From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx190.postini.com [74.125.245.190]) by kanga.kvack.org (Postfix) with SMTP id E9D416B002B for ; Thu, 16 Aug 2012 15:00:40 -0400 (EDT) Message-ID: <502D42E5.7090403@redhat.com> Date: Thu, 16 Aug 2012 14:58:45 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: Repeated fork() causes SLAB to grow without bound References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> In-Reply-To: <20120816024610.GA5350@evergreen.ssec.wisc.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, Hugh Dickins Cc: linux-mm On 08/15/2012 10:46 PM, Daniel Forrest wrote: > I'm hoping someone has seen this before... > > I've been trying to track down a performance problem with Linux 3.0.4. > The symptom is system-mode load increasing over time while user-mode > load remains constant while running a data ingest/processing program. > > Looking at /proc/meminfo I noticed SUnreclaim increasing steadily. > > Looking at /proc/slabinfo I noticed anon_vma and anon_vma_chain also > increasing steadily. Oh dear. Basically, what happens is that at fork time, a new "level" is created for the anon_vma hierarchy. This works great for normal forking daemons, since the parent process just keeps running, and forking off children. Look at anon_vma_fork() in mm/rmap.c for the details. Having each child become the new parent, and the previous parent exit, can result in an "infinite" stack of anon_vmas. Now, the parent anon_vma we cannot get rid of, because that is where the anon_vma lock lives. However, in your case you have many more anon_vma levels than you have processes! I wonder if it may be possible to fix your bug by adding a refcount to the struct anon_vma, one count for each VMA that is directly attached to the anon_vma (ie. vma->anon_vma == anon_vma), and one for each page that points to the anon_vma. If the reference count on an anon_vma reaches 0, we can skip that anon_vma in anon_vma_clone, and the child process should not get that anon_vma. A scheme like that may be enough to avoid the trouble you are running into. Does this sound realistic? > I was able to generate a simple test program that will cause this: > > --- > > #include > > int main(int argc, char *argv[]) > { > pid_t pid; > > while (1) { > pid = fork(); > if (pid == -1) { > /* error */ > return 1; > } > if (pid) { > /* parent */ > sleep(2); > break; > } > else { > /* child */ > sleep(1); > } > } > return 0; > } > > --- > > In the actual program (running as a daemon), a child is reading data > while its parent is processing the previously read data. At any time > there are only a few processes in existence, with older processes > exiting and new processes being fork()ed. Killing the program frees > the slab usage. > > I patched the kernel to 3.0.40, but the problem remains. I also > compiled with slab debugging and can see that the growth of anon_vma > and anon_vma_chain is due to anon_vma_clone/anon_vma_fork. > > Is this a known issue? Is it fixed in a later release? > > Thanks, > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx184.postini.com [74.125.245.184]) by kanga.kvack.org (Postfix) with SMTP id 7ADCE6B0069 for ; Fri, 17 Aug 2012 20:03:14 -0400 (EDT) Date: Fri, 17 Aug 2012 19:03:12 -0500 From: Daniel Forrest Subject: Re: Repeated fork() causes SLAB to grow without bound Message-ID: <20120818000312.GA4262@evergreen.ssec.wisc.edu> Reply-To: Daniel Forrest References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <502D42E5.7090403@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: linux-kernel@vger.kernel.org, Hugh Dickins , linux-mm On Thu, Aug 16, 2012 at 02:58:45PM -0400, Rik van Riel wrote: > Oh dear. > > Basically, what happens is that at fork time, a new > "level" is created for the anon_vma hierarchy. This > works great for normal forking daemons, since the > parent process just keeps running, and forking off > children. > > Look at anon_vma_fork() in mm/rmap.c for the details. > > Having each child become the new parent, and the > previous parent exit, can result in an "infinite" > stack of anon_vmas. > > Now, the parent anon_vma we cannot get rid of, > because that is where the anon_vma lock lives. > > However, in your case you have many more anon_vma > levels than you have processes! > > I wonder if it may be possible to fix your bug > by adding a refcount to the struct anon_vma, > one count for each VMA that is directly attached > to the anon_vma (ie. vma->anon_vma == anon_vma), > and one for each page that points to the anon_vma. > > If the reference count on an anon_vma reaches 0, > we can skip that anon_vma in anon_vma_clone, and > the child process should not get that anon_vma. > > A scheme like that may be enough to avoid the trouble > you are running into. > > Does this sound realistic? Based on your comments, I came up with the following patch. It boots and the anon_vma/anon_vma_chain SLAB usage is stable, but I don't know if I've overlooked something. I'm not a kernel hacker. --- include/linux/rmap.h.ORIG 2011-08-05 04:59:21.000000000 +0000 +++ include/linux/rmap.h 2012-08-16 22:52:25.000000000 +0000 @@ -35,6 +35,7 @@ struct anon_vma { * anon_vma if they are the last user on release */ atomic_t refcount; + atomic_t pagecount; /* * NOTE: the LSB of the head.next is set by --- mm/rmap.c.ORIG 2011-08-05 04:59:21.000000000 +0000 +++ mm/rmap.c 2012-08-17 23:55:13.000000000 +0000 @@ -85,6 +85,7 @@ static inline struct anon_vma *anon_vma_ static inline void anon_vma_free(struct anon_vma *anon_vma) { VM_BUG_ON(atomic_read(&anon_vma->refcount)); + VM_BUG_ON(atomic_read(&anon_vma->pagecount)); /* * Synchronize against page_lock_anon_vma() such that @@ -176,6 +177,7 @@ int anon_vma_prepare(struct vm_area_stru spin_lock(&mm->page_table_lock); if (likely(!vma->anon_vma)) { vma->anon_vma = anon_vma; + atomic_inc(&anon_vma->pagecount); avc->anon_vma = anon_vma; avc->vma = vma; list_add(&avc->same_vma, &vma->anon_vma_chain); @@ -262,7 +264,10 @@ int anon_vma_clone(struct vm_area_struct } anon_vma = pavc->anon_vma; root = lock_anon_vma_root(root, anon_vma); - anon_vma_chain_link(dst, avc, anon_vma); + if (!atomic_read(&anon_vma->pagecount)) + anon_vma_chain_free(avc); + else + anon_vma_chain_link(dst, avc, anon_vma); } unlock_anon_vma_root(root); return 0; @@ -314,6 +319,7 @@ int anon_vma_fork(struct vm_area_struct get_anon_vma(anon_vma->root); /* Mark this anon_vma as the one where our new (COWed) pages go. */ vma->anon_vma = anon_vma; + atomic_set(&anon_vma->pagecount, 1); anon_vma_lock(anon_vma); anon_vma_chain_link(vma, avc, anon_vma); anon_vma_unlock(anon_vma); @@ -341,6 +347,8 @@ void unlink_anon_vmas(struct vm_area_str root = lock_anon_vma_root(root, anon_vma); list_del(&avc->same_anon_vma); + if (vma->anon_vma == anon_vma) + atomic_dec(&anon_vma->pagecount); /* * Leave empty anon_vmas on the list - we'll need @@ -375,6 +383,7 @@ static void anon_vma_ctor(void *data) mutex_init(&anon_vma->mutex); atomic_set(&anon_vma->refcount, 0); + atomic_set(&anon_vma->pagecount, 0); INIT_LIST_HEAD(&anon_vma->head); } @@ -996,6 +1005,7 @@ static void __page_set_anon_rmap(struct if (!exclusive) anon_vma = anon_vma->root; + atomic_inc(&anon_vma->pagecount); anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; page->mapping = (struct address_space *) anon_vma; page->index = linear_page_index(vma, address); @@ -1142,6 +1152,11 @@ void page_remove_rmap(struct page *page) if (unlikely(PageHuge(page))) return; if (PageAnon(page)) { + struct anon_vma *anon_vma; + + anon_vma = page_anon_vma(page); + if (anon_vma) + atomic_dec(&anon_vma->pagecount); mem_cgroup_uncharge_page(page); if (!PageTransHuge(page)) __dec_zone_page_state(page, NR_ANON_PAGES); @@ -1747,6 +1762,7 @@ static void __hugepage_set_anon_rmap(str if (!exclusive) anon_vma = anon_vma->root; + atomic_inc(&anon_vma->pagecount); anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; page->mapping = (struct address_space *) anon_vma; page->index = linear_page_index(vma, address); -- Daniel K. Forrest Space Science and dan.forrest@ssec.wisc.edu Engineering Center (608) 890 - 0558 University of Wisconsin, Madison -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx194.postini.com [74.125.245.194]) by kanga.kvack.org (Postfix) with SMTP id D74FF6B0069 for ; Fri, 17 Aug 2012 23:46:22 -0400 (EDT) Message-ID: <502F100A.1080401@redhat.com> Date: Fri, 17 Aug 2012 23:46:18 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: Repeated fork() causes SLAB to grow without bound References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> In-Reply-To: <20120818000312.GA4262@evergreen.ssec.wisc.edu> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, Hugh Dickins , linux-mm On 08/17/2012 08:03 PM, Daniel Forrest wrote: > Based on your comments, I came up with the following patch. It boots > and the anon_vma/anon_vma_chain SLAB usage is stable, but I don't know > if I've overlooked something. I'm not a kernel hacker. The patch looks reasonable to me. There is one spot left for optimization, which I have pointed out below. Of course, that leaves the big question: do we want the overhead of having the atomic addition and decrement for every anonymous memory page, or is it easier to fix this issue in userspace? Given that malicious userspace could potentially run the system out of memory, without needing special privileges, and the OOM killer may not be able to reclaim it due to internal slab fragmentation, I guess this issue could be classified as a low impact denial of service vulnerability. Furthermore, there is already a fair amount of bookkeeping being done in the rmap code, so this patch is not likely to add a whole lot - some testing might be useful, though. > @@ -262,7 +264,10 @@ int anon_vma_clone(struct vm_area_struct > } > anon_vma = pavc->anon_vma; > root = lock_anon_vma_root(root, anon_vma); > - anon_vma_chain_link(dst, avc, anon_vma); > + if (!atomic_read(&anon_vma->pagecount)) > + anon_vma_chain_free(avc); > + else > + anon_vma_chain_link(dst, avc, anon_vma); > } > unlock_anon_vma_root(root); > return 0; In this function, you can do the test before the code block where we try to allocate an anon_vma chain. In other words: list_for_each_entry_reverse(..... struct anon_vma *anon_vma; + if (!atomic_read(&anon_vma->pagecount)) + continue; + avc = anon_vma_chain_alloc(... if (unlikely(!avc)) { The rest looks good. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx147.postini.com [74.125.245.147]) by kanga.kvack.org (Postfix) with SMTP id 05A816B0069 for ; Sat, 18 Aug 2012 00:07:48 -0400 (EDT) Date: Fri, 17 Aug 2012 23:07:47 -0500 From: Daniel Forrest Subject: Re: Repeated fork() causes SLAB to grow without bound Message-ID: <20120818040747.GA22793@evergreen.ssec.wisc.edu> Reply-To: Daniel Forrest References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <502F100A.1080401@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: linux-kernel@vger.kernel.org, Hugh Dickins , linux-mm On Fri, Aug 17, 2012 at 11:46:18PM -0400, Rik van Riel wrote: > On 08/17/2012 08:03 PM, Daniel Forrest wrote: > > >Based on your comments, I came up with the following patch. It boots > >and the anon_vma/anon_vma_chain SLAB usage is stable, but I don't know > >if I've overlooked something. I'm not a kernel hacker. > > The patch looks reasonable to me. There is one spot left > for optimization, which I have pointed out below. > > Of course, that leaves the big question: do we want the > overhead of having the atomic addition and decrement for > every anonymous memory page, or is it easier to fix this > issue in userspace? > > Given that malicious userspace could potentially run the > system out of memory, without needing special privileges, > and the OOM killer may not be able to reclaim it due to > internal slab fragmentation, I guess this issue could be > classified as a low impact denial of service vulnerability. > > Furthermore, there is already a fair amount of bookkeeping > being done in the rmap code, so this patch is not likely > to add a whole lot - some testing might be useful, though. > > >@@ -262,7 +264,10 @@ int anon_vma_clone(struct vm_area_struct > > } > > anon_vma = pavc->anon_vma; > > root = lock_anon_vma_root(root, anon_vma); > >- anon_vma_chain_link(dst, avc, anon_vma); > >+ if (!atomic_read(&anon_vma->pagecount)) > >+ anon_vma_chain_free(avc); > >+ else > >+ anon_vma_chain_link(dst, avc, anon_vma); > > } > > unlock_anon_vma_root(root); > > return 0; > > In this function, you can do the test before the code block > where we try to allocate an anon_vma chain. > > In other words: > > list_for_each_entry_reverse(..... > struct anon_vma *anon_vma; > > + if (!atomic_read(&anon_vma->pagecount)) > + continue; > + > avc = anon_vma_chain_alloc(... > if (unlikely(!avc)) { > > The rest looks good. I was being careful since I wasn't certain about the locking. Does the test need to be protected by "lock_anon_vma_root"? That's why I chose the overhead of the possible wasted "anon_vma_chain_alloc". -- Daniel K. Forrest Space Science and dan.forrest@ssec.wisc.edu Engineering Center (608) 890 - 0558 University of Wisconsin, Madison -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx110.postini.com [74.125.245.110]) by kanga.kvack.org (Postfix) with SMTP id 94AD86B0069 for ; Sat, 18 Aug 2012 00:10:17 -0400 (EDT) Message-ID: <502F15A6.5060902@redhat.com> Date: Sat, 18 Aug 2012 00:10:14 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: Repeated fork() causes SLAB to grow without bound References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120818040747.GA22793@evergreen.ssec.wisc.edu> In-Reply-To: <20120818040747.GA22793@evergreen.ssec.wisc.edu> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, Hugh Dickins , linux-mm On 08/18/2012 12:07 AM, Daniel Forrest wrote: > I was being careful since I wasn't certain about the locking. Does > the test need to be protected by "lock_anon_vma_root"? That's why I > chose the overhead of the possible wasted "anon_vma_chain_alloc". The function anon_vma_clone is being called from fork(). When running fork(), the kernel holds the mm->mmap_sem for write, which prevents page faults by the parent process. This means if the anon_vma in question belongs to the parent process, no new pages will be added to it in this time. Likewise, if the anon_vma belonged to a grandparent process, any new pages instantiated in it will not be visible to the parent process, or to the newly created process. This means it is safe to skip the anon_vma. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx139.postini.com [74.125.245.139]) by kanga.kvack.org (Postfix) with SMTP id A04406B0069 for ; Mon, 20 Aug 2012 04:00:54 -0400 (EDT) Received: by pbbro12 with SMTP id ro12so7521565pbb.14 for ; Mon, 20 Aug 2012 01:00:53 -0700 (PDT) Date: Mon, 20 Aug 2012 01:00:11 -0700 (PDT) From: Hugh Dickins Subject: Re: Repeated fork() causes SLAB to grow without bound In-Reply-To: <502F100A.1080401@redhat.com> Message-ID: References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Daniel Forrest , Andrea Arcangeli , Michel Lespinasse , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Fri, 17 Aug 2012, Rik van Riel wrote: > On 08/17/2012 08:03 PM, Daniel Forrest wrote: > > > Based on your comments, I came up with the following patch. It boots > > and the anon_vma/anon_vma_chain SLAB usage is stable, but I don't know > > if I've overlooked something. I'm not a kernel hacker. > > The patch looks reasonable to me. There is one spot left > for optimization, which I have pointed out below. > > Of course, that leaves the big question: do we want the > overhead of having the atomic addition and decrement for > every anonymous memory page, or is it easier to fix this > issue in userspace? I've not given any thought to alternatives, and I've not done any performance analysis; but my instinct says that we really do not want another atomic increment and decrement (and another cache line redirtied) for every single page mapped. One of the things I've often admired about Andrea's anon_vma design was the way it did not need a refcount; and although we later added one for KSM and migration, that scarcely mattered, because it was for exceptional circumstances, and not per page. May I dare to think: what if we just backed out all the anon_vma_chain complexity, and returned to the simple anon_vma list we had in 2.6.33? Just how realistic was the workload which led you to anon_vma_chains? And isn't it correct to say that the performance evaluation was made while believing that each anon_vma->lock was useful, before the sad realization that anon_vma->root->lock (or ->mutex) had to be used? I've Cc'ed Michel, because I think he has plans (or at least hopes) for the anon_vmas, in his relentless pursuit of world domination by rbtree. Hugh > > Given that malicious userspace could potentially run the > system out of memory, without needing special privileges, > and the OOM killer may not be able to reclaim it due to > internal slab fragmentation, I guess this issue could be > classified as a low impact denial of service vulnerability. > > Furthermore, there is already a fair amount of bookkeeping > being done in the rmap code, so this patch is not likely > to add a whole lot - some testing might be useful, though. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx115.postini.com [74.125.245.115]) by kanga.kvack.org (Postfix) with SMTP id 819126B00A2 for ; Mon, 20 Aug 2012 05:39:27 -0400 (EDT) Received: by iahk25 with SMTP id k25so3082117iah.14 for ; Mon, 20 Aug 2012 02:39:26 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> Date: Mon, 20 Aug 2012 02:39:26 -0700 Message-ID: Subject: Re: Repeated fork() causes SLAB to grow without bound From: Michel Lespinasse Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Rik van Riel , Daniel Forrest , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Mon, Aug 20, 2012 at 1:00 AM, Hugh Dickins wrote: > On Fri, 17 Aug 2012, Rik van Riel wrote: >> Of course, that leaves the big question: do we want the >> overhead of having the atomic addition and decrement for >> every anonymous memory page, or is it easier to fix this >> issue in userspace? > > I've not given any thought to alternatives, and I've not done any > performance analysis; but my instinct says that we really do not > want another atomic increment and decrement (and another cache > line redirtied) for every single page mapped. I am concerned about this as well. > May I dare to think: what if we just backed out all the anon_vma_chain > complexity, and returned to the simple anon_vma list we had in 2.6.33? > > Just how realistic was the workload which led you to anon_vma_chains? > And isn't it correct to say that the performance evaluation was made > while believing that each anon_vma->lock was useful, before the sad > realization that anon_vma->root->lock (or ->mutex) had to be used? Thanks for suggesting this - I certainly wish we could go that way. I suspect there will be a strong case against this, but I'd certainly like to hear it (and see if it can be addressed another way). Here we just don't have processes that fork a lot of children that don't immediately exec, so anon_vmas don't bring any value for us. > I've Cc'ed Michel, because I think he has plans (or at least hopes) for > the anon_vmas, in his relentless pursuit of world domination by rbtree. Unfortunately I don't have great ideas there. It would be easy to add a flag to track if an anon_vma has ever been referenced by a struct page, and not clone the anon_vma if the flag isn't set. But, this wouldn't help at all with the DOS potential here. If there are pages referencing the anon_vma, we could reassign these to the parent anon_vma, but finding all such pages would be expensive too. Instead of adding an atomic count for page references, we could limit the anon_vma stacking depth. In fork, we would only clone anon_vmas that have a low enough generation count. I think that's not great (adds a special case for the deep-fork-without-exec behavior), but still better than the atomic page reference counter. I would still prefer if we could just remove the anon_vma_chain stuff, though. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx138.postini.com [74.125.245.138]) by kanga.kvack.org (Postfix) with SMTP id B6F3C6B0069 for ; Mon, 20 Aug 2012 07:11:01 -0400 (EDT) From: Andi Kleen Subject: Re: Repeated fork() causes SLAB to grow without bound References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> Date: Mon, 20 Aug 2012 04:11:00 -0700 In-Reply-To: (Michel Lespinasse's message of "Mon, 20 Aug 2012 02:39:26 -0700") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org List-ID: To: Michel Lespinasse Cc: Hugh Dickins , Rik van Riel , Daniel Forrest , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org Michel Lespinasse writes: > > I would still prefer if we could just remove the anon_vma_chain stuff, though. Would probably help with the fork locking problems too. We never really recovered from that regression. -Andi -- ak@linux.intel.com -- Speaking for myself only -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx180.postini.com [74.125.245.180]) by kanga.kvack.org (Postfix) with SMTP id 188C76B0069 for ; Mon, 20 Aug 2012 07:17:44 -0400 (EDT) Message-ID: <50321CD3.5050501@redhat.com> Date: Mon, 20 Aug 2012 07:17:39 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: Repeated fork() causes SLAB to grow without bound References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michel Lespinasse Cc: Hugh Dickins , Daniel Forrest , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org On 08/20/2012 05:39 AM, Michel Lespinasse wrote: > I would still prefer if we could just remove the anon_vma_chain stuff, though. If only we could. That simply replaces a medium issue at fork time, with the potential for a catastrophic issue at page reclaim time, in any workload with heavily forking server software. Without the anon_vma_chains, we end up scanning every single one of the child processes (and the parent) for every COWed page, which can be a real issue when the VM runs into 1000 such pages, for 1000 child processes. Unfortunately, we have seen this happen... -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx110.postini.com [74.125.245.110]) by kanga.kvack.org (Postfix) with SMTP id 423606B0069 for ; Mon, 20 Aug 2012 07:53:01 -0400 (EDT) Received: by iahk25 with SMTP id k25so3181122iah.14 for ; Mon, 20 Aug 2012 04:53:00 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <50321CD3.5050501@redhat.com> References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <50321CD3.5050501@redhat.com> Date: Mon, 20 Aug 2012 04:53:00 -0700 Message-ID: Subject: Re: Repeated fork() causes SLAB to grow without bound From: Michel Lespinasse Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Hugh Dickins , Daniel Forrest , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Mon, Aug 20, 2012 at 4:17 AM, Rik van Riel wrote: > Without the anon_vma_chains, we end up scanning every single > one of the child processes (and the parent) for every COWed > page, which can be a real issue when the VM runs into 1000 > such pages, for 1000 child processes. > > Unfortunately, we have seen this happen... Well, it only happens if the vma is created in the parent, and the first anon write also happens in the parent. I suppose that's a legitimate thing to do in a forking server though - say, for an expensive initialization stage, or precomputing some table, or whatever. When fork happens after the first anon page has been created, the child VMA currently ends up being added to the parent's anon_vma - even if the child might never create new anon pages into that VMA. I wonder if it might help to add the child VMA onto the parent's anon_vma only at the first child COW event. That way it would at least be possible (with userspace changes) for any forking servers to separate the areas they want to write into from the parent (such as things that need expensive initialization), from the ones that they want to write into from the child, and have none of the anon_vma lists grow too large. This might still be impractical if one has too many such workloads to care about. I'm just not sure how prevalent the problem workloads are. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx155.postini.com [74.125.245.155]) by kanga.kvack.org (Postfix) with SMTP id 4D4B46B005D for ; Mon, 20 Aug 2012 15:11:45 -0400 (EDT) Received: by ghrr18 with SMTP id r18so6493473ghr.14 for ; Mon, 20 Aug 2012 12:11:44 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <50321CD3.5050501@redhat.com> Date: Mon, 20 Aug 2012 12:11:43 -0700 Message-ID: Subject: Re: Repeated fork() causes SLAB to grow without bound From: Michel Lespinasse Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Hugh Dickins , Daniel Forrest , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Mon, Aug 20, 2012 at 4:53 AM, Michel Lespinasse wrote: > I wonder if it might help to add the child VMA onto the parent's > anon_vma only at the first child COW event. That way it would at least > be possible (with userspace changes) for any forking servers to > separate the areas they want to write into from the parent (such as > things that need expensive initialization), from the ones that they > want to write into from the child, and have none of the anon_vma lists > grow too large. Actually that wouldn't work. The parent's anon pages are visible from the child, so the child vma needs to be on the parent anon_vma list. Sorry for the noise :/ -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx143.postini.com [74.125.245.143]) by kanga.kvack.org (Postfix) with SMTP id 344D96B0044 for ; Tue, 21 Aug 2012 23:21:04 -0400 (EDT) Received: by pbbro12 with SMTP id ro12so890943pbb.14 for ; Tue, 21 Aug 2012 20:21:03 -0700 (PDT) Date: Tue, 21 Aug 2012 20:20:57 -0700 From: Michel Lespinasse Subject: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound Message-ID: <20120822032057.GA30871@google.com> References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Hugh Dickins Cc: Rik van Riel , Daniel Forrest , Andrea Arcangeli , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote: > Instead of adding an atomic count for page references, we could limit > the anon_vma stacking depth. In fork, we would only clone anon_vmas > that have a low enough generation count. I think that's not great > (adds a special case for the deep-fork-without-exec behavior), but > still better than the atomic page reference counter. Here is an attached patch to demonstrate the idea. anon_vma_clone() is modified to return the length of the existing same_vma anon vma chain, and we create a new anon_vma in the child only on the first fork (this could be tweaked to allow up to a set number of forks, but I think the first fork would cover all the common forking server cases). Signed-off-by: Michel Lespinasse --- mm/mmap.c | 6 +++--- mm/rmap.c | 18 ++++++++++++++---- 2 files changed, 17 insertions(+), 7 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 3edfcdfa42d9..e14b19a838cb 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -539,7 +539,7 @@ again: remove_next = 1 + (end > next->vm_end); * shrinking vma had, to cover any anon pages imported. */ if (exporter && exporter->anon_vma && !importer->anon_vma) { - if (anon_vma_clone(importer, exporter)) + if (anon_vma_clone(importer, exporter) < 0) return -ENOMEM; importer->anon_vma = exporter->anon_vma; } @@ -1988,7 +1988,7 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma, } vma_set_policy(new, pol); - if (anon_vma_clone(new, vma)) + if (anon_vma_clone(new, vma) < 0) goto out_free_mpol; if (new->vm_file) { @@ -2409,7 +2409,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, if (IS_ERR(pol)) goto out_free_vma; INIT_LIST_HEAD(&new_vma->anon_vma_chain); - if (anon_vma_clone(new_vma, vma)) + if (anon_vma_clone(new_vma, vma) < 0) goto out_free_mempol; vma_set_policy(new_vma, pol); new_vma->vm_start = addr; diff --git a/mm/rmap.c b/mm/rmap.c index 0f3b7cda2a24..ba8a726aaee6 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -238,12 +238,13 @@ static inline void unlock_anon_vma_root(struct anon_vma *root) /* * Attach the anon_vmas from src to dst. - * Returns 0 on success, -ENOMEM on failure. + * Returns length of the anon_vma chain on success, -ENOMEM on failure. */ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) { struct anon_vma_chain *avc, *pavc; struct anon_vma *root = NULL; + int length = 0; list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) { struct anon_vma *anon_vma; @@ -259,9 +260,10 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) anon_vma = pavc->anon_vma; root = lock_anon_vma_root(root, anon_vma); anon_vma_chain_link(dst, avc, anon_vma); + length++; } unlock_anon_vma_root(root); - return 0; + return length; enomem_failure: unlink_anon_vmas(dst); @@ -322,6 +324,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) { struct anon_vma_chain *avc; struct anon_vma *anon_vma; + int length; /* Don't bother if the parent process has no anon_vma here. */ if (!pvma->anon_vma) @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) * First, attach the new VMA to the parent VMA's anon_vmas, * so rmap can find non-COWed pages in child processes. */ - if (anon_vma_clone(vma, pvma)) + length = anon_vma_clone(vma, pvma); + if (length < 0) return -ENOMEM; + else if (length > 1) + return 0; - /* Then add our own anon_vma. */ + /* + * Then add our own anon_vma. We do this only on the first fork after + * the anon_vma is created, as we don't want the same_vma chain to + * grow arbitrarily large. + */ anon_vma = anon_vma_alloc(); if (!anon_vma) goto out_error; -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx153.postini.com [74.125.245.153]) by kanga.kvack.org (Postfix) with SMTP id 01D3E6B0044 for ; Tue, 21 Aug 2012 23:30:00 -0400 (EDT) Message-ID: <50345232.4090002@redhat.com> Date: Tue, 21 Aug 2012 23:29:54 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> In-Reply-To: <20120822032057.GA30871@google.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michel Lespinasse Cc: Hugh Dickins , Daniel Forrest , Andrea Arcangeli , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org On 08/21/2012 11:20 PM, Michel Lespinasse wrote: > On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote: >> Instead of adding an atomic count for page references, we could limit >> the anon_vma stacking depth. In fork, we would only clone anon_vmas >> that have a low enough generation count. I think that's not great >> (adds a special case for the deep-fork-without-exec behavior), but >> still better than the atomic page reference counter. > > Here is an attached patch to demonstrate the idea. > > anon_vma_clone() is modified to return the length of the existing same_vma > anon vma chain, and we create a new anon_vma in the child only on the first > fork (this could be tweaked to allow up to a set number of forks, but > I think the first fork would cover all the common forking server cases). I suspect we need 2 or 3. Some forking servers first fork off one child, and have the original parent exit, in order to "background the server". That first child then becomes the parent to the real child processes that do the work. It is conceivable that we might need an extra level for processes that do something special with privilege dropping, namespace changing, etc... Even setting the threshold to 5 should be totally harmless, since the problem does not kick in until we have really long chains, like in Dan's bug report. -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx177.postini.com [74.125.245.177]) by kanga.kvack.org (Postfix) with SMTP id 03A6A6B0082 for ; Mon, 3 Jun 2013 15:50:04 -0400 (EDT) Date: Mon, 3 Jun 2013 14:50:03 -0500 From: Daniel Forrest Subject: Re: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound Message-ID: <20130603195003.GA31275@evergreen.ssec.wisc.edu> Reply-To: Daniel Forrest References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50345232.4090002@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Tue, Aug 21, 2012 at 11:29:54PM -0400, Rik van Riel wrote: > On 08/21/2012 11:20 PM, Michel Lespinasse wrote: > >On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote: > >>Instead of adding an atomic count for page references, we could limit > >>the anon_vma stacking depth. In fork, we would only clone anon_vmas > >>that have a low enough generation count. I think that's not great > >>(adds a special case for the deep-fork-without-exec behavior), but > >>still better than the atomic page reference counter. > > > >Here is an attached patch to demonstrate the idea. > > > >anon_vma_clone() is modified to return the length of the existing same_vma > >anon vma chain, and we create a new anon_vma in the child only on the first > >fork (this could be tweaked to allow up to a set number of forks, but > >I think the first fork would cover all the common forking server cases). > > I suspect we need 2 or 3. > > Some forking servers first fork off one child, and have > the original parent exit, in order to "background the server". > That first child then becomes the parent to the real child > processes that do the work. > > It is conceivable that we might need an extra level for > processes that do something special with privilege dropping, > namespace changing, etc... > > Even setting the threshold to 5 should be totally harmless, > since the problem does not kick in until we have really > long chains, like in Dan's bug report. I have been running with Michel's patch (with the threshold set to 5) for quite a few months now and can confirm that it does indeed solve my problem. I am not a kernel developer, so I would appreciate if one of you could push this into the kernel tree. NOTE: I have attached Michel's patch with "(length > 1)" modified to "(length > 5)" and added a "Tested-by:". --- On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote: > Instead of adding an atomic count for page references, we could limit > the anon_vma stacking depth. In fork, we would only clone anon_vmas > that have a low enough generation count. I think that's not great > (adds a special case for the deep-fork-without-exec behavior), but > still better than the atomic page reference counter. Here is an attached patch to demonstrate the idea. anon_vma_clone() is modified to return the length of the existing same_vma anon vma chain, and we create a new anon_vma in the child only on the first fork (this could be tweaked to allow up to a set number of forks, but I think the first fork would cover all the common forking server cases). Signed-off-by: Michel Lespinasse Tested-by: Daniel Forrest --- mm/mmap.c | 6 +++--- mm/rmap.c | 18 ++++++++++++++---- 2 files changed, 17 insertions(+), 7 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 3edfcdfa42d9..e14b19a838cb 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -539,7 +539,7 @@ again: remove_next = 1 + (end > next->vm_end); * shrinking vma had, to cover any anon pages imported. */ if (exporter && exporter->anon_vma && !importer->anon_vma) { - if (anon_vma_clone(importer, exporter)) + if (anon_vma_clone(importer, exporter) < 0) return -ENOMEM; importer->anon_vma = exporter->anon_vma; } @@ -1988,7 +1988,7 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma, } vma_set_policy(new, pol); - if (anon_vma_clone(new, vma)) + if (anon_vma_clone(new, vma) < 0) goto out_free_mpol; if (new->vm_file) { @@ -2409,7 +2409,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, if (IS_ERR(pol)) goto out_free_vma; INIT_LIST_HEAD(&new_vma->anon_vma_chain); - if (anon_vma_clone(new_vma, vma)) + if (anon_vma_clone(new_vma, vma) < 0) goto out_free_mempol; vma_set_policy(new_vma, pol); new_vma->vm_start = addr; diff --git a/mm/rmap.c b/mm/rmap.c index 0f3b7cda2a24..ba8a726aaee6 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -238,12 +238,13 @@ static inline void unlock_anon_vma_root(struct anon_vma *root) /* * Attach the anon_vmas from src to dst. - * Returns 0 on success, -ENOMEM on failure. + * Returns length of the anon_vma chain on success, -ENOMEM on failure. */ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) { struct anon_vma_chain *avc, *pavc; struct anon_vma *root = NULL; + int length = 0; list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) { struct anon_vma *anon_vma; @@ -259,9 +260,10 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) anon_vma = pavc->anon_vma; root = lock_anon_vma_root(root, anon_vma); anon_vma_chain_link(dst, avc, anon_vma); + length++; } unlock_anon_vma_root(root); - return 0; + return length; enomem_failure: unlink_anon_vmas(dst); @@ -322,6 +324,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) { struct anon_vma_chain *avc; struct anon_vma *anon_vma; + int length; /* Don't bother if the parent process has no anon_vma here. */ if (!pvma->anon_vma) @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) * First, attach the new VMA to the parent VMA's anon_vmas, * so rmap can find non-COWed pages in child processes. */ - if (anon_vma_clone(vma, pvma)) + length = anon_vma_clone(vma, pvma); + if (length < 0) return -ENOMEM; + else if (length > 5) + return 0; - /* Then add our own anon_vma. */ + /* + * Then add our own anon_vma. We do this only on the first fork after + * the anon_vma is created, as we don't want the same_vma chain to + * grow arbitrarily large. + */ anon_vma = anon_vma_alloc(); if (!anon_vma) goto out_error; -- Daniel K. Forrest Space Science and dan.forrest@ssec.wisc.edu Engineering Center (608) 890 - 0558 University of Wisconsin, Madison -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx142.postini.com [74.125.245.142]) by kanga.kvack.org (Postfix) with SMTP id 2458C6B0087 for ; Tue, 4 Jun 2013 06:37:29 -0400 (EDT) Message-ID: <51ADC365.4010307@redhat.com> Date: Tue, 04 Jun 2013 06:37:25 -0400 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> In-Reply-To: <20130603195003.GA31275@evergreen.ssec.wisc.edu> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org On 06/03/2013 03:50 PM, Daniel Forrest wrote: > On Tue, Aug 21, 2012 at 11:29:54PM -0400, Rik van Riel wrote: >> On 08/21/2012 11:20 PM, Michel Lespinasse wrote: >>> On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote: >>>> Instead of adding an atomic count for page references, we could limit >>>> the anon_vma stacking depth. In fork, we would only clone anon_vmas >>>> that have a low enough generation count. I think that's not great >>>> (adds a special case for the deep-fork-without-exec behavior), but >>>> still better than the atomic page reference counter. >>> >>> Here is an attached patch to demonstrate the idea. >>> >>> anon_vma_clone() is modified to return the length of the existing same_vma >>> anon vma chain, and we create a new anon_vma in the child only on the first >>> fork (this could be tweaked to allow up to a set number of forks, but >>> I think the first fork would cover all the common forking server cases). >> >> I suspect we need 2 or 3. >> >> Some forking servers first fork off one child, and have >> the original parent exit, in order to "background the server". >> That first child then becomes the parent to the real child >> processes that do the work. >> >> It is conceivable that we might need an extra level for >> processes that do something special with privilege dropping, >> namespace changing, etc... >> >> Even setting the threshold to 5 should be totally harmless, >> since the problem does not kick in until we have really >> long chains, like in Dan's bug report. > > I have been running with Michel's patch (with the threshold set to 5) > for quite a few months now and can confirm that it does indeed solve > my problem. I am not a kernel developer, so I would appreciate if one > of you could push this into the kernel tree. > > NOTE: I have attached Michel's patch with "(length > 1)" modified to > "(length > 5)" and added a "Tested-by:". Thank you for testing this. I believe this code should go into the Linux kernel, since it closes up what could be a denial of service attack (albeit a local one) with the anonvma code. > On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote: >> Instead of adding an atomic count for page references, we could limit >> the anon_vma stacking depth. In fork, we would only clone anon_vmas >> that have a low enough generation count. I think that's not great >> (adds a special case for the deep-fork-without-exec behavior), but >> still better than the atomic page reference counter. > > Here is an attached patch to demonstrate the idea. > > anon_vma_clone() is modified to return the length of the existing same_vma > anon vma chain, and we create a new anon_vma in the child only on the first > fork (this could be tweaked to allow up to a set number of forks, but > I think the first fork would cover all the common forking server cases). > > Signed-off-by: Michel Lespinasse > Tested-by: Daniel Forrest Reviewed-by: Rik van Riel -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx180.postini.com [74.125.245.180]) by kanga.kvack.org (Postfix) with SMTP id B32756B0034 for ; Wed, 5 Jun 2013 10:03:02 -0400 (EDT) Date: Wed, 5 Jun 2013 16:02:58 +0200 From: Andrea Arcangeli Subject: Re: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound Message-ID: <20130605140258.GL3463@redhat.com> References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <51ADC365.4010307@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51ADC365.4010307@redhat.com> Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Michel Lespinasse , Hugh Dickins , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org On Tue, Jun 04, 2013 at 06:37:25AM -0400, Rik van Riel wrote: > On 06/03/2013 03:50 PM, Daniel Forrest wrote: > > On Tue, Aug 21, 2012 at 11:29:54PM -0400, Rik van Riel wrote: > >> On 08/21/2012 11:20 PM, Michel Lespinasse wrote: > >>> On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote: > >>>> Instead of adding an atomic count for page references, we could limit > >>>> the anon_vma stacking depth. In fork, we would only clone anon_vmas > >>>> that have a low enough generation count. I think that's not great > >>>> (adds a special case for the deep-fork-without-exec behavior), but > >>>> still better than the atomic page reference counter. > >>> > >>> Here is an attached patch to demonstrate the idea. > >>> > >>> anon_vma_clone() is modified to return the length of the existing same_vma > >>> anon vma chain, and we create a new anon_vma in the child only on the first > >>> fork (this could be tweaked to allow up to a set number of forks, but > >>> I think the first fork would cover all the common forking server cases). > >> > >> I suspect we need 2 or 3. > >> > >> Some forking servers first fork off one child, and have > >> the original parent exit, in order to "background the server". > >> That first child then becomes the parent to the real child > >> processes that do the work. > >> > >> It is conceivable that we might need an extra level for > >> processes that do something special with privilege dropping, > >> namespace changing, etc... > >> > >> Even setting the threshold to 5 should be totally harmless, > >> since the problem does not kick in until we have really > >> long chains, like in Dan's bug report. > > > > I have been running with Michel's patch (with the threshold set to 5) > > for quite a few months now and can confirm that it does indeed solve > > my problem. I am not a kernel developer, so I would appreciate if one > > of you could push this into the kernel tree. > > > > NOTE: I have attached Michel's patch with "(length > 1)" modified to > > "(length > 5)" and added a "Tested-by:". > > Thank you for testing this. > > I believe this code should go into the Linux kernel, > since it closes up what could be a denial of service > attack (albeit a local one) with the anonvma code. Agreed. The only thing I don't like about this patch is the hardcoding of number 5: could we make it a variable to tweak with sysfs/sysctl so if some weird workload arises we have a tuning tweak? It'd cost one cacheline during fork, so it doesn't look excessive overhead. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f181.google.com (mail-ig0-f181.google.com [209.85.213.181]) by kanga.kvack.org (Postfix) with ESMTP id 2116E6B00CB for ; Fri, 14 Nov 2014 11:30:56 -0500 (EST) Received: by mail-ig0-f181.google.com with SMTP id l13so1932862iga.2 for ; Fri, 14 Nov 2014 08:30:55 -0800 (PST) Received: from cosmos.ssec.wisc.edu ([2607:f388:1090:0:fab1:56ff:fedf:5d9c]) by mx.google.com with ESMTP id iv13si39813136icc.88.2014.11.14.08.30.53 for ; Fri, 14 Nov 2014 08:30:54 -0800 (PST) Date: Fri, 14 Nov 2014 10:30:53 -0600 From: Daniel Forrest Subject: [PATCH] Repeated fork() causes SLAB to grow without bound Message-ID: <20141114163053.GA6547@cosmos.ssec.wisc.edu> Reply-To: Daniel Forrest References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130603195003.GA31275@evergreen.ssec.wisc.edu> Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Tim Hartrick , Michal Hocko There have been a couple of inquiries about the status of this patch over the last few months, so I am going to try pushing it out. Andrea Arcangeli has commented: > Agreed. The only thing I don't like about this patch is the hardcoding > of number 5: could we make it a variable to tweak with sysfs/sysctl so > if some weird workload arises we have a tuning tweak? It'd cost one > cacheline during fork, so it doesn't look excessive overhead. Adding this is beyond my experience level, so if it is required then someone else will have to make it so. Rik van Riel has commented: > I believe we should just merge that patch. > > I have not seen any better ideas come by. > > The comment should probably be fixed to reflect the > chain length of 5 though :) So here is Michel's patch again with "(length > 1)" modified to "(length > 5)" and fixed comments. I have been running with this patch (with the threshold set to 5) for over two years now and it does indeed solve the problem. --- anon_vma_clone() is modified to return the length of the existing same_vma anon vma chain, and we create a new anon_vma in the child if it is more than five forks after the anon_vma was created, as we don't want the same_vma chain to grow arbitrarily large. Signed-off-by: Michel Lespinasse Tested-by: Daniel Forrest --- mm/mmap.c | 6 +++--- mm/rmap.c | 18 ++++++++++++++---- 2 files changed, 17 insertions(+), 7 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 3edfcdfa42d9..e14b19a838cb 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -539,7 +539,7 @@ again: remove_next = 1 + (end > next->vm_end); * shrinking vma had, to cover any anon pages imported. */ if (exporter && exporter->anon_vma && !importer->anon_vma) { - if (anon_vma_clone(importer, exporter)) + if (anon_vma_clone(importer, exporter) < 0) return -ENOMEM; importer->anon_vma = exporter->anon_vma; } @@ -1988,7 +1988,7 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma, } vma_set_policy(new, pol); - if (anon_vma_clone(new, vma)) + if (anon_vma_clone(new, vma) < 0) goto out_free_mpol; if (new->vm_file) { @@ -2409,7 +2409,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, if (IS_ERR(pol)) goto out_free_vma; INIT_LIST_HEAD(&new_vma->anon_vma_chain); - if (anon_vma_clone(new_vma, vma)) + if (anon_vma_clone(new_vma, vma) < 0) goto out_free_mempol; vma_set_policy(new_vma, pol); new_vma->vm_start = addr; diff --git a/mm/rmap.c b/mm/rmap.c index 0f3b7cda2a24..ba8a726aaee6 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -238,12 +238,13 @@ static inline void unlock_anon_vma_root(struct anon_vma *root) /* * Attach the anon_vmas from src to dst. - * Returns 0 on success, -ENOMEM on failure. + * Returns length of the anon_vma chain on success, -ENOMEM on failure. */ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) { struct anon_vma_chain *avc, *pavc; struct anon_vma *root = NULL; + int length = 0; list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) { struct anon_vma *anon_vma; @@ -259,9 +260,10 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) anon_vma = pavc->anon_vma; root = lock_anon_vma_root(root, anon_vma); anon_vma_chain_link(dst, avc, anon_vma); + length++; } unlock_anon_vma_root(root); - return 0; + return length; enomem_failure: unlink_anon_vmas(dst); @@ -322,6 +324,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) { struct anon_vma_chain *avc; struct anon_vma *anon_vma; + int length; /* Don't bother if the parent process has no anon_vma here. */ if (!pvma->anon_vma) @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) * First, attach the new VMA to the parent VMA's anon_vmas, * so rmap can find non-COWed pages in child processes. */ - if (anon_vma_clone(vma, pvma)) + length = anon_vma_clone(vma, pvma); + if (length < 0) return -ENOMEM; + else if (length > 5) + return 0; - /* Then add our own anon_vma. */ + /* + * Then add our own anon_vma. We do this only for five forks after + * the anon_vma was created, as we don't want the same_vma chain to + * grow arbitrarily large. + */ anon_vma = anon_vma_alloc(); if (!anon_vma) goto out_error; -- Daniel K. Forrest Space Science and dan.forrest@ssec.wisc.edu Engineering Center (608) 890 - 0558 University of Wisconsin, Madison -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f176.google.com (mail-ig0-f176.google.com [209.85.213.176]) by kanga.kvack.org (Postfix) with ESMTP id AB6316B0069 for ; Mon, 17 Nov 2014 19:02:15 -0500 (EST) Received: by mail-ig0-f176.google.com with SMTP id l13so95059iga.15 for ; Mon, 17 Nov 2014 16:02:15 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id em5si56956487icb.55.2014.11.17.16.02.14 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 17 Nov 2014 16:02:14 -0800 (PST) Date: Mon, 17 Nov 2014 16:02:12 -0800 From: Andrew Morton Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound Message-Id: <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> In-Reply-To: <20141114163053.GA6547@cosmos.ssec.wisc.edu> References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Daniel Forrest Cc: Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Tim Hartrick , Michal Hocko On Fri, 14 Nov 2014 10:30:53 -0600 Daniel Forrest wrote: > There have been a couple of inquiries about the status of this patch > over the last few months, so I am going to try pushing it out. > > Andrea Arcangeli has commented: > > > Agreed. The only thing I don't like about this patch is the hardcoding > > of number 5: could we make it a variable to tweak with sysfs/sysctl so > > if some weird workload arises we have a tuning tweak? It'd cost one > > cacheline during fork, so it doesn't look excessive overhead. > > Adding this is beyond my experience level, so if it is required then > someone else will have to make it so. > > Rik van Riel has commented: > > > I believe we should just merge that patch. > > > > I have not seen any better ideas come by. > > > > The comment should probably be fixed to reflect the > > chain length of 5 though :) > > So here is Michel's patch again with "(length > 1)" modified to > "(length > 5)" and fixed comments. > > I have been running with this patch (with the threshold set to 5) for > over two years now and it does indeed solve the problem. > > --- > > anon_vma_clone() is modified to return the length of the existing > same_vma anon vma chain, and we create a new anon_vma in the child > if it is more than five forks after the anon_vma was created, as we > don't want the same_vma chain to grow arbitrarily large. hoo boy, what's going on here. - Under what circumstances are we seeing this slab windup? - What are the consequences? Can it OOM the machine? - Why is this occurring? There aren't an infinite number of vmas, so there shouldn't be an infinite number of anon_vmas or anon_vma_chains. - IOW, what has to be done to fix this properly? - What are the runtime consequences of limiting the length of the chain? > ... > > @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) > * First, attach the new VMA to the parent VMA's anon_vmas, > * so rmap can find non-COWed pages in child processes. > */ > - if (anon_vma_clone(vma, pvma)) > + length = anon_vma_clone(vma, pvma); > + if (length < 0) > return -ENOMEM; This should propagate the anon_vma_clone() return val instead of assuming ENOMEM. But that won't fix anything... > + else if (length > 5) > + return 0; > > - /* Then add our own anon_vma. */ > + /* > + * Then add our own anon_vma. We do this only for five forks after > + * the anon_vma was created, as we don't want the same_vma chain to > + * grow arbitrarily large. > + */ > anon_vma = anon_vma_alloc(); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ie0-f173.google.com (mail-ie0-f173.google.com [209.85.223.173]) by kanga.kvack.org (Postfix) with ESMTP id 158646B0069 for ; Mon, 17 Nov 2014 20:41:38 -0500 (EST) Received: by mail-ie0-f173.google.com with SMTP id y20so6350796ier.32 for ; Mon, 17 Nov 2014 17:41:37 -0800 (PST) Received: from cosmos.ssec.wisc.edu ([2607:f388:1090:0:fab1:56ff:fedf:5d9c]) by mx.google.com with ESMTP id l10si57181056icx.43.2014.11.17.17.41.36 for ; Mon, 17 Nov 2014 17:41:36 -0800 (PST) Date: Mon, 17 Nov 2014 19:41:35 -0600 From: Daniel Forrest Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound Message-ID: <20141118014135.GA17252@cosmos.ssec.wisc.edu> Reply-To: Daniel Forrest References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Tim Hartrick , Michal Hocko On Mon, Nov 17, 2014 at 04:02:12PM -0800, Andrew Morton wrote: > On Fri, 14 Nov 2014 10:30:53 -0600 Daniel Forrest wrote: > > > There have been a couple of inquiries about the status of this patch > > over the last few months, so I am going to try pushing it out. > > > > Andrea Arcangeli has commented: > > > > > Agreed. The only thing I don't like about this patch is the hardcoding > > > of number 5: could we make it a variable to tweak with sysfs/sysctl so > > > if some weird workload arises we have a tuning tweak? It'd cost one > > > cacheline during fork, so it doesn't look excessive overhead. > > > > Adding this is beyond my experience level, so if it is required then > > someone else will have to make it so. > > > > Rik van Riel has commented: > > > > > I believe we should just merge that patch. > > > > > > I have not seen any better ideas come by. > > > > > > The comment should probably be fixed to reflect the > > > chain length of 5 though :) > > > > So here is Michel's patch again with "(length > 1)" modified to > > "(length > 5)" and fixed comments. > > > > I have been running with this patch (with the threshold set to 5) for > > over two years now and it does indeed solve the problem. > > > > --- > > > > anon_vma_clone() is modified to return the length of the existing > > same_vma anon vma chain, and we create a new anon_vma in the child > > if it is more than five forks after the anon_vma was created, as we > > don't want the same_vma chain to grow arbitrarily large. > > hoo boy, what's going on here. > > - Under what circumstances are we seeing this slab windup? The original bug report is here: https://lkml.org/lkml/2012/8/15/765 > - What are the consequences? Can it OOM the machine? Yes, eventually you run out of SLAB space. > - Why is this occurring? There aren't an infinite number of vmas, so > there shouldn't be an infinite number of anon_vmas or > anon_vma_chains. Because of the serial forking there does indeed end up being an infinite number of vmas. The initial vma can never be deleted (even though the initial parent process has long since terminated) because the initial vma is referenced by the children. > - IOW, what has to be done to fix this properly? As far as I know, this is the best solution. I tried a refcounting solution based on comments by Rik van Riel: https://lkml.org/lkml/2012/8/17/536 But it didn't fully work, probably because I didn't quite get the locking done properly. In any case, at this point questions came up about the overhead of the page refcounting and Michel Lespinasse suggested the initial version of this patch: https://lkml.org/lkml/2012/8/21/730 > - What are the runtime consequences of limiting the length of the chain? I can't say, but it only affects users who fork more than five levels deep without doing an exec. On the other hand, there are at least three users (Tim Hartrick, Michal Hocko, and myself) who have real world applications where the consequence of no patch is a crashed system. I would suggest reading the thread starting with my initial bug report for what others have had to say about this. > > ... > > > > @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) > > * First, attach the new VMA to the parent VMA's anon_vmas, > > * so rmap can find non-COWed pages in child processes. > > */ > > - if (anon_vma_clone(vma, pvma)) > > + length = anon_vma_clone(vma, pvma); > > + if (length < 0) > > return -ENOMEM; > > This should propagate the anon_vma_clone() return val instead of > assuming ENOMEM. But that won't fix anything... Agreed, but the only failure return value of anon_vma_clone is -ENOMEM. Scanning the code in __split_vma (mm/mmap.c) it looks like the error return is lost (between Linux 3.11 and 3.12 the err variable is now used before the call to anon_vma_clone and the default initial value of -ENOMEM is overwritten). This is an actual bug in the current code. I can update the patch to fix these issues. > > + else if (length > 5) > > + return 0; > > > > - /* Then add our own anon_vma. */ > > + /* > > + * Then add our own anon_vma. We do this only for five forks after > > + * the anon_vma was created, as we don't want the same_vma chain to > > + * grow arbitrarily large. > > + */ > > anon_vma = anon_vma_alloc(); -- Daniel K. Forrest Space Science and dan.forrest@ssec.wisc.edu Engineering Center (608) 890 - 0558 University of Wisconsin, Madison -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f45.google.com (mail-qg0-f45.google.com [209.85.192.45]) by kanga.kvack.org (Postfix) with ESMTP id 7796E6B0069 for ; Mon, 17 Nov 2014 21:42:11 -0500 (EST) Received: by mail-qg0-f45.google.com with SMTP id z107so15873010qgd.4 for ; Mon, 17 Nov 2014 18:42:11 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id r10si66553821qat.106.2014.11.17.18.42.09 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 17 Nov 2014 18:42:10 -0800 (PST) Message-ID: <546AB1F5.6030306@redhat.com> Date: Mon, 17 Nov 2014 21:41:57 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> In-Reply-To: <20141118014135.GA17252@cosmos.ssec.wisc.edu> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Tim Hartrick , Michal Hocko -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/17/2014 08:41 PM, Daniel Forrest wrote: > On Mon, Nov 17, 2014 at 04:02:12PM -0800, Andrew Morton wrote: >> On Fri, 14 Nov 2014 10:30:53 -0600 Daniel Forrest >> wrote: >> >>> There have been a couple of inquiries about the status of this >>> patch over the last few months, so I am going to try pushing it >>> out. >>> >>> Andrea Arcangeli has commented: >>> >>>> Agreed. The only thing I don't like about this patch is the >>>> hardcoding of number 5: could we make it a variable to tweak >>>> with sysfs/sysctl so if some weird workload arises we have a >>>> tuning tweak? It'd cost one cacheline during fork, so it >>>> doesn't look excessive overhead. >>> >>> Adding this is beyond my experience level, so if it is required >>> then someone else will have to make it so. >>> >>> Rik van Riel has commented: >>> >>>> I believe we should just merge that patch. >>>> >>>> I have not seen any better ideas come by. >>>> >>>> The comment should probably be fixed to reflect the chain >>>> length of 5 though :) >>> >>> So here is Michel's patch again with "(length > 1)" modified >>> to "(length > 5)" and fixed comments. >>> >>> I have been running with this patch (with the threshold set to >>> 5) for over two years now and it does indeed solve the >>> problem. >>> >>> --- >>> >>> anon_vma_clone() is modified to return the length of the >>> existing same_vma anon vma chain, and we create a new anon_vma >>> in the child if it is more than five forks after the anon_vma >>> was created, as we don't want the same_vma chain to grow >>> arbitrarily large. >> >> hoo boy, what's going on here. >> >> - Under what circumstances are we seeing this slab windup? > > The original bug report is here: > > https://lkml.org/lkml/2012/8/15/765 > >> - What are the consequences? Can it OOM the machine? > > Yes, eventually you run out of SLAB space. > >> - Why is this occurring? There aren't an infinite number of >> vmas, so there shouldn't be an infinite number of anon_vmas or >> anon_vma_chains. > > Because of the serial forking there does indeed end up being an > infinite number of vmas. The initial vma can never be deleted > (even though the initial parent process has long since terminated) > because the initial vma is referenced by the children. There is a finite number of VMAs, but an infite number of anon_vmas. Subtle, yet deadly... >> - IOW, what has to be done to fix this properly? > > As far as I know, this is the best solution. I tried a > refcounting solution based on comments by Rik van Riel: > > https://lkml.org/lkml/2012/8/17/536 > > But it didn't fully work, probably because I didn't quite get the > locking done properly. In any case, at this point questions came > up about the overhead of the page refcounting and Michel > Lespinasse suggested the initial version of this patch: > > https://lkml.org/lkml/2012/8/21/730 > >> - What are the runtime consequences of limiting the length of the >> chain? > > I can't say, but it only affects users who fork more than five > levels deep without doing an exec. On the other hand, there are at > least three users (Tim Hartrick, Michal Hocko, and myself) who have > real world applications where the consequence of no patch is a > crashed system. > > I would suggest reading the thread starting with my initial bug > report for what others have had to say about this. I suspect what Andrew is hinting at is that the changelog for the patch should contain a detailed description of exactly what the bug is, how it is triggered, what the symptoms are, and how the patch avoids it. That way people can understand what the code does simply by looking at the changelog - no need to go find old linux-kernel mailing list threads. >>> ... >>> >>> @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct >>> *vma, struct vm_area_struct *pvma) * First, attach the new VMA >>> to the parent VMA's anon_vmas, * so rmap can find non-COWed >>> pages in child processes. */ - if (anon_vma_clone(vma, pvma)) + >>> length = anon_vma_clone(vma, pvma); + if (length < 0) return >>> -ENOMEM; >> >> This should propagate the anon_vma_clone() return val instead of >> assuming ENOMEM. But that won't fix anything... > > Agreed, but the only failure return value of anon_vma_clone is > -ENOMEM. > > Scanning the code in __split_vma (mm/mmap.c) it looks like the > error return is lost (between Linux 3.11 and 3.12 the err variable > is now used before the call to anon_vma_clone and the default > initial value of -ENOMEM is overwritten). This is an actual bug in > the current code. > > I can update the patch to fix these issues. > >>> + else if (length > 5) + return 0; >>> >>> - /* Then add our own anon_vma. */ + /* + * Then add our own >>> anon_vma. We do this only for five forks after + * the >>> anon_vma was created, as we don't want the same_vma chain to + >>> * grow arbitrarily large. + */ anon_vma = anon_vma_alloc(); > - -- All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJUarH1AAoJEM553pKExN6DXwUH/RHNwGTYhzzwIQtbtMqnHYjE YWriqPLIOW8yWh85hkrmTsjWIegbDnEsbgNRX0Y8ANrKgx+vWRRW/eJ/s+Z+m7UY lD1DKO3vIfUSQvL4QHnViTEgEHfdychnhe0SE/kMeQbnLpUw8ywviJxX0UibeLdK L/F8xMzpUj/PBkNTtPxQRevWwUEMMMY6RS8RjHNBADe9ym/Fjd0dzAkoPCYCUapT barWfI9RMC3gYfyObFNBNYyaYyyK1FlAyBq52d/W8xCBW/5EIhEtFBGben/lAuEP alJt+jnFq4B1tXQtJIu1YBhY4OhuqWQy5lbz7NFPxg8+cECVPd3Vq6O2Bxilz9U= =GLaM -----END PGP SIGNATURE----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ig0-f178.google.com (mail-ig0-f178.google.com [209.85.213.178]) by kanga.kvack.org (Postfix) with ESMTP id 68EF76B0038 for ; Tue, 18 Nov 2014 15:19:40 -0500 (EST) Received: by mail-ig0-f178.google.com with SMTP id hl2so4885248igb.11 for ; Tue, 18 Nov 2014 12:19:40 -0800 (PST) Received: from mail.linuxfoundation.org (mail.linuxfoundation.org. [140.211.169.12]) by mx.google.com with ESMTPS id c9si19474036igo.10.2014.11.18.12.19.37 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 18 Nov 2014 12:19:38 -0800 (PST) Date: Tue, 18 Nov 2014 12:19:36 -0800 From: Andrew Morton Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound Message-Id: <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> In-Reply-To: <546AB1F5.6030306@redhat.com> References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Tim Hartrick , Michal Hocko On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel wrote: > > Because of the serial forking there does indeed end up being an > > infinite number of vmas. The initial vma can never be deleted > > (even though the initial parent process has long since terminated) > > because the initial vma is referenced by the children. > > There is a finite number of VMAs, but an infite number of > anon_vmas. > > Subtle, yet deadly... Well, we clearly have the data structures screwed up. I've forgotten enough about this code for me to be unable to work out what the fixed up data structures would look like :( But surely there is some proper solution here. Help? > > I can't say, but it only affects users who fork more than five > > levels deep without doing an exec. On the other hand, there are at > > least three users (Tim Hartrick, Michal Hocko, and myself) who have > > real world applications where the consequence of no patch is a > > crashed system. > > > > I would suggest reading the thread starting with my initial bug > > report for what others have had to say about this. > > I suspect what Andrew is hinting at is that the > changelog for the patch should contain a detailed > description of exactly what the bug is, how it is > triggered, what the symptoms are, and how the > patch avoids it. > > That way people can understand what the code does > simply by looking at the changelog - no need to go > find old linux-kernel mailing list threads. Yes please, there's a ton of stuff here which we should attempt to capture. https://lkml.org/lkml/2012/8/15/765 is useful. I'm assuming that with the "foo < 5" hack, an application which forked 5 times then did a lot of work would still trigger the "catastrophic issue at page reclaim time" issue which Rik identified at https://lkml.org/lkml/2012/8/20/265? There are real-world workloads which are triggering this slab growth problem, yes? (Detail them in the changelog, please). This bug snuck under my radar last time - we're permitting unprivileged userspace to exhaust memory and that's bad. I'm OK with the foo<5 thing for -stable kernels, as it is simple. But I'm reluctant to merge (or at least to retain) it in mainline because then everyone will run away and think about other stuff and this bug will never get fixed properly. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f52.google.com (mail-wg0-f52.google.com [74.125.82.52]) by kanga.kvack.org (Postfix) with ESMTP id B203B6B0038 for ; Tue, 18 Nov 2014 17:15:47 -0500 (EST) Received: by mail-wg0-f52.google.com with SMTP id a1so5406759wgh.25 for ; Tue, 18 Nov 2014 14:15:47 -0800 (PST) Received: from mail-wg0-x236.google.com (mail-wg0-x236.google.com. [2a00:1450:400c:c00::236]) by mx.google.com with ESMTPS id aq2si122186wjc.52.2014.11.18.14.15.46 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 18 Nov 2014 14:15:47 -0800 (PST) Received: by mail-wg0-f54.google.com with SMTP id y10so7440666wgg.13 for ; Tue, 18 Nov 2014 14:15:46 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> Date: Wed, 19 Nov 2014 02:15:46 +0400 Message-ID: Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound From: Konstantin Khlebnikov Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton wrote: > On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel wrote: > >> > Because of the serial forking there does indeed end up being an >> > infinite number of vmas. The initial vma can never be deleted >> > (even though the initial parent process has long since terminated) >> > because the initial vma is referenced by the children. >> >> There is a finite number of VMAs, but an infite number of >> anon_vmas. >> >> Subtle, yet deadly... > > Well, we clearly have the data structures screwed up. I've forgotten > enough about this code for me to be unable to work out what the fixed > up data structures would look like :( But surely there is some proper > solution here. Help? Not sure if it's right but probably we could reuse on fork an old anon_vma from the chain if it's already lost all vmas which points to it. For endlessly forking exploit this should work mostly like proposed patch which stops branching after some depth but without magic constant. > >> > I can't say, but it only affects users who fork more than five >> > levels deep without doing an exec. On the other hand, there are at >> > least three users (Tim Hartrick, Michal Hocko, and myself) who have >> > real world applications where the consequence of no patch is a >> > crashed system. >> > >> > I would suggest reading the thread starting with my initial bug >> > report for what others have had to say about this. >> >> I suspect what Andrew is hinting at is that the >> changelog for the patch should contain a detailed >> description of exactly what the bug is, how it is >> triggered, what the symptoms are, and how the >> patch avoids it. >> >> That way people can understand what the code does >> simply by looking at the changelog - no need to go >> find old linux-kernel mailing list threads. > > Yes please, there's a ton of stuff here which we should attempt to > capture. > > https://lkml.org/lkml/2012/8/15/765 is useful. > > I'm assuming that with the "foo < 5" hack, an application which forked > 5 times then did a lot of work would still trigger the "catastrophic > issue at page reclaim time" issue which Rik identified at > https://lkml.org/lkml/2012/8/20/265? > > There are real-world workloads which are triggering this slab growth > problem, yes? (Detail them in the changelog, please). > > This bug snuck under my radar last time - we're permitting unprivileged > userspace to exhaust memory and that's bad. I'm OK with the foo<5 > thing for -stable kernels, as it is simple. But I'm reluctant to merge > (or at least to retain) it in mainline because then everyone will run > away and think about other stuff and this bug will never get fixed > properly. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f182.google.com (mail-pd0-f182.google.com [209.85.192.182]) by kanga.kvack.org (Postfix) with ESMTP id 177F06B006C for ; Tue, 18 Nov 2014 18:02:06 -0500 (EST) Received: by mail-pd0-f182.google.com with SMTP id r10so12319pdi.13 for ; Tue, 18 Nov 2014 15:02:05 -0800 (PST) Received: from mail-pd0-x22d.google.com (mail-pd0-x22d.google.com. [2607:f8b0:400e:c02::22d]) by mx.google.com with ESMTPS id sg2si3126819pbb.148.2014.11.18.15.02.04 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 18 Nov 2014 15:02:04 -0800 (PST) Received: by mail-pd0-f173.google.com with SMTP id ft15so3343125pdb.18 for ; Tue, 18 Nov 2014 15:02:04 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> Date: Wed, 19 Nov 2014 03:02:02 +0400 Message-ID: Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound From: Konstantin Khlebnikov Content-Type: multipart/mixed; boundary=047d7b15a9ad5e0d0605082a15d4 Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko --047d7b15a9ad5e0d0605082a15d4 Content-Type: text/plain; charset=UTF-8 On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov wrote: > On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton > wrote: >> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel wrote: >> >>> > Because of the serial forking there does indeed end up being an >>> > infinite number of vmas. The initial vma can never be deleted >>> > (even though the initial parent process has long since terminated) >>> > because the initial vma is referenced by the children. >>> >>> There is a finite number of VMAs, but an infite number of >>> anon_vmas. >>> >>> Subtle, yet deadly... >> >> Well, we clearly have the data structures screwed up. I've forgotten >> enough about this code for me to be unable to work out what the fixed >> up data structures would look like :( But surely there is some proper >> solution here. Help? > > Not sure if it's right but probably we could reuse on fork an old anon_vma > from the chain if it's already lost all vmas which points to it. > For endlessly forking exploit this should work mostly like proposed patch > which stops branching after some depth but without magic constant. Something like this. I leave proper comment for tomorrow. > >> >>> > I can't say, but it only affects users who fork more than five >>> > levels deep without doing an exec. On the other hand, there are at >>> > least three users (Tim Hartrick, Michal Hocko, and myself) who have >>> > real world applications where the consequence of no patch is a >>> > crashed system. >>> > >>> > I would suggest reading the thread starting with my initial bug >>> > report for what others have had to say about this. >>> >>> I suspect what Andrew is hinting at is that the >>> changelog for the patch should contain a detailed >>> description of exactly what the bug is, how it is >>> triggered, what the symptoms are, and how the >>> patch avoids it. >>> >>> That way people can understand what the code does >>> simply by looking at the changelog - no need to go >>> find old linux-kernel mailing list threads. >> >> Yes please, there's a ton of stuff here which we should attempt to >> capture. >> >> https://lkml.org/lkml/2012/8/15/765 is useful. >> >> I'm assuming that with the "foo < 5" hack, an application which forked >> 5 times then did a lot of work would still trigger the "catastrophic >> issue at page reclaim time" issue which Rik identified at >> https://lkml.org/lkml/2012/8/20/265? >> >> There are real-world workloads which are triggering this slab growth >> problem, yes? (Detail them in the changelog, please). >> >> This bug snuck under my radar last time - we're permitting unprivileged >> userspace to exhaust memory and that's bad. I'm OK with the foo<5 >> thing for -stable kernels, as it is simple. But I'm reluctant to merge >> (or at least to retain) it in mainline because then everyone will run >> away and think about other stuff and this bug will never get fixed >> properly. >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: email@kvack.org --047d7b15a9ad5e0d0605082a15d4 Content-Type: application/octet-stream; name=mm-reuse-old-anon_vma-if-it-s-lost-all-vmas Content-Disposition: attachment; filename=mm-reuse-old-anon_vma-if-it-s-lost-all-vmas Content-Transfer-Encoding: base64 X-Attachment-Id: f_i2nv7pxo0 bW06IHJldXNlIG9sZCBhbm9uX3ZtYSBpZiBpdCdzIGxvc3QgYWxsIHZtYXMKCkZyb206IEtvbnN0 YW50aW4gS2hsZWJuaWtvdiA8a29jdDlpQGdtYWlsLmNvbT4KClNpZ25lZC1vZmYtYnk6IEtvbnN0 YW50aW4gS2hsZWJuaWtvdiA8a29jdDlpQGdtYWlsLmNvbT4KLS0tCiBpbmNsdWRlL2xpbnV4L3Jt YXAuaCB8ICAgIDIgKysKIG1tL3JtYXAuYyAgICAgICAgICAgIHwgICAxNCArKysrKysrKysrKysr KwogMiBmaWxlcyBjaGFuZ2VkLCAxNiBpbnNlcnRpb25zKCspCgpkaWZmIC0tZ2l0IGEvaW5jbHVk ZS9saW51eC9ybWFwLmggYi9pbmNsdWRlL2xpbnV4L3JtYXAuaAppbmRleCBjMGMyYmNlLi5kNDBj YTA4IDEwMDY0NAotLS0gYS9pbmNsdWRlL2xpbnV4L3JtYXAuaAorKysgYi9pbmNsdWRlL2xpbnV4 L3JtYXAuaApAQCAtMzYsNiArMzYsOCBAQCBzdHJ1Y3QgYW5vbl92bWEgewogCSAqLwogCWF0b21p Y190IHJlZmNvdW50OwogCisJaW50IG5yX3ZtYXM7CS8qIE51bWJlciBvZiBkaXJlY3QgcmVmZXJl bmNlcyBmcm9tIHZtYXMgKi8KKwogCS8qCiAJICogTk9URTogdGhlIExTQiBvZiB0aGUgcmJfcm9v dC5yYl9ub2RlIGlzIHNldCBieQogCSAqIG1tX3Rha2VfYWxsX2xvY2tzKCkgX2FmdGVyXyB0YWtp bmcgdGhlIGFib3ZlIGxvY2suIFNvIHRoZQpkaWZmIC0tZ2l0IGEvbW0vcm1hcC5jIGIvbW0vcm1h cC5jCmluZGV4IDE5ODg2ZmIuLmNlZDQ3NTQgMTAwNjQ0Ci0tLSBhL21tL3JtYXAuYworKysgYi9t bS9ybWFwLmMKQEAgLTcyLDYgKzcyLDcgQEAgc3RhdGljIGlubGluZSBzdHJ1Y3QgYW5vbl92bWEg KmFub25fdm1hX2FsbG9jKHZvaWQpCiAJYW5vbl92bWEgPSBrbWVtX2NhY2hlX2FsbG9jKGFub25f dm1hX2NhY2hlcCwgR0ZQX0tFUk5FTCk7CiAJaWYgKGFub25fdm1hKSB7CiAJCWF0b21pY19zZXQo JmFub25fdm1hLT5yZWZjb3VudCwgMSk7CisJCWFub25fdm1hLT5ucl92bWFzID0gMTsKIAkJLyoK IAkJICogSW5pdGlhbGlzZSB0aGUgYW5vbl92bWEgcm9vdCB0byBwb2ludCB0byBpdHNlbGYuIElm IGNhbGxlZAogCQkgKiBmcm9tIGZvcmssIHRoZSByb290IHdpbGwgYmUgcmVzZXQgdG8gdGhlIHBh cmVudHMgYW5vbl92bWEuCkBAIC0yNTYsNyArMjU3LDExIEBAIGludCBhbm9uX3ZtYV9jbG9uZShz dHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKmRzdCwgc3RydWN0IHZtX2FyZWFfc3RydWN0ICpzcmMpCiAJ CWFub25fdm1hID0gcGF2Yy0+YW5vbl92bWE7CiAJCXJvb3QgPSBsb2NrX2Fub25fdm1hX3Jvb3Qo cm9vdCwgYW5vbl92bWEpOwogCQlhbm9uX3ZtYV9jaGFpbl9saW5rKGRzdCwgYXZjLCBhbm9uX3Zt YSk7CisJCWlmICghZHN0LT5hbm9uX3ZtYSAmJiAhYW5vbl92bWEtPm5yX3ZtYXMpCisJCQlkc3Qt PmFub25fdm1hID0gYW5vbl92bWE7CiAJfQorCWlmIChkc3QtPmFub25fdm1hKQorCQlkc3QtPmFu b25fdm1hLT5ucl92bWFzKys7CiAJdW5sb2NrX2Fub25fdm1hX3Jvb3Qocm9vdCk7CiAJcmV0dXJu IDA7CiAKQEAgLTI3OSw2ICsyODQsOSBAQCBpbnQgYW5vbl92bWFfZm9yayhzdHJ1Y3Qgdm1fYXJl YV9zdHJ1Y3QgKnZtYSwgc3RydWN0IHZtX2FyZWFfc3RydWN0ICpwdm1hKQogCWlmICghcHZtYS0+ YW5vbl92bWEpCiAJCXJldHVybiAwOwogCisJLyogRHJvcCBwYXJlbnQgYW5vbl92bWEsIHdlIHdh bnQgZmluZCBvciBhbGxvY2F0ZSBvdXIgb3duLiAqLworCXZtYS0+YW5vbl92bWEgPSBOVUxMOwor CiAJLyoKIAkgKiBGaXJzdCwgYXR0YWNoIHRoZSBuZXcgVk1BIHRvIHRoZSBwYXJlbnQgVk1BJ3Mg YW5vbl92bWFzLAogCSAqIHNvIHJtYXAgY2FuIGZpbmQgbm9uLUNPV2VkIHBhZ2VzIGluIGNoaWxk IHByb2Nlc3Nlcy4KQEAgLTI4Niw2ICsyOTQsMTAgQEAgaW50IGFub25fdm1hX2Zvcmsoc3RydWN0 IHZtX2FyZWFfc3RydWN0ICp2bWEsIHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqcHZtYSkKIAlpZiAo YW5vbl92bWFfY2xvbmUodm1hLCBwdm1hKSkKIAkJcmV0dXJuIC1FTk9NRU07CiAKKwkvKiBPbGQg YW5vbl92bWEgaGFzIGJlZW4gcmV1c2VkLiAqLworCWlmICh2bWEtPmFub25fdm1hKQorCQlyZXR1 cm4gMDsKKwogCS8qIFRoZW4gYWRkIG91ciBvd24gYW5vbl92bWEuICovCiAJYW5vbl92bWEgPSBh bm9uX3ZtYV9hbGxvYygpOwogCWlmICghYW5vbl92bWEpCkBAIC0zNDUsNiArMzU3LDggQEAgdm9p ZCB1bmxpbmtfYW5vbl92bWFzKHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqdm1hKQogCQlsaXN0X2Rl bCgmYXZjLT5zYW1lX3ZtYSk7CiAJCWFub25fdm1hX2NoYWluX2ZyZWUoYXZjKTsKIAl9CisJaWYg KHZtYS0+YW5vbl92bWEpCisJCXZtYS0+YW5vbl92bWEtPm5yX3ZtYXMtLTsKIAl1bmxvY2tfYW5v bl92bWFfcm9vdChyb290KTsKIAogCS8qCg== --047d7b15a9ad5e0d0605082a15d4-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f172.google.com (mail-wi0-f172.google.com [209.85.212.172]) by kanga.kvack.org (Postfix) with ESMTP id C2F6F6B0038 for ; Tue, 18 Nov 2014 18:50:04 -0500 (EST) Received: by mail-wi0-f172.google.com with SMTP id n3so3624949wiv.5 for ; Tue, 18 Nov 2014 15:50:04 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id t5si440739wjr.21.2014.11.18.15.50.03 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 18 Nov 2014 15:50:03 -0800 (PST) Message-ID: <546BDB29.9050403@suse.cz> Date: Wed, 19 Nov 2014 00:50:01 +0100 From: Vlastimil Babka MIME-Version: 1.0 Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Konstantin Khlebnikov , Andrew Morton Cc: Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko On 11/19/2014 12:02 AM, Konstantin Khlebnikov wrote: > On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov wrote: >> On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton >> wrote: >>> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel wrote: >>> >>>> > Because of the serial forking there does indeed end up being an >>>> > infinite number of vmas. The initial vma can never be deleted >>>> > (even though the initial parent process has long since terminated) >>>> > because the initial vma is referenced by the children. >>>> >>>> There is a finite number of VMAs, but an infite number of >>>> anon_vmas. >>>> >>>> Subtle, yet deadly... >>> >>> Well, we clearly have the data structures screwed up. I've forgotten >>> enough about this code for me to be unable to work out what the fixed >>> up data structures would look like :( But surely there is some proper >>> solution here. Help? >> >> Not sure if it's right but probably we could reuse on fork an old anon_vma >> from the chain if it's already lost all vmas which points to it. >> For endlessly forking exploit this should work mostly like proposed patch >> which stops branching after some depth but without magic constant. > > Something like this. I leave proper comment for tomorrow. Hmm I'm not sure that will work as it is. If I understand it correctly, your patch can detect if the parent's anon_vma has no own references at the fork() time. But at the fork time, the parent is still alive, it only exits after the fork, right? So I guess it still has own references and the child will still allocate its new anon_vma, and the problem is not solved. So maybe we could detect that the own references dropped to zero when the parent does exit, and then change mapping of all relevant pages to the root anon_vma, destroy avc's of children and the anon_vma itself. But that sounds quite heavyweight :/ Vlastimil >> >>> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f180.google.com (mail-qc0-f180.google.com [209.85.216.180]) by kanga.kvack.org (Postfix) with ESMTP id 6D37E6B0038 for ; Tue, 18 Nov 2014 21:48:16 -0500 (EST) Received: by mail-qc0-f180.google.com with SMTP id i8so8788749qcq.11 for ; Tue, 18 Nov 2014 18:48:16 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id q110si525021qgd.122.2014.11.18.18.48.13 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 18 Nov 2014 18:48:15 -0800 (PST) Message-ID: <546C04E0.4090209@redhat.com> Date: Tue, 18 Nov 2014 21:48:00 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> In-Reply-To: <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrew Morton Cc: Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Tim Hartrick , Michal Hocko -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/18/2014 03:19 PM, Andrew Morton wrote: > On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel > wrote: > >> That way people can understand what the code does simply by >> looking at the changelog - no need to go find old linux-kernel >> mailing list threads. > > Yes please, there's a ton of stuff here which we should attempt to > capture. > > https://lkml.org/lkml/2012/8/15/765 is useful. > > I'm assuming that with the "foo < 5" hack, an application which > forked 5 times then did a lot of work would still trigger the > "catastrophic issue at page reclaim time" issue which Rik > identified at https://lkml.org/lkml/2012/8/20/265? It's not "forking 5 times", it is "forking >>5 generations deep". There are a few programs that do that, but it does not appear that they are forking servers like apache or sendmail (which fork from the 2nd generation, and then sometimes again to exec a helper from the 4th generation). > There are real-world workloads which are triggering this slab > growth problem, yes? (Detail them in the changelog, please). There are, but the overlap between "forks >>5 generations deep" and "forks a bajillion child processes" appears to be zero. - -- All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJUbATgAAoJEM553pKExN6Ds84H/ixCr4Q5C09sDISuw9y/PsVI moXPbqgefpzbS316MgD1AMl7rj2OWAMiQcRGQ6yMelXOyuB89XTiBi19t5UxaSUn tuFnxeknoIL0155yTfszETRGjN9mUKoyk9HAhND1T+x2VFLwaQYyk7CdZC/h7IQ7 m1jfwlR30r0Ie6x5lkN1XaculdWdXjr7wTwUWeOVsc6lWv3kR3dC52LKsB4fv340 gBeL5sTDNNp6r5Gfr5QL7fQR0eLVvhStSmsm4GbggpVSBSCpZ++h8eTjdtHxuJO3 jtgEGAvhnLDSqRi6NG6dKoxtXW8++hnFIKBw1Ec36NTuTkbKiHo9EQujINtXWro= =/EU5 -----END PGP SIGNATURE----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f43.google.com (mail-wg0-f43.google.com [74.125.82.43]) by kanga.kvack.org (Postfix) with ESMTP id 964EE6B0038 for ; Wed, 19 Nov 2014 09:36:47 -0500 (EST) Received: by mail-wg0-f43.google.com with SMTP id l18so1016716wgh.30 for ; Wed, 19 Nov 2014 06:36:47 -0800 (PST) Received: from mail-wg0-x22a.google.com (mail-wg0-x22a.google.com. [2a00:1450:400c:c00::22a]) by mx.google.com with ESMTPS id x5si1287182wiy.6.2014.11.19.06.36.46 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 19 Nov 2014 06:36:46 -0800 (PST) Received: by mail-wg0-f42.google.com with SMTP id z12so1043209wgg.1 for ; Wed, 19 Nov 2014 06:36:45 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <546BDB29.9050403@suse.cz> References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> <546BDB29.9050403@suse.cz> Date: Wed, 19 Nov 2014 18:36:45 +0400 Message-ID: Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound From: Konstantin Khlebnikov Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: Andrew Morton , Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko On Wed, Nov 19, 2014 at 2:50 AM, Vlastimil Babka wrote: > On 11/19/2014 12:02 AM, Konstantin Khlebnikov wrote: >> On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov wrote: >>> On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton >>> wrote: >>>> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel wrote: >>>> >>>>> > Because of the serial forking there does indeed end up being an >>>>> > infinite number of vmas. The initial vma can never be deleted >>>>> > (even though the initial parent process has long since terminated) >>>>> > because the initial vma is referenced by the children. >>>>> >>>>> There is a finite number of VMAs, but an infite number of >>>>> anon_vmas. >>>>> >>>>> Subtle, yet deadly... >>>> >>>> Well, we clearly have the data structures screwed up. I've forgotten >>>> enough about this code for me to be unable to work out what the fixed >>>> up data structures would look like :( But surely there is some proper >>>> solution here. Help? >>> >>> Not sure if it's right but probably we could reuse on fork an old anon_vma >>> from the chain if it's already lost all vmas which points to it. >>> For endlessly forking exploit this should work mostly like proposed patch >>> which stops branching after some depth but without magic constant. >> >> Something like this. I leave proper comment for tomorrow. > > Hmm I'm not sure that will work as it is. If I understand it correctly, your > patch can detect if the parent's anon_vma has no own references at the fork() > time. But at the fork time, the parent is still alive, it only exits after the > fork, right? So I guess it still has own references and the child will still > allocate its new anon_vma, and the problem is not solved. But it could reuse anon_vma from grandparent or older. Count of anon_vmas in chain will be limited with count of alive processes. I think it's better to describe this in terms of sets of anon_vma instead hierarchy: at clone vma inherits pages from parent together with set of anon_vma which they belong. For new pages it might allocate new anon_vma or reuse existing. After my patch vma will try to reuse anon_vma from that set which has no vmas which points to it. As a result there will be no parent-child relation between anon_vma and multiple pages might have equal (anon_vma, index) pair but I see no problems here. > > So maybe we could detect that the own references dropped to zero when the parent > does exit, and then change mapping of all relevant pages to the root anon_vma, > destroy avc's of children and the anon_vma itself. But that sounds quite > heavyweight :/ > > Vlastimil > >>> >>>> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f177.google.com (mail-wi0-f177.google.com [209.85.212.177]) by kanga.kvack.org (Postfix) with ESMTP id B81BC6B006E for ; Wed, 19 Nov 2014 11:09:52 -0500 (EST) Received: by mail-wi0-f177.google.com with SMTP id l15so2414798wiw.16 for ; Wed, 19 Nov 2014 08:09:52 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id m1si156585wiy.80.2014.11.19.08.09.51 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 19 Nov 2014 08:09:51 -0800 (PST) Message-ID: <546CC0CD.40906@suse.cz> Date: Wed, 19 Nov 2014 17:09:49 +0100 From: Vlastimil Babka MIME-Version: 1.0 Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> <546BDB29.9050403@suse.cz> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Konstantin Khlebnikov Cc: Andrew Morton , Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko On 11/19/2014 03:36 PM, Konstantin Khlebnikov wrote: > On Wed, Nov 19, 2014 at 2:50 AM, Vlastimil Babka wrote: >> On 11/19/2014 12:02 AM, Konstantin Khlebnikov wrote: >>> On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov wrote: >>>> On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton >>>> wrote: >>>>> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel wrote: >>>>> >>>>>> > Because of the serial forking there does indeed end up being an >>>>>> > infinite number of vmas. The initial vma can never be deleted >>>>>> > (even though the initial parent process has long since terminated) >>>>>> > because the initial vma is referenced by the children. >>>>>> >>>>>> There is a finite number of VMAs, but an infite number of >>>>>> anon_vmas. >>>>>> >>>>>> Subtle, yet deadly... >>>>> >>>>> Well, we clearly have the data structures screwed up. I've forgotten >>>>> enough about this code for me to be unable to work out what the fixed >>>>> up data structures would look like :( But surely there is some proper >>>>> solution here. Help? >>>> >>>> Not sure if it's right but probably we could reuse on fork an old anon_vma >>>> from the chain if it's already lost all vmas which points to it. >>>> For endlessly forking exploit this should work mostly like proposed patch >>>> which stops branching after some depth but without magic constant. >>> >>> Something like this. I leave proper comment for tomorrow. >> >> Hmm I'm not sure that will work as it is. If I understand it correctly, your >> patch can detect if the parent's anon_vma has no own references at the fork() >> time. But at the fork time, the parent is still alive, it only exits after the >> fork, right? So I guess it still has own references and the child will still >> allocate its new anon_vma, and the problem is not solved. > > But it could reuse anon_vma from grandparent or older. > Count of anon_vmas in chain will be limited with count of alive processes. Ah I missed that it can reuse older anon_vma, sorry. > I think it's better to describe this in terms of sets of anon_vma > instead hierarchy: > at clone vma inherits pages from parent together with set of anon_vma > which they belong. > For new pages it might allocate new anon_vma or reuse existing. After > my patch vma > will try to reuse anon_vma from that set which has no vmas which points to it. > As a result there will be no parent-child relation between anon_vma and > multiple pages might have equal (anon_vma, index) pair but I see no > problems here. Hmm I wonder if root anon_vma should be excluded from this reusal. For performance reasons, exclusive pages go to non-root anon_vma (see __page_set_anon_rmap()) and reusing root anon_vma would change this. Also from reading http://lwn.net/Articles/383162/ I understand that correctness also depends on the hierarchy and I wonder if there's a danger of reintroducing a bug like the one described there. Vlastimil >> >> So maybe we could detect that the own references dropped to zero when the parent >> does exit, and then change mapping of all relevant pages to the root anon_vma, >> destroy avc's of children and the anon_vma itself. But that sounds quite >> heavyweight :/ >> >> Vlastimil >> >>>> >>>>> >> > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f44.google.com (mail-wg0-f44.google.com [74.125.82.44]) by kanga.kvack.org (Postfix) with ESMTP id 054A86B0038 for ; Wed, 19 Nov 2014 11:58:23 -0500 (EST) Received: by mail-wg0-f44.google.com with SMTP id b13so1320559wgh.3 for ; Wed, 19 Nov 2014 08:58:21 -0800 (PST) Received: from mail-wg0-x231.google.com (mail-wg0-x231.google.com. [2a00:1450:400c:c00::231]) by mx.google.com with ESMTPS id o5si3480659wjz.59.2014.11.19.08.58.20 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 19 Nov 2014 08:58:20 -0800 (PST) Received: by mail-wg0-f49.google.com with SMTP id x12so1308949wgg.8 for ; Wed, 19 Nov 2014 08:58:20 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <546CC0CD.40906@suse.cz> References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> <546BDB29.9050403@suse.cz> <546CC0CD.40906@suse.cz> Date: Wed, 19 Nov 2014 20:58:20 +0400 Message-ID: Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound From: Konstantin Khlebnikov Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Vlastimil Babka Cc: Andrew Morton , Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko On Wed, Nov 19, 2014 at 7:09 PM, Vlastimil Babka wrote: > On 11/19/2014 03:36 PM, Konstantin Khlebnikov wrote: >> On Wed, Nov 19, 2014 at 2:50 AM, Vlastimil Babka wrote: >>> On 11/19/2014 12:02 AM, Konstantin Khlebnikov wrote: >>>> On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov wrote: >>>>> On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton >>>>> wrote: >>>>>> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel wrote: >>>>>> >>>>>>> > Because of the serial forking there does indeed end up being an >>>>>>> > infinite number of vmas. The initial vma can never be deleted >>>>>>> > (even though the initial parent process has long since terminated) >>>>>>> > because the initial vma is referenced by the children. >>>>>>> >>>>>>> There is a finite number of VMAs, but an infite number of >>>>>>> anon_vmas. >>>>>>> >>>>>>> Subtle, yet deadly... >>>>>> >>>>>> Well, we clearly have the data structures screwed up. I've forgotten >>>>>> enough about this code for me to be unable to work out what the fixed >>>>>> up data structures would look like :( But surely there is some proper >>>>>> solution here. Help? >>>>> >>>>> Not sure if it's right but probably we could reuse on fork an old anon_vma >>>>> from the chain if it's already lost all vmas which points to it. >>>>> For endlessly forking exploit this should work mostly like proposed patch >>>>> which stops branching after some depth but without magic constant. >>>> >>>> Something like this. I leave proper comment for tomorrow. >>> >>> Hmm I'm not sure that will work as it is. If I understand it correctly, your >>> patch can detect if the parent's anon_vma has no own references at the fork() >>> time. But at the fork time, the parent is still alive, it only exits after the >>> fork, right? So I guess it still has own references and the child will still >>> allocate its new anon_vma, and the problem is not solved. >> >> But it could reuse anon_vma from grandparent or older. >> Count of anon_vmas in chain will be limited with count of alive processes. > > Ah I missed that it can reuse older anon_vma, sorry. > >> I think it's better to describe this in terms of sets of anon_vma >> instead hierarchy: >> at clone vma inherits pages from parent together with set of anon_vma >> which they belong. >> For new pages it might allocate new anon_vma or reuse existing. After >> my patch vma >> will try to reuse anon_vma from that set which has no vmas which points to it. >> As a result there will be no parent-child relation between anon_vma and >> multiple pages might have equal (anon_vma, index) pair but I see no >> problems here. > > Hmm I wonder if root anon_vma should be excluded from this reusal. For > performance reasons, exclusive pages go to non-root anon_vma (see > __page_set_anon_rmap()) and reusing root anon_vma would change this. This is simple, in my patch this can be reached by bumping its nr_vmas by one and it'll never be reused. > Also from reading http://lwn.net/Articles/383162/ I understand that correctness > also depends on the hierarchy and I wonder if there's a danger of reintroducing > a bug like the one described there. If I remember right that was fixed by linking non-exclusively mapped pages to root anon_vma instead of anon_vma from vma where fault has happened. After my patch this still works. Topology hierarchy actually isn't used. Here just one selected "root' anon_vma which dies last. That's all. > > Vlastimil > >>> >>> So maybe we could detect that the own references dropped to zero when the parent >>> does exit, and then change mapping of all relevant pages to the root anon_vma, >>> destroy avc's of children and the anon_vma itself. But that sounds quite >>> heavyweight :/ >>> >>> Vlastimil >>> >>>>> >>>>>> >>> >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: email@kvack.org >> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f179.google.com (mail-qc0-f179.google.com [209.85.216.179]) by kanga.kvack.org (Postfix) with ESMTP id 89C1B6B0080 for ; Wed, 19 Nov 2014 18:14:17 -0500 (EST) Received: by mail-qc0-f179.google.com with SMTP id c9so1273051qcz.38 for ; Wed, 19 Nov 2014 15:14:17 -0800 (PST) Received: from mail-qa0-x231.google.com (mail-qa0-x231.google.com. [2607:f8b0:400d:c00::231]) by mx.google.com with ESMTPS id 67si891678qgx.12.2014.11.19.15.14.16 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 19 Nov 2014 15:14:16 -0800 (PST) Received: by mail-qa0-f49.google.com with SMTP id s7so1152125qap.8 for ; Wed, 19 Nov 2014 15:14:16 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> <546BDB29.9050403@suse.cz> <546CC0CD.40906@suse.cz> Date: Wed, 19 Nov 2014 15:14:15 -0800 Message-ID: Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound From: Michel Lespinasse Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Konstantin Khlebnikov Cc: Vlastimil Babka , Andrew Morton , Rik van Riel , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko On Wed, Nov 19, 2014 at 8:58 AM, Konstantin Khlebnikov wrote: > On Wed, Nov 19, 2014 at 7:09 PM, Vlastimil Babka wrote: >> Also from reading http://lwn.net/Articles/383162/ I understand that correctness >> also depends on the hierarchy and I wonder if there's a danger of reintroducing >> a bug like the one described there. > > If I remember right that was fixed by linking non-exclusively mapped pages to > root anon_vma instead of anon_vma from vma where fault has happened. > After my patch this still works. Topology hierarchy actually isn't used. > Here just one selected "root' anon_vma which dies last. That's all. That's not how I remember it. An anon_vma corresponds to a given vma V, and is used to track all vmas (V and descendant vmas) that may include a page that was originally mapped in V. Each anon page has a link to the anon_vma corresponding to the vma they were originally faulted in, and an offset indicating where the page was located relative to that original VMA. The anon_vma has an interval tree of struct anon_vma_chain, and each struct anon_vma_chain includes a link to a descendent-of-V vma. This allows rmap to quickly find all the vmas that may map a given page (based on the page's anon_vma and offset). When forking or splitting vmas, the new vma is a descendent of the same vmas as the old one so it must be added to all the anon_vma interval trees that were referencing the old one (that is, ancestors of the new vma). To that end, all the struct anon_vma_chain pointing to a given vma are kept on a linked list, and struct anon_vma_chain includes a link to the anon_vma holding the interval tree. Locking the entire structure is done with a single lock hosted in the root anon_vma (that is, a vma that was created by mmap() and not by cloning or forking existing vmas). Limit the length of the ancestors linked list is correct, though it has performance implications. In the extreme case, forcing all vmas to be added on the root vma's interval tree would be correct, though it may re-introduce the performance problems that lead to the introduction of anon_vma. The good thing about Konstantin's proposal is that it does not have any magic constant like mine did. However, I think he is mistaken in saying that hierarchy isn't used - an ancestor vma will always have more descendents than its children, and the reason for the hierarchy is to limit the number of vmas that rmap must explore. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f178.google.com (mail-wi0-f178.google.com [209.85.212.178]) by kanga.kvack.org (Postfix) with ESMTP id AC6506B0070 for ; Thu, 20 Nov 2014 09:42:04 -0500 (EST) Received: by mail-wi0-f178.google.com with SMTP id hi2so5574495wib.17 for ; Thu, 20 Nov 2014 06:42:04 -0800 (PST) Received: from mail-wg0-x234.google.com (mail-wg0-x234.google.com. [2a00:1450:400c:c00::234]) by mx.google.com with ESMTPS id db3si7641519wib.3.2014.11.20.06.42.03 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 20 Nov 2014 06:42:03 -0800 (PST) Received: by mail-wg0-f52.google.com with SMTP id a1so3895208wgh.25 for ; Thu, 20 Nov 2014 06:42:03 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> <546BDB29.9050403@suse.cz> <546CC0CD.40906@suse.cz> Date: Thu, 20 Nov 2014 18:42:03 +0400 Message-ID: Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound From: Konstantin Khlebnikov Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Michel Lespinasse Cc: Vlastimil Babka , Andrew Morton , Rik van Riel , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko On Thu, Nov 20, 2014 at 2:14 AM, Michel Lespinasse wrote: > On Wed, Nov 19, 2014 at 8:58 AM, Konstantin Khlebnikov wrote: >> On Wed, Nov 19, 2014 at 7:09 PM, Vlastimil Babka wrote: >>> Also from reading http://lwn.net/Articles/383162/ I understand that correctness >>> also depends on the hierarchy and I wonder if there's a danger of reintroducing >>> a bug like the one described there. >> >> If I remember right that was fixed by linking non-exclusively mapped pages to >> root anon_vma instead of anon_vma from vma where fault has happened. >> After my patch this still works. Topology hierarchy actually isn't used. >> Here just one selected "root' anon_vma which dies last. That's all. > > That's not how I remember it. ??? That at the end of lwn article: [quote] The fix is straightforward; when linking an existing page to an anon_vma structure, the kernel needs to pick the one which is highest in the process hierarchy; that guarantees that the anon_vma will not go away prematurely. [/quote] nowdays this happens in __page_set_anon_rmap(): /* * If the page isn't exclusively mapped into this vma, * we must use the _oldest_ possible anon_vma for the * page mapping! */ if (!exclusive) anon_vma = anon_vma->root; The rest treeish of topology affects only performance. > > An anon_vma corresponds to a given vma V, and is used to track all > vmas (V and descendant vmas) that may include a page that was > originally mapped in V. > > Each anon page has a link to the anon_vma corresponding to the vma > they were originally faulted in, and an offset indicating where the > page was located relative to that original VMA. > > The anon_vma has an interval tree of struct anon_vma_chain, and each > struct anon_vma_chain includes a link to a descendent-of-V vma. This > allows rmap to quickly find all the vmas that may map a given page > (based on the page's anon_vma and offset). > > When forking or splitting vmas, the new vma is a descendent of the > same vmas as the old one so it must be added to all the anon_vma > interval trees that were referencing the old one (that is, ancestors > of the new vma). To that end, all the struct anon_vma_chain pointing > to a given vma are kept on a linked list, and struct anon_vma_chain > includes a link to the anon_vma holding the interval tree. > > Locking the entire structure is done with a single lock hosted in the > root anon_vma (that is, a vma that was created by mmap() and not by > cloning or forking existing vmas). > > Limit the length of the ancestors linked list is correct, though it > has performance implications. In the extreme case, forcing all vmas to > be added on the root vma's interval tree would be correct, though it > may re-introduce the performance problems that lead to the > introduction of anon_vma. > > The good thing about Konstantin's proposal is that it does not have > any magic constant like mine did. However, I think he is mistaken in > saying that hierarchy isn't used - an ancestor vma will always have > more descendents than its children, and the reason for the hierarchy > is to limit the number of vmas that rmap must explore. I mean after breaking hierarchy whole structure stays correct and kernel wouldn't explode, of course reusing anon_vma from ancestor makes rmap walk less effective because newly allocated pages will get false aliased vmas where they will never be mapped. I'm thinking about limitation for reusing anon_vmas which might increase performance without breaking asymptotic estimation of count anon_vma in the worst case. For example this heuristic: allow to reuse only anon_vma with single direct descendant. It seems there will be arount up to two times more anon_vmas but false-aliasing must be much lower. > > -- > Michel "Walken" Lespinasse > A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by kanga.kvack.org (Postfix) with ESMTP id 823C16B0069 for ; Thu, 20 Nov 2014 10:03:25 -0500 (EST) Received: by mail-wi0-f180.google.com with SMTP id n3so5677465wiv.7 for ; Thu, 20 Nov 2014 07:03:25 -0800 (PST) Received: from mail-wg0-x233.google.com (mail-wg0-x233.google.com. [2a00:1450:400c:c00::233]) by mx.google.com with ESMTPS id dj6si3650134wjc.151.2014.11.20.07.03.24 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 20 Nov 2014 07:03:24 -0800 (PST) Received: by mail-wg0-f51.google.com with SMTP id k14so3962740wgh.10 for ; Thu, 20 Nov 2014 07:03:24 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <546DFFA1.4030700@redhat.com> References: <502D42E5.7090403@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> <546BDB29.9050403@suse.cz> <546CC0CD.40906@suse.cz> <546DFFA1.4030700@redhat.com> Date: Thu, 20 Nov 2014 19:03:24 +0400 Message-ID: Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound From: Konstantin Khlebnikov Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Michel Lespinasse , Vlastimil Babka , Andrew Morton , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote: > >> I'm thinking about limitation for reusing anon_vmas which might >> increase performance without breaking asymptotic estimation of >> count anon_vma in the worst case. For example this heuristic: allow >> to reuse only anon_vma with single direct descendant. It seems >> there will be arount up to two times more anon_vmas but >> false-aliasing must be much lower. > > It may even be possible to not create a child anon_vma for the > first child a parent forks, but only create a new anon_vma once > the parent clones a second child (alive at the same time as the > first child). > > That still takes care of things like apache or sendmail, but > would not create infinite anon_vmas for a task that keeps forking > itself to infinite depth without calling exec... But this scheme is still exploitable. Malicious software easily could create sequence of forks and exits which leads to infinite chain of anon_vmas. > > - -- > All rights reversed > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1 > > iQEcBAEBAgAGBQJUbf+hAAoJEM553pKExN6DxhQH/1QL+9GdhaSx7EQnRcbDRcHi > GuEfMU0g9Kv4ad+oPSQnH/L7vJMJAYeh5ZJGH+rOykWHp3sGReqDZOnzpXRAe11z > 1cSC1BJsndzrv9wX8niFpuKpYbF0IP+ckv3qaEzWtm5yCRyhHVZfr6b794Y4K9jF > z2EPPu1vAAldbkx1VlYTwofBA5lESL5UmrFvH4ouI7BeWYSEe6BgVCbvK+K5fANT > ketdA5R08xyUAcXDa+28qpBYkdWnxNhwqseDoXCW8SOFNwWbLDI6GRfrsCNku13i > Gi41h3uEuIAGDf+AU/GMjiymgwutCOGq+cfZlszELaRvHmDpNGYdPv1llghNg7Q= > =Vk+H > -----END PGP SIGNATURE----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yh0-f51.google.com (mail-yh0-f51.google.com [209.85.213.51]) by kanga.kvack.org (Postfix) with ESMTP id CAA106B0071 for ; Thu, 20 Nov 2014 10:18:35 -0500 (EST) Received: by mail-yh0-f51.google.com with SMTP id a41so1420074yho.10 for ; Thu, 20 Nov 2014 07:18:35 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id c68si2909423qga.110.2014.11.20.07.18.34 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 20 Nov 2014 07:18:34 -0800 (PST) Message-ID: <546DFFA1.4030700@redhat.com> Date: Thu, 20 Nov 2014 09:50:09 -0500 From: Rik van Riel MIME-Version: 1.0 Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound References: <502D42E5.7090403@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> <546BDB29.9050403@suse.cz> <546CC0CD.40906@suse.cz> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Konstantin Khlebnikov , Michel Lespinasse Cc: Vlastimil Babka , Andrew Morton , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote: > I'm thinking about limitation for reusing anon_vmas which might > increase performance without breaking asymptotic estimation of > count anon_vma in the worst case. For example this heuristic: allow > to reuse only anon_vma with single direct descendant. It seems > there will be arount up to two times more anon_vmas but > false-aliasing must be much lower. It may even be possible to not create a child anon_vma for the first child a parent forks, but only create a new anon_vma once the parent clones a second child (alive at the same time as the first child). That still takes care of things like apache or sendmail, but would not create infinite anon_vmas for a task that keeps forking itself to infinite depth without calling exec... - -- All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJUbf+hAAoJEM553pKExN6DxhQH/1QL+9GdhaSx7EQnRcbDRcHi GuEfMU0g9Kv4ad+oPSQnH/L7vJMJAYeh5ZJGH+rOykWHp3sGReqDZOnzpXRAe11z 1cSC1BJsndzrv9wX8niFpuKpYbF0IP+ckv3qaEzWtm5yCRyhHVZfr6b794Y4K9jF z2EPPu1vAAldbkx1VlYTwofBA5lESL5UmrFvH4ouI7BeWYSEe6BgVCbvK+K5fANT ketdA5R08xyUAcXDa+28qpBYkdWnxNhwqseDoXCW8SOFNwWbLDI6GRfrsCNku13i Gi41h3uEuIAGDf+AU/GMjiymgwutCOGq+cfZlszELaRvHmDpNGYdPv1llghNg7Q= =Vk+H -----END PGP SIGNATURE----- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qg0-f45.google.com (mail-qg0-f45.google.com [209.85.192.45]) by kanga.kvack.org (Postfix) with ESMTP id 9642E6B0071 for ; Thu, 20 Nov 2014 10:27:05 -0500 (EST) Received: by mail-qg0-f45.google.com with SMTP id f51so2231853qge.18 for ; Thu, 20 Nov 2014 07:27:05 -0800 (PST) Received: from mail-qg0-x229.google.com (mail-qg0-x229.google.com. [2607:f8b0:400d:c04::229]) by mx.google.com with ESMTPS id c9si3037914qcm.44.2014.11.20.07.27.04 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 20 Nov 2014 07:27:04 -0800 (PST) Received: by mail-qg0-f41.google.com with SMTP id j5so2239596qga.0 for ; Thu, 20 Nov 2014 07:27:04 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> <546BDB29.9050403@suse.cz> <546CC0CD.40906@suse.cz> Date: Thu, 20 Nov 2014 16:27:03 +0100 Message-ID: Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound From: Michel Lespinasse Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Konstantin Khlebnikov Cc: Vlastimil Babka , Andrew Morton , Rik van Riel , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko On Thu, Nov 20, 2014 at 3:42 PM, Konstantin Khlebnikov wrote: > On Thu, Nov 20, 2014 at 2:14 AM, Michel Lespinasse wrote: >> On Wed, Nov 19, 2014 at 8:58 AM, Konstantin Khlebnikov wrote: >>> On Wed, Nov 19, 2014 at 7:09 PM, Vlastimil Babka wrote: >>>> Also from reading http://lwn.net/Articles/383162/ I understand that correctness >>>> also depends on the hierarchy and I wonder if there's a danger of reintroducing >>>> a bug like the one described there. >>> >>> If I remember right that was fixed by linking non-exclusively mapped pages to >>> root anon_vma instead of anon_vma from vma where fault has happened. >>> After my patch this still works. Topology hierarchy actually isn't used. >>> Here just one selected "root' anon_vma which dies last. That's all. >> >> That's not how I remember it. > > ??? That at the end of lwn article: > > [quote] > The fix is straightforward; when linking an existing page to an > anon_vma structure, > the kernel needs to pick the one which is highest in the process hierarchy; > that guarantees that the anon_vma will not go away prematurely. > [/quote] > > nowdays this happens in __page_set_anon_rmap(): > > /* > * If the page isn't exclusively mapped into this vma, > * we must use the _oldest_ possible anon_vma for the > * page mapping! > */ > if (!exclusive) > anon_vma = anon_vma->root; > > The rest treeish of topology affects only performance. Ah, I see what you mean. IIRC the !exclusive bit is for pages coming back from swap, where we don't have enough tracking info to remember where the page was first created so we have to assume the worst case (i.e. that it was created in the root anon_vma). My understanding was that we don't exercise this in the non-swap case. Looking back into it, it seems that we are now doing this with ksm and migrate as well, though. The point remains though that moving pages higher than necessary in the anon_vma hierarchy is OK from a correctness perspective but could have bad implications from a performance perspective. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f176.google.com (mail-wi0-f176.google.com [209.85.212.176]) by kanga.kvack.org (Postfix) with ESMTP id 64879800CA for ; Mon, 24 Nov 2014 02:09:41 -0500 (EST) Received: by mail-wi0-f176.google.com with SMTP id ex7so4605780wid.15 for ; Sun, 23 Nov 2014 23:09:40 -0800 (PST) Received: from mail-wg0-x22a.google.com (mail-wg0-x22a.google.com. [2a00:1450:400c:c00::22a]) by mx.google.com with ESMTPS id q3si10873180wix.22.2014.11.23.23.09.40 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 23 Nov 2014 23:09:40 -0800 (PST) Received: by mail-wg0-f42.google.com with SMTP id z12so11534770wgg.29 for ; Sun, 23 Nov 2014 23:09:40 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <502D42E5.7090403@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> <546BDB29.9050403@suse.cz> <546CC0CD.40906@suse.cz> <546DFFA1.4030700@redhat.com> Date: Mon, 24 Nov 2014 11:09:40 +0400 Message-ID: Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound From: Konstantin Khlebnikov Content-Type: multipart/mixed; boundary=089e013d1da2667a570508957af9 Sender: owner-linux-mm@kvack.org List-ID: To: Rik van Riel Cc: Michel Lespinasse , Vlastimil Babka , Andrew Morton , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko --089e013d1da2667a570508957af9 Content-Type: text/plain; charset=UTF-8 On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov wrote: > On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote: >> >>> I'm thinking about limitation for reusing anon_vmas which might >>> increase performance without breaking asymptotic estimation of >>> count anon_vma in the worst case. For example this heuristic: allow >>> to reuse only anon_vma with single direct descendant. It seems >>> there will be arount up to two times more anon_vmas but >>> false-aliasing must be much lower. Done. RFC patch in attachment. This patch adds heuristic which decides to reuse existing anon_vma instead of forking new one. It counts vmas and direct descendants for each anon_vma. Anon_vma with degree lower than two will be reused at next fork. As a result each anon_vma has either alive vma or at least two descendants, endless chains are no longer possible and count of anon_vmas is no more than two times more than count of vmas. >> >> It may even be possible to not create a child anon_vma for the >> first child a parent forks, but only create a new anon_vma once >> the parent clones a second child (alive at the same time as the >> first child). >> >> That still takes care of things like apache or sendmail, but >> would not create infinite anon_vmas for a task that keeps forking >> itself to infinite depth without calling exec... > > But this scheme is still exploitable. Malicious software easily could create > sequence of forks and exits which leads to infinite chain of anon_vmas. > >> >> - -- >> All rights reversed >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1 >> >> iQEcBAEBAgAGBQJUbf+hAAoJEM553pKExN6DxhQH/1QL+9GdhaSx7EQnRcbDRcHi >> GuEfMU0g9Kv4ad+oPSQnH/L7vJMJAYeh5ZJGH+rOykWHp3sGReqDZOnzpXRAe11z >> 1cSC1BJsndzrv9wX8niFpuKpYbF0IP+ckv3qaEzWtm5yCRyhHVZfr6b794Y4K9jF >> z2EPPu1vAAldbkx1VlYTwofBA5lESL5UmrFvH4ouI7BeWYSEe6BgVCbvK+K5fANT >> ketdA5R08xyUAcXDa+28qpBYkdWnxNhwqseDoXCW8SOFNwWbLDI6GRfrsCNku13i >> Gi41h3uEuIAGDf+AU/GMjiymgwutCOGq+cfZlszELaRvHmDpNGYdPv1llghNg7Q= >> =Vk+H >> -----END PGP SIGNATURE----- --089e013d1da2667a570508957af9 Content-Type: application/octet-stream; name=mm-prevent-endless-growth-of-anon_vma-hierarchy Content-Disposition: attachment; filename=mm-prevent-endless-growth-of-anon_vma-hierarchy Content-Transfer-Encoding: base64 X-Attachment-Id: f_i2vhurku0 bW06IHByZXZlbnQgZW5kbGVzcyBncm93dGggb2YgYW5vbl92bWEgaGllcmFyY2h5CgpGcm9tOiBL b25zdGFudGluIEtobGVibmlrb3YgPGtvY3Q5aUBnbWFpbC5jb20+CgpDb25zdGFudGx5IGZvcmtp bmcgdGFzayBjYXVzZXMgdW5saW1pdGVkIGdyb3cgb2YgYW5vbl92bWEgY2hhaW4uCkVhY2ggbmV4 dCBjaGlsZCBhbGxvY2F0ZSBuZXcgbGV2ZWwgb2YgYW5vbl92bWFzIGFuZCBsaW5rcyB2bWFzIHRv IGFsbApwcmV2aW91cyBsZXZlbHMgYmVjYXVzZSBpdCBpbmhlcml0cyBwYWdlcyBmcm9tIHRoZW0u IE5vbmUgb2YgYW5vbl92bWFzCmNhbm5vdCBiZSBmcmVlZCBiZWNhdXNlIHRoZXJlIG1pZ2h0IGJl IHBhZ2VzIHdoaWNoIHBvaW50cyB0byB0aGVtLgoKVGhpcyBwYXRjaCBhZGRzIGhldXJpc3RpYyB3 aGljaCBkZWNpZGVzIHRvIHJldXNlIGV4aXN0aW5nIGFub25fdm1hIGluc3RlYWQKb2YgZm9ya2lu ZyBuZXcgb25lLiBJdCBjb3VudHMgdm1hcyBhbmQgZGlyZWN0IGRlc2NlbmRhbnRzIGZvciBlYWNo IGFub25fdm1hLgpBbm9uX3ZtYSB3aXRoIGRlZ3JlZSBsb3dlciB0aGFuIHR3byB3aWxsIGJlIHJl dXNlZCBhdCBuZXh0IGZvcmsuCkFzIGEgcmVzdWx0IGVhY2ggYW5vbl92bWEgaGFzIGVpdGhlciBh bGl2ZSB2bWEgb3IgYXQgbGVhc3QgdHdvIGRlc2NlbmRhbnRzLAplbmRsZXNzIGNoYWlucyBhcmUg bm8gbG9uZ2VyIHBvc3NpYmxlIGFuZCBjb3VudCBvZiBhbm9uX3ZtYXMgaXMgbm8gbW9yZSB0aGFu CnR3byB0aW1lcyBtb3JlIHRoYW4gY291bnQgb2Ygdm1hcy4KClNpZ25lZC1vZmYtYnk6IEtvbnN0 YW50aW4gS2hsZWJuaWtvdiA8a29jdDlpQGdtYWlsLmNvbT4KTGluazogaHR0cDovL2xrbWwua2Vy bmVsLm9yZy9yLzIwMTIwODE2MDI0NjEwLkdBNTM1MEBldmVyZ3JlZW4uc3NlYy53aXNjLmVkdQot LS0KIGluY2x1ZGUvbGludXgvcm1hcC5oIHwgICAxNiArKysrKysrKysrKysrKysrCiBtbS9ybWFw LmMgICAgICAgICAgICB8ICAgMzAgKysrKysrKysrKysrKysrKysrKysrKysrKysrKystCiAyIGZp bGVzIGNoYW5nZWQsIDQ1IGluc2VydGlvbnMoKyksIDEgZGVsZXRpb24oLSkKCmRpZmYgLS1naXQg YS9pbmNsdWRlL2xpbnV4L3JtYXAuaCBiL2luY2x1ZGUvbGludXgvcm1hcC5oCmluZGV4IGMwYzJi Y2UuLmIxZDE0MGMgMTAwNjQ0Ci0tLSBhL2luY2x1ZGUvbGludXgvcm1hcC5oCisrKyBiL2luY2x1 ZGUvbGludXgvcm1hcC5oCkBAIC00NSw2ICs0NSwyMiBAQCBzdHJ1Y3QgYW5vbl92bWEgewogCSAq IG1tX3Rha2VfYWxsX2xvY2tzKCkgKG1tX2FsbF9sb2Nrc19tdXRleCkuCiAJICovCiAJc3RydWN0 IHJiX3Jvb3QgcmJfcm9vdDsJLyogSW50ZXJ2YWwgdHJlZSBvZiBwcml2YXRlICJyZWxhdGVkIiB2 bWFzICovCisKKwkvKgorCSAqIENvdW50IG9mIGNoaWxkIGFub25fdm1hcyBhbmQgVk1BcyB3aGlj aCBwb2ludHMgdG8gdGhpcyBhbm9uX3ZtYS4KKwkgKgorCSAqIFRoaXMgY291bnRlciBpcyB1c2Vk IGZvciBtYWtpbmcgZGVjaXNpb24gYWJvdXQgcmV1c2luZyBvbGQgYW5vbl92bWEKKwkgKiBpbnN0 ZWFkIG9mIGZvcmtpbmcgbmV3IG9uZS4gSXQgYWxsb3dzIHRvIGRldGVjdCBhbm9uX3ZtYXMgd2hp Y2ggaGF2ZQorCSAqIGp1c3Qgb25lIGRpcmVjdCBkZXNjZW5kYW50IGFuZCBubyB2bWFzLiBSZXVz aW5nIHN1Y2ggYW5vbl92bWEgbm90CisJICogbGVhZHMgdG8gc2lnbmlmaWNhbnQgcHJlZm9ybWFu Y2UgcmVncmVzc2lvbiBidXQgcHJldmVudHMgZGVncmFkYXRpb24KKwkgKiBvZiBhbm9uX3ZtYSBo aWVyYXJjaHkgdG8gZW5kbGVzcyBsaW5lYXIgY2hhaW4uCisJICoKKwkgKiBSb290IGFub25fdm1h IGlzIG5ldmVyIHJldXNlZCBiZWNhdXNlIGl0IGlzIGl0cyBvd24gcGFyZW50IGFuZCBpdCBoYXMK KwkgKiBhdCBsZWF0IG9uZSB2bWEgb3IgY2hpbGQsIHRodXMgYXQgZm9yayBpdCdzIGRlZ3JlZSBp cyBhdCBsZWFzdCAyLgorCSAqLworCXVuc2lnbmVkIGRlZ3JlZTsKKworCXN0cnVjdCBhbm9uX3Zt YSAqcGFyZW50OwkvKiBQYXJlbnQgb2YgdGhpcyBhbm9uX3ZtYSAqLwogfTsKIAogLyoKZGlmZiAt LWdpdCBhL21tL3JtYXAuYyBiL21tL3JtYXAuYwppbmRleCAxOTg4NmZiLi5iYTI5ZTFjIDEwMDY0 NAotLS0gYS9tbS9ybWFwLmMKKysrIGIvbW0vcm1hcC5jCkBAIC03Miw2ICs3Miw4IEBAIHN0YXRp YyBpbmxpbmUgc3RydWN0IGFub25fdm1hICphbm9uX3ZtYV9hbGxvYyh2b2lkKQogCWFub25fdm1h ID0ga21lbV9jYWNoZV9hbGxvYyhhbm9uX3ZtYV9jYWNoZXAsIEdGUF9LRVJORUwpOwogCWlmIChh bm9uX3ZtYSkgewogCQlhdG9taWNfc2V0KCZhbm9uX3ZtYS0+cmVmY291bnQsIDEpOworCQlhbm9u X3ZtYS0+ZGVncmVlID0gMTsJLyogUmVmZXJlbmNlIGZvciBmaXJzdCB2bWEgKi8KKwkJYW5vbl92 bWEtPnBhcmVudCA9IGFub25fdm1hOwogCQkvKgogCQkgKiBJbml0aWFsaXNlIHRoZSBhbm9uX3Zt YSByb290IHRvIHBvaW50IHRvIGl0c2VsZi4gSWYgY2FsbGVkCiAJCSAqIGZyb20gZm9yaywgdGhl IHJvb3Qgd2lsbCBiZSByZXNldCB0byB0aGUgcGFyZW50cyBhbm9uX3ZtYS4KQEAgLTE4MCw2ICsx ODIsOCBAQCBpbnQgYW5vbl92bWFfcHJlcGFyZShzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKnZtYSkK IAkJCWlmICh1bmxpa2VseSghYW5vbl92bWEpKQogCQkJCWdvdG8gb3V0X2Vub21lbV9mcmVlX2F2 YzsKIAkJCWFsbG9jYXRlZCA9IGFub25fdm1hOworCQkJLyogQnVtcCBkZWdyZWUsIHJvb3QgYW5v bl92bWEgaXMgaXRzIG93biBwYXJlbnQuICovCisJCQlhbm9uX3ZtYS0+ZGVncmVlKys7CiAJCX0K IAogCQlhbm9uX3ZtYV9sb2NrX3dyaXRlKGFub25fdm1hKTsKQEAgLTI1Niw3ICsyNjAsMTcgQEAg aW50IGFub25fdm1hX2Nsb25lKHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqZHN0LCBzdHJ1Y3Qgdm1f YXJlYV9zdHJ1Y3QgKnNyYykKIAkJYW5vbl92bWEgPSBwYXZjLT5hbm9uX3ZtYTsKIAkJcm9vdCA9 IGxvY2tfYW5vbl92bWFfcm9vdChyb290LCBhbm9uX3ZtYSk7CiAJCWFub25fdm1hX2NoYWluX2xp bmsoZHN0LCBhdmMsIGFub25fdm1hKTsKKworCQkvKgorCQkgKiBSZXVzZSBleGlzdGluZyBhbm9u X3ZtYSBpZiBpdHMgZGVncmVlIGxvd2VyIHRoYW4gdHdvLAorCQkgKiB0aGF0IG1lYW5zIGl0IGhh cyBubyB2bWEgYW5kIGp1c3Qgb25lIGFub25fdm1hIGNoaWxkLgorCQkgKi8KKwkJaWYgKCFkc3Qt PmFub25fdm1hICYmIGFub25fdm1hICE9IHNyYy0+YW5vbl92bWEgJiYKKwkJCQlhbm9uX3ZtYS0+ ZGVncmVlIDwgMikKKwkJCWRzdC0+YW5vbl92bWEgPSBhbm9uX3ZtYTsKIAl9CisJaWYgKGRzdC0+ YW5vbl92bWEpCisJCWRzdC0+YW5vbl92bWEtPmRlZ3JlZSsrOwogCXVubG9ja19hbm9uX3ZtYV9y b290KHJvb3QpOwogCXJldHVybiAwOwogCkBAIC0yNzksNiArMjkzLDkgQEAgaW50IGFub25fdm1h X2Zvcmsoc3RydWN0IHZtX2FyZWFfc3RydWN0ICp2bWEsIHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAq cHZtYSkKIAlpZiAoIXB2bWEtPmFub25fdm1hKQogCQlyZXR1cm4gMDsKIAorCS8qIERyb3AgaW5o ZXJpdGVkIGFub25fdm1hLCB3ZSdsbCByZXVzZSBvbGQgb25lIG9yIGFsbG9jYXRlIG5ldy4gKi8K Kwl2bWEtPmFub25fdm1hID0gTlVMTDsKKwogCS8qCiAJICogRmlyc3QsIGF0dGFjaCB0aGUgbmV3 IFZNQSB0byB0aGUgcGFyZW50IFZNQSdzIGFub25fdm1hcywKIAkgKiBzbyBybWFwIGNhbiBmaW5k IG5vbi1DT1dlZCBwYWdlcyBpbiBjaGlsZCBwcm9jZXNzZXMuCkBAIC0yODYsNiArMzAzLDEwIEBA IGludCBhbm9uX3ZtYV9mb3JrKHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqdm1hLCBzdHJ1Y3Qgdm1f YXJlYV9zdHJ1Y3QgKnB2bWEpCiAJaWYgKGFub25fdm1hX2Nsb25lKHZtYSwgcHZtYSkpCiAJCXJl dHVybiAtRU5PTUVNOwogCisJLyogQW4gb2xkIGFub25fdm1hIGhhcyBiZWVuIHJldXNlZC4gKi8K KwlpZiAodm1hLT5hbm9uX3ZtYSkKKwkJcmV0dXJuIDA7CisKIAkvKiBUaGVuIGFkZCBvdXIgb3du IGFub25fdm1hLiAqLwogCWFub25fdm1hID0gYW5vbl92bWFfYWxsb2MoKTsKIAlpZiAoIWFub25f dm1hKQpAQCAtMjk5LDYgKzMyMCw3IEBAIGludCBhbm9uX3ZtYV9mb3JrKHN0cnVjdCB2bV9hcmVh X3N0cnVjdCAqdm1hLCBzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKnB2bWEpCiAJICogbG9jayBhbnkg b2YgdGhlIGFub25fdm1hcyBpbiB0aGlzIGFub25fdm1hIHRyZWUuCiAJICovCiAJYW5vbl92bWEt PnJvb3QgPSBwdm1hLT5hbm9uX3ZtYS0+cm9vdDsKKwlhbm9uX3ZtYS0+cGFyZW50ID0gcHZtYS0+ YW5vbl92bWE7CiAJLyoKIAkgKiBXaXRoIHJlZmNvdW50cywgYW4gYW5vbl92bWEgY2FuIHN0YXkg YXJvdW5kIGxvbmdlciB0aGFuIHRoZQogCSAqIHByb2Nlc3MgaXQgYmVsb25ncyB0by4gVGhlIHJv b3QgYW5vbl92bWEgbmVlZHMgdG8gYmUgcGlubmVkIHVudGlsCkBAIC0zMDksNiArMzMxLDcgQEAg aW50IGFub25fdm1hX2Zvcmsoc3RydWN0IHZtX2FyZWFfc3RydWN0ICp2bWEsIHN0cnVjdCB2bV9h cmVhX3N0cnVjdCAqcHZtYSkKIAl2bWEtPmFub25fdm1hID0gYW5vbl92bWE7CiAJYW5vbl92bWFf bG9ja193cml0ZShhbm9uX3ZtYSk7CiAJYW5vbl92bWFfY2hhaW5fbGluayh2bWEsIGF2YywgYW5v bl92bWEpOworCWFub25fdm1hLT5wYXJlbnQtPmRlZ3JlZSsrOwogCWFub25fdm1hX3VubG9ja193 cml0ZShhbm9uX3ZtYSk7CiAKIAlyZXR1cm4gMDsKQEAgLTMzOSwxMiArMzYyLDE2IEBAIHZvaWQg dW5saW5rX2Fub25fdm1hcyhzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKnZtYSkKIAkJICogTGVhdmUg ZW1wdHkgYW5vbl92bWFzIG9uIHRoZSBsaXN0IC0gd2UnbGwgbmVlZAogCQkgKiB0byBmcmVlIHRo ZW0gb3V0c2lkZSB0aGUgbG9jay4KIAkJICovCi0JCWlmIChSQl9FTVBUWV9ST09UKCZhbm9uX3Zt YS0+cmJfcm9vdCkpCisJCWlmIChSQl9FTVBUWV9ST09UKCZhbm9uX3ZtYS0+cmJfcm9vdCkpIHsK KwkJCWFub25fdm1hLT5wYXJlbnQtPmRlZ3JlZS0tOwogCQkJY29udGludWU7CisJCX0KIAogCQls aXN0X2RlbCgmYXZjLT5zYW1lX3ZtYSk7CiAJCWFub25fdm1hX2NoYWluX2ZyZWUoYXZjKTsKIAl9 CisJaWYgKHZtYS0+YW5vbl92bWEpCisJCXZtYS0+YW5vbl92bWEtPmRlZ3JlZS0tOwogCXVubG9j a19hbm9uX3ZtYV9yb290KHJvb3QpOwogCiAJLyoKQEAgLTM1NSw2ICszODIsNyBAQCB2b2lkIHVu bGlua19hbm9uX3ZtYXMoc3RydWN0IHZtX2FyZWFfc3RydWN0ICp2bWEpCiAJbGlzdF9mb3JfZWFj aF9lbnRyeV9zYWZlKGF2YywgbmV4dCwgJnZtYS0+YW5vbl92bWFfY2hhaW4sIHNhbWVfdm1hKSB7 CiAJCXN0cnVjdCBhbm9uX3ZtYSAqYW5vbl92bWEgPSBhdmMtPmFub25fdm1hOwogCisJCUJVR19P Tihhbm9uX3ZtYS0+ZGVncmVlKTsKIAkJcHV0X2Fub25fdm1hKGFub25fdm1hKTsKIAogCQlsaXN0 X2RlbCgmYXZjLT5zYW1lX3ZtYSk7Cg== --089e013d1da2667a570508957af9-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f182.google.com (mail-wi0-f182.google.com [209.85.212.182]) by kanga.kvack.org (Postfix) with ESMTP id 4862E6B0038 for ; Tue, 25 Nov 2014 05:59:56 -0500 (EST) Received: by mail-wi0-f182.google.com with SMTP id h11so1003190wiw.3 for ; Tue, 25 Nov 2014 02:59:55 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id q6si2415239wiz.104.2014.11.25.02.59.54 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 25 Nov 2014 02:59:54 -0800 (PST) Date: Tue, 25 Nov 2014 11:59:53 +0100 From: Michal Hocko Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound Message-ID: <20141125105953.GC4607@dhcp22.suse.cz> References: <546BDB29.9050403@suse.cz> <546CC0CD.40906@suse.cz> <546DFFA1.4030700@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Konstantin Khlebnikov Cc: Rik van Riel , Michel Lespinasse , Vlastimil Babka , Andrew Morton , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick On Mon 24-11-14 11:09:40, Konstantin Khlebnikov wrote: > On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov wrote: > > On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- > >> Hash: SHA1 > >> > >> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote: > >> > >>> I'm thinking about limitation for reusing anon_vmas which might > >>> increase performance without breaking asymptotic estimation of > >>> count anon_vma in the worst case. For example this heuristic: allow > >>> to reuse only anon_vma with single direct descendant. It seems > >>> there will be arount up to two times more anon_vmas but > >>> false-aliasing must be much lower. > > Done. RFC patch in attachment. This is triggering BUG_ON(anon_vma->degree); in unlink_anon_vmas. I have applied the patch on top of 3.18.0-rc6. [ 12.380189] ------------[ cut here ]------------ [ 12.380221] kernel BUG at mm/rmap.c:385! [ 12.380239] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 12.380272] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd [ 12.380518] CPU: 1 PID: 3704 Comm: kdm_greet Not tainted 3.18.0-rc6-test-00001-gf5bc00c103ff #409 [ 12.380554] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011 [ 12.380584] task: ffff8801272bc2c0 ti: ffff8800bcaf0000 task.ti: ffff8800bcaf0000 [ 12.380614] RIP: 0010:[] [] unlink_anon_vmas+0x12b/0x169 [ 12.380653] RSP: 0018:ffff8800bcaf3d28 EFLAGS: 00010286 [ 12.380676] RAX: ffff8800bcb3e690 RBX: ffff8800bcb35e28 RCX: ffff8801272bcb60 [ 12.380706] RDX: ffff8800bcb38e70 RSI: 0000000000000001 RDI: ffff8800bcb38e70 [ 12.380734] RBP: ffff8800bcaf3d78 R08: 0000000000000000 R09: 0000000000000000 [ 12.380764] R10: 0000000000000000 R11: ffff8800bcb3e6a0 R12: ffff8800bcb3e680 [ 12.380793] R13: ffff8800bcb3e690 R14: ffff8800bcb38e70 R15: ffff8800bcb38e70 [ 12.380822] FS: 0000000000000000(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000 [ 12.380855] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 12.380880] CR2: 00007fcd2603b0e8 CR3: 0000000001a11000 CR4: 00000000000407e0 [ 12.380908] Stack: [ 12.380918] ffff8801272e9dc0 ffff8800bcb35e38 ffff8800bcb35e38 ffff8800bcb3e680 [ 12.380953] ffff8800bcaf3d78 ffff8800bcb35dc0 ffff8800bcaf3dd8 0000000000000000 [ 12.380989] 0000000000000000 ffff8800bcb35dc0 ffff8800bcaf3dc8 ffffffff81119e26 [ 12.381024] Call Trace: [ 12.381038] [] free_pgtables+0x8e/0xcc [ 12.381062] [] exit_mmap+0x84/0x123 [ 12.381086] [] mmput+0x5e/0xbb [ 12.381107] [] do_exit+0x39c/0x97e [ 12.381131] [] ? context_tracking_user_exit+0x79/0x116 [ 12.381160] [] ? __this_cpu_preempt_check+0x13/0x15 [ 12.381188] [] do_group_exit+0x4c/0xc9 [ 12.381212] [] SyS_exit_group+0x14/0x14 [ 12.381238] [] system_call_fastpath+0x12/0x17 [ 12.381262] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 [ 12.381445] RIP [] unlink_anon_vmas+0x12b/0x169 [ 12.381473] RSP [ 12.386659] ---[ end trace 5761ee18fca12427 ]--- [ 12.386662] Fixing recursive fault but reboot is needed! [ 13.158240] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X [ 13.259294] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X [ 13.259468] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready [ 16.790917] e1000e: lan0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx [ 16.790957] IPv6: ADDRCONF(NETDEV_CHANGE): lan0: link becomes ready [ 18.846524] iwlwifi 0000:02:00.0: L1 Enabled - LTR Disabled [ 18.846742] iwlwifi 0000:02:00.0: Radio type=0x0-0x3-0x1 [ 18.941594] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready [ 19.145595] e1000e: lan0 NIC Link is Down [ 19.287399] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X [ 19.391325] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X [ 19.391475] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready [ 19.573640] e1000e: lan0 NIC Link is Down [ 19.717813] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X [ 19.819729] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X [ 19.819883] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready [ 22.938849] e1000e: lan0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx [ 22.938889] IPv6: ADDRCONF(NETDEV_CHANGE): lan0: link becomes ready [ 23.404027] ------------[ cut here ]------------ [ 23.404056] kernel BUG at mm/rmap.c:385! [ 23.404074] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC [ 23.404107] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd [ 23.404353] CPU: 1 PID: 4506 Comm: synaptikscfg Tainted: G D 3.18.0-rc6-test-00001-gf5bc00c103ff #409 [ 23.404395] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011 [ 23.404425] task: ffff8800a337c2c0 ti: ffff88009f4ec000 task.ti: ffff88009f4ec000 [ 23.404455] RIP: 0010:[] [] unlink_anon_vmas+0x12b/0x169 [ 23.404494] RSP: 0018:ffff88009f4efd28 EFLAGS: 00010282 [ 23.405766] RAX: ffff88009f54d010 RBX: ffff88009f54c488 RCX: 0000000000000000 [ 23.407062] RDX: ffff88009f5a3a50 RSI: 0000000000000001 RDI: ffff88009f5a3a50 [ 23.408352] RBP: ffff88009f4efd78 R08: 0000000000000000 R09: 0000000000000000 [ 23.409597] R10: 0000000000000000 R11: ffff88009f54d020 R12: ffff88009f54d000 [ 23.410816] R13: ffff88009f54d010 R14: ffff88009f5a3a50 R15: ffff88009f5a3a50 [ 23.411998] FS: 0000000000000000(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000 [ 23.413167] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 23.414320] CR2: 00007f7a855608f0 CR3: 00000000a328c000 CR4: 00000000000407e0 [ 23.415471] Stack: [ 23.416603] ffff8800a3390e00 ffff88009f54c498 ffff88009f54c498 ffff88009f54d000 [ 23.417747] ffff88009f4efd78 ffff88009f54c420 ffff88009f4efdd8 0000000000000000 [ 23.418892] 0000000000000000 ffff88009f54c420 ffff88009f4efdc8 ffffffff81119e26 [ 23.420027] Call Trace: [ 23.421153] [] free_pgtables+0x8e/0xcc [ 23.422273] [] exit_mmap+0x84/0x123 [ 23.423411] [] ? do_exit+0x358/0x97e [ 23.424537] [] mmput+0x5e/0xbb [ 23.425665] [] do_exit+0x39c/0x97e [ 23.426766] [] ? context_tracking_user_exit+0x79/0x116 [ 23.427866] [] ? __this_cpu_preempt_check+0x13/0x15 [ 23.428962] [] do_group_exit+0x4c/0xc9 [ 23.430064] [] SyS_exit_group+0x14/0x14 [ 23.431162] [] system_call_fastpath+0x12/0x17 [ 23.432262] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 [ 23.434722] RIP [] unlink_anon_vmas+0x12b/0x169 [ 23.435924] RSP [ 23.441996] ---[ end trace 5761ee18fca12428 ]--- [ 23.442001] Fixing recursive fault but reboot is needed! [ 838.179454] ------------[ cut here ]------------ [ 838.180658] kernel BUG at mm/rmap.c:385! [ 838.181843] invalid opcode: 0000 [#3] PREEMPT SMP DEBUG_PAGEALLOC [ 838.183046] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd [ 838.186983] CPU: 1 PID: 6643 Comm: colord-sane Tainted: G D 3.18.0-rc6-test-00001-gf5bc00c103ff #409 [ 838.188240] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011 [ 838.189503] task: ffff8800c4fd8000 ti: ffff880079c6c000 task.ti: ffff880079c6c000 [ 838.190765] RIP: 0010:[] [] unlink_anon_vmas+0x12b/0x169 [ 838.192045] RSP: 0018:ffff880079c6fb68 EFLAGS: 00010286 [ 838.193324] RAX: ffff8800c5a70150 RBX: ffff8800a6fd5748 RCX: 0000000000000000 [ 838.194616] RDX: ffff8800a5379840 RSI: 0000000000000001 RDI: ffff8800a5379840 [ 838.195879] RBP: ffff880079c6fbb8 R08: 0000000000000000 R09: 0000000000000000 [ 838.197100] R10: 0000000000000000 R11: ffff8800c5a70160 R12: ffff8800c5a70140 [ 838.198289] R13: ffff8800c5a70150 R14: ffff8800a5379840 R15: ffff8800a5379840 [ 838.199448] FS: 0000000000000000(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000 [ 838.200604] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 838.201753] CR2: 00007fdfd692cde8 CR3: 0000000079d0d000 CR4: 00000000000407e0 [ 838.202902] Stack: [ 838.204029] ffff88011e6fc540 ffff8800a6fd5758 ffff8800a6fd5758 ffff8800c5a70140 [ 838.205180] ffff880079c6fbb8 ffff8800a6fd56e0 ffff880079c6fc18 0000000000000000 [ 838.206328] 0000000000000000 ffff8800a6fd56e0 ffff880079c6fc08 ffffffff81119e26 [ 838.207477] Call Trace: [ 838.208614] [] free_pgtables+0x8e/0xcc [ 838.209762] [] exit_mmap+0x84/0x123 [ 838.210897] [] ? do_exit+0x358/0x97e [ 838.212020] [] mmput+0x5e/0xbb [ 838.213132] [] do_exit+0x39c/0x97e [ 838.214232] [] ? get_signal+0xdb/0x68a [ 838.215324] [] ? poll_select_copy_remaining+0xfe/0xfe [ 838.216420] [] do_group_exit+0x4c/0xc9 [ 838.217521] [] get_signal+0x647/0x68a [ 838.218612] [] ? context_tracking_user_enter+0xdb/0x159 [ 838.219705] [] do_signal+0x28/0x657 [ 838.220796] [] ? __acct_update_integrals+0xbf/0xd4 [ 838.221894] [] ? preempt_count_sub+0xcd/0xdb [ 838.222998] [] ? vtime_account_user+0x88/0x95 [ 838.224105] [] ? _raw_spin_unlock+0x32/0x47 [ 838.225205] [] ? context_tracking_user_exit+0x79/0x116 [ 838.226308] [] ? context_tracking_user_exit+0x79/0x116 [ 838.227401] [] do_notify_resume+0x3f/0x94 [ 838.228495] [] int_signal+0x12/0x17 [ 838.229581] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 [ 838.231909] RIP [] unlink_anon_vmas+0x12b/0x169 [ 838.233003] RSP [ 838.234248] ---[ end trace 5761ee18fca12429 ]--- [ 838.234251] Fixing recursive fault but reboot is needed! [ 1806.784267] ------------[ cut here ]------------ [ 1806.785322] kernel BUG at mm/rmap.c:385! [ 1806.786361] invalid opcode: 0000 [#4] PREEMPT SMP DEBUG_PAGEALLOC [ 1806.787397] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd [ 1806.790682] CPU: 1 PID: 8135 Comm: DNS Resolver #7 Tainted: G D 3.18.0-rc6-test-00001-gf5bc00c103ff #409 [ 1806.791728] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011 [ 1806.792779] task: ffff8800b3d40000 ti: ffff880079e34000 task.ti: ffff880079e34000 [ 1806.793816] RIP: 0010:[] [] unlink_anon_vmas+0x12b/0x169 [ 1806.794863] RSP: 0018:ffff880079e37d38 EFLAGS: 00010282 [ 1806.795894] RAX: ffff8800b508d790 RBX: ffff8800bcaa4e28 RCX: 0000000000000000 [ 1806.796948] RDX: ffff880124ce0f20 RSI: 0000000000000001 RDI: ffff880124ce0f20 [ 1806.798011] RBP: ffff880079e37d88 R08: 0000000000000000 R09: 0000000000000000 [ 1806.799048] R10: 00007fc2827f9db0 R11: ffff8800b508d7a0 R12: ffff8800b508d780 [ 1806.800105] R13: ffff8800b508d790 R14: ffff880124ce0f20 R15: ffff880124ce0f20 [ 1806.801143] FS: 00007fc2827fa700(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000 [ 1806.802206] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1806.803244] CR2: 00007fc2c6b87000 CR3: 00000000a3063000 CR4: 00000000000407e0 [ 1806.804305] Stack: [ 1806.805329] 00007fc280754000 ffff8800bcaa4e38 ffff8800bcaa4e38 ffff8800b508d780 [ 1806.806382] 0000000081098bfb ffff8800bcaa4dc0 ffff880079e37df8 00007fc27ff00000 [ 1806.807467] 00007fc280a00000 ffff8800bcaa4dc0 ffff880079e37dd8 ffffffff81119e26 [ 1806.808536] Call Trace: [ 1806.809570] [] free_pgtables+0x8e/0xcc [ 1806.810617] [] unmap_region+0xc8/0xec [ 1806.811658] [] ? __rb_erase_color+0x122/0x1f9 [ 1806.812724] [] do_munmap+0x275/0x2f7 [ 1806.813792] [] vm_munmap+0x48/0x61 [ 1806.814841] [] SyS_munmap+0x26/0x2f [ 1806.815884] [] system_call_fastpath+0x12/0x17 [ 1806.816951] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 [ 1806.819300] RIP [] unlink_anon_vmas+0x12b/0x169 [ 1806.820457] RSP [ 1806.822068] ---[ end trace 5761ee18fca1242a ]--- -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-la0-f50.google.com (mail-la0-f50.google.com [209.85.215.50]) by kanga.kvack.org (Postfix) with ESMTP id B289F6B0038 for ; Tue, 25 Nov 2014 07:13:18 -0500 (EST) Received: by mail-la0-f50.google.com with SMTP id pv20so407704lab.9 for ; Tue, 25 Nov 2014 04:13:18 -0800 (PST) Received: from mail-wi0-x22b.google.com (mail-wi0-x22b.google.com. [2a00:1450:400c:c05::22b]) by mx.google.com with ESMTPS id pq5si1561349wjc.165.2014.11.25.04.13.17 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 25 Nov 2014 04:13:17 -0800 (PST) Received: by mail-wi0-f171.google.com with SMTP id bs8so8893766wib.10 for ; Tue, 25 Nov 2014 04:13:16 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <20141125105953.GC4607@dhcp22.suse.cz> References: <546BDB29.9050403@suse.cz> <546CC0CD.40906@suse.cz> <546DFFA1.4030700@redhat.com> <20141125105953.GC4607@dhcp22.suse.cz> Date: Tue, 25 Nov 2014 16:13:16 +0400 Message-ID: Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound From: Konstantin Khlebnikov Content-Type: multipart/mixed; boundary=089e0163503c0399e80508add695 Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko Cc: Rik van Riel , Michel Lespinasse , Vlastimil Babka , Andrew Morton , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick --089e0163503c0399e80508add695 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Tue, Nov 25, 2014 at 1:59 PM, Michal Hocko wrote: > On Mon 24-11-14 11:09:40, Konstantin Khlebnikov wrote: >> On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov wrote: >> > On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel wrote: >> >> -----BEGIN PGP SIGNED MESSAGE----- >> >> Hash: SHA1 >> >> >> >> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote: >> >> >> >>> I'm thinking about limitation for reusing anon_vmas which might >> >>> increase performance without breaking asymptotic estimation of >> >>> count anon_vma in the worst case. For example this heuristic: allow >> >>> to reuse only anon_vma with single direct descendant. It seems >> >>> there will be arount up to two times more anon_vmas but >> >>> false-aliasing must be much lower. >> >> Done. RFC patch in attachment. > > This is triggering BUG_ON(anon_vma->degree); in unlink_anon_vmas. I have > applied the patch on top of 3.18.0-rc6. It seems I've screwed up with counter if anon_vma is merged in anon_vma_pre= pare. Increment must be in the next if block: --- a/mm/rmap.c +++ b/mm/rmap.c @@ -182,8 +182,6 @@ int anon_vma_prepare(struct vm_area_struct *vma) if (unlikely(!anon_vma)) goto out_enomem_free_avc; allocated =3D anon_vma; - /* Bump degree, root anon_vma is its own parent. */ - anon_vma->degree++; } anon_vma_lock_write(anon_vma); @@ -192,6 +190,7 @@ int anon_vma_prepare(struct vm_area_struct *vma) if (likely(!vma->anon_vma)) { vma->anon_vma =3D anon_vma; anon_vma_chain_link(vma, avc, anon_vma); + anon_vma->degree++; allocated =3D NULL; avc =3D NULL; } I've tested it with trinity but probably isn't long enough. > > [ 12.380189] ------------[ cut here ]------------ > [ 12.380221] kernel BUG at mm/rmap.c:385! > [ 12.380239] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > [ 12.380272] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit= fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binf= mt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_cor= e v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sd= hci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_= codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss sn= d_mixer_oss snd_pcm video backlight snd_timer snd > [ 12.380518] CPU: 1 PID: 3704 Comm: kdm_greet Not tainted 3.18.0-rc6-te= st-00001-gf5bc00c103ff #409 > [ 12.380554] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 1= 0/18/2011 > [ 12.380584] task: ffff8801272bc2c0 ti: ffff8800bcaf0000 task.ti: ffff8= 800bcaf0000 > [ 12.380614] RIP: 0010:[] [] unlin= k_anon_vmas+0x12b/0x169 > [ 12.380653] RSP: 0018:ffff8800bcaf3d28 EFLAGS: 00010286 > [ 12.380676] RAX: ffff8800bcb3e690 RBX: ffff8800bcb35e28 RCX: ffff88012= 72bcb60 > [ 12.380706] RDX: ffff8800bcb38e70 RSI: 0000000000000001 RDI: ffff8800b= cb38e70 > [ 12.380734] RBP: ffff8800bcaf3d78 R08: 0000000000000000 R09: 000000000= 0000000 > [ 12.380764] R10: 0000000000000000 R11: ffff8800bcb3e6a0 R12: ffff8800b= cb3e680 > [ 12.380793] R13: ffff8800bcb3e690 R14: ffff8800bcb38e70 R15: ffff8800b= cb38e70 > [ 12.380822] FS: 0000000000000000(0000) GS:ffff88012d440000(0000) knlG= S:0000000000000000 > [ 12.380855] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 12.380880] CR2: 00007fcd2603b0e8 CR3: 0000000001a11000 CR4: 000000000= 00407e0 > [ 12.380908] Stack: > [ 12.380918] ffff8801272e9dc0 ffff8800bcb35e38 ffff8800bcb35e38 ffff88= 00bcb3e680 > [ 12.380953] ffff8800bcaf3d78 ffff8800bcb35dc0 ffff8800bcaf3dd8 000000= 0000000000 > [ 12.380989] 0000000000000000 ffff8800bcb35dc0 ffff8800bcaf3dc8 ffffff= ff81119e26 > [ 12.381024] Call Trace: > [ 12.381038] [] free_pgtables+0x8e/0xcc > [ 12.381062] [] exit_mmap+0x84/0x123 > [ 12.381086] [] mmput+0x5e/0xbb > [ 12.381107] [] do_exit+0x39c/0x97e > [ 12.381131] [] ? context_tracking_user_exit+0x79/0x= 116 > [ 12.381160] [] ? __this_cpu_preempt_check+0x13/0x15 > [ 12.381188] [] do_group_exit+0x4c/0xc9 > [ 12.381212] [] SyS_exit_group+0x14/0x14 > [ 12.381238] [] system_call_fastpath+0x12/0x17 > [ 12.381262] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 1= 0 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 0= 2 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 > [ 12.381445] RIP [] unlink_anon_vmas+0x12b/0x169 > [ 12.381473] RSP > [ 12.386659] ---[ end trace 5761ee18fca12427 ]--- > [ 12.386662] Fixing recursive fault but reboot is needed! > [ 13.158240] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X > [ 13.259294] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X > [ 13.259468] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready > [ 16.790917] e1000e: lan0 NIC Link is Up 1000 Mbps Full Duplex, Flow Co= ntrol: Rx/Tx > [ 16.790957] IPv6: ADDRCONF(NETDEV_CHANGE): lan0: link becomes ready > [ 18.846524] iwlwifi 0000:02:00.0: L1 Enabled - LTR Disabled > [ 18.846742] iwlwifi 0000:02:00.0: Radio type=3D0x0-0x3-0x1 > [ 18.941594] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready > [ 19.145595] e1000e: lan0 NIC Link is Down > [ 19.287399] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X > [ 19.391325] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X > [ 19.391475] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready > [ 19.573640] e1000e: lan0 NIC Link is Down > [ 19.717813] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X > [ 19.819729] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X > [ 19.819883] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready > [ 22.938849] e1000e: lan0 NIC Link is Up 1000 Mbps Full Duplex, Flow Co= ntrol: Rx/Tx > [ 22.938889] IPv6: ADDRCONF(NETDEV_CHANGE): lan0: link becomes ready > [ 23.404027] ------------[ cut here ]------------ > [ 23.404056] kernel BUG at mm/rmap.c:385! > [ 23.404074] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC > [ 23.404107] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit= fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binf= mt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_cor= e v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sd= hci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_= codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss sn= d_mixer_oss snd_pcm video backlight snd_timer snd > [ 23.404353] CPU: 1 PID: 4506 Comm: synaptikscfg Tainted: G D = 3.18.0-rc6-test-00001-gf5bc00c103ff #409 > [ 23.404395] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 1= 0/18/2011 > [ 23.404425] task: ffff8800a337c2c0 ti: ffff88009f4ec000 task.ti: ffff8= 8009f4ec000 > [ 23.404455] RIP: 0010:[] [] unlin= k_anon_vmas+0x12b/0x169 > [ 23.404494] RSP: 0018:ffff88009f4efd28 EFLAGS: 00010282 > [ 23.405766] RAX: ffff88009f54d010 RBX: ffff88009f54c488 RCX: 000000000= 0000000 > [ 23.407062] RDX: ffff88009f5a3a50 RSI: 0000000000000001 RDI: ffff88009= f5a3a50 > [ 23.408352] RBP: ffff88009f4efd78 R08: 0000000000000000 R09: 000000000= 0000000 > [ 23.409597] R10: 0000000000000000 R11: ffff88009f54d020 R12: ffff88009= f54d000 > [ 23.410816] R13: ffff88009f54d010 R14: ffff88009f5a3a50 R15: ffff88009= f5a3a50 > [ 23.411998] FS: 0000000000000000(0000) GS:ffff88012d440000(0000) knlG= S:0000000000000000 > [ 23.413167] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 23.414320] CR2: 00007f7a855608f0 CR3: 00000000a328c000 CR4: 000000000= 00407e0 > [ 23.415471] Stack: > [ 23.416603] ffff8800a3390e00 ffff88009f54c498 ffff88009f54c498 ffff88= 009f54d000 > [ 23.417747] ffff88009f4efd78 ffff88009f54c420 ffff88009f4efdd8 000000= 0000000000 > [ 23.418892] 0000000000000000 ffff88009f54c420 ffff88009f4efdc8 ffffff= ff81119e26 > [ 23.420027] Call Trace: > [ 23.421153] [] free_pgtables+0x8e/0xcc > [ 23.422273] [] exit_mmap+0x84/0x123 > [ 23.423411] [] ? do_exit+0x358/0x97e > [ 23.424537] [] mmput+0x5e/0xbb > [ 23.425665] [] do_exit+0x39c/0x97e > [ 23.426766] [] ? context_tracking_user_exit+0x79/0x= 116 > [ 23.427866] [] ? __this_cpu_preempt_check+0x13/0x15 > [ 23.428962] [] do_group_exit+0x4c/0xc9 > [ 23.430064] [] SyS_exit_group+0x14/0x14 > [ 23.431162] [] system_call_fastpath+0x12/0x17 > [ 23.432262] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 1= 0 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 0= 2 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 > [ 23.434722] RIP [] unlink_anon_vmas+0x12b/0x169 > [ 23.435924] RSP > [ 23.441996] ---[ end trace 5761ee18fca12428 ]--- > [ 23.442001] Fixing recursive fault but reboot is needed! > [ 838.179454] ------------[ cut here ]------------ > [ 838.180658] kernel BUG at mm/rmap.c:385! > [ 838.181843] invalid opcode: 0000 [#3] PREEMPT SMP DEBUG_PAGEALLOC > [ 838.183046] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit= fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binf= mt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_cor= e v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sd= hci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_= codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss sn= d_mixer_oss snd_pcm video backlight snd_timer snd > [ 838.186983] CPU: 1 PID: 6643 Comm: colord-sane Tainted: G D = 3.18.0-rc6-test-00001-gf5bc00c103ff #409 > [ 838.188240] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 1= 0/18/2011 > [ 838.189503] task: ffff8800c4fd8000 ti: ffff880079c6c000 task.ti: ffff8= 80079c6c000 > [ 838.190765] RIP: 0010:[] [] unlin= k_anon_vmas+0x12b/0x169 > [ 838.192045] RSP: 0018:ffff880079c6fb68 EFLAGS: 00010286 > [ 838.193324] RAX: ffff8800c5a70150 RBX: ffff8800a6fd5748 RCX: 000000000= 0000000 > [ 838.194616] RDX: ffff8800a5379840 RSI: 0000000000000001 RDI: ffff8800a= 5379840 > [ 838.195879] RBP: ffff880079c6fbb8 R08: 0000000000000000 R09: 000000000= 0000000 > [ 838.197100] R10: 0000000000000000 R11: ffff8800c5a70160 R12: ffff8800c= 5a70140 > [ 838.198289] R13: ffff8800c5a70150 R14: ffff8800a5379840 R15: ffff8800a= 5379840 > [ 838.199448] FS: 0000000000000000(0000) GS:ffff88012d440000(0000) knlG= S:0000000000000000 > [ 838.200604] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 838.201753] CR2: 00007fdfd692cde8 CR3: 0000000079d0d000 CR4: 000000000= 00407e0 > [ 838.202902] Stack: > [ 838.204029] ffff88011e6fc540 ffff8800a6fd5758 ffff8800a6fd5758 ffff88= 00c5a70140 > [ 838.205180] ffff880079c6fbb8 ffff8800a6fd56e0 ffff880079c6fc18 000000= 0000000000 > [ 838.206328] 0000000000000000 ffff8800a6fd56e0 ffff880079c6fc08 ffffff= ff81119e26 > [ 838.207477] Call Trace: > [ 838.208614] [] free_pgtables+0x8e/0xcc > [ 838.209762] [] exit_mmap+0x84/0x123 > [ 838.210897] [] ? do_exit+0x358/0x97e > [ 838.212020] [] mmput+0x5e/0xbb > [ 838.213132] [] do_exit+0x39c/0x97e > [ 838.214232] [] ? get_signal+0xdb/0x68a > [ 838.215324] [] ? poll_select_copy_remaining+0xfe/0x= fe > [ 838.216420] [] do_group_exit+0x4c/0xc9 > [ 838.217521] [] get_signal+0x647/0x68a > [ 838.218612] [] ? context_tracking_user_enter+0xdb/0= x159 > [ 838.219705] [] do_signal+0x28/0x657 > [ 838.220796] [] ? __acct_update_integrals+0xbf/0xd4 > [ 838.221894] [] ? preempt_count_sub+0xcd/0xdb > [ 838.222998] [] ? vtime_account_user+0x88/0x95 > [ 838.224105] [] ? _raw_spin_unlock+0x32/0x47 > [ 838.225205] [] ? context_tracking_user_exit+0x79/0x= 116 > [ 838.226308] [] ? context_tracking_user_exit+0x79/0x= 116 > [ 838.227401] [] do_notify_resume+0x3f/0x94 > [ 838.228495] [] int_signal+0x12/0x17 > [ 838.229581] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 1= 0 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 0= 2 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 > [ 838.231909] RIP [] unlink_anon_vmas+0x12b/0x169 > [ 838.233003] RSP > [ 838.234248] ---[ end trace 5761ee18fca12429 ]--- > [ 838.234251] Fixing recursive fault but reboot is needed! > [ 1806.784267] ------------[ cut here ]------------ > [ 1806.785322] kernel BUG at mm/rmap.c:385! > [ 1806.786361] invalid opcode: 0000 [#4] PREEMPT SMP DEBUG_PAGEALLOC > [ 1806.787397] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit= fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binf= mt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_cor= e v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sd= hci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_= codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss sn= d_mixer_oss snd_pcm video backlight snd_timer snd > [ 1806.790682] CPU: 1 PID: 8135 Comm: DNS Resolver #7 Tainted: G D = 3.18.0-rc6-test-00001-gf5bc00c103ff #409 > [ 1806.791728] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 1= 0/18/2011 > [ 1806.792779] task: ffff8800b3d40000 ti: ffff880079e34000 task.ti: ffff8= 80079e34000 > [ 1806.793816] RIP: 0010:[] [] unlin= k_anon_vmas+0x12b/0x169 > [ 1806.794863] RSP: 0018:ffff880079e37d38 EFLAGS: 00010282 > [ 1806.795894] RAX: ffff8800b508d790 RBX: ffff8800bcaa4e28 RCX: 000000000= 0000000 > [ 1806.796948] RDX: ffff880124ce0f20 RSI: 0000000000000001 RDI: ffff88012= 4ce0f20 > [ 1806.798011] RBP: ffff880079e37d88 R08: 0000000000000000 R09: 000000000= 0000000 > [ 1806.799048] R10: 00007fc2827f9db0 R11: ffff8800b508d7a0 R12: ffff8800b= 508d780 > [ 1806.800105] R13: ffff8800b508d790 R14: ffff880124ce0f20 R15: ffff88012= 4ce0f20 > [ 1806.801143] FS: 00007fc2827fa700(0000) GS:ffff88012d440000(0000) knlG= S:0000000000000000 > [ 1806.802206] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 1806.803244] CR2: 00007fc2c6b87000 CR3: 00000000a3063000 CR4: 000000000= 00407e0 > [ 1806.804305] Stack: > [ 1806.805329] 00007fc280754000 ffff8800bcaa4e38 ffff8800bcaa4e38 ffff88= 00b508d780 > [ 1806.806382] 0000000081098bfb ffff8800bcaa4dc0 ffff880079e37df8 00007f= c27ff00000 > [ 1806.807467] 00007fc280a00000 ffff8800bcaa4dc0 ffff880079e37dd8 ffffff= ff81119e26 > [ 1806.808536] Call Trace: > [ 1806.809570] [] free_pgtables+0x8e/0xcc > [ 1806.810617] [] unmap_region+0xc8/0xec > [ 1806.811658] [] ? __rb_erase_color+0x122/0x1f9 > [ 1806.812724] [] do_munmap+0x275/0x2f7 > [ 1806.813792] [] vm_munmap+0x48/0x61 > [ 1806.814841] [] SyS_munmap+0x26/0x2f > [ 1806.815884] [] system_call_fastpath+0x12/0x17 > [ 1806.816951] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 1= 0 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 0= 2 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 > [ 1806.819300] RIP [] unlink_anon_vmas+0x12b/0x169 > [ 1806.820457] RSP > [ 1806.822068] ---[ end trace 5761ee18fca1242a ]--- > -- > Michal Hocko > SUSE Labs --089e0163503c0399e80508add695 Content-Type: application/octet-stream; name=mm-prevent-endless-growth-of-anon_vma-hierarchy-v2 Content-Disposition: attachment; filename=mm-prevent-endless-growth-of-anon_vma-hierarchy-v2 Content-Transfer-Encoding: base64 X-Attachment-Id: f_i2x84ucm0 bW06IHByZXZlbnQgZW5kbGVzcyBncm93dGggb2YgYW5vbl92bWEgaGllcmFyY2h5CgpGcm9tOiBL b25zdGFudGluIEtobGVibmlrb3YgPGtvY3Q5aUBnbWFpbC5jb20+CgpDb25zdGFudGx5IGZvcmtp bmcgdGFzayBjYXVzZXMgdW5saW1pdGVkIGdyb3cgb2YgYW5vbl92bWEgY2hhaW4uCkVhY2ggbmV4 dCBjaGlsZCBhbGxvY2F0ZSBuZXcgbGV2ZWwgb2YgYW5vbl92bWFzIGFuZCBsaW5rcyB2bWFzIHRv IGFsbApwcmV2aW91cyBsZXZlbHMgYmVjYXVzZSBpdCBpbmhlcml0cyBwYWdlcyBmcm9tIHRoZW0u IE5vbmUgb2YgYW5vbl92bWFzCmNhbm5vdCBiZSBmcmVlZCBiZWNhdXNlIHRoZXJlIG1pZ2h0IGJl IHBhZ2VzIHdoaWNoIHBvaW50cyB0byB0aGVtLgoKVGhpcyBwYXRjaCBhZGRzIGhldXJpc3RpYyB3 aGljaCBkZWNpZGVzIHRvIHJldXNlIGV4aXN0aW5nIGFub25fdm1hIGluc3RlYWQKb2YgZm9ya2lu ZyBuZXcgb25lLiBJdCBjb3VudHMgdm1hcyBhbmQgZGlyZWN0IGRlc2NlbmRhbnRzIGZvciBlYWNo IGFub25fdm1hLgpBbm9uX3ZtYSB3aXRoIGRlZ3JlZSBsb3dlciB0aGFuIHR3byB3aWxsIGJlIHJl dXNlZCBhdCBuZXh0IGZvcmsuCkFzIGEgcmVzdWx0IGVhY2ggYW5vbl92bWEgaGFzIGVpdGhlciBh bGl2ZSB2bWEgb3IgYXQgbGVhc3QgdHdvIGRlc2NlbmRhbnRzLAplbmRsZXNzIGNoYWlucyBhcmUg bm8gbG9uZ2VyIHBvc3NpYmxlIGFuZCBjb3VudCBvZiBhbm9uX3ZtYXMgaXMgbm8gbW9yZSB0aGFu CnR3byB0aW1lcyBtb3JlIHRoYW4gY291bnQgb2Ygdm1hcy4KCnYyOiB1cGRhdGUgZGVncmVlIGlu IGFub25fdm1hX3ByZXBhcmUgZm9yIG1lcmdlZCBhbm9uX3ZtYQoKU2lnbmVkLW9mZi1ieTogS29u c3RhbnRpbiBLaGxlYm5pa292IDxrb2N0OWlAZ21haWwuY29tPgpMaW5rOiBodHRwOi8vbGttbC5r ZXJuZWwub3JnL3IvMjAxMjA4MTYwMjQ2MTAuR0E1MzUwQGV2ZXJncmVlbi5zc2VjLndpc2MuZWR1 Ci0tLQogaW5jbHVkZS9saW51eC9ybWFwLmggfCAgIDE2ICsrKysrKysrKysrKysrKysKIG1tL3Jt YXAuYyAgICAgICAgICAgIHwgICAzMCArKysrKysrKysrKysrKysrKysrKysrKysrKysrKy0KIDIg ZmlsZXMgY2hhbmdlZCwgNDUgaW5zZXJ0aW9ucygrKSwgMSBkZWxldGlvbigtKQoKZGlmZiAtLWdp dCBhL2luY2x1ZGUvbGludXgvcm1hcC5oIGIvaW5jbHVkZS9saW51eC9ybWFwLmgKaW5kZXggYzBj MmJjZS4uYjFkMTQwYyAxMDA2NDQKLS0tIGEvaW5jbHVkZS9saW51eC9ybWFwLmgKKysrIGIvaW5j bHVkZS9saW51eC9ybWFwLmgKQEAgLTQ1LDYgKzQ1LDIyIEBAIHN0cnVjdCBhbm9uX3ZtYSB7CiAJ ICogbW1fdGFrZV9hbGxfbG9ja3MoKSAobW1fYWxsX2xvY2tzX211dGV4KS4KIAkgKi8KIAlzdHJ1 Y3QgcmJfcm9vdCByYl9yb290OwkvKiBJbnRlcnZhbCB0cmVlIG9mIHByaXZhdGUgInJlbGF0ZWQi IHZtYXMgKi8KKworCS8qCisJICogQ291bnQgb2YgY2hpbGQgYW5vbl92bWFzIGFuZCBWTUFzIHdo aWNoIHBvaW50cyB0byB0aGlzIGFub25fdm1hLgorCSAqCisJICogVGhpcyBjb3VudGVyIGlzIHVz ZWQgZm9yIG1ha2luZyBkZWNpc2lvbiBhYm91dCByZXVzaW5nIG9sZCBhbm9uX3ZtYQorCSAqIGlu c3RlYWQgb2YgZm9ya2luZyBuZXcgb25lLiBJdCBhbGxvd3MgdG8gZGV0ZWN0IGFub25fdm1hcyB3 aGljaCBoYXZlCisJICoganVzdCBvbmUgZGlyZWN0IGRlc2NlbmRhbnQgYW5kIG5vIHZtYXMuIFJl dXNpbmcgc3VjaCBhbm9uX3ZtYSBub3QKKwkgKiBsZWFkcyB0byBzaWduaWZpY2FudCBwcmVmb3Jt YW5jZSByZWdyZXNzaW9uIGJ1dCBwcmV2ZW50cyBkZWdyYWRhdGlvbgorCSAqIG9mIGFub25fdm1h IGhpZXJhcmNoeSB0byBlbmRsZXNzIGxpbmVhciBjaGFpbi4KKwkgKgorCSAqIFJvb3QgYW5vbl92 bWEgaXMgbmV2ZXIgcmV1c2VkIGJlY2F1c2UgaXQgaXMgaXRzIG93biBwYXJlbnQgYW5kIGl0IGhh cworCSAqIGF0IGxlYXQgb25lIHZtYSBvciBjaGlsZCwgdGh1cyBhdCBmb3JrIGl0J3MgZGVncmVl IGlzIGF0IGxlYXN0IDIuCisJICovCisJdW5zaWduZWQgZGVncmVlOworCisJc3RydWN0IGFub25f dm1hICpwYXJlbnQ7CS8qIFBhcmVudCBvZiB0aGlzIGFub25fdm1hICovCiB9OwogCiAvKgpkaWZm IC0tZ2l0IGEvbW0vcm1hcC5jIGIvbW0vcm1hcC5jCmluZGV4IDE5ODg2ZmIuLmRmNWM0NGUgMTAw NjQ0Ci0tLSBhL21tL3JtYXAuYworKysgYi9tbS9ybWFwLmMKQEAgLTcyLDYgKzcyLDggQEAgc3Rh dGljIGlubGluZSBzdHJ1Y3QgYW5vbl92bWEgKmFub25fdm1hX2FsbG9jKHZvaWQpCiAJYW5vbl92 bWEgPSBrbWVtX2NhY2hlX2FsbG9jKGFub25fdm1hX2NhY2hlcCwgR0ZQX0tFUk5FTCk7CiAJaWYg KGFub25fdm1hKSB7CiAJCWF0b21pY19zZXQoJmFub25fdm1hLT5yZWZjb3VudCwgMSk7CisJCWFu b25fdm1hLT5kZWdyZWUgPSAxOwkvKiBSZWZlcmVuY2UgZm9yIGZpcnN0IHZtYSAqLworCQlhbm9u X3ZtYS0+cGFyZW50ID0gYW5vbl92bWE7CiAJCS8qCiAJCSAqIEluaXRpYWxpc2UgdGhlIGFub25f dm1hIHJvb3QgdG8gcG9pbnQgdG8gaXRzZWxmLiBJZiBjYWxsZWQKIAkJICogZnJvbSBmb3JrLCB0 aGUgcm9vdCB3aWxsIGJlIHJlc2V0IHRvIHRoZSBwYXJlbnRzIGFub25fdm1hLgpAQCAtMTg4LDYg KzE5MCw4IEBAIGludCBhbm9uX3ZtYV9wcmVwYXJlKHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqdm1h KQogCQlpZiAobGlrZWx5KCF2bWEtPmFub25fdm1hKSkgewogCQkJdm1hLT5hbm9uX3ZtYSA9IGFu b25fdm1hOwogCQkJYW5vbl92bWFfY2hhaW5fbGluayh2bWEsIGF2YywgYW5vbl92bWEpOworCQkJ Lyogdm1hIGxpbmsgaWYgbWVyZ2VkIG9yIGNoaWxkIGxpbmsgZm9yIG5ldyByb290ICovCisJCQlh bm9uX3ZtYS0+ZGVncmVlKys7CiAJCQlhbGxvY2F0ZWQgPSBOVUxMOwogCQkJYXZjID0gTlVMTDsK IAkJfQpAQCAtMjU2LDcgKzI2MCwxNyBAQCBpbnQgYW5vbl92bWFfY2xvbmUoc3RydWN0IHZtX2Fy ZWFfc3RydWN0ICpkc3QsIHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqc3JjKQogCQlhbm9uX3ZtYSA9 IHBhdmMtPmFub25fdm1hOwogCQlyb290ID0gbG9ja19hbm9uX3ZtYV9yb290KHJvb3QsIGFub25f dm1hKTsKIAkJYW5vbl92bWFfY2hhaW5fbGluayhkc3QsIGF2YywgYW5vbl92bWEpOworCisJCS8q CisJCSAqIFJldXNlIGV4aXN0aW5nIGFub25fdm1hIGlmIGl0cyBkZWdyZWUgbG93ZXIgdGhhbiB0 d28sCisJCSAqIHRoYXQgbWVhbnMgaXQgaGFzIG5vIHZtYSBhbmQganVzdCBvbmUgYW5vbl92bWEg Y2hpbGQuCisJCSAqLworCQlpZiAoIWRzdC0+YW5vbl92bWEgJiYgYW5vbl92bWEgIT0gc3JjLT5h bm9uX3ZtYSAmJgorCQkJCWFub25fdm1hLT5kZWdyZWUgPCAyKQorCQkJZHN0LT5hbm9uX3ZtYSA9 IGFub25fdm1hOwogCX0KKwlpZiAoZHN0LT5hbm9uX3ZtYSkKKwkJZHN0LT5hbm9uX3ZtYS0+ZGVn cmVlKys7CiAJdW5sb2NrX2Fub25fdm1hX3Jvb3Qocm9vdCk7CiAJcmV0dXJuIDA7CiAKQEAgLTI3 OSw2ICsyOTMsOSBAQCBpbnQgYW5vbl92bWFfZm9yayhzdHJ1Y3Qgdm1fYXJlYV9zdHJ1Y3QgKnZt YSwgc3RydWN0IHZtX2FyZWFfc3RydWN0ICpwdm1hKQogCWlmICghcHZtYS0+YW5vbl92bWEpCiAJ CXJldHVybiAwOwogCisJLyogRHJvcCBpbmhlcml0ZWQgYW5vbl92bWEsIHdlJ2xsIHJldXNlIG9s ZCBvbmUgb3IgYWxsb2NhdGUgbmV3LiAqLworCXZtYS0+YW5vbl92bWEgPSBOVUxMOworCiAJLyoK IAkgKiBGaXJzdCwgYXR0YWNoIHRoZSBuZXcgVk1BIHRvIHRoZSBwYXJlbnQgVk1BJ3MgYW5vbl92 bWFzLAogCSAqIHNvIHJtYXAgY2FuIGZpbmQgbm9uLUNPV2VkIHBhZ2VzIGluIGNoaWxkIHByb2Nl c3Nlcy4KQEAgLTI4Niw2ICszMDMsMTAgQEAgaW50IGFub25fdm1hX2Zvcmsoc3RydWN0IHZtX2Fy ZWFfc3RydWN0ICp2bWEsIHN0cnVjdCB2bV9hcmVhX3N0cnVjdCAqcHZtYSkKIAlpZiAoYW5vbl92 bWFfY2xvbmUodm1hLCBwdm1hKSkKIAkJcmV0dXJuIC1FTk9NRU07CiAKKwkvKiBBbiBvbGQgYW5v bl92bWEgaGFzIGJlZW4gcmV1c2VkLiAqLworCWlmICh2bWEtPmFub25fdm1hKQorCQlyZXR1cm4g MDsKKwogCS8qIFRoZW4gYWRkIG91ciBvd24gYW5vbl92bWEuICovCiAJYW5vbl92bWEgPSBhbm9u X3ZtYV9hbGxvYygpOwogCWlmICghYW5vbl92bWEpCkBAIC0yOTksNiArMzIwLDcgQEAgaW50IGFu b25fdm1hX2Zvcmsoc3RydWN0IHZtX2FyZWFfc3RydWN0ICp2bWEsIHN0cnVjdCB2bV9hcmVhX3N0 cnVjdCAqcHZtYSkKIAkgKiBsb2NrIGFueSBvZiB0aGUgYW5vbl92bWFzIGluIHRoaXMgYW5vbl92 bWEgdHJlZS4KIAkgKi8KIAlhbm9uX3ZtYS0+cm9vdCA9IHB2bWEtPmFub25fdm1hLT5yb290Owor CWFub25fdm1hLT5wYXJlbnQgPSBwdm1hLT5hbm9uX3ZtYTsKIAkvKgogCSAqIFdpdGggcmVmY291 bnRzLCBhbiBhbm9uX3ZtYSBjYW4gc3RheSBhcm91bmQgbG9uZ2VyIHRoYW4gdGhlCiAJICogcHJv Y2VzcyBpdCBiZWxvbmdzIHRvLiBUaGUgcm9vdCBhbm9uX3ZtYSBuZWVkcyB0byBiZSBwaW5uZWQg dW50aWwKQEAgLTMwOSw2ICszMzEsNyBAQCBpbnQgYW5vbl92bWFfZm9yayhzdHJ1Y3Qgdm1fYXJl YV9zdHJ1Y3QgKnZtYSwgc3RydWN0IHZtX2FyZWFfc3RydWN0ICpwdm1hKQogCXZtYS0+YW5vbl92 bWEgPSBhbm9uX3ZtYTsKIAlhbm9uX3ZtYV9sb2NrX3dyaXRlKGFub25fdm1hKTsKIAlhbm9uX3Zt YV9jaGFpbl9saW5rKHZtYSwgYXZjLCBhbm9uX3ZtYSk7CisJYW5vbl92bWEtPnBhcmVudC0+ZGVn cmVlKys7CiAJYW5vbl92bWFfdW5sb2NrX3dyaXRlKGFub25fdm1hKTsKIAogCXJldHVybiAwOwpA QCAtMzM5LDEyICszNjIsMTYgQEAgdm9pZCB1bmxpbmtfYW5vbl92bWFzKHN0cnVjdCB2bV9hcmVh X3N0cnVjdCAqdm1hKQogCQkgKiBMZWF2ZSBlbXB0eSBhbm9uX3ZtYXMgb24gdGhlIGxpc3QgLSB3 ZSdsbCBuZWVkCiAJCSAqIHRvIGZyZWUgdGhlbSBvdXRzaWRlIHRoZSBsb2NrLgogCQkgKi8KLQkJ aWYgKFJCX0VNUFRZX1JPT1QoJmFub25fdm1hLT5yYl9yb290KSkKKwkJaWYgKFJCX0VNUFRZX1JP T1QoJmFub25fdm1hLT5yYl9yb290KSkgeworCQkJYW5vbl92bWEtPnBhcmVudC0+ZGVncmVlLS07 CiAJCQljb250aW51ZTsKKwkJfQogCiAJCWxpc3RfZGVsKCZhdmMtPnNhbWVfdm1hKTsKIAkJYW5v bl92bWFfY2hhaW5fZnJlZShhdmMpOwogCX0KKwlpZiAodm1hLT5hbm9uX3ZtYSkKKwkJdm1hLT5h bm9uX3ZtYS0+ZGVncmVlLS07CiAJdW5sb2NrX2Fub25fdm1hX3Jvb3Qocm9vdCk7CiAKIAkvKgpA QCAtMzU1LDYgKzM4Miw3IEBAIHZvaWQgdW5saW5rX2Fub25fdm1hcyhzdHJ1Y3Qgdm1fYXJlYV9z dHJ1Y3QgKnZtYSkKIAlsaXN0X2Zvcl9lYWNoX2VudHJ5X3NhZmUoYXZjLCBuZXh0LCAmdm1hLT5h bm9uX3ZtYV9jaGFpbiwgc2FtZV92bWEpIHsKIAkJc3RydWN0IGFub25fdm1hICphbm9uX3ZtYSA9 IGF2Yy0+YW5vbl92bWE7CiAKKwkJQlVHX09OKGFub25fdm1hLT5kZWdyZWUpOwogCQlwdXRfYW5v bl92bWEoYW5vbl92bWEpOwogCiAJCWxpc3RfZGVsKCZhdmMtPnNhbWVfdm1hKTsK --089e0163503c0399e80508add695-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f173.google.com (mail-wi0-f173.google.com [209.85.212.173]) by kanga.kvack.org (Postfix) with ESMTP id AF5C56B0069 for ; Tue, 25 Nov 2014 10:00:08 -0500 (EST) Received: by mail-wi0-f173.google.com with SMTP id r20so9443114wiv.0 for ; Tue, 25 Nov 2014 07:00:08 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id gp6si3532677wib.41.2014.11.25.07.00.07 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 25 Nov 2014 07:00:07 -0800 (PST) Date: Tue, 25 Nov 2014 16:00:06 +0100 From: Michal Hocko Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound Message-ID: <20141125150006.GB4415@dhcp22.suse.cz> References: <546CC0CD.40906@suse.cz> <546DFFA1.4030700@redhat.com> <20141125105953.GC4607@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Konstantin Khlebnikov Cc: Rik van Riel , Michel Lespinasse , Vlastimil Babka , Andrew Morton , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick On Tue 25-11-14 16:13:16, Konstantin Khlebnikov wrote: > On Tue, Nov 25, 2014 at 1:59 PM, Michal Hocko wrote: > > On Mon 24-11-14 11:09:40, Konstantin Khlebnikov wrote: > >> On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov wrote: > >> > On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel wrote: > >> >> -----BEGIN PGP SIGNED MESSAGE----- > >> >> Hash: SHA1 > >> >> > >> >> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote: > >> >> > >> >>> I'm thinking about limitation for reusing anon_vmas which might > >> >>> increase performance without breaking asymptotic estimation of > >> >>> count anon_vma in the worst case. For example this heuristic: allow > >> >>> to reuse only anon_vma with single direct descendant. It seems > >> >>> there will be arount up to two times more anon_vmas but > >> >>> false-aliasing must be much lower. > >> > >> Done. RFC patch in attachment. > > > > This is triggering BUG_ON(anon_vma->degree); in unlink_anon_vmas. I have > > applied the patch on top of 3.18.0-rc6. > > It seems I've screwed up with counter if anon_vma is merged in anon_vma_prepare. > Increment must be in the next if block: > > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -182,8 +182,6 @@ int anon_vma_prepare(struct vm_area_struct *vma) > if (unlikely(!anon_vma)) > goto out_enomem_free_avc; > allocated = anon_vma; > - /* Bump degree, root anon_vma is its own parent. */ > - anon_vma->degree++; > } > > anon_vma_lock_write(anon_vma); > @@ -192,6 +190,7 @@ int anon_vma_prepare(struct vm_area_struct *vma) > if (likely(!vma->anon_vma)) { > vma->anon_vma = anon_vma; > anon_vma_chain_link(vma, avc, anon_vma); > + anon_vma->degree++; > allocated = NULL; > avc = NULL; > } > > I've tested it with trinity but probably isn't long enough. OK, this has passed few runs with the original reproducer: $ date +%s; grep anon_vma /proc/slabinfo; $ ./vma_chain_repro $ sleep 1h $ date +%s; grep anon_vma /proc/slabinfo $ killall vma_chain_repro $ date +%s; grep anon_vma /proc/slabinfo 1416923468 anon_vma 11523 11523 176 23 1 : tunables 0 0 0 : slabdata 501 501 0 1416927070 anon_vma 11477 11477 176 23 1 : tunables 0 0 0 : slabdata 499 499 0 1416927070 anon_vma 11127 11431 176 23 1 : tunables 0 0 0 : slabdata 497 497 0 anon_vmas do not seem to leak anymore. I have forwarded the patch to the customer who was complaining about NSD but I guess it will take some time to get the confirmation. Anyway thanks a lot for your help and feel free to add Tested-by: Michal Hocko I have yet to look deeper into the code to give you my Reviewed-by. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f178.google.com (mail-wi0-f178.google.com [209.85.212.178]) by kanga.kvack.org (Postfix) with ESMTP id B552B6B0069 for ; Wed, 26 Nov 2014 12:35:21 -0500 (EST) Received: by mail-wi0-f178.google.com with SMTP id hi2so5758861wib.17 for ; Wed, 26 Nov 2014 09:35:21 -0800 (PST) Received: from mx2.suse.de (cantor2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id cj1si8982848wib.103.2014.11.26.09.35.20 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 26 Nov 2014 09:35:20 -0800 (PST) Date: Wed, 26 Nov 2014 18:35:17 +0100 From: Michal Hocko Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound Message-ID: <20141126173517.GA8180@dhcp22.suse.cz> References: <546CC0CD.40906@suse.cz> <546DFFA1.4030700@redhat.com> <20141125105953.GC4607@dhcp22.suse.cz> <20141125150006.GB4415@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20141125150006.GB4415@dhcp22.suse.cz> Sender: owner-linux-mm@kvack.org List-ID: To: Konstantin Khlebnikov Cc: Rik van Riel , Michel Lespinasse , Vlastimil Babka , Andrew Morton , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Daniel Forrest On Tue 25-11-14 16:00:06, Michal Hocko wrote: > On Tue 25-11-14 16:13:16, Konstantin Khlebnikov wrote: > > On Tue, Nov 25, 2014 at 1:59 PM, Michal Hocko wrote: > > > On Mon 24-11-14 11:09:40, Konstantin Khlebnikov wrote: > > >> On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov wrote: > > >> > On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel wrote: > > >> >> -----BEGIN PGP SIGNED MESSAGE----- > > >> >> Hash: SHA1 > > >> >> > > >> >> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote: > > >> >> > > >> >>> I'm thinking about limitation for reusing anon_vmas which might > > >> >>> increase performance without breaking asymptotic estimation of > > >> >>> count anon_vma in the worst case. For example this heuristic: allow > > >> >>> to reuse only anon_vma with single direct descendant. It seems > > >> >>> there will be arount up to two times more anon_vmas but > > >> >>> false-aliasing must be much lower. > > >> > > >> Done. RFC patch in attachment. Ok, finally managed to untagnle myself from vma chains and your patch makes sense to me, it is quite clever actually. Here is it including the fixup. --- > From 1d4b0b38198c69ecfeb37670cb1dda767a802c9a Mon Sep 17 00:00:00 2001 > From: Konstantin Khlebnikov > Date: Tue, 25 Nov 2014 10:54:44 +0100 > Subject: [PATCH] mm: prevent endless growth of anon_vma hierarchy > > Constantly forking task causes unlimited grow of anon_vma chain. > Each next child allocate new level of anon_vmas and links vmas to all > previous levels because it inherits pages from them. None of anon_vmas > cannot be freed because there might be pages which points to them. > > This patch adds heuristic which decides to reuse existing anon_vma instead > of forking new one. It counts vmas and direct descendants for each anon_vma. > Anon_vma with degree lower than two will be reused at next fork. > As a result each anon_vma has either alive vma or at least two descendants, > endless chains are no longer possible and count of anon_vmas is no more than > two times more than count of vmas. > > Signed-off-by: Konstantin Khlebnikov > Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu Tested-by: Michal Hocko Reviewed-by: Michal Hocko and I guess Reported-by: Daniel Forrest who somehow vanished from CC list (added back) would be appropriate as well. plus Fixes: 5beb49305251 (mm: change anon_vma linking to fix multi-process server scalability issue) and mark it for stable Thanks! > --- > include/linux/rmap.h | 16 ++++++++++++++++ > mm/rmap.c | 29 ++++++++++++++++++++++++++++- > 2 files changed, 44 insertions(+), 1 deletion(-) > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index c0c2bce6b0b7..b1d140c20b37 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -45,6 +45,22 @@ struct anon_vma { > * mm_take_all_locks() (mm_all_locks_mutex). > */ > struct rb_root rb_root; /* Interval tree of private "related" vmas */ > + > + /* > + * Count of child anon_vmas and VMAs which points to this anon_vma. > + * > + * This counter is used for making decision about reusing old anon_vma > + * instead of forking new one. It allows to detect anon_vmas which have > + * just one direct descendant and no vmas. Reusing such anon_vma not > + * leads to significant preformance regression but prevents degradation > + * of anon_vma hierarchy to endless linear chain. > + * > + * Root anon_vma is never reused because it is its own parent and it has > + * at leat one vma or child, thus at fork it's degree is at least 2. > + */ > + unsigned degree; > + > + struct anon_vma *parent; /* Parent of this anon_vma */ > }; > > /* > diff --git a/mm/rmap.c b/mm/rmap.c > index 19886fb2f13a..40ae8184a1e1 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -72,6 +72,8 @@ static inline struct anon_vma *anon_vma_alloc(void) > anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL); > if (anon_vma) { > atomic_set(&anon_vma->refcount, 1); > + anon_vma->degree = 1; /* Reference for first vma */ > + anon_vma->parent = anon_vma; > /* > * Initialise the anon_vma root to point to itself. If called > * from fork, the root will be reset to the parents anon_vma. > @@ -188,6 +190,7 @@ int anon_vma_prepare(struct vm_area_struct *vma) > if (likely(!vma->anon_vma)) { > vma->anon_vma = anon_vma; > anon_vma_chain_link(vma, avc, anon_vma); > + anon_vma->degree++; > allocated = NULL; > avc = NULL; > } > @@ -256,7 +259,17 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) > anon_vma = pavc->anon_vma; > root = lock_anon_vma_root(root, anon_vma); > anon_vma_chain_link(dst, avc, anon_vma); > + > + /* > + * Reuse existing anon_vma if its degree lower than two, > + * that means it has no vma and just one anon_vma child. > + */ > + if (!dst->anon_vma && anon_vma != src->anon_vma && > + anon_vma->degree < 2) > + dst->anon_vma = anon_vma; > } > + if (dst->anon_vma) > + dst->anon_vma->degree++; > unlock_anon_vma_root(root); > return 0; > > @@ -279,6 +292,9 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) > if (!pvma->anon_vma) > return 0; > > + /* Drop inherited anon_vma, we'll reuse old one or allocate new. */ > + vma->anon_vma = NULL; > + > /* > * First, attach the new VMA to the parent VMA's anon_vmas, > * so rmap can find non-COWed pages in child processes. > @@ -286,6 +302,10 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) > if (anon_vma_clone(vma, pvma)) > return -ENOMEM; > > + /* An old anon_vma has been reused. */ > + if (vma->anon_vma) > + return 0; > + > /* Then add our own anon_vma. */ > anon_vma = anon_vma_alloc(); > if (!anon_vma) > @@ -299,6 +319,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) > * lock any of the anon_vmas in this anon_vma tree. > */ > anon_vma->root = pvma->anon_vma->root; > + anon_vma->parent = pvma->anon_vma; > /* > * With refcounts, an anon_vma can stay around longer than the > * process it belongs to. The root anon_vma needs to be pinned until > @@ -309,6 +330,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) > vma->anon_vma = anon_vma; > anon_vma_lock_write(anon_vma); > anon_vma_chain_link(vma, avc, anon_vma); > + anon_vma->parent->degree++; > anon_vma_unlock_write(anon_vma); > > return 0; > @@ -339,12 +361,16 @@ void unlink_anon_vmas(struct vm_area_struct *vma) > * Leave empty anon_vmas on the list - we'll need > * to free them outside the lock. > */ > - if (RB_EMPTY_ROOT(&anon_vma->rb_root)) > + if (RB_EMPTY_ROOT(&anon_vma->rb_root)) { > + anon_vma->parent->degree--; > continue; > + } > > list_del(&avc->same_vma); > anon_vma_chain_free(avc); > } > + if (vma->anon_vma) > + vma->anon_vma->degree--; > unlock_anon_vma_root(root); > > /* > @@ -355,6 +381,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma) > list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) { > struct anon_vma *anon_vma = avc->anon_vma; > > + BUG_ON(anon_vma->degree); > put_anon_vma(anon_vma); > > list_del(&avc->same_vma); > -- > 2.1.3 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f169.google.com (mail-qc0-f169.google.com [209.85.216.169]) by kanga.kvack.org (Postfix) with ESMTP id 78CB46B006E for ; Fri, 5 Dec 2014 11:18:08 -0500 (EST) Received: by mail-qc0-f169.google.com with SMTP id w7so688546qcr.14 for ; Fri, 05 Dec 2014 08:18:08 -0800 (PST) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id i6si15317310qaz.8.2014.12.05.08.18.06 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 05 Dec 2014 08:18:07 -0800 (PST) Message-ID: <5481D2F0.2090908@redhat.com> Date: Fri, 05 Dec 2014 16:44:48 +0100 From: Jerome Marchand MIME-Version: 1.0 Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound References: <546CC0CD.40906@suse.cz> <546DFFA1.4030700@redhat.com> <20141125105953.GC4607@dhcp22.suse.cz> <20141125150006.GB4415@dhcp22.suse.cz> <20141126173517.GA8180@dhcp22.suse.cz> In-Reply-To: <20141126173517.GA8180@dhcp22.suse.cz> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="ljVvaKbLTp0wFD2DBEHsqLhuNTgc52Hrq" Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko , Konstantin Khlebnikov Cc: Rik van Riel , Michel Lespinasse , Vlastimil Babka , Andrew Morton , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Daniel Forrest This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --ljVvaKbLTp0wFD2DBEHsqLhuNTgc52Hrq Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable On 11/26/2014 06:35 PM, Michal Hocko wrote: > On Tue 25-11-14 16:00:06, Michal Hocko wrote: >> On Tue 25-11-14 16:13:16, Konstantin Khlebnikov wrote: >>> On Tue, Nov 25, 2014 at 1:59 PM, Michal Hocko wrote:= >>>> On Mon 24-11-14 11:09:40, Konstantin Khlebnikov wrote: >>>>> On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov wrote: >>>>>> On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel wr= ote: >>>>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>>>> Hash: SHA1 >>>>>>> >>>>>>> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote: >>>>>>> >>>>>>>> I'm thinking about limitation for reusing anon_vmas which might >>>>>>>> increase performance without breaking asymptotic estimation of >>>>>>>> count anon_vma in the worst case. For example this heuristic: al= low >>>>>>>> to reuse only anon_vma with single direct descendant. It seems >>>>>>>> there will be arount up to two times more anon_vmas but >>>>>>>> false-aliasing must be much lower. >>>>> >>>>> Done. RFC patch in attachment. >=20 > Ok, finally managed to untagnle myself from vma chains and your patch > makes sense to me, it is quite clever actually. Here is it including th= e > fixup. > --- >> From 1d4b0b38198c69ecfeb37670cb1dda767a802c9a Mon Sep 17 00:00:00 2001= >> From: Konstantin Khlebnikov >> Date: Tue, 25 Nov 2014 10:54:44 +0100 >> Subject: [PATCH] mm: prevent endless growth of anon_vma hierarchy >> >> Constantly forking task causes unlimited grow of anon_vma chain. >> Each next child allocate new level of anon_vmas and links vmas to all >> previous levels because it inherits pages from them. None of anon_vmas= >> cannot be freed because there might be pages which points to them. >> >> This patch adds heuristic which decides to reuse existing anon_vma ins= tead >> of forking new one. It counts vmas and direct descendants for each ano= n_vma. >> Anon_vma with degree lower than two will be reused at next fork. >> As a result each anon_vma has either alive vma or at least two descend= ants, >> endless chains are no longer possible and count of anon_vmas is no mor= e than >> two times more than count of vmas. >> >> Signed-off-by: Konstantin Khlebnikov >> Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wi= sc.edu >=20 > Tested-by: Michal Hocko > Reviewed-by: Michal Hocko >=20 > and I guess > Reported-by: Daniel Forrest Tested-by: Jerome Marchand Minor nitpicks below. >=20 > who somehow vanished from CC list (added back) would be appropriate as > well. >=20 > plus >=20 > Fixes: 5beb49305251 (mm: change anon_vma linking to fix multi-process s= erver scalability issue) > and mark it for stable >=20 > Thanks! >=20 >> --- >> include/linux/rmap.h | 16 ++++++++++++++++ >> mm/rmap.c | 29 ++++++++++++++++++++++++++++- >> 2 files changed, 44 insertions(+), 1 deletion(-) >> >> diff --git a/include/linux/rmap.h b/include/linux/rmap.h >> index c0c2bce6b0b7..b1d140c20b37 100644 >> --- a/include/linux/rmap.h >> +++ b/include/linux/rmap.h >> @@ -45,6 +45,22 @@ struct anon_vma { >> * mm_take_all_locks() (mm_all_locks_mutex). >> */ >> struct rb_root rb_root; /* Interval tree of private "related" vmas *= / >> + >> + /* >> + * Count of child anon_vmas and VMAs which points to this anon_vma. >> + * >> + * This counter is used for making decision about reusing old anon_v= ma >> + * instead of forking new one. It allows to detect anon_vmas which h= ave >> + * just one direct descendant and no vmas. Reusing such anon_vma not= >> + * leads to significant preformance regression but prevents degradat= ion Does it or does it not lead to significant performance issue? I can't tel= l. >> + * of anon_vma hierarchy to endless linear chain. >> + * >> + * Root anon_vma is never reused because it is its own parent and it= has >> + * at leat one vma or child, thus at fork it's degree is at least 2.= s/leat/least/ Thanks, Jerome >> + */ >> + unsigned degree; >> + >> + struct anon_vma *parent; /* Parent of this anon_vma */ >> }; >> =20 >> /* >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 19886fb2f13a..40ae8184a1e1 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -72,6 +72,8 @@ static inline struct anon_vma *anon_vma_alloc(void) >> anon_vma =3D kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL); >> if (anon_vma) { >> atomic_set(&anon_vma->refcount, 1); >> + anon_vma->degree =3D 1; /* Reference for first vma */ >> + anon_vma->parent =3D anon_vma; >> /* >> * Initialise the anon_vma root to point to itself. If called >> * from fork, the root will be reset to the parents anon_vma. >> @@ -188,6 +190,7 @@ int anon_vma_prepare(struct vm_area_struct *vma) >> if (likely(!vma->anon_vma)) { >> vma->anon_vma =3D anon_vma; >> anon_vma_chain_link(vma, avc, anon_vma); >> + anon_vma->degree++; >> allocated =3D NULL; >> avc =3D NULL; >> } >> @@ -256,7 +259,17 @@ int anon_vma_clone(struct vm_area_struct *dst, st= ruct vm_area_struct *src) >> anon_vma =3D pavc->anon_vma; >> root =3D lock_anon_vma_root(root, anon_vma); >> anon_vma_chain_link(dst, avc, anon_vma); >> + >> + /* >> + * Reuse existing anon_vma if its degree lower than two, >> + * that means it has no vma and just one anon_vma child. >> + */ >> + if (!dst->anon_vma && anon_vma !=3D src->anon_vma && >> + anon_vma->degree < 2) >> + dst->anon_vma =3D anon_vma; >> } >> + if (dst->anon_vma) >> + dst->anon_vma->degree++; >> unlock_anon_vma_root(root); >> return 0; >> =20 >> @@ -279,6 +292,9 @@ int anon_vma_fork(struct vm_area_struct *vma, stru= ct vm_area_struct *pvma) >> if (!pvma->anon_vma) >> return 0; >> =20 >> + /* Drop inherited anon_vma, we'll reuse old one or allocate new. */ >> + vma->anon_vma =3D NULL; >> + >> /* >> * First, attach the new VMA to the parent VMA's anon_vmas, >> * so rmap can find non-COWed pages in child processes. >> @@ -286,6 +302,10 @@ int anon_vma_fork(struct vm_area_struct *vma, str= uct vm_area_struct *pvma) >> if (anon_vma_clone(vma, pvma)) >> return -ENOMEM; >> =20 >> + /* An old anon_vma has been reused. */ >> + if (vma->anon_vma) >> + return 0; >> + >> /* Then add our own anon_vma. */ >> anon_vma =3D anon_vma_alloc(); >> if (!anon_vma) >> @@ -299,6 +319,7 @@ int anon_vma_fork(struct vm_area_struct *vma, stru= ct vm_area_struct *pvma) >> * lock any of the anon_vmas in this anon_vma tree. >> */ >> anon_vma->root =3D pvma->anon_vma->root; >> + anon_vma->parent =3D pvma->anon_vma; >> /* >> * With refcounts, an anon_vma can stay around longer than the >> * process it belongs to. The root anon_vma needs to be pinned until= >> @@ -309,6 +330,7 @@ int anon_vma_fork(struct vm_area_struct *vma, stru= ct vm_area_struct *pvma) >> vma->anon_vma =3D anon_vma; >> anon_vma_lock_write(anon_vma); >> anon_vma_chain_link(vma, avc, anon_vma); >> + anon_vma->parent->degree++; >> anon_vma_unlock_write(anon_vma); >> =20 >> return 0; >> @@ -339,12 +361,16 @@ void unlink_anon_vmas(struct vm_area_struct *vma= ) >> * Leave empty anon_vmas on the list - we'll need >> * to free them outside the lock. >> */ >> - if (RB_EMPTY_ROOT(&anon_vma->rb_root)) >> + if (RB_EMPTY_ROOT(&anon_vma->rb_root)) { >> + anon_vma->parent->degree--; >> continue; >> + } >> =20 >> list_del(&avc->same_vma); >> anon_vma_chain_free(avc); >> } >> + if (vma->anon_vma) >> + vma->anon_vma->degree--; >> unlock_anon_vma_root(root); >> =20 >> /* >> @@ -355,6 +381,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma) >> list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) = { >> struct anon_vma *anon_vma =3D avc->anon_vma; >> =20 >> + BUG_ON(anon_vma->degree); >> put_anon_vma(anon_vma); >> =20 >> list_del(&avc->same_vma); >> --=20 >> 2.1.3 >=20 --ljVvaKbLTp0wFD2DBEHsqLhuNTgc52Hrq Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJUgdLwAAoJEHTzHJCtsuoCXLgH/2nti42bXPoi1luv0yxkmJl+ 303DVMFIcO45LGekL3xWyJaHvnA2IVdXwSd73rMrEzHDpx0Ahh5nQIwi/zQXS5ED zijKD/JmL/C8FEWGYMp17M6B8zz506RVFSUzlRT/aHQcldIEb7eGRkZIWw7neLXL gzUVXG+Pqv/3scd6UcoEeFoxwPZVdUUO6Ns3Nj8II+/v8aLZLCCE9eslPTveky7+ epC5GJLlMKf32iehHtMq5axGz001Bk6koA2Zhzmq3Yj7UHK3G9qgsqsumM5vDxZs AgDFzGYFSIcP1zfRevQRJqh7ln2DWI/wFPfZOUaos4AsMqjJUTS34TwrqLI0ksI= =dtia -----END PGP SIGNATURE----- --ljVvaKbLTp0wFD2DBEHsqLhuNTgc52Hrq-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753810Ab2HPDRT (ORCPT ); Wed, 15 Aug 2012 23:17:19 -0400 Received: from evergreen.ssec.wisc.edu ([128.104.108.253]:59631 "EHLO evergreen.ssec.wisc.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752201Ab2HPDRR (ORCPT ); Wed, 15 Aug 2012 23:17:17 -0400 X-Greylist: delayed 1866 seconds by postgrey-1.27 at vger.kernel.org; Wed, 15 Aug 2012 23:17:17 EDT Date: Wed, 15 Aug 2012 21:46:10 -0500 From: Daniel Forrest To: linux-kernel@vger.kernel.org Subject: Repeated fork() causes SLAB to grow without bound Message-ID: <20120816024610.GA5350@evergreen.ssec.wisc.edu> Reply-To: Daniel Forrest Mail-Followup-To: linux-kernel@vger.kernel.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.2i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I'm hoping someone has seen this before... I've been trying to track down a performance problem with Linux 3.0.4. The symptom is system-mode load increasing over time while user-mode load remains constant while running a data ingest/processing program. Looking at /proc/meminfo I noticed SUnreclaim increasing steadily. Looking at /proc/slabinfo I noticed anon_vma and anon_vma_chain also increasing steadily. I was able to generate a simple test program that will cause this: --- #include int main(int argc, char *argv[]) { pid_t pid; while (1) { pid = fork(); if (pid == -1) { /* error */ return 1; } if (pid) { /* parent */ sleep(2); break; } else { /* child */ sleep(1); } } return 0; } --- In the actual program (running as a daemon), a child is reading data while its parent is processing the previously read data. At any time there are only a few processes in existence, with older processes exiting and new processes being fork()ed. Killing the program frees the slab usage. I patched the kernel to 3.0.40, but the problem remains. I also compiled with slab debugging and can see that the growth of anon_vma and anon_vma_chain is due to anon_vma_clone/anon_vma_fork. Is this a known issue? Is it fixed in a later release? Thanks, -- Daniel K. Forrest Space Science and dan.forrest@ssec.wisc.edu Engineering Center (608) 890 - 0558 University of Wisconsin, Madison From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757042Ab2HPTAn (ORCPT ); Thu, 16 Aug 2012 15:00:43 -0400 Received: from mx1.redhat.com ([209.132.183.28]:2760 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752118Ab2HPTAl (ORCPT ); Thu, 16 Aug 2012 15:00:41 -0400 Message-ID: <502D42E5.7090403@redhat.com> Date: Thu, 16 Aug 2012 14:58:45 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120605 Thunderbird/13.0 MIME-Version: 1.0 To: linux-kernel@vger.kernel.org, Hugh Dickins CC: linux-mm Subject: Re: Repeated fork() causes SLAB to grow without bound References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> In-Reply-To: <20120816024610.GA5350@evergreen.ssec.wisc.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08/15/2012 10:46 PM, Daniel Forrest wrote: > I'm hoping someone has seen this before... > > I've been trying to track down a performance problem with Linux 3.0.4. > The symptom is system-mode load increasing over time while user-mode > load remains constant while running a data ingest/processing program. > > Looking at /proc/meminfo I noticed SUnreclaim increasing steadily. > > Looking at /proc/slabinfo I noticed anon_vma and anon_vma_chain also > increasing steadily. Oh dear. Basically, what happens is that at fork time, a new "level" is created for the anon_vma hierarchy. This works great for normal forking daemons, since the parent process just keeps running, and forking off children. Look at anon_vma_fork() in mm/rmap.c for the details. Having each child become the new parent, and the previous parent exit, can result in an "infinite" stack of anon_vmas. Now, the parent anon_vma we cannot get rid of, because that is where the anon_vma lock lives. However, in your case you have many more anon_vma levels than you have processes! I wonder if it may be possible to fix your bug by adding a refcount to the struct anon_vma, one count for each VMA that is directly attached to the anon_vma (ie. vma->anon_vma == anon_vma), and one for each page that points to the anon_vma. If the reference count on an anon_vma reaches 0, we can skip that anon_vma in anon_vma_clone, and the child process should not get that anon_vma. A scheme like that may be enough to avoid the trouble you are running into. Does this sound realistic? > I was able to generate a simple test program that will cause this: > > --- > > #include > > int main(int argc, char *argv[]) > { > pid_t pid; > > while (1) { > pid = fork(); > if (pid == -1) { > /* error */ > return 1; > } > if (pid) { > /* parent */ > sleep(2); > break; > } > else { > /* child */ > sleep(1); > } > } > return 0; > } > > --- > > In the actual program (running as a daemon), a child is reading data > while its parent is processing the previously read data. At any time > there are only a few processes in existence, with older processes > exiting and new processes being fork()ed. Killing the program frees > the slab usage. > > I patched the kernel to 3.0.40, but the problem remains. I also > compiled with slab debugging and can see that the growth of anon_vma > and anon_vma_chain is due to anon_vma_clone/anon_vma_fork. > > Is this a known issue? Is it fixed in a later release? > > Thanks, > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932671Ab2HRADV (ORCPT ); Fri, 17 Aug 2012 20:03:21 -0400 Received: from evergreen.ssec.wisc.edu ([128.104.108.253]:49723 "EHLO evergreen.ssec.wisc.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932071Ab2HRADO (ORCPT ); Fri, 17 Aug 2012 20:03:14 -0400 Date: Fri, 17 Aug 2012 19:03:12 -0500 From: Daniel Forrest To: Rik van Riel Cc: linux-kernel@vger.kernel.org, Hugh Dickins , linux-mm Subject: Re: Repeated fork() causes SLAB to grow without bound Message-ID: <20120818000312.GA4262@evergreen.ssec.wisc.edu> Reply-To: Daniel Forrest Mail-Followup-To: Rik van Riel , linux-kernel@vger.kernel.org, Hugh Dickins , linux-mm References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <502D42E5.7090403@redhat.com> User-Agent: Mutt/1.4.2.2i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Aug 16, 2012 at 02:58:45PM -0400, Rik van Riel wrote: > Oh dear. > > Basically, what happens is that at fork time, a new > "level" is created for the anon_vma hierarchy. This > works great for normal forking daemons, since the > parent process just keeps running, and forking off > children. > > Look at anon_vma_fork() in mm/rmap.c for the details. > > Having each child become the new parent, and the > previous parent exit, can result in an "infinite" > stack of anon_vmas. > > Now, the parent anon_vma we cannot get rid of, > because that is where the anon_vma lock lives. > > However, in your case you have many more anon_vma > levels than you have processes! > > I wonder if it may be possible to fix your bug > by adding a refcount to the struct anon_vma, > one count for each VMA that is directly attached > to the anon_vma (ie. vma->anon_vma == anon_vma), > and one for each page that points to the anon_vma. > > If the reference count on an anon_vma reaches 0, > we can skip that anon_vma in anon_vma_clone, and > the child process should not get that anon_vma. > > A scheme like that may be enough to avoid the trouble > you are running into. > > Does this sound realistic? Based on your comments, I came up with the following patch. It boots and the anon_vma/anon_vma_chain SLAB usage is stable, but I don't know if I've overlooked something. I'm not a kernel hacker. --- include/linux/rmap.h.ORIG 2011-08-05 04:59:21.000000000 +0000 +++ include/linux/rmap.h 2012-08-16 22:52:25.000000000 +0000 @@ -35,6 +35,7 @@ struct anon_vma { * anon_vma if they are the last user on release */ atomic_t refcount; + atomic_t pagecount; /* * NOTE: the LSB of the head.next is set by --- mm/rmap.c.ORIG 2011-08-05 04:59:21.000000000 +0000 +++ mm/rmap.c 2012-08-17 23:55:13.000000000 +0000 @@ -85,6 +85,7 @@ static inline struct anon_vma *anon_vma_ static inline void anon_vma_free(struct anon_vma *anon_vma) { VM_BUG_ON(atomic_read(&anon_vma->refcount)); + VM_BUG_ON(atomic_read(&anon_vma->pagecount)); /* * Synchronize against page_lock_anon_vma() such that @@ -176,6 +177,7 @@ int anon_vma_prepare(struct vm_area_stru spin_lock(&mm->page_table_lock); if (likely(!vma->anon_vma)) { vma->anon_vma = anon_vma; + atomic_inc(&anon_vma->pagecount); avc->anon_vma = anon_vma; avc->vma = vma; list_add(&avc->same_vma, &vma->anon_vma_chain); @@ -262,7 +264,10 @@ int anon_vma_clone(struct vm_area_struct } anon_vma = pavc->anon_vma; root = lock_anon_vma_root(root, anon_vma); - anon_vma_chain_link(dst, avc, anon_vma); + if (!atomic_read(&anon_vma->pagecount)) + anon_vma_chain_free(avc); + else + anon_vma_chain_link(dst, avc, anon_vma); } unlock_anon_vma_root(root); return 0; @@ -314,6 +319,7 @@ int anon_vma_fork(struct vm_area_struct get_anon_vma(anon_vma->root); /* Mark this anon_vma as the one where our new (COWed) pages go. */ vma->anon_vma = anon_vma; + atomic_set(&anon_vma->pagecount, 1); anon_vma_lock(anon_vma); anon_vma_chain_link(vma, avc, anon_vma); anon_vma_unlock(anon_vma); @@ -341,6 +347,8 @@ void unlink_anon_vmas(struct vm_area_str root = lock_anon_vma_root(root, anon_vma); list_del(&avc->same_anon_vma); + if (vma->anon_vma == anon_vma) + atomic_dec(&anon_vma->pagecount); /* * Leave empty anon_vmas on the list - we'll need @@ -375,6 +383,7 @@ static void anon_vma_ctor(void *data) mutex_init(&anon_vma->mutex); atomic_set(&anon_vma->refcount, 0); + atomic_set(&anon_vma->pagecount, 0); INIT_LIST_HEAD(&anon_vma->head); } @@ -996,6 +1005,7 @@ static void __page_set_anon_rmap(struct if (!exclusive) anon_vma = anon_vma->root; + atomic_inc(&anon_vma->pagecount); anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; page->mapping = (struct address_space *) anon_vma; page->index = linear_page_index(vma, address); @@ -1142,6 +1152,11 @@ void page_remove_rmap(struct page *page) if (unlikely(PageHuge(page))) return; if (PageAnon(page)) { + struct anon_vma *anon_vma; + + anon_vma = page_anon_vma(page); + if (anon_vma) + atomic_dec(&anon_vma->pagecount); mem_cgroup_uncharge_page(page); if (!PageTransHuge(page)) __dec_zone_page_state(page, NR_ANON_PAGES); @@ -1747,6 +1762,7 @@ static void __hugepage_set_anon_rmap(str if (!exclusive) anon_vma = anon_vma->root; + atomic_inc(&anon_vma->pagecount); anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; page->mapping = (struct address_space *) anon_vma; page->index = linear_page_index(vma, address); -- Daniel K. Forrest Space Science and dan.forrest@ssec.wisc.edu Engineering Center (608) 890 - 0558 University of Wisconsin, Madison From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756056Ab2HRDqa (ORCPT ); Fri, 17 Aug 2012 23:46:30 -0400 Received: from mx1.redhat.com ([209.132.183.28]:38257 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753661Ab2HRDqX (ORCPT ); Fri, 17 Aug 2012 23:46:23 -0400 Message-ID: <502F100A.1080401@redhat.com> Date: Fri, 17 Aug 2012 23:46:18 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120717 Thunderbird/14.0 MIME-Version: 1.0 To: linux-kernel@vger.kernel.org, Hugh Dickins , linux-mm Subject: Re: Repeated fork() causes SLAB to grow without bound References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> In-Reply-To: <20120818000312.GA4262@evergreen.ssec.wisc.edu> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08/17/2012 08:03 PM, Daniel Forrest wrote: > Based on your comments, I came up with the following patch. It boots > and the anon_vma/anon_vma_chain SLAB usage is stable, but I don't know > if I've overlooked something. I'm not a kernel hacker. The patch looks reasonable to me. There is one spot left for optimization, which I have pointed out below. Of course, that leaves the big question: do we want the overhead of having the atomic addition and decrement for every anonymous memory page, or is it easier to fix this issue in userspace? Given that malicious userspace could potentially run the system out of memory, without needing special privileges, and the OOM killer may not be able to reclaim it due to internal slab fragmentation, I guess this issue could be classified as a low impact denial of service vulnerability. Furthermore, there is already a fair amount of bookkeeping being done in the rmap code, so this patch is not likely to add a whole lot - some testing might be useful, though. > @@ -262,7 +264,10 @@ int anon_vma_clone(struct vm_area_struct > } > anon_vma = pavc->anon_vma; > root = lock_anon_vma_root(root, anon_vma); > - anon_vma_chain_link(dst, avc, anon_vma); > + if (!atomic_read(&anon_vma->pagecount)) > + anon_vma_chain_free(avc); > + else > + anon_vma_chain_link(dst, avc, anon_vma); > } > unlock_anon_vma_root(root); > return 0; In this function, you can do the test before the code block where we try to allocate an anon_vma chain. In other words: list_for_each_entry_reverse(..... struct anon_vma *anon_vma; + if (!atomic_read(&anon_vma->pagecount)) + continue; + avc = anon_vma_chain_alloc(... if (unlikely(!avc)) { The rest looks good. -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751618Ab2HREH5 (ORCPT ); Sat, 18 Aug 2012 00:07:57 -0400 Received: from evergreen.ssec.wisc.edu ([128.104.108.253]:47506 "EHLO evergreen.ssec.wisc.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751156Ab2HREHs (ORCPT ); Sat, 18 Aug 2012 00:07:48 -0400 Date: Fri, 17 Aug 2012 23:07:47 -0500 From: Daniel Forrest To: Rik van Riel Cc: linux-kernel@vger.kernel.org, Hugh Dickins , linux-mm Subject: Re: Repeated fork() causes SLAB to grow without bound Message-ID: <20120818040747.GA22793@evergreen.ssec.wisc.edu> Reply-To: Daniel Forrest Mail-Followup-To: Rik van Riel , linux-kernel@vger.kernel.org, Hugh Dickins , linux-mm References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <502F100A.1080401@redhat.com> User-Agent: Mutt/1.4.2.2i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Aug 17, 2012 at 11:46:18PM -0400, Rik van Riel wrote: > On 08/17/2012 08:03 PM, Daniel Forrest wrote: > > >Based on your comments, I came up with the following patch. It boots > >and the anon_vma/anon_vma_chain SLAB usage is stable, but I don't know > >if I've overlooked something. I'm not a kernel hacker. > > The patch looks reasonable to me. There is one spot left > for optimization, which I have pointed out below. > > Of course, that leaves the big question: do we want the > overhead of having the atomic addition and decrement for > every anonymous memory page, or is it easier to fix this > issue in userspace? > > Given that malicious userspace could potentially run the > system out of memory, without needing special privileges, > and the OOM killer may not be able to reclaim it due to > internal slab fragmentation, I guess this issue could be > classified as a low impact denial of service vulnerability. > > Furthermore, there is already a fair amount of bookkeeping > being done in the rmap code, so this patch is not likely > to add a whole lot - some testing might be useful, though. > > >@@ -262,7 +264,10 @@ int anon_vma_clone(struct vm_area_struct > > } > > anon_vma = pavc->anon_vma; > > root = lock_anon_vma_root(root, anon_vma); > >- anon_vma_chain_link(dst, avc, anon_vma); > >+ if (!atomic_read(&anon_vma->pagecount)) > >+ anon_vma_chain_free(avc); > >+ else > >+ anon_vma_chain_link(dst, avc, anon_vma); > > } > > unlock_anon_vma_root(root); > > return 0; > > In this function, you can do the test before the code block > where we try to allocate an anon_vma chain. > > In other words: > > list_for_each_entry_reverse(..... > struct anon_vma *anon_vma; > > + if (!atomic_read(&anon_vma->pagecount)) > + continue; > + > avc = anon_vma_chain_alloc(... > if (unlikely(!avc)) { > > The rest looks good. I was being careful since I wasn't certain about the locking. Does the test need to be protected by "lock_anon_vma_root"? That's why I chose the overhead of the possible wasted "anon_vma_chain_alloc". -- Daniel K. Forrest Space Science and dan.forrest@ssec.wisc.edu Engineering Center (608) 890 - 0558 University of Wisconsin, Madison From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752030Ab2HREK1 (ORCPT ); Sat, 18 Aug 2012 00:10:27 -0400 Received: from mx1.redhat.com ([209.132.183.28]:43612 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751156Ab2HREKS (ORCPT ); Sat, 18 Aug 2012 00:10:18 -0400 Message-ID: <502F15A6.5060902@redhat.com> Date: Sat, 18 Aug 2012 00:10:14 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120717 Thunderbird/14.0 MIME-Version: 1.0 To: linux-kernel@vger.kernel.org, Hugh Dickins , linux-mm Subject: Re: Repeated fork() causes SLAB to grow without bound References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120818040747.GA22793@evergreen.ssec.wisc.edu> In-Reply-To: <20120818040747.GA22793@evergreen.ssec.wisc.edu> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08/18/2012 12:07 AM, Daniel Forrest wrote: > I was being careful since I wasn't certain about the locking. Does > the test need to be protected by "lock_anon_vma_root"? That's why I > chose the overhead of the possible wasted "anon_vma_chain_alloc". The function anon_vma_clone is being called from fork(). When running fork(), the kernel holds the mm->mmap_sem for write, which prevents page faults by the parent process. This means if the anon_vma in question belongs to the parent process, no new pages will be added to it in this time. Likewise, if the anon_vma belonged to a grandparent process, any new pages instantiated in it will not be visible to the parent process, or to the newly created process. This means it is safe to skip the anon_vma. -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755800Ab2HTIA5 (ORCPT ); Mon, 20 Aug 2012 04:00:57 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:42863 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755620Ab2HTIAy (ORCPT ); Mon, 20 Aug 2012 04:00:54 -0400 Date: Mon, 20 Aug 2012 01:00:11 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Rik van Riel cc: Daniel Forrest , Andrea Arcangeli , Michel Lespinasse , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: Repeated fork() causes SLAB to grow without bound In-Reply-To: <502F100A.1080401@redhat.com> Message-ID: References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> User-Agent: Alpine 2.00 (LSU 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 17 Aug 2012, Rik van Riel wrote: > On 08/17/2012 08:03 PM, Daniel Forrest wrote: > > > Based on your comments, I came up with the following patch. It boots > > and the anon_vma/anon_vma_chain SLAB usage is stable, but I don't know > > if I've overlooked something. I'm not a kernel hacker. > > The patch looks reasonable to me. There is one spot left > for optimization, which I have pointed out below. > > Of course, that leaves the big question: do we want the > overhead of having the atomic addition and decrement for > every anonymous memory page, or is it easier to fix this > issue in userspace? I've not given any thought to alternatives, and I've not done any performance analysis; but my instinct says that we really do not want another atomic increment and decrement (and another cache line redirtied) for every single page mapped. One of the things I've often admired about Andrea's anon_vma design was the way it did not need a refcount; and although we later added one for KSM and migration, that scarcely mattered, because it was for exceptional circumstances, and not per page. May I dare to think: what if we just backed out all the anon_vma_chain complexity, and returned to the simple anon_vma list we had in 2.6.33? Just how realistic was the workload which led you to anon_vma_chains? And isn't it correct to say that the performance evaluation was made while believing that each anon_vma->lock was useful, before the sad realization that anon_vma->root->lock (or ->mutex) had to be used? I've Cc'ed Michel, because I think he has plans (or at least hopes) for the anon_vmas, in his relentless pursuit of world domination by rbtree. Hugh > > Given that malicious userspace could potentially run the > system out of memory, without needing special privileges, > and the OOM killer may not be able to reclaim it due to > internal slab fragmentation, I guess this issue could be > classified as a low impact denial of service vulnerability. > > Furthermore, there is already a fair amount of bookkeeping > being done in the rmap code, so this patch is not likely > to add a whole lot - some testing might be useful, though. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756050Ab2HTJj3 (ORCPT ); Mon, 20 Aug 2012 05:39:29 -0400 Received: from mail-iy0-f174.google.com ([209.85.210.174]:60447 "EHLO mail-iy0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751444Ab2HTJj1 (ORCPT ); Mon, 20 Aug 2012 05:39:27 -0400 MIME-Version: 1.0 In-Reply-To: References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> Date: Mon, 20 Aug 2012 02:39:26 -0700 Message-ID: Subject: Re: Repeated fork() causes SLAB to grow without bound From: Michel Lespinasse To: Hugh Dickins Cc: Rik van Riel , Daniel Forrest , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset=ISO-8859-1 X-System-Of-Record: true Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Aug 20, 2012 at 1:00 AM, Hugh Dickins wrote: > On Fri, 17 Aug 2012, Rik van Riel wrote: >> Of course, that leaves the big question: do we want the >> overhead of having the atomic addition and decrement for >> every anonymous memory page, or is it easier to fix this >> issue in userspace? > > I've not given any thought to alternatives, and I've not done any > performance analysis; but my instinct says that we really do not > want another atomic increment and decrement (and another cache > line redirtied) for every single page mapped. I am concerned about this as well. > May I dare to think: what if we just backed out all the anon_vma_chain > complexity, and returned to the simple anon_vma list we had in 2.6.33? > > Just how realistic was the workload which led you to anon_vma_chains? > And isn't it correct to say that the performance evaluation was made > while believing that each anon_vma->lock was useful, before the sad > realization that anon_vma->root->lock (or ->mutex) had to be used? Thanks for suggesting this - I certainly wish we could go that way. I suspect there will be a strong case against this, but I'd certainly like to hear it (and see if it can be addressed another way). Here we just don't have processes that fork a lot of children that don't immediately exec, so anon_vmas don't bring any value for us. > I've Cc'ed Michel, because I think he has plans (or at least hopes) for > the anon_vmas, in his relentless pursuit of world domination by rbtree. Unfortunately I don't have great ideas there. It would be easy to add a flag to track if an anon_vma has ever been referenced by a struct page, and not clone the anon_vma if the flag isn't set. But, this wouldn't help at all with the DOS potential here. If there are pages referencing the anon_vma, we could reassign these to the parent anon_vma, but finding all such pages would be expensive too. Instead of adding an atomic count for page references, we could limit the anon_vma stacking depth. In fork, we would only clone anon_vmas that have a low enough generation count. I think that's not great (adds a special case for the deep-fork-without-exec behavior), but still better than the atomic page reference counter. I would still prefer if we could just remove the anon_vma_chain stuff, though. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753922Ab2HTLLF (ORCPT ); Mon, 20 Aug 2012 07:11:05 -0400 Received: from mga01.intel.com ([192.55.52.88]:41362 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752848Ab2HTLLD (ORCPT ); Mon, 20 Aug 2012 07:11:03 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.77,796,1336374000"; d="scan'208";a="206337044" From: Andi Kleen To: Michel Lespinasse Cc: Hugh Dickins , Rik van Riel , Daniel Forrest , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: Repeated fork() causes SLAB to grow without bound References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> Date: Mon, 20 Aug 2012 04:11:00 -0700 In-Reply-To: (Michel Lespinasse's message of "Mon, 20 Aug 2012 02:39:26 -0700") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Michel Lespinasse writes: > > I would still prefer if we could just remove the anon_vma_chain stuff, though. Would probably help with the fork locking problems too. We never really recovered from that regression. -Andi -- ak@linux.intel.com -- Speaking for myself only From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754048Ab2HTLRt (ORCPT ); Mon, 20 Aug 2012 07:17:49 -0400 Received: from mx1.redhat.com ([209.132.183.28]:57138 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753412Ab2HTLRq (ORCPT ); Mon, 20 Aug 2012 07:17:46 -0400 Message-ID: <50321CD3.5050501@redhat.com> Date: Mon, 20 Aug 2012 07:17:39 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120717 Thunderbird/14.0 MIME-Version: 1.0 To: Michel Lespinasse CC: Hugh Dickins , Daniel Forrest , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: Repeated fork() causes SLAB to grow without bound References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08/20/2012 05:39 AM, Michel Lespinasse wrote: > I would still prefer if we could just remove the anon_vma_chain stuff, though. If only we could. That simply replaces a medium issue at fork time, with the potential for a catastrophic issue at page reclaim time, in any workload with heavily forking server software. Without the anon_vma_chains, we end up scanning every single one of the child processes (and the parent) for every COWed page, which can be a real issue when the VM runs into 1000 such pages, for 1000 child processes. Unfortunately, we have seen this happen... -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754391Ab2HTLxE (ORCPT ); Mon, 20 Aug 2012 07:53:04 -0400 Received: from mail-iy0-f174.google.com ([209.85.210.174]:54768 "EHLO mail-iy0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753975Ab2HTLxB (ORCPT ); Mon, 20 Aug 2012 07:53:01 -0400 MIME-Version: 1.0 In-Reply-To: <50321CD3.5050501@redhat.com> References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <50321CD3.5050501@redhat.com> Date: Mon, 20 Aug 2012 04:53:00 -0700 Message-ID: Subject: Re: Repeated fork() causes SLAB to grow without bound From: Michel Lespinasse To: Rik van Riel Cc: Hugh Dickins , Daniel Forrest , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset=ISO-8859-1 X-System-Of-Record: true Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Aug 20, 2012 at 4:17 AM, Rik van Riel wrote: > Without the anon_vma_chains, we end up scanning every single > one of the child processes (and the parent) for every COWed > page, which can be a real issue when the VM runs into 1000 > such pages, for 1000 child processes. > > Unfortunately, we have seen this happen... Well, it only happens if the vma is created in the parent, and the first anon write also happens in the parent. I suppose that's a legitimate thing to do in a forking server though - say, for an expensive initialization stage, or precomputing some table, or whatever. When fork happens after the first anon page has been created, the child VMA currently ends up being added to the parent's anon_vma - even if the child might never create new anon pages into that VMA. I wonder if it might help to add the child VMA onto the parent's anon_vma only at the first child COW event. That way it would at least be possible (with userspace changes) for any forking servers to separate the areas they want to write into from the parent (such as things that need expensive initialization), from the ones that they want to write into from the child, and have none of the anon_vma lists grow too large. This might still be impractical if one has too many such workloads to care about. I'm just not sure how prevalent the problem workloads are. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753624Ab2HTTLs (ORCPT ); Mon, 20 Aug 2012 15:11:48 -0400 Received: from mail-yx0-f174.google.com ([209.85.213.174]:42164 "EHLO mail-yx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752034Ab2HTTLp (ORCPT ); Mon, 20 Aug 2012 15:11:45 -0400 MIME-Version: 1.0 In-Reply-To: References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <50321CD3.5050501@redhat.com> Date: Mon, 20 Aug 2012 12:11:43 -0700 Message-ID: Subject: Re: Repeated fork() causes SLAB to grow without bound From: Michel Lespinasse To: Rik van Riel Cc: Hugh Dickins , Daniel Forrest , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset=ISO-8859-1 X-System-Of-Record: true Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Aug 20, 2012 at 4:53 AM, Michel Lespinasse wrote: > I wonder if it might help to add the child VMA onto the parent's > anon_vma only at the first child COW event. That way it would at least > be possible (with userspace changes) for any forking servers to > separate the areas they want to write into from the parent (such as > things that need expensive initialization), from the ones that they > want to write into from the child, and have none of the anon_vma lists > grow too large. Actually that wouldn't work. The parent's anon pages are visible from the child, so the child vma needs to be on the parent anon_vma list. Sorry for the noise :/ -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753349Ab2HVDVI (ORCPT ); Tue, 21 Aug 2012 23:21:08 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:53365 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751719Ab2HVDVD (ORCPT ); Tue, 21 Aug 2012 23:21:03 -0400 Date: Tue, 21 Aug 2012 20:20:57 -0700 From: Michel Lespinasse To: Hugh Dickins Cc: Rik van Riel , Daniel Forrest , Andrea Arcangeli , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound Message-ID: <20120822032057.GA30871@google.com> References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote: > Instead of adding an atomic count for page references, we could limit > the anon_vma stacking depth. In fork, we would only clone anon_vmas > that have a low enough generation count. I think that's not great > (adds a special case for the deep-fork-without-exec behavior), but > still better than the atomic page reference counter. Here is an attached patch to demonstrate the idea. anon_vma_clone() is modified to return the length of the existing same_vma anon vma chain, and we create a new anon_vma in the child only on the first fork (this could be tweaked to allow up to a set number of forks, but I think the first fork would cover all the common forking server cases). Signed-off-by: Michel Lespinasse --- mm/mmap.c | 6 +++--- mm/rmap.c | 18 ++++++++++++++---- 2 files changed, 17 insertions(+), 7 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 3edfcdfa42d9..e14b19a838cb 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -539,7 +539,7 @@ again: remove_next = 1 + (end > next->vm_end); * shrinking vma had, to cover any anon pages imported. */ if (exporter && exporter->anon_vma && !importer->anon_vma) { - if (anon_vma_clone(importer, exporter)) + if (anon_vma_clone(importer, exporter) < 0) return -ENOMEM; importer->anon_vma = exporter->anon_vma; } @@ -1988,7 +1988,7 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma, } vma_set_policy(new, pol); - if (anon_vma_clone(new, vma)) + if (anon_vma_clone(new, vma) < 0) goto out_free_mpol; if (new->vm_file) { @@ -2409,7 +2409,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, if (IS_ERR(pol)) goto out_free_vma; INIT_LIST_HEAD(&new_vma->anon_vma_chain); - if (anon_vma_clone(new_vma, vma)) + if (anon_vma_clone(new_vma, vma) < 0) goto out_free_mempol; vma_set_policy(new_vma, pol); new_vma->vm_start = addr; diff --git a/mm/rmap.c b/mm/rmap.c index 0f3b7cda2a24..ba8a726aaee6 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -238,12 +238,13 @@ static inline void unlock_anon_vma_root(struct anon_vma *root) /* * Attach the anon_vmas from src to dst. - * Returns 0 on success, -ENOMEM on failure. + * Returns length of the anon_vma chain on success, -ENOMEM on failure. */ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) { struct anon_vma_chain *avc, *pavc; struct anon_vma *root = NULL; + int length = 0; list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) { struct anon_vma *anon_vma; @@ -259,9 +260,10 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) anon_vma = pavc->anon_vma; root = lock_anon_vma_root(root, anon_vma); anon_vma_chain_link(dst, avc, anon_vma); + length++; } unlock_anon_vma_root(root); - return 0; + return length; enomem_failure: unlink_anon_vmas(dst); @@ -322,6 +324,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) { struct anon_vma_chain *avc; struct anon_vma *anon_vma; + int length; /* Don't bother if the parent process has no anon_vma here. */ if (!pvma->anon_vma) @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) * First, attach the new VMA to the parent VMA's anon_vmas, * so rmap can find non-COWed pages in child processes. */ - if (anon_vma_clone(vma, pvma)) + length = anon_vma_clone(vma, pvma); + if (length < 0) return -ENOMEM; + else if (length > 1) + return 0; - /* Then add our own anon_vma. */ + /* + * Then add our own anon_vma. We do this only on the first fork after + * the anon_vma is created, as we don't want the same_vma chain to + * grow arbitrarily large. + */ anon_vma = anon_vma_alloc(); if (!anon_vma) goto out_error; -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754266Ab2HVDaK (ORCPT ); Tue, 21 Aug 2012 23:30:10 -0400 Received: from mx1.redhat.com ([209.132.183.28]:16570 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752144Ab2HVDaI (ORCPT ); Tue, 21 Aug 2012 23:30:08 -0400 Message-ID: <50345232.4090002@redhat.com> Date: Tue, 21 Aug 2012 23:29:54 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120717 Thunderbird/14.0 MIME-Version: 1.0 To: Michel Lespinasse CC: Hugh Dickins , Daniel Forrest , Andrea Arcangeli , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> In-Reply-To: <20120822032057.GA30871@google.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08/21/2012 11:20 PM, Michel Lespinasse wrote: > On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote: >> Instead of adding an atomic count for page references, we could limit >> the anon_vma stacking depth. In fork, we would only clone anon_vmas >> that have a low enough generation count. I think that's not great >> (adds a special case for the deep-fork-without-exec behavior), but >> still better than the atomic page reference counter. > > Here is an attached patch to demonstrate the idea. > > anon_vma_clone() is modified to return the length of the existing same_vma > anon vma chain, and we create a new anon_vma in the child only on the first > fork (this could be tweaked to allow up to a set number of forks, but > I think the first fork would cover all the common forking server cases). I suspect we need 2 or 3. Some forking servers first fork off one child, and have the original parent exit, in order to "background the server". That first child then becomes the parent to the real child processes that do the work. It is conceivable that we might need an extra level for processes that do something special with privilege dropping, namespace changing, etc... Even setting the threshold to 5 should be totally harmless, since the problem does not kick in until we have really long chains, like in Dan's bug report. -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756949Ab3FCUMz (ORCPT ); Mon, 3 Jun 2013 16:12:55 -0400 Received: from evergreen.ssec.wisc.edu ([128.104.108.253]:35097 "EHLO evergreen.ssec.wisc.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754201Ab3FCUMw (ORCPT ); Mon, 3 Jun 2013 16:12:52 -0400 X-Greylist: delayed 1367 seconds by postgrey-1.27 at vger.kernel.org; Mon, 03 Jun 2013 16:12:52 EDT Date: Mon, 3 Jun 2013 14:50:03 -0500 From: Daniel Forrest To: Rik van Riel Cc: Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound Message-ID: <20130603195003.GA31275@evergreen.ssec.wisc.edu> Reply-To: Daniel Forrest Mail-Followup-To: Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <50345232.4090002@redhat.com> User-Agent: Mutt/1.4.2.2i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 21, 2012 at 11:29:54PM -0400, Rik van Riel wrote: > On 08/21/2012 11:20 PM, Michel Lespinasse wrote: > >On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote: > >>Instead of adding an atomic count for page references, we could limit > >>the anon_vma stacking depth. In fork, we would only clone anon_vmas > >>that have a low enough generation count. I think that's not great > >>(adds a special case for the deep-fork-without-exec behavior), but > >>still better than the atomic page reference counter. > > > >Here is an attached patch to demonstrate the idea. > > > >anon_vma_clone() is modified to return the length of the existing same_vma > >anon vma chain, and we create a new anon_vma in the child only on the first > >fork (this could be tweaked to allow up to a set number of forks, but > >I think the first fork would cover all the common forking server cases). > > I suspect we need 2 or 3. > > Some forking servers first fork off one child, and have > the original parent exit, in order to "background the server". > That first child then becomes the parent to the real child > processes that do the work. > > It is conceivable that we might need an extra level for > processes that do something special with privilege dropping, > namespace changing, etc... > > Even setting the threshold to 5 should be totally harmless, > since the problem does not kick in until we have really > long chains, like in Dan's bug report. I have been running with Michel's patch (with the threshold set to 5) for quite a few months now and can confirm that it does indeed solve my problem. I am not a kernel developer, so I would appreciate if one of you could push this into the kernel tree. NOTE: I have attached Michel's patch with "(length > 1)" modified to "(length > 5)" and added a "Tested-by:". --- On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote: > Instead of adding an atomic count for page references, we could limit > the anon_vma stacking depth. In fork, we would only clone anon_vmas > that have a low enough generation count. I think that's not great > (adds a special case for the deep-fork-without-exec behavior), but > still better than the atomic page reference counter. Here is an attached patch to demonstrate the idea. anon_vma_clone() is modified to return the length of the existing same_vma anon vma chain, and we create a new anon_vma in the child only on the first fork (this could be tweaked to allow up to a set number of forks, but I think the first fork would cover all the common forking server cases). Signed-off-by: Michel Lespinasse Tested-by: Daniel Forrest --- mm/mmap.c | 6 +++--- mm/rmap.c | 18 ++++++++++++++---- 2 files changed, 17 insertions(+), 7 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 3edfcdfa42d9..e14b19a838cb 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -539,7 +539,7 @@ again: remove_next = 1 + (end > next->vm_end); * shrinking vma had, to cover any anon pages imported. */ if (exporter && exporter->anon_vma && !importer->anon_vma) { - if (anon_vma_clone(importer, exporter)) + if (anon_vma_clone(importer, exporter) < 0) return -ENOMEM; importer->anon_vma = exporter->anon_vma; } @@ -1988,7 +1988,7 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma, } vma_set_policy(new, pol); - if (anon_vma_clone(new, vma)) + if (anon_vma_clone(new, vma) < 0) goto out_free_mpol; if (new->vm_file) { @@ -2409,7 +2409,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, if (IS_ERR(pol)) goto out_free_vma; INIT_LIST_HEAD(&new_vma->anon_vma_chain); - if (anon_vma_clone(new_vma, vma)) + if (anon_vma_clone(new_vma, vma) < 0) goto out_free_mempol; vma_set_policy(new_vma, pol); new_vma->vm_start = addr; diff --git a/mm/rmap.c b/mm/rmap.c index 0f3b7cda2a24..ba8a726aaee6 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -238,12 +238,13 @@ static inline void unlock_anon_vma_root(struct anon_vma *root) /* * Attach the anon_vmas from src to dst. - * Returns 0 on success, -ENOMEM on failure. + * Returns length of the anon_vma chain on success, -ENOMEM on failure. */ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) { struct anon_vma_chain *avc, *pavc; struct anon_vma *root = NULL; + int length = 0; list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) { struct anon_vma *anon_vma; @@ -259,9 +260,10 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) anon_vma = pavc->anon_vma; root = lock_anon_vma_root(root, anon_vma); anon_vma_chain_link(dst, avc, anon_vma); + length++; } unlock_anon_vma_root(root); - return 0; + return length; enomem_failure: unlink_anon_vmas(dst); @@ -322,6 +324,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) { struct anon_vma_chain *avc; struct anon_vma *anon_vma; + int length; /* Don't bother if the parent process has no anon_vma here. */ if (!pvma->anon_vma) @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) * First, attach the new VMA to the parent VMA's anon_vmas, * so rmap can find non-COWed pages in child processes. */ - if (anon_vma_clone(vma, pvma)) + length = anon_vma_clone(vma, pvma); + if (length < 0) return -ENOMEM; + else if (length > 5) + return 0; - /* Then add our own anon_vma. */ + /* + * Then add our own anon_vma. We do this only on the first fork after + * the anon_vma is created, as we don't want the same_vma chain to + * grow arbitrarily large. + */ anon_vma = anon_vma_alloc(); if (!anon_vma) goto out_error; -- Daniel K. Forrest Space Science and dan.forrest@ssec.wisc.edu Engineering Center (608) 890 - 0558 University of Wisconsin, Madison From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752566Ab3FDKhc (ORCPT ); Tue, 4 Jun 2013 06:37:32 -0400 Received: from mx1.redhat.com ([209.132.183.28]:55660 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750738Ab3FDKha (ORCPT ); Tue, 4 Jun 2013 06:37:30 -0400 Message-ID: <51ADC365.4010307@redhat.com> Date: Tue, 04 Jun 2013 06:37:25 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130402 Thunderbird/17.0.5 MIME-Version: 1.0 To: Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> In-Reply-To: <20130603195003.GA31275@evergreen.ssec.wisc.edu> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/03/2013 03:50 PM, Daniel Forrest wrote: > On Tue, Aug 21, 2012 at 11:29:54PM -0400, Rik van Riel wrote: >> On 08/21/2012 11:20 PM, Michel Lespinasse wrote: >>> On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote: >>>> Instead of adding an atomic count for page references, we could limit >>>> the anon_vma stacking depth. In fork, we would only clone anon_vmas >>>> that have a low enough generation count. I think that's not great >>>> (adds a special case for the deep-fork-without-exec behavior), but >>>> still better than the atomic page reference counter. >>> >>> Here is an attached patch to demonstrate the idea. >>> >>> anon_vma_clone() is modified to return the length of the existing same_vma >>> anon vma chain, and we create a new anon_vma in the child only on the first >>> fork (this could be tweaked to allow up to a set number of forks, but >>> I think the first fork would cover all the common forking server cases). >> >> I suspect we need 2 or 3. >> >> Some forking servers first fork off one child, and have >> the original parent exit, in order to "background the server". >> That first child then becomes the parent to the real child >> processes that do the work. >> >> It is conceivable that we might need an extra level for >> processes that do something special with privilege dropping, >> namespace changing, etc... >> >> Even setting the threshold to 5 should be totally harmless, >> since the problem does not kick in until we have really >> long chains, like in Dan's bug report. > > I have been running with Michel's patch (with the threshold set to 5) > for quite a few months now and can confirm that it does indeed solve > my problem. I am not a kernel developer, so I would appreciate if one > of you could push this into the kernel tree. > > NOTE: I have attached Michel's patch with "(length > 1)" modified to > "(length > 5)" and added a "Tested-by:". Thank you for testing this. I believe this code should go into the Linux kernel, since it closes up what could be a denial of service attack (albeit a local one) with the anonvma code. > On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote: >> Instead of adding an atomic count for page references, we could limit >> the anon_vma stacking depth. In fork, we would only clone anon_vmas >> that have a low enough generation count. I think that's not great >> (adds a special case for the deep-fork-without-exec behavior), but >> still better than the atomic page reference counter. > > Here is an attached patch to demonstrate the idea. > > anon_vma_clone() is modified to return the length of the existing same_vma > anon vma chain, and we create a new anon_vma in the child only on the first > fork (this could be tweaked to allow up to a set number of forks, but > I think the first fork would cover all the common forking server cases). > > Signed-off-by: Michel Lespinasse > Tested-by: Daniel Forrest Reviewed-by: Rik van Riel -- All rights reversed From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755860Ab3FEODG (ORCPT ); Wed, 5 Jun 2013 10:03:06 -0400 Received: from mx1.redhat.com ([209.132.183.28]:21723 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755624Ab3FEODD (ORCPT ); Wed, 5 Jun 2013 10:03:03 -0400 Date: Wed, 5 Jun 2013 16:02:58 +0200 From: Andrea Arcangeli To: Rik van Riel Cc: Michel Lespinasse , Hugh Dickins , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound Message-ID: <20130605140258.GL3463@redhat.com> References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <51ADC365.4010307@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <51ADC365.4010307@redhat.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 04, 2013 at 06:37:25AM -0400, Rik van Riel wrote: > On 06/03/2013 03:50 PM, Daniel Forrest wrote: > > On Tue, Aug 21, 2012 at 11:29:54PM -0400, Rik van Riel wrote: > >> On 08/21/2012 11:20 PM, Michel Lespinasse wrote: > >>> On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote: > >>>> Instead of adding an atomic count for page references, we could limit > >>>> the anon_vma stacking depth. In fork, we would only clone anon_vmas > >>>> that have a low enough generation count. I think that's not great > >>>> (adds a special case for the deep-fork-without-exec behavior), but > >>>> still better than the atomic page reference counter. > >>> > >>> Here is an attached patch to demonstrate the idea. > >>> > >>> anon_vma_clone() is modified to return the length of the existing same_vma > >>> anon vma chain, and we create a new anon_vma in the child only on the first > >>> fork (this could be tweaked to allow up to a set number of forks, but > >>> I think the first fork would cover all the common forking server cases). > >> > >> I suspect we need 2 or 3. > >> > >> Some forking servers first fork off one child, and have > >> the original parent exit, in order to "background the server". > >> That first child then becomes the parent to the real child > >> processes that do the work. > >> > >> It is conceivable that we might need an extra level for > >> processes that do something special with privilege dropping, > >> namespace changing, etc... > >> > >> Even setting the threshold to 5 should be totally harmless, > >> since the problem does not kick in until we have really > >> long chains, like in Dan's bug report. > > > > I have been running with Michel's patch (with the threshold set to 5) > > for quite a few months now and can confirm that it does indeed solve > > my problem. I am not a kernel developer, so I would appreciate if one > > of you could push this into the kernel tree. > > > > NOTE: I have attached Michel's patch with "(length > 1)" modified to > > "(length > 5)" and added a "Tested-by:". > > Thank you for testing this. > > I believe this code should go into the Linux kernel, > since it closes up what could be a denial of service > attack (albeit a local one) with the anonvma code. Agreed. The only thing I don't like about this patch is the hardcoding of number 5: could we make it a variable to tweak with sysfs/sysctl so if some weird workload arises we have a tuning tweak? It'd cost one cacheline during fork, so it doesn't look excessive overhead. Thanks, Andrea From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965655AbaKNQh0 (ORCPT ); Fri, 14 Nov 2014 11:37:26 -0500 Received: from cosmos.ssec.wisc.edu ([128.104.109.114]:48104 "EHLO cosmos.ssec.wisc.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965050AbaKNQhZ (ORCPT ); Fri, 14 Nov 2014 11:37:25 -0500 X-Greylist: delayed 390 seconds by postgrey-1.27 at vger.kernel.org; Fri, 14 Nov 2014 11:37:24 EST Date: Fri, 14 Nov 2014 10:30:53 -0600 From: Daniel Forrest To: Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Tim Hartrick , Michal Hocko Subject: [PATCH] Repeated fork() causes SLAB to grow without bound Message-ID: <20141114163053.GA6547@cosmos.ssec.wisc.edu> Reply-To: Daniel Forrest Mail-Followup-To: Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Tim Hartrick , Michal Hocko References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130603195003.GA31275@evergreen.ssec.wisc.edu> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org There have been a couple of inquiries about the status of this patch over the last few months, so I am going to try pushing it out. Andrea Arcangeli has commented: > Agreed. The only thing I don't like about this patch is the hardcoding > of number 5: could we make it a variable to tweak with sysfs/sysctl so > if some weird workload arises we have a tuning tweak? It'd cost one > cacheline during fork, so it doesn't look excessive overhead. Adding this is beyond my experience level, so if it is required then someone else will have to make it so. Rik van Riel has commented: > I believe we should just merge that patch. > > I have not seen any better ideas come by. > > The comment should probably be fixed to reflect the > chain length of 5 though :) So here is Michel's patch again with "(length > 1)" modified to "(length > 5)" and fixed comments. I have been running with this patch (with the threshold set to 5) for over two years now and it does indeed solve the problem. --- anon_vma_clone() is modified to return the length of the existing same_vma anon vma chain, and we create a new anon_vma in the child if it is more than five forks after the anon_vma was created, as we don't want the same_vma chain to grow arbitrarily large. Signed-off-by: Michel Lespinasse Tested-by: Daniel Forrest --- mm/mmap.c | 6 +++--- mm/rmap.c | 18 ++++++++++++++---- 2 files changed, 17 insertions(+), 7 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 3edfcdfa42d9..e14b19a838cb 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -539,7 +539,7 @@ again: remove_next = 1 + (end > next->vm_end); * shrinking vma had, to cover any anon pages imported. */ if (exporter && exporter->anon_vma && !importer->anon_vma) { - if (anon_vma_clone(importer, exporter)) + if (anon_vma_clone(importer, exporter) < 0) return -ENOMEM; importer->anon_vma = exporter->anon_vma; } @@ -1988,7 +1988,7 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma, } vma_set_policy(new, pol); - if (anon_vma_clone(new, vma)) + if (anon_vma_clone(new, vma) < 0) goto out_free_mpol; if (new->vm_file) { @@ -2409,7 +2409,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, if (IS_ERR(pol)) goto out_free_vma; INIT_LIST_HEAD(&new_vma->anon_vma_chain); - if (anon_vma_clone(new_vma, vma)) + if (anon_vma_clone(new_vma, vma) < 0) goto out_free_mempol; vma_set_policy(new_vma, pol); new_vma->vm_start = addr; diff --git a/mm/rmap.c b/mm/rmap.c index 0f3b7cda2a24..ba8a726aaee6 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -238,12 +238,13 @@ static inline void unlock_anon_vma_root(struct anon_vma *root) /* * Attach the anon_vmas from src to dst. - * Returns 0 on success, -ENOMEM on failure. + * Returns length of the anon_vma chain on success, -ENOMEM on failure. */ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) { struct anon_vma_chain *avc, *pavc; struct anon_vma *root = NULL; + int length = 0; list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) { struct anon_vma *anon_vma; @@ -259,9 +260,10 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) anon_vma = pavc->anon_vma; root = lock_anon_vma_root(root, anon_vma); anon_vma_chain_link(dst, avc, anon_vma); + length++; } unlock_anon_vma_root(root); - return 0; + return length; enomem_failure: unlink_anon_vmas(dst); @@ -322,6 +324,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) { struct anon_vma_chain *avc; struct anon_vma *anon_vma; + int length; /* Don't bother if the parent process has no anon_vma here. */ if (!pvma->anon_vma) @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) * First, attach the new VMA to the parent VMA's anon_vmas, * so rmap can find non-COWed pages in child processes. */ - if (anon_vma_clone(vma, pvma)) + length = anon_vma_clone(vma, pvma); + if (length < 0) return -ENOMEM; + else if (length > 5) + return 0; - /* Then add our own anon_vma. */ + /* + * Then add our own anon_vma. We do this only for five forks after + * the anon_vma was created, as we don't want the same_vma chain to + * grow arbitrarily large. + */ anon_vma = anon_vma_alloc(); if (!anon_vma) goto out_error; -- Daniel K. Forrest Space Science and dan.forrest@ssec.wisc.edu Engineering Center (608) 890 - 0558 University of Wisconsin, Madison From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753104AbaKRACP (ORCPT ); Mon, 17 Nov 2014 19:02:15 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:52300 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751982AbaKRACO (ORCPT ); Mon, 17 Nov 2014 19:02:14 -0500 Date: Mon, 17 Nov 2014 16:02:12 -0800 From: Andrew Morton To: Daniel Forrest Cc: Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Tim Hartrick , Michal Hocko Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound Message-Id: <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> In-Reply-To: <20141114163053.GA6547@cosmos.ssec.wisc.edu> References: <20120816024610.GA5350@evergreen.ssec.wisc.edu> <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> X-Mailer: Sylpheed 3.4.0beta7 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 14 Nov 2014 10:30:53 -0600 Daniel Forrest wrote: > There have been a couple of inquiries about the status of this patch > over the last few months, so I am going to try pushing it out. > > Andrea Arcangeli has commented: > > > Agreed. The only thing I don't like about this patch is the hardcoding > > of number 5: could we make it a variable to tweak with sysfs/sysctl so > > if some weird workload arises we have a tuning tweak? It'd cost one > > cacheline during fork, so it doesn't look excessive overhead. > > Adding this is beyond my experience level, so if it is required then > someone else will have to make it so. > > Rik van Riel has commented: > > > I believe we should just merge that patch. > > > > I have not seen any better ideas come by. > > > > The comment should probably be fixed to reflect the > > chain length of 5 though :) > > So here is Michel's patch again with "(length > 1)" modified to > "(length > 5)" and fixed comments. > > I have been running with this patch (with the threshold set to 5) for > over two years now and it does indeed solve the problem. > > --- > > anon_vma_clone() is modified to return the length of the existing > same_vma anon vma chain, and we create a new anon_vma in the child > if it is more than five forks after the anon_vma was created, as we > don't want the same_vma chain to grow arbitrarily large. hoo boy, what's going on here. - Under what circumstances are we seeing this slab windup? - What are the consequences? Can it OOM the machine? - Why is this occurring? There aren't an infinite number of vmas, so there shouldn't be an infinite number of anon_vmas or anon_vma_chains. - IOW, what has to be done to fix this properly? - What are the runtime consequences of limiting the length of the chain? > ... > > @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) > * First, attach the new VMA to the parent VMA's anon_vmas, > * so rmap can find non-COWed pages in child processes. > */ > - if (anon_vma_clone(vma, pvma)) > + length = anon_vma_clone(vma, pvma); > + if (length < 0) > return -ENOMEM; This should propagate the anon_vma_clone() return val instead of assuming ENOMEM. But that won't fix anything... > + else if (length > 5) > + return 0; > > - /* Then add our own anon_vma. */ > + /* > + * Then add our own anon_vma. We do this only for five forks after > + * the anon_vma was created, as we don't want the same_vma chain to > + * grow arbitrarily large. > + */ > anon_vma = anon_vma_alloc(); From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753481AbaKRBli (ORCPT ); Mon, 17 Nov 2014 20:41:38 -0500 Received: from cosmos.ssec.wisc.edu ([128.104.109.114]:43531 "EHLO cosmos.ssec.wisc.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752177AbaKRBlh (ORCPT ); Mon, 17 Nov 2014 20:41:37 -0500 Date: Mon, 17 Nov 2014 19:41:35 -0600 From: Daniel Forrest To: Andrew Morton Cc: Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Tim Hartrick , Michal Hocko Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound Message-ID: <20141118014135.GA17252@cosmos.ssec.wisc.edu> Reply-To: Daniel Forrest Mail-Followup-To: Andrew Morton , Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Tim Hartrick , Michal Hocko References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 17, 2014 at 04:02:12PM -0800, Andrew Morton wrote: > On Fri, 14 Nov 2014 10:30:53 -0600 Daniel Forrest wrote: > > > There have been a couple of inquiries about the status of this patch > > over the last few months, so I am going to try pushing it out. > > > > Andrea Arcangeli has commented: > > > > > Agreed. The only thing I don't like about this patch is the hardcoding > > > of number 5: could we make it a variable to tweak with sysfs/sysctl so > > > if some weird workload arises we have a tuning tweak? It'd cost one > > > cacheline during fork, so it doesn't look excessive overhead. > > > > Adding this is beyond my experience level, so if it is required then > > someone else will have to make it so. > > > > Rik van Riel has commented: > > > > > I believe we should just merge that patch. > > > > > > I have not seen any better ideas come by. > > > > > > The comment should probably be fixed to reflect the > > > chain length of 5 though :) > > > > So here is Michel's patch again with "(length > 1)" modified to > > "(length > 5)" and fixed comments. > > > > I have been running with this patch (with the threshold set to 5) for > > over two years now and it does indeed solve the problem. > > > > --- > > > > anon_vma_clone() is modified to return the length of the existing > > same_vma anon vma chain, and we create a new anon_vma in the child > > if it is more than five forks after the anon_vma was created, as we > > don't want the same_vma chain to grow arbitrarily large. > > hoo boy, what's going on here. > > - Under what circumstances are we seeing this slab windup? The original bug report is here: https://lkml.org/lkml/2012/8/15/765 > - What are the consequences? Can it OOM the machine? Yes, eventually you run out of SLAB space. > - Why is this occurring? There aren't an infinite number of vmas, so > there shouldn't be an infinite number of anon_vmas or > anon_vma_chains. Because of the serial forking there does indeed end up being an infinite number of vmas. The initial vma can never be deleted (even though the initial parent process has long since terminated) because the initial vma is referenced by the children. > - IOW, what has to be done to fix this properly? As far as I know, this is the best solution. I tried a refcounting solution based on comments by Rik van Riel: https://lkml.org/lkml/2012/8/17/536 But it didn't fully work, probably because I didn't quite get the locking done properly. In any case, at this point questions came up about the overhead of the page refcounting and Michel Lespinasse suggested the initial version of this patch: https://lkml.org/lkml/2012/8/21/730 > - What are the runtime consequences of limiting the length of the chain? I can't say, but it only affects users who fork more than five levels deep without doing an exec. On the other hand, there are at least three users (Tim Hartrick, Michal Hocko, and myself) who have real world applications where the consequence of no patch is a crashed system. I would suggest reading the thread starting with my initial bug report for what others have had to say about this. > > ... > > > > @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) > > * First, attach the new VMA to the parent VMA's anon_vmas, > > * so rmap can find non-COWed pages in child processes. > > */ > > - if (anon_vma_clone(vma, pvma)) > > + length = anon_vma_clone(vma, pvma); > > + if (length < 0) > > return -ENOMEM; > > This should propagate the anon_vma_clone() return val instead of > assuming ENOMEM. But that won't fix anything... Agreed, but the only failure return value of anon_vma_clone is -ENOMEM. Scanning the code in __split_vma (mm/mmap.c) it looks like the error return is lost (between Linux 3.11 and 3.12 the err variable is now used before the call to anon_vma_clone and the default initial value of -ENOMEM is overwritten). This is an actual bug in the current code. I can update the patch to fix these issues. > > + else if (length > 5) > > + return 0; > > > > - /* Then add our own anon_vma. */ > > + /* > > + * Then add our own anon_vma. We do this only for five forks after > > + * the anon_vma was created, as we don't want the same_vma chain to > > + * grow arbitrarily large. > > + */ > > anon_vma = anon_vma_alloc(); -- Daniel K. Forrest Space Science and dan.forrest@ssec.wisc.edu Engineering Center (608) 890 - 0558 University of Wisconsin, Madison From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753370AbaKRCmX (ORCPT ); Mon, 17 Nov 2014 21:42:23 -0500 Received: from mx1.redhat.com ([209.132.183.28]:59665 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751982AbaKRCmW (ORCPT ); Mon, 17 Nov 2014 21:42:22 -0500 Message-ID: <546AB1F5.6030306@redhat.com> Date: Mon, 17 Nov 2014 21:41:57 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 To: Andrew Morton , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Tim Hartrick , Michal Hocko Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> In-Reply-To: <20141118014135.GA17252@cosmos.ssec.wisc.edu> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/17/2014 08:41 PM, Daniel Forrest wrote: > On Mon, Nov 17, 2014 at 04:02:12PM -0800, Andrew Morton wrote: >> On Fri, 14 Nov 2014 10:30:53 -0600 Daniel Forrest >> wrote: >> >>> There have been a couple of inquiries about the status of this >>> patch over the last few months, so I am going to try pushing it >>> out. >>> >>> Andrea Arcangeli has commented: >>> >>>> Agreed. The only thing I don't like about this patch is the >>>> hardcoding of number 5: could we make it a variable to tweak >>>> with sysfs/sysctl so if some weird workload arises we have a >>>> tuning tweak? It'd cost one cacheline during fork, so it >>>> doesn't look excessive overhead. >>> >>> Adding this is beyond my experience level, so if it is required >>> then someone else will have to make it so. >>> >>> Rik van Riel has commented: >>> >>>> I believe we should just merge that patch. >>>> >>>> I have not seen any better ideas come by. >>>> >>>> The comment should probably be fixed to reflect the chain >>>> length of 5 though :) >>> >>> So here is Michel's patch again with "(length > 1)" modified >>> to "(length > 5)" and fixed comments. >>> >>> I have been running with this patch (with the threshold set to >>> 5) for over two years now and it does indeed solve the >>> problem. >>> >>> --- >>> >>> anon_vma_clone() is modified to return the length of the >>> existing same_vma anon vma chain, and we create a new anon_vma >>> in the child if it is more than five forks after the anon_vma >>> was created, as we don't want the same_vma chain to grow >>> arbitrarily large. >> >> hoo boy, what's going on here. >> >> - Under what circumstances are we seeing this slab windup? > > The original bug report is here: > > https://lkml.org/lkml/2012/8/15/765 > >> - What are the consequences? Can it OOM the machine? > > Yes, eventually you run out of SLAB space. > >> - Why is this occurring? There aren't an infinite number of >> vmas, so there shouldn't be an infinite number of anon_vmas or >> anon_vma_chains. > > Because of the serial forking there does indeed end up being an > infinite number of vmas. The initial vma can never be deleted > (even though the initial parent process has long since terminated) > because the initial vma is referenced by the children. There is a finite number of VMAs, but an infite number of anon_vmas. Subtle, yet deadly... >> - IOW, what has to be done to fix this properly? > > As far as I know, this is the best solution. I tried a > refcounting solution based on comments by Rik van Riel: > > https://lkml.org/lkml/2012/8/17/536 > > But it didn't fully work, probably because I didn't quite get the > locking done properly. In any case, at this point questions came > up about the overhead of the page refcounting and Michel > Lespinasse suggested the initial version of this patch: > > https://lkml.org/lkml/2012/8/21/730 > >> - What are the runtime consequences of limiting the length of the >> chain? > > I can't say, but it only affects users who fork more than five > levels deep without doing an exec. On the other hand, there are at > least three users (Tim Hartrick, Michal Hocko, and myself) who have > real world applications where the consequence of no patch is a > crashed system. > > I would suggest reading the thread starting with my initial bug > report for what others have had to say about this. I suspect what Andrew is hinting at is that the changelog for the patch should contain a detailed description of exactly what the bug is, how it is triggered, what the symptoms are, and how the patch avoids it. That way people can understand what the code does simply by looking at the changelog - no need to go find old linux-kernel mailing list threads. >>> ... >>> >>> @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct >>> *vma, struct vm_area_struct *pvma) * First, attach the new VMA >>> to the parent VMA's anon_vmas, * so rmap can find non-COWed >>> pages in child processes. */ - if (anon_vma_clone(vma, pvma)) + >>> length = anon_vma_clone(vma, pvma); + if (length < 0) return >>> -ENOMEM; >> >> This should propagate the anon_vma_clone() return val instead of >> assuming ENOMEM. But that won't fix anything... > > Agreed, but the only failure return value of anon_vma_clone is > -ENOMEM. > > Scanning the code in __split_vma (mm/mmap.c) it looks like the > error return is lost (between Linux 3.11 and 3.12 the err variable > is now used before the call to anon_vma_clone and the default > initial value of -ENOMEM is overwritten). This is an actual bug in > the current code. > > I can update the patch to fix these issues. > >>> + else if (length > 5) + return 0; >>> >>> - /* Then add our own anon_vma. */ + /* + * Then add our own >>> anon_vma. We do this only for five forks after + * the >>> anon_vma was created, as we don't want the same_vma chain to + >>> * grow arbitrarily large. + */ anon_vma = anon_vma_alloc(); > - -- All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJUarH1AAoJEM553pKExN6DXwUH/RHNwGTYhzzwIQtbtMqnHYjE YWriqPLIOW8yWh85hkrmTsjWIegbDnEsbgNRX0Y8ANrKgx+vWRRW/eJ/s+Z+m7UY lD1DKO3vIfUSQvL4QHnViTEgEHfdychnhe0SE/kMeQbnLpUw8ywviJxX0UibeLdK L/F8xMzpUj/PBkNTtPxQRevWwUEMMMY6RS8RjHNBADe9ym/Fjd0dzAkoPCYCUapT barWfI9RMC3gYfyObFNBNYyaYyyK1FlAyBq52d/W8xCBW/5EIhEtFBGben/lAuEP alJt+jnFq4B1tXQtJIu1YBhY4OhuqWQy5lbz7NFPxg8+cECVPd3Vq6O2Bxilz9U= =GLaM -----END PGP SIGNATURE----- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754971AbaKRUTv (ORCPT ); Tue, 18 Nov 2014 15:19:51 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:35486 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754237AbaKRUTi (ORCPT ); Tue, 18 Nov 2014 15:19:38 -0500 Date: Tue, 18 Nov 2014 12:19:36 -0800 From: Andrew Morton To: Rik van Riel Cc: Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Tim Hartrick , Michal Hocko Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound Message-Id: <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> In-Reply-To: <546AB1F5.6030306@redhat.com> References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> X-Mailer: Sylpheed 3.4.0beta7 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel wrote: > > Because of the serial forking there does indeed end up being an > > infinite number of vmas. The initial vma can never be deleted > > (even though the initial parent process has long since terminated) > > because the initial vma is referenced by the children. > > There is a finite number of VMAs, but an infite number of > anon_vmas. > > Subtle, yet deadly... Well, we clearly have the data structures screwed up. I've forgotten enough about this code for me to be unable to work out what the fixed up data structures would look like :( But surely there is some proper solution here. Help? > > I can't say, but it only affects users who fork more than five > > levels deep without doing an exec. On the other hand, there are at > > least three users (Tim Hartrick, Michal Hocko, and myself) who have > > real world applications where the consequence of no patch is a > > crashed system. > > > > I would suggest reading the thread starting with my initial bug > > report for what others have had to say about this. > > I suspect what Andrew is hinting at is that the > changelog for the patch should contain a detailed > description of exactly what the bug is, how it is > triggered, what the symptoms are, and how the > patch avoids it. > > That way people can understand what the code does > simply by looking at the changelog - no need to go > find old linux-kernel mailing list threads. Yes please, there's a ton of stuff here which we should attempt to capture. https://lkml.org/lkml/2012/8/15/765 is useful. I'm assuming that with the "foo < 5" hack, an application which forked 5 times then did a lot of work would still trigger the "catastrophic issue at page reclaim time" issue which Rik identified at https://lkml.org/lkml/2012/8/20/265? There are real-world workloads which are triggering this slab growth problem, yes? (Detail them in the changelog, please). This bug snuck under my radar last time - we're permitting unprivileged userspace to exhaust memory and that's bad. I'm OK with the foo<5 thing for -stable kernels, as it is simple. But I'm reluctant to merge (or at least to retain) it in mainline because then everyone will run away and think about other stuff and this bug will never get fixed properly. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932214AbaKRWPt (ORCPT ); Tue, 18 Nov 2014 17:15:49 -0500 Received: from mail-wg0-f41.google.com ([74.125.82.41]:33858 "EHLO mail-wg0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753794AbaKRWPs (ORCPT ); Tue, 18 Nov 2014 17:15:48 -0500 MIME-Version: 1.0 In-Reply-To: <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> Date: Wed, 19 Nov 2014 02:15:46 +0400 Message-ID: Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound From: Konstantin Khlebnikov To: Andrew Morton Cc: Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton wrote: > On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel wrote: > >> > Because of the serial forking there does indeed end up being an >> > infinite number of vmas. The initial vma can never be deleted >> > (even though the initial parent process has long since terminated) >> > because the initial vma is referenced by the children. >> >> There is a finite number of VMAs, but an infite number of >> anon_vmas. >> >> Subtle, yet deadly... > > Well, we clearly have the data structures screwed up. I've forgotten > enough about this code for me to be unable to work out what the fixed > up data structures would look like :( But surely there is some proper > solution here. Help? Not sure if it's right but probably we could reuse on fork an old anon_vma from the chain if it's already lost all vmas which points to it. For endlessly forking exploit this should work mostly like proposed patch which stops branching after some depth but without magic constant. > >> > I can't say, but it only affects users who fork more than five >> > levels deep without doing an exec. On the other hand, there are at >> > least three users (Tim Hartrick, Michal Hocko, and myself) who have >> > real world applications where the consequence of no patch is a >> > crashed system. >> > >> > I would suggest reading the thread starting with my initial bug >> > report for what others have had to say about this. >> >> I suspect what Andrew is hinting at is that the >> changelog for the patch should contain a detailed >> description of exactly what the bug is, how it is >> triggered, what the symptoms are, and how the >> patch avoids it. >> >> That way people can understand what the code does >> simply by looking at the changelog - no need to go >> find old linux-kernel mailing list threads. > > Yes please, there's a ton of stuff here which we should attempt to > capture. > > https://lkml.org/lkml/2012/8/15/765 is useful. > > I'm assuming that with the "foo < 5" hack, an application which forked > 5 times then did a lot of work would still trigger the "catastrophic > issue at page reclaim time" issue which Rik identified at > https://lkml.org/lkml/2012/8/20/265? > > There are real-world workloads which are triggering this slab growth > problem, yes? (Detail them in the changelog, please). > > This bug snuck under my radar last time - we're permitting unprivileged > userspace to exhaust memory and that's bad. I'm OK with the foo<5 > thing for -stable kernels, as it is simple. But I'm reluctant to merge > (or at least to retain) it in mainline because then everyone will run > away and think about other stuff and this bug will never get fixed > properly. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932934AbaKRXuG (ORCPT ); Tue, 18 Nov 2014 18:50:06 -0500 Received: from cantor2.suse.de ([195.135.220.15]:43592 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932181AbaKRXuF (ORCPT ); Tue, 18 Nov 2014 18:50:05 -0500 Message-ID: <546BDB29.9050403@suse.cz> Date: Wed, 19 Nov 2014 00:50:01 +0100 From: Vlastimil Babka User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Konstantin Khlebnikov , Andrew Morton CC: Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/19/2014 12:02 AM, Konstantin Khlebnikov wrote: > On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov wrote: >> On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton >> wrote: >>> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel wrote: >>> >>>> > Because of the serial forking there does indeed end up being an >>>> > infinite number of vmas. The initial vma can never be deleted >>>> > (even though the initial parent process has long since terminated) >>>> > because the initial vma is referenced by the children. >>>> >>>> There is a finite number of VMAs, but an infite number of >>>> anon_vmas. >>>> >>>> Subtle, yet deadly... >>> >>> Well, we clearly have the data structures screwed up. I've forgotten >>> enough about this code for me to be unable to work out what the fixed >>> up data structures would look like :( But surely there is some proper >>> solution here. Help? >> >> Not sure if it's right but probably we could reuse on fork an old anon_vma >> from the chain if it's already lost all vmas which points to it. >> For endlessly forking exploit this should work mostly like proposed patch >> which stops branching after some depth but without magic constant. > > Something like this. I leave proper comment for tomorrow. Hmm I'm not sure that will work as it is. If I understand it correctly, your patch can detect if the parent's anon_vma has no own references at the fork() time. But at the fork time, the parent is still alive, it only exits after the fork, right? So I guess it still has own references and the child will still allocate its new anon_vma, and the problem is not solved. So maybe we could detect that the own references dropped to zero when the parent does exit, and then change mapping of all relevant pages to the root anon_vma, destroy avc's of children and the anon_vma itself. But that sounds quite heavyweight :/ Vlastimil >> >>> From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754238AbaKSCs0 (ORCPT ); Tue, 18 Nov 2014 21:48:26 -0500 Received: from mx1.redhat.com ([209.132.183.28]:39853 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753731AbaKSCsZ (ORCPT ); Tue, 18 Nov 2014 21:48:25 -0500 Message-ID: <546C04E0.4090209@redhat.com> Date: Tue, 18 Nov 2014 21:48:00 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 To: Andrew Morton CC: Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Tim Hartrick , Michal Hocko Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> In-Reply-To: <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/18/2014 03:19 PM, Andrew Morton wrote: > On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel > wrote: > >> That way people can understand what the code does simply by >> looking at the changelog - no need to go find old linux-kernel >> mailing list threads. > > Yes please, there's a ton of stuff here which we should attempt to > capture. > > https://lkml.org/lkml/2012/8/15/765 is useful. > > I'm assuming that with the "foo < 5" hack, an application which > forked 5 times then did a lot of work would still trigger the > "catastrophic issue at page reclaim time" issue which Rik > identified at https://lkml.org/lkml/2012/8/20/265? It's not "forking 5 times", it is "forking >>5 generations deep". There are a few programs that do that, but it does not appear that they are forking servers like apache or sendmail (which fork from the 2nd generation, and then sometimes again to exec a helper from the 4th generation). > There are real-world workloads which are triggering this slab > growth problem, yes? (Detail them in the changelog, please). There are, but the overlap between "forks >>5 generations deep" and "forks a bajillion child processes" appears to be zero. - -- All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJUbATgAAoJEM553pKExN6Ds84H/ixCr4Q5C09sDISuw9y/PsVI moXPbqgefpzbS316MgD1AMl7rj2OWAMiQcRGQ6yMelXOyuB89XTiBi19t5UxaSUn tuFnxeknoIL0155yTfszETRGjN9mUKoyk9HAhND1T+x2VFLwaQYyk7CdZC/h7IQ7 m1jfwlR30r0Ie6x5lkN1XaculdWdXjr7wTwUWeOVsc6lWv3kR3dC52LKsB4fv340 gBeL5sTDNNp6r5Gfr5QL7fQR0eLVvhStSmsm4GbggpVSBSCpZ++h8eTjdtHxuJO3 jtgEGAvhnLDSqRi6NG6dKoxtXW8++hnFIKBw1Ec36NTuTkbKiHo9EQujINtXWro= =/EU5 -----END PGP SIGNATURE----- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755487AbaKSOgs (ORCPT ); Wed, 19 Nov 2014 09:36:48 -0500 Received: from mail-wi0-f181.google.com ([209.85.212.181]:61707 "EHLO mail-wi0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754167AbaKSOgq (ORCPT ); Wed, 19 Nov 2014 09:36:46 -0500 MIME-Version: 1.0 In-Reply-To: <546BDB29.9050403@suse.cz> References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> <546BDB29.9050403@suse.cz> Date: Wed, 19 Nov 2014 18:36:45 +0400 Message-ID: Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound From: Konstantin Khlebnikov To: Vlastimil Babka Cc: Andrew Morton , Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 19, 2014 at 2:50 AM, Vlastimil Babka wrote: > On 11/19/2014 12:02 AM, Konstantin Khlebnikov wrote: >> On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov wrote: >>> On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton >>> wrote: >>>> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel wrote: >>>> >>>>> > Because of the serial forking there does indeed end up being an >>>>> > infinite number of vmas. The initial vma can never be deleted >>>>> > (even though the initial parent process has long since terminated) >>>>> > because the initial vma is referenced by the children. >>>>> >>>>> There is a finite number of VMAs, but an infite number of >>>>> anon_vmas. >>>>> >>>>> Subtle, yet deadly... >>>> >>>> Well, we clearly have the data structures screwed up. I've forgotten >>>> enough about this code for me to be unable to work out what the fixed >>>> up data structures would look like :( But surely there is some proper >>>> solution here. Help? >>> >>> Not sure if it's right but probably we could reuse on fork an old anon_vma >>> from the chain if it's already lost all vmas which points to it. >>> For endlessly forking exploit this should work mostly like proposed patch >>> which stops branching after some depth but without magic constant. >> >> Something like this. I leave proper comment for tomorrow. > > Hmm I'm not sure that will work as it is. If I understand it correctly, your > patch can detect if the parent's anon_vma has no own references at the fork() > time. But at the fork time, the parent is still alive, it only exits after the > fork, right? So I guess it still has own references and the child will still > allocate its new anon_vma, and the problem is not solved. But it could reuse anon_vma from grandparent or older. Count of anon_vmas in chain will be limited with count of alive processes. I think it's better to describe this in terms of sets of anon_vma instead hierarchy: at clone vma inherits pages from parent together with set of anon_vma which they belong. For new pages it might allocate new anon_vma or reuse existing. After my patch vma will try to reuse anon_vma from that set which has no vmas which points to it. As a result there will be no parent-child relation between anon_vma and multiple pages might have equal (anon_vma, index) pair but I see no problems here. > > So maybe we could detect that the own references dropped to zero when the parent > does exit, and then change mapping of all relevant pages to the root anon_vma, > destroy avc's of children and the anon_vma itself. But that sounds quite > heavyweight :/ > > Vlastimil > >>> >>>> > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756160AbaKSQJx (ORCPT ); Wed, 19 Nov 2014 11:09:53 -0500 Received: from cantor2.suse.de ([195.135.220.15]:60750 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754153AbaKSQJw (ORCPT ); Wed, 19 Nov 2014 11:09:52 -0500 Message-ID: <546CC0CD.40906@suse.cz> Date: Wed, 19 Nov 2014 17:09:49 +0100 From: Vlastimil Babka User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Konstantin Khlebnikov CC: Andrew Morton , Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> <546BDB29.9050403@suse.cz> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/19/2014 03:36 PM, Konstantin Khlebnikov wrote: > On Wed, Nov 19, 2014 at 2:50 AM, Vlastimil Babka wrote: >> On 11/19/2014 12:02 AM, Konstantin Khlebnikov wrote: >>> On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov wrote: >>>> On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton >>>> wrote: >>>>> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel wrote: >>>>> >>>>>> > Because of the serial forking there does indeed end up being an >>>>>> > infinite number of vmas. The initial vma can never be deleted >>>>>> > (even though the initial parent process has long since terminated) >>>>>> > because the initial vma is referenced by the children. >>>>>> >>>>>> There is a finite number of VMAs, but an infite number of >>>>>> anon_vmas. >>>>>> >>>>>> Subtle, yet deadly... >>>>> >>>>> Well, we clearly have the data structures screwed up. I've forgotten >>>>> enough about this code for me to be unable to work out what the fixed >>>>> up data structures would look like :( But surely there is some proper >>>>> solution here. Help? >>>> >>>> Not sure if it's right but probably we could reuse on fork an old anon_vma >>>> from the chain if it's already lost all vmas which points to it. >>>> For endlessly forking exploit this should work mostly like proposed patch >>>> which stops branching after some depth but without magic constant. >>> >>> Something like this. I leave proper comment for tomorrow. >> >> Hmm I'm not sure that will work as it is. If I understand it correctly, your >> patch can detect if the parent's anon_vma has no own references at the fork() >> time. But at the fork time, the parent is still alive, it only exits after the >> fork, right? So I guess it still has own references and the child will still >> allocate its new anon_vma, and the problem is not solved. > > But it could reuse anon_vma from grandparent or older. > Count of anon_vmas in chain will be limited with count of alive processes. Ah I missed that it can reuse older anon_vma, sorry. > I think it's better to describe this in terms of sets of anon_vma > instead hierarchy: > at clone vma inherits pages from parent together with set of anon_vma > which they belong. > For new pages it might allocate new anon_vma or reuse existing. After > my patch vma > will try to reuse anon_vma from that set which has no vmas which points to it. > As a result there will be no parent-child relation between anon_vma and > multiple pages might have equal (anon_vma, index) pair but I see no > problems here. Hmm I wonder if root anon_vma should be excluded from this reusal. For performance reasons, exclusive pages go to non-root anon_vma (see __page_set_anon_rmap()) and reusing root anon_vma would change this. Also from reading http://lwn.net/Articles/383162/ I understand that correctness also depends on the hierarchy and I wonder if there's a danger of reintroducing a bug like the one described there. Vlastimil >> >> So maybe we could detect that the own references dropped to zero when the parent >> does exit, and then change mapping of all relevant pages to the root anon_vma, >> destroy avc's of children and the anon_vma itself. But that sounds quite >> heavyweight :/ >> >> Vlastimil >> >>>> >>>>> >> > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756168AbaKSQ6X (ORCPT ); Wed, 19 Nov 2014 11:58:23 -0500 Received: from mail-wg0-f41.google.com ([74.125.82.41]:44627 "EHLO mail-wg0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754598AbaKSQ6W (ORCPT ); Wed, 19 Nov 2014 11:58:22 -0500 MIME-Version: 1.0 In-Reply-To: <546CC0CD.40906@suse.cz> References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> <546BDB29.9050403@suse.cz> <546CC0CD.40906@suse.cz> Date: Wed, 19 Nov 2014 20:58:20 +0400 Message-ID: Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound From: Konstantin Khlebnikov To: Vlastimil Babka Cc: Andrew Morton , Rik van Riel , Michel Lespinasse , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 19, 2014 at 7:09 PM, Vlastimil Babka wrote: > On 11/19/2014 03:36 PM, Konstantin Khlebnikov wrote: >> On Wed, Nov 19, 2014 at 2:50 AM, Vlastimil Babka wrote: >>> On 11/19/2014 12:02 AM, Konstantin Khlebnikov wrote: >>>> On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov wrote: >>>>> On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton >>>>> wrote: >>>>>> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel wrote: >>>>>> >>>>>>> > Because of the serial forking there does indeed end up being an >>>>>>> > infinite number of vmas. The initial vma can never be deleted >>>>>>> > (even though the initial parent process has long since terminated) >>>>>>> > because the initial vma is referenced by the children. >>>>>>> >>>>>>> There is a finite number of VMAs, but an infite number of >>>>>>> anon_vmas. >>>>>>> >>>>>>> Subtle, yet deadly... >>>>>> >>>>>> Well, we clearly have the data structures screwed up. I've forgotten >>>>>> enough about this code for me to be unable to work out what the fixed >>>>>> up data structures would look like :( But surely there is some proper >>>>>> solution here. Help? >>>>> >>>>> Not sure if it's right but probably we could reuse on fork an old anon_vma >>>>> from the chain if it's already lost all vmas which points to it. >>>>> For endlessly forking exploit this should work mostly like proposed patch >>>>> which stops branching after some depth but without magic constant. >>>> >>>> Something like this. I leave proper comment for tomorrow. >>> >>> Hmm I'm not sure that will work as it is. If I understand it correctly, your >>> patch can detect if the parent's anon_vma has no own references at the fork() >>> time. But at the fork time, the parent is still alive, it only exits after the >>> fork, right? So I guess it still has own references and the child will still >>> allocate its new anon_vma, and the problem is not solved. >> >> But it could reuse anon_vma from grandparent or older. >> Count of anon_vmas in chain will be limited with count of alive processes. > > Ah I missed that it can reuse older anon_vma, sorry. > >> I think it's better to describe this in terms of sets of anon_vma >> instead hierarchy: >> at clone vma inherits pages from parent together with set of anon_vma >> which they belong. >> For new pages it might allocate new anon_vma or reuse existing. After >> my patch vma >> will try to reuse anon_vma from that set which has no vmas which points to it. >> As a result there will be no parent-child relation between anon_vma and >> multiple pages might have equal (anon_vma, index) pair but I see no >> problems here. > > Hmm I wonder if root anon_vma should be excluded from this reusal. For > performance reasons, exclusive pages go to non-root anon_vma (see > __page_set_anon_rmap()) and reusing root anon_vma would change this. This is simple, in my patch this can be reached by bumping its nr_vmas by one and it'll never be reused. > Also from reading http://lwn.net/Articles/383162/ I understand that correctness > also depends on the hierarchy and I wonder if there's a danger of reintroducing > a bug like the one described there. If I remember right that was fixed by linking non-exclusively mapped pages to root anon_vma instead of anon_vma from vma where fault has happened. After my patch this still works. Topology hierarchy actually isn't used. Here just one selected "root' anon_vma which dies last. That's all. > > Vlastimil > >>> >>> So maybe we could detect that the own references dropped to zero when the parent >>> does exit, and then change mapping of all relevant pages to the root anon_vma, >>> destroy avc's of children and the anon_vma itself. But that sounds quite >>> heavyweight :/ >>> >>> Vlastimil >>> >>>>> >>>>>> >>> >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: email@kvack.org >> > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933240AbaKSXOS (ORCPT ); Wed, 19 Nov 2014 18:14:18 -0500 Received: from mail-qg0-f43.google.com ([209.85.192.43]:61030 "EHLO mail-qg0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932633AbaKSXOR (ORCPT ); Wed, 19 Nov 2014 18:14:17 -0500 MIME-Version: 1.0 In-Reply-To: References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> <546BDB29.9050403@suse.cz> <546CC0CD.40906@suse.cz> Date: Wed, 19 Nov 2014 15:14:15 -0800 Message-ID: Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound From: Michel Lespinasse To: Konstantin Khlebnikov Cc: Vlastimil Babka , Andrew Morton , Rik van Riel , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 19, 2014 at 8:58 AM, Konstantin Khlebnikov wrote: > On Wed, Nov 19, 2014 at 7:09 PM, Vlastimil Babka wrote: >> Also from reading http://lwn.net/Articles/383162/ I understand that correctness >> also depends on the hierarchy and I wonder if there's a danger of reintroducing >> a bug like the one described there. > > If I remember right that was fixed by linking non-exclusively mapped pages to > root anon_vma instead of anon_vma from vma where fault has happened. > After my patch this still works. Topology hierarchy actually isn't used. > Here just one selected "root' anon_vma which dies last. That's all. That's not how I remember it. An anon_vma corresponds to a given vma V, and is used to track all vmas (V and descendant vmas) that may include a page that was originally mapped in V. Each anon page has a link to the anon_vma corresponding to the vma they were originally faulted in, and an offset indicating where the page was located relative to that original VMA. The anon_vma has an interval tree of struct anon_vma_chain, and each struct anon_vma_chain includes a link to a descendent-of-V vma. This allows rmap to quickly find all the vmas that may map a given page (based on the page's anon_vma and offset). When forking or splitting vmas, the new vma is a descendent of the same vmas as the old one so it must be added to all the anon_vma interval trees that were referencing the old one (that is, ancestors of the new vma). To that end, all the struct anon_vma_chain pointing to a given vma are kept on a linked list, and struct anon_vma_chain includes a link to the anon_vma holding the interval tree. Locking the entire structure is done with a single lock hosted in the root anon_vma (that is, a vma that was created by mmap() and not by cloning or forking existing vmas). Limit the length of the ancestors linked list is correct, though it has performance implications. In the extreme case, forcing all vmas to be added on the root vma's interval tree would be correct, though it may re-introduce the performance problems that lead to the introduction of anon_vma. The good thing about Konstantin's proposal is that it does not have any magic constant like mine did. However, I think he is mistaken in saying that hierarchy isn't used - an ancestor vma will always have more descendents than its children, and the reason for the hierarchy is to limit the number of vmas that rmap must explore. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756779AbaKTOmH (ORCPT ); Thu, 20 Nov 2014 09:42:07 -0500 Received: from mail-wi0-f180.google.com ([209.85.212.180]:59010 "EHLO mail-wi0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751257AbaKTOmE (ORCPT ); Thu, 20 Nov 2014 09:42:04 -0500 MIME-Version: 1.0 In-Reply-To: References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> <546BDB29.9050403@suse.cz> <546CC0CD.40906@suse.cz> Date: Thu, 20 Nov 2014 18:42:03 +0400 Message-ID: Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound From: Konstantin Khlebnikov To: Michel Lespinasse Cc: Vlastimil Babka , Andrew Morton , Rik van Riel , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 20, 2014 at 2:14 AM, Michel Lespinasse wrote: > On Wed, Nov 19, 2014 at 8:58 AM, Konstantin Khlebnikov wrote: >> On Wed, Nov 19, 2014 at 7:09 PM, Vlastimil Babka wrote: >>> Also from reading http://lwn.net/Articles/383162/ I understand that correctness >>> also depends on the hierarchy and I wonder if there's a danger of reintroducing >>> a bug like the one described there. >> >> If I remember right that was fixed by linking non-exclusively mapped pages to >> root anon_vma instead of anon_vma from vma where fault has happened. >> After my patch this still works. Topology hierarchy actually isn't used. >> Here just one selected "root' anon_vma which dies last. That's all. > > That's not how I remember it. ??? That at the end of lwn article: [quote] The fix is straightforward; when linking an existing page to an anon_vma structure, the kernel needs to pick the one which is highest in the process hierarchy; that guarantees that the anon_vma will not go away prematurely. [/quote] nowdays this happens in __page_set_anon_rmap(): /* * If the page isn't exclusively mapped into this vma, * we must use the _oldest_ possible anon_vma for the * page mapping! */ if (!exclusive) anon_vma = anon_vma->root; The rest treeish of topology affects only performance. > > An anon_vma corresponds to a given vma V, and is used to track all > vmas (V and descendant vmas) that may include a page that was > originally mapped in V. > > Each anon page has a link to the anon_vma corresponding to the vma > they were originally faulted in, and an offset indicating where the > page was located relative to that original VMA. > > The anon_vma has an interval tree of struct anon_vma_chain, and each > struct anon_vma_chain includes a link to a descendent-of-V vma. This > allows rmap to quickly find all the vmas that may map a given page > (based on the page's anon_vma and offset). > > When forking or splitting vmas, the new vma is a descendent of the > same vmas as the old one so it must be added to all the anon_vma > interval trees that were referencing the old one (that is, ancestors > of the new vma). To that end, all the struct anon_vma_chain pointing > to a given vma are kept on a linked list, and struct anon_vma_chain > includes a link to the anon_vma holding the interval tree. > > Locking the entire structure is done with a single lock hosted in the > root anon_vma (that is, a vma that was created by mmap() and not by > cloning or forking existing vmas). > > Limit the length of the ancestors linked list is correct, though it > has performance implications. In the extreme case, forcing all vmas to > be added on the root vma's interval tree would be correct, though it > may re-introduce the performance problems that lead to the > introduction of anon_vma. > > The good thing about Konstantin's proposal is that it does not have > any magic constant like mine did. However, I think he is mistaken in > saying that hierarchy isn't used - an ancestor vma will always have > more descendents than its children, and the reason for the hierarchy > is to limit the number of vmas that rmap must explore. I mean after breaking hierarchy whole structure stays correct and kernel wouldn't explode, of course reusing anon_vma from ancestor makes rmap walk less effective because newly allocated pages will get false aliased vmas where they will never be mapped. I'm thinking about limitation for reusing anon_vmas which might increase performance without breaking asymptotic estimation of count anon_vma in the worst case. For example this heuristic: allow to reuse only anon_vma with single direct descendant. It seems there will be arount up to two times more anon_vmas but false-aliasing must be much lower. > > -- > Michel "Walken" Lespinasse > A program is never fully debugged until the last user dies. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757313AbaKTOua (ORCPT ); Thu, 20 Nov 2014 09:50:30 -0500 Received: from mx1.redhat.com ([209.132.183.28]:39446 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756498AbaKTOu2 (ORCPT ); Thu, 20 Nov 2014 09:50:28 -0500 Message-ID: <546DFFA1.4030700@redhat.com> Date: Thu, 20 Nov 2014 09:50:09 -0500 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 To: Konstantin Khlebnikov , Michel Lespinasse CC: Vlastimil Babka , Andrew Morton , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound References: <502D42E5.7090403@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> <546BDB29.9050403@suse.cz> <546CC0CD.40906@suse.cz> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote: > I'm thinking about limitation for reusing anon_vmas which might > increase performance without breaking asymptotic estimation of > count anon_vma in the worst case. For example this heuristic: allow > to reuse only anon_vma with single direct descendant. It seems > there will be arount up to two times more anon_vmas but > false-aliasing must be much lower. It may even be possible to not create a child anon_vma for the first child a parent forks, but only create a new anon_vma once the parent clones a second child (alive at the same time as the first child). That still takes care of things like apache or sendmail, but would not create infinite anon_vmas for a task that keeps forking itself to infinite depth without calling exec... - -- All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJUbf+hAAoJEM553pKExN6DxhQH/1QL+9GdhaSx7EQnRcbDRcHi GuEfMU0g9Kv4ad+oPSQnH/L7vJMJAYeh5ZJGH+rOykWHp3sGReqDZOnzpXRAe11z 1cSC1BJsndzrv9wX8niFpuKpYbF0IP+ckv3qaEzWtm5yCRyhHVZfr6b794Y4K9jF z2EPPu1vAAldbkx1VlYTwofBA5lESL5UmrFvH4ouI7BeWYSEe6BgVCbvK+K5fANT ketdA5R08xyUAcXDa+28qpBYkdWnxNhwqseDoXCW8SOFNwWbLDI6GRfrsCNku13i Gi41h3uEuIAGDf+AU/GMjiymgwutCOGq+cfZlszELaRvHmDpNGYdPv1llghNg7Q= =Vk+H -----END PGP SIGNATURE----- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756664AbaKTPD1 (ORCPT ); Thu, 20 Nov 2014 10:03:27 -0500 Received: from mail-wi0-f171.google.com ([209.85.212.171]:44656 "EHLO mail-wi0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751257AbaKTPDZ (ORCPT ); Thu, 20 Nov 2014 10:03:25 -0500 MIME-Version: 1.0 In-Reply-To: <546DFFA1.4030700@redhat.com> References: <502D42E5.7090403@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> <546BDB29.9050403@suse.cz> <546CC0CD.40906@suse.cz> <546DFFA1.4030700@redhat.com> Date: Thu, 20 Nov 2014 19:03:24 +0400 Message-ID: Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound From: Konstantin Khlebnikov To: Rik van Riel Cc: Michel Lespinasse , Vlastimil Babka , Andrew Morton , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote: > >> I'm thinking about limitation for reusing anon_vmas which might >> increase performance without breaking asymptotic estimation of >> count anon_vma in the worst case. For example this heuristic: allow >> to reuse only anon_vma with single direct descendant. It seems >> there will be arount up to two times more anon_vmas but >> false-aliasing must be much lower. > > It may even be possible to not create a child anon_vma for the > first child a parent forks, but only create a new anon_vma once > the parent clones a second child (alive at the same time as the > first child). > > That still takes care of things like apache or sendmail, but > would not create infinite anon_vmas for a task that keeps forking > itself to infinite depth without calling exec... But this scheme is still exploitable. Malicious software easily could create sequence of forks and exits which leads to infinite chain of anon_vmas. > > - -- > All rights reversed > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1 > > iQEcBAEBAgAGBQJUbf+hAAoJEM553pKExN6DxhQH/1QL+9GdhaSx7EQnRcbDRcHi > GuEfMU0g9Kv4ad+oPSQnH/L7vJMJAYeh5ZJGH+rOykWHp3sGReqDZOnzpXRAe11z > 1cSC1BJsndzrv9wX8niFpuKpYbF0IP+ckv3qaEzWtm5yCRyhHVZfr6b794Y4K9jF > z2EPPu1vAAldbkx1VlYTwofBA5lESL5UmrFvH4ouI7BeWYSEe6BgVCbvK+K5fANT > ketdA5R08xyUAcXDa+28qpBYkdWnxNhwqseDoXCW8SOFNwWbLDI6GRfrsCNku13i > Gi41h3uEuIAGDf+AU/GMjiymgwutCOGq+cfZlszELaRvHmDpNGYdPv1llghNg7Q= > =Vk+H > -----END PGP SIGNATURE----- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757003AbaKTP1H (ORCPT ); Thu, 20 Nov 2014 10:27:07 -0500 Received: from mail-qg0-f43.google.com ([209.85.192.43]:38176 "EHLO mail-qg0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753776AbaKTP1F (ORCPT ); Thu, 20 Nov 2014 10:27:05 -0500 MIME-Version: 1.0 In-Reply-To: References: <502D42E5.7090403@redhat.com> <20120818000312.GA4262@evergreen.ssec.wisc.edu> <502F100A.1080401@redhat.com> <20120822032057.GA30871@google.com> <50345232.4090002@redhat.com> <20130603195003.GA31275@evergreen.ssec.wisc.edu> <20141114163053.GA6547@cosmos.ssec.wisc.edu> <20141117160212.b86d031e1870601240b0131d@linux-foundation.org> <20141118014135.GA17252@cosmos.ssec.wisc.edu> <546AB1F5.6030306@redhat.com> <20141118121936.07b02545a0684b2cc839a10c@linux-foundation.org> <546BDB29.9050403@suse.cz> <546CC0CD.40906@suse.cz> Date: Thu, 20 Nov 2014 16:27:03 +0100 Message-ID: Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound From: Michel Lespinasse To: Konstantin Khlebnikov Cc: Vlastimil Babka , Andrew Morton , Rik van Riel , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Michal Hocko Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 20, 2014 at 3:42 PM, Konstantin Khlebnikov wrote: > On Thu, Nov 20, 2014 at 2:14 AM, Michel Lespinasse wrote: >> On Wed, Nov 19, 2014 at 8:58 AM, Konstantin Khlebnikov wrote: >>> On Wed, Nov 19, 2014 at 7:09 PM, Vlastimil Babka wrote: >>>> Also from reading http://lwn.net/Articles/383162/ I understand that correctness >>>> also depends on the hierarchy and I wonder if there's a danger of reintroducing >>>> a bug like the one described there. >>> >>> If I remember right that was fixed by linking non-exclusively mapped pages to >>> root anon_vma instead of anon_vma from vma where fault has happened. >>> After my patch this still works. Topology hierarchy actually isn't used. >>> Here just one selected "root' anon_vma which dies last. That's all. >> >> That's not how I remember it. > > ??? That at the end of lwn article: > > [quote] > The fix is straightforward; when linking an existing page to an > anon_vma structure, > the kernel needs to pick the one which is highest in the process hierarchy; > that guarantees that the anon_vma will not go away prematurely. > [/quote] > > nowdays this happens in __page_set_anon_rmap(): > > /* > * If the page isn't exclusively mapped into this vma, > * we must use the _oldest_ possible anon_vma for the > * page mapping! > */ > if (!exclusive) > anon_vma = anon_vma->root; > > The rest treeish of topology affects only performance. Ah, I see what you mean. IIRC the !exclusive bit is for pages coming back from swap, where we don't have enough tracking info to remember where the page was first created so we have to assume the worst case (i.e. that it was created in the root anon_vma). My understanding was that we don't exercise this in the non-swap case. Looking back into it, it seems that we are now doing this with ksm and migrate as well, though. The point remains though that moving pages higher than necessary in the anon_vma hierarchy is OK from a correctness perspective but could have bad implications from a performance perspective. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754771AbaKYLAB (ORCPT ); Tue, 25 Nov 2014 06:00:01 -0500 Received: from cantor2.suse.de ([195.135.220.15]:36925 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753721AbaKYK74 (ORCPT ); Tue, 25 Nov 2014 05:59:56 -0500 Date: Tue, 25 Nov 2014 11:59:53 +0100 From: Michal Hocko To: Konstantin Khlebnikov Cc: Rik van Riel , Michel Lespinasse , Vlastimil Babka , Andrew Morton , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound Message-ID: <20141125105953.GC4607@dhcp22.suse.cz> References: <546BDB29.9050403@suse.cz> <546CC0CD.40906@suse.cz> <546DFFA1.4030700@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 24-11-14 11:09:40, Konstantin Khlebnikov wrote: > On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov wrote: > > On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- > >> Hash: SHA1 > >> > >> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote: > >> > >>> I'm thinking about limitation for reusing anon_vmas which might > >>> increase performance without breaking asymptotic estimation of > >>> count anon_vma in the worst case. For example this heuristic: allow > >>> to reuse only anon_vma with single direct descendant. It seems > >>> there will be arount up to two times more anon_vmas but > >>> false-aliasing must be much lower. > > Done. RFC patch in attachment. This is triggering BUG_ON(anon_vma->degree); in unlink_anon_vmas. I have applied the patch on top of 3.18.0-rc6. [ 12.380189] ------------[ cut here ]------------ [ 12.380221] kernel BUG at mm/rmap.c:385! [ 12.380239] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 12.380272] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd [ 12.380518] CPU: 1 PID: 3704 Comm: kdm_greet Not tainted 3.18.0-rc6-test-00001-gf5bc00c103ff #409 [ 12.380554] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011 [ 12.380584] task: ffff8801272bc2c0 ti: ffff8800bcaf0000 task.ti: ffff8800bcaf0000 [ 12.380614] RIP: 0010:[] [] unlink_anon_vmas+0x12b/0x169 [ 12.380653] RSP: 0018:ffff8800bcaf3d28 EFLAGS: 00010286 [ 12.380676] RAX: ffff8800bcb3e690 RBX: ffff8800bcb35e28 RCX: ffff8801272bcb60 [ 12.380706] RDX: ffff8800bcb38e70 RSI: 0000000000000001 RDI: ffff8800bcb38e70 [ 12.380734] RBP: ffff8800bcaf3d78 R08: 0000000000000000 R09: 0000000000000000 [ 12.380764] R10: 0000000000000000 R11: ffff8800bcb3e6a0 R12: ffff8800bcb3e680 [ 12.380793] R13: ffff8800bcb3e690 R14: ffff8800bcb38e70 R15: ffff8800bcb38e70 [ 12.380822] FS: 0000000000000000(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000 [ 12.380855] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 12.380880] CR2: 00007fcd2603b0e8 CR3: 0000000001a11000 CR4: 00000000000407e0 [ 12.380908] Stack: [ 12.380918] ffff8801272e9dc0 ffff8800bcb35e38 ffff8800bcb35e38 ffff8800bcb3e680 [ 12.380953] ffff8800bcaf3d78 ffff8800bcb35dc0 ffff8800bcaf3dd8 0000000000000000 [ 12.380989] 0000000000000000 ffff8800bcb35dc0 ffff8800bcaf3dc8 ffffffff81119e26 [ 12.381024] Call Trace: [ 12.381038] [] free_pgtables+0x8e/0xcc [ 12.381062] [] exit_mmap+0x84/0x123 [ 12.381086] [] mmput+0x5e/0xbb [ 12.381107] [] do_exit+0x39c/0x97e [ 12.381131] [] ? context_tracking_user_exit+0x79/0x116 [ 12.381160] [] ? __this_cpu_preempt_check+0x13/0x15 [ 12.381188] [] do_group_exit+0x4c/0xc9 [ 12.381212] [] SyS_exit_group+0x14/0x14 [ 12.381238] [] system_call_fastpath+0x12/0x17 [ 12.381262] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 [ 12.381445] RIP [] unlink_anon_vmas+0x12b/0x169 [ 12.381473] RSP [ 12.386659] ---[ end trace 5761ee18fca12427 ]--- [ 12.386662] Fixing recursive fault but reboot is needed! [ 13.158240] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X [ 13.259294] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X [ 13.259468] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready [ 16.790917] e1000e: lan0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx [ 16.790957] IPv6: ADDRCONF(NETDEV_CHANGE): lan0: link becomes ready [ 18.846524] iwlwifi 0000:02:00.0: L1 Enabled - LTR Disabled [ 18.846742] iwlwifi 0000:02:00.0: Radio type=0x0-0x3-0x1 [ 18.941594] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready [ 19.145595] e1000e: lan0 NIC Link is Down [ 19.287399] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X [ 19.391325] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X [ 19.391475] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready [ 19.573640] e1000e: lan0 NIC Link is Down [ 19.717813] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X [ 19.819729] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X [ 19.819883] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready [ 22.938849] e1000e: lan0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx [ 22.938889] IPv6: ADDRCONF(NETDEV_CHANGE): lan0: link becomes ready [ 23.404027] ------------[ cut here ]------------ [ 23.404056] kernel BUG at mm/rmap.c:385! [ 23.404074] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC [ 23.404107] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd [ 23.404353] CPU: 1 PID: 4506 Comm: synaptikscfg Tainted: G D 3.18.0-rc6-test-00001-gf5bc00c103ff #409 [ 23.404395] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011 [ 23.404425] task: ffff8800a337c2c0 ti: ffff88009f4ec000 task.ti: ffff88009f4ec000 [ 23.404455] RIP: 0010:[] [] unlink_anon_vmas+0x12b/0x169 [ 23.404494] RSP: 0018:ffff88009f4efd28 EFLAGS: 00010282 [ 23.405766] RAX: ffff88009f54d010 RBX: ffff88009f54c488 RCX: 0000000000000000 [ 23.407062] RDX: ffff88009f5a3a50 RSI: 0000000000000001 RDI: ffff88009f5a3a50 [ 23.408352] RBP: ffff88009f4efd78 R08: 0000000000000000 R09: 0000000000000000 [ 23.409597] R10: 0000000000000000 R11: ffff88009f54d020 R12: ffff88009f54d000 [ 23.410816] R13: ffff88009f54d010 R14: ffff88009f5a3a50 R15: ffff88009f5a3a50 [ 23.411998] FS: 0000000000000000(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000 [ 23.413167] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 23.414320] CR2: 00007f7a855608f0 CR3: 00000000a328c000 CR4: 00000000000407e0 [ 23.415471] Stack: [ 23.416603] ffff8800a3390e00 ffff88009f54c498 ffff88009f54c498 ffff88009f54d000 [ 23.417747] ffff88009f4efd78 ffff88009f54c420 ffff88009f4efdd8 0000000000000000 [ 23.418892] 0000000000000000 ffff88009f54c420 ffff88009f4efdc8 ffffffff81119e26 [ 23.420027] Call Trace: [ 23.421153] [] free_pgtables+0x8e/0xcc [ 23.422273] [] exit_mmap+0x84/0x123 [ 23.423411] [] ? do_exit+0x358/0x97e [ 23.424537] [] mmput+0x5e/0xbb [ 23.425665] [] do_exit+0x39c/0x97e [ 23.426766] [] ? context_tracking_user_exit+0x79/0x116 [ 23.427866] [] ? __this_cpu_preempt_check+0x13/0x15 [ 23.428962] [] do_group_exit+0x4c/0xc9 [ 23.430064] [] SyS_exit_group+0x14/0x14 [ 23.431162] [] system_call_fastpath+0x12/0x17 [ 23.432262] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 [ 23.434722] RIP [] unlink_anon_vmas+0x12b/0x169 [ 23.435924] RSP [ 23.441996] ---[ end trace 5761ee18fca12428 ]--- [ 23.442001] Fixing recursive fault but reboot is needed! [ 838.179454] ------------[ cut here ]------------ [ 838.180658] kernel BUG at mm/rmap.c:385! [ 838.181843] invalid opcode: 0000 [#3] PREEMPT SMP DEBUG_PAGEALLOC [ 838.183046] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd [ 838.186983] CPU: 1 PID: 6643 Comm: colord-sane Tainted: G D 3.18.0-rc6-test-00001-gf5bc00c103ff #409 [ 838.188240] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011 [ 838.189503] task: ffff8800c4fd8000 ti: ffff880079c6c000 task.ti: ffff880079c6c000 [ 838.190765] RIP: 0010:[] [] unlink_anon_vmas+0x12b/0x169 [ 838.192045] RSP: 0018:ffff880079c6fb68 EFLAGS: 00010286 [ 838.193324] RAX: ffff8800c5a70150 RBX: ffff8800a6fd5748 RCX: 0000000000000000 [ 838.194616] RDX: ffff8800a5379840 RSI: 0000000000000001 RDI: ffff8800a5379840 [ 838.195879] RBP: ffff880079c6fbb8 R08: 0000000000000000 R09: 0000000000000000 [ 838.197100] R10: 0000000000000000 R11: ffff8800c5a70160 R12: ffff8800c5a70140 [ 838.198289] R13: ffff8800c5a70150 R14: ffff8800a5379840 R15: ffff8800a5379840 [ 838.199448] FS: 0000000000000000(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000 [ 838.200604] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 838.201753] CR2: 00007fdfd692cde8 CR3: 0000000079d0d000 CR4: 00000000000407e0 [ 838.202902] Stack: [ 838.204029] ffff88011e6fc540 ffff8800a6fd5758 ffff8800a6fd5758 ffff8800c5a70140 [ 838.205180] ffff880079c6fbb8 ffff8800a6fd56e0 ffff880079c6fc18 0000000000000000 [ 838.206328] 0000000000000000 ffff8800a6fd56e0 ffff880079c6fc08 ffffffff81119e26 [ 838.207477] Call Trace: [ 838.208614] [] free_pgtables+0x8e/0xcc [ 838.209762] [] exit_mmap+0x84/0x123 [ 838.210897] [] ? do_exit+0x358/0x97e [ 838.212020] [] mmput+0x5e/0xbb [ 838.213132] [] do_exit+0x39c/0x97e [ 838.214232] [] ? get_signal+0xdb/0x68a [ 838.215324] [] ? poll_select_copy_remaining+0xfe/0xfe [ 838.216420] [] do_group_exit+0x4c/0xc9 [ 838.217521] [] get_signal+0x647/0x68a [ 838.218612] [] ? context_tracking_user_enter+0xdb/0x159 [ 838.219705] [] do_signal+0x28/0x657 [ 838.220796] [] ? __acct_update_integrals+0xbf/0xd4 [ 838.221894] [] ? preempt_count_sub+0xcd/0xdb [ 838.222998] [] ? vtime_account_user+0x88/0x95 [ 838.224105] [] ? _raw_spin_unlock+0x32/0x47 [ 838.225205] [] ? context_tracking_user_exit+0x79/0x116 [ 838.226308] [] ? context_tracking_user_exit+0x79/0x116 [ 838.227401] [] do_notify_resume+0x3f/0x94 [ 838.228495] [] int_signal+0x12/0x17 [ 838.229581] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 [ 838.231909] RIP [] unlink_anon_vmas+0x12b/0x169 [ 838.233003] RSP [ 838.234248] ---[ end trace 5761ee18fca12429 ]--- [ 838.234251] Fixing recursive fault but reboot is needed! [ 1806.784267] ------------[ cut here ]------------ [ 1806.785322] kernel BUG at mm/rmap.c:385! [ 1806.786361] invalid opcode: 0000 [#4] PREEMPT SMP DEBUG_PAGEALLOC [ 1806.787397] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd [ 1806.790682] CPU: 1 PID: 8135 Comm: DNS Resolver #7 Tainted: G D 3.18.0-rc6-test-00001-gf5bc00c103ff #409 [ 1806.791728] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011 [ 1806.792779] task: ffff8800b3d40000 ti: ffff880079e34000 task.ti: ffff880079e34000 [ 1806.793816] RIP: 0010:[] [] unlink_anon_vmas+0x12b/0x169 [ 1806.794863] RSP: 0018:ffff880079e37d38 EFLAGS: 00010282 [ 1806.795894] RAX: ffff8800b508d790 RBX: ffff8800bcaa4e28 RCX: 0000000000000000 [ 1806.796948] RDX: ffff880124ce0f20 RSI: 0000000000000001 RDI: ffff880124ce0f20 [ 1806.798011] RBP: ffff880079e37d88 R08: 0000000000000000 R09: 0000000000000000 [ 1806.799048] R10: 00007fc2827f9db0 R11: ffff8800b508d7a0 R12: ffff8800b508d780 [ 1806.800105] R13: ffff8800b508d790 R14: ffff880124ce0f20 R15: ffff880124ce0f20 [ 1806.801143] FS: 00007fc2827fa700(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000 [ 1806.802206] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1806.803244] CR2: 00007fc2c6b87000 CR3: 00000000a3063000 CR4: 00000000000407e0 [ 1806.804305] Stack: [ 1806.805329] 00007fc280754000 ffff8800bcaa4e38 ffff8800bcaa4e38 ffff8800b508d780 [ 1806.806382] 0000000081098bfb ffff8800bcaa4dc0 ffff880079e37df8 00007fc27ff00000 [ 1806.807467] 00007fc280a00000 ffff8800bcaa4dc0 ffff880079e37dd8 ffffffff81119e26 [ 1806.808536] Call Trace: [ 1806.809570] [] free_pgtables+0x8e/0xcc [ 1806.810617] [] unmap_region+0xc8/0xec [ 1806.811658] [] ? __rb_erase_color+0x122/0x1f9 [ 1806.812724] [] do_munmap+0x275/0x2f7 [ 1806.813792] [] vm_munmap+0x48/0x61 [ 1806.814841] [] SyS_munmap+0x26/0x2f [ 1806.815884] [] system_call_fastpath+0x12/0x17 [ 1806.816951] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 [ 1806.819300] RIP [] unlink_anon_vmas+0x12b/0x169 [ 1806.820457] RSP [ 1806.822068] ---[ end trace 5761ee18fca1242a ]--- -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750975AbaKYPAK (ORCPT ); Tue, 25 Nov 2014 10:00:10 -0500 Received: from cantor2.suse.de ([195.135.220.15]:42899 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750860AbaKYPAI (ORCPT ); Tue, 25 Nov 2014 10:00:08 -0500 Date: Tue, 25 Nov 2014 16:00:06 +0100 From: Michal Hocko To: Konstantin Khlebnikov Cc: Rik van Riel , Michel Lespinasse , Vlastimil Babka , Andrew Morton , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound Message-ID: <20141125150006.GB4415@dhcp22.suse.cz> References: <546CC0CD.40906@suse.cz> <546DFFA1.4030700@redhat.com> <20141125105953.GC4607@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 25-11-14 16:13:16, Konstantin Khlebnikov wrote: > On Tue, Nov 25, 2014 at 1:59 PM, Michal Hocko wrote: > > On Mon 24-11-14 11:09:40, Konstantin Khlebnikov wrote: > >> On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov wrote: > >> > On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel wrote: > >> >> -----BEGIN PGP SIGNED MESSAGE----- > >> >> Hash: SHA1 > >> >> > >> >> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote: > >> >> > >> >>> I'm thinking about limitation for reusing anon_vmas which might > >> >>> increase performance without breaking asymptotic estimation of > >> >>> count anon_vma in the worst case. For example this heuristic: allow > >> >>> to reuse only anon_vma with single direct descendant. It seems > >> >>> there will be arount up to two times more anon_vmas but > >> >>> false-aliasing must be much lower. > >> > >> Done. RFC patch in attachment. > > > > This is triggering BUG_ON(anon_vma->degree); in unlink_anon_vmas. I have > > applied the patch on top of 3.18.0-rc6. > > It seems I've screwed up with counter if anon_vma is merged in anon_vma_prepare. > Increment must be in the next if block: > > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -182,8 +182,6 @@ int anon_vma_prepare(struct vm_area_struct *vma) > if (unlikely(!anon_vma)) > goto out_enomem_free_avc; > allocated = anon_vma; > - /* Bump degree, root anon_vma is its own parent. */ > - anon_vma->degree++; > } > > anon_vma_lock_write(anon_vma); > @@ -192,6 +190,7 @@ int anon_vma_prepare(struct vm_area_struct *vma) > if (likely(!vma->anon_vma)) { > vma->anon_vma = anon_vma; > anon_vma_chain_link(vma, avc, anon_vma); > + anon_vma->degree++; > allocated = NULL; > avc = NULL; > } > > I've tested it with trinity but probably isn't long enough. OK, this has passed few runs with the original reproducer: $ date +%s; grep anon_vma /proc/slabinfo; $ ./vma_chain_repro $ sleep 1h $ date +%s; grep anon_vma /proc/slabinfo $ killall vma_chain_repro $ date +%s; grep anon_vma /proc/slabinfo 1416923468 anon_vma 11523 11523 176 23 1 : tunables 0 0 0 : slabdata 501 501 0 1416927070 anon_vma 11477 11477 176 23 1 : tunables 0 0 0 : slabdata 499 499 0 1416927070 anon_vma 11127 11431 176 23 1 : tunables 0 0 0 : slabdata 497 497 0 anon_vmas do not seem to leak anymore. I have forwarded the patch to the customer who was complaining about NSD but I guess it will take some time to get the confirmation. Anyway thanks a lot for your help and feel free to add Tested-by: Michal Hocko I have yet to look deeper into the code to give you my Reviewed-by. -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752694AbaKZRfX (ORCPT ); Wed, 26 Nov 2014 12:35:23 -0500 Received: from cantor2.suse.de ([195.135.220.15]:47895 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750828AbaKZRfV (ORCPT ); Wed, 26 Nov 2014 12:35:21 -0500 Date: Wed, 26 Nov 2014 18:35:17 +0100 From: Michal Hocko To: Konstantin Khlebnikov Cc: Rik van Riel , Michel Lespinasse , Vlastimil Babka , Andrew Morton , Hugh Dickins , Andrea Arcangeli , Linux Kernel Mailing List , "linux-mm@kvack.org" , Tim Hartrick , Daniel Forrest Subject: Re: [PATCH] Repeated fork() causes SLAB to grow without bound Message-ID: <20141126173517.GA8180@dhcp22.suse.cz> References: <546CC0CD.40906@suse.cz> <546DFFA1.4030700@redhat.com> <20141125105953.GC4607@dhcp22.suse.cz> <20141125150006.GB4415@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20141125150006.GB4415@dhcp22.suse.cz> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 25-11-14 16:00:06, Michal Hocko wrote: > On Tue 25-11-14 16:13:16, Konstantin Khlebnikov wrote: > > On Tue, Nov 25, 2014 at 1:59 PM, Michal Hocko wrote: > > > On Mon 24-11-14 11:09:40, Konstantin Khlebnikov wrote: > > >> On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov wrote: > > >> > On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel wrote: > > >> >> -----BEGIN PGP SIGNED MESSAGE----- > > >> >> Hash: SHA1 > > >> >> > > >> >> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote: > > >> >> > > >> >>> I'm thinking about limitation for reusing anon_vmas which might > > >> >>> increase performance without breaking asymptotic estimation of > > >> >>> count anon_vma in the worst case. For example this heuristic: allow > > >> >>> to reuse only anon_vma with single direct descendant. It seems > > >> >>> there will be arount up to two times more anon_vmas but > > >> >>> false-aliasing must be much lower. > > >> > > >> Done. RFC patch in attachment. Ok, finally managed to untagnle myself from vma chains and your patch makes sense to me, it is quite clever actually. Here is it including the fixup. --- > From 1d4b0b38198c69ecfeb37670cb1dda767a802c9a Mon Sep 17 00:00:00 2001 > From: Konstantin Khlebnikov > Date: Tue, 25 Nov 2014 10:54:44 +0100 > Subject: [PATCH] mm: prevent endless growth of anon_vma hierarchy > > Constantly forking task causes unlimited grow of anon_vma chain. > Each next child allocate new level of anon_vmas and links vmas to all > previous levels because it inherits pages from them. None of anon_vmas > cannot be freed because there might be pages which points to them. > > This patch adds heuristic which decides to reuse existing anon_vma instead > of forking new one. It counts vmas and direct descendants for each anon_vma. > Anon_vma with degree lower than two will be reused at next fork. > As a result each anon_vma has either alive vma or at least two descendants, > endless chains are no longer possible and count of anon_vmas is no more than > two times more than count of vmas. > > Signed-off-by: Konstantin Khlebnikov > Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu Tested-by: Michal Hocko Reviewed-by: Michal Hocko and I guess Reported-by: Daniel Forrest who somehow vanished from CC list (added back) would be appropriate as well. plus Fixes: 5beb49305251 (mm: change anon_vma linking to fix multi-process server scalability issue) and mark it for stable Thanks! > --- > include/linux/rmap.h | 16 ++++++++++++++++ > mm/rmap.c | 29 ++++++++++++++++++++++++++++- > 2 files changed, 44 insertions(+), 1 deletion(-) > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index c0c2bce6b0b7..b1d140c20b37 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -45,6 +45,22 @@ struct anon_vma { > * mm_take_all_locks() (mm_all_locks_mutex). > */ > struct rb_root rb_root; /* Interval tree of private "related" vmas */ > + > + /* > + * Count of child anon_vmas and VMAs which points to this anon_vma. > + * > + * This counter is used for making decision about reusing old anon_vma > + * instead of forking new one. It allows to detect anon_vmas which have > + * just one direct descendant and no vmas. Reusing such anon_vma not > + * leads to significant preformance regression but prevents degradation > + * of anon_vma hierarchy to endless linear chain. > + * > + * Root anon_vma is never reused because it is its own parent and it has > + * at leat one vma or child, thus at fork it's degree is at least 2. > + */ > + unsigned degree; > + > + struct anon_vma *parent; /* Parent of this anon_vma */ > }; > > /* > diff --git a/mm/rmap.c b/mm/rmap.c > index 19886fb2f13a..40ae8184a1e1 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -72,6 +72,8 @@ static inline struct anon_vma *anon_vma_alloc(void) > anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL); > if (anon_vma) { > atomic_set(&anon_vma->refcount, 1); > + anon_vma->degree = 1; /* Reference for first vma */ > + anon_vma->parent = anon_vma; > /* > * Initialise the anon_vma root to point to itself. If called > * from fork, the root will be reset to the parents anon_vma. > @@ -188,6 +190,7 @@ int anon_vma_prepare(struct vm_area_struct *vma) > if (likely(!vma->anon_vma)) { > vma->anon_vma = anon_vma; > anon_vma_chain_link(vma, avc, anon_vma); > + anon_vma->degree++; > allocated = NULL; > avc = NULL; > } > @@ -256,7 +259,17 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) > anon_vma = pavc->anon_vma; > root = lock_anon_vma_root(root, anon_vma); > anon_vma_chain_link(dst, avc, anon_vma); > + > + /* > + * Reuse existing anon_vma if its degree lower than two, > + * that means it has no vma and just one anon_vma child. > + */ > + if (!dst->anon_vma && anon_vma != src->anon_vma && > + anon_vma->degree < 2) > + dst->anon_vma = anon_vma; > } > + if (dst->anon_vma) > + dst->anon_vma->degree++; > unlock_anon_vma_root(root); > return 0; > > @@ -279,6 +292,9 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) > if (!pvma->anon_vma) > return 0; > > + /* Drop inherited anon_vma, we'll reuse old one or allocate new. */ > + vma->anon_vma = NULL; > + > /* > * First, attach the new VMA to the parent VMA's anon_vmas, > * so rmap can find non-COWed pages in child processes. > @@ -286,6 +302,10 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) > if (anon_vma_clone(vma, pvma)) > return -ENOMEM; > > + /* An old anon_vma has been reused. */ > + if (vma->anon_vma) > + return 0; > + > /* Then add our own anon_vma. */ > anon_vma = anon_vma_alloc(); > if (!anon_vma) > @@ -299,6 +319,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) > * lock any of the anon_vmas in this anon_vma tree. > */ > anon_vma->root = pvma->anon_vma->root; > + anon_vma->parent = pvma->anon_vma; > /* > * With refcounts, an anon_vma can stay around longer than the > * process it belongs to. The root anon_vma needs to be pinned until > @@ -309,6 +330,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) > vma->anon_vma = anon_vma; > anon_vma_lock_write(anon_vma); > anon_vma_chain_link(vma, avc, anon_vma); > + anon_vma->parent->degree++; > anon_vma_unlock_write(anon_vma); > > return 0; > @@ -339,12 +361,16 @@ void unlink_anon_vmas(struct vm_area_struct *vma) > * Leave empty anon_vmas on the list - we'll need > * to free them outside the lock. > */ > - if (RB_EMPTY_ROOT(&anon_vma->rb_root)) > + if (RB_EMPTY_ROOT(&anon_vma->rb_root)) { > + anon_vma->parent->degree--; > continue; > + } > > list_del(&avc->same_vma); > anon_vma_chain_free(avc); > } > + if (vma->anon_vma) > + vma->anon_vma->degree--; > unlock_anon_vma_root(root); > > /* > @@ -355,6 +381,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma) > list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) { > struct anon_vma *anon_vma = avc->anon_vma; > > + BUG_ON(anon_vma->degree); > put_anon_vma(anon_vma); > > list_del(&avc->same_vma); > -- > 2.1.3 -- Michal Hocko SUSE Labs