* [PATCH] mm: write protect MADV_FREE pages @ 2017-01-23 23:15 Shaohua Li 2017-01-23 23:28 ` Andrew Morton 2017-01-24 2:32 ` Minchan Kim 0 siblings, 2 replies; 7+ messages in thread From: Shaohua Li @ 2017-01-23 23:15 UTC (permalink / raw) To: linux-mm Cc: Kernel-team, Minchan Kim, Andrew Morton, Hugh Dickins, Rik van Riel, stable The page reclaim has an assumption writting to a page with clean pte should trigger a page fault, because there is a window between pte zero and tlb flush where a new write could come. If the new write doesn't trigger page fault, page reclaim will not notice it and think the page is clean and reclaim it. The MADV_FREE pages don't comply with the rule and the pte is just cleaned without writeprotect, so there will be no pagefault for new write. This will cause data corruption. Cc: Minchan Kim <minchan@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: stable@kernel.org Signed-off-by: Shaohua Li <shli@fb.com> --- mm/huge_memory.c | 1 + mm/madvise.c | 1 + 2 files changed, 2 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9a6bd6c..9cc5de5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1381,6 +1381,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, tlb->fullmm); orig_pmd = pmd_mkold(orig_pmd); orig_pmd = pmd_mkclean(orig_pmd); + orig_pmd = pmd_wrprotect(orig_pmd); set_pmd_at(mm, addr, pmd, orig_pmd); tlb_remove_pmd_tlb_entry(tlb, pmd, addr); diff --git a/mm/madvise.c b/mm/madvise.c index 0e3828e..bfb6800 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -373,6 +373,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, ptent = pte_mkold(ptent); ptent = pte_mkclean(ptent); + ptent = pte_wrprotect(ptent); set_pte_at(mm, addr, pte, ptent); if (PageActive(page)) deactivate_page(page); -- 2.9.3 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] mm: write protect MADV_FREE pages 2017-01-23 23:15 [PATCH] mm: write protect MADV_FREE pages Shaohua Li @ 2017-01-23 23:28 ` Andrew Morton 2017-01-24 0:01 ` Shaohua Li 2017-01-24 2:32 ` Minchan Kim 1 sibling, 1 reply; 7+ messages in thread From: Andrew Morton @ 2017-01-23 23:28 UTC (permalink / raw) To: Shaohua Li Cc: linux-mm, Kernel-team, Minchan Kim, Hugh Dickins, Rik van Riel, stable On Mon, 23 Jan 2017 15:15:52 -0800 Shaohua Li <shli@fb.com> wrote: > The page reclaim has an assumption writting to a page with clean pte > should trigger a page fault, because there is a window between pte zero > and tlb flush where a new write could come. If the new write doesn't > trigger page fault, page reclaim will not notice it and think the page > is clean and reclaim it. The MADV_FREE pages don't comply with the rule > and the pte is just cleaned without writeprotect, so there will be no > pagefault for new write. This will cause data corruption. I'd like to see here a complete description of the bug's effects: waht sort of workload will trigger it, what the end-user visible effects are, etc. > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1381,6 +1381,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > tlb->fullmm); > orig_pmd = pmd_mkold(orig_pmd); > orig_pmd = pmd_mkclean(orig_pmd); > + orig_pmd = pmd_wrprotect(orig_pmd); Is this the right way round? There's still a window where we won't get that write fault on the cleaned pte. Should the pmd_wrprotect() happen before the pmd_mkclean()? > set_pmd_at(mm, addr, pmd, orig_pmd); > tlb_remove_pmd_tlb_entry(tlb, pmd, addr); > diff --git a/mm/madvise.c b/mm/madvise.c > index 0e3828e..bfb6800 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -373,6 +373,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > > ptent = pte_mkold(ptent); > ptent = pte_mkclean(ptent); > + ptent = pte_wrprotect(ptent); > set_pte_at(mm, addr, pte, ptent); > if (PageActive(page)) > deactivate_page(page); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] mm: write protect MADV_FREE pages 2017-01-23 23:28 ` Andrew Morton @ 2017-01-24 0:01 ` Shaohua Li 0 siblings, 0 replies; 7+ messages in thread From: Shaohua Li @ 2017-01-24 0:01 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, Kernel-team, Minchan Kim, Hugh Dickins, Rik van Riel, stable On Mon, Jan 23, 2017 at 03:28:14PM -0800, Andrew Morton wrote: > On Mon, 23 Jan 2017 15:15:52 -0800 Shaohua Li <shli@fb.com> wrote: > > > The page reclaim has an assumption writting to a page with clean pte > > should trigger a page fault, because there is a window between pte zero > > and tlb flush where a new write could come. If the new write doesn't > > trigger page fault, page reclaim will not notice it and think the page > > is clean and reclaim it. The MADV_FREE pages don't comply with the rule > > and the pte is just cleaned without writeprotect, so there will be no > > pagefault for new write. This will cause data corruption. > > I'd like to see here a complete description of the bug's effects: waht > sort of workload will trigger it, what the end-user visible effects > are, etc. I don't have a real workload to trigger this, it's from code study, sorry. I thought a workload like this triggering the bug: madvise(MADV_FREE) /* memory range */ write to the memory range read from the memory range With memory pressure, the data read by the application could be all 0 instead of those written > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -1381,6 +1381,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > > tlb->fullmm); > > orig_pmd = pmd_mkold(orig_pmd); > > orig_pmd = pmd_mkclean(orig_pmd); > > + orig_pmd = pmd_wrprotect(orig_pmd); > > Is this the right way round? There's still a window where we won't get > that write fault on the cleaned pte. Should the pmd_wrprotect() happen > before the pmd_mkclean()? This doesn't matter. We haven't set the pmd value to page table yet Thanks, Shaohua > > set_pmd_at(mm, addr, pmd, orig_pmd); > > tlb_remove_pmd_tlb_entry(tlb, pmd, addr); > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 0e3828e..bfb6800 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -373,6 +373,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > > > > ptent = pte_mkold(ptent); > > ptent = pte_mkclean(ptent); > > + ptent = pte_wrprotect(ptent); > > set_pte_at(mm, addr, pte, ptent); > > if (PageActive(page)) > > deactivate_page(page); > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] mm: write protect MADV_FREE pages 2017-01-23 23:15 [PATCH] mm: write protect MADV_FREE pages Shaohua Li 2017-01-23 23:28 ` Andrew Morton @ 2017-01-24 2:32 ` Minchan Kim 2017-01-25 17:15 ` Shaohua Li 1 sibling, 1 reply; 7+ messages in thread From: Minchan Kim @ 2017-01-24 2:32 UTC (permalink / raw) To: Shaohua Li Cc: linux-mm, Kernel-team, Andrew Morton, Hugh Dickins, Rik van Riel, stable Hi Shaohua, On Mon, Jan 23, 2017 at 03:15:52PM -0800, Shaohua Li wrote: > The page reclaim has an assumption writting to a page with clean pte > should trigger a page fault, because there is a window between pte zero > and tlb flush where a new write could come. If the new write doesn't > trigger page fault, page reclaim will not notice it and think the page > is clean and reclaim it. The MADV_FREE pages don't comply with the rule > and the pte is just cleaned without writeprotect, so there will be no > pagefault for new write. This will cause data corruption. It's hard to understand. Could you show me exact scenario seqence you have in mind? Thanks. > > Cc: Minchan Kim <minchan@kernel.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Hugh Dickins <hughd@google.com> > Cc: Rik van Riel <riel@surriel.com> > Cc: stable@kernel.org > Signed-off-by: Shaohua Li <shli@fb.com> > --- > mm/huge_memory.c | 1 + > mm/madvise.c | 1 + > 2 files changed, 2 insertions(+) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 9a6bd6c..9cc5de5 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1381,6 +1381,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > tlb->fullmm); > orig_pmd = pmd_mkold(orig_pmd); > orig_pmd = pmd_mkclean(orig_pmd); > + orig_pmd = pmd_wrprotect(orig_pmd); > > set_pmd_at(mm, addr, pmd, orig_pmd); > tlb_remove_pmd_tlb_entry(tlb, pmd, addr); > diff --git a/mm/madvise.c b/mm/madvise.c > index 0e3828e..bfb6800 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -373,6 +373,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > > ptent = pte_mkold(ptent); > ptent = pte_mkclean(ptent); > + ptent = pte_wrprotect(ptent); > set_pte_at(mm, addr, pte, ptent); > if (PageActive(page)) > deactivate_page(page); > -- > 2.9.3 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] mm: write protect MADV_FREE pages 2017-01-24 2:32 ` Minchan Kim @ 2017-01-25 17:15 ` Shaohua Li 2017-01-25 23:09 ` Minchan Kim 0 siblings, 1 reply; 7+ messages in thread From: Shaohua Li @ 2017-01-25 17:15 UTC (permalink / raw) To: Minchan Kim Cc: Shaohua Li, linux-mm, Kernel-team, Andrew Morton, Hugh Dickins, Rik van Riel On Tue, Jan 24, 2017 at 11:32:12AM +0900, Minchan Kim wrote: > Hi Shaohua, > > On Mon, Jan 23, 2017 at 03:15:52PM -0800, Shaohua Li wrote: > > The page reclaim has an assumption writting to a page with clean pte > > should trigger a page fault, because there is a window between pte zero > > and tlb flush where a new write could come. If the new write doesn't > > trigger page fault, page reclaim will not notice it and think the page > > is clean and reclaim it. The MADV_FREE pages don't comply with the rule > > and the pte is just cleaned without writeprotect, so there will be no > > pagefault for new write. This will cause data corruption. > > It's hard to understand. > Could you show me exact scenario seqence you have in mind? Sorry for the delay, for some reason, I didn't receive the mail. in try_to_unmap_one: CPU 1: CPU2: 1. pteval = ptep_get_and_clear(mm, address, pte); 2. write to the address 3. tlb flush step 1 will get a clean pteval, step2 dirty it, but the unmap missed the dirty bit so discard the page without pageout. step2 doesn't trigger a page fault, because the tlb cache still has the pte entry. The defer flush makes the window bigger actually. There are comments about this in try_to_unmap_one too. Thanks, Shaohua > > Cc: Minchan Kim <minchan@kernel.org> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Hugh Dickins <hughd@google.com> > > Cc: Rik van Riel <riel@surriel.com> > > Cc: stable@kernel.org > > Signed-off-by: Shaohua Li <shli@fb.com> > > --- > > mm/huge_memory.c | 1 + > > mm/madvise.c | 1 + > > 2 files changed, 2 insertions(+) > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index 9a6bd6c..9cc5de5 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -1381,6 +1381,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > > tlb->fullmm); > > orig_pmd = pmd_mkold(orig_pmd); > > orig_pmd = pmd_mkclean(orig_pmd); > > + orig_pmd = pmd_wrprotect(orig_pmd); > > > > set_pmd_at(mm, addr, pmd, orig_pmd); > > tlb_remove_pmd_tlb_entry(tlb, pmd, addr); > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 0e3828e..bfb6800 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -373,6 +373,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > > > > ptent = pte_mkold(ptent); > > ptent = pte_mkclean(ptent); > > + ptent = pte_wrprotect(ptent); > > set_pte_at(mm, addr, pte, ptent); > > if (PageActive(page)) > > deactivate_page(page); > > -- > > 2.9.3 > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] mm: write protect MADV_FREE pages 2017-01-25 17:15 ` Shaohua Li @ 2017-01-25 23:09 ` Minchan Kim 2017-01-26 1:50 ` Shaohua Li 0 siblings, 1 reply; 7+ messages in thread From: Minchan Kim @ 2017-01-25 23:09 UTC (permalink / raw) To: Shaohua Li Cc: Shaohua Li, linux-mm, Kernel-team, Andrew Morton, Hugh Dickins, Rik van Riel, Mel Gorman Hello, On Wed, Jan 25, 2017 at 09:15:19AM -0800, Shaohua Li wrote: > On Tue, Jan 24, 2017 at 11:32:12AM +0900, Minchan Kim wrote: > > Hi Shaohua, > > > > On Mon, Jan 23, 2017 at 03:15:52PM -0800, Shaohua Li wrote: > > > The page reclaim has an assumption writting to a page with clean pte > > > should trigger a page fault, because there is a window between pte zero > > > and tlb flush where a new write could come. If the new write doesn't > > > trigger page fault, page reclaim will not notice it and think the page > > > is clean and reclaim it. The MADV_FREE pages don't comply with the rule > > > and the pte is just cleaned without writeprotect, so there will be no > > > pagefault for new write. This will cause data corruption. > > > > It's hard to understand. > > Could you show me exact scenario seqence you have in mind? > Sorry for the delay, for some reason, I didn't receive the mail. > in try_to_unmap_one: > CPU 1: CPU2: > 1. pteval = ptep_get_and_clear(mm, address, pte); > 2. write to the address > 3. tlb flush > > step 1 will get a clean pteval, step2 dirty it, but the unmap missed the dirty > bit so discard the page without pageout. step2 doesn't trigger a page fault, I thought about that when Mel introduced deferred flush and concluded it should be no problem from theses discussion: 1. https://lkml.org/lkml/2015/4/15/565 2. https://lkml.org/lkml/2015/4/16/136 So, shouldn't it make trap? Ccing Mel. > because the tlb cache still has the pte entry. The defer flush makes the window > bigger actually. There are comments about this in try_to_unmap_one too. > > Thanks, > Shaohua > > > > Cc: Minchan Kim <minchan@kernel.org> > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > Cc: Hugh Dickins <hughd@google.com> > > > Cc: Rik van Riel <riel@surriel.com> > > > Cc: stable@kernel.org > > > Signed-off-by: Shaohua Li <shli@fb.com> > > > --- > > > mm/huge_memory.c | 1 + > > > mm/madvise.c | 1 + > > > 2 files changed, 2 insertions(+) > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > index 9a6bd6c..9cc5de5 100644 > > > --- a/mm/huge_memory.c > > > +++ b/mm/huge_memory.c > > > @@ -1381,6 +1381,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, > > > tlb->fullmm); > > > orig_pmd = pmd_mkold(orig_pmd); > > > orig_pmd = pmd_mkclean(orig_pmd); > > > + orig_pmd = pmd_wrprotect(orig_pmd); > > > > > > set_pmd_at(mm, addr, pmd, orig_pmd); > > > tlb_remove_pmd_tlb_entry(tlb, pmd, addr); > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > index 0e3828e..bfb6800 100644 > > > --- a/mm/madvise.c > > > +++ b/mm/madvise.c > > > @@ -373,6 +373,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > > > > > > ptent = pte_mkold(ptent); > > > ptent = pte_mkclean(ptent); > > > + ptent = pte_wrprotect(ptent); > > > set_pte_at(mm, addr, pte, ptent); > > > if (PageActive(page)) > > > deactivate_page(page); > > > -- > > > 2.9.3 > > > > > > -- > > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > > the body to majordomo@kvack.org. For more info on Linux MM, > > > see: http://www.linux-mm.org/ . > > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] mm: write protect MADV_FREE pages 2017-01-25 23:09 ` Minchan Kim @ 2017-01-26 1:50 ` Shaohua Li 0 siblings, 0 replies; 7+ messages in thread From: Shaohua Li @ 2017-01-26 1:50 UTC (permalink / raw) To: Minchan Kim Cc: linux-mm, Andrew Morton, Hugh Dickins, Rik van Riel, Mel Gorman On Thu, Jan 26, 2017 at 08:09:09AM +0900, Minchan Kim wrote: > Hello, > > On Wed, Jan 25, 2017 at 09:15:19AM -0800, Shaohua Li wrote: > > On Tue, Jan 24, 2017 at 11:32:12AM +0900, Minchan Kim wrote: > > > Hi Shaohua, > > > > > > On Mon, Jan 23, 2017 at 03:15:52PM -0800, Shaohua Li wrote: > > > > The page reclaim has an assumption writting to a page with clean pte > > > > should trigger a page fault, because there is a window between pte zero > > > > and tlb flush where a new write could come. If the new write doesn't > > > > trigger page fault, page reclaim will not notice it and think the page > > > > is clean and reclaim it. The MADV_FREE pages don't comply with the rule > > > > and the pte is just cleaned without writeprotect, so there will be no > > > > pagefault for new write. This will cause data corruption. > > > > > > It's hard to understand. > > > Could you show me exact scenario seqence you have in mind? > > Sorry for the delay, for some reason, I didn't receive the mail. > > in try_to_unmap_one: > > CPU 1: CPU2: > > 1. pteval = ptep_get_and_clear(mm, address, pte); > > 2. write to the address > > 3. tlb flush > > > > step 1 will get a clean pteval, step2 dirty it, but the unmap missed the dirty > > bit so discard the page without pageout. step2 doesn't trigger a page fault, > > I thought about that when Mel introduced deferred flush and concluded it > should be no problem from theses discussion: > > 1. https://lkml.org/lkml/2015/4/15/565 > 2. https://lkml.org/lkml/2015/4/16/136 > > So, shouldn't it make trap? > > Ccing Mel. Ah, don't know the cpu will refetch and trigger the fault. Thanks for the clarification. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2017-01-26 1:50 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-01-23 23:15 [PATCH] mm: write protect MADV_FREE pages Shaohua Li 2017-01-23 23:28 ` Andrew Morton 2017-01-24 0:01 ` Shaohua Li 2017-01-24 2:32 ` Minchan Kim 2017-01-25 17:15 ` Shaohua Li 2017-01-25 23:09 ` Minchan Kim 2017-01-26 1:50 ` Shaohua Li
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).