Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Wu Fengguang <fengguang.wu@intel.com>
To: Nai Xia <nai.xia@gmail.com>
Cc: Andi Kleen <andi@firstfloor.org>, Nick Piggin <npiggin@suse.de>,
	"hugh@veritas.com" <hugh@veritas.com>,
	"riel@redhat.com" <riel@redhat.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"chris.mason@oracle.com" <chris.mason@oracle.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [PATCH] [13/16] HWPOISON: The high level memory error handler in  the VM v3
Date: Tue, 9 Jun 2009 14:48:55 +0800	[thread overview]
Message-ID: <20090609064855.GB5490@localhost> (raw)
In-Reply-To: <ab418ea90906080746m6d1d59d8m395ab76585575db1@mail.gmail.com>

On Mon, Jun 08, 2009 at 10:46:53PM +0800, Nai Xia wrote:
> On Mon, Jun 8, 2009 at 8:31 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Mon, Jun 08, 2009 at 07:06:12PM +0800, Nai Xia wrote:
> >> On Mon, Jun 8, 2009 at 12:02 AM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> >> > On Thu, Jun 04, 2009 at 02:25:24PM +0800, Nai Xia wrote:
> >> >> On Thu, May 28, 2009 at 10:50 PM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> >> > On Thu, May 28, 2009 at 09:45:20PM +0800, Andi Kleen wrote:
> >> >> >> On Thu, May 28, 2009 at 02:08:54PM +0200, Nick Piggin wrote:
> >> >> >
> >> >> > [snip]
> >> >> >
> >> >> >> >
> >> >> >> > BTW. I don't know if you are checking for PG_writeback often enough?
> >> >> >> > You can't remove a PG_writeback page from pagecache. The normal
> >> >> >> > pattern is lock_page(page); wait_on_page_writeback(page); which I
> >> >> >>
> >> >> >> So pages can be in writeback without being locked? I still
> >> >> >> wasn't able to find such a case (in fact unless I'm misreading
> >> >> >> the code badly the writeback bit is only used by NFS and a few
> >> >> >> obscure cases)
> >> >> >
> >> >> > Yes the writeback page is typically not locked. Only read IO requires
> >> >> > to be exclusive. Read IO is in fact page *writer*, while writeback IO
> >> >> > is page *reader* :-)
> >> >>
> >> >> Sorry for maybe somewhat a little bit off topic,
> >> >> I am trying to get a good understanding of PG_writeback & PG_locked ;)
> >> >>
> >> >> So you are saying PG_writeback & PG_locked are acting like a read/write lock?
> >> >> I notice wait_on_page_writeback(page) seems always called with page locked --
> >> >
> >> > No. Note that pages are not locked in wait_on_page_writeback_range().
> >>
> >> I see. This function seems mostly called A on the sync path,
> >> it just waits for data being synchronized to disk.
> >> No writers from the pages' POV, so no lock.
> >> I missed this case, but my argument about the role of read/write lock.
> >> seems still consistent. :)
> >
> > It's more constrained. Normal read/write locks allow concurrent readers,
> > however fsync() must wait for previous IO to finish before starting
> > its own IO.
> 
> Oh, yes, this is what I called "mixed roles". One for lock, one for
> status flag, twisted together in the same path, making the read lock
> semantics totally broken.
> 
> >
> >> >
> >> >> that is the semantics of a writer waiting to get the lock while it's
> >> >> acquired by
> >> >> some reader:The caller(e.g. truncate_inode_pages_range() A and
> >> >> invalidate_inode_pages2_range()) are the writers waiting for
> >> >> writeback readers (as you clarified ) to finish their job, right ?
> >> >
> >> > Sorry if my metaphor confused you. But they are not typical
> >> > reader/writer problems, but more about data integrities.
> >>
> >> No, you didn't :)
> >> Actually, you make me clear about the mixed roles for
> >> those bits.
> >>
> >> >
> >> > Pages have to be "not under writeback" when truncated.
> >> > Otherwise data lost is possible:
> >> >
> >> > 1) create a file with one page (page A)
> >> > 2) truncate page A that is under writeback
> >> > 3) write to file, which creates page B
> >> > 4) sync file, which sends page B to disk quickly
> >> >
> >> > Now if page B reaches disk before A, the new data will be overwritten
> >> > by truncated old data, which corrupts the file.
> >>
> >> I fully understand this scenario which you had already clarified in a
> >> previous message. :)
> >>
> >> 1. someone make index1-> page A
> >> 2. Path P1 is acting as a *reader* to a cache page at index1 by
> >> A  A  setting PG_writeback on, while at the same time as a *writer* to
> >> A  A  the corresponding file blocks.
> >> 3. Another path P2 comes in and A truncate page A, he is the writer
> >> A  A  to the same cache page.
> >> 4. Yet another path P3 comes A as the writer to the cache page
> >> A  A  A making it points to page B: index1--> page B.
> >> 5. Path P4 comes writing back the cache page(and set PG_writeback).
> >> A  A He is the reader of the cache page and the writer to the file blocks.
> >>
> >> The corrupts occur because P1 & P4 races when writing file blocks.
> >> But the _root_ of this racing is because nothing is used to serialize
> >> them on the side of writing the file blocks and above stream reading was
> >> inconsistent because of the writers(P2 & P3) to cache page at index1.
> >>
> >> Note that the "sync file" is somewhat irrelevant, even without "sync file",
> >> the racing still may exists. I know you must want to show me that this could
> >> make the corruption more easy to occur.
> >>
> >> So I think the simple logic is:
> >> 1) if you want to truncate/change the mapping from a index to a struct *page,
> >> test writeback bit because the writebacker to the file blocks is the reader
> >> of this mapping.
> >> 2) if a writebacker want to start a read of this mapping with
> >> test_set_page_writeback()
> >> or set_page_writeback(), he'd be sure this page is locked to keep out the
> >> writers to this mapping of index-->struct *page.
> >>
> >> This is really behavior of a read/write lock, right ?
> >
> > Please, that's a dangerous idea. A page can be written to at any time
> > when writeback to disk is under way. Does PG_writeback (your reader
> > lock) prevents page data writers? A NO.
> 
> I meant PG_writeback stops writers to index---->struct page mapping.

It's protected by the radix tree RCU locks. Period.

If you are referring to the reverse mapping: page->mapping is procted
by PG_lock. No one should make assumption that it won't change under
page writeback.

Thanks,
Fengguang

> I think I should make my statements more concise and the "reader/writer"
> less vague.
> 
> Here we care about the write/read operation for index---->struct page mapping.
> Not for read/write operation for the page content.
> 
> Anyone who wants to change this mapping  is a writer, he should take
> page lock.
> Anyone who wants to reference this mapping is a reader, writers should
> wait for him. And when this reader wants to get ref, he should wait for
> anyone one who is changing this mapping(e.g. page truncater).
> 
> When a path sets PG_writeback on a page, it need this index-->struct page
> mapping be 100% valid right? (otherwise may leads to corruption.)
> So writeback routines are readers of this index-->struct page mapping.
> (oh, well if we can put the other role of PG_writeback aside)
> 
> Ok,Ok, since PG_locked does mean much more than just protecting
> the per-page mapping which makes the lock abstraction even less clear.
> so indeed, forget about it.
> 
> >
> > Thanks,
> > Fengguang
> >
> >> wait_on_page_writeback_range() looks different only because "sync"
> >> operates on "struct page", it's not sensitive to index-->struct *page mapping.
> >> It does care about if pages returned by pagevec_lookup_tag() are
> >> still maintains the mapping when wait_on_page_writeback(page).
> >> Here, PG_writeback is only a status flag for "struct page" not a lock bit for
> >> index->struct *page mapping.
> >>
> >> >
> >> >> So do you think the idea is sane to group the two bits together
> >> >> to form a real read/write lock, which does not care about the _number_
> >> >> of readers ?
> >> >
> >> > We don't care number of readers here. So please forget about it.
> >> Yeah, I meant number of readers is not important.
> >>
> >> I still hold that these two bits in some way act like a _sparse_
> >> read/write lock.
> >> But I am going to drop the idea of making them a pure lock, since PG_writeback
> >> does has other meaning -- the page is being writing back: for sync
> >> path, it's only
> >> a status flag.
> >> Making a pure read/write lock definitely will lose that or at least distort it.
> >>
> >>
> >> Hoping I've made my words understandable, correct me if wrong, and
> >> many thanks for your time and patience. :-)
> >>
> >>
> >> Nai Xia
> >>
> >> >
> >> > Thanks,
> >> > Fengguang
> >> >
> >> >> > The writeback bit is _widely_ used. A test_set_page_writeback() is
> >> >> > directly used by NFS/AFS etc. But its main user is in fact
> >> >> > set_page_writeback(), which is called in 26 places.
> >> >> >
> >> >> >> > think would be safest
> >> >> >>
> >> >> >> Okay. I'll just add it after the page lock.
> >> >> >>
> >> >> >> > (then you never have to bother with the writeback bit again)
> >> >> >>
> >> >> >> Until Fengguang does something fancy with it.
> >> >> >
> >> >> > Yes I'm going to do it without wait_on_page_writeback().
> >> >> >
> >> >> > The reason truncate_inode_pages_range() has to wait on writeback page
> >> >> > is to ensure data integrity. Otherwise if there comes two events:
> >> >> > A  A  A  A truncate page A at offset X
> >> >> > A  A  A  A populate page B at offset X
> >> >> > If A and B are all writeback pages, then B can hit disk first and then
> >> >> > be overwritten by A. Which corrupts the data at offset X from user's POV.
> >> >> >
> >> >> > But for hwpoison, there are no such worries. If A is poisoned, we do
> >> >> > our best to isolate it as well as intercepting its IO. If the interception
> >> >> > fails, it will trigger another machine check before hitting the disk.
> >> >> >
> >> >> > After all, poisoned A means the data at offset X is already corrupted.
> >> >> > It doesn't matter if there comes another B page.
> >> >> >
> >> >> > Thanks,
> >> >> > Fengguang
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> > More majordomo info at A http://vger.kernel.org/majordomo-info.html
> >> >> > Please read the FAQ at A http://www.tux.org/lkml/
> >> >> >
> >> >
> >

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2009-06-09  6:23 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-27 20:12 [PATCH] [0/16] HWPOISON: Intro Andi Kleen
2009-05-27 20:12 ` [PATCH] [1/16] HWPOISON: Add page flag for poisoned pages Andi Kleen
2009-05-27 20:35   ` Larry H.
2009-05-27 21:15   ` Alan Cox
2009-05-28  7:54     ` Andi Kleen
2009-05-29 16:10       ` Rik van Riel
2009-05-29 16:37         ` Andi Kleen
2009-05-29 16:34           ` Rik van Riel
2009-05-29 18:24             ` Andi Kleen
2009-05-29 18:26               ` Rik van Riel
2009-05-29 18:42                 ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [2/16] HWPOISON: Export poison flag in /proc/kpageflags Andi Kleen
2009-05-29 16:37   ` Rik van Riel
2009-05-27 20:12 ` [PATCH] [3/16] HWPOISON: Export some rmap vma locking to outside world Andi Kleen
2009-05-27 20:12 ` [PATCH] [4/16] HWPOISON: Add support for poison swap entries v2 Andi Kleen
2009-05-28  8:46   ` Hidehiro Kawai
2009-05-28  9:11     ` Wu Fengguang
2009-05-28 10:42     ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [5/16] HWPOISON: Add new SIGBUS error codes for hardware poison signals Andi Kleen
2009-05-27 20:12 ` [PATCH] [6/16] HWPOISON: Add basic support for poisoned pages in fault handler v2 Andi Kleen
2009-05-29  4:15   ` Hidehiro Kawai
2009-05-29  6:28     ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [7/16] HWPOISON: Add various poison checks in mm/memory.c Andi Kleen
2009-05-27 20:12 ` [PATCH] [8/16] HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler Andi Kleen
2009-05-27 20:12 ` [PATCH] [9/16] HWPOISON: Use bitmask/action code for try_to_unmap behaviour Andi Kleen
2009-05-28  7:27   ` Nick Piggin
2009-05-28  8:03     ` Andi Kleen
2009-05-28  8:28       ` Nick Piggin
2009-05-28  9:02         ` Andi Kleen
2009-05-28 12:26           ` Nick Piggin
2009-05-27 20:12 ` [PATCH] [10/16] HWPOISON: Handle hardware poisoned pages in try_to_unmap Andi Kleen
2009-05-27 20:12 ` [PATCH] [11/16] HWPOISON: Handle poisoned pages in set_page_dirty() Andi Kleen
2009-05-27 20:12 ` [PATCH] [12/16] HWPOISON: check and isolate corrupted free pages Andi Kleen
2009-05-27 20:12 ` [PATCH] [13/16] HWPOISON: The high level memory error handler in the VM v3 Andi Kleen
2009-05-28  8:26   ` Nick Piggin
2009-05-28  9:31     ` Andi Kleen
2009-05-28 12:08       ` Nick Piggin
2009-05-28 13:45         ` Andi Kleen
2009-05-28 14:50           ` Wu Fengguang
2009-06-04  6:25             ` Nai Xia
2009-06-07 16:02               ` Wu Fengguang
2009-06-08 11:06                 ` Nai Xia
2009-06-08 12:31                   ` Wu Fengguang
2009-06-08 14:46                     ` Nai Xia
2009-06-09  6:48                       ` Wu Fengguang [this message]
2009-06-09 10:48                         ` Nick Piggin
2009-06-09 12:15                           ` Wu Fengguang
2009-06-09 12:17                             ` Nick Piggin
2009-06-09 12:47                               ` Wu Fengguang
2009-06-09 13:36                                 ` Nai Xia
2009-05-28 16:56           ` Russ Anderson
2009-05-30  6:42             ` Andi Kleen
2009-06-01 11:39               ` Nick Piggin
2009-06-01 18:19                 ` Andi Kleen
2009-06-01 12:05           ` Nick Piggin
2009-06-01 18:51             ` Andi Kleen
2009-06-02 12:10               ` Nick Piggin
2009-06-02 12:34                 ` Andi Kleen
2009-06-02 12:37                   ` Nick Piggin
2009-06-02 12:55                     ` Andi Kleen
2009-06-02 13:03                       ` Nick Piggin
2009-06-02 13:20                         ` Andi Kleen
2009-06-02 13:19                           ` Nick Piggin
2009-06-02 13:46                             ` Andi Kleen
2009-06-02 13:47                               ` Nick Piggin
2009-06-02 14:05                                 ` Andi Kleen
2009-06-02 13:30                     ` Wu Fengguang
2009-06-02 14:07                       ` Nick Piggin
2009-05-28  9:59     ` Wu Fengguang
2009-05-28 10:11       ` Andi Kleen
2009-05-28 10:33         ` Wu Fengguang
2009-05-28 10:51           ` Andi Kleen
2009-05-28 11:03             ` Wu Fengguang
2009-05-28 12:15             ` Nick Piggin
2009-05-28 13:48               ` Andi Kleen
2009-05-28 12:23       ` Nick Piggin
2009-05-28 13:54         ` Wu Fengguang
2009-06-01 11:50           ` Nick Piggin
2009-06-01 14:05             ` Wu Fengguang
2009-06-01 14:40               ` Nick Piggin
2009-06-02 11:14                 ` Wu Fengguang
2009-06-02 12:19                   ` Nick Piggin
2009-06-02 12:51                     ` Wu Fengguang
2009-06-02 14:33                       ` Nick Piggin
2009-06-03 10:21                       ` Jens Axboe
2009-06-01 21:11               ` Hugh Dickins
2009-06-01 21:41                 ` Andi Kleen
2009-06-01 18:32             ` Andi Kleen
2009-06-02 12:00               ` Nick Piggin
2009-06-02 12:47                 ` Andi Kleen
2009-06-02 12:57                   ` Nick Piggin
2009-06-02 13:25                     ` Andi Kleen
2009-06-02 13:24                       ` Nick Piggin
2009-06-02 13:41                         ` Andi Kleen
2009-06-02 13:40                           ` Nick Piggin
2009-06-02 13:53                           ` Wu Fengguang
2009-06-02 14:06                             ` Andi Kleen
2009-06-02 14:12                               ` Wu Fengguang
2009-06-02 14:21                                 ` Nick Piggin
2009-06-02 13:46                     ` Wu Fengguang
2009-06-02 14:08                       ` Andi Kleen
2009-06-02 14:10                         ` Wu Fengguang
2009-06-02 14:14                           ` Nick Piggin
2009-06-02 15:17                       ` Nick Piggin
2009-06-02 17:27                         ` Andi Kleen
2009-06-03  9:35                           ` Nick Piggin
2009-06-03 11:24                             ` Andi Kleen
2009-06-02 13:02                   ` Wu Fengguang
2009-06-02 15:09                   ` Nick Piggin
2009-06-02 17:19                     ` Andi Kleen
2009-06-03  6:24                       ` Nick Piggin
2009-06-03 15:51               ` Wu Fengguang
2009-06-03 16:05                 ` Andi Kleen
2009-05-27 20:12 ` [PATCH] [14/16] HWPOISON: FOR TESTING: Enable memory failure code unconditionally Andi Kleen
2009-05-27 20:12 ` [PATCH] [15/16] HWPOISON: Add madvise() based injector for hardware poisoned pages v3 Andi Kleen
2009-05-27 20:12 ` [PATCH] [16/16] HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs Andi Kleen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090609064855.GB5490@localhost \
    --to=fengguang.wu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=andi@firstfloor.org \
    --cc=chris.mason@oracle.com \
    --cc=hugh@veritas.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nai.xia@gmail.com \
    --cc=npiggin@suse.de \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).