linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Simon Jeons <simon.jeons@gmail.com>
To: Mitsuhiro Tanino <mitsuhiro.tanino.gm@hitachi.com>
Cc: Andi Kleen <andi@firstfloor.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable
Date: Thu, 11 Apr 2013 11:53:31 +0800	[thread overview]
Message-ID: <516633BB.40307@gmail.com> (raw)
In-Reply-To: <51662D5B.3050001@hitachi.com>

Hi Mitsuhiro,
On 04/11/2013 11:26 AM, Mitsuhiro Tanino wrote:
> Hi All,
> Please find a patch set that introduces these new sysctl interfaces,
> to handle a case when an memory error is detected on dirty page cache.
>
> - vm.memory_failure_dirty_panic
> - vm.memory_failure_print_ratelimit
> - vm.memory_failure_print_ratelimit_burst
>
> Problem
> ---------
> Recently, it is common that enterprise servers likely have a large
> amount of memory, especially for cloud environment. This means that
> possibility of memory failures is increased.
>
> To handle memory failure, Linux has a hwpoison feature. When a memory
> error is detected by memory scrub, the error is reported as machine
> check, uncorrected recoverable (UCR), to OS. Then OS isolates the memory
> region with memory failure if the memory page can be isolated.
> The hwpoison handles it according to the memory region, such as kernel,
> dirty cache, clean cache. If the memory region can be isolated, the
> page is marked "hwpoison" and it is not used again.
>
> When SRAO machine check is reported on a page which is included dirty
> page cache, the page is truncated because the memory is corrupted and
> data of the page cannot be written to a disk any more.
>
> As a result, if the dirty cache includes user data, the data is lost,
> and data corruption occurs if an application uses old data.

One question against mce instead of the patchset. ;-)

When check memory is bad? Before memory access? Is there a process scan 
it period?

>
>
> Solution
> ---------
> The patch proposes a new sysctl interface, vm.memory_failure_dirty_panic,
> in order to prevent data corruption comes from data lost problem.
> Also this patch displays information of affected file such as device name,
> inode number, file offset and file type if the file is mapped on a memory
> and the page is dirty cache.
>
> When SRAO machine check occurs on a dirty page cache, corresponding
> data cannot be recovered any more. Therefore, the patch proposes a kernel
> option to keep a system running or force system panic in order
> to avoid further trouble such as data corruption problem of application.
>
> System administrator can select an error action using this option
> according to characteristics of target system.
>
>
>
> Use Case
> ---------
> This option is intended to be adopted in KVM guest because it is
> supposed that Linux on KVM guest operates customers business and
> it is big impact to lost or corrupt customers data by memory failure.
>
> On the other hand, this option does not recommend to apply KVM host
> as following reasons.
>
> - Making KVM host panic has a big impact because all virtual guests are
>    affected by their host panic. Affected virtual guests are forced to stop
>    and have to be restarted on the other hypervisor.
>
> - If disk cached model of qemu is set to "none", I/O type of virtual
>    guests becomes O_DIRECT and KVM host does not cache guest's disk I/O.
>    Therefore, if SRAO machine check is reported on a dirty page cache
>    in KVM host, its virtual machines are not affected by the machine check.
>    So the host is expected to keep operating instead of kernel panic.
>
>
> Past discussion
> --------------------
> This problem was previously discussed in the kernel community,
> (refer: mail threads pertaining to
> http://marc.info/?l=linux-kernel&m=135187403804934&w=4).
>
>>> - I worry that if a hardware error occurs, it might affect a large
>>>    amount of memory all at the same time.  For example, if a 4G memory
>>>    block goes bad, this message will be printed a million times?
>
> As Andrew mentioned in the above threads, if 4GB memory blocks goes bad,
> error messages will be printed a million times and this behavior loses
> a system reliability.
>
> Therefore, the second patch introduces two sysctl parameters for
> __ratelimit() which is used at mce_notify_irq() in order to notify
> occurrence of machine check event to system administrator.
> The use of __ratelimit(), this patch can limit quantity of messages
> per interval to be output at syslog or terminal console.
>
> If system administrator needs to limit quantity of messages,
> these parameters are available.
>
> - vm.memory_failure_print_ratelimit:
>    Specifies the minimum length of time between messages.
>    By default the rate limiting is disabled.
>
> - vm.memory_failure_print_ratelimit_burst:
>    Specifies the number of messages we can send before rate limiting.
>
>
>
> Test Results
> ---------
> These patches are tested on 3.8.1 kernel(FC18) using software pseudo MCE
> injection from KVM host to guest.
>
>
> ******** Host OS Screen logs(SRAO Machine Check injection) ********
> Inject software pseudo MCE into guest qemu process.
>
> (1) Load mce-inject module
> # modprobe mce-inject
>
> (2) Find a PID of target qemu-kvm and page struct
> # ps -C qemu-kvm -o pid=
>   8176
>
> (3) Edit software pseudo MCE data
> Choose a offset of page struct and insert the offset to ADDR line in mce-file.
>
> #  ./page-types -p 8176 -LN | grep "___UDlA____Ma_b___________________"
> voffset         offset  flags
> ...
> 7fd25eb77       344d77  ___UDlA____Ma_b___________________
> 7fd25eb78       344d78  ___UDlA____Ma_b___________________
> 7fd25eb79       344d79  ___UDlA____Ma_b___________________
> 7fd25eb7a       344d7a  ___UDlA____Ma_b___________________
> 7fd25eb7b       344d7b  ___UDlA____Ma_b___________________
> 7fd25eb7c       344d7c  ___UDlA____Ma_b___________________
> 7fd25eb7d       344d7d  ___UDlA____Ma_b___________________
> ...
>
> # vi mce-file
> CPU 0 BANK 2
> STATUS UNCORRECTED SRAO 0x17a
> MCGSTATUS MCIP RIPV
> MISC 0x8c
> ADDR 0x344d77000
> EOF
>
> (4) Inject MCE
> # mce-inject mce-file
>
> Try step (3) to (4) a couple of times
>
>
>
>
> ******** Guest OS Screen logs(kdump) ********
> Receive MCE from KVM host
>
> (1) Set vm.memory_failure_dirty_panic parameter to 1.
>
> (2) When guest catches MCE injection from qemu and
>      MCE hit dirty page cache, hwpoison dirty cache handler
>      displays information of affected file such as Device name,
>      Inode Number, Offset and File Type.
>      And then, system goes a panic.
>
> ex.
> -------------
> [root@host /]# sysctl -a | grep memory_failure
> vm.memory_failure_dirty_panic = 1
> vm.memory_failure_early_kill = 0
> vm.memory_failure_recovery = 1
> [root@host /]#
> [  517.975220] MCE 0x326e6: clean LRU page recovery: Recovered
> [  521.969218] MCE 0x34df8: clean LRU page recovery: Recovered
> [  525.769171] MCE 0x37509: corrupted page was clean: dropped without side effects
> [  525.771070] MCE 0x37509: clean LRU page recovery: Recovered
> [  529.969246] MCE 0x39c18: File was corrupted: Dev:vda3 Inode:808998 Offset:6561
> [  529.969995] Kernel panic - not syncing: MCE 0x39c18: Force a panic because of dirty page cache was corrupted : File type:0x81a4
> [  529.969995]
> [  529.970055] Pid: 245, comm: kworker/0:2 Tainted: G   M         3.8.1 #22
> [  529.970055] Call Trace:
> [  529.970055]  [<ffffffff81645d1e>] panic+0xc1/0x1d0
> [  529.970055]  [<ffffffff811991fa>] me_pagecache_dirty+0xda/0x1a0
> [  529.970055]  [<ffffffff8119a0ab>] memory_failure+0x4eb/0xca0
> [  529.970055]  [<ffffffff8102f34d>] mce_process_work+0x3d/0x60
> [  529.970055]  [<ffffffff8107a5f7>] process_one_work+0x147/0x490
> [  529.970055]  [<ffffffff8102f310>] ? mce_schedule_work+0x50/0x50
> [  529.970055]  [<ffffffff8107ce8e>] worker_thread+0x15e/0x450
> [  529.970055]  [<ffffffff8107cd30>] ? busy_worker_rebind_fn+0x110/0x110
> [  529.970055]  [<ffffffff81081f50>] kthread+0xc0/0xd0
> [  529.970055]  [<ffffffff81010000>] ? ftrace_define_fields_xen_mc_entry+0xa0/0xf0
> [  529.970055]  [<ffffffff81081e90>] ? kthread_create_on_node+0x120/0x120
> [  529.970055]  [<ffffffff81657cec>] ret_from_fork+0x7c/0xb0
> [  529.970055]  [<ffffffff81081e90>] ? kthread_create_on_node+0x120/0x120
> -------------
>
> (3) Case of a number of MCE occurs
> If a number of MCE occurs during a minute, error messages are suppressed to output
> by __ratelimit() with following message.
>
> [  414.815303] me_pagecache_dirty: 3 callbacks suppressed
>
> ex.
> -------------
> [root@host /]# sysctl -a | grep memory_failure
> vm.memory_failure_dirty_panic = 0
> vm.memory_failure_early_kill = 0
> vm.memory_failure_print_ratelimit = 30
> vm.memory_failure_print_ratelimit_burst = 2
> vm.memory_failure_recovery = 1
>
> [root@host /]#
>
> [  181.565534] MCE 0xc38c: File was corrupted: Dev:vda3 Inode:808998 Offset:9566
> [  181.566310] MCE 0xc38c: dirty LRU page recovery: Recovered
> [  183.525425] MCE 0xc45a: Unknown page state
> [  183.527225] MCE 0xc45a: unknown page state page recovery: Failed
> [  183.527907] MCE 0xc45a: unknown page state page still referenced by -1 users
> [  185.000329] MCE 0xc524: dirty LRU page recovery: Recovered
> [  186.065231] MCE 0xc5ef: dirty LRU page recovery: Recovered
> [  188.054096] MCE 0xc6ba: clean LRU page recovery: Recovered
> [  189.565275] MCE 0xc783: clean LRU page recovery: Recovered
> [  191.692628] MCE 0xc84c: clean LRU page recovery: Recovered
> [  193.000257] MCE 0xc91d: File was corrupted: Dev:vda3 Inode:808998 Offset:6201
> [  193.001222] MCE 0xc91d: dirty LRU page recovery: Recovered
> [  194.065314] MCE 0xc9e6: dirty LRU page recovery: Recovered
> [  195.711211] MCE 0xcaaf: clean LRU page recovery: Recovered
> [  197.565339] MCE 0xcb78: dirty LRU page recovery: Recovered
> [  200.054177] MCE 0xcc41: dirty LRU page recovery: Recovered
> [  201.000272] MCE 0xcd0a: clean LRU page recovery: Recovered
> [  204.054109] MCE 0xcdd3: clean LRU page recovery: Recovered
> [  205.283189] MCE 0xcf65: clean LRU page recovery: Recovered
> [  207.110339] MCE 0xd02f: Unknown page state
> [  207.110787] MCE 0xd02f: unknown page state page recovery: Failed
> [  207.111427] MCE 0xd02f: unknown page state page still referenced by -1 users
> [  209.000134] MCE 0xd0f9: dirty LRU page recovery: Recovered
> [  210.106360] MCE 0xd1c5: dirty LRU page recovery: Recovered
> [  211.796333] me_pagecache_dirty: 3 callbacks suppressed
> [  211.796961] MCE 0xd296: File was corrupted: Dev:vda3 Inode:808998 Offset:9320
> [  211.798091] MCE 0xd296: dirty LRU page recovery: Recovered
> [  213.565288] MCE 0xd35f: clean LRU page recovery: Recovered
> -------------
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2013-04-11  3:53 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-11  3:26 [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable Mitsuhiro Tanino
2013-04-11  3:53 ` Simon Jeons [this message]
2013-04-11 12:51   ` Mitsuhiro Tanino
2013-04-11 13:00     ` Ric Mason
2013-04-12 13:43       ` Mitsuhiro Tanino
2013-04-17  5:49         ` Simon Jeons
2013-04-11  7:11 ` Naoya Horiguchi
2013-04-12 13:24   ` Mitsuhiro Tanino
2013-04-12 14:45     ` Naoya Horiguchi
2013-04-17  7:14   ` Simon Jeons
2013-04-17 14:55     ` Naoya Horiguchi
2013-04-18  0:27       ` Simon Jeons
2013-04-11 13:49 ` Andi Kleen
2013-04-11 15:23   ` Naoya Horiguchi
2013-04-11 18:10     ` Andi Kleen
2013-04-12 13:38       ` Mitsuhiro Tanino
2013-04-12 15:13         ` Naoya Horiguchi
2013-04-17 13:58           ` Naoya Horiguchi
2013-04-17  6:42     ` Simon Jeons
2013-04-17 14:16       ` Naoya Horiguchi
2013-04-17  5:30   ` Simon Jeons
2013-04-11 15:15 ` KOSAKI Motohiro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=516633BB.40307@gmail.com \
    --to=simon.jeons@gmail.com \
    --cc=andi@firstfloor.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mitsuhiro.tanino.gm@hitachi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).