Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mel Gorman <mel@csn.ul.ie>
To: pacman@kosh.dhis.org
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	Christoph Lameter <cl@linux.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Yinghai Lu <yinghai@kernel.org>,
	linux-kernel@vger.kernel.org
Subject: Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55
Date: Mon, 11 Oct 2010 15:30:22 +0100	[thread overview]
Message-ID: <20101011143022.GD30667@csn.ul.ie> (raw)
In-Reply-To: <20101009095718.1775.qmail@kosh.dhis.org>

On Sat, Oct 09, 2010 at 04:57:18AM -0500, pacman@kosh.dhis.org wrote:
> (What a big Cc: list... scripts/get_maintainer.pl made me do it.)
> 
> This will be a long story with a weak conclusion, sorry about that, but it's
> been a long bug-hunt.
> 
> With recent kernels I've seen a bug that appears to corrupt random 4-byte
> chunks of memory. It's not easy to reproduce. It seems to happen only once
> per boot, pretty quickly after userspace has gotten started, and sometimes it
> doesn't happen at all.
> 

A corruption of 4 bytes could be consistent with a pointer value being
written to an incorrect location.

> Symptoms that I have seen multiple times include:
> #1. Oops during modprobe usbcore (in apply_relocate_add)
> #2. (more frequent than #1) e2fsck dies of SIGSEGV or SIGILL
> 
> I gdb'ed one of the e2fsck crashes and found that the SIGILL was indeed an
> illegal instruction. A single instruction had been replaced by 4 seemingly
> random bytes which did not form a valid instruction. So I began doing an md5
> check of e2fsck and its dependent libs on every boot.
> 
> This made detection easier, as I found that about 50% of the time, booting a
> bad kernel would cause an md5 mismatch in /lib/libe2p.so.2.3. None of this
> corruption was actually present on disk. I was always able to boot my old
> known-good kernel and md5 all the suspect files, and they were always fine.
> 
> Using that test procedure, all the bad kernels showed the symptom on the
> second boot, and all the good kernels had 6 consecutive boots without any
> trouble. The git bisect ended here:
> 
>   commit 6dda9d55bf545013597724bf0cd79d01bd2bd944
>   Author: Corrado Zoccolo <czoccolo@gmail.com>
> 
>       page allocator: reduce fragmentation in buddy allocator by adding buddies that are merging to the tail of the free lists
> 
>    mm/page_alloc.c |   30 +++++++++++++++++++++++++-----
>    1 files changed, 25 insertions(+), 5 deletions(-)
> 
> which is way back before 2.6.35-rc1.
> 
> Since this is code that has obviously been tested by a lot of people and
> hasn't hurt most of them, I figure it must be very sensitive to hardware
> and/or kernel config options. I also considered the possibility of a compiler
> bug. Most of my testing was done with gcc 4.3.2, but I also tried 4.4.2 and
> that didn't make a difference.
> 
> This is all happening on Pegasos2 (32-bit PPC).
> 
> The latest kernel I've confirmed the bug on was 2.6.35.7. The bad commit
> reverts cleanly on top of 2.6.35.7, and that results in a good kernel as
> expected. (I can't test the latest Linus git tree until I solve the unrelated
> bug that has apparently killed the keyboard driver.)
> 
> Can someone familiar with the code take a fresh look at 6dda9d55 and spot a
> bug? If not, what should I try next?
> 

I think there is a slight bug but but not one that would cause corruption.

	if ((order < MAX_ORDER-1) && pfn_valid_within(page_to_pfn(buddy))) {

That looks like it can result in checking the buddy for an order-(MAX_ORDER-1)
page which is a bit bogus. Thing is, it should be harmless because there
isn't an unusual write made. In case it's some weird compiler optimisation
though, could you try this?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 502a882..5b0eb8c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -530,7 +530,7 @@ static inline void __free_one_page(struct page *page,
 	 * so it's less likely to be used soon and more likely to be merged
 	 * as a higher order page
 	 */
-	if ((order < MAX_ORDER-1) && pfn_valid_within(page_to_pfn(buddy))) {
+	if ((order < MAX_ORDER-2) && pfn_valid_within(page_to_pfn(buddy))) {
 		struct page *higher_page, *higher_buddy;
 		combined_idx = __find_combined_index(page_idx, order);
 		higher_page = page + combined_idx - page_idx;

WARNING: multiple messages have this Message-ID (diff)

From: Mel Gorman <mel@csn.ul.ie>
To: pacman@kosh.dhis.org
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	Christoph Lameter <cl@linux.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Yinghai Lu <yinghai@kernel.org>,
	linux-kernel@vger.kernel.org
Subject: Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55
Date: Mon, 11 Oct 2010 15:30:22 +0100	[thread overview]
Message-ID: <20101011143022.GD30667@csn.ul.ie> (raw)
In-Reply-To: <20101009095718.1775.qmail@kosh.dhis.org>

On Sat, Oct 09, 2010 at 04:57:18AM -0500, pacman@kosh.dhis.org wrote:
> (What a big Cc: list... scripts/get_maintainer.pl made me do it.)
> 
> This will be a long story with a weak conclusion, sorry about that, but it's
> been a long bug-hunt.
> 
> With recent kernels I've seen a bug that appears to corrupt random 4-byte
> chunks of memory. It's not easy to reproduce. It seems to happen only once
> per boot, pretty quickly after userspace has gotten started, and sometimes it
> doesn't happen at all.
> 

A corruption of 4 bytes could be consistent with a pointer value being
written to an incorrect location.

> Symptoms that I have seen multiple times include:
> #1. Oops during modprobe usbcore (in apply_relocate_add)
> #2. (more frequent than #1) e2fsck dies of SIGSEGV or SIGILL
> 
> I gdb'ed one of the e2fsck crashes and found that the SIGILL was indeed an
> illegal instruction. A single instruction had been replaced by 4 seemingly
> random bytes which did not form a valid instruction. So I began doing an md5
> check of e2fsck and its dependent libs on every boot.
> 
> This made detection easier, as I found that about 50% of the time, booting a
> bad kernel would cause an md5 mismatch in /lib/libe2p.so.2.3. None of this
> corruption was actually present on disk. I was always able to boot my old
> known-good kernel and md5 all the suspect files, and they were always fine.
> 
> Using that test procedure, all the bad kernels showed the symptom on the
> second boot, and all the good kernels had 6 consecutive boots without any
> trouble. The git bisect ended here:
> 
>   commit 6dda9d55bf545013597724bf0cd79d01bd2bd944
>   Author: Corrado Zoccolo <czoccolo@gmail.com>
> 
>       page allocator: reduce fragmentation in buddy allocator by adding buddies that are merging to the tail of the free lists
> 
>    mm/page_alloc.c |   30 +++++++++++++++++++++++++-----
>    1 files changed, 25 insertions(+), 5 deletions(-)
> 
> which is way back before 2.6.35-rc1.
> 
> Since this is code that has obviously been tested by a lot of people and
> hasn't hurt most of them, I figure it must be very sensitive to hardware
> and/or kernel config options. I also considered the possibility of a compiler
> bug. Most of my testing was done with gcc 4.3.2, but I also tried 4.4.2 and
> that didn't make a difference.
> 
> This is all happening on Pegasos2 (32-bit PPC).
> 
> The latest kernel I've confirmed the bug on was 2.6.35.7. The bad commit
> reverts cleanly on top of 2.6.35.7, and that results in a good kernel as
> expected. (I can't test the latest Linus git tree until I solve the unrelated
> bug that has apparently killed the keyboard driver.)
> 
> Can someone familiar with the code take a fresh look at 6dda9d55 and spot a
> bug? If not, what should I try next?
> 

I think there is a slight bug but but not one that would cause corruption.

	if ((order < MAX_ORDER-1) && pfn_valid_within(page_to_pfn(buddy))) {

That looks like it can result in checking the buddy for an order-(MAX_ORDER-1)
page which is a bit bogus. Thing is, it should be harmless because there
isn't an unusual write made. In case it's some weird compiler optimisation
though, could you try this?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 502a882..5b0eb8c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -530,7 +530,7 @@ static inline void __free_one_page(struct page *page,
 	 * so it's less likely to be used soon and more likely to be merged
 	 * as a higher order page
 	 */
-	if ((order < MAX_ORDER-1) && pfn_valid_within(page_to_pfn(buddy))) {
+	if ((order < MAX_ORDER-2) && pfn_valid_within(page_to_pfn(buddy))) {
 		struct page *higher_page, *higher_buddy;
 		combined_idx = __find_combined_index(page_idx, order);
 		higher_page = page + combined_idx - page_idx;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2010-10-11 14:30 UTC|newest]

Thread overview: 91+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-10-09  9:57 PROBLEM: memory corrupting bug, bisected to 6dda9d55 pacman
2010-10-09  9:57 ` pacman
2010-10-11 12:52 ` Christoph Lameter
2010-10-11 12:52   ` Christoph Lameter
2010-10-11 14:30 ` Mel Gorman [this message]
2010-10-11 14:30   ` Mel Gorman
2010-10-11 20:35   ` pacman
2010-10-11 20:35     ` pacman
2010-10-11 21:00   ` Andrew Morton
2010-10-11 21:00     ` Andrew Morton
2010-10-11 21:00     ` Andrew Morton
2010-10-13 14:40     ` Mel Gorman
2010-10-13 14:40       ` Mel Gorman
2010-10-13 14:40       ` Mel Gorman
2010-10-13 17:52       ` pacman
2010-10-13 17:52         ` pacman
2010-10-13 17:52         ` pacman
2010-10-18 11:33         ` Mel Gorman
2010-10-18 11:33           ` Mel Gorman
2010-10-18 11:33           ` Mel Gorman
2010-10-18 19:10           ` pacman
2010-10-18 19:10             ` pacman
2010-10-18 19:10             ` pacman
2010-10-18 21:10             ` Benjamin Herrenschmidt
2010-10-18 21:10               ` Benjamin Herrenschmidt
2010-10-18 21:33               ` pacman
2010-10-18 21:33                 ` pacman
2010-10-18 21:33                 ` pacman
2010-10-19 10:16                 ` Benjamin Herrenschmidt
2010-10-19 10:16                   ` Benjamin Herrenschmidt
2010-10-19 18:10                   ` pacman
2010-10-19 18:10                     ` pacman
2010-10-19 18:10                     ` pacman
2010-10-19 20:47                     ` Segher Boessenkool
2010-10-19 20:47                       ` Segher Boessenkool
2010-10-19 20:47                       ` Segher Boessenkool
2010-10-19 21:02                       ` Benjamin Herrenschmidt
2010-10-19 21:02                         ` Benjamin Herrenschmidt
2010-10-19 21:02                         ` Benjamin Herrenschmidt
2010-10-20  3:23                         ` pacman
2010-10-20  3:23                           ` pacman
2010-10-20  3:23                           ` pacman
2010-10-20 10:32                           ` Benjamin Herrenschmidt
2010-10-20 10:32                             ` Benjamin Herrenschmidt
2010-10-20 10:32                             ` Benjamin Herrenschmidt
2010-10-20 18:33                             ` pacman
2010-10-20 18:33                               ` pacman
2010-10-20 20:56                               ` Benjamin Herrenschmidt
2010-10-20 20:56                                 ` Benjamin Herrenschmidt
2010-10-22  9:15                                 ` pacman
2010-10-22  9:15                                   ` pacman
2010-10-27  8:57                                 ` Pegasos OHCI bug (was Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55) pacman
2010-10-27  8:57                                   ` pacman
2010-10-27 10:13                                   ` Olaf Hering
2010-10-27 10:13                                     ` Olaf Hering
2010-10-27 21:04                                     ` Pegasos OHCI bug (was Re: PROBLEM: memory corrupting bug, pacman
2010-10-27 22:05                                       ` Segher Boessenkool
2010-10-27 22:58                                         ` pacman
2010-10-27 22:58                                           ` pacman
2010-10-27 23:33                                           ` Segher Boessenkool
2010-10-27 23:33                                             ` Segher Boessenkool
2010-10-28  1:11                                             ` pacman
2010-10-28 19:50                                               ` Segher Boessenkool
2010-10-28 19:50                                                 ` Segher Boessenkool
2010-10-28 21:07                                                 ` pacman
2010-10-29  0:16                                                   ` Segher Boessenkool
2010-10-29  0:16                                                     ` Segher Boessenkool
2010-11-05  6:43                                                     ` pacman
2010-11-05  6:43                                                       ` pacman
2010-11-29  5:44                                                       ` Benjamin Herrenschmidt
2010-10-27 13:27                                   ` Pegasos OHCI bug (was Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55) Benjamin Herrenschmidt
2010-10-27 13:27                                     ` Benjamin Herrenschmidt
2010-10-19 20:58                     ` PROBLEM: memory corrupting bug, bisected to 6dda9d55 Benjamin Herrenschmidt
2010-10-19 20:58                       ` Benjamin Herrenschmidt
2010-10-18 19:37           ` Andrew Morton
2010-10-18 19:37             ` Andrew Morton
2010-10-18 19:37             ` Andrew Morton
2010-10-18 21:02             ` Benjamin Herrenschmidt
2010-10-18 21:02               ` Benjamin Herrenschmidt
2010-10-18 21:55             ` Thomas Gleixner
2010-10-18 21:55               ` Thomas Gleixner
2010-10-18 21:55               ` Thomas Gleixner
2010-10-19 16:24               ` Helmut Grohne
2010-10-19 16:24                 ` Helmut Grohne
2010-10-19 16:24                 ` Helmut Grohne
2010-10-19 16:42                 ` Thomas Gleixner
2010-10-19 16:42                   ` Thomas Gleixner
2010-10-19 16:42                   ` Thomas Gleixner
2010-10-18 20:59       ` Benjamin Herrenschmidt
2010-10-18 20:59         ` Benjamin Herrenschmidt
2010-10-18 20:59         ` Benjamin Herrenschmidt

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:502a882 dfblob:5b0eb8c dfblob:502a882 dfblob:5b0eb8c )
 OR (
bs:"Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101011143022.GD30667@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pacman@kosh.dhis.org \
    --cc=yinghai@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.