All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nishanth Aravamudan <nacc@us.ibm.com>
To: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>, Adam Litke <agl@us.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mel@csn.ul.ie>, Eric Whitney <eric.whitney@hp.com>
Subject: Re: [BUG] 2.6.25-rc4 hang/softlockups after freeing hugepages
Date: Thu, 6 Mar 2008 09:53:11 -0800	[thread overview]
Message-ID: <20080306175311.GA14567@us.ibm.com> (raw)
In-Reply-To: <1204824183.5294.62.camel@localhost>

On 06.03.2008 [12:23:03 -0500], Lee Schermerhorn wrote:
> Test platform:  HP Proliant DL585 server - 4 socket, dual core AMD with
> 32GB memory.
> 
> I first saw this on 25-rc2-mm1 with Mel's zonelist patches, while
> investigating the interaction of hugepages and cpusets.  Thinking that
> it might be caused by the zonelist patches, I went back to 25-rc2-mm1
> w/o the patches and saw the same thing.  It sometimes takes a while for
> the softlockups to start appearing, and I wanted to find a fairly
> minimal duplicator.  Meanwhile 25-rc3 and rc4 have come out, so I tried
> the latest upstream kernel and see the same thing.

So, does 2.6.25-rc2 show the problem? Or was it something introduced in
that -mm which has since gone upstream?

> To duplicate the problem, I need only:
> 
> + log into the platform as root in one window and:
> 
> 	echo N >/proc/sys/vm/nr_hugepages
> 	echo 0 >proc/sys/vm/nr_hugepages
> 
> In my case, N=64.  If I look, before echoing 0, I see 16 hugepages
> allocated on each of the 4 nodes, as expected.
> 
> + then in another window, log in again.  
> 
> Sometimes it will hang during the 2nd login and I'll never see a shell
> prompt.  Other times, I make it all the way to editing a file or
> starting a kernel build.  The task in the 2nd login hangs and on the
> console I see--e.g.,
> 
> BUG: soft lockup - CPU#1 stuck for 61s! [runkbuild:3320]
> CPU 1:
> Modules linked in: sunrpc ipv6 dm_mirror dm_mod parport_pc lp parport ide_cd_mod cdrom button tg3 hpwdt serio_raw amd_rng pata_acpi libata i2c_amd756 i2c_core pcspkr mptspi mptscsih sym53c8xx scsi_transport_spi sd_mod scsi_mod mptbase ext3 jbd ehci_hcd ohci_hcd uhci_hcd
> Pid: 3320, comm: runkbuild Not tainted 2.6.25-rc4 #1
> RIP: 0010:[<ffffffff803341f5>]  [<ffffffff803341f5>] copy_page_c+0x5/0x10
> RSP: 0000:ffff8103fe56fe00  EFLAGS: 00010286
> RAX: ffff810000000000 RBX: ffff8103fe56fe68 RCX: 0000000000000200
> RDX: ffffffff805d6c00 RSI: ffff8103fdada000 RDI: ffff8103fe200000
> RBP: ffff8103fe56fe68 R08: ffffe20017fc3a68 R09: 00003ffffffff000
> R10: 0000000000000002 R11: 0000000000000246 R12: ffffe2000ff6b680
> R13: ffffe2000ff88000 R14: ffff8103fe08c160 R15: ffff8103fe08fb10
> FS:  00007f20b83996f0(0000) GS:ffff8103ff028000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: ffff8103fe200000 CR3: 00000007fe0c7000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> 
> Call Trace:
>  [<ffffffff8027b693>] ? do_wp_page+0x103/0x570
>  [<ffffffff8027e4cf>] handle_mm_fault+0x5cf/0x7f0
>  [<ffffffff804a1cdf>] do_page_fault+0x26f/0x8d0
>  [<ffffffff8049fbd9>] error_exit+0x0/0x51
> 
> ---------------------------------------------------------------------------
> 
> This one is from starting a shell script 'runkbuild' to run parallel
> kernel builds in a loop.  Never got to start any make.  Dont' know
> whether I can trust the RIP.  
> 
> I have also seen hangs in get_page_from_freelist() which make more sense
> to me.  Perhaps failure to unlock a zone lru_lock?

Hrm, interesting. Barring an obvious thinko, can you bisect it at all?
If it's in mainline for 2.6.25-rc2 to -rc3, that shouldn't take too
long.

> I've been looking through the hugepage allocation/freeing functions and
> haven't seen anything that jumps out at me.

I don't see anything obvious either. You don't get any softlockups
without first growing and shrinking the pool? How about only growing it?

> I took a look at the recent hugetlb patches from Adam and Nish, but none
> seemed to address this symptom.  I don't think I'm dealing with surplus
> pages here.

If /proc/sys/vm/nr_overcommit_hugepages = 0, then no, you're not.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

WARNING: multiple messages have this Message-ID (diff)
From: Nishanth Aravamudan <nacc@us.ibm.com>
To: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>, Adam Litke <agl@us.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mel@csn.ul.ie>, Eric Whitney <eric.whitney@hp.com>
Subject: Re: [BUG] 2.6.25-rc4 hang/softlockups after freeing hugepages
Date: Thu, 6 Mar 2008 09:53:11 -0800	[thread overview]
Message-ID: <20080306175311.GA14567@us.ibm.com> (raw)
In-Reply-To: <1204824183.5294.62.camel@localhost>

On 06.03.2008 [12:23:03 -0500], Lee Schermerhorn wrote:
> Test platform:  HP Proliant DL585 server - 4 socket, dual core AMD with
> 32GB memory.
> 
> I first saw this on 25-rc2-mm1 with Mel's zonelist patches, while
> investigating the interaction of hugepages and cpusets.  Thinking that
> it might be caused by the zonelist patches, I went back to 25-rc2-mm1
> w/o the patches and saw the same thing.  It sometimes takes a while for
> the softlockups to start appearing, and I wanted to find a fairly
> minimal duplicator.  Meanwhile 25-rc3 and rc4 have come out, so I tried
> the latest upstream kernel and see the same thing.

So, does 2.6.25-rc2 show the problem? Or was it something introduced in
that -mm which has since gone upstream?

> To duplicate the problem, I need only:
> 
> + log into the platform as root in one window and:
> 
> 	echo N >/proc/sys/vm/nr_hugepages
> 	echo 0 >proc/sys/vm/nr_hugepages
> 
> In my case, N=64.  If I look, before echoing 0, I see 16 hugepages
> allocated on each of the 4 nodes, as expected.
> 
> + then in another window, log in again.  
> 
> Sometimes it will hang during the 2nd login and I'll never see a shell
> prompt.  Other times, I make it all the way to editing a file or
> starting a kernel build.  The task in the 2nd login hangs and on the
> console I see--e.g.,
> 
> BUG: soft lockup - CPU#1 stuck for 61s! [runkbuild:3320]
> CPU 1:
> Modules linked in: sunrpc ipv6 dm_mirror dm_mod parport_pc lp parport ide_cd_mod cdrom button tg3 hpwdt serio_raw amd_rng pata_acpi libata i2c_amd756 i2c_core pcspkr mptspi mptscsih sym53c8xx scsi_transport_spi sd_mod scsi_mod mptbase ext3 jbd ehci_hcd ohci_hcd uhci_hcd
> Pid: 3320, comm: runkbuild Not tainted 2.6.25-rc4 #1
> RIP: 0010:[<ffffffff803341f5>]  [<ffffffff803341f5>] copy_page_c+0x5/0x10
> RSP: 0000:ffff8103fe56fe00  EFLAGS: 00010286
> RAX: ffff810000000000 RBX: ffff8103fe56fe68 RCX: 0000000000000200
> RDX: ffffffff805d6c00 RSI: ffff8103fdada000 RDI: ffff8103fe200000
> RBP: ffff8103fe56fe68 R08: ffffe20017fc3a68 R09: 00003ffffffff000
> R10: 0000000000000002 R11: 0000000000000246 R12: ffffe2000ff6b680
> R13: ffffe2000ff88000 R14: ffff8103fe08c160 R15: ffff8103fe08fb10
> FS:  00007f20b83996f0(0000) GS:ffff8103ff028000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: ffff8103fe200000 CR3: 00000007fe0c7000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> 
> Call Trace:
>  [<ffffffff8027b693>] ? do_wp_page+0x103/0x570
>  [<ffffffff8027e4cf>] handle_mm_fault+0x5cf/0x7f0
>  [<ffffffff804a1cdf>] do_page_fault+0x26f/0x8d0
>  [<ffffffff8049fbd9>] error_exit+0x0/0x51
> 
> ---------------------------------------------------------------------------
> 
> This one is from starting a shell script 'runkbuild' to run parallel
> kernel builds in a loop.  Never got to start any make.  Dont' know
> whether I can trust the RIP.  
> 
> I have also seen hangs in get_page_from_freelist() which make more sense
> to me.  Perhaps failure to unlock a zone lru_lock?

Hrm, interesting. Barring an obvious thinko, can you bisect it at all?
If it's in mainline for 2.6.25-rc2 to -rc3, that shouldn't take too
long.

> I've been looking through the hugepage allocation/freeing functions and
> haven't seen anything that jumps out at me.

I don't see anything obvious either. You don't get any softlockups
without first growing and shrinking the pool? How about only growing it?

> I took a look at the recent hugetlb patches from Adam and Nish, but none
> seemed to address this symptom.  I don't think I'm dealing with surplus
> pages here.

If /proc/sys/vm/nr_overcommit_hugepages = 0, then no, you're not.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2008-03-06 17:53 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-03-06 17:23 [BUG] 2.6.25-rc4 hang/softlockups after freeing hugepages Lee Schermerhorn
2008-03-06 17:23 ` Lee Schermerhorn
2008-03-06 17:45 ` Ingo Molnar
2008-03-06 17:45   ` Ingo Molnar
2008-03-06 18:19   ` Lee Schermerhorn
2008-03-06 18:19     ` Lee Schermerhorn
2008-03-06 17:53 ` Nishanth Aravamudan [this message]
2008-03-06 17:53   ` Nishanth Aravamudan
2008-03-06 18:17   ` Lee Schermerhorn
2008-03-06 18:17     ` Lee Schermerhorn
2008-03-07 11:48 ` Mel Gorman
2008-03-07 11:48   ` Mel Gorman
2008-03-07 14:36   ` Lee Schermerhorn
2008-03-07 14:36     ` Lee Schermerhorn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080306175311.GA14567@us.ibm.com \
    --to=nacc@us.ibm.com \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=agl@us.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=eric.whitney@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.