public inbox for linux-omap@vger.kernel.org
 help / color / mirror / Atom feed
From: Tony Lindgren <tony@atomide.com>
To: CF Adad <cfadad@rocketmail.com>
Cc: "linux-omap@vger.kernel.org" <linux-omap@vger.kernel.org>
Subject: Re: Please help!  AM35xx mm/slab.c BUG
Date: Tue, 5 Jun 2012 00:08:53 -0700	[thread overview]
Message-ID: <20120605070853.GE12766@atomide.com> (raw)
In-Reply-To: <1338878255.13133.YahooMailNeo@web125205.mail.ne1.yahoo.com>

* CF Adad <cfadad@rocketmail.com> [120604 23:47]:
> All,
> 
> I'm **really** hoping someone out there can help us with this.
> 
> My team has been working with the AM3517 for several months now, and we seem to be plagued every so often by what we have termed the "slab bug".  In short, it looks something like the pasted bootlog below.  This has been an *incredibly* hard bug to figure out.  We have a couple of different AM3517-based platforms at our disposal, but the one we see the issue on almost exclusively is a custom, prototype baseboard designed around the TechNexion TAM3157.  Over the last several months, we have tried several versions of the Linux off the linux-omap tree, with loads of different configurations, and even different bootloader versions and combinations.  We've spent most of our time with a linux-omap snapshot that was a 3.2-rc6, and more recently a 3.4-rc6 from late a week or two back.  (Tomorrow I anticipate pulling the latest 3.5 now that I see it's out.)  In all cases, since we switched to 3.0+, we've seen these errors.
> 
> They are *very* inconsistent in when they occur, but they happen often enough to be very frustrating.  Consequently, our team has had an incredibly difficult time tracking what's causing them.  They seem to occur at random, perhaps on average once every handful of days.  We've messed with everything we can think of from tweaking kernel options (like enabling/disabling preemption), to disabling various drivers and userspace components, to reviewing every single line in any of our board files.  We have tried different versions and combinations of the OS and both bootloaders (x-loader & u-boot), and even went so far as to do a full analysis of the RAM timings in the EMIF4.  Unfortunately, nothing so far has worked.  The error occurs when operating off both the SD/MMC and the NAND devices, with or without the Ethernets (LAN9221 & EMAC) up and/or running, with or without PREEMPT, under heavy load and sometimes just idling, ...  There is simply nothing
>  consistent about it.  After probably 2 weeks without seeing one, I saw 3 today.
> 
> Though the error's occurence is inconistent, the error itself is.  It always throws an internal OOPs at the following section of code in mm/slab.c:
> ---
> /*
> * The slab was either on partial or free list so
> * there must be at least one object available for
> * allocation.
> */
> BUG_ON(slabp->inuse >= cachep->num);
> ---
> (It appears this was patched in eons ago: https://lkml.org/lkml/2007/2/19/20.  So it's nothing new.)

I can think of at least three issues causing errors like this:

1. Missing retention/off idle workarounds

   You can test this one by booting with nohlt cmdline option and
   seeing if that helps.

2. Broken memory

   I've seen at least one case of this where things would work
   fine if only half of the memory was in use and devices would
   oops at random point within a week. To test for this you can
   pass cmdline options to artifically partition the memory and
   leave out some chunks to see if that helps. Or boot with
   mem=xxxM set to half of the physical memory. And run your tests
   with SLAB_DEBUG set.

3. Software bugs

   My experience is that things are behaving very reliably regarding
   cache and highmem, so I would check #1 and #2 fist.

Regards,

Tony 
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2012-06-05  7:08 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-05  6:37 Please help! AM35xx mm/slab.c BUG CF Adad
2012-06-05  7:08 ` Tony Lindgren [this message]
2012-06-05 16:29   ` CF Adad
2012-06-06  6:14     ` CF Adad
2012-06-06  6:36       ` Shilimkar, Santosh
2012-06-06  7:08         ` CF Adad
2012-06-06  7:10         ` Tony Lindgren
2012-06-06  7:51           ` CF Adad
2012-06-06  8:41             ` Tony Lindgren
2012-06-06 10:37               ` Jarkko Nikula
2012-06-06 15:53                 ` CF Adad
2012-06-07  9:32             ` Mohammed, Afzal
2012-06-07 19:50               ` CF Adad
2012-06-12 11:14                 ` Mohammed, Afzal
2012-06-12 15:27                   ` CF Adad
2012-06-14 17:28                   ` CF Adad
2012-06-14 19:10                     ` jean-philippe francois
2012-06-15  4:23                       ` CF Adad
2012-06-19  1:29                   ` CF Adad
2012-06-19  6:29                     ` Mohammed, Afzal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120605070853.GE12766@atomide.com \
    --to=tony@atomide.com \
    --cc=cfadad@rocketmail.com \
    --cc=linux-omap@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox