LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH EDACv16 1/2] edac: Change internal representation to work with layers
From: Mauro Carvalho Chehab @ 2012-04-30 13:53 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Shaohui Xie, Jason Uhlenkott, Aristeu Rozanski, Hitoshi Mitake,
	Mark Gross, Dmitry Eremin-Solenikov, Ranganathan Desikan,
	Egor Martovetsky, Niklas Söderlund, Tim Small, Arvind R.,
	Chris Metcalf, Olof Johansson, Doug Thompson,
	Linux Edac Mailing List, Michal Marek, Jiri Kosina,
	Linux Kernel Mailing List, Joe Perches, Andrew Morton,
	linuxppc-dev
In-Reply-To: <4F9E8CFE.1040801@redhat.com>

Em 30-04-2012 10:00, Mauro Carvalho Chehab escreveu:
> Em 30-04-2012 09:38, Borislav Petkov escreveu:
>> On Mon, Apr 30, 2012 at 08:45:09AM -0300, Mauro Carvalho Chehab wrote:
>>> Em 30-04-2012 08:11, Borislav Petkov escreveu:
>>>> On Mon, Apr 30, 2012 at 07:58:33AM -0300, Mauro Carvalho Chehab wrote:
> 
>>>> This way it says "initializing 12 dimms" and the user thinks there are
>>>> 12 DIMMs on his system where this might not be true.
>>>
>>>
>>> I'm OK to remove the "initializing 12 dimms" message. It doesn't add anything
>>> new.
>>>
>>> With regards do the other messages, if the debug messages are not clear, 
>>> then let's fix them, instead of removing. What if we print, instead,
>>> on a message like:
>>>
>>> 	"row 1, chan 1 will represent dimm5 (1:2:0) if not empty"
>>
>> How about the following instead: the specific driver calls
>> edac_mc_alloc(), it gets the allocated dimm array in mci->dimms
>> _without_ dumping each dimm%d line. Then, each driver figures out which
>> subset of that dimms array actually has populated slots and prints only
>> the populated rank/slot/...
>>
>> This information is much more valuable than saying how many _possible_
>> slots the edac core has allocated.
>>
>> Then, each driver can decide whether it makes sense to dump that info or
>> not.
> 
> No, that would add extra complexity at the drivers level just due to debug
> messages. I think that the better is to move this printk to the debug-specific
> routine that is called only when the dimm is filled (edac_mc_dump_dimm).
> 
> With this cange, the message will be printed only for the filled dimms.
> 
> This is a cleanup patch, so I'll write it, together with the change that
> will get rid of the loop that uses KERN_CONT. It will use a function added
> by a latter patch at edac_mc_sysfs so it can't be merged on this patch
> anyway.

The following patch dos the debug cleanup. I'll add at the end of my tree.

Regards,
Mauro.


From: Mauro Carvalho Chehab <mchehab@redhat.com>
Date: Mon, 30 Apr 2012 10:24:43 -0300
Subject: [PATCH] edac_mc: Cleanup per-dimm_info debug messages

The edac_mc_alloc() routine allocates one dimm_info device for all
possible memories, including the non-filled ones. The debug messages
there are somewhat confusing. So, cleans them, by moving the code
that prints the memory location to edac_mc, and using it on both
edac_mc_sysfs and edac_mc.

After this patch, a dimm-based memory controller will print the debug
info as:

[  728.430828] EDAC DEBUG: edac_mc_dump_dimm: 	dimm2: channel 0 slot 2 mapped as virtual row 0, chan 2
[  728.430834] EDAC DEBUG: edac_mc_dump_dimm: 	dimm->label = 'mc#0channel#0slot#2'
[  728.430839] EDAC DEBUG: edac_mc_dump_dimm: 	dimm->nr_pages = 0x0
[  728.430846] EDAC DEBUG: edac_mc_dump_dimm: 	dimm->grain = 0
[  728.430850] EDAC DEBUG: edac_mc_dump_dimm: 	dimm->nr_pages = 0x0

(a rank-based memory controller would print, instead, "rank2"
 on the above debug info)

Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index d8278b3..1bc2843 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -40,6 +40,25 @@
 static DEFINE_MUTEX(mem_ctls_mutex);
 static LIST_HEAD(mc_devices);
 
+unsigned edac_dimm_info_location(struct dimm_info *dimm, char *buf,
+			         int len)
+{
+	struct mem_ctl_info *mci = dimm->mci;
+	int i, n, count = 0;
+	char *p = buf;
+
+	for (i = 0; i < mci->n_layers; i++) {
+		n = snprintf(p, len, "%s %d ",
+			      edac_layer_name[mci->layers[i].type],
+			      dimm->location[i]);
+		p += n;
+		len -= n;
+		count += n;
+	}
+
+	return count;
+}
+
 #ifdef CONFIG_EDAC_DEBUG
 
 static void edac_mc_dump_channel(struct rank_info *chan)
@@ -50,20 +69,18 @@ static void edac_mc_dump_channel(struct rank_info *chan)
 	debugf4("\tchannel->dimm = %p\n", chan->dimm);
 }
 
-static void edac_mc_dump_dimm(struct dimm_info *dimm)
+static void edac_mc_dump_dimm(struct dimm_info *dimm, int number)
 {
-	int i;
+	char location[80];
+
+	edac_dimm_info_location(dimm, location, sizeof(location));
 
 	debugf4("\tdimm = %p\n", dimm);
+	debugf4("\t%s%i: %smapped as virtual row %d, chan %d\n",
+		dimm->mci->mem_is_per_rank ? "rank" : "dimm",
+		number, location, dimm->csrow, dimm->cschannel);
 	debugf4("\tdimm->label = '%s'\n", dimm->label);
 	debugf4("\tdimm->nr_pages = 0x%x\n", dimm->nr_pages);
-	debugf4("\tdimm location ");
-	for (i = 0; i < dimm->mci->n_layers; i++) {
-		printk(KERN_CONT "%d", dimm->location[i]);
-		if (i < dimm->mci->n_layers - 1)
-			printk(KERN_CONT ".");
-	}
-	printk(KERN_CONT "\n");
 	debugf4("\tdimm->grain = %d\n", dimm->grain);
 	debugf4("\tdimm->nr_pages = 0x%x\n", dimm->nr_pages);
 }
@@ -337,8 +354,6 @@ struct mem_ctl_info *edac_mc_alloc(unsigned edac_index,
 	memset(&pos, 0, sizeof(pos));
 	row = 0;
 	chn = 0;
-	debugf4("initializing %d %s\n", tot_dimms,
-		per_rank ? "ranks" : "dimms");
 	for (i = 0; i < tot_dimms; i++) {
 		chan = mci->csrows[row]->channels[chn];
 		off = EDAC_DIMM_OFF(layer, n_layers, pos[0], pos[1], pos[2]);
@@ -351,10 +366,6 @@ struct mem_ctl_info *edac_mc_alloc(unsigned edac_index,
 		mci->dimms[off] = dimm;
 		dimm->mci = mci;
 
-		debugf2("%d: %s%i (%d:%d:%d): row %d, chan %d\n", i,
-			per_rank ? "rank" : "dimm", off,
-			pos[0], pos[1], pos[2], row, chn);
-
 		/*
 		 * Copy DIMM location and initialize the memory location
 		 */
@@ -730,7 +741,7 @@ int edac_mc_add_mc(struct mem_ctl_info *mci)
 				edac_mc_dump_channel(mci->csrows[i]->channels[j]);
 		}
 		for (i = 0; i < mci->tot_dimms; i++)
-			edac_mc_dump_dimm(mci->dimms[i]);
+			edac_mc_dump_dimm(mci->dimms[i], i);
 	}
 #endif
 	mutex_lock(&mem_ctls_mutex);
diff --git a/drivers/edac/edac_mc_sysfs.c b/drivers/edac/edac_mc_sysfs.c
index 8f96c49..e3e9e75 100644
--- a/drivers/edac/edac_mc_sysfs.c
+++ b/drivers/edac/edac_mc_sysfs.c
@@ -488,13 +488,7 @@ static ssize_t dimmdev_location_show(struct device *dev,
 	int i;
 	char *p = data;
 
-	for (i = 0; i < mci->n_layers; i++) {
-		p += sprintf(p, "%s %d ",
-			     edac_layer_name[mci->layers[i].type],
-			     dimm->location[i]);
-	}
-
-	return p - data;
+	return edac_dimm_info_location(dimm, data, PAGE_SIZE);
 }
 
 static ssize_t dimmdev_label_show(struct device *dev,
diff --git a/drivers/edac/edac_module.h b/drivers/edac/edac_module.h
index 1af1367..de92756 100644
--- a/drivers/edac/edac_module.h
+++ b/drivers/edac/edac_module.h
@@ -34,6 +34,9 @@ extern int edac_mc_get_panic_on_ue(void);
 extern int edac_get_poll_msec(void);
 extern int edac_mc_get_poll_msec(void);
 
+unsigned edac_dimm_info_location(struct dimm_info *dimm, char *buf,
+				 int len);
+
 	/* on edac_device.c */
 extern int edac_device_register_sysfs_main_kobj(
 				struct edac_device_ctl_info *edac_dev);

^ permalink raw reply related

* Re: [REGRESSION][PATCH V4 3/3] bpf jit: Let the powerpc jit handle negative offsets
From: David Miller @ 2012-04-30 17:41 UTC (permalink / raw)
  To: benh; +Cc: kaffeemonster, eric.dumazet, matt, netdev, linux-kernel,
	linuxppc-dev
In-Reply-To: <1335763568.20866.37.camel@pasglop>

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Mon, 30 Apr 2012 15:26:08 +1000

> David, what's the right way to fix that ?

There is no doubt that sock_fprog is the correct datastructure to use.

^ permalink raw reply

* Re: [REGRESSION][PATCH V5 3/3] bpf jit: Let the powerpc jit handle negative offsets
From: David Miller @ 2012-04-30 17:41 UTC (permalink / raw)
  To: kaffeemonster; +Cc: eric.dumazet, matt, netdev, linux-kernel, linuxppc-dev
In-Reply-To: <4F9E1CDB.9020104@googlemail.com>

From: Jan Seiffert <kaffeemonster@googlemail.com>
Date: Mon, 30 Apr 2012 07:02:19 +0200

> Now the helper function from filter.c for negative offsets is exported,
> it can be used it in the jit to handle negative offsets.
> 
> First modify the asm load helper functions to handle:
> - know positive offsets
> - know negative offsets
> - any offset
> 
> then the compiler can be modified to explicitly use these helper
> when appropriate.
> 
> This fixes the case of a negative X register and allows to lift
> the restriction that bpf programs with negative offsets can't
> be jited.
> 
> Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Signed-off-by: Jan Seiffert <kaffeemonster@googlemail.com>

Applied, thanks.

^ permalink raw reply

* Re: [REGRESSION][PATCH V4 3/3] bpf jit: Let the powerpc jit handle negative offsets
From: Benjamin Herrenschmidt @ 2012-04-30 21:55 UTC (permalink / raw)
  To: David Miller
  Cc: kaffeemonster, eric.dumazet, matt, netdev, linux-kernel,
	linuxppc-dev
In-Reply-To: <20120430.134140.1738751315208907289.davem@davemloft.net>

On Mon, 2012-04-30 at 13:41 -0400, David Miller wrote:
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Mon, 30 Apr 2012 15:26:08 +1000
> 
> > David, what's the right way to fix that ?
> 
> There is no doubt that sock_fprog is the correct datastructure to use.

Ok, so the right fix is to email anybody who posted code using struct
bpf_program to fix their code ? :-)

My question was more along the lines of should we attempt changing one
of the two variants to make them match on BE (since they are in effect
compatible on LE), tho of course this could have the usual annoying
consequence of breaking the mangled c++ name of the symbol).

>From your reply I assume the answer is no... so that leaves us to chase
up users and fix them. Great....

Cheers,
Ben.

^ permalink raw reply

* Re: [REGRESSION][PATCH V4 3/3] bpf jit: Let the powerpc jit handle negative offsets
From: Benjamin Herrenschmidt @ 2012-04-30 21:57 UTC (permalink / raw)
  To: David Miller
  Cc: kaffeemonster, eric.dumazet, matt, netdev, linux-kernel,
	linuxppc-dev
In-Reply-To: <1335822926.20866.47.camel@pasglop>

On Tue, 2012-05-01 at 07:55 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-04-30 at 13:41 -0400, David Miller wrote:
> > From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > Date: Mon, 30 Apr 2012 15:26:08 +1000
> > 
> > > David, what's the right way to fix that ?
> > 
> > There is no doubt that sock_fprog is the correct datastructure to use.
> 
> Ok, so the right fix is to email anybody who posted code using struct
> bpf_program to fix their code ? :-)

Actually, the right fix is for anybody using pcap-bpf.h to not
use SO_ATTACH_FILTER directly but to use pcap_setfilter() which
handles the compatibility.

I'll start spamming web sites who tell people to do the wrong thing.

Cheers,
Ben.

^ permalink raw reply

* linux-next ppc64: RCU mods cause __might_sleep BUGs
From: Hugh Dickins @ 2012-04-30 22:37 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linuxppc-dev, linux-kernel

Hi Paul,

On 3.4.0-rc4-next-20120427 and preceding linux-nexts (I've not tried
rc5-next-20120430 but expect it's the same), on PowerPC G5 quad with
CONFIG_PREEMPT=y and CONFIG_DEBUG_ATOMIC_SLEEP=y, I'm getting spurious
"BUG: sleeping function called from invalid context" messages from
__might_sleep().

Just once I saw such a message during startup.  Once I saw such a
message when rebuilding the machine's kernel.  Usually I see them
when I'm running a swapping load of kernel builds under memory
pressure (but that's what I'm habitually running there): perhaps
after a few minutes a flurry comes, then goes away, comes back
again later, and after perhaps a couple of hours of that I see
"INFO: rcu_preempt detected stalls" messages too, and soon it
freezes (or perhaps it's still running, but I'm so flooded by
messages that I reboot anyway).

Rather like from before you fixed schedule_tail() for your per-cpu
RCU mods, but not so easy to reproduce.  I did a bisection and indeed
it converged as expected on the RCU changes.  No such problem seen on
x86: it looks as if there's some further tweak required on PowerPC.

Here are my RCU config options (I don't usually have the TORTURE_TEST
in, but tried that for half an hour this morning, in the hope that it
would generate the issue: but it did not).

# RCU Subsystem
CONFIG_TREE_PREEMPT_RCU=y
CONFIG_PREEMPT_RCU=y
CONFIG_RCU_FANOUT=64
# CONFIG_RCU_FANOUT_EXACT is not set
CONFIG_TREE_RCU_TRACE=y
# CONFIG_RCU_BOOST is not set
CONFIG_HAVE_RCU_TABLE_FREE=y
# CONFIG_SPARSE_RCU_POINTER is not set
CONFIG_RCU_TORTURE_TEST=m
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_RCU_CPU_STALL_VERBOSE is not set
# CONFIG_RCU_CPU_STALL_INFO is not set
CONFIG_RCU_TRACE=y

Here's the message when I was rebuilding the G5's kernel:

BUG: sleeping function called from invalid context at include/linux/pagemap.h:354
in_atomic(): 0, irqs_disabled(): 0, pid: 6886, name: cc1
Call Trace:
[c0000001a99f78e0] [c00000000000f34c] .show_stack+0x6c/0x16c (unreliable)
[c0000001a99f7990] [c000000000077b40] .__might_sleep+0x11c/0x134
[c0000001a99f7a10] [c0000000000c6228] .filemap_fault+0x1fc/0x494
[c0000001a99f7af0] [c0000000000e7c9c] .__do_fault+0x120/0x684
[c0000001a99f7c00] [c000000000025790] .do_page_fault+0x458/0x664
[c0000001a99f7e30] [c000000000005868] handle_page_fault+0x10/0x30

I've plenty more examples, most of them from page faults or from kswapd;
but I don't think there's any more useful information in them.

Anything I can try later on?

Thanks!
Hugh

^ permalink raw reply

* Re: linux-next ppc64: RCU mods cause __might_sleep BUGs
From: Paul E. McKenney @ 2012-04-30 23:14 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Paul E. McKenney, linuxppc-dev, linux-kernel
In-Reply-To: <alpine.LSU.2.00.1204301451270.2986@eggly.anvils>

On Mon, Apr 30, 2012 at 03:37:10PM -0700, Hugh Dickins wrote:
> Hi Paul,
> 
> On 3.4.0-rc4-next-20120427 and preceding linux-nexts (I've not tried
> rc5-next-20120430 but expect it's the same), on PowerPC G5 quad with
> CONFIG_PREEMPT=y and CONFIG_DEBUG_ATOMIC_SLEEP=y, I'm getting spurious
> "BUG: sleeping function called from invalid context" messages from
> __might_sleep().
> 
> Just once I saw such a message during startup.  Once I saw such a
> message when rebuilding the machine's kernel.  Usually I see them
> when I'm running a swapping load of kernel builds under memory
> pressure (but that's what I'm habitually running there): perhaps
> after a few minutes a flurry comes, then goes away, comes back
> again later, and after perhaps a couple of hours of that I see
> "INFO: rcu_preempt detected stalls" messages too, and soon it
> freezes (or perhaps it's still running, but I'm so flooded by
> messages that I reboot anyway).
> 
> Rather like from before you fixed schedule_tail() for your per-cpu
> RCU mods, but not so easy to reproduce.  I did a bisection and indeed
> it converged as expected on the RCU changes.  No such problem seen on
> x86: it looks as if there's some further tweak required on PowerPC.
> 
> Here are my RCU config options (I don't usually have the TORTURE_TEST
> in, but tried that for half an hour this morning, in the hope that it
> would generate the issue: but it did not).
> 
> # RCU Subsystem
> CONFIG_TREE_PREEMPT_RCU=y
> CONFIG_PREEMPT_RCU=y
> CONFIG_RCU_FANOUT=64
> # CONFIG_RCU_FANOUT_EXACT is not set
> CONFIG_TREE_RCU_TRACE=y
> # CONFIG_RCU_BOOST is not set
> CONFIG_HAVE_RCU_TABLE_FREE=y
> # CONFIG_SPARSE_RCU_POINTER is not set
> CONFIG_RCU_TORTURE_TEST=m
> CONFIG_RCU_CPU_STALL_TIMEOUT=60
> # CONFIG_RCU_CPU_STALL_VERBOSE is not set
> # CONFIG_RCU_CPU_STALL_INFO is not set
> CONFIG_RCU_TRACE=y
> 
> Here's the message when I was rebuilding the G5's kernel:
> 
> BUG: sleeping function called from invalid context at include/linux/pagemap.h:354
> in_atomic(): 0, irqs_disabled(): 0, pid: 6886, name: cc1
> Call Trace:
> [c0000001a99f78e0] [c00000000000f34c] .show_stack+0x6c/0x16c (unreliable)
> [c0000001a99f7990] [c000000000077b40] .__might_sleep+0x11c/0x134
> [c0000001a99f7a10] [c0000000000c6228] .filemap_fault+0x1fc/0x494
> [c0000001a99f7af0] [c0000000000e7c9c] .__do_fault+0x120/0x684
> [c0000001a99f7c00] [c000000000025790] .do_page_fault+0x458/0x664
> [c0000001a99f7e30] [c000000000005868] handle_page_fault+0x10/0x30
> 
> I've plenty more examples, most of them from page faults or from kswapd;
> but I don't think there's any more useful information in them.
> 
> Anything I can try later on?

Interesting...  As you say, I saw this sort of thing before applying
the changes to schedule_tail(), and it is all too possible that there
is some other "sneak path" for context switches.

Have you tried running with CONFIG_PROVE_RCU?  This enables some
additional debugging in rcu_switch_from() and rcu_switch_to() that
helped track down the schedule_tail() problem.

						Thanx, Paul

^ permalink raw reply

* Re: linux-next ppc64: RCU mods cause __might_sleep BUGs
From: Benjamin Herrenschmidt @ 2012-05-01  0:33 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Paul E. McKenney, linuxppc-dev, linux-kernel
In-Reply-To: <alpine.LSU.2.00.1204301451270.2986@eggly.anvils>

On Mon, 2012-04-30 at 15:37 -0700, Hugh Dickins wrote:
> 
> BUG: sleeping function called from invalid context at include/linux/pagemap.h:354
> in_atomic(): 0, irqs_disabled(): 0, pid: 6886, name: cc1

Hrm ... in_atomic and irqs_disabled are both 0 ... so yeah it smells
like a preempt count problem... odd.

Did you get a specific bisect target yet ?

Cheers,
Ben.

> Call Trace:
> [c0000001a99f78e0] [c00000000000f34c] .show_stack+0x6c/0x16c (unreliable)
> [c0000001a99f7990] [c000000000077b40] .__might_sleep+0x11c/0x134
> [c0000001a99f7a10] [c0000000000c6228] .filemap_fault+0x1fc/0x494
> [c0000001a99f7af0] [c0000000000e7c9c] .__do_fault+0x120/0x684
> [c0000001a99f7c00] [c000000000025790] .do_page_fault+0x458/0x664
> [c0000001a99f7e30] [c000000000005868] handle_page_fault+0x10/0x30
> 
> I've plenty more examples, most of them from page faults or from kswapd;
> but I don't think there's any more useful information in them.
> 
> Anything I can try later on? 

^ permalink raw reply

* Re: [PATCH v3 03/17] powerpc: Add PFO support to the VIO bus
From: Benjamin Herrenschmidt @ 2012-05-01  3:06 UTC (permalink / raw)
  To: Kent Yoder; +Cc: rcj, linuxppc-dev, linux-kernel, linux-crypto
In-Reply-To: <1334243302.18090.10.camel@key-ThinkPad-W510>

Hrm... I don't like that much:

> +	if (op->timeout)
> +		deadline = jiffies + msecs_to_jiffies(op->timeout);
> +
> +	while (true) {
> +		hret = plpar_hcall_norets(H_COP, op->flags,
> +				vdev->resource_id,
> +				op->in, op->inlen, op->out,
> +				op->outlen, op->csbcpb);
> +
> +		if (hret == H_SUCCESS ||
> +		    (hret != H_NOT_ENOUGH_RESOURCES &&
> +		     hret != H_BUSY && hret != H_RESOURCE) ||
> +		    (op->timeout && time_after(deadline, jiffies)))
> +			break;
> +
> +		dev_dbg(dev, "%s: hcall ret(%ld), retrying.\n", __func__, hret);
> +	}
> +

Is this meant to be called in atomic context ? If not, maybe it should
at the very least do a cond_resched() ?

Else, what about ceding the processor ? Or at the very least reducing
the thread priority for a bit ?

Shouldn't we also enforce to always have a timeout ? IE. Something like
30s or so if nothing specified to avoid having the kernel just hard
lock...

In general I don't like that sort of synchronous code, I'd rather return
the busy status up the chain which gives a chance to the caller to take
more appropriate measures depending on what it's doing, but that really
depends what you use that synchronous call for. I suppose if it's for
configuration type operations, it's ok...

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH v3 03/17] powerpc: Add PFO support to the VIO bus
From: Benjamin Herrenschmidt @ 2012-05-01  4:10 UTC (permalink / raw)
  To: Kent Yoder; +Cc: rcj, linuxppc-dev, linux-kernel, linux-crypto
In-Reply-To: <1335841603.3621.8.camel@pasglop>


> Else, what about ceding the processor ? Or at the very least reducing
> the thread priority for a bit ?
> 
> Shouldn't we also enforce to always have a timeout ? IE. Something like
> 30s or so if nothing specified to avoid having the kernel just hard
> lock...
> 
> In general I don't like that sort of synchronous code, I'd rather return
> the busy status up the chain which gives a chance to the caller to take
> more appropriate measures depending on what it's doing, but that really
> depends what you use that synchronous call for. I suppose if it's for
> configuration type operations, it's ok...

In any case, don't resend the whole series, just that one patch.

Cheers,
Ben.

^ permalink raw reply

* Re: linux-next ppc64: RCU mods cause __might_sleep BUGs
From: Hugh Dickins @ 2012-05-01  5:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Paul E. McKenney, linuxppc-dev, linux-kernel
In-Reply-To: <1335832418.20866.95.camel@pasglop>

On Tue, 1 May 2012, Benjamin Herrenschmidt wrote:
> On Mon, 2012-04-30 at 15:37 -0700, Hugh Dickins wrote:
> > 
> > BUG: sleeping function called from invalid context at include/linux/pagemap.h:354
> > in_atomic(): 0, irqs_disabled(): 0, pid: 6886, name: cc1
> 
> Hrm ... in_atomic and irqs_disabled are both 0 ... so yeah it smells
> like a preempt count problem... odd.
> 
> Did you get a specific bisect target yet ?

Oh, I went as far as we need, I think, but I didn't bother quite to
complete it because, once in that area, we know the schedule_tail()
omission would muddy the waters: the tail of my bisect log was

# bad: [e798cf3385d3aa7c84afa65677eb92e0c0876dfd] rcu: Add exports for per-CPU variables used for inlining
git bisect bad e798cf3385d3aa7c84afa65677eb92e0c0876dfd
# good: [90aec3b06194393c909e3e5a47b6ed99bb8caba5] rcu: Make exit_rcu() more precise and consolidate
git bisect good 90aec3b06194393c909e3e5a47b6ed99bb8caba5

from which I concluded that the patch responsible is

commit ab8fc41a8545d40a4b58d745876c125af72a8a5c
Author: Paul E. McKenney <paul.mckenney@linaro.org>
Date:   Fri Apr 13 14:32:01 2012 -0700

    rcu: Move __rcu_read_lock() and __rcu_read_unlock() to per-CPU variables
    
    This commit is another step towards inlinable __rcu_read_lock() and
    __rcu_read_unlock() functions for preemptible RCU.  This keeps these two
    functions out of line, but switches them to use the per-CPU variables
    that are required to export their definitions without requiring that
    all RCU users include sched.h.  These per-CPU variables are saved and
    restored at context-switch time.

> 
> Cheers,
> Ben.
> 
> > Call Trace:
> > [c0000001a99f78e0] [c00000000000f34c] .show_stack+0x6c/0x16c (unreliable)
> > [c0000001a99f7990] [c000000000077b40] .__might_sleep+0x11c/0x134
> > [c0000001a99f7a10] [c0000000000c6228] .filemap_fault+0x1fc/0x494
> > [c0000001a99f7af0] [c0000000000e7c9c] .__do_fault+0x120/0x684
> > [c0000001a99f7c00] [c000000000025790] .do_page_fault+0x458/0x664
> > [c0000001a99f7e30] [c000000000005868] handle_page_fault+0x10/0x30
> > 
> > I've plenty more examples, most of them from page faults or from kswapd;
> > but I don't think there's any more useful information in them.
> > 
> > Anything I can try later on? 

I'd forgotten about CONFIG_PROVE_RCU (and hadn't been using PROVE_LOCKING
on that machine), but following Paul's suggestion have now turned them on.

But not much light shed, I'm afraid.  Within minutes it showed a trace
exactly like the one above, but the only thing PROVE_LOCKING and PROVE_RCU
had to say was that we're holding mmap_sem at that point, which is no
surprise and not a problem, just something lockdep is right to note.

That was an isolated occurrence, it continued quietly for maybe 20 minutes,
then output lots to the console screen - but garbled in a way I've not
seen before - the 0s came out just right (or perhaps all the hex digits
were being shown as 0s), but most everything else was grayly unreadable.
Then after a few minutes, spontaneously rebooted.

Perhaps I should remind myself of netdump; but getting the trace above
without complaint from PROVE_RCU tells me that it is not helping.

Hugh

^ permalink raw reply

* Re: [PATCH 1/9] cpu: Introduce clear_tasks_mm_cpumask() helper
From: Peter Zijlstra @ 2012-05-01 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linaro-kernel, Mike Frysinger, user-mode-linux-devel, linux-sh,
	Richard Weinberger, linuxppc-dev, Oleg Nesterov, linux-kernel,
	linux-mm, Anton Vorontsov, Paul Mundt, John Stultz, patches,
	KOSAKI Motohiro, Russell King, uclinux-dist-devel,
	linux-arm-kernel
In-Reply-To: <20120426165911.00cebd31.akpm@linux-foundation.org>

On Thu, 2012-04-26 at 16:59 -0700, Andrew Morton wrote:
> > +void clear_tasks_mm_cpumask(int cpu)
>=20
> The operation of this function was presumably obvious to you at the
> time you wrote it, but that isn't true of other people at later times.
>=20
> Please document it?
>=20
>=20
> > +{
> > +     struct task_struct *p;
> > +
> > +     /*
> > +      * This function is called after the cpu is taken down and marked
> > +      * offline,
>=20
> hm, well.  Who said that this function will only ever be called
> after that CPU was taken down?  There is nothing in the function name
> nor in the (absent) documentation which enforces this precondition.
>=20
> If someone tries to use this function for a different purpose, or
> copies-and-modifies it for a different purpose, we just shot them in
> the foot.
>=20
> They'd be pretty dumb to do that without reading the local comment,
> but still...

Methinks something simple like:

	WARN_ON(cpu_online(cpu));

Ought to cure that worry, no? :-)

>=20
> >        so its not like new tasks will ever get this cpu set in
> > +      * their mm mask. -- Peter Zijlstra
> > +      * Thus, we may use rcu_read_lock() here, instead of grabbing
> > +      * full-fledged tasklist_lock.
> > +      */
> > +     rcu_read_lock();
> > +     for_each_process(p) {
> > +             struct task_struct *t;
> > +
> > +             t =3D find_lock_task_mm(p);
> > +             if (!t)
> > +                     continue;
> > +             cpumask_clear_cpu(cpu, mm_cpumask(t->mm));
> > +             task_unlock(t);
> > +     }
> > +     rcu_read_unlock();
> > +}=20

^ permalink raw reply

* Re: linux-next ppc64: RCU mods cause __might_sleep BUGs
From: Paul E. McKenney @ 2012-05-01 13:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paul E. McKenney, linuxppc-dev, Hugh Dickins, linux-kernel
In-Reply-To: <1335832418.20866.95.camel@pasglop>

On Tue, May 01, 2012 at 10:33:38AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-04-30 at 15:37 -0700, Hugh Dickins wrote:
> > 
> > BUG: sleeping function called from invalid context at include/linux/pagemap.h:354
> > in_atomic(): 0, irqs_disabled(): 0, pid: 6886, name: cc1
> 
> Hrm ... in_atomic and irqs_disabled are both 0 ... so yeah it smells
> like a preempt count problem... odd.

All of the preempt-count patches are now in mainline.  :-(

The CONFIG_PROVE_RCU checks verify that either the task is new or it
is the same task that was last context-switched to on this CPU.  So
the most likely suspect is a newly created task that starts running
without schedule_tail() being invoked on the path from parent task
to child task.  If so, the fix would be to invoke rcu_switch_from()
and rcu_switch_to() on that code path.

So, does Power have a way of switching to a new task without involving
schedule_tail()?  I convinced myself that my old bugbear, usermode helpers,
aren't causing this problem, but I could easily be missing something.

> Did you get a specific bisect target yet ?

On this one, Hugh is close enough.  ;-)

							Thanx, Paul

> Cheers,
> Ben.
> 
> > Call Trace:
> > [c0000001a99f78e0] [c00000000000f34c] .show_stack+0x6c/0x16c (unreliable)
> > [c0000001a99f7990] [c000000000077b40] .__might_sleep+0x11c/0x134
> > [c0000001a99f7a10] [c0000000000c6228] .filemap_fault+0x1fc/0x494
> > [c0000001a99f7af0] [c0000000000e7c9c] .__do_fault+0x120/0x684
> > [c0000001a99f7c00] [c000000000025790] .do_page_fault+0x458/0x664
> > [c0000001a99f7e30] [c000000000005868] handle_page_fault+0x10/0x30
> > 
> > I've plenty more examples, most of them from page faults or from kswapd;
> > but I don't think there's any more useful information in them.
> > 
> > Anything I can try later on? 
> 

^ permalink raw reply

* Re: linux-next ppc64: RCU mods cause __might_sleep BUGs
From: Paul E. McKenney @ 2012-05-01 14:22 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linuxppc-dev, linux-kernel, Paul E. McKenney
In-Reply-To: <alpine.LSU.2.00.1204302142310.27208@eggly.anvils>

On Mon, Apr 30, 2012 at 10:10:06PM -0700, Hugh Dickins wrote:
> On Tue, 1 May 2012, Benjamin Herrenschmidt wrote:
> > On Mon, 2012-04-30 at 15:37 -0700, Hugh Dickins wrote:
> > > 
> > > BUG: sleeping function called from invalid context at include/linux/pagemap.h:354
> > > in_atomic(): 0, irqs_disabled(): 0, pid: 6886, name: cc1
> > 
> > Hrm ... in_atomic and irqs_disabled are both 0 ... so yeah it smells
> > like a preempt count problem... odd.
> > 
> > Did you get a specific bisect target yet ?
> 
> Oh, I went as far as we need, I think, but I didn't bother quite to
> complete it because, once in that area, we know the schedule_tail()
> omission would muddy the waters: the tail of my bisect log was

Agreed, your bisect is close enough for our purposes.

> # bad: [e798cf3385d3aa7c84afa65677eb92e0c0876dfd] rcu: Add exports for per-CPU variables used for inlining
> git bisect bad e798cf3385d3aa7c84afa65677eb92e0c0876dfd
> # good: [90aec3b06194393c909e3e5a47b6ed99bb8caba5] rcu: Make exit_rcu() more precise and consolidate
> git bisect good 90aec3b06194393c909e3e5a47b6ed99bb8caba5
> 
> from which I concluded that the patch responsible is
> 
> commit ab8fc41a8545d40a4b58d745876c125af72a8a5c
> Author: Paul E. McKenney <paul.mckenney@linaro.org>
> Date:   Fri Apr 13 14:32:01 2012 -0700
> 
>     rcu: Move __rcu_read_lock() and __rcu_read_unlock() to per-CPU variables
>     
>     This commit is another step towards inlinable __rcu_read_lock() and
>     __rcu_read_unlock() functions for preemptible RCU.  This keeps these two
>     functions out of line, but switches them to use the per-CPU variables
>     that are required to export their definitions without requiring that
>     all RCU users include sched.h.  These per-CPU variables are saved and
>     restored at context-switch time.
> 
> > 
> > Cheers,
> > Ben.
> > 
> > > Call Trace:
> > > [c0000001a99f78e0] [c00000000000f34c] .show_stack+0x6c/0x16c (unreliable)
> > > [c0000001a99f7990] [c000000000077b40] .__might_sleep+0x11c/0x134
> > > [c0000001a99f7a10] [c0000000000c6228] .filemap_fault+0x1fc/0x494
> > > [c0000001a99f7af0] [c0000000000e7c9c] .__do_fault+0x120/0x684
> > > [c0000001a99f7c00] [c000000000025790] .do_page_fault+0x458/0x664
> > > [c0000001a99f7e30] [c000000000005868] handle_page_fault+0x10/0x30
> > > 
> > > I've plenty more examples, most of them from page faults or from kswapd;
> > > but I don't think there's any more useful information in them.
> > > 
> > > Anything I can try later on? 
> 
> I'd forgotten about CONFIG_PROVE_RCU (and hadn't been using PROVE_LOCKING
> on that machine), but following Paul's suggestion have now turned them on.
> 
> But not much light shed, I'm afraid.  Within minutes it showed a trace
> exactly like the one above, but the only thing PROVE_LOCKING and PROVE_RCU
> had to say was that we're holding mmap_sem at that point, which is no
> surprise and not a problem, just something lockdep is right to note.
> 
> That was an isolated occurrence, it continued quietly for maybe 20 minutes,
> then output lots to the console screen - but garbled in a way I've not
> seen before - the 0s came out just right (or perhaps all the hex digits
> were being shown as 0s), but most everything else was grayly unreadable.
> Then after a few minutes, spontaneously rebooted.
> 
> Perhaps I should remind myself of netdump; but getting the trace above
> without complaint from PROVE_RCU tells me that it is not helping.

My guess is that the following happened:

1.	Task A is running, with its state in RCU's per-CPU variables.

2.	Task A creates Task B and switches to it, but without invoking
	schedule_tail() or schedule().  Task B is now running, with
	Task A's state in RCU's per-CPU variables.

3.	Task B switches context, saving Task A's per-CPU RCU variables
	(with modifications by Task B, just for fun).

4.	Task A starts running again, and loads obsolete versions of its
	per-CPU RCU variables.  This can cause rcu_read_unlock_do_special()
	to be invoked at inappropriate times, which could cause
	pretty arbitrary misbehavior.

5.	Mismatched values for the RCU read-side nesting could cause
	the read-side critical section to complete prematurely, which
	could cause all manner of mischief.  However, I would expect
	this to trigger the WARN_ON_ONCE() in __rcu_read_unlock().

Hmmm...

							Thanx, Paul

^ permalink raw reply

* [RFC][PATCH 5/5] sched: Rewrite the CONFIG_NUMA sched domain support
From: Peter Zijlstra @ 2012-05-01 18:14 UTC (permalink / raw)
  To: mingo, pjt, vatsa, suresh.b.siddha, efault
  Cc: linux-mips, linux-ia64, linux-sh, David Howells, Paul Mackerras,
	H. Peter Anvin, sparclinux, Dimitri Sivanich, x86, Greg Pearson,
	Ingo Molnar, chris.mason, Matt Turner, Fenghua Yu, Peter Zijlstra,
	Chris Metcalf, Ivan Kokshaysky, Anton Blanchard, Thomas Gleixner,
	KAMEZAWA Hiroyuki, Richard Henderson, Tony Luck, linux-kernel,
	Ralf Baechle, Paul Mundt, linux-alpha, bob.picco, linuxppc-dev,
	David S. Miller
In-Reply-To: <20120501181430.007891123@chello.nl>

The current code groups up to 16 nodes in a level and then puts an
ALLNODES domain spanning the entire tree on top of that. This doesn't
reflect the numa topology and esp for the smaller not-fully-connected
machines out there today this might make a difference.

Therefore, build a proper numa topology based on node_distance().

Since there's no fixed numa layers anymore, the static SD_NODE_INIT
and SD_ALLNODES_INIT aren't usable anymore, the new code tries to
construct something similar and scales some values either on the
number of cpus in the domain and/or the node_distance() ratio.

Anton (IBM)/Dimitri (SGI)/Greg (HP)/Kamezawa-san (Fujitsu)/Oracle,
could I get node_distance() tables for your 'current' line-up of silly
large machines? You can get them by doing:

 $ cat /sys/devices/system/node/node*/distance

If this information is 'sekrit' please consider sending it privately.
If you're not the right person to ask, please redirect me.

Also, testing this code on your favourite large NUMA machine would be
most appreciated.

Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: linux-alpha@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-sh@vger.kernel.org
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: sparclinux@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Greg Pearson <greg.pearson@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: bob.picco@oracle.com
Cc: chris.mason@oracle.com
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/ia64/include/asm/topology.h           |   25 --
 arch/mips/include/asm/mach-ip27/topology.h |   17 -
 arch/powerpc/include/asm/topology.h        |   36 ---
 arch/sh/include/asm/topology.h             |   25 --
 arch/sparc/include/asm/topology_64.h       |   19 -
 arch/tile/include/asm/topology.h           |   26 --
 arch/x86/include/asm/topology.h            |   38 ---
 include/linux/topology.h                   |   37 ---
 kernel/sched/core.c                        |  280 +++++++++++++++++++----------
 9 files changed, 185 insertions(+), 318 deletions(-)

--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -70,31 +70,6 @@ void build_cpu_to_node_map(void);
 	.nr_balance_failed	= 0,			\
 }
 
-/* sched_domains SD_NODE_INIT for IA64 NUMA machines */
-#define SD_NODE_INIT (struct sched_domain) {		\
-	.parent			= NULL,			\
-	.child			= NULL,			\
-	.groups			= NULL,			\
-	.min_interval		= 8,			\
-	.max_interval		= 8*(min(num_online_cpus(), 32U)), \
-	.busy_factor		= 64,			\
-	.imbalance_pct		= 125,			\
-	.cache_nice_tries	= 2,			\
-	.busy_idx		= 3,			\
-	.idle_idx		= 2,			\
-	.newidle_idx		= 0,			\
-	.wake_idx		= 0,			\
-	.forkexec_idx		= 0,			\
-	.flags			= SD_LOAD_BALANCE	\
-				| SD_BALANCE_NEWIDLE	\
-				| SD_BALANCE_EXEC	\
-				| SD_BALANCE_FORK	\
-				| SD_SERIALIZE,		\
-	.last_balance		= jiffies,		\
-	.balance_interval	= 64,			\
-	.nr_balance_failed	= 0,			\
-}
-
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_SMP
--- a/arch/mips/include/asm/mach-ip27/topology.h
+++ b/arch/mips/include/asm/mach-ip27/topology.h
@@ -36,23 +36,6 @@ extern unsigned char __node_distances[MA
 
 #define node_distance(from, to)	(__node_distances[(from)][(to)])
 
-/* sched_domains SD_NODE_INIT for SGI IP27 machines */
-#define SD_NODE_INIT (struct sched_domain) {		\
-	.parent			= NULL,			\
-	.child			= NULL,			\
-	.groups			= NULL,			\
-	.min_interval		= 8,			\
-	.max_interval		= 32,			\
-	.busy_factor		= 32,			\
-	.imbalance_pct		= 125,			\
-	.cache_nice_tries	= 1,			\
-	.flags			= SD_LOAD_BALANCE |	\
-				  SD_BALANCE_EXEC,	\
-	.last_balance		= jiffies,		\
-	.balance_interval	= 1,			\
-	.nr_balance_failed	= 0,			\
-}
-
 #include <asm-generic/topology.h>
 
 #endif /* _ASM_MACH_TOPOLOGY_H */
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -18,12 +18,6 @@ struct device_node;
  */
 #define RECLAIM_DISTANCE 10
 
-/*
- * Avoid creating an extra level of balancing (SD_ALLNODES) on the largest
- * POWER7 boxes which have a maximum of 32 nodes.
- */
-#define SD_NODES_PER_DOMAIN 32
-
 #include <asm/mmzone.h>
 
 static inline int cpu_to_node(int cpu)
@@ -51,36 +45,6 @@ static inline int pcibus_to_node(struct
 				 cpu_all_mask :				\
 				 cpumask_of_node(pcibus_to_node(bus)))
 
-/* sched_domains SD_NODE_INIT for PPC64 machines */
-#define SD_NODE_INIT (struct sched_domain) {				\
-	.min_interval		= 8,					\
-	.max_interval		= 32,					\
-	.busy_factor		= 32,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 3,					\
-	.idle_idx		= 1,					\
-	.newidle_idx		= 0,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 0*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 1*SD_WAKE_AFFINE			\
-				| 0*SD_PREFER_LOCAL			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 0*SD_POWERSAVINGS_BALANCE		\
-				| 0*SD_SHARE_PKG_RESOURCES		\
-				| 1*SD_SERIALIZE			\
-				| 0*SD_PREFER_SIBLING			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 1,					\
-}
-
 extern int __node_distance(int, int);
 #define node_distance(a, b) __node_distance(a, b)
 
--- a/arch/sh/include/asm/topology.h
+++ b/arch/sh/include/asm/topology.h
@@ -3,31 +3,6 @@
 
 #ifdef CONFIG_NUMA
 
-/* sched_domains SD_NODE_INIT for sh machines */
-#define SD_NODE_INIT (struct sched_domain) {		\
-	.parent			= NULL,			\
-	.child			= NULL,			\
-	.groups			= NULL,			\
-	.min_interval		= 8,			\
-	.max_interval		= 32,			\
-	.busy_factor		= 32,			\
-	.imbalance_pct		= 125,			\
-	.cache_nice_tries	= 2,			\
-	.busy_idx		= 3,			\
-	.idle_idx		= 2,			\
-	.newidle_idx		= 0,			\
-	.wake_idx		= 0,			\
-	.forkexec_idx		= 0,			\
-	.flags			= SD_LOAD_BALANCE	\
-				| SD_BALANCE_FORK	\
-				| SD_BALANCE_EXEC	\
-				| SD_BALANCE_NEWIDLE	\
-				| SD_SERIALIZE,		\
-	.last_balance		= jiffies,		\
-	.balance_interval	= 1,			\
-	.nr_balance_failed	= 0,			\
-}
-
 #define cpu_to_node(cpu)	((void)(cpu),0)
 #define parent_node(node)	((void)(node),0)
 
--- a/arch/sparc/include/asm/topology_64.h
+++ b/arch/sparc/include/asm/topology_64.h
@@ -31,25 +31,6 @@ static inline int pcibus_to_node(struct
 	 cpu_all_mask : \
 	 cpumask_of_node(pcibus_to_node(bus)))
 
-#define SD_NODE_INIT (struct sched_domain) {		\
-	.min_interval		= 8,			\
-	.max_interval		= 32,			\
-	.busy_factor		= 32,			\
-	.imbalance_pct		= 125,			\
-	.cache_nice_tries	= 2,			\
-	.busy_idx		= 3,			\
-	.idle_idx		= 2,			\
-	.newidle_idx		= 0, 			\
-	.wake_idx		= 0,			\
-	.forkexec_idx		= 0,			\
-	.flags			= SD_LOAD_BALANCE	\
-				| SD_BALANCE_FORK	\
-				| SD_BALANCE_EXEC	\
-				| SD_SERIALIZE,		\
-	.last_balance		= jiffies,		\
-	.balance_interval	= 1,			\
-}
-
 #else /* CONFIG_NUMA */
 
 #include <asm-generic/topology.h>
--- a/arch/tile/include/asm/topology.h
+++ b/arch/tile/include/asm/topology.h
@@ -78,32 +78,6 @@ static inline const struct cpumask *cpum
 	.balance_interval	= 32,					\
 }
 
-/* sched_domains SD_NODE_INIT for TILE architecture */
-#define SD_NODE_INIT (struct sched_domain) {				\
-	.min_interval		= 16,					\
-	.max_interval		= 512,					\
-	.busy_factor		= 32,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 3,					\
-	.idle_idx		= 1,					\
-	.newidle_idx		= 2,					\
-	.wake_idx		= 1,					\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 0*SD_WAKE_AFFINE			\
-				| 0*SD_PREFER_LOCAL			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 0*SD_SHARE_PKG_RESOURCES		\
-				| 1*SD_SERIALIZE			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 128,					\
-}
-
 /* By definition, we create nodes based on online memory. */
 #define node_has_online_mem(nid) 1
 
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -92,44 +92,6 @@ extern void setup_node_to_cpumask_map(vo
 
 #define pcibus_to_node(bus) __pcibus_to_node(bus)
 
-#ifdef CONFIG_X86_32
-# define SD_CACHE_NICE_TRIES	1
-# define SD_IDLE_IDX		1
-#else
-# define SD_CACHE_NICE_TRIES	2
-# define SD_IDLE_IDX		2
-#endif
-
-/* sched_domains SD_NODE_INIT for NUMA machines */
-#define SD_NODE_INIT (struct sched_domain) {				\
-	.min_interval		= 8,					\
-	.max_interval		= 32,					\
-	.busy_factor		= 32,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= SD_CACHE_NICE_TRIES,			\
-	.busy_idx		= 3,					\
-	.idle_idx		= SD_IDLE_IDX,				\
-	.newidle_idx		= 0,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 1*SD_WAKE_AFFINE			\
-				| 0*SD_PREFER_LOCAL			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 0*SD_POWERSAVINGS_BALANCE		\
-				| 0*SD_SHARE_PKG_RESOURCES		\
-				| 1*SD_SERIALIZE			\
-				| 0*SD_PREFER_SIBLING			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 1,					\
-}
-
 extern int __node_distance(int, int);
 #define node_distance(a, b) __node_distance(a, b)
 
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -70,7 +70,6 @@ int arch_update_cpu_topology(void);
  * Below are the 3 major initializers used in building sched_domains:
  * SD_SIBLING_INIT, for SMT domains
  * SD_CPU_INIT, for SMP domains
- * SD_NODE_INIT, for NUMA domains
  *
  * Any architecture that cares to do any tuning to these values should do so
  * by defining their own arch-specific initializer in include/asm/topology.h.
@@ -176,48 +175,12 @@ int arch_update_cpu_topology(void);
 }
 #endif
 
-/* sched_domains SD_ALLNODES_INIT for NUMA machines */
-#define SD_ALLNODES_INIT (struct sched_domain) {			\
-	.min_interval		= 64,					\
-	.max_interval		= 64*num_online_cpus(),			\
-	.busy_factor		= 128,					\
-	.imbalance_pct		= 133,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 3,					\
-	.idle_idx		= 3,					\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 0*SD_BALANCE_EXEC			\
-				| 0*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 0*SD_WAKE_AFFINE			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 0*SD_POWERSAVINGS_BALANCE		\
-				| 0*SD_SHARE_PKG_RESOURCES		\
-				| 1*SD_SERIALIZE			\
-				| 0*SD_PREFER_SIBLING			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 64,					\
-}
-
-#ifndef SD_NODES_PER_DOMAIN
-#define SD_NODES_PER_DOMAIN 16
-#endif
-
 #ifdef CONFIG_SCHED_BOOK
 #ifndef SD_BOOK_INIT
 #error Please define an appropriate SD_BOOK_INIT in include/asm/topology.h!!!
 #endif
 #endif /* CONFIG_SCHED_BOOK */
 
-#ifdef CONFIG_NUMA
-#ifndef SD_NODE_INIT
-#error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!!
-#endif
-
-#endif /* CONFIG_NUMA */
-
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DECLARE_PER_CPU(int, numa_node);
 
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5560,7 +5560,8 @@ static int sched_domain_debug_one(struct
 			break;
 		}
 
-		if (cpumask_intersects(groupmask, sched_group_cpus(group))) {
+		if (!(sd->flags & SD_OVERLAP) &&
+		    cpumask_intersects(groupmask, sched_group_cpus(group))) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: repeated CPUs\n");
 			break;
@@ -5898,92 +5899,6 @@ static int __init isolated_cpu_setup(cha
 
 __setup("isolcpus=", isolated_cpu_setup);
 
-#ifdef CONFIG_NUMA
-
-/**
- * find_next_best_node - find the next node to include in a sched_domain
- * @node: node whose sched_domain we're building
- * @used_nodes: nodes already in the sched_domain
- *
- * Find the next node to include in a given scheduling domain. Simply
- * finds the closest node not already in the @used_nodes map.
- *
- * Should use nodemask_t.
- */
-static int find_next_best_node(int node, nodemask_t *used_nodes)
-{
-	int i, n, val, min_val, best_node = -1;
-
-	min_val = INT_MAX;
-
-	for (i = 0; i < nr_node_ids; i++) {
-		/* Start at @node */
-		n = (node + i) % nr_node_ids;
-
-		if (!nr_cpus_node(n))
-			continue;
-
-		/* Skip already used nodes */
-		if (node_isset(n, *used_nodes))
-			continue;
-
-		/* Simple min distance search */
-		val = node_distance(node, n);
-
-		if (val < min_val) {
-			min_val = val;
-			best_node = n;
-		}
-	}
-
-	if (best_node != -1)
-		node_set(best_node, *used_nodes);
-	return best_node;
-}
-
-/**
- * sched_domain_node_span - get a cpumask for a node's sched_domain
- * @node: node whose cpumask we're constructing
- * @span: resulting cpumask
- *
- * Given a node, construct a good cpumask for its sched_domain to span. It
- * should be one that prevents unnecessary balancing, but also spreads tasks
- * out optimally.
- */
-static void sched_domain_node_span(int node, struct cpumask *span)
-{
-	nodemask_t used_nodes;
-	int i;
-
-	cpumask_clear(span);
-	nodes_clear(used_nodes);
-
-	cpumask_or(span, span, cpumask_of_node(node));
-	node_set(node, used_nodes);
-
-	for (i = 1; i < SD_NODES_PER_DOMAIN; i++) {
-		int next_node = find_next_best_node(node, &used_nodes);
-		if (next_node < 0)
-			break;
-		cpumask_or(span, span, cpumask_of_node(next_node));
-	}
-}
-
-static const struct cpumask *cpu_node_mask(int cpu)
-{
-	lockdep_assert_held(&sched_domains_mutex);
-
-	sched_domain_node_span(cpu_to_node(cpu), sched_domains_tmpmask);
-
-	return sched_domains_tmpmask;
-}
-
-static const struct cpumask *cpu_allnodes_mask(int cpu)
-{
-	return cpu_possible_mask;
-}
-#endif /* CONFIG_NUMA */
-
 static const struct cpumask *cpu_cpu_mask(int cpu)
 {
 	return cpumask_of_node(cpu_to_node(cpu));
@@ -6020,6 +5935,7 @@ struct sched_domain_topology_level {
 	sched_domain_init_f init;
 	sched_domain_mask_f mask;
 	int		    flags;
+	int		    numa_level;
 	struct sd_data      data;
 };
 
@@ -6213,10 +6129,6 @@ sd_init_##type(struct sched_domain_topol
 }
 
 SD_INIT_FUNC(CPU)
-#ifdef CONFIG_NUMA
- SD_INIT_FUNC(ALLNODES)
- SD_INIT_FUNC(NODE)
-#endif
 #ifdef CONFIG_SCHED_SMT
  SD_INIT_FUNC(SIBLING)
 #endif
@@ -6338,15 +6250,191 @@ static struct sched_domain_topology_leve
 	{ sd_init_BOOK, cpu_book_mask, },
 #endif
 	{ sd_init_CPU, cpu_cpu_mask, },
-#ifdef CONFIG_NUMA
-	{ sd_init_NODE, cpu_node_mask, SDTL_OVERLAP, },
-	{ sd_init_ALLNODES, cpu_allnodes_mask, },
-#endif
 	{ NULL, },
 };
 
 static struct sched_domain_topology_level *sched_domain_topology = default_topology;
 
+#ifdef CONFIG_NUMA
+
+static int sched_domains_numa_levels;
+static int sched_domains_numa_scale;
+static int *sched_domains_numa_distance;
+static struct cpumask ***sched_domains_numa_masks;
+static int sched_domains_curr_level;
+
+static inline unsigned long numa_scale(unsigned long x, int level)
+{
+	return x * sched_domains_numa_distance[level] / sched_domains_numa_scale;
+}
+
+static inline int sd_local_flags(int level)
+{
+	if (sched_domains_numa_distance[level] > REMOTE_DISTANCE)
+		return 0;
+
+	return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
+}
+
+static struct sched_domain *
+sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
+{
+	struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
+	int level = tl->numa_level;
+	int sd_weight = cpumask_weight(
+			sched_domains_numa_masks[level][cpu_to_node(cpu)]);
+
+	*sd = (struct sched_domain){
+		.min_interval		= sd_weight,
+		.max_interval		= 2*sd_weight,
+		.busy_factor		= 32,
+		.imbalance_pct		= 100 + numa_scale(25, level),
+		.cache_nice_tries	= 2,
+		.busy_idx		= 3,
+		.idle_idx		= 2,
+		.newidle_idx		= 0,
+		.wake_idx		= 0,
+		.forkexec_idx		= 0,
+
+		.flags			= 1*SD_LOAD_BALANCE
+					| 1*SD_BALANCE_NEWIDLE
+					| 0*SD_BALANCE_EXEC
+					| 0*SD_BALANCE_FORK
+					| 0*SD_BALANCE_WAKE
+					| 0*SD_WAKE_AFFINE
+					| 0*SD_PREFER_LOCAL
+					| 0*SD_SHARE_CPUPOWER
+					| 0*SD_POWERSAVINGS_BALANCE
+					| 0*SD_SHARE_PKG_RESOURCES
+					| 1*SD_SERIALIZE
+					| 0*SD_PREFER_SIBLING
+					| sd_local_flags(level)
+					,
+		.last_balance		= jiffies,
+		.balance_interval	= sd_weight,
+	};
+	SD_INIT_NAME(sd, NUMA);
+	sd->private = &tl->data;
+
+	/*
+	 * Ugly hack to pass state to sd_numa_mask()...
+	 */
+	sched_domains_curr_level = tl->numa_level;
+
+	return sd;
+}
+
+static const struct cpumask *sd_numa_mask(int cpu)
+{
+	return sched_domains_numa_masks[sched_domains_curr_level][cpu_to_node(cpu)];
+}
+
+static void sched_init_numa(void)
+{
+	int next_distance, curr_distance = node_distance(0, 0);
+	struct sched_domain_topology_level *tl;
+	int level = 0;
+	int i, j, k;
+
+	sched_domains_numa_scale = curr_distance;
+	sched_domains_numa_distance = kzalloc(sizeof(int) * nr_node_ids, GFP_KERNEL);
+	if (!sched_domains_numa_distance)
+		return;
+
+	/*
+	 * O(nr_nodes^2) deduplicating selection sort -- in order to find the
+	 * unique distances in the node_distance() table.
+	 *
+	 * Assumes node_distance(0,j) includes all distances in
+	 * node_distance(i,j) in order to avoid cubic time.
+	 *
+	 * XXX: could be optimized to O(n log n) by using sort()
+	 */
+	next_distance = curr_distance;
+	for (i = 0; i < nr_node_ids; i++) {
+		for (j = 0; j < nr_node_ids; j++) {
+			int distance = node_distance(0, j);
+			if (distance > curr_distance &&
+					(distance < next_distance ||
+					 next_distance == curr_distance))
+				next_distance = distance;
+		}
+		if (next_distance != curr_distance) {
+			sched_domains_numa_distance[level++] = next_distance;
+			sched_domains_numa_levels = level;
+			curr_distance = next_distance;
+		} else break;
+	}
+	/*
+	 * 'level' contains the number of unique distances, excluding the
+	 * identity distance node_distance(i,i).
+	 *
+	 * The sched_domains_nume_distance[] array includes the actual distance
+	 * numbers.
+	 */
+
+	sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
+	if (!sched_domains_numa_masks)
+		return;
+
+	/*
+	 * Now for each level, construct a mask per node which contains all
+	 * cpus of nodes that are that many hops away from us.
+	 */
+	for (i = 0; i < level; i++) {
+		sched_domains_numa_masks[i] =
+			kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL);
+		if (!sched_domains_numa_masks[i])
+			return;
+
+		for (j = 0; j < nr_node_ids; j++) {
+			struct cpumask *mask = kzalloc_node(cpumask_size(), GFP_KERNEL, j);
+			if (!mask)
+				return;
+
+			sched_domains_numa_masks[i][j] = mask;
+
+			for (k = 0; k < nr_node_ids; k++) {
+				if (node_distance(cpu_to_node(j), k) >
+						sched_domains_numa_distance[i])
+					continue;
+
+				cpumask_or(mask, mask, cpumask_of_node(k));
+			}
+		}
+	}
+
+	tl = kzalloc((ARRAY_SIZE(default_topology) + level) *
+			sizeof(struct sched_domain_topology_level), GFP_KERNEL);
+	if (!tl)
+		return;
+
+	/*
+	 * Copy the default topology bits..
+	 */
+	for (i = 0; default_topology[i].init; i++)
+		tl[i] = default_topology[i];
+
+	/*
+	 * .. and append 'j' levels of NUMA goodness.
+	 */
+	for (j = 0; j < level; i++, j++) {
+		tl[i] = (struct sched_domain_topology_level){
+			.init = sd_numa_init,
+			.mask = sd_numa_mask,
+			.flags = SDTL_OVERLAP,
+			.numa_level = j,
+		};
+	}
+
+	sched_domain_topology = tl;
+}
+#else
+static inline void sched_init_numa(void)
+{
+}
+#endif /* CONFIG_NUMA */
+
 static int __sdt_alloc(const struct cpumask *cpu_map)
 {
 	struct sched_domain_topology_level *tl;
@@ -6840,6 +6928,8 @@ void __init sched_init_smp(void)
 	alloc_cpumask_var(&non_isolated_cpus, GFP_KERNEL);
 	alloc_cpumask_var(&fallback_doms, GFP_KERNEL);
 
+	sched_init_numa();
+
 	get_online_cpus();
 	mutex_lock(&sched_domains_mutex);
 	init_sched_domains(cpu_active_mask);

^ permalink raw reply

* Re: linux-next ppc64: RCU mods cause __might_sleep BUGs
From: Hugh Dickins @ 2012-05-01 21:42 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linuxppc-dev, linux-kernel, Paul E. McKenney
In-Reply-To: <20120501142208.GA2441@linux.vnet.ibm.com>

On Tue, 1 May 2012, Paul E. McKenney wrote:
> On Mon, Apr 30, 2012 at 10:10:06PM -0700, Hugh Dickins wrote:
> > On Tue, 1 May 2012, Benjamin Herrenschmidt wrote:
> > > On Mon, 2012-04-30 at 15:37 -0700, Hugh Dickins wrote:
> > > > 
> > > > BUG: sleeping function called from invalid context at include/linux/pagemap.h:354
> > > > in_atomic(): 0, irqs_disabled(): 0, pid: 6886, name: cc1
> > > > Call Trace:
> > > > [c0000001a99f78e0] [c00000000000f34c] .show_stack+0x6c/0x16c (unreliable)
> > > > [c0000001a99f7990] [c000000000077b40] .__might_sleep+0x11c/0x134
> > > > [c0000001a99f7a10] [c0000000000c6228] .filemap_fault+0x1fc/0x494
> > > > [c0000001a99f7af0] [c0000000000e7c9c] .__do_fault+0x120/0x684
> > > > [c0000001a99f7c00] [c000000000025790] .do_page_fault+0x458/0x664
> > > > [c0000001a99f7e30] [c000000000005868] handle_page_fault+0x10/0x30
> > > > 

> My guess is that the following happened:
> 
> 1.	Task A is running, with its state in RCU's per-CPU variables.
> 
> 2.	Task A creates Task B and switches to it, but without invoking
> 	schedule_tail() or schedule().  Task B is now running, with
> 	Task A's state in RCU's per-CPU variables.
> 
> 3.	Task B switches context, saving Task A's per-CPU RCU variables
> 	(with modifications by Task B, just for fun).
> 
> 4.	Task A starts running again, and loads obsolete versions of its
> 	per-CPU RCU variables.  This can cause rcu_read_unlock_do_special()
> 	to be invoked at inappropriate times, which could cause
> 	pretty arbitrary misbehavior.
> 
> 5.	Mismatched values for the RCU read-side nesting could cause
> 	the read-side critical section to complete prematurely, which
> 	could cause all manner of mischief.  However, I would expect
> 	this to trigger the WARN_ON_ONCE() in __rcu_read_unlock().
> 
> Hmmm...

I didn't find anything corresponding to that under arch/powerpc.

There is something I found, that I had high hopes for, but sadly no,
it does not fix it.  I may be wrong, it's a long time since I thought
about what happens in fork().  But I believe the rcu_switch_from(prev)
you added to schedule_tail() is bogus: doesn't schedule_tail() more or
less amount to a jump into schedule(), for the child to be as if it's
emerging from the switch_to() in schedule()?

So I think the rcu_switch_from(prev) has already been done by the prev
task on the cpu, as it goes into switch_to() in schedule().  So at best
you're duplicating that in schedule_tail(), and at worst (I don't know
if the prev task can get far enough away for this to matter) you're
messing with its state.  Probably difficult to do any harm (those fields
don't matter while it's on another cpu, and it has to get on another cpu
for them to change), but it does look wrong to me.

But commenting out that line did not fix my issues.  And if you agree,
you'll probably prefer, not to comment out the line, but revert back to
rcu_switch_from(void) and just add the rcu_switch_to() to schedule_tail().

Something else that I noticed in comments on rcu_switch_from() in
sched.h (er, is sched.h really the right place for this code?): it says
"if rcu_read_unlock_special is zero, then rcu_read_lock_nesting must
also be zero" - shouldn't that say "non-zero" in each case?

I must turn away from this right now, and come back to it later.

Hugh

^ permalink raw reply

* Re: linux-next ppc64: RCU mods cause __might_sleep BUGs
From: Paul E. McKenney @ 2012-05-01 23:25 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linuxppc-dev, linux-kernel, Paul E. McKenney
In-Reply-To: <alpine.LSU.2.00.1205011410360.28232@eggly.anvils>

On Tue, May 01, 2012 at 02:42:02PM -0700, Hugh Dickins wrote:
> On Tue, 1 May 2012, Paul E. McKenney wrote:
> > On Mon, Apr 30, 2012 at 10:10:06PM -0700, Hugh Dickins wrote:
> > > On Tue, 1 May 2012, Benjamin Herrenschmidt wrote:
> > > > On Mon, 2012-04-30 at 15:37 -0700, Hugh Dickins wrote:
> > > > > 
> > > > > BUG: sleeping function called from invalid context at include/linux/pagemap.h:354
> > > > > in_atomic(): 0, irqs_disabled(): 0, pid: 6886, name: cc1
> > > > > Call Trace:
> > > > > [c0000001a99f78e0] [c00000000000f34c] .show_stack+0x6c/0x16c (unreliable)
> > > > > [c0000001a99f7990] [c000000000077b40] .__might_sleep+0x11c/0x134
> > > > > [c0000001a99f7a10] [c0000000000c6228] .filemap_fault+0x1fc/0x494
> > > > > [c0000001a99f7af0] [c0000000000e7c9c] .__do_fault+0x120/0x684
> > > > > [c0000001a99f7c00] [c000000000025790] .do_page_fault+0x458/0x664
> > > > > [c0000001a99f7e30] [c000000000005868] handle_page_fault+0x10/0x30
> > > > > 
> 
> > My guess is that the following happened:
> > 
> > 1.	Task A is running, with its state in RCU's per-CPU variables.
> > 
> > 2.	Task A creates Task B and switches to it, but without invoking
> > 	schedule_tail() or schedule().  Task B is now running, with
> > 	Task A's state in RCU's per-CPU variables.
> > 
> > 3.	Task B switches context, saving Task A's per-CPU RCU variables
> > 	(with modifications by Task B, just for fun).
> > 
> > 4.	Task A starts running again, and loads obsolete versions of its
> > 	per-CPU RCU variables.  This can cause rcu_read_unlock_do_special()
> > 	to be invoked at inappropriate times, which could cause
> > 	pretty arbitrary misbehavior.
> > 
> > 5.	Mismatched values for the RCU read-side nesting could cause
> > 	the read-side critical section to complete prematurely, which
> > 	could cause all manner of mischief.  However, I would expect
> > 	this to trigger the WARN_ON_ONCE() in __rcu_read_unlock().
> > 
> > Hmmm...
> 
> I didn't find anything corresponding to that under arch/powerpc.

Nor did I.  :-(

> There is something I found, that I had high hopes for, but sadly no,
> it does not fix it.  I may be wrong, it's a long time since I thought
> about what happens in fork().  But I believe the rcu_switch_from(prev)
> you added to schedule_tail() is bogus: doesn't schedule_tail() more or
> less amount to a jump into schedule(), for the child to be as if it's
> emerging from the switch_to() in schedule()?
> 
> So I think the rcu_switch_from(prev) has already been done by the prev
> task on the cpu, as it goes into switch_to() in schedule().  So at best
> you're duplicating that in schedule_tail(), and at worst (I don't know
> if the prev task can get far enough away for this to matter) you're
> messing with its state.  Probably difficult to do any harm (those fields
> don't matter while it's on another cpu, and it has to get on another cpu
> for them to change), but it does look wrong to me.

I do believe that you are correct.  /me wonders if it was really such
a good idea to tie RCU this closely to the scheduler...

I also agree that the chance of error is small.  The parent would have
to be blocked for the child to have any probability of doing harm,
but we need the probability to be zero, which it does not appear to be.

I will semi-revert this change as you suggest.

> But commenting out that line did not fix my issues.  And if you agree,
> you'll probably prefer, not to comment out the line, but revert back to
> rcu_switch_from(void) and just add the rcu_switch_to() to schedule_tail().
> 
> Something else that I noticed in comments on rcu_switch_from() in
> sched.h (er, is sched.h really the right place for this code?): it says
> "if rcu_read_unlock_special is zero, then rcu_read_lock_nesting must
> also be zero" - shouldn't that say "non-zero" in each case?

No, but the variables should be reversed.  It should read "Both
cases rely on the fact that if rcu_read_lock_nesting is zero, then
rcu_read_unlock_special must also be zero."  Here are the two cases:

1.	rcu_read_lock_nesting is zero:  Then rcu_read_unlock_special
	must also be zero, so there is no way to get to the
	rcu_read_unlock_do_special() function.  A scheduling-clock
	interrupt might later set one of the rcu_read_unlock_special
	bits, but only RCU_READ_UNLOCK_BLOCKED, which is OK because
	rcu_read_unlock_do_special() does not mess with the scheduler
	in this case.

2.	rcu_read_lock_nesting is non-zero: Then the task is blocking
	within an RCU read-side critical section, so none of the
	scheduler's or architecture's read-side critical sections
	can have the outermost rcu_read_unlock(), so the
	rcu_read_unlock_do_special() function will not be invoked
	in the first place.

Thank you for catching this!

> I must turn away from this right now, and come back to it later.

Thank you very much for all your help with this!

							Thanx, Paul

^ permalink raw reply

* Re: Build regressions/improvements in v3.4-rc5
From: Geert Uytterhoeven @ 2012-05-02  9:39 UTC (permalink / raw)
  To: Linux Kernel Development; +Cc: Chris Zankel, Linux/PPC Development
In-Reply-To: <alpine.DEB.2.00.1205021134570.1538@ayla.of.borg>

On Wed, 2 May 2012, Geert Uytterhoeven wrote:
> Below is the list of build error/warning regressions/improvements in
> v3.4-rc5[1] compared to v3.3[2].
> 
> Too make this mail fit in the lkml limit, I deleted
>   - 104 lines about __mcount_loc on sparc64
>   - all warning improvements
> 
> Summarized:
>   - build errors: +107/-82
>   - build warnings: +284/-999
> 
> JFYI, when comparing v3.4-rc5 to v3.4-rc4[3], the summaries are:
>   - build errors: +34/-2283
>   - build warnings: +145/-681

powerpc-randconfig:

  + arch/powerpc/platforms/512x/mpc512x_shared.c: error: 'FSL_DIU_PORT_DVI' undeclared (first use in this function):  => 189:9
  + arch/powerpc/platforms/512x/mpc512x_shared.c: error: parameter 1 ('port') has incomplete type:  => 187:54, 83:56, 88:57, 69:56
  + arch/powerpc/platforms/512x/mpc512x_shared.c: error: return type is an incomplete type:  => 187:1
  + drivers/virt/fsl_hypervisor.c: error: 'MSR_GS' undeclared (first use in this function):  => 799:80

xtensa-allnoconfig:

  + error: arch/xtensa/platforms/iss/built-in.o: undefined reference to `__netif_schedule':  => .text+0x48)
  + error: arch/xtensa/platforms/iss/built-in.o: undefined reference to `alloc_etherdev_mqs':  => .text+0x11c)
  + error: arch/xtensa/platforms/iss/built-in.o: undefined reference to `consume_skb':  => .text+0x4c)
  + error: arch/xtensa/platforms/iss/built-in.o: undefined reference to `dev_alloc_skb':  => .text+0x60)
  + error: arch/xtensa/platforms/iss/built-in.o: undefined reference to `dev_close':  => .text+0x84)
  + error: arch/xtensa/platforms/iss/built-in.o: undefined reference to `eth_type_trans':  => .text+0xac)
  + error: arch/xtensa/platforms/iss/built-in.o: undefined reference to `eth_validate_addr':  => .rodata+0x114)
  + error: arch/xtensa/platforms/iss/built-in.o: undefined reference to `free_netdev':  => .text+0x134)
  + error: arch/xtensa/platforms/iss/built-in.o: undefined reference to `kfree_skb':  => .text+0x70)
  + error: arch/xtensa/platforms/iss/built-in.o: undefined reference to `netif_rx_ni':  => .text+0x6c)
  + error: arch/xtensa/platforms/iss/built-in.o: undefined reference to `register_netdevice':  => .text+0x12c)
  + error: arch/xtensa/platforms/iss/built-in.o: undefined reference to `rtnl_lock':  => .text+0x128)
  + error: arch/xtensa/platforms/iss/built-in.o: undefined reference to `rtnl_unlock':  => .text+0x130)
  + error: arch/xtensa/platforms/iss/built-in.o: undefined reference to `skb_put':  => .text+0x64)
  + error: arch/xtensa/platforms/iss/built-in.o: undefined reference to `skb_trim':  => .text+0x68)
  + error: kernel/built-in.o: undefined reference to `_sdata':  => .text+0x8c0), .text+0x84c)
  + error: network.c: undefined reference to `__netif_schedule':  => .text+0x3d7)
  + error: network.c: undefined reference to `alloc_etherdev_mqs':  => .text+0x8c8)
  + error: network.c: undefined reference to `consume_skb':  => .text+0x42e)
  + error: network.c: undefined reference to `dev_alloc_skb':  => .text+0x4c8)
  + error: network.c: undefined reference to `dev_close':  => .text+0x683)
  + error: network.c: undefined reference to `eth_type_trans':  => .text+0x7db)
  + error: network.c: undefined reference to `free_netdev':  => .text+0xad2)
  + error: network.c: undefined reference to `kfree_skb':  => .text+0x548)
  + error: network.c: undefined reference to `netif_rx_ni':  => .text+0x53e)
  + error: network.c: undefined reference to `register_netdevice':  => .text+0xab7)
  + error: network.c: undefined reference to `rtnl_lock':  => .text+0xaaf)
  + error: network.c: undefined reference to `rtnl_unlock':  => .text+0xabf)
  + error: network.c: undefined reference to `skb_put':  => .text+0x507)
  + error: network.c: undefined reference to `skb_trim':  => .text+0x514)

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply

* [PATCH EDACv16 1/2] edac: Change internal representation to work with layers
From: Borislav Petkov @ 2012-05-02 13:30 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Shaohui Xie, Jason Uhlenkott, Aristeu Rozanski, Hitoshi Mitake,
	Mark Gross, Dmitry Eremin-Solenikov, Ranganathan Desikan,
	Egor Martovetsky, Niklas Söderlund, Tim Small, Arvind R.,
	Chris Metcalf, Olof Johansson, Doug Thompson,
	Linux Edac Mailing List, Michal Marek, Jiri Kosina,
	Linux Kernel Mailing List, Joe Perches, Andrew Morton,
	linuxppc-dev

Starting a new thread because the old one grew too big and
is out of my screen. Patch below is git-formatted from
git://git.infradead.org/users/mchehab/edac.git

> commit 447b7929e633027ffe131f2f8f246bba5690cee7
> Author: Mauro Carvalho Chehab <mchehab@redhat.com>
> Date:   Wed Apr 18 15:20:50 2012 -0300
> 
>     edac: Change internal representation to work with layers
>     
>     Change the EDAC internal representation to work with non-csrow
>     based memory controllers.
>     
>     There are lots of those memory controllers nowadays, and more
>     are coming. So, the EDAC internal representation needs to be
>     changed, in order to work with those memory controllers, while
>     preserving backward compatibility with the old ones.
>     
>     The edac core was written with the idea that memory controllers
>     are able to directly access csrows.
>     
>     This is not true for FB-DIMM and RAMBUS memory controllers.
>     
>     Also, some recent advanced memory controllers don't present a per-csrows
>     view. Instead, they view memories as DIMMs, instead of ranks.
>     
>     So, change the allocation and error report routines to allow
>     them to work with all types of architectures.
>     
>     This will allow the removal of several hacks with FB-DIMM and RAMBUS
>     memory controllers.
>     
>     Also, several tests were done on different platforms using different
>     x86 drivers.
>     
>     TODO: a multi-rank DIMMs are currently represented by multiple DIMM
>     entries in struct dimm_info. That means that changing a label for one
>     rank won't change the same label for the other ranks at the same DIMM.
>     This bug is present since the beginning of the EDAC, so it is not a big
>     deal. However, on several drivers, it is possible to fix this issue, but
>     it should be a per-driver fix, as the csrow => DIMM arrangement may not
>     be equal for all. So, don't try to fix it here yet.
>     
>     I tried to make this patch as short as possible, preceding it with
>     several other patches that simplified the logic here. Yet, as the
>     internal API changes, all drivers need changes. The changes are
>     generally bigger in the drivers for FB-DIMMs.

<snip already reviewed stuff>

>  	/* Figure out the offsets of the various items from the start of an mc
>  	 * structure.  We want the alignment of each item to be at least as
> @@ -191,12 +253,28 @@ struct mem_ctl_info *edac_mc_alloc(unsigned sz_pvt, unsigned nr_csrows,
>  	 * hardcode everything into a single struct.
>  	 */
>  	mci = edac_align_ptr(&ptr, sizeof(*mci), 1);
> -	csi = edac_align_ptr(&ptr, sizeof(*csi), nr_csrows);
> -	chi = edac_align_ptr(&ptr, sizeof(*chi), nr_csrows * nr_chans);
> -	dimm = edac_align_ptr(&ptr, sizeof(*dimm), nr_csrows * nr_chans);
> +	layer = edac_align_ptr(&ptr, sizeof(*layer), n_layers);
> +	csi = edac_align_ptr(&ptr, sizeof(*csi), tot_csrows);
> +	chi = edac_align_ptr(&ptr, sizeof(*chi), tot_csrows * tot_channels);
> +	dimm = edac_align_ptr(&ptr, sizeof(*dimm), tot_dimms);
> +	count = 1;
> +	for (i = 0; i < n_layers; i++) {
> +		count *= layers[i].size;
> +		debugf4("%s: errcount layer %d size %d\n", __func__, i, count);
> +		ce_per_layer[i] = edac_align_ptr(&ptr, sizeof(u32), count);
> +		ue_per_layer[i] = edac_align_ptr(&ptr, sizeof(u32), count);
> +		tot_errcount += 2 * count;
> +	}
> +
> +	debugf4("%s: allocating %d error counters\n", __func__, tot_errcount);
>  	pvt = edac_align_ptr(&ptr, sz_pvt, 1);
>  	size = ((unsigned long)pvt) + sz_pvt;
>  
> +	debugf1("%s(): allocating %u bytes for mci data (%d %s, %d csrows/channels)\n",
> +		__func__, size,
> +		tot_dimms,
> +		per_rank ? "ranks" : "dimms",
> +		tot_csrows * tot_channels);
>  	mci = kzalloc(size, GFP_KERNEL);
>  	if (mci == NULL)
>  		return NULL;
> @@ -204,42 +282,101 @@ struct mem_ctl_info *edac_mc_alloc(unsigned sz_pvt, unsigned nr_csrows,
>  	/* Adjust pointers so they point within the memory we just allocated
>  	 * rather than an imaginary chunk of memory located at address 0.
>  	 */
> +	layer = (struct edac_mc_layer *)(((char *)mci) + ((unsigned long)layer));
>  	csi = (struct csrow_info *)(((char *)mci) + ((unsigned long)csi));
>  	chi = (struct rank_info *)(((char *)mci) + ((unsigned long)chi));
>  	dimm = (struct dimm_info *)(((char *)mci) + ((unsigned long)dimm));
> +	for (i = 0; i < n_layers; i++) {
> +		mci->ce_per_layer[i] = (u32 *)((char *)mci + ((unsigned long)ce_per_layer[i]));
> +		mci->ue_per_layer[i] = (u32 *)((char *)mci + ((unsigned long)ue_per_layer[i]));
> +	}
>  	pvt = sz_pvt ? (((char *)mci) + ((unsigned long)pvt)) : NULL;
>  
>  	/* setup index and various internal pointers */
>  	mci->mc_idx = edac_index;
>  	mci->csrows = csi;
>  	mci->dimms  = dimm;
> +	mci->tot_dimms = tot_dimms;
>  	mci->pvt_info = pvt;
> -	mci->nr_csrows = nr_csrows;
> +	mci->n_layers = n_layers;
> +	mci->layers = layer;
> +	memcpy(mci->layers, layers, sizeof(*layer) * n_layers);
> +	mci->nr_csrows = tot_csrows;
> +	mci->num_cschannel = tot_channels;
> +	mci->mem_is_per_rank = per_rank;
>  
>  	/*
> -	 * For now, assumes that a per-csrow arrangement for dimms.
> -	 * This will be latter changed.
> +	 * Fills the csrow struct

	   Fill

>  	 */
> -	dimm = mci->dimms;
> -
> -	for (row = 0; row < nr_csrows; row++) {
> -		csrow = &csi[row];
> -		csrow->csrow_idx = row;
> -		csrow->mci = mci;
> -		csrow->nr_channels = nr_chans;
> -		chp = &chi[row * nr_chans];
> -		csrow->channels = chp;
> -
> -		for (chn = 0; chn < nr_chans; chn++) {
> +	for (row = 0; row < tot_csrows; row++) {
> +		csr = &csi[row];
> +		csr->csrow_idx = row;
> +		csr->mci = mci;
> +		csr->nr_channels = tot_channels;
> +		chp = &chi[row * tot_channels];
> +		csr->channels = chp;
> +
> +		for (chn = 0; chn < tot_channels; chn++) {
>  			chan = &chp[chn];
>  			chan->chan_idx = chn;
> -			chan->csrow = csrow;
> +			chan->csrow = csr;
> +		}
> +	}
>  
> -			mci->csrows[row].channels[chn].dimm = dimm;
> -			dimm->csrow = row;
> -			dimm->csrow_channel = chn;
> -			dimm++;
> -			mci->nr_dimms++;
> +	/*
> +	 * Fills the dimm struct

	   Fill

> +	 */
> +	memset(&pos, 0, sizeof(pos));
> +	row = 0;
> +	chn = 0;
> +	debugf4("%s: initializing %d %s\n", __func__, tot_dimms,
> +		per_rank ? "ranks" : "dimms");
> +	for (i = 0; i < tot_dimms; i++) {
> +		chan = &csi[row].channels[chn];
> +		dimm = EDAC_DIMM_PTR(layer, mci->dimms, n_layers,
> +			       pos[0], pos[1], pos[2]);
> +		dimm->mci = mci;
> +
> +		debugf2("%s: %d: %s%zd (%d:%d:%d): row %d, chan %d\n", __func__,
> +			i, per_rank ? "rank" : "dimm", (dimm - mci->dimms),
> +			pos[0], pos[1], pos[2], row, chn);
> +
> +		/* Copy DIMM location */
> +		for (j = 0; j < n_layers; j++)
> +			dimm->location[j] = pos[j];
> +
> +		/* Link it to the csrows old API data */
> +		chan->dimm = dimm;
> +		dimm->csrow = row;
> +		dimm->cschannel = chn;
> +
> +		/* Increment csrow location */
> +		if (!rev_order) {

AFAICT, this rev_order is always false (in the final version of the
patches anyway) and if so, can be completely removed as an argument to
edac_mc_alloc() and also the code in the else-branch below.

> +			for (j = n_layers - 1; j >= 0; j--)
> +				if (!layers[j].is_virt_csrow)
> +					break;
> +			chn++;
> +			if (chn == tot_channels) {
> +				chn = 0;
> +				row++;
> +			}
> +		} else {
> +			for (j = n_layers - 1; j >= 0; j--)
> +				if (layers[j].is_virt_csrow)
> +					break;
> +			row++;
> +			if (row == tot_csrows) {
> +				row = 0;
> +				chn++;
> +			}
> +		}
> +
> +		/* Increment dimm location */
> +		for (j = n_layers - 1; j >= 0; j--) {
> +			pos[j]++;
> +			if (pos[j] < layers[j].size)
> +				break;
> +			pos[j] = 0;
>  		}
>  	}
>  
> @@ -263,6 +400,57 @@ struct mem_ctl_info *edac_mc_alloc(unsigned sz_pvt, unsigned nr_csrows,
>  	 */
>  	return mci;
>  }
> +EXPORT_SYMBOL_GPL(new_edac_mc_alloc);
> +
> +/**
> + * edac_mc_alloc: Allocate and partially fills a struct mem_ctl_info structure

					    fill

> + * @edac_index:		Memory controller number

How about 'mc_num' for a variable name instead then? I know, I know, it
was called like that originally but 'edac_index' is misleading.

> + * @n_layers:		Nu
> +mber of layers at the MC hierarchy

needless '\n'

> + * layers:		Describes each layer as seen by the Memory Controller
> + * @rev_order:		Fills csrows/cs channels at the reverse order

you can drop that, as put above.

> + * @size_pvt:		size of private storage needed

Capitalize:			Size

> + *
> + *
> + * FIXME: drivers handle multi-rank memories on different ways: on some

						in		 (remove "on")

> + * drivers, one multi-rank memory is mapped as one DIMM, while, on others,

and say: "Some drivers map multi-ranked DIMMs as one DIMM while others
as several DIMMs".

> + * a single multi-rank DIMM would be mapped into several "dimms".
> + *
> + * Non-csrow based drivers (like FB-DIMM and RAMBUS ones) will likely report
> + * such DIMMS properly, but the csrow-based ones will likely do the wrong
> + * thing, as two chip select values are used for dual-rank memories (and 4, for
> + * quad-rank ones). I suspect that this issue could be solved inside the EDAC
> + * core for SDRAM memories, but it requires further study at JEDEC JESD 21C.

This last paragraph sounds innacurately, especially the "likely" adverbs
make it even more confusing. The gist of what you're saying is already
present in the commit message anyway, so drop it here pls.

> + *
> + * In summary, solving this issue is not easy, as it requires a lot of testing.

Also superfluous and has nothing to do with edac_mc_alloc().

> + *
> + * Everything is kmalloc'ed as one big chunk - more efficient.
> + * Only can be used if all structures have the same lifetime - otherwisea

"It can only be used if ..."

> + * you have to allocate and initialize your own structures.
> + *
> + * Use edac_mc_free() to free mc structures allocated by this function.
> + *
> + * Returns:
> + *	NULL allocation failed
> + *	struct mem_ctl_info pointer

	On failure: NULL
	On success: struct mem_ctl_info.

> + */
> +
> +struct mem_ctl_info *edac_mc_alloc(unsigned sz_pvt, unsigned nr_csrows,
> +				   unsigned nr_chans, int edac_index)
> +{
> +	unsigned n_layers = 2;
> +	struct edac_mc_layer layers[n_layers];
> +
> +	layers[0].type = EDAC_MC_LAYER_CHIP_SELECT;
> +	layers[0].size = nr_csrows;
> +	layers[0].is_virt_csrow = true;
> +	layers[1].type = EDAC_MC_LAYER_CHANNEL;
> +	layers[1].size = nr_chans;
> +	layers[1].is_virt_csrow = false;
> +
> +	return new_edac_mc_alloc(edac_index, ARRAY_SIZE(layers), layers,
> +			  false, sz_pvt);
> +}
>  EXPORT_SYMBOL_GPL(edac_mc_alloc);
>  
>  /**
> @@ -528,7 +716,6 @@ EXPORT_SYMBOL(edac_mc_find);
>   * edac_mc_add_mc: Insert the 'mci' structure into the mci global list and
>   *                 create sysfs entries associated with mci structure
>   * @mci: pointer to the mci structure to be added to the list
> - * @mc_idx: A unique numeric identifier to be assigned to the 'mci' structure.
>   *
>   * Return:
>   *	0	Success
> @@ -555,6 +742,8 @@ int edac_mc_add_mc(struct mem_ctl_info *mci)
>  				edac_mc_dump_channel(&mci->csrows[i].
>  						channels[j]);
>  		}
> +		for (i = 0; i < mci->tot_dimms; i++)
> +			edac_mc_dump_dimm(&mci->dimms[i]);
>  	}
>  #endif
>  	mutex_lock(&mem_ctls_mutex);
> @@ -712,261 +901,251 @@ int edac_mc_find_csrow_by_page(struct mem_ctl_info *mci, unsigned long page)
>  }
>  EXPORT_SYMBOL_GPL(edac_mc_find_csrow_by_page);
>  
> -/* FIXME - setable log (warning/emerg) levels */
> -/* FIXME - integrate with evlog: http://evlog.sourceforge.net/ */
> -void edac_mc_handle_ce(struct mem_ctl_info *mci,
> -		unsigned long page_frame_number,
> -		unsigned long offset_in_page, unsigned long syndrome,
> -		int row, int channel, const char *msg)
> +const char *edac_layer_name[] = {
> +	[EDAC_MC_LAYER_BRANCH] = "branch",
> +	[EDAC_MC_LAYER_CHANNEL] = "channel",
> +	[EDAC_MC_LAYER_SLOT] = "slot",
> +	[EDAC_MC_LAYER_CHIP_SELECT] = "csrow",
> +};
> +EXPORT_SYMBOL_GPL(edac_layer_name);
> +
> +static void edac_increment_ce_error(struct mem_ctl_info *mci,

This could be abbreviated to edac_inc_ce_error()

> +				    bool enable_filter,
> +				    unsigned pos[EDAC_MAX_LAYERS])

Passing the whole array as an argument instead of only a pointer to it?

>  {
> -	unsigned long remapped_page;
> -	char *label = NULL;
> -	u32 grain;
> +	int i, index = 0;
>  
> -	debugf3("MC%d: %s()\n", mci->mc_idx, __func__);
> +	mci->ce_mc++;
>  
> -	/* FIXME - maybe make panic on INTERNAL ERROR an option */
> -	if (row >= mci->nr_csrows || row < 0) {
> -		/* something is wrong */
> -		edac_mc_printk(mci, KERN_ERR,
> -			"INTERNAL ERROR: row out of range "
> -			"(%d >= %d)\n", row, mci->nr_csrows);
> -		edac_mc_handle_ce_no_info(mci, "INTERNAL ERROR");
> +	if (!enable_filter) {
> +		mci->ce_noinfo_count++;
>  		return;
>  	}
>  
> -	if (channel >= mci->csrows[row].nr_channels || channel < 0) {
> -		/* something is wrong */
> -		edac_mc_printk(mci, KERN_ERR,
> -			"INTERNAL ERROR: channel out of range "
> -			"(%d >= %d)\n", channel,
> -			mci->csrows[row].nr_channels);
> -		edac_mc_handle_ce_no_info(mci, "INTERNAL ERROR");
> -		return;
> -	}
> -
> -	label = mci->csrows[row].channels[channel].dimm->label;
> -	grain = mci->csrows[row].channels[channel].dimm->grain;
> +	for (i = 0; i < mci->n_layers; i++) {
> +		if (pos[i] < 0)
> +			break;
> +		index += pos[i];
> +		mci->ce_per_layer[i][index]++;
>  
> -	if (edac_mc_get_log_ce())
> -		/* FIXME - put in DIMM location */
> -		edac_mc_printk(mci, KERN_WARNING,
> -			"CE page 0x%lx, offset 0x%lx, grain %d, syndrome "
> -			"0x%lx, row %d, channel %d, label \"%s\": %s\n",
> -			page_frame_number, offset_in_page,
> -			grain, syndrome, row, channel,
> -			label, msg);
> +		if (i < mci->n_layers - 1)
> +			index *= mci->layers[i + 1].size;
> +	}
> +}
>  
> -	mci->ce_count++;
> -	mci->csrows[row].ce_count++;
> -	mci->csrows[row].channels[channel].dimm->ce_count++;
> -	mci->csrows[row].channels[channel].ce_count++;
> +static void edac_increment_ue_error(struct mem_ctl_info *mci,
> +				    bool enable_filter,
> +				    unsigned pos[EDAC_MAX_LAYERS])
> +{
> +	int i, index = 0;
>  
> -	if (mci->scrub_mode & SCRUB_SW_SRC) {
> -		/*
> -		 * Some MC's can remap memory so that it is still available
> -		 * at a different address when PCI devices map into memory.
> -		 * MC's that can't do this lose the memory where PCI devices
> -		 * are mapped.  This mapping is MC dependent and so we call
> -		 * back into the MC driver for it to map the MC page to
> -		 * a physical (CPU) page which can then be mapped to a virtual
> -		 * page - which can then be scrubbed.
> -		 */
> -		remapped_page = mci->ctl_page_to_phys ?
> -			mci->ctl_page_to_phys(mci, page_frame_number) :
> -			page_frame_number;
> +	mci->ue_mc++;
>  
> -		edac_mc_scrub_block(remapped_page, offset_in_page, grain);
> +	if (!enable_filter) {
> +		mci->ce_noinfo_count++;
> +		return;
>  	}
> -}
> -EXPORT_SYMBOL_GPL(edac_mc_handle_ce);
>  
> -void edac_mc_handle_ce_no_info(struct mem_ctl_info *mci, const char *msg)
> -{
> -	if (edac_mc_get_log_ce())
> -		edac_mc_printk(mci, KERN_WARNING,
> -			"CE - no information available: %s\n", msg);
> +	for (i = 0; i < mci->n_layers; i++) {
> +		if (pos[i] < 0)
> +			break;
> +		index += pos[i];
> +		mci->ue_per_layer[i][index]++;
>  
> -	mci->ce_noinfo_count++;
> -	mci->ce_count++;
> +		if (i < mci->n_layers - 1)
> +			index *= mci->layers[i + 1].size;
> +	}
>  }
> -EXPORT_SYMBOL_GPL(edac_mc_handle_ce_no_info);
>  
> -void edac_mc_handle_ue(struct mem_ctl_info *mci,
> -		unsigned long page_frame_number,
> -		unsigned long offset_in_page, int row, const char *msg)
> +#define OTHER_LABEL " or "
> +void edac_mc_handle_error(const enum hw_event_mc_err_type type,
> +			  struct mem_ctl_info *mci,
> +			  const unsigned long page_frame_number,
> +			  const unsigned long offset_in_page,
> +			  const unsigned long syndrome,
> +			  const int layer0,
> +			  const int layer1,
> +			  const int layer2,

Instead of passing each layer as an arg, you can prepare the array pos[]
in each edac_mc_hanlde_*() and pass around a pointer to it - you need it
anyway in the edac_mc_inc*() functions.

> +			  const char *msg,
> +			  const char *other_detail,
> +			  const void *mcelog)

Also, this function is huuuge and begs to be splitted into small,
self-contained helpers.

>  {
> -	int len = EDAC_MC_LABEL_LEN * 4;
> -	char labels[len + 1];
> -	char *pos = labels;
> -	int chan;
> -	int chars;
> -	char *label = NULL;
> +	unsigned long remapped_page;
> +	/* FIXME: too much for stack: move it to some pre-alocated area */
> +	char detail[80], location[80];
> +	char label[(EDAC_MC_LABEL_LEN + 1 + sizeof(OTHER_LABEL)) * mci->tot_dimms];
> +	char *p;
> +	int row = -1, chan = -1;
> +	int pos[EDAC_MAX_LAYERS] = { layer0, layer1, layer2 };
> +	int i;
>  	u32 grain;
> +	bool enable_filter = false;

What does this enable_filter thing mean:

	if (pos[i] >= 0)
		enable_filter = true;

This absolutely needs explanation and better naming!

>  
>  	debugf3("MC%d: %s()\n", mci->mc_idx, __func__);
>  
> -	/* FIXME - maybe make panic on INTERNAL ERROR an option */
> -	if (row >= mci->nr_csrows || row < 0) {
> -		/* something is wrong */
> -		edac_mc_printk(mci, KERN_ERR,
> -			"INTERNAL ERROR: row out of range "
> -			"(%d >= %d)\n", row, mci->nr_csrows);
> -		edac_mc_handle_ue_no_info(mci, "INTERNAL ERROR");
> -		return;
> -	}
> -
> -	grain = mci->csrows[row].channels[0].dimm->grain;
> -	label = mci->csrows[row].channels[0].dimm->label;
> -	chars = snprintf(pos, len + 1, "%s", label);
> -	len -= chars;
> -	pos += chars;
> -
> -	for (chan = 1; (chan < mci->csrows[row].nr_channels) && (len > 0);
> -		chan++) {
> -		label = mci->csrows[row].channels[chan].dimm->label;
> -		chars = snprintf(pos, len + 1, ":%s", label);
> -		len -= chars;
> -		pos += chars;
> +	/* Check if the event report is consistent */
> +	for (i = 0; i < mci->n_layers; i++) {
> +		if (pos[i] >= (int)mci->layers[i].size) {
> +			if (type == HW_EVENT_ERR_CORRECTED) {
> +				p = "CE";
> +				mci->ce_mc++;
> +			} else {
> +				p = "UE";
> +				mci->ue_mc++;

Ok, mci->ce_mc and mci->ue_mc are being incremented here and the same
happens below again in the edac_inc*_{ce,ue}_error() functions below.

Why? Current layer is within valid elements count of current layer?

> +			}
> +			edac_mc_printk(mci, KERN_ERR,
> +				       "INTERNAL ERROR: %s value is out of range (%d >= %d)\n",
> +				       edac_layer_name[mci->layers[i].type],
> +				       pos[i], mci->layers[i].size);
> +			/*
> +			 * Instead of just returning it, let's use what's
> +			 * known about the error. The increment routines and
> +			 * the DIMM filter logic will do the right thing by
> +			 * pointing the likely damaged DIMMs.
> +			 */
> +			pos[i] = -1;
> +		}
> +		if (pos[i] >= 0)
> +			enable_filter = true;

As above, what does that filter logic mean, where it is explained?

>  
> -	if (edac_mc_get_log_ue())
> -		edac_mc_printk(mci, KERN_EMERG,
> -			"UE page 0x%lx, offset 0x%lx, grain %d, row %d, "
> -			"labels \"%s\": %s\n", page_frame_number,
> -			offset_in_page, grain, row, labels, msg);
> -
> -	if (edac_mc_get_panic_on_ue())
> -		panic("EDAC MC%d: UE page 0x%lx, offset 0x%lx, grain %d, "
> -			"row %d, labels \"%s\": %s\n", mci->mc_idx,
> -			page_frame_number, offset_in_page,
> -			grain, row, labels, msg);
> -
> -	mci->ue_count++;
> -	mci->csrows[row].ue_count++;
> -}
> -EXPORT_SYMBOL_GPL(edac_mc_handle_ue);
> +	/*
> +	 * Get the dimm label/grain that applies to the match criteria.
> +	 * As the error algorithm may not be able to point to just one memory,

What exactly do you mean by "memory" here? DIMM, slot, rank? Please be
more specific.

> +	 * the logic here will get all possible labels that could pottentially
> +	 * be affected by the error.
> +	 * On FB-DIMM memory controllers, for uncorrected errors, it is common
> +	 * to have only the MC channel and the MC dimm (also called as "rank")

							remove "as"

> +	 * but the channel is not known, as the memory is arranged in pairs,
> +	 * where each memory belongs to a separate channel within the same
> +	 * branch.
> +	 * It will also get the max grain, over the error match range
> +	 */
> +	grain = 0;
> +	p = label;
> +	*p = '\0';
> +	for (i = 0; i < mci->tot_dimms; i++) {
> +		struct dimm_info *dimm = &mci->dimms[i];
>  
> -void edac_mc_handle_ue_no_info(struct mem_ctl_info *mci, const char *msg)
> -{
> -	if (edac_mc_get_panic_on_ue())
> -		panic("EDAC MC%d: Uncorrected Error", mci->mc_idx);
> +		if (layer0 >= 0 && layer0 != dimm->location[0])
> +			continue;
> +		if (layer1 >= 0 && layer1 != dimm->location[1])
> +			continue;
> +		if (layer2 >= 0 && layer2 != dimm->location[2])
> +			continue;
>  
> -	if (edac_mc_get_log_ue())
> -		edac_mc_printk(mci, KERN_WARNING,
> -			"UE - no information available: %s\n", msg);
> -	mci->ue_noinfo_count++;
> -	mci->ue_count++;
> -}
> -EXPORT_SYMBOL_GPL(edac_mc_handle_ue_no_info);
> +		if (dimm->grain > grain)
> +			grain = dimm->grain;

Pls move the "max grain" part of the comment over this lines so that one
knows what happens.

>  
> -/*************************************************************
> - * On Fully Buffered DIMM modules, this help function is
> - * called to process UE events
> - */
> -void edac_mc_handle_fbd_ue(struct mem_ctl_info *mci,
> -			unsigned int csrow,
> -			unsigned int channela,
> -			unsigned int channelb, char *msg)
> -{
> -	int len = EDAC_MC_LABEL_LEN * 4;
> -	char labels[len + 1];
> -	char *pos = labels;
> -	int chars;
> -	char *label;
> -
> -	if (csrow >= mci->nr_csrows) {
> -		/* something is wrong */
> -		edac_mc_printk(mci, KERN_ERR,
> -			"INTERNAL ERROR: row out of range (%d >= %d)\n",
> -			csrow, mci->nr_csrows);
> -		edac_mc_handle_ue_no_info(mci, "INTERNAL ERROR");
> -		return;
> +		/*
> +		 * If the error is memory-controller wide, there's no sense
> +		 * on seeking for the affected DIMMs, as everything may be

	"there's no need to seek for the affected DIMMs because the whole
	channel/memory controller/...  may be affected"

> +		 * affected. Also, don't show errors for non-filled dimm's.

						     for empty DIMM slots.

> +		 */
> +		if (enable_filter && dimm->nr_pages) {
> +			if (p != label) {
> +				strcpy(p, OTHER_LABEL);
> +				p += strlen(OTHER_LABEL);
> +			}
> +			strcpy(p, dimm->label);
> +			p += strlen(p);
> +			*p = '\0';
> +
> +			/*
> +			 * get csrow/channel of the dimm, in order to allow

						    DIMM

> +			 * incrementing the compat API counters
> +			 */
> +			debugf4("%s: %s csrows map: (%d,%d)\n",
> +				__func__,
> +				mci->mem_is_per_rank ? "rank" : "dimm",
> +				dimm->csrow, dimm->cschannel);

newline

> +			if (row == -1)
> +				row = dimm->csrow;
> +			else if (row >= 0 && row != dimm->csrow)
> +				row = -2;

ditto

> +			if (chan == -1)
> +				chan = dimm->cschannel;
> +			else if (chan >= 0 && chan != dimm->cschannel)
> +				chan = -2;
> +		}
>  	}
> -
> -	if (channela >= mci->csrows[csrow].nr_channels) {
> -		/* something is wrong */
> -		edac_mc_printk(mci, KERN_ERR,
> -			"INTERNAL ERROR: channel-a out of range "
> -			"(%d >= %d)\n",
> -			channela, mci->csrows[csrow].nr_channels);
> -		edac_mc_handle_ue_no_info(mci, "INTERNAL ERROR");
> -		return;

newline here.

> +	if (!enable_filter) {
> +		strcpy(label, "any memory");
> +	} else {
> +		debugf4("%s: csrow/channel to increment: (%d,%d)\n",
> +			__func__, row, chan);
> +		if (p == label)
> +			strcpy(label, "unknown memory");
> +		if (type == HW_EVENT_ERR_CORRECTED) {
> +			if (row >= 0) {
> +				mci->csrows[row].ce_count++;
> +				if (chan >= 0)
> +					mci->csrows[row].channels[chan].ce_count++;
> +			}
> +		} else
> +			if (row >= 0)
> +				mci->csrows[row].ue_count++;
>  	}
>  
> -	if (channelb >= mci->csrows[csrow].nr_channels) {
> -		/* something is wrong */
> -		edac_mc_printk(mci, KERN_ERR,
> -			"INTERNAL ERROR: channel-b out of range "
> -			"(%d >= %d)\n",
> -			channelb, mci->csrows[csrow].nr_channels);
> -		edac_mc_handle_ue_no_info(mci, "INTERNAL ERROR");
> -		return;
> +	/* Fill the RAM location data */
> +	p = location;
> +	for (i = 0; i < mci->n_layers; i++) {
> +		if (pos[i] < 0)
> +			continue;

newline.

> +		p += sprintf(p, "%s %d ",
> +			     edac_layer_name[mci->layers[i].type],
> +			     pos[i]);
>  	}
>  
> -	mci->ue_count++;
> -	mci->csrows[csrow].ue_count++;
> -
> -	/* Generate the DIMM labels from the specified channels */
> -	label = mci->csrows[csrow].channels[channela].dimm->label;
> -	chars = snprintf(pos, len + 1, "%s", label);
> -	len -= chars;
> -	pos += chars;
> -
> -	chars = snprintf(pos, len + 1, "-%s",
> -			mci->csrows[csrow].channels[channelb].dimm->label);
> -
> -	if (edac_mc_get_log_ue())
> -		edac_mc_printk(mci, KERN_EMERG,
> -			"UE row %d, channel-a= %d channel-b= %d "
> -			"labels \"%s\": %s\n", csrow, channela, channelb,
> -			labels, msg);
> -
> -	if (edac_mc_get_panic_on_ue())
> -		panic("UE row %d, channel-a= %d channel-b= %d "
> -			"labels \"%s\": %s\n", csrow, channela,
> -			channelb, labels, msg);
> -}
> -EXPORT_SYMBOL(edac_mc_handle_fbd_ue);
> +	/* Memory type dependent details about the error */
> +	if (type == HW_EVENT_ERR_CORRECTED)
> +		snprintf(detail, sizeof(detail),
> +			"page 0x%lx offset 0x%lx grain %d syndrome 0x%lx",
> +			page_frame_number, offset_in_page,
> +			grain, syndrome);
> +	else
> +		snprintf(detail, sizeof(detail),
> +			"page 0x%lx offset 0x%lx grain %d",
> +			page_frame_number, offset_in_page, grain);
> +
> +	if (type == HW_EVENT_ERR_CORRECTED) {
> +		if (edac_mc_get_log_ce())
> +			edac_mc_printk(mci, KERN_WARNING,
> +				       "CE %s on %s (%s%s %s)\n",
> +				       msg, label, location,
> +				       detail, other_detail);

two back-to-back if-statements with the same conditional, pls merge
them. Better yet, this edac_mc_handle_error() is huuge, pls split its
functionality into well-abstracted helpers, for example one which deals
with HW_EVENT_ERR_CORRECTED, another with HW_EVENT_ERR_UNCORRECTED, etc.

> +		edac_increment_ce_error(mci, enable_filter, pos);
> +
> +		if (mci->scrub_mode & SCRUB_SW_SRC) {
> +			/*
> +			 * Some MC's can remap memory so that it is still

				memory controllers (called MCs below)

> +			 * available at a different address when PCI devices
> +			 * map into memory.
> +			 * MC's that can't do this lose the memory where PCI

					      this, lose...

> +			 * devices are mapped. This mapping is MC dependent

								MC-dependent

> +			 * and so we call back into the MC driver for it to
> +			 * map the MC page to a physical (CPU) page which can
> +			 * then be mapped to a virtual page - which can then
> +			 * be scrubbed.
> +			 */
> +			remapped_page = mci->ctl_page_to_phys ?
> +				mci->ctl_page_to_phys(mci, page_frame_number) :
> +				page_frame_number;
> +
> +			edac_mc_scrub_block(remapped_page,
> +					    offset_in_page, grain);
> +		}

The SCRUB_SW_SRC piece can be another function.

> +	} else {
> +		if (edac_mc_get_log_ue())
> +			edac_mc_printk(mci, KERN_WARNING,
> +				"UE %s on %s (%s%s %s)\n",
> +				msg, label, location, detail, other_detail);
>  
> -/*************************************************************
> - * On Fully Buffered DIMM modules, this help function is
> - * called to process CE events
> - */
> -void edac_mc_handle_fbd_ce(struct mem_ctl_info *mci,
> -			unsigned int csrow, unsigned int channel, char *msg)
> -{
> -	char *label = NULL;
> +		if (edac_mc_get_panic_on_ue())
> +			panic("UE %s on %s (%s%s %s)\n",
> +			      msg, label, location, detail, other_detail);
>  
> -	/* Ensure boundary values */
> -	if (csrow >= mci->nr_csrows) {
> -		/* something is wrong */
> -		edac_mc_printk(mci, KERN_ERR,
> -			"INTERNAL ERROR: row out of range (%d >= %d)\n",
> -			csrow, mci->nr_csrows);
> -		edac_mc_handle_ce_no_info(mci, "INTERNAL ERROR");
> -		return;
> +		edac_increment_ue_error(mci, enable_filter, pos);
>  	}
> -	if (channel >= mci->csrows[csrow].nr_channels) {
> -		/* something is wrong */
> -		edac_mc_printk(mci, KERN_ERR,
> -			"INTERNAL ERROR: channel out of range (%d >= %d)\n",
> -			channel, mci->csrows[csrow].nr_channels);
> -		edac_mc_handle_ce_no_info(mci, "INTERNAL ERROR");
> -		return;
> -	}
> -
> -	label = mci->csrows[csrow].channels[channel].dimm->label;
> -
> -	if (edac_mc_get_log_ce())
> -		/* FIXME - put in DIMM location */
> -		edac_mc_printk(mci, KERN_WARNING,
> -			"CE row %d, channel %d, label \"%s\": %s\n",
> -			csrow, channel, label, msg);
> -
> -	mci->ce_count++;
> -	mci->csrows[csrow].ce_count++;
> -	mci->csrows[csrow].channels[channel].dimm->ce_count++;
> -	mci->csrows[csrow].channels[channel].ce_count++;
>  }
> -EXPORT_SYMBOL(edac_mc_handle_fbd_ce);
> +EXPORT_SYMBOL_GPL(edac_mc_handle_error);
> diff --git a/include/linux/edac.h b/include/linux/edac.h
> index 3b8798d909da..2b66109bc301 100644
> --- a/include/linux/edac.h
> +++ b/include/linux/edac.h
> @@ -412,18 +412,20 @@ struct edac_mc_layer {
>  /* FIXME: add the proper per-location error counts */
>  struct dimm_info {
>  	char label[EDAC_MC_LABEL_LEN + 1];	/* DIMM label on motherboard */
> -	unsigned memory_controller;
> -	unsigned csrow;
> -	unsigned csrow_channel;
> +
> +	/* Memory location data */
> +	unsigned location[EDAC_MAX_LAYERS];
> +
> +	struct mem_ctl_info *mci;	/* the parent */
>  
>  	u32 grain;		/* granularity of reported error in bytes */
>  	enum dev_type dtype;	/* memory device type */
>  	enum mem_type mtype;	/* memory dimm type */
>  	enum edac_type edac_mode;	/* EDAC mode for this dimm */
>  
> -	u32 nr_pages;			/* number of pages in csrow */
> +	u32 nr_pages;			/* number of pages on this dimm */
>  
> -	u32 ce_count;		/* Correctable Errors for this dimm */
> +	unsigned csrow, cschannel;	/* Points to the old API data */
>  };
>  
>  /**
> @@ -443,9 +445,10 @@ struct dimm_info {
>   */
>  struct rank_info {
>  	int chan_idx;
> -	u32 ce_count;
>  	struct csrow_info *csrow;
>  	struct dimm_info *dimm;
> +
> +	u32 ce_count;		/* Correctable Errors for this csrow */
>  };
>  
>  struct csrow_info {
> @@ -497,6 +500,11 @@ struct mcidev_sysfs_attribute {
>          ssize_t (*store)(struct mem_ctl_info *, const char *,size_t);
>  };
>  
> +struct edac_hierarchy {
> +	char		*name;
> +	unsigned	nr;
> +};

What is that, unused leftovers?

> +
>  /* MEMORY controller information structure
>   */
>  struct mem_ctl_info {
> @@ -541,13 +549,18 @@ struct mem_ctl_info {
>  	unsigned long (*ctl_page_to_phys) (struct mem_ctl_info * mci,
>  					   unsigned long page);
>  	int mc_idx;
> -	int nr_csrows;
>  	struct csrow_info *csrows;
> +	unsigned nr_csrows, num_cschannel;
> +
> +	/* Memory Controller hierarchy */
> +	unsigned n_layers;
> +	struct edac_mc_layer *layers;
> +	bool mem_is_per_rank;
>  
>  	/*
>  	 * DIMM info. Will eventually remove the entire csrows_info some day
>  	 */
> -	unsigned nr_dimms;
> +	unsigned tot_dimms;
>  	struct dimm_info *dimms;
>  
>  	/*
> @@ -562,12 +575,15 @@ struct mem_ctl_info {
>  	const char *dev_name;
>  	char proc_name[MC_PROC_NAME_MAX_LEN + 1];
>  	void *pvt_info;
> -	u32 ue_noinfo_count;	/* Uncorrectable Errors w/o info */
> -	u32 ce_noinfo_count;	/* Correctable Errors w/o info */
> -	u32 ue_count;		/* Total Uncorrectable Errors for this MC */
> -	u32 ce_count;		/* Total Correctable Errors for this MC */
> +	u32 ue_count;           /* Total Uncorrectable Errors for this MC */
> +	u32 ce_count;           /* Total Correctable Errors for this MC */a

Why are you touching ue_count and ce_count, I don't see any differences
here?

>  	unsigned long start_time;	/* mci load start time (in jiffies) */
>  
> +	/* drivers shouldn't access this struct directly */

Which struct, I only see unsigneds?

> +	unsigned ce_noinfo_count, ue_noinfo_count;
> +	unsigned ce_mc, ue_mc;

What are those counters?

> +	u32 *ce_per_layer[EDAC_MAX_LAYERS], *ue_per_layer[EDAC_MAX_LAYERS];
> +
>  	struct completion complete;
>  
>  	/* edac sysfs device control */
> @@ -580,7 +596,7 @@ struct mem_ctl_info {
>  	 * by the low level driver.
>  	 *
>  	 * Set by the low level driver to provide attributes at the
> -	 * controller level, same level as 'ue_count' and 'ce_count' above.
> +	 * controller level.
>  	 * An array of structures, NULL terminated
>  	 *
>  	 * If attributes are desired, then set to array of attributes

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply

* Re: [V2 5/5] powerpc: kernel: 16650 UART reg-shift support
From: Josh Boyer @ 2012-05-02 13:38 UTC (permalink / raw)
  To: Tanmay Inamdar; +Cc: devicetree-discuss, linuxppc-dev, linux-kernel
In-Reply-To: <1333956052-25319-5-git-send-email-tinamdar@apm.com>

On Mon, Apr 9, 2012 at 3:20 AM, Tanmay Inamdar <tinamdar@apm.com> wrote:
> In APM8018X SOC, UART register address space has been relocated to 32-bit
> data boundaries for APB bus implementation.
> Current legacy_serial driver ignores the reg-shift property. This patch
> modifies legacy_serial.c and udbg_16550.c to work with above mentioned UA=
RTs.
>
> Signed-off-by: Tanmay Inamdar <tinamdar@apm.com>
> ---
> :100644 100644 8338aef... f5fc106... M =A0arch/powerpc/include/asm/udbg.h
> :100644 100644 bedd12e... d523b7d... M =A0arch/powerpc/kernel/legacy_seri=
al.c
> :100644 100644 6837f83... e0cb7dc... M =A0arch/powerpc/kernel/udbg_16550.=
c
> =A0arch/powerpc/include/asm/udbg.h =A0 =A0 | =A0 =A02 +-
> =A0arch/powerpc/kernel/legacy_serial.c | =A0 16 +++++---
> =A0arch/powerpc/kernel/udbg_16550.c =A0 =A0| =A0 64 +++++++++++++++++++++=
+------------
> =A03 files changed, 52 insertions(+), 30 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/udbg.h b/arch/powerpc/include/asm/u=
dbg.h
> index 8338aef..f5fc106 100644
> --- a/arch/powerpc/include/asm/udbg.h
> +++ b/arch/powerpc/include/asm/udbg.h
> @@ -29,7 +29,7 @@ extern void udbg_printf(const char *fmt, ...)
> =A0extern void udbg_progress(char *s, unsigned short hex);
>
> =A0extern void udbg_init_uart(void __iomem *comport, unsigned int speed,
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0unsigned int clock);
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0unsigned int clock, =
=A0unsigned int regshift);
> =A0extern unsigned int udbg_probe_uart_speed(void __iomem *comport,
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0unsigned int clock);
>
> diff --git a/arch/powerpc/kernel/legacy_serial.c b/arch/powerpc/kernel/le=
gacy_serial.c
> index bedd12e..d523b7d 100644
> --- a/arch/powerpc/kernel/legacy_serial.c
> +++ b/arch/powerpc/kernel/legacy_serial.c
> @@ -33,6 +33,7 @@ static struct legacy_serial_info {
> =A0 =A0 =A0 =A0unsigned int =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0clock;
> =A0 =A0 =A0 =A0int =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 irq_check_parent;
> =A0 =A0 =A0 =A0phys_addr_t =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 taddr;
> + =A0 =A0 =A0 unsigned int =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0regshif=
t;
> =A0} legacy_serial_infos[MAX_LEGACY_SERIAL_PORTS];
>
> =A0static struct __initdata of_device_id legacy_serial_parents[] =3D {
> @@ -42,6 +43,7 @@ static struct __initdata of_device_id legacy_serial_par=
ents[] =3D {
> =A0 =A0 =A0 =A0{.compatible =3D "ibm,opb",},
> =A0 =A0 =A0 =A0{.compatible =3D "simple-bus",},
> =A0 =A0 =A0 =A0{.compatible =3D "wrs,epld-localbus",},
> + =A0 =A0 =A0 {.compatible =3D "apm,apb",},
> =A0 =A0 =A0 =A0{},
> =A0};
>
> @@ -163,11 +165,6 @@ static int __init add_legacy_soc_port(struct device_=
node *np,
> =A0 =A0 =A0 =A0if (of_get_property(np, "clock-frequency", NULL) =3D=3D NU=
LL)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return -1;
>
> - =A0 =A0 =A0 /* if reg-shift or offset, don't try to use it */
> - =A0 =A0 =A0 if ((of_get_property(np, "reg-shift", NULL) !=3D NULL) ||
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 (of_get_property(np, "reg-offset", NULL) !=
=3D NULL))
> - =A0 =A0 =A0 =A0 =A0 =A0 =A0 return -1;
> -

So we explicitly didn't support reg-shift before.  I'm guessing there is
a reason for that, but I don't recall what.  Ben?

Also, why do you need to use the legacy serial driver at all for this
SOC?  As far as I remember, the OF serial driver should be sufficient.

> +static unsigned int reg_shift;
> +#define ns16550_offset(addr) (addr - (unsigned char *)udbg_comport)
> +
> =A0static struct NS16550 __iomem *udbg_comport;
>
> +static inline u8 serial_read(unsigned char *addr)
> +{
> + =A0 =A0 =A0 u32 offset =3D ns16550_offset(addr) << reg_shift;
> + =A0 =A0 =A0 return readb(udbg_comport + offset);
> +}
> +
> +static inline void serial_write(unsigned char *addr, char val)
> +{
> + =A0 =A0 =A0 u32 offset =3D ns16550_offset(addr) << reg_shift;
> + =A0 =A0 =A0 writeb(val, udbg_comport + offset);
> +}
> +

I don't think readb/writeb are correct here.  Why did you switch to
using those instead of sticking with in_8/out_8?

josh

^ permalink raw reply

* Re: [V2 2/5] powerpc: 40x: Add AHB, APB of_device_ids
From: Josh Boyer @ 2012-05-02 13:41 UTC (permalink / raw)
  To: Tanmay Inamdar; +Cc: devicetree-discuss, linuxppc-dev, linux-kernel
In-Reply-To: <1333956052-25319-2-git-send-email-tinamdar@apm.com>

On Mon, Apr 9, 2012 at 3:20 AM, Tanmay Inamdar <tinamdar@apm.com> wrote:
> Adding of_device_id's for AHB and APB buses used in klondike (APM8018X) in
> platforms/40x/ppc40x_simple.c file
>
> Signed-off-by: Tanmay Inamdar <tinamdar@apm.com>

You should probably combine the first 3 patches in this series into one
patch.  Otherwise you run the risk of changing the DTS to something the
kernel can't support.  At the least, this change should be patch 1 in
the series, and you can probably combine the two DTS changes.

josh

^ permalink raw reply

* Please pull 'next' branch of 4xx tree
From: Josh Boyer @ 2012-05-02 13:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev

Hi Ben,

A few patches from Suzie for 47x kexec/kdump support, and some MSI patches
from Mai La.

josh

The following changes since commit ec34a6814988f17506733c1e8b058ce46602591d:

  powerpc: Remove old powerpc specific ptrace getregs/setregs calls
(2012-04-30 15:37:28 +1000)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jwboyer/powerpc-4xx.git next

for you to fetch changes up to 9c6b2353dfb80ae843b831c03fc53ddc5c3949ff:

  powerpc/44x: Add PCI MSI node for Maui APM821xx SoC and Bluestone
board in DTS (2012-05-03 08:58:21 -0400)

----------------------------------------------------------------
Mai La (2):
      powerpc/44x: Fix PCI MSI support for Maui APM821xx SoC and Bluestone board
      powerpc/44x: Add PCI MSI node for Maui APM821xx SoC and
Bluestone board in DTS

Suzuki Poulose (3):
      powerpc/44x: Fix/Initialize PID to kernel PID before the TLB search
      powerpc/47x: Kernel support for KEXEC
      powerpc/47x: Enable CRASH_DUMP

 arch/powerpc/Kconfig                |    4 +-
 arch/powerpc/boot/dts/bluestone.dts |   25 +++++
 arch/powerpc/kernel/misc_32.S       |  203 +++++++++++++++++++++++++++++++++--
 arch/powerpc/platforms/44x/Kconfig  |    2 +
 arch/powerpc/sysdev/ppc4xx_msi.c    |   42 +++++---
 5 files changed, 252 insertions(+), 24 deletions(-)

^ permalink raw reply

* Re: linux-next ppc64: RCU mods cause __might_sleep BUGs
From: Hugh Dickins @ 2012-05-02 20:25 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linuxppc-dev, linux-kernel, Paul E. McKenney
In-Reply-To: <20120501232516.GR2441@linux.vnet.ibm.com>

On Tue, 1 May 2012, Paul E. McKenney wrote:
> > > > > On Mon, 2012-04-30 at 15:37 -0700, Hugh Dickins wrote:
> > > > > > 
> > > > > > BUG: sleeping function called from invalid context at include/linux/pagemap.h:354
> > > > > > in_atomic(): 0, irqs_disabled(): 0, pid: 6886, name: cc1
> > > > > > Call Trace:
> > > > > > [c0000001a99f78e0] [c00000000000f34c] .show_stack+0x6c/0x16c (unreliable)
> > > > > > [c0000001a99f7990] [c000000000077b40] .__might_sleep+0x11c/0x134
> > > > > > [c0000001a99f7a10] [c0000000000c6228] .filemap_fault+0x1fc/0x494
> > > > > > [c0000001a99f7af0] [c0000000000e7c9c] .__do_fault+0x120/0x684
> > > > > > [c0000001a99f7c00] [c000000000025790] .do_page_fault+0x458/0x664
> > > > > > [c0000001a99f7e30] [c000000000005868] handle_page_fault+0x10/0x30

Got it at last.  Embarrassingly obvious.  __rcu_read_lock() and
__rcu_read_unlock() are not safe to be using __this_cpu operations,
the cpu may change in between the rmw's read and write: they should
be using this_cpu operations (or, I put preempt_disable/enable in the
__rcu_read_unlock below).  __this_cpus there work out fine on x86,
which was given good instructions to use; but not so well on PowerPC.

I've been running successfully for an hour now with the patch below;
but I expect you'll want to consider the tradeoffs, and may choose a
different solution.

Hugh

--- 3.4-rc4-next-20120427/include/linux/rcupdate.h	2012-04-28 09:26:38.000000000 -0700
+++ testing/include/linux/rcupdate.h	2012-05-02 11:46:06.000000000 -0700
@@ -159,7 +159,7 @@ DECLARE_PER_CPU(struct task_struct *, rc
  */
 static inline void __rcu_read_lock(void)
 {
-	__this_cpu_inc(rcu_read_lock_nesting);
+	this_cpu_inc(rcu_read_lock_nesting);
 	barrier(); /* Keep code within RCU read-side critical section. */
 }
 
--- 3.4-rc4-next-20120427/kernel/rcupdate.c	2012-04-28 09:26:40.000000000 -0700
+++ testing/kernel/rcupdate.c	2012-05-02 11:44:13.000000000 -0700
@@ -72,6 +72,7 @@ DEFINE_PER_CPU(struct task_struct *, rcu
  */
 void __rcu_read_unlock(void)
 {
+	preempt_disable();
 	if (__this_cpu_read(rcu_read_lock_nesting) != 1)
 		__this_cpu_dec(rcu_read_lock_nesting);
 	else {
@@ -83,13 +84,14 @@ void __rcu_read_unlock(void)
 		barrier();  /* ->rcu_read_unlock_special load before assign */
 		__this_cpu_write(rcu_read_lock_nesting, 0);
 	}
-#ifdef CONFIG_PROVE_LOCKING
+#if 1 /* CONFIG_PROVE_LOCKING */
 	{
 		int rln = __this_cpu_read(rcu_read_lock_nesting);
 
-		WARN_ON_ONCE(rln < 0 && rln > INT_MIN / 2);
+		BUG_ON(rln < 0 && rln > INT_MIN / 2);
 	}
 #endif /* #ifdef CONFIG_PROVE_LOCKING */
+	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(__rcu_read_unlock);
 
--- 3.4-rc4-next-20120427/kernel/sched/core.c	2012-04-28 09:26:40.000000000 -0700
+++ testing/kernel/sched/core.c	2012-05-01 22:40:46.000000000 -0700
@@ -2024,7 +2024,7 @@ asmlinkage void schedule_tail(struct tas
 {
 	struct rq *rq = this_rq();
 
-	rcu_switch_from(prev);
+	/* rcu_switch_from(prev); */
 	rcu_switch_to();
 	finish_task_switch(rq, prev);
 
@@ -7093,6 +7093,10 @@ void __might_sleep(const char *file, int
 		"BUG: sleeping function called from invalid context at %s:%d\n",
 			file, line);
 	printk(KERN_ERR
+		"cpu=%d preempt_count=%x preempt_offset=%x rcu_nesting=%x nesting_save=%x\n",
+		raw_smp_processor_id(), preempt_count(), preempt_offset,
+		rcu_preempt_depth(), current->rcu_read_lock_nesting_save); 
+	printk(KERN_ERR
 		"in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",
 			in_atomic(), irqs_disabled(),
 			current->pid, current->comm);

^ permalink raw reply

* Re: linux-next ppc64: RCU mods cause __might_sleep BUGs
From: Paul E. McKenney @ 2012-05-02 20:49 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linuxppc-dev, linux-kernel, Paul E. McKenney
In-Reply-To: <alpine.LSU.2.00.1205021324430.24246@eggly.anvils>

On Wed, May 02, 2012 at 01:25:30PM -0700, Hugh Dickins wrote:
> On Tue, 1 May 2012, Paul E. McKenney wrote:
> > > > > > On Mon, 2012-04-30 at 15:37 -0700, Hugh Dickins wrote:
> > > > > > > 
> > > > > > > BUG: sleeping function called from invalid context at include/linux/pagemap.h:354
> > > > > > > in_atomic(): 0, irqs_disabled(): 0, pid: 6886, name: cc1
> > > > > > > Call Trace:
> > > > > > > [c0000001a99f78e0] [c00000000000f34c] .show_stack+0x6c/0x16c (unreliable)
> > > > > > > [c0000001a99f7990] [c000000000077b40] .__might_sleep+0x11c/0x134
> > > > > > > [c0000001a99f7a10] [c0000000000c6228] .filemap_fault+0x1fc/0x494
> > > > > > > [c0000001a99f7af0] [c0000000000e7c9c] .__do_fault+0x120/0x684
> > > > > > > [c0000001a99f7c00] [c000000000025790] .do_page_fault+0x458/0x664
> > > > > > > [c0000001a99f7e30] [c000000000005868] handle_page_fault+0x10/0x30
> 
> Got it at last.  Embarrassingly obvious.  __rcu_read_lock() and
> __rcu_read_unlock() are not safe to be using __this_cpu operations,
> the cpu may change in between the rmw's read and write: they should
> be using this_cpu operations (or, I put preempt_disable/enable in the
> __rcu_read_unlock below).  __this_cpus there work out fine on x86,
> which was given good instructions to use; but not so well on PowerPC.

Thank you very much for tracking this down!!!

> I've been running successfully for an hour now with the patch below;
> but I expect you'll want to consider the tradeoffs, and may choose a
> different solution.

The thing that puzzles me about this is that the normal path through
the scheduler would save and restore these per-CPU variables to and
from the task structure.  There must be a sneak path through the
scheduler that I failed to account for.

But given your good work, this should be easy for me to chase down
even on my x86-based laptop -- just convert from __this_cpu_inc() to a
read-inc-delay-write sequence.  And check that the underlying variable
didn't change in the meantime.  And dump an ftrace if it did.  ;-)

Thank you again, Hugh!

							Thanx, Paul

> Hugh
> 
> --- 3.4-rc4-next-20120427/include/linux/rcupdate.h	2012-04-28 09:26:38.000000000 -0700
> +++ testing/include/linux/rcupdate.h	2012-05-02 11:46:06.000000000 -0700
> @@ -159,7 +159,7 @@ DECLARE_PER_CPU(struct task_struct *, rc
>   */
>  static inline void __rcu_read_lock(void)
>  {
> -	__this_cpu_inc(rcu_read_lock_nesting);
> +	this_cpu_inc(rcu_read_lock_nesting);
>  	barrier(); /* Keep code within RCU read-side critical section. */
>  }
> 
> --- 3.4-rc4-next-20120427/kernel/rcupdate.c	2012-04-28 09:26:40.000000000 -0700
> +++ testing/kernel/rcupdate.c	2012-05-02 11:44:13.000000000 -0700
> @@ -72,6 +72,7 @@ DEFINE_PER_CPU(struct task_struct *, rcu
>   */
>  void __rcu_read_unlock(void)
>  {
> +	preempt_disable();
>  	if (__this_cpu_read(rcu_read_lock_nesting) != 1)
>  		__this_cpu_dec(rcu_read_lock_nesting);
>  	else {
> @@ -83,13 +84,14 @@ void __rcu_read_unlock(void)
>  		barrier();  /* ->rcu_read_unlock_special load before assign */
>  		__this_cpu_write(rcu_read_lock_nesting, 0);
>  	}
> -#ifdef CONFIG_PROVE_LOCKING
> +#if 1 /* CONFIG_PROVE_LOCKING */
>  	{
>  		int rln = __this_cpu_read(rcu_read_lock_nesting);
> 
> -		WARN_ON_ONCE(rln < 0 && rln > INT_MIN / 2);
> +		BUG_ON(rln < 0 && rln > INT_MIN / 2);
>  	}
>  #endif /* #ifdef CONFIG_PROVE_LOCKING */
> +	preempt_enable();
>  }
>  EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> 
> --- 3.4-rc4-next-20120427/kernel/sched/core.c	2012-04-28 09:26:40.000000000 -0700
> +++ testing/kernel/sched/core.c	2012-05-01 22:40:46.000000000 -0700
> @@ -2024,7 +2024,7 @@ asmlinkage void schedule_tail(struct tas
>  {
>  	struct rq *rq = this_rq();
> 
> -	rcu_switch_from(prev);
> +	/* rcu_switch_from(prev); */
>  	rcu_switch_to();
>  	finish_task_switch(rq, prev);
> 
> @@ -7093,6 +7093,10 @@ void __might_sleep(const char *file, int
>  		"BUG: sleeping function called from invalid context at %s:%d\n",
>  			file, line);
>  	printk(KERN_ERR
> +		"cpu=%d preempt_count=%x preempt_offset=%x rcu_nesting=%x nesting_save=%x\n",
> +		raw_smp_processor_id(), preempt_count(), preempt_offset,
> +		rcu_preempt_depth(), current->rcu_read_lock_nesting_save); 
> +	printk(KERN_ERR
>  		"in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",
>  			in_atomic(), irqs_disabled(),
>  			current->pid, current->comm);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply

* Re: linux-next ppc64: RCU mods cause __might_sleep BUGs
From: Benjamin Herrenschmidt @ 2012-05-02 21:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Paul E. McKenney, Paul E. McKenney, linuxppc-dev, linux-kernel
In-Reply-To: <alpine.LSU.2.00.1205021324430.24246@eggly.anvils>

On Wed, 2012-05-02 at 13:25 -0700, Hugh Dickins wrote:
> Got it at last.  Embarrassingly obvious.  __rcu_read_lock() and
> __rcu_read_unlock() are not safe to be using __this_cpu operations,
> the cpu may change in between the rmw's read and write: they should
> be using this_cpu operations (or, I put preempt_disable/enable in the
> __rcu_read_unlock below).  __this_cpus there work out fine on x86,
> which was given good instructions to use; but not so well on PowerPC.
> 
> I've been running successfully for an hour now with the patch below;
> but I expect you'll want to consider the tradeoffs, and may choose a
> different solution.

Didn't Linus recently rant about these __this_cpu vs this_cpu nonsense ?

I thought that was going out..

Cheers,
Ben.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox