* [PATCH] Correct offset_in_page mask bits in function edac_mc_handle_ce.
From: Shaohui Xie @ 2011-07-14 2:01 UTC (permalink / raw)
To: linuxppc-dev
Cc: kumar.gala, Kai.Jiang, djiang, ptyser, dougthompson, scottwood,
akpm
From: Kai.Jiang <Kai.Jiang@freescale.com>
Parameter offset_in_page in function edac_mc_handle_ce should be masked
the higher bits above the page size, not the lower bits. The original input
sometimes causes crash.
Signed-off-by: Kai.Jiang <Kai.Jiang@freescale.com>
---
drivers/edac/mpc85xx_edac.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/edac/mpc85xx_edac.c b/drivers/edac/mpc85xx_edac.c
index 38ab8e2..b048a5f 100644
--- a/drivers/edac/mpc85xx_edac.c
+++ b/drivers/edac/mpc85xx_edac.c
@@ -854,11 +854,11 @@ static void mpc85xx_mc_check(struct mem_ctl_info *mci)
mpc85xx_mc_printk(mci, KERN_ERR, "PFN out of range!\n");
if (err_detect & DDR_EDE_SBE)
- edac_mc_handle_ce(mci, pfn, err_addr & PAGE_MASK,
+ edac_mc_handle_ce(mci, pfn, err_addr & ~PAGE_MASK,
syndrome, row_index, 0, mci->ctl_name);
if (err_detect & DDR_EDE_MBE)
- edac_mc_handle_ue(mci, pfn, err_addr & PAGE_MASK,
+ edac_mc_handle_ue(mci, pfn, err_addr & ~PAGE_MASK,
row_index, mci->ctl_name);
out_be32(pdata->mc_vbase + MPC85XX_MC_ERR_DETECT, err_detect);
--
1.6.4
^ permalink raw reply related
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Anton Blanchard @ 2011-07-14 4:35 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds
In-Reply-To: <20110714103418.7ef25b68@kryten>
> I took a quick look and we are stuck in update_group_power:
>
> do {
> power += group->cpu_power;
> group = group->next;
> } while (group != child->groups);
>
> I looked at the linked list:
>
> child->groups = c000007b2f74ff00
>
> and dumping group as we go:
>
> c000007b2f74ff00 c000007b2f760000 c000007b2fb60000 c000007b2ff60000
>
> at this point we end up in a cycle and never make it back to
> child->groups:
>
> c000008b2e68ff00 c000008b2e6a0000 c000008b2eaa0000 c000008b2eea0000
> c000009aee77ff00 c000009aee790000 c000009aeeb90000 c000009aeef90000
> c00000bafde91800 c00000dafdf81800 c00000fafce81800 c000011afdf71800
> c00001226e70ff00 c00001226e720000 c00001226eb20000 c00001226ef20000
> c000008b2e68ff00
It looks like the group ends up in two lists. I added a BUG_ON to
ensure we never link a group twice, and it hits.
I also printed out the cpu spans as we walk through build_sched_groups:
0 1 2 3
0 4 8 12 16 20 24 28
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480
0 128 256 384
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
24 25 26 27
28 29 30 31
32 33 34 35
32 36 40 44 48 52 56 60
36 37 38 39
40 41 42 43
44 45 46 47
48 49 50 51
52 53 54 55
56 57 58 59
60 61 62 63
64 65 66 67
64 68 72 76 80 84 88 92
68 69 70 71
72 73 74 75
76 77 78 79
80 81 82 83
84 85 86 87
88 89 90 91
92 93 94 95
96 97 98 99
96 100 104 108 112 116 120 124
100 101 102 103
104 105 106 107
108 109 110 111
112 113 114 115
116 117 118 119
120 121 122 123
124 125 126 127
128 129 130 131
128 132 136 140 144 148 152 156
Duplicates start appearing in this span:
128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608
So it looks like the overlap of the 16 entry spans
(SD_NODES_PER_DOMAIN) is causing our problem.
Anton
Index: linux-2.6-work/kernel/sched.c
===================================================================
--- linux-2.6-work.orig/kernel/sched.c 2011-07-11
12:48:48.251087767 +1000 +++ linux-2.6-work/kernel/sched.c
2011-07-14 14:19:45.867094044 +1000 @@ -7021,6 +7021,7 @@
build_sched_groups(struct sched_domain *
cpumask_clear(sched_group_cpus(sg));
sg->cpu_power = 0;
+ BUG_ON(sg->next);
for_each_cpu(j, span) {
if (get_group(j, sdd, NULL) != group)
Anton
^ permalink raw reply
* Re: [v3] booke/kprobe: Fix stack corrupt issue when kprobe 'stwu'
From: tiejun.chen @ 2011-07-14 11:56 UTC (permalink / raw)
To: benh, linuxppc-dev; +Cc: Tiejun Chen
In-Reply-To: <1310383915-30543-2-git-send-email-tiejun.chen@windriver.com>
Tiejun Chen wrote:
> v1 -> v2: when allocate pgirq_ctx, use 'hw_cpu' to identify cpu ID in exc_lvl_early_init().
> v2 -> v3: add that specific return-from-program-exc to restore necessary thread info then
> we can withdraw the original patch #2.
Any feedback?
Thanks
Tiejun
>
> Tiejun
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
^ permalink raw reply
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Peter Zijlstra @ 2011-07-14 13:16 UTC (permalink / raw)
To: Anton Blanchard; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds
In-Reply-To: <20110714143521.5fe4fab6@kryten>
On Thu, 2011-07-14 at 14:35 +1000, Anton Blanchard wrote:
> I also printed out the cpu spans as we walk through build_sched_groups:
> 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480
> Duplicates start appearing in this span:
> 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608
>=20
> So it looks like the overlap of the 16 entry spans
> (SD_NODES_PER_DOMAIN) is causing our problem.
Urgh.. so those spans are generated by sched_domain_node_span(), and it
looks like that simply picks the 15 nearest nodes to the one we've got
without consideration for overlap with previously generated spans.
Now that used to work because it used to simply allocate a new group
instead of using the existing one.
The thing is, we want to track state unique to a group of cpus, so
duplicating that is iffy.
Otoh, making these masks non-overlapping is probably sub-optimal from a
NUMA pov.
Looking at a slightly simpler set-up (4 socket AMD magny-cours):
$ cat /sys/devices/system/node/node*/distance
10 16 16 22 16 22 16 22
16 10 22 16 22 16 22 16
16 22 10 16 16 22 16 22
22 16 16 10 22 16 22 16
16 22 16 22 10 16 16 22
22 16 22 16 16 10 22 16
16 22 16 22 16 22 10 16
22 16 22 16 22 16 16 10
We can translate that into groups like
{0} {0,1,2,4,6} {0-7}
{1} {1,0,3,5,7} {0-7}
...
and we can easily see there's overlap there as well in the NUMA layout
itself.
This seems to suggest we need to separate the unique state from the
sched_group.
Now all I need is a way to not consume gobs of memory.. /me goes prod
^ permalink raw reply
* Re: [v3 PATCH 1/1] booke/kprobe: make program exception to use one dedicated exception stack
From: Kumar Gala @ 2011-07-14 13:27 UTC (permalink / raw)
To: Tiejun Chen; +Cc: linuxppc-dev
In-Reply-To: <1310383915-30543-1-git-send-email-tiejun.chen@windriver.com>
On Jul 11, 2011, at 6:31 AM, Tiejun Chen wrote:
> When kprobe these operations such as store-and-update-word for SP(r1),
>=20
> stwu r1, -A(r1)
>=20
> The program exception is triggered, and PPC always allocate an =
exception frame
> as shown as the follows:
>=20
> old r1 ----------
> ...
> nip
> gpr[2] ~ gpr[31]
> gpr[1] <--------- old r1 is stored.
> gpr[0]
> -------- <--------- pr_regs @offset 16 bytes
> padding
> STACK_FRAME_REGS_MARKER
> LR
> back chain
> new r1 ----------
> Then emulate_step() will emulate this instruction, 'stwu'. Actually =
its
> equivalent to:
> 1> Update pr_regs->gpr[1] =3D mem[old r1 + (-A)]
> 2> stw [old r1], mem[old r1 + (-A)]
>=20
> Please notice the stack based on new r1 may be covered with mem[old r1
> +(-A)] when addr[old r1 + (-A)] < addr[old r1 + sizeof(an exception =
frame0].
> So the above 2# operation will overwirte something to break this =
exception
> frame then unexpected kernel problem will be issued.
>=20
> So looks we have to implement independed interrupt stack for PPC =
program
> exception when CONFIG_BOOKE is enabled. Here we can use
> EXC_LEVEL_EXCEPTION_PROLOG to replace original NORMAL_EXCEPTION_PROLOG
> for program exception if CONFIG_BOOKE. Then its always safe for kprobe
> with independed exc stack from one pre-allocated and dedicated =
thread_info.
> Actually this is just waht we did for critical/machine check =
exceptions
> on PPC.
>=20
> Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
> ---
I'm still very confused why we need a unique stack frame for =
kprobe/program exceptions on book-e devices.
Can you explain this further.
- k
^ permalink raw reply
* Re: [PATCH v2] powerpc32: Kexec support for PPC440X chipsets
From: Sebastian Andrzej Siewior @ 2011-07-14 15:40 UTC (permalink / raw)
To: Suzuki K. Poulose
Cc: kexec, lkml, Josh Boyer, Paul Mackerras, linux ppc dev,
Vivek Goyal
In-Reply-To: <20110712064356.28567.48722.stgit@suzukikp.in.ibm.com>
* Suzuki K. Poulose | 2011-07-12 12:14:16 [+0530]:
>Changes from V1: Uses a tmp mapping in the other address space to setup
> the 1:1 mapping (suggested by Sebastian Andrzej Siewior).
>
>Note: Should we do the same for kernel entry code for PPC44x ?
You have one the kernel mapping. Then you make your temporary mapping
which might be just the one valid TLB entry (after you invalidated the
others). After that you setup multiple mappings to cover 0..2GiB address
space. Since the kernel runs at 0xC.... and you need 0x0.. for the same
memory you end up at some point with two mappings for the same memory.
The reference manual says one should not have two active mappings for
piece of memory. That is why the "other address space" is should be
used.
So I think using this "other address space" for the entry code isn't a
bad idea.
However, I'm fine with this patch.
Sebastian
^ permalink raw reply
* Re: [v3 PATCH 1/1] booke/kprobe: make program exception to use one dedicated exception stack
From: Scott Wood @ 2011-07-14 15:53 UTC (permalink / raw)
To: Kumar Gala; +Cc: Tiejun Chen, linuxppc-dev
In-Reply-To: <CC4F4DC6-0682-4FC6-8D89-CE9AFEA7AD85@kernel.crashing.org>
On Thu, 14 Jul 2011 08:27:44 -0500
Kumar Gala <galak@kernel.crashing.org> wrote:
>
> On Jul 11, 2011, at 6:31 AM, Tiejun Chen wrote:
>
> > When kprobe these operations such as store-and-update-word for SP(r1),
> >
> > stwu r1, -A(r1)
> >
> > The program exception is triggered, and PPC always allocate an exception frame
> > as shown as the follows:
> >
> > old r1 ----------
> > ...
> > nip
> > gpr[2] ~ gpr[31]
> > gpr[1] <--------- old r1 is stored.
> > gpr[0]
> > -------- <--------- pr_regs @offset 16 bytes
> > padding
> > STACK_FRAME_REGS_MARKER
> > LR
> > back chain
> > new r1 ----------
> > Then emulate_step() will emulate this instruction, 'stwu'. Actually its
> > equivalent to:
> > 1> Update pr_regs->gpr[1] = mem[old r1 + (-A)]
> > 2> stw [old r1], mem[old r1 + (-A)]
> >
> > Please notice the stack based on new r1 may be covered with mem[old r1
> > +(-A)] when addr[old r1 + (-A)] < addr[old r1 + sizeof(an exception frame0].
> > So the above 2# operation will overwirte something to break this exception
> > frame then unexpected kernel problem will be issued.
> >
> > So looks we have to implement independed interrupt stack for PPC program
> > exception when CONFIG_BOOKE is enabled. Here we can use
> > EXC_LEVEL_EXCEPTION_PROLOG to replace original NORMAL_EXCEPTION_PROLOG
> > for program exception if CONFIG_BOOKE. Then its always safe for kprobe
> > with independed exc stack from one pre-allocated and dedicated thread_info.
> > Actually this is just waht we did for critical/machine check exceptions
> > on PPC.
> >
> > Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
> > ---
>
> I'm still very confused why we need a unique stack frame for kprobe/program exceptions on book-e devices.
I don't know why it's booke-specific (or why they're emulating rather than
using single-step), but the problem is trying to emulate an instruction
that's expanding the non-exception stack into the area that the exception
handler is sitting on, and writing to that new stack area.
-Scott
^ permalink raw reply
* Re: [PATCH 1/1] powerpc/4xx: enable and fix pcie gen1/gen2 on the 460sx
From: Ayman El-Khashab @ 2011-07-14 16:04 UTC (permalink / raw)
To: Tony Breeds; +Cc: Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <20110714011627.GJ20597@ozlabs.org>
Thanks Tony, some comments below.
On Thu, Jul 14, 2011 at 11:16:27AM +1000, Tony Breeds wrote:
>
> > +static void __init ppc460sx_pciex_check_link(struct ppc4xx_pciex_port *port)
> > +{
> > + void __iomem *mbase;
> > + int attempt = 50;
> > +
> > + port->link = 0;
> > +
> > + mbase = ioremap(port->cfg_space.start + 0x00000000, 0x1000);
>
> Why + 0x00000000 ? ppc4xx_pciex_port_setup_hose() does:
> mbase = ioremap(port->cfg_space.start + 0x10000000, 0x1000);
> so isn't one of these statements is wrong?
yes, that doesn't look right. I'll verify that and make
sure that it works correctly and resubmit the patch.
>
> > + if (mbase == NULL) {
> > + printk(KERN_ERR "%s: Can't map internal config space !",
> > + port->node->full_name);
> > + goto done;
> > + }
> > +
> > + while (attempt && (0 == (in_le32(mbase + PECFG_460SX_DLLSTA)
> > + & 0x00000010))) {
>
> Nitpicking, I think it'd be nice if there was #define for 0x00000010
> perhaps: #define PECFG_460SX_DLLSTA_LINKUP 0x00000010
ok.
> >
> > - if (ppc4xx_pciex_hwops->check_link)
> > - ppc4xx_pciex_hwops->check_link(port);
> > -
> > /*
> > * Initialize mapping: disable all regions and configure
> > * CFG and REG regions based on resources in the device tree
> > */
> > ppc4xx_pciex_port_init_mapping(port);
> >
> > + if (ppc4xx_pciex_hwops->check_link)
> > + ppc4xx_pciex_hwops->check_link(port);
> > +
>
> Why move this? You already iorempat the cfg space.
This was what I was asking about before. The reason that I
swapped the order of the init_mapping and check_link is
because the init_mapping currently sets up the cfgbax
registers. Those setup the base address of the
configuration space on the PLB side of the bus. As far as I
could determine, I cannot access the config space until
those registers are configured. I need to touch the config
space in order to do the check_link b/c the 460sx uses the
extended config space to keep track of the link status. I
looked at init mapping and based on what it did I did not
see any potential adverse effects.
> > out_le32(mbase + PECFG_POM2LAH, pciah);
> > @@ -1591,8 +1632,7 @@ static int __init ppc4xx_setup_one_pciex_POM(struct ppc4xx_pciex_port *port,
> > dcr_write(port->dcrs, DCRO_PEGPL_OMR3BAH, lah);
> > dcr_write(port->dcrs, DCRO_PEGPL_OMR3BAL, lal);
> > dcr_write(port->dcrs, DCRO_PEGPL_OMR3MSKH, 0x7fffffff);
> > - /* Note that 3 here means enabled | IO space !!! */
> > - dcr_write(port->dcrs, DCRO_PEGPL_OMR3MSKL, sa | 3);
> > + dcr_write(port->dcrs, DCRO_PEGPL_OMR3MSKL, sa);
> > break;
> > }
>
> I think you really want to check the definitions for OMRs 2 and 3 to verify that this is right.
Thanks, good catch. I'll change the first case block to
include a switch on the 460sx. The first case statement
needs to be | 0x5, while the others need to stay 0x3.
Ayman
^ permalink raw reply
* Re: [PATCH] Correct offset_in_page mask bits in function edac_mc_handle_ce.
From: Andrew Morton @ 2011-07-14 21:50 UTC (permalink / raw)
To: Shaohui Xie
Cc: kumar.gala, Kai.Jiang, djiang, ptyser, dougthompson, scottwood,
linuxppc-dev
In-Reply-To: <1310608902-14221-1-git-send-email-Shaohui.Xie@freescale.com>
On Thu, 14 Jul 2011 10:01:42 +0800
Shaohui Xie <Shaohui.Xie@freescale.com> wrote:
> From: Kai.Jiang <Kai.Jiang@freescale.com>
>
> Parameter offset_in_page in function edac_mc_handle_ce should be masked
> the higher bits above the page size, not the lower bits. The original input
> sometimes causes crash.
>
> Signed-off-by: Kai.Jiang <Kai.Jiang@freescale.com>
> ---
> drivers/edac/mpc85xx_edac.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/edac/mpc85xx_edac.c b/drivers/edac/mpc85xx_edac.c
> index 38ab8e2..b048a5f 100644
> --- a/drivers/edac/mpc85xx_edac.c
> +++ b/drivers/edac/mpc85xx_edac.c
> @@ -854,11 +854,11 @@ static void mpc85xx_mc_check(struct mem_ctl_info *mci)
> mpc85xx_mc_printk(mci, KERN_ERR, "PFN out of range!\n");
>
> if (err_detect & DDR_EDE_SBE)
> - edac_mc_handle_ce(mci, pfn, err_addr & PAGE_MASK,
> + edac_mc_handle_ce(mci, pfn, err_addr & ~PAGE_MASK,
> syndrome, row_index, 0, mci->ctl_name);
>
> if (err_detect & DDR_EDE_MBE)
> - edac_mc_handle_ue(mci, pfn, err_addr & PAGE_MASK,
> + edac_mc_handle_ue(mci, pfn, err_addr & ~PAGE_MASK,
> row_index, mci->ctl_name);
>
> out_be32(pdata->mc_vbase + MPC85XX_MC_ERR_DETECT, err_detect);
The patch should have had your Signed-off-by:, as described in
Documentation/SubmittingPatches. I added your s-o-b to my copy -
please ack this.
>From the description it appears to me that this fix should be
backported into 2.6.39.x and earlier. Do you agree?
The way to indicate this intention is to add "Cc: <stable@kernel.org>"
to the changelog. I added that tag to my copy of this patch.
^ permalink raw reply
* Re: [PATCH 1/1] powerpc/4xx: enable and fix pcie gen1/gen2 on the 460sx
From: Benjamin Herrenschmidt @ 2011-07-14 22:36 UTC (permalink / raw)
To: Ayman El-Khashab; +Cc: linuxppc-dev, Paul Mackerras, linux-kernel
In-Reply-To: <20110714160456.GA6859@crust.elkhashab.com>
On Thu, 2011-07-14 at 11:04 -0500, Ayman El-Khashab wrote:
> Thanks Tony, some comments below.
>
> On Thu, Jul 14, 2011 at 11:16:27AM +1000, Tony Breeds wrote:
> >
> > > +static void __init ppc460sx_pciex_check_link(struct ppc4xx_pciex_port *port)
> > > +{
> > > + void __iomem *mbase;
> > > + int attempt = 50;
> > > +
> > > + port->link = 0;
> > > +
> > > + mbase = ioremap(port->cfg_space.start + 0x00000000, 0x1000);
> >
> > Why + 0x00000000 ? ppc4xx_pciex_port_setup_hose() does:
> > mbase = ioremap(port->cfg_space.start + 0x10000000, 0x1000);
> > so isn't one of these statements is wrong?
>
> yes, that doesn't look right. I'll verify that and make
> sure that it works correctly and resubmit the patch.
The state of that top bit is obscure. I couldn't figure out from docs
whether it's actually useful or not (it doesn't seem to make a
difference on the HW we've played with here).
It's possible that some parts need it to access the RC config space vs
emitting type 1 cycles, and some don't, in which case the address
decoding just wraps and the bit is ignored ?
> > > + if (mbase == NULL) {
> > > + printk(KERN_ERR "%s: Can't map internal config space !",
> > > + port->node->full_name);
> > > + goto done;
> > > + }
> > > +
> > > + while (attempt && (0 == (in_le32(mbase + PECFG_460SX_DLLSTA)
> > > + & 0x00000010))) {
> >
> > Nitpicking, I think it'd be nice if there was #define for 0x00000010
> > perhaps: #define PECFG_460SX_DLLSTA_LINKUP 0x00000010
>
> ok.
>
> > >
> > > - if (ppc4xx_pciex_hwops->check_link)
> > > - ppc4xx_pciex_hwops->check_link(port);
> > > -
> > > /*
> > > * Initialize mapping: disable all regions and configure
> > > * CFG and REG regions based on resources in the device tree
> > > */
> > > ppc4xx_pciex_port_init_mapping(port);
> > >
> > > + if (ppc4xx_pciex_hwops->check_link)
> > > + ppc4xx_pciex_hwops->check_link(port);
> > > +
> >
> > Why move this? You already iorempat the cfg space.
>
> This was what I was asking about before. The reason that I
> swapped the order of the init_mapping and check_link is
> because the init_mapping currently sets up the cfgbax
> registers. Those setup the base address of the
> configuration space on the PLB side of the bus. As far as I
> could determine, I cannot access the config space until
> those registers are configured. I need to touch the config
> space in order to do the check_link b/c the 460sx uses the
> extended config space to keep track of the link status. I
> looked at init mapping and based on what it did I did not
> see any potential adverse effects.
I think it makes sense to setup mappings before checking the link. There
may be a couple of mechanical issues in the code with this, I haven't
looked, but I agree in principle.
> > > out_le32(mbase + PECFG_POM2LAH, pciah);
> > > @@ -1591,8 +1632,7 @@ static int __init ppc4xx_setup_one_pciex_POM(struct ppc4xx_pciex_port *port,
> > > dcr_write(port->dcrs, DCRO_PEGPL_OMR3BAH, lah);
> > > dcr_write(port->dcrs, DCRO_PEGPL_OMR3BAL, lal);
> > > dcr_write(port->dcrs, DCRO_PEGPL_OMR3MSKH, 0x7fffffff);
> > > - /* Note that 3 here means enabled | IO space !!! */
> > > - dcr_write(port->dcrs, DCRO_PEGPL_OMR3MSKL, sa | 3);
> > > + dcr_write(port->dcrs, DCRO_PEGPL_OMR3MSKL, sa);
> > > break;
> > > }
> >
> > I think you really want to check the definitions for OMRs 2 and 3 to verify that this is right.
>
> Thanks, good catch. I'll change the first case block to
> include a switch on the 460sx. The first case statement
> needs to be | 0x5, while the others need to stay 0x3.
Please make it constants using #define or something... those bits tend
to subtly change between ASICs too I noticed.
Cheers,
Ben.
> Ayman
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Anton Blanchard @ 2011-07-15 0:45 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds
In-Reply-To: <1310649379.2586.273.camel@twins>
Hi,
> Urgh.. so those spans are generated by sched_domain_node_span(), and
> it looks like that simply picks the 15 nearest nodes to the one we've
> got without consideration for overlap with previously generated spans.
I do wonder if we need this extra level at all on ppc64. From memory
SGI added it for their massive setups, but our largest setup is 32 nodes
and breaking that down into 16 node chunks seems overkill.
I just realised we were setting NEWIDLE on our node definition and that
was causing large amounts of rebalance work even with
SD_NODES_PER_DOMAIN=16.
After removing it and bumping SD_NODES_PER_DOMAIN to 32, things look
pretty good.
Perhaps we should allow an arch to override SD_NODES_PER_DOMAIN so this
extra level is only used by SGI boxes.
Anton
^ permalink raw reply
* [PATCH] powerpc: add CFAR to oops output
From: Michael Neuling @ 2011-07-15 5:25 UTC (permalink / raw)
To: Paul Mackerras, benh; +Cc: linuxppc-dev
Now we have the CFAR saved add it to the oops output.
Signed-off-by: Michael Neuling <mikey@neuling.org>
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 91e52df..b14ab9e 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -650,6 +650,8 @@ void show_regs(struct pt_regs * regs)
printbits(regs->msr, msr_bits);
printk(" CR: %08lx XER: %08lx\n", regs->ccr, regs->xer);
trap = TRAP(regs);
+ if ((regs->trap != 0xc00) && cpu_has_feature(CPU_FTR_CFAR))
+ printk("CFAR: "REG"\n", regs->orig_gpr3);
if (trap == 0x300 || trap == 0x600)
#ifdef CONFIG_PPC_ADV_DEBUG_REGS
printk("DEAR: "REG", ESR: "REG"\n", regs->dar, regs->dsisr);
^ permalink raw reply related
* Re: [v3 PATCH 1/1] booke/kprobe: make program exception to use one dedicated exception stack
From: tiejun.chen @ 2011-07-15 5:28 UTC (permalink / raw)
To: Kumar Gala; +Cc: linuxppc-dev
In-Reply-To: <CC4F4DC6-0682-4FC6-8D89-CE9AFEA7AD85@kernel.crashing.org>
Kumar Gala wrote:
> On Jul 11, 2011, at 6:31 AM, Tiejun Chen wrote:
>
>> When kprobe these operations such as store-and-update-word for SP(r1),
>>
>> stwu r1, -A(r1)
>>
>> The program exception is triggered, and PPC always allocate an exception frame
>> as shown as the follows:
>>
>> old r1 ----------
>> ...
>> nip
>> gpr[2] ~ gpr[31]
>> gpr[1] <--------- old r1 is stored.
>> gpr[0]
>> -------- <--------- pr_regs @offset 16 bytes
>> padding
>> STACK_FRAME_REGS_MARKER
>> LR
>> back chain
>> new r1 ----------
>> Then emulate_step() will emulate this instruction, 'stwu'. Actually its
>> equivalent to:
>> 1> Update pr_regs->gpr[1] = mem[old r1 + (-A)]
>> 2> stw [old r1], mem[old r1 + (-A)]
>>
>> Please notice the stack based on new r1 may be covered with mem[old r1
>> +(-A)] when addr[old r1 + (-A)] < addr[old r1 + sizeof(an exception frame0].
>> So the above 2# operation will overwirte something to break this exception
>> frame then unexpected kernel problem will be issued.
>>
>> So looks we have to implement independed interrupt stack for PPC program
>> exception when CONFIG_BOOKE is enabled. Here we can use
>> EXC_LEVEL_EXCEPTION_PROLOG to replace original NORMAL_EXCEPTION_PROLOG
>> for program exception if CONFIG_BOOKE. Then its always safe for kprobe
>> with independed exc stack from one pre-allocated and dedicated thread_info.
>> Actually this is just waht we did for critical/machine check exceptions
>> on PPC.
>>
>> Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
>> ---
>
> I'm still very confused why we need a unique stack frame for kprobe/program exceptions on book-e devices.
Its a bug at least for Book-E. And if you'd like to check another topic thread,
"[BUG?]3.0-rc4+ftrace+kprobe: set kprobe at instruction 'stwu' lead to system
crash/freeze", you can know this story completely :)
This bug should not be reproduced on PPC64 with the exception prolog/endlog
dedicated to PPC64. But I have no enough time to check other PPC32 & !BOOKE so
I'm not sure if we should extend this modification.
>
> Can you explain this further.
I can show one of those issued examples.
Here we kprobe the entry point of show_interrupts().
(gdb) disassemble show_interrupts
Dump of assembler code for function show_interrupts:
0xc0004ff4 <+0>: stwu r1,-48(r1)
0xc0004ff8 <+4>: mflr r0
I add some printk() inside pre_handler() to show pt_regs->gpr[1] and pt_regs->nip.
------
......
Planted kprobe at c0004ff4
pre_handler: p->addr = 0xc0004ff4, nip = 0xc0004ff4, msr = 0x29000
gpr[1] = de767e50.
nip = c0004ff4.
When hit this instruction, emulate_step() would emulate this instruction as follows:
------
#1> current pr_regs->gpr[1] = 0xde767e50 - 48 = 0xde767e20;
#2> stw (previous pr_regs->gpr[1]), @(current pr_regs->gpr[1])
==> stw (0xde767e50), 0xde767e20
But after this kprobe process something would be rewrite incorrectly:
------
......
post_handler: p->addr = 0xc0004ff4, msr = 0x29000
gpr[1] = de767e20.
nip = de767e54.
^
If everything is good nip should equal to (0xc0004ff4 + 0x4). But looks its
reset with (0xde767e50 + 0x4) via the above #2 operation. So why?
As I understand kprobe use 'trap' to enter the program exception. At now PR = 0
so the kernel allocate an exception frame as normal.
---------------- old r1[0xde767e50]
1 pt_regs->result
2 pt_regs->dsisr
3 pt_regs->dar
4 pt_regs->trap
5 pt_regs->mq
6 pt_regs->ccr
7 pt_regs->xer
8 pt_regs->link
9 pt_regs->ctr
10 pt_regs->orig_gpr3
11 pt_regs->msr
12 pt_regs->nip <-- @ 0xde767e50 - 12 x 4 = 0xde767e20
......
----------------- new r1[0xde767e50 - INT_FRAME_SIZE]
I think you can understand why pt_regs->nip is broken :) So the root cause is an
exception frame and the kprobed function stack frame are always overlap. And
then someone member inside an exception frame may be corrupted by that emulated
stw operation.
So I think we have to use one unique stack frame to avoid this when enable
CONFIG_KPROBES. Especially for Book-E we can refer easily machine
check/critical/debug exception implementation to do this like my patch.
Tiejun
>
> - k
>
>
^ permalink raw reply
* RE: [PATCH] Correct offset_in_page mask bits in function edac_mc_handle_ce.
From: Xie Shaohui-B21989 @ 2011-07-15 6:21 UTC (permalink / raw)
To: Andrew Morton
Cc: Wood Scott-B07421, djiang@mvista.com, ptyser@xes-inc.com,
Jiang Kai-B18973, dougthompson@xmission.com, Gala Kumar-B11780,
linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20110714145042.9d2e88de.akpm@linux-foundation.org>
>-----Original Message-----
>From: Andrew Morton [mailto:akpm@linux-foundation.org]
>Sent: Friday, July 15, 2011 5:51 AM
>To: Xie Shaohui-B21989
>Cc: linuxppc-dev@lists.ozlabs.org; Gala Kumar-B11780; Wood Scott-B07421;
>ptyser@xes-inc.com; djiang@mvista.com; dougthompson@xmission.com; Jiang
>Kai-B18973
>Subject: Re: [PATCH] Correct offset_in_page mask bits in function
>edac_mc_handle_ce.
>
>On Thu, 14 Jul 2011 10:01:42 +0800
>Shaohui Xie <Shaohui.Xie@freescale.com> wrote:
>
>> From: Kai.Jiang <Kai.Jiang@freescale.com>
>>
>> Parameter offset_in_page in function edac_mc_handle_ce should be
>> masked the higher bits above the page size, not the lower bits. The
>> original input sometimes causes crash.
>>
>> Signed-off-by: Kai.Jiang <Kai.Jiang@freescale.com>
>> ---
>> drivers/edac/mpc85xx_edac.c | 4 ++--
>> 1 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/edac/mpc85xx_edac.c b/drivers/edac/mpc85xx_edac.c
>> index 38ab8e2..b048a5f 100644
>> --- a/drivers/edac/mpc85xx_edac.c
>> +++ b/drivers/edac/mpc85xx_edac.c
>> @@ -854,11 +854,11 @@ static void mpc85xx_mc_check(struct mem_ctl_info
>*mci)
>> mpc85xx_mc_printk(mci, KERN_ERR, "PFN out of range!\n");
>>
>> if (err_detect & DDR_EDE_SBE)
>> - edac_mc_handle_ce(mci, pfn, err_addr & PAGE_MASK,
>> + edac_mc_handle_ce(mci, pfn, err_addr & ~PAGE_MASK,
>> syndrome, row_index, 0, mci->ctl_name);
>>
>> if (err_detect & DDR_EDE_MBE)
>> - edac_mc_handle_ue(mci, pfn, err_addr & PAGE_MASK,
>> + edac_mc_handle_ue(mci, pfn, err_addr & ~PAGE_MASK,
>> row_index, mci->ctl_name);
>>
>> out_be32(pdata->mc_vbase + MPC85XX_MC_ERR_DETECT, err_detect);
>
>The patch should have had your Signed-off-by:, as described in
>Documentation/SubmittingPatches. I added your s-o-b to my copy - please
>ack this.
[Xie Shaohui] OK, thank you.
>
>From the description it appears to me that this fix should be backported
>into 2.6.39.x and earlier. Do you agree?
[Xie Shaohui] Actually this patch was created based on 2.6.32, it can be ap=
plied=20
to 2.6.39 and 3.0 without any change.
>
>The way to indicate this intention is to add "Cc: <stable@kernel.org>"
>to the changelog. I added that tag to my copy of this patch.
[Xie Shaohui] OK, thank you.
Best Regards,=20
Shaohui Xie
^ permalink raw reply
* [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core
From: Shan Hai @ 2011-07-15 8:07 UTC (permalink / raw)
To: benh, paulus
Cc: tony.luck, a.p.zijlstra, linux-kernel, cmetcalf, dhowells, tglx,
walken, linuxppc-dev, akpm
The following test case could reveal a bug in the futex_lock_pi()
BUG: On FUTEX_LOCK_PI, there is a infinite loop in the futex_lock_pi()
on Powerpc e500 core.
Cause: The linux kernel on the e500 core has no write permission on
the COW page, refer the head comment of the following test code.
ftrace on test case:
[000] 353.990181: futex_lock_pi_atomic <-futex_lock_pi
[000] 353.990185: cmpxchg_futex_value_locked <-futex_lock_pi_atomic
[snip]
[000] 353.990191: do_page_fault <-handle_page_fault
[000] 353.990192: bad_page_fault <-handle_page_fault
[000] 353.990193: search_exception_tables <-bad_page_fault
[snip]
[000] 353.990199: get_user_pages <-fault_in_user_writeable
[snip]
[000] 353.990208: mark_page_accessed <-follow_page
[000] 353.990222: futex_lock_pi_atomic <-futex_lock_pi
[snip]
[000] 353.990230: cmpxchg_futex_value_locked <-futex_lock_pi_atomic
[ a loop occures here ]
/*
* A test case for revealing an infinite loop in the futex_lock_pi().
* - there are 2 processes, parent and a child
* - the parent process allocates and initializes a pthread_mutex MUTEX in a
* shared memory region
* - the parent process holds the MUTEX and do long time computing
* - the child process tries to hold the MUTEX during the parent holding it and
* traps into the kernel for waiting on the MUTEX because of contention
* - the kernel loops in futex_lock_pi()
* - result of 'top' command reveals that the system usage of CPU is 100%
*/
#include <stdio.h>
#include <stdlib.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <errno.h>
#include <pthread.h>
#include <string.h>
#include <signal.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/mman.h>
enum { SHM_INIT, SHM_GET };
enum { PARENT, CHILD };
#define FIXED_MMAP_ADDR 0x20000000
#define MMAP_SIZE 0x2000000
static int shmid;
static char shm_name[100];
static int sleep_period = 100000;
void * shmem_init(int flag)
{
int start = FIXED_MMAP_ADDR;
int memory_size = MMAP_SIZE;
int mode = 0666;
void *addr;
int ret;
sprintf(shm_name, "/shmem_1234");
shmid = shm_open (shm_name, O_RDWR | O_EXCL | O_CREAT | O_TRUNC, mode);
if (shmid < 0) {
if (errno == EEXIST) {
printf ("shm_open: %s\n", strerror(errno));
shmid = shm_open (shm_name, O_RDWR, mode);
} else {
printf("failed to shm_open, err=%s\n", strerror(errno));
return NULL;
}
}
ret = fcntl (shmid, F_SETFD, FD_CLOEXEC);
if (ret < 0) {
printf("fcntl: %s\n", strerror(errno));
return NULL;
}
ret = ftruncate (shmid, memory_size);
if (ret < 0) {
printf("ftruncate: %s\n", strerror(errno));
return NULL;
}
addr = mmap ((void *)start, memory_size, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_FIXED, shmid, 0);
if (addr == MAP_FAILED) {
printf ("mmap: %s\n", strerror(errno));
close (shmid);
shm_unlink (shm_name);
return NULL;
}
if (flag == SHM_INIT)
memset(addr, 0, memory_size);
return (void *)start;
}
pthread_mutex_t * shmem_mutex_init(int flag)
{
pthread_mutex_t * pmutex = (pthread_mutex_t *)shmem_init(flag);
pthread_mutexattr_t attr;
if (flag == SHM_INIT) {
pthread_mutexattr_init (&attr);
pthread_mutexattr_setpshared (&attr, PTHREAD_PROCESS_SHARED);
pthread_mutexattr_setprotocol (&attr, PTHREAD_PRIO_INHERIT);
pthread_mutexattr_setrobust_np (&attr,
PTHREAD_MUTEX_STALLED_NP);
pthread_mutexattr_settype (&attr, PTHREAD_MUTEX_ERRORCHECK);
if (pthread_mutex_init (pmutex, &attr) != 0) {
printf("Init mutex failed, err=%s\n", strerror(errno));
pthread_mutexattr_destroy (&attr);
return NULL;
}
}
return pmutex;
}
void long_running_task(int flag)
{
static int counter = 0;
if (flag == PARENT)
usleep(5*sleep_period);
else
usleep(3*sleep_period);
counter = (counter + 1) % 100;
printf("%d: completed %d computing\n", getpid(), counter);
}
void sig_handler(int signum)
{
close(shmid);
shm_unlink(shm_name);
exit(0);
}
int main(int argc, char *argv[])
{
pthread_mutex_t *mutex_parent, *mutex_child;
signal(SIGUSR1, sig_handler);
if (fork()) { /* parent process */
if ((mutex_parent = shmem_mutex_init(SHM_INIT)) == NULL) {
printf("failed to get the shmem_mutex\n");
exit(-1);
}
while (1) {
printf("%d: try to hold the lock\n", getpid());
pthread_mutex_lock(mutex_parent);
printf("%d: got the lock\n", getpid());
long_running_task(PARENT);
pthread_mutex_unlock(mutex_parent);
printf("%d: released the lock\n", getpid());
}
} else { /* child process */
usleep(sleep_period);
if ((mutex_child = shmem_mutex_init(SHM_GET)) == NULL) {
printf("failed to get the shmem_mutex\n");
exit(-1);
}
while (1) {
printf("%d: try to hold the lock\n", getpid());
pthread_mutex_lock(mutex_child);
printf("%d: got the lock\n", getpid());
long_running_task(CHILD);
pthread_mutex_unlock(mutex_child);
printf("%d: released the lock\n", getpid());
}
}
return 0;
}
---
arch/powerpc/include/asm/futex.h | 11 ++++++++++-
arch/powerpc/include/asm/tlb.h | 25 +++++++++++++++++++++++++
2 files changed, 35 insertions(+), 1 deletions(-)
^ permalink raw reply
* [PATCH 1/1] Fixup write permission of TLB on powerpc e500 core
From: Shan Hai @ 2011-07-15 8:07 UTC (permalink / raw)
To: benh, paulus
Cc: tony.luck, a.p.zijlstra, Shan Hai, linux-kernel, cmetcalf,
dhowells, tglx, walken, linuxppc-dev, akpm
In-Reply-To: <1310717238-13857-1-git-send-email-haishan.bai@gmail.com>
The kernel has no write permission on COW pages by default on e500 core, this
will cause endless loop in futex_lock_pi, because futex code assumes the kernel
has write permission on COW pages. Grant write permission to the kernel on COW
pages when access violation page fault occurs.
Signed-off-by: Shan Hai <haishan.bai@gmail.com>
---
arch/powerpc/include/asm/futex.h | 11 ++++++++++-
arch/powerpc/include/asm/tlb.h | 25 +++++++++++++++++++++++++
2 files changed, 35 insertions(+), 1 deletions(-)
diff --git a/arch/powerpc/include/asm/futex.h b/arch/powerpc/include/asm/futex.h
index c94e4a3..54c3e74 100644
--- a/arch/powerpc/include/asm/futex.h
+++ b/arch/powerpc/include/asm/futex.h
@@ -8,6 +8,7 @@
#include <asm/errno.h>
#include <asm/synch.h>
#include <asm/asm-compat.h>
+#include <asm/tlb.h>
#define __futex_atomic_op(insn, ret, oldval, uaddr, oparg) \
__asm__ __volatile ( \
@@ -113,7 +114,15 @@ futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
: "cc", "memory");
*uval = prev;
- return ret;
+
+ /* Futex assumes the kernel has permission to write to
+ * COW pages, grant the kernel write permission on COW
+ * pages because it has none by default.
+ */
+ if (ret == -EFAULT)
+ __tlb_fixup_write_permission(current->mm, (unsigned long)uaddr);
+
+ return ret;
}
#endif /* __KERNEL__ */
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index e2b428b..3863c6a 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -45,5 +45,30 @@ static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
#endif
}
+/* Grant write permission to the kernel on a page. */
+static inline void __tlb_fixup_write_permission(struct mm_struct *mm,
+ unsigned long address)
+{
+#if defined(CONFIG_FSL_BOOKE)
+ /* Grant write permission to the kernel on a page by setting TLB.SW
+ * bit, the bit setting operation is tricky here, calling
+ * handle_mm_fault with FAULT_FLAG_WRITE causes _PAGE_DIRTY bit of
+ * the pte to be set, the _PAGE_DIRTY of the pte is translated into
+ * TLB.SW on Powerpc e500 core.
+ */
+
+ struct vm_area_struct *vma;
+
+ vma = find_vma(mm, address);
+ if (likely(vma)) {
+ /* only fixup present page */
+ if (follow_page(vma, address, FOLL_WRITE)) {
+ handle_mm_fault(mm, vma, address, FAULT_FLAG_WRITE);
+ flush_tlb_page(vma, address);
+ }
+ }
+#endif
+}
+
#endif /* __KERNEL__ */
#endif /* __ASM_POWERPC_TLB_H */
--
1.7.1
^ permalink raw reply related
* Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core
From: Peter Zijlstra @ 2011-07-15 8:20 UTC (permalink / raw)
To: Shan Hai
Cc: tony.luck, linux-kernel, cmetcalf, dhowells, paulus, tglx, walken,
linuxppc-dev, akpm
In-Reply-To: <1310717238-13857-1-git-send-email-haishan.bai@gmail.com>
On Fri, 2011-07-15 at 16:07 +0800, Shan Hai wrote:
>=20
> The following test case could reveal a bug in the futex_lock_pi()
>=20
> BUG: On FUTEX_LOCK_PI, there is a infinite loop in the futex_lock_pi()=
=20
> on Powerpc e500 core.
> Cause: The linux kernel on the e500 core has no write permission on
> the COW page, refer the head comment of the following test code.
> =20
> ftrace on test case:
> [000] 353.990181: futex_lock_pi_atomic <-futex_lock_pi
> [000] 353.990185: cmpxchg_futex_value_locked <-futex_lock_pi_atomic
> [snip]
> [000] 353.990191: do_page_fault <-handle_page_fault
> [000] 353.990192: bad_page_fault <-handle_page_fault
> [000] 353.990193: search_exception_tables <-bad_page_fault
> [snip]
> [000] 353.990199: get_user_pages <-fault_in_user_writeable
> [snip]
> [000] 353.990208: mark_page_accessed <-follow_page
> [000] 353.990222: futex_lock_pi_atomic <-futex_lock_pi
> [snip]
> [000] 353.990230: cmpxchg_futex_value_locked <-futex_lock_pi_atomic
> [ a loop occures here ]
>=20
But but but but, that get_user_pages(.write=3D1, .force=3D0) should result
in a COW break, getting our own writable page.
What is this e500 thing smoking that this doesn't work?
^ permalink raw reply
* Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core
From: MailingLists @ 2011-07-15 8:38 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tony.luck, linux-kernel, cmetcalf, dhowells, paulus, tglx, walken,
linuxppc-dev, akpm
In-Reply-To: <1310718056.2586.275.camel@twins>
On 07/15/2011 04:20 PM, Peter Zijlstra wrote:
> On Fri, 2011-07-15 at 16:07 +0800, Shan Hai wrote:
>> The following test case could reveal a bug in the futex_lock_pi()
>>
>> BUG: On FUTEX_LOCK_PI, there is a infinite loop in the futex_lock_pi()
>> on Powerpc e500 core.
>> Cause: The linux kernel on the e500 core has no write permission on
>> the COW page, refer the head comment of the following test code.
>>
>> ftrace on test case:
>> [000] 353.990181: futex_lock_pi_atomic<-futex_lock_pi
>> [000] 353.990185: cmpxchg_futex_value_locked<-futex_lock_pi_atomic
>> [snip]
>> [000] 353.990191: do_page_fault<-handle_page_fault
>> [000] 353.990192: bad_page_fault<-handle_page_fault
>> [000] 353.990193: search_exception_tables<-bad_page_fault
>> [snip]
>> [000] 353.990199: get_user_pages<-fault_in_user_writeable
>> [snip]
>> [000] 353.990208: mark_page_accessed<-follow_page
>> [000] 353.990222: futex_lock_pi_atomic<-futex_lock_pi
>> [snip]
>> [000] 353.990230: cmpxchg_futex_value_locked<-futex_lock_pi_atomic
>> [ a loop occures here ]
>>
>
> But but but but, that get_user_pages(.write=1, .force=0) should result
> in a COW break, getting our own writable page.
>
> What is this e500 thing smoking that this doesn't work?
A page could be set to read only by the kernel (supervisor in the powerpc
literature) on the e500, and that's what the kernel do. Set SW(supervisor
write) bit in the TLB entry to grant write permission to the kernel on a
page.
And further the SW bit is set according to the DIRTY flag of the PTE,
PTE.DIRTY is set in the do_page_fault(), the futex_lock_pi() disabled
page fault, the PTE.DIRTY never can be set, so do the SW bit, unbreakable
COW occurred, infinite loop followed.
Thanks
Shan Hai
^ permalink raw reply
* Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Peter Zijlstra @ 2011-07-15 8:37 UTC (permalink / raw)
To: Anton Blanchard; +Cc: mahesh, linuxppc-dev, linux-kernel, mingo, torvalds
In-Reply-To: <20110715104547.29c3c509@kryten>
On Fri, 2011-07-15 at 10:45 +1000, Anton Blanchard wrote:
> Hi,
>=20
> > Urgh.. so those spans are generated by sched_domain_node_span(), and
> > it looks like that simply picks the 15 nearest nodes to the one we've
> > got without consideration for overlap with previously generated spans.
>=20
> I do wonder if we need this extra level at all on ppc64. From memory
> SGI added it for their massive setups, but our largest setup is 32 nodes
> and breaking that down into 16 node chunks seems overkill.
>=20
> I just realised we were setting NEWIDLE on our node definition and that
> was causing large amounts of rebalance work even with
> SD_NODES_PER_DOMAIN=3D16.
>=20
> After removing it and bumping SD_NODES_PER_DOMAIN to 32, things look
> pretty good.
>=20
> Perhaps we should allow an arch to override SD_NODES_PER_DOMAIN so this
> extra level is only used by SGI boxes.
We can certainly remove the whole topology layer that causes this
problem for 3.0 and try to fix up for 3.1 again.
But I was rather hoping to introduce more of those layers in the near
future, I was hoping to create a layer per node_distance() value, such
that the load-balancing is aware of the interconnects.
Now for that I ran into the exact same problem, and at the time didn't
come up with a solution, but I think I now see a way out.
Something like the below ought to avoid the problem.. makes SGI sad
though :-)
---
kernel/sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 8fb4245..877b9f1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7203,7 +7203,7 @@ static struct sched_domain_topology_level default_top=
ology[] =3D {
#endif
{ sd_init_CPU, cpu_cpu_mask, },
#ifdef CONFIG_NUMA
- { sd_init_NODE, cpu_node_mask, },
+// { sd_init_NODE, cpu_node_mask, },
{ sd_init_ALLNODES, cpu_allnodes_mask, },
#endif
{ NULL, },
^ permalink raw reply related
* Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core
From: Peter Zijlstra @ 2011-07-15 8:44 UTC (permalink / raw)
To: MailingLists
Cc: tony.luck, linux-kernel, cmetcalf, dhowells, paulus, tglx, walken,
linuxppc-dev, akpm
In-Reply-To: <4E1FFC7B.4000209@gmail.com>
On Fri, 2011-07-15 at 16:38 +0800, MailingLists wrote:
> On 07/15/2011 04:20 PM, Peter Zijlstra wrote:
> > On Fri, 2011-07-15 at 16:07 +0800, Shan Hai wrote:
> >> The following test case could reveal a bug in the futex_lock_pi()
> >>
> >> BUG: On FUTEX_LOCK_PI, there is a infinite loop in the futex_lock_pi()
> >> on Powerpc e500 core.
> >> Cause: The linux kernel on the e500 core has no write permission on
> >> the COW page, refer the head comment of the following test co=
de.
> >>
> >> ftrace on test case:
> >> [000] 353.990181: futex_lock_pi_atomic<-futex_lock_pi
> >> [000] 353.990185: cmpxchg_futex_value_locked<-futex_lock_pi_atomic
> >> [snip]
> >> [000] 353.990191: do_page_fault<-handle_page_fault
> >> [000] 353.990192: bad_page_fault<-handle_page_fault
> >> [000] 353.990193: search_exception_tables<-bad_page_fault
> >> [snip]
> >> [000] 353.990199: get_user_pages<-fault_in_user_writeable
> >> [snip]
> >> [000] 353.990208: mark_page_accessed<-follow_page
> >> [000] 353.990222: futex_lock_pi_atomic<-futex_lock_pi
> >> [snip]
> >> [000] 353.990230: cmpxchg_futex_value_locked<-futex_lock_pi_atomic
> >> [ a loop occures here ]
> >>
> >
> > But but but but, that get_user_pages(.write=3D1, .force=3D0) should res=
ult
> > in a COW break, getting our own writable page.
> >
> > What is this e500 thing smoking that this doesn't work?
>=20
> A page could be set to read only by the kernel (supervisor in the powerpc
> literature) on the e500, and that's what the kernel do. Set SW(supervisor
> write) bit in the TLB entry to grant write permission to the kernel on a
> page.
>=20
> And further the SW bit is set according to the DIRTY flag of the PTE,
> PTE.DIRTY is set in the do_page_fault(), the futex_lock_pi() disabled
> page fault, the PTE.DIRTY never can be set, so do the SW bit, unbreakable
> COW occurred, infinite loop followed.
I'm fairly sure fault_in_user_writeable() has PF enabled as it takes
mmap_sem, an pagefaul_disable() is akin to preemp_disable() on mainline.
Also get_user_pages() fully expects to be able to schedule, and in fact
can call the full pf handler path all by its lonesome self.
^ permalink raw reply
* Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core
From: Benjamin Herrenschmidt @ 2011-07-15 9:05 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tony.luck, Shan Hai, linux-kernel, cmetcalf, dhowells, paulus,
tglx, walken, linuxppc-dev, akpm
In-Reply-To: <1310718056.2586.275.camel@twins>
On Fri, 2011-07-15 at 10:20 +0200, Peter Zijlstra wrote:
> But but but but, that get_user_pages(.write=1, .force=0) should result
> in a COW break, getting our own writable page.
>
> What is this e500 thing smoking that this doesn't work?
Right. That should have triggered the cow & flushed the TLB entry....
Cheers,
Ben.
^ permalink raw reply
* Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core
From: Shan Hai @ 2011-07-15 9:08 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tony.luck, linux-kernel, cmetcalf, dhowells, paulus, tglx, walken,
linuxppc-dev, akpm
In-Reply-To: <1310719445.2586.288.camel@twins>
On 07/15/2011 04:44 PM, Peter Zijlstra wrote:
> On Fri, 2011-07-15 at 16:38 +0800, MailingLists wrote:
>> On 07/15/2011 04:20 PM, Peter Zijlstra wrote:
>>> On Fri, 2011-07-15 at 16:07 +0800, Shan Hai wrote:
>>>> The following test case could reveal a bug in the futex_lock_pi()
>>>>
>>>> BUG: On FUTEX_LOCK_PI, there is a infinite loop in the futex_lock_pi()
>>>> on Powerpc e500 core.
>>>> Cause: The linux kernel on the e500 core has no write permission on
>>>> the COW page, refer the head comment of the following test code.
>>>>
>>>> ftrace on test case:
>>>> [000] 353.990181: futex_lock_pi_atomic<-futex_lock_pi
>>>> [000] 353.990185: cmpxchg_futex_value_locked<-futex_lock_pi_atomic
>>>> [snip]
>>>> [000] 353.990191: do_page_fault<-handle_page_fault
>>>> [000] 353.990192: bad_page_fault<-handle_page_fault
>>>> [000] 353.990193: search_exception_tables<-bad_page_fault
>>>> [snip]
>>>> [000] 353.990199: get_user_pages<-fault_in_user_writeable
>>>> [snip]
>>>> [000] 353.990208: mark_page_accessed<-follow_page
>>>> [000] 353.990222: futex_lock_pi_atomic<-futex_lock_pi
>>>> [snip]
>>>> [000] 353.990230: cmpxchg_futex_value_locked<-futex_lock_pi_atomic
>>>> [ a loop occures here ]
>>>>
>>> But but but but, that get_user_pages(.write=1, .force=0) should result
>>> in a COW break, getting our own writable page.
>>>
>>> What is this e500 thing smoking that this doesn't work?
>> A page could be set to read only by the kernel (supervisor in the powerpc
>> literature) on the e500, and that's what the kernel do. Set SW(supervisor
>> write) bit in the TLB entry to grant write permission to the kernel on a
>> page.
>>
>> And further the SW bit is set according to the DIRTY flag of the PTE,
>> PTE.DIRTY is set in the do_page_fault(), the futex_lock_pi() disabled
>> page fault, the PTE.DIRTY never can be set, so do the SW bit, unbreakable
>> COW occurred, infinite loop followed.
> I'm fairly sure fault_in_user_writeable() has PF enabled as it takes
> mmap_sem, an pagefaul_disable() is akin to preemp_disable() on mainline.
>
> Also get_user_pages() fully expects to be able to schedule, and in fact
> can call the full pf handler path all by its lonesome self.
The whole scenario should be,
- the child process triggers a page fault at the first time access to
the lock, and it got its own writable page, but its *clean* for
the reason just for checking the status of the lock.
I am sorry for above "unbreakable COW".
- the futex_lock_pi() is invoked because of the lock contention,
and the futex_atomic_cmpxchg_inatomic() tries to get the lock,
it found out the lock is free so tries to write to the lock for
reservation, a page fault occurs, because the page is read only
for kernel(e500 specific), and returns -EFAULT to the caller
- the fault_in_user_writeable() tries to fix the fault,
but from the get_user_pages() view everything is ok, because
the COW was already broken, retry futex_lock_pi_atomic()
- futex_lock_pi_atomic() --> futex_atomic_cmpxchg_inatomic(),
another write protection page fault
- infinite loop
Thanks
Shan Hai
^ permalink raw reply
* Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core
From: Benjamin Herrenschmidt @ 2011-07-15 9:07 UTC (permalink / raw)
To: MailingLists
Cc: tony.luck, Peter Zijlstra, linux-kernel, cmetcalf, dhowells,
paulus, tglx, walken, linuxppc-dev, akpm
In-Reply-To: <4E1FFC7B.4000209@gmail.com>
On Fri, 2011-07-15 at 16:38 +0800, MailingLists wrote:
> A page could be set to read only by the kernel (supervisor in the
> powerpc
> literature) on the e500, and that's what the kernel do. Set
> SW(supervisor
> write) bit in the TLB entry to grant write permission to the kernel on
> a
> page.
>
> And further the SW bit is set according to the DIRTY flag of the PTE,
> PTE.DIRTY is set in the do_page_fault(), the futex_lock_pi() disabled
> page fault, the PTE.DIRTY never can be set, so do the SW bit,
> unbreakable
> COW occurred, infinite loop followed.
That would be it ... the SW dirty and young tracking relies on faults to
fixup things in handle_pte_fault(). If the "disable page fault" thingy
happens before we get there, then we have a pretty nasty bug. Note that
this will hit more than just e500 (and in fact any architecture that
relies on SW to do dirty and young tracking).
Cheers,
Ben.
^ permalink raw reply
* Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core
From: Benjamin Herrenschmidt @ 2011-07-15 9:12 UTC (permalink / raw)
To: Shan Hai
Cc: tony.luck, Peter Zijlstra, linux-kernel, cmetcalf, dhowells,
paulus, tglx, walken, linuxppc-dev, akpm
In-Reply-To: <4E20037C.5070506@gmail.com>
On Fri, 2011-07-15 at 17:08 +0800, Shan Hai wrote:
> The whole scenario should be,
> - the child process triggers a page fault at the first time access to
> the lock, and it got its own writable page, but its *clean* for
> the reason just for checking the status of the lock.
> I am sorry for above "unbreakable COW".
> - the futex_lock_pi() is invoked because of the lock contention,
> and the futex_atomic_cmpxchg_inatomic() tries to get the lock,
> it found out the lock is free so tries to write to the lock for
> reservation, a page fault occurs, because the page is read only
> for kernel(e500 specific), and returns -EFAULT to the caller
There is nothing e500 specific about user read only pages being read
only for kernel. All architectures behave the same way here afaik.
_However_ there is something not totally x86-like in the fact that we
require handle_mm_fault() to deal with dirty and young tracking, which
means that we -will- fault for a non-dirty writeable page or for any
non-young page. It's quite possible that the page fault disabling occurs
before that and thus breaks those architectures (it's not only e500 and
afaik not only powerpc) while x86 works fine due to HW update of dirty
and young.
It might be something to look into.
Cheers,
Ben.
> - the fault_in_user_writeable() tries to fix the fault,
> but from the get_user_pages() view everything is ok, because
> the COW was already broken, retry futex_lock_pi_atomic()
> - futex_lock_pi_atomic() --> futex_atomic_cmpxchg_inatomic(),
> another write protection page fault
> - infinite loop
>
>
^ permalink raw reply
* Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core
From: Peter Zijlstra @ 2011-07-15 9:50 UTC (permalink / raw)
To: Shan Hai
Cc: tony.luck, linux-kernel, cmetcalf, dhowells, paulus, tglx, walken,
linuxppc-dev, akpm
In-Reply-To: <4E20037C.5070506@gmail.com>
On Fri, 2011-07-15 at 17:08 +0800, Shan Hai wrote:
> The whole scenario should be,
> - the child process triggers a page fault at the first time access to
> the lock, and it got its own writable page, but its *clean* for
> the reason just for checking the status of the lock.
> I am sorry for above "unbreakable COW".
> - the futex_lock_pi() is invoked because of the lock contention,
> and the futex_atomic_cmpxchg_inatomic() tries to get the lock,
> it found out the lock is free so tries to write to the lock for
> reservation, a page fault occurs, because the page is read only
> for kernel(e500 specific), and returns -EFAULT to the caller
> - the fault_in_user_writeable() tries to fix the fault,
> but from the get_user_pages() view everything is ok, because
> the COW was already broken, retry futex_lock_pi_atomic()
but that's a bug right there, gup(.write=3D1) _should_ be a complete write
fault, and as such toggle your sw dirty/young tracking.
> - futex_lock_pi_atomic() --> futex_atomic_cmpxchg_inatomic(),
> another write protection page fault
> - infinite loop
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox