* Re: [PATCH] fsldma: fsl_ioread64*() do not need lower_32_bits()
From: Linus Torvalds @ 2020-08-29 21:20 UTC (permalink / raw)
To: Guenter Roeck
Cc: Herbert Xu, Joerg Roedel, Linux Kernel Mailing List, Li Yang,
Zhang Wei, Vinod Koul, dma, Andrew Morton, linuxppc-dev,
Dan Williams, Luc Van Oostenryck
In-Reply-To: <59cc6c99-9894-08b3-1075-2156e39bfc8e@roeck-us.net>
On Sat, Aug 29, 2020 at 1:40 PM Guenter Roeck <linux@roeck-us.net> wrote:
>
> Except for
>
> CHECK: spaces preferred around that '+' (ctx:VxV)
> #29: FILE: drivers/dma/fsldma.h:223:
> + u32 val_lo = in_be32((u32 __iomem *)addr+1);
Added spaces.
> I don't see anything wrong with it either, so
>
> Reviewed-by: Guenter Roeck <linux@roeck-us.net>
>
> Since I didn't see the real problem with the original code,
> I'd take that with a grain of salt, though.
Well, honestly, the old code was so confused that just making it build
is clearly already an improvement even if everything else were to be
wrong.
So I committed my "fix". If it turns out there's more wrong in there
and somebody tests it, we can fix it again. But now it hopefully
compiles, at least.
My bet is that if that driver ever worked on ppc32, it will continue
to work whatever we do to that function.
I _think_ the old code happened to - completely by mistake - get the
value right for the case of "little endian access, with dma_addr_t
being 32-bit". Because then it would still read the upper bits wrong,
but the cast to dma_addr_t would then throw those bits away. And the
lower bits would be right.
But for big-endian accesses or for ARCH_DMA_ADDR_T_64BIT it really
looks like it always returned a completely incorrect value.
And again - the driver may have worked even with that completely
incorrect value, since the use of it seems to be very incidental.
In either case ("it didn't work before" or "it worked because the
value doesn't really matter"), I don't think I could possibly have
made things worse.
Famous last words.
Linus
^ permalink raw reply
* Re: [PATCH] fsldma: fsl_ioread64*() do not need lower_32_bits()
From: Guenter Roeck @ 2020-08-29 20:40 UTC (permalink / raw)
To: Linus Torvalds, Luc Van Oostenryck
Cc: Herbert Xu, Joerg Roedel, Linux Kernel Mailing List, Li Yang,
Zhang Wei, Vinod Koul, dma, Andrew Morton, linuxppc-dev,
Dan Williams
In-Reply-To: <CAHk-=whH0ApHy0evN0q6AwQ+-a5RK56oMkYkkCJtTMnaq4FrNQ@mail.gmail.com>
On 8/29/20 10:29 AM, Linus Torvalds wrote:
> On Sat, Aug 29, 2020 at 5:46 AM Luc Van Oostenryck
> <luc.vanoostenryck@gmail.com> wrote:
>>
>> But the pointer is already 32-bit, so simply cast the pointer to u32.
>
> Yeah, that code was completely pointless. If the pointer had actually
> been 64-bit, the old code would have warned too.
>
> The odd thing is that the fsl_iowrite64() functions make sense. It's
> only the fsl_ioread64() functions that seem to be written by somebody
> who is really confused.
>
> That said, this patch only humors the confusion. The cast to 'u32' is
> completely pointless. In fact, it seems to be actively wrong, because
> it means that the later "fsl_addr + 1" is done entirely incorrectly -
> it now literally adds "1" to an integer value, while the iowrite()
> functions will add one to a "u32 __iomem *" pointer (so will do
> pointer arithmetic, and add 4).
>
Outch.
> So this code has never ever worked correctly to begin with, but the
> patches to fix the warning miss the point. The problem isn't the
> warning, the problem is that the code is broken and completely wrong
> to begin with.
>
> And the "lower_32_bits()" thing has always been pure and utter
> confusion and complete garbage.
>
> I *think* the right patch is the one attached, but since this code is
> clearly utterly broken, I'd want somebody to test it.
>
> It has probably never ever worked on 32-bit powerpc, or did so purely
> by mistake (perhaps because nobody really cares - the only 64-bit use
> is this:
>
> static dma_addr_t get_cdar(struct fsldma_chan *chan)
> {
> return FSL_DMA_IN(chan, &chan->regs->cdar, 64) & ~FSL_DMA_SNEN;
> }
>
> and there are two users of that: one which ignores the return value,
> and one that looks like it might end up half-way working even if the
> value read was garbage (it's used only to compare against a "current
> descriptor" value).
>
> Anyway, the fix is definitely not to just shut up the warning. The
> warning is only a sign of utter confusion in that driver.
>
> Can somebody with the hardware test this on 32-bit ppc?
>
> And if not (judging by just how broken those functions are, maybe it
> never did work), can somebody with a ppc32 setup at least compile-test
> this patch and look at whether it makes sense, in ways the old code
> did not.
>
A bit more careful this time. For the attached patch:
Compile-tested-by: Guenter Roeck <linux@roeck-us.net>
Except for
CHECK: spaces preferred around that '+' (ctx:VxV)
#29: FILE: drivers/dma/fsldma.h:223:
+ u32 val_lo = in_be32((u32 __iomem *)addr+1);
I don't see anything wrong with it either, so
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Since I didn't see the real problem with the original code,
I'd take that with a grain of salt, though.
Guenter
^ permalink raw reply
* Re: [PATCH] fsldma: fsl_ioread64*() do not need lower_32_bits()
From: Linus Torvalds @ 2020-08-29 17:29 UTC (permalink / raw)
To: Luc Van Oostenryck
Cc: Herbert Xu, Joerg Roedel, Linux Kernel Mailing List, Li Yang,
Zhang Wei, Vinod Koul, dma, Andrew Morton, linuxppc-dev,
Dan Williams, Guenter Roeck
In-Reply-To: <20200829124538.7475-1-luc.vanoostenryck@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 2203 bytes --]
On Sat, Aug 29, 2020 at 5:46 AM Luc Van Oostenryck
<luc.vanoostenryck@gmail.com> wrote:
>
> But the pointer is already 32-bit, so simply cast the pointer to u32.
Yeah, that code was completely pointless. If the pointer had actually
been 64-bit, the old code would have warned too.
The odd thing is that the fsl_iowrite64() functions make sense. It's
only the fsl_ioread64() functions that seem to be written by somebody
who is really confused.
That said, this patch only humors the confusion. The cast to 'u32' is
completely pointless. In fact, it seems to be actively wrong, because
it means that the later "fsl_addr + 1" is done entirely incorrectly -
it now literally adds "1" to an integer value, while the iowrite()
functions will add one to a "u32 __iomem *" pointer (so will do
pointer arithmetic, and add 4).
So this code has never ever worked correctly to begin with, but the
patches to fix the warning miss the point. The problem isn't the
warning, the problem is that the code is broken and completely wrong
to begin with.
And the "lower_32_bits()" thing has always been pure and utter
confusion and complete garbage.
I *think* the right patch is the one attached, but since this code is
clearly utterly broken, I'd want somebody to test it.
It has probably never ever worked on 32-bit powerpc, or did so purely
by mistake (perhaps because nobody really cares - the only 64-bit use
is this:
static dma_addr_t get_cdar(struct fsldma_chan *chan)
{
return FSL_DMA_IN(chan, &chan->regs->cdar, 64) & ~FSL_DMA_SNEN;
}
and there are two users of that: one which ignores the return value,
and one that looks like it might end up half-way working even if the
value read was garbage (it's used only to compare against a "current
descriptor" value).
Anyway, the fix is definitely not to just shut up the warning. The
warning is only a sign of utter confusion in that driver.
Can somebody with the hardware test this on 32-bit ppc?
And if not (judging by just how broken those functions are, maybe it
never did work), can somebody with a ppc32 setup at least compile-test
this patch and look at whether it makes sense, in ways the old code
did not.
Linus
[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 1167 bytes --]
drivers/dma/fsldma.h | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/drivers/dma/fsldma.h b/drivers/dma/fsldma.h
index 56f18ae99233..c574d223d52e 100644
--- a/drivers/dma/fsldma.h
+++ b/drivers/dma/fsldma.h
@@ -205,10 +205,10 @@ struct fsldma_chan {
#else
static u64 fsl_ioread64(const u64 __iomem *addr)
{
- u32 fsl_addr = lower_32_bits(addr);
- u64 fsl_addr_hi = (u64)in_le32((u32 *)(fsl_addr + 1)) << 32;
+ u32 val_lo = in_le32((u32 __iomem *)addr);
+ u32 val_hi = in_le32((u32 __iomem *)addr + 1);
- return fsl_addr_hi | in_le32((u32 *)fsl_addr);
+ return ((u64)val_hi << 32) + val_lo;
}
static void fsl_iowrite64(u64 val, u64 __iomem *addr)
@@ -219,10 +219,10 @@ static void fsl_iowrite64(u64 val, u64 __iomem *addr)
static u64 fsl_ioread64be(const u64 __iomem *addr)
{
- u32 fsl_addr = lower_32_bits(addr);
- u64 fsl_addr_hi = (u64)in_be32((u32 *)fsl_addr) << 32;
+ u32 val_hi = in_be32((u32 __iomem *)addr);
+ u32 val_lo = in_be32((u32 __iomem *)addr+1);
- return fsl_addr_hi | in_be32((u32 *)(fsl_addr + 1));
+ return ((u64)val_hi << 32) + val_lo;
}
static void fsl_iowrite64be(u64 val, u64 __iomem *addr)
^ permalink raw reply related
* Re: [PATCH] fsldma: fsl_ioread64*() do not need lower_32_bits()
From: Christophe Leroy @ 2020-08-29 15:58 UTC (permalink / raw)
To: Guenter Roeck, Luc Van Oostenryck
Cc: Herbert Xu, Joerg Roedel, Linus Torvalds, linux-kernel, Li Yang,
Zhang Wei, Vinod Koul, dmaengine, Andrew Morton, linuxppc-dev,
Dan Williams
In-Reply-To: <20200829150551.GA27225@roeck-us.net>
Le 29/08/2020 à 17:05, Guenter Roeck a écrit :
> On Sat, Aug 29, 2020 at 02:45:38PM +0200, Luc Van Oostenryck wrote:
>> For ppc32, the functions fsl_ioread64() & fsl_ioread64be()
>> use lower_32_bits() as a fancy way to cast the pointer to u32
>> in order to do non-atomic 64-bit IO.
>>
>> But the pointer is already 32-bit, so simply cast the pointer to u32.
>>
>> This fixes a compile error introduced by
>> ef91bb196b0d ("kernel.h: Silence sparse warning in lower_32_bits")
>>
>> Fixes: ef91bb196b0db1013ef8705367bc2d7944ef696b
>
> checkpatch complains about this and prefers
>
> Fixes: ef91bb196b0d ("kernel.h: Silence sparse warning in lower_32_bits")
Checkpatch also complains about spacing:
CHECK:SPACING: No space is necessary after a cast
#39: FILE: drivers/dma/fsldma.h:208:
+ u32 fsl_addr = (u32) addr;
CHECK:SPACING: No space is necessary after a cast
#48: FILE: drivers/dma/fsldma.h:222:
+ u32 fsl_addr = (u32) addr;
total: 0 errors, 0 warnings, 2 checks, 16 lines checked
Christophe
>
> Otherwise
>
> Tested-by: Guenter Roeck <linux@roeck-us.net>
>
>> Reported-by: Guenter Roeck <linux@roeck-us.net>
>> Cc: Li Yang <leoyang.li@nxp.com>
>> Cc: Zhang Wei <zw@zh-kernel.org>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Vinod Koul <vkoul@kernel.org>
>> Cc: Herbert Xu <herbert@gondor.apana.org.au>
>> Cc: linuxppc-dev@lists.ozlabs.org
>> Cc: dmaengine@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
>> ---
>> drivers/dma/fsldma.h | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/dma/fsldma.h b/drivers/dma/fsldma.h
>> index 56f18ae99233..6f6fa7641fa2 100644
>> --- a/drivers/dma/fsldma.h
>> +++ b/drivers/dma/fsldma.h
>> @@ -205,7 +205,7 @@ struct fsldma_chan {
>> #else
>> static u64 fsl_ioread64(const u64 __iomem *addr)
>> {
>> - u32 fsl_addr = lower_32_bits(addr);
>> + u32 fsl_addr = (u32) addr;
>> u64 fsl_addr_hi = (u64)in_le32((u32 *)(fsl_addr + 1)) << 32;
>>
>> return fsl_addr_hi | in_le32((u32 *)fsl_addr);
>> @@ -219,7 +219,7 @@ static void fsl_iowrite64(u64 val, u64 __iomem *addr)
>>
>> static u64 fsl_ioread64be(const u64 __iomem *addr)
>> {
>> - u32 fsl_addr = lower_32_bits(addr);
>> + u32 fsl_addr = (u32) addr;
>> u64 fsl_addr_hi = (u64)in_be32((u32 *)fsl_addr) << 32;
>>
>> return fsl_addr_hi | in_be32((u32 *)(fsl_addr + 1));
>> --
>> 2.28.0
>>
^ permalink raw reply
* Re: [PATCH] fsldma: fsl_ioread64*() do not need lower_32_bits()
From: Guenter Roeck @ 2020-08-29 15:05 UTC (permalink / raw)
To: Luc Van Oostenryck
Cc: Herbert Xu, Joerg Roedel, linuxppc-dev, Li Yang, Zhang Wei,
Vinod Koul, dmaengine, Andrew Morton, Linus Torvalds,
Dan Williams, linux-kernel
On Sat, Aug 29, 2020 at 02:45:38PM +0200, Luc Van Oostenryck wrote:
> For ppc32, the functions fsl_ioread64() & fsl_ioread64be()
> use lower_32_bits() as a fancy way to cast the pointer to u32
> in order to do non-atomic 64-bit IO.
>
> But the pointer is already 32-bit, so simply cast the pointer to u32.
>
> This fixes a compile error introduced by
> ef91bb196b0d ("kernel.h: Silence sparse warning in lower_32_bits")
>
> Fixes: ef91bb196b0db1013ef8705367bc2d7944ef696b
checkpatch complains about this and prefers
Fixes: ef91bb196b0d ("kernel.h: Silence sparse warning in lower_32_bits")
Otherwise
Tested-by: Guenter Roeck <linux@roeck-us.net>
> Reported-by: Guenter Roeck <linux@roeck-us.net>
> Cc: Li Yang <leoyang.li@nxp.com>
> Cc: Zhang Wei <zw@zh-kernel.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Vinod Koul <vkoul@kernel.org>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: dmaengine@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
> ---
> drivers/dma/fsldma.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dma/fsldma.h b/drivers/dma/fsldma.h
> index 56f18ae99233..6f6fa7641fa2 100644
> --- a/drivers/dma/fsldma.h
> +++ b/drivers/dma/fsldma.h
> @@ -205,7 +205,7 @@ struct fsldma_chan {
> #else
> static u64 fsl_ioread64(const u64 __iomem *addr)
> {
> - u32 fsl_addr = lower_32_bits(addr);
> + u32 fsl_addr = (u32) addr;
> u64 fsl_addr_hi = (u64)in_le32((u32 *)(fsl_addr + 1)) << 32;
>
> return fsl_addr_hi | in_le32((u32 *)fsl_addr);
> @@ -219,7 +219,7 @@ static void fsl_iowrite64(u64 val, u64 __iomem *addr)
>
> static u64 fsl_ioread64be(const u64 __iomem *addr)
> {
> - u32 fsl_addr = lower_32_bits(addr);
> + u32 fsl_addr = (u32) addr;
> u64 fsl_addr_hi = (u64)in_be32((u32 *)fsl_addr) << 32;
>
> return fsl_addr_hi | in_be32((u32 *)(fsl_addr + 1));
> --
> 2.28.0
>
^ permalink raw reply
* [PATCH] fsldma: fsl_ioread64*() do not need lower_32_bits()
From: Luc Van Oostenryck @ 2020-08-29 12:45 UTC (permalink / raw)
To: Herbert Xu
Cc: linux-kernel, Joerg Roedel, linuxppc-dev, Li Yang, Zhang Wei,
Vinod Koul, Luc Van Oostenryck, dmaengine, Andrew Morton,
Linus Torvalds, Dan Williams, Guenter Roeck
In-Reply-To: <20200829105116.GA246533@roeck-us.net>
For ppc32, the functions fsl_ioread64() & fsl_ioread64be()
use lower_32_bits() as a fancy way to cast the pointer to u32
in order to do non-atomic 64-bit IO.
But the pointer is already 32-bit, so simply cast the pointer to u32.
This fixes a compile error introduced by
ef91bb196b0d ("kernel.h: Silence sparse warning in lower_32_bits")
Fixes: ef91bb196b0db1013ef8705367bc2d7944ef696b
Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: Li Yang <leoyang.li@nxp.com>
Cc: Zhang Wei <zw@zh-kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vinod Koul <vkoul@kernel.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: dmaengine@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
---
drivers/dma/fsldma.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/dma/fsldma.h b/drivers/dma/fsldma.h
index 56f18ae99233..6f6fa7641fa2 100644
--- a/drivers/dma/fsldma.h
+++ b/drivers/dma/fsldma.h
@@ -205,7 +205,7 @@ struct fsldma_chan {
#else
static u64 fsl_ioread64(const u64 __iomem *addr)
{
- u32 fsl_addr = lower_32_bits(addr);
+ u32 fsl_addr = (u32) addr;
u64 fsl_addr_hi = (u64)in_le32((u32 *)(fsl_addr + 1)) << 32;
return fsl_addr_hi | in_le32((u32 *)fsl_addr);
@@ -219,7 +219,7 @@ static void fsl_iowrite64(u64 val, u64 __iomem *addr)
static u64 fsl_ioread64be(const u64 __iomem *addr)
{
- u32 fsl_addr = lower_32_bits(addr);
+ u32 fsl_addr = (u32) addr;
u64 fsl_addr_hi = (u64)in_be32((u32 *)fsl_addr) << 32;
return fsl_addr_hi | in_be32((u32 *)(fsl_addr + 1));
--
2.28.0
^ permalink raw reply related
* [PATCH net-next] net/wan/fsl_ucc_hdlc: Add MODULE_DESCRIPTION
From: YueHaibing @ 2020-08-29 11:58 UTC (permalink / raw)
To: qiang.zhao, davem, kuba; +Cc: netdev, YueHaibing, linuxppc-dev, linux-kernel
Add missing MODULE_DESCRIPTION.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
---
drivers/net/wan/fsl_ucc_hdlc.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/wan/fsl_ucc_hdlc.c b/drivers/net/wan/fsl_ucc_hdlc.c
index 9edd94679283..dca97cd7c4e7 100644
--- a/drivers/net/wan/fsl_ucc_hdlc.c
+++ b/drivers/net/wan/fsl_ucc_hdlc.c
@@ -1295,3 +1295,4 @@ static struct platform_driver ucc_hdlc_driver = {
module_platform_driver(ucc_hdlc_driver);
MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION(DRV_DESC);
--
2.17.1
^ permalink raw reply related
* Re: [PATCH 08/10] x86: remove address space overrides using set_fs()
From: Christoph Hellwig @ 2020-08-29 9:25 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-arch, Kees Cook, the arch/x86 maintainers,
Linux Kernel Mailing List, Al Viro, linux-fsdevel, linuxppc-dev,
Christoph Hellwig
In-Reply-To: <CAHk-=wjxeN+KrCB2TyC5s2RWhz-dWWO8vbBwWcCiKb0+8ipayw@mail.gmail.com>
On Thu, Aug 27, 2020 at 11:15:12AM -0700, Linus Torvalds wrote:
> > SYM_FUNC_START(__put_user_2)
> > - ENTER
> > - mov TASK_addr_limit(%_ASM_BX),%_ASM_BX
> > + LOAD_TASK_SIZE_MAX
> > sub $1,%_ASM_BX
>
> It's even more obvious here. We load a constant and then immediately
> do a "sub $1" on that value.
>
> It's not a huge deal, you don't have to respin the series for this, I
> just wanted to point it out so that people are aware of it and if I
> forget somebody else will hopefully remember that "we should fix that
> too".
The changes seem easy enough and I need to respin at least for the
lkdtm changes, and probaby also for a pending fix in the low-level
x86 code that will hopefully be picked up for 5.9.
But the more important questions is: how do we want to pick the series
up? Especially due to the splice changes I really want it to be in
linux-next as long as possible.
^ permalink raw reply
* Re: [PATCH 05/10] lkdtm: disable set_fs-based tests for !CONFIG_SET_FS
From: Christoph Hellwig @ 2020-08-29 9:24 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-arch, Kees Cook, the arch/x86 maintainers,
Linux Kernel Mailing List, Al Viro, linux-fsdevel, linuxppc-dev,
Christoph Hellwig
In-Reply-To: <CAHk-=wipbWD66sU7etETXwDW5NRsU2vnbDpXXQ5i94hiTcawyw@mail.gmail.com>
On Thu, Aug 27, 2020 at 11:06:28AM -0700, Linus Torvalds wrote:
> On Thu, Aug 27, 2020 at 8:00 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > Once we can't manipulate the address limit, we also can't test what
> > happens when the manipulation is abused.
>
> Just remove these tests entirely.
>
> Once set_fs() doesn't exist on x86, the tests no longer make any sense
> what-so-ever, because test coverage will be basically zero.
>
> So don't make the code uglier just to maintain a fiction that
> something is tested when it isn't really.
Sure fine with me unless Kees screams.
^ permalink raw reply
* Re: [PATCH 01/10] fs: don't allow kernel reads and writes without iter ops
From: 'Christoph Hellwig' @ 2020-08-29 9:23 UTC (permalink / raw)
To: David Laight
Cc: linux-arch@vger.kernel.org, Kees Cook, x86@kernel.org,
linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org,
Al Viro, linux-fsdevel@vger.kernel.org, Linus Torvalds,
'Christoph Hellwig'
In-Reply-To: <e5cb22d53c7c4ebea92443b8b6d86e88@AcuMS.aculab.com>
On Thu, Aug 27, 2020 at 03:58:02PM +0000, David Laight wrote:
> Is there a real justification for that?
> For system calls supplying both methods makes sense to avoid
> the extra code paths for a simple read/write.
Al asked for it as two of our four in-tree instances do have weird
semantics, and we can't change that any more. And the other two
don't make sense to be used with kernel_read and kernel_write (
(/dev/null and /dev/zero).
^ permalink raw reply
* Re: [PATCH v1 06/10] powerpc/pseries/iommu: Add ddw_list_add() helper
From: Leonardo Bras @ 2020-08-28 21:28 UTC (permalink / raw)
To: Alexey Kardashevskiy, Michael Ellerman, Benjamin Herrenschmidt,
Paul Mackerras, Christophe Leroy, Joel Stanley,
Thiago Jung Bauermann, Ram Pai, Brian King,
Murilo Fossa Vicentini, David Dai
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <54cfb977-6d30-47b8-b26b-f47efd10299f@ozlabs.ru>
On Fri, 2020-08-28 at 11:58 +1000, Alexey Kardashevskiy wrote:
>
> On 28/08/2020 08:11, Leonardo Bras wrote:
> > On Mon, 2020-08-24 at 13:46 +1000, Alexey Kardashevskiy wrote:
> > > > static int find_existing_ddw_windows(void)
> > > > {
> > > > int len;
> > > > @@ -887,18 +905,11 @@ static int find_existing_ddw_windows(void)
> > > > if (!direct64)
> > > > continue;
> > > >
> > > > - window = kzalloc(sizeof(*window), GFP_KERNEL);
> > > > - if (!window || len < sizeof(struct dynamic_dma_window_prop)) {
> > > > + window = ddw_list_add(pdn, direct64);
> > > > + if (!window || len < sizeof(*direct64)) {
> > >
> > > Since you are touching this code, it looks like the "len <
> > > sizeof(*direct64)" part should go above to "if (!direct64)".
> >
> > Sure, makes sense.
> > It will be fixed for v2.
> >
> > >
> > >
> > > > kfree(window);
> > > > remove_ddw(pdn, true);
> > > > - continue;
> > > > }
> > > > -
> > > > - window->device = pdn;
> > > > - window->prop = direct64;
> > > > - spin_lock(&direct_window_list_lock);
> > > > - list_add(&window->list, &direct_window_list);
> > > > - spin_unlock(&direct_window_list_lock);
> > > > }
> > > >
> > > > return 0;
> > > > @@ -1261,7 +1272,8 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > > > dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
> > > > create.liobn, dn);
> > > >
> > > > - window = kzalloc(sizeof(*window), GFP_KERNEL);
> > > > + /* Add new window to existing DDW list */
> > >
> > > The comment seems to duplicate what the ddw_list_add name already suggests.
> >
> > Ok, I will remove it then.
> >
> > > > + window = ddw_list_add(pdn, ddwprop);
> > > > if (!window)
> > > > goto out_clear_window;
> > > >
> > > > @@ -1280,16 +1292,14 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > > > goto out_free_window;
> > > > }
> > > >
> > > > - window->device = pdn;
> > > > - window->prop = ddwprop;
> > > > - spin_lock(&direct_window_list_lock);
> > > > - list_add(&window->list, &direct_window_list);
> > > > - spin_unlock(&direct_window_list_lock);
> > >
> > > I'd leave these 3 lines here and in find_existing_ddw_windows() (which
> > > would make ddw_list_add -> ddw_prop_alloc). In general you want to have
> > > less stuff to do on the failure path. kmalloc may fail and needs kfree
> > > but you can safely delay list_add (which cannot fail) and avoid having
> > > the lock help twice in the same function (one of them is hidden inside
> > > ddw_list_add).
> > > Not sure if this change is really needed after all. Thanks,
> >
> > I understand this leads to better performance in case anything fails.
> > Also, I think list_add happening in the end is less error-prone (in
> > case the list is checked between list_add and a fail).
>
> Performance was not in my mind at all.
>
> I noticed you remove from a list with a lock help and it was not there
> before and there is a bunch on labels on the exit path and started
> looking for list_add() and if you do not double remove from the list.
>
>
> > But what if we put it at the end?
> > What is the chance of a kzalloc of 4 pointers (struct direct_window)
> > failing after walk_system_ram_range?
>
> This is not about chances really, it is about readability. If let's say
> kmalloc failed, you just to the error exit label and simply call kfree()
> on that pointer, kfree will do nothing if it is NULL already, simple.
> list_del() does not have this simplicity.
>
>
> > Is it not worthy doing that for making enable_ddw() easier to
> > understand?
>
> This is my goal here :)
Ok, it makes sense to me now.
I tried creating list_add() to keep everything related to list-adding
into a single place, instead of splitting it around the other stuff,
but now I understand that the code may look more complex than it was
before, because of the failing path increasing in size.
For me it was strange creating a list entry end not list_add()ing it
right away, but maybe it's something worth to get used to, as it may
increase the failing path simplicity, since list_add() don't fail.
I will try to see if the ddw_list_add() routine would become a useful
ddw_list_entry(), but if not, I will remove this patch.
Alexey, Thank you for reviewing this series!
Best regards,
Leonardo
^ permalink raw reply
* Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()
From: Leonardo Bras @ 2020-08-28 20:41 UTC (permalink / raw)
To: Alexey Kardashevskiy, Michael Ellerman, Benjamin Herrenschmidt,
Paul Mackerras, Christophe Leroy, Joel Stanley,
Thiago Jung Bauermann, Ram Pai, Brian King,
Murilo Fossa Vicentini, David Dai
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <da473389-f921-075a-ec8e-ea516de4f177@ozlabs.ru>
On Fri, 2020-08-28 at 11:40 +1000, Alexey Kardashevskiy wrote:
> > I think it would be better to keep the code as much generic as possible
> > regarding page sizes.
>
> Then you need to test it. Does 4K guest even boot (it should but I would
> not bet much on it)?
Maybe testing with host 64k pagesize and IOMMU 16MB pagesize in qemu
should be enough, is there any chance to get indirect mapping in qemu
like this? (DDW but with smaller DMA window available)
> > > Because if we want the former (==support), then we'll have to align the
> > > size up to the bigger page size when allocating/zeroing system pages,
> > > etc.
> >
> > This part I don't understand. Why do we need to align everything to the
> > bigger pagesize?
> >
> > I mean, is not that enough that the range [ret, ret + size[ is both
> > allocated by mm and mapped on a iommu range?
> >
> > Suppose a iommu_alloc_coherent() of 16kB on PAGESIZE = 4k and
> > IOMMU_PAGE_SIZE() == 64k.
> > Why 4 * cpu_pages mapped by a 64k IOMMU page is not enough?
> > All the space the user asked for is allocated and mapped for DMA.
>
> The user asked to map 16K, the rest - 48K - is used for something else
> (may be even mapped to another device) but you are making all 64K
> accessible by the device which only should be able to access 16K.
>
> In practice, if this happens, H_PUT_TCE will simply fail.
I have noticed mlx5 driver getting a few bytes in a buffer, and using
iommu_map_page(). It does map a whole page for as few bytes as the user
wants mapped, and the other bytes get used for something else, or just
mapped on another DMA page.
It seems to work fine.
>
>
> > > Bigger pages are not the case here as I understand it.
> >
> > I did not get this part, what do you mean?
>
> Possible IOMMU page sizes are 4K, 64K, 2M, 16M, 256M, 1GB, and the
> supported set of sizes is different for P8/P9 and type of IO (PHB,
> NVLink/CAPI).
>
>
> > > > Update those functions to guarantee alignment with requested size
> > > > using IOMMU_PAGE_ALIGN() before doing iommu_alloc() / iommu_free().
> > > >
> > > > Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
> > > > with IOMMU_PAGE_ALIGN(n, tbl), which seems easier to read.
> > > >
> > > > Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
> > > > ---
> > > > arch/powerpc/kernel/iommu.c | 17 +++++++++--------
> > > > 1 file changed, 9 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> > > > index 9704f3f76e63..d7086087830f 100644
> > > > --- a/arch/powerpc/kernel/iommu.c
> > > > +++ b/arch/powerpc/kernel/iommu.c
> > > > @@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct device *dev,
> > > > }
> > > >
> > > > if (dev)
> > > > - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
> > > > - 1 << tbl->it_page_shift);
> > > > + boundary_size = IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, tbl);
> > >
> > > Run checkpatch.pl, should complain about a long line.
> >
> > It's 86 columns long, which is less than the new limit of 100 columns
> > Linus announced a few weeks ago. checkpatch.pl was updated too:
> > https://www.phoronix.com/scan.php?page=news_item&px=Linux-Kernel-Deprecates-80-Col
>
> Yay finally :) Thanks,
:)
>
>
> > >
> > > > else
> > > > - boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
> > > > + boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl);
> > > > /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
> > > >
> > > > n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
> > > > @@ -858,6 +857,7 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
> > > > unsigned int order;
> > > > unsigned int nio_pages, io_order;
> > > > struct page *page;
> > > > + size_t size_io = size;
> > > >
> > > > size = PAGE_ALIGN(size);
> > > > order = get_order(size);
> > > > @@ -884,8 +884,9 @@ void *iommu_alloc_coherent(struct device *dev, struct iommu_table *tbl,
> > > > memset(ret, 0, size);
> > > >
> > > > /* Set up tces to cover the allocated range */
> > > > - nio_pages = size >> tbl->it_page_shift;
> > > > - io_order = get_iommu_order(size, tbl);
> > > > + size_io = IOMMU_PAGE_ALIGN(size_io, tbl);
> > > > + nio_pages = size_io >> tbl->it_page_shift;
> > > > + io_order = get_iommu_order(size_io, tbl);
> > > > mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
> > > > mask >> tbl->it_page_shift, io_order, 0);
> > > > if (mapping == DMA_MAPPING_ERROR) {
> > > > @@ -900,11 +901,11 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
> > > > void *vaddr, dma_addr_t dma_handle)
> > > > {
> > > > if (tbl) {
> > > > - unsigned int nio_pages;
> > > > + size_t size_io = IOMMU_PAGE_ALIGN(size, tbl);
> > > > + unsigned int nio_pages = size_io >> tbl->it_page_shift;
> > > >
> > > > - size = PAGE_ALIGN(size);
> > > > - nio_pages = size >> tbl->it_page_shift;
> > > > iommu_free(tbl, dma_handle, nio_pages);
> > > > +
> > >
> > > Unrelated new line.
> >
> > Will be removed. Thanks!
> >
> > >
> > > > size = PAGE_ALIGN(size);
> > > > free_pages((unsigned long)vaddr, get_order(size));
> > > > }
> > > >
^ permalink raw reply
* Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift
From: Leonardo Bras @ 2020-08-28 19:55 UTC (permalink / raw)
To: Alexey Kardashevskiy, Michael Ellerman, Benjamin Herrenschmidt,
Paul Mackerras, Christophe Leroy, Joel Stanley,
Thiago Jung Bauermann, Ram Pai, Brian King,
Murilo Fossa Vicentini, David Dai
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <1e77a3d9-dff9-f58b-45be-77be7cbea41a@ozlabs.ru>
On Fri, 2020-08-28 at 12:27 +1000, Alexey Kardashevskiy wrote:
>
> On 28/08/2020 01:32, Leonardo Bras wrote:
> > Hello Alexey, thank you for this feedback!
> >
> > On Sat, 2020-08-22 at 19:33 +1000, Alexey Kardashevskiy wrote:
> > > > +#define TCE_RPN_BITS 52 /* Bits 0-51 represent RPN on TCE */
> > >
> > > Ditch this one and use MAX_PHYSMEM_BITS instead? I am pretty sure this
> > > is the actual limit.
> >
> > I understand this MAX_PHYSMEM_BITS(51) comes from the maximum physical memory addressable in the machine. IIUC, it means we can access physical address up to (1ul << MAX_PHYSMEM_BITS).
> >
> > This 52 comes from PAPR "Table 9. TCE Definition" which defines bits
> > 0-51 as the RPN. By looking at code, I understand that it means we may input any address < (1ul << 52) to TCE.
> >
> > In practice, MAX_PHYSMEM_BITS should be enough as of today, because I suppose we can't ever pass a physical page address over
> > (1ul << 51), and TCE accepts up to (1ul << 52).
> > But if we ever increase MAX_PHYSMEM_BITS, it doesn't necessarily means that TCE_RPN_BITS will also be increased, so I think they are independent values.
> >
> > Does it make sense? Please let me know if I am missing something.
>
> The underlying hardware is PHB3/4 about which the IODA2 Version 2.4
> 6Apr2012.pdf spec says:
>
> "The number of most significant RPN bits implemented in the TCE is
> dependent on the max size of System Memory to be supported by the platform".
>
> IODA3 is the same on this matter.
>
> This is MAX_PHYSMEM_BITS and PHB itself does not have any other limits
> on top of that. So the only real limit comes from MAX_PHYSMEM_BITS and
> where TCE_RPN_BITS comes from exactly - I have no idea.
Well, I created this TCE_RPN_BITS = 52 because the previous mask was a
hardcoded 40-bit mask (0xfffffffffful), for hard-coded 12-bit (4k)
pagesize, and on PAPR+/LoPAR also defines TCE as having bits 0-51
described as RPN, as described before.
IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figure 3.4 and 3.5.
shows system memory mapping into a TCE, and the TCE also has bits 0-51
for the RPN (52 bits). "Table 3.6. TCE Definition" also shows it.
In fact, by the looks of those figures, the RPN_MASK should always be a
52-bit mask, and RPN = (page >> tceshift) & RPN_MASK.
Maybe that's it?
>
>
> > >
> > > > +#define TCE_RPN_MASK(ps) ((1ul << (TCE_RPN_BITS - (ps))) - 1)
> > > > #define TCE_VALID 0x800 /* TCE valid */
> > > > #define TCE_ALLIO 0x400 /* TCE valid for all lpars */
> > > > #define TCE_PCI_WRITE 0x2 /* write from PCI allowed */
> > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > > > index e4198700ed1a..8fe23b7dff3a 100644
> > > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > > @@ -107,6 +107,9 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index,
> > > > u64 proto_tce;
> > > > __be64 *tcep;
> > > > u64 rpn;
> > > > + const unsigned long tceshift = tbl->it_page_shift;
> > > > + const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);
> > > > + const u64 rpn_mask = TCE_RPN_MASK(tceshift);
> > >
> > > Using IOMMU_PAGE_SIZE macro for the page size and not using
> > > IOMMU_PAGE_MASK for the mask - this incosistency makes my small brain
> > > explode :) I understand the history but maaaaan... Oh well, ok.
> > >
> >
> > Yeah, it feels kind of weird after two IOMMU related consts. :)
> > But sure IOMMU_PAGE_MASK() would not be useful here :)
> >
> > And this kind of let me thinking:
> > > > + rpn = __pa(uaddr) >> tceshift;
> > > > + *tcep = cpu_to_be64(proto_tce | (rpn & rpn_mask) << tceshift);
> > Why not:
> > rpn_mask = TCE_RPN_MASK(tceshift) << tceshift;
>
> A mask for a page number (but not the address!) hurts my brain, masks
> are good against addresses but numbers should already have all bits
> adjusted imho, may be it is just me :-/
>
>
> >
> > rpn = __pa(uaddr) & rpn_mask;
> > *tcep = cpu_to_be64(proto_tce | rpn)
> >
> > I am usually afraid of changing stuff like this, but I think it's safe.
> >
> > > Good, otherwise. Thanks,
> >
> > Thank you for reviewing!
> >
> >
> >
^ permalink raw reply
* Re: [PATCH v1 09/10] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition
From: Leonardo Bras @ 2020-08-28 18:36 UTC (permalink / raw)
To: Alexey Kardashevskiy, Michael Ellerman, Benjamin Herrenschmidt,
Paul Mackerras, Christophe Leroy, Joel Stanley,
Thiago Jung Bauermann, Ram Pai, Brian King,
Murilo Fossa Vicentini, David Dai
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <d2d98195-982d-c40a-43bc-5853726ed1d6@ozlabs.ru>
On Mon, 2020-08-24 at 15:17 +1000, Alexey Kardashevskiy wrote:
>
> On 18/08/2020 09:40, Leonardo Bras wrote:
> > As of today, if the biggest DDW that can be created can't map the whole
> > partition, it's creation is skipped and the default DMA window
> > "ibm,dma-window" is used instead.
> >
> > DDW is 16x bigger than the default DMA window,
>
> 16x only under very specific circumstances which are
> 1. phyp
> 2. sriov
> 3. device class in hmc (or what that priority number is in the lpar config).
Yeah, missing details.
> > having the same amount of
> > pages, but increasing the page size to 64k.
> > Besides larger DMA window,
>
> "Besides being larger"?
You are right there.
>
> > it performs better for allocations over 4k,
>
> Better how?
I was thinking for allocations larger than (512 * 4k), since >2
hypercalls are needed here, and for 64k pages would still be just 1
hypercall up to (512 * 64k).
But yeah, not the usual case anyway.
>
> > so it would be nice to use it instead.
>
> I'd rather say something like:
> ===
> So far we assumed we can map the guest RAM 1:1 to the bus which worked
> with a small number of devices. SRIOV changes it as the user can
> configure hundreds VFs and since phyp preallocates TCEs and does not
> allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
> per a PE to limit waste of physical pages.
> ===
I mixed this in my commit message, it looks like this:
===
powerpc/pseries/iommu: Make use of DDW for indirect mapping
So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.
As of today, if the assumed direct mapping is not possible, DDW
creation is skipped and the default DMA window "ibm,dma-window" is used
instead.
The default DMA window uses 4k pages instead of 64k pages, and since
the amount of pages is the same, making use of DDW instead of the
default DMA window for indirect mapping will expand in 16x the amount
of memory that can be mapped on DMA.
The DDW created will be used for direct mapping by default. [...]
===
What do you think?
> > The DDW created will be used for direct mapping by default.
> > If it's not available, indirect mapping will be used instead.
> >
> > For indirect mapping, it's necessary to update the iommu_table so
> > iommu_alloc() can use the DDW created. For this,
> > iommu_table_update_window() is called when everything else succeeds
> > at enable_ddw().
> >
> > Removing the default DMA window for using DDW with indirect mapping
> > is only allowed if there is no current IOMMU memory allocated in
> > the iommu_table. enable_ddw() is aborted otherwise.
> >
> > As there will never have both direct and indirect mappings at the same
> > time, the same property name can be used for the created DDW.
> >
> > So renaming
> > define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
> > to
> > define DMA64_PROPNAME "linux,dma64-ddr-window-info"
> > looks the right thing to do.
>
> I know I suggested this but this does not look so good anymore as I
> suspect it breaks kexec (from older kernel to this one) so you either
> need to check for both DT names or just keep the old one. Changing the
> macro name is fine.
>
Yeah, having 'direct' in the name don't really makes sense if it's used
for indirect mapping. I will just add this new define instead of
replacing the old one, and check for both.
Is that ok?
>
> > To make sure the property differentiates both cases, a new u32 for flags
> > was added at the end of the property, where BIT(0) set means direct
> > mapping.
> >
> > Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
> > ---
> > arch/powerpc/platforms/pseries/iommu.c | 108 +++++++++++++++++++------
> > 1 file changed, 84 insertions(+), 24 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > index 3a1ef02ad9d5..9544e3c91ced 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -350,8 +350,11 @@ struct dynamic_dma_window_prop {
> > __be64 dma_base; /* address hi,lo */
> > __be32 tce_shift; /* ilog2(tce_page_size) */
> > __be32 window_shift; /* ilog2(tce_window_size) */
> > + __be32 flags; /* DDW properties, see bellow */
> > };
> >
> > +#define DDW_FLAGS_DIRECT 0x01
>
> This is set if ((1<<window_shift) >= ddw_memory_hotplug_max()), you
> could simply check window_shift and drop the flags.
>
Yeah, it's better this way, I will revert this.
>
> > +
> > struct direct_window {
> > struct device_node *device;
> > const struct dynamic_dma_window_prop *prop;
> > @@ -377,7 +380,7 @@ static LIST_HEAD(direct_window_list);
> > static DEFINE_SPINLOCK(direct_window_list_lock);
> > /* protects initializing window twice for same device */
> > static DEFINE_MUTEX(direct_window_init_mutex);
> > -#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
> > +#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
> >
> > static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
> > unsigned long num_pfn, const void *arg)
> > @@ -836,7 +839,7 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
> > if (ret)
> > return;
> >
> > - win = of_find_property(np, DIRECT64_PROPNAME, NULL);
> > + win = of_find_property(np, DMA64_PROPNAME, NULL);
> > if (!win)
> > return;
> >
> > @@ -852,7 +855,7 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
> > np, ret);
> > }
> >
> > -static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
> > +static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, bool *direct_mapping)
> > {
> > struct direct_window *window;
> > const struct dynamic_dma_window_prop *direct64;
> > @@ -864,6 +867,7 @@ static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
> > if (window->device == pdn) {
> > direct64 = window->prop;
> > *dma_addr = be64_to_cpu(direct64->dma_base);
> > + *direct_mapping = be32_to_cpu(direct64->flags) & DDW_FLAGS_DIRECT;
> > found = true;
> > break;
> > }
> > @@ -901,8 +905,8 @@ static int find_existing_ddw_windows(void)
> > if (!firmware_has_feature(FW_FEATURE_LPAR))
> > return 0;
> >
> > - for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
> > - direct64 = of_get_property(pdn, DIRECT64_PROPNAME, &len);
> > + for_each_node_with_property(pdn, DMA64_PROPNAME) {
> > + direct64 = of_get_property(pdn, DMA64_PROPNAME, &len);
> > if (!direct64)
> > continue;
> >
> > @@ -1124,7 +1128,8 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
> > }
> >
> > static int ddw_property_create(struct property **ddw_win, const char *propname,
> > - u32 liobn, u64 dma_addr, u32 page_shift, u32 window_shift)
> > + u32 liobn, u64 dma_addr, u32 page_shift,
> > + u32 window_shift, bool direct_mapping)
> > {
> > struct dynamic_dma_window_prop *ddwprop;
> > struct property *win64;
> > @@ -1144,6 +1149,36 @@ static int ddw_property_create(struct property **ddw_win, const char *propname,
> > ddwprop->dma_base = cpu_to_be64(dma_addr);
> > ddwprop->tce_shift = cpu_to_be32(page_shift);
> > ddwprop->window_shift = cpu_to_be32(window_shift);
> > + if (direct_mapping)
> > + ddwprop->flags = cpu_to_be32(DDW_FLAGS_DIRECT);
> > +
> > + return 0;
> > +}
> > +
> > +static int iommu_table_update_window(struct iommu_table **tbl, int nid, unsigned long liobn,
> > + unsigned long win_addr, unsigned long page_shift,
> > + unsigned long window_size)
>
> Rather strange helper imho. I'd extract the most of
> iommu_table_setparms_lpar() into iommu_table_setparms() (except
> of_parse_dma_window) and call new helper from where you call
> iommu_table_update_window; and do
> iommu_pseries_alloc_table/iommu_tce_table_put there.
>
I don't see how to extract iommu_table_setparms_lpar() into
iommu_table_setparms(), they look to be used for different machine
types.
Do mean you extracting most of iommu_table_setparms_lpar() (and maybe
iommu_table_setparms() ) into a new helper, which is called in both
functions and use it instead of iommu_table_update_window() ?
>
> > +{
> > + struct iommu_table *new_tbl, *old_tbl;
> > +
> > + new_tbl = iommu_pseries_alloc_table(nid);
> > + if (!new_tbl)
> > + return -ENOMEM;
> > +
> > + old_tbl = *tbl;
> > + new_tbl->it_index = liobn;
> > + new_tbl->it_offset = win_addr >> page_shift;
> > + new_tbl->it_page_shift = page_shift;
> > + new_tbl->it_size = window_size >> page_shift;
> > + new_tbl->it_base = old_tbl->it_base;
>
> Should not be used in pseries.
>
The point here is to migrate the values from the older tbl to the
newer. I Would like to understand why this is bad, if it will still be
'unused' as the older tbl.
>
> > + new_tbl->it_busno = old_tbl->it_busno;
> > + new_tbl->it_blocksize = old_tbl->it_blocksize;
>
> 16 for pseries and does not change (may be even make it a macro).
>
> > + new_tbl->it_type = old_tbl->it_type;
>
> TCE_PCI.
>
Same as above.
>
> > + new_tbl->it_ops = old_tbl->it_ops;
> > +
> > + iommu_init_table(new_tbl, nid, old_tbl->it_reserved_start, old_tbl->it_reserved_end);
> > + iommu_tce_table_put(old_tbl);
> > + *tbl = new_tbl;
> >
> > return 0;
> > }
> > @@ -1171,12 +1206,16 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > struct direct_window *window;
> > struct property *win64 = NULL;
> > struct failed_ddw_pdn *fpdn;
> > - bool default_win_removed = false;
> > + bool default_win_removed = false, maps_whole_partition = false;
>
> s/maps_whole_partition/direct_mapping/
>
Sure, I will get it replaced.
>
> > + struct pci_dn *pci = PCI_DN(pdn);
> > + struct iommu_table *tbl = pci->table_group->tables[0];
> >
> > mutex_lock(&direct_window_init_mutex);
> >
> > - if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset))
> > - goto out_unlock;
> > + if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset, &maps_whole_partition)) {
> > + mutex_unlock(&direct_window_init_mutex);
> > + return maps_whole_partition;
> > + }
> >
> > /*
> > * If we already went through this for a previous function of
> > @@ -1258,16 +1297,24 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > query.page_size);
> > goto out_failed;
> > }
> > +
> > /* verify the window * number of ptes will map the partition */
> > - /* check largest block * page size > max memory hotplug addr */
> > max_addr = ddw_memory_hotplug_max();
> > if (query.largest_available_block < (max_addr >> page_shift)) {
> > - dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
> > - "%llu-sized pages\n", max_addr, query.largest_available_block,
> > - 1ULL << page_shift);
> > - goto out_failed;
> > + dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu %llu-sized pages\n",
> > + max_addr, query.largest_available_block,
> > + 1ULL << page_shift);
> > +
> > + len = order_base_2(query.largest_available_block << page_shift);
> > + } else {
> > + maps_whole_partition = true;
> > + len = order_base_2(max_addr);
> > }
> > - len = order_base_2(max_addr);
> > +
> > + /* DDW + IOMMU on single window may fail if there is any allocation */
> > + if (default_win_removed && !maps_whole_partition &&
> > + iommu_table_in_use(tbl))
> > + goto out_failed;
> >
> > ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
> > if (ret != 0)
> > @@ -1277,8 +1324,8 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > create.liobn, dn);
> >
> > win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
> > - ret = ddw_property_create(&win64, DIRECT64_PROPNAME, create.liobn, win_addr,
> > - page_shift, len);
> > + ret = ddw_property_create(&win64, DMA64_PROPNAME, create.liobn, win_addr,
> > + page_shift, len, maps_whole_partition);
> > if (ret) {
> > dev_info(&dev->dev,
> > "couldn't allocate property, property name, or value\n");
> > @@ -1297,12 +1344,25 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > if (!window)
> > goto out_prop_del;
> >
> > - ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> > - win64->value, tce_setrange_multi_pSeriesLP_walk);
> > - if (ret) {
> > - dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
> > - dn, ret);
> > - goto out_free_window;
> > + if (maps_whole_partition) {
> > + /* DDW maps the whole partition, so enable direct DMA mapping */
> > + ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> > + win64->value, tce_setrange_multi_pSeriesLP_walk);
> > + if (ret) {
> > + dev_info(&dev->dev, "failed to map direct window for %pOF: %d\n",
> > + dn, ret);
> > + goto out_free_window;
> > + }
> > + } else {
> > + /* New table for using DDW instead of the default DMA window */
> > + if (iommu_table_update_window(&tbl, pci->phb->node, create.liobn,
> > + win_addr, page_shift, 1UL << len))
> > + goto out_free_window;
> > +
> > + set_iommu_table_base(&dev->dev, tbl);
> > + WARN_ON(dev->dev.archdata.dma_offset >= SZ_4G);
>
> What is this check for exactly? Why 4G, not >= 0, for example?
I am not really sure, you suggested adding it here:
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200716071658.467820-6-leobras.c@gmail.com/#2488874
I can remove it if it's ok.
>
> > + goto out_unlock;
> > +
> > }
> >
> > dev->dev.archdata.dma_offset = win_addr;
> > @@ -1340,7 +1400,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> >
> > out_unlock:
> > mutex_unlock(&direct_window_init_mutex);
> > - return win64;
> > + return win64 && maps_whole_partition;
> > }
> >
> > static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
> >
^ permalink raw reply
* Re: kernel since 5.6 do not boot anymore on Apple PowerBook
From: Gabriel Paubert @ 2020-08-28 17:07 UTC (permalink / raw)
To: Christophe Leroy; +Cc: Giuseppe Sacco, linuxppc-dev
In-Reply-To: <afae7efd-0d8a-5672-7b75-9394b0ff3d3c@csgroup.eu>
On Thu, Aug 27, 2020 at 12:39:06PM +0200, Christophe Leroy wrote:
> Hi,
>
> Le 27/08/2020 à 10:28, Giuseppe Sacco a écrit :
> > Il giorno gio, 27/08/2020 alle 09.46 +0200, Giuseppe Sacco ha scritto:
> > > Il giorno gio, 27/08/2020 alle 00.28 +0200, Giuseppe Sacco ha scritto:
> > > > Hello Christophe,
> > > >
> > > > Il giorno mer, 26/08/2020 alle 15.53 +0200, Christophe Leroy ha
> > > > scritto:
> > > > [...]
> > > > > If there is no warning, then the issue is something else, bad luck.
> > > > >
> > > > > Could you increase the loglevel and try again both with and without
> > > > > VMAP_STACK ? Maybe we'll get more information on where it stops.
> > > >
> > > > The problem is related to the CPU frequency changes. This is where the
> > > > system stop: cpufreq get the CPU frequency limits and then start the
> > > > default governor (performance) and then calls function
> > > > cpufreq_gov_performance_limits() that never returns.
> > > >
> > > > Rebuilding after enabling pr_debug for cpufreq.c, I've got some more
> > > > lines of output:
> > > >
> > > > cpufreq: setting new policy for CPU 0: 667000 - 867000 kHz
> > > > cpufreq: new min and max freqs are 667000 - 867000 kHz
> > > > cpufreq: governor switch
> > > > cpufreq: cpufreq_init_governor: for CPU 0
> > > > cpufreq: cpufreq_start_governor: for CPU 0
> > > > cpufreq: target for CPU 0: 867000 kHz, relation 1, requested 867000 kHz
> > > > cpufreq: __target_index: cpu: 0, oldfreq: 667000, new freq: 867000
> > > > cpufreq: notification 0 of frequency transition to 867000 kHz
> > > > cpufreq: saving 133328 as reference value for loops_per_jiffy; freq is 667000 kHz
> > > >
> > > > no more lines are printed. I think this output only refers to the
> > > > notification sent prior to the frequency change.
> > > >
> > > > I was thinking that selecting a governor that would run at 667mHz would
> > > > probably skip the problem. I added cpufreq.default_governor=powersave
> > > > to the command line parameters but it did not work: the selected
> > > > governor was still performance and the system crashed.
> > >
> > > I kept following the thread and found that CPU frequency is changed in
> > > function pmu_set_cpu_speed() in file drivers/cpufreq/pmac32-cpufreq.c.
> > > As first thing, the function calls the macro preempt_disable() and this
> > > is where it stops.
> >
> > Sorry, I made a mistake. The real problem is down, on the same
> > function, when it calls low_sleep_handler(). This is where the problem
> > probably is.
> >
>
>
> Great, you spotted the problem.
>
> I see what it is, it is in low_sleep_handler() in
> arch/powerpc/platforms/powermac/sleep.S
>
> All critical registers are saved on the stack. At restore, they are restore
> BEFORE re-enabling MMU (because they are needed for that). But when we have
> VMAP_STACK, the stack can hardly be accessed without the MMU enabled.
> tophys() doesn't work for virtual stack addresses.
>
> Therefore, the low_sleep_handler() has to be reworked for using an area in
> the linear mem instead of the stack.
Actually there is already one, at the end of sleep.S, there is a 4 byte
area called sleep_storage. It should be enough to move there the CPU state
which has to be restored before the MMU is enabled:
- SDR1
- BATs
- SPRG (they might stay on the stack by reordering the code unless I
miss something)
Note that save_cpu_setup/restore_cpu_setup also use another static
storage area to save other SPRs.
Using static area would not work on SMP, but all PPC based Apple laptops
are single core and single thread.
Gabriel
^ permalink raw reply
* Re: Flushing transparent hugepages
From: Catalin Marinas @ 2020-08-28 17:06 UTC (permalink / raw)
To: Will Deacon
Cc: linux-arch, Thomas Bogendoerfer, Vineet Gupta, Russell King,
Matthew Wilcox, linux-mips, linux-mm, Paul Mackerras, sparclinux,
linux-snps-arc, linuxppc-dev, David S. Miller, linux-arm-kernel
In-Reply-To: <20200818160815.GA16191@willie-the-truck>
On Tue, Aug 18, 2020 at 05:08:16PM +0100, Will Deacon wrote:
> On Tue, Aug 18, 2020 at 04:07:36PM +0100, Matthew Wilcox wrote:
> > For example, arm64 seems confused in this scenario:
> >
> > void flush_dcache_page(struct page *page)
> > {
> > if (test_bit(PG_dcache_clean, &page->flags))
> > clear_bit(PG_dcache_clean, &page->flags);
> > }
> >
> > ...
> >
> > void __sync_icache_dcache(pte_t pte)
> > {
> > struct page *page = pte_page(pte);
> >
> > if (!test_and_set_bit(PG_dcache_clean, &page->flags))
> > sync_icache_aliases(page_address(page), page_size(page));
> > }
> >
> > So arm64 keeps track on a per-page basis which ones have been flushed.
> > page_size() will return PAGE_SIZE if called on a tail page or regular
> > page, but will return PAGE_SIZE << compound_order if called on a head
> > page. So this will either over-flush, or it's missing the opportunity
> > to clear the bits on all the subpages which have now been flushed.
>
> Hmm, that seems to go all the way back to 2014 as the result of a bug fix
> in 923b8f5044da ("arm64: mm: Make icache synchronisation logic huge page
> aware") which has a Reported-by Mark and a CC stable, suggesting something
> _was_ going wrong at the time :/ Was there a point where the tail pages
> could end up with PG_arch_1 uncleared on allocation?
In my experience, it's the other way around: you can end up with
PG_arch_1 cleared in a tail page when the head one was set (splitting
THP).
> > What would you _like_ to see? Would you rather flush_dcache_page()
> > were called once for each subpage, or would you rather maintain
> > the page-needs-flushing state once per compound page? We could also
> > introduce flush_dcache_thp() if some architectures would prefer it one
> > way and one the other, although that brings into question what to do
> > for hugetlbfs pages.
>
> For arm64, we'd like to see PG_arch_1 preserved during huge page splitting
> [1], but there was a worry that it might break x86 and s390. It's also not
> clear to me that we can change __sync_icache_dcache() as it's called when
> we're installing the entry in the page-table, so why would it be called
> again for the tail pages?
Indeed, __sync_icache_dcache() is called from set_pte_at() on the head
page, though it could always iterate and flush the tail pages
individually (I think we could have done this in commit 923b8f5044da).
Currently I suspect it does some over-flushing if you use THP on
executable pages (it's a no-op on non-exec pages).
With MTE (arm64 memory tagging) I'm introducing a PG_arch_2 flag and
losing this is more problematic as it can lead to clearing valid tags.
In the subsequent patch [2], mte_sync_tags() (also called from
set_pte_at()) checks the PG_arch_2 in each page of a compound one.
My preference would be to treat both PG_arch_1 and _2 similarly.
> [1] https://lore.kernel.org/linux-arch/20200703153718.16973-8-catalin.marinas@arm.com/
[2] https://lore.kernel.org/linux-arch/20200703153718.16973-9-catalin.marinas@arm.com/
--
Catalin
^ permalink raw reply
* [powerpc:fixes-test] BUILD SUCCESS 4a133eb351ccc275683ad49305d0b04dde903733
From: kernel test robot @ 2020-08-28 16:53 UTC (permalink / raw)
To: Michael Ellerman; +Cc: linuxppc-dev
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git fixes-test
branch HEAD: 4a133eb351ccc275683ad49305d0b04dde903733 powerpc/32s: Disable VMAP stack which CONFIG_ADB_PMU
elapsed time: 871m
configs tested: 151
configs skipped: 18
The following configs have been built successfully.
More configs may be tested in the coming days.
arm defconfig
arm64 allyesconfig
arm64 defconfig
arm allyesconfig
arm allmodconfig
x86_64 allyesconfig
m68k m5208evb_defconfig
m68k mvme147_defconfig
openrisc or1ksim_defconfig
arm cm_x300_defconfig
arc nsim_700_defconfig
sh se7206_defconfig
s390 alldefconfig
mips rt305x_defconfig
powerpc ep8248e_defconfig
powerpc pseries_defconfig
arm keystone_defconfig
sh se7722_defconfig
parisc generic-64bit_defconfig
mips rs90_defconfig
m68k bvme6000_defconfig
mips bcm47xx_defconfig
c6x evmc6474_defconfig
microblaze nommu_defconfig
powerpc mpc866_ads_defconfig
powerpc pq2fads_defconfig
m68k apollo_defconfig
m68k allyesconfig
arm axm55xx_defconfig
sh alldefconfig
mips fuloong2e_defconfig
arc nsimosci_hs_smp_defconfig
arm cns3420vb_defconfig
arm qcom_defconfig
mips maltasmvp_eva_defconfig
nios2 allyesconfig
powerpc allnoconfig
parisc generic-32bit_defconfig
mips rb532_defconfig
ia64 alldefconfig
mips vocore2_defconfig
powerpc alldefconfig
arc tb10x_defconfig
m68k sun3_defconfig
sh se7721_defconfig
sh se7780_defconfig
arm efm32_defconfig
powerpc adder875_defconfig
arm mvebu_v7_defconfig
m68k m5275evb_defconfig
mips rbtx49xx_defconfig
arm neponset_defconfig
sh magicpanelr2_defconfig
arm lpc18xx_defconfig
mips jmr3927_defconfig
arm exynos_defconfig
arm pxa910_defconfig
arm lpd270_defconfig
nios2 defconfig
mips allyesconfig
mips maltaup_xpa_defconfig
sh shx3_defconfig
arm u300_defconfig
sh se7724_defconfig
mips db1xxx_defconfig
mips maltaup_defconfig
arm realview_defconfig
h8300 defconfig
mips malta_kvm_guest_defconfig
mips cavium_octeon_defconfig
arc nps_defconfig
arm spear13xx_defconfig
arm moxart_defconfig
powerpc ep88xc_defconfig
mips bigsur_defconfig
arm orion5x_defconfig
sh sdk7780_defconfig
c6x dsk6455_defconfig
arc alldefconfig
arm colibri_pxa270_defconfig
mips lemote2f_defconfig
s390 debug_defconfig
alpha alldefconfig
arm sama5_defconfig
arm mmp2_defconfig
h8300 allyesconfig
sh lboxre2_defconfig
arm bcm2835_defconfig
sh apsh4a3a_defconfig
m68k m5475evb_defconfig
mips ip22_defconfig
ia64 allmodconfig
ia64 defconfig
ia64 allyesconfig
m68k allmodconfig
m68k defconfig
arc allyesconfig
nds32 allnoconfig
c6x allyesconfig
nds32 defconfig
csky defconfig
alpha defconfig
alpha allyesconfig
xtensa allyesconfig
arc defconfig
sh allmodconfig
parisc defconfig
s390 allyesconfig
parisc allyesconfig
s390 defconfig
i386 allyesconfig
sparc allyesconfig
sparc defconfig
i386 defconfig
mips allmodconfig
powerpc defconfig
powerpc allyesconfig
powerpc allmodconfig
x86_64 randconfig-a003-20200827
x86_64 randconfig-a002-20200827
x86_64 randconfig-a001-20200827
x86_64 randconfig-a005-20200827
x86_64 randconfig-a006-20200827
x86_64 randconfig-a004-20200827
i386 randconfig-a002-20200828
i386 randconfig-a005-20200828
i386 randconfig-a003-20200828
i386 randconfig-a004-20200828
i386 randconfig-a001-20200828
i386 randconfig-a006-20200828
x86_64 randconfig-a015-20200828
x86_64 randconfig-a012-20200828
x86_64 randconfig-a016-20200828
x86_64 randconfig-a014-20200828
x86_64 randconfig-a011-20200828
x86_64 randconfig-a013-20200828
i386 randconfig-a013-20200828
i386 randconfig-a012-20200828
i386 randconfig-a011-20200828
i386 randconfig-a016-20200828
i386 randconfig-a014-20200828
i386 randconfig-a015-20200828
riscv allyesconfig
riscv allnoconfig
riscv defconfig
riscv allmodconfig
x86_64 rhel
x86_64 rhel-7.6-kselftests
x86_64 defconfig
x86_64 rhel-8.3
x86_64 kexec
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
^ permalink raw reply
* [powerpc:merge] BUILD SUCCESS f5d3660ab83be7c3dc2c8a5d662bbac7bc1b092f
From: kernel test robot @ 2020-08-28 16:53 UTC (permalink / raw)
To: Michael Ellerman; +Cc: linuxppc-dev
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git merge
branch HEAD: f5d3660ab83be7c3dc2c8a5d662bbac7bc1b092f Automatic merge of 'master', 'next' and 'fixes' (2020-08-28 12:18)
elapsed time: 870m
configs tested: 134
configs skipped: 11
The following configs have been built successfully.
More configs may be tested in the coming days.
arm defconfig
arm64 allyesconfig
arm64 defconfig
arm allyesconfig
arm allmodconfig
m68k m5208evb_defconfig
m68k mvme147_defconfig
openrisc or1ksim_defconfig
sh se7206_defconfig
s390 alldefconfig
mips rt305x_defconfig
i386 allyesconfig
powerpc ep8248e_defconfig
powerpc pseries_defconfig
arm keystone_defconfig
sh se7722_defconfig
parisc generic-64bit_defconfig
mips rs90_defconfig
m68k bvme6000_defconfig
mips bcm47xx_defconfig
c6x evmc6474_defconfig
microblaze nommu_defconfig
arm mmp2_defconfig
arm cerfcube_defconfig
mips vocore2_defconfig
sh rts7751r2dplus_defconfig
arm qcom_defconfig
mips maltasmvp_eva_defconfig
nios2 allyesconfig
powerpc allnoconfig
arm clps711x_defconfig
sh sh7763rdp_defconfig
arc nsim_700_defconfig
arm versatile_defconfig
powerpc mpc7448_hpc2_defconfig
powerpc alldefconfig
arc tb10x_defconfig
m68k sun3_defconfig
sh se7721_defconfig
sh se7780_defconfig
arm efm32_defconfig
arm cm_x300_defconfig
powerpc adder875_defconfig
arm mvebu_v7_defconfig
arm lpc18xx_defconfig
mips jmr3927_defconfig
arm exynos_defconfig
arm pxa910_defconfig
arm lpd270_defconfig
arm corgi_defconfig
sh se7724_defconfig
mips db1xxx_defconfig
mips maltaup_defconfig
arm realview_defconfig
h8300 defconfig
mips malta_kvm_guest_defconfig
mips cavium_octeon_defconfig
arc nps_defconfig
arm spear13xx_defconfig
h8300 allyesconfig
mips rb532_defconfig
sh lboxre2_defconfig
s390 zfcpdump_defconfig
sh sh03_defconfig
powerpc gamecube_defconfig
sh sh7785lcr_defconfig
arm oxnas_v6_defconfig
arm bcm2835_defconfig
mips loongson1b_defconfig
mips ar7_defconfig
arc hsdk_defconfig
alpha defconfig
ia64 allmodconfig
ia64 defconfig
ia64 allyesconfig
m68k allmodconfig
m68k defconfig
m68k allyesconfig
nios2 defconfig
arc allyesconfig
nds32 allnoconfig
c6x allyesconfig
nds32 defconfig
csky defconfig
alpha allyesconfig
xtensa allyesconfig
arc defconfig
sh allmodconfig
parisc defconfig
s390 allyesconfig
parisc allyesconfig
s390 defconfig
sparc allyesconfig
sparc defconfig
i386 defconfig
mips allyesconfig
mips allmodconfig
powerpc defconfig
powerpc allyesconfig
powerpc allmodconfig
i386 randconfig-a002-20200828
i386 randconfig-a005-20200828
i386 randconfig-a003-20200828
i386 randconfig-a004-20200828
i386 randconfig-a001-20200828
i386 randconfig-a006-20200828
x86_64 randconfig-a015-20200828
x86_64 randconfig-a012-20200828
x86_64 randconfig-a016-20200828
x86_64 randconfig-a014-20200828
x86_64 randconfig-a011-20200828
x86_64 randconfig-a013-20200828
i386 randconfig-a016-20200828
i386 randconfig-a014-20200828
i386 randconfig-a015-20200828
i386 randconfig-a013-20200828
i386 randconfig-a012-20200828
i386 randconfig-a011-20200828
x86_64 randconfig-a003-20200827
x86_64 randconfig-a002-20200827
x86_64 randconfig-a001-20200827
x86_64 randconfig-a005-20200827
x86_64 randconfig-a006-20200827
x86_64 randconfig-a004-20200827
riscv allyesconfig
riscv allnoconfig
riscv defconfig
riscv allmodconfig
x86_64 rhel
x86_64 allyesconfig
x86_64 rhel-7.6-kselftests
x86_64 defconfig
x86_64 rhel-8.3
x86_64 kexec
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
^ permalink raw reply
* [powerpc:next-test] BUILD SUCCESS WITH WARNING dd419a93bd9954cfdd333f8387a9a0d22218720d
From: kernel test robot @ 2020-08-28 16:54 UTC (permalink / raw)
To: Michael Ellerman; +Cc: linuxppc-dev
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next-test
branch HEAD: dd419a93bd9954cfdd333f8387a9a0d22218720d powerpc: Update documentation of ISA versions for Power10
Warning in current branch:
/usr/bin/powerpc64-linux-gnu-ld: warning: discarding dynamic section .rela.opd
Warning ids grouped by kconfigs:
clang_recent_errors
`-- powerpc64-randconfig-r025-20200827
`-- usr-bin-powerpc64-linux-gnu-ld:warning:discarding-dynamic-section-.rela.opd
elapsed time: 869m
configs tested: 129
configs skipped: 12
arm defconfig
arm64 allyesconfig
arm64 defconfig
arm allyesconfig
arm allmodconfig
m68k m5208evb_defconfig
m68k mvme147_defconfig
openrisc or1ksim_defconfig
arm cm_x300_defconfig
arc nsim_700_defconfig
sh se7206_defconfig
s390 alldefconfig
mips rt305x_defconfig
i386 allyesconfig
powerpc ep8248e_defconfig
powerpc pseries_defconfig
arm keystone_defconfig
sh se7722_defconfig
parisc generic-64bit_defconfig
mips rs90_defconfig
m68k bvme6000_defconfig
mips bcm47xx_defconfig
c6x evmc6474_defconfig
microblaze nommu_defconfig
arm qcom_defconfig
mips maltasmvp_eva_defconfig
nios2 allyesconfig
powerpc allnoconfig
powerpc alldefconfig
arc tb10x_defconfig
m68k sun3_defconfig
sh se7721_defconfig
sh se7780_defconfig
arm efm32_defconfig
powerpc adder875_defconfig
arm mvebu_v7_defconfig
arm lpc18xx_defconfig
mips jmr3927_defconfig
arm exynos_defconfig
arm pxa910_defconfig
arm lpd270_defconfig
sh se7724_defconfig
mips db1xxx_defconfig
mips maltaup_defconfig
arm realview_defconfig
h8300 defconfig
mips malta_kvm_guest_defconfig
mips cavium_octeon_defconfig
arc nps_defconfig
arm spear13xx_defconfig
s390 debug_defconfig
alpha alldefconfig
arm sama5_defconfig
arm mmp2_defconfig
h8300 allyesconfig
mips rb532_defconfig
sh lboxre2_defconfig
s390 zfcpdump_defconfig
sh sh03_defconfig
powerpc gamecube_defconfig
sh sh7785lcr_defconfig
arm oxnas_v6_defconfig
arm bcm2835_defconfig
mips loongson1b_defconfig
mips ar7_defconfig
arc hsdk_defconfig
alpha defconfig
ia64 allmodconfig
ia64 defconfig
ia64 allyesconfig
m68k allmodconfig
m68k defconfig
m68k allyesconfig
nios2 defconfig
arc allyesconfig
nds32 allnoconfig
c6x allyesconfig
nds32 defconfig
csky defconfig
alpha allyesconfig
xtensa allyesconfig
arc defconfig
sh allmodconfig
parisc defconfig
s390 allyesconfig
parisc allyesconfig
s390 defconfig
sparc allyesconfig
sparc defconfig
i386 defconfig
mips allyesconfig
mips allmodconfig
powerpc defconfig
powerpc allyesconfig
powerpc allmodconfig
x86_64 randconfig-a003-20200827
x86_64 randconfig-a002-20200827
x86_64 randconfig-a001-20200827
x86_64 randconfig-a005-20200827
x86_64 randconfig-a006-20200827
x86_64 randconfig-a004-20200827
i386 randconfig-a002-20200828
i386 randconfig-a005-20200828
i386 randconfig-a003-20200828
i386 randconfig-a004-20200828
i386 randconfig-a001-20200828
i386 randconfig-a006-20200828
x86_64 randconfig-a015-20200828
x86_64 randconfig-a012-20200828
x86_64 randconfig-a014-20200828
x86_64 randconfig-a011-20200828
x86_64 randconfig-a013-20200828
x86_64 randconfig-a016-20200828
i386 randconfig-a013-20200828
i386 randconfig-a012-20200828
i386 randconfig-a011-20200828
i386 randconfig-a016-20200828
i386 randconfig-a014-20200828
i386 randconfig-a015-20200828
riscv allyesconfig
riscv allnoconfig
riscv defconfig
riscv allmodconfig
x86_64 rhel
x86_64 allyesconfig
x86_64 rhel-7.6-kselftests
x86_64 defconfig
x86_64 rhel-8.3
x86_64 kexec
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
^ permalink raw reply
* Re: [PATCH v1 08/10] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()
From: Leonardo Bras @ 2020-08-28 15:25 UTC (permalink / raw)
To: Alexey Kardashevskiy, Michael Ellerman, Benjamin Herrenschmidt,
Paul Mackerras, Christophe Leroy, Joel Stanley,
Thiago Jung Bauermann, Ram Pai, Brian King,
Murilo Fossa Vicentini, David Dai
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <952fb640-01dd-003f-7fcb-bd48446d526c@ozlabs.ru>
On Mon, 2020-08-24 at 15:07 +1000, Alexey Kardashevskiy wrote:
>
> On 18/08/2020 09:40, Leonardo Bras wrote:
> > Code used to create a ddw property that was previously scattered in
> > enable_ddw() is now gathered in ddw_property_create(), which deals with
> > allocation and filling the property, letting it ready for
> > of_property_add(), which now occurs in sequence.
> >
> > This created an opportunity to reorganize the second part of enable_ddw():
> >
> > Without this patch enable_ddw() does, in order:
> > kzalloc() property & members, create_ddw(), fill ddwprop inside property,
> > ddw_list_add(), do tce_setrange_multi_pSeriesLP_walk in all memory,
> > of_add_property().
> >
> > With this patch enable_ddw() does, in order:
> > create_ddw(), ddw_property_create(), of_add_property(), ddw_list_add(),
> > do tce_setrange_multi_pSeriesLP_walk in all memory.
> >
> > This change requires of_remove_property() in case anything fails after
> > of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
> > in all memory, which looks the most expensive operation, only if
> > everything else succeeds.
> >
> > Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
> > ---
> > arch/powerpc/platforms/pseries/iommu.c | 97 +++++++++++++++-----------
> > 1 file changed, 57 insertions(+), 40 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > index 4031127c9537..3a1ef02ad9d5 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -1123,6 +1123,31 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
> > ret);
> > }
> >
> > +static int ddw_property_create(struct property **ddw_win, const char *propname,
>
> @propname is always the same, do you really want to pass it every time?
I think it reads better, like "create a ddw property with this name".
Also, it makes possible to create ddw properties with other names, in
case we decide to create properties with different names depending on
the window created.
Also, it's probably optimized / inlined at this point.
Is it ok doing it like this?
>
> > + u32 liobn, u64 dma_addr, u32 page_shift, u32 window_shift)
> > +{
> > + struct dynamic_dma_window_prop *ddwprop;
> > + struct property *win64;
> > +
> > + *ddw_win = win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
> > + if (!win64)
> > + return -ENOMEM;
> > +
> > + win64->name = kstrdup(propname, GFP_KERNEL);
>
> Not clear why "win64->name = DIRECT64_PROPNAME" would not work here, the
> generic OF code does not try kfree() it but it is probably out of scope
> here.
Yeah, I had that question too.
Previous code was like that, and I as trying not to mess too much on
how it's done.
> > + ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
> > + win64->value = ddwprop;
> > + win64->length = sizeof(*ddwprop);
> > + if (!win64->name || !win64->value)
> > + return -ENOMEM;
>
> Up to 2 memory leaks here. I see the cleanup at "out_free_prop:" but
> still looks fragile. Instead you could simply return win64 as the only
> error possible here is -ENOMEM and returning NULL is equally good.
I agree. It's better if this function have it's own cleaning routine.
It will be fixed for next version.
>
>
> > +
> > + ddwprop->liobn = cpu_to_be32(liobn);
> > + ddwprop->dma_base = cpu_to_be64(dma_addr);
> > + ddwprop->tce_shift = cpu_to_be32(page_shift);
> > + ddwprop->window_shift = cpu_to_be32(window_shift);
> > +
> > + return 0;
> > +}
> > +
> > /*
> > * If the PE supports dynamic dma windows, and there is space for a table
> > * that can map all pages in a linear offset, then setup such a table,
> > @@ -1140,12 +1165,11 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > struct ddw_query_response query;
> > struct ddw_create_response create;
> > int page_shift;
> > - u64 max_addr;
> > + u64 max_addr, win_addr;
> > struct device_node *dn;
> > u32 ddw_avail[DDW_APPLICABLE_SIZE];
> > struct direct_window *window;
> > - struct property *win64;
> > - struct dynamic_dma_window_prop *ddwprop;
> > + struct property *win64 = NULL;
> > struct failed_ddw_pdn *fpdn;
> > bool default_win_removed = false;
> >
> > @@ -1244,38 +1268,34 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > goto out_failed;
> > }
> > len = order_base_2(max_addr);
> > - win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
> > - if (!win64) {
> > - dev_info(&dev->dev,
> > - "couldn't allocate property for 64bit dma window\n");
> > +
> > + ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
> > + if (ret != 0)
>
> It is usually just "if (ret)"
It was previously like that, and all query_ddw() checks return value
this way. Should I update them all or just this one?
Thanks!
>
>
> > goto out_failed;
> > - }
> > - win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL);
> > - win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL);
> > - win64->length = sizeof(*ddwprop);
> > - if (!win64->name || !win64->value) {
> > +
> > + dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
> > + create.liobn, dn);
> > +
> > + win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
> > + ret = ddw_property_create(&win64, DIRECT64_PROPNAME, create.liobn, win_addr,
> > + page_shift, len);
> > + if (ret) {
> > dev_info(&dev->dev,
> > - "couldn't allocate property name and value\n");
> > + "couldn't allocate property, property name, or value\n");
> > goto out_free_prop;
> > }
> >
> > - ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
> > - if (ret != 0)
> > + ret = of_add_property(pdn, win64);
> > + if (ret) {
> > + dev_err(&dev->dev, "unable to add dma window property for %pOF: %d",
> > + pdn, ret);
> > goto out_free_prop;
> > -
> > - ddwprop->liobn = cpu_to_be32(create.liobn);
> > - ddwprop->dma_base = cpu_to_be64(((u64)create.addr_hi << 32) |
> > - create.addr_lo);
> > - ddwprop->tce_shift = cpu_to_be32(page_shift);
> > - ddwprop->window_shift = cpu_to_be32(len);
> > -
> > - dev_dbg(&dev->dev, "created tce table LIOBN 0x%x for %pOF\n",
> > - create.liobn, dn);
> > + }
> >
> > /* Add new window to existing DDW list */
> > - window = ddw_list_add(pdn, ddwprop);
> > + window = ddw_list_add(pdn, win64->value);
> > if (!window)
> > - goto out_clear_window;
> > + goto out_prop_del;
> >
> > ret = walk_system_ram_range(0, memblock_end_of_DRAM() >> PAGE_SHIFT,
> > win64->value, tce_setrange_multi_pSeriesLP_walk);
> > @@ -1285,14 +1305,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > goto out_free_window;
> > }
> >
> > - ret = of_add_property(pdn, win64);
> > - if (ret) {
> > - dev_err(&dev->dev, "unable to add dma window property for %pOF: %d",
> > - pdn, ret);
> > - goto out_free_window;
> > - }
> > -
> > - dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
> > + dev->dev.archdata.dma_offset = win_addr;
> > goto out_unlock;
> >
> > out_free_window:
> > @@ -1302,14 +1315,18 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> >
> > kfree(window);
> >
> > -out_clear_window:
> > - remove_ddw(pdn, true);
> > +out_prop_del:
> > + of_remove_property(pdn, win64);
> >
> > out_free_prop:
> > - kfree(win64->name);
> > - kfree(win64->value);
> > - kfree(win64);
> > - win64 = NULL;
> > + if (win64) {
> > + kfree(win64->name);
> > + kfree(win64->value);
> > + kfree(win64);
> > + win64 = NULL;
> > + }
> > +
> > + remove_ddw(pdn, true);
> >
> > out_failed:
> > if (default_win_removed)
> >
^ permalink raw reply
* Re: [PATCH v1 07/10] powerpc/pseries/iommu: Allow DDW windows starting at 0x00
From: Leonardo Bras @ 2020-08-28 14:04 UTC (permalink / raw)
To: Alexey Kardashevskiy, Michael Ellerman, Benjamin Herrenschmidt,
Paul Mackerras, Christophe Leroy, Joel Stanley,
Thiago Jung Bauermann, Ram Pai, Brian King,
Murilo Fossa Vicentini, David Dai
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <3fda1c2d-20f2-7789-e072-47fe966f0265@ozlabs.ru>
On Mon, 2020-08-24 at 13:44 +1000, Alexey Kardashevskiy wrote:
>
> > On 18/08/2020 09:40, Leonardo Bras wrote:
> > enable_ddw() currently returns the address of the DMA window, which is
> > considered invalid if has the value 0x00.
> >
> > Also, it only considers valid an address returned from find_existing_ddw
> > if it's not 0x00.
> >
> > Changing this behavior makes sense, given the users of enable_ddw() only
> > need to know if direct mapping is possible. It can also allow a DMA window
> > starting at 0x00 to be used.
> >
> > This will be helpful for using a DDW with indirect mapping, as the window
> > address will be different than 0x00, but it will not map the whole
> > partition.
> >
> > Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
> > ---
> > arch/powerpc/platforms/pseries/iommu.c | 30 ++++++++++++--------------
> > 1 file changed, 14 insertions(+), 16 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > index fcdefcc0f365..4031127c9537 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -852,24 +852,25 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
> > np, ret);
> > }
> > >
> > -static u64 find_existing_ddw(struct device_node *pdn)
> > +static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
> > {
> > struct direct_window *window;
> > const struct dynamic_dma_window_prop *direct64;
> > - u64 dma_addr = 0;
> > + bool found = false;
> >
> > spin_lock(&direct_window_list_lock);
> > /* check if we already created a window and dupe that config if so */
> > list_for_each_entry(window, &direct_window_list, list) {
> > if (window->device == pdn) {
> > direct64 = window->prop;
> > - dma_addr = be64_to_cpu(direct64->dma_base);
> > + *dma_addr = be64_to_cpu(direct64->dma_base);
> > + found = true;
> > break;
> > }
> > }
> > spin_unlock(&direct_window_list_lock);
> >
> > - return dma_addr;
> > + return found;
> > }
> >
> > static struct direct_window *ddw_list_add(struct device_node *pdn,
> > @@ -1131,15 +1132,15 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
> > * pdn: the parent pe node with the ibm,dma_window property
> > * Future: also check if we can remap the base window for our base page size
> > *
> > - * returns the dma offset for use by the direct mapped DMA code.
> > + * returns true if can map all pages (direct mapping), false otherwise..
> > */
> > -static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > +static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > {
> > int len, ret;
> > struct ddw_query_response query;
> > struct ddw_create_response create;
> > int page_shift;
> > - u64 dma_addr, max_addr;
> > + u64 max_addr;
> > struct device_node *dn;
> > u32 ddw_avail[DDW_APPLICABLE_SIZE];
> > struct direct_window *window;
> > @@ -1150,8 +1151,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> >
> > mutex_lock(&direct_window_init_mutex);
> >
> > - dma_addr = find_existing_ddw(pdn);
> > - if (dma_addr != 0)
> > + if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset))
> > goto out_unlock;
> >
> > /*
> > @@ -1292,7 +1292,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > goto out_free_window;
> > }
> >
> > - dma_addr = be64_to_cpu(ddwprop->dma_base);
> > + dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
>
> Do not you need the same chunk in the find_existing_ddw() case above as
> well? Thanks,
The new signature of find_existing_ddw() is
static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
And on enable_ddw(), we call
find_existing_ddw(pdn, &dev->dev.archdata.dma_offset)
And inside the function we do:
*dma_addr = be64_to_cpu(direct64->dma_base);
I think it's the same as the chunk before.
Am I missing something?
>
>
> > goto out_unlock;
> >
> > out_free_window:
> > @@ -1309,6 +1309,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > kfree(win64->name);
> > kfree(win64->value);
> > kfree(win64);
> > + win64 = NULL;
> >
> > out_failed:
> > if (default_win_removed)
> > @@ -1322,7 +1323,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> >
> > out_unlock:
> > mutex_unlock(&direct_window_init_mutex);
> > - return dma_addr;
> > + return win64;
> > }
> >
> > static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
> > @@ -1401,11 +1402,8 @@ static bool iommu_bypass_supported_pSeriesLP(struct pci_dev *pdev, u64 dma_mask)
> > break;
> > }
> >
> > - if (pdn && PCI_DN(pdn)) {
> > - pdev->dev.archdata.dma_offset = enable_ddw(pdev, pdn);
> > - if (pdev->dev.archdata.dma_offset)
> > - return true;
> > - }
> > + if (pdn && PCI_DN(pdn))
> > + return enable_ddw(pdev, pdn);
> >
> > return false;
> > }
> >
^ permalink raw reply
* Re: [PATCH 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
From: peterz @ 2020-08-28 11:15 UTC (permalink / raw)
To: Nicholas Piggin
Cc: linux-arch, Jens Axboe, Aneesh Kumar K.V, linux-kernel, linux-mm,
Andrew Morton, linuxppc-dev, David S. Miller
In-Reply-To: <20200828100022.1099682-2-npiggin@gmail.com>
On Fri, Aug 28, 2020 at 08:00:19PM +1000, Nicholas Piggin wrote:
> Closing this race only requires interrupts to be disabled while ->mm
> and ->active_mm are being switched, but the TLB problem requires also
> holding interrupts off over activate_mm. Unfortunately not all archs
> can do that yet, e.g., arm defers the switch if irqs are disabled and
> expects finish_arch_post_lock_switch() to be called to complete the
> flush; um takes a blocking lock in activate_mm().
ARM at least has activate_mm() := switch_mm(), so it could be made to
work.
^ permalink raw reply
* [PATCH v2] powerpc/book3s64/radix: Fix boot failure with large amount of guest memory
From: Aneesh Kumar K.V @ 2020-08-28 10:08 UTC (permalink / raw)
To: linuxppc-dev, mpe; +Cc: Shirisha Ganta, Aneesh Kumar K.V, Hari Bathini
If the hypervisor doesn't support hugepages, the kernel ends up allocating a large
number of page table pages. The early page table allocation was wrongly
setting the max memblock limit to ppc64_rma_size with radix translation
which resulted in boot failure as shown below.
Kernel panic - not syncing:
early_alloc_pgtable: Failed to allocate 16777216 bytes align=0x1000000 nid=-1 from=0x0000000000000000 max_addr=0xffffffffffffffff
CPU: 0 PID: 0 Comm: swapper Not tainted 5.8.0-24.9-default+ #2
Call Trace:
[c0000000016f3d00] [c0000000007c6470] dump_stack+0xc4/0x114 (unreliable)
[c0000000016f3d40] [c00000000014c78c] panic+0x164/0x418
[c0000000016f3dd0] [c000000000098890] early_alloc_pgtable+0xe0/0xec
[c0000000016f3e60] [c0000000010a5440] radix__early_init_mmu+0x360/0x4b4
[c0000000016f3ef0] [c000000001099bac] early_init_mmu+0x1c/0x3c
[c0000000016f3f10] [c00000000109a320] early_setup+0x134/0x170
This was because the kernel was checking for the radix feature before we enable the
feature via mmu_features. This resulted in the kernel using hash restrictions on
radix.
Rework the early init code such that the kernel boot with memblock restrictions
as imposed by hash. At that point, the kernel still hasn't finalized the
translation the kernel will end up using.
We have three different ways of detecting radix.
1. dt_cpu_ftrs_scan -> used only in case of PowerNV
2. ibm,pa-features -> Used when we don't use cpu_dt_ftr_scan
3. CAS -> Where we negotiate with hypervisor about the supported translation.
We look at 1 or 2 early in the boot and after that, we look at the CAS vector to
finalize the translation the kernel will use. We also support a kernel command
line option (disable_radix) to switch to hash.
Update the memblock limit after mmu_early_init_devtree() if the kernel is going
to use radix translation. This forces some of the memblock allocations we do before
mmu_early_init_devtree() to be within the RMA limit.
Fixes: 2bfd65e45e87 ("powerpc/mm/radix: Add radix callbacks for early init routines")
Reviewed-by: Hari Bathini <hbathini@linux.ibm.com>
Reported-by: Shirisha Ganta <shiganta@in.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
arch/powerpc/include/asm/book3s/64/mmu.h | 10 +++++-----
arch/powerpc/mm/book3s64/radix_pgtable.c | 15 ---------------
arch/powerpc/mm/init_64.c | 11 +++++++++--
3 files changed, 14 insertions(+), 22 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
index 55442d45c597..b392384a3b15 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -239,14 +239,14 @@ static inline void early_init_mmu_secondary(void)
extern void hash__setup_initial_memory_limit(phys_addr_t first_memblock_base,
phys_addr_t first_memblock_size);
-extern void radix__setup_initial_memory_limit(phys_addr_t first_memblock_base,
- phys_addr_t first_memblock_size);
static inline void setup_initial_memory_limit(phys_addr_t first_memblock_base,
phys_addr_t first_memblock_size)
{
- if (early_radix_enabled())
- return radix__setup_initial_memory_limit(first_memblock_base,
- first_memblock_size);
+ /*
+ * Hash has more strict restrictions. At this point we don't
+ * know which translations we will pick. Hence go with hash
+ * restrictions.
+ */
return hash__setup_initial_memory_limit(first_memblock_base,
first_memblock_size);
}
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 28c784976bed..d5f0c10d752a 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -734,21 +734,6 @@ void radix__mmu_cleanup_all(void)
}
}
-void radix__setup_initial_memory_limit(phys_addr_t first_memblock_base,
- phys_addr_t first_memblock_size)
-{
- /*
- * We don't currently support the first MEMBLOCK not mapping 0
- * physical on those processors
- */
- BUG_ON(first_memblock_base != 0);
-
- /*
- * Radix mode is not limited by RMA / VRMA addressing.
- */
- ppc64_rma_size = ULONG_MAX;
-}
-
#ifdef CONFIG_MEMORY_HOTPLUG
static void free_pte_table(pte_t *pte_start, pmd_t *pmd)
{
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 02e127fa5777..8459056cce67 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -433,9 +433,16 @@ void __init mmu_early_init_devtree(void)
if (!(mfmsr() & MSR_HV))
early_check_vec5();
- if (early_radix_enabled())
+ if (early_radix_enabled()) {
radix__early_init_devtree();
- else
+ /*
+ * We have finalized the translation we are going to use by now.
+ * Radix mode is not limited by RMA / VRMA addressing.
+ * Hence don't limit memblock allocations.
+ */
+ ppc64_rma_size = ULONG_MAX;
+ memblock_set_current_limit(MEMBLOCK_ALLOC_ANYWHERE);
+ } else
hash__early_init_devtree();
}
#endif /* CONFIG_PPC_BOOK3S_64 */
--
2.26.2
^ permalink raw reply related
* [PATCH 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm
From: Nicholas Piggin @ 2020-08-28 10:00 UTC (permalink / raw)
To: linux-mm
Cc: linux-arch, Jens Axboe, Peter Zijlstra, Aneesh Kumar K.V,
linux-kernel, Nicholas Piggin, Andrew Morton, linuxppc-dev,
David S. Miller
In-Reply-To: <20200828100022.1099682-1-npiggin@gmail.com>
Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of
single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
a process under certain conditions. One of the assumptions is that
mm_users would not be incremented via a reference outside the process
context with mmget_not_zero() then go on to kthread_use_mm() via that
reference.
That invariant was broken by io_uring code (see previous sparc64 fix),
but I'll point Fixes: to the original powerpc commit because we are
changing that assumption going forward, so this will make backports
match up.
Fix this by no longer relying on that assumption, but by having each CPU
check the mm is not being used, and clearing their own bit from the mask
if it's okay. This fix relies on commit 38cf307c1f20 ("mm: fix
kthread_use_mm() vs TLB invalidate") to disable irqs over the mm switch,
and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to be enabled.
Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/include/asm/tlb.h | 13 -------------
arch/powerpc/mm/book3s64/radix_tlb.c | 23 ++++++++++++++++-------
2 files changed, 16 insertions(+), 20 deletions(-)
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index fbc6f3002f23..d97f061fecac 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -66,19 +66,6 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
return false;
return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
}
-static inline void mm_reset_thread_local(struct mm_struct *mm)
-{
- WARN_ON(atomic_read(&mm->context.copros) > 0);
- /*
- * It's possible for mm_access to take a reference on mm_users to
- * access the remote mm from another thread, but it's not allowed
- * to set mm_cpumask, so mm_users may be > 1 here.
- */
- WARN_ON(current->mm != mm);
- atomic_set(&mm->context.active_cpus, 1);
- cpumask_clear(mm_cpumask(mm));
- cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
-}
#else /* CONFIG_PPC_BOOK3S_64 */
static inline int mm_is_thread_local(struct mm_struct *mm)
{
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 0d233763441f..a421a0e3f930 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -645,19 +645,29 @@ static void do_exit_flush_lazy_tlb(void *arg)
struct mm_struct *mm = arg;
unsigned long pid = mm->context.id;
+ /*
+ * A kthread could have done a mmget_not_zero() after the flushing CPU
+ * checked mm_users == 1, and be in the process of kthread_use_mm when
+ * interrupted here. In that case, current->mm will be set to mm,
+ * because kthread_use_mm() setting ->mm and switching to the mm is
+ * done with interrupts off.
+ */
if (current->mm == mm)
- return; /* Local CPU */
+ goto out_flush;
if (current->active_mm == mm) {
- /*
- * Must be a kernel thread because sender is single-threaded.
- */
- BUG_ON(current->mm);
+ WARN_ON_ONCE(current->mm != NULL);
+ /* Is a kernel thread and is using mm as the lazy tlb */
mmgrab(&init_mm);
- switch_mm(mm, &init_mm, current);
current->active_mm = &init_mm;
+ switch_mm_irqs_off(mm, &init_mm, current);
mmdrop(mm);
}
+
+ atomic_dec(&mm->context.active_cpus);
+ cpumask_clear_cpu(smp_processor_id(), mm_cpumask(mm));
+
+out_flush:
_tlbiel_pid(pid, RIC_FLUSH_ALL);
}
@@ -672,7 +682,6 @@ static void exit_flush_lazy_tlbs(struct mm_struct *mm)
*/
smp_call_function_many(mm_cpumask(mm), do_exit_flush_lazy_tlb,
(void *)mm, 1);
- mm_reset_thread_local(mm);
}
void radix__flush_tlb_mm(struct mm_struct *mm)
--
2.23.0
^ permalink raw reply related
* [PATCH 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
From: Nicholas Piggin @ 2020-08-28 10:00 UTC (permalink / raw)
To: linux-mm
Cc: linux-arch, Jens Axboe, Peter Zijlstra, Aneesh Kumar K.V,
linux-kernel, Nicholas Piggin, sparclinux, Andrew Morton,
linuxppc-dev, David S. Miller
In-Reply-To: <20200828100022.1099682-1-npiggin@gmail.com>
The de facto (and apparently uncommented) standard for using an mm had,
thanks to this code in sparc if nothing else, been that you must have a
reference on mm_users *and that reference must have been obtained with
mmget()*, i.e., from a thread with a reference to mm_users that had used
the mm.
The introduction of mmget_not_zero() in commit d2005e3f41d4
("userfaultfd: don't pin the user memory in userfaultfd_file_create()")
allowed mm_count holders to aoperate on user mappings asynchronously
from the actual threads using the mm, but they were not to load those
mappings into their TLB (i.e., walking vmas and page tables is okay,
kthread_use_mm() is not).
io_uring 2b188cc1bb857 ("Add io_uring IO interface") added code which
does a kthread_use_mm() from a mmget_not_zero() refcount.
The problem with this is code which previously assumed mm == current->mm
and mm->mm_users == 1 implies the mm will remain single-threaded at
least until this thread creates another mm_users reference, has now
broken.
arch/sparc/kernel/smp_64.c:
if (atomic_read(&mm->mm_users) == 1) {
cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
goto local_flush_and_out;
}
vs fs/io_uring.c
if (unlikely(!(ctx->flags & IORING_SETUP_SQPOLL) ||
!mmget_not_zero(ctx->sqo_mm)))
return -EFAULT;
kthread_use_mm(ctx->sqo_mm);
mmget_not_zero() could come in right after the mm_users == 1 test, then
kthread_use_mm() which sets its CPU in the mm_cpumask. That update could
be lost if cpumask_copy() occurs afterward.
I propose we fix this by allowing mmget_not_zero() to be a first-class
reference, and not have this obscure undocumented and unchecked
restriction.
The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
optimisation could be effectively restored by sending IPIs to mm_cpumask
members and having them remove themselves from mm_cpumask. This is more
tricky so I leave it as an exercise for someone with a sparc64 SMP.
powerpc has a (currently similarly broken) example.
Cc: sparclinux@vger.kernel.org
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/sparc/kernel/smp_64.c | 65 ++++++++------------------------------
1 file changed, 14 insertions(+), 51 deletions(-)
diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index e286e2badc8a..e38d8bf454e8 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1039,38 +1039,9 @@ void smp_fetch_global_pmu(void)
* are flush_tlb_*() routines, and these run after flush_cache_*()
* which performs the flushw.
*
- * The SMP TLB coherency scheme we use works as follows:
- *
- * 1) mm->cpu_vm_mask is a bit mask of which cpus an address
- * space has (potentially) executed on, this is the heuristic
- * we use to avoid doing cross calls.
- *
- * Also, for flushing from kswapd and also for clones, we
- * use cpu_vm_mask as the list of cpus to make run the TLB.
- *
- * 2) TLB context numbers are shared globally across all processors
- * in the system, this allows us to play several games to avoid
- * cross calls.
- *
- * One invariant is that when a cpu switches to a process, and
- * that processes tsk->active_mm->cpu_vm_mask does not have the
- * current cpu's bit set, that tlb context is flushed locally.
- *
- * If the address space is non-shared (ie. mm->count == 1) we avoid
- * cross calls when we want to flush the currently running process's
- * tlb state. This is done by clearing all cpu bits except the current
- * processor's in current->mm->cpu_vm_mask and performing the
- * flush locally only. This will force any subsequent cpus which run
- * this task to flush the context from the local tlb if the process
- * migrates to another cpu (again).
- *
- * 3) For shared address spaces (threads) and swapping we bite the
- * bullet for most cases and perform the cross call (but only to
- * the cpus listed in cpu_vm_mask).
- *
- * The performance gain from "optimizing" away the cross call for threads is
- * questionable (in theory the big win for threads is the massive sharing of
- * address space state across processors).
+ * mm->cpu_vm_mask is a bit mask of which cpus an address
+ * space has (potentially) executed on, this is the heuristic
+ * we use to limit cross calls.
*/
/* This currently is only used by the hugetlb arch pre-fault
@@ -1080,18 +1051,13 @@ void smp_fetch_global_pmu(void)
void smp_flush_tlb_mm(struct mm_struct *mm)
{
u32 ctx = CTX_HWBITS(mm->context);
- int cpu = get_cpu();
- if (atomic_read(&mm->mm_users) == 1) {
- cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
- goto local_flush_and_out;
- }
+ get_cpu();
smp_cross_call_masked(&xcall_flush_tlb_mm,
ctx, 0, 0,
mm_cpumask(mm));
-local_flush_and_out:
__flush_tlb_mm(ctx, SECONDARY_CONTEXT);
put_cpu();
@@ -1114,17 +1080,15 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
{
u32 ctx = CTX_HWBITS(mm->context);
struct tlb_pending_info info;
- int cpu = get_cpu();
+
+ get_cpu();
info.ctx = ctx;
info.nr = nr;
info.vaddrs = vaddrs;
- if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
- cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
- else
- smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
- &info, 1);
+ smp_call_function_many(mm_cpumask(mm), tlb_pending_func,
+ &info, 1);
__flush_tlb_pending(ctx, nr, vaddrs);
@@ -1134,14 +1098,13 @@ void smp_flush_tlb_pending(struct mm_struct *mm, unsigned long nr, unsigned long
void smp_flush_tlb_page(struct mm_struct *mm, unsigned long vaddr)
{
unsigned long context = CTX_HWBITS(mm->context);
- int cpu = get_cpu();
- if (mm == current->mm && atomic_read(&mm->mm_users) == 1)
- cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
- else
- smp_cross_call_masked(&xcall_flush_tlb_page,
- context, vaddr, 0,
- mm_cpumask(mm));
+ get_cpu();
+
+ smp_cross_call_masked(&xcall_flush_tlb_page,
+ context, vaddr, 0,
+ mm_cpumask(mm));
+
__flush_tlb_page(context, vaddr);
put_cpu();
--
2.23.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox