* Re: [PATCH V2] lglock: add read-preference local-global rwlock
From: Oleg Nesterov @ 2013-03-03 17:40 UTC (permalink / raw)
To: Lai Jiangshan
Cc: Lai Jiangshan, linux-doc, peterz, fweisbec, Michel Lespinasse,
mingo, linux-arch, linux, xiaoguangrong, wangyun, paulmck, nikunj,
linux-pm, rusty, rostedt, rjw, namhyung, tglx, linux-arm-kernel,
netdev, linux-kernel, vincent.guittot, sbw, Srivatsa S. Bhat, tj,
akpm, linuxppc-dev
In-Reply-To: <20130302172003.GC29769@redhat.com>
On 03/02, Oleg Nesterov wrote:
>
> On 03/02, Lai Jiangshan wrote:
> >
> > +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
> > +{
> > + switch (__this_cpu_read(*lgrw->reader_refcnt)) {
> > + case 1:
> > + __this_cpu_write(*lgrw->reader_refcnt, 0);
> > + lg_local_unlock(&lgrw->lglock);
> > + return;
> > + case FALLBACK_BASE:
> > + __this_cpu_write(*lgrw->reader_refcnt, 0);
> > + read_unlock(&lgrw->fallback_rwlock);
> > + rwlock_release(&lg->lock_dep_map, 1, _RET_IP_);
>
> I guess "case 1:" should do rwlock_release() too.
>
> Otherwise, at first glance looks correct...
>
> However, I still think that FALLBACK_BASE only adds the unnecessary
> complications. But even if I am right this is subjective of course, please
> feel free to ignore.
Yes, but...
> And btw, I am not sure about lg->lock_dep_map, perhaps we should use
> fallback_rwlock->dep_map ?
>
> We need rwlock_acquire_read() even in the fast-path, and this acquire_read
> should be paired with rwlock_acquire() in _write_lock(), but it does
> spin_acquire(lg->lock_dep_map). Yes, currently this is the same (afaics)
> but perhaps fallback_rwlock->dep_map would be more clean.
Please ignore this part.
I missed that lg_rwlock_global_write_lock() relies on lg_global_lock(),
and I just noticed that it does rwlock_acquire(lg->lock_dep_map).
Hmm. But then I do not understand the lglock annotations. Obviously,
rwlock_acquire_read() in lg_local_lock() can't even detect the simplest
deadlock, say, lg_local_lock(LOCK) + lg_local_lock(LOCK). Not to mention
spin_lock(X) + lg_local_lock(Y) vs lg_local_lock(Y) + spin_lock(X).
OK, I understand that it is not easy to make these annotations correct...
Oleg.
^ permalink raw reply
* Re: Linux kernel 3.x problems on PowerMac G5
From: Benjamin Herrenschmidt @ 2013-03-03 19:03 UTC (permalink / raw)
To: Phileas Fogg; +Cc: Andreas Schwab, Denis Kirjanov, linuxppc-dev, Aaro Koskinen
In-Reply-To: <51333F1C.4080605@mail.ru>
On Sun, 2013-03-03 at 05:16 -0700, Phileas Fogg wrote:
> i'm testing currently with slightly modified g5_defconfig. I enabled
> the Ralink driver for my USB WLAN NIC and some other stuff, nothing
> fancy. Anyways, i'm sending you config for 3.0.67. Linux 3.0.67 hangs on
> my machine too.
>
> I cloned your GIT powerpc repository and at the moment bisecting and
> trying to find the commit which causes trouble on my machine.
>
> I have already found out that
> the commit 27d934b28752b860cba6c0d77ea4598861d80998 is OK
> but
> the commit 0bd41dfc9fbbcf174d5336c1c9fc5ba917519761 is NOT.
> The problem appeared between Linux 2.6.39 and Linux-3.0-rc1.
>
> Just my typical luck with PowerPC :) First PS3 and now G5.
>
> I will report back as soon as i find the commit which causes the hangs.
Thanks. It looks like a bisection might indeed be the way to go...
Out of curiosity, have you tried without some of your additional drivers ?
Maybe one of them is the culprit...
Cheers,
Ben.
^ permalink raw reply
* Re: Linux kernel 3.x problems on PowerMac G5
From: Phileas Fogg @ 2013-03-03 19:16 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: linuxppc-dev, Denis Kirjanov, Andreas Schwab, Aaro Koskinen
In-Reply-To: <1362337429.2969.17.camel@pasglop>
Benjamin Herrenschmidt wrote:
> Thanks. It looks like a bisection might indeed be the way to go...
>
> Out of curiosity, have you tried without some of your additional drivers ?
> Maybe one of them is the culprit...
>
> Cheers,
> Ben.
>
Not yet, will do.
But I tested the official Debian Wheezy RC netinstall CD with Linux 3.2,
it has the same problem and hangs at boot on my machine.
regards
^ permalink raw reply
* Re: Linux kernel 3.x problems on PowerMac G5
From: Benjamin Herrenschmidt @ 2013-03-03 19:24 UTC (permalink / raw)
To: Phileas Fogg; +Cc: linuxppc-dev, Denis Kirjanov, Andreas Schwab, Aaro Koskinen
In-Reply-To: <5133A180.2040908@mail.ru>
On Sun, 2013-03-03 at 20:16 +0100, Phileas Fogg wrote:
> Benjamin Herrenschmidt wrote:
> > Thanks. It looks like a bisection might indeed be the way to go...
> >
> > Out of curiosity, have you tried without some of your additional drivers ?
> > Maybe one of them is the culprit...
> >
> > Cheers,
> > Ben.
> >
>
> Not yet, will do.
> But I tested the official Debian Wheezy RC netinstall CD with Linux 3.2,
> it has the same problem and hangs at boot on my machine.
Ok, so it's definitely something about your configuration. Maybe
something in the 11,2 support code chokes on single-chip configs, I
don't have one of them to test, both mines are dual chip (ie. quad
core).
But it does look like a regression that should be bisectable, so let me
know went you're done there and what you get.
Cheers,
Ben.
^ permalink raw reply
* Re: [PATCH] kexec/ppc: Fix kernel program entry point while changing the load addr
From: Simon Horman @ 2013-03-04 1:41 UTC (permalink / raw)
To: Suzuki K. Poulose
Cc: Matthew McClintock, Sebastian Andrzej Siewior, kexec,
linuxppc-dev
In-Reply-To: <20130303073540.12040.16854.stgit@suzukikp>
[ Cc: linuxppc-dev@lists.ozlabs.org ]
On Sun, Mar 03, 2013 at 01:06:00PM +0530, Suzuki K. Poulose wrote:
> From: Suzuki K. Poulose <suzuki@in.ibm.com>
>
> uImage probe fills the entry point (ep) based on the load_addr
> from the uImage headers. If we change the load_addr, we should
> accordingly update the entry point.
>
> For ELF, calculate the offset of e_entry from the virtual start
> address and add it to the physical start address to find the
> physical address of kernel entry.
>
> i.e,
> pa (e_entry) = pa(phdr[0].p_vaddr) + (e_entry - phdr[0].p_vaddr)
> = kernel_addr + (e_entry - phdr[0].p_vaddr)
Would it be possible for someone to provide a review of this change?
>
> Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Matthew McClintock <msm@freescale.com>
> ---
> kexec/arch/ppc/kexec-elf-ppc.c | 12 ++++++++----
> kexec/arch/ppc/kexec-uImage-ppc.c | 6 +++++-
> 2 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/kexec/arch/ppc/kexec-elf-ppc.c b/kexec/arch/ppc/kexec-elf-ppc.c
> index 8e408cc..5f63a64 100644
> --- a/kexec/arch/ppc/kexec-elf-ppc.c
> +++ b/kexec/arch/ppc/kexec-elf-ppc.c
> @@ -397,10 +397,14 @@ int elf_ppc_load(int argc, char **argv, const char *buf, off_t len,
> die("Error device tree not loadded to address it was expecting to be loaded too!\n");
> }
>
> - /* set various variables for the purgatory ehdr.e_entry is a
> - * virtual address, we can use kernel_addr which
> - * should be the physical start address of the kernel */
> - addr = kernel_addr;
> + /*
> + * set various variables for the purgatory.
> + * ehdr.e_entry is a virtual address. we know physical start
> + * address of the kernel (kernel_addr). Find the offset of
> + * e_entry from the virtual start address(e_phdr[0].p_vaddr)
> + * and calculate the actual physical address of the 'kernel entry'.
> + */
> + addr = kernel_addr + (ehdr.e_entry - ehdr.e_phdr[0].p_vaddr);
> elf_rel_set_symbol(&info->rhdr, "kernel", &addr, sizeof(addr));
>
> addr = dtb_addr;
> diff --git a/kexec/arch/ppc/kexec-uImage-ppc.c b/kexec/arch/ppc/kexec-uImage-ppc.c
> index e0bc7bb..900cd16 100644
> --- a/kexec/arch/ppc/kexec-uImage-ppc.c
> +++ b/kexec/arch/ppc/kexec-uImage-ppc.c
> @@ -159,15 +159,19 @@ static int ppc_load_bare_bits(int argc, char **argv, const char *buf,
>
> /*
> * If the provided load_addr cannot be allocated, find a new
> - * area.
> + * area. Rebase the entry point based on the new load_addr.
> */
> if (!valid_memory_range(info, load_addr, load_addr + (len + _1MiB))) {
> + int ep_offset = ep - load_addr;
> +
> load_addr = locate_hole(info, len + _1MiB, 0, 0, max_addr, 1);
> if (load_addr == ULONG_MAX) {
> printf("Can't allocate memory for kernel of len %ld\n",
> len + _1MiB);
> return -1;
> }
> +
> + ep = load_addr + ep_offset;
> }
>
> add_segment(info, buf, len, load_addr, len + _1MiB);
>
>
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
>
^ permalink raw reply
* Re: [PATCH 0/3] Enable multiple MSI feature in pSeries
From: Mike Qiu @ 2013-03-04 3:14 UTC (permalink / raw)
To: Michael Ellerman; +Cc: tglx, linuxppc-dev, linux-kernel
In-Reply-To: <20130301035417.GA17236@concordia>
于 2013/3/1 11:54, Michael Ellerman 写道:
> On Fri, Mar 01, 2013 at 11:08:45AM +0800, Mike wrote:
>> Hi all
>>
>> Any comments? or any questions about my patchset?
> You were going to get some performance numbers that show a definite
> benefit for using more than one MSI.
Yes, but my patch just enable the kernel to support this feature, whether
to use it depens on the device driver.
And this feature has been merged to the kernel for X86 for a long time.
See commit: 5ca72c4f7c412c2002363218901eba5516c476b1
51906e779f2b13b38f8153774c4c7163d412ffd9
Actually, I'm trying to do the test. but it is difficult to do that test,
because it mostly depends on how the device driver to use this feature,
while the ipr driver patch was wrote by another person. also no any reply
from her.
> cheers
>
^ permalink raw reply
* Re: [PATCH -V1 06/24] powerpc: Reduce PTE table memory wastage
From: Paul Mackerras @ 2013-03-04 4:58 UTC (permalink / raw)
To: Aneesh Kumar K.V; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <1361865914-13911-7-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
On Tue, Feb 26, 2013 at 01:34:56PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>
> We allocate one page for the last level of linux page table. With THP and
> large page size of 16MB, that would mean we are be wasting large part
> of that page. To map 16MB area, we only need a PTE space of 2K with 64K
> page size. This patch reduce the space wastage by sharing the page
> allocated for the last level of linux page table with multiple pmd
> entries. We call these smaller chunks PTE page fragments and allocated
> page, PTE page. We use the page->_mapcount as bitmap to indicate which
> PTE fragments are free.
>
> page->_mapcount is divided into two halves. The upper half is used for
> tracking the freed page framents in the RCU grace period.
>
> In order to support systems which doesn't have 64K HPTE support, we also
> add another 2K to PTE page fragment. The second half of the PTE fragments
> is used for storing slot and secondary bit information of an HPTE. With this
> we now have a 4K PTE fragment.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
This one has taken me hours to review. Perhaps it's partly because of
the way that diff has matched things up, but it's difficult to see
what's moved where, what's common code that is now the 4k page case,
etc. For example, pmd_alloc_one() and pmd_free() are unchanged, but
the diff shows them as removed in one place and added in another.
The other general comment I have is that it's not really clear when a
page will be on the mm->context.pgtable_list and when it won't. I
would like to see an invariant that says something like "the page is
on the pgtable_list if and only if (page->_mapcount & FRAG_MASK) is
neither 0 nor FRAG_MASK". But that doesn't seem to be the case
exactly, and I can't see any consistent rule, which makes me think
there are going to be bugs in corner cases.
Consider, for example, the case where a page has two fragments still
in use, and one of them gets queued up by RCU for freeing via a call
to page_table_free_rcu, and then the other one gets freed through
page_table_free(). Neither the call to page_table_free_rcu nor the
call to page_table_free will take the page off the list AFAICS, and
then __page_table_free_rcu() will free the page while it's still on
the pgtable_list.
More specific comments below...
> -static inline void pgtable_free(void *table, unsigned index_size)
> -{
> - if (!index_size)
> - free_page((unsigned long)table);
> - else {
> - BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
> - kmem_cache_free(PGT_CACHE(index_size), table);
> - }
> -}
This is still used in the UP case, both for 4k and 64k, and UP configs
now fail to build.
> static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
> {
> free_page((unsigned long)pte);
> @@ -156,7 +118,12 @@ static inline void __tlb_remove_table(void *_table)
> void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
> unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
>
> - pgtable_free(table, shift);
> + if (!shift)
> + free_page((unsigned long)table);
> + else {
> + BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
> + kmem_cache_free(PGT_CACHE(shift), table);
> + }
Any particular reason for open-coding pgtable_free() here?
> +/*
> + * we support 15 fragments per PTE page. This is limited by how many
> + * bits we can pack in page->_mapcount. We use the first half for
> + * tracking the usage for rcu page table free.
> + */
> +#define FRAG_MASK_BITS 15
> +#define FRAG_MASK ((1 << FRAG_MASK_BITS) - 1)
Atomic_t variables are 32-bit, so you really should be able to make
FRAG_MASK_BITS be 16 and avoid wasting the last fragment of each page.
Paul.
^ permalink raw reply
* Re: [PATCH -V1 07/24] powerpc: Add size argument to pgtable_cache_add
From: Paul Mackerras @ 2013-03-04 5:13 UTC (permalink / raw)
To: Aneesh Kumar K.V; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <1361865914-13911-8-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
On Tue, Feb 26, 2013 at 01:34:57PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>
> We will use this later with THP changes to request for pmd table of double the size.
> THP code does PTE page allocation along with large page request and deposit them
> for later use. This is to ensure that we won't have any failures when we split
> huge pages to regular pages.
>
> On powerpc we want to use the deposited PTE page for storing hash pte slot and
> secondary bit information for the HPTEs. Hence we save them in the second half
> of the pmd table.
Looks OK, but you should explain why you made the wholesale change of
"shift" to "index". Is there some important semantic difference, or
do you just prefer the "index" name for some reason?
Paul.
^ permalink raw reply
* Re: [PATCH -V1 08/24] powerpc: Use encode avpn where we need only avpn values
From: Paul Mackerras @ 2013-03-04 5:15 UTC (permalink / raw)
To: Aneesh Kumar K.V; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <1361865914-13911-9-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
On Tue, Feb 26, 2013 at 01:34:58PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>
> In all these cases we are doing something similar to
>
> HPTE_V_COMPARE(hpte_v, want_v) which ignores the HPTE_V_LARGE bit
>
> With MPSS support we would need actual page size to set HPTE_V_LARGE
> bit and that won't be available in most of these cases. Since we are ignoring
> HPTE_V_LARGE bit, use the avpn value instead. There should not be any change
> in behaviour after this patch.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
^ permalink raw reply
* Re: [PATCH -V1 09/24] powerpc: Decode the pte-lp-encoding bits correctly.
From: Paul Mackerras @ 2013-03-04 5:48 UTC (permalink / raw)
To: Aneesh Kumar K.V; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <1361865914-13911-10-git-send-email-aneesh.kumar@linux.vnet.ibm.com>
On Tue, Feb 26, 2013 at 01:34:59PM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>
> We look at both the segment base page size and actual page size and store
> the pte-lp-encodings in an array per base page size.
>
> We also update all relevant functions to take actual page size argument
> so that we can use the correct PTE LP encoding in HPTE. This should also
> get the basic Multiple Page Size per Segment (MPSS) support. This is needed
> to enable THP on ppc64.
Mostly looks OK, comments below...
> +/*
> + * HPTE LP details
> + */
> +#define LP_SHIFT 12
> +#define LP_BITS 8
> +#define LP_MASK(i) ((0xFF >> (i)) << LP_SHIFT)
The reader might be wondering at this point what "LP" is; be kind and
make it "large page" in the comment for them.
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 71d0c90..48f6d99 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1515,7 +1515,7 @@ static void kvmppc_add_seg_page_size(struct kvm_ppc_one_seg_page_size **sps,
> (*sps)->page_shift = def->shift;
> (*sps)->slb_enc = def->sllp;
> (*sps)->enc[0].page_shift = def->shift;
> - (*sps)->enc[0].pte_enc = def->penc;
> + (*sps)->enc[0].pte_enc = def->penc[linux_psize];
> (*sps)++;
> }
This will only return the entries where actual page size == base page
size, which basically means that KVM guests won't be able to use
MPSS. We will need to return multiple entries in that case.
> +static inline int hpte_actual_psize(struct hash_pte *hptep, int psize)
> +{
> + unsigned int mask;
> + int i, penc, shift;
> + /* Look at the 8 bit LP value */
> + unsigned int lp = (hptep->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
> +
> + penc = 0;
> + for (i = 0; i < MMU_PAGE_COUNT; i++) {
> + /* valid entries have a shift value */
> + if (!mmu_psize_defs[i].shift)
> + continue;
> +
> + /* encoding bits per actual page size */
> + shift = mmu_psize_defs[i].shift - 11;
> + if (shift > 9)
> + shift = 9;
> + mask = (1 << shift) - 1;
> + if ((lp & mask) == mmu_psize_defs[psize].penc[i])
> + return i;
> + }
> + return -1;
> +}
This doesn't look right to me. First, it's not clear what the 11 and
9 refer to, and I think the 9 should be LP_BITS (i.e. 8). Secondly,
the mask for the comparison needs to depend on the actual page size
not the base page size.
I strongly suggest you pull out this code together with
native_hpte_insert into a little userspace test program that runs
through all the possible page size combinations, creating an HPTE and
then decoding it with hpte_actual_psize() to check that you get back
the correct actual page size.
> static void hpte_decode(struct hash_pte *hpte, unsigned long slot,
> - int *psize, int *ssize, unsigned long *vpn)
> + int *psize, int *apsize, int *ssize, unsigned long *vpn)
> {
> unsigned long avpn, pteg, vpi;
> unsigned long hpte_r = hpte->r;
> unsigned long hpte_v = hpte->v;
> unsigned long vsid, seg_off;
> - int i, size, shift, penc;
> + int i, size, a_size = MMU_PAGE_4K, shift, penc;
>
> if (!(hpte_v & HPTE_V_LARGE))
> size = MMU_PAGE_4K;
> @@ -395,12 +422,13 @@ static void hpte_decode(struct hash_pte *hpte, unsigned long slot,
> /* valid entries have a shift value */
> if (!mmu_psize_defs[size].shift)
> continue;
> -
> - if (penc == mmu_psize_defs[size].penc)
> - break;
> + for (a_size = 0; a_size < MMU_PAGE_COUNT; a_size++)
> + if (penc == mmu_psize_defs[size].penc[a_size])
> + goto out;
Once again I don't think this is correct, since the number of bits in
the page size encoding depends on the page size. In fact the
calculation of penc in that function looks completely bogus to me (not
that that is code that you have written or modified, but it looks to
me like it needs fixing).
> static int __init htab_dt_scan_page_sizes(unsigned long node,
> const char *uname, int depth,
> void *data)
> @@ -294,60 +318,57 @@ static int __init htab_dt_scan_page_sizes(unsigned long node,
> size /= 4;
> cur_cpu_spec->mmu_features &= ~(MMU_FTR_16M_PAGE);
> while(size > 0) {
> - unsigned int shift = prop[0];
> + unsigned int base_shift = prop[0];
> unsigned int slbenc = prop[1];
> unsigned int lpnum = prop[2];
> - unsigned int lpenc = 0;
> struct mmu_psize_def *def;
> - int idx = -1;
> + int idx, base_idx;
>
> size -= 3; prop += 3;
> - while(size > 0 && lpnum) {
> - if (prop[0] == shift)
> - lpenc = prop[1];
> + base_idx = get_idx_from_shift(base_shift);
> + if (base_idx < 0) {
> + /*
> + * skip the pte encoding also
> + */
> prop += 2; size -= 2;
> - lpnum--;
> + continue;
> }
> - switch(shift) {
> - case 0xc:
> - idx = MMU_PAGE_4K;
> - break;
> - case 0x10:
> - idx = MMU_PAGE_64K;
> - break;
> - case 0x14:
> - idx = MMU_PAGE_1M;
> - break;
> - case 0x18:
> - idx = MMU_PAGE_16M;
> + def = &mmu_psize_defs[base_idx];
> + if (base_idx == MMU_PAGE_16M)
> cur_cpu_spec->mmu_features |= MMU_FTR_16M_PAGE;
> - break;
> - case 0x22:
> - idx = MMU_PAGE_16G;
> - break;
> - }
> - if (idx < 0)
> - continue;
> - def = &mmu_psize_defs[idx];
> - def->shift = shift;
> - if (shift <= 23)
> +
> + def->shift = base_shift;
> + if (base_shift <= 23)
> def->avpnm = 0;
> else
> - def->avpnm = (1 << (shift - 23)) - 1;
> + def->avpnm = (1 << (base_shift - 23)) - 1;
> def->sllp = slbenc;
> - def->penc = lpenc;
> - /* We don't know for sure what's up with tlbiel, so
> + /*
> + * We don't know for sure what's up with tlbiel, so
> * for now we only set it for 4K and 64K pages
> */
> - if (idx == MMU_PAGE_4K || idx == MMU_PAGE_64K)
> + if (base_idx == MMU_PAGE_4K || base_idx == MMU_PAGE_64K)
> def->tlbiel = 1;
> else
> def->tlbiel = 0;
>
> - DBG(" %d: shift=%02x, sllp=%04lx, avpnm=%08lx, "
> - "tlbiel=%d, penc=%d\n",
> - idx, shift, def->sllp, def->avpnm, def->tlbiel,
> - def->penc);
> + while (size > 0 && lpnum) {
> + unsigned int shift = prop[0];
> + unsigned int penc = prop[1];
> +
> + prop += 2; size -= 2;
> + lpnum--;
> +
> + idx = get_idx_from_shift(shift);
> + if (idx < 0)
> + continue;
> +
> + def->penc[idx] = penc;
> + DBG(" %d: shift=%02x, sllp=%04lx, "
> + "avpnm=%08lx, tlbiel=%d, penc=%d\n",
> + idx, shift, def->sllp, def->avpnm,
> + def->tlbiel, def->penc[idx]);
> + }
I don't see where in this function you set the penc[] elements for
invalid actual page sizes to -1.
Paul.
^ permalink raw reply
* Re: [PATCH] kexec/ppc: Fix kernel program entry point while changing the load addr
From: Suzuki K. Poulose @ 2013-03-04 5:56 UTC (permalink / raw)
To: Simon Horman
Cc: Matthew McClintock, Sebastian Andrzej Siewior, kexec,
linuxppc-dev
In-Reply-To: <20130304014152.GB9007@verge.net.au>
On 03/04/2013 07:11 AM, Simon Horman wrote:
> [ Cc: linuxppc-dev@lists.ozlabs.org ]
>
> On Sun, Mar 03, 2013 at 01:06:00PM +0530, Suzuki K. Poulose wrote:
>> From: Suzuki K. Poulose <suzuki@in.ibm.com>
>>
>> uImage probe fills the entry point (ep) based on the load_addr
>> from the uImage headers. If we change the load_addr, we should
>> accordingly update the entry point.
>>
>> For ELF, calculate the offset of e_entry from the virtual start
>> address and add it to the physical start address to find the
>> physical address of kernel entry.
>>
>> i.e,
>> pa (e_entry) = pa(phdr[0].p_vaddr) + (e_entry - phdr[0].p_vaddr)
>> = kernel_addr + (e_entry - phdr[0].p_vaddr)
>
> Would it be possible for someone to provide a review of this change?
To make it bit more clear :
The entry point of the kernel is usually at 0 offset from the first
PT_LOAD section. The current code makes this assumption and uses the
pa(phdr[0].p_vaddr) as the kernel entry.
But this *may* not be true always, in such a case the kexec would fail.
While I fixed the uImage case, I thought it would be better to handle
the same case in ELF.
Btw, this calculation is not specific to ppc32.
Thanks
Suzuki
>
>>
>> Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com>
>> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>> Cc: Matthew McClintock <msm@freescale.com>
>> ---
>> kexec/arch/ppc/kexec-elf-ppc.c | 12 ++++++++----
>> kexec/arch/ppc/kexec-uImage-ppc.c | 6 +++++-
>> 2 files changed, 13 insertions(+), 5 deletions(-)
>>
>> diff --git a/kexec/arch/ppc/kexec-elf-ppc.c b/kexec/arch/ppc/kexec-elf-ppc.c
>> index 8e408cc..5f63a64 100644
>> --- a/kexec/arch/ppc/kexec-elf-ppc.c
>> +++ b/kexec/arch/ppc/kexec-elf-ppc.c
>> @@ -397,10 +397,14 @@ int elf_ppc_load(int argc, char **argv, const char *buf, off_t len,
>> die("Error device tree not loadded to address it was expecting to be loaded too!\n");
>> }
>>
>> - /* set various variables for the purgatory ehdr.e_entry is a
>> - * virtual address, we can use kernel_addr which
>> - * should be the physical start address of the kernel */
>> - addr = kernel_addr;
>> + /*
>> + * set various variables for the purgatory.
>> + * ehdr.e_entry is a virtual address. we know physical start
>> + * address of the kernel (kernel_addr). Find the offset of
>> + * e_entry from the virtual start address(e_phdr[0].p_vaddr)
>> + * and calculate the actual physical address of the 'kernel entry'.
>> + */
>> + addr = kernel_addr + (ehdr.e_entry - ehdr.e_phdr[0].p_vaddr);
>> elf_rel_set_symbol(&info->rhdr, "kernel", &addr, sizeof(addr));
>>
>> addr = dtb_addr;
>> diff --git a/kexec/arch/ppc/kexec-uImage-ppc.c b/kexec/arch/ppc/kexec-uImage-ppc.c
>> index e0bc7bb..900cd16 100644
>> --- a/kexec/arch/ppc/kexec-uImage-ppc.c
>> +++ b/kexec/arch/ppc/kexec-uImage-ppc.c
>> @@ -159,15 +159,19 @@ static int ppc_load_bare_bits(int argc, char **argv, const char *buf,
>>
>> /*
>> * If the provided load_addr cannot be allocated, find a new
>> - * area.
>> + * area. Rebase the entry point based on the new load_addr.
>> */
>> if (!valid_memory_range(info, load_addr, load_addr + (len + _1MiB))) {
>> + int ep_offset = ep - load_addr;
>> +
>> load_addr = locate_hole(info, len + _1MiB, 0, 0, max_addr, 1);
>> if (load_addr == ULONG_MAX) {
>> printf("Can't allocate memory for kernel of len %ld\n",
>> len + _1MiB);
>> return -1;
>> }
>> +
>> + ep = load_addr + ep_offset;
>> }
>>
>> add_segment(info, buf, len, load_addr, len + _1MiB);
>>
>>
>> _______________________________________________
>> kexec mailing list
>> kexec@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
>>
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
^ permalink raw reply
* RE: [PATCH 1/6 v8] iommu/fsl: Store iommu domain information pointer in archdata.
From: Sethi Varun-B16395 @ 2013-03-04 6:35 UTC (permalink / raw)
To: Yoder Stuart-B08248, Alexey Kardashevskiy
Cc: Wood Scott-B07421, Alex Williamson, Joerg Roedel,
linux-kernel@vger.kernel.org list,
iommu@lists.linux-foundation.org, Paul Mackerras,
linuxppc-dev@lists.ozlabs.org list, David Gibson
In-Reply-To: <9F6FE96B71CF29479FF1CDC8046E1503565876@039-SN1MPN1-003.039d.mgd.msft.net>
> -----Original Message-----
> From: Yoder Stuart-B08248
> Sent: Friday, March 01, 2013 9:52 PM
> To: Alexey Kardashevskiy; Sethi Varun-B16395
> Cc: Kumar Gala; Benjamin Herrenschmidt; iommu@lists.linux-foundation.org;
> linuxppc-dev@lists.ozlabs.org list; linux-kernel@vger.kernel.org list;
> Wood Scott-B07421; Joerg Roedel; Paul Mackerras; David Gibson; Alex
> Williamson
> Subject: RE: [PATCH 1/6 v8] iommu/fsl: Store iommu domain information
> pointer in archdata.
>=20
>=20
>=20
> > -----Original Message-----
> > From: Alexey Kardashevskiy [mailto:aik@ozlabs.ru]
> > Sent: Friday, March 01, 2013 4:07 AM
> > To: Sethi Varun-B16395
> > Cc: Kumar Gala; Benjamin Herrenschmidt;
> > iommu@lists.linux-foundation.org; linuxppc-dev@lists.ozlabs.org list;
> > linux-kernel@vger.kernel.org list; Wood Scott-B07421; Yoder
> > Stuart-B08248; Joerg Roedel; Paul Mackerras; David Gibson; Alex
> > Williamson
> > Subject: Re: [PATCH 1/6 v8] iommu/fsl: Store iommu domain information
> pointer in archdata.
> >
> > btw the device struct already has a pointer to its iommu_group, and
> > the iommu_group struct itself has a pointer void *iommu_data which you
> > could use for anything you want (iommu_group_get_iommudata(),
> > iommu_group_set_iommudata()).
> >
> > By design you are expected to add iommu groups to a domain but not
> > devices so I am not so sure that you really need a pointer to domain
> > in the device struct.
>=20
> Well, at the lowest level the IOMMU API does attach devices to domains--
> i.e.
> API attach_dev(). So, it seems to conceptually make sense to have a ptr
> from the device to the associated domain. When you implement
> attach_dev() you need to be able to see whether the device is already
> attached to
> a domain. Adding a couple of levels of indirection...from device to
> group to domain...doesn't seems to make things simpler or better IMHO.
>=20
> x86 keeps a pointer to the domain in device archdata and since there is a
> direct correlation between a device and domain I'd rather see it where
> this patch has it.
>=20
As Stuart stated this allows us to maintain the device <-> domain relations=
hip at the lowest level.
-Varun
^ permalink raw reply
* [PATCH] fsl/sata: create a sysfs entry for rx water mark
From: Qiang Liu @ 2013-03-04 7:20 UTC (permalink / raw)
To: jgarzik, linux-ide; +Cc: Qiang Liu, linuxppc-dev
Support config RX WATER MARK via sysfs when running at run-time;
A wrokaround for fix the exception happened to some WD HDD, found on
WD3000HLFS-01G6U1, WD3000HLFS-01G6U0, some SSD disks. The read performance
is also regression (about 30%) when use default value.
According to the latest documents, 0x10 is the default value of RX WATER MARK,
but exception/performance issue happened to some disks mentioned above.
The exception log as below when testing read performance with IOZone:
ata1.00: exception Emask 0x0 SAct 0x7 SErr 0x800000 action 0x6 frozen
ata1: SError: { LinkSeq }
ata1.00: failed command: READ FPDMA QUEUED
ata1.00: cmd 60/00:00:ff:2c:14/01:00:02:00:00/40 tag 0 ncq 131072 in
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1.00: failed command: READ FPDMA QUEUED
ata1.00: cmd 60/00:08:ff:2d:14/01:00:02:00:00/40 tag 1 ncq 131072 in
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/10:10:af:08:6e/00:00:12:00:00/40 tag 2 ncq 8192 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: hard resetting link
ata1: Hardreset failed, not off-lined 0
ata1: Signature Update detected @ 504 msecs
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1.00: device reported invalid CHS sector 0
ata1: EH complete
The exception/performance can be resolved when RX WATER MARK value is 0x16.
Signed-off-by: Qiang Liu <qiang.liu@freescale.com>
---
drivers/ata/sata_fsl.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 55 insertions(+), 0 deletions(-)
diff --git a/drivers/ata/sata_fsl.c b/drivers/ata/sata_fsl.c
index 124b2c1..515f66f 100644
--- a/drivers/ata/sata_fsl.c
+++ b/drivers/ata/sata_fsl.c
@@ -285,6 +285,7 @@ struct sata_fsl_host_priv {
int irq;
int data_snoop;
struct device_attribute intr_coalescing;
+ struct device_attribute rx_watermark;
};
static void fsl_sata_set_irq_coalescing(struct ata_host *host,
@@ -343,6 +344,48 @@ static ssize_t fsl_sata_intr_coalescing_store(struct device *dev,
return strlen(buf);
}
+static ssize_t fsl_sata_rx_watermark_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ unsigned int rx_watermark;
+ unsigned long flags;
+ struct ata_host *host = dev_get_drvdata(dev);
+ struct sata_fsl_host_priv *host_priv = host->private_data;
+ void __iomem *csr_base = host_priv->csr_base;
+
+ spin_lock_irqsave(&host->lock, flags);
+ rx_watermark = ioread32(csr_base + TRANSCFG);
+ rx_watermark &= 0x1f;
+
+ spin_unlock_irqrestore(&host->lock, flags);
+ return sprintf(buf, "%d\n", rx_watermark);
+}
+
+static ssize_t fsl_sata_rx_watermark_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned int rx_watermark;
+ unsigned long flags;
+ struct ata_host *host = dev_get_drvdata(dev);
+ struct sata_fsl_host_priv *host_priv = host->private_data;
+ void __iomem *csr_base = host_priv->csr_base;
+ u32 temp;
+
+ if (sscanf(buf, "%d", &rx_watermark) != 1) {
+ printk(KERN_ERR "fsl-sata: wrong parameter format.\n");
+ return -EINVAL;
+ }
+
+ spin_lock_irqsave(&host->lock, flags);
+ temp = ioread32(csr_base + TRANSCFG);
+ temp &= 0xffffffe0;
+ iowrite32(temp | rx_watermark, csr_base + TRANSCFG);
+
+ spin_unlock_irqrestore(&host->lock, flags);
+ return strlen(buf);
+}
+
static inline unsigned int sata_fsl_tag(unsigned int tag,
void __iomem *hcr_base)
{
@@ -1500,6 +1543,17 @@ static int sata_fsl_probe(struct platform_device *ofdev)
if (retval)
goto error_exit_with_cleanup;
+ host_priv->rx_watermark.show = fsl_sata_rx_watermark_show;
+ host_priv->rx_watermark.store = fsl_sata_rx_watermark_store;
+ sysfs_attr_init(&host_priv->rx_watermark.attr);
+ host_priv->rx_watermark.attr.name = "rx_watermark";
+ host_priv->rx_watermark.attr.mode = S_IRUGO | S_IWUSR;
+ retval = device_create_file(host->dev, &host_priv->rx_watermark);
+ if (retval) {
+ device_remove_file(&ofdev->dev, &host_priv->intr_coalescing);
+ goto error_exit_with_cleanup;
+ }
+
return 0;
error_exit_with_cleanup:
@@ -1523,6 +1577,7 @@ static int sata_fsl_remove(struct platform_device *ofdev)
struct sata_fsl_host_priv *host_priv = host->private_data;
device_remove_file(&ofdev->dev, &host_priv->intr_coalescing);
+ device_remove_file(&ofdev->dev, &host_priv->rx_watermark);
ata_host_detach(host);
--
1.7.5.1
^ permalink raw reply related
* [PATCH] powerpc/85xx: Add platform_device declaration to fsl_pci.h
From: Jia Hongtao @ 2013-03-04 8:40 UTC (permalink / raw)
To: linuxppc-dev, galak; +Cc: B07421, b38951
mpc85xx_pci_err_probe(struct platform_device *op) need platform_device
declaration for definition. Otherwise, it will cause compile error if any
files including fsl_pci.h without declaration of platform_device.
Signed-off-by: Jia Hongtao <B38951@freescale.com>
---
arch/powerpc/sysdev/fsl_pci.h | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/arch/powerpc/sysdev/fsl_pci.h b/arch/powerpc/sysdev/fsl_pci.h
index c495c00..df66721 100644
--- a/arch/powerpc/sysdev/fsl_pci.h
+++ b/arch/powerpc/sysdev/fsl_pci.h
@@ -14,6 +14,8 @@
#ifndef __POWERPC_FSL_PCI_H
#define __POWERPC_FSL_PCI_H
+#include <linux/platform_device.h>
+
#define PCIE_LTSSM 0x0404 /* PCIE Link Training and Status */
#define PCIE_LTSSM_L0 0x16 /* L0 state */
#define PCIE_IP_REV_2_2 0x02080202 /* PCIE IP block version Rev2.2 */
--
1.7.5.1
^ permalink raw reply related
* [PATCH V4] powerpc/85xx: Add machine check handler to fix PCIe erratum on mpc85xx
From: Jia Hongtao @ 2013-03-04 8:40 UTC (permalink / raw)
To: linuxppc-dev, galak; +Cc: B07421, b38951
A PCIe erratum of mpc85xx may causes a core hang when a link of PCIe
goes down. when the link goes down, Non-posted transactions issued
via the ATMU requiring completion result in an instruction stall.
At the same time a machine-check exception is generated to the core
to allow further processing by the handler. We implements the handler
which skips the instruction caused the stall.
This patch depends on patch:
powerpc/85xx: Add platform_device declaration to fsl_pci.h
Signed-off-by: Zhao Chenhui <b35336@freescale.com>
Signed-off-by: Li Yang <leoli@freescale.com>
Signed-off-by: Liu Shuo <soniccat.liu@gmail.com>
Signed-off-by: Jia Hongtao <B38951@freescale.com>
---
V4 changed:
* Rebased the patch on kernel v3.8.0-rc5
The link for V3:
http://patchwork.ozlabs.org/patch/171120/
arch/powerpc/kernel/cpu_setup_fsl_booke.S | 2 +-
arch/powerpc/kernel/traps.c | 3 ++
arch/powerpc/sysdev/fsl_pci.c | 41 +++++++++++++++++++++++++++++
arch/powerpc/sysdev/fsl_pci.h | 6 ++++
4 files changed, 51 insertions(+), 1 deletions(-)
diff --git a/arch/powerpc/kernel/cpu_setup_fsl_booke.S b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
index dcd8819..f1bde90 100644
--- a/arch/powerpc/kernel/cpu_setup_fsl_booke.S
+++ b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
@@ -66,7 +66,7 @@ _GLOBAL(__setup_cpu_e500v2)
bl __e500_icache_setup
bl __e500_dcache_setup
bl __setup_e500_ivors
-#ifdef CONFIG_FSL_RIO
+#if defined(CONFIG_FSL_RIO) || defined(CONFIG_FSL_PCI)
/* Ensure that RFXE is set */
mfspr r3,SPRN_HID1
oris r3,r3,HID1_RFXE@h
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index a008cf5..dd275a4 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -59,6 +59,7 @@
#include <asm/fadump.h>
#include <asm/switch_to.h>
#include <asm/debug.h>
+#include <sysdev/fsl_pci.h>
#if defined(CONFIG_DEBUGGER) || defined(CONFIG_KEXEC)
int (*__debugger)(struct pt_regs *regs) __read_mostly;
@@ -556,6 +557,8 @@ int machine_check_e500(struct pt_regs *regs)
if (reason & MCSR_BUS_RBERR) {
if (fsl_rio_mcheck_exception(regs))
return 1;
+ if (fsl_pci_mcheck_exception(regs))
+ return 1;
}
printk("Machine check in kernel mode.\n");
diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
index 682084d..39a10b6 100644
--- a/arch/powerpc/sysdev/fsl_pci.c
+++ b/arch/powerpc/sysdev/fsl_pci.c
@@ -30,6 +30,7 @@
#include <asm/io.h>
#include <asm/prom.h>
#include <asm/pci-bridge.h>
+#include <asm/ppc-pci.h>
#include <asm/machdep.h>
#include <sysdev/fsl_soc.h>
#include <sysdev/fsl_pci.h>
@@ -826,6 +827,46 @@ u64 fsl_pci_immrbar_base(struct pci_controller *hose)
return 0;
}
+#ifdef CONFIG_E500
+static int is_in_pci_mem_space(phys_addr_t addr)
+{
+ struct pci_controller *hose;
+ struct resource *res;
+ int i;
+
+ list_for_each_entry(hose, &hose_list, list_node) {
+ if (!early_find_capability(hose, 0, 0, PCI_CAP_ID_EXP))
+ continue;
+
+ for (i = 0; i < 3; i++) {
+ res = &hose->mem_resources[i];
+ if ((res->flags & IORESOURCE_MEM) &&
+ addr >= res->start && addr <= res->end)
+ return 1;
+ }
+ }
+ return 0;
+}
+
+int fsl_pci_mcheck_exception(struct pt_regs *regs)
+{
+ phys_addr_t addr = 0;
+
+#ifdef CONFIG_PHYS_64BIT
+ addr = mfspr(SPRN_MCARU);
+ addr <<= 32;
+#endif
+ addr += mfspr(SPRN_MCAR);
+
+ if (is_in_pci_mem_space(addr)) {
+ regs->nip += 4;
+ return 1;
+ }
+
+ return 0;
+}
+#endif
+
#if defined(CONFIG_FSL_SOC_BOOKE) || defined(CONFIG_PPC_86xx)
static const struct of_device_id pci_ids[] = {
{ .compatible = "fsl,mpc8540-pci", },
diff --git a/arch/powerpc/sysdev/fsl_pci.h b/arch/powerpc/sysdev/fsl_pci.h
index df66721..ee90a95 100644
--- a/arch/powerpc/sysdev/fsl_pci.h
+++ b/arch/powerpc/sysdev/fsl_pci.h
@@ -115,5 +115,11 @@ static inline int mpc85xx_pci_err_probe(struct platform_device *op)
}
#endif
+#ifdef CONFIG_FSL_PCI
+extern int fsl_pci_mcheck_exception(struct pt_regs *);
+#else
+static inline int fsl_pci_mcheck_exception(struct pt_regs *regs) {return 0; }
+#endif
+
#endif /* __POWERPC_FSL_PCI_H */
#endif /* __KERNEL__ */
--
1.7.5.1
^ permalink raw reply related
* [PATCH 0/3] powerpc: FSCR fixes for POWER8
From: Michael Neuling @ 2013-03-04 9:46 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: Michael Neuling, linuxppc-dev, Ian Munsie
Benh,
Here are a few fixes for the POWER8 FSCR.
First patch changes the FSCR so that it's set on secondary CPUs as well as when
MSR HV=0.
Second two patches make sure that the FSCR DSCR bit is set so that we don't
trap on DSCR accesses.
These are aimed to fix issues in 3.9.
Mikey
^ permalink raw reply
* [PATCH 1/3] powerpc: Fix setting FSCR for HV=0 and secondary CPUs
From: Michael Neuling @ 2013-03-04 9:46 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: Michael Neuling, linuxppc-dev, Ian Munsie
In-Reply-To: <1362390402-17725-1-git-send-email-mikey@neuling.org>
Currently we only set the FSCR when HV=1 but this feature is available when
HV=0 also. This patch sets FSCR when HV=0.
Also, we currently only set the FSCR on the master CPU. This patch also sets
the FSCR on secondary CPUs.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: Ian Munsie <imunsie@au1.ibm.com>
---
arch/powerpc/kernel/cpu_setup_power.S | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S
index d29facb..bb2d203 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -48,6 +48,7 @@ _GLOBAL(__restore_cpu_power7)
_GLOBAL(__setup_cpu_power8)
mflr r11
+ bl __init_FSCR
bl __init_hvmode_206
mtlr r11
beqlr
@@ -56,13 +57,13 @@ _GLOBAL(__setup_cpu_power8)
mfspr r3,SPRN_LPCR
oris r3, r3, LPCR_AIL_3@h
bl __init_LPCR
- bl __init_FSCR
bl __init_TLB
mtlr r11
blr
_GLOBAL(__restore_cpu_power8)
mflr r11
+ bl __init_FSCR
mfmsr r3
rldicl. r0,r3,4,63
beqlr
--
1.7.10.4
^ permalink raw reply related
* [PATCH 2/3] powerpc: Add DSCR FSCR register bit definition
From: Michael Neuling @ 2013-03-04 9:46 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: Michael Neuling, linuxppc-dev, Ian Munsie
In-Reply-To: <1362390402-17725-1-git-send-email-mikey@neuling.org>
Also harmonise TAR bit definition too.
Signed-off-by: Michael Neuling <mikey@neuling.org>
---
arch/powerpc/include/asm/reg.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index e665861..a43cd2d 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -266,7 +266,8 @@
#define SPRN_HSRR0 0x13A /* Hypervisor Save/Restore 0 */
#define SPRN_HSRR1 0x13B /* Hypervisor Save/Restore 1 */
#define SPRN_FSCR 0x099 /* Facility Status & Control Register */
-#define FSCR_TAR (1<<8) /* Enable Target Adress Register */
+#define FSCR_TAR (1 << (63-55)) /* Enable Target Adress Register */
+#define FSCR_DSCR (1 << (63-61)) /* Enable DSCR */
#define SPRN_TAR 0x32f /* Target Address Register */
#define SPRN_LPCR 0x13E /* LPAR Control Register */
#define LPCR_VPM0 (1ul << (63-0))
--
1.7.10.4
^ permalink raw reply related
* [PATCH 3/3] powerpc: Set DSCR bit in FSCR setup
From: Michael Neuling @ 2013-03-04 9:46 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: Michael Neuling, linuxppc-dev, Ian Munsie
In-Reply-To: <1362390402-17725-1-git-send-email-mikey@neuling.org>
We support DSCR so we should make sure we set it in the FSCR incase some
firmwares don't set it.
Signed-off-by: Michael Neuling <mikey@neuling.org>
---
arch/powerpc/kernel/cpu_setup_power.S | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S
index bb2d203..ea847ab 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -116,7 +116,7 @@ __init_LPCR:
__init_FSCR:
mfspr r3,SPRN_FSCR
- ori r3,r3,FSCR_TAR
+ ori r3,r3,FSCR_TAR|FSCR_DSCR
mtspr SPRN_FSCR,r3
blr
--
1.7.10.4
^ permalink raw reply related
* Re: [PATCH -V1 06/24] powerpc: Reduce PTE table memory wastage
From: Aneesh Kumar K.V @ 2013-03-04 10:58 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <20130304045853.GB27523@drongo>
Paul Mackerras <paulus@samba.org> writes:
> On Tue, Feb 26, 2013 at 01:34:56PM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>>
>> We allocate one page for the last level of linux page table. With THP and
>> large page size of 16MB, that would mean we are be wasting large part
>> of that page. To map 16MB area, we only need a PTE space of 2K with 64K
>> page size. This patch reduce the space wastage by sharing the page
>> allocated for the last level of linux page table with multiple pmd
>> entries. We call these smaller chunks PTE page fragments and allocated
>> page, PTE page. We use the page->_mapcount as bitmap to indicate which
>> PTE fragments are free.
>>
>> page->_mapcount is divided into two halves. The upper half is used for
>> tracking the freed page framents in the RCU grace period.
>>
>> In order to support systems which doesn't have 64K HPTE support, we also
>> add another 2K to PTE page fragment. The second half of the PTE fragments
>> is used for storing slot and secondary bit information of an HPTE. With this
>> we now have a 4K PTE fragment.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>
> This one has taken me hours to review. Perhaps it's partly because of
> the way that diff has matched things up, but it's difficult to see
> what's moved where, what's common code that is now the 4k page case,
> etc. For example, pmd_alloc_one() and pmd_free() are unchanged, but
> the diff shows them as removed in one place and added in another.
I updated the previous patch "powerpc: Move the pte free routines from
common header" to duplicate the 4K and 64K definitions. That helped in
making the diff better. I have inlined the resulting patch below. Let
know if this one looks better.
>
> The other general comment I have is that it's not really clear when a
> page will be on the mm->context.pgtable_list and when it won't. I
> would like to see an invariant that says something like "the page is
> on the pgtable_list if and only if (page->_mapcount & FRAG_MASK) is
> neither 0 nor FRAG_MASK". But that doesn't seem to be the case
> exactly, and I can't see any consistent rule, which makes me think
> there are going to be bugs in corner cases.
>
I added the below comment when initializing the list.
+#ifdef CONFIG_PPC_64K_PAGES
+ /*
+ * Used to support 4K PTE fragment. The pages are added to list,
+ * when we have free framents in the page. We track the whether
+ * a page frament is available using page._mapcount. A value of
+ * zero indicate none of the fragments are used and page can be
+ * freed. A value of FRAG_MASK indicate all the fragments are used
+ * and hence the page will be removed from the below list.
+ */
+ INIT_LIST_HEAD(&init_mm.context.pgtable_list);
+#endif
I am not sure about why you say there is no consistent rule. Can you
elaborate on that ?
> Consider, for example, the case where a page has two fragments still
> in use, and one of them gets queued up by RCU for freeing via a call
> to page_table_free_rcu, and then the other one gets freed through
> page_table_free(). Neither the call to page_table_free_rcu nor the
> call to page_table_free will take the page off the list AFAICS, and
> then __page_table_free_rcu() will free the page while it's still on
> the pgtable_list.
The last one that ends up doing atomic_xor_bits which cause the mapcount
to go zero, will take the page off the list and free the page.
>
> More specific comments below...
>
>> -static inline void pgtable_free(void *table, unsigned index_size)
>> -{
>> - if (!index_size)
>> - free_page((unsigned long)table);
>> - else {
>> - BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
>> - kmem_cache_free(PGT_CACHE(index_size), table);
>> - }
>> -}
>
> This is still used in the UP case, both for 4k and 64k, and UP configs
> now fail to build.
>
>> static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
>> {
>> free_page((unsigned long)pte);
>> @@ -156,7 +118,12 @@ static inline void __tlb_remove_table(void *_table)
>> void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
>> unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
>>
>> - pgtable_free(table, shift);
>> + if (!shift)
>> + free_page((unsigned long)table);
>> + else {
>> + BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
>> + kmem_cache_free(PGT_CACHE(shift), table);
>> + }
>
> Any particular reason for open-coding pgtable_free() here?
>
>> +/*
>> + * we support 15 fragments per PTE page. This is limited by how many
>> + * bits we can pack in page->_mapcount. We use the first half for
>> + * tracking the usage for rcu page table free.
>> + */
>> +#define FRAG_MASK_BITS 15
>> +#define FRAG_MASK ((1 << FRAG_MASK_BITS) - 1)
>
> Atomic_t variables are 32-bit, so you really should be able to make
> FRAG_MASK_BITS be 16 and avoid wasting the last fragment of each page.
>
commit ff1af6b4b5c90223bac43052436aae943bae1104
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date: Fri Oct 12 10:54:07 2012 +0530
powerpc: Reduce PTE table memory wastage
We allocate one page for the last level of linux page table. With THP and
large page size of 16MB, that would mean we are be wasting large part
of that page. To map 16MB area, we only need a PTE space of 2K with 64K
page size. This patch reduce the space wastage by sharing the page
allocated for the last level of linux page table with multiple pmd
entries. We call these smaller chunks PTE page fragments and allocated
page, PTE page. We use the page->_mapcount as bitmap to indicate which
PTE fragments are free.
page->_mapcount is divided into two halves. The upper half is used for
tracking the freed page framents in the RCU grace period.
In order to support systems which doesn't have 64K HPTE support, we also
add another 2K to PTE page fragment. The second half of the PTE fragments
is used for storing slot and secondary bit information of an HPTE. With this
we now have a 4K PTE fragment.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
diff --git a/arch/powerpc/include/asm/mmu-book3e.h b/arch/powerpc/include/asm/mmu-book3e.h
index 99d43e0..ffae629 100644
--- a/arch/powerpc/include/asm/mmu-book3e.h
+++ b/arch/powerpc/include/asm/mmu-book3e.h
@@ -231,6 +231,10 @@ typedef struct {
u64 high_slices_psize; /* 4 bits per slice for now */
u16 user_psize; /* page size index */
#endif
+#ifdef CONFIG_PPC_64K_PAGES
+ /* for 4K PTE fragment support */
+ struct list_head pgtable_list;
+#endif
} mm_context_t;
/* Page size definitions, common between 32 and 64-bit
diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h
index 35bb51e..feac737 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -498,6 +498,10 @@ typedef struct {
unsigned long acop; /* mask of enabled coprocessor types */
unsigned int cop_pid; /* pid value used with coprocessors */
#endif /* CONFIG_PPC_ICSWX */
+#ifdef CONFIG_PPC_64K_PAGES
+ /* for 4K PTE fragment support */
+ struct list_head pgtable_list;
+#endif
} mm_context_t;
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index f072e97..38e7ff6 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -378,7 +378,11 @@ void arch_free_page(struct page *page, int order);
struct vm_area_struct;
+#ifdef CONFIG_PPC_64K_PAGES
+typedef pte_t *pgtable_t;
+#else
typedef struct page *pgtable_t;
+#endif
#include <asm-generic/memory_model.h>
#endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/pgalloc-64.h b/arch/powerpc/include/asm/pgalloc-64.h
index cdbf555..3b62636 100644
--- a/arch/powerpc/include/asm/pgalloc-64.h
+++ b/arch/powerpc/include/asm/pgalloc-64.h
@@ -150,6 +150,18 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
#else /* if CONFIG_PPC_64K_PAGES */
+extern unsigned long *page_table_alloc(struct mm_struct *, unsigned long);
+extern void page_table_free(struct mm_struct *, unsigned long *);
+#ifdef CONFIG_SMP
+extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift);
+extern void __tlb_remove_table(void *_table);
+#else
+static inline void pgtable_free_tlb(struct mmu_gather *tlb,
+ void *table, int shift)
+{
+ pgtable_free(table, shift);
+}
+#endif
#define pud_populate(mm, pud, pmd) pud_set(pud, (unsigned long)pmd)
static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd,
@@ -161,90 +173,42 @@ static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd,
static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
pgtable_t pte_page)
{
- pmd_populate_kernel(mm, pmd, page_address(pte_page));
+ pmd_set(pmd, (unsigned long)pte_page);
}
static inline pgtable_t pmd_pgtable(pmd_t pmd)
{
- return pmd_page(pmd);
+ return (pgtable_t)(pmd_val(pmd) & -sizeof(pte_t)*PTRS_PER_PTE);
}
static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
unsigned long address)
{
- return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);
+ return (pte_t *)page_table_alloc(mm, address);
}
static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
- unsigned long address)
+ unsigned long address)
{
- struct page *page;
- pte_t *pte;
-
- pte = pte_alloc_one_kernel(mm, address);
- if (!pte)
- return NULL;
- page = virt_to_page(pte);
- pgtable_page_ctor(page);
- return page;
+ return (pgtable_t)page_table_alloc(mm, address);
}
static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
{
- free_page((unsigned long)pte);
+ page_table_free(mm, (unsigned long *)pte);
}
static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
{
- pgtable_page_dtor(ptepage);
- __free_page(ptepage);
+ page_table_free(mm, (unsigned long *)ptepage);
}
-static inline void pgtable_free(void *table, unsigned index_size)
-{
- if (!index_size)
- free_page((unsigned long)table);
- else {
- BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
- kmem_cache_free(PGT_CACHE(index_size), table);
- }
-}
-
-#ifdef CONFIG_SMP
-static inline void pgtable_free_tlb(struct mmu_gather *tlb,
- void *table, int shift)
-{
- unsigned long pgf = (unsigned long)table;
- BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
- pgf |= shift;
- tlb_remove_table(tlb, (void *)pgf);
-}
-
-static inline void __tlb_remove_table(void *_table)
-{
- void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
- unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
-
- pgtable_free(table, shift);
-}
-#else /* !CONFIG_SMP */
-static inline void pgtable_free_tlb(struct mmu_gather *tlb,
- void *table, int shift)
-{
- pgtable_free(table, shift);
-}
-#endif /* CONFIG_SMP */
-
static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
unsigned long address)
{
- struct page *page = page_address(table);
-
tlb_flush_pgtable(tlb, address);
- pgtable_page_dtor(page);
- pgtable_free_tlb(tlb, page, 0);
+ pgtable_free_tlb(tlb, table, 0);
}
-
#endif /* CONFIG_PPC_64K_PAGES */
static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
@@ -258,7 +222,6 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
kmem_cache_free(PGT_CACHE(PMD_INDEX_SIZE), pmd);
}
-
#define __pmd_free_tlb(tlb, pmd, addr) \
pgtable_free_tlb(tlb, pmd, PMD_INDEX_SIZE)
#ifndef CONFIG_PPC_64K_PAGES
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 6da881b..442f858 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -575,7 +575,17 @@ void __init setup_arch(char **cmdline_p)
init_mm.end_code = (unsigned long) _etext;
init_mm.end_data = (unsigned long) _edata;
init_mm.brk = klimit;
-
+#ifdef CONFIG_PPC_64K_PAGES
+ /*
+ * Used to support 4K PTE fragment. The pages are added to list,
+ * when we have free framents in the page. We track the whether
+ * a page frament is available using page._mapcount. A value of
+ * zero indicate none of the fragments are used and page can be
+ * freed. A value of FRAG_MASK indicate all the fragments are used
+ * and hence the page will be removed from the below list.
+ */
+ INIT_LIST_HEAD(&init_mm.context.pgtable_list);
+#endif
irqstack_early_init();
exc_lvl_early_init();
emergency_stack_init();
diff --git a/arch/powerpc/mm/mmu_context_hash64.c b/arch/powerpc/mm/mmu_context_hash64.c
index 59cd773..474b9af 100644
--- a/arch/powerpc/mm/mmu_context_hash64.c
+++ b/arch/powerpc/mm/mmu_context_hash64.c
@@ -86,6 +86,9 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
spin_lock_init(mm->context.cop_lockp);
#endif /* CONFIG_PPC_ICSWX */
+#ifdef CONFIG_PPC_64K_PAGES
+ INIT_LIST_HEAD(&mm->context.pgtable_list);
+#endif
return 0;
}
@@ -97,13 +100,37 @@ void __destroy_context(int context_id)
}
EXPORT_SYMBOL_GPL(__destroy_context);
+#ifdef CONFIG_PPC_64K_PAGES
+static void destroy_pagetable_list(struct mm_struct *mm)
+{
+ struct page *page;
+ struct list_head *item, *tmp;
+
+ list_for_each_safe(item, tmp, &mm->context.pgtable_list) {
+ page = list_entry(item, struct page, lru);
+ list_del(&page->lru);
+ pgtable_page_dtor(page);
+ atomic_set(&page->_mapcount, -1);
+ __free_page(page);
+ }
+}
+#else
+static inline void destroy_pagetable_list(struct mm_struct *mm)
+{
+ return;
+}
+#endif
+
void destroy_context(struct mm_struct *mm)
{
+
#ifdef CONFIG_PPC_ICSWX
drop_cop(mm->context.acop, mm);
kfree(mm->context.cop_lockp);
mm->context.cop_lockp = NULL;
#endif /* CONFIG_PPC_ICSWX */
+
+ destroy_pagetable_list(mm);
__destroy_context(mm->context.id);
subpage_prot_free(mm);
mm->context.id = MMU_NO_CONTEXT;
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index e212a27..2a49044 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -337,3 +337,193 @@ EXPORT_SYMBOL(__ioremap_at);
EXPORT_SYMBOL(iounmap);
EXPORT_SYMBOL(__iounmap);
EXPORT_SYMBOL(__iounmap_at);
+
+#ifdef CONFIG_PPC_64K_PAGES
+/*
+ * we support 16 fragments per PTE page. This is limited by how many
+ * bits we can pack in page->_mapcount. We use the first half for
+ * tracking the usage for rcu page table free.
+ */
+#define FRAG_MASK_BITS 16
+#define FRAG_MASK ((1 << FRAG_MASK_BITS) - 1)
+/*
+ * We use a 2K PTE page fragment and another 2K for storing
+ * real_pte_t hash index
+ */
+#define PTE_FRAG_SIZE (2 * PTRS_PER_PTE * sizeof(pte_t))
+
+static inline unsigned int atomic_xor_bits(atomic_t *v, unsigned int bits)
+{
+ unsigned int old, new;
+
+ do {
+ old = atomic_read(v);
+ new = old ^ bits;
+ } while (atomic_cmpxchg(v, old, new) != old);
+ return new;
+}
+
+unsigned long *page_table_alloc(struct mm_struct *mm, unsigned long vmaddr)
+{
+ struct page *page;
+ unsigned int mask, bit;
+ unsigned long *table;
+
+ spin_lock(&mm->page_table_lock);
+ mask = FRAG_MASK;
+ if (!list_empty(&mm->context.pgtable_list)) {
+ page = list_first_entry(&mm->context.pgtable_list,
+ struct page, lru);
+ table = (unsigned long *) page_address(page);
+ mask = atomic_read(&page->_mapcount);
+ /*
+ * Update with the higher order mask bits accumulated,
+ * added as a part of rcu free.
+ */
+ mask = mask | (mask >> FRAG_MASK_BITS);
+ }
+ if ((mask & FRAG_MASK) == FRAG_MASK) {
+ spin_unlock(&mm->page_table_lock);
+ page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
+ if (!page)
+ return NULL;
+ pgtable_page_ctor(page);
+ atomic_set(&page->_mapcount, 1);
+ table = (unsigned long *) page_address(page);
+ spin_lock(&mm->page_table_lock);
+ INIT_LIST_HEAD(&page->lru);
+ list_add(&page->lru, &mm->context.pgtable_list);
+ } else {
+ /* The second half is used for real_pte_t hindex */
+ for (bit = 1; mask & bit; bit <<= 1)
+ table = (unsigned long *)((char *)table + PTE_FRAG_SIZE);
+
+ mask = atomic_xor_bits(&page->_mapcount, bit);
+ /*
+ * We have taken up all the space, remove this from
+ * the list, we will add it back when we have a free slot
+ */
+ if ((mask & FRAG_MASK) == FRAG_MASK)
+ list_del_init(&page->lru);
+ }
+ spin_unlock(&mm->page_table_lock);
+ /*
+ * zero out the newly allocated area, this make sure we don't
+ * see the old left over pte values
+ */
+ memset(table, 0, PTE_FRAG_SIZE);
+ return table;
+}
+
+void page_table_free(struct mm_struct *mm, unsigned long *table)
+{
+ struct page *page;
+ unsigned int bit, mask;
+
+ /* Free 4K page table fragment of a 64K page */
+ page = virt_to_page(table);
+ bit = 1 << ((__pa(table) & ~PAGE_MASK) / PTE_FRAG_SIZE);
+ spin_lock(&mm->page_table_lock);
+ mask = atomic_xor_bits(&page->_mapcount, bit);
+ if (mask == 0)
+ list_del(&page->lru);
+ else if (mask & FRAG_MASK) {
+ /*
+ * Add the page table page to pgtable_list so that
+ * the free fragment can be used by the next alloc
+ */
+ list_del_init(&page->lru);
+ list_add(&page->lru, &mm->context.pgtable_list);
+ }
+ spin_unlock(&mm->page_table_lock);
+ if (mask == 0) {
+ pgtable_page_dtor(page);
+ atomic_set(&page->_mapcount, -1);
+ __free_page(page);
+ }
+}
+
+#ifdef CONFIG_SMP
+static void __page_table_free_rcu(void *table)
+{
+ unsigned int bit;
+ struct page *page;
+ /*
+ * this is a PTE page free 4K page table
+ * fragment of a 64K page.
+ */
+ page = virt_to_page(table);
+ bit = 1 << ((__pa(table) & ~PAGE_MASK) / PTE_FRAG_SIZE);
+ bit <<= FRAG_MASK_BITS;
+ /*
+ * clear the higher half and if nobody used the page in
+ * between, even lower half would be zero.
+ */
+ if (atomic_xor_bits(&page->_mapcount, bit) == 0) {
+ pgtable_page_dtor(page);
+ atomic_set(&page->_mapcount, -1);
+ __free_page(page);
+ }
+}
+
+static void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table)
+{
+ struct page *page;
+ struct mm_struct *mm;
+ unsigned int bit, mask;
+
+ mm = tlb->mm;
+ /* Free 4K page table fragment of a 64K page */
+ page = virt_to_page(table);
+ bit = 1 << ((__pa(table) & ~PAGE_MASK) / PTE_FRAG_SIZE);
+ spin_lock(&mm->page_table_lock);
+ /*
+ * stash the actual mask in higher half, and clear the lower half
+ * and selectively, add remove from pgtable list
+ */
+ mask = atomic_xor_bits(&page->_mapcount, bit | (bit << FRAG_MASK_BITS));
+ if (!(mask & FRAG_MASK))
+ list_del(&page->lru);
+ else {
+ /*
+ * Add the page table page to pgtable_list so that
+ * the free fragment can be used by the next alloc.
+ * We will not be able to use it untill the rcu grace period
+ * is over, because we have the corresponding high half bit set
+ * and page_table_alloc looks at the high half bit.
+ */
+ list_del_init(&page->lru);
+ list_add_tail(&page->lru, &mm->context.pgtable_list);
+ }
+ spin_unlock(&mm->page_table_lock);
+ tlb_remove_table(tlb, table);
+}
+
+void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
+{
+ unsigned long pgf = (unsigned long)table;
+
+ BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+ pgf |= shift;
+ if (shift == 0)
+ /* PTE page needs special handling */
+ page_table_free_rcu(tlb, table);
+ else
+ tlb_remove_table(tlb, (void *)pgf);
+}
+
+void __tlb_remove_table(void *_table)
+{
+ void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
+ unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
+
+ if (!shift)
+ /* PTE page needs special handling */
+ __page_table_free_rcu(table);
+ else {
+ BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+ kmem_cache_free(PGT_CACHE(shift), table);
+ }
+}
+#endif
+#endif /* CONFIG_PPC_64K_PAGES */
^ permalink raw reply related
* Re: [PATCH -V1 07/24] powerpc: Add size argument to pgtable_cache_add
From: Aneesh Kumar K.V @ 2013-03-04 11:02 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <20130304051340.GC27523@drongo>
Paul Mackerras <paulus@samba.org> writes:
> On Tue, Feb 26, 2013 at 01:34:57PM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>>
>> We will use this later with THP changes to request for pmd table of double the size.
>> THP code does PTE page allocation along with large page request and deposit them
>> for later use. This is to ensure that we won't have any failures when we split
>> huge pages to regular pages.
>>
>> On powerpc we want to use the deposited PTE page for storing hash pte slot and
>> secondary bit information for the HPTEs. Hence we save them in the second half
>> of the pmd table.
>
> Looks OK, but you should explain why you made the wholesale change of
> "shift" to "index". Is there some important semantic difference, or
> do you just prefer the "index" name for some reason?
>
Now with table_size argument, the first arg is no more the shift value,
rather it is index into the array. Hence i changed the variable name. I
will split that patch to make it easy for review.
-aneesh
^ permalink raw reply
* RE: [PATCH 6/6 v8] iommu/fsl: Freescale PAMU driver and IOMMU API implementation.
From: Sethi Varun-B16395 @ 2013-03-04 11:31 UTC (permalink / raw)
To: Stuart Yoder
Cc: Wood Scott-B07421, Joerg Roedel, linux-kernel@vger.kernel.org,
Yoder Stuart-B08248, iommu@lists.linux-foundation.org,
linuxppc-dev@lists.ozlabs.org
In-Reply-To: <CALRxmdDEccTaS2Houuns44DoQ_8wUbXAMiXQ8w0Ks9R2F2u-Ag@mail.gmail.com>
> -----Original Message-----
> From: Stuart Yoder [mailto:b08248@gmail.com]
> Sent: Saturday, March 02, 2013 4:58 AM
> To: Sethi Varun-B16395
> Cc: iommu@lists.linux-foundation.org; linuxppc-dev@lists.ozlabs.org;
> linux-kernel@vger.kernel.org; Wood Scott-B07421; Joerg Roedel; Yoder
> Stuart-B08248
> Subject: Re: [PATCH 6/6 v8] iommu/fsl: Freescale PAMU driver and IOMMU
> API implementation.
>=20
> On Mon, Feb 18, 2013 at 6:52 AM, Varun Sethi <Varun.Sethi@freescale.com>
> wrote:
> [cut]
> > +static phys_addr_t get_phys_addr(struct fsl_dma_domain *dma_domain,
> > +unsigned long iova) {
> > + u32 win_cnt =3D dma_domain->win_cnt;
> > + struct dma_window *win_ptr =3D
> > + &dma_domain->win_arr[0];
> > + struct iommu_domain_geometry *geom;
> > +
> > + geom =3D &dma_domain->iommu_domain->geometry;
> > +
> > + if (!win_cnt || !dma_domain->geom_size) {
> > + pr_err("Number of windows/geometry not configured for
> the domain\n");
> > + return 0;
> > + }
> > +
> > + if (win_cnt > 1) {
> > + u64 subwin_size;
> > + unsigned long subwin_iova;
> > + u32 wnd;
> > +
> > + subwin_size =3D dma_domain->geom_size >> ilog2(win_cnt)=
;
>=20
> Could it be just geom_size / win_cnt ??
[Sethi Varun-B16395] You would run in to linking issues with respect to u64=
division on 32 bit kernels.
>=20
> > + subwin_iova =3D iova & ~(subwin_size - 1);
> > + wnd =3D (subwin_iova - geom->aperture_start) >>
> ilog2(subwin_size);
> > + win_ptr =3D &dma_domain->win_arr[wnd];
> > + }
> > +
> > + if (win_ptr->valid)
> > + return (win_ptr->paddr + (iova & (win_ptr->size -
> > + 1)));
> > +
> > + return 0;
> > +}
> > +
> > +static int map_liodn_subwins(int liodn, struct fsl_dma_domain
> > +*dma_domain)
>=20
> Just call it map_subwins(). They are just sub windows, not "liodn sub
> windows".
>=20
[Sethi Varun-B16395] This would be called per liodn.
> [cut]
>=20
> > +static int map_liodn_win(int liodn, struct fsl_dma_domain
> > +*dma_domain)
>=20
> Call it map_win().
[Sethi Varun-B16395] This would again be called per liodn.
>=20
> [cut]
> > +static struct fsl_dma_domain *iommu_alloc_dma_domain(void) {
> > + struct fsl_dma_domain *domain;
> > +
> > + domain =3D kmem_cache_zalloc(fsl_pamu_domain_cache, GFP_KERNEL)=
;
> > + if (!domain)
> > + return NULL;
> > +
> > + domain->stash_id =3D ~(u32)0;
> > + domain->snoop_id =3D ~(u32)0;
> > + domain->win_cnt =3D max_subwindow_count;
>=20
> To align with my previous comments on fsl_pamu.c, I think instead of
> referencing a global variable (in fsl_pamu.c) you should be making an
> accessor API call here to get the max subwindow _count.
>=20
[Sethi Varun-B16395] ok, I will make a accessor API call.
> > + domain->geom_size =3D 0;
> > +
> > + INIT_LIST_HEAD(&domain->devices);
> > +
> > + spin_lock_init(&domain->domain_lock);
> > +
> > + return domain;
> > +}
> > +
> > +static inline struct device_domain_info *find_domain(struct device
> > +*dev) {
> > + return dev->archdata.iommu_domain; }
> > +
> > +static void remove_domain_ref(struct device_domain_info *info, u32
> > +win_cnt) {
> > + list_del(&info->link);
> > + spin_lock(&iommu_lock);
> > + if (win_cnt)
> > + pamu_free_subwins(info->liodn);
> > + pamu_disable_liodn(info->liodn);
> > + spin_unlock(&iommu_lock);
> > + spin_lock(&device_domain_lock);
> > + info->dev->archdata.iommu_domain =3D NULL;
> > + kmem_cache_free(iommu_devinfo_cache, info);
> > + spin_unlock(&device_domain_lock); }
>=20
> The above function is literally removing the _device_ reference from the
> domain.
> The name implies that it is removing a "domain reference". Suggestion
> is
> to call it "remove_device_ref".
>=20
> Also, the whitespace is messed up there. You have 2 tabs instead of 1.
[Sethi Varun-B16395] ok will make the change.
>=20
> > +static void destroy_domain(struct fsl_dma_domain *dma_domain) {
> > + struct device_domain_info *info;
> > +
> > + /* Dissociate all the devices from this domain */
> > + while (!list_empty(&dma_domain->devices)) {
> > + info =3D list_entry(dma_domain->devices.next,
> > + struct device_domain_info, link);
> > + remove_domain_ref(info, dma_domain->win_cnt);
> > + }
> > +}
>=20
> This function is removing all devices from a domain...maybe to be
> consistent with my suggestion below on detach_domain(), call this
> detach_all_devices().
> We have 2 functions
> doing almost the same thing....one detaches a single device, one detaches
> all devices.
> The current names "destroy_domain" and "detach_domain" are not as clear
> to me.
>=20
[Sethi Varun-B16395]ok=20
> > +static void detach_domain(struct device *dev, struct fsl_dma_domain
> > +*dma_domain) {
> > + struct device_domain_info *info;
> > + struct list_head *entry, *tmp;
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&dma_domain->domain_lock, flags);
> > + /* Remove the device from the domain device list */
> > + if (!list_empty(&dma_domain->devices)) {
> > + list_for_each_safe(entry, tmp, &dma_domain->devices) {
> > + info =3D list_entry(entry, struct
> device_domain_info, link);
> > + if (info->dev =3D=3D dev)
> > + remove_domain_ref(info, dma_domain-
> >win_cnt);
> > + }
> > + }
> > + spin_unlock_irqrestore(&dma_domain->domain_lock, flags); }
>=20
> This function is not "detaching a domain", but is detaching a device.
> Call it detach_device().
>=20
[Sethi Varun-B16395] ok, will address this.
> > +static void attach_domain(struct fsl_dma_domain *dma_domain, int
> > +liodn, struct device *dev) {
>=20
> Same thing here. This is not attaching a domain, but attaching a
> device. Call it attach_device.
>=20
[Sethi Varun-B16395] ok.
> > + struct device_domain_info *info, *old_domain_info;
> > +
> > + spin_lock(&device_domain_lock);
> > + /*
> > + * Check here if the device is already attached to domain or
> not.
> > + * If the device is already attached to a domain detach it.
> > + */
> > + old_domain_info =3D find_domain(dev);
> > + if (old_domain_info && old_domain_info->domain !=3D dma_domain)=
{
> > + spin_unlock(&device_domain_lock);
> > + detach_domain(dev, old_domain_info->domain);
> > + spin_lock(&device_domain_lock);
> > + }
> > +
> > + info =3D kmem_cache_zalloc(iommu_devinfo_cache, GFP_KERNEL);
> > +
> > + info->dev =3D dev;
> > + info->liodn =3D liodn;
> > + info->domain =3D dma_domain;
> > +
> > + list_add(&info->link, &dma_domain->devices);
> > + /*
> > + * In case of devices with multiple LIODNs just store
> > + * the info for the first LIODN as all
> > + * LIODNs share the same domain
> > + */
> > + if (!old_domain_info)
> > + dev->archdata.iommu_domain =3D info;
> > + spin_unlock(&device_domain_lock);
> > +
> > +}
> > +
>=20
> [cut]
> > +/* Configure geometry settings for all LIODNs associated with domain
> > +*/ static int configure_domain(struct fsl_dma_domain *dma_domain,
> > + struct iommu_domain_geometry *geom_attr,
> > + u32 win_cnt)
>=20
> This function is not configuring the iommu domain...which is a concept in
> the Linux driver, it is taking the domain geometry and setting up the
> PAMU tables for all LIODNs currently in the domain.
>=20
> Maybe it would help if you used a prefix like "pamu" or "paact" to
> identify functions that operate on the actual PAMU tables.
>=20
> maybe: pamu_set_domain_geometry()
>=20
[Sethi Varun-B16395] ok.
> > +{
> > + struct device_domain_info *info;
> > + int ret =3D 0;
> > +
> > + if (!list_empty(&dma_domain->devices)) {
> > + list_for_each_entry(info, &dma_domain->devices, link) {
> > + ret =3D configure_liodn(info->liodn, info->dev,
> dma_domain,
> > + geom_attr, win_cnt);
>=20
> ...and following the above naming convention, call this (?):
> pamu_set_liodn
[Sethi Varun-B16395] ok.
>=20
> > + if (ret)
> > + break;
> > + }
> > + }
> > +
> > + return ret;
> > +}
> > +
>=20
> [cut]
> > +static int fsl_pamu_window_enable(struct iommu_domain *domain, u32
> wnd_nr,
> > + phys_addr_t paddr, u64 size, int
> > +iommu_prot) {
> > + struct fsl_dma_domain *dma_domain =3D domain->priv;
> > + struct dma_window *wnd;
> > + int prot =3D 0;
> > + int ret;
> > + unsigned long flags;
> > + u64 win_size;
> > +
> > + if (iommu_prot & IOMMU_READ)
> > + prot |=3D PAACE_AP_PERMS_QUERY;
> > + if (iommu_prot & IOMMU_WRITE)
> > + prot |=3D PAACE_AP_PERMS_UPDATE;
> > +
> > + spin_lock_irqsave(&dma_domain->domain_lock, flags);
> > + if (!dma_domain->win_arr) {
> > + pr_err("Number of windows not configured\n");
> > + spin_unlock_irqrestore(&dma_domain->domain_lock,
> flags);
> > + return -ENODEV;
> > + }
> > +
> > + if (wnd_nr >=3D dma_domain->win_cnt) {
> > + pr_err("Invalid window index\n");
> > + spin_unlock_irqrestore(&dma_domain->domain_lock,
> flags);
> > + return -EINVAL;
> > + }
> > +
> > + win_size =3D dma_domain->geom_size >>
> > + ilog2(dma_domain->win_cnt);
>=20
> Isn't size just geom_size / win_cnt. Not sure why you do the ilog2()
> and right shift...
> We already validated that win_cnt is power of 2 when it was set.
>=20
[Sethi Varun-B16395] problem with u64 division.
> > + if (size > win_size) {
> > + pr_err("Invalid window size \n");
> > + spin_unlock_irqrestore(&dma_domain->domain_lock,
> flags);
> > + return -EINVAL;
> > + }
> > +
> > + if (dma_domain->win_cnt =3D=3D 1) {
> > + if (dma_domain->enabled) {
> > + pr_err("Disable the window before updating the
> mapping\n");
> > + spin_unlock_irqrestore(&dma_domain-
> >domain_lock, flags);
> > + return -EBUSY;
> > + }
> > +
> > + ret =3D check_size(size, domain-
> >geometry.aperture_start);
> > + if (ret) {
> > + pr_err("Aperture start not aligned to the
> size\n");
> > + spin_unlock_irqrestore(&dma_domain-
> >domain_lock, flags);
> > + return -EINVAL;
> > + }
> > + }
>=20
> Why is win_cnt=3D=3D1 a special case? Would the geometry not have been
> verified
> when it was originally set up?
>=20
[Sethi Varun-B16395] Yes, but in case of a single window you can still have=
the flexibility of specifying a different size range. But the start addres=
s should still be aligned to the new size.
> Also, do you need a check here to verify if the geometry has been set.
> Is it a requirement to set the geometry prior to window enable?
[Sethi Varun-B16395] That is already checked with the subwindow array check=
.
>=20
> > + wnd =3D &dma_domain->win_arr[wnd_nr];
> > + if (!wnd->valid) {
> > + wnd->paddr =3D paddr;
> > + wnd->size =3D size;
> > + wnd->prot =3D prot;
> > +
> > + ret =3D update_domain_mapping(dma_domain, wnd_nr);
> > + if (!ret) {
> > + wnd->valid =3D 1;
> > + dma_domain->mapped++;
> > + }
> > + } else {
> > + pr_err("Disable the window before updating the
> mapping\n");
> > + ret =3D -EBUSY;
> > + }
> > +
> > + spin_unlock_irqrestore(&dma_domain->domain_lock, flags);
> > +
> > + return ret;
> > +}
> > +
> > +/*
>=20
> [cut]
> > +static int fsl_pamu_attach_device(struct iommu_domain *domain,
> > + struct device *dev) {
> > + struct fsl_dma_domain *dma_domain =3D domain->priv;
> > + const u32 *prop;
> > + u32 prop_cnt;
> > + int len, ret =3D 0;
> > + struct pci_dev *pdev =3D NULL;
> > + struct pci_controller *pci_ctl;
> > +
> > + /*
> > + * Hack to make attach device work for the PCI devices. Simply
> assign the
> > + * the LIODN for the PCI controller to the PCI device.
> > + */
>=20
> Instead of "Simply assign...", perhaps say instead: Use the LIODN for
> the PCI controller when attaching a PCI device.
[Sethi Varun-B16395] ok.
>=20
> > + if (dev->bus =3D=3D &pci_bus_type) {
> > + pdev =3D to_pci_dev(dev);
> > + pci_ctl =3D pci_bus_to_host(pdev->bus);
> > + /*
> > + * make dev point to pci controller device
> > + * so we can get the LIODN programmed by
> > + * u-boot;
> > + */
> > + dev =3D pci_ctl->parent;
> > + }
> > +
> > + prop =3D of_get_property(dev->of_node, "fsl,liodn", &len);
>=20
> suggestion: name "prop" to be "liodn" and "prop_cnt" to be "liodn_cnt".
> That will be more clear.
[Sethi Varun-B16395] ok.
>=20
> > + if (prop) {
> > + prop_cnt =3D len / sizeof(u32);
> > + ret =3D handle_attach_device(dma_domain, dev,
> > + prop, prop_cnt);
> > + } else {
> > + pr_err("missing fsl,liodn property at %s\n",
> > + dev->of_node->full_name);
> > + ret =3D -EINVAL;
> > + }
> > +
> > + return ret;
> > +}
> > +
> > +static void fsl_pamu_detach_device(struct iommu_domain *domain,
> > + struct device *dev) {
> > + struct fsl_dma_domain *dma_domain =3D domain->priv;
> > + const u32 *prop;
> > + int len;
> > + struct pci_dev *pdev =3D NULL;
> > + struct pci_controller *pci_ctl;
> > +
> > + /*
> > + * Hack to make detach device work for the PCI devices. Simply
> assign the
> > + * the LIODN for the PCI controller to the PCI device.
> > + */
> > + if (dev->bus =3D=3D &pci_bus_type) {
> > + pdev =3D to_pci_dev(dev);
> > + pci_ctl =3D pci_bus_to_host(pdev->bus);
> > + /*
> > + * make dev point to pci controller device
> > + * so we can get the LIODN programmed by
> > + * u-boot;
> > + */
> > + dev =3D pci_ctl->parent;
> > + }
> > +
> > + prop =3D of_get_property(dev->of_node, "fsl,liodn", &len);
> > + if (prop)
> > + detach_domain(dev, dma_domain);
> > + else
> > + pr_err("missing fsl,liodn property at %s\n",
> > + dev->of_node->full_name); }
>=20
> I understand why you need the LIODN when attaching, but why do you get it
> in the detatch function. You are not using "prop" here.
[Sethi Varun-B16395] Just a sanity check.
>=20
> [cut]
> > +static bool check_pci_ctl_endpt_part(struct pci_controller *pci_ctl)
> > +{
> > + u32 version;
> > +
> > + /* Check the PCI controller version number by readding BRR1
> register */
> > + version =3D in_be32(pci_ctl->cfg_addr + (PCI_FSL_BRR1 >> 2));
> > + version &=3D PCI_FSL_BRR1_VER;
> > + /* If PCI controller version is >=3D 0x204 we can partition
> endpoints*/
> > + if (version >=3D 0x204)
> > + return 1;
> > +
> > + return 0;
> > +}
>=20
> Can we just use the compatible string here to identify the different
> kinds of PCI
> controllers? Reading the actual device registers is ugly right now
> because
> you are #defining the BRR1 stuff in a generic powerpc header.
>=20
[Sethi Varun-B16395] hmmm......, I would have to check for various differen=
t compatible strings in that case. May be I can move the defines to a diffe=
rent file. What if I move them to some other header?
> > +static int fsl_pamu_add_device(struct device *dev) {
> > + struct iommu_group *group =3D NULL;
> > + struct pci_dev *pdev;
> > + struct pci_dev *bridge, *dma_pdev =3D NULL;
> > + struct pci_controller *pci_ctl;
> > + int ret;
> > +
> > + /*
> > + * For platform devices we allocate a separate group for
> > + * each of the devices.
> > + */
> > + if (dev->bus =3D=3D &pci_bus_type) {
> > + bool pci_endpt_part;
>=20
> That variable name is not clear, the abbreviations are not natural. I
> would just call it pci_endpoint_partitioning.
[Sethi Varun-B16395] ok.
-Varun
^ permalink raw reply
* Re: [PATCH -V1 09/24] powerpc: Decode the pte-lp-encoding bits correctly.
From: Aneesh Kumar K.V @ 2013-03-04 11:41 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <20130304054848.GE27523@drongo>
Paul Mackerras <paulus@samba.org> writes:
> On Tue, Feb 26, 2013 at 01:34:59PM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>>=20
>> We look at both the segment base page size and actual page size and store
>> the pte-lp-encodings in an array per base page size.
>>=20
>> We also update all relevant functions to take actual page size argument
>> so that we can use the correct PTE LP encoding in HPTE. This should also
>> get the basic Multiple Page Size per Segment (MPSS) support. This is nee=
ded
>> to enable THP on ppc64.
>
> Mostly looks OK, comments below...
>
>> +/*
>> + * HPTE LP details
>> + */
>> +#define LP_SHIFT 12
>> +#define LP_BITS 8
>> +#define LP_MASK(i) ((0xFF >> (i)) << LP_SHIFT)
>
> The reader might be wondering at this point what "LP" is; be kind and
> make it "large page" in the comment for them.
Fixed.
>
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 71d0c90..48f6d99 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -1515,7 +1515,7 @@ static void kvmppc_add_seg_page_size(struct kvm_pp=
c_one_seg_page_size **sps,
>> (*sps)->page_shift =3D def->shift;
>> (*sps)->slb_enc =3D def->sllp;
>> (*sps)->enc[0].page_shift =3D def->shift;
>> - (*sps)->enc[0].pte_enc =3D def->penc;
>> + (*sps)->enc[0].pte_enc =3D def->penc[linux_psize];
>> (*sps)++;
>> }
>
> This will only return the entries where actual page size =3D=3D base page
> size, which basically means that KVM guests won't be able to use
> MPSS. We will need to return multiple entries in that case.
I did that as a the follow up patch.
[PATCH -V1 10/24] powerpc: Return all the valid pte ecndoing in
KVM_PPC_GET_SMMU_INFO ioct
>
>> +static inline int hpte_actual_psize(struct hash_pte *hptep, int psize)
>> +{
>> + unsigned int mask;
>> + int i, penc, shift;
>> + /* Look at the 8 bit LP value */
>> + unsigned int lp =3D (hptep->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
>> +
>> + penc =3D 0;
>> + for (i =3D 0; i < MMU_PAGE_COUNT; i++) {
>> + /* valid entries have a shift value */
>> + if (!mmu_psize_defs[i].shift)
>> + continue;
>> +
>> + /* encoding bits per actual page size */
>> + shift =3D mmu_psize_defs[i].shift - 11;
>> + if (shift > 9)
>> + shift =3D 9;
>> + mask =3D (1 << shift) - 1;
>> + if ((lp & mask) =3D=3D mmu_psize_defs[psize].penc[i])
>> + return i;
>> + }
>> + return -1;
>> +}
>
> This doesn't look right to me. First, it's not clear what the 11 and
> 9 refer to, and I think the 9 should be LP_BITS (i.e. 8). Secondly,
> the mask for the comparison needs to depend on the actual page size
> not the base page size.
That 11 should be 12.That depends on the fact that we have below mapping
rrrr rrrz =E2=89=A58KB
Yes, that 9 should be LP_BITs.=20
We are generating mask based on actual page size above (variable i in
the for loop).
>
> I strongly suggest you pull out this code together with
> native_hpte_insert into a little userspace test program that runs
> through all the possible page size combinations, creating an HPTE and
> then decoding it with hpte_actual_psize() to check that you get back
> the correct actual page size.
>
will do.
>> static void hpte_decode(struct hash_pte *hpte, unsigned long slot,
>> - int *psize, int *ssize, unsigned long *vpn)
>> + int *psize, int *apsize, int *ssize, unsigned long *vpn)
>> {
>> unsigned long avpn, pteg, vpi;
>> unsigned long hpte_r =3D hpte->r;
>> unsigned long hpte_v =3D hpte->v;
>> unsigned long vsid, seg_off;
>> - int i, size, shift, penc;
>> + int i, size, a_size =3D MMU_PAGE_4K, shift, penc;
>>=20=20
>> if (!(hpte_v & HPTE_V_LARGE))
>> size =3D MMU_PAGE_4K;
>> @@ -395,12 +422,13 @@ static void hpte_decode(struct hash_pte *hpte, uns=
igned long slot,
>> /* valid entries have a shift value */
>> if (!mmu_psize_defs[size].shift)
>> continue;
>> -
>> - if (penc =3D=3D mmu_psize_defs[size].penc)
>> - break;
>> + for (a_size =3D 0; a_size < MMU_PAGE_COUNT; a_size++)
>> + if (penc =3D=3D mmu_psize_defs[size].penc[a_size])
>> + goto out;
>
> Once again I don't think this is correct, since the number of bits in
> the page size encoding depends on the page size. In fact the
> calculation of penc in that function looks completely bogus to me (not
> that that is code that you have written or modified, but it looks to
> me like it needs fixing).
I am fixing that in the later patch
powerpc: Fix hpte_decode to use the correct decoding for page sizes
But that will also need fixing as you suggested above.
>
>> static int __init htab_dt_scan_page_sizes(unsigned long node,
>> const char *uname, int depth,
>> void *data)
>> @@ -294,60 +318,57 @@ static int __init htab_dt_scan_page_sizes(unsigned=
long node,
>> size /=3D 4;
>> cur_cpu_spec->mmu_features &=3D ~(MMU_FTR_16M_PAGE);
>> while(size > 0) {
>> - unsigned int shift =3D prop[0];
>> + unsigned int base_shift =3D prop[0];
>> unsigned int slbenc =3D prop[1];
>> unsigned int lpnum =3D prop[2];
>> - unsigned int lpenc =3D 0;
>> struct mmu_psize_def *def;
>> - int idx =3D -1;
>> + int idx, base_idx;
>>=20=20
>> size -=3D 3; prop +=3D 3;
>> - while(size > 0 && lpnum) {
>> - if (prop[0] =3D=3D shift)
>> - lpenc =3D prop[1];
>> + base_idx =3D get_idx_from_shift(base_shift);
>> + if (base_idx < 0) {
>> + /*
>> + * skip the pte encoding also
>> + */
>> prop +=3D 2; size -=3D 2;
>> - lpnum--;
>> + continue;
>> }
>> - switch(shift) {
>> - case 0xc:
>> - idx =3D MMU_PAGE_4K;
>> - break;
>> - case 0x10:
>> - idx =3D MMU_PAGE_64K;
>> - break;
>> - case 0x14:
>> - idx =3D MMU_PAGE_1M;
>> - break;
>> - case 0x18:
>> - idx =3D MMU_PAGE_16M;
>> + def =3D &mmu_psize_defs[base_idx];
>> + if (base_idx =3D=3D MMU_PAGE_16M)
>> cur_cpu_spec->mmu_features |=3D MMU_FTR_16M_PAGE;
>> - break;
>> - case 0x22:
>> - idx =3D MMU_PAGE_16G;
>> - break;
>> - }
>> - if (idx < 0)
>> - continue;
>> - def =3D &mmu_psize_defs[idx];
>> - def->shift =3D shift;
>> - if (shift <=3D 23)
>> +
>> + def->shift =3D base_shift;
>> + if (base_shift <=3D 23)
>> def->avpnm =3D 0;
>> else
>> - def->avpnm =3D (1 << (shift - 23)) - 1;
>> + def->avpnm =3D (1 << (base_shift - 23)) - 1;
>> def->sllp =3D slbenc;
>> - def->penc =3D lpenc;
>> - /* We don't know for sure what's up with tlbiel, so
>> + /*
>> + * We don't know for sure what's up with tlbiel, so
>> * for now we only set it for 4K and 64K pages
>> */
>> - if (idx =3D=3D MMU_PAGE_4K || idx =3D=3D MMU_PAGE_64K)
>> + if (base_idx =3D=3D MMU_PAGE_4K || base_idx =3D=3D MMU_PAGE_64K)
>> def->tlbiel =3D 1;
>> else
>> def->tlbiel =3D 0;
>>=20=20
>> - DBG(" %d: shift=3D%02x, sllp=3D%04lx, avpnm=3D%08lx, "
>> - "tlbiel=3D%d, penc=3D%d\n",
>> - idx, shift, def->sllp, def->avpnm, def->tlbiel,
>> - def->penc);
>> + while (size > 0 && lpnum) {
>> + unsigned int shift =3D prop[0];
>> + unsigned int penc =3D prop[1];
>> +
>> + prop +=3D 2; size -=3D 2;
>> + lpnum--;
>> +
>> + idx =3D get_idx_from_shift(shift);
>> + if (idx < 0)
>> + continue;
>> +
>> + def->penc[idx] =3D penc;
>> + DBG(" %d: shift=3D%02x, sllp=3D%04lx, "
>> + "avpnm=3D%08lx, tlbiel=3D%d, penc=3D%d\n",
>> + idx, shift, def->sllp, def->avpnm,
>> + def->tlbiel, def->penc[idx]);
>> + }
>
> I don't see where in this function you set the penc[] elements for
> invalid actual page sizes to -1.
We do the below
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -125,7 +125,7 @@ static struct mmu_psize_def mmu_psize_defaults_old[] =
=3D {
[MMU_PAGE_4K] =3D {
.shift =3D 12,
.sllp =3D 0,
- .penc =3D 0,
+ .penc =3D { [0 ... MMU_PAGE_COUNT - 1] =3D -1 },
.avpnm =3D 0,
-aneesh
^ permalink raw reply
* Re: [PATCH -V1 09/24] powerpc: Decode the pte-lp-encoding bits correctly.
From: Aneesh Kumar K.V @ 2013-03-04 11:52 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev, linux-mm
In-Reply-To: <20130304054848.GE27523@drongo>
Paul Mackerras <paulus@samba.org> writes:
> On Tue, Feb 26, 2013 at 01:34:59PM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>>=20
>> +static inline int hpte_actual_psize(struct hash_pte *hptep, int psize)
>> +{
>> + unsigned int mask;
>> + int i, penc, shift;
>> + /* Look at the 8 bit LP value */
>> + unsigned int lp =3D (hptep->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
>> +
>> + penc =3D 0;
>> + for (i =3D 0; i < MMU_PAGE_COUNT; i++) {
>> + /* valid entries have a shift value */
>> + if (!mmu_psize_defs[i].shift)
>> + continue;
>> +
>> + /* encoding bits per actual page size */
>> + shift =3D mmu_psize_defs[i].shift - 11;
>> + if (shift > 9)
>> + shift =3D 9;
>> + mask =3D (1 << shift) - 1;
>> + if ((lp & mask) =3D=3D mmu_psize_defs[psize].penc[i])
>> + return i;
>> + }
>> + return -1;
>> +}
>
> This doesn't look right to me. First, it's not clear what the 11 and
> 9 refer to, and I think the 9 should be LP_BITS (i.e. 8). Secondly,
> the mask for the comparison needs to depend on the actual page size
> not the base page size.
How about the below. I am yet to test this in user space.=20
static inline int hpte_actual_psize(struct hash_pte *hptep, int psize)
{
unsigned int mask;
int i, penc, shift;
/* Look at the 8 bit LP value */
unsigned int lp =3D (hptep->r >> LP_SHIFT) & ((1 << LP_BITS) - 1);
penc =3D 0;
for (i =3D 0; i < MMU_PAGE_COUNT; i++) {
/* valid entries have a shift value */
if (!mmu_psize_defs[i].shift)
continue;
/*
* encoding bits per actual page size
* PTE LP actual page size
* rrrr rrrz =E2=89=A58KB
* rrrr rrzz =E2=89=A516KB
* rrrr rzzz =E2=89=A532KB
* rrrr zzzz =E2=89=A564KB
* .......
*/
shift =3D mmu_psize_defs[i].shift -
mmu_psize_defs[MMU_PAGE_4K].shift;
if (shift > LP_BITS)
shift =3D LP_BITS;
mask =3D (1 << shift) - 1;
if ((lp & mask) =3D=3D mmu_psize_defs[psize].penc[i])
return i;
}
return -1;
}
^ permalink raw reply
* Re: [PATCH V4] powerpc/85xx: Add machine check handler to fix PCIe erratum on mpc85xx
From: Stuart Yoder @ 2013-03-04 16:16 UTC (permalink / raw)
To: Jia Hongtao; +Cc: B07421, linuxppc-dev
In-Reply-To: <1362386425-20149-1-git-send-email-B38951@freescale.com>
On Mon, Mar 4, 2013 at 2:40 AM, Jia Hongtao <B38951@freescale.com> wrote:
> A PCIe erratum of mpc85xx may causes a core hang when a link of PCIe
> goes down. when the link goes down, Non-posted transactions issued
> via the ATMU requiring completion result in an instruction stall.
> At the same time a machine-check exception is generated to the core
> to allow further processing by the handler. We implements the handler
> which skips the instruction caused the stall.
Can you explain at a high level how just skipping an instruction solves
anything? If you just skip a load/store and continue like nothing is
wrong, isn't your system possibly in a really bad state.
And if the core is already hung, due to the PCI link going down, isn't
it too late? How does skipping help?
Stuart
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox