* [RFC] Simple ioremap cache
@ 2004-06-05 0:29 Eugene Surovegin
2004-06-07 7:46 ` Marius Groeger
2004-06-07 14:58 ` Kumar Gala
0 siblings, 2 replies; 8+ messages in thread
From: Eugene Surovegin @ 2004-06-05 0:29 UTC (permalink / raw)
To: linuxppc-dev
Hello all!
I'd like to present simple optimization I have been using for a while in my
PPC 4xx tree.
PPC 4xx on-chip peripheral I/O registers are located in the same physical page:
40x - EF60'0000
44x - 1'4000'0000
Different device drivers ioremap different parts of this page. Currently ioremap
implementation doesn't track previous requests and we end up with different
virtual mappings for the same physical page.
Here is ioremap profile I recorded on Ebony (PPC440GP) with only serial, EMAC &
i2c drivers enabled (2.6.7-rc2):
ioremap(0x00000001fffffe00, 0x00001000) -> 0xfdfffe00 (0xfdfff000)
ioremap(0x0000000148000000, 0x00002000) -> 0xfdffd000 (0xfdffd000)
ioremap(0x000000020ec80000, 0x00001000) -> 0xfdffc000 (0xfdffc000)
ioremap(0x0000000208000000, 0x00010000) -> 0xfdfec000 (0xfdfec000)
ioremap(0x000000020ec00000, 0x00001000) -> 0xfdfeb000 (0xfdfeb000)
ioremap(0x0000000140000200, 0x00001000) -> 0xfdfea200 (0xfdfea000)
ioremap(0x0000000140000300, 0x00001000) -> 0xfdfe9300 (0xfdfe9000)
ioremap(0x0000000140000800, 0x00001000) -> 0xd1000800 (0xd1000000)
ioremap(0x0000000140000780, 0x00001000) -> 0xd1002780 (0xd1002000)
ioremap(0x0000000140000900, 0x00001000) -> 0xd1004900 (0xd1004000)
ioremap(0x0000000140000400, 0x00001000) -> 0xd1006400 (0xd1006000)
ioremap(0x0000000140000500, 0x00001000) -> 0xd1008500 (0xd1008000)
first number - phys address, second - size, third - ioremap result and
the forth one - ioremap result with PAGE_MASK applied
As you can see we could save a lot of TLB misses by using just one mapping for
_all_ 440GP peripherals (440GP has a 64-entry software-managed TLB).
To optimize ioremap allocation I implemented very simple ioremap cache. I chose
to cache only page-sized allocations, also I used simple 10 entry array with
linear search. ioremap is called mostly during driver initialization so it
seemed quite reasonable not to over-complicate this stuff :)
Here is ioremap profile _after_ my patch applied:
ioremap(0x00000001fffffe00, 0x00001000) -> 0xfdfffe00 (0xfdfff000)
ioremap(0x0000000148000000, 0x00002000) -> 0xfdffd000 (0xfdffd000)
ioremap(0x000000020ec80000, 0x00001000) -> 0xfdffc000 (0xfdffc000)
ioremap(0x0000000208000000, 0x00010000) -> 0xfdfec000 (0xfdfec000)
ioremap(0x000000020ec00000, 0x00001000) -> 0xfdfeb000 (0xfdfeb000)
ioremap(0x0000000140000200, 0x00001000) -> 0xfdfea200 (0xfdfea000)
ioremap(0x0000000140000300, 0x00001000) -> 0xfdfea300 (0xfdfea000)
ioremap(0x0000000140000800, 0x00001000) -> 0xfdfea800 (0xfdfea000)
ioremap(0x0000000140000780, 0x00001000) -> 0xfdfea780 (0xfdfea000)
ioremap(0x0000000140000900, 0x00001000) -> 0xfdfea900 (0xfdfea000)
ioremap(0x0000000140000400, 0x00001000) -> 0xfdfea400 (0xfdfea000)
ioremap(0x0000000140000500, 0x00001000) -> 0xfdfea500 (0xfdfea000)
I have several questions on how we can enhance my simple hack so it can be
acceptable into mainline:
0) Do we really need such stuff in mainline :) ?
1) Should this feature be enabled for all ppc32 archs or only for 4xx? I
made ioremap profile for 2.6.6 kernel running on my G4 Powerbook and
haven't noticed a lot of ioremap regions overlap (there was one instance where
my patch would have helped if I've increased cache size to 32 entries).
2) Should we cache allocation bigger than 4K? From Ebony and tipb profiles it
doesn't seem advantogeous. Maybe other CPUs can benefit from the bigger sizes.
3) Should cache size (currently hardcoded to 10 entries) be made configurable?
4) Other enhancements I haven't thought of...
Comments/suggestions?
Here is the patch against current linux-2.5:
===== arch/ppc/mm/pgtable.c 1.19 vs edited =====
--- 1.19/arch/ppc/mm/pgtable.c Sat May 22 14:56:23 2004
+++ edited/arch/ppc/mm/pgtable.c Fri Jun 4 16:28:44 2004
@@ -10,6 +10,8 @@
* Copyright (C) 1996 Paul Mackerras
* Amiga/APUS changes by Jesper Skov (jskov@cygnus.co.uk).
*
+ * Simple ioremap cache added by Eugene Surovegin <ebs@ebshome.net>, 2004
+ *
* Derived from "arch/i386/mm/init.c"
* Copyright (C) 1991, 1992, 1993, 1994 Linus Torvalds
*
@@ -59,6 +61,17 @@
#define p_mapped_by_bats(x) (0UL)
#endif /* HAVE_BATS */
+/* simple ioremap cache */
+#define IOREMAP_CACHE_SIZE 10
+static spinlock_t ioremap_cache_lock = SPIN_LOCK_UNLOCKED;
+static int ioremap_cache_active_slots;
+static struct ioremap_cache_entry {
+ phys_addr_t pa;
+ unsigned long va;
+ unsigned long flags;
+ int users;
+} ioremap_cache[IOREMAP_CACHE_SIZE];
+
#ifdef CONFIG_44x
/* 44x uses an 8kB pgdir because it has 8-byte Linux PTEs. */
#define PGDIR_ORDER 1
@@ -137,6 +150,84 @@
__free_page(ptepage);
}
+static unsigned long ioremap_cache_check(phys_addr_t pa, unsigned long size,
+ unsigned long flags)
+{
+ unsigned long va = 0;
+ int i;
+
+ if (size != 0x1000)
+ return 0;
+
+ spin_lock(&ioremap_cache_lock);
+ if (!ioremap_cache_active_slots)
+ goto out;
+
+ for (i = 0; i < IOREMAP_CACHE_SIZE; ++i)
+ if (ioremap_cache[i].pa == pa &&
+ ioremap_cache[i].flags == flags)
+ {
+ va = ioremap_cache[i].va;
+ ++ioremap_cache[i].users;
+ break;
+ }
+out:
+ spin_unlock(&ioremap_cache_lock);
+
+ return va;
+}
+
+static void ioremap_cache_add(phys_addr_t pa, unsigned long va, unsigned long size,
+ unsigned long flags)
+{
+ int i;
+
+ if (size != 0x1000)
+ return;
+
+ spin_lock(&ioremap_cache_lock);
+ if (ioremap_cache_active_slots == IOREMAP_CACHE_SIZE)
+ goto out;
+
+ for (i = 0; i < IOREMAP_CACHE_SIZE; ++i)
+ if (!ioremap_cache[i].pa){
+ ioremap_cache[i].pa = pa;
+ ioremap_cache[i].va = va;
+ ioremap_cache[i].flags = flags;
+ ioremap_cache[i].users = 1;
+ ++ioremap_cache_active_slots;
+ break;
+ }
+out:
+ spin_unlock(&ioremap_cache_lock);
+}
+
+static int ioremap_cache_del(unsigned long va)
+{
+ int i, res = 0;
+ va &= PAGE_MASK;
+
+ spin_lock(&ioremap_cache_lock);
+ if (!ioremap_cache_active_slots)
+ goto out;
+
+ for (i = 0; i < IOREMAP_CACHE_SIZE; ++i)
+ if (ioremap_cache[i].va == va){
+ res = --ioremap_cache[i].users;
+ if (!res){
+ ioremap_cache[i].pa = 0;
+ ioremap_cache[i].va = 0;
+ ioremap_cache[i].flags = 0;
+ --ioremap_cache_active_slots;
+ }
+ break;
+ }
+out:
+ spin_unlock(&ioremap_cache_lock);
+
+ return res;
+}
+
#ifndef CONFIG_44x
void *
ioremap(phys_addr_t addr, unsigned long size)
@@ -210,6 +301,14 @@
if ((v = p_mapped_by_bats(p)) /*&& p_mapped_by_bats(p+size-1)*/ )
goto out;
+ if ((flags & _PAGE_PRESENT) == 0)
+ flags |= _PAGE_KERNEL;
+ if (flags & _PAGE_NO_CACHE)
+ flags |= _PAGE_GUARDED;
+
+ if ((v = ioremap_cache_check(p, size, flags)))
+ goto out;
+
if (mem_init_done) {
struct vm_struct *area;
area = get_vm_area(size, VM_IOREMAP);
@@ -220,11 +319,6 @@
v = (ioremap_bot -= size);
}
- if ((flags & _PAGE_PRESENT) == 0)
- flags |= _PAGE_KERNEL;
- if (flags & _PAGE_NO_CACHE)
- flags |= _PAGE_GUARDED;
-
/*
* Should check if it is a candidate for a BAT mapping
*/
@@ -238,6 +332,7 @@
return NULL;
}
+ ioremap_cache_add(p, v, size, flags);
out:
return (void *) (v + ((unsigned long)addr & ~PAGE_MASK));
}
@@ -250,8 +345,9 @@
*/
if (v_mapped_by_bats((unsigned long)addr)) return;
- if (addr > high_memory && (unsigned long) addr < ioremap_bot)
- vunmap((void *) (PAGE_MASK & (unsigned long)addr));
+ if (!ioremap_cache_del((unsigned long)addr))
+ if (addr > high_memory && (unsigned long) addr < ioremap_bot)
+ vunmap((void *) (PAGE_MASK & (unsigned long)addr));
}
int
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [RFC] Simple ioremap cache
2004-06-05 0:29 [RFC] Simple ioremap cache Eugene Surovegin
@ 2004-06-07 7:46 ` Marius Groeger
2004-06-07 8:48 ` Eugene Surovegin
2004-06-07 14:58 ` Kumar Gala
1 sibling, 1 reply; 8+ messages in thread
From: Marius Groeger @ 2004-06-07 7:46 UTC (permalink / raw)
To: Eugene Surovegin; +Cc: linuxppc-dev
Hello Eugene,
On Fri, 4 Jun 2004, Eugene Surovegin wrote:
> I'd like to present simple optimization I have been using for a
> while in my PPC 4xx tree.
...
> As you can see we could save a lot of TLB misses by using just one
> mapping for _all_ 440GP peripherals (440GP has a 64-entry
> software-managed TLB).
This sounds very interesting. We have also thought about such an
optimization a while ago.
I'm not sure, however, whether your current patch actually saves _TLB_
misses. Have you counted them to prove it? To do this, I think you also need
to flag a bigger virtual page size to the MMU, eg. program a different
PAGESZ_* value (see include/asm-ppc/mmu.h). If you don't, the MMU has to
manage diffent chunks all the same, they just happen to be virtually
contiguous. Along this line, I think not all PPC MMUs allow for variable
page sizes the way 4xx CPUs do, so this optimization may only be reasonable
for those.
So I think what you're saving right now is just mapping entries (which also
is a valid thing to gain).
What do you think? Have I missed somthing?
Regards,
Marius
--
Marius Groeger <mgroeger@sysgo.com> Project Manager
SYSGO AG Embedded and Real-Time Software
Voice: +49 6136 9948 0 FAX: +49 6136 9948 10
www.sysgo.com | www.elinos.com | www.osek.de | www.imerva.com
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Simple ioremap cache
2004-06-07 7:46 ` Marius Groeger
@ 2004-06-07 8:48 ` Eugene Surovegin
2004-06-07 9:12 ` Marius Groeger
0 siblings, 1 reply; 8+ messages in thread
From: Eugene Surovegin @ 2004-06-07 8:48 UTC (permalink / raw)
To: Marius Groeger; +Cc: linuxppc-dev
On Mon, Jun 07, 2004 at 09:46:16AM +0200, Marius Groeger wrote:
> I'm not sure, however, whether your current patch actually saves _TLB_
> misses.
Huh?
Let's consider two drivers - serial & emac on 44x. Both of them ioremap
different parts of the same phys page, but without the patch these will be
_different_ virtual mappings, e.g. if they access their respective I/O registers
_two_ TLB entries will be used.
Here is real example (see my original e-mail):
UART0 ioremaps 0x0000000140000200 and gets 0xfdfea200 as a virtual kernel
address
EMAC0 ioremaps 0x0000000140000800 and gets 0xd1000800.
You'll need _two_ TLB entries for these mappings:
0xfdfea000 -> 0x0000000140000000
0xd1000000 -> 0x0000000140000000
With my patch only _one_ TLB mapping will be required for _all_ 4K-sized
ioremaps of 0x0000000140000000:
0xfdfea000 -> 0x0000000140000000
440GP has a 64 entry TLB and believe me, when you are running user mode app,
there are a lot of TLB misses, and decreasing TLB pressure (e.g. requiring less
TLB entries for driver I/O) _will_ save you some TLB misses.
> Have you counted them to prove it?
No, it seemed quite obvious to me :)
> To do this, I think you also need
> to flag a bigger virtual page size to the MMU, eg. program a different
> PAGESZ_* value (see include/asm-ppc/mmu.h). If you don't, the MMU has to
> manage diffent chunks all the same, they just happen to be virtually
> contiguous.
I don't follow you here, sorry. Could you give some examples with the real TLB
contents for the cases you are describing?
> So I think what you're saving right now is just mapping entries (which also
> is a valid thing to gain).
What do you mean "just mapping entries" ? TLB slots contain these "mapping
entries", that's the whole purpose of TLB.
Eugene.
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Simple ioremap cache
2004-06-07 8:48 ` Eugene Surovegin
@ 2004-06-07 9:12 ` Marius Groeger
2004-06-07 9:26 ` Eugene Surovegin
0 siblings, 1 reply; 8+ messages in thread
From: Marius Groeger @ 2004-06-07 9:12 UTC (permalink / raw)
To: Eugene Surovegin; +Cc: linuxppc-dev
On Mon, 7 Jun 2004, Eugene Surovegin wrote:
> With my patch only _one_ TLB mapping will be required for _all_ 4K-sized
> ioremaps of 0x0000000140000000:
Yes, now I understand. You're after a different kind of optimization
which I had in mind. It makes absolutely sense to me, and should also
be portable to other cores.
> > Have you counted them to prove it?
>
> No, it seemed quite obvious to me :)
Yeah to me, too now ... :-)
> > To do this, I think you also need
> > to flag a bigger virtual page size to the MMU, eg. program a different
> > PAGESZ_* value (see include/asm-ppc/mmu.h). If you don't, the MMU has to
> > manage diffent chunks all the same, they just happen to be virtually
> > contiguous.
>
> I don't follow you here, sorry. Could you give some examples with
> the real TLB contents for the cases you are describing?
What I mean is to merge/coaelsce individual mappings within the same
IO area. Eg., consider an IO area with multiple resources at
0xd000.0000 spanning more than one 4k page. Now, driver A requests
access to a page at 0xd0000.0100, and driver B wants to access
0xd0003.0400. Usually, this would lead to 2 diffent mapping entries
for the following phys base/size pairs (0xd0000.0000, 0x1000);
(0xd0003.0000, 0x1000). With optimization, this could be handled by
one mapping at 0xd0000.0000 spanning a 16k page. It's a bit like when
BATs are used to cover larger chunks.
Again, this was an idea we had a while ago. I don't know how much real
remedy is in implementing it. There once was also talk about a "big
TLB" patch. I haven't checked if this is already part of 2.5/2.6.
> What do you mean "just mapping entries" ? TLB slots contain these "mapping
> entries", that's the whole purpose of TLB.
Yes, but 4xx allows for variable sized TLBs.
Sorry for the confusion. But maybe this tickles your inspiration :-)
Regards,
Marius
--
Marius Groeger <mgroeger@sysgo.com> Project Manager
SYSGO AG Embedded and Real-Time Software
Voice: +49 6136 9948 0 FAX: +49 6136 9948 10
www.sysgo.com | www.elinos.com | www.osek.de | www.imerva.com
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Simple ioremap cache
2004-06-07 9:12 ` Marius Groeger
@ 2004-06-07 9:26 ` Eugene Surovegin
0 siblings, 0 replies; 8+ messages in thread
From: Eugene Surovegin @ 2004-06-07 9:26 UTC (permalink / raw)
To: Marius Groeger; +Cc: linuxppc-dev
On Mon, Jun 07, 2004 at 11:12:42AM +0200, Marius Groeger wrote:
> > > To do this, I think you also need
> > > to flag a bigger virtual page size to the MMU, eg. program a different
> > > PAGESZ_* value (see include/asm-ppc/mmu.h). If you don't, the MMU has to
> > > manage diffent chunks all the same, they just happen to be virtually
> > > contiguous.
> >
> > I don't follow you here, sorry. Could you give some examples with
> > the real TLB contents for the cases you are describing?
>
> What I mean is to merge/coaelsce individual mappings within the same
> IO area. Eg., consider an IO area with multiple resources at
> 0xd000.0000 spanning more than one 4k page. Now, driver A requests
> access to a page at 0xd0000.0100, and driver B wants to access
> 0xd0003.0400. Usually, this would lead to 2 diffent mapping entries
> for the following phys base/size pairs (0xd0000.0000, 0x1000);
> (0xd0003.0000, 0x1000). With optimization, this could be handled by
> one mapping at 0xd0000.0000 spanning a 16k page. It's a bit like when
> BATs are used to cover larger chunks.
>
> Again, this was an idea we had a while ago. I don't know how much real
> remedy is in implementing it. There once was also talk about a "big
> TLB" patch. I haven't checked if this is already part of 2.5/2.6.
Yeah, I see now.
This kind of coalescing might be useful for PCI device drivers ioremaps (I saw
some adjacent mappings on tipb - I _think_ they were different PCI peripherals)
>
> > What do you mean "just mapping entries" ? TLB slots contain these "mapping
> > entries", that's the whole purpose of TLB.
>
> Yes, but 4xx allows for variable sized TLBs.
Sure, these big TLBs are even used for kernel lowmem mappings on 4xx (again to
save some TLB misses :)
Eugene
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Simple ioremap cache
2004-06-05 0:29 [RFC] Simple ioremap cache Eugene Surovegin
2004-06-07 7:46 ` Marius Groeger
@ 2004-06-07 14:58 ` Kumar Gala
2004-06-07 15:27 ` Benjamin Herrenschmidt
2004-06-07 17:47 ` Kumar Gala
1 sibling, 2 replies; 8+ messages in thread
From: Kumar Gala @ 2004-06-07 14:58 UTC (permalink / raw)
To: Eugene Surovegin; +Cc: linuxppc-dev
> I have several questions on how we can enhance my simple hack so it
> can be
> acceptable into mainline:
>
> 0) Do we really need such stuff in mainline :) ?
I see no reason, its helpful for a number of embedded archs (4xx, 8xx,
e500)
> 1) Should this feature be enabled for all ppc32 archs or only for 4xx?
> I
> made ioremap profile for 2.6.6 kernel running on my G4 Powerbook and
> haven't noticed a lot of ioremap regions overlap (there was one
> instance where
> my patch would have helped if I've increased cache size to 32 entries).
This may be due to use of BATs on PPC Classic.
> 2) Should we cache allocation bigger than 4K? From Ebony and tipb
> profiles it
> doesn't seem advantogeous. Maybe other CPUs can benefit from the
> bigger sizes.
I think 4K is reasonable, I think this optimization is mainly for PPCs
w/ integrated periphs.
> 3) Should cache size (currently hardcoded to 10 entries) be made
> configurable?
Seems pointless to make it proper kernel config param. If someone
wants to change it they can always change the code themselves for there
purposes.
- kumar
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Simple ioremap cache
2004-06-07 14:58 ` Kumar Gala
@ 2004-06-07 15:27 ` Benjamin Herrenschmidt
2004-06-07 17:47 ` Kumar Gala
1 sibling, 0 replies; 8+ messages in thread
From: Benjamin Herrenschmidt @ 2004-06-07 15:27 UTC (permalink / raw)
To: Kumar Gala; +Cc: Eugene Surovegin, linuxppc-dev list
> > 1) Should this feature be enabled for all ppc32 archs or only for 4xx?
> > I
> > made ioremap profile for 2.6.6 kernel running on my G4 Powerbook and
> > haven't noticed a lot of ioremap regions overlap (there was one
> > instance where
> > my patch would have helped if I've increased cache size to 32 entries).
>
> This may be due to use of BATs on PPC Classic.
I don't think I use a BAT anymore for IOs on pmac. But I also don't
declare overlapping regions. The only places where I may actually
have a few overlaps are some parts of the mac-io ASIC like the DBDMA
regs since they are offset by 0x400 and I ioremap them separately
in each driver.
Ben.
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] Simple ioremap cache
2004-06-07 14:58 ` Kumar Gala
2004-06-07 15:27 ` Benjamin Herrenschmidt
@ 2004-06-07 17:47 ` Kumar Gala
1 sibling, 0 replies; 8+ messages in thread
From: Kumar Gala @ 2004-06-07 17:47 UTC (permalink / raw)
To: Eugene Surovegin; +Cc: Linux/PPC Development
On Jun 7, 2004, at 9:58 AM, Kumar Gala wrote:
>
>> I have several questions on how we can enhance my simple hack so it
>> can be
>> acceptable into mainline:
>>
>> 0) Do we really need such stuff in mainline :) ?
>
> I see no reason, its helpful for a number of embedded archs (4xx, 8xx,
> e500)
Should have been: 'I see no reason for it not to be'...
uugh, should wake up before sending emails.
- kumar
** Sent via the linuxppc-dev mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2004-06-07 17:47 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-05 0:29 [RFC] Simple ioremap cache Eugene Surovegin
2004-06-07 7:46 ` Marius Groeger
2004-06-07 8:48 ` Eugene Surovegin
2004-06-07 9:12 ` Marius Groeger
2004-06-07 9:26 ` Eugene Surovegin
2004-06-07 14:58 ` Kumar Gala
2004-06-07 15:27 ` Benjamin Herrenschmidt
2004-06-07 17:47 ` Kumar Gala
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).