* Re: consistent_alloc changes for 4xx/8xx
@ 2002-01-21 20:45 Ralph Blach
2002-01-22 4:38 ` Dan Malek
0 siblings, 1 reply; 6+ messages in thread
From: Ralph Blach @ 2002-01-21 20:45 UTC (permalink / raw)
To: linuxppc-embedded; +Cc: Mark Wisner, Alan Booker
Dan, here are my responces.
>>Another, more challenging, situation also exists because you can't use
>>the __va()/__pa() macros on the addresses returned from consistent_alloc
().
>>On the 4xx/8xx, the virt_to_* macros will call iopa() which will track
down
>>the real physical address in the page table, which continues to work.
On the IBM book E part, __va()/__pa() will have to be obsoleted. The
address space is
36 bits. Better now than later.
>>For testing, I added the ability on the MPC860 to pin the first 8M of
kernel
>>text (of which there is probably on 512K used), up to 24 Mbytes of kernel
>>data, and the 8M IMMR space. Note that this will only work on the 860
>>processors with a 32 entry TLB. I added, but couldn't test for lack of
>>hardware, a similar feature to the 4xx. I wanted to check this in before
the
>>kernel changed too much, and I have volunteers testing it, so any
problems
>>should be corrected shortly. Note that this TLB pinning comes at a cost
of
>>taking TLB entries out of use for applications, so IMHO it isn't
something
>>that should be done without verification of total system performance
improvement.
>>I hope someone can find some benchmarks where this feature actually
provides
>>benefit
Dan, you saw netperf which demonstrated the benefit. These were the very
test you
suggested we run and they showed a >10% performance increase on a 405.
Chip
** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: consistent_alloc changes for 4xx/8xx
2002-01-21 20:45 consistent_alloc changes for 4xx/8xx Ralph Blach
@ 2002-01-22 4:38 ` Dan Malek
0 siblings, 0 replies; 6+ messages in thread
From: Dan Malek @ 2002-01-22 4:38 UTC (permalink / raw)
To: Ralph Blach; +Cc: linuxppc-embedded, Mark Wisner, Alan Booker
Ralph Blach wrote:
> On the IBM book E part, __va()/__pa() will have to be obsoleted. The
> address space is
> 36 bits. Better now than later.
These macros are not used on I/O space addresses, and would continue
to work properly on the Book E parts when used as intended. In fact they
are still used in all kernel ports, people just had the tendency to
take shortcuts and abuse them. I was just pointing out one case of
abuse that will now break.
> Dan, you saw netperf which demonstrated the benefit.
Netperf is not an application that does any real work, it is simply
an IP stack performance tool. In the real world in real products, you
have to be concerned with resources consumed by applications. When you
pin TLB entries, you steal resources from applications and could cause
a decline in the overall system performance.
> .... These were the very
> test you
> suggested we run and they showed a >10% performance increase on a 405.
I suggested netperf in response to your question about which network
performance test to use, not as a tool to measure system performance.
When you look at applications like MP3 players that are nearly 100%
in user space, they are very sensitive to instruction and data cache
footprint. Increasing the complexity of TLB miss handlers by adding more
code or requiring the fetching of additional data, causes these applications
to actually require more processing time.
Thanks.
-- Dan
** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: consistent_alloc changes for 4xx/8xx
@ 2002-01-24 19:10 Ralph Blach
0 siblings, 0 replies; 6+ messages in thread
From: Ralph Blach @ 2002-01-24 19:10 UTC (permalink / raw)
To: linuxppc-embedded
I just looked closely at the code for pinned tlb's head_4xx.S and there
needs to be a fix.
/* Load up the kernel context */
2: SYNC /*For all PTE updates to finish */
tlbia
sync
This is incorrect for a the pinned kernel configuration and because it
would invalidate the pinned tlb's
I verifyed this with RiscWatch.
The code needs to be changes to
/* Load up the kernel context */
2: SYNC /*For all PTE updates to finish */
#ifndef CONFIG_TLB_PIN
tlbia
#endif
sync
Thanks
Chip
** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: consistent_alloc changes for 4xx/8xx
@ 2002-01-24 13:30 Ralph Blach
0 siblings, 0 replies; 6+ messages in thread
From: Ralph Blach @ 2002-01-24 13:30 UTC (permalink / raw)
To: linuxppc-embedded
I have just looked at the code for pinning tlbs and have several
suggestions for improvement.
1)put the tlb_4xx_index in the first 64 k of memory.
2)add a tlb_4xx_watermark in the same cache line as the index register.
Next, change the tlb miss code for pinning to
li r22, tlb_4xx_index@l #Since the tlb is in the first 64k, get
its address
#this saves two cycles and covers the access to
the watermark into the data cache.
lwz r23,0(r22) #now get the tlb index
li r22,tlb_4xx_watermark@l #get the watermark address
lwz r22,0(r22) #now get the watermark
cmpw cr0,r23,r22 #compare the index to the watermark
blt 66f
li r23,0 #set the watermark to 0
li r22,tlb_4xx_index #get the address of the index
stw r23,0(r22) #save the new index.
This has the advantange of allowing the user to select the number of tlb's
they wish to pin, although software in the tlb pin routines should limit it
to the current 4.
It also has very good performance, being about as long as the current
replacement mechanism.
Since the TLBs water mark are paired in a cache line accessing the index
covers the access to the
watermark so that it only takes one extra cycle, even if the line is a
cache miss.
Of course the tlb_pin routine has to correctly set up the watermark
register, which is not a big deal.
Another suggestion would be to load the address of the index and eariler
and do a dcbt and allow these be brought into the
data cache eariler and save a possible miss.
Also, I modified the Journeyman kernel to allow riscwatch to function
correctly and I performed the following experiment.
In finnish_tlb_load, there is a tlbsx instruction and I wanted to see if
tlbsx every made a match.
To do this I split the different paths and put a breakpoint on the tlbsx
matching path.
I never saw a match. If we are really concerned about performance and this
instructionever produces a match,
it should be eliminated from the instruction path and thus saving two
instructions.
Chip
Chip
** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: consistent_alloc changes for 4xx/8xx
@ 2002-01-23 14:52 Ralph Blach
0 siblings, 0 replies; 6+ messages in thread
From: Ralph Blach @ 2002-01-23 14:52 UTC (permalink / raw)
To: Dan Malek, linuxppc-embedded
> On the IBM book E part, __va()/__pa() will have to be obsoleted. The
> address space is
> 36 bits. Better now than later.
These macros are not used on I/O space addresses, and would continue
to work properly on the Book E parts when used as intended. In fact they
are still used in all kernel ports, people just had the tendency to
take shortcuts and abuse them. I was just pointing out one case of
abuse that will now break.
Or, since translation is on in the 440, you can assign both the exception
address and translated
address to be the same, so they could be null functions.
Chip
** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 6+ messages in thread
* consistent_alloc changes for 4xx/8xx
@ 2002-01-20 10:07 Dan Malek
0 siblings, 0 replies; 6+ messages in thread
From: Dan Malek @ 2002-01-20 10:07 UTC (permalink / raw)
To: linuxppc-embedded
Hi folks.
If you don't care about 4xx/8xx and incoherent caches, you can stop
reading now.
At the request of some people that wanted a pinned TLB feature, I had
to modify the consistent_alloc() function to allocate virtual space out
of the kernel's vmalloc pool and assign individual pages with uncached
attributes. This has created some problems with the Linux VM subsystem,
most notably you can't call consistent_alloc() from an interrupt handler
as the documentation indicates. There was some discussion on l-k about
this (calling from interrupt in general) over the last few days. Since
consistent_alloc() (or pci_consistent_alloc(), the source of discussion)
can return errors and not allocate the space, I questioned the value of
adding this complexity to a driver, but I was simply told "this is the
way it is." Well, unfortunately it isn't on the 4xx and 8xx now.
The problem arises when you try to allocate out of an interrupt handler
and the vmalloc() (well, map_pages() actually) needs to allocate a page
table and a free page doesn't exist. The page table allocator will try
to sleep at this point, which you obviously can't do from an interrupt.
Another, more challenging, situation also exists because you can't use
the __va()/__pa() macros on the addresses returned from consistent_alloc().
On the 4xx/8xx, the virt_to_* macros will call iopa() which will track down
the real physical address in the page table, which continues to work. It
isn't possible to find the virtual address from the physical one, so drivers
need to be changed to keep the addresses returned from consistent_alloc()
and use them. It also means that drivers requiring uncached memory very
early in the kernel, like the serial console on the 8xx, can't get the
memory and must do something different. I don't think any other drivers
are affected by this.
My original implementation of consistent_alloc() assumed the kernel was
mapped with 4K pages, and when you wanted an uncached page the attributes
were simply changed in place. There wasn't any need to allocate page table
pages, so the interrupt problem didn't exist. You could also use the
fast mapping macros if desired.
For testing, I added the ability on the MPC860 to pin the first 8M of kernel
text (of which there is probably on 512K used), up to 24 Mbytes of kernel
data, and the 8M IMMR space. Note that this will only work on the 860
processors with a 32 entry TLB. I added, but couldn't test for lack of
hardware, a similar feature to the 4xx. I wanted to check this in before the
kernel changed too much, and I have volunteers testing it, so any problems
should be corrected shortly. Note that this TLB pinning comes at a cost of
taking TLB entries out of use for applications, so IMHO it isn't something
that should be done without verification of total system performance improvement.
I hope someone can find some benchmarks where this feature actually provides
benefit. On the 8xx with madplay, I gained a whole 0.300 seconds on average
using a 5 minute audio track. Not worth the hassle in my opinion, but I
congratulate anyone that can design to these extremes :-). If nothing else,
it made me finally check in the Embedded Planet HIOX audio driver.
If the calling from interrupt handler, out of memory, system crashing
is an issue for someone, we can likely fix this with some minor changes
to the generic Linux VM functions. Whether they are accepted is another
challenge :-). We can also consider using the older consistent_alloc()
implementation as an option when this is a problem.
Thanks. Have fun.
-- Dan
** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2002-01-24 19:10 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-01-21 20:45 consistent_alloc changes for 4xx/8xx Ralph Blach
2002-01-22 4:38 ` Dan Malek
-- strict thread matches above, loose matches on Subject: below --
2002-01-24 19:10 Ralph Blach
2002-01-24 13:30 Ralph Blach
2002-01-23 14:52 Ralph Blach
2002-01-20 10:07 Dan Malek
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).