* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
@ 2005-02-02 19:46 ` Jesse Barnes
2005-02-02 23:33 ` Jack Steiner
` (33 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Jesse Barnes @ 2005-02-02 19:46 UTC (permalink / raw)
To: linux-ia64
On Wednesday, February 2, 2005 11:10 am, Jes Sorensen wrote:
> The remaining issue I am facing is that for the uncached pool I want to
> make it node aware and we want to use the spill pages from the lower
> granules for this pool which is easily done on SN2. However I see no
> generic way to get from a physical address to node id for pages that do
> not have a struct page assigned to them. I am curious if anyone has any
> suggestions for how to solve this in a generic way?
paddr_to_nid?
Jesse
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
2005-02-02 19:46 ` Jesse Barnes
@ 2005-02-02 23:33 ` Jack Steiner
2005-02-03 7:47 ` Jes Sorensen
` (32 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Jack Steiner @ 2005-02-02 23:33 UTC (permalink / raw)
To: linux-ia64
On Wed, Feb 02, 2005 at 02:10:32PM -0500, Jes Sorensen wrote:
General comment:
1) I may be paranoid, but I'm nervous about using memory visible to
the VM system for fetchops. If ANYTHING in the kernel makes a
reference to the memory and causes a TLB dropin to occur,
then we are exposed to data corruption. If memory being used for
fetchops is loaded into the cache, either directly or by speculation,
then data corruption of the uncached fetchop memory can occur.
Am I being overly paranoid? How can we be certain that nothing will
ever reference the fetchop memory alocated from the general VM
pool. lcrash, for example.
2) Is there a limit on the number of mspec pages that can be allocated?
Is there a shaker that will cause unused mspec granules to be freed?
What prevents a malicious/stupid/buggy user from filling the system with
mspec pages?
3) Is an "mspec" address a physical address or an uncached virtual address?
Some places in the code appear inconsistent. For example:
mspec_free_page(TO_PHYS(maddr))
vs.
maddr; /* phys addr of start of mspecs. */
A few code specific issues:
...
+ printk(KERN_WARNING "smp_call_function failed for "
+ "mspec_ipi_visibility! (%i)\n", status);
+ }
+
+ sn_flush_all_caches((unsigned long)tmp, IA64_GRANULE_SIZE);
Don't the TLBs need to be flushed before you flush caches. Otherwise,
the cpu may reload data via speculation.
I dont see any TLB flushing of the kernel TLB entries that map the
chunks. That needs to be done.
...
+ /*
+ * The kernel requires a page structure to be returned upon
+ * success, but there are no page structures for low granule pages.
+ * remap_page_range() creates the pte for us and we return a
+ * bogus page back to the kernel fault handler to keep it happy
+ * (the page is freed immediately there).
+ */
Ugly hack. Isn't there a better way? (I know this isn't your code & you
probably don't like this either. I had hoped for a cleaner solution in
2.6....)
...
+ /*
+ * Use the bte to ensure cache lines
+ * are actually pulled from the
+ * processor back to the md.
+ */
+
This doesn't need to be done if the memory was being used for fetchops or
uncached memory.
+ s <<= 1;
+ }
+ a = (unsigned long) h[j].next;
It appears that you are keeping a linked list of free memory WITHIN the
mspec memory itself. If I'm reading this correctly, all the addresses are uncached
virtual addresses so that should be ok. However, it might be good to add
debugging code to make sure that you never cause a cachable reference to be made to any
of the fetchop memory. The resulting data corruption problems are almost
impossible to debug.
--
Thanks
Jack Steiner (steiner@sgi.com) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
2005-02-02 19:46 ` Jesse Barnes
2005-02-02 23:33 ` Jack Steiner
@ 2005-02-03 7:47 ` Jes Sorensen
2005-02-03 8:38 ` Jes Sorensen
` (31 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Jes Sorensen @ 2005-02-03 7:47 UTC (permalink / raw)
To: linux-ia64
>>>>> "Jesse" = Jesse Barnes <jbarnes@engr.sgi.com> writes:
Jesse> On Wednesday, February 2, 2005 11:10 am, Jes Sorensen wrote:
>> The remaining issue I am facing is that for the uncached pool I
>> want to make it node aware and we want to use the spill pages from
>> the lower granules for this pool which is easily done on
>> SN2. However I see no generic way to get from a physical address to
>> node id for pages that do not have a struct page assigned to
>> them. I am curious if anyone has any suggestions for how to solve
>> this in a generic way?
Jesse> paddr_to_nid?
It's not clear to me this covers the spill memory found in the lower
granules and second it's really really slow, though that may not be a
real problem.
Cheers,
Jes
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (2 preceding siblings ...)
2005-02-03 7:47 ` Jes Sorensen
@ 2005-02-03 8:38 ` Jes Sorensen
2005-02-03 11:19 ` Robin Holt
` (30 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Jes Sorensen @ 2005-02-03 8:38 UTC (permalink / raw)
To: linux-ia64
>>>>> "Jack" = Jack Steiner <steiner@sgi.com> writes:
Jack> On Wed, Feb 02, 2005 at 02:10:32PM -0500, Jes Sorensen wrote:
Jack> General comment:
Jack, thanks for the comments, I'll look at it, however I have the
following comments (which may or may not be correct from my side):
Jack> 1) I may be paranoid, but I'm nervous about using memory visible
Jack> to the VM system for fetchops. If ANYTHING in the kernel makes a
Jack> reference to the memory and causes a TLB dropin to occur, then
Jack> we are exposed to data corruption. If memory being used for
Jack> fetchops is loaded into the cache, either directly or by
Jack> speculation, then data corruption of the uncached fetchop memory
Jack> can occur.
Jack> Am I being overly paranoid? How can we be certain that nothing
Jack> will ever reference the fetchop memory alocated from the general
Jack> VM pool. lcrash, for example.
Once a page is handed out using alloc_pages, the kernel won't touch it
again unless you explicitly map it etc. or if some process touches
memory at random, which could also happen with the spill pages. So I
don't think the situation is any worse than it is for the spill pages
in the lower granules.
Jack> 2) Is there a limit on the number of mspec pages that can be
Jack> allocated? Is there a shaker that will cause unused mspec
Jack> granules to be freed? What prevents a malicious/stupid/buggy
Jack> user from filling the system with mspec pages?
Currently there is no limit on this, however it could easily be
imposed either by having a max number of granules allocated per node
or system wide.
I could add code to free granules when it's all released but I believe
the amount of memory being pulled in for this in real life situations
is so limited it's not really worth the complexity. Adding a hard
limit for how much is allowed to be allocated seems simpler.
Jack> 3) Is an "mspec" address a physical address or an uncached
Jack> virtual address? Some places in the code appear
Jack> inconsistent. For example:
Jack> mspec_free_page(TO_PHYS(maddr)) vs. maddr; /* phys addr of
Jack> start of mspecs. */
Uncached virtual, the comments you point out are leftovers from the
old version of the driver.
Jack> A few code specific issues:
Jack> ... + printk(KERN_WARNING "smp_call_function failed for " +
Jack> "mspec_ipi_visibility! (%i)\n", status); + } + +
Jack> sn_flush_all_caches((unsigned long)tmp, IA64_GRANULE_SIZE);
Jack> Don't the TLBs need to be flushed before you flush
Jack> caches. Otherwise, the cpu may reload data via speculation.
Jack> I dont see any TLB flushing of the kernel TLB entries that map
Jack> the chunks. That needs to be done. ...
I thought about this one a fair bit after reading your comments and I
don't think it's an issue. The pages in the kernel's cached mapping
are identity mapped which means we shouldn't see any tlbs for this,
which leaves us with just tlbs for pages that have explicitly been
mapped somewhere - user tlbs should be removed when a process is shot
down or pages unmapped and vfree() calls flush_tlb_all(). Or, am I
missing something?
Jack> + /* + * The kernel requires a page structure to be returned
Jack> upon + * success, but there are no page structures for low
Jack> granule pages. + * remap_page_range() creates the pte for us
Jack> and we return a + * bogus page back to the kernel fault handler
Jack> to keep it happy + * (the page is freed immediately there). +
Jack> */
Jack> Ugly hack. Isn't there a better way? (I know this isn't your
Jack> code & you probably don't like this either. I had hoped for a
Jack> cleaner solution in 2.6....)
It's gross, ugly and I hate it ... not sure if there's a simpler way.
Maybe we can use the same approach as the fbmem driver and do it all
in the mmap() function, I will have to investigate that.
Jack> + /* + * Use the bte to ensure cache lines + * are actually
Jack> pulled from the + * processor back to the md. + */ +
Jack> This doesn't need to be done if the memory was being used for
Jack> fetchops or uncached memory.
I'll check.
Jack> + s <<= 1; + } + a = (unsigned long) h[j].next;
Jack> It appears that you are keeping a linked list of free memory
Jack> WITHIN the mspec memory itself. If I'm reading this correctly,
Jack> all the addresses are uncached virtual addresses so that should
Jack> be ok. However, it might be good to add debugging code to make
Jack> sure that you never cause a cachable reference to be made to any
Jack> of the fetchop memory. The resulting data corruption problems
Jack> are almost impossible to debug.
You are correct that I keep the lists in the memory. I may change the
allocator at a later stage to use descriptors instead, but for now I
think this should be ok. I'll add a check to make sure we never
receive a cached address back into mspec_free_page.
Thanks,
Jes
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (3 preceding siblings ...)
2005-02-03 8:38 ` Jes Sorensen
@ 2005-02-03 11:19 ` Robin Holt
2005-02-03 14:55 ` Jes Sorensen
` (29 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Robin Holt @ 2005-02-03 11:19 UTC (permalink / raw)
To: linux-ia64
> Jack> Ugly hack. Isn't there a better way? (I know this isn't your
> Jack> code & you probably don't like this either. I had hoped for a
> Jack> cleaner solution in 2.6....)
>
> It's gross, ugly and I hate it ... not sure if there's a simpler way.
> Maybe we can use the same approach as the fbmem driver and do it all
> in the mmap() function, I will have to investigate that.
If you do it in the mmap, all the pages will be allocated and mapped
on the node doing the map. This will result in large applications
using multiple threads to incur _LARGE_ amounts of numa traffic.
The first fault is critical for performance.
>
> Jack> + /* + * Use the bte to ensure cache lines + * are actually
> Jack> pulled from the + * processor back to the md. + */ +
>
> Jack> This doesn't need to be done if the memory was being used for
> Jack> fetchops or uncached memory.
>
> I'll check.
The bte zero is needed for memory mapped cached (one of the mechanisms
here) and also ensures the memory is zereod out when returned to the
memory pool.
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (4 preceding siblings ...)
2005-02-03 11:19 ` Robin Holt
@ 2005-02-03 14:55 ` Jes Sorensen
2005-02-03 17:06 ` Jesse Barnes
` (28 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Jes Sorensen @ 2005-02-03 14:55 UTC (permalink / raw)
To: linux-ia64
>>>>> "Robin" = Robin Holt <holt@sgi.com> writes:
Jack> Ugly hack. Isn't there a better way? (I know this isn't your
Jack> code & you probably don't like this either. I had hoped for a
Jack> cleaner solution in 2.6....)
>> It's gross, ugly and I hate it ... not sure if there's a simpler
>> way. Maybe we can use the same approach as the fbmem driver and do
>> it all in the mmap() function, I will have to investigate that.
Robin> If you do it in the mmap, all the pages will be allocated and
Robin> mapped on the node doing the map. This will result in large
Robin> applications using multiple threads to incur _LARGE_ amounts of
Robin> numa traffic. The first fault is critical for performance.
Robin,
Is this because the applications will normally allocate their fetchops
in the main thread before spawning off the threads? If the mmap is
done by the thread that will 'own' the individual fetchop this
wouldn't be a problem? Sorry, just trying to understand the nature of
how these applications work, my knowledge of MPI etc. is ehm
... limited ;)
Jack> + /* + * Use the bte to ensure cache lines + * are actually
Jack> pulled from the + * processor back to the md. + */ +
>>
Jack> This doesn't need to be done if the memory was being used for
Jack> fetchops or uncached memory.
>> I'll check.
Robin> The bte zero is needed for memory mapped cached (one of the
Robin> mechanisms here) and also ensures the memory is zereod out when
Robin> returned to the memory pool.
I already updated the comment to state that we want the memory
zeroed in the pool.
Thanks,
Jes
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (5 preceding siblings ...)
2005-02-03 14:55 ` Jes Sorensen
@ 2005-02-03 17:06 ` Jesse Barnes
2005-02-03 17:31 ` Dean Roe
` (27 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Jesse Barnes @ 2005-02-03 17:06 UTC (permalink / raw)
To: linux-ia64
On Thursday, February 3, 2005 12:38 am, Jes Sorensen wrote:
> Jack> ... + printk(KERN_WARNING "smp_call_function failed for " +
> Jack> "mspec_ipi_visibility! (%i)\n", status); + } + +
> Jack> sn_flush_all_caches((unsigned long)tmp, IA64_GRANULE_SIZE);
>
> Jack> Don't the TLBs need to be flushed before you flush
> Jack> caches. Otherwise, the cpu may reload data via speculation.
>
> Jack> I dont see any TLB flushing of the kernel TLB entries that map
> Jack> the chunks. That needs to be done. ...
>
> I thought about this one a fair bit after reading your comments and I
> don't think it's an issue. The pages in the kernel's cached mapping
> are identity mapped which means we shouldn't see any tlbs for this,
> which leaves us with just tlbs for pages that have explicitly been
> mapped somewhere - user tlbs should be removed when a process is shot
> down or pages unmapped and vfree() calls flush_tlb_all(). Or, am I
> missing something?
Even identity mapped regions have TLB entries associated with them. The
translation registers only cover the code and static data section, afaik.
When we take a miss on an identity mapped region, the kernel still does an
'itc', so you'll still need to purge the TLB.
Jesse
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (6 preceding siblings ...)
2005-02-03 17:06 ` Jesse Barnes
@ 2005-02-03 17:31 ` Dean Roe
2005-02-03 17:31 ` Robin Holt
` (26 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Dean Roe @ 2005-02-03 17:31 UTC (permalink / raw)
To: linux-ia64
>
> >>>>> "Robin" = Robin Holt <holt@sgi.com> writes:
>
> Jack> Ugly hack. Isn't there a better way? (I know this isn't your
> Jack> code & you probably don't like this either. I had hoped for a
> Jack> cleaner solution in 2.6....)
> >> It's gross, ugly and I hate it ... not sure if there's a simpler
> >> way. Maybe we can use the same approach as the fbmem driver and do
> >> it all in the mmap() function, I will have to investigate that.
>
> Robin> If you do it in the mmap, all the pages will be allocated and
> Robin> mapped on the node doing the map. This will result in large
> Robin> applications using multiple threads to incur _LARGE_ amounts of
> Robin> numa traffic. The first fault is critical for performance.
>
> Robin,
>
> Is this because the applications will normally allocate their fetchops
> in the main thread before spawning off the threads? If the mmap is
> done by the thread that will 'own' the individual fetchop this
> wouldn't be a problem? Sorry, just trying to understand the nature of
> how these applications work, my knowledge of MPI etc. is ehm
> ... limited ;)
You are correct. The parent process mmaps one segment of fetchop
space and forks off all the worker processes. The worker processes
inherit the fetchop mapping across the fork and each have their
portion of the segment they "own". Now every process can read/write
every other process' fetchop area.
Dean
---
Dean Roe
Silicon Graphics, Inc.
roe@sgi.com
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (7 preceding siblings ...)
2005-02-03 17:31 ` Dean Roe
@ 2005-02-03 17:31 ` Robin Holt
2005-02-03 17:41 ` Jesse Barnes
` (25 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Robin Holt @ 2005-02-03 17:31 UTC (permalink / raw)
To: linux-ia64
On Thu, Feb 03, 2005 at 09:55:26AM -0500, Jes Sorensen wrote:
> Is this because the applications will normally allocate their fetchops
> in the main thread before spawning off the threads? If the mmap is
> done by the thread that will 'own' the individual fetchop this
> wouldn't be a problem? Sorry, just trying to understand the nature of
> how these applications work, my knowledge of MPI etc. is ehm
> ... limited ;)
That is the normal way of doing things. Nearly all large applications
have been operating under that assumption for a very long time
and changing all those apps to do the mmap after spawning workers
is unlikely.
Robin
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (8 preceding siblings ...)
2005-02-03 17:31 ` Robin Holt
@ 2005-02-03 17:41 ` Jesse Barnes
2005-02-03 18:54 ` Jack Steiner
` (24 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Jesse Barnes @ 2005-02-03 17:41 UTC (permalink / raw)
To: linux-ia64
On Wednesday, February 2, 2005 11:47 pm, Jes Sorensen wrote:
> >>>>> "Jesse" = Jesse Barnes <jbarnes@engr.sgi.com> writes:
>
> Jesse> On Wednesday, February 2, 2005 11:10 am, Jes Sorensen wrote:
> >> The remaining issue I am facing is that for the uncached pool I
> >> want to make it node aware and we want to use the spill pages from
> >> the lower granules for this pool which is easily done on
> >> SN2. However I see no generic way to get from a physical address to
> >> node id for pages that do not have a struct page assigned to
> >> them. I am curious if anyone has any suggestions for how to solve
> >> this in a generic way?
>
> Jesse> paddr_to_nid?
>
> It's not clear to me this covers the spill memory found in the lower
> granules and second it's really really slow, though that may not be a
> real problem.
You're right, it won't deal with low addresses, only regular, cachable ones
afaict.
Jesse
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (9 preceding siblings ...)
2005-02-03 17:41 ` Jesse Barnes
@ 2005-02-03 18:54 ` Jack Steiner
2005-02-03 19:39 ` David Mosberger
` (23 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Jack Steiner @ 2005-02-03 18:54 UTC (permalink / raw)
To: linux-ia64
On Thu, Feb 03, 2005 at 03:38:45AM -0500, Jes Sorensen wrote:
> >>>>> "Jack" = Jack Steiner <steiner@sgi.com> writes:
sorry - I mised your reply. Apparantly, it looks like SPAM:
>>> Subject: ***** SUSPECTED SPAM ***** Re: [rfc] generic allocator and mspec driver
>>> From: Jes Sorensen <jes@wildopensource.com>
>>> X-Virus-Scanned: by cuda.sgi.com at sgi.com
>>> X-Barracuda-Spam-Score: 0.60
>>> X-Barracuda-Spam-Status: Yes, SCORE=0.60 using per-user scores of TAG_LEVEL=0.2 QUARANTINE_LEVEL=2.3 KILL_LEVEL\x1000.0 tests=FORGED_RCVD_HELO,
>>> MARKETING_SUBJECT
>>> X-Barracuda-Spam-Report: Code version 2.64, rules version 2.1.1028
>>> Rule breakdown below pts rule name description
>>> ---- ---------------------- -------------------------------------------
>>> 0.60 MARKETING_SUBJECT Subject contains popular marketing words
>>> 0.00 FORGED_RCVD_HELO Received: contains a forged HELO
>>> X-Priority: 5 (Lowest)
Oh well....
>
> Jack> On Wed, Feb 02, 2005 at 02:10:32PM -0500, Jes Sorensen wrote:
> Jack> General comment:
>
> Jack, thanks for the comments, I'll look at it, however I have the
> following comments (which may or may not be correct from my side):
>
> Jack> 1) I may be paranoid, but I'm nervous about using memory visible
> Jack> to the VM system for fetchops. If ANYTHING in the kernel makes a
> Jack> reference to the memory and causes a TLB dropin to occur, then
> Jack> we are exposed to data corruption. If memory being used for
> Jack> fetchops is loaded into the cache, either directly or by
> Jack> speculation, then data corruption of the uncached fetchop memory
> Jack> can occur.
>
> Jack> Am I being overly paranoid? How can we be certain that nothing
> Jack> will ever reference the fetchop memory alocated from the general
> Jack> VM pool. lcrash, for example.
>
> Once a page is handed out using alloc_pages, the kernel won't touch it
> again unless you explicitly map it etc. or if some process touches
> memory at random, which could also happen with the spill pages. So I
> don't think the situation is any worse than it is for the spill pages
> in the lower granules.
In theory, you are correct & maybe I'm being overly paranoid. However,
a failure is almost impossible to debug.
Using the UC area in the low granules seems safe but it still took
us a long time to get it right. The kernel is unaware
of the memory & with the exception of the fetchop code, nothing in
the kernel ever references the spill areas.
I can't find any specific place that will fail using kernel memory for
mspec but my gut feeling is that we are more exposed to errors using
memory the kernel knows about than in using the spill areas.
For example, although I don't see any problems here because of it's limited use,
virt_addr_valid() & pfn_valid() is FALSE for the spill area but TRUE
for kernel memory.
What prevents lcrash (or /dev/kmem or /proc/kcore) from referencing
special memory being used for fetchops? Granted, this takes
root privilege but the consequences of a bad reference can
cause silent data corruption that is impossible to debug.
Should we add code to prohibit these area from referencing
granules being used for mspec memory?
Forgive the paranoia but several of use spent a long time debugging
some of these issues. Maybe all I'm asking is that everyone spend
a little extra time thinking of ways that the kernel could cause
a TLB entry to be made to a granule being used for mspec memory.
>
> Jack> 2) Is there a limit on the number of mspec pages that can be
> Jack> allocated? Is there a shaker that will cause unused mspec
> Jack> granules to be freed? What prevents a malicious/stupid/buggy
> Jack> user from filling the system with mspec pages?
>
> Currently there is no limit on this, however it could easily be
> imposed either by having a max number of granules allocated per node
> or system wide.
>
> I could add code to free granules when it's all released but I believe
> the amount of memory being pulled in for this in real life situations
> is so limited it's not really worth the complexity. Adding a hard
> limit for how much is allowed to be allocated seems simpler.
Seems like some sort of limit is needed. I agree - something simple
is all that is needed.
>
> Jack> 3) Is an "mspec" address a physical address or an uncached
> Jack> virtual address? Some places in the code appear
> Jack> inconsistent. For example:
>
> Jack> mspec_free_page(TO_PHYS(maddr)) vs. maddr; /* phys addr of
> Jack> start of mspecs. */
>
> Uncached virtual, the comments you point out are leftovers from the
> old version of the driver.
>
> Jack> A few code specific issues:
>
> Jack> ... + printk(KERN_WARNING "smp_call_function failed for " +
> Jack> "mspec_ipi_visibility! (%i)\n", status); + } + +
> Jack> sn_flush_all_caches((unsigned long)tmp, IA64_GRANULE_SIZE);
>
> Jack> Don't the TLBs need to be flushed before you flush
> Jack> caches. Otherwise, the cpu may reload data via speculation.
>
> Jack> I dont see any TLB flushing of the kernel TLB entries that map
> Jack> the chunks. That needs to be done. ...
>
> I thought about this one a fair bit after reading your comments and I
> don't think it's an issue. The pages in the kernel's cached mapping
> are identity mapped which means we shouldn't see any tlbs for this,
> which leaves us with just tlbs for pages that have explicitly been
> mapped somewhere - user tlbs should be removed when a process is shot
> down or pages unmapped and vfree() calls flush_tlb_all(). Or, am I
> missing something?
Identity mapped memory still requires a TLB entry. Somewhere, these
entries need to be purged before using a newly allocated granule for fetchops
or uncached memory. Also, the TLB entries need to be purged before
the cache is flushed. And the cache flushing can't require a
cacheable TLB entry to be made.
>
> Jack> + /* + * The kernel requires a page structure to be returned
> Jack> upon + * success, but there are no page structures for low
> Jack> granule pages. + * remap_page_range() creates the pte for us
> Jack> and we return a + * bogus page back to the kernel fault handler
> Jack> to keep it happy + * (the page is freed immediately there). +
> Jack> */
>
> Jack> Ugly hack. Isn't there a better way? (I know this isn't your
> Jack> code & you probably don't like this either. I had hoped for a
> Jack> cleaner solution in 2.6....)
>
> It's gross, ugly and I hate it ... not sure if there's a simpler way.
> Maybe we can use the same approach as the fbmem driver and do it all
> in the mmap() function, I will have to investigate that.
>
> Jack> + /* + * Use the bte to ensure cache lines + * are actually
> Jack> pulled from the + * processor back to the md. + */ +
>
> Jack> This doesn't need to be done if the memory was being used for
> Jack> fetchops or uncached memory.
>
> I'll check.
>
> Jack> + s <<= 1; + } + a = (unsigned long) h[j].next;
>
> Jack> It appears that you are keeping a linked list of free memory
> Jack> WITHIN the mspec memory itself. If I'm reading this correctly,
> Jack> all the addresses are uncached virtual addresses so that should
> Jack> be ok. However, it might be good to add debugging code to make
> Jack> sure that you never cause a cachable reference to be made to any
> Jack> of the fetchop memory. The resulting data corruption problems
> Jack> are almost impossible to debug.
>
> You are correct that I keep the lists in the memory. I may change the
> allocator at a later stage to use descriptors instead, but for now I
> think this should be ok. I'll add a check to make sure we never
> receive a cached address back into mspec_free_page.
>
> Thanks,
> Jes
--
Thanks
Jack Steiner (steiner@sgi.com) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (10 preceding siblings ...)
2005-02-03 18:54 ` Jack Steiner
@ 2005-02-03 19:39 ` David Mosberger
2005-02-03 20:58 ` Jes Sorensen
` (22 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: David Mosberger @ 2005-02-03 19:39 UTC (permalink / raw)
To: linux-ia64
>>>>> On Thu, 3 Feb 2005 12:54:06 -0600, Jack Steiner <steiner@sgi.com> said:
Jack> Am I being overly paranoid?
I don't think so. Your concerns are hitting the nail on the head...
--david
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (11 preceding siblings ...)
2005-02-03 19:39 ` David Mosberger
@ 2005-02-03 20:58 ` Jes Sorensen
2005-02-03 21:40 ` Luck, Tony
` (21 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Jes Sorensen @ 2005-02-03 20:58 UTC (permalink / raw)
To: linux-ia64
>>>>> "Jack" = Jack Steiner <steiner@sgi.com> writes:
Jack> On Thu, Feb 03, 2005 at 03:38:45AM -0500, Jes Sorensen wrote:
Jack> sorry - I mised your reply. Apparantly, it looks like SPAM:
Urgh, Barracuda! No comment ;-(
Jack> On Wed, Feb 02, 2005 at 02:10:32PM -0500, Jes Sorensen wrote:
Jack> General comment:
>> Jack, thanks for the comments, I'll look at it, however I have the
>> following comments (which may or may not be correct from my side):
>>
Jack> 1) I may be paranoid, but I'm nervous about using memory visible
Jack> to the VM system for fetchops. If ANYTHING in the kernel makes a
Jack> reference to the memory and causes a TLB dropin to occur, then
Jack> we are exposed to data corruption. If memory being used for
Jack> fetchops is loaded into the cache, either directly or by
Jack> speculation, then data corruption of the uncached fetchop memory
Jack> can occur.
Jack> Am I being overly paranoid? How can we be certain that nothing
Jack> will ever reference the fetchop memory alocated from the general
Jack> VM pool. lcrash, for example.
Jack,
I hear your concerns! However, at the same time, if something within
the kernel starts mocking with memory it doesn't own, thats a bug.
Admittedly that can a royal pain in the b*tt to debug since a simple
load can trigger it.
I was more concerned that there would be a case where prefetching or
speculation would spill into a page in another granule and thereby
cause a cacheable operation on the memory. However I don't quite
understand the hardware to this level to be 110% sure.
>> Once a page is handed out using alloc_pages, the kernel won't
>> touch it again unless you explicitly map it etc. or if some process
>> touches memory at random, which could also happen with the spill
>> pages. So I don't think the situation is any worse than it is for
>> the spill pages in the lower granules.
Jack> In theory, you are correct & maybe I'm being overly
Jack> paranoid. However, a failure is almost impossible to debug.
I wonder if it's something one could run a test on by running all of
the kernel data memory outside the identity mapped area and then
read-protecting the pages handed to the uncached allocator. Would
require a fair bit of instrumentation though so I am not sure it would
be feasible to try out.
Jack> Using the UC area in the low granules seems safe but it still
Jack> took us a long time to get it right. The kernel is unaware of
Jack> the memory & with the exception of the fetchop code, nothing in
Jack> the kernel ever references the spill areas.
Jack> I can't find any specific place that will fail using kernel
Jack> memory for mspec but my gut feeling is that we are more exposed
Jack> to errors using memory the kernel knows about than in using the
Jack> spill areas. For example, although I don't see any problems
Jack> here because of it's limited use, virt_addr_valid() &
Jack> pfn_valid() is FALSE for the spill area but TRUE for kernel
Jack> memory.
I added support for converting pages since it was my understanding
this was something there was a wish for. We can limit the uncached
allocator to just use the spill pages but then we're back in the exact
same situation we had with the old allocator there was in fetchop.c.
Jack> What prevents lcrash (or /dev/kmem or /proc/kcore) from
Jack> referencing special memory being used for fetchops? Granted,
Jack> this takes root privilege but the consequences of a bad
Jack> reference can cause silent data corruption that is impossible to
Jack> debug. Should we add code to prohibit these area from
Jack> referencing granules being used for mspec memory?
I believe you will have the same problem with anyone messing with
/dev/mem at random on the spill pages. I can't see us doing anything
here besides saying that fetchop isn't supported if someone plays that
kinds of games on their system as root.
Jack> Forgive the paranoia but several of use spent a long time
Jack> debugging some of these issues. Maybe all I'm asking is that
Jack> everyone spend a little extra time thinking of ways that the
Jack> kernel could cause a TLB entry to be made to a granule being
Jack> used for mspec memory.
Perfectly reasonable, which is also why I posted it to the list. More
brains thinking about a problem are more likely to find any potential
caveats.
>> I thought about this one a fair bit after reading your comments
>> and I don't think it's an issue. The pages in the kernel's cached
>> mapping are identity mapped which means we shouldn't see any tlbs
>> for this, which leaves us with just tlbs for pages that have
>> explicitly been mapped somewhere - user tlbs should be removed when
>> a process is shot down or pages unmapped and vfree() calls
>> flush_tlb_all(). Or, am I missing something?
Jack> Identity mapped memory still requires a TLB entry. Somewhere,
Jack> these entries need to be purged before using a newly allocated
Jack> granule for fetchops or uncached memory. Also, the TLB entries
Jack> need to be purged before the cache is flushed. And the cache
Jack> flushing can't require a cacheable TLB entry to be made.
Aha, I wasn't aware of this, I thought the region registers worked
like some giant TLB. I'll add a flush for the granule when it's pulled
into the allocator.
Cheers,
Jes
^ permalink raw reply [flat|nested] 36+ messages in thread* RE: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (12 preceding siblings ...)
2005-02-03 20:58 ` Jes Sorensen
@ 2005-02-03 21:40 ` Luck, Tony
2005-02-04 1:00 ` David Mosberger
` (20 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Luck, Tony @ 2005-02-03 21:40 UTC (permalink / raw)
To: linux-ia64
>Aha, I wasn't aware of this, I thought the region registers worked
>like some giant TLB. I'll add a flush for the granule when it's pulled
>into the allocator.
No, the Alt-DTLB miss handler in ivt.S will blindly insert TLB
entries to fix any faults in region 6 or 7 ... giving the illusion
that it's all mapped 1:1, but actually only a few granules will
have mappings at any given time.
Maybe this handler should consult a "granule bitmap" before the
insert to make sure it doesn't compound an error from a wild access
by creating a TLB entry pointing at something that should be left
unmapped? But that would be a big piece of memory for a sparse
machine like an Altix, and an extra memory load in the Alt-DTLB
path.
-Tony
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (13 preceding siblings ...)
2005-02-03 21:40 ` Luck, Tony
@ 2005-02-04 1:00 ` David Mosberger
2005-02-13 21:57 ` Jes Sorensen
` (19 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: David Mosberger @ 2005-02-04 1:00 UTC (permalink / raw)
To: linux-ia64
>>>>> On 03 Feb 2005 15:58:16 -0500, Jes Sorensen <jes@wildopensource.com> said:
Jes> I hear your concerns! However, at the same time, if something
Jes> within the kernel starts mocking with memory it doesn't own,
Jes> thats a bug.
I'm not convinced of that. Jack already mentioned lcrash...
Jes> I was more concerned that there would be a case where
Jes> prefetching or speculation would spill into a page in another
Jes> granule and thereby cause a cacheable operation on the
Jes> memory.
This is not an issue. The AltTLB miss-handlers ignore misses
triggered by speculative accesses (including lfetch) or accesses
triggered by user-level.
Jack> What prevents lcrash (or /dev/kmem or /proc/kcore) from
Jack> referencing special memory being used for fetchops? Granted,
Jack> this takes root privilege but the consequences of a bad
Jack> reference can cause silent data corruption that is impossible
Jack> to debug. Should we add code to prohibit these area from
Jack> referencing granules being used for mspec memory?
Jes> I believe you will have the same problem with anyone messing
Jes> with /dev/mem at random on the spill pages.
/dev/mem does check the memory attributes and accesses the memory
uncached if EFI_MEMORY_WB is _not_ set. This probably needs to be
extended to check what the current usage of the page is (as opposed to
what the potential usages are). Now that the kernel has it's own copy
of the memory-map, that's probably something we can do more easily.
OTOH, for pages which have a page-descriptor, a flag-bit would be the
most logical choice and that would make it possible to efficiently
check whether it is OK to access a page via a cacheable reference.
Jes> Aha, I wasn't aware of this, I thought the region registers
Jes> worked like some giant TLB. I'll add a flush for the granule
Jes> when it's pulled into the allocator.
Don't you have to purge the all TLBs completely anyhow?
--david
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (14 preceding siblings ...)
2005-02-04 1:00 ` David Mosberger
@ 2005-02-13 21:57 ` Jes Sorensen
2005-02-14 19:12 ` Luck, Tony
` (18 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Jes Sorensen @ 2005-02-13 21:57 UTC (permalink / raw)
To: linux-ia64
>>>>> "David" = David Mosberger <davidm@napali.hpl.hp.com> writes:
>>>>> On 03 Feb 2005 15:58:16 -0500, Jes Sorensen <jes@wildopensource.com> said:
Jes> I hear your concerns! However, at the same time, if something
Jes> within the kernel starts mocking with memory it doesn't own,
Jes> thats a bug.
David> I'm not convinced of that. Jack already mentioned lcrash...
Mmmmm, I took a look at lcrash and I'll refrain from commenting on
it's method of operation here to avoid too much public swearing. Yuck!
Anyway, it seems that lcrash at least only accesses the memory using
read() and write()? which means it is relatively easy to using a
struct page flag to do uncached vs cached access on the fly. mmap on
the other hand would be a bit of a nightmare as it would require a
custom fault handler that would know when to go cached and when not
to.
How would you feel about using PG_arch_1 for this on ia64?
Cheers,
Jes
^ permalink raw reply [flat|nested] 36+ messages in thread* RE: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (15 preceding siblings ...)
2005-02-13 21:57 ` Jes Sorensen
@ 2005-02-14 19:12 ` Luck, Tony
2005-02-14 19:17 ` David Mosberger
` (17 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Luck, Tony @ 2005-02-14 19:12 UTC (permalink / raw)
To: linux-ia64
>Anyway, it seems that lcrash at least only accesses the memory using
>read() and write()? which means it is relatively easy to using a
>struct page flag to do uncached vs cached access on the fly. mmap on
>the other hand would be a bit of a nightmare as it would require a
>custom fault handler that would know when to go cached and when not
>to.
>
>How would you feel about using PG_arch_1 for this on ia64?
Is ia64 the only architecture that has uncached vs. cached access
issues? If so, the PG_arch_1 might have to be the solution, but
surely others have cache coherence problems too if there are mixed
cacheable and uncacheable access to the same memory? In which case
a new generic bit would be more appropriate.
A few weeks back there was also some discussion about allocating
one or more bits to mark bad pages (those containing ECC errors)
as "do not touch".
-Tony
^ permalink raw reply [flat|nested] 36+ messages in thread* RE: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (16 preceding siblings ...)
2005-02-14 19:12 ` Luck, Tony
@ 2005-02-14 19:17 ` David Mosberger
2005-02-15 8:43 ` Jes Sorensen
` (16 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: David Mosberger @ 2005-02-14 19:17 UTC (permalink / raw)
To: linux-ia64
>>>>> On Mon, 14 Feb 2005 11:12:54 -0800, "Luck, Tony" <tony.luck@intel.com> said:
Tony> If so, the PG_arch_1 might have to be the solution
Note that PG_arch_1 is already being used as a "i-cache coherent" flag.
--david
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (17 preceding siblings ...)
2005-02-14 19:17 ` David Mosberger
@ 2005-02-15 8:43 ` Jes Sorensen
2005-02-15 8:44 ` Jes Sorensen
` (15 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Jes Sorensen @ 2005-02-15 8:43 UTC (permalink / raw)
To: linux-ia64
>>>>> "Tony" = Luck, Tony <tony.luck@intel.com> writes:
>> Anyway, it seems that lcrash at least only accesses the memory
>> using read() and write()? which means it is relatively easy to
>> using a struct page flag to do uncached vs cached access on the
>> fly. mmap on the other hand would be a bit of a nightmare as it
>> would require a custom fault handler that would know when to go
>> cached and when not to.
>>
>> How would you feel about using PG_arch_1 for this on ia64?
Tony> Is ia64 the only architecture that has uncached vs. cached
Tony> access issues? If so, the PG_arch_1 might have to be the
Tony> solution, but surely others have cache coherence problems too if
Tony> there are mixed cacheable and uncacheable access to the same
Tony> memory? In which case a new generic bit would be more
Tony> appropriate.
None of the ones I know well have this problem, but I have little
knowledge about this level of stuff on most architectures. The ones
that could have issues would probably be like PPC, PARISC and maybe
Alpha .....
Tony> A few weeks back there was also some discussion about allocating
Tony> one or more bits to mark bad pages (those containing ECC errors)
Tony> as "do not touch".
Probably something which needs to be done sooner or later.
Cheers,
Jes
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (18 preceding siblings ...)
2005-02-15 8:43 ` Jes Sorensen
@ 2005-02-15 8:44 ` Jes Sorensen
2005-02-15 17:35 ` Grant Grundler
` (14 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Jes Sorensen @ 2005-02-15 8:44 UTC (permalink / raw)
To: linux-ia64
>>>>> "David" = David Mosberger <davidm@napali.hpl.hp.com> writes:
>>>>> On Mon, 14 Feb 2005 11:12:54 -0800, "Luck, Tony" <tony.luck@intel.com> said:
Tony> If so, the PG_arch_1 might have to be the solution
David> Note that PG_arch_1 is already being used as a "i-cache
David> coherent" flag.
Ahh, so not totally off. It's also used for internal debugging in
reiser4, but I'll happily ignore that while waving the 'friends don't
let friends run reiser4' flag ;)
Cheers,
Jes
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (19 preceding siblings ...)
2005-02-15 8:44 ` Jes Sorensen
@ 2005-02-15 17:35 ` Grant Grundler
2005-02-15 18:04 ` David Mosberger
` (13 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Grant Grundler @ 2005-02-15 17:35 UTC (permalink / raw)
To: linux-ia64
On Tue, Feb 15, 2005 at 03:43:10AM -0500, Jes Sorensen wrote:
> None of the ones I know well have this problem, but I have little
> knowledge about this level of stuff on most architectures. The ones
> that could have issues would probably be like PPC, PARISC and maybe
> Alpha .....
With parisc 2.0 CPU, one could map the same physical address
as both cacheable and uncacheable via two entries in the page table.
AFAIK, the uncacheable bit is only use to map MMIO address space.
In practice, I'm not sure what happens since I'm not aware of
any need for both mappings at the same time because we can use
"LDWA" (Load Word Absolute). LDWA can load from (almost) any physical
address and is by nature uncached (caches are VIVT).
Older PA1.x CPU had hardwired 0xf0000000-0xffffffff to be uncacheable.
I.e. it's not possible to have an overlap.
hth,
grant
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (20 preceding siblings ...)
2005-02-15 17:35 ` Grant Grundler
@ 2005-02-15 18:04 ` David Mosberger
2005-02-15 18:05 ` David Mosberger
` (12 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: David Mosberger @ 2005-02-15 18:04 UTC (permalink / raw)
To: linux-ia64
>>>>> On 15 Feb 2005 03:43:10 -0500, Jes Sorensen <jes@wildopensource.com> said:
>>>>> "Tony" = Luck, Tony <tony.luck@intel.com> writes:
Tony> Is ia64 the only architecture that has uncached vs. cached
Tony> access issues? If so, the PG_arch_1 might have to be the
Tony> solution, but surely others have cache coherence problems too
Tony> if there are mixed cacheable and uncacheable access to the
Tony> same memory? In which case a new generic bit would be more
Tony> appropriate.
Jes> None of the ones I know well have this problem, but I have
Jes> little knowledge about this level of stuff on most
Jes> architectures. The ones that could have issues would probably
Jes> be like PPC, PARISC and maybe Alpha .....
Well, any CPU that allows overlapping mappings and does _any_ sort of
speculative accesses will have a problem. The only question is
whether you'll get an explicit error notification (e.g., MCA) or
silent data corruption.
--david
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (21 preceding siblings ...)
2005-02-15 18:04 ` David Mosberger
@ 2005-02-15 18:05 ` David Mosberger
2005-02-15 19:11 ` Robin Holt
` (11 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: David Mosberger @ 2005-02-15 18:05 UTC (permalink / raw)
To: linux-ia64
>>>>> On 15 Feb 2005 03:44:26 -0500, Jes Sorensen <jes@wildopensource.com> said:
Jes> Ahh, so not totally off. It's also used for internal debugging
Jes> in reiser4, but I'll happily ignore that while waving the
Jes> 'friends don't let friends run reiser4' flag ;)
What are they smoking? Which part of PG_arch_1 do they not understand?
^^^^
--david
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (22 preceding siblings ...)
2005-02-15 18:05 ` David Mosberger
@ 2005-02-15 19:11 ` Robin Holt
2005-02-15 19:19 ` David Mosberger
` (10 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Robin Holt @ 2005-02-15 19:11 UTC (permalink / raw)
To: linux-ia64
On Tue, Feb 15, 2005 at 10:04:49AM -0800, David Mosberger wrote:
> >>>>> On 15 Feb 2005 03:43:10 -0500, Jes Sorensen <jes@wildopensource.com> said:
>
> >>>>> "Tony" = Luck, Tony <tony.luck@intel.com> writes:
>
> Tony> Is ia64 the only architecture that has uncached vs. cached
> Tony> access issues? If so, the PG_arch_1 might have to be the
> Tony> solution, but surely others have cache coherence problems too
> Tony> if there are mixed cacheable and uncacheable access to the
> Tony> same memory? In which case a new generic bit would be more
> Tony> appropriate.
>
> Jes> None of the ones I know well have this problem, but I have
> Jes> little knowledge about this level of stuff on most
> Jes> architectures. The ones that could have issues would probably
> Jes> be like PPC, PARISC and maybe Alpha .....
>
> Well, any CPU that allows overlapping mappings and does _any_ sort of
> speculative accesses will have a problem. The only question is
> whether you'll get an explicit error notification (e.g., MCA) or
> silent data corruption.
We saw many silent data corruptions. The SN2 hardware will give an
MCA if both types of references are made by the cpu at about the
same time, but that depends on both transactions being on the bus
in close relationship to each other.
Robin
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (23 preceding siblings ...)
2005-02-15 19:11 ` Robin Holt
@ 2005-02-15 19:19 ` David Mosberger
2005-02-15 19:49 ` Robin Holt
` (9 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: David Mosberger @ 2005-02-15 19:19 UTC (permalink / raw)
To: linux-ia64
>>>>> On Tue, 15 Feb 2005 13:11:20 -0600, Robin Holt <holt@sgi.com> said:
Robin> We saw many silent data corruptions. The SN2 hardware will
Robin> give an MCA if both types of references are made by the cpu
Robin> at about the same time, but that depends on both transactions
Robin> being on the bus in close relationship to each other.
Yes, but the point is it affects _most_ of today's machines, including
x86. A while ago, there were some nasty data corruption issues on AMD
chips caused by the AGP GART and the large mappings used by the Linux
kernel, for example.
--david
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (24 preceding siblings ...)
2005-02-15 19:19 ` David Mosberger
@ 2005-02-15 19:49 ` Robin Holt
2005-02-15 19:59 ` Jes Sorensen
` (8 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Robin Holt @ 2005-02-15 19:49 UTC (permalink / raw)
To: linux-ia64
On Tue, Feb 15, 2005 at 11:19:40AM -0800, David Mosberger wrote:
> >>>>> On Tue, 15 Feb 2005 13:11:20 -0600, Robin Holt <holt@sgi.com> said:
>
> Robin> We saw many silent data corruptions. The SN2 hardware will
> Robin> give an MCA if both types of references are made by the cpu
> Robin> at about the same time, but that depends on both transactions
> Robin> being on the bus in close relationship to each other.
>
> Yes, but the point is it affects _most_ of today's machines, including
> x86. A while ago, there were some nasty data corruption issues on AMD
> chips caused by the AGP GART and the large mappings used by the Linux
> kernel, for example.
One thing I had not thought about before this is SN2 hardware will also
allow us to change the memory protections for the cachelines in the pages
of this granule to not allow cached references. I am not sure if this
has been tested aside from design verification testing and probably some
of the offline diagnostic tests.
Would anybody have a concern with marking those cache lines as processor
uncached only? Would there ever be a time that we would expect _ANY_
type of cached reference when the granule is being used for uncached
that would be an acceptable use?
Thanks,
Robin
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (25 preceding siblings ...)
2005-02-15 19:49 ` Robin Holt
@ 2005-02-15 19:59 ` Jes Sorensen
2005-02-15 20:02 ` Jes Sorensen
` (7 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Jes Sorensen @ 2005-02-15 19:59 UTC (permalink / raw)
To: linux-ia64
>>>>> "David" = David Mosberger <davidm@napali.hpl.hp.com> writes:
>>>>> On 15 Feb 2005 03:44:26 -0500, Jes Sorensen <jes@wildopensource.com> said:
Jes> Ahh, so not totally off. It's also used for internal debugging in
Jes> reiser4, but I'll happily ignore that while waving the 'friends
Jes> don't let friends run reiser4' flag ;)
David> What are they smoking? Which part of PG_arch_1 do they not
David> understand? ^^^^
I suspect it's some sort of garden weed, the one labelled 'if we use
this magic flag, then nobody will notice us sneaking something in
under the rug'.
Gotta love 'em ;-)
Jes
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (26 preceding siblings ...)
2005-02-15 19:59 ` Jes Sorensen
@ 2005-02-15 20:02 ` Jes Sorensen
2005-02-15 20:24 ` Robin Holt
` (6 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Jes Sorensen @ 2005-02-15 20:02 UTC (permalink / raw)
To: linux-ia64
>>>>> "Robin" = Robin Holt <holt@sgi.com> writes:
Robin> Would anybody have a concern with marking those cache lines as
Robin> processor uncached only? Would there ever be a time that we
Robin> would expect _ANY_ type of cached reference when the granule is
Robin> being used for uncached that would be an acceptable use?
Robin,
I don't see any valid such cases. In principle one has to go through
the reverse procedure to convert something from uncached to cached and
since we don't do that in any place I don't see any valid reason to do
so. Anyone doing it in kdb etc. would be in for what they are getting,
so I don't think thats valid concern.
Cheers,
Jes
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (27 preceding siblings ...)
2005-02-15 20:02 ` Jes Sorensen
@ 2005-02-15 20:24 ` Robin Holt
2005-02-16 7:32 ` Jes Sorensen
` (5 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Robin Holt @ 2005-02-15 20:24 UTC (permalink / raw)
To: linux-ia64
On Tue, Feb 15, 2005 at 03:02:01PM -0500, Jes Sorensen wrote:
> >>>>> "Robin" = Robin Holt <holt@sgi.com> writes:
>
> Robin> Would anybody have a concern with marking those cache lines as
> Robin> processor uncached only? Would there ever be a time that we
> Robin> would expect _ANY_ type of cached reference when the granule is
> Robin> being used for uncached that would be an acceptable use?
>
> Robin,
>
> I don't see any valid such cases. In principle one has to go through
> the reverse procedure to convert something from uncached to cached and
> since we don't do that in any place I don't see any valid reason to do
> so. Anyone doing it in kdb etc. would be in for what they are getting,
> so I don't think thats valid concern.
I guess I didn't follow you. Are you saying you would have no concern
with changing the memory protections to prevent that type of reference?
Thanks,
Robin
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (28 preceding siblings ...)
2005-02-15 20:24 ` Robin Holt
@ 2005-02-16 7:32 ` Jes Sorensen
2005-02-16 14:17 ` Jes Sorensen
` (4 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Jes Sorensen @ 2005-02-16 7:32 UTC (permalink / raw)
To: linux-ia64
>>>>> "Robin" = Robin Holt <holt@sgi.com> writes:
Robin> On Tue, Feb 15, 2005 at 03:02:01PM -0500, Jes Sorensen wrote:
>> I don't see any valid such cases. In principle one has to go
>> through the reverse procedure to convert something from uncached to
>> cached and since we don't do that in any place I don't see any
>> valid reason to do so. Anyone doing it in kdb etc. would be in for
>> what they are getting, so I don't think thats valid concern.
Robin> I guess I didn't follow you. Are you saying you would have no
Robin> concern with changing the memory protections to prevent that
Robin> type of reference?
Yes
Jes
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (29 preceding siblings ...)
2005-02-16 7:32 ` Jes Sorensen
@ 2005-02-16 14:17 ` Jes Sorensen
2005-02-16 17:50 ` Luck, Tony
` (3 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Jes Sorensen @ 2005-02-16 14:17 UTC (permalink / raw)
To: linux-ia64
Hi
Here's another version of the mspec driver with the generic
allocator. I have tried to implement all the features we discussed on
the list earlier, including marking pages as uncached using PG_arch_1
and making /dev/mem honor it. In order to do so I had to move our
/dev/mem read/write functions into arch/ia64/kernel to avoid totally
cluttering drivers/char/mem.c with #ifdefs. I also moved the mspec.c
driver to arch/ia64/kernel to reflect that it now works on non-sn
hardware for uncached and cached mappings.
Any more ideas/objections for this one?
The patch is still against 2.6.11-rc2-mm2 however it won't compile
straight out of the box as it depends on another patch I dumped on
Jesse which moved shubio.h to include/asm/sn
Cheers,
Jes
diff -X /usr/people/jes/exclude-linux -urN linux-2.6.11-rc2-mm2-vanilla/arch/ia64/Kconfig linux-2.6.11-rc2-mm2/arch/ia64/Kconfig
--- linux-2.6.11-rc2-mm2-vanilla/arch/ia64/Kconfig 2005-02-02 04:31:21 -08:00
+++ linux-2.6.11-rc2-mm2/arch/ia64/Kconfig 2005-02-16 04:26:12 -08:00
@@ -225,6 +225,15 @@
If you are compiling a kernel that will run under SGI's IA-64
simulator (Medusa) then say Y, otherwise say N.
+config MSPEC
+ tristate "Special Memory support"
+ help
+ This driver allows for cached and uncached mappings of memory
+ to user processes. On SGI SN hardware it will also export the
+ special fetchop memory facility.
+ Fetchops are atomic memory operations that are implemented in the
+ memory controller on SGI SN hardware.
+
config FORCE_MAX_ZONEORDER
int
default "18"
diff -X /usr/people/jes/exclude-linux -urN linux-2.6.11-rc2-mm2-vanilla/arch/ia64/configs/sn2_defconfig linux-2.6.11-rc2-mm2/arch/ia64/configs/sn2_defconfig
--- linux-2.6.11-rc2-mm2-vanilla/arch/ia64/configs/sn2_defconfig 2005-02-02 04:31:21 -08:00
+++ linux-2.6.11-rc2-mm2/arch/ia64/configs/sn2_defconfig 2005-02-16 04:27:24 -08:00
@@ -82,6 +82,7 @@
# CONFIG_IA64_CYCLONE is not set
CONFIG_IOSAPIC=y
CONFIG_IA64_SGI_SN_SIM=y
+CONFIG_MSPEC=m
CONFIG_FORCE_MAX_ZONEORDER\x18
CONFIG_SMP=y
CONFIG_NR_CPUSQ2
diff -X /usr/people/jes/exclude-linux -urN linux-2.6.11-rc2-mm2-vanilla/arch/ia64/defconfig linux-2.6.11-rc2-mm2/arch/ia64/defconfig
--- linux-2.6.11-rc2-mm2-vanilla/arch/ia64/defconfig 2005-02-02 04:31:05 -08:00
+++ linux-2.6.11-rc2-mm2/arch/ia64/defconfig 2005-02-16 04:26:50 -08:00
@@ -80,6 +80,7 @@
CONFIG_DISCONTIGMEM=y
CONFIG_IA64_CYCLONE=y
CONFIG_IOSAPIC=y
+CONFIG_MSPEC=m
CONFIG_FORCE_MAX_ZONEORDER\x18
CONFIG_SMP=y
CONFIG_NR_CPUSQ2
diff -X /usr/people/jes/exclude-linux -urN linux-2.6.11-rc2-mm2-vanilla/arch/ia64/kernel/Makefile linux-2.6.11-rc2-mm2/arch/ia64/kernel/Makefile
--- linux-2.6.11-rc2-mm2-vanilla/arch/ia64/kernel/Makefile 2005-02-02 04:31:05 -08:00
+++ linux-2.6.11-rc2-mm2/arch/ia64/kernel/Makefile 2005-02-16 04:29:52 -08:00
@@ -7,7 +7,7 @@
obj-y := acpi.o entry.o efi.o efi_stub.o gate-data.o fsys.o ia64_ksyms.o irq.o irq_ia64.o \
irq_lsapic.o ivt.o machvec.o pal.o patch.o process.o perfmon.o ptrace.o sal.o \
salinfo.o semaphore.o setup.o signal.o sys_ia64.o time.o traps.o unaligned.o \
- unwind.o mca.o mca_asm.o topology.o
+ unwind.o mca.o mca_asm.o topology.o mem.o
obj-$(CONFIG_IA64_BRL_EMU) += brl_emu.o
obj-$(CONFIG_IA64_GENERIC) += acpi-ext.o
@@ -20,6 +20,7 @@
obj-$(CONFIG_PERFMON) += perfmon_default_smpl.o
obj-$(CONFIG_IA64_CYCLONE) += cyclone.o
obj-$(CONFIG_IA64_MCA_RECOVERY) += mca_recovery.o
+obj-$(CONFIG_MSPEC) += mspec.o
mca_recovery-y += mca_drv.o mca_drv_asm.o
# The gate DSO image is built using a special linker script.
diff -X /usr/people/jes/exclude-linux -urN linux-2.6.11-rc2-mm2-vanilla/arch/ia64/kernel/mem.c linux-2.6.11-rc2-mm2/arch/ia64/kernel/mem.c
--- linux-2.6.11-rc2-mm2-vanilla/arch/ia64/kernel/mem.c 1969-12-31 16:00:00 -08:00
+++ linux-2.6.11-rc2-mm2/arch/ia64/kernel/mem.c 2005-02-16 04:17:09 -08:00
@@ -0,0 +1,151 @@
+/*
+ * arch/ia64/kernel/mem.c
+ *
+ * IA64 specific portions of /dev/mem access, notably handling
+ * read/write from uncached memory
+ *
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ * Copyright (C) 2005 Jes Sorensen <jes@wildopensource.com>
+ */
+
+
+#include <linux/mm.h>
+
+#include <asm/io.h>
+#include <asm/uaccess.h>
+
+
+extern loff_t memory_lseek(struct file * file, loff_t offset, int orig);
+extern int mmap_kmem(struct file * file, struct vm_area_struct * vma);
+extern int open_port(struct inode * inode, struct file * filp);
+
+
+static inline int range_is_allowed(unsigned long from, unsigned long to)
+{
+ unsigned long cursor;
+
+ cursor = from >> PAGE_SHIFT;
+ while ((cursor << PAGE_SHIFT) < to) {
+ if (!devmem_is_allowed(cursor))
+ return 0;
+ cursor++;
+ }
+ return 1;
+}
+
+
+/*
+ * This funcion reads the *physical* memory. The f_pos points directly
+ * to the memory location.
+ */
+static ssize_t read_mem(struct file * file, char __user * buf,
+ size_t count, loff_t *ppos)
+{
+ unsigned long p = *ppos;
+ ssize_t read, sz;
+ struct page *page;
+ char *ptr;
+
+
+ if (!valid_phys_addr_range(p, &count))
+ return -EFAULT;
+ read = 0;
+
+ while (count > 0) {
+ /*
+ * Handle first page in case it's not aligned
+ */
+ if (-p & (PAGE_SIZE - 1))
+ sz = -p & (PAGE_SIZE - 1);
+ else
+ sz = min(PAGE_SIZE, count);
+
+ page = pfn_to_page(p >> PAGE_SHIFT);
+ /*
+ * On ia64 if a page has been mapped somewhere as
+ * uncached, then it must also be accessed uncached
+ * by the kernel or data corruption may occur
+ */
+ if (page->flags & PG_arch_1)
+ ptr = (char *)p + __IA64_UNCACHED_OFFSET;
+ else
+ ptr = __va(p);
+ if (copy_to_user(buf, ptr, sz))
+ return -EFAULT;
+ buf += sz;
+ p += sz;
+ count -= sz;
+ read += sz;
+ }
+
+ *ppos += read;
+ return read;
+}
+
+
+static ssize_t write_mem(struct file * file, const char __user * buf,
+ size_t count, loff_t *ppos)
+{
+ unsigned long p = *ppos;
+ unsigned long copied;
+ ssize_t written, sz;
+ struct page *page;
+ char *ptr;
+
+ if (!valid_phys_addr_range(p, &count))
+ return -EFAULT;
+
+ written = 0;
+
+ if (!range_is_allowed(p, p + count))
+ return -EPERM;
+ /*
+ * Need virtual p below here
+ */
+ while (count > 0) {
+ /*
+ * Handle first page in case it's not aligned
+ */
+ if (-p & (PAGE_SIZE - 1))
+ sz = -p & (PAGE_SIZE - 1);
+ else
+ sz = min(PAGE_SIZE, count);
+
+ page = pfn_to_page(p >> PAGE_SHIFT);
+ /*
+ * On ia64 if a page has been mapped somewhere as
+ * uncached, then it must also be accessed uncached
+ * by the kernel or data corruption may occur
+ */
+ if (page->flags & PG_arch_1)
+ ptr = (char *)p + __IA64_UNCACHED_OFFSET;
+ else
+ ptr = __va(p);
+
+ copied = copy_from_user(ptr, buf, sz);
+ if (copied) {
+ ssize_t ret;
+
+ ret = written + (sz - copied);
+ if (ret)
+ return ret;
+ return -EFAULT;
+ }
+ buf += sz;
+ p += sz;
+ count -= sz;
+ written += sz;
+ }
+
+ *ppos += written;
+ return written;
+}
+
+
+struct file_operations mem_fops = {
+ .llseek = memory_lseek,
+ .read = read_mem,
+ .write = write_mem,
+ .mmap = mmap_kmem,
+ .open = open_port,
+};
diff -X /usr/people/jes/exclude-linux -urN linux-2.6.11-rc2-mm2-vanilla/arch/ia64/kernel/mspec.c linux-2.6.11-rc2-mm2/arch/ia64/kernel/mspec.c
--- linux-2.6.11-rc2-mm2-vanilla/arch/ia64/kernel/mspec.c 1969-12-31 16:00:00 -08:00
+++ linux-2.6.11-rc2-mm2/arch/ia64/kernel/mspec.c 2005-02-16 05:12:22 -08:00
@@ -0,0 +1,804 @@
+/*
+ * Copyright (C) 2001-2005 Silicon Graphics, Inc. All rights
+ * reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ */
+
+/*
+ * SN Platform Special Memory (mspec) Support
+ *
+ * This driver exports the SN special memory (mspec) facility to user processes.
+ * There are three types of memory made available thru this driver:
+ * fetchops, uncached and cached.
+ *
+ * Fetchops are atomic memory operations that are implemented in the
+ * memory controller on SGI SN hardware.
+ *
+ * Uncached are used for memory write combining feature of the ia64
+ * cpu.
+ *
+ * Cached are used for areas of memory that are used as cached addresses
+ * on our partition and used as uncached addresses from other partitions.
+ * Due to a design constraint of the SN2 Shub, you can not have processors
+ * on the same FSB perform both a cached and uncached reference to the
+ * same cache line. These special memory cached regions prevent the
+ * kernel from ever dropping in a TLB entry and therefore prevent the
+ * processor from ever speculating a cache line from this page.
+ */
+
+
+#include <linux/config.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/miscdevice.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+#include <linux/proc_fs.h>
+#include <linux/vmalloc.h>
+#include <linux/bitops.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/efi.h>
+#include <linux/genalloc.h>
+#include <asm/page.h>
+#include <asm/pal.h>
+#include <asm/system.h>
+#include <asm/pgtable.h>
+#include <asm/atomic.h>
+#include <asm/tlbflush.h>
+#include <asm/sn/addrs.h>
+#include <asm/sn/arch.h>
+#include <asm/sn/mspec.h>
+#include <asm/sn/sn_cpuid.h>
+#include <asm/sn/io.h>
+#include <asm/sn/bte.h>
+#include <asm/sn/shubio.h>
+
+
+#define DEBUG 0
+
+#define FETCHOP_DRIVER_ID_STR "MSPEC Fetchop Device Driver"
+#define CACHED_DRIVER_ID_STR "MSPEC Cached Device Driver"
+#define UNCACHED_DRIVER_ID_STR "MSPEC Uncached Device Driver"
+#define REVISION "3.0"
+#define MSPEC_BASENAME "mspec"
+
+
+#define BTE_ZERO_BLOCK(_maddr, _len) \
+ bte_copy(0, _maddr - __IA64_UNCACHED_OFFSET, _len, BTE_WACQUIRE | BTE_ZERO_FILL, NULL)
+
+static int fetchop_mmap(struct file *file, struct vm_area_struct *vma);
+static int cached_mmap(struct file *file, struct vm_area_struct *vma);
+static int uncached_mmap(struct file *file, struct vm_area_struct *vma);
+static void mspec_open(struct vm_area_struct *vma);
+static void mspec_close(struct vm_area_struct *vma);
+static struct page * mspec_nopage(struct vm_area_struct *vma,
+ unsigned long address, int *unused);
+
+/*
+ * Page types allocated by the device.
+ */
+enum {
+ MSPEC_FETCHOP = 1,
+ MSPEC_CACHED,
+ MSPEC_UNCACHED
+};
+
+static struct file_operations fetchop_fops = {
+ .owner THIS_MODULE,
+ .mmap fetchop_mmap
+};
+static struct miscdevice fetchop_miscdev = {
+ .minor MISC_DYNAMIC_MINOR,
+ .name "sgi_fetchop",
+ .fops &fetchop_fops
+};
+
+
+static struct file_operations cached_fops = {
+ .owner THIS_MODULE,
+ .mmap cached_mmap
+};
+static struct miscdevice cached_miscdev = {
+ .minor MISC_DYNAMIC_MINOR,
+ .name "sgi_cached",
+ .fops &cached_fops
+};
+
+
+static struct file_operations uncached_fops = {
+ .owner THIS_MODULE,
+ .mmap uncached_mmap
+};
+static struct miscdevice uncached_miscdev = {
+ .minor MISC_DYNAMIC_MINOR,
+ .name "sgi_uncached",
+ .fops &uncached_fops
+};
+
+
+static struct vm_operations_struct mspec_vm_ops = {
+ .open mspec_open,
+ .close mspec_close,
+ .nopage mspec_nopage
+};
+
+/*
+ * There is one of these structs per node. It is used to manage the mspec
+ * space that is available on the node. Current assumption is that there is
+ * only 1 mspec block of memory per node.
+ */
+struct node_mspecs {
+ long maddr; /* phys addr of start of mspecs. */
+ int count; /* Total number of mspec pages. */
+ atomic_t free; /* Number of pages currently free. */
+ unsigned long bits[1]; /* Bitmap for managing pages. */
+};
+
+
+/*
+ * One of these structures is allocated when an mspec region is mmaped. The
+ * structure is pointed to by the vma->vm_private_data field in the vma struct.
+ * This structure is used to record the addresses of the mspec pages.
+ */
+struct vma_data {
+ atomic_t refcnt; /* Number of vmas sharing the data. */
+ spinlock_t lock; /* Serialize access to the vma. */
+ int count; /* Number of pages allocated. */
+ int type; /* Type of pages allocated. */
+ unsigned long maddr[1]; /* Array of MSPEC addresses. */
+};
+
+
+/*
+ * Memory Special statistics.
+ */
+struct mspec_stats {
+ atomic_t map_count; /* Number of active mmap's */
+ atomic_t pages_in_use; /* Number of mspec pages in use */
+ unsigned long pages_total; /* Total number of mspec pages */
+};
+
+static struct mspec_stats mspec_stats;
+static struct node_mspecs *node_mspecs[MAX_NUMNODES];
+
+#define MAX_UNCACHED_GRANULES 5
+static int allocated_granules;
+
+struct gen_pool *mspec_pool[MAX_NUMNODES];
+
+static void mspec_ipi_visibility(void *data)
+{
+ int status;
+
+ status = ia64_pal_prefetch_visibility(PAL_VISIBILITY_PHYSICAL);
+ if ((status != PAL_VISIBILITY_OK) &&
+ (status != PAL_VISIBILITY_OK_REMOTE_NEEDED))
+ printk(KERN_DEBUG "pal_prefetch_visibility() returns %i on "
+ "CPU %i\n", status, get_cpu());
+}
+
+
+static void mspec_ipi_mc_drain(void *data)
+{
+ int status;
+ status = ia64_pal_mc_drain();
+ if (status)
+ printk(KERN_WARNING "ia64_pal_mc_drain() failed with %i on "
+ "CPU %i\n", status, get_cpu());
+}
+
+
+static unsigned long
+mspec_get_new_chunk(struct gen_pool *poolp)
+{
+ struct page *page;
+ void *tmp;
+ int status, node, i;
+ unsigned long addr;
+
+ if (allocated_granules >= MAX_UNCACHED_GRANULES)
+ return 0;
+
+ node = (int)poolp->private;
+ page = alloc_pages_node(node, GFP_KERNEL,
+ IA64_GRANULE_SHIFT-PAGE_SHIFT);
+
+#if DEBUG
+ printk(KERN_INFO "get_new_chunk page %p, addr %lx\n",
+ page, (unsigned long)(page-vmem_map) << PAGE_SHIFT);
+#endif
+
+ /*
+ * Do magic if no mem on local node! XXX
+ */
+ if (!page)
+ return 0;
+ tmp = page_address(page);
+ memset(tmp, 0, IA64_GRANULE_SIZE);
+
+ /*
+ * There's a small race here where it's possible for someone to
+ * access the page through /dev/mem halfway through the conversion
+ * to uncached - not sure it's really worth bothering about
+ */
+ for (i = 0; i < (IA64_GRANULE_SIZE / PAGE_SIZE); i++)
+ page[i].flags |= PG_arch_1;
+
+ flush_tlb_kernel_range(tmp, tmp + IA64_GRANULE_SIZE);
+
+ status = ia64_pal_prefetch_visibility(PAL_VISIBILITY_PHYSICAL);
+#if DEBUG
+ printk(KERN_INFO "pal_prefetch_visibility() returns %i on cpu %i\n",
+ status, get_cpu());
+#endif
+ if (!status) {
+ status = smp_call_function(mspec_ipi_visibility, NULL, 0, 1);
+ if (status)
+ printk(KERN_WARNING "smp_call_function failed for "
+ "mspec_ipi_visibility! (%i)\n", status);
+ }
+
+ sn_flush_all_caches((unsigned long)tmp, IA64_GRANULE_SIZE);
+ ia64_pal_mc_drain();
+ status = smp_call_function(mspec_ipi_mc_drain, NULL, 0, 1);
+ if (status)
+ printk(KERN_WARNING "smp_call_function failed for "
+ "mspec_ipi_mc_drain! (%i)\n", status);
+
+ addr = (unsigned long)tmp - PAGE_OFFSET + __IA64_UNCACHED_OFFSET;
+
+ allocated_granules++;
+ return addr;
+}
+
+
+/*
+ * mspec_alloc_page
+ *
+ * Allocate 1 mspec page. Allocates on the requested node. If no
+ * mspec pages are available on the requested node, roundrobin starting
+ * with higher nodes.
+ */
+static unsigned long
+mspec_alloc_page(int nid, int type)
+{
+ unsigned long maddr;
+
+ maddr = gen_pool_alloc(mspec_pool[nid], PAGE_SIZE);
+#if DEBUG
+ printk(KERN_DEBUG "mspec_alloc_page returns %lx on node %i\n",
+ maddr, nid);
+#endif
+
+ /*
+ * If no memory is availble on our local node, try the
+ * remaining nodes in the system.
+ */
+ if (!maddr) {
+ int i;
+
+ for (i = MAX_NUMNODES - 1; i >= 0; i--) {
+ if (i = nid || !node_online(i))
+ continue;
+ maddr = gen_pool_alloc(mspec_pool[i], PAGE_SIZE);
+#if DEBUG
+ printk(KERN_DEBUG "mspec_alloc_page alternate search "
+ "returns %lx on node %i\n", maddr, i);
+#endif
+ if (maddr) {
+ break;
+ }
+ }
+ }
+
+ if (maddr)
+ atomic_inc(&mspec_stats.pages_in_use);
+
+ return maddr;
+}
+
+
+/*
+ * mspec_free_page
+ *
+ * Free a single mspec page.
+ */
+static void
+mspec_free_page(unsigned long maddr)
+{
+ int node;
+
+ node = nasid_to_cnodeid(NASID_GET(maddr));
+#if DEBUG
+ printk(KERN_DEBUG "mspec_free_page(%lx) on node %i\n", maddr, node);
+#endif
+ if ((maddr & (0XFUL << 60)) != __IA64_UNCACHED_OFFSET)
+ panic("mspec_free_page invalid address %lx\n", maddr);
+
+ atomic_dec(&mspec_stats.pages_in_use);
+ gen_pool_free(mspec_pool[node], maddr, PAGE_SIZE);
+}
+
+
+/*
+ * mspec_mmap
+ *
+ * Called when mmaping the device. Initializes the vma with a fault handler
+ * and private data structure necessary to allocate, track, and free the
+ * underlying pages.
+ */
+static int
+mspec_mmap(struct file *file, struct vm_area_struct *vma, int type)
+{
+ struct vma_data *vdata;
+ int pages;
+
+ if (vma->vm_pgoff != 0)
+ return -EINVAL;
+
+ if ((vma->vm_flags & VM_WRITE) = 0)
+ return -EPERM;
+
+ pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ if (!(vdata = vmalloc(sizeof(struct vma_data)+(pages-1)*sizeof(long))))
+ return -ENOMEM;
+ memset(vdata, 0, sizeof(struct vma_data)+(pages-1)*sizeof(long));
+
+ vdata->type = type;
+ vdata->lock = SPIN_LOCK_UNLOCKED;
+ vdata->refcnt = ATOMIC_INIT(1);
+ vma->vm_private_data = vdata;
+
+ vma->vm_flags |= (VM_IO | VM_SHM | VM_LOCKED);
+ if (vdata->type = MSPEC_FETCHOP || vdata->type = MSPEC_UNCACHED)
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ vma->vm_ops = &mspec_vm_ops;
+
+ atomic_inc(&mspec_stats.map_count);
+ return 0;
+}
+
+static int
+fetchop_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_FETCHOP);
+}
+
+static int
+cached_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_CACHED);
+}
+
+static int
+uncached_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_UNCACHED);
+}
+
+/*
+ * mspec_open
+ *
+ * Called when a device mapping is created by a means other than mmap
+ * (via fork, etc.). Increments the reference count on the underlying
+ * mspec data so it is not freed prematurely.
+ */
+static void
+mspec_open(struct vm_area_struct *vma)
+{
+ struct vma_data *vdata;
+
+ vdata = vma->vm_private_data;
+ atomic_inc(&vdata->refcnt);
+}
+
+/*
+ * mspec_close
+ *
+ * Called when unmapping a device mapping. Frees all mspec pages
+ * belonging to the vma.
+ */
+static void
+mspec_close(struct vm_area_struct *vma)
+{
+ struct vma_data *vdata;
+ int i, pages;
+ bte_result_t br;
+
+ vdata = vma->vm_private_data;
+ if (atomic_dec(&vdata->refcnt) = 0) {
+ pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ for (i = 0; i < pages; i++) {
+ if (vdata->maddr[i] != 0) {
+ /*
+ * Clear the page before sticking it back
+ * into the pool.
+ */
+ br = BTE_ZERO_BLOCK(vdata->maddr[i], PAGE_SIZE);
+ if (br = BTE_SUCCESS)
+ mspec_free_page(vdata->maddr[i]);
+ else
+ printk(KERN_WARNING "mspec_close(): BTE failed to zero page\n");
+ }
+ }
+ if (vdata->count)
+ atomic_dec(&mspec_stats.map_count);
+ vfree(vdata);
+ }
+}
+
+/*
+ * mspec_get_one_pte
+ *
+ * Return the pte for a given mm and address.
+ */
+static __inline__ int
+mspec_get_one_pte(struct mm_struct *mm, u64 address, pte_t **pte)
+{
+ pgd_t *pgd;
+ pmd_t *pmd;
+ pud_t *pud;
+
+ pgd = pgd_offset(mm, address);
+ if (pgd_present(*pgd)) {
+ pud = pud_offset(pgd, address);
+ if (pud_present(*pud)) {
+ pmd = pmd_offset(pud, address);
+ if (pmd_present(*pmd)) {
+ *pte = pte_offset_map(pmd, address);
+ if (pte_present(**pte)) {
+ return 0;
+ }
+ }
+ }
+ }
+
+ return -1;
+}
+
+/*
+ * mspec_nopage
+ *
+ * Creates a mspec page and maps it to user space.
+ */
+static struct page *
+mspec_nopage(struct vm_area_struct *vma, unsigned long address, int *unused)
+{
+ unsigned long paddr, maddr = 0;
+ unsigned long pfn;
+ int index;
+ pte_t *pte;
+ struct page *page;
+ struct vma_data *vdata = vma->vm_private_data;
+
+ spin_lock(&vdata->lock);
+
+ index = (address - vma->vm_start) >> PAGE_SHIFT;
+ if (vdata->maddr[index] = 0) {
+ vdata->count++;
+ maddr = mspec_alloc_page(numa_node_id(), vdata->type);
+ if (maddr = 0)
+ BUG();
+ vdata->maddr[index] = maddr;
+ } else if (mspec_get_one_pte(vma->vm_mm, address, &pte) = 0) {
+ printk(KERN_ERR "page already mapped\n");
+ /*
+ * The page may have already been faulted by another
+ * pthread. If so, we need to avoid remapping the
+ * page or we will trip a BUG check in the
+ * remap_page_range() path.
+ */
+ goto getpage;
+ }
+
+ if (vdata->type = MSPEC_FETCHOP)
+ paddr = TO_AMO(vdata->maddr[index]);
+ else
+ paddr = __pa(TO_CAC(vdata->maddr[index]));
+
+ /*
+ * XXX - is this correct?
+ */
+ pfn = paddr >> PAGE_SHIFT;
+ if (remap_pfn_range(vma, address, pfn, PAGE_SIZE, vma->vm_page_prot)) {
+ printk(KERN_ERR "remap_pfn_range failed!\n");
+ goto error;
+ }
+
+ /*
+ * The kernel requires a page structure to be returned upon
+ * success, but there are no page structures for low granule pages.
+ * remap_page_range() creates the pte for us and we return a
+ * bogus page back to the kernel fault handler to keep it happy
+ * (the page is freed immediately there).
+ */
+ if (mspec_get_one_pte(vma->vm_mm, address, &pte) = 0) {
+ spin_lock(&vma->vm_mm->page_table_lock);
+ vma->vm_mm->rss++;
+ spin_unlock(&vma->vm_mm->page_table_lock);
+
+ set_pte(pte, pte_mkwrite(pte_mkdirty(*pte)));
+ }
+getpage:
+ /*
+ * Is this really correct?
+ */
+ page = alloc_pages(GFP_USER, 0);
+ spin_unlock(&vdata->lock);
+ return page;
+#if 0
+ page = pfn_to_page(pfn);
+ printk(KERN_ERR "fall through!\n");
+ goto getpage;
+ return page;
+#endif
+error:
+ if (maddr) {
+ mspec_free_page(vdata->maddr[index]);
+ vdata->maddr[index] = 0;
+ vdata->count--;
+ }
+ spin_unlock(&vdata->lock);
+ return NOPAGE_SIGBUS;
+}
+
+
+#ifdef CONFIG_PROC_FS
+static void *
+mspec_seq_start(struct seq_file *file, loff_t *offset)
+{
+ if (*offset < MAX_NUMNODES)
+ return offset;
+ return NULL;
+}
+
+static void *
+mspec_seq_next(struct seq_file *file, void *data, loff_t *offset)
+{
+ (*offset)++;
+ if (*offset < MAX_NUMNODES)
+ return offset;
+ return NULL;
+}
+
+static void
+mspec_seq_stop(struct seq_file *file, void *data)
+{
+}
+
+static int
+mspec_seq_show(struct seq_file *file, void *data)
+{
+ struct node_mspecs *mspecs;
+ int i;
+
+ i = *(loff_t *)data;
+
+ if (!i) {
+ seq_printf(file, "mappings : %i\n",
+ atomic_read(&mspec_stats.map_count));
+ seq_printf(file, "current mspec pages : %i\n",
+ atomic_read(&mspec_stats.pages_in_use));
+ seq_printf(file, "%4s %7s %7s\n", "node", "total", "free");
+ }
+
+ if (i < MAX_NUMNODES) {
+ int free, count;
+ mspecs = node_mspecs[i];
+ if (mspecs) {
+ free = atomic_read(&mspecs->free);
+ count = mspecs->count;
+ seq_printf(file, "%4d %7d %7d\n", i, count, free);
+ }
+ }
+
+ return 0;
+}
+
+
+static struct seq_operations mspec_seq_ops = {
+ .start = mspec_seq_start,
+ .next = mspec_seq_next,
+ .stop = mspec_seq_stop,
+ .show = mspec_seq_show
+};
+
+int
+mspec_proc_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &mspec_seq_ops);
+}
+
+static struct file_operations proc_mspec_operations = {
+ .open = mspec_proc_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+
+static struct proc_dir_entry *proc_mspec;
+
+#endif /* CONFIG_PROC_FS */
+
+/*
+ * mspec_build_memmap,
+ *
+ * Called at boot time to build a map of pages that can be used for
+ * memory special operations.
+ */
+static int __init
+mspec_build_memmap(unsigned long start, unsigned long end)
+{
+ long length;
+ bte_result_t br;
+ unsigned long vstart, vend;
+ int node;
+
+ length = end - start;
+ vstart = start + __IA64_UNCACHED_OFFSET;
+ vend = end + __IA64_UNCACHED_OFFSET;
+
+#if DEBUG
+ printk(KERN_ERR "mspec_build_memmap(%lx %lx)\n", start, end);
+#endif
+
+ br = BTE_ZERO_BLOCK(vstart, length);
+ if (br != BTE_SUCCESS)
+ panic("BTE Failed while trying to zero mspec page. bte_result_t = %d\n", (int) br);
+
+ node = nasid_to_cnodeid(NASID_GET(start));
+
+ for (; vstart < vend ; vstart += PAGE_SIZE) {
+#if DEBUG
+ printk(KERN_INFO "sticking %lx into the pool!\n", vstart);
+#endif
+ gen_pool_free(mspec_pool[node], vstart, PAGE_SIZE);
+ }
+
+ return 0;
+}
+
+/*
+ * Walk the EFI memory map to pull out leftover pages in the lower
+ * memory regions which do not end up in the regular memory map and
+ * stick them into the uncached allocator
+ */
+static void __init
+mspec_walk_efi_memmap_uc (void)
+{
+ void *efi_map_start, *efi_map_end, *p;
+ efi_memory_desc_t *md;
+ u64 efi_desc_size, start, end;
+
+ efi_map_start = __va(ia64_boot_param->efi_memmap);
+ efi_map_end = efi_map_start + ia64_boot_param->efi_memmap_size;
+ efi_desc_size = ia64_boot_param->efi_memdesc_size;
+
+ for (p = efi_map_start; p < efi_map_end; p += efi_desc_size) {
+ md = p;
+ if (md->attribute = EFI_MEMORY_UC) {
+ start = PAGE_ALIGN(md->phys_addr);
+ end = PAGE_ALIGN((md->phys_addr+(md->num_pages << EFI_PAGE_SHIFT)) & PAGE_MASK);
+ if (mspec_build_memmap(start, end) < 0)
+ return;
+ }
+ }
+}
+
+
+
+/*
+ * mspec_init
+ *
+ * Called at boot time to initialize the mspec facility.
+ */
+static int __init
+mspec_init(void)
+{
+ int i, ret;
+
+ /*
+ * The fetchop device only works on SN2 hardware, uncached and cached
+ * memory drivers should both be valid on all ia64 hardware
+ */
+ if (ia64_platform_is("sn2")) {
+ if ((ret = misc_register(&fetchop_miscdev))) {
+ printk(KERN_ERR "%s: failed to register device %i\n",
+ FETCHOP_DRIVER_ID_STR, ret);
+ return ret;
+ }
+ }
+ if ((ret = misc_register(&cached_miscdev))) {
+ printk(KERN_ERR "%s: failed to register device %i\n",
+ CACHED_DRIVER_ID_STR, ret);
+ misc_deregister(&fetchop_miscdev);
+ return ret;
+ }
+ if ((ret = misc_register(&uncached_miscdev))) {
+ printk(KERN_ERR "%s: failed to register device %i\n",
+ UNCACHED_DRIVER_ID_STR, ret);
+ misc_deregister(&cached_miscdev);
+ misc_deregister(&fetchop_miscdev);
+ return ret;
+ }
+
+ /*
+ * /proc code needs to be updated to work with the new
+ * allocation scheme
+ */
+#ifdef CONFIG_PROC_FS
+ if (!(proc_mspec = create_proc_entry(MSPEC_BASENAME, 0444, NULL))){
+ printk(KERN_ERR "%s: unable to create proc entry",
+ FETCHOP_DRIVER_ID_STR);
+ misc_deregister(&uncached_miscdev);
+ misc_deregister(&cached_miscdev);
+ misc_deregister(&fetchop_miscdev);
+ return -EINVAL;
+ }
+ proc_mspec->proc_fops = &proc_mspec_operations;
+#endif /* CONFIG_PROC_FS */
+
+ for (i = 0; i < MAX_NUMNODES; i++) {
+ if (!node_online(i))
+ continue;
+ printk(KERN_DEBUG "Setting up pool for node %i\n", i);
+ mspec_pool[i] = alloc_gen_pool(0, IA64_GRANULE_SHIFT,
+ &mspec_get_new_chunk, i);
+ }
+
+ mspec_walk_efi_memmap_uc();
+
+ printk(KERN_INFO "%s: v%s\n", FETCHOP_DRIVER_ID_STR, REVISION);
+ printk(KERN_INFO "%s: v%s\n", CACHED_DRIVER_ID_STR, REVISION);
+ printk(KERN_INFO "%s: v%s\n", UNCACHED_DRIVER_ID_STR, REVISION);
+
+ return 0;
+}
+
+
+static void __exit
+mspec_exit(void)
+{
+ BUG_ON(atomic_read(&mspec_stats.pages_in_use) > 0);
+
+#ifdef CONFIG_PROC_FS
+ remove_proc_entry(MSPEC_BASENAME, NULL);
+#endif
+ misc_deregister(&uncached_miscdev);
+ misc_deregister(&cached_miscdev);
+ misc_deregister(&fetchop_miscdev);
+}
+
+
+unsigned long
+mspec_kalloc_page(int nid)
+{
+ return TO_AMO(mspec_alloc_page(nid, MSPEC_FETCHOP));
+}
+
+
+void
+mspec_kfree_page(unsigned long maddr)
+{
+ mspec_free_page(TO_PHYS(maddr) + __IA64_UNCACHED_OFFSET);
+}
+EXPORT_SYMBOL(mspec_kalloc_page);
+EXPORT_SYMBOL(mspec_kfree_page);
+
+
+module_init(mspec_init);
+module_exit(mspec_exit);
+
+
+MODULE_AUTHOR("Silicon Graphics, Inc.");
+MODULE_DESCRIPTION("Driver for SGI SN special memory operations");
+MODULE_LICENSE("GPL");
diff -X /usr/people/jes/exclude-linux -urN linux-2.6.11-rc2-mm2-vanilla/drivers/char/mem.c linux-2.6.11-rc2-mm2/drivers/char/mem.c
--- linux-2.6.11-rc2-mm2-vanilla/drivers/char/mem.c 2005-02-02 04:31:22 -08:00
+++ linux-2.6.11-rc2-mm2/drivers/char/mem.c 2005-02-16 04:23:20 -08:00
@@ -125,39 +125,7 @@
}
return 1;
}
-static ssize_t do_write_mem(void *p, unsigned long realp,
- const char __user * buf, size_t count, loff_t *ppos)
-{
- ssize_t written;
- unsigned long copied;
-
- written = 0;
-#if defined(__sparc__) || (defined(__mc68000__) && defined(CONFIG_MMU))
- /* we don't have page 0 mapped on sparc and m68k.. */
- if (realp < PAGE_SIZE) {
- unsigned long sz = PAGE_SIZE-realp;
- if (sz > count) sz = count;
- /* Hmm. Do something? */
- buf+=sz;
- p+=sz;
- count-=sz;
- written+=sz;
- }
-#endif
- if (!range_is_allowed(realp, realp+count))
- return -EPERM;
- copied = copy_from_user(p, buf, count);
- if (copied) {
- ssize_t ret = written + (count - copied);
- if (ret)
- return ret;
- return -EFAULT;
- }
- written += count;
- *ppos += written;
- return written;
-}
#ifndef ARCH_HAS_DEV_MEM
/*
@@ -196,6 +164,40 @@
return read;
}
+static ssize_t do_write_mem(void *p, unsigned long realp,
+ const char __user * buf, size_t count, loff_t *ppos)
+{
+ ssize_t written;
+ unsigned long copied;
+
+ written = 0;
+#if defined(__sparc__) || (defined(__mc68000__) && defined(CONFIG_MMU))
+ /* we don't have page 0 mapped on sparc and m68k.. */
+ if (realp < PAGE_SIZE) {
+ unsigned long sz = PAGE_SIZE-realp;
+ if (sz > count) sz = count;
+ /* Hmm. Do something? */
+ buf+=sz;
+ p+=sz;
+ count-=sz;
+ written+=sz;
+ }
+#endif
+ if (!range_is_allowed(realp, realp+count))
+ return -EPERM;
+ copied = copy_from_user(p, buf, count);
+ if (copied) {
+ ssize_t ret = written + (count - copied);
+
+ if (ret)
+ return ret;
+ return -EFAULT;
+ }
+ written += count;
+ *ppos += written;
+ return written;
+}
+
static ssize_t write_mem(struct file * file, const char __user * buf,
size_t count, loff_t *ppos)
{
@@ -207,7 +209,8 @@
}
#endif
-static int mmap_kmem(struct file * file, struct vm_area_struct * vma)
+
+int mmap_kmem(struct file * file, struct vm_area_struct * vma)
{
#ifdef pgprot_noncached
unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
@@ -553,7 +556,7 @@
* also note that seeking relative to the "end of file" isn't supported:
* it has no meaning, so it returns -EINVAL.
*/
-static loff_t memory_lseek(struct file * file, loff_t offset, int orig)
+loff_t memory_lseek(struct file * file, loff_t offset, int orig)
{
loff_t ret;
@@ -576,7 +579,7 @@
return ret;
}
-static int open_port(struct inode * inode, struct file * filp)
+int open_port(struct inode * inode, struct file * filp)
{
return capable(CAP_SYS_RAWIO) ? 0 : -EPERM;
}
diff -X /usr/people/jes/exclude-linux -urN linux-2.6.11-rc2-mm2-vanilla/include/asm-ia64/io.h linux-2.6.11-rc2-mm2/include/asm-ia64/io.h
--- linux-2.6.11-rc2-mm2-vanilla/include/asm-ia64/io.h 2005-02-02 04:31:11 -08:00
+++ linux-2.6.11-rc2-mm2/include/asm-ia64/io.h 2005-02-16 02:38:18 -08:00
@@ -481,4 +481,6 @@
#define BIO_VMERGE_BOUNDARY (ia64_max_iommu_merge_mask + 1)
#endif
+#define ARCH_HAS_DEV_MEM
+
#endif /* _ASM_IA64_IO_H */
diff -X /usr/people/jes/exclude-linux -urN linux-2.6.11-rc2-mm2-vanilla/include/asm-ia64/sn/fetchop.h linux-2.6.11-rc2-mm2/include/asm-ia64/sn/fetchop.h
--- linux-2.6.11-rc2-mm2-vanilla/include/asm-ia64/sn/fetchop.h 2004-12-24 13:35:00 -08:00
+++ linux-2.6.11-rc2-mm2/include/asm-ia64/sn/fetchop.h 1969-12-31 16:00:00 -08:00
@@ -1,85 +0,0 @@
-/*
- *
- * This file is subject to the terms and conditions of the GNU General Public
- * License. See the file "COPYING" in the main directory of this archive
- * for more details.
- *
- * Copyright (c) 2001-2004 Silicon Graphics, Inc. All rights reserved.
- */
-
-#ifndef _ASM_IA64_SN_FETCHOP_H
-#define _ASM_IA64_SN_FETCHOP_H
-
-#include <linux/config.h>
-
-#define FETCHOP_BASENAME "sgi_fetchop"
-#define FETCHOP_FULLNAME "/dev/sgi_fetchop"
-
-
-
-#define FETCHOP_VAR_SIZE 64 /* 64 byte per fetchop variable */
-
-#define FETCHOP_LOAD 0
-#define FETCHOP_INCREMENT 8
-#define FETCHOP_DECREMENT 16
-#define FETCHOP_CLEAR 24
-
-#define FETCHOP_STORE 0
-#define FETCHOP_AND 24
-#define FETCHOP_OR 32
-
-#define FETCHOP_CLEAR_CACHE 56
-
-#define FETCHOP_LOAD_OP(addr, op) ( \
- *(volatile long *)((char*) (addr) + (op)))
-
-#define FETCHOP_STORE_OP(addr, op, x) ( \
- *(volatile long *)((char*) (addr) + (op)) = (long) (x))
-
-#ifdef __KERNEL__
-
-/*
- * Convert a region 6 (kaddr) address to the address of the fetchop variable
- */
-#define FETCHOP_KADDR_TO_MSPEC_ADDR(kaddr) TO_MSPEC(kaddr)
-
-
-/*
- * Each Atomic Memory Operation (AMO formerly known as fetchop)
- * variable is 64 bytes long. The first 8 bytes are used. The
- * remaining 56 bytes are unaddressable due to the operation taking
- * that portion of the address.
- *
- * NOTE: The AMO_t _MUST_ be placed in either the first or second half
- * of the cache line. The cache line _MUST NOT_ be used for anything
- * other than additional AMO_t entries. This is because there are two
- * addresses which reference the same physical cache line. One will
- * be a cached entry with the memory type bits all set. This address
- * may be loaded into processor cache. The AMO_t will be referenced
- * uncached via the memory special memory type. If any portion of the
- * cached cache-line is modified, when that line is flushed, it will
- * overwrite the uncached value in physical memory and lead to
- * inconsistency.
- */
-typedef struct {
- u64 variable;
- u64 unused[7];
-} AMO_t;
-
-
-/*
- * The following APIs are externalized to the kernel to allocate/free pages of
- * fetchop variables.
- * fetchop_kalloc_page - Allocate/initialize 1 fetchop page on the
- * specified cnode.
- * fetchop_kfree_page - Free a previously allocated fetchop page
- */
-
-unsigned long fetchop_kalloc_page(int nid);
-void fetchop_kfree_page(unsigned long maddr);
-
-
-#endif /* __KERNEL__ */
-
-#endif /* _ASM_IA64_SN_FETCHOP_H */
-
diff -X /usr/people/jes/exclude-linux -urN linux-2.6.11-rc2-mm2-vanilla/include/asm-ia64/sn/mspec.h linux-2.6.11-rc2-mm2/include/asm-ia64/sn/mspec.h
--- linux-2.6.11-rc2-mm2-vanilla/include/asm-ia64/sn/mspec.h 1969-12-31 16:00:00 -08:00
+++ linux-2.6.11-rc2-mm2/include/asm-ia64/sn/mspec.h 2005-02-02 04:52:48 -08:00
@@ -0,0 +1,72 @@
+/*
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (c) 2001-2004 Silicon Graphics, Inc. All rights reserved.
+ */
+
+#ifndef _ASM_IA64_SN_MSPEC_H
+#define _ASM_IA64_SN_MSPEC_H
+
+#define FETCHOP_VAR_SIZE 64 /* 64 byte per fetchop variable */
+
+#define FETCHOP_LOAD 0
+#define FETCHOP_INCREMENT 8
+#define FETCHOP_DECREMENT 16
+#define FETCHOP_CLEAR 24
+
+#define FETCHOP_STORE 0
+#define FETCHOP_AND 24
+#define FETCHOP_OR 32
+
+#define FETCHOP_CLEAR_CACHE 56
+
+#define FETCHOP_LOAD_OP(addr, op) ( \
+ *(volatile long *)((char*) (addr) + (op)))
+
+#define FETCHOP_STORE_OP(addr, op, x) ( \
+ *(volatile long *)((char*) (addr) + (op)) = (long) (x))
+
+#ifdef __KERNEL__
+
+/*
+ * Each Atomic Memory Operation (AMO formerly known as fetchop)
+ * variable is 64 bytes long. The first 8 bytes are used. The
+ * remaining 56 bytes are unaddressable due to the operation taking
+ * that portion of the address.
+ *
+ * NOTE: The AMO_t _MUST_ be placed in either the first or second half
+ * of the cache line. The cache line _MUST NOT_ be used for anything
+ * other than additional AMO_t entries. This is because there are two
+ * addresses which reference the same physical cache line. One will
+ * be a cached entry with the memory type bits all set. This address
+ * may be loaded into processor cache. The AMO_t will be referenced
+ * uncached via the memory special memory type. If any portion of the
+ * cached cache-line is modified, when that line is flushed, it will
+ * overwrite the uncached value in physical memory and lead to
+ * inconsistency.
+ */
+typedef struct {
+ u64 variable;
+ u64 unused[7];
+} AMO_t;
+
+
+/*
+ * The following APIs are externalized to the kernel to allocate/free pages of
+ * fetchop variables.
+ * mspec_kalloc_page - Allocate/initialize 1 fetchop page on the
+ * specified cnode.
+ * mspec_kfree_page - Free a previously allocated fetchop page
+ */
+
+extern unsigned long mspec_kalloc_page(int);
+extern void mspec_kfree_page(unsigned long);
+
+
+#endif /* __KERNEL__ */
+
+#endif /* _ASM_IA64_SN_MSPEC_H */
+
diff -X /usr/people/jes/exclude-linux -urN linux-2.6.11-rc2-mm2-vanilla/include/linux/genalloc.h linux-2.6.11-rc2-mm2/include/linux/genalloc.h
--- linux-2.6.11-rc2-mm2-vanilla/include/linux/genalloc.h 1969-12-31 16:00:00 -08:00
+++ linux-2.6.11-rc2-mm2/include/linux/genalloc.h 2005-02-02 04:52:48 -08:00
@@ -0,0 +1,46 @@
+/*
+ * Basic general purpose allocator for managing special purpose memory
+ * not managed by the regular kmalloc/kfree interface.
+ * Uses for this includes on-device special memory, uncached memory
+ * etc.
+ *
+ * This code is based on the buddy allocator found in the sym53c8xx_2
+ * driver, adapted for general purpose use.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+#include <linux/spinlock.h>
+
+#define ALLOC_MIN_SHIFT 5 /* 32 bytes minimum */
+/*
+ * Link between free memory chunks of a given size.
+ */
+struct gen_pool_link {
+ struct gen_pool_link *next;
+};
+
+/*
+ * Memory pool of a given kind.
+ * Ideally, we want to use:
+ * 1) 1 pool for memory we donnot need to involve in DMA.
+ * 2) The same pool for controllers that require same DMA
+ * constraints and features.
+ * The OS specific m_pool_id_t thing and the gen_pool_match()
+ * method are expected to tell the driver about.
+ */
+struct gen_pool {
+ spinlock_t lock;
+ unsigned long (*get_new_chunk)(struct gen_pool *);
+ struct gen_pool *next;
+ struct gen_pool_link *h;
+ unsigned long private;
+ int max_chunk_shift;
+};
+
+unsigned long gen_pool_alloc(struct gen_pool *poolp, int size);
+void gen_pool_free(struct gen_pool *mp, unsigned long ptr, int size);
+struct gen_pool *alloc_gen_pool(int nr_chunks, int max_chunk_shift,
+ unsigned long (*fp)(struct gen_pool *),
+ unsigned long data);
diff -X /usr/people/jes/exclude-linux -urN linux-2.6.11-rc2-mm2-vanilla/init/main.c linux-2.6.11-rc2-mm2/init/main.c
--- linux-2.6.11-rc2-mm2-vanilla/init/main.c 2005-02-02 04:31:24 -08:00
+++ linux-2.6.11-rc2-mm2/init/main.c 2005-02-02 04:52:48 -08:00
@@ -78,6 +78,7 @@
static int init(void *);
+extern void gen_pool_init(void);
extern void init_IRQ(void);
extern void sock_init(void);
extern void fork_init(unsigned long);
@@ -482,6 +483,7 @@
#endif
vfs_caches_init_early();
mem_init();
+ gen_pool_init();
kmem_cache_init();
numa_policy_init();
if (late_time_init)
diff -X /usr/people/jes/exclude-linux -urN linux-2.6.11-rc2-mm2-vanilla/lib/Makefile linux-2.6.11-rc2-mm2/lib/Makefile
--- linux-2.6.11-rc2-mm2-vanilla/lib/Makefile 2005-02-02 04:31:24 -08:00
+++ linux-2.6.11-rc2-mm2/lib/Makefile 2005-02-02 04:54:02 -08:00
@@ -6,7 +6,7 @@
bust_spinlocks.o rbtree.o radix-tree.o dump_stack.o \
kobject.o kref.o idr.o div64.o parser.o int_sqrt.o \
bitmap.o extable.o kobject_uevent.o prio_tree.o \
- sha1.o halfmd4.o
+ sha1.o halfmd4.o genalloc.o
ifeq ($(CONFIG_DEBUG_KOBJECT),y)
CFLAGS_kobject.o += -DDEBUG
diff -X /usr/people/jes/exclude-linux -urN linux-2.6.11-rc2-mm2-vanilla/lib/genalloc.c linux-2.6.11-rc2-mm2/lib/genalloc.c
--- linux-2.6.11-rc2-mm2-vanilla/lib/genalloc.c 1969-12-31 16:00:00 -08:00
+++ linux-2.6.11-rc2-mm2/lib/genalloc.c 2005-02-02 04:52:48 -08:00
@@ -0,0 +1,218 @@
+/*
+ * Basic general purpose allocator for managing special purpose memory
+ * not managed by the regular kmalloc/kfree interface.
+ * Uses for this includes on-device special memory, uncached memory
+ * etc.
+ *
+ * This code is based on the buddy allocator found in the sym53c8xx_2
+ * driver Copyright (C) 1999-2001 Gerard Roudier <groudier@free.fr>,
+ * and adapted for general purpose use.
+ *
+ * This source code is licensed under the GNU General Public License,
+ * Version 2. See the file COPYING for more details.
+ */
+
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/stddef.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/spinlock.h>
+#include <linux/genalloc.h>
+
+#include <asm/page.h>
+#include <asm/pal.h>
+
+
+#define DEBUG 0
+
+struct gen_pool *alloc_gen_pool(int nr_chunks, int max_chunk_shift,
+ unsigned long (*fp)(struct gen_pool *),
+ unsigned long data)
+{
+ struct gen_pool *poolp;
+ unsigned long tmp;
+ int i;
+
+ /*
+ * This is really an arbitrary limit, +10 is enough for
+ * IA64_GRANULE_SHIFT.
+ */
+ if ((max_chunk_shift > (PAGE_SHIFT + 10)) ||
+ ((max_chunk_shift < ALLOC_MIN_SHIFT) && max_chunk_shift))
+ return NULL;
+
+ if (!max_chunk_shift)
+ max_chunk_shift = PAGE_SHIFT;
+
+ poolp = kmalloc(sizeof(struct gen_pool), GFP_KERNEL);
+ if (!poolp)
+ return NULL;
+ memset(poolp, 0, sizeof(struct gen_pool));
+ poolp->h = kmalloc(sizeof(struct gen_pool_link) *
+ (max_chunk_shift - ALLOC_MIN_SHIFT + 1),
+ GFP_KERNEL);
+ if (!poolp->h) {
+ printk(KERN_WARNING "gen_pool_alloc() failed to allocate\n");
+ kfree(poolp);
+ return NULL;
+ }
+ memset(poolp->h, 0, sizeof(struct gen_pool_link) *
+ (max_chunk_shift - ALLOC_MIN_SHIFT + 1));
+
+ spin_lock_init(&poolp->lock);
+ poolp->get_new_chunk = fp;
+ poolp->max_chunk_shift = max_chunk_shift;
+ poolp->private = data;
+
+ for (i = 0; i < nr_chunks; i++) {
+ tmp = poolp->get_new_chunk(poolp);
+ printk(KERN_INFO "allocated %lx\n", tmp);
+ if (!tmp)
+ break;
+ gen_pool_free(poolp, tmp, (1 << poolp->max_chunk_shift));
+ }
+
+ return poolp;
+}
+
+
+/*
+ * Simple power of two buddy-like generic allocator.
+ * Provides naturally aligned memory chunks.
+ */
+unsigned long gen_pool_alloc(struct gen_pool *poolp, int size)
+{
+ int j, i, s, max_chunk_size;
+ unsigned long a, flags;
+ struct gen_pool_link *h = poolp->h;
+
+ max_chunk_size = 1 << poolp->max_chunk_shift;
+
+ if (size > max_chunk_size)
+ return 0;
+
+ i = 0;
+ s = (1 << ALLOC_MIN_SHIFT);
+ while (size > s) {
+ s <<= 1;
+ i++;
+ }
+
+#if DEBUG
+ printk(KERN_DEBUG "gen_pool_alloc: s %02x, i %i, h %p\n", s, i, h);
+#endif
+
+ j = i;
+
+ spin_lock_irqsave(&poolp->lock, flags);
+ while (!h[j].next) {
+ if (s = max_chunk_size) {
+ struct gen_pool_link *ptr;
+ spin_unlock_irqrestore(&poolp->lock, flags);
+ ptr = (struct gen_pool_link *)poolp->get_new_chunk(poolp);
+ spin_lock_irqsave(&poolp->lock, flags);
+ h[j].next = ptr;
+ if (h[j].next)
+ h[j].next->next = NULL;
+#if DEBUG
+ printk(KERN_DEBUG "gen_pool_alloc() max chunk j %i\n", j);
+#endif
+ break;
+ }
+ j++;
+ s <<= 1;
+ }
+ a = (unsigned long) h[j].next;
+ if (a) {
+ h[j].next = h[j].next->next;
+ /*
+ * This should be split into a seperate function doing
+ * the chunk split in order to support custom
+ * handling memory not physically accessible by host
+ */
+ while (j > i) {
+#if DEBUG
+ printk(KERN_DEBUG "gen_pool_alloc() splitting i %i j %i %x a %02lx\n", i, j, s, a);
+#endif
+ j -= 1;
+ s >>= 1;
+ h[j].next = (struct gen_pool_link *) (a + s);
+ h[j].next->next = NULL;
+ }
+ }
+ spin_unlock_irqrestore(&poolp->lock, flags);
+#if DEBUG
+ printk(KERN_DEBUG "gen_pool_alloc(%d) = %p\n", size, (void *) a);
+#endif
+ return a;
+}
+
+/*
+ * Counter-part of the generic allocator.
+ */
+void gen_pool_free(struct gen_pool *poolp, unsigned long ptr, int size)
+{
+ struct gen_pool_link *q;
+ struct gen_pool_link *h = poolp->h;
+ unsigned long a, b, flags;
+ int i, max_chunk_size;
+ int s = (1 << ALLOC_MIN_SHIFT);
+
+#if DEBUG
+ printk(KERN_DEBUG "gen_pool_free(%lx, %d)\n", ptr, size);
+#endif
+
+ max_chunk_size = 1 << poolp->max_chunk_shift;
+
+ if (size > max_chunk_size)
+ return;
+
+ i = 0;
+ while (size > s) {
+ s <<= 1;
+ i++;
+ }
+
+ a = ptr;
+
+ spin_lock_irqsave(&poolp->lock, flags);
+ while (1) {
+ if (s = max_chunk_size) {
+ ((struct gen_pool_link *)a)->next = h[i].next;
+ h[i].next = (struct gen_pool_link *)a;
+ break;
+ }
+ b = a ^ s;
+ q = &h[i];
+
+ while (q->next && q->next != (struct gen_pool_link *)b) {
+ q = q->next;
+ }
+
+ if (!q->next) {
+ ((struct gen_pool_link *)a)->next = h[i].next;
+ h[i].next = (struct gen_pool_link *)a;
+ break;
+ }
+ q->next = q->next->next;
+ a = a & b;
+ s <<= 1;
+ i++;
+ }
+ spin_unlock_irqrestore(&poolp->lock, flags);
+}
+
+
+int __init gen_pool_init(void)
+{
+ printk(KERN_INFO "Generic memory pool allocator v1.0\n");
+ return 0;
+}
+
+EXPORT_SYMBOL(alloc_gen_pool);
+EXPORT_SYMBOL(gen_pool_alloc);
+EXPORT_SYMBOL(gen_pool_free);
^ permalink raw reply [flat|nested] 36+ messages in thread* RE: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (30 preceding siblings ...)
2005-02-16 14:17 ` Jes Sorensen
@ 2005-02-16 17:50 ` Luck, Tony
2005-02-16 18:18 ` Luck, Tony
` (2 subsequent siblings)
34 siblings, 0 replies; 36+ messages in thread
From: Luck, Tony @ 2005-02-16 17:50 UTC (permalink / raw)
To: linux-ia64
>Here's another version of the mspec driver with the generic
>allocator. I have tried to implement all the features we discussed on
>the list earlier, including marking pages as uncached using PG_arch_1
>and making /dev/mem honor it.
Did you see David's e-mail? The one where he said:
>Note that PG_arch_1 is already being used as a "i-cache coherent" flag.
Won't bad things happen if someone tried to read a page from /dev/mem
that was marked PG_arch_1 by swiotlb code?
I'm also not thrilled about mem.c being cloned into arch/ia64.
-Tony
^ permalink raw reply [flat|nested] 36+ messages in thread* RE: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (31 preceding siblings ...)
2005-02-16 17:50 ` Luck, Tony
@ 2005-02-16 18:18 ` Luck, Tony
2005-02-16 19:10 ` Jes Sorensen
2005-02-16 21:02 ` David Mosberger
34 siblings, 0 replies; 36+ messages in thread
From: Luck, Tony @ 2005-02-16 18:18 UTC (permalink / raw)
To: linux-ia64
>I'm also not thrilled about mem.c being cloned into arch/ia64.
Looking more closely at what you did, things aren't as bad as I first
thought. Strike that comment from the record.
But the PG_arch_1 dual usage issue remains.
-Tony
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (32 preceding siblings ...)
2005-02-16 18:18 ` Luck, Tony
@ 2005-02-16 19:10 ` Jes Sorensen
2005-02-16 21:02 ` David Mosberger
34 siblings, 0 replies; 36+ messages in thread
From: Jes Sorensen @ 2005-02-16 19:10 UTC (permalink / raw)
To: linux-ia64
>>>>> "Tony" = Luck, Tony <tony.luck@intel.com> writes:
>> I'm also not thrilled about mem.c being cloned into arch/ia64.
Tony> Looking more closely at what you did, things aren't as bad as I
Tony> first thought. Strike that comment from the record.
Tony,
No problem. I really didn't want to do it, but it seems that the -mm
series already introduced this for Xen. It was that or 3 zillion
#ifdefs ;-(
Tony> But the PG_arch_1 dual usage issue remains.
Hmmm for some reason I had the impression PG_arch_1 was only used on
PPC/PPC64 for icache current status, but it looks like you're
right. Darn ;-(
Can't we just drop swiotlb and tell people to buy real hardware? ;-)))
Ok, I'll invent a new flag!
Thanks for the input!
Cheers,
Jes
^ permalink raw reply [flat|nested] 36+ messages in thread* Re: [rfc] generic allocator and mspec driver
2005-02-02 19:10 [rfc] generic allocator and mspec driver Jes Sorensen
` (33 preceding siblings ...)
2005-02-16 19:10 ` Jes Sorensen
@ 2005-02-16 21:02 ` David Mosberger
34 siblings, 0 replies; 36+ messages in thread
From: David Mosberger @ 2005-02-16 21:02 UTC (permalink / raw)
To: linux-ia64
>>>>> On 16 Feb 2005 14:10:56 -0500, Jes Sorensen <jes@wildopensource.com> said:
Jes> Hmmm for some reason I had the impression PG_arch_1 was only
Jes> used on PPC/PPC64 for icache current status, but it looks like
Jes> you're right. Darn ;-(
Jes> Can't we just drop swiotlb and tell people to buy real
Jes> hardware? ;-)))
The use of PG_arch_1 has nothing to do with swiotlb. It lets us
optimize away expensive cache-flushing by taking advantage of the fact
that DMA transactions automatically establish coherency.
--david
^ permalink raw reply [flat|nested] 36+ messages in thread