public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* Re: efi_memmapwalk re-write
@ 2005-08-03 22:45 Luck, Tony
  2005-08-03 23:00 ` Luck, Tony
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Luck, Tony @ 2005-08-03 22:45 UTC (permalink / raw)
  To: linux-ia64


>one on x86. This patch is relative to 2.6.13-rc3 and applies on top of 
>the EFI memory map walk rewrite patch at 
><http://free.linux.hp.com/~khalid/ia64/efi_memmapwalk_2.6.13rc3.patch>
>(Tony, this EFI memory map walk patch is same as the one I sent you this
>morning).

Khalid,

Thanks for working on this.  I'm sorry it has taken this long to look
at it.

I think some areas can still benefit from more simplification.  The only
place I see you split a kern_memdesc_t is in efi_trim_memory() in order
to limit memory according to either of mem_limit or max_addr.  Wouldn't
it be simpler to just adjust num_pages element of the element instead
of splitting?

If you do that ... then you don't need to have a linked list of kern_memdesc
structures, you can treat them just like an array, nor do you need
MEM_DESC_SAFETY_MARGIN

Likewise the granule alignment functions.  The original trim_top() and
trim_bottom() are insanely complex ... and perhaps you were led astray
trying to duplicate their behaivour?  I believe that you should end up
with the desired behaivour if you just do any coalescing of memory blocks
that are WB and have one of the allowable types, then round the base
addresses up to granule boundaries and the tops down.  All that scanning
around looking for holes or non-WB sections of memory looks pointless to
me ... perhaps I'm missing some incredible subtlety in the original?

By only copying from "is_available()" types into kern_memdesc_t structures
you can avoid calling is_available() in your new efi_memmap_walk(), and
indeed drop the "type" field from the structure.

find_memmap_space() should be in efi.c ... just pass a pointer to space
for it to fill in the address of the block it allocated and have it
return the size so that reserve_memory() can fill these into an entry
in rsvd_region[].  It shouldn't use is_available_memory() to decide
which blocks are candidates for the allocation.  This could result in
choosing memory that is still in use by EFI, command line arguments, or
even overwriting the kernel.  EFI_CONVENTIONAL_MEMORY is definitely
safe for this, perhaps EFI_LOADER_CODE too?

It is not OK to call "machine_restart()" if you couldn't allocate space.
This would put the machine into a loop ... rebooting and failing, with
no oportunity to read the error message.  Just use panic().

-Tony


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: efi_memmapwalk re-write
  2005-08-03 22:45 efi_memmapwalk re-write Luck, Tony
@ 2005-08-03 23:00 ` Luck, Tony
  2005-08-04 18:16 ` Khalid Aziz
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Luck, Tony @ 2005-08-03 23:00 UTC (permalink / raw)
  To: linux-ia64


>addresses up to granule boundaries and the tops down.  All that scanning
>around looking for holes or non-WB sections of memory looks pointless to
>me ... perhaps I'm missing some incredible subtlety in the original?

Duh.

Ok ... it is a bit more complex than I make it appear because there
are regions of WB memory that aren't usable by Linux ... they shouldn't
stop Linux from using granules that contain such blocks.

-Tony

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: efi_memmapwalk re-write
  2005-08-03 22:45 efi_memmapwalk re-write Luck, Tony
  2005-08-03 23:00 ` Luck, Tony
@ 2005-08-04 18:16 ` Khalid Aziz
  2005-08-04 22:41 ` Luck, Tony
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Khalid Aziz @ 2005-08-04 18:16 UTC (permalink / raw)
  To: linux-ia64

On Wed, 2005-08-03 at 15:45 -0700, Luck, Tony wrote:
> >one on x86. This patch is relative to 2.6.13-rc3 and applies on top of 
> >the EFI memory map walk rewrite patch at 
> ><http://free.linux.hp.com/~khalid/ia64/efi_memmapwalk_2.6.13rc3.patch>
> >(Tony, this EFI memory map walk patch is same as the one I sent you this
> >morning).
> 
> Khalid,
> 
> Thanks for working on this.  I'm sorry it has taken this long to look
> at it.
> 
> I think some areas can still benefit from more simplification.  The only
> place I see you split a kern_memdesc_t is in efi_trim_memory() in order
> to limit memory according to either of mem_limit or max_addr.  Wouldn't
> it be simpler to just adjust num_pages element of the element instead
> of splitting?
> 
> If you do that ... then you don't need to have a linked list of kern_memdesc
> structures, you can treat them just like an array, nor do you need
> MEM_DESC_SAFETY_MARGIN
> 
> Likewise the granule alignment functions.  The original trim_top() and
> trim_bottom() are insanely complex ... and perhaps you were led astray
> trying to duplicate their behaivour?  I believe that you should end up
> with the desired behaivour if you just do any coalescing of memory blocks
> that are WB and have one of the allowable types, then round the base
> addresses up to granule boundaries and the tops down.  All that scanning
> around looking for holes or non-WB sections of memory looks pointless to
> me ... perhaps I'm missing some incredible subtlety in the original?
> 
> By only copying from "is_available()" types into kern_memdesc_t structures
> you can avoid calling is_available() in your new efi_memmap_walk(), and
> indeed drop the "type" field from the structure.
> 
> find_memmap_space() should be in efi.c ... just pass a pointer to space
> for it to fill in the address of the block it allocated and have it
> return the size so that reserve_memory() can fill these into an entry
> in rsvd_region[].  It shouldn't use is_available_memory() to decide
> which blocks are candidates for the allocation.  This could result in
> choosing memory that is still in use by EFI, command line arguments, or
> even overwriting the kernel.  EFI_CONVENTIONAL_MEMORY is definitely
> safe for this, perhaps EFI_LOADER_CODE too?
> 
> It is not OK to call "machine_restart()" if you couldn't allocate space.
> This would put the machine into a loop ... rebooting and failing, with
> no oportunity to read the error message.  Just use panic().
> 
> -Tony
> 

Tony,

Thanks for feedback. I will take a look at these and see if I can
simplify it further.

-- 
Khalid

==================================
Khalid Aziz                       Open Source and Linux Organization
(970)898-9214                                        Hewlett-Packard
khalid.aziz@hp.com                                  Fort Collins, CO

"The Linux kernel is subject to relentless development" 
                                - Alessandro Rubini


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: efi_memmapwalk re-write
  2005-08-03 22:45 efi_memmapwalk re-write Luck, Tony
  2005-08-03 23:00 ` Luck, Tony
  2005-08-04 18:16 ` Khalid Aziz
@ 2005-08-04 22:41 ` Luck, Tony
  2005-08-05 22:46 ` Khalid Aziz
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Luck, Tony @ 2005-08-04 22:41 UTC (permalink / raw)
  To: linux-ia64

> Thanks for feedback. I will take a look at these and see if I can
> simplify it further.

Here's part of the solution ... a combination of your gather and trim
functions.  I bypassed the problem of allocating the space for the
kern_memdesc structures for this example ... so you'll have to
integrate this with your code to allocate space.

I think it is a significant improvement in readability, but I'm biased.

Oh, it doesn't print out all the:

efi.trim_top: ignoring 4KB of memory at 0x0 due to granule hole at 0x0
and
efi.trim_bottom: ignoring 8KB of memory at 0x1feffe000 due to granule hole at 0x1fe000000

messages.  If anyone thinks they are useful, then they could be added, but
they seem like a lot of noise to me.

This looks like it runs ok for the memory map on my tiger, but I didn't try it on
anything else.

-Tony

---- cut here and drop into efi.c ----


struct kern_memdesc {
	u64	start;
	u64	npages;
} kern_memdesc[100];

static inline u64
efi_end(efi_memory_desc_t *e)
{
	return e->phys_addr + (e->num_pages<<EFI_PAGE_SHIFT);
}

static inline int
efi_wb(efi_memory_desc_t *e)
{
	return e->attribute & EFI_MEMORY_WB;
}

static inline u64
kern_end(struct kern_memdesc *k)
{
	return k->start + (k->npages<<EFI_PAGE_SHIFT);
}

void
efi_gather(void)
{
	struct kern_memdesc *k = kern_memdesc, *prev = 0;
	u64	contig_low=0, contig_high=0;
	u64	as, ae;
	void *efi_map_start, *efi_map_end, *p, *q;
	efi_memory_desc_t *md, *pmd = NULL, *check_md;
	u64	efi_desc_size;
	unsigned long total_mem = 0;

	efi_map_start = __va(ia64_boot_param->efi_memmap);
	efi_map_end   = efi_map_start + ia64_boot_param->efi_memmap_size;
	efi_desc_size = ia64_boot_param->efi_memdesc_size;

	for (p = efi_map_start; p < efi_map_end; pmd = md, p += efi_desc_size) {
		md = p;
		if (!efi_wb(md))
			continue;
		if (pmd = NULL || !efi_wb(pmd) || efi_end(pmd) != md->phys_addr) {
			contig_low = GRANULEROUNDUP(md->phys_addr);
			contig_high = efi_end(md);
			for (q = p + efi_desc_size; q < efi_map_end; q += efi_desc_size) {
				check_md = q;
				if (!efi_wb(check_md))
					break;
				if (contig_high != check_md->phys_addr)
					break;
				contig_high = efi_end(check_md);
			}
			contig_high = GRANULEROUNDDOWN(contig_high);
		}
		if (!is_available_memory(md))
			continue;

		/* round ends inward to granule boundaries */
		as = max(contig_low, md->phys_addr);
		ae = min(contig_high, efi_end(md));

		/* keep within max_addr= command line arg */
		ae = min(ae, max_addr);
		if (ae <= as)
			continue;

		/* avoid going over mem= command line arg */
		if (total_mem + (ae - as) > mem_limit)
			ae -= total_mem + (ae - as) - mem_limit;

		if (ae <= as)
			continue;
		if (prev && kern_end(prev) = md->phys_addr) {
			prev->npages += (ae - as) >> EFI_PAGE_SHIFT;
			total_mem += ae - as;
			continue;
		}
		k->start = as;
		k->npages = (ae - as) >> EFI_PAGE_SHIFT;
		total_mem += ae - as;
		prev = k++;
	}
	k->start = ~0L; /* end-marker */
}

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: efi_memmapwalk re-write
  2005-08-03 22:45 efi_memmapwalk re-write Luck, Tony
                   ` (2 preceding siblings ...)
  2005-08-04 22:41 ` Luck, Tony
@ 2005-08-05 22:46 ` Khalid Aziz
  2005-08-08 18:59 ` Luck, Tony
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Khalid Aziz @ 2005-08-05 22:46 UTC (permalink / raw)
  To: linux-ia64

On Wed, 2005-08-03 at 15:45 -0700, Luck, Tony wrote:
> >one on x86. This patch is relative to 2.6.13-rc3 and applies on top of 
> >the EFI memory map walk rewrite patch at 
> ><http://free.linux.hp.com/~khalid/ia64/efi_memmapwalk_2.6.13rc3.patch>
> >(Tony, this EFI memory map walk patch is same as the one I sent you this
> >morning).
> 
> Khalid,
> 
> Thanks for working on this.  I'm sorry it has taken this long to look
> at it.
> 
> I think some areas can still benefit from more simplification.  The only
> place I see you split a kern_memdesc_t is in efi_trim_memory() in order
> to limit memory according to either of mem_limit or max_addr.  Wouldn't
> it be simpler to just adjust num_pages element of the element instead
> of splitting?
> 
> If you do that ... then you don't need to have a linked list of kern_memdesc
> structures, you can treat them just like an array, nor do you need
> MEM_DESC_SAFETY_MARGIN

Tony,

I was a little reluctant to throw away information about physical memory
that was on the system when I wrote this code originally. That
information could be useful for instance if we choose to implement to
allow adding that memory back to the system without having to reboot the
system, using hotplug memory infrastructure. Cost of retaining this
information looked reasonable enough to me.

> 
> Likewise the granule alignment functions.  The original trim_top() and
> trim_bottom() are insanely complex ... and perhaps you were led astray
> trying to duplicate their behaivour?  I believe that you should end up
> with the desired behaivour if you just do any coalescing of memory blocks
> that are WB and have one of the allowable types, then round the base
> addresses up to granule boundaries and the tops down.  All that scanning
> around looking for holes or non-WB sections of memory looks pointless to
> me ... perhaps I'm missing some incredible subtlety in the original?

I was mostly afraid to touch that code:) That code has been debugged
over years and I did not want to drop those fixes by rewriting it. I
will take another look at it.

-- 
Khalid

==================================
Khalid Aziz                       Open Source and Linux Organization
(970)898-9214                                        Hewlett-Packard
khalid.aziz@hp.com                                  Fort Collins, CO

"The Linux kernel is subject to relentless development" 
                                - Alessandro Rubini


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: efi_memmapwalk re-write
  2005-08-03 22:45 efi_memmapwalk re-write Luck, Tony
                   ` (3 preceding siblings ...)
  2005-08-05 22:46 ` Khalid Aziz
@ 2005-08-08 18:59 ` Luck, Tony
  2005-08-12 23:05 ` Khalid Aziz
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Luck, Tony @ 2005-08-08 18:59 UTC (permalink / raw)
  To: linux-ia64


>I was a little reluctant to throw away information about physical memory
>that was on the system when I wrote this code originally. That
>information could be useful for instance if we choose to implement to
>allow adding that memory back to the system without having to reboot the
>system, using hotplug memory infrastructure. Cost of retaining this
>information looked reasonable enough to me.

We already throw away information ... the existing code does it
irreversibly when it makes adjustments to the phys_addr and num_pages
fields in efi_memory_desc_t structures.  Your new code avoids doing
that, but since you coalesce sections that are usable by Linux that
are contiguous, you drop any type information that is different
between sections.

I think this is fine ... anyone that really needs to know will be
able to go look at the original efi_memory_desc_t structures.  Linux
currently doesn't need to know, so we might as well optimize these
new summary structures for use by Linux.

-Tony

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: efi_memmapwalk re-write
  2005-08-03 22:45 efi_memmapwalk re-write Luck, Tony
                   ` (4 preceding siblings ...)
  2005-08-08 18:59 ` Luck, Tony
@ 2005-08-12 23:05 ` Khalid Aziz
  2005-08-12 23:48 ` Luck, Tony
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Khalid Aziz @ 2005-08-12 23:05 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 721 bytes --]

Tony,

Here is the updated patch. It incorporates your suggestions. I have left
kern_memdesc as a linked list as opposed to array. Linked list is little
more versatile structure and the cost is minimal over an array. Unless
you feel strongly about using arrays instead of linked list, I would
prefer to leave it this way.

-- 
Khalid

====================================================================
Khalid Aziz                       Open Source and Linux Organization
(970)898-9214                                        Hewlett-Packard
khalid.aziz@hp.com                                  Fort Collins, CO

"The Linux kernel is subject to relentless development" 
                                - Alessandro Rubini

[-- Attachment #2: efi_memmapwalk_2.6.13rc3.patch --]
[-- Type: text/x-patch, Size: 17761 bytes --]

diff -urNp linux-2.6.13-rc3/arch/ia64/kernel/efi.c linux-2.6.13-rc3-efimemmap/arch/ia64/kernel/efi.c
--- linux-2.6.13-rc3/arch/ia64/kernel/efi.c	2005-07-28 13:37:40.000000000 -0600
+++ linux-2.6.13-rc3-efimemmap/arch/ia64/kernel/efi.c	2005-08-12 16:56:48.000000000 -0600
@@ -17,6 +17,10 @@
  *
  * Goutham Rao: <goutham.rao@intel.com>
  *	Skip non-WB memory and ignore empty memory ranges.
+ *
+ * Rewrote efi_memap_walk() to create a linked list of available 
+ * memory regions instead of editing EFI memory map in place 
+ * 				- Khalid Aziz <khalid.aziz@hp.com>
  */
 #include <linux/config.h>
 #include <linux/module.h>
@@ -35,12 +39,17 @@
 
 #define EFI_DEBUG	0
 
+#define efi_md_size(md)	(md->num_pages << EFI_PAGE_SHIFT)
+
 extern efi_status_t efi_call_phys (void *, ...);
 
 struct efi efi;
 EXPORT_SYMBOL(efi);
 static efi_runtime_services_t *runtime;
 static unsigned long mem_limit = ~0UL, max_addr = ~0UL;
+static kern_memdesc_t *kern_memmap = NULL;
+static unsigned long efi_total_mem = 0UL;
+kern_memdesc_t *memdesc_area, *memdesc_end;
 
 #define efi_call_virt(f, args...)	(*(f))(args)
 
@@ -222,190 +231,232 @@ efi_gettimeofday (struct timespec *ts)
 	ts->tv_nsec = tm.nanosecond;
 }
 
-static int
-is_available_memory (efi_memory_desc_t *md)
+#define is_usable_memory(md)	((md->type == EFI_LOADER_CODE)? 1: \
+				 ((md->type == EFI_BOOT_SERVICES_CODE)? 1: \
+				  ((md->type == EFI_BOOT_SERVICES_DATA)? 1: \
+				   ((md->type == EFI_CONVENTIONAL_MEMORY)? 1:0))))
+
+static inline int
+efi_wb(efi_memory_desc_t *md)
 {
-	if (!(md->attribute & EFI_MEMORY_WB))
-		return 0;
+	return (md->attribute & EFI_MEMORY_WB);
+}
 
-	switch (md->type) {
-	      case EFI_LOADER_CODE:
-	      case EFI_LOADER_DATA:
-	      case EFI_BOOT_SERVICES_CODE:
-	      case EFI_BOOT_SERVICES_DATA:
-	      case EFI_CONVENTIONAL_MEMORY:
-		return 1;
-	}
-	return 0;
+static inline u64
+kern_end(kern_memdesc_t *kmd)
+{
+	return (kmd->start + (kmd->num_pages << EFI_PAGE_SHIFT));
 }
 
-/*
- * Trim descriptor MD so its starts at address START_ADDR.  If the descriptor covers
- * memory that is normally available to the kernel, issue a warning that some memory
- * is being ignored.
- */
-static void
-trim_bottom (efi_memory_desc_t *md, u64 start_addr)
+int
+find_memmap_space (struct rsvd_region *rsvd_rgn)
 {
-	u64 num_skipped_pages;
+	void *efi_map_start, *efi_map_end, *p, *q;
+	u64 efi_desc_size, space_needed;
+	u64 smallest_block = UINT_MAX;
+	u64 small_block_addr = -1UL;
+	u64 block_size;
+	efi_memory_desc_t *md, *check_md;
 
-	if (md->phys_addr >= start_addr || !md->num_pages)
-		return;
+	/*
+	 * Look for the first granule aligned memory descriptor memory 
+	 * that is big enough to hold EFI memory map. Make sure this 
+	 * descriptor is atleast granule sized so it does not get trimmed
+	 */
+	efi_map_start = __va(ia64_boot_param->efi_memmap);
+	efi_map_end   = efi_map_start + ia64_boot_param->efi_memmap_size;
+	efi_desc_size = ia64_boot_param->efi_memdesc_size;
 
-	num_skipped_pages = (start_addr - md->phys_addr) >> EFI_PAGE_SHIFT;
-	if (num_skipped_pages > md->num_pages)
-		num_skipped_pages = md->num_pages;
-
-	if (is_available_memory(md))
-		printk(KERN_NOTICE "efi.%s: ignoring %luKB of memory at 0x%lx due to granule hole "
-		       "at 0x%lx\n", __FUNCTION__,
-		       (num_skipped_pages << EFI_PAGE_SHIFT) >> 10,
-		       md->phys_addr, start_addr - IA64_GRANULE_SIZE);
 	/*
-	 * NOTE: Don't set md->phys_addr to START_ADDR because that could cause the memory
-	 * descriptor list to become unsorted.  In such a case, md->num_pages will be
-	 * zero, so the Right Thing will happen.
+	 * We will allocate enough memory to hold as many nodes as 
+	 * there are in EFI memory map and a null node. 
 	 */
-	md->phys_addr += num_skipped_pages << EFI_PAGE_SHIFT;
-	md->num_pages -= num_skipped_pages;
+	space_needed = sizeof(kern_memdesc_t)*((ia64_boot_param->efi_memmap_size/efi_desc_size) + 1);
+
+	for (p = efi_map_start; p < efi_map_end; p += efi_desc_size) {
+		md = p;
+
+		/* skip over non-WB and non-available memory descriptors */
+		if ((!efi_wb(md)) || (!is_usable_memory(md)))
+			continue;
+		block_size = efi_md_size(md);
+
+		/* Look for any contiguous blocks of memory */
+		for (q = p+efi_desc_size; q < efi_map_end; q += efi_desc_size) {
+			check_md = q;
+
+			if (efi_wb(check_md) &&
+			    (check_md->phys_addr == md->phys_addr+block_size) &&
+			    is_usable_memory(check_md)) {
+				block_size += efi_md_size(check_md);
+				p += efi_desc_size;
+			}
+			else
+				break;
+		}
+
+		if ((block_size < smallest_block) && 
+			(block_size >= space_needed) &&
+			 (block_size >= IA64_GRANULE_SIZE)) {
+			smallest_block = block_size;
+			small_block_addr = md->phys_addr;
+		}
+
+	}
+
+	/* 
+	 * We will allocate a chunk of memory from the smallest block
+	 * of memory we found.
+	 */
+	rsvd_rgn->start = small_block_addr;
+	rsvd_rgn->end = small_block_addr + space_needed;
+	memdesc_area = __va(small_block_addr);
+	memdesc_end = memdesc_area + space_needed;
+	return 0;
+}
+
+/* 
+ * Allocate a node for kernel memory descriptor. These allocations are never
+ * freed.
+ */
+static inline kern_memdesc_t  *
+memdesc_alloc (void)
+{
+	if (memdesc_area >= memdesc_end)
+		return((kern_memdesc_t  *)-1UL);
+	return((kern_memdesc_t  *)memdesc_area++);
 }
 
-static void
-trim_top (efi_memory_desc_t *md, u64 end_addr)
+/*
+ * Walks the EFI memory map and calls CALLBACK once for each EFI 
+ * memory descriptor that has memory that is available for OS use.
+ */
+void
+efi_memmap_walk (efi_freemem_callback_t callback, void *arg)
 {
-	u64 num_dropped_pages, md_end_addr;
+	kern_memdesc_t *memnode;
+	u64 start, end;
 
-	md_end_addr = md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT);
+	memnode = kern_memmap;
 
-	if (md_end_addr <= end_addr || !md->num_pages)
-		return;
+	while (memnode != NULL) {
+		start = PAGE_OFFSET + memnode->start;
+		end = (start + efi_md_size(memnode)) & PAGE_MASK;
+
+		if ((*callback)(start, end, arg) < 0)
+			return;
+		memnode = memnode->next;
+	}
+}
 
-	num_dropped_pages = (md_end_addr - end_addr) >> EFI_PAGE_SHIFT;
-	if (num_dropped_pages > md->num_pages)
-		num_dropped_pages = md->num_pages;
-
-	if (is_available_memory(md))
-		printk(KERN_NOTICE "efi.%s: ignoring %luKB of memory at 0x%lx due to granule hole "
-		       "at 0x%lx\n", __FUNCTION__,
-		       (num_dropped_pages << EFI_PAGE_SHIFT) >> 10,
-		       md->phys_addr, end_addr);
-	md->num_pages -= num_dropped_pages;
+static inline u64
+efi_end(efi_memory_desc_t *md)
+{
+	return (md->phys_addr + efi_md_size(md));
 }
 
 /*
- * Walks the EFI memory map and calls CALLBACK once for each EFI memory descriptor that
- * has memory that is available for OS use.
+ * Walk the EFI memory map and gather all memory available for kernel 
+ * to use. 
  */
 void
-efi_memmap_walk (efi_freemem_callback_t callback, void *arg)
+efi_gather_memory (void)
 {
-	int prev_valid = 0;
-	struct range {
-		u64 start;
-		u64 end;
-	} prev, curr;
 	void *efi_map_start, *efi_map_end, *p, *q;
-	efi_memory_desc_t *md, *check_md;
-	u64 efi_desc_size, start, end, granule_addr, last_granule_addr, first_non_wb_addr = 0;
-	unsigned long total_mem = 0;
+	efi_memory_desc_t *md, *check_md, *pmd = NULL;
+	u64 efi_desc_size;
+	u64 contig_low=0, contig_high=0, range_end;
+	int no_allocate = 0;
+	kern_memdesc_t *newnode, *prevnode = NULL;
 
 	efi_map_start = __va(ia64_boot_param->efi_memmap);
 	efi_map_end   = efi_map_start + ia64_boot_param->efi_memmap_size;
 	efi_desc_size = ia64_boot_param->efi_memdesc_size;
 
-	for (p = efi_map_start; p < efi_map_end; p += efi_desc_size) {
+	for (p = efi_map_start; p < efi_map_end; pmd=md, p += efi_desc_size) {
 		md = p;
 
-		/* skip over non-WB memory descriptors; that's all we're interested in... */
-		if (!(md->attribute & EFI_MEMORY_WB))
+		if (!efi_wb(md) || !is_available_memory(md))
 			continue;
 
+		if (!no_allocate && (newnode = memdesc_alloc()) == NULL) {
+			printk(KERN_ERR "ERROR: Failed to allocate node for kernel memory descriptor\n");
+			printk(KERN_ERR "       Continuing with limited memory\n");
+			break;
+		}
+		no_allocate = 0;
+		newnode->start = md->phys_addr;
+		newnode->num_pages = md->num_pages;
+		newnode->next = newnode->prev = NULL;
+		if (kern_memmap == NULL)
+			kern_memmap = newnode;
+
 		/*
-		 * granule_addr is the base of md's first granule.
-		 * [granule_addr - first_non_wb_addr) is guaranteed to
-		 * be contiguous WB memory.
+		 * Granule align and coalesce contiguous ranges 
 		 */
-		granule_addr = GRANULEROUNDDOWN(md->phys_addr);
-		first_non_wb_addr = max(first_non_wb_addr, granule_addr);
-
-		if (first_non_wb_addr < md->phys_addr) {
-			trim_bottom(md, granule_addr + IA64_GRANULE_SIZE);
-			granule_addr = GRANULEROUNDDOWN(md->phys_addr);
-			first_non_wb_addr = max(first_non_wb_addr, granule_addr);
+		if (pmd == NULL || !efi_wb(pmd) || efi_end(pmd) != md->phys_addr) {
+			contig_low = GRANULEROUNDUP(newnode->start);
+			contig_high = efi_end(md);
+			for (q = p+efi_desc_size; q < efi_map_end; q += efi_desc_size) {
+				check_md = q;
+
+				if (!efi_wb(check_md) || 
+					(check_md->phys_addr != contig_high)) {
+					break;
+				}
+				contig_high = efi_end(check_md);
+			}
+			contig_high = GRANULEROUNDDOWN(contig_high);
 		}
+		if (!is_available_memory(md))
+			continue;
 
-		for (q = p; q < efi_map_end; q += efi_desc_size) {
-			check_md = q;
+		newnode->start = max(contig_low, md->phys_addr);
+		range_end = min(contig_high, efi_end(md));
 
-			if ((check_md->attribute & EFI_MEMORY_WB) &&
-			    (check_md->phys_addr == first_non_wb_addr))
-				first_non_wb_addr += check_md->num_pages << EFI_PAGE_SHIFT;
-			else
-				break;		/* non-WB or hole */
+		/* Apply max_addr= limit */
+		range_end = min(range_end, max_addr);
+		if (range_end <= newnode->start) {
+			no_allocate = 1;
+			continue;
 		}
 
-		last_granule_addr = GRANULEROUNDDOWN(first_non_wb_addr);
-		if (last_granule_addr < md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT))
-			trim_top(md, last_granule_addr);
-
-		if (is_available_memory(md)) {
-			if (md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) >= max_addr) {
-				if (md->phys_addr >= max_addr)
-					continue;
-				md->num_pages = (max_addr - md->phys_addr) >> EFI_PAGE_SHIFT;
-				first_non_wb_addr = max_addr;
-			}
-
-			if (total_mem >= mem_limit)
+		/* Enforce mem= limit */
+		if ((efi_total_mem + range_end - newnode->start) > mem_limit)
+			range_end -= (efi_total_mem + range_end - 
+					newnode->start) - mem_limit;
+
+		if (range_end <= newnode->start)
+			newnode->num_pages = 0;
+		else {
+			/* Can we merge this range with previous one */
+			if (prevnode && kern_end(prevnode) == md->phys_addr) {
+				prevnode->num_pages += (range_end - newnode->start) >> EFI_PAGE_SHIFT;
+				efi_total_mem += range_end - newnode->start;
+				no_allocate = 1;
 				continue;
-
-			if (total_mem + (md->num_pages << EFI_PAGE_SHIFT) > mem_limit) {
-				unsigned long limit_addr = md->phys_addr;
-
-				limit_addr += mem_limit - total_mem;
-				limit_addr = GRANULEROUNDDOWN(limit_addr);
-
-				if (md->phys_addr > limit_addr)
-					continue;
-
-				md->num_pages = (limit_addr - md->phys_addr) >>
-				                EFI_PAGE_SHIFT;
-				first_non_wb_addr = max_addr = md->phys_addr +
-				              (md->num_pages << EFI_PAGE_SHIFT);
 			}
-			total_mem += (md->num_pages << EFI_PAGE_SHIFT);
-
-			if (md->num_pages == 0)
-				continue;
+			else
+				newnode->num_pages = (range_end - newnode->start) >> EFI_PAGE_SHIFT;
+		}
+		/* 
+		 * Are we left with any pages after all the alignment? 
+		 * If not, we will simply reuse the node we just allocated
+		 * and not allocate a new one.
+		 */
+		if (!newnode->num_pages) {
+			no_allocate = 1;
+			continue;
+		} 
 
-			curr.start = PAGE_OFFSET + md->phys_addr;
-			curr.end   = curr.start + (md->num_pages << EFI_PAGE_SHIFT);
+		efi_total_mem += efi_md_size(newnode);
 
-			if (!prev_valid) {
-				prev = curr;
-				prev_valid = 1;
-			} else {
-				if (curr.start < prev.start)
-					printk(KERN_ERR "Oops: EFI memory table not ordered!\n");
-
-				if (prev.end == curr.start) {
-					/* merge two consecutive memory ranges */
-					prev.end = curr.end;
-				} else {
-					start = PAGE_ALIGN(prev.start);
-					end = prev.end & PAGE_MASK;
-					if ((end > start) && (*callback)(start, end, arg) < 0)
-						return;
-					prev = curr;
-				}
-			}
+		/* Link this node in the list */
+		if (prevnode != NULL) {
+			newnode->prev = prevnode;
+			prevnode->next = newnode;
 		}
-	}
-	if (prev_valid) {
-		start = PAGE_ALIGN(prev.start);
-		end = prev.end & PAGE_MASK;
-		if (end > start)
-			(*callback)(start, end, arg);
+		prevnode = newnode;
 	}
 }
 
@@ -644,7 +695,7 @@ efi_init (void)
 			md = p;
 			printk("mem%02u: type=%u, attr=0x%lx, range=[0x%016lx-0x%016lx) (%luMB)\n",
 			       i, md->type, md->attribute, md->phys_addr,
-			       md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT),
+			       md->phys_addr + efi_md_size(md),
 			       md->num_pages >> (20 - EFI_PAGE_SHIFT));
 		}
 	}
@@ -673,7 +724,7 @@ efi_enter_virtual_mode (void)
 			 * Some descriptors have multiple bits set, so the order of
 			 * the tests is relevant.
 			 */
-			if (md->attribute & EFI_MEMORY_WB) {
+			if (efi_wb(md)) {
 				md->virt_addr = (u64) __va(md->phys_addr);
 			} else if (md->attribute & EFI_MEMORY_UC) {
 				md->virt_addr = (u64) ioremap(md->phys_addr, 0);
@@ -765,7 +816,7 @@ efi_mem_type (unsigned long phys_addr)
 	for (p = efi_map_start; p < efi_map_end; p += efi_desc_size) {
 		md = p;
 
-		if (phys_addr - md->phys_addr < (md->num_pages << EFI_PAGE_SHIFT))
+		if (phys_addr - md->phys_addr < efi_md_size(md))
 			 return md->type;
 	}
 	return 0;
@@ -785,7 +836,7 @@ efi_mem_attributes (unsigned long phys_a
 	for (p = efi_map_start; p < efi_map_end; p += efi_desc_size) {
 		md = p;
 
-		if (phys_addr - md->phys_addr < (md->num_pages << EFI_PAGE_SHIFT))
+		if (phys_addr - md->phys_addr < efi_md_size(md))
 			return md->attribute;
 	}
 	return 0;
@@ -806,12 +857,12 @@ valid_phys_addr_range (unsigned long phy
 	for (p = efi_map_start; p < efi_map_end; p += efi_desc_size) {
 		md = p;
 
-		if (phys_addr - md->phys_addr < (md->num_pages << EFI_PAGE_SHIFT)) {
-			if (!(md->attribute & EFI_MEMORY_WB))
+		if (phys_addr - md->phys_addr < efi_md_size(md)) {
+			if (!efi_wb(md))
 				return 0;
 
-			if (*size > md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - phys_addr)
-				*size = md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) - phys_addr;
+			if (*size > md->phys_addr + efi_md_size(md) - phys_addr)
+				*size = md->phys_addr + efi_md_size(md) - phys_addr;
 			return 1;
 		}
 	}
diff -urNp linux-2.6.13-rc3/arch/ia64/kernel/setup.c linux-2.6.13-rc3-efimemmap/arch/ia64/kernel/setup.c
--- linux-2.6.13-rc3/arch/ia64/kernel/setup.c	2005-07-28 13:37:40.000000000 -0600
+++ linux-2.6.13-rc3-efimemmap/arch/ia64/kernel/setup.c	2005-08-09 14:34:18.000000000 -0600
@@ -163,6 +164,8 @@ sort_regions (struct rsvd_region *rsvd_r
 	}
 }
 
+extern int find_memmap_space(struct rsvd_region *);
+
 /**
  * reserve_memory - setup reserved memory areas
  *
@@ -203,6 +206,11 @@ reserve_memory (void)
 	}
 #endif
 
+	if (find_memmap_space(&rsvd_region[n]) != 0) {
+		panic("Failed to find space to build kernel EFI memory map");
+	}
+	n++;
+
 	/* end of memory marker */
 	rsvd_region[n].start = ~0UL;
 	rsvd_region[n].end   = ~0UL;
diff -urNp linux-2.6.13-rc3/arch/ia64/mm/contig.c linux-2.6.13-rc3-efimemmap/arch/ia64/mm/contig.c
--- linux-2.6.13-rc3/arch/ia64/mm/contig.c	2005-06-17 13:48:29.000000000 -0600
+++ linux-2.6.13-rc3-efimemmap/arch/ia64/mm/contig.c	2005-08-12 16:36:36.000000000 -0600
@@ -148,6 +148,8 @@ find_memory (void)
 
 	reserve_memory();
 
+	efi_gather_memory();
+
 	/* first find highest page frame number */
 	max_pfn = 0;
 	efi_memmap_walk(find_max_pfn, &max_pfn);
diff -urNp linux-2.6.13-rc3/arch/ia64/mm/discontig.c linux-2.6.13-rc3-efimemmap/arch/ia64/mm/discontig.c
--- linux-2.6.13-rc3/arch/ia64/mm/discontig.c	2005-07-28 13:37:40.000000000 -0600
+++ linux-2.6.13-rc3-efimemmap/arch/ia64/mm/discontig.c	2005-08-12 16:36:40.000000000 -0600
@@ -433,6 +433,8 @@ void __init find_memory(void)
 
 	reserve_memory();
 
+	efi_gather_memory();
+
 	if (num_online_nodes() == 0) {
 		printk(KERN_ERR "node info missing!\n");
 		node_set_online(0);
diff -urNp linux-2.6.13-rc3/include/asm-ia64/meminit.h linux-2.6.13-rc3-efimemmap/include/asm-ia64/meminit.h
--- linux-2.6.13-rc3/include/asm-ia64/meminit.h	2005-06-17 13:48:29.000000000 -0600
+++ linux-2.6.13-rc3-efimemmap/include/asm-ia64/meminit.h	2005-08-12 16:36:05.000000000 -0600
@@ -16,10 +16,11 @@
  * 	- initrd (optional)
  * 	- command line string
  * 	- kernel code & data
+ * 	- Kernel memory map built from EFI memory map
  *
  * More could be added if necessary
  */
-#define IA64_MAX_RSVD_REGIONS 5
+#define IA64_MAX_RSVD_REGIONS 6
 
 struct rsvd_region {
 	unsigned long start;	/* virtual address of beginning of element */
@@ -29,6 +30,12 @@ struct rsvd_region {
 extern struct rsvd_region rsvd_region[IA64_MAX_RSVD_REGIONS + 1];
 extern int num_rsvd_regions;
 
+typedef struct kern_memdesc {
+	u64 start;
+	u64 num_pages;
+	struct kern_memdesc *next, *prev;
+} kern_memdesc_t;
+
 extern void find_memory (void);
 extern void reserve_memory (void);
 extern void find_initrd (void);
@@ -57,4 +64,10 @@ extern int filter_rsvd_memory (unsigned 
   extern int create_mem_map_page_table (u64 start, u64 end, void *arg);
 #endif
 
+#define is_available_memory(md) ((md->type == EFI_LOADER_CODE)? 1: \
+				  ((md->type == EFI_LOADER_DATA)? 1: \
+				   ((md->type == EFI_BOOT_SERVICES_CODE)? 1: \
+				    ((md->type == EFI_BOOT_SERVICES_DATA)? 1: \
+				     ((md->type == EFI_CONVENTIONAL_MEMORY)? 1:0)))))
+
 #endif /* meminit_h */

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: efi_memmapwalk re-write
  2005-08-03 22:45 efi_memmapwalk re-write Luck, Tony
                   ` (5 preceding siblings ...)
  2005-08-12 23:05 ` Khalid Aziz
@ 2005-08-12 23:48 ` Luck, Tony
  2005-08-15 14:33 ` Khalid Aziz
  2005-09-03  6:25 ` tony.luck
  8 siblings, 0 replies; 10+ messages in thread
From: Luck, Tony @ 2005-08-12 23:48 UTC (permalink / raw)
  To: linux-ia64


>Here is the updated patch. It incorporates your suggestions. I have left
>kern_memdesc as a linked list as opposed to array. Linked list is little
>more versatile structure and the cost is minimal over an array. Unless
>you feel strongly about using arrays instead of linked list, I would
>prefer to leave it this way.

Khalid,

Looking good (from my 2 minute scan, I'll take a better look
later).  One question ... why do we need "is_available_memory"
and "is_usable_memory"?  They look to be almost the same (except
is_usable... doesn't consider EFI_LOADER_DATA to be usable).

It doesn't look like is_available_memory needs to be in meminit.h
[in last version of this patch it was used in more than one file,
but now the only usage in in efi.c]

I'll ponder the linked question ... if you really think the
extra versatility is worthwhile, then I'll consider them ...
but that will raise the issue of why aren't you using the
standard Linux kernel list.h macros.

-Tony

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: efi_memmapwalk re-write
  2005-08-03 22:45 efi_memmapwalk re-write Luck, Tony
                   ` (6 preceding siblings ...)
  2005-08-12 23:48 ` Luck, Tony
@ 2005-08-15 14:33 ` Khalid Aziz
  2005-09-03  6:25 ` tony.luck
  8 siblings, 0 replies; 10+ messages in thread
From: Khalid Aziz @ 2005-08-15 14:33 UTC (permalink / raw)
  To: linux-ia64

On Fri, 2005-08-12 at 16:48 -0700, Luck, Tony wrote:
> >Here is the updated patch. It incorporates your suggestions. I have left
> >kern_memdesc as a linked list as opposed to array. Linked list is little
> >more versatile structure and the cost is minimal over an array. Unless
> >you feel strongly about using arrays instead of linked list, I would
> >prefer to leave it this way.
> 
> Khalid,
> 
> Looking good (from my 2 minute scan, I'll take a better look
> later).  One question ... why do we need "is_available_memory"
> and "is_usable_memory"?  They look to be almost the same (except
> is_usable... doesn't consider EFI_LOADER_DATA to be usable).

I am using is_usable_memory while scanning for space for kern_memdesc
structure. Since EFI_LOADER_DATA can contain boot parameters and/or
ramdisk, I don't want to allocate any space out of there.
is_available_memory calls EFI_LOADER_DATA type memory to be avialble
because that memory gets marked reserved any way.

> 
> It doesn't look like is_available_memory needs to be in meminit.h
> [in last version of this patch it was used in more than one file,
> but now the only usage in in efi.c]

You are right. It can be moved to efi.c.

> 
> I'll ponder the linked question ... if you really think the
> extra versatility is worthwhile, then I'll consider them ...
> but that will raise the issue of why aren't you using the
> standard Linux kernel list.h macros.
> 
> -Tony

--
Khalid 

==================================
Khalid Aziz                       Open Source and Linux Organization
(970)898-9214                                        Hewlett-Packard
khalid.aziz@hp.com                                  Fort Collins, CO

"The Linux kernel is subject to relentless development" 
                                - Alessandro Rubini


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: efi_memmapwalk re-write
  2005-08-03 22:45 efi_memmapwalk re-write Luck, Tony
                   ` (7 preceding siblings ...)
  2005-08-15 14:33 ` Khalid Aziz
@ 2005-09-03  6:25 ` tony.luck
  8 siblings, 0 replies; 10+ messages in thread
From: tony.luck @ 2005-09-03  6:25 UTC (permalink / raw)
  To: linux-ia64

>Here is the updated patch. It incorporates your suggestions. I have left
>kern_memdesc as a linked list as opposed to array. Linked list is little
>more versatile structure and the cost is minimal over an array. Unless
>you feel strongly about using arrays instead of linked list, I would
>prefer to leave it this way.

I ditched the list ... it seems like overkill for a structure that
we only need to append things to.

There were a couple of problems in the "gather" code where you
dropped efi memory sections before doing the contiguity scan, so
you ended up not using memory that could have been used. That may
have contributed to it not booting on tiger (as it ignored all memory
below 4G!).  But I didn't track down exactly what was wrong.

Then I went wild and re-arranged a whole lot of stuff, and added
code to pick out the UC memory, and save the trimmed off pieces
for use by efi_memmap_walk_uc().

Remaining issues:
1) Does it boot on anything else? (I only tried Tiger so far, but I
did confirm that memory passed to the kernel is identical to the
old efi_memmap_walk/trim_top/trim_bottom).
2) I trimmed down the find_memmap_space() routine ... but I am not
really happy with it.  We need to find a relatively tiny piece of
memory to use, but I fear we have to duplicate most of the code
to figure out the granule boundaries to be sure of picking from
memory that will eventually be part of the kernel.
3) Does it find the UC memory on sn2?  Or do the type/attribute
checks need tweaking?  My version returns region 6 addresses (for
symmetry with the region 7 addresses returned by the regular walk),
but the original just returned physical addresses. I switch to
physical easily if that is of more use to the callers.
4) Most of these routines should be marked "__init".
5) There is debug code at the bottom of efi.c to print out a summary
of the blocks of memory available for the kernel ("K") and for the
uncached allocator ("U").  This shouldn't be checked in, but it may
be useful to check whether this code is doing the right thing.

-Tony

diff --git a/arch/ia64/kernel/efi.c b/arch/ia64/kernel/efi.c
--- a/arch/ia64/kernel/efi.c
+++ b/arch/ia64/kernel/efi.c
@@ -239,57 +239,30 @@ is_available_memory (efi_memory_desc_t *
 	return 0;
 }
 
-/*
- * Trim descriptor MD so its starts at address START_ADDR.  If the descriptor covers
- * memory that is normally available to the kernel, issue a warning that some memory
- * is being ignored.
- */
-static void
-trim_bottom (efi_memory_desc_t *md, u64 start_addr)
-{
-	u64 num_skipped_pages;
+typedef struct kern_memdesc {
+	u64 attribute;
+	u64 start;
+	u64 num_pages;
+} kern_memdesc_t;
 
-	if (md->phys_addr >= start_addr || !md->num_pages)
-		return;
-
-	num_skipped_pages = (start_addr - md->phys_addr) >> EFI_PAGE_SHIFT;
-	if (num_skipped_pages > md->num_pages)
-		num_skipped_pages = md->num_pages;
-
-	if (is_available_memory(md))
-		printk(KERN_NOTICE "efi.%s: ignoring %luKB of memory at 0x%lx due to granule hole "
-		       "at 0x%lx\n", __FUNCTION__,
-		       (num_skipped_pages << EFI_PAGE_SHIFT) >> 10,
-		       md->phys_addr, start_addr - IA64_GRANULE_SIZE);
-	/*
-	 * NOTE: Don't set md->phys_addr to START_ADDR because that could cause the memory
-	 * descriptor list to become unsorted.  In such a case, md->num_pages will be
-	 * zero, so the Right Thing will happen.
-	 */
-	md->phys_addr += num_skipped_pages << EFI_PAGE_SHIFT;
-	md->num_pages -= num_skipped_pages;
-}
+static kern_memdesc_t *kern_memmap;
 
 static void
-trim_top (efi_memory_desc_t *md, u64 end_addr)
+walk (efi_freemem_callback_t callback, void *arg, u64 attr)
 {
-	u64 num_dropped_pages, md_end_addr;
+	kern_memdesc_t *k;
+	u64 start, end;
 
-	md_end_addr = md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT);
-
-	if (md_end_addr <= end_addr || !md->num_pages)
-		return;
-
-	num_dropped_pages = (md_end_addr - end_addr) >> EFI_PAGE_SHIFT;
-	if (num_dropped_pages > md->num_pages)
-		num_dropped_pages = md->num_pages;
-
-	if (is_available_memory(md))
-		printk(KERN_NOTICE "efi.%s: ignoring %luKB of memory at 0x%lx due to granule hole "
-		       "at 0x%lx\n", __FUNCTION__,
-		       (num_dropped_pages << EFI_PAGE_SHIFT) >> 10,
-		       md->phys_addr, end_addr);
-	md->num_pages -= num_dropped_pages;
+	for (k = kern_memmap; k->start != ~0UL; k++) {
+		if (k->attribute != attr)
+			continue;
+		start = (attr = EFI_MEMORY_WB) ? PAGE_OFFSET : __IA64_UNCACHED_OFFSET;
+		start += PAGE_ALIGN(k->start);
+		end = (start + (k->num_pages << EFI_PAGE_SHIFT)) & PAGE_MASK;
+		if (start < end)
+			if ((*callback)(start, end, arg) < 0)
+				return;
+	}
 }
 
 /*
@@ -299,148 +272,19 @@ trim_top (efi_memory_desc_t *md, u64 end
 void
 efi_memmap_walk (efi_freemem_callback_t callback, void *arg)
 {
-	int prev_valid = 0;
-	struct range {
-		u64 start;
-		u64 end;
-	} prev, curr;
-	void *efi_map_start, *efi_map_end, *p, *q;
-	efi_memory_desc_t *md, *check_md;
-	u64 efi_desc_size, start, end, granule_addr, last_granule_addr, first_non_wb_addr = 0;
-	unsigned long total_mem = 0;
-
-	efi_map_start = __va(ia64_boot_param->efi_memmap);
-	efi_map_end   = efi_map_start + ia64_boot_param->efi_memmap_size;
-	efi_desc_size = ia64_boot_param->efi_memdesc_size;
-
-	for (p = efi_map_start; p < efi_map_end; p += efi_desc_size) {
-		md = p;
-
-		/* skip over non-WB memory descriptors; that's all we're interested in... */
-		if (!(md->attribute & EFI_MEMORY_WB))
-			continue;
-
-		/*
-		 * granule_addr is the base of md's first granule.
-		 * [granule_addr - first_non_wb_addr) is guaranteed to
-		 * be contiguous WB memory.
-		 */
-		granule_addr = GRANULEROUNDDOWN(md->phys_addr);
-		first_non_wb_addr = max(first_non_wb_addr, granule_addr);
-
-		if (first_non_wb_addr < md->phys_addr) {
-			trim_bottom(md, granule_addr + IA64_GRANULE_SIZE);
-			granule_addr = GRANULEROUNDDOWN(md->phys_addr);
-			first_non_wb_addr = max(first_non_wb_addr, granule_addr);
-		}
-
-		for (q = p; q < efi_map_end; q += efi_desc_size) {
-			check_md = q;
-
-			if ((check_md->attribute & EFI_MEMORY_WB) &&
-			    (check_md->phys_addr = first_non_wb_addr))
-				first_non_wb_addr += check_md->num_pages << EFI_PAGE_SHIFT;
-			else
-				break;		/* non-WB or hole */
-		}
-
-		last_granule_addr = GRANULEROUNDDOWN(first_non_wb_addr);
-		if (last_granule_addr < md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT))
-			trim_top(md, last_granule_addr);
-
-		if (is_available_memory(md)) {
-			if (md->phys_addr + (md->num_pages << EFI_PAGE_SHIFT) >= max_addr) {
-				if (md->phys_addr >= max_addr)
-					continue;
-				md->num_pages = (max_addr - md->phys_addr) >> EFI_PAGE_SHIFT;
-				first_non_wb_addr = max_addr;
-			}
-
-			if (total_mem >= mem_limit)
-				continue;
-
-			if (total_mem + (md->num_pages << EFI_PAGE_SHIFT) > mem_limit) {
-				unsigned long limit_addr = md->phys_addr;
-
-				limit_addr += mem_limit - total_mem;
-				limit_addr = GRANULEROUNDDOWN(limit_addr);
-
-				if (md->phys_addr > limit_addr)
-					continue;
-
-				md->num_pages = (limit_addr - md->phys_addr) >>
-				                EFI_PAGE_SHIFT;
-				first_non_wb_addr = max_addr = md->phys_addr +
-				              (md->num_pages << EFI_PAGE_SHIFT);
-			}
-			total_mem += (md->num_pages << EFI_PAGE_SHIFT);
-
-			if (md->num_pages = 0)
-				continue;
-
-			curr.start = PAGE_OFFSET + md->phys_addr;
-			curr.end   = curr.start + (md->num_pages << EFI_PAGE_SHIFT);
-
-			if (!prev_valid) {
-				prev = curr;
-				prev_valid = 1;
-			} else {
-				if (curr.start < prev.start)
-					printk(KERN_ERR "Oops: EFI memory table not ordered!\n");
-
-				if (prev.end = curr.start) {
-					/* merge two consecutive memory ranges */
-					prev.end = curr.end;
-				} else {
-					start = PAGE_ALIGN(prev.start);
-					end = prev.end & PAGE_MASK;
-					if ((end > start) && (*callback)(start, end, arg) < 0)
-						return;
-					prev = curr;
-				}
-			}
-		}
-	}
-	if (prev_valid) {
-		start = PAGE_ALIGN(prev.start);
-		end = prev.end & PAGE_MASK;
-		if (end > start)
-			(*callback)(start, end, arg);
-	}
+	walk(callback, arg, EFI_MEMORY_WB);
 }
 
 /*
- * Walk the EFI memory map to pull out leftover pages in the lower
- * memory regions which do not end up in the regular memory map and
- * stick them into the uncached allocator
- *
- * The regular walk function is significantly more complex than the
- * uncached walk which means it really doesn't make sense to try and
- * marge the two.
+ * Walks the EFI memory map and calls CALLBACK once for each EFI memory descriptor that
+ * has memory that is available for uncached allocator.
  */
-void __init
-efi_memmap_walk_uc (efi_freemem_callback_t callback)
+void
+efi_memmap_walk_uc (efi_freemem_callback_t callback, void *arg)
 {
-	void *efi_map_start, *efi_map_end, *p;
-	efi_memory_desc_t *md;
-	u64 efi_desc_size, start, end;
-
-	efi_map_start = __va(ia64_boot_param->efi_memmap);
-	efi_map_end = efi_map_start + ia64_boot_param->efi_memmap_size;
-	efi_desc_size = ia64_boot_param->efi_memdesc_size;
-
-	for (p = efi_map_start; p < efi_map_end; p += efi_desc_size) {
-		md = p;
-		if (md->attribute = EFI_MEMORY_UC) {
-			start = PAGE_ALIGN(md->phys_addr);
-			end = PAGE_ALIGN((md->phys_addr+(md->num_pages << EFI_PAGE_SHIFT)) & PAGE_MASK);
-			if ((*callback)(start, end, NULL) < 0)
-				return;
-		}
-	}
+	walk(callback, arg, EFI_MEMORY_UC);
 }
 
-
 /*
  * Look for the PAL_CODE region reported by EFI and maps it using an
  * ITR to enable safe PAL calls in virtual mode.  See IA-64 Processor
@@ -862,3 +706,196 @@ efi_uart_console_only(void)
 	printk(KERN_ERR "Malformed %s value\n", name);
 	return 0;
 }
+
+#define efi_md_size(md)	(md->num_pages << EFI_PAGE_SHIFT)
+
+static inline u64
+kmd_end(kern_memdesc_t *kmd)
+{
+	return (kmd->start + (kmd->num_pages << EFI_PAGE_SHIFT));
+}
+
+static inline u64
+efi_md_end(efi_memory_desc_t *md)
+{
+	return (md->phys_addr + efi_md_size(md));
+}
+
+static inline int
+efi_wb(efi_memory_desc_t *md)
+{
+	return (md->attribute & EFI_MEMORY_WB);
+}
+
+static inline int
+efi_uc(efi_memory_desc_t *md)
+{
+	return (md->attribute & EFI_MEMORY_UC);
+}
+
+/*
+ * Look for the first granule aligned memory descriptor memory
+ * that is big enough to hold EFI memory map. Make sure this
+ * descriptor is atleast granule sized so it does not get trimmed
+ */
+struct kern_memdesc *
+find_memmap_space (void)
+{
+	void *efi_map_start, *efi_map_end, *p;
+	u64 efi_desc_size, space_needed;
+	u64 smallest_block = UINT_MAX;
+	u64 small_block_addr = -1UL;
+	u64 block_size;
+	efi_memory_desc_t *md;
+
+	efi_map_start = __va(ia64_boot_param->efi_memmap);
+	efi_map_end   = efi_map_start + ia64_boot_param->efi_memmap_size;
+	efi_desc_size = ia64_boot_param->efi_memdesc_size;
+
+	/*
+	 * We will allocate enough memory to hold as many nodes as
+	 * there are in EFI memory map and a null node.
+	 */
+	space_needed = sizeof(kern_memdesc_t)*((ia64_boot_param->efi_memmap_size/efi_desc_size) + 1);
+
+	for (p = efi_map_start; p < efi_map_end; p += efi_desc_size) {
+		md = p;
+
+		/* skip over non-available memory descriptors */
+		if (!is_available_memory(md) || md->type = EFI_LOADER_DATA)
+			continue;
+		block_size = efi_md_size(md);
+
+		if ((block_size < smallest_block) &&
+			(block_size >= space_needed) &&
+			 (block_size >= IA64_GRANULE_SIZE)) {
+			smallest_block = block_size;
+			small_block_addr = md->phys_addr;
+		}
+
+	}
+
+	return __va(small_block_addr);
+}
+
+/*
+ * Walk the EFI memory map and gather all memory available for kernel
+ * to use.  We can allocate partial granules only if the unavailable
+ * parts exist, and are WB.
+ */
+void
+efi_memmap_init(unsigned long *s, unsigned long *e)
+{
+	struct kern_memdesc *k, *prev = 0;
+	u64	contig_low=0, contig_high=0;
+	u64	as, ae, lim;
+	void *efi_map_start, *efi_map_end, *p, *q;
+	efi_memory_desc_t *md, *pmd = NULL, *check_md;
+	u64	efi_desc_size;
+	unsigned long total_mem = 0;
+
+	k = kern_memmap = find_memmap_space();
+
+	efi_map_start = __va(ia64_boot_param->efi_memmap);
+	efi_map_end   = efi_map_start + ia64_boot_param->efi_memmap_size;
+	efi_desc_size = ia64_boot_param->efi_memdesc_size;
+
+	for (p = efi_map_start; p < efi_map_end; pmd = md, p += efi_desc_size) {
+		md = p;
+		if (!efi_wb(md)) {
+			if (efi_uc(md) && md->type = EFI_CONVENTIONAL_MEMORY) {
+				k->attribute = EFI_MEMORY_UC;
+				k->start = md->phys_addr;
+				k->num_pages = md->num_pages;
+				k++;
+			}
+			continue;
+		}
+		if (pmd = NULL || !efi_wb(pmd) || efi_md_end(pmd) != md->phys_addr) {
+			contig_low = GRANULEROUNDUP(md->phys_addr);
+			contig_high = efi_md_end(md);
+			for (q = p + efi_desc_size; q < efi_map_end; q += efi_desc_size) {
+				check_md = q;
+				if (!efi_wb(check_md))
+					break;
+				if (contig_high != check_md->phys_addr)
+					break;
+				contig_high = efi_md_end(check_md);
+			}
+			contig_high = GRANULEROUNDDOWN(contig_high);
+		}
+		if (!is_available_memory(md))
+			continue;
+
+		/*
+		 * Round ends inward to granule boundaries
+		 * Give trimmings to uncached allocator
+		 */
+		if (md->phys_addr < contig_low) {
+			lim = min(kmd_end(k), contig_low);
+			if (efi_uc(md)) {
+				if (k > kern_memmap && (k-1)->attribute = EFI_MEMORY_UC &&
+				    kmd_end(k-1) = md->phys_addr) {
+					(k-1)->num_pages += (lim - md->phys_addr) >> EFI_PAGE_SHIFT;
+				} else {
+					k->attribute = EFI_MEMORY_UC;
+					k->start = md->phys_addr;
+					k->num_pages = (lim - md->phys_addr) >> EFI_PAGE_SHIFT;
+					k++;
+				}
+			}
+			as = lim;
+		} else
+			as = md->phys_addr;
+
+		if (efi_md_end(md) > contig_high) {
+			lim = max(md->phys_addr, contig_high);
+			if (efi_uc(md)) {
+				if (lim = md->phys_addr && k > kern_memmap &&
+				    (k-1)->attribute = EFI_MEMORY_UC &&
+				    kmd_end(k-1) = md->phys_addr) {
+					(k-1)->num_pages += md->num_pages;
+				} else {
+					k->attribute = EFI_MEMORY_UC;
+					k->start = lim;
+					k->num_pages = (efi_md_end(md) - k->start) >> EFI_PAGE_SHIFT;
+					k++;
+				}
+			}
+			ae = lim;
+		} else
+			ae = efi_md_end(md);
+
+		/* keep within max_addr= command line arg */
+		ae = min(ae, max_addr);
+		if (ae <= as)
+			continue;
+
+		/* avoid going over mem= command line arg */
+		if (total_mem + (ae - as) > mem_limit)
+			ae -= total_mem + (ae - as) - mem_limit;
+
+		if (ae <= as)
+			continue;
+		if (prev && kmd_end(prev) = md->phys_addr) {
+			prev->num_pages += (ae - as) >> EFI_PAGE_SHIFT;
+			total_mem += ae - as;
+			continue;
+		}
+		k->attribute = EFI_MEMORY_WB;
+		k->start = as;
+		k->num_pages = (ae - as) >> EFI_PAGE_SHIFT;
+		total_mem += ae - as;
+		prev = k++;
+	}
+	k->start = ~0L; /* end-marker */
+
+	/* reserve the memory we are using for kern_memmap */
+	*s = (u64)kern_memmap;
+	*e = (u64)++k;
+
+/*DEBUG PRINTK TO BE REMOVED*/
+	for (k = kern_memmap; k->start != ~0L; k++)
+		printk("%c %.16lx %.16lx\n", (k->attribute = EFI_MEMORY_WB)?'K':'U',
+			k->start, kmd_end(k));
+}
diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -211,6 +211,9 @@ reserve_memory (void)
 	}
 #endif
 
+	efi_memmap_init(&rsvd_region[n].start, &rsvd_region[n].end);
+	n++;
+
 	/* end of memory marker */
 	rsvd_region[n].start = ~0UL;
 	rsvd_region[n].end   = ~0UL;
diff --git a/include/asm-ia64/meminit.h b/include/asm-ia64/meminit.h
--- a/include/asm-ia64/meminit.h
+++ b/include/asm-ia64/meminit.h
@@ -16,10 +16,11 @@
  * 	- initrd (optional)
  * 	- command line string
  * 	- kernel code & data
+ * 	- Kernel memory map built from EFI memory map
  *
  * More could be added if necessary
  */
-#define IA64_MAX_RSVD_REGIONS 5
+#define IA64_MAX_RSVD_REGIONS 6
 
 struct rsvd_region {
 	unsigned long start;	/* virtual address of beginning of element */
@@ -33,6 +34,7 @@ extern void find_memory (void);
 extern void reserve_memory (void);
 extern void find_initrd (void);
 extern int filter_rsvd_memory (unsigned long start, unsigned long end, void *arg);
+extern void efi_memmap_init(unsigned long *, unsigned long *);
 
 /*
  * For rounding an address to the next IA64_GRANULE_SIZE or order

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2005-09-03  6:25 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-03 22:45 efi_memmapwalk re-write Luck, Tony
2005-08-03 23:00 ` Luck, Tony
2005-08-04 18:16 ` Khalid Aziz
2005-08-04 22:41 ` Luck, Tony
2005-08-05 22:46 ` Khalid Aziz
2005-08-08 18:59 ` Luck, Tony
2005-08-12 23:05 ` Khalid Aziz
2005-08-12 23:48 ` Luck, Tony
2005-08-15 14:33 ` Khalid Aziz
2005-09-03  6:25 ` tony.luck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox