public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* /proc/kcore - how to fix it
@ 2003-05-20 20:05 Luck, Tony
  2003-05-20 20:30 ` Valdis.Kletnieks
  0 siblings, 1 reply; 10+ messages in thread
From: Luck, Tony @ 2003-05-20 20:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-ia64, rmk, ak, David Mosberger

This conversation started over on the ia64 kernel list, but it needs a
wider audience.

The story so far for new readers:
/proc/kcore has been broken on some architectures for a long time. Problems
surround the fact that some architectures allocate memory for vmalloc()
and thus modules at addresses below PAGE_OFFSET, which results in negative
file offsets in the virtual core file image provided by /proc/kcore. There
are also pending problems for discontig memory systems as /proc/kcore just
pretends that there are no holes between "PAGE_OFFSET" and "high_memory",
so an unwary user (ok super-user) can read non-existant memory which may
do bad things. There may also be kernel objects that would be nice to
view in /proc/kcore, but do not show up there.

A pending change on ia64 to allow booting on machines that don't have
physical memory in any convenient pre-determined place will make things
even worse, because the kernel itself won't show up in the current
implementation of /proc/kcore!

I tried to fix this by taking a suggestion made by Andi Kleen in one of
the previous threads on this topic, and added an entry to "vmlist" for
the kernel ...

 Tony> I've juggled the addresses around again, moving the kernel up to
 Tony> 0xA000004000000000 and VMALLOC_START back down to 0xA000000000030000
 Tony> so that the entry for the kernel goes on the *end* of the vmlist,
 Tony> so we don't have to uselessly step over it on every call to vmalloc().

 David> I don't want the kernel layout to be constrained by something as
 David> esoteric as kcore.  Let's fix kcore for good.

So I'm looking at what it might take to "fix kcore for good", and the first
issue I hit was I'm not exactly sure what the requirements are, since I
never use /proc/kcore except when people complain that my kernel address
shuffling has broken things.

Presumably people expect to see kernel code/data in /proc/kcore.
On x86 they can also see every piece of vmalloc() memory. Do they really
need them all, or just the ones that are being used for modules? Do we
really want to have separate elf_phdrs for every one of them, or could
we just have one section that covers addresses from VMALLOC_START to
VMALLOC_END, and then test whether pages have been mapped in that range
before accessing them?

What about discontiguous memory.  Since /proc/kcore is super-user only
we could continue with the attitude that the user should be careful not
to touch memory that doesn't exist, or we could be kind and provide an
API so that the architecture specific code that finds the memory can tell
/proc/kcore what exists.

What about other random architecture specific blobs of kernel address
space memory (e.g. ia64 percpu memory).  Should they get a elf_phdr too?

What about the "negative file offset" problem.  Current 2.5 code has
KCORE_BASE which is part of a solution for this ... a simple change
so that kcore.c reads:
	#ifndef KCORE_BASE
	#define KCORE_BASE PAGE_OFFSET
	#endif
so that architectures could define their own KCORE_BASE value would
help for ia64 (since all kernel addresses are in the top 3/8 of the
address space ... we can avoid negative offsets by picking KCORE_BASE
low enough that all offsets are positive).  This won't help you if your
architecture spreads kernel objects across your full address space.

-Tony

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: /proc/kcore - how to fix it
  2003-05-20 20:05 Luck, Tony
@ 2003-05-20 20:30 ` Valdis.Kletnieks
  0 siblings, 0 replies; 10+ messages in thread
From: Valdis.Kletnieks @ 2003-05-20 20:30 UTC (permalink / raw)
  To: Luck, Tony; +Cc: linux-kernel, linux-ia64

[-- Attachment #1: Type: text/plain, Size: 1039 bytes --]

On Tue, 20 May 2003 13:05:15 PDT, "Luck, Tony" said:

> What about discontiguous memory.  Since /proc/kcore is super-user only
> we could continue with the attitude that the user should be careful not
> to touch memory that doesn't exist, or we could be kind and provide an
> API so that the architecture specific code that finds the memory can tell
> /proc/kcore what exists.

"don't touch memory that doesn't exist" is a bad idea unless there is *some*
sort of API that allows the program to intuit what does/doesn't exist.  If
the program can't find out what is legal without hitting an oops or worse,
nobody will use /proc/kcore, and then why bother implementing it?

(Note that I'd consider "look *here* for a pointer to known-existing memory,
then look 24 bytes into there for a pointer to a linked list of memory
block address/size pairs" sufficient, no need for a fancy /proc interface.
Of course, that's just my opinion - those who don't have memories of
pointer chasing in S/360 assembler under OS/MVT may have other opinions ;)

[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: /proc/kcore - how to fix it
@ 2003-05-23 19:13 Luck, Tony
  2003-05-23 19:41 ` Andi Kleen
  2003-05-23 21:11 ` Russell King
  0 siblings, 2 replies; 10+ messages in thread
From: Luck, Tony @ 2003-05-23 19:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-ia64, rmk, ak, David Mosberger, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 3434 bytes --]

Not a lot of replies to this on the list, but I did notice
that the lse-tech IRC discussion mentioned me by name, saying
that I had a proposal for a fix ... and Andrew has stuck my
name against this in the must-fix list for 2.6.  So, not wanting
to be the person who is holding up 2.6 ... here's what I am
proposing.

The patch attached provides enough hooks that each architecture
should be able to make /proc/kcore useful.  The patch is INCOMPLETE
in that *use* of those hooks is ONLY PROVIDED FOR IA64.

Here's how it works.  The default code in fs/proc/kcore.c doesn't
set up any "elf_phdr" sections ... it is left to each architecture
to make appropriate calls to "kclist_add()" to specify a base
address and size for each piece of kernel virtual address space
that needs to be made accessible through /proc/kcore.  To get the
old functionality, you'll need two calls that look something like:

 kclist_add(&kcore_mem, __va(0),
             max_low_pfn * PAGE_SIZE);
 kclist_add(&kcore_vmem, (void *)VMALLOC_START,
             VMALLOC_END-VMALLOC_START);

The first makes all of memory visible (__i386__, __mc68000__ and
__x86_64__ should use __va(PAGE_SIZE) to duplicate the original
lack of access to page 0).  The second provides a single map for
all "vmalloc" space (the code still searches the vmlist to see
what actually exists before accessing it).

Other blocks of kernel virtual space can be added as needed, and
removed again (with kclist_del()).  E.g. discontiguous memory
machines can add an entry for each block of memory.  Architectures
that allocate memory for modules someplace outside of vmalloc-land
can add/remove entries on module insert and remove.

The second piece of abstraction is the kc_vaddr_to_offset() and
kc_offset_to_vaddr() macros.  These provide mappings from kernel
virtual addresses to offsets in the virtual file that /proc/kcore
instantiates.  I hope they are sufficient to avoid negative offset
problems that plagued the old /proc/kcore.  Default versions are
provided for the old behaviour (mapping simply adds/subtracts
PAGE_OFFSET).  For ia64 I just need to use a different offset
as all kernel virtual allocations are in the high 37.5% of the
64-bit virtual address space.
x86_64 was the other architecture with this problem. I don't know
enough (anything) about the kernel memory map on x86_64, so I hope
these provide a big enough hook.  I'm hoping that you have some low
stuff, and some high stuff with a big hole in the middle ... in
which case the macros might look something like:

#define kc_vaddr_to_offset(v) ((v) < 0x1000000000000000 ? (v) : \
                              ((v) - 0xF000000000000000))

But if you have interesting stuff scattered across *every* part of
the unsigned address range, then you won't be able to squeeze it all
into a signed file offset.

There are a couple of bug fixes too:
1) get_kcore_size() didn't account for the elf_prstatus, elf_prpsinfo
   and task_struct that are placed in the PT_NOTE section that is
   part of the header.  We were saved on most configurations by the
   round-up to PAGE_SIZE ... but it's possible that some architectures
   or machines corrupted memory beyond the space allocated for the
   header.

2) The size of the PT_NOTES section was incorrectly set to the size
   of the last note, rather than the sum of the sizes of all the notes.

-Tony


[-- Attachment #2: kcore.patch --]
[-- Type: application/octet-stream, Size: 10569 bytes --]

diff -ru l2569-mosberger/arch/ia64/mm/init.c l2569-aegl/arch/ia64/mm/init.c
--- l2569-mosberger/arch/ia64/mm/init.c	Mon May 19 09:40:04 2003
+++ l2569-aegl/arch/ia64/mm/init.c	Fri May 23 09:22:56 2003
@@ -32,7 +32,7 @@
 struct mmu_gather mmu_gathers[NR_CPUS];
 
 /* References to section boundaries: */
-extern char _stext, _etext, _edata, __init_begin, __init_end;
+extern char _stext, _etext, _edata, __init_begin, __init_end, _end;
 
 extern void ia64_tlb_init (void);
 
@@ -576,6 +576,7 @@
 	long reserved_pages, codesize, datasize, initsize;
 	unsigned long num_pgt_pages;
 	pg_data_t *pgdat;
+	static struct kcore_list kcore_mem, kcore_vmem, kcore_kernel;
 
 #ifdef CONFIG_PCI
 	/*
@@ -594,6 +595,10 @@
 
 	high_memory = __va(max_low_pfn * PAGE_SIZE);
 
+	kclist_add(&kcore_mem, __va(0), max_low_pfn * PAGE_SIZE);
+	kclist_add(&kcore_vmem, (void *)VMALLOC_START, VMALLOC_END-VMALLOC_START);
+	kclist_add(&kcore_kernel, &_stext, &_end - &_stext);
+
 	for_each_pgdat(pgdat)
 		totalram_pages += free_all_bootmem_node(pgdat);
 
diff -ru l2569-mosberger/fs/proc/kcore.c l2569-aegl/fs/proc/kcore.c
--- l2569-mosberger/fs/proc/kcore.c	Mon May 19 09:40:57 2003
+++ l2569-aegl/fs/proc/kcore.c	Fri May 23 09:21:00 2003
@@ -99,7 +99,12 @@
 }
 #else /* CONFIG_KCORE_AOUT */
 
-#define	KCORE_BASE	PAGE_OFFSET
+#ifndef kc_vaddr_to_offset
+#define	kc_vaddr_to_offset(v) ((v) - PAGE_OFFSET)
+#endif
+#ifndef	kc_offset_to_vaddr
+#define	kc_offset_to_vaddr(o) ((o) + PAGE_OFFSET)
+#endif
 
 #define roundup(x, y)  ((((x)+((y)-1))/(y))*(y))
 
@@ -112,29 +117,60 @@
 	void *data;
 };
 
+static struct kcore_list *kclist;
+static rwlock_t kclist_lock = RW_LOCK_UNLOCKED;
+
+void
+kclist_add(struct kcore_list *new, void *addr, size_t size)
+{
+	new->addr = (unsigned long)addr;
+	new->size = size;
+
+	write_lock(&kclist_lock);
+	new->next = kclist;
+	kclist = new;
+	write_unlock(&kclist_lock);
+}
+
+struct kcore_list *
+kclist_del(void *addr)
+{
+	struct kcore_list *m, **p = &kclist;
+
+	write_lock(&kclist_lock);
+	for (m = *p; m; p = &m->next) {
+		if (m->addr == (unsigned long)addr) {
+			*p = m->next;
+			write_unlock(&kclist_lock);
+			return m;
+		}
+	}
+	write_unlock(&kclist_lock);
+	return 0;
+}
+
 extern char saved_command_line[];
 
-static size_t get_kcore_size(int *num_vma, size_t *elf_buflen)
+static size_t get_kcore_size(int *nphdr, size_t *elf_buflen)
 {
 	size_t try, size;
-	struct vm_struct *m;
+	struct kcore_list *m;
 
-	*num_vma = 0;
-	size = ((size_t)high_memory - KCORE_BASE + PAGE_SIZE);
-	if (!vmlist) {
-		*elf_buflen = PAGE_SIZE;
-		return (size);
-	}
+	*nphdr = 1; /* PT_NOTE */
+	size = 0;
 
-	for (m=vmlist; m; m=m->next) {
-		try = (size_t)m->addr + m->size;
-		if (try > KCORE_BASE + size)
-			size = try - KCORE_BASE;
-		*num_vma = *num_vma + 1;
+	for (m=kclist; m; m=m->next) {
+		try = kc_vaddr_to_offset((size_t)m->addr + m->size);
+		if (try > size)
+			size = try;
+		*nphdr = *nphdr + 1;
 	}
 	*elf_buflen =	sizeof(struct elfhdr) + 
-			(*num_vma + 2)*sizeof(struct elf_phdr) + 
-			3 * sizeof(struct memelfnote);
+			(*nphdr + 2)*sizeof(struct elf_phdr) + 
+			3 * sizeof(struct memelfnote) +
+			sizeof(struct elf_prstatus) +
+			sizeof(struct elf_prpsinfo) +
+			sizeof(struct task_struct);
 	*elf_buflen = PAGE_ALIGN(*elf_buflen);
 	return size + *elf_buflen;
 }
@@ -184,9 +220,9 @@
 
 /*
  * store an ELF coredump header in the supplied buffer
- * num_vma is the number of elements in vmlist
+ * nphdr is the number of elf_phdr to insert
  */
-static void elf_kcore_store_hdr(char *bufp, int num_vma, int dataoff)
+static void elf_kcore_store_hdr(char *bufp, int nphdr, int dataoff)
 {
 	struct elf_prstatus prstatus;	/* NT_PRSTATUS */
 	struct elf_prpsinfo prpsinfo;	/* NT_PRPSINFO */
@@ -194,7 +230,7 @@
 	struct elfhdr *elf;
 	struct memelfnote notes[3];
 	off_t offset = 0;
-	struct vm_struct *m;
+	struct kcore_list *m;
 
 	/* setup ELF header */
 	elf = (struct elfhdr *) bufp;
@@ -214,7 +250,7 @@
 	elf->e_flags	= 0;
 	elf->e_ehsize	= sizeof(struct elfhdr);
 	elf->e_phentsize= sizeof(struct elf_phdr);
-	elf->e_phnum	= 2 + num_vma;
+	elf->e_phnum	= nphdr;
 	elf->e_shentsize= 0;
 	elf->e_shnum	= 0;
 	elf->e_shstrndx	= 0;
@@ -232,33 +268,17 @@
 	nhdr->p_flags	= 0;
 	nhdr->p_align	= 0;
 
-	/* setup ELF PT_LOAD program header for the 
-	 * virtual range 0xc0000000 -> high_memory */
-	phdr = (struct elf_phdr *) bufp;
-	bufp += sizeof(struct elf_phdr);
-	offset += sizeof(struct elf_phdr);
-	phdr->p_type	= PT_LOAD;
-	phdr->p_flags	= PF_R|PF_W|PF_X;
-	phdr->p_offset	= PAGE_OFFSET - KCORE_BASE + dataoff;
-	phdr->p_vaddr	= PAGE_OFFSET;
-	phdr->p_paddr	= __pa(PAGE_OFFSET);
-	phdr->p_filesz	= phdr->p_memsz = ((unsigned long)high_memory - PAGE_OFFSET);
-	phdr->p_align	= PAGE_SIZE;
-
-	/* setup ELF PT_LOAD program header for every vmalloc'd area */
-	for (m=vmlist; m; m=m->next) {
-		if (m->flags & VM_IOREMAP) /* don't dump ioremap'd stuff! (TA) */
-			continue;
-
+	/* setup ELF PT_LOAD program header for every area */
+	for (m=kclist; m; m=m->next) {
 		phdr = (struct elf_phdr *) bufp;
 		bufp += sizeof(struct elf_phdr);
 		offset += sizeof(struct elf_phdr);
 
 		phdr->p_type	= PT_LOAD;
 		phdr->p_flags	= PF_R|PF_W|PF_X;
-		phdr->p_offset	= (size_t)m->addr - KCORE_BASE + dataoff;
+		phdr->p_offset	= kc_vaddr_to_offset(m->addr) + dataoff;
 		phdr->p_vaddr	= (size_t)m->addr;
-		phdr->p_paddr	= __pa(m->addr);
+		phdr->p_paddr	= 0;
 		phdr->p_filesz	= phdr->p_memsz	= m->size;
 		phdr->p_align	= PAGE_SIZE;
 	}
@@ -294,7 +314,7 @@
 	strcpy(prpsinfo.pr_fname, "vmlinux");
 	strncpy(prpsinfo.pr_psargs, saved_command_line, ELF_PRARGSZ);
 
-	nhdr->p_filesz	= notesize(&notes[1]);
+	nhdr->p_filesz	+= notesize(&notes[1]);
 	bufp = storenote(&notes[1], bufp);
 
 	/* set up the task structure */
@@ -303,7 +323,7 @@
 	notes[2].datasz	= sizeof(struct task_struct);
 	notes[2].data	= current;
 
-	nhdr->p_filesz	= notesize(&notes[2]);
+	nhdr->p_filesz	+= notesize(&notes[2]);
 	bufp = storenote(&notes[2], bufp);
 
 } /* end elf_kcore_store_hdr() */
@@ -317,13 +337,14 @@
 	ssize_t acc = 0;
 	size_t size, tsz;
 	size_t elf_buflen;
-	int num_vma;
+	int nphdr;
 	unsigned long start;
 
-	read_lock(&vmlist_lock);
-	proc_root_kcore->size = size = get_kcore_size(&num_vma, &elf_buflen);
+	read_lock(&kclist_lock);
+	tsz =  get_kcore_size(&nphdr, &elf_buflen);
+	proc_root_kcore->size = size = tsz + elf_buflen;
 	if (buflen == 0 || *fpos >= size) {
-		read_unlock(&vmlist_lock);
+		read_unlock(&kclist_lock);
 		return 0;
 	}
 
@@ -340,12 +361,12 @@
 			tsz = buflen;
 		elf_buf = kmalloc(elf_buflen, GFP_ATOMIC);
 		if (!elf_buf) {
-			read_unlock(&vmlist_lock);
+			read_unlock(&kclist_lock);
 			return -ENOMEM;
 		}
 		memset(elf_buf, 0, elf_buflen);
-		elf_kcore_store_hdr(elf_buf, num_vma, elf_buflen);
-		read_unlock(&vmlist_lock);
+		elf_kcore_store_hdr(elf_buf, nphdr, elf_buflen);
+		read_unlock(&kclist_lock);
 		if (copy_to_user(buffer, elf_buf + *fpos, tsz)) {
 			kfree(elf_buf);
 			return -EFAULT;
@@ -360,41 +381,30 @@
 		if (buflen == 0)
 			return acc;
 	} else
-		read_unlock(&vmlist_lock);
-
-	/* where page 0 not mapped, write zeros into buffer */
-#if defined (__i386__) || defined (__mc68000__) || defined(__x86_64__)
-	if (*fpos < PAGE_SIZE + elf_buflen) {
-		/* work out how much to clear */
-		tsz = PAGE_SIZE + elf_buflen - *fpos;
-		if (buflen < tsz)
-			tsz = buflen;
-
-		/* write zeros to buffer */
-		if (clear_user(buffer, tsz))
-			return -EFAULT;
-		buflen -= tsz;
-		*fpos += tsz;
-		buffer += tsz;
-		acc += tsz;
+		read_unlock(&kclist_lock);
 
-		/* leave now if filled buffer already */
-		if (buflen == 0)
-			return tsz;
-	}
-#endif
-	
 	/*
-	 * Fill the remainder of the buffer from kernel VM space.
-	 * We said in the ELF header that the data which starts
-	 * at 'elf_buflen' is virtual address KCORE_BASE. --rmk
+	 * Check to see if our file offset matches with any of
+	 * the addresses in the elf_phdr on our list.
 	 */
-	start = KCORE_BASE + (*fpos - elf_buflen);
+	start = kc_offset_to_vaddr(*fpos - elf_buflen);
 	if ((tsz = (PAGE_SIZE - (start & ~PAGE_MASK))) > buflen)
 		tsz = buflen;
 		
 	while (buflen) {
-		if ((start >= VMALLOC_START) && (start < VMALLOC_END)) {
+		struct kcore_list *m;
+
+		read_lock(&kclist_lock);
+		for (m=kclist; m; m=m->next) {
+			if (start >= m->addr && start < (m->addr+m->size))
+				break;
+		}
+		read_unlock(&kclist_lock);
+
+		if (m == NULL) {
+			if (clear_user(buffer, tsz))
+				return -EFAULT;
+		} else if ((start >= VMALLOC_START) && (start < VMALLOC_END)) {
 			char * elf_buf;
 			struct vm_struct *m;
 			unsigned long curstart = start;
@@ -439,8 +449,7 @@
 				return -EFAULT;
 			}
 			kfree(elf_buf);
-		} else if ((start > PAGE_OFFSET) && (start < 
-						(unsigned long)high_memory)) {
+		} else {
 			if (kern_addr_valid(start)) {
 				if (copy_to_user(buffer, (char *)start, tsz))
 					return -EFAULT;
@@ -448,9 +457,6 @@
 				if (clear_user(buffer, tsz))
 					return -EFAULT;
 			}
-		} else {
-			if (clear_user(buffer, tsz))
-				return -EFAULT;
 		}
 		buflen -= tsz;
 		*fpos += tsz;
diff -ru l2569-mosberger/include/asm-ia64/pgtable.h l2569-aegl/include/asm-ia64/pgtable.h
--- l2569-mosberger/include/asm-ia64/pgtable.h	Mon May 19 09:41:01 2003
+++ l2569-aegl/include/asm-ia64/pgtable.h	Fri May 23 09:13:08 2003
@@ -217,6 +217,10 @@
 # define VMALLOC_END		(0xa000000000000000 + (1UL << (4*PAGE_SHIFT - 9)))
 #endif
 
+/* fs/proc/kcore.c */
+#define	kc_vaddr_to_offset(v) ((v) - 0xA000000000000000)
+#define	kc_offset_to_vaddr(o) ((o) + 0xA000000000000000)
+
 /*
  * Conversion functions: convert page frame number (pfn) and a protection value to a page
  * table entry (pte).
diff -ru l2569-mosberger/include/linux/proc_fs.h l2569-aegl/include/linux/proc_fs.h
--- l2569-mosberger/include/linux/proc_fs.h	Mon May 19 09:41:07 2003
+++ l2569-aegl/include/linux/proc_fs.h	Fri May 23 09:23:51 2003
@@ -74,6 +74,12 @@
 	kdev_t	rdev;
 };
 
+struct kcore_list {
+	struct kcore_list *next;
+	unsigned long addr;
+	size_t size;
+};
+
 #ifdef CONFIG_PROC_FS
 
 extern struct proc_dir_entry proc_root;
@@ -168,6 +174,12 @@
 	remove_proc_entry(name,proc_net);
 }
 
+/*
+ * fs/proc/kcore.c
+ */
+extern void kclist_add(struct kcore_list *, void *, size_t);
+extern struct kcore_list *kclist_del(void *);
+
 #else
 
 #define proc_root_driver NULL
@@ -201,6 +213,8 @@
 
 extern struct proc_dir_entry proc_root;
 
+static inline void kclist_add(struct kcore_list *new, void *addr, size_t size) {};
+static inline struct kcore_list * kclist_del(void *addr) {return NULL};
 #endif /* CONFIG_PROC_FS */
 
 struct proc_inode {

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: /proc/kcore - how to fix it
  2003-05-23 19:13 Luck, Tony
@ 2003-05-23 19:41 ` Andi Kleen
  2003-05-23 21:11 ` Russell King
  1 sibling, 0 replies; 10+ messages in thread
From: Andi Kleen @ 2003-05-23 19:41 UTC (permalink / raw)
  To: Luck, Tony; +Cc: linux-kernel, linux-ia64, rmk, davidm, akpm

On Fri, 23 May 2003 12:13:04 -0700
"Luck, Tony" <tony.luck@intel.com> wrote:


> Here's how it works.  The default code in fs/proc/kcore.c doesn't
> set up any "elf_phdr" sections ... it is left to each architecture
> to make appropriate calls to "kclist_add()" to specify a base
> address and size for each piece of kernel virtual address space
> that needs to be made accessible through /proc/kcore.  To get the
> old functionality, you'll need two calls that look something like:
> 
>  kclist_add(&kcore_mem, __va(0),
>              max_low_pfn * PAGE_SIZE);
>  kclist_add(&kcore_vmem, (void *)VMALLOC_START,
>              VMALLOC_END-VMALLOC_START);

Looks good to me.

/dev/mem / dev/kmem has the same problem, it could use that too.

> 
> The first makes all of memory visible (__i386__, __mc68000__ and
> __x86_64__ should use __va(PAGE_SIZE) to duplicate the original
> lack of access to page 0).  The second provides a single map for
> all "vmalloc" space (the code still searches the vmlist to see
> what actually exists before accessing it).
> 
> Other blocks of kernel virtual space can be added as needed, and
> removed again (with kclist_del()).  E.g. discontiguous memory

Remove could get racy - /proc/kcore can sleep while accessing such
a block. You would need a sleeping lock hold all the time.

What would you need remove for?

> The second piece of abstraction is the kc_vaddr_to_offset() and
> kc_offset_to_vaddr() macros.  These provide mappings from kernel
> virtual addresses to offsets in the virtual file that /proc/kcore
> instantiates.  I hope they are sufficient to avoid negative offset
> problems that plagued the old /proc/kcore.  Default versions are
> provided for the old behaviour (mapping simply adds/subtracts
> PAGE_OFFSET).  For ia64 I just need to use a different offset
> as all kernel virtual allocations are in the high 37.5% of the
> 64-bit virtual address space.

I'm not sure it is a good idea from the interface standpoint, especially
for /dev/kmem.  It would surely be confusing for the user. Yes, a few applications and even some kernel code has signedness problems, but I would
better fix them instead of adding such a weird interface. 1:1 mapping would 
be a lot cleaner. It should not be that bad because on i386 the kernel
is also in negative space.

[i still have some 2.4 patches to fix a few 64bit signedness problems in /proc,
perhaps I should dust them off and port to 2.5]


> But if you have interesting stuff scattered across *every* part of
> the unsigned address range, then you won't be able to squeeze it all
> into a signed file offset.

The memory map on x86-64 is rather clean, that's no problem

Thanks for doing this work,
-Andi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: /proc/kcore - how to fix it
@ 2003-05-23 20:52 Luck, Tony
  2003-05-23 21:39 ` Andi Kleen
  0 siblings, 1 reply; 10+ messages in thread
From: Luck, Tony @ 2003-05-23 20:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-ia64, rmk, davidm, akpm

> /dev/mem / dev/kmem has the same problem, it could use that too.

Hmmm ... so "kclist" needs to be globally visible instead of static,
probably needs to be maintained by the mem driver rather than kcore.c
(which might not be configured) ... and would need a new name to
reflect its many uses (kvmlist?)
 
> > Other blocks of kernel virtual space can be added as needed, and
> > removed again (with kclist_del()).  E.g. discontiguous memory
> 
> Remove could get racy - /proc/kcore can sleep while accessing such
> a block. You would need a sleeping lock hold all the time.
> 
> What would you need remove for?

Someday we'll support memory hot-add and hot-remove.  But in the
more immediate future I think that arch/arm allocates space for
modules outside of vmalloc-land ... so might want to add space to
the list on module insert, and remove at rmmod time.

Races are a problem ... I'm just not sure how big of a problem.  The
virtual address to offset mapping stuff below is set up so that the
offsets of sections in the virtual /proc/kcore file do not change as
sections appear/disappear (just like the existing kcore code). So
if you are accessing some vmalloc'd structure and the kernel vfree()s
some other structure, then you won't get hurt.  But opening /proc/kcore
and reading the headers doesn't make any promises that memory listed
in an elf_phdr will still be there by the time you lseek and read,
which is no different from the existing implementation.

> 
> > The second piece of abstraction is the kc_vaddr_to_offset() and
> > kc_offset_to_vaddr() macros.  These provide mappings from kernel
> > virtual addresses to offsets in the virtual file that /proc/kcore
> > instantiates.  I hope they are sufficient to avoid negative offset
> > problems that plagued the old /proc/kcore.  Default versions are
> > provided for the old behaviour (mapping simply adds/subtracts
> > PAGE_OFFSET).  For ia64 I just need to use a different offset
> > as all kernel virtual allocations are in the high 37.5% of the
> > 64-bit virtual address space.
> 
> I'm not sure it is a good idea from the interface standpoint, 
> especially for /dev/kmem.  It would surely be confusing for the user.

I agree. /dev/kmem should provide a flat 1:1 map ... and leave users
to deal with all the negative addresses.  Applications already expect
this (because on x86 the kernel is already in negative space).

/proc/kcore is a bit different because it's trying to present
a regular file view, rather than a char-special file view to
any tool that wants to use it.  If someone fixes up gdb, objdump,
readelf, etc. then the macros can be easily removed to provide 1:1
(though even then it isn't quite 1:1 ... offset in file would be
"vaddr + elf_buflen" to allow space for the elf headers at the start
of the file.

-Tony

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: /proc/kcore - how to fix it
  2003-05-23 19:13 Luck, Tony
  2003-05-23 19:41 ` Andi Kleen
@ 2003-05-23 21:11 ` Russell King
  1 sibling, 0 replies; 10+ messages in thread
From: Russell King @ 2003-05-23 21:11 UTC (permalink / raw)
  To: Luck, Tony; +Cc: linux-kernel, linux-ia64, ak, David Mosberger, Andrew Morton

On Fri, May 23, 2003 at 12:13:04PM -0700, Luck, Tony wrote:
> Not a lot of replies to this on the list, but I did notice
> that the lse-tech IRC discussion mentioned me by name, saying
> that I had a proposal for a fix ... and Andrew has stuck my
> name against this in the must-fix list for 2.6.  So, not wanting
> to be the person who is holding up 2.6 ... here's what I am
> proposing.

This is what I currently have hacked into kcore.c which works (but
*will* break for non-ARM stuff.)

I suspect the easiest thing may be to arrange for the discontig direct
mapped memory blocks to appear on the vmlist and then remove the special
case for the direct mapped RAM.  I don't see why architecture support
needs to come into the picture really.

I don't believe any races here are that important (except of course
ensuring that we produce consistent data for a particular read() and
not oopsing the kernel) - take a moment to think where the information
/proc/kcore provides ends up and realise that as soon as it hits
userspace, you can't rely on it.  eg, Today, if you're using it and
you insert a module, the structure of the file changes, and gdb's
idea of offsets of data into the "core" becomes invalid.

--- orig/fs/proc/kcore.c	Sat Nov  2 18:58:18 2002
+++ linux/fs/proc/kcore.c	Mon May 19 23:30:41 2003
@@ -99,7 +99,13 @@
 }
 #else /* CONFIG_KCORE_AOUT */
 
+#define KCORE_BASE	(PAGE_OFFSET - 0x01000000)
+#define in_vmlist_region(x) (((x) >= KCORE_BASE && (x) < PAGE_OFFSET) || ((x) >= VMALLOC_START && (x) < VMALLOC_END))
+
+#ifndef KCORE_BASE
 #define	KCORE_BASE	PAGE_OFFSET
+#define in_vmlist_region(x) ((x) >= VMALLOC_START && (x) < VMALLOC_END)
+#endif
 
 #define roundup(x, y)  ((((x)+((y)-1))/(y))*(y))
 
@@ -394,7 +400,7 @@
 		tsz = buflen;
 		
 	while (buflen) {
-		if ((start >= VMALLOC_START) && (start < VMALLOC_END)) {
+		if (in_vmlist_region(start)) {
 			char * elf_buf;
 			struct vm_struct *m;
 			unsigned long curstart = start;


-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: /proc/kcore - how to fix it
  2003-05-23 20:52 Luck, Tony
@ 2003-05-23 21:39 ` Andi Kleen
  0 siblings, 0 replies; 10+ messages in thread
From: Andi Kleen @ 2003-05-23 21:39 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Andi Kleen, linux-kernel, linux-ia64, rmk, davidm, akpm

On Fri, May 23, 2003 at 01:52:25PM -0700, Luck, Tony wrote:
> > /dev/mem / dev/kmem has the same problem, it could use that too.
> 
> Hmmm ... so "kclist" needs to be globally visible instead of static,
> probably needs to be maintained by the mem driver rather than kcore.c
> (which might not be configured) ... and would need a new name to
> reflect its many uses (kvmlist?)

One alternative I considered was to use just do a page table lookup.
But I fear that some architectures use direct mapping registers etc.
with mappings not in the page tables for the direct mapping, so it 
probably won't work for everybody.

>  
> > > Other blocks of kernel virtual space can be added as needed, and
> > > removed again (with kclist_del()).  E.g. discontiguous memory
> > 
> > Remove could get racy - /proc/kcore can sleep while accessing such
> > a block. You would need a sleeping lock hold all the time.
> > 
> > What would you need remove for?
> 
> Someday we'll support memory hot-add and hot-remove.  But in the
> more immediate future I think that arch/arm allocates space for
> modules outside of vmalloc-land ... so might want to add space to
> the list on module insert, and remove at rmmod time.

x86-64 does that too. My prefered solution would be to to just handle
the exception when this happens and thread module / vmalloc area as a 
big chunk. But it would probably require too much architecture
specific code again to be practical (on many archs you can just use
__copy_*_user, but some do funky things in there and it won't work
for them)

Ignoring that the choices are either: memcpy to temporary buffer with 
spinlock hold or a semaphore over the copy_to_user.

> 
> Races are a problem ... I'm just not sure how big of a problem.  The
> virtual address to offset mapping stuff below is set up so that the
> offsets of sections in the virtual /proc/kcore file do not change as
> sections appear/disappear (just like the existing kcore code). So
> if you are accessing some vmalloc'd structure and the kernel vfree()s
> some other structure, then you won't get hurt.  But opening /proc/kcore
> and reading the headers doesn't make any promises that memory listed
> in an elf_phdr will still be there by the time you lseek and read,
> which is no different from the existing implementation.

What I'm worrying about is that the kernel will oops when accessing
unmapped memory. That certainly should not happen.

> /proc/kcore is a bit different because it's trying to present
> a regular file view, rather than a char-special file view to
> any tool that wants to use it.  If someone fixes up gdb, objdump,
> readelf, etc. then the macros can be easily removed to provide 1:1
> (though even then it isn't quite 1:1 ... offset in file would be
> "vaddr + elf_buflen" to allow space for the elf headers at the start
> of the file.

You're doing this to handle tools that have signedness bugs while
parsing core files? iirc gdb is clean. What other tools have the 
problem? 

-Andi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: /proc/kcore - how to fix it
@ 2003-05-23 23:43 Luck, Tony
  0 siblings, 0 replies; 10+ messages in thread
From: Luck, Tony @ 2003-05-23 23:43 UTC (permalink / raw)
  To: Russell King; +Cc: linux-kernel, linux-ia64, ak, David Mosberger, Andrew Morton

> I suspect the easiest thing may be to arrange for the discontig direct
> mapped memory blocks to appear on the vmlist and then remove 
> the special
> case for the direct mapped RAM.  I don't see why architecture support
> needs to come into the picture really.

I did experiment with adding the architectural weird stuff to the vmlist.
But I wasn't entirely happy with it.  There were two choices:

1) Add any random thing to vmlist and fixup /proc/kcore code to check
   for other stuff outside the VMALLOC_START,VMALLOC_END range (as you
   suggested last year, an implemented in the patch you showed today).

This feels like it is abusing the vmlist.  Only one piece of code would
actually break (get_vmalloc_info() in fs/proc/proc_misc.c, and it could
easily be fixed).  But it just feels kludgy.

2) Force the random things to be allocated in the VMALLOC_START,VMALLOC_END
   address range.

I tried this, and David complained about esoteric features like kcore
dictating important stuff like kernel layout decisions.  Probably isn't
even an option for most architecture.

> I don't believe any races here are that important (except of course
> ensuring that we produce consistent data for a particular read() and
> not oopsing the kernel) - take a moment to think where the information
> /proc/kcore provides ends up and realise that as soon as it hits
> userspace, you can't rely on it.  eg, Today, if you're using it and
> you insert a module, the structure of the file changes, and gdb's
> idea of offsets of data into the "core" becomes invalid.

Actually inserting/deleting modules won't change the offsets
of other objects inside /proc/kcore (unless you add enough modules
to bump the elf_buflen size of the header to consume an extra page).
The offsets in /proc/kcore are all based on the virtual address
of the object ... not on the number/size of any preceeding objects,
so the /proc/kcore file is full of holes in the spots where things
are not currently mapped.
If gdb went back and re-read the elf headers it might be surprised
to find the elf_phdrs were appearing and disappearing ... but offsets
into the rest of the file won't (generally) change.

-Tony

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: /proc/kcore - how to fix it
@ 2003-05-23 23:51 Luck, Tony
  2003-05-24  7:38 ` Russell King
  0 siblings, 1 reply; 10+ messages in thread
From: Luck, Tony @ 2003-05-23 23:51 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, linux-ia64, rmk, davidm, akpm

> One alternative I considered was to use just do a page table lookup.
> But I fear that some architectures use direct mapping registers etc.
> with mappings not in the page tables for the direct mapping, so it 
> probably won't work for everybody.

You are right.  IA64 maps the kernel with some locked registers, so
there are no pagetables to show that the mapping exists.
 
> What I'm worrying about is that the kernel will oops when accessing
> unmapped memory. That certainly should not happen.

Agreed.  Oops is always a bad thing.

> > /proc/kcore is a bit different because it's trying to present
> > a regular file view, rather than a char-special file view to
> > any tool that wants to use it.  If someone fixes up gdb, objdump,
> > readelf, etc. then the macros can be easily removed to provide 1:1
> > (though even then it isn't quite 1:1 ... offset in file would be
> > "vaddr + elf_buflen" to allow space for the elf headers at the start
> > of the file.
> 
> You're doing this to handle tools that have signedness bugs while
> parsing core files? iirc gdb is clean. What other tools have the 
> problem? 

I don't know ... you'll have to dust off those fixes for /proc to let
the negative file offsets get as far as the kcore.c code so we can
see what utilities work.  In practice we probably don't care about
anything other than gdb.

-Tony

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: /proc/kcore - how to fix it
  2003-05-23 23:51 /proc/kcore - how to fix it Luck, Tony
@ 2003-05-24  7:38 ` Russell King
  0 siblings, 0 replies; 10+ messages in thread
From: Russell King @ 2003-05-24  7:38 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Andi Kleen, linux-kernel, linux-ia64, davidm, akpm

On Fri, May 23, 2003 at 04:51:43PM -0700, Luck, Tony wrote:
> > One alternative I considered was to use just do a page table lookup.
> > But I fear that some architectures use direct mapping registers etc.
> > with mappings not in the page tables for the direct mapping, so it 
> > probably won't work for everybody.
> 
> You are right.  IA64 maps the kernel with some locked registers, so
> there are no pagetables to show that the mapping exists.

ARM maps the kernel direct-mapped RAM using 1MB section mappings, which
the normal pgd/pmd/pte macros don't recognise as being valid.

> I don't know ... you'll have to dust off those fixes for /proc to let
> the negative file offsets get as far as the kcore.c code so we can
> see what utilities work.  In practice we probably don't care about
> anything other than gdb.

gdb definitely breaks - that's why I had to do the changes in the first
place.  gdb tries to lseek to negative 64-bit file offsets, which the
kernel rejects with EINVAL iirc.  (Tried it earlier this week.)

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2003-05-24  7:25 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-05-23 23:51 /proc/kcore - how to fix it Luck, Tony
2003-05-24  7:38 ` Russell King
  -- strict thread matches above, loose matches on Subject: below --
2003-05-23 23:43 Luck, Tony
2003-05-23 20:52 Luck, Tony
2003-05-23 21:39 ` Andi Kleen
2003-05-23 19:13 Luck, Tony
2003-05-23 19:41 ` Andi Kleen
2003-05-23 21:11 ` Russell King
2003-05-20 20:05 Luck, Tony
2003-05-20 20:30 ` Valdis.Kletnieks

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox