/proc/$pid/pagemap troubles

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* /proc/$pid/pagemap troubles
@ 2007-07-31 20:36 Dave Hansen
  2007-07-31 21:37 ` Matt Mackall
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2007-07-31 20:36 UTC (permalink / raw)
  To: Matt Mackall; +Cc: linux-kernel@vger.kernel.org

Since the pagemap code has a little header on it to help describe the
format, I wrote a little c program to parse its output.  I get some
strange results.  If I do this:

	fd = open("/proc/1/pagemap", O_RDONLY);
	count = read(fd, &endianness, 1);

count will always be 4.  

hexdump gets similar, but even worse results:

        qemu:~# strace hexdump -C /proc/self/pagemap 
        ...
        read(0, "\1\f\4\4\377\377\377\377\377\377\377\377\377\377\377\377"..., 16) = 20
        read(0, 0x804d39c, 4294967292)          = -1 EFAULT (Bad address)
        --- SIGSEGV (Segmentation fault) @ 0 (0) ---
        +++ killed by SIGSEGV +++

Note that the kernel returns 20 to the read request of 16.  I think the
kernel is actually copying over something important in hexdump's memory
which is adjacent to the buffer and causing it to segfault.

The code is basically organized not to output the right thing for any
unaligned access, and it apparently gets confused about exactly what
userspace has asked for.  I think this is largely due to its overwriting
of "count" in pagemap_read().

So, a couple of questions.  Don't we need to support non-sizeof(unsigned
long)-aligned reads?

Do we _really_ need that header in each and every file?

> * first byte:   0 for big endian, 1 for little

Do we ever have cases where userspace and kernel differ in their
endianness?  Or, are you hoping to dump these files raw on one
architecture and parse them on another?  I would think it makes more
sense to get this from elsewhere.  Or, should we just output in network
byte order and be done with it?

If anything, we could put it in /proc/$pid/status.

> * second byte:  page shift (eg 12 for 4096 byte pages)

This might actually (in theory) change on a per-process basis, so it
makes sense.  But, it seems more global to the process that just pagemap
output.  Would this always be the same as getpagesize()?   Or, should it
always map 1:1 with the amount of memory mapped by a kernel pte_t.  I
_think_ these can be slightly different because we have 64k PAGE_SIZE on
ppc64, but allow mappings to happen in 4k 

> * third byte:   entry size in bytes (currently either 4 or 8)

This one really boils down to "what is the kernel's sizeof(unsigned
long)" because we'll always store pfns in those.  It seems like we
should have a better way to go fetch that.

> * fourth byte:  header size

If we can get rid of the other three this, of course, goes away.

-- Dave

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: /proc/$pid/pagemap troubles
  2007-07-31 20:36 /proc/$pid/pagemap troubles Dave Hansen
@ 2007-07-31 21:37 ` Matt Mackall
  2007-07-31 22:43   ` Dave Hansen
  2007-07-31 22:58   ` Andreas Schwab
  0 siblings, 2 replies; 7+ messages in thread
From: Matt Mackall @ 2007-07-31 21:37 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel@vger.kernel.org

On Tue, Jul 31, 2007 at 01:36:14PM -0700, Dave Hansen wrote:
> Since the pagemap code has a little header on it to help describe the
> format, I wrote a little c program to parse its output.  I get some
> strange results.  If I do this:
> 
> 	fd = open("/proc/1/pagemap", O_RDONLY);
> 	count = read(fd, &endianness, 1);
> 
> count will always be 4.  

Known bug, fixed in my pending and not-currently-working update. It
ought to return 0 for short reads.
 
> hexdump gets similar, but even worse results:
>         
>         qemu:~# strace hexdump -C /proc/self/pagemap 
>         ...
>         read(0, "\1\f\4\4\377\377\377\377\377\377\377\377\377\377\377\377"..., 16) = 20
>         read(0, 0x804d39c, 4294967292)          = -1 EFAULT (Bad address)
>         --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>         +++ killed by SIGSEGV +++
> 
> Note that the kernel returns 20 to the read request of 16.  I think the
> kernel is actually copying over something important in hexdump's memory
> which is adjacent to the buffer and causing it to segfault.

Also fixed.

> The code is basically organized not to output the right thing for any
> unaligned access, and it apparently gets confused about exactly what
> userspace has asked for.  I think this is largely due to its overwriting
> of "count" in pagemap_read().
> 
> So, a couple of questions.  Don't we need to support non-sizeof(unsigned
> long)-aligned reads?

Why? We should obviously never return more data than we were asked for
(that's clearly a bug), but lots of things refuse to read or write
stuff that isn't well sized and aligned.

> Do we _really_ need that header in each and every file?

Well there's either a header or there isn't.

> > * first byte:   0 for big endian, 1 for little
> 
> Do we ever have cases where userspace and kernel differ in their
> endianness?  Or, are you hoping to dump these files raw on one
> architecture and parse them on another?

Potentially, yes.

> > * second byte:  page shift (eg 12 for 4096 byte pages)
> 
> This might actually (in theory) change on a per-process basis, so it
> makes sense.  But, it seems more global to the process that just pagemap
> output.  Would this always be the same as getpagesize()?   Or, should it
> always map 1:1 with the amount of memory mapped by a kernel pte_t.  I
> _think_ these can be slightly different because we have 64k PAGE_SIZE on
> ppc64, but allow mappings to happen in 4k 
> 
> > * third byte:   entry size in bytes (currently either 4 or 8)
> 
> This one really boils down to "what is the kernel's sizeof(unsigned
> long)" because we'll always store pfns in those.  It seems like we
> should have a better way to go fetch that.
> 
> > * fourth byte:  header size
> 
> If we can get rid of the other three this, of course, goes away.

True. But the variable-sized header lets us add other stuff later.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: /proc/$pid/pagemap troubles
  2007-07-31 21:37 ` Matt Mackall
@ 2007-07-31 22:43   ` Dave Hansen
  2007-08-01  0:14     ` Matt Mackall
  2007-07-31 22:58   ` Andreas Schwab
  1 sibling, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2007-07-31 22:43 UTC (permalink / raw)
  To: Matt Mackall; +Cc: linux-kernel@vger.kernel.org

On Tue, 2007-07-31 at 16:37 -0500, Matt Mackall wrote:
> 
> > So, a couple of questions.  Don't we need to support
> non-sizeof(unsigned
> > long)-aligned reads?
> 
> Why? We should obviously never return more data than we were asked for
> (that's clearly a bug), but lots of things refuse to read or write
> stuff that isn't well sized and aligned. 

For the 1-byte stuff at the beginning of the file, I think it makes
perfect sense:

	char __big_endian;
	int big_endian;
	read(fd, &__big_endian, 1);
	big_endian = __big_endian;

> > Do we _really_ need that header in each and every file?
> 
> Well there's either a header or there isn't.

I just mean that perhaps we should be putting some of that stuff in a
global file.  I think there's some added simplicity both in the kernel
and in the userspace programs if we just let "*ppos << PAGE_SHIFT ==
vaddr".

I was going to send you a patch, but since you're updating anyway, could
you add some symbolic names like this:

#define PAGEMAP_ENTRY_SIZE_BYTES        sizeof(unsigned long)
#define PAGEMAP_HEADER_SIZE_BYTES       sizeof(unsigned long)
#define PAGEMAP_MAX_VIRT_PFN            (((~0UL) >> PAGE_SHIFT) + 1)
#define PAGEMAP_BUFFER_SLOTS		(PAGE_SIZE/PAGEMAP_ENTRY_SIZE_BYTES)
#define PAGEMAP_UNPOPULATED_PAGE	(-1UL)

+       evpfn = min((src + count) / sizeof(unsigned long),
+                   ((~0UL) >> PAGE_SHIFT) + 1);

Should that hunk of code be any different for 32-bit processes on a
64-bit kernel?  Do we want to use TASK_SIZE_OF() or restrict these to
actual virtual address space (which is only 48 bits on x86_64).

It also seems that we need some replacement for the CONFIG_HIGHPTE
#ifdefs.  What is the basic reason for those?  The copy_to_user() called
by add_to_pagemap()->flush_pagemap() conflicts with the kmap_atomic()
from pte_offset_map()?

Perhaps we should have the page table walker code return -EAGAIN when it
needs to flush.  Then, it will fall all the way out to read_pagemap()
where we could do the actual flush outside of the kmaps and restart the
walker.

Or, we could just call the walker only for ranges that we know will fit
into the buffer that we have.  If there was no header on the file, we
could determine this completely by the address that we were currently
walking.

As I think about it, do we really even need to walk the VMA list at all?
We don't do anything differently if an address is inside a VMA.  We just
save time by not walking the pagetables for areas that we know are
zeroed.  But, we won't be able to walk down to the PTE level in most
cases, anyway.  So, it probably won't waste _that_ much time.

+struct pagemapread {
+       struct mm_struct *mm;
+       unsigned long next;
+       unsigned long *buf;
+       pte_t *ptebuf;
+       unsigned long pos;
+       size_t count;
+       int index;
+       char __user *out;
+};

Just looking at the structure, it's _really_ hard to tell what fields go
with other fields, and it took me a while to unravel it all.  There's
also some redundancy in it, which I find a bit confusing.

"->next" shouldn't be necessary at all, unless the page table walker is
called twice with overlapping page ranges.  I modified pagemap_fill() to
return the number of pagemap entries written.  It uses "->next" heavily
but doesn't really need to if you simply pass an address range into it.

"->pos" isn't necessary as long as we keep an original copy of the "buf"
pagemap_read() argument around.

"->index" is necessary, but goes with the "->buf" field and should
probably go next to it.  The same goes for "->count" and "->out".

I ended up with a structure that looks like this:

struct pagemapread {
        struct mm_struct *mm;
        pte_t *ptebuf;

        unsigned long *buf;
        int next_buf_slot;
        size_t slots_to_fill; /* not strictly necessary with the
                               * "bytes_to_write" field here */

        /*
         * We need this because the page table walking
         * code doesn't really tell us how far it walked.
         * It is internal and not to be used otherwise.
         */
        unsigned long __last_vaddr_filled;

        char __user *user_buffer;
        size_t userspace_bytes_left_to_write;
};

-- Dave

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: /proc/$pid/pagemap troubles
  2007-07-31 21:37 ` Matt Mackall
  2007-07-31 22:43   ` Dave Hansen
@ 2007-07-31 22:58   ` Andreas Schwab
  2007-07-31 23:06     ` Dave Hansen
  1 sibling, 1 reply; 7+ messages in thread
From: Andreas Schwab @ 2007-07-31 22:58 UTC (permalink / raw)
  To: Matt Mackall; +Cc: Dave Hansen, linux-kernel@vger.kernel.org

Matt Mackall <mpm@selenic.com> writes:

> On Tue, Jul 31, 2007 at 01:36:14PM -0700, Dave Hansen wrote:
>> Since the pagemap code has a little header on it to help describe the
>> format, I wrote a little c program to parse its output.  I get some
>> strange results.  If I do this:
>> 
>> 	fd = open("/proc/1/pagemap", O_RDONLY);
>> 	count = read(fd, &endianness, 1);
>> 
>> count will always be 4.  
>
> Known bug, fixed in my pending and not-currently-working update. It
> ought to return 0 for short reads.

That's not a good choice.  Returning 0 means EOF, but there is actually
data to be read.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: /proc/$pid/pagemap troubles
  2007-07-31 22:58   ` Andreas Schwab
@ 2007-07-31 23:06     ` Dave Hansen
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2007-07-31 23:06 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Matt Mackall, linux-kernel@vger.kernel.org

On Wed, 2007-08-01 at 00:58 +0200, Andreas Schwab wrote:
> Matt Mackall <mpm@selenic.com> writes:
> 
> > On Tue, Jul 31, 2007 at 01:36:14PM -0700, Dave Hansen wrote:
> >> Since the pagemap code has a little header on it to help describe the
> >> format, I wrote a little c program to parse its output.  I get some
> >> strange results.  If I do this:
> >> 
> >> 	fd = open("/proc/1/pagemap", O_RDONLY);
> >> 	count = read(fd, &endianness, 1);
> >> 
> >> count will always be 4.  
> >
> > Known bug, fixed in my pending and not-currently-working update. It
> > ought to return 0 for short reads.
> 
> That's not a good choice.  Returning 0 means EOF, but there is actually
> data to be read.

This should actually be pretty easy to fix.  We have a nice PAGE_SIZE
buffer.  So, if we are unaligned and would have overflowed the PAGE_SIZE
buffer, we return a short read.

If they ask for a <sizeof(unsigned long) read, we can copy that into the
PAGE_SIZE buffer, then just copy_to_user() the portion that was asked
for.  

-- Dave


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: /proc/$pid/pagemap troubles
  2007-07-31 22:43   ` Dave Hansen
@ 2007-08-01  0:14     ` Matt Mackall
  2007-08-01 16:06       ` Dave Hansen
  0 siblings, 1 reply; 7+ messages in thread
From: Matt Mackall @ 2007-08-01  0:14 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel@vger.kernel.org

On Tue, Jul 31, 2007 at 03:43:10PM -0700, Dave Hansen wrote:
> On Tue, 2007-07-31 at 16:37 -0500, Matt Mackall wrote:
> > 
> > > So, a couple of questions.  Don't we need to support
> > non-sizeof(unsigned
> > > long)-aligned reads?
> > 
> > Why? We should obviously never return more data than we were asked for
> > (that's clearly a bug), but lots of things refuse to read or write
> > stuff that isn't well sized and aligned. 
> 
> For the 1-byte stuff at the beginning of the file, I think it makes
> perfect sense:
> 
> 	char __big_endian;
> 	int big_endian;
> 	read(fd, &__big_endian, 1);
> 	big_endian = __big_endian;

True.
 
> > > Do we _really_ need that header in each and every file?
> > 
> > Well there's either a header or there isn't.
> 
> I just mean that perhaps we should be putting some of that stuff in a
> global file.  I think there's some added simplicity both in the kernel
> and in the userspace programs if we just let "*ppos << PAGE_SHIFT ==
> vaddr".

Absolutely. It's the source of both of the long read bugs you
mentioned. But there are also advantages to having it attached.

> I was going to send you a patch, but since you're updating anyway, could
> you add some symbolic names like this:
> 
> #define PAGEMAP_ENTRY_SIZE_BYTES        sizeof(unsigned long)
> #define PAGEMAP_HEADER_SIZE_BYTES       sizeof(unsigned long)
> #define PAGEMAP_MAX_VIRT_PFN            (((~0UL) >> PAGE_SHIFT) + 1)
> #define PAGEMAP_BUFFER_SLOTS		(PAGE_SIZE/PAGEMAP_ENTRY_SIZE_BYTES)
> #define PAGEMAP_UNPOPULATED_PAGE	(-1UL)

I suppose.

> +       evpfn = min((src + count) / sizeof(unsigned long),
> +                   ((~0UL) >> PAGE_SHIFT) + 1);
> 
> Should that hunk of code be any different for 32-bit processes on a
> 64-bit kernel?  Do we want to use TASK_SIZE_OF() or restrict these to
> actual virtual address space (which is only 48 bits on x86_64).

No. A page for a 32-bit task can line anywhere in the >32-bit physical
address space.

> It also seems that we need some replacement for the CONFIG_HIGHPTE
> #ifdefs.  What is the basic reason for those?  The copy_to_user() called
> by add_to_pagemap()->flush_pagemap() conflicts with the kmap_atomic()
> from pte_offset_map()?

My not-yet-working code uses get_user_pages and does away with all the
double-buffering and a bunch of the locking issues. 

> As I think about it, do we really even need to walk the VMA list at all?
> We don't do anything differently if an address is inside a VMA.  We just
> save time by not walking the pagetables for areas that we know are
> zeroed.  But, we won't be able to walk down to the PTE level in most
> cases, anyway.  So, it probably won't waste _that_ much time.

Perhaps not.

> +struct pagemapread {
> +       struct mm_struct *mm;
> +       unsigned long next;
> +       unsigned long *buf;
> +       pte_t *ptebuf;
> +       unsigned long pos;
> +       size_t count;
> +       int index;
> +       char __user *out;
> +};
> 
> Just looking at the structure, it's _really_ hard to tell what fields go
> with other fields, and it took me a while to unravel it all.  There's
> also some redundancy in it, which I find a bit confusing.

Most of those go away with the get_user_pages approach.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: /proc/$pid/pagemap troubles
  2007-08-01  0:14     ` Matt Mackall
@ 2007-08-01 16:06       ` Dave Hansen
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2007-08-01 16:06 UTC (permalink / raw)
  To: Matt Mackall; +Cc: linux-kernel@vger.kernel.org

On Tue, 2007-07-31 at 19:14 -0500, Matt Mackall wrote:
> On Tue, Jul 31, 2007 at 03:43:10PM -0700, Dave Hansen wrote:
> > +       evpfn = min((src + count) / sizeof(unsigned long),
> > +                   ((~0UL) >> PAGE_SHIFT) + 1);
> > 
> > Should that hunk of code be any different for 32-bit processes on a
> > 64-bit kernel?  Do we want to use TASK_SIZE_OF() or restrict these to
> > actual virtual address space (which is only 48 bits on x86_64).
> 
> No. A page for a 32-bit task can line anywhere in the >32-bit physical
> address space.

An evpfn is an ending _virtual_ pfn right?  A virtual pfn is a vaddr >>
PAGE_SHIFT, right?  We're not calculating anything having to do with
physical addresses.  Having 64-bit physical addresses would be a reason
to use 64-bits as the entry size, but not to round down this "evpfn".

"src" in this case is "*ppos", which is the offset into the file.
Divide that by the entry size (sizeof(unsigned long)) and you get the
number of pages offset into the vaddr space that the user is attempting
to read.  There might be a bug here with it forgetting to offset the
header size, but oh well.

In any case, what _is_ the ((~0UL) >> PAGE_SHIFT)  operation there for?
Is it to make sure that a user reading at the end of the virtual address
space doesn't ask to read 10 pages starting at 0xfffffff0?

> Most of those go away with the get_user_pages approach.

Are you working on this actively?  If not, I'd be happy to take whatever
you have and fix it up.  Care to post it, even in a slightly unpolished
state?

-- Dave

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-08-01 16:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-31 20:36 /proc/$pid/pagemap troubles Dave Hansen
2007-07-31 21:37 ` Matt Mackall
2007-07-31 22:43   ` Dave Hansen
2007-08-01  0:14     ` Matt Mackall
2007-08-01 16:06       ` Dave Hansen
2007-07-31 22:58   ` Andreas Schwab
2007-07-31 23:06     ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).