* [patch] broken fault_in_pages_readable call in generic_file_buffered_write.
@ 2005-06-03 17:54 Martin Schwidefsky
2005-06-03 18:55 ` Andrew Morton
0 siblings, 1 reply; 3+ messages in thread
From: Martin Schwidefsky @ 2005-06-03 17:54 UTC (permalink / raw)
To: akpm, linux-kernel
[patch] broken fault_in_pages_readable call in generic_file_buffered_write.
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
With one of our test we discovered a problem with the writev system
call that can cause the stack to grow to an insane size.
The following test shows the effect if run with unlimited stack size:
---
#include <stdio.h>
#include <sys/mman.h>
#include <sys/uio.h>
int main(void)
{
char syscmd[32];
struct iovec iov[2];
unsigned long maddr;
char *ptr;
FILE *fp;
int fd;
fp = tmpfile();
fd = fileno(fp);
maddr = ((unsigned long) &syscmd - 0x2000000) & ~0xfffffUL;
ptr = mmap((void *) maddr, 32768, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
memset(ptr, 0, 32768);
sprintf(syscmd, "/bin/cat /proc/%i/maps", getpid());
system(syscmd);
printf("\n");
iov[0] = (struct iovec) { ptr + (32768 - 64), 32 };
iov[1] = (struct iovec) { ptr + 16384, 1024 };
writev(fd, iov, 2);
system(syscmd);
fclose(fp);
return 0;
}
---
The output you'll get is two dumps of /proc/<pid>/maps, on my i386 system
it looked like this:
08048000-08049000 r-xp 00000000 03:07 847321 /home/sky/source/writevtest
08049000-0804a000 rwxp 00000000 03:07 847321 /home/sky/source/writevtest
0804a000-0806b000 rwxp 0804a000 00:00 0 [heap]
40000000-40016000 r-xp 00000000 03:06 1212530 /lib/ld-2.3.2.so
40016000-40017000 rwxp 00015000 03:06 1212530 /lib/ld-2.3.2.so
40017000-40019000 rwxp 40017000 00:00 0
4002b000-40155000 r-xp 00000000 03:06 557207 /lib/tls/libc-2.3.2.so
40155000-4015e000 rwxp 00129000 03:06 557207 /lib/tls/libc-2.3.2.so
4015e000-40161000 rwxp 4015e000 00:00 0
bdf00000-bdf08000 rwxp bdf00000 00:00 0
bffa7000-bffbd000 rwxp bffa7000 00:00 0 [stack]
ffffe000-fffff000 ---p 00000000 00:00 0 [vdso]
08048000-08049000 r-xp 00000000 03:07 847321 /home/sky/source/writevtest
08049000-0804a000 rwxp 00000000 03:07 847321 /home/sky/source/writevtest
0804a000-0806b000 rwxp 0804a000 00:00 0 [heap]
40000000-40016000 r-xp 00000000 03:06 1212530 /lib/ld-2.3.2.so
40016000-40017000 rwxp 00015000 03:06 1212530 /lib/ld-2.3.2.so
40017000-4001a000 rwxp 40017000 00:00 0
4002b000-40155000 r-xp 00000000 03:06 557207 /lib/tls/libc-2.3.2.so
40155000-4015e000 rwxp 00129000 03:06 557207 /lib/tls/libc-2.3.2.so
4015e000-40161000 rwxp 4015e000 00:00 0
bdf00000-bdf08000 rwxp bdf00000 00:00 0
bdf08000-bffbd000 rwxp bdf08000 00:00 0 [stack]
ffffe000-fffff000 ---p 00000000 00:00 0 [vdso]
The stack segment grew from bffa7000-c0000000 to bdf08000-c0000000
by a perfectly valid writev call. That should not happen.
This is caused by an invalid size on the fault_in_pages_readable call
in generic_file_buffered_write. The length of the passed buffer needs
to be clipped to the maximum size of the current iov. The attached
patch tries to solve this problem. While looking at this I wondered
about another potential issue, fault_in_pages_readable faults the
pages of the iov into memory by using __get_user(). Nobody has made
sure that the page really stays in memory. While it is unlikely that
it gets removed before generic_file_buffered_write has done its jobs,
in theory on a virtual system that runs under extreme memory pressure
it can happen that the page is reclaimed immediatly. So the race that
is mentioned in the comment is not really fixed...
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
diffstat:
mm/filemap.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)
diff -urpN linux-2.6/mm/filemap.c linux-2.6-patched/mm/filemap.c
--- linux-2.6/mm/filemap.c 2005-06-03 16:25:21.000000000 +0200
+++ linux-2.6-patched/mm/filemap.c 2005-06-03 16:25:38.000000000 +0200
@@ -1968,6 +1968,7 @@ generic_file_buffered_write(struct kiocb
do {
unsigned long index;
unsigned long offset;
+ unsigned long maxlen;
size_t copied;
offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
@@ -1982,7 +1983,10 @@ generic_file_buffered_write(struct kiocb
* same page as we're writing to, without it being marked
* up-to-date.
*/
- fault_in_pages_readable(buf, bytes);
+ maxlen = cur_iov->iov_len - iov_base;
+ if (maxlen > bytes)
+ maxlen = bytes;
+ fault_in_pages_readable(buf, maxlen);
page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
if (!page) {
@@ -2024,7 +2028,8 @@ generic_file_buffered_write(struct kiocb
filemap_set_next_iovec(&cur_iov,
&iov_base, status);
buf = cur_iov->iov_base + iov_base;
- }
+ } else
+ iov_base += status;
}
}
if (unlikely(copied != bytes))
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [patch] broken fault_in_pages_readable call in generic_file_buffered_write.
2005-06-03 17:54 [patch] broken fault_in_pages_readable call in generic_file_buffered_write Martin Schwidefsky
@ 2005-06-03 18:55 ` Andrew Morton
2005-06-06 9:06 ` Martin Schwidefsky
0 siblings, 1 reply; 3+ messages in thread
From: Andrew Morton @ 2005-06-03 18:55 UTC (permalink / raw)
To: Martin Schwidefsky; +Cc: linux-kernel
Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
>
> The stack segment grew from bffa7000-c0000000 to bdf08000-c0000000
> by a perfectly valid writev call. That should not happen.
>
> This is caused by an invalid size on the fault_in_pages_readable call
> in generic_file_buffered_write. The length of the passed buffer needs
> to be clipped to the maximum size of the current iov.
Gad that code has become opaque - I need to stare at it for half an hour.
> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
>
> diffstat:
> mm/filemap.c | 9 +++++++--
> 1 files changed, 7 insertions(+), 2 deletions(-)
>
> diff -urpN linux-2.6/mm/filemap.c linux-2.6-patched/mm/filemap.c
> --- linux-2.6/mm/filemap.c 2005-06-03 16:25:21.000000000 +0200
> +++ linux-2.6-patched/mm/filemap.c 2005-06-03 16:25:38.000000000 +0200
> @@ -1968,6 +1968,7 @@ generic_file_buffered_write(struct kiocb
> do {
> unsigned long index;
> unsigned long offset;
> + unsigned long maxlen;
> size_t copied;
>
> offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
> @@ -1982,7 +1983,10 @@ generic_file_buffered_write(struct kiocb
> * same page as we're writing to, without it being marked
> * up-to-date.
> */
> - fault_in_pages_readable(buf, bytes);
> + maxlen = cur_iov->iov_len - iov_base;
> + if (maxlen > bytes)
> + maxlen = bytes;
> + fault_in_pages_readable(buf, maxlen);
Can you explain the bug a bit more completely? AFACIT, `bytes' here was
always in the range 0 .. PAGE_CACHE_SIZE, so how can it have caused large
amounts of the stack segment to have been faulted in?
> While looking at this I wondered
> about another potential issue, fault_in_pages_readable faults the
> pages of the iov into memory by using __get_user(). Nobody has made
> sure that the page really stays in memory. While it is unlikely that
> it gets removed before generic_file_buffered_write has done its jobs,
> in theory on a virtual system that runs under extreme memory pressure
> it can happen that the page is reclaimed immediatly. So the race that
> is mentioned in the comment is not really fixed...
>
yup, that's a "well known" problem and we've never come up with an adequate
solution. It is possible, rarely, to cause that page to get unmapped in
the window which you identify. It is much harder to cause the page to
actually be reclaimed. And it is much harder still to cause a mmap/write
deadlock once the page has been reclaimed. We've been admiring this
problem on and off for four or five years...
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [patch] broken fault_in_pages_readable call in generic_file_buffered_write.
2005-06-03 18:55 ` Andrew Morton
@ 2005-06-06 9:06 ` Martin Schwidefsky
0 siblings, 0 replies; 3+ messages in thread
From: Martin Schwidefsky @ 2005-06-06 9:06 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel
Andrew Morton <akpm@osdl.org> wrote on 06/03/2005 08:55:12 PM:
> > This is caused by an invalid size on the fault_in_pages_readable call
> > in generic_file_buffered_write. The length of the passed buffer needs
> > to be clipped to the maximum size of the current iov.
>
> Gad that code has become opaque - I need to stare at it for half an hour.
Ask me, I had to find that stupid bug..
> Can you explain the bug a bit more completely? AFACIT, `bytes' here was
> always in the range 0 .. PAGE_CACHE_SIZE, so how can it have caused large
> amounts of the stack segment to have been faulted in?
Sure. If you have a small iov at the end of a vma followed by a large iov
somewhere else then the bytes variable is the number of bytes between the
current file position and the end of the page cache page. That offset is
used in the fault_in_pages_readable call which now will span TWO pages.
For the small iov only one page may get accessed. The second page lies
after the end of the vma. If that vma happens to be the last one before
the stack and the stack size is unlimited then VM_GROWSDOWN will create
a big stack vma. In our case later memory allocations failed because
there wasn't enough virtual address space.
> > While looking at this I wondered
> > about another potential issue, fault_in_pages_readable faults the
> > pages of the iov into memory by using __get_user(). Nobody has made
> > sure that the page really stays in memory. While it is unlikely that
> > it gets removed before generic_file_buffered_write has done its jobs,
> > in theory on a virtual system that runs under extreme memory pressure
> > it can happen that the page is reclaimed immediatly. So the race that
> > is mentioned in the comment is not really fixed...
> >
>
> yup, that's a "well known" problem and we've never come up with an adequate
> solution. It is possible, rarely, to cause that page to get unmapped in
> the window which you identify. It is much harder to cause the page to
> actually be reclaimed. And it is much harder still to cause a mmap/write
> deadlock once the page has been reclaimed. We've been admiring this
> problem on and off for four or five years...
Ok, now there is one admirer more ..
blue skies,
Martin
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2005-06-06 9:06 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-03 17:54 [patch] broken fault_in_pages_readable call in generic_file_buffered_write Martin Schwidefsky
2005-06-03 18:55 ` Andrew Morton
2005-06-06 9:06 ` Martin Schwidefsky
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.