* wrong madvise(MADV_DONTNEED) semantic
@ 2005-06-28 13:43 Samuel Thibault
2005-06-28 14:38 ` [Patch] Hotfix for " Jörn Engel
2005-06-28 18:16 ` Andy Isaacson
0 siblings, 2 replies; 18+ messages in thread
From: Samuel Thibault @ 2005-06-28 13:43 UTC (permalink / raw)
To: linux-kernel
Hi,
There is something wrong with the current madvise(MADV_DONTNEED)
implementation. Both the manpage and the source code says that
MADV_DONTNEED means that the application does not care about the data,
so it might be thrown away by the kernel. But that's not what posix
says:
http://www.opengroup.org/onlinepubs/009695399/functions/posix_madvise.html
It says that "The posix_madvise() function shall have no effect on the
semantics of access to memory in the specified range". I.e. the data
that was recorded shall be saved!
The current linux implementation of MADV_DONTNEED is rather an
implementation of solaris' MADV_FREE, see its manpage:
http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrde?a=view
Hence the current madvise_dontneed() implementation could be renamed
into madvise_free() and the appropriate MADV_FREE case be added, while
a new implementation of madvise_dontneed() _needs_ be written. It may
for instance go through the range so as to zap clean pages (since it is
safe), and set dirty pages as being least recently used so that they
will be considered as good candidates for eviction.
(And the manpage should get corrected too)
Regards,
Samuel Thibault
^ permalink raw reply [flat|nested] 18+ messages in thread
* [Patch] Hotfix for Re: wrong madvise(MADV_DONTNEED) semantic
2005-06-28 13:43 wrong madvise(MADV_DONTNEED) semantic Samuel Thibault
@ 2005-06-28 14:38 ` Jörn Engel
2005-06-28 18:16 ` Andy Isaacson
1 sibling, 0 replies; 18+ messages in thread
From: Jörn Engel @ 2005-06-28 14:38 UTC (permalink / raw)
To: Andrew Morton; +Cc: Samuel Thibault, linux-kernel
On Tue, 28 June 2005 15:43:16 +0200, Samuel Thibault wrote:
>
> There is something wrong with the current madvise(MADV_DONTNEED)
> implementation. Both the manpage and the source code says that
> MADV_DONTNEED means that the application does not care about the data,
> so it might be thrown away by the kernel. But that's not what posix
> says:
>
> http://www.opengroup.org/onlinepubs/009695399/functions/posix_madvise.html
>
> It says that "The posix_madvise() function shall have no effect on the
> semantics of access to memory in the specified range". I.e. the data
> that was recorded shall be saved!
>
> The current linux implementation of MADV_DONTNEED is rather an
> implementation of solaris' MADV_FREE, see its manpage:
> http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrde?a=view
>
> Hence the current madvise_dontneed() implementation could be renamed
> into madvise_free() and the appropriate MADV_FREE case be added, while
> a new implementation of madvise_dontneed() _needs_ be written. It may
> for instance go through the range so as to zap clean pages (since it is
> safe), and set dirty pages as being least recently used so that they
> will be considered as good candidates for eviction.
Agreed. Unless someone else comes up with something better soon, you
can use below patch.
Akpm, would you merge this?
Jörn
--
The story so far:
In the beginning the Universe was created. This has made a lot
of people very angry and been widely regarded as a bad move.
-- Douglas Adams
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de>
---
mm/madvise.c | 4 ++++
1 files changed, 4 insertions(+)
--- linux-2.6.12-rc4logfs/mm/madvise.c~madvice_bug 2005-03-04 11:40:50.000000000 +0100
+++ linux-2.6.12-rc4logfs/mm/madvise.c 2005-06-28 16:34:54.000000000 +0200
@@ -97,6 +97,10 @@ static long madvise_willneed(struct vm_a
static long madvise_dontneed(struct vm_area_struct * vma,
unsigned long start, unsigned long end)
{
+ return 0;
+ /* Really ugly bugfix - madvice may never change anything
+ * beyond performance. Dropping dirty pages is right out.
+ */
if ((vma->vm_flags & VM_LOCKED) || is_vm_hugetlb_page(vma))
return -EINVAL;
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic
2005-06-28 13:43 wrong madvise(MADV_DONTNEED) semantic Samuel Thibault
2005-06-28 14:38 ` [Patch] Hotfix for " Jörn Engel
@ 2005-06-28 18:16 ` Andy Isaacson
2005-06-28 18:28 ` Robert Love
2005-06-28 18:54 ` wrong madvise(MADV_DONTNEED) semantic Samuel Thibault
1 sibling, 2 replies; 18+ messages in thread
From: Andy Isaacson @ 2005-06-28 18:16 UTC (permalink / raw)
To: Samuel Thibault, linux-kernel
On Tue, Jun 28, 2005 at 03:43:16PM +0200, Samuel Thibault wrote:
> There is something wrong with the current madvise(MADV_DONTNEED)
> implementation. Both the manpage and the source code says that
> MADV_DONTNEED means that the application does not care about the data,
> so it might be thrown away by the kernel.
I agree that the Linux madvise manpage is unclear and should be fixed.
If your interpretation of the problem is correct, then it should be
trivial to write a test program demonstrating the problem. Did you
write the simple test program and run it?
I did, and the results make me think that the Linux implementation does
conform to the POSIX spec you quote. I did not observe any data loss.
So it's just a documentation issue.
Besides, if you read the documentation closely, it does not say what you
think it says.
MADV_DONTNEED
Do not expect access in the near future. (For the time
being, the application is finished with the given range,
so the kernel can free resources associated with it.)
Subsequent accesses of pages in this range will succeed,
but will result either in reloading of the memory contents
from the underlying mapped file (see mmap) or
zero-fill-on-demand pages for mappings without an
underlying file.
You seem to think that "reloading ... from the underlying mapped file"
means that changes are lost, but that's not implied.
Also, the parenthetical near the top of the manpage "(except in the case
of MADV_DONTNEED)" is, AFAICS, wrong. MADV_DONTNEED does not affect
semantics. It looks to me like someone "improved" madvise.2 without
actually understanding what they were talking about.
Below is the test program I used.
-andy
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <sys/stat.h>
typedef unsigned int u32;
void die(char *fmt, ...)
{
va_list ap;
va_start(ap, fmt);
vfprintf(stderr, fmt, ap);
va_end(ap);
exit(1);
}
int check_cookie(char *file, u32 cookie)
{
u32 buf;
int fd;
if((fd = open(file, O_RDONLY, 0)) == -1)
die("%s: %s\n", file, strerror(errno));
if(read(fd, &buf, sizeof(buf)) == -1)
die("read: %s\n", strerror(errno));
close(fd);
return buf == cookie;
}
int dotest(char *file, u32 cookie, int do_msync, int do_madvise)
{
void *p;
int fd, len = 16 * 1024, prot = PROT_READ|PROT_WRITE;
if((fd = open(file, O_RDWR|O_CREAT, 0666)) == -1)
die("%s: %s\n", file, strerror(errno));
if(ftruncate(fd, len) == -1)
die("ftruncate: %s\n", strerror(errno));
if(write(fd, "", 1) == -1)
die("write: %s\n", strerror(errno));
if((p = mmap(0, len, prot, MAP_SHARED, fd, 0)) == MAP_FAILED)
die("mmap: %s\n", strerror(errno));
*(u32 *)p = cookie;
if(do_msync)
if(msync(p, len, MS_SYNC) == -1)
die("msync: %s\n", strerror(errno));
if(do_madvise)
if(madvise(p, len, MADV_DONTNEED) == -1)
die("madvise: %s\n", strerror(errno));
if(close(fd) == -1)
die("close: %s\n", strerror(errno));
printf("c = %08x msync: %s madvise: %s %s\n",
cookie, do_msync ? "YES" : " NO", do_madvise ? "YES" : " NO",
check_cookie(file, cookie) ? "ok" : "FAILED");
}
int main(int argc, char **argv)
{
if(argc != 2) die("usage: %s file\n", argv[0]);
lrand48();
dotest(argv[1], lrand48(), 0, 0);
dotest(argv[1], lrand48(), 1, 0);
dotest(argv[1], lrand48(), 0, 1);
dotest(argv[1], lrand48(), 1, 1);
}
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic
2005-06-28 18:16 ` Andy Isaacson
@ 2005-06-28 18:28 ` Robert Love
2005-06-28 18:53 ` Andy Isaacson
2005-06-29 16:34 ` Michael Kerrisk
2005-06-28 18:54 ` wrong madvise(MADV_DONTNEED) semantic Samuel Thibault
1 sibling, 2 replies; 18+ messages in thread
From: Robert Love @ 2005-06-28 18:28 UTC (permalink / raw)
To: Andy Isaacson; +Cc: Samuel Thibault, linux-kernel
On Tue, 2005-06-28 at 11:16 -0700, Andy Isaacson wrote:
> Besides, if you read the documentation closely, it does not say what you
> think it says.
>
> MADV_DONTNEED
> Do not expect access in the near future. (For the time
> being, the application is finished with the given range,
> so the kernel can free resources associated with it.)
> Subsequent accesses of pages in this range will succeed,
> but will result either in reloading of the memory contents
> from the underlying mapped file (see mmap) or
> zero-fill-on-demand pages for mappings without an
> underlying file.
>
> You seem to think that "reloading ... from the underlying mapped file"
> means that changes are lost, but that's not implied.
This wording _does_ imply that changes are lost if the file is mapped
writable and not mysnc'ed or if the memory mapping is anonymous.
In the former, changes are dropped and the file is reread from the stale
on-disk copy. In the latter case, the data is dropped and the pages are
zero-filled on access.
Robert Love
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic
2005-06-28 18:28 ` Robert Love
@ 2005-06-28 18:53 ` Andy Isaacson
2005-06-28 19:23 ` Robert Love
2005-06-29 16:34 ` Michael Kerrisk
1 sibling, 1 reply; 18+ messages in thread
From: Andy Isaacson @ 2005-06-28 18:53 UTC (permalink / raw)
To: Robert Love; +Cc: Samuel Thibault, linux-kernel
On Tue, Jun 28, 2005 at 02:28:20PM -0400, Robert Love wrote:
> On Tue, 2005-06-28 at 11:16 -0700, Andy Isaacson wrote:
> > Besides, if you read the documentation closely, it does not say what you
> > think it says.
> >
> > MADV_DONTNEED
> > Do not expect access in the near future. (For the time
> > being, the application is finished with the given range,
> > so the kernel can free resources associated with it.)
> > Subsequent accesses of pages in this range will succeed,
> > but will result either in reloading of the memory contents
> > from the underlying mapped file (see mmap) or
> > zero-fill-on-demand pages for mappings without an
> > underlying file.
> >
> > You seem to think that "reloading ... from the underlying mapped file"
> > means that changes are lost, but that's not implied.
>
> This wording _does_ imply that changes are lost
I contest your interpretation of the manpage; while it could be read
the way you suggest, I claim that because Linux mmap is inherently
coherent (as opposed to, for example, AIX 4.1 mmap) then the "underlying
file" already contains the updated contents, and ergo msync is not
required for correct MAP_SHARED semantics on Linux, and the manpage as
it stands is (misleading, but) both accurate to the 2.6.11
implementation and compliant with the POSIX description posted earlier.
> if the file is mapped writable and not mysnc'ed
This is the case that my posted example code exercises, and I did not
see any problems. Is there some additional circumstance that is
necessary to cause it to break? (I tested on 2.6.11-rc5 or something
close to that.)
> or if the memory mapping is anonymous.
>
> In the latter case, the data is dropped and the pages are
> zero-filled on access.
Yes, MAP_ANONYMOUS is a more interesting case. Somebody else will have
to write the testcase for that...
I think the correct docs fix is to simply delete the misleading parts of
madvise.2 so that it reads
MADV_DONTNEED
Do not expect access in the near future. (For the time
being, the application is finished with the given range,
so the kernel can free resources associated with it.)
and remove the erroneous parenthetical in the first paragraph.
... unless, of course, someone can actually demonstrate a case where
madvise results in differing semantics...
-andy
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic
2005-06-28 18:16 ` Andy Isaacson
2005-06-28 18:28 ` Robert Love
@ 2005-06-28 18:54 ` Samuel Thibault
1 sibling, 0 replies; 18+ messages in thread
From: Samuel Thibault @ 2005-06-28 18:54 UTC (permalink / raw)
To: Andy Isaacson; +Cc: linux-kernel
Andy Isaacson, le Tue 28 Jun 2005 11:16:20 -0700, a écrit :
> If your interpretation of the problem is correct, then it should be
> trivial to write a test program demonstrating the problem. Did you
> write the simple test program and run it?
I indeed didn't, trusting both the man page, the source code comments,
and my knowledge of zap_page_range().
> MADV_DONTNEED
> Do not expect access in the near future. (For the time
> being, the application is finished with the given range,
> so the kernel can free resources associated with it.)
> Subsequent accesses of pages in this range will succeed,
> but will result either in reloading of the memory contents
> from the underlying mapped file (see mmap) or
> zero-fill-on-demand pages for mappings without an
> underlying file.
>
> You seem to think that "reloading ... from the underlying mapped file"
> means that changes are lost, but that's not implied.
I didn't say anything precise. What mostly feared me was the
"zero-fill-on-demand pages for mappings without an underlying file."
> Below is the test program I used.
It does indeed work, but this is no proof. It your testcase it does
indeed work, since the page still remains in the page cache (it's a
shared mapping of the file). But now try this one. It uses private
mappings, and fails as expected, getting 0 in the ANONYMOUS case, and
the original file value in the file mapping case.
Regards,
Samuel Thibault
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>
#include <string.h>
#include <errno.h>
#include <signal.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <sys/stat.h>
typedef unsigned int u32;
void die(char *fmt, ...)
{
va_list ap;
va_start(ap, fmt);
vfprintf(stderr, fmt, ap);
va_end(ap);
exit(1);
}
int check_cookie(char *file, u32 cookie)
{
u32 buf;
int fd;
if((fd = open(file, O_RDONLY, 0)) == -1)
die("%s: %s\n", file, strerror(errno));
if(read(fd, &buf, sizeof(buf)) == -1)
die("read: %s\n", strerror(errno));
close(fd);
return buf == cookie;
}
int dotest(char *file, u32 cookie, int do_anonymous, int do_msync, int do_madvise)
{
void *p;
int fd, len = 16 * 1024, prot = PROT_READ|PROT_WRITE;
u32 newcookie;
if (!do_anonymous) {
if((fd = open(file, O_RDWR|O_CREAT, 0666)) == -1)
die("%s: %s\n", file, strerror(errno));
if(ftruncate(fd, len) == -1)
die("ftruncate: %s\n", strerror(errno));
if(write(fd, "", 1) == -1)
die("write: %s\n", strerror(errno));
}
if((p = mmap(0, len, prot, (do_anonymous ? MAP_ANONYMOUS : 0) | MAP_PRIVATE, do_anonymous ? -1 : fd, 0)) == MAP_FAILED)
die("mmap: %s\n", strerror(errno));
*(u32 *)p = cookie;
if(do_msync)
if(msync(p, len, MS_SYNC) == -1)
die("msync: %s\n", strerror(errno));
if(do_madvise)
if(madvise(p, len, MADV_DONTNEED) == -1)
die("madvise: %s\n", strerror(errno));
newcookie = *(u32 *)p;
printf("c = %08x msync: %s madvise: %s %s\n",
cookie, do_msync ? "YES" : " NO", do_madvise ? "YES" : " NO",
newcookie == cookie ? "ok" : "FAILED");
}
int main(int argc, char **argv)
{
if(argc != 2) die("usage: %s file\n", argv[0]);
lrand48();
dotest(argv[1], lrand48(), 0, 0, 0);
dotest(argv[1], lrand48(), 0, 1, 0);
dotest(argv[1], lrand48(), 0, 0, 1);
dotest(argv[1], lrand48(), 0, 1, 1);
dotest(argv[1], lrand48(), 1, 0, 0);
dotest(argv[1], lrand48(), 1, 1, 0);
dotest(argv[1], lrand48(), 1, 0, 1);
dotest(argv[1], lrand48(), 1, 1, 1);
}
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic
2005-06-28 18:53 ` Andy Isaacson
@ 2005-06-28 19:23 ` Robert Love
2005-06-28 19:41 ` Samuel Thibault
2005-06-29 16:53 ` wrong madvise(MADV DONTNEED) semantic Michael Kerrisk
0 siblings, 2 replies; 18+ messages in thread
From: Robert Love @ 2005-06-28 19:23 UTC (permalink / raw)
To: Andy Isaacson; +Cc: Samuel Thibault, linux-kernel
On Tue, 2005-06-28 at 11:53 -0700, Andy Isaacson wrote:
> I contest your interpretation of the manpage; while it could be read
> the way you suggest, I claim that because Linux mmap is inherently
> coherent (as opposed to, for example, AIX 4.1 mmap) then the "underlying
> file" already contains the updated contents, and ergo msync is not
> required for correct MAP_SHARED semantics on Linux, and the manpage as
> it stands is (misleading, but) both accurate to the 2.6.11
> implementation and compliant with the POSIX description posted earlier.
Well, there is nothing guaranteeing (either in the Linux implementation
or the code) that the in-memory changes are synced back to disk. You
have to call msyc().
> > if the file is mapped writable and not mysnc'ed
>
> This is the case that my posted example code exercises, and I did not
> see any problems. Is there some additional circumstance that is
> necessary to cause it to break? (I tested on 2.6.11-rc5 or something
> close to that.)
Yah, I am not--at all--talking about actual behavior. Just that the
wording definitely says, its kind of been a common belief, that
MADV_DONTNEED literally ditches your data. If you need it, don't call
MADV_DONTNEED.
> > or if the memory mapping is anonymous.
> >
> > In the latter case, the data is dropped and the pages are
> > zero-filled on access.
>
> Yes, MAP_ANONYMOUS is a more interesting case. Somebody else will have
> to write the testcase for that...
It would have to discard the pages, losing the data, unless it caused
swapout (let's hope not).
> I think the correct docs fix is to simply delete the misleading parts of
> madvise.2 so that it reads
>
> MADV_DONTNEED
> Do not expect access in the near future. (For the time
> being, the application is finished with the given range,
> so the kernel can free resources associated with it.)
>
> and remove the erroneous parenthetical in the first paragraph.
>
> .. unless, of course, someone can actually demonstrate a case where
> madvise results in differing semantics...
Nod.
I think we need to resolve the differences between the man pages,
comments, expected user behavior, kernel implementation, POSIX standard,
and what other OS's do. Figure out what to do, then unify everything.
Robert Love
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic
2005-06-28 19:23 ` Robert Love
@ 2005-06-28 19:41 ` Samuel Thibault
2005-06-28 20:03 ` Jörn Engel
2005-06-29 16:53 ` wrong madvise(MADV DONTNEED) semantic Michael Kerrisk
1 sibling, 1 reply; 18+ messages in thread
From: Samuel Thibault @ 2005-06-28 19:41 UTC (permalink / raw)
To: Robert Love; +Cc: Andy Isaacson, linux-kernel
Robert Love, le Tue 28 Jun 2005 15:23:43 -0400, a écrit :
> I think we need to resolve the differences between the man pages,
> comments, expected user behavior, kernel implementation, POSIX standard,
> and what other OS's do. Figure out what to do, then unify everything.
Well, posix says data should be kept. The solaris man page for madvise,
since it proposes a MADV_FREE case too do agree with posix.
I've tested the program (thanks Andy) on solaris 5.8, it does work fine.
On OSF1 it failed on the anonymous case.
Some people may want their data to be kept, so the safe side is to keep
data safe, i.e. both kernel and manpage be corrected, but leave a bug
notice in the manpage about previous kernels.
Regards,
Samuel
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic
2005-06-28 19:41 ` Samuel Thibault
@ 2005-06-28 20:03 ` Jörn Engel
2005-06-28 20:05 ` Robert Love
0 siblings, 1 reply; 18+ messages in thread
From: Jörn Engel @ 2005-06-28 20:03 UTC (permalink / raw)
To: Samuel Thibault, Robert Love, Andy Isaacson, linux-kernel
On Tue, 28 June 2005 21:41:28 +0200, Samuel Thibault wrote:
> Robert Love, le Tue 28 Jun 2005 15:23:43 -0400, a écrit :
> > I think we need to resolve the differences between the man pages,
> > comments, expected user behavior, kernel implementation, POSIX standard,
> > and what other OS's do. Figure out what to do, then unify everything.
>
> Well, posix says data should be kept. The solaris man page for madvise,
> since it proposes a MADV_FREE case too do agree with posix.
Plus, common sense agrees with posix. An implementation of madvice
that doesn't do anything should be perfectly sane and functionally
equivalent to any other implementation. Only differences should be in
performance.
Dropping dirty pages is a clear functional difference.
> I've tested the program (thanks Andy) on solaris 5.8, it does work fine.
> On OSF1 it failed on the anonymous case.
>
> Some people may want their data to be kept, so the safe side is to keep
> data safe, i.e. both kernel and manpage be corrected, but leave a bug
> notice in the manpage about previous kernels.
Sounds sane.
Jörn
--
There's nothing better for promoting creativity in a medium than
making an audience feel "Hmm I could do better than that!"
-- Douglas Adams in a slashdot interview
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic
2005-06-28 20:03 ` Jörn Engel
@ 2005-06-28 20:05 ` Robert Love
2005-06-28 20:17 ` Jörn Engel
0 siblings, 1 reply; 18+ messages in thread
From: Robert Love @ 2005-06-28 20:05 UTC (permalink / raw)
To: Jörn Engel; +Cc: Samuel Thibault, Andy Isaacson, linux-kernel
On Tue, 2005-06-28 at 22:03 +0200, Jörn Engel wrote:
> Plus, common sense agrees with posix. An implementation of madvice
> that doesn't do anything should be perfectly sane and functionally
> equivalent to any other implementation. Only differences should be in
> performance.
>
> Dropping dirty pages is a clear functional difference.
I like the idea (I think someone suggested this early on) of renaming
the current MADV_DONTNEED to MADV_FREE and then adding a correct
MADV_DONTNEED.
And, as I said, the man page needs clarification.
Robert Love
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic
2005-06-28 20:05 ` Robert Love
@ 2005-06-28 20:17 ` Jörn Engel
2005-06-28 20:20 ` Samuel Thibault
2005-07-05 23:39 ` Darren Hart
0 siblings, 2 replies; 18+ messages in thread
From: Jörn Engel @ 2005-06-28 20:17 UTC (permalink / raw)
To: Robert Love; +Cc: Samuel Thibault, Andy Isaacson, linux-kernel
On Tue, 28 June 2005 16:05:11 -0400, Robert Love wrote:
>
> I like the idea (I think someone suggested this early on) of renaming
> the current MADV_DONTNEED to MADV_FREE and then adding a correct
> MADV_DONTNEED.
Imo, that's still a crime against common sense. Madvice should give
the kernel some advice about which data to keep or not to keep in
memory, hence the name. It should *not* tell the kernel to corrupt
data, which currently appears to be the case.
If the application knows 100% that it is the _only_ possible user of
this data and will never again use it, dropping dirty pages might be a
sane option. Effectively that translates to anonymous memory only.
In all other cases, dirty pages should be written back.
> And, as I said, the man page needs clarification.
Definitely.
Jörn
--
Eighty percent of success is showing up.
-- Woody Allen
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic
2005-06-28 20:17 ` Jörn Engel
@ 2005-06-28 20:20 ` Samuel Thibault
2005-06-28 20:30 ` Jörn Engel
2005-06-28 20:37 ` Andy Isaacson
2005-07-05 23:39 ` Darren Hart
1 sibling, 2 replies; 18+ messages in thread
From: Samuel Thibault @ 2005-06-28 20:20 UTC (permalink / raw)
To: Jörn Engel; +Cc: Robert Love, Andy Isaacson, linux-kernel
Jörn Engel, le Tue 28 Jun 2005 22:17:04 +0200, a écrit :
> If the application knows 100% that it is the _only_ possible user of
> this data and will never again use it, dropping dirty pages might be a
> sane option. Effectively that translates to anonymous memory only.
And private file mappings?
Regards,
Samue
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic
2005-06-28 20:20 ` Samuel Thibault
@ 2005-06-28 20:30 ` Jörn Engel
2005-06-28 20:37 ` Andy Isaacson
1 sibling, 0 replies; 18+ messages in thread
From: Jörn Engel @ 2005-06-28 20:30 UTC (permalink / raw)
To: Samuel Thibault, Robert Love, Andy Isaacson, linux-kernel
On Tue, 28 June 2005 22:20:53 +0200, Samuel Thibault wrote:
> Jörn Engel, le Tue 28 Jun 2005 22:17:04 +0200, a écrit :
> > If the application knows 100% that it is the _only_ possible user of
> > this data and will never again use it, dropping dirty pages might be a
> > sane option. Effectively that translates to anonymous memory only.
>
> And private file mappings?
As in inode->i_nlink == 0? Yes, if you can prove that only this one
thread still has it open. How to deal with multithreaded processes?
I don't know and would default to "write it back" again.
Besides, writing the dirty pages to backing store can only hurt
performance, never correctness. The data could already be synced at
the time of the madvice call anyway.
Jörn
--
This above all: to thine own self be true.
-- Shakespeare
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic
2005-06-28 20:20 ` Samuel Thibault
2005-06-28 20:30 ` Jörn Engel
@ 2005-06-28 20:37 ` Andy Isaacson
1 sibling, 0 replies; 18+ messages in thread
From: Andy Isaacson @ 2005-06-28 20:37 UTC (permalink / raw)
To: Samuel Thibault, Jörn Engel, Robert Love, linux-kernel
On Tue, Jun 28, 2005 at 10:20:53PM +0200, Samuel Thibault wrote:
> Jörn Engel, le Tue 28 Jun 2005 22:17:04 +0200, a écrit :
> > If the application knows 100% that it is the _only_ possible user of
> > this data and will never again use it, dropping dirty pages might be a
> > sane option. Effectively that translates to anonymous memory only.
>
> And private file mappings?
So long as the mapping exists, the data should not disappear. So a
MAP_PRIVATE mapping should behave just like a MAP_SHARED mapping, and
both need to be fixed to not lose data due to madvise(MADV_DONTNEED).
(I agree with Joern, MADV_FREE seems like an ill-advised extension.)
If this means some swap needs to be allocated, well, so be it.
-andy
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV DONTNEED) semantic
2005-06-28 18:28 ` Robert Love
2005-06-28 18:53 ` Andy Isaacson
@ 2005-06-29 16:34 ` Michael Kerrisk
1 sibling, 0 replies; 18+ messages in thread
From: Michael Kerrisk @ 2005-06-29 16:34 UTC (permalink / raw)
To: Robert Love; +Cc: adi, samuel.thibault, linux-kernel, mtk-lists
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="us-ascii", Size: 1424 bytes --]
> On Tue, 2005-06-28 at 11:16 -0700, Andy Isaacson wrote:
>
> > Besides, if you read the documentation closely, it does not say what
> > you
> > think it says.
> >
> > MADV_DONTNEED
> > Do not expect access in the near future. (For the time
> > being, the application is finished with the given range,
> > so the kernel can free resources associated with it.)
> > Subsequent accesses of pages in this range will succeed,
> > but will result either in reloading of the memory contents
> > from the underlying mapped file (see mmap) or
> > zero-fill-on-demand pages for mappings without an
> > underlying file.
> >
> > You seem to think that "reloading ... from the underlying mapped file"
> > means that changes are lost, but that's not implied.
>
> This wording _does_ imply that changes are lost if the file is mapped
> writable and not mysnc'ed or if the memory mapping is anonymous.
A little late into this thread, but...
Indeed it does imply that, because that was what I understood
when (IIRC) I wrote that text in the man page.
> In the former, changes are dropped and the file is reread from the stale
> on-disk copy. In the latter case, the data is dropped and the pages are
> zero-filled on access.
Yes.
Cheers,
Michael
--
5 GB Mailbox, 50 FreeSMS http://www.gmx.net/de/go/promail
+++ GMX - die erste Adresse für Mail, Message, More +++
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV DONTNEED) semantic
2005-06-28 19:23 ` Robert Love
2005-06-28 19:41 ` Samuel Thibault
@ 2005-06-29 16:53 ` Michael Kerrisk
2005-06-29 17:22 ` Jamie Lokier
1 sibling, 1 reply; 18+ messages in thread
From: Michael Kerrisk @ 2005-06-29 16:53 UTC (permalink / raw)
To: Robert Love; +Cc: adi, samuel.thibault, linux-kernel, mtk-lists, jamie Lokier
> > I contest your interpretation of the manpage; while it could be read
> > the way you suggest, I claim that because Linux mmap is inherently
> > coherent (as opposed to, for example, AIX 4.1 mmap) then the
> > "underlying
> > file" already contains the updated contents, and ergo msync is not
> > required for correct MAP_SHARED semantics on Linux, and the manpage as
> > it stands is (misleading, but) both accurate to the 2.6.11
> > implementation and compliant with the POSIX description posted earlier.
>
> Well, there is nothing guaranteeing (either in the Linux implementation
> or the code) that the in-memory changes are synced back to disk. You
> have to call msyc().
>
> > > if the file is mapped writable and not mysnc'ed
> >
> > This is the case that my posted example code exercises, and I did not
> > see any problems. Is there some additional circumstance that is
> > necessary to cause it to break? (I tested on 2.6.11-rc5 or something
> > close to that.)
>
> Yah, I am not--at all--talking about actual behavior. Just that the
> wording definitely says, its kind of been a common belief, that
> MADV_DONTNEED literally ditches your data. If you need it, don't call
> MADV_DONTNEED.
>
> > > or if the memory mapping is anonymous.
> > >
> > > In the latter case, the data is dropped and the pages are
> > > zero-filled on access.
> >
> > Yes, MAP_ANONYMOUS is a more interesting case. Somebody else will have
> > to write the testcase for that...
>
> It would have to discard the pages, losing the data, unless it caused
> swapout (let's hope not).
>
> > I think the correct docs fix is to simply delete the misleading parts
> > of
> > madvise.2 so that it reads
> >
> > MADV_DONTNEED
> > Do not expect access in the near future. (For the time
> > being, the application is finished with the given range,
> > so the kernel can free resources associated with it.)
> >
> > and remove the erroneous parenthetical in the first paragraph.
> >
> > .. unless, of course, someone can actually demonstrate a case where
> > madvise results in differing semantics...
>
> Nod.
Well, as far as I last knew Linux MADV_DONTNEED is indeed
destructive, and not POSIX.1-2001 conformant.
> I think we need to resolve the differences between the man pages,
> comments, expected user behavior, kernel implementation, POSIX standard,
> and what other OS's do. Figure out what to do, then unify everything.
First of all, I'm always happy to receive man page patches
at mtk-manpages@gmx.net.
My understanding of Linux MADV_DONTNEED is that it differs
from POSIX and from *some* other Unix implementations.
For Linux:
For a MAP_PRIVATE region, the mapped pages are explicitly
discarded, so that modifications to the pages are lost. The
next access results in a page fault re-initializing either
with the contents of the underlying mapped file, or, in the
case of an anonymous mapping, with 0. (A while ago, I
verified this behaviour using a test program.)
For a MAP_SHARED region, the kernel may discard modified
pages, depending on the architecture. (I seem to recall
getting this information about architecture variation from
an email conversation with Jamie Lokier; I haven't
verified it. On x86 at least, I haven't seen an pages
discarded for MADV_DONTNEED on a MAP_SHARED region.)
On some some other Unix implementations, MADV_DONTNEED
simply informs the kernel that the specified pages can be
swapped out if necessary. (i.e., MADV_DONTNEED on those
implementations is non-destructive.)
I realise that the current Linux madvise(2) manual page
doesn't go into quite the above level of detail. If
someone wants to confirm or improve the above, I'm happy
to update the man page.
Cheers,
Michael
--
Weitersagen: GMX DSL-Flatrates mit Tempo-Garantie!
Ab 4,99 Euro/Monat: http://www.gmx.net/de/go/dsl
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV DONTNEED) semantic
2005-06-29 16:53 ` wrong madvise(MADV DONTNEED) semantic Michael Kerrisk
@ 2005-06-29 17:22 ` Jamie Lokier
0 siblings, 0 replies; 18+ messages in thread
From: Jamie Lokier @ 2005-06-29 17:22 UTC (permalink / raw)
To: Michael Kerrisk
Cc: Robert Love, adi, samuel.thibault, linux-kernel, mtk-lists
Michael Kerrisk wrote:
> For a MAP_SHARED region, the kernel may discard modified
> pages, depending on the architecture. (I seem to recall
> getting this information about architecture variation from
> an email conversation with Jamie Lokier; I haven't
> verified it. On x86 at least, I haven't seen an pages
> discarded for MADV_DONTNEED on a MAP_SHARED region.)
The discard in MAP_SHARED case isn't explicit. It's due to CPU cache
incoherency, so the data (or just some of the data) is lost simply by
not syncing properly between userspace and the page cache.
It wouldn't be apparent on fully-coherent architectures like x86.
That was when I last looked at the code (during the 2.5 time frame).
I didn't test it, only examined the code, but quite thoroughly.
-- Jamie
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic
2005-06-28 20:17 ` Jörn Engel
2005-06-28 20:20 ` Samuel Thibault
@ 2005-07-05 23:39 ` Darren Hart
1 sibling, 0 replies; 18+ messages in thread
From: Darren Hart @ 2005-07-05 23:39 UTC (permalink / raw)
To: Jörn Engel; +Cc: Robert Love, Samuel Thibault, Andy Isaacson, linux-kernel
Jörn Engel wrote:
> On Tue, 28 June 2005 16:05:11 -0400, Robert Love wrote:
>
>>I like the idea (I think someone suggested this early on) of renaming
>>the current MADV_DONTNEED to MADV_FREE and then adding a correct
>>MADV_DONTNEED.
>
>
> Imo, that's still a crime against common sense. Madvice should give
> the kernel some advice about which data to keep or not to keep in
> memory, hence the name. It should *not* tell the kernel to corrupt
> data, which currently appears to be the case.
>
> If the application knows 100% that it is the _only_ possible user of
> this data and will never again use it, dropping dirty pages might be a
> sane option. Effectively that translates to anonymous memory only.
> In all other cases, dirty pages should be written back.
There is also the case of shmget/shmat memory segments. Some
applications will use these in order to map a very large amount of
memory and then madvise(MADV_DONTNEED) in order to play nice with the
rest of the system should memory pressure / system load / etc require
it. Obviously if other tasks have these segments mapped, the pages
cannot be discarded. If a task is the sole "mapper" of the region and
doesn't need that memory (ever again) it would be good to avoid the i/o
overhead of swapping it out and just discarding it. Perhaps
MADV_DONTNEED isn't the right place for this, but there is demand for
this behavior.
--Darren Hart
>
>
>>And, as I said, the man page needs clarification.
>
>
> Definitely.
>
> Jörn
>
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2005-07-05 23:42 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-28 13:43 wrong madvise(MADV_DONTNEED) semantic Samuel Thibault
2005-06-28 14:38 ` [Patch] Hotfix for " Jörn Engel
2005-06-28 18:16 ` Andy Isaacson
2005-06-28 18:28 ` Robert Love
2005-06-28 18:53 ` Andy Isaacson
2005-06-28 19:23 ` Robert Love
2005-06-28 19:41 ` Samuel Thibault
2005-06-28 20:03 ` Jörn Engel
2005-06-28 20:05 ` Robert Love
2005-06-28 20:17 ` Jörn Engel
2005-06-28 20:20 ` Samuel Thibault
2005-06-28 20:30 ` Jörn Engel
2005-06-28 20:37 ` Andy Isaacson
2005-07-05 23:39 ` Darren Hart
2005-06-29 16:53 ` wrong madvise(MADV DONTNEED) semantic Michael Kerrisk
2005-06-29 17:22 ` Jamie Lokier
2005-06-29 16:34 ` Michael Kerrisk
2005-06-28 18:54 ` wrong madvise(MADV_DONTNEED) semantic Samuel Thibault
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox