* wrong madvise(MADV_DONTNEED) semantic @ 2005-06-28 13:43 Samuel Thibault 2005-06-28 14:38 ` [Patch] Hotfix for " Jörn Engel 2005-06-28 18:16 ` Andy Isaacson 0 siblings, 2 replies; 18+ messages in thread From: Samuel Thibault @ 2005-06-28 13:43 UTC (permalink / raw) To: linux-kernel Hi, There is something wrong with the current madvise(MADV_DONTNEED) implementation. Both the manpage and the source code says that MADV_DONTNEED means that the application does not care about the data, so it might be thrown away by the kernel. But that's not what posix says: http://www.opengroup.org/onlinepubs/009695399/functions/posix_madvise.html It says that "The posix_madvise() function shall have no effect on the semantics of access to memory in the specified range". I.e. the data that was recorded shall be saved! The current linux implementation of MADV_DONTNEED is rather an implementation of solaris' MADV_FREE, see its manpage: http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrde?a=view Hence the current madvise_dontneed() implementation could be renamed into madvise_free() and the appropriate MADV_FREE case be added, while a new implementation of madvise_dontneed() _needs_ be written. It may for instance go through the range so as to zap clean pages (since it is safe), and set dirty pages as being least recently used so that they will be considered as good candidates for eviction. (And the manpage should get corrected too) Regards, Samuel Thibault ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Patch] Hotfix for Re: wrong madvise(MADV_DONTNEED) semantic 2005-06-28 13:43 wrong madvise(MADV_DONTNEED) semantic Samuel Thibault @ 2005-06-28 14:38 ` Jörn Engel 2005-06-28 18:16 ` Andy Isaacson 1 sibling, 0 replies; 18+ messages in thread From: Jörn Engel @ 2005-06-28 14:38 UTC (permalink / raw) To: Andrew Morton; +Cc: Samuel Thibault, linux-kernel On Tue, 28 June 2005 15:43:16 +0200, Samuel Thibault wrote: > > There is something wrong with the current madvise(MADV_DONTNEED) > implementation. Both the manpage and the source code says that > MADV_DONTNEED means that the application does not care about the data, > so it might be thrown away by the kernel. But that's not what posix > says: > > http://www.opengroup.org/onlinepubs/009695399/functions/posix_madvise.html > > It says that "The posix_madvise() function shall have no effect on the > semantics of access to memory in the specified range". I.e. the data > that was recorded shall be saved! > > The current linux implementation of MADV_DONTNEED is rather an > implementation of solaris' MADV_FREE, see its manpage: > http://docs.sun.com/app/docs/doc/816-5168/6mbb3hrde?a=view > > Hence the current madvise_dontneed() implementation could be renamed > into madvise_free() and the appropriate MADV_FREE case be added, while > a new implementation of madvise_dontneed() _needs_ be written. It may > for instance go through the range so as to zap clean pages (since it is > safe), and set dirty pages as being least recently used so that they > will be considered as good candidates for eviction. Agreed. Unless someone else comes up with something better soon, you can use below patch. Akpm, would you merge this? Jörn -- The story so far: In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- Douglas Adams Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> --- mm/madvise.c | 4 ++++ 1 files changed, 4 insertions(+) --- linux-2.6.12-rc4logfs/mm/madvise.c~madvice_bug 2005-03-04 11:40:50.000000000 +0100 +++ linux-2.6.12-rc4logfs/mm/madvise.c 2005-06-28 16:34:54.000000000 +0200 @@ -97,6 +97,10 @@ static long madvise_willneed(struct vm_a static long madvise_dontneed(struct vm_area_struct * vma, unsigned long start, unsigned long end) { + return 0; + /* Really ugly bugfix - madvice may never change anything + * beyond performance. Dropping dirty pages is right out. + */ if ((vma->vm_flags & VM_LOCKED) || is_vm_hugetlb_page(vma)) return -EINVAL; ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic 2005-06-28 13:43 wrong madvise(MADV_DONTNEED) semantic Samuel Thibault 2005-06-28 14:38 ` [Patch] Hotfix for " Jörn Engel @ 2005-06-28 18:16 ` Andy Isaacson 2005-06-28 18:28 ` Robert Love 2005-06-28 18:54 ` wrong madvise(MADV_DONTNEED) semantic Samuel Thibault 1 sibling, 2 replies; 18+ messages in thread From: Andy Isaacson @ 2005-06-28 18:16 UTC (permalink / raw) To: Samuel Thibault, linux-kernel On Tue, Jun 28, 2005 at 03:43:16PM +0200, Samuel Thibault wrote: > There is something wrong with the current madvise(MADV_DONTNEED) > implementation. Both the manpage and the source code says that > MADV_DONTNEED means that the application does not care about the data, > so it might be thrown away by the kernel. I agree that the Linux madvise manpage is unclear and should be fixed. If your interpretation of the problem is correct, then it should be trivial to write a test program demonstrating the problem. Did you write the simple test program and run it? I did, and the results make me think that the Linux implementation does conform to the POSIX spec you quote. I did not observe any data loss. So it's just a documentation issue. Besides, if you read the documentation closely, it does not say what you think it says. MADV_DONTNEED Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) Subsequent accesses of pages in this range will succeed, but will result either in reloading of the memory contents from the underlying mapped file (see mmap) or zero-fill-on-demand pages for mappings without an underlying file. You seem to think that "reloading ... from the underlying mapped file" means that changes are lost, but that's not implied. Also, the parenthetical near the top of the manpage "(except in the case of MADV_DONTNEED)" is, AFAICS, wrong. MADV_DONTNEED does not affect semantics. It looks to me like someone "improved" madvise.2 without actually understanding what they were talking about. Below is the test program I used. -andy #include <stdio.h> #include <stdlib.h> #include <stdarg.h> #include <string.h> #include <errno.h> #include <unistd.h> #include <sys/mman.h> #include <fcntl.h> #include <sys/stat.h> typedef unsigned int u32; void die(char *fmt, ...) { va_list ap; va_start(ap, fmt); vfprintf(stderr, fmt, ap); va_end(ap); exit(1); } int check_cookie(char *file, u32 cookie) { u32 buf; int fd; if((fd = open(file, O_RDONLY, 0)) == -1) die("%s: %s\n", file, strerror(errno)); if(read(fd, &buf, sizeof(buf)) == -1) die("read: %s\n", strerror(errno)); close(fd); return buf == cookie; } int dotest(char *file, u32 cookie, int do_msync, int do_madvise) { void *p; int fd, len = 16 * 1024, prot = PROT_READ|PROT_WRITE; if((fd = open(file, O_RDWR|O_CREAT, 0666)) == -1) die("%s: %s\n", file, strerror(errno)); if(ftruncate(fd, len) == -1) die("ftruncate: %s\n", strerror(errno)); if(write(fd, "", 1) == -1) die("write: %s\n", strerror(errno)); if((p = mmap(0, len, prot, MAP_SHARED, fd, 0)) == MAP_FAILED) die("mmap: %s\n", strerror(errno)); *(u32 *)p = cookie; if(do_msync) if(msync(p, len, MS_SYNC) == -1) die("msync: %s\n", strerror(errno)); if(do_madvise) if(madvise(p, len, MADV_DONTNEED) == -1) die("madvise: %s\n", strerror(errno)); if(close(fd) == -1) die("close: %s\n", strerror(errno)); printf("c = %08x msync: %s madvise: %s %s\n", cookie, do_msync ? "YES" : " NO", do_madvise ? "YES" : " NO", check_cookie(file, cookie) ? "ok" : "FAILED"); } int main(int argc, char **argv) { if(argc != 2) die("usage: %s file\n", argv[0]); lrand48(); dotest(argv[1], lrand48(), 0, 0); dotest(argv[1], lrand48(), 1, 0); dotest(argv[1], lrand48(), 0, 1); dotest(argv[1], lrand48(), 1, 1); } ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic 2005-06-28 18:16 ` Andy Isaacson @ 2005-06-28 18:28 ` Robert Love 2005-06-28 18:53 ` Andy Isaacson 2005-06-29 16:34 ` Michael Kerrisk 2005-06-28 18:54 ` wrong madvise(MADV_DONTNEED) semantic Samuel Thibault 1 sibling, 2 replies; 18+ messages in thread From: Robert Love @ 2005-06-28 18:28 UTC (permalink / raw) To: Andy Isaacson; +Cc: Samuel Thibault, linux-kernel On Tue, 2005-06-28 at 11:16 -0700, Andy Isaacson wrote: > Besides, if you read the documentation closely, it does not say what you > think it says. > > MADV_DONTNEED > Do not expect access in the near future. (For the time > being, the application is finished with the given range, > so the kernel can free resources associated with it.) > Subsequent accesses of pages in this range will succeed, > but will result either in reloading of the memory contents > from the underlying mapped file (see mmap) or > zero-fill-on-demand pages for mappings without an > underlying file. > > You seem to think that "reloading ... from the underlying mapped file" > means that changes are lost, but that's not implied. This wording _does_ imply that changes are lost if the file is mapped writable and not mysnc'ed or if the memory mapping is anonymous. In the former, changes are dropped and the file is reread from the stale on-disk copy. In the latter case, the data is dropped and the pages are zero-filled on access. Robert Love ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic 2005-06-28 18:28 ` Robert Love @ 2005-06-28 18:53 ` Andy Isaacson 2005-06-28 19:23 ` Robert Love 2005-06-29 16:34 ` Michael Kerrisk 1 sibling, 1 reply; 18+ messages in thread From: Andy Isaacson @ 2005-06-28 18:53 UTC (permalink / raw) To: Robert Love; +Cc: Samuel Thibault, linux-kernel On Tue, Jun 28, 2005 at 02:28:20PM -0400, Robert Love wrote: > On Tue, 2005-06-28 at 11:16 -0700, Andy Isaacson wrote: > > Besides, if you read the documentation closely, it does not say what you > > think it says. > > > > MADV_DONTNEED > > Do not expect access in the near future. (For the time > > being, the application is finished with the given range, > > so the kernel can free resources associated with it.) > > Subsequent accesses of pages in this range will succeed, > > but will result either in reloading of the memory contents > > from the underlying mapped file (see mmap) or > > zero-fill-on-demand pages for mappings without an > > underlying file. > > > > You seem to think that "reloading ... from the underlying mapped file" > > means that changes are lost, but that's not implied. > > This wording _does_ imply that changes are lost I contest your interpretation of the manpage; while it could be read the way you suggest, I claim that because Linux mmap is inherently coherent (as opposed to, for example, AIX 4.1 mmap) then the "underlying file" already contains the updated contents, and ergo msync is not required for correct MAP_SHARED semantics on Linux, and the manpage as it stands is (misleading, but) both accurate to the 2.6.11 implementation and compliant with the POSIX description posted earlier. > if the file is mapped writable and not mysnc'ed This is the case that my posted example code exercises, and I did not see any problems. Is there some additional circumstance that is necessary to cause it to break? (I tested on 2.6.11-rc5 or something close to that.) > or if the memory mapping is anonymous. > > In the latter case, the data is dropped and the pages are > zero-filled on access. Yes, MAP_ANONYMOUS is a more interesting case. Somebody else will have to write the testcase for that... I think the correct docs fix is to simply delete the misleading parts of madvise.2 so that it reads MADV_DONTNEED Do not expect access in the near future. (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.) and remove the erroneous parenthetical in the first paragraph. ... unless, of course, someone can actually demonstrate a case where madvise results in differing semantics... -andy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic 2005-06-28 18:53 ` Andy Isaacson @ 2005-06-28 19:23 ` Robert Love 2005-06-28 19:41 ` Samuel Thibault 2005-06-29 16:53 ` wrong madvise(MADV DONTNEED) semantic Michael Kerrisk 0 siblings, 2 replies; 18+ messages in thread From: Robert Love @ 2005-06-28 19:23 UTC (permalink / raw) To: Andy Isaacson; +Cc: Samuel Thibault, linux-kernel On Tue, 2005-06-28 at 11:53 -0700, Andy Isaacson wrote: > I contest your interpretation of the manpage; while it could be read > the way you suggest, I claim that because Linux mmap is inherently > coherent (as opposed to, for example, AIX 4.1 mmap) then the "underlying > file" already contains the updated contents, and ergo msync is not > required for correct MAP_SHARED semantics on Linux, and the manpage as > it stands is (misleading, but) both accurate to the 2.6.11 > implementation and compliant with the POSIX description posted earlier. Well, there is nothing guaranteeing (either in the Linux implementation or the code) that the in-memory changes are synced back to disk. You have to call msyc(). > > if the file is mapped writable and not mysnc'ed > > This is the case that my posted example code exercises, and I did not > see any problems. Is there some additional circumstance that is > necessary to cause it to break? (I tested on 2.6.11-rc5 or something > close to that.) Yah, I am not--at all--talking about actual behavior. Just that the wording definitely says, its kind of been a common belief, that MADV_DONTNEED literally ditches your data. If you need it, don't call MADV_DONTNEED. > > or if the memory mapping is anonymous. > > > > In the latter case, the data is dropped and the pages are > > zero-filled on access. > > Yes, MAP_ANONYMOUS is a more interesting case. Somebody else will have > to write the testcase for that... It would have to discard the pages, losing the data, unless it caused swapout (let's hope not). > I think the correct docs fix is to simply delete the misleading parts of > madvise.2 so that it reads > > MADV_DONTNEED > Do not expect access in the near future. (For the time > being, the application is finished with the given range, > so the kernel can free resources associated with it.) > > and remove the erroneous parenthetical in the first paragraph. > > .. unless, of course, someone can actually demonstrate a case where > madvise results in differing semantics... Nod. I think we need to resolve the differences between the man pages, comments, expected user behavior, kernel implementation, POSIX standard, and what other OS's do. Figure out what to do, then unify everything. Robert Love ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic 2005-06-28 19:23 ` Robert Love @ 2005-06-28 19:41 ` Samuel Thibault 2005-06-28 20:03 ` Jörn Engel 2005-06-29 16:53 ` wrong madvise(MADV DONTNEED) semantic Michael Kerrisk 1 sibling, 1 reply; 18+ messages in thread From: Samuel Thibault @ 2005-06-28 19:41 UTC (permalink / raw) To: Robert Love; +Cc: Andy Isaacson, linux-kernel Robert Love, le Tue 28 Jun 2005 15:23:43 -0400, a écrit : > I think we need to resolve the differences between the man pages, > comments, expected user behavior, kernel implementation, POSIX standard, > and what other OS's do. Figure out what to do, then unify everything. Well, posix says data should be kept. The solaris man page for madvise, since it proposes a MADV_FREE case too do agree with posix. I've tested the program (thanks Andy) on solaris 5.8, it does work fine. On OSF1 it failed on the anonymous case. Some people may want their data to be kept, so the safe side is to keep data safe, i.e. both kernel and manpage be corrected, but leave a bug notice in the manpage about previous kernels. Regards, Samuel ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic 2005-06-28 19:41 ` Samuel Thibault @ 2005-06-28 20:03 ` Jörn Engel 2005-06-28 20:05 ` Robert Love 0 siblings, 1 reply; 18+ messages in thread From: Jörn Engel @ 2005-06-28 20:03 UTC (permalink / raw) To: Samuel Thibault, Robert Love, Andy Isaacson, linux-kernel On Tue, 28 June 2005 21:41:28 +0200, Samuel Thibault wrote: > Robert Love, le Tue 28 Jun 2005 15:23:43 -0400, a écrit : > > I think we need to resolve the differences between the man pages, > > comments, expected user behavior, kernel implementation, POSIX standard, > > and what other OS's do. Figure out what to do, then unify everything. > > Well, posix says data should be kept. The solaris man page for madvise, > since it proposes a MADV_FREE case too do agree with posix. Plus, common sense agrees with posix. An implementation of madvice that doesn't do anything should be perfectly sane and functionally equivalent to any other implementation. Only differences should be in performance. Dropping dirty pages is a clear functional difference. > I've tested the program (thanks Andy) on solaris 5.8, it does work fine. > On OSF1 it failed on the anonymous case. > > Some people may want their data to be kept, so the safe side is to keep > data safe, i.e. both kernel and manpage be corrected, but leave a bug > notice in the manpage about previous kernels. Sounds sane. Jörn -- There's nothing better for promoting creativity in a medium than making an audience feel "Hmm I could do better than that!" -- Douglas Adams in a slashdot interview ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic 2005-06-28 20:03 ` Jörn Engel @ 2005-06-28 20:05 ` Robert Love 2005-06-28 20:17 ` Jörn Engel 0 siblings, 1 reply; 18+ messages in thread From: Robert Love @ 2005-06-28 20:05 UTC (permalink / raw) To: Jörn Engel; +Cc: Samuel Thibault, Andy Isaacson, linux-kernel On Tue, 2005-06-28 at 22:03 +0200, Jörn Engel wrote: > Plus, common sense agrees with posix. An implementation of madvice > that doesn't do anything should be perfectly sane and functionally > equivalent to any other implementation. Only differences should be in > performance. > > Dropping dirty pages is a clear functional difference. I like the idea (I think someone suggested this early on) of renaming the current MADV_DONTNEED to MADV_FREE and then adding a correct MADV_DONTNEED. And, as I said, the man page needs clarification. Robert Love ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic 2005-06-28 20:05 ` Robert Love @ 2005-06-28 20:17 ` Jörn Engel 2005-06-28 20:20 ` Samuel Thibault 2005-07-05 23:39 ` Darren Hart 0 siblings, 2 replies; 18+ messages in thread From: Jörn Engel @ 2005-06-28 20:17 UTC (permalink / raw) To: Robert Love; +Cc: Samuel Thibault, Andy Isaacson, linux-kernel On Tue, 28 June 2005 16:05:11 -0400, Robert Love wrote: > > I like the idea (I think someone suggested this early on) of renaming > the current MADV_DONTNEED to MADV_FREE and then adding a correct > MADV_DONTNEED. Imo, that's still a crime against common sense. Madvice should give the kernel some advice about which data to keep or not to keep in memory, hence the name. It should *not* tell the kernel to corrupt data, which currently appears to be the case. If the application knows 100% that it is the _only_ possible user of this data and will never again use it, dropping dirty pages might be a sane option. Effectively that translates to anonymous memory only. In all other cases, dirty pages should be written back. > And, as I said, the man page needs clarification. Definitely. Jörn -- Eighty percent of success is showing up. -- Woody Allen ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic 2005-06-28 20:17 ` Jörn Engel @ 2005-06-28 20:20 ` Samuel Thibault 2005-06-28 20:30 ` Jörn Engel 2005-06-28 20:37 ` Andy Isaacson 2005-07-05 23:39 ` Darren Hart 1 sibling, 2 replies; 18+ messages in thread From: Samuel Thibault @ 2005-06-28 20:20 UTC (permalink / raw) To: Jörn Engel; +Cc: Robert Love, Andy Isaacson, linux-kernel Jörn Engel, le Tue 28 Jun 2005 22:17:04 +0200, a écrit : > If the application knows 100% that it is the _only_ possible user of > this data and will never again use it, dropping dirty pages might be a > sane option. Effectively that translates to anonymous memory only. And private file mappings? Regards, Samue ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic 2005-06-28 20:20 ` Samuel Thibault @ 2005-06-28 20:30 ` Jörn Engel 2005-06-28 20:37 ` Andy Isaacson 1 sibling, 0 replies; 18+ messages in thread From: Jörn Engel @ 2005-06-28 20:30 UTC (permalink / raw) To: Samuel Thibault, Robert Love, Andy Isaacson, linux-kernel On Tue, 28 June 2005 22:20:53 +0200, Samuel Thibault wrote: > Jörn Engel, le Tue 28 Jun 2005 22:17:04 +0200, a écrit : > > If the application knows 100% that it is the _only_ possible user of > > this data and will never again use it, dropping dirty pages might be a > > sane option. Effectively that translates to anonymous memory only. > > And private file mappings? As in inode->i_nlink == 0? Yes, if you can prove that only this one thread still has it open. How to deal with multithreaded processes? I don't know and would default to "write it back" again. Besides, writing the dirty pages to backing store can only hurt performance, never correctness. The data could already be synced at the time of the madvice call anyway. Jörn -- This above all: to thine own self be true. -- Shakespeare ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic 2005-06-28 20:20 ` Samuel Thibault 2005-06-28 20:30 ` Jörn Engel @ 2005-06-28 20:37 ` Andy Isaacson 1 sibling, 0 replies; 18+ messages in thread From: Andy Isaacson @ 2005-06-28 20:37 UTC (permalink / raw) To: Samuel Thibault, Jörn Engel, Robert Love, linux-kernel On Tue, Jun 28, 2005 at 10:20:53PM +0200, Samuel Thibault wrote: > Jörn Engel, le Tue 28 Jun 2005 22:17:04 +0200, a écrit : > > If the application knows 100% that it is the _only_ possible user of > > this data and will never again use it, dropping dirty pages might be a > > sane option. Effectively that translates to anonymous memory only. > > And private file mappings? So long as the mapping exists, the data should not disappear. So a MAP_PRIVATE mapping should behave just like a MAP_SHARED mapping, and both need to be fixed to not lose data due to madvise(MADV_DONTNEED). (I agree with Joern, MADV_FREE seems like an ill-advised extension.) If this means some swap needs to be allocated, well, so be it. -andy ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic 2005-06-28 20:17 ` Jörn Engel 2005-06-28 20:20 ` Samuel Thibault @ 2005-07-05 23:39 ` Darren Hart 1 sibling, 0 replies; 18+ messages in thread From: Darren Hart @ 2005-07-05 23:39 UTC (permalink / raw) To: Jörn Engel; +Cc: Robert Love, Samuel Thibault, Andy Isaacson, linux-kernel Jörn Engel wrote: > On Tue, 28 June 2005 16:05:11 -0400, Robert Love wrote: > >>I like the idea (I think someone suggested this early on) of renaming >>the current MADV_DONTNEED to MADV_FREE and then adding a correct >>MADV_DONTNEED. > > > Imo, that's still a crime against common sense. Madvice should give > the kernel some advice about which data to keep or not to keep in > memory, hence the name. It should *not* tell the kernel to corrupt > data, which currently appears to be the case. > > If the application knows 100% that it is the _only_ possible user of > this data and will never again use it, dropping dirty pages might be a > sane option. Effectively that translates to anonymous memory only. > In all other cases, dirty pages should be written back. There is also the case of shmget/shmat memory segments. Some applications will use these in order to map a very large amount of memory and then madvise(MADV_DONTNEED) in order to play nice with the rest of the system should memory pressure / system load / etc require it. Obviously if other tasks have these segments mapped, the pages cannot be discarded. If a task is the sole "mapper" of the region and doesn't need that memory (ever again) it would be good to avoid the i/o overhead of swapping it out and just discarding it. Perhaps MADV_DONTNEED isn't the right place for this, but there is demand for this behavior. --Darren Hart > > >>And, as I said, the man page needs clarification. > > > Definitely. > > Jörn > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV DONTNEED) semantic 2005-06-28 19:23 ` Robert Love 2005-06-28 19:41 ` Samuel Thibault @ 2005-06-29 16:53 ` Michael Kerrisk 2005-06-29 17:22 ` Jamie Lokier 1 sibling, 1 reply; 18+ messages in thread From: Michael Kerrisk @ 2005-06-29 16:53 UTC (permalink / raw) To: Robert Love; +Cc: adi, samuel.thibault, linux-kernel, mtk-lists, jamie Lokier > > I contest your interpretation of the manpage; while it could be read > > the way you suggest, I claim that because Linux mmap is inherently > > coherent (as opposed to, for example, AIX 4.1 mmap) then the > > "underlying > > file" already contains the updated contents, and ergo msync is not > > required for correct MAP_SHARED semantics on Linux, and the manpage as > > it stands is (misleading, but) both accurate to the 2.6.11 > > implementation and compliant with the POSIX description posted earlier. > > Well, there is nothing guaranteeing (either in the Linux implementation > or the code) that the in-memory changes are synced back to disk. You > have to call msyc(). > > > > if the file is mapped writable and not mysnc'ed > > > > This is the case that my posted example code exercises, and I did not > > see any problems. Is there some additional circumstance that is > > necessary to cause it to break? (I tested on 2.6.11-rc5 or something > > close to that.) > > Yah, I am not--at all--talking about actual behavior. Just that the > wording definitely says, its kind of been a common belief, that > MADV_DONTNEED literally ditches your data. If you need it, don't call > MADV_DONTNEED. > > > > or if the memory mapping is anonymous. > > > > > > In the latter case, the data is dropped and the pages are > > > zero-filled on access. > > > > Yes, MAP_ANONYMOUS is a more interesting case. Somebody else will have > > to write the testcase for that... > > It would have to discard the pages, losing the data, unless it caused > swapout (let's hope not). > > > I think the correct docs fix is to simply delete the misleading parts > > of > > madvise.2 so that it reads > > > > MADV_DONTNEED > > Do not expect access in the near future. (For the time > > being, the application is finished with the given range, > > so the kernel can free resources associated with it.) > > > > and remove the erroneous parenthetical in the first paragraph. > > > > .. unless, of course, someone can actually demonstrate a case where > > madvise results in differing semantics... > > Nod. Well, as far as I last knew Linux MADV_DONTNEED is indeed destructive, and not POSIX.1-2001 conformant. > I think we need to resolve the differences between the man pages, > comments, expected user behavior, kernel implementation, POSIX standard, > and what other OS's do. Figure out what to do, then unify everything. First of all, I'm always happy to receive man page patches at mtk-manpages@gmx.net. My understanding of Linux MADV_DONTNEED is that it differs from POSIX and from *some* other Unix implementations. For Linux: For a MAP_PRIVATE region, the mapped pages are explicitly discarded, so that modifications to the pages are lost. The next access results in a page fault re-initializing either with the contents of the underlying mapped file, or, in the case of an anonymous mapping, with 0. (A while ago, I verified this behaviour using a test program.) For a MAP_SHARED region, the kernel may discard modified pages, depending on the architecture. (I seem to recall getting this information about architecture variation from an email conversation with Jamie Lokier; I haven't verified it. On x86 at least, I haven't seen an pages discarded for MADV_DONTNEED on a MAP_SHARED region.) On some some other Unix implementations, MADV_DONTNEED simply informs the kernel that the specified pages can be swapped out if necessary. (i.e., MADV_DONTNEED on those implementations is non-destructive.) I realise that the current Linux madvise(2) manual page doesn't go into quite the above level of detail. If someone wants to confirm or improve the above, I'm happy to update the man page. Cheers, Michael -- Weitersagen: GMX DSL-Flatrates mit Tempo-Garantie! Ab 4,99 Euro/Monat: http://www.gmx.net/de/go/dsl ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV DONTNEED) semantic 2005-06-29 16:53 ` wrong madvise(MADV DONTNEED) semantic Michael Kerrisk @ 2005-06-29 17:22 ` Jamie Lokier 0 siblings, 0 replies; 18+ messages in thread From: Jamie Lokier @ 2005-06-29 17:22 UTC (permalink / raw) To: Michael Kerrisk Cc: Robert Love, adi, samuel.thibault, linux-kernel, mtk-lists Michael Kerrisk wrote: > For a MAP_SHARED region, the kernel may discard modified > pages, depending on the architecture. (I seem to recall > getting this information about architecture variation from > an email conversation with Jamie Lokier; I haven't > verified it. On x86 at least, I haven't seen an pages > discarded for MADV_DONTNEED on a MAP_SHARED region.) The discard in MAP_SHARED case isn't explicit. It's due to CPU cache incoherency, so the data (or just some of the data) is lost simply by not syncing properly between userspace and the page cache. It wouldn't be apparent on fully-coherent architectures like x86. That was when I last looked at the code (during the 2.5 time frame). I didn't test it, only examined the code, but quite thoroughly. -- Jamie ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV DONTNEED) semantic 2005-06-28 18:28 ` Robert Love 2005-06-28 18:53 ` Andy Isaacson @ 2005-06-29 16:34 ` Michael Kerrisk 1 sibling, 0 replies; 18+ messages in thread From: Michael Kerrisk @ 2005-06-29 16:34 UTC (permalink / raw) To: Robert Love; +Cc: adi, samuel.thibault, linux-kernel, mtk-lists [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="us-ascii", Size: 1424 bytes --] > On Tue, 2005-06-28 at 11:16 -0700, Andy Isaacson wrote: > > > Besides, if you read the documentation closely, it does not say what > > you > > think it says. > > > > MADV_DONTNEED > > Do not expect access in the near future. (For the time > > being, the application is finished with the given range, > > so the kernel can free resources associated with it.) > > Subsequent accesses of pages in this range will succeed, > > but will result either in reloading of the memory contents > > from the underlying mapped file (see mmap) or > > zero-fill-on-demand pages for mappings without an > > underlying file. > > > > You seem to think that "reloading ... from the underlying mapped file" > > means that changes are lost, but that's not implied. > > This wording _does_ imply that changes are lost if the file is mapped > writable and not mysnc'ed or if the memory mapping is anonymous. A little late into this thread, but... Indeed it does imply that, because that was what I understood when (IIRC) I wrote that text in the man page. > In the former, changes are dropped and the file is reread from the stale > on-disk copy. In the latter case, the data is dropped and the pages are > zero-filled on access. Yes. Cheers, Michael -- 5 GB Mailbox, 50 FreeSMS http://www.gmx.net/de/go/promail +++ GMX - die erste Adresse für Mail, Message, More +++ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: wrong madvise(MADV_DONTNEED) semantic 2005-06-28 18:16 ` Andy Isaacson 2005-06-28 18:28 ` Robert Love @ 2005-06-28 18:54 ` Samuel Thibault 1 sibling, 0 replies; 18+ messages in thread From: Samuel Thibault @ 2005-06-28 18:54 UTC (permalink / raw) To: Andy Isaacson; +Cc: linux-kernel Andy Isaacson, le Tue 28 Jun 2005 11:16:20 -0700, a écrit : > If your interpretation of the problem is correct, then it should be > trivial to write a test program demonstrating the problem. Did you > write the simple test program and run it? I indeed didn't, trusting both the man page, the source code comments, and my knowledge of zap_page_range(). > MADV_DONTNEED > Do not expect access in the near future. (For the time > being, the application is finished with the given range, > so the kernel can free resources associated with it.) > Subsequent accesses of pages in this range will succeed, > but will result either in reloading of the memory contents > from the underlying mapped file (see mmap) or > zero-fill-on-demand pages for mappings without an > underlying file. > > You seem to think that "reloading ... from the underlying mapped file" > means that changes are lost, but that's not implied. I didn't say anything precise. What mostly feared me was the "zero-fill-on-demand pages for mappings without an underlying file." > Below is the test program I used. It does indeed work, but this is no proof. It your testcase it does indeed work, since the page still remains in the page cache (it's a shared mapping of the file). But now try this one. It uses private mappings, and fails as expected, getting 0 in the ANONYMOUS case, and the original file value in the file mapping case. Regards, Samuel Thibault #include <stdio.h> #include <stdlib.h> #include <stdarg.h> #include <string.h> #include <errno.h> #include <signal.h> #include <unistd.h> #include <sys/mman.h> #include <fcntl.h> #include <sys/stat.h> typedef unsigned int u32; void die(char *fmt, ...) { va_list ap; va_start(ap, fmt); vfprintf(stderr, fmt, ap); va_end(ap); exit(1); } int check_cookie(char *file, u32 cookie) { u32 buf; int fd; if((fd = open(file, O_RDONLY, 0)) == -1) die("%s: %s\n", file, strerror(errno)); if(read(fd, &buf, sizeof(buf)) == -1) die("read: %s\n", strerror(errno)); close(fd); return buf == cookie; } int dotest(char *file, u32 cookie, int do_anonymous, int do_msync, int do_madvise) { void *p; int fd, len = 16 * 1024, prot = PROT_READ|PROT_WRITE; u32 newcookie; if (!do_anonymous) { if((fd = open(file, O_RDWR|O_CREAT, 0666)) == -1) die("%s: %s\n", file, strerror(errno)); if(ftruncate(fd, len) == -1) die("ftruncate: %s\n", strerror(errno)); if(write(fd, "", 1) == -1) die("write: %s\n", strerror(errno)); } if((p = mmap(0, len, prot, (do_anonymous ? MAP_ANONYMOUS : 0) | MAP_PRIVATE, do_anonymous ? -1 : fd, 0)) == MAP_FAILED) die("mmap: %s\n", strerror(errno)); *(u32 *)p = cookie; if(do_msync) if(msync(p, len, MS_SYNC) == -1) die("msync: %s\n", strerror(errno)); if(do_madvise) if(madvise(p, len, MADV_DONTNEED) == -1) die("madvise: %s\n", strerror(errno)); newcookie = *(u32 *)p; printf("c = %08x msync: %s madvise: %s %s\n", cookie, do_msync ? "YES" : " NO", do_madvise ? "YES" : " NO", newcookie == cookie ? "ok" : "FAILED"); } int main(int argc, char **argv) { if(argc != 2) die("usage: %s file\n", argv[0]); lrand48(); dotest(argv[1], lrand48(), 0, 0, 0); dotest(argv[1], lrand48(), 0, 1, 0); dotest(argv[1], lrand48(), 0, 0, 1); dotest(argv[1], lrand48(), 0, 1, 1); dotest(argv[1], lrand48(), 1, 0, 0); dotest(argv[1], lrand48(), 1, 1, 0); dotest(argv[1], lrand48(), 1, 0, 1); dotest(argv[1], lrand48(), 1, 1, 1); } ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2005-07-05 23:42 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-06-28 13:43 wrong madvise(MADV_DONTNEED) semantic Samuel Thibault 2005-06-28 14:38 ` [Patch] Hotfix for " Jörn Engel 2005-06-28 18:16 ` Andy Isaacson 2005-06-28 18:28 ` Robert Love 2005-06-28 18:53 ` Andy Isaacson 2005-06-28 19:23 ` Robert Love 2005-06-28 19:41 ` Samuel Thibault 2005-06-28 20:03 ` Jörn Engel 2005-06-28 20:05 ` Robert Love 2005-06-28 20:17 ` Jörn Engel 2005-06-28 20:20 ` Samuel Thibault 2005-06-28 20:30 ` Jörn Engel 2005-06-28 20:37 ` Andy Isaacson 2005-07-05 23:39 ` Darren Hart 2005-06-29 16:53 ` wrong madvise(MADV DONTNEED) semantic Michael Kerrisk 2005-06-29 17:22 ` Jamie Lokier 2005-06-29 16:34 ` Michael Kerrisk 2005-06-28 18:54 ` wrong madvise(MADV_DONTNEED) semantic Samuel Thibault
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox