* Q about pagecache data never written to disk @ 2004-09-05 8:01 Andrey Savochkin 2004-09-05 9:22 ` William Lee Irwin III 2004-09-05 10:52 ` Andrew Morton 0 siblings, 2 replies; 17+ messages in thread From: Andrey Savochkin @ 2004-09-05 8:01 UTC (permalink / raw) To: linux-kernel Let's suppose an mmap'ed (SHARED, RW) file has a hole. AFAICS, we allow to dirty the file pages without allocating the space for the hole - filemap_nopage just "reads" the page filling it with zeroes, and nothing is done about the on-disk data until writepage. So, if the page can't be written to disk (no space), the dirty data just stays in the pagecache. The data can be read or seen via mmap, but it isn't and never be on disk. The pagecache stays unsynchronized with the on-disk content forever. Is it the intended behavior? Shouldn't we call the filesystem to fill the hole at the moment of the first write access? Andrey ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Q about pagecache data never written to disk 2004-09-05 8:01 Q about pagecache data never written to disk Andrey Savochkin @ 2004-09-05 9:22 ` William Lee Irwin III 2004-09-05 10:52 ` Andrew Morton 1 sibling, 0 replies; 17+ messages in thread From: William Lee Irwin III @ 2004-09-05 9:22 UTC (permalink / raw) To: Andrey Savochkin; +Cc: linux-kernel On Sun, Sep 05, 2004 at 12:01:47PM +0400, Andrey Savochkin wrote: > Let's suppose an mmap'ed (SHARED, RW) file has a hole. > AFAICS, we allow to dirty the file pages without allocating the space for the > hole - filemap_nopage just "reads" the page filling it with zeroes, and > nothing is done about the on-disk data until writepage. > So, if the page can't be written to disk (no space), the dirty data just > stays in the pagecache. The data can be read or seen via mmap, but it isn't > and never be on disk. The pagecache stays unsynchronized with the on-disk > content forever. > Is it the intended behavior? > Shouldn't we call the filesystem to fill the hole at the moment of the first > write access? We would have to trap the first write access for that. What we do at the moment is lazily collecting the results of write accesses a.k.a. dirty bits from the pte data structures only under memory pressure. i.e. mmap() IO is permabust, and fixing it is permavetoed. At the moment the only protection faults the kernel understands are those meant for copy-on-write; trapping these accesses involves understanding that a protection fault may occur for other reasons. So, in do_no_page() we now have: 1558 * Note that if write_access is true, we either now have 1559 * an exclusive copy of the page, or this is a shared mapping, 1560 * so we can make it writable and dirty to avoid having to 1561 * handle that later. 1562 */ 1563 /* Only go through if we didn't race with anybody else... */ 1564 if (pte_none(*page_table)) { 1565 if (!PageReserved(new_page)) 1566 ++mm->rss; 1567 flush_icache_page(vma, new_page); 1568 entry = mk_pte(new_page, vma->vm_page_prot); 1569 if (write_access) 1570 entry = maybe_mkwrite(pte_mkdirty(entry), vma); 1571 set_pte(page_table, entry); Here we would have to rearrange the pte setup, something vaguely like, but not precisely like, the following snippet. The basic idea is that you arrange for the event to occur in do_no_page() when a read fault is taken on the thing, and then handle it later in do_no_page(), or otherwise process it immediately if you know it's happening right in do_no_page(). It probably makes sense to do set_page_dirty() and other things around this kind of situation as well, and the calling convention in this example may not be ideal. At any rate, if the space reservation for the page may block, you'll have to spin_unlock(&vma->vm_mm->page_table_lock) and also have to pte_unmap(page_table) in enough_space() in this arrangement. Of course, you won't be able to use this out of the box; you'll have to implement enough_space() and possibly even add a new filesystem method for enough_space() to call to do its ENOSPC detection. This may also not interoperate particularly well with architecture support code for non-i386 architectures. I see low enough odds of this kind of affair getting merged that I don't really see a point in going through with much of this myself, though if you yourself have a need, maybe this tells you something useful enough for you to carry out the rest. There was some kind of talk of an alternative to be carried out at mmap() -time, but as of yet there's been no coherent explanation of how it's possible for such half measures to cope with the realities of block indexing metadata or space consumers competing with mmap(). -- wli --- mm3-2.6.9-rc1/mm/memory.c 2004-09-03 03:06:24.000000000 -0700 +++ mmap-io-2.6.9-rc1-mm3/mm/memory.c 2004-09-05 02:19:24.469265712 -0700 @@ -1056,6 +1056,19 @@ static int do_wp_page(struct mm_struct * unsigned long pfn = pte_pfn(pte); pte_t entry; + if (vma->vm_flags & VM_WRITE) { + int ret; + if (enough_space(vma, address, pte, &page_table)) { + ret = VM_FAULT_MINOR; + pte = pte_mkwrite(pte_mkyoung(pte_mkdirty(pte)))); + set_pte(page_table, pte); + } else + ret = VM_FAULT_SIGBUS; + pte_unmap(page_table); + spin_unlock(&mm->page_table_lock); + return ret; + } + if (unlikely(!pfn_valid(pfn))) { /* * This should really halt the system so it can be debugged or @@ -1562,13 +1575,24 @@ do_no_page(struct mm_struct * */ /* Only go through if we didn't race with anybody else... */ if (pte_none(*page_table)) { - if (!PageReserved(new_page)) - ++mm->rss; flush_icache_page(vma, new_page); entry = mk_pte(new_page, vma->vm_page_prot); if (write_access) entry = maybe_mkwrite(pte_mkdirty(entry), vma); - set_pte(page_table, entry); + else + entry = pte_wrprotect(pte); + if (!write_access || + enough_space(vma, address, pte, &page_table)) + set_pte(page_table, entry); + else { + spin_unlock(&mm->page_table_lock); + pte_unmap(page_table); + ret = VM_FAULT_SIGBUS; + page_cache_release(new_page); + goto out; + } + if (!PageReserved(new_page)) + ++mm->rss; if (anon) { lru_cache_add_active(new_page); page_add_anon_rmap(new_page, vma, address); ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Q about pagecache data never written to disk 2004-09-05 8:01 Q about pagecache data never written to disk Andrey Savochkin 2004-09-05 9:22 ` William Lee Irwin III @ 2004-09-05 10:52 ` Andrew Morton 2004-09-05 11:43 ` Andrey Savochkin 2004-09-05 16:33 ` William Lee Irwin III 1 sibling, 2 replies; 17+ messages in thread From: Andrew Morton @ 2004-09-05 10:52 UTC (permalink / raw) To: Andrey Savochkin; +Cc: linux-kernel Andrey Savochkin <saw@saw.sw.com.sg> wrote: > > Let's suppose an mmap'ed (SHARED, RW) file has a hole. > AFAICS, we allow to dirty the file pages without allocating the space for the > hole - filemap_nopage just "reads" the page filling it with zeroes, and > nothing is done about the on-disk data until writepage. > > So, if the page can't be written to disk (no space), the dirty data just > stays in the pagecache. The data can be read or seen via mmap, but it isn't > and never be on disk. The pagecache stays unsynchronized with the on-disk > content forever. The kernel will make one attampt to write the data to disk. If that write hits ENOSPC, the page is not redirtied (ie: the data can be lost). When that write hits ENOSPC an error flag is set in the address_space and that will be returned from a subsequent msync(). The application will then need to do something about it. If your application doesn't msync() the memory then it doesn't care about its data anyway. If your application _does_ msync the pages then we reliably report errors. > Is it the intended behavior? > Shouldn't we call the filesystem to fill the hole at the moment of the first > write access? That would be a retrograde step - it would be nice to move in the other direction: perform disk allocation at writeback time rather than at write() time, even for regular write() data. To do that we (probably) need space reservation APIs. And yes, we perhaps could reserve space in the filesystem when that page is first written to. But then what would we do if there's no space? SIGBUS? SIGSEGV? Inappropriate. SIGENOSPC? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Q about pagecache data never written to disk 2004-09-05 10:52 ` Andrew Morton @ 2004-09-05 11:43 ` Andrey Savochkin 2004-09-05 21:00 ` Andrew Morton 2004-09-05 16:33 ` William Lee Irwin III 1 sibling, 1 reply; 17+ messages in thread From: Andrey Savochkin @ 2004-09-05 11:43 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel Hi Andrew, On Sun, Sep 05, 2004 at 03:52:33AM -0700, Andrew Morton wrote: > Andrey Savochkin <saw@saw.sw.com.sg> wrote: > > > > Let's suppose an mmap'ed (SHARED, RW) file has a hole. > > AFAICS, we allow to dirty the file pages without allocating the space for the > > hole - filemap_nopage just "reads" the page filling it with zeroes, and > > nothing is done about the on-disk data until writepage. > > > > So, if the page can't be written to disk (no space), the dirty data just > > stays in the pagecache. The data can be read or seen via mmap, but it isn't > > and never be on disk. The pagecache stays unsynchronized with the on-disk > > content forever. > > The kernel will make one attampt to write the data to disk. If that write > hits ENOSPC, the page is not redirtied (ie: the data can be lost). > > When that write hits ENOSPC an error flag is set in the address_space and > that will be returned from a subsequent msync(). The application will then > need to do something about it. > > If your application doesn't msync() the memory then it doesn't care about > its data anyway. If your application _does_ msync the pages then we > reliably report errors. This question came to my mind when I was thinking about journal_start in ext3_prepare_write and copy_from_user issue... Did you follow that discussion? In the considered scenario not only the application is not guaranteed anything till msync(), but all other programs doing regular read() may also be fooled about the file content, and this idea surprised me. On the other hand, after a write() other programs also see the new content without a guarantee that this content corresponds with what is on the disk... > > > Is it the intended behavior? > > Shouldn't we call the filesystem to fill the hole at the moment of the first > > write access? > > That would be a retrograde step - it would be nice to move in the other > direction: perform disk allocation at writeback time rather than at write() > time, even for regular write() data. To do that we (probably) need space > reservation APIs. And yes, we perhaps could reserve space in the > filesystem when that page is first written to. > > But then what would we do if there's no space? SIGBUS? SIGSEGV? > Inappropriate. SIGENOSPC? Should the space be allocated on close()? Who will get the signal if nobody accesses the file anymore? I'm also thinking about various shell scripts with redirects to files... Andrey ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Q about pagecache data never written to disk 2004-09-05 11:43 ` Andrey Savochkin @ 2004-09-05 21:00 ` Andrew Morton 2004-09-06 7:06 ` Andrey Savochkin 2004-09-09 12:39 ` Pavel Machek 0 siblings, 2 replies; 17+ messages in thread From: Andrew Morton @ 2004-09-05 21:00 UTC (permalink / raw) To: Andrey Savochkin; +Cc: linux-kernel Andrey Savochkin <saw@saw.sw.com.sg> wrote: > > Hi Andrew, > > On Sun, Sep 05, 2004 at 03:52:33AM -0700, Andrew Morton wrote: > > Andrey Savochkin <saw@saw.sw.com.sg> wrote: > > > > > > Let's suppose an mmap'ed (SHARED, RW) file has a hole. > > > AFAICS, we allow to dirty the file pages without allocating the space for the > > > hole - filemap_nopage just "reads" the page filling it with zeroes, and > > > nothing is done about the on-disk data until writepage. > > > > > > So, if the page can't be written to disk (no space), the dirty data just > > > stays in the pagecache. The data can be read or seen via mmap, but it isn't > > > and never be on disk. The pagecache stays unsynchronized with the on-disk > > > content forever. > > > > The kernel will make one attampt to write the data to disk. If that write > > hits ENOSPC, the page is not redirtied (ie: the data can be lost). > > > > When that write hits ENOSPC an error flag is set in the address_space and > > that will be returned from a subsequent msync(). The application will then > > need to do something about it. > > > > If your application doesn't msync() the memory then it doesn't care about > > its data anyway. If your application _does_ msync the pages then we > > reliably report errors. > > This question came to my mind when I was thinking about journal_start in > ext3_prepare_write and copy_from_user issue... > Did you follow that discussion? Yup. Chris and I have been admiring the problem for a few months now. > In the considered scenario not only the application is not > guaranteed anything till msync(), but all other programs doing regular read() > may also be fooled about the file content, and this idea surprised me. > On the other hand, after a write() other programs also see the new content > without a guarantee that this content corresponds with what is on the disk... No, read() will see the modified pagecache data immediately, apart from CPU cache coherency effects. > > > > > Is it the intended behavior? > > > Shouldn't we call the filesystem to fill the hole at the moment of the first > > > write access? > > > > That would be a retrograde step - it would be nice to move in the other > > direction: perform disk allocation at writeback time rather than at write() > > time, even for regular write() data. To do that we (probably) need space > > reservation APIs. And yes, we perhaps could reserve space in the > > filesystem when that page is first written to. > > > > But then what would we do if there's no space? SIGBUS? SIGSEGV? > > Inappropriate. SIGENOSPC? > > Should the space be allocated on close()? What effect are you trying to achieve? > Who will get the signal if nobody accesses the file anymore? Nobody. That's the point. Plus there _is_ no signal defined for this. Neither in Linux nor in POSIX. > I'm also thinking about various shell scripts with redirects to files... ? I doubt that they're writing files via MAP_SHARED. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Q about pagecache data never written to disk 2004-09-05 21:00 ` Andrew Morton @ 2004-09-06 7:06 ` Andrey Savochkin 2004-09-09 12:39 ` Pavel Machek 1 sibling, 0 replies; 17+ messages in thread From: Andrey Savochkin @ 2004-09-06 7:06 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel On Sun, Sep 05, 2004 at 02:00:40PM -0700, Andrew Morton wrote: > Andrey Savochkin <saw@saw.sw.com.sg> wrote: > > On Sun, Sep 05, 2004 at 03:52:33AM -0700, Andrew Morton wrote: > > > That would be a retrograde step - it would be nice to move in the other > > > direction: perform disk allocation at writeback time rather than at write() > > > time, even for regular write() data. To do that we (probably) need space > > > reservation APIs. And yes, we perhaps could reserve space in the > > > filesystem when that page is first written to. > > > > > > But then what would we do if there's no space? SIGBUS? SIGSEGV? > > > Inappropriate. SIGENOSPC? > > > > Should the space be allocated on close()? > > What effect are you trying to achieve? Sending a signal while there is still a process... > > Who will get the signal if nobody accesses the file anymore? > > Nobody. That's the point. Plus there _is_ no signal defined for this. > Neither in Linux nor in POSIX. > > > I'm also thinking about various shell scripts with redirects to files... > > ? I doubt that they're writing files via MAP_SHARED. I was deliberating on your idea about delayed allocation for regular write()s also... ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Q about pagecache data never written to disk 2004-09-05 21:00 ` Andrew Morton 2004-09-06 7:06 ` Andrey Savochkin @ 2004-09-09 12:39 ` Pavel Machek 2004-09-09 13:15 ` Nick Piggin 1 sibling, 1 reply; 17+ messages in thread From: Pavel Machek @ 2004-09-09 12:39 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrey Savochkin, linux-kernel Hi! > > > The kernel will make one attampt to write the data to disk. If that write > > > hits ENOSPC, the page is not redirtied (ie: the data can be lost). > > > > > > When that write hits ENOSPC an error flag is set in the address_space and > > > that will be returned from a subsequent msync(). The application will then > > > need to do something about it. > > > > > > If your application doesn't msync() the memory then it doesn't care about > > > its data anyway. If your application _does_ msync the pages then we > > > reliably report errors. > > > > This question came to my mind when I was thinking about journal_start in > > ext3_prepare_write and copy_from_user issue... > > Did you follow that discussion? > > Yup. Chris and I have been admiring the problem for a few months now. > > > In the considered scenario not only the application is not > > guaranteed anything till msync(), but all other programs doing regular read() > > may also be fooled about the file content, and this idea surprised me. > > On the other hand, after a write() other programs also see the new content > > without a guarantee that this content corresponds with what is on the disk... > > No, read() will see the modified pagecache data immediately, apart from CPU > cache coherency effects. Is not this quite a big security hole? cat evil_data > /tmp/sign.me [Okay, evil_data probably have to contain lot of zeroes?] sync, fill disk or wait for someone to fill disk completely attempt to write good_data to /tmp/sign.me using mmap "Hey, root, see what /tmp/sign.me contains, can you make it suid?" root reads /tmp/sign.me, and sees it is good. root does chown root.root /tmp/sign.me; chmod 4755 /tmp/sign.me kernel realizes that there's not enough disk space, and discard changes, therefore /tmp/sign.me reverts to previous, evil, content. Pavel -- When do you have heart between your knees? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Q about pagecache data never written to disk 2004-09-09 12:39 ` Pavel Machek @ 2004-09-09 13:15 ` Nick Piggin 2004-09-09 13:37 ` Pavel Machek 0 siblings, 1 reply; 17+ messages in thread From: Nick Piggin @ 2004-09-09 13:15 UTC (permalink / raw) To: Pavel Machek; +Cc: Andrew Morton, Andrey Savochkin, linux-kernel Pavel Machek wrote: >>No, read() will see the modified pagecache data immediately, apart from CPU >>cache coherency effects. > > > Is not this quite a big security hole? > > cat evil_data > /tmp/sign.me [Okay, evil_data probably have to > contain lot of zeroes?] > sync, fill disk or wait for someone to fill disk completely > > attempt to write good_data to /tmp/sign.me using mmap > > "Hey, root, see what /tmp/sign.me contains, can you make it suid?" > > root reads /tmp/sign.me, and sees it is good. > > root does chown root.root /tmp/sign.me; chmod 4755 /tmp/sign.me > > kernel realizes that there's not enough disk space, and discard > changes, therefore /tmp/sign.me reverts to previous, evil, content. > root would have to make that change while user has the file open, and should welcome the subsequent unleashing of evil content as a valuable lesson. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Q about pagecache data never written to disk 2004-09-09 13:15 ` Nick Piggin @ 2004-09-09 13:37 ` Pavel Machek 2004-09-09 13:32 ` Nick Piggin 0 siblings, 1 reply; 17+ messages in thread From: Pavel Machek @ 2004-09-09 13:37 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, Andrey Savochkin, linux-kernel Hi! > >>No, read() will see the modified pagecache data immediately, apart from > >>CPU > >>cache coherency effects. > > > > > >Is not this quite a big security hole? > > > >cat evil_data > /tmp/sign.me [Okay, evil_data probably have to > > contain lot of zeroes?] > >sync, fill disk or wait for someone to fill disk completely > > > >attempt to write good_data to /tmp/sign.me using mmap > > > >"Hey, root, see what /tmp/sign.me contains, can you make it suid?" > > > >root reads /tmp/sign.me, and sees it is good. > > > >root does chown root.root /tmp/sign.me; chmod 4755 /tmp/sign.me > > > >kernel realizes that there's not enough disk space, and discard > >changes, therefore /tmp/sign.me reverts to previous, evil, content. > > > > root would have to make that change while user has the file open, > and should welcome the subsequent unleashing of evil content as a > valuable lesson. Really? I thought that writeback is not synchronous at close() time.... Hmm.... It probably could be in case of mmap.... It is still pretty unexpected. Like "root sees you have that file open, so he stops you via ptrace".... but ok.... Pavel ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Q about pagecache data never written to disk 2004-09-09 13:37 ` Pavel Machek @ 2004-09-09 13:32 ` Nick Piggin 2004-09-09 17:24 ` William Lee Irwin III 0 siblings, 1 reply; 17+ messages in thread From: Nick Piggin @ 2004-09-09 13:32 UTC (permalink / raw) To: Pavel Machek; +Cc: Andrew Morton, Andrey Savochkin, linux-kernel Pavel Machek wrote: >>>kernel realizes that there's not enough disk space, and discard >>>changes, therefore /tmp/sign.me reverts to previous, evil, content. >>> >> >>root would have to make that change while user has the file open, >>and should welcome the subsequent unleashing of evil content as a >>valuable lesson. > > > Really? I thought that writeback is not synchronous at close() > time.... Hmm.... It probably could be in case of mmap.... > writeback isn't, but the pages will get marked dirty at unmap. But I think I am wrong actually - I don't actually see why the user would have to have the file open. > It is still pretty unexpected. Like "root sees you have that file > open, so he stops you via ptrace".... but ok.... Or maybe cp /tmp/sign.me ~/ chown ... ~/sign.me chmod ... ~/sign.me mv ~/sign.me /tmp/signed ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Q about pagecache data never written to disk 2004-09-09 13:32 ` Nick Piggin @ 2004-09-09 17:24 ` William Lee Irwin III 2004-09-09 17:14 ` Nick Piggin 0 siblings, 1 reply; 17+ messages in thread From: William Lee Irwin III @ 2004-09-09 17:24 UTC (permalink / raw) To: Nick Piggin; +Cc: Pavel Machek, Andrew Morton, Andrey Savochkin, linux-kernel On Thu, Sep 09, 2004 at 11:32:01PM +1000, Nick Piggin wrote: > writeback isn't, but the pages will get marked dirty at unmap. > But I think I am wrong actually - I don't actually see why the > user would have to have the file open. Dirty memory "limits" have no force as applied to mmap() IO, which is not a pretty state of affairs with respect to various attempts the VM makes at mitigating data structure proliferation associated with dirty data. -- wli ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Q about pagecache data never written to disk 2004-09-09 17:24 ` William Lee Irwin III @ 2004-09-09 17:14 ` Nick Piggin 2004-09-09 17:35 ` William Lee Irwin III 0 siblings, 1 reply; 17+ messages in thread From: Nick Piggin @ 2004-09-09 17:14 UTC (permalink / raw) To: William Lee Irwin III Cc: Pavel Machek, Andrew Morton, Andrey Savochkin, linux-kernel William Lee Irwin III wrote: > On Thu, Sep 09, 2004 at 11:32:01PM +1000, Nick Piggin wrote: > >>writeback isn't, but the pages will get marked dirty at unmap. >>But I think I am wrong actually - I don't actually see why the >>user would have to have the file open. > > > Dirty memory "limits" have no force as applied to mmap() IO, which is > not a pretty state of affairs with respect to various attempts the VM > makes at mitigating data structure proliferation associated with dirty > data. > Yeah I know. data structure proliferation and just the simple fact that it can't immediately be freed is a problem. What is the alternative? Take a fault every time we write to a clean, mmapped page? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Q about pagecache data never written to disk 2004-09-09 17:14 ` Nick Piggin @ 2004-09-09 17:35 ` William Lee Irwin III 0 siblings, 0 replies; 17+ messages in thread From: William Lee Irwin III @ 2004-09-09 17:35 UTC (permalink / raw) To: Nick Piggin; +Cc: Pavel Machek, Andrew Morton, Andrey Savochkin, linux-kernel William Lee Irwin III wrote: >> Dirty memory "limits" have no force as applied to mmap() IO, which is >> not a pretty state of affairs with respect to various attempts the VM >> makes at mitigating data structure proliferation associated with dirty >> data. On Fri, Sep 10, 2004 at 03:14:09AM +1000, Nick Piggin wrote: > Yeah I know. data structure proliferation and just the simple fact > that it can't immediately be freed is a problem. > What is the alternative? Take a fault every time we write to a clean, > mmapped page? That's the only option I'm now aware of. I suspect it may make sense to try it to get a notion of just how large the performance impact is so it's cost can be properly weighed against the expected stability benefit. But it's also worth noting that with some care the additional fault may be circumvented in some performance-relevant instances. -- wli ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Q about pagecache data never written to disk 2004-09-05 10:52 ` Andrew Morton 2004-09-05 11:43 ` Andrey Savochkin @ 2004-09-05 16:33 ` William Lee Irwin III 2004-09-06 6:24 ` William Lee Irwin III 1 sibling, 1 reply; 17+ messages in thread From: William Lee Irwin III @ 2004-09-05 16:33 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrey Savochkin, linux-kernel Andrey Savochkin <saw@saw.sw.com.sg> wrote: >> Let's suppose an mmap'ed (SHARED, RW) file has a hole. >> AFAICS, we allow to dirty the file pages without allocating the >> space for the hole - filemap_nopage just "reads" the page filling it >> with zeroes, and nothing is done about the on-disk data until >> writepage. So, if the page can't be written to disk (no space), the >> dirty data just stays in the pagecache. The data can be read or >> seen via mmap, but it isn't and never be on disk. The pagecache >> stays unsynchronized with the on-disk content forever. On Sun, Sep 05, 2004 at 03:52:33AM -0700, Andrew Morton wrote: > The kernel will make one attampt to write the data to disk. If that write > hits ENOSPC, the page is not redirtied (ie: the data can be lost). > When that write hits ENOSPC an error flag is set in the address_space and > that will be returned from a subsequent msync(). The application will then > need to do something about it. > If your application doesn't msync() the memory then it doesn't care about > its data anyway. If your application _does_ msync the pages then we > reliably report errors. msync(p, sz, MS_ASYNC) only does set_page_dirty() at the moment and returns 0 unconditionally AFAICT, so things are stuck blocking and waiting for disk to reap the status of the IO at all. Maybe if that worked the fault handling wouldn't be as important. Maybe we should be reaping AS_EIO and/or AS_ENOSPC in the MS_ASYNC case, or wherever it is we stash the fact those IO errors ever happened. I'm also not sure what people think would be the right way to kick off IO in the background there, as trying to kmalloc() a workqueue element, then doing schedule_work() on it has resource management issues, but forcing userspace to block on the IO to ensure it's been initiated at all defeats the point of it. Andrey Savochkin <saw@saw.sw.com.sg> wrote: >> Is it the intended behavior? >> Shouldn't we call the filesystem to fill the hole at the moment of >> the first write access? On Sun, Sep 05, 2004 at 03:52:33AM -0700, Andrew Morton wrote: > That would be a retrograde step - it would be nice to move in the other > direction: perform disk allocation at writeback time rather than at write() > time, even for regular write() data. To do that we (probably) need space > reservation APIs. And yes, we perhaps could reserve space in the > filesystem when that page is first written to. > But then what would we do if there's no space? SIGBUS? SIGSEGV? > Inappropriate. SIGENOSPC? I believe SIGBUS is conventional, though you seem to be leaning toward solutions outside the fault path. I presume the "You're screwed without msync(2)" bit is standards-conformant. -- wli ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Q about pagecache data never written to disk 2004-09-05 16:33 ` William Lee Irwin III @ 2004-09-06 6:24 ` William Lee Irwin III 2004-09-06 7:02 ` Andrew Morton 0 siblings, 1 reply; 17+ messages in thread From: William Lee Irwin III @ 2004-09-06 6:24 UTC (permalink / raw) To: Andrew Morton; +Cc: Andrey Savochkin, linux-kernel On Sun, Sep 05, 2004 at 09:33:44AM -0700, William Lee Irwin III wrote: > msync(p, sz, MS_ASYNC) only does set_page_dirty() at the moment and > returns 0 unconditionally AFAICT, so things are stuck blocking and > waiting for disk to reap the status of the IO at all. Maybe if that > worked the fault handling wouldn't be as important. Maybe we should be > reaping AS_EIO and/or AS_ENOSPC in the MS_ASYNC case, or wherever it is > we stash the fact those IO errors ever happened. I'm also not sure what > people think would be the right way to kick off IO in the background > there, as trying to kmalloc() a workqueue element, then doing > schedule_work() on it has resource management issues, but forcing > userspace to block on the IO to ensure it's been initiated at all > defeats the point of it. And, interestingly, the only user of the result of set_page_dirty() is redirty_page_for_writepage(), whose results are ignored by all callers. It appears that something is amiss here, as failed reservations aren't reported until something attempts background writeback or IO syscalls. That is, it would seem that checking the results of set_page_dirty(), also called in the MS_ASYNC case, suffices, however, it does not return useful results in most (all?) cases, and nothing now checks its result. The calling convention looks very very odd also; filemap_fdatawait() is the only apparent way to extract an ENOSPC result without calling the ->writepage() method directly, and this, instead of checking for things returning -ENOSPC as one would expect, does a rather odd thing, that is test_and_clear_bit(AS_ENOSPC, &mapping->flags), which will lose all but one of the results whenever there are multiple concurrent callers of it on a single inode. Worse yet, that can be legitimate, particularly when multiple tasks concurrently msync() disjoint subsets of a file's data. -- wli ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Q about pagecache data never written to disk 2004-09-06 6:24 ` William Lee Irwin III @ 2004-09-06 7:02 ` Andrew Morton 2004-09-06 15:12 ` William Lee Irwin III 0 siblings, 1 reply; 17+ messages in thread From: Andrew Morton @ 2004-09-06 7:02 UTC (permalink / raw) To: William Lee Irwin III; +Cc: saw, linux-kernel William Lee Irwin III <wli@holomorphy.com> wrote: > > On Sun, Sep 05, 2004 at 09:33:44AM -0700, William Lee Irwin III wrote: > > msync(p, sz, MS_ASYNC) only does set_page_dirty() at the moment and > > returns 0 unconditionally AFAICT, so things are stuck blocking and > > waiting for disk to reap the status of the IO at all. Maybe if that > > worked the fault handling wouldn't be as important. Maybe we should be > > reaping AS_EIO and/or AS_ENOSPC in the MS_ASYNC case, or wherever it is > > we stash the fact those IO errors ever happened. I'm also not sure what > > people think would be the right way to kick off IO in the background > > there, as trying to kmalloc() a workqueue element, then doing > > schedule_work() on it has resource management issues, but forcing > > userspace to block on the IO to ensure it's been initiated at all > > defeats the point of it. > > And, interestingly, the only user of the result of set_page_dirty() is > redirty_page_for_writepage(), whose results are ignored by all callers. > It appears that something is amiss here, as failed reservations aren't > reported until something attempts background writeback or IO syscalls. > That is, it would seem that checking the results of set_page_dirty(), > also called in the MS_ASYNC case, suffices, however, it does not return > useful results in most (all?) cases, and nothing now checks its result. Yes, the non-void return value from set_page_dirty() is a holdover from my very early allocate-on-flush patches, wherein set_page_dirty() did indeed reserve space in the filesystem. > The calling convention looks very very odd also; filemap_fdatawait() is > the only apparent way to extract an ENOSPC result without calling the > ->writepage() method directly, and this, instead of checking for things > returning -ENOSPC as one would expect, does a rather odd thing, that is > test_and_clear_bit(AS_ENOSPC, &mapping->flags), which will lose all but > one of the results whenever there are multiple concurrent callers of it > on a single inode. Worse yet, that can be legitimate, particularly when > multiple tasks concurrently msync() disjoint subsets of a file's data. > Yes. But at least _someone_ gets told that there was an ENOSPC/EIO. What are the alternatives? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Q about pagecache data never written to disk 2004-09-06 7:02 ` Andrew Morton @ 2004-09-06 15:12 ` William Lee Irwin III 0 siblings, 0 replies; 17+ messages in thread From: William Lee Irwin III @ 2004-09-06 15:12 UTC (permalink / raw) To: Andrew Morton; +Cc: saw, linux-kernel William Lee Irwin III <wli@holomorphy.com> wrote: >> And, interestingly, the only user of the result of set_page_dirty() is >> redirty_page_for_writepage(), whose results are ignored by all callers. >> It appears that something is amiss here, as failed reservations aren't >> reported until something attempts background writeback or IO syscalls. >> That is, it would seem that checking the results of set_page_dirty(), >> also called in the MS_ASYNC case, suffices, however, it does not return >> useful results in most (all?) cases, and nothing now checks its result. On Mon, Sep 06, 2004 at 12:02:54AM -0700, Andrew Morton wrote: > Yes, the non-void return value from set_page_dirty() is a holdover from my > very early allocate-on-flush patches, wherein set_page_dirty() did indeed > reserve space in the filesystem. Supposing one maintained upper and lower bounds on reserved space the best it appears to be able to do is a check on the lower bound and opportunistically add to the upper bound, as it's nonblocking. If the callers could be given hints to back out of their locks and retry reservations while blocking, that may do. filemap_fdatawait() can, but there are a lot of bizarre callers, e.g. fs/hfsplus/bnode.c, and no one maintains that kind of information. William Lee Irwin III <wli@holomorphy.com> wrote: >> The calling convention looks very very odd also; filemap_fdatawait() is >> the only apparent way to extract an ENOSPC result without calling the >> ->writepage() method directly, and this, instead of checking for things >> returning -ENOSPC as one would expect, does a rather odd thing, that is >> test_and_clear_bit(AS_ENOSPC, &mapping->flags), which will lose all but >> one of the results whenever there are multiple concurrent callers of it >> on a single inode. Worse yet, that can be legitimate, particularly when >> multiple tasks concurrently msync() disjoint subsets of a file's data. On Mon, Sep 06, 2004 at 12:02:54AM -0700, Andrew Morton wrote: > Yes. But at least _someone_ gets told that there was an ENOSPC/EIO. What > are the alternatives? It seems more like a property of the sb, so referring things to ->i_sb and flagging the condition in there may make some sense. But a worse problem with all this is that the wrong one may catch the error. e.g. one process doing mmap() IO writes to a hole on a full fs, one process doing mmap() IO writes to already-allocated blocks, both block, AS_ENOSPC is set on behalf of the writer to the hole, and the writer to the already-allocated blocks reaps it. The only IO codepath that runs out of context from the submitting processes without alternative methods of returning the error is vmscan.c, so in principle returning errors to callers should work. But converting fs drivers to doing something for set_page_dirty() et al to report or even propagating -ENOSPC back from all of the ->writepage() and ->writepages() callsites looks painful. And supposing one moved the ENOSPC flag to the sb the events that would clear it aren't now trapped by anything. Hmm. More questions than answers, again. Let me know if there are any fs/ or mm/ sweeps you want done for ENOSPC-relevant things (not necessarily any of the above). -- wli ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2004-09-09 17:45 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-09-05 8:01 Q about pagecache data never written to disk Andrey Savochkin 2004-09-05 9:22 ` William Lee Irwin III 2004-09-05 10:52 ` Andrew Morton 2004-09-05 11:43 ` Andrey Savochkin 2004-09-05 21:00 ` Andrew Morton 2004-09-06 7:06 ` Andrey Savochkin 2004-09-09 12:39 ` Pavel Machek 2004-09-09 13:15 ` Nick Piggin 2004-09-09 13:37 ` Pavel Machek 2004-09-09 13:32 ` Nick Piggin 2004-09-09 17:24 ` William Lee Irwin III 2004-09-09 17:14 ` Nick Piggin 2004-09-09 17:35 ` William Lee Irwin III 2004-09-05 16:33 ` William Lee Irwin III 2004-09-06 6:24 ` William Lee Irwin III 2004-09-06 7:02 ` Andrew Morton 2004-09-06 15:12 ` William Lee Irwin III
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox