Q about pagecache data never written to disk

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Q about pagecache data never written to disk
@ 2004-09-05  8:01 Andrey Savochkin
  2004-09-05  9:22 ` William Lee Irwin III
  2004-09-05 10:52 ` Andrew Morton
  0 siblings, 2 replies; 17+ messages in thread
From: Andrey Savochkin @ 2004-09-05  8:01 UTC (permalink / raw)
  To: linux-kernel

Let's suppose an mmap'ed (SHARED, RW) file has a hole.
AFAICS, we allow to dirty the file pages without allocating the space for the
hole - filemap_nopage just "reads" the page filling it with zeroes, and
nothing is done about the on-disk data until writepage.

So, if the page can't be written to disk (no space), the dirty data just
stays in the pagecache.  The data can be read or seen via mmap, but it isn't
and never be on disk.  The pagecache stays unsynchronized with the on-disk
content forever.

Is it the intended behavior?
Shouldn't we call the filesystem to fill the hole at the moment of the first
write access?

	Andrey

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Q about pagecache data never written to disk
  2004-09-05  8:01 Q about pagecache data never written to disk Andrey Savochkin
@ 2004-09-05  9:22 ` William Lee Irwin III
  2004-09-05 10:52 ` Andrew Morton
  1 sibling, 0 replies; 17+ messages in thread
From: William Lee Irwin III @ 2004-09-05  9:22 UTC (permalink / raw)
  To: Andrey Savochkin; +Cc: linux-kernel

On Sun, Sep 05, 2004 at 12:01:47PM +0400, Andrey Savochkin wrote:
> Let's suppose an mmap'ed (SHARED, RW) file has a hole.
> AFAICS, we allow to dirty the file pages without allocating the space for the
> hole - filemap_nopage just "reads" the page filling it with zeroes, and
> nothing is done about the on-disk data until writepage.
> So, if the page can't be written to disk (no space), the dirty data just
> stays in the pagecache.  The data can be read or seen via mmap, but it isn't
> and never be on disk.  The pagecache stays unsynchronized with the on-disk
> content forever.
> Is it the intended behavior?
> Shouldn't we call the filesystem to fill the hole at the moment of the first
> write access?

We would have to trap the first write access for that. What we do at
the moment is lazily collecting the results of write accesses a.k.a.
dirty bits from the pte data structures only under memory pressure.
i.e. mmap() IO is permabust, and fixing it is permavetoed. At the
moment the only protection faults the kernel understands are those
meant for copy-on-write; trapping these accesses involves understanding
that a protection fault may occur for other reasons.

So, in do_no_page() we now have:
   1558      * Note that if write_access is true, we either now have
   1559      * an exclusive copy of the page, or this is a shared mapping,
   1560      * so we can make it writable and dirty to avoid having to
   1561      * handle that later.
   1562      */
   1563     /* Only go through if we didn't race with anybody else... */
   1564     if (pte_none(*page_table)) {
   1565         if (!PageReserved(new_page))
   1566             ++mm->rss;
   1567         flush_icache_page(vma, new_page);
   1568         entry = mk_pte(new_page, vma->vm_page_prot);
   1569         if (write_access)
   1570             entry = maybe_mkwrite(pte_mkdirty(entry), vma);
   1571         set_pte(page_table, entry);

Here we would have to rearrange the pte setup, something vaguely like,
but not precisely like, the following snippet.

The basic idea is that you arrange for the event to occur in
do_no_page() when a read fault is taken on the thing, and then handle
it later in do_no_page(), or otherwise process it immediately if you
know it's happening right in do_no_page(). It probably makes sense to
do set_page_dirty() and other things around this kind of situation
as well, and the calling convention in this example may not be ideal.
At any rate, if the space reservation for the page may block, you'll
have to spin_unlock(&vma->vm_mm->page_table_lock) and also have to
pte_unmap(page_table) in enough_space() in this arrangement.

Of course, you won't be able to use this out of the box; you'll have
to implement enough_space() and possibly even add a new filesystem
method for enough_space() to call to do its ENOSPC detection. This
may also not interoperate particularly well with architecture support
code for non-i386 architectures. I see low enough odds of this kind of
affair getting merged that I don't really see a point in going through
with much of this myself, though if you yourself have a need, maybe
this tells you something useful enough for you to carry out the rest.

There was some kind of talk of an alternative to be carried out at
mmap() -time, but as of yet there's been no coherent explanation of
how it's possible for such half measures to cope with the realities
of block indexing metadata or space consumers competing with mmap().

-- wli

--- mm3-2.6.9-rc1/mm/memory.c	2004-09-03 03:06:24.000000000 -0700
+++ mmap-io-2.6.9-rc1-mm3/mm/memory.c	2004-09-05 02:19:24.469265712 -0700
@@ -1056,6 +1056,19 @@ static int do_wp_page(struct mm_struct *
 	unsigned long pfn = pte_pfn(pte);
 	pte_t entry;

+	if (vma->vm_flags & VM_WRITE) {
+		int ret;
+		if (enough_space(vma, address, pte, &page_table)) {
+			ret = VM_FAULT_MINOR;
+			pte = pte_mkwrite(pte_mkyoung(pte_mkdirty(pte))));
+			set_pte(page_table, pte);
+		} else
+			ret = VM_FAULT_SIGBUS;
+		pte_unmap(page_table);
+		spin_unlock(&mm->page_table_lock);
+		return ret;
+	}
+
 	if (unlikely(!pfn_valid(pfn))) {
 		/*
 		 * This should really halt the system so it can be debugged or
@@ -1562,13 +1575,24 @@ do_no_page(struct mm_struct *
 	 */
 	/* Only go through if we didn't race with anybody else... */
 	if (pte_none(*page_table)) {
-		if (!PageReserved(new_page))
-			++mm->rss;
 		flush_icache_page(vma, new_page);
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		if (write_access)
 			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		set_pte(page_table, entry);
+		else
+			entry = pte_wrprotect(pte);
+		if (!write_access ||
+				enough_space(vma, address, pte, &page_table))
+			set_pte(page_table, entry);
+		else {
+			spin_unlock(&mm->page_table_lock);
+			pte_unmap(page_table);
+			ret = VM_FAULT_SIGBUS;
+			page_cache_release(new_page);
+			goto out;
+		}
+		if (!PageReserved(new_page))
+			++mm->rss;
 		if (anon) {
 			lru_cache_add_active(new_page);
 			page_add_anon_rmap(new_page, vma, address);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Q about pagecache data never written to disk
  2004-09-05  8:01 Q about pagecache data never written to disk Andrey Savochkin
  2004-09-05  9:22 ` William Lee Irwin III
@ 2004-09-05 10:52 ` Andrew Morton
  2004-09-05 11:43   ` Andrey Savochkin
  2004-09-05 16:33   ` William Lee Irwin III
  1 sibling, 2 replies; 17+ messages in thread
From: Andrew Morton @ 2004-09-05 10:52 UTC (permalink / raw)
  To: Andrey Savochkin; +Cc: linux-kernel

Andrey Savochkin <saw@saw.sw.com.sg> wrote:
>
> Let's suppose an mmap'ed (SHARED, RW) file has a hole.
>  AFAICS, we allow to dirty the file pages without allocating the space for the
>  hole - filemap_nopage just "reads" the page filling it with zeroes, and
>  nothing is done about the on-disk data until writepage.
> 
>  So, if the page can't be written to disk (no space), the dirty data just
>  stays in the pagecache.  The data can be read or seen via mmap, but it isn't
>  and never be on disk.  The pagecache stays unsynchronized with the on-disk
>  content forever.

The kernel will make one attampt to write the data to disk.  If that write
hits ENOSPC, the page is not redirtied (ie: the data can be lost).

When that write hits ENOSPC an error flag is set in the address_space and
that will be returned from a subsequent msync().  The application will then
need to do something about it.

If your application doesn't msync() the memory then it doesn't care about
its data anyway.  If your application _does_ msync the pages then we
reliably report errors.

>  Is it the intended behavior?
>  Shouldn't we call the filesystem to fill the hole at the moment of the first
>  write access?

That would be a retrograde step - it would be nice to move in the other
direction: perform disk allocation at writeback time rather than at write()
time, even for regular write() data.  To do that we (probably) need space
reservation APIs.  And yes, we perhaps could reserve space in the
filesystem when that page is first written to.

But then what would we do if there's no space?  SIGBUS?  SIGSEGV? 
Inappropriate.  SIGENOSPC?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Q about pagecache data never written to disk
  2004-09-05 10:52 ` Andrew Morton
@ 2004-09-05 11:43   ` Andrey Savochkin
  2004-09-05 21:00     ` Andrew Morton
  2004-09-05 16:33   ` William Lee Irwin III
  1 sibling, 1 reply; 17+ messages in thread
From: Andrey Savochkin @ 2004-09-05 11:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Hi Andrew,

On Sun, Sep 05, 2004 at 03:52:33AM -0700, Andrew Morton wrote:
> Andrey Savochkin <saw@saw.sw.com.sg> wrote:
> >
> > Let's suppose an mmap'ed (SHARED, RW) file has a hole.
> >  AFAICS, we allow to dirty the file pages without allocating the space for the
> >  hole - filemap_nopage just "reads" the page filling it with zeroes, and
> >  nothing is done about the on-disk data until writepage.
> > 
> >  So, if the page can't be written to disk (no space), the dirty data just
> >  stays in the pagecache.  The data can be read or seen via mmap, but it isn't
> >  and never be on disk.  The pagecache stays unsynchronized with the on-disk
> >  content forever.
> 
> The kernel will make one attampt to write the data to disk.  If that write
> hits ENOSPC, the page is not redirtied (ie: the data can be lost).
> 
> When that write hits ENOSPC an error flag is set in the address_space and
> that will be returned from a subsequent msync().  The application will then
> need to do something about it.
> 
> If your application doesn't msync() the memory then it doesn't care about
> its data anyway.  If your application _does_ msync the pages then we
> reliably report errors.

This question came to my mind when I was thinking about journal_start in
ext3_prepare_write and copy_from_user issue...
Did you follow that discussion?

In the considered scenario not only the application is not
guaranteed anything till msync(), but all other programs doing regular read()
may also be fooled about the file content, and this idea surprised me.
On the other hand, after a write() other programs also see the new content
without a guarantee that this content corresponds with what is on the disk...

> 
> >  Is it the intended behavior?
> >  Shouldn't we call the filesystem to fill the hole at the moment of the first
> >  write access?
> 
> That would be a retrograde step - it would be nice to move in the other
> direction: perform disk allocation at writeback time rather than at write()
> time, even for regular write() data.  To do that we (probably) need space
> reservation APIs.  And yes, we perhaps could reserve space in the
> filesystem when that page is first written to.
> 
> But then what would we do if there's no space?  SIGBUS?  SIGSEGV? 
> Inappropriate.  SIGENOSPC?

Should the space be allocated on close()?
Who will get the signal if nobody accesses the file anymore?
I'm also thinking about various shell scripts with redirects to files...

	Andrey

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Q about pagecache data never written to disk
  2004-09-05 11:43   ` Andrey Savochkin
@ 2004-09-05 21:00     ` Andrew Morton
  2004-09-06  7:06       ` Andrey Savochkin
  2004-09-09 12:39       ` Pavel Machek
  0 siblings, 2 replies; 17+ messages in thread
From: Andrew Morton @ 2004-09-05 21:00 UTC (permalink / raw)
  To: Andrey Savochkin; +Cc: linux-kernel

Andrey Savochkin <saw@saw.sw.com.sg> wrote:
>
> Hi Andrew,
> 
> On Sun, Sep 05, 2004 at 03:52:33AM -0700, Andrew Morton wrote:
> > Andrey Savochkin <saw@saw.sw.com.sg> wrote:
> > >
> > > Let's suppose an mmap'ed (SHARED, RW) file has a hole.
> > >  AFAICS, we allow to dirty the file pages without allocating the space for the
> > >  hole - filemap_nopage just "reads" the page filling it with zeroes, and
> > >  nothing is done about the on-disk data until writepage.
> > > 
> > >  So, if the page can't be written to disk (no space), the dirty data just
> > >  stays in the pagecache.  The data can be read or seen via mmap, but it isn't
> > >  and never be on disk.  The pagecache stays unsynchronized with the on-disk
> > >  content forever.
> > 
> > The kernel will make one attampt to write the data to disk.  If that write
> > hits ENOSPC, the page is not redirtied (ie: the data can be lost).
> > 
> > When that write hits ENOSPC an error flag is set in the address_space and
> > that will be returned from a subsequent msync().  The application will then
> > need to do something about it.
> > 
> > If your application doesn't msync() the memory then it doesn't care about
> > its data anyway.  If your application _does_ msync the pages then we
> > reliably report errors.
> 
> This question came to my mind when I was thinking about journal_start in
> ext3_prepare_write and copy_from_user issue...
> Did you follow that discussion?

Yup.  Chris and I have been admiring the problem for a few months now.

> In the considered scenario not only the application is not
> guaranteed anything till msync(), but all other programs doing regular read()
> may also be fooled about the file content, and this idea surprised me.
> On the other hand, after a write() other programs also see the new content
> without a guarantee that this content corresponds with what is on the disk...

No, read() will see the modified pagecache data immediately, apart from CPU
cache coherency effects.

> > 
> > >  Is it the intended behavior?
> > >  Shouldn't we call the filesystem to fill the hole at the moment of the first
> > >  write access?
> > 
> > That would be a retrograde step - it would be nice to move in the other
> > direction: perform disk allocation at writeback time rather than at write()
> > time, even for regular write() data.  To do that we (probably) need space
> > reservation APIs.  And yes, we perhaps could reserve space in the
> > filesystem when that page is first written to.
> > 
> > But then what would we do if there's no space?  SIGBUS?  SIGSEGV? 
> > Inappropriate.  SIGENOSPC?
> 
> Should the space be allocated on close()?

What effect are you trying to achieve?

> Who will get the signal if nobody accesses the file anymore?

Nobody.  That's the point.  Plus there _is_ no signal defined for this. 
Neither in Linux nor in POSIX.

> I'm also thinking about various shell scripts with redirects to files...

?  I doubt that they're writing files via MAP_SHARED.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Q about pagecache data never written to disk
  2004-09-05 21:00     ` Andrew Morton
@ 2004-09-06  7:06       ` Andrey Savochkin
  2004-09-09 12:39       ` Pavel Machek
  1 sibling, 0 replies; 17+ messages in thread
From: Andrey Savochkin @ 2004-09-06  7:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Sun, Sep 05, 2004 at 02:00:40PM -0700, Andrew Morton wrote:
> Andrey Savochkin <saw@saw.sw.com.sg> wrote:
> > On Sun, Sep 05, 2004 at 03:52:33AM -0700, Andrew Morton wrote:
> > > That would be a retrograde step - it would be nice to move in the other
> > > direction: perform disk allocation at writeback time rather than at write()
> > > time, even for regular write() data.  To do that we (probably) need space
> > > reservation APIs.  And yes, we perhaps could reserve space in the
> > > filesystem when that page is first written to.
> > > 
> > > But then what would we do if there's no space?  SIGBUS?  SIGSEGV? 
> > > Inappropriate.  SIGENOSPC?
> > 
> > Should the space be allocated on close()?
> 
> What effect are you trying to achieve?

Sending a signal while there is still a process...

> > Who will get the signal if nobody accesses the file anymore?
> 
> Nobody.  That's the point.  Plus there _is_ no signal defined for this. 
> Neither in Linux nor in POSIX.
> 
> > I'm also thinking about various shell scripts with redirects to files...
> 
> ?  I doubt that they're writing files via MAP_SHARED.

I was deliberating on your idea about delayed allocation for regular write()s
also...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Q about pagecache data never written to disk
  2004-09-05 21:00     ` Andrew Morton
  2004-09-06  7:06       ` Andrey Savochkin
@ 2004-09-09 12:39       ` Pavel Machek
  2004-09-09 13:15         ` Nick Piggin
  1 sibling, 1 reply; 17+ messages in thread
From: Pavel Machek @ 2004-09-09 12:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrey Savochkin, linux-kernel

Hi!

> > > The kernel will make one attampt to write the data to disk.  If that write
> > > hits ENOSPC, the page is not redirtied (ie: the data can be lost).
> > > 
> > > When that write hits ENOSPC an error flag is set in the address_space and
> > > that will be returned from a subsequent msync().  The application will then
> > > need to do something about it.
> > > 
> > > If your application doesn't msync() the memory then it doesn't care about
> > > its data anyway.  If your application _does_ msync the pages then we
> > > reliably report errors.
> > 
> > This question came to my mind when I was thinking about journal_start in
> > ext3_prepare_write and copy_from_user issue...
> > Did you follow that discussion?
> 
> Yup.  Chris and I have been admiring the problem for a few months now.
> 
> > In the considered scenario not only the application is not
> > guaranteed anything till msync(), but all other programs doing regular read()
> > may also be fooled about the file content, and this idea surprised me.
> > On the other hand, after a write() other programs also see the new content
> > without a guarantee that this content corresponds with what is on the disk...
> 
> No, read() will see the modified pagecache data immediately, apart from CPU
> cache coherency effects.

Is not this quite a big security hole?

cat evil_data > /tmp/sign.me   [Okay, evil_data probably have to
				contain lot of zeroes?]
sync, fill disk or wait for someone to fill disk completely

attempt to write good_data to /tmp/sign.me using mmap

"Hey, root, see what /tmp/sign.me contains, can you make it suid?"

root reads /tmp/sign.me, and sees it is good.

root does chown root.root /tmp/sign.me; chmod 4755 /tmp/sign.me

kernel realizes that there's not enough disk space, and discard
changes, therefore /tmp/sign.me reverts to previous, evil, content.

								Pavel
-- 
When do you have heart between your knees?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Q about pagecache data never written to disk
  2004-09-09 12:39       ` Pavel Machek
@ 2004-09-09 13:15         ` Nick Piggin
  2004-09-09 13:37           ` Pavel Machek
  0 siblings, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2004-09-09 13:15 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Andrew Morton, Andrey Savochkin, linux-kernel

Pavel Machek wrote:

>>No, read() will see the modified pagecache data immediately, apart from CPU
>>cache coherency effects.
> 
> 
> Is not this quite a big security hole?
> 
> cat evil_data > /tmp/sign.me   [Okay, evil_data probably have to
> 				contain lot of zeroes?]
> sync, fill disk or wait for someone to fill disk completely
> 
> attempt to write good_data to /tmp/sign.me using mmap
> 
> "Hey, root, see what /tmp/sign.me contains, can you make it suid?"
> 
> root reads /tmp/sign.me, and sees it is good.
> 
> root does chown root.root /tmp/sign.me; chmod 4755 /tmp/sign.me
> 
> kernel realizes that there's not enough disk space, and discard
> changes, therefore /tmp/sign.me reverts to previous, evil, content.
> 

root would have to make that change while user has the file open,
and should welcome the subsequent unleashing of evil content as a
valuable lesson.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Q about pagecache data never written to disk
  2004-09-09 13:15         ` Nick Piggin
@ 2004-09-09 13:37           ` Pavel Machek
  2004-09-09 13:32             ` Nick Piggin
  0 siblings, 1 reply; 17+ messages in thread
From: Pavel Machek @ 2004-09-09 13:37 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, Andrey Savochkin, linux-kernel

Hi!

> >>No, read() will see the modified pagecache data immediately, apart from 
> >>CPU
> >>cache coherency effects.
> >
> >
> >Is not this quite a big security hole?
> >
> >cat evil_data > /tmp/sign.me   [Okay, evil_data probably have to
> >				contain lot of zeroes?]
> >sync, fill disk or wait for someone to fill disk completely
> >
> >attempt to write good_data to /tmp/sign.me using mmap
> >
> >"Hey, root, see what /tmp/sign.me contains, can you make it suid?"
> >
> >root reads /tmp/sign.me, and sees it is good.
> >
> >root does chown root.root /tmp/sign.me; chmod 4755 /tmp/sign.me
> >
> >kernel realizes that there's not enough disk space, and discard
> >changes, therefore /tmp/sign.me reverts to previous, evil, content.
> >
> 
> root would have to make that change while user has the file open,
> and should welcome the subsequent unleashing of evil content as a
> valuable lesson.

Really? I thought that writeback is not synchronous at close()
time.... Hmm.... It probably could be in case of mmap....

It is still pretty unexpected. Like "root sees you have that file
open, so he stops you via ptrace".... but ok....
							Pavel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Q about pagecache data never written to disk
  2004-09-09 13:37           ` Pavel Machek
@ 2004-09-09 13:32             ` Nick Piggin
  2004-09-09 17:24               ` William Lee Irwin III
  0 siblings, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2004-09-09 13:32 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Andrew Morton, Andrey Savochkin, linux-kernel

Pavel Machek wrote:

>>>kernel realizes that there's not enough disk space, and discard
>>>changes, therefore /tmp/sign.me reverts to previous, evil, content.
>>>
>>
>>root would have to make that change while user has the file open,
>>and should welcome the subsequent unleashing of evil content as a
>>valuable lesson.
> 
> 
> Really? I thought that writeback is not synchronous at close()
> time.... Hmm.... It probably could be in case of mmap....
> 

writeback isn't, but the pages will get marked dirty at unmap.
But I think I am wrong actually - I don't actually see why the
user would have to have the file open.

> It is still pretty unexpected. Like "root sees you have that file
> open, so he stops you via ptrace".... but ok....

Or maybe

cp /tmp/sign.me ~/
chown ... ~/sign.me
chmod ... ~/sign.me
mv ~/sign.me /tmp/signed

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Q about pagecache data never written to disk
  2004-09-09 13:32             ` Nick Piggin
@ 2004-09-09 17:24               ` William Lee Irwin III
  2004-09-09 17:14                 ` Nick Piggin
  0 siblings, 1 reply; 17+ messages in thread
From: William Lee Irwin III @ 2004-09-09 17:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Pavel Machek, Andrew Morton, Andrey Savochkin, linux-kernel

On Thu, Sep 09, 2004 at 11:32:01PM +1000, Nick Piggin wrote:
> writeback isn't, but the pages will get marked dirty at unmap.
> But I think I am wrong actually - I don't actually see why the
> user would have to have the file open.

Dirty memory "limits" have no force as applied to mmap() IO, which is
not a pretty state of affairs with respect to various attempts the VM
makes at mitigating data structure proliferation associated with dirty
data.


-- wli

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Q about pagecache data never written to disk
  2004-09-09 17:24               ` William Lee Irwin III
@ 2004-09-09 17:14                 ` Nick Piggin
  2004-09-09 17:35                   ` William Lee Irwin III
  0 siblings, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2004-09-09 17:14 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Pavel Machek, Andrew Morton, Andrey Savochkin, linux-kernel

William Lee Irwin III wrote:
> On Thu, Sep 09, 2004 at 11:32:01PM +1000, Nick Piggin wrote:
> 
>>writeback isn't, but the pages will get marked dirty at unmap.
>>But I think I am wrong actually - I don't actually see why the
>>user would have to have the file open.
> 
> 
> Dirty memory "limits" have no force as applied to mmap() IO, which is
> not a pretty state of affairs with respect to various attempts the VM
> makes at mitigating data structure proliferation associated with dirty
> data.
> 

Yeah I know. data structure proliferation and just the simple fact
that it can't immediately be freed is a problem.

What is the alternative? Take a fault every time we write to a clean,
mmapped page?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Q about pagecache data never written to disk
  2004-09-09 17:14                 ` Nick Piggin
@ 2004-09-09 17:35                   ` William Lee Irwin III
  0 siblings, 0 replies; 17+ messages in thread
From: William Lee Irwin III @ 2004-09-09 17:35 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Pavel Machek, Andrew Morton, Andrey Savochkin, linux-kernel

William Lee Irwin III wrote:
>> Dirty memory "limits" have no force as applied to mmap() IO, which is
>> not a pretty state of affairs with respect to various attempts the VM
>> makes at mitigating data structure proliferation associated with dirty
>> data.

On Fri, Sep 10, 2004 at 03:14:09AM +1000, Nick Piggin wrote:
> Yeah I know. data structure proliferation and just the simple fact
> that it can't immediately be freed is a problem.
> What is the alternative? Take a fault every time we write to a clean,
> mmapped page?

That's the only option I'm now aware of. I suspect it may make sense to
try it to get a notion of just how large the performance impact is so
it's cost can be properly weighed against the expected stability
benefit. But it's also worth noting that with some care the additional
fault may be circumvented in some performance-relevant instances.


-- wli

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Q about pagecache data never written to disk
  2004-09-05 10:52 ` Andrew Morton
  2004-09-05 11:43   ` Andrey Savochkin
@ 2004-09-05 16:33   ` William Lee Irwin III
  2004-09-06  6:24     ` William Lee Irwin III
  1 sibling, 1 reply; 17+ messages in thread
From: William Lee Irwin III @ 2004-09-05 16:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrey Savochkin, linux-kernel

Andrey Savochkin <saw@saw.sw.com.sg> wrote:
>> Let's suppose an mmap'ed (SHARED, RW) file has a hole.
>> AFAICS, we allow to dirty the file pages without allocating the
>> space for the hole - filemap_nopage just "reads" the page filling it
>> with zeroes, and nothing is done about the on-disk data until
>> writepage. So, if the page can't be written to disk (no space), the
>> dirty data just stays in the pagecache.  The data can be read or
>> seen via mmap, but it isn't and never be on disk.  The pagecache
>> stays unsynchronized with the on-disk content forever.

On Sun, Sep 05, 2004 at 03:52:33AM -0700, Andrew Morton wrote:
> The kernel will make one attampt to write the data to disk.  If that write
> hits ENOSPC, the page is not redirtied (ie: the data can be lost).
> When that write hits ENOSPC an error flag is set in the address_space and
> that will be returned from a subsequent msync().  The application will then
> need to do something about it.
> If your application doesn't msync() the memory then it doesn't care about
> its data anyway.  If your application _does_ msync the pages then we
> reliably report errors.

msync(p, sz, MS_ASYNC) only does set_page_dirty() at the moment and
returns 0 unconditionally AFAICT, so things are stuck blocking and
waiting for disk to reap the status of the IO at all. Maybe if that
worked the fault handling wouldn't be as important. Maybe we should be
reaping AS_EIO and/or AS_ENOSPC in the MS_ASYNC case, or wherever it is
we stash the fact those IO errors ever happened. I'm also not sure what
people think would be the right way to kick off IO in the background
there, as trying to kmalloc() a workqueue element, then doing
schedule_work() on it has resource management issues, but forcing
userspace to block on the IO to ensure it's been initiated at all
defeats the point of it.


Andrey Savochkin <saw@saw.sw.com.sg> wrote:
>> Is it the intended behavior?
>> Shouldn't we call the filesystem to fill the hole at the moment of
>> the first write access?

On Sun, Sep 05, 2004 at 03:52:33AM -0700, Andrew Morton wrote:
> That would be a retrograde step - it would be nice to move in the other
> direction: perform disk allocation at writeback time rather than at write()
> time, even for regular write() data.  To do that we (probably) need space
> reservation APIs.  And yes, we perhaps could reserve space in the
> filesystem when that page is first written to.
> But then what would we do if there's no space?  SIGBUS?  SIGSEGV? 
> Inappropriate.  SIGENOSPC?

I believe SIGBUS is conventional, though you seem to be leaning toward
solutions outside the fault path. I presume the "You're screwed without
msync(2)" bit is standards-conformant.


-- wli

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Q about pagecache data never written to disk
  2004-09-05 16:33   ` William Lee Irwin III
@ 2004-09-06  6:24     ` William Lee Irwin III
  2004-09-06  7:02       ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: William Lee Irwin III @ 2004-09-06  6:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andrey Savochkin, linux-kernel

On Sun, Sep 05, 2004 at 09:33:44AM -0700, William Lee Irwin III wrote:
> msync(p, sz, MS_ASYNC) only does set_page_dirty() at the moment and
> returns 0 unconditionally AFAICT, so things are stuck blocking and
> waiting for disk to reap the status of the IO at all. Maybe if that
> worked the fault handling wouldn't be as important. Maybe we should be
> reaping AS_EIO and/or AS_ENOSPC in the MS_ASYNC case, or wherever it is
> we stash the fact those IO errors ever happened. I'm also not sure what
> people think would be the right way to kick off IO in the background
> there, as trying to kmalloc() a workqueue element, then doing
> schedule_work() on it has resource management issues, but forcing
> userspace to block on the IO to ensure it's been initiated at all
> defeats the point of it.

And, interestingly, the only user of the result of set_page_dirty() is
redirty_page_for_writepage(), whose results are ignored by all callers.
It appears that something is amiss here, as failed reservations aren't
reported until something attempts background writeback or IO syscalls.
That is, it would seem that checking the results of set_page_dirty(),
also called in the MS_ASYNC case, suffices, however, it does not return
useful results in most (all?) cases, and nothing now checks its result.

The calling convention looks very very odd also; filemap_fdatawait() is
the only apparent way to extract an ENOSPC result without calling the
->writepage() method directly, and this, instead of checking for things
returning -ENOSPC as one would expect, does a rather odd thing, that is
test_and_clear_bit(AS_ENOSPC, &mapping->flags), which will lose all but
one of the results whenever there are multiple concurrent callers of it
on a single inode. Worse yet, that can be legitimate, particularly when
multiple tasks concurrently msync() disjoint subsets of a file's data.


-- wli

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Q about pagecache data never written to disk
  2004-09-06  6:24     ` William Lee Irwin III
@ 2004-09-06  7:02       ` Andrew Morton
  2004-09-06 15:12         ` William Lee Irwin III
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2004-09-06  7:02 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: saw, linux-kernel

William Lee Irwin III <wli@holomorphy.com> wrote:
>
> On Sun, Sep 05, 2004 at 09:33:44AM -0700, William Lee Irwin III wrote:
> > msync(p, sz, MS_ASYNC) only does set_page_dirty() at the moment and
> > returns 0 unconditionally AFAICT, so things are stuck blocking and
> > waiting for disk to reap the status of the IO at all. Maybe if that
> > worked the fault handling wouldn't be as important. Maybe we should be
> > reaping AS_EIO and/or AS_ENOSPC in the MS_ASYNC case, or wherever it is
> > we stash the fact those IO errors ever happened. I'm also not sure what
> > people think would be the right way to kick off IO in the background
> > there, as trying to kmalloc() a workqueue element, then doing
> > schedule_work() on it has resource management issues, but forcing
> > userspace to block on the IO to ensure it's been initiated at all
> > defeats the point of it.
> 
> And, interestingly, the only user of the result of set_page_dirty() is
> redirty_page_for_writepage(), whose results are ignored by all callers.
> It appears that something is amiss here, as failed reservations aren't
> reported until something attempts background writeback or IO syscalls.
> That is, it would seem that checking the results of set_page_dirty(),
> also called in the MS_ASYNC case, suffices, however, it does not return
> useful results in most (all?) cases, and nothing now checks its result.

Yes, the non-void return value from set_page_dirty() is a holdover from my
very early allocate-on-flush patches, wherein set_page_dirty() did indeed
reserve space in the filesystem.

> The calling convention looks very very odd also; filemap_fdatawait() is
> the only apparent way to extract an ENOSPC result without calling the
> ->writepage() method directly, and this, instead of checking for things
> returning -ENOSPC as one would expect, does a rather odd thing, that is
> test_and_clear_bit(AS_ENOSPC, &mapping->flags), which will lose all but
> one of the results whenever there are multiple concurrent callers of it
> on a single inode. Worse yet, that can be legitimate, particularly when
> multiple tasks concurrently msync() disjoint subsets of a file's data.
> 

Yes.  But at least _someone_ gets told that there was an ENOSPC/EIO.  What
are the alternatives?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Q about pagecache data never written to disk
  2004-09-06  7:02       ` Andrew Morton
@ 2004-09-06 15:12         ` William Lee Irwin III
  0 siblings, 0 replies; 17+ messages in thread
From: William Lee Irwin III @ 2004-09-06 15:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: saw, linux-kernel

William Lee Irwin III <wli@holomorphy.com> wrote:
>> And, interestingly, the only user of the result of set_page_dirty() is
>> redirty_page_for_writepage(), whose results are ignored by all callers.
>> It appears that something is amiss here, as failed reservations aren't
>> reported until something attempts background writeback or IO syscalls.
>> That is, it would seem that checking the results of set_page_dirty(),
>> also called in the MS_ASYNC case, suffices, however, it does not return
>> useful results in most (all?) cases, and nothing now checks its result.

On Mon, Sep 06, 2004 at 12:02:54AM -0700, Andrew Morton wrote:
> Yes, the non-void return value from set_page_dirty() is a holdover from my
> very early allocate-on-flush patches, wherein set_page_dirty() did indeed
> reserve space in the filesystem.

Supposing one maintained upper and lower bounds on reserved space the
best it appears to be able to do is a check on the lower bound and
opportunistically add to the upper bound, as it's nonblocking. If the
callers could be given hints to back out of their locks and retry
reservations while blocking, that may do. filemap_fdatawait() can, but
there are a lot of bizarre callers, e.g. fs/hfsplus/bnode.c, and no one
maintains that kind of information.

William Lee Irwin III <wli@holomorphy.com> wrote:
>> The calling convention looks very very odd also; filemap_fdatawait() is
>> the only apparent way to extract an ENOSPC result without calling the
>> ->writepage() method directly, and this, instead of checking for things
>> returning -ENOSPC as one would expect, does a rather odd thing, that is
>> test_and_clear_bit(AS_ENOSPC, &mapping->flags), which will lose all but
>> one of the results whenever there are multiple concurrent callers of it
>> on a single inode. Worse yet, that can be legitimate, particularly when
>> multiple tasks concurrently msync() disjoint subsets of a file's data.

On Mon, Sep 06, 2004 at 12:02:54AM -0700, Andrew Morton wrote:
> Yes.  But at least _someone_ gets told that there was an ENOSPC/EIO.  What
> are the alternatives?

It seems more like a property of the sb, so referring things to ->i_sb
and flagging the condition in there may make some sense. But a worse
problem with all this is that the wrong one may catch the error. e.g.
one process doing mmap() IO writes to a hole on a full fs, one process
doing mmap() IO writes to already-allocated blocks, both block,
AS_ENOSPC is set on behalf of the writer to the hole, and the writer to
the already-allocated blocks reaps it. The only IO codepath that runs out
of context from the submitting processes without alternative methods of
returning the error is vmscan.c, so in principle returning errors to
callers should work. But converting fs drivers to doing something for
set_page_dirty() et al to report or even propagating -ENOSPC back from
all of the ->writepage() and ->writepages() callsites looks painful. And
supposing one moved the ENOSPC flag to the sb the events that would
clear it aren't now trapped by anything.

Hmm. More questions than answers, again. Let me know if there are any
fs/ or mm/ sweeps you want done for ENOSPC-relevant things (not
necessarily any of the above).

-- wli

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2004-09-09 17:45 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-05  8:01 Q about pagecache data never written to disk Andrey Savochkin
2004-09-05  9:22 ` William Lee Irwin III
2004-09-05 10:52 ` Andrew Morton
2004-09-05 11:43   ` Andrey Savochkin
2004-09-05 21:00     ` Andrew Morton
2004-09-06  7:06       ` Andrey Savochkin
2004-09-09 12:39       ` Pavel Machek
2004-09-09 13:15         ` Nick Piggin
2004-09-09 13:37           ` Pavel Machek
2004-09-09 13:32             ` Nick Piggin
2004-09-09 17:24               ` William Lee Irwin III
2004-09-09 17:14                 ` Nick Piggin
2004-09-09 17:35                   ` William Lee Irwin III
2004-09-05 16:33   ` William Lee Irwin III
2004-09-06  6:24     ` William Lee Irwin III
2004-09-06  7:02       ` Andrew Morton
2004-09-06 15:12         ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox