linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed
       [not found] <20170516073616.GB767@jagdpanzerIV.localdomain>
@ 2017-05-17  8:32 ` Minchan Kim
  2017-05-17  9:14   ` Sergey Senozhatsky
  2017-05-21  7:04   ` Christoph Hellwig
  0 siblings, 2 replies; 5+ messages in thread
From: Minchan Kim @ 2017-05-17  8:32 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky,
	kernel-team, linux-fsdevel, linux-block, hch, viro, axboe,
	jlayton, tytso

Hi Sergey,

On Tue, May 16, 2017 at 04:36:17PM +0900, Sergey Senozhatsky wrote:
> On (05/16/17 16:16), Minchan Kim wrote:
> > > but would this be correct? the data is not valid - we failed to store
> > > the valid one. but instead we assure application that read()/swapin/etc.,
> > > depending on the usage scenario, is successful (even though the data is
> > > not what application really expects to see), application tries to use the
> > > data from that page and probably crashes (dunno, for example page contained
> > > hash tables with pointers that are not valid anymore, etc. etc.).
> > > 
> > > I'm not optimistic about stale data reads; it basically will look like
> > > data corruption to the application.
> > 
> > Hmm, I don't understand what you say.
> > My point is zram_free_page should be done only if whoe write operation
> > is successful.
> > With you change, following situation can happens.
> > 
> > write block 4, 'all A' -> success
> > read  block 4, 'all A' verified -> Good
> > write block 4, 'all B' -> but failed with ENOMEM
> > read  block 4  expected 'all A' but 'all 0' -> Oops
> 
> yes. 'all A' in #4 can be incorrect. zram can be used as a block device
> with a file system, and pid that does write op not necessarily does read
> op later. it can be a completely different application. e.g. compilation,
> or anything else.
> 
> suppose PID A does
> 
> wr block 1       all a
> wr block 2       all a + 1
> wr block 3       all a + 2
> wr block 4       all a + 3
> 
> now PID A does
> 
> wr block 1       all m
> wr block 2       all m + 1
> wr block 3       all m + 2
> wr block 4       failed. block still has 'all a + 3'.
> exit
> 
> another application, PID C, reads in the file and tries to do
> something sane with it
> 
> rd block 1       all m
> rd block 2       all m + 1
> rd block 3       all m + 3
> rd block 4       all a + 3      << this is dangerous. we should return
>                                    error from read() here; not stale data.
> 
> 
> what we can return now is a `partially updated' data, with some new
> and some stale pages. this is quite unlikely to end up anywhere good.
> am I wrong?
> 
> why does `rd block 4' in your case causes Oops? as a worst case scenario?
> application does not expect page to be 'all A' at this point. pages are
> likely to belong to some mappings/files/etc., and there is likely a data
> dependency between them, dunno C++ objects that span across pages or
> JPEG images, etc. so returning "new data   new data   stale data" is a bit
> fishy.

I thought more about it and start to confuse. :/
So, let's Cc linux-block, fs peoples.

The question is that 

Is block device(esp, zram which is compressed ram block device) okay to
return garbage when ongoing overwrite IO fails?

O_DIRECT write 4 block "aaa.." -> success
read  4 block "aaa.." -> success
O_DIRECT write 4 block "bbb.." -> fail
read  4 block "000..' -> it is okay?

Hope to get an answer form experts. :)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed
  2017-05-17  8:32 ` [PATCH 2/2] zram: do not count duplicated pages as compressed Minchan Kim
@ 2017-05-17  9:14   ` Sergey Senozhatsky
  2017-05-18  4:53     ` Minchan Kim
  2017-05-21  7:04   ` Christoph Hellwig
  1 sibling, 1 reply; 5+ messages in thread
From: Sergey Senozhatsky @ 2017-05-17  9:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, Joonsoo Kim,
	Sergey Senozhatsky, kernel-team, linux-fsdevel, linux-block, hch,
	viro, axboe, jlayton, tytso

Hello Minchan,

On (05/17/17 17:32), Minchan Kim wrote:
[..]
> > what we can return now is a `partially updated' data, with some new
> > and some stale pages. this is quite unlikely to end up anywhere good.
> > am I wrong?
> > 
> > why does `rd block 4' in your case causes Oops? as a worst case scenario?
> > application does not expect page to be 'all A' at this point. pages are
> > likely to belong to some mappings/files/etc., and there is likely a data
> > dependency between them, dunno C++ objects that span across pages or
> > JPEG images, etc. so returning "new data   new data   stale data" is a bit
> > fishy.
> 
> I thought more about it and start to confuse. :/

sorry, I'm not sure I see what's the source of your confusion :)

my point is - we should not let READ succeed if we know that WRITE
failed. assume JPEG image example,


over-write block 1 aaa->xxx OK
over-write block 2 bbb->yyy OK
over-write block 3 ccc->zzz error

reading that JPEG file

read block 1 xxx OK
read block 2 yyy OK
read block 3 ccc OK   << we should not return OK here. because
                         "xxxyyyccc" is not the correct JPEG file
                         anyway.

do you agree that telling application that read() succeeded and at
the same returning corrupted "xxxyyyccc" instead of "xxxyyyzzz" is
not correct?



so how about this,

- if we fail to compress page (S/W or H/W compressor error, depending
  on particular setup) let's store it uncompressed (page_size-d zspool
  object).

?

this should do the trick. at least we will have correct data:
	xxx  - compressed
	yyy  - compressed
	zzz  - uncompressed, because compressing back-end returned an error.

thoughts?

	-ss

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed
  2017-05-17  9:14   ` Sergey Senozhatsky
@ 2017-05-18  4:53     ` Minchan Kim
  0 siblings, 0 replies; 5+ messages in thread
From: Minchan Kim @ 2017-05-18  4:53 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky,
	kernel-team, linux-fsdevel, linux-block, hch, viro, axboe,
	jlayton, tytso

Hi Sergey,

On Wed, May 17, 2017 at 06:14:23PM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> On (05/17/17 17:32), Minchan Kim wrote:
> [..]
> > > what we can return now is a `partially updated' data, with some new
> > > and some stale pages. this is quite unlikely to end up anywhere good.
> > > am I wrong?
> > > 
> > > why does `rd block 4' in your case causes Oops? as a worst case scenario?
> > > application does not expect page to be 'all A' at this point. pages are
> > > likely to belong to some mappings/files/etc., and there is likely a data
> > > dependency between them, dunno C++ objects that span across pages or
> > > JPEG images, etc. so returning "new data   new data   stale data" is a bit
> > > fishy.
> > 
> > I thought more about it and start to confuse. :/
> 
> sorry, I'm not sure I see what's the source of your confusion :)
> 
> my point is - we should not let READ succeed if we know that WRITE
> failed. assume JPEG image example,

I don't think we shoul do it. I will write down my thought below. :)

> 
> 
> over-write block 1 aaa->xxx OK
> over-write block 2 bbb->yyy OK
> over-write block 3 ccc->zzz error
> 
> reading that JPEG file
> 
> read block 1 xxx OK
> read block 2 yyy OK
> read block 3 ccc OK   << we should not return OK here. because
>                          "xxxyyyccc" is not the correct JPEG file
>                          anyway.
> 
> do you agree that telling application that read() succeeded and at
> the same returning corrupted "xxxyyyccc" instead of "xxxyyyzzz" is
> not correct?

I don't agree. I *think* block device is a just dumb device so
zram doesn't need to know about any objects from the upper layer.
What zram should consider is basically read/write success or fail of
IO unit(maybe, BIO).

So if we assume each step from above example is bio unit, I think
it's no problem returns "xxxyyyccc".

What I meant "started confused" was about atomicity, not above
thing.

I think it's okay to return ccc instead of zzz but is it okay
zram to return "000", not "ccc" and "zzz"?
My conclusion is that it's okay now after discussion from one
of my FS friends.

Let's think about it.

FS requests write "aaa" to block 4 and fails by somethings
(H/W failure, S/W failure like ENOMEM). The interface to catch
the failure is the function registered by bio_endio which is
normally handles by AS_EIO by mappint_set_error as well as
PG_error flags of the page. In this case, FS assumes the block
4 can have stale data, not 'zzz' and 'ccc' because the device
was broken in the middle of write some data to a block if
the block device doesn't support atomic write(I guess it's
more popular) so it would be safe to consider the block
has garbage now rather than old value, new value.
(I hope I explain my thought well :/)

Having said that, I think everyone likes block device supports
atomicity(ie, old or new). so I am reluctant to change the
behavior for simple refactoring.

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed
  2017-05-17  8:32 ` [PATCH 2/2] zram: do not count duplicated pages as compressed Minchan Kim
  2017-05-17  9:14   ` Sergey Senozhatsky
@ 2017-05-21  7:04   ` Christoph Hellwig
  2017-05-21  7:15     ` Minchan Kim
  1 sibling, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2017-05-21  7:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, Joonsoo Kim,
	Sergey Senozhatsky, kernel-team, linux-fsdevel, linux-block, hch,
	viro, axboe, jlayton, tytso

On Wed, May 17, 2017 at 05:32:12PM +0900, Minchan Kim wrote:
> Is block device(esp, zram which is compressed ram block device) okay to
> return garbage when ongoing overwrite IO fails?
> 
> O_DIRECT write 4 block "aaa.." -> success
> read  4 block "aaa.." -> success
> O_DIRECT write 4 block "bbb.." -> fail
> read  4 block "000..' -> it is okay?
> 
> Hope to get an answer form experts. :)

It's "okay" as it's what existing real block devices do (at least on a
sector boundary).  It's not "nice" though, so if you can avoid it,
please do.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed
  2017-05-21  7:04   ` Christoph Hellwig
@ 2017-05-21  7:15     ` Minchan Kim
  0 siblings, 0 replies; 5+ messages in thread
From: Minchan Kim @ 2017-05-21  7:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, Joonsoo Kim,
	Sergey Senozhatsky, kernel-team, linux-fsdevel, linux-block, viro,
	axboe, jlayton, tytso

On Sun, May 21, 2017 at 12:04:27AM -0700, Christoph Hellwig wrote:
> On Wed, May 17, 2017 at 05:32:12PM +0900, Minchan Kim wrote:
> > Is block device(esp, zram which is compressed ram block device) okay to
> > return garbage when ongoing overwrite IO fails?
> > 
> > O_DIRECT write 4 block "aaa.." -> success
> > read  4 block "aaa.." -> success
> > O_DIRECT write 4 block "bbb.." -> fail
> > read  4 block "000..' -> it is okay?
> > 
> > Hope to get an answer form experts. :)
> 
> It's "okay" as it's what existing real block devices do (at least on a
> sector boundary).  It's not "nice" though, so if you can avoid it,
> please do.

That was my understanding so I wanted to avoid it for just simple
code refactoring. Your comment helps to confirm the thhought.

Thanks, Christoph!

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-05-21  7:15 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20170516073616.GB767@jagdpanzerIV.localdomain>
2017-05-17  8:32 ` [PATCH 2/2] zram: do not count duplicated pages as compressed Minchan Kim
2017-05-17  9:14   ` Sergey Senozhatsky
2017-05-18  4:53     ` Minchan Kim
2017-05-21  7:04   ` Christoph Hellwig
2017-05-21  7:15     ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).