All of lore.kernel.org
 help / color / mirror / Atom feed
* possible regression by the barrier patch in 2.6.30-rc2
@ 2009-04-17  4:57 Kiyoshi Ueda
  2009-04-17 13:22 ` Alasdair G Kergon
  0 siblings, 1 reply; 10+ messages in thread
From: Kiyoshi Ueda @ 2009-04-17  4:57 UTC (permalink / raw)
  To: Alasdair Kergon, Mikulas Patocka; +Cc: device-mapper development

Hi Alasdair, Mikulas,

I have reviewed the barrier patch-set and found some possible
regression from 2.6.29 in dm-implement-basic-barrier-support.patch:
   http://git.kernel.org/?p=linux/kernel/git/agk/linux-2.6-dm.git;a=commitdiff;h=af7e466a1acededbc10beaba9eec8531d561c566;hp=92c639021ca6e962645114f02e356e7feb131d0b

Please take a look and consider to fix them (I don't have enough
time to make patches now unfortunately).

1. The semantics of flush suspend has been changed.
   For flush suspend, dm used to flush all bios submitted before
   the invocation of suspend.
   But now, dm may not flush such bios:
     1. submit a barrier => set QUEUE_IO_TO_THREAD flag and kick kdmflush
     2. submit some bios => queued into md->defered if the barrier
                            is still in progress
     3. invoke suspend   => set BLOCK_IO_FOR_SUSPEND flag
                            => kdmflush stops flushing md->defered
                               => complete suspend with holding the bios
   Then, we could expect bios submitted at 2 to be flushed, but
   we can't expect that any more now.
   I think no suspend user wants this semantics change.
   Is it intended change or just a bug?


2. Possible invalid pointer reference problem again.
   Milan had fixed a possible NULL pointer reference in 2.6.29 by
   http://git.kernel.org/?p=linux/kernel/git/agk/linux-2.6-dm.git;a=commitdiff;h=b35f8caa0890169000fec22902290d9a15274cbd;hp=b2174eebd1fadb76454dad09a1dacbc17081e6b0.
   His patch means that we must not touch md after calling bio_endio()
   in dec_pending() because the bio submitter can close/free the md
   once bio_endio() is called.
   The barrier patch breaks his fix and we get the possible invalid
   pointer reference problem again.
   The problematic part is below and free_io() should be called
   before bio_endio().
----------------------------------------------------------------------
-               free_io(md, io);
+               if (bio_barrier(bio)) {
+                       /*
+                        * There can be just one barrier request so we use
+                        * a per-device variable for error reporting.
+                        * Note that you can't touch the bio after end_io_acct
+                        */
+                       md->barrier_error = io_error;
+                       end_io_acct(io);
+               } else {
+                       end_io_acct(io);
 
-               if (io_error != DM_ENDIO_REQUEUE) {
-                       trace_block_bio_complete(md->queue, bio);
+                       if (io_error != DM_ENDIO_REQUEUE) {
+                               trace_block_bio_complete(md->queue, bio);
 
-                       bio_endio(bio, io_error);
+                               bio_endio(bio, io_error);
+                       }
                }
+
+               free_io(md, io);
----------------------------------------------------------------------

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: possible regression by the barrier patch in 2.6.30-rc2
  2009-04-17  4:57 possible regression by the barrier patch in 2.6.30-rc2 Kiyoshi Ueda
@ 2009-04-17 13:22 ` Alasdair G Kergon
  2009-04-20  4:27   ` Kiyoshi Ueda
  0 siblings, 1 reply; 10+ messages in thread
From: Alasdair G Kergon @ 2009-04-17 13:22 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Kiyoshi Ueda, device-mapper development

On Fri, Apr 17, 2009 at 01:57:02PM +0900, Kiyoshi Ueda wrote:
> 1. The semantics of flush suspend has been changed.

We absolutely must complete any I/O issued as a result of the lock_fs()
call in dm_suspend().

Alasdair
-- 
agk@redhat.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: possible regression by the barrier patch in 2.6.30-rc2
  2009-04-17 13:22 ` Alasdair G Kergon
@ 2009-04-20  4:27   ` Kiyoshi Ueda
  2009-04-20  6:20     ` Mikulas Patocka
  0 siblings, 1 reply; 10+ messages in thread
From: Kiyoshi Ueda @ 2009-04-20  4:27 UTC (permalink / raw)
  To: Mikulas Patocka, Alasdair G Kergon; +Cc: device-mapper development

Hi Alasdair, Mikulas,

On 2009/04/17 22:22 +0900, Alasdair G Kergon wrote:
> On Fri, Apr 17, 2009 at 01:57:02PM +0900, Kiyoshi Ueda wrote:
>> 1. The semantics of flush suspend has been changed.
> 
> We absolutely must complete any I/O issued as a result of the lock_fs()
> call in dm_suspend().

I think that lock_fs() waits for I/O to complete, so no semantics change
in case of LOCKFS && FLUSH. (All I/O issued from lock_fs() are flushed.)

But in case of NO_LOCKFS && FLUSH, the semantics is changed:
  from: I/Os submitted before the suspend invocation are flushed
  to:   I/Os submitted even before the suspend invocation may not be flushed

I have no idea whether someone gets real damage by this semantics change.

> Alasdair

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: possible regression by the barrier patch in 2.6.30-rc2
  2009-04-20  4:27   ` Kiyoshi Ueda
@ 2009-04-20  6:20     ` Mikulas Patocka
  2009-10-28 15:21       ` Alasdair G Kergon
  0 siblings, 1 reply; 10+ messages in thread
From: Mikulas Patocka @ 2009-04-20  6:20 UTC (permalink / raw)
  To: Kiyoshi Ueda; +Cc: device-mapper development, Alasdair G Kergon

On Mon, 20 Apr 2009, Kiyoshi Ueda wrote:

> Hi Alasdair, Mikulas,
> 
> On 2009/04/17 22:22 +0900, Alasdair G Kergon wrote:
> > On Fri, Apr 17, 2009 at 01:57:02PM +0900, Kiyoshi Ueda wrote:
> >> 1. The semantics of flush suspend has been changed.
> > 
> > We absolutely must complete any I/O issued as a result of the lock_fs()
> > call in dm_suspend().
> 
> I think that lock_fs() waits for I/O to complete, so no semantics change
> in case of LOCKFS && FLUSH. (All I/O issued from lock_fs() are flushed.)

True. It waits, so there's no problem with it.

> But in case of NO_LOCKFS && FLUSH, the semantics is changed:
>   from: I/Os submitted before the suspend invocation are flushed
>   to:   I/Os submitted even before the suspend invocation may not be flushed
> 
> I have no idea whether someone gets real damage by this semantics change.

I have no idea whether it hurts too. I think it doesn't hurt because 
suspend may come only from ioctl and that ioctl syscall doesn't submit any 
requests prior to suspend.

It would be trivial to fix, but I don't know about any case where it could 
cause a bug.

Mikulas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: possible regression by the barrier patch in 2.6.30-rc2
  2009-04-20  6:20     ` Mikulas Patocka
@ 2009-10-28 15:21       ` Alasdair G Kergon
  2009-11-05  1:53         ` Mikulas Patocka
  2009-11-06  7:34         ` Kiyoshi Ueda
  0 siblings, 2 replies; 10+ messages in thread
From: Alasdair G Kergon @ 2009-10-28 15:21 UTC (permalink / raw)
  To: Mikulas Patocka, Kiyoshi Ueda; +Cc: device-mapper development

Well let's go back to first principles.

There are two types of suspend.

(1) I don't care about the ordering of the I/O on the disk relative to
the suspend.  This one is easy: it's --noflush --nolockfs.

(2) I do want some control over the state of the device at the point
of the suspend.

Break this second case down.
If I have a filesystem, I require it to be consistent, so I require lockfs.

If the device belongs to a userspace database, I require it to be consistent,
so I must pause the database, at which point there is no further I/O being
issued to the device, then I suspend (and everything prior to this must be
flushed) and resume etc.  This can be either a "suspend with flush" OR the
flush could have been issued prior to the suspend (and any decent database
would have done that).

If I have any other application owning the device that cares about the
state on the device at the point of the suspend it *must* stop issuing
I/O while the suspend ioctl is run.  If any I/O continues to arrive
during the suspend, then that tells us we are necessarily in case (1).

So the implementation of 'suspend with flush' is simply 'issue a flush
using the standard mechanism for doing this and then issue a suspend'.
The 'suspend' itself does not need any special handling for 'flush'.
In practice, anything that requires a flush will issue it prior to
calling the suspend ioctl and not send I/O concurrently with the
suspend.  For backwards compatibility we could still support the
'suspend with flush' as I described - issue a flush internally before
entering the code that performs the suspend.

Alasdair

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: possible regression by the barrier patch in 2.6.30-rc2
  2009-10-28 15:21       ` Alasdair G Kergon
@ 2009-11-05  1:53         ` Mikulas Patocka
  2009-11-05  2:02           ` Mikulas Patocka
  2009-11-05  2:04           ` Re: possible regression by the barrier patch in 2.6.30-rc2 Alasdair G Kergon
  2009-11-06  7:34         ` Kiyoshi Ueda
  1 sibling, 2 replies; 10+ messages in thread
From: Mikulas Patocka @ 2009-11-05  1:53 UTC (permalink / raw)
  To: Alasdair G Kergon; +Cc: Kiyoshi Ueda, device-mapper development



On Wed, 28 Oct 2009, Alasdair G Kergon wrote:

> Well let's go back to first principles.
> 
> There are two types of suspend.
> 
> (1) I don't care about the ordering of the I/O on the disk relative to
> the suspend.  This one is easy: it's --noflush --nolockfs.
> 
> (2) I do want some control over the state of the device at the point
> of the suspend.
> 
> Break this second case down.
> If I have a filesystem, I require it to be consistent, so I require lockfs.
> 
> If the device belongs to a userspace database, I require it to be consistent,
> so I must pause the database, at which point there is no further I/O being
> issued to the device, then I suspend (and everything prior to this must be
> flushed) and resume etc.  This can be either a "suspend with flush" OR the
> flush could have been issued prior to the suspend (and any decent database
> would have done that).

In this database case, you have to pause the database, the pause procedure 
will wait until all I/O finishes and won't submit new I/O. When the pause 
procedure finishes, the database has no I/O in flight, so it doesn't 
matter if you use flush or noflush suspend.

The reason is that there may be another I/O midlayers between the database 
and the device mapper. So, if the database submits I/O, it doesn't have to 
immediatelly arrive to the device mapper. If you paused the database 
(without waiting for complete I/Os) and then issued "flush" suspend, the 
I/O may still be pending somewhere above the suspended device, then the 
device finishes flush suspend, then the I/O arrives and waits until 
unsuspend.

I'm somehow starting to think that "flush" suspend is not needed at all 
and all suspends may be "noflush". Do you have any counterexamples?

Mikulas

> If I have any other application owning the device that cares about the
> state on the device at the point of the suspend it *must* stop issuing
> I/O while the suspend ioctl is run.  If any I/O continues to arrive
> during the suspend, then that tells us we are necessarily in case (1).
> 
> So the implementation of 'suspend with flush' is simply 'issue a flush
> using the standard mechanism for doing this and then issue a suspend'.
> The 'suspend' itself does not need any special handling for 'flush'.
> In practice, anything that requires a flush will issue it prior to
> calling the suspend ioctl and not send I/O concurrently with the
> suspend.  For backwards compatibility we could still support the
> 'suspend with flush' as I described - issue a flush internally before
> entering the code that performs the suspend.
> 
> Alasdair
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: possible regression by the barrier patch in 2.6.30-rc2
  2009-11-05  1:53         ` Mikulas Patocka
@ 2009-11-05  2:02           ` Mikulas Patocka
  2009-11-06  2:42             ` snapshot of mirror (was: possible regression by the barrier patch in 2.6.30-rc2) Mikulas Patocka
  2009-11-05  2:04           ` Re: possible regression by the barrier patch in 2.6.30-rc2 Alasdair G Kergon
  1 sibling, 1 reply; 10+ messages in thread
From: Mikulas Patocka @ 2009-11-05  2:02 UTC (permalink / raw)
  To: Alasdair G Kergon; +Cc: Kiyoshi Ueda, device-mapper development

On Wed, 4 Nov 2009, Mikulas Patocka wrote:

> On Wed, 28 Oct 2009, Alasdair G Kergon wrote:
> 
> > Well let's go back to first principles.
> > 
> > There are two types of suspend.
> > 
> > (1) I don't care about the ordering of the I/O on the disk relative to
> > the suspend.  This one is easy: it's --noflush --nolockfs.
> > 
> > (2) I do want some control over the state of the device at the point
> > of the suspend.
> > 
> > Break this second case down.
> > If I have a filesystem, I require it to be consistent, so I require lockfs.
> > 
> > If the device belongs to a userspace database, I require it to be consistent,
> > so I must pause the database, at which point there is no further I/O being
> > issued to the device, then I suspend (and everything prior to this must be
> > flushed) and resume etc.  This can be either a "suspend with flush" OR the
> > flush could have been issued prior to the suspend (and any decent database
> > would have done that).
> 
> In this database case, you have to pause the database, the pause procedure 
> will wait until all I/O finishes and won't submit new I/O. When the pause 
> procedure finishes, the database has no I/O in flight, so it doesn't 
> matter if you use flush or noflush suspend.
> 
> The reason is that there may be another I/O midlayers between the database 
> and the device mapper. So, if the database submits I/O, it doesn't have to 
> immediatelly arrive to the device mapper. If you paused the database 
> (without waiting for complete I/Os) and then issued "flush" suspend, the 
> I/O may still be pending somewhere above the suspended device, then the 
> device finishes flush suspend, then the I/O arrives and waits until 
> unsuspend.
> 
> I'm somehow starting to think that "flush" suspend is not needed at all 
> and all suspends may be "noflush". Do you have any counterexamples?
> 
> Mikulas

Anyway, suspend unconditionally purges all I/Os from target drivers. All 
that this "noflush" flag does is that it allows the I/O to be held and 
requeued for new dm table incarnation, if the target requests it with 
DM_MAPIO_REQUEUE (dm-raid1 and dm-multipath do it). If no target returns 
DM_MAPIO_REQUEUE, then "flush" and "noflush" suspend is equivalent.

Question: could "noflush" be the default behavior and could the "flush" 
flag be dropped? I think yes.

Mikulas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: possible regression by the barrier patch in 2.6.30-rc2
  2009-11-05  1:53         ` Mikulas Patocka
  2009-11-05  2:02           ` Mikulas Patocka
@ 2009-11-05  2:04           ` Alasdair G Kergon
  1 sibling, 0 replies; 10+ messages in thread
From: Alasdair G Kergon @ 2009-11-05  2:04 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Kiyoshi Ueda, device-mapper development

On Wed, Nov 04, 2009 at 08:53:23PM -0500, Mikulas Patocka wrote:
> On Wed, 28 Oct 2009, Alasdair G Kergon wrote:
> > This can be either a "suspend with flush" OR the
> > flush could have been issued prior to the suspend (and any decent database
> > would have done that).
 
> I'm somehow starting to think that "flush" suspend is not needed at all 
> and all suspends may be "noflush". 

Indeed - that was what my email was trying to say - 'decent database' is
the one you describe, which waits properly, and the only reason for 'flush
suspend' is backwards compatibility with our previous releases to support
poorly-written software (which quite probably doesn't exist).

Alasdair

^ permalink raw reply	[flat|nested] 10+ messages in thread

* snapshot of mirror (was: possible regression by the barrier patch in 2.6.30-rc2)
  2009-11-05  2:02           ` Mikulas Patocka
@ 2009-11-06  2:42             ` Mikulas Patocka
  0 siblings, 0 replies; 10+ messages in thread
From: Mikulas Patocka @ 2009-11-06  2:42 UTC (permalink / raw)
  To: Alasdair G Kergon; +Cc: Kiyoshi Ueda, device-mapper development

> Anyway, suspend unconditionally purges all I/Os from target drivers. All 
> that this "noflush" flag does is that it allows the I/O to be held and 
> requeued for new dm table incarnation, if the target requests it with 
> DM_MAPIO_REQUEUE (dm-raid1 and dm-multipath do it). If no target returns 
> DM_MAPIO_REQUEUE, then "flush" and "noflush" suspend is equivalent.

Hmm, I think it makes snapshot-of-mirrors implementation impossible.

If we create the first snapshot or delete the last snapshot, we must 
inevitably suspend the underlying origin device. If we do a "flush" 
suspend and mirror log or leg fails at the same time, the i/o would be 
incorrectly returned with -EIO. If we do a "noflush" suspend, we couldn't 
synchronize the filesystem on snapshot creation and we couldn't get the 
I/Os out of origin target (because they would be queued in the underlying 
-real device).

Mirrors must to be self-sufficient (i.e. handle errors on their own 
without queuing i/os and needing immediate action of dmeventd) to enable 
snapshots-of-mirrors work. Jon, will your log duplication work meet this 
requirement?

Mikulas

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Re: possible regression by the barrier patch in 2.6.30-rc2
  2009-10-28 15:21       ` Alasdair G Kergon
  2009-11-05  1:53         ` Mikulas Patocka
@ 2009-11-06  7:34         ` Kiyoshi Ueda
  1 sibling, 0 replies; 10+ messages in thread
From: Kiyoshi Ueda @ 2009-11-06  7:34 UTC (permalink / raw)
  To: Alasdair Kergon; +Cc: device-mapper development, Mikulas Patocka

Hi Alasdair,

On 10/29/2009 12:21 AM +0900, Alasdair G Kergon wrote:
> Well let's go back to first principles.
> 
> There are two types of suspend.
> 
> (1) I don't care about the ordering of the I/O on the disk relative to
> the suspend.  This one is easy: it's --noflush --nolockfs.
> 
> (2) I do want some control over the state of the device at the point
> of the suspend.
> 
> Break this second case down.
> If I have a filesystem, I require it to be consistent, so I require lockfs.
> 
> If the device belongs to a userspace database, I require it to be consistent,
> so I must pause the database, at which point there is no further I/O being
> issued to the device, then I suspend (and everything prior to this must be
> flushed) and resume etc.  This can be either a "suspend with flush" OR the
> flush could have been issued prior to the suspend (and any decent database
> would have done that).
> 
> If I have any other application owning the device that cares about the
> state on the device at the point of the suspend it *must* stop issuing
> I/O while the suspend ioctl is run.  If any I/O continues to arrive
> during the suspend, then that tells us we are necessarily in case (1).
> 
> So the implementation of 'suspend with flush' is simply 'issue a flush
> using the standard mechanism for doing this and then issue a suspend'.
> The 'suspend' itself does not need any special handling for 'flush'.
> In practice, anything that requires a flush will issue it prior to
> calling the suspend ioctl and not send I/O concurrently with the
> suspend.  For backwards compatibility we could still support the
> 'suspend with flush' as I described - issue a flush internally before
> entering the code that performs the suspend.

OK, I basically agree with the idea (which issues a flush before
the guts of the suspend code).
But I think issuing a flush after lock_fs() may be better like below:

	dm_suspend() {
		...
		if (!noflush && do_lockfs)
			lock_fs();

		if (!noflush)
			blkdev_issue_flush();
		...
		set_bit(DMF_BLOCK_IO_FOR_SUSPEND);
		...
		dm_wait_for_completion();
		...
	}

By doing this, we can add another feature, which guarantees
the flushed I/Os are written to the medium, for "flush" suspend,
not only for the backward compatibility of "--nolockfs && --flush".
(Maybe the "lock_fs() and blkdev_issue_flush()" can be moved out
 from the suspend code as a pair.)


By the way, I also think we need only one flag for "suspend with flush"
from application viewpoints.
Probably, we can drop "lockfs" flag because its naming is internal 
and applications shouldn't know what "lockfs" is.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-11-06  7:34 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-17  4:57 possible regression by the barrier patch in 2.6.30-rc2 Kiyoshi Ueda
2009-04-17 13:22 ` Alasdair G Kergon
2009-04-20  4:27   ` Kiyoshi Ueda
2009-04-20  6:20     ` Mikulas Patocka
2009-10-28 15:21       ` Alasdair G Kergon
2009-11-05  1:53         ` Mikulas Patocka
2009-11-05  2:02           ` Mikulas Patocka
2009-11-06  2:42             ` snapshot of mirror (was: possible regression by the barrier patch in 2.6.30-rc2) Mikulas Patocka
2009-11-05  2:04           ` Re: possible regression by the barrier patch in 2.6.30-rc2 Alasdair G Kergon
2009-11-06  7:34         ` Kiyoshi Ueda

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.