- * [PATCH 01/12] introduce BDRV_POLL_WHILE_UNLOCKED
  2022-01-18 16:27 [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm Emanuele Giuseppe Esposito
@ 2022-01-18 16:27 ` Emanuele Giuseppe Esposito
  2022-01-26 10:49   ` Stefan Hajnoczi
  2022-01-18 16:27 ` [PATCH 02/12] block/io.c: make bdrv_do_drained_begin_quiesce static and introduce bdrv_drained_begin_no_poll Emanuele Giuseppe Esposito
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-01-18 16:27 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Emanuele Giuseppe Esposito, qemu-devel, Hanna Reitz,
	Stefan Hajnoczi, Paolo Bonzini, John Snow
Same as BDRV_POLL_WHILE, but uses AIO_WAIT_WHILE_UNLOCKED.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 include/block/block-global-state.h | 5 +++++
 1 file changed, 5 insertions(+)
diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
index 419fe8427f..7ad9496f56 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -158,6 +158,11 @@ void bdrv_drain_all(void);
     AIO_WAIT_WHILE(bdrv_get_aio_context(bs_),              \
                    cond); })
 
+#define BDRV_POLL_WHILE_UNLOCKED(bs, cond) ({              \
+    BlockDriverState *bs_ = (bs);                          \
+    AIO_WAIT_WHILE_UNLOCKED(bdrv_get_aio_context(bs_),     \
+                            cond); })
+
 int bdrv_has_zero_init_1(BlockDriverState *bs);
 int bdrv_has_zero_init(BlockDriverState *bs);
 int bdrv_can_set_read_only(BlockDriverState *bs, bool read_only,
-- 
2.31.1
^ permalink raw reply related	[flat|nested] 38+ messages in thread
- * Re: [PATCH 01/12] introduce BDRV_POLL_WHILE_UNLOCKED
  2022-01-18 16:27 ` [PATCH 01/12] introduce BDRV_POLL_WHILE_UNLOCKED Emanuele Giuseppe Esposito
@ 2022-01-26 10:49   ` Stefan Hajnoczi
  2022-02-03 13:57     ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2022-01-26 10:49 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-block,
	qemu-devel, Hanna Reitz, Paolo Bonzini, John Snow
[-- Attachment #1: Type: text/plain, Size: 1052 bytes --]
On Tue, Jan 18, 2022 at 11:27:27AM -0500, Emanuele Giuseppe Esposito wrote:
> Same as BDRV_POLL_WHILE, but uses AIO_WAIT_WHILE_UNLOCKED.
> 
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>  include/block/block-global-state.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
> index 419fe8427f..7ad9496f56 100644
> --- a/include/block/block-global-state.h
> +++ b/include/block/block-global-state.h
> @@ -158,6 +158,11 @@ void bdrv_drain_all(void);
>      AIO_WAIT_WHILE(bdrv_get_aio_context(bs_),              \
>                     cond); })
>  
> +#define BDRV_POLL_WHILE_UNLOCKED(bs, cond) ({              \
> +    BlockDriverState *bs_ = (bs);                          \
> +    AIO_WAIT_WHILE_UNLOCKED(bdrv_get_aio_context(bs_),     \
> +                            cond); })
No doc comments? When and why is this API useful? Are there any
preconditions or assumptions (e.g. cond must be thread-safe)?
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 38+ messages in thread
- * Re: [PATCH 01/12] introduce BDRV_POLL_WHILE_UNLOCKED
  2022-01-26 10:49   ` Stefan Hajnoczi
@ 2022-02-03 13:57     ` Emanuele Giuseppe Esposito
  2022-02-04 12:13       ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-02-03 13:57 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-block,
	qemu-devel, Hanna Reitz, Paolo Bonzini, John Snow
On 26/01/2022 11:49, Stefan Hajnoczi wrote:
> On Tue, Jan 18, 2022 at 11:27:27AM -0500, Emanuele Giuseppe Esposito wrote:
>> Same as BDRV_POLL_WHILE, but uses AIO_WAIT_WHILE_UNLOCKED.
>>
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> ---
>>  include/block/block-global-state.h | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
>> index 419fe8427f..7ad9496f56 100644
>> --- a/include/block/block-global-state.h
>> +++ b/include/block/block-global-state.h
>> @@ -158,6 +158,11 @@ void bdrv_drain_all(void);
>>      AIO_WAIT_WHILE(bdrv_get_aio_context(bs_),              \
>>                     cond); })
>>  
>> +#define BDRV_POLL_WHILE_UNLOCKED(bs, cond) ({              \
>> +    BlockDriverState *bs_ = (bs);                          \
>> +    AIO_WAIT_WHILE_UNLOCKED(bdrv_get_aio_context(bs_),     \
>> +                            cond); })
> 
> No doc comments? When and why is this API useful? Are there any
> preconditions or assumptions (e.g. cond must be thread-safe)?
> 
I am planning to add the following comment above the macro:
/*
 * Unlocked counterpart of BDRV_POLL_WHILE. Uses AIO_WAIT_WHILE_UNLOCKED,
 * so it does not release+acquire the AioContext lock if we are in the
 * main loop, therefore the caller does not need to take it.
 * This function avoids taking the AioContext lock unnecessarly, so use
 * it only when sure that the lock is not taken already, otherwise it
 * might cause deadlocks.
 *
 * @cond must be thread-safe.
 */
Emanuele
^ permalink raw reply	[flat|nested] 38+ messages in thread
- * Re: [PATCH 01/12] introduce BDRV_POLL_WHILE_UNLOCKED
  2022-02-03 13:57     ` Emanuele Giuseppe Esposito
@ 2022-02-04 12:13       ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2022-02-04 12:13 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-block,
	qemu-devel, Hanna Reitz, John Snow
On 2/3/22 14:57, Emanuele Giuseppe Esposito wrote:
>   * This function avoids taking the AioContext lock unnecessarly, so use
>   * it only when sure that the lock is not taken already, otherwise it
>   * might cause deadlocks.
"so use" should be "but use". :)
Paolo
>   *
>   * @cond must be thread-safe.
>   */
^ permalink raw reply	[flat|nested] 38+ messages in thread 
 
 
 
- * [PATCH 02/12] block/io.c: make bdrv_do_drained_begin_quiesce static and introduce bdrv_drained_begin_no_poll
  2022-01-18 16:27 [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm Emanuele Giuseppe Esposito
  2022-01-18 16:27 ` [PATCH 01/12] introduce BDRV_POLL_WHILE_UNLOCKED Emanuele Giuseppe Esposito
@ 2022-01-18 16:27 ` Emanuele Giuseppe Esposito
  2022-01-19  9:11   ` Paolo Bonzini
  2022-01-18 16:27 ` [PATCH 03/12] block.c: bdrv_replace_child_noperm: first remove the child, and then call ->detach() Emanuele Giuseppe Esposito
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-01-18 16:27 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Emanuele Giuseppe Esposito, qemu-devel, Hanna Reitz,
	Stefan Hajnoczi, Paolo Bonzini, John Snow
Using bdrv_do_drained_begin_quiesce() in bdrv_child_cb_drained_end()
is not a good idea: the callback might be called when running
a drain in a coroutine, and bdrv_drained_begin_poll() does not
handle that case, resulting in assertion failure.
Instead, bdrv_do_drained_begin with no recursion and poll
will accomplish the same thing (invoking bdrv_do_drained_begin_quiesce)
but will firstly check if we are already in a coroutine, and exit
from that via bdrv_co_yield_to_drain().
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 block.c                  | 2 +-
 block/io.c               | 7 ++++++-
 include/block/block-io.h | 8 +++++---
 3 files changed, 12 insertions(+), 5 deletions(-)
diff --git a/block.c b/block.c
index 668986c879..08fde585f4 100644
--- a/block.c
+++ b/block.c
@@ -1206,7 +1206,7 @@ static char *bdrv_child_get_parent_desc(BdrvChild *c)
 static void bdrv_child_cb_drained_begin(BdrvChild *child)
 {
     BlockDriverState *bs = child->opaque;
-    bdrv_do_drained_begin_quiesce(bs, NULL, false);
+    bdrv_drained_begin_no_poll(bs);
 }
 
 static bool bdrv_child_cb_drained_poll(BdrvChild *child)
diff --git a/block/io.c b/block/io.c
index 3be08cad29..5123b7b713 100644
--- a/block/io.c
+++ b/block/io.c
@@ -425,7 +425,7 @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
     }
 }
 
-void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
+static void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
                                    BdrvChild *parent, bool ignore_bds_parents)
 {
     assert(!qemu_in_coroutine());
@@ -487,6 +487,11 @@ void bdrv_subtree_drained_begin(BlockDriverState *bs)
     bdrv_do_drained_begin(bs, true, NULL, false, true);
 }
 
+void bdrv_drained_begin_no_poll(BlockDriverState *bs)
+{
+    bdrv_do_drained_begin(bs, false, NULL, false, false);
+}
+
 /**
  * This function does not poll, nor must any of its recursively called
  * functions.  The *drained_end_counter pointee will be incremented
diff --git a/include/block/block-io.h b/include/block/block-io.h
index a69bc5bd36..34a991649c 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -238,13 +238,15 @@ bool bdrv_drain_poll(BlockDriverState *bs, bool recursive,
 void bdrv_drained_begin(BlockDriverState *bs);
 
 /**
- * bdrv_do_drained_begin_quiesce:
+ * bdrv_drained_begin_no_poll:
  *
  * Quiesces a BDS like bdrv_drained_begin(), but does not wait for already
  * running requests to complete.
+ * Same as bdrv_drained_begin(), but do not poll for the subgraph to
+ * actually become unquiesced. Therefore, no graph changes will occur
+ * with this function.
  */
-void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
-                                   BdrvChild *parent, bool ignore_bds_parents);
+void bdrv_drained_begin_no_poll(BlockDriverState *bs);
 
 /**
  * Like bdrv_drained_begin, but recursively begins a quiesced section for
-- 
2.31.1
^ permalink raw reply related	[flat|nested] 38+ messages in thread
- * Re: [PATCH 02/12] block/io.c: make bdrv_do_drained_begin_quiesce static and introduce bdrv_drained_begin_no_poll
  2022-01-18 16:27 ` [PATCH 02/12] block/io.c: make bdrv_do_drained_begin_quiesce static and introduce bdrv_drained_begin_no_poll Emanuele Giuseppe Esposito
@ 2022-01-19  9:11   ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2022-01-19  9:11 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-devel,
	Hanna Reitz, Stefan Hajnoczi, John Snow
On 1/18/22 17:27, Emanuele Giuseppe Esposito wrote:
> Using bdrv_do_drained_begin_quiesce() in bdrv_child_cb_drained_end()
bdrv_child_cb_drained_begin()
> is not a good idea: the callback might be called when running
> a drain in a coroutine, and bdrv_drained_begin_poll() does not
> handle that case, resulting in assertion failure.
> 
> Instead, bdrv_do_drained_begin with no recursion and poll
> will accomplish the same thing (invoking bdrv_do_drained_begin_quiesce)
> but will firstly check if we are already in a coroutine, and exit
> from that via bdrv_co_yield_to_drain().
A better subject would be "fix bdrv_child_cb_drained_begin invocations 
from a coroutine".
Paolo
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>   block.c                  | 2 +-
>   block/io.c               | 7 ++++++-
>   include/block/block-io.h | 8 +++++---
>   3 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 668986c879..08fde585f4 100644
> --- a/block.c
> +++ b/block.c
> @@ -1206,7 +1206,7 @@ static char *bdrv_child_get_parent_desc(BdrvChild *c)
>   static void bdrv_child_cb_drained_begin(BdrvChild *child)
>   {
>       BlockDriverState *bs = child->opaque;
> -    bdrv_do_drained_begin_quiesce(bs, NULL, false);
> +    bdrv_drained_begin_no_poll(bs);
>   }
>   
>   static bool bdrv_child_cb_drained_poll(BdrvChild *child)
> diff --git a/block/io.c b/block/io.c
> index 3be08cad29..5123b7b713 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -425,7 +425,7 @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
>       }
>   }
>   
> -void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
> +static void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
>                                      BdrvChild *parent, bool ignore_bds_parents)
>   {
>       assert(!qemu_in_coroutine());
> @@ -487,6 +487,11 @@ void bdrv_subtree_drained_begin(BlockDriverState *bs)
>       bdrv_do_drained_begin(bs, true, NULL, false, true);
>   }
>   
> +void bdrv_drained_begin_no_poll(BlockDriverState *bs)
> +{
> +    bdrv_do_drained_begin(bs, false, NULL, false, false);
> +}
> +
>   /**
>    * This function does not poll, nor must any of its recursively called
>    * functions.  The *drained_end_counter pointee will be incremented
> diff --git a/include/block/block-io.h b/include/block/block-io.h
> index a69bc5bd36..34a991649c 100644
> --- a/include/block/block-io.h
> +++ b/include/block/block-io.h
> @@ -238,13 +238,15 @@ bool bdrv_drain_poll(BlockDriverState *bs, bool recursive,
>   void bdrv_drained_begin(BlockDriverState *bs);
>   
>   /**
> - * bdrv_do_drained_begin_quiesce:
> + * bdrv_drained_begin_no_poll:
>    *
>    * Quiesces a BDS like bdrv_drained_begin(), but does not wait for already
>    * running requests to complete.
> + * Same as bdrv_drained_begin(), but do not poll for the subgraph to
> + * actually become unquiesced. Therefore, no graph changes will occur
> + * with this function.
>    */
> -void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
> -                                   BdrvChild *parent, bool ignore_bds_parents);
> +void bdrv_drained_begin_no_poll(BlockDriverState *bs);
>   
>   /**
>    * Like bdrv_drained_begin, but recursively begins a quiesced section for
^ permalink raw reply	[flat|nested] 38+ messages in thread
 
- * [PATCH 03/12] block.c: bdrv_replace_child_noperm: first remove the child, and then call ->detach()
  2022-01-18 16:27 [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm Emanuele Giuseppe Esposito
  2022-01-18 16:27 ` [PATCH 01/12] introduce BDRV_POLL_WHILE_UNLOCKED Emanuele Giuseppe Esposito
  2022-01-18 16:27 ` [PATCH 02/12] block/io.c: make bdrv_do_drained_begin_quiesce static and introduce bdrv_drained_begin_no_poll Emanuele Giuseppe Esposito
@ 2022-01-18 16:27 ` Emanuele Giuseppe Esposito
  2022-01-18 16:27 ` [PATCH 04/12] block.c: bdrv_replace_child_noperm: first call ->attach(), and then add child Emanuele Giuseppe Esposito
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-01-18 16:27 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Emanuele Giuseppe Esposito, qemu-devel, Hanna Reitz,
	Stefan Hajnoczi, Paolo Bonzini, John Snow
Doing the opposite can make ->detach() (more precisely
bdrv_unapply_subtree_drain() in bdrv_child_cb_detach) undo the subtree_drain
just performed to protect the removal of the child from the graph,
thus making the fully-enabled assert_bdrv_graph_writable fail.
Note that assert_bdrv_graph_writable is not yet fully enabled.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 block.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/block.c b/block.c
index 08fde585f4..29de2b62b5 100644
--- a/block.c
+++ b/block.c
@@ -2861,14 +2861,16 @@ static void bdrv_replace_child_noperm(BdrvChild **childp,
     }
 
     if (old_bs) {
-        /* Detach first so that the recursive drain sections coming from @child
+        assert_bdrv_graph_writable(old_bs);
+        QLIST_REMOVE(child, next_parent);
+        /*
+         * Detach first so that the recursive drain sections coming from @child
          * are already gone and we only end the drain sections that came from
-         * elsewhere. */
+         * elsewhere.
+         */
         if (child->klass->detach) {
             child->klass->detach(child);
         }
-        assert_bdrv_graph_writable(old_bs);
-        QLIST_REMOVE(child, next_parent);
     }
 
     child->bs = new_bs;
-- 
2.31.1
^ permalink raw reply related	[flat|nested] 38+ messages in thread
- * [PATCH 04/12] block.c: bdrv_replace_child_noperm: first call ->attach(), and then add child
  2022-01-18 16:27 [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm Emanuele Giuseppe Esposito
                   ` (2 preceding siblings ...)
  2022-01-18 16:27 ` [PATCH 03/12] block.c: bdrv_replace_child_noperm: first remove the child, and then call ->detach() Emanuele Giuseppe Esposito
@ 2022-01-18 16:27 ` Emanuele Giuseppe Esposito
  2022-01-18 16:27 ` [PATCH 05/12] test-bdrv-drain.c: adapt test to the coming subtree drains Emanuele Giuseppe Esposito
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-01-18 16:27 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Emanuele Giuseppe Esposito, qemu-devel, Hanna Reitz,
	Stefan Hajnoczi, Paolo Bonzini, John Snow
Doing the opposite can make adding the child node to a non-drained node,
as apply_subtree_drain is only done in ->attach() and thus make
assert_bdrv_graph_writable fail.
This can happen for example during a transaction rollback (test 245,
test_io_with_graph_changes):
1. a node is removed from the graph, thus it is undrained
2. then something happens, and we need to roll back the transactions
   through tran_abort()
3. at this point, the current code would first attach the undrained node
   to the graph via QLIST_INSERT_HEAD, and then call ->attach() that
   will take care of restoring the drain with apply_subtree_drain(),
   leaving the node undrained between the two operations.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 block.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/block.c b/block.c
index 29de2b62b5..fb5bc3077a 100644
--- a/block.c
+++ b/block.c
@@ -2879,8 +2879,6 @@ static void bdrv_replace_child_noperm(BdrvChild **childp,
     }
 
     if (new_bs) {
-        assert_bdrv_graph_writable(new_bs);
-        QLIST_INSERT_HEAD(&new_bs->parents, child, next_parent);
 
         /*
          * Detaching the old node may have led to the new node's
@@ -2897,6 +2895,10 @@ static void bdrv_replace_child_noperm(BdrvChild **childp,
         if (child->klass->attach) {
             child->klass->attach(child);
         }
+
+        assert_bdrv_graph_writable(new_bs);
+        QLIST_INSERT_HEAD(&new_bs->parents, child, next_parent);
+
     }
 
     /*
-- 
2.31.1
^ permalink raw reply related	[flat|nested] 38+ messages in thread
- * [PATCH 05/12] test-bdrv-drain.c: adapt test to the coming subtree drains
  2022-01-18 16:27 [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm Emanuele Giuseppe Esposito
                   ` (3 preceding siblings ...)
  2022-01-18 16:27 ` [PATCH 04/12] block.c: bdrv_replace_child_noperm: first call ->attach(), and then add child Emanuele Giuseppe Esposito
@ 2022-01-18 16:27 ` Emanuele Giuseppe Esposito
  2022-01-19  9:18   ` Paolo Bonzini
  2022-01-18 16:27 ` [PATCH 06/12] test-bdrv-drain.c: remove test_detach_by_parent_cb() Emanuele Giuseppe Esposito
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-01-18 16:27 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Emanuele Giuseppe Esposito, qemu-devel, Hanna Reitz,
	Stefan Hajnoczi, Paolo Bonzini, John Snow
There will be 2 problems in this test when we will add
subtree drains in bdrv_replace_child_noperm:
- First of all, inconsistency between block_job_create under
aiocontext lock that internally calls blk_insert_bs and therefore
bdrv_replace_child_noperm, and blk_insert_bs that is called two lines
above in the same test without aiocontext. There seems to be no
reason on why we need the lock there, so move the aiocontext lock further
down.
- test_detach_indirect: here it is simply a matter of wrong callbacks
used. In the original test, there was only a subtree drain, so
overriding .drained_begin was not a problem. Now that we want to have
additional subtree drains, we risk to call the test callback
to early, or multiple times. We do not want that, so override
the callback only when we actually want to use it.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 tests/unit/test-bdrv-drain.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index 5a35dc391d..f6af206748 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -912,12 +912,12 @@ static void test_blockjob_common_drain_node(enum drain_type drain_type,
     blk_insert_bs(blk_target, target, &error_abort);
     blk_set_allow_aio_context_change(blk_target, true);
 
-    aio_context_acquire(ctx);
     tjob = block_job_create("job0", &test_job_driver, NULL, src,
                             0, BLK_PERM_ALL,
                             0, 0, NULL, NULL, &error_abort);
     tjob->bs = src;
     job = &tjob->common;
+    aio_context_acquire(ctx);
     block_job_add_bdrv(job, "target", target, 0, BLK_PERM_ALL, &error_abort);
 
     switch (result) {
@@ -1322,15 +1322,18 @@ static void detach_by_parent_aio_cb(void *opaque, int ret)
     }
 }
 
+static BdrvChildClass detach_by_driver_cb_class;
+
 static void detach_by_driver_cb_drained_begin(BdrvChild *child)
 {
+    /* restore .drained_begin cb, we don't need it anymore. */
+    detach_by_driver_cb_class.drained_begin = child_of_bds.drained_begin;
+
     aio_bh_schedule_oneshot(qemu_get_current_aio_context(),
                             detach_indirect_bh, &detach_by_parent_data);
     child_of_bds.drained_begin(child);
 }
 
-static BdrvChildClass detach_by_driver_cb_class;
-
 /*
  * Initial graph:
  *
@@ -1362,8 +1365,6 @@ static void test_detach_indirect(bool by_parent_cb)
 
     if (!by_parent_cb) {
         detach_by_driver_cb_class = child_of_bds;
-        detach_by_driver_cb_class.drained_begin =
-            detach_by_driver_cb_drained_begin;
     }
 
     /* Create all involved nodes */
@@ -1421,6 +1422,12 @@ static void test_detach_indirect(bool by_parent_cb)
     acb = blk_aio_preadv(blk, 0, &qiov, 0, detach_by_parent_aio_cb, NULL);
     g_assert(acb != NULL);
 
+    if (!by_parent_cb) {
+        /* set .drained_begin cb to run only in the following drain. */
+        detach_by_driver_cb_class.drained_begin =
+            detach_by_driver_cb_drained_begin;
+    }
+
     /* Drain and check the expected result */
     bdrv_subtree_drained_begin(parent_b);
 
-- 
2.31.1
^ permalink raw reply related	[flat|nested] 38+ messages in thread
- * Re: [PATCH 05/12] test-bdrv-drain.c: adapt test to the coming subtree drains
  2022-01-18 16:27 ` [PATCH 05/12] test-bdrv-drain.c: adapt test to the coming subtree drains Emanuele Giuseppe Esposito
@ 2022-01-19  9:18   ` Paolo Bonzini
  2022-02-03 11:41     ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2022-01-19  9:18 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-devel,
	Hanna Reitz, Stefan Hajnoczi, John Snow
On 1/18/22 17:27, Emanuele Giuseppe Esposito wrote:
> - First of all, inconsistency between block_job_create under
> aiocontext lock that internally calls blk_insert_bs and therefore
> bdrv_replace_child_noperm, and blk_insert_bs that is called two lines
> above in the same test without aiocontext. There seems to be no
> reason on why we need the lock there, so move the aiocontext lock further
> down.
> 
> - test_detach_indirect: here it is simply a matter of wrong callbacks
> used. In the original test, there was only a subtree drain, so
> overriding .drained_begin was not a problem. Now that we want to have
> additional subtree drains, we risk to call the test callback
> to early, or multiple times. We do not want that, so override
> the callback only when we actually want to use it.
The language here is a bit overcomplicated.  Don't think that you're 
writing Italian, instead use simple sentences.
First, the test is inconsistent about taking the AioContext lock when 
calling bdrv_replace_child_noperm.  bdrv_replace_child_noperm is reached 
in two places: from blk_insert_bs directly, and via block_job_create. 
Only the second does it with the AioContext lock taken, and there seems 
to be no reason why the lock is needed.  Move aio_context_acquire 
further down.  [Any reason why you don't move it even further down, to 
cover only job_start?]
Second, test_detach_indirect is only interested in observing the first 
call to .drained_begin.  In the original test, there was only a single 
subtree drain; however, with additional drains introduced in 
bdrv_replace_child_noperm, the test callback would be called too early 
and/or multiple times.  Override the callback only when we actually want 
to use it, and put back the original after it's been invoked.
This could also be split in two commits.
Paolo
^ permalink raw reply	[flat|nested] 38+ messages in thread 
- * Re: [PATCH 05/12] test-bdrv-drain.c: adapt test to the coming subtree drains
  2022-01-19  9:18   ` Paolo Bonzini
@ 2022-02-03 11:41     ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-02-03 11:41 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-devel,
	Hanna Reitz, Stefan Hajnoczi, John Snow
On 19/01/2022 10:18, Paolo Bonzini wrote:
> On 1/18/22 17:27, Emanuele Giuseppe Esposito wrote:
>> - First of all, inconsistency between block_job_create under
>> aiocontext lock that internally calls blk_insert_bs and therefore
>> bdrv_replace_child_noperm, and blk_insert_bs that is called two lines
>> above in the same test without aiocontext. There seems to be no
>> reason on why we need the lock there, so move the aiocontext lock further
>> down.
>>
>> - test_detach_indirect: here it is simply a matter of wrong callbacks
>> used. In the original test, there was only a subtree drain, so
>> overriding .drained_begin was not a problem. Now that we want to have
>> additional subtree drains, we risk to call the test callback
>> to early, or multiple times. We do not want that, so override
>> the callback only when we actually want to use it.
> 
> The language here is a bit overcomplicated.  Don't think that you're
> writing Italian, instead use simple sentences.
> 
> First, the test is inconsistent about taking the AioContext lock when
> calling bdrv_replace_child_noperm.  bdrv_replace_child_noperm is reached
> in two places: from blk_insert_bs directly, and via block_job_create.
> Only the second does it with the AioContext lock taken, and there seems
> to be no reason why the lock is needed.  Move aio_context_acquire
> further down.  [Any reason why you don't move it even further down, to
> cover only job_start?]
The lock is left just to cover block_job_add_bdrv, since it internally
releases and then acquires the lock.
Fixing that is a future TODO.
job_start did not and does not need the AioContext lock :)
> 
> Second, test_detach_indirect is only interested in observing the first
> call to .drained_begin.  In the original test, there was only a single
> subtree drain; however, with additional drains introduced in
> bdrv_replace_child_noperm, the test callback would be called too early
> and/or multiple times.  Override the callback only when we actually want
> to use it, and put back the original after it's been invoked.
> 
> This could also be split in two commits.
> 
Will update the commit message, thank you!
Emanuele
^ permalink raw reply	[flat|nested] 38+ messages in thread 
 
 
- * [PATCH 06/12] test-bdrv-drain.c: remove test_detach_by_parent_cb()
  2022-01-18 16:27 [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm Emanuele Giuseppe Esposito
                   ` (4 preceding siblings ...)
  2022-01-18 16:27 ` [PATCH 05/12] test-bdrv-drain.c: adapt test to the coming subtree drains Emanuele Giuseppe Esposito
@ 2022-01-18 16:27 ` Emanuele Giuseppe Esposito
  2022-01-18 16:27 ` [PATCH 07/12] block/io.c: introduce bdrv_subtree_drained_{begin/end}_unlocked Emanuele Giuseppe Esposito
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-01-18 16:27 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Emanuele Giuseppe Esposito, qemu-devel, Hanna Reitz,
	Stefan Hajnoczi, Paolo Bonzini, John Snow
This test uses a callback of an I/O function (blk_aio_preadv)
to modify the graph, using bdrv_attach_child.
This is simply not allowed anymore. I/O cannot change the graph.
Before "block/io.c: make bdrv_do_drained_begin_quiesce static
and introduce bdrv_drained_begin_no_poll", the test would simply
be at risk of failure, because if bdrv_replace_child_noperm()
(called to modify the graph) would call a drain,
then one callback of .drained_begin() is bdrv_do_drained_begin_quiesce,
that specifically asserts that we are not in a coroutine.
Now that we fixed the behavior, the drain will invoke a bh in the
main loop, so we don't have such problem. However, this test is still
illegal and fails because we forbid graph changes from I/O paths.
Once we add the required subtree_drains to protect
bdrv_replace_child_noperm(), the key problem in this test is in:
acb = blk_aio_preadv(blk, 0, &qiov, 0, detach_by_parent_aio_cb, NULL);
/* Drain and check the expected result */
bdrv_subtree_drained_begin(parent_b);
because the detach_by_parent_aio_cb calls detach_indirect_bh(), that
modifies the graph and is invoked during bdrv_subtree_drained_begin().
The call stack is the following:
1. blk_aio_preadv() creates a coroutine, increments in_flight counter
and enters the coroutine running blk_aio_read_entry()
2. blk_aio_read_entry() performs the read and then schedules a bh to
   complete (blk_aio_complete)
3. at this point, subtree_drained_begin() kicks in and waits for all
   in_flight requests, polling
4. polling allows the bh to be scheduled, so blk_aio_complete runs
5. blk_aio_complete *first* invokes the callback
   (detach_by_parent_aio_cb) and then decrements the in_flight counter
6. Here we have the problem: detach_by_parent_aio_cb modifies the graph,
   so both bdrv_unref_child() and bdrv_attach_child() will have
   subtree_drains inside. And this causes a deadlock, because the
   nested drain will wait for in_flight counter to go to zero, which
   is only happening once the drain itself finishes.
Different story is test_detach_by_driver_cb(): in this case,
detach_by_parent_aio_cb() does not call detach_indirect_bh(),
but it is instead called as a bh running in the main loop by
detach_by_driver_cb_drained_begin(), the callback for
.drained_begin().
This test was added in 231281ab42 and part of the series
"Drain fixes and cleanups, part 3"
https://lists.nongnu.org/archive/html/qemu-block/2018-05/msg01132.html
but as explained above I believe that it is not valid anymore, and
can be discarded.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 tests/unit/test-bdrv-drain.c | 46 +++++++++---------------------------
 1 file changed, 11 insertions(+), 35 deletions(-)
diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index f6af206748..af3501c46d 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -1296,7 +1296,6 @@ struct detach_by_parent_data {
     BdrvChild *child_b;
     BlockDriverState *c;
     BdrvChild *child_c;
-    bool by_parent_cb;
 };
 static struct detach_by_parent_data detach_by_parent_data;
 
@@ -1314,12 +1313,7 @@ static void detach_indirect_bh(void *opaque)
 
 static void detach_by_parent_aio_cb(void *opaque, int ret)
 {
-    struct detach_by_parent_data *data = &detach_by_parent_data;
-
     g_assert_cmpint(ret, ==, 0);
-    if (data->by_parent_cb) {
-        detach_indirect_bh(data);
-    }
 }
 
 static BdrvChildClass detach_by_driver_cb_class;
@@ -1341,31 +1335,24 @@ static void detach_by_driver_cb_drained_begin(BdrvChild *child)
  *    \ /   \
  *     A     B     C
  *
- * by_parent_cb == true:  Test that parent callbacks don't poll
- *
- *     PA has a pending write request whose callback changes the child nodes of
- *     PB: It removes B and adds C instead. The subtree of PB is drained, which
- *     will indirectly drain the write request, too.
- *
- * by_parent_cb == false: Test that bdrv_drain_invoke() doesn't poll
+ * Test that bdrv_drain_invoke() doesn't poll
  *
  *     PA's BdrvChildClass has a .drained_begin callback that schedules a BH
  *     that does the same graph change. If bdrv_drain_invoke() calls it, the
  *     state is messed up, but if it is only polled in the single
  *     BDRV_POLL_WHILE() at the end of the drain, this should work fine.
  */
-static void test_detach_indirect(bool by_parent_cb)
+static void test_detach_indirect(void)
 {
     BlockBackend *blk;
     BlockDriverState *parent_a, *parent_b, *a, *b, *c;
     BdrvChild *child_a, *child_b;
     BlockAIOCB *acb;
+    BDRVTestState *s;
 
     QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, NULL, 0);
 
-    if (!by_parent_cb) {
-        detach_by_driver_cb_class = child_of_bds;
-    }
+    detach_by_driver_cb_class = child_of_bds;
 
     /* Create all involved nodes */
     parent_a = bdrv_new_open_driver(&bdrv_test, "parent-a", BDRV_O_RDWR,
@@ -1384,10 +1371,8 @@ static void test_detach_indirect(bool by_parent_cb)
 
     /* If we want to get bdrv_drain_invoke() to call aio_poll(), the driver
      * callback must not return immediately. */
-    if (!by_parent_cb) {
-        BDRVTestState *s = parent_a->opaque;
-        s->sleep_in_drain_begin = true;
-    }
+    s = parent_a->opaque;
+    s->sleep_in_drain_begin = true;
 
     /* Set child relationships */
     bdrv_ref(b);
@@ -1399,7 +1384,7 @@ static void test_detach_indirect(bool by_parent_cb)
 
     bdrv_ref(a);
     bdrv_attach_child(parent_a, a, "PA-A",
-                      by_parent_cb ? &child_of_bds : &detach_by_driver_cb_class,
+                      &detach_by_driver_cb_class,
                       BDRV_CHILD_DATA, &error_abort);
 
     g_assert_cmpint(parent_a->refcnt, ==, 1);
@@ -1417,16 +1402,13 @@ static void test_detach_indirect(bool by_parent_cb)
         .parent_b = parent_b,
         .child_b = child_b,
         .c = c,
-        .by_parent_cb = by_parent_cb,
     };
     acb = blk_aio_preadv(blk, 0, &qiov, 0, detach_by_parent_aio_cb, NULL);
     g_assert(acb != NULL);
 
-    if (!by_parent_cb) {
-        /* set .drained_begin cb to run only in the following drain. */
-        detach_by_driver_cb_class.drained_begin =
-            detach_by_driver_cb_drained_begin;
-    }
+    /* set .drained_begin cb to run only in the following drain. */
+    detach_by_driver_cb_class.drained_begin =
+        detach_by_driver_cb_drained_begin;
 
     /* Drain and check the expected result */
     bdrv_subtree_drained_begin(parent_b);
@@ -1462,14 +1444,9 @@ static void test_detach_indirect(bool by_parent_cb)
     bdrv_unref(c);
 }
 
-static void test_detach_by_parent_cb(void)
-{
-    test_detach_indirect(true);
-}
-
 static void test_detach_by_driver_cb(void)
 {
-    test_detach_indirect(false);
+    test_detach_indirect();
 }
 
 static void test_append_to_drained(void)
@@ -2224,7 +2201,6 @@ int main(int argc, char **argv)
     g_test_add_func("/bdrv-drain/detach/drain_all", test_detach_by_drain_all);
     g_test_add_func("/bdrv-drain/detach/drain", test_detach_by_drain);
     g_test_add_func("/bdrv-drain/detach/drain_subtree", test_detach_by_drain_subtree);
-    g_test_add_func("/bdrv-drain/detach/parent_cb", test_detach_by_parent_cb);
     g_test_add_func("/bdrv-drain/detach/driver_cb", test_detach_by_driver_cb);
 
     g_test_add_func("/bdrv-drain/attach/drain", test_append_to_drained);
-- 
2.31.1
^ permalink raw reply related	[flat|nested] 38+ messages in thread
- * [PATCH 07/12] block/io.c: introduce bdrv_subtree_drained_{begin/end}_unlocked
  2022-01-18 16:27 [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm Emanuele Giuseppe Esposito
                   ` (5 preceding siblings ...)
  2022-01-18 16:27 ` [PATCH 06/12] test-bdrv-drain.c: remove test_detach_by_parent_cb() Emanuele Giuseppe Esposito
@ 2022-01-18 16:27 ` Emanuele Giuseppe Esposito
  2022-01-19  9:52   ` Paolo Bonzini
  2022-01-26 11:04   ` Stefan Hajnoczi
  2022-01-18 16:27 ` [PATCH 08/12] reopen: add a transaction to drain_end nodes picked in bdrv_reopen_parse_file_or_backing Emanuele Giuseppe Esposito
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-01-18 16:27 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Emanuele Giuseppe Esposito, qemu-devel, Hanna Reitz,
	Stefan Hajnoczi, Paolo Bonzini, John Snow
Same as the locked version, but use BDRV_POLL_UNLOCKED
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 block/io.c               | 50 +++++++++++++++++++++++++++++-----------
 include/block/block-io.h |  2 ++
 2 files changed, 39 insertions(+), 13 deletions(-)
diff --git a/block/io.c b/block/io.c
index 5123b7b713..9d5167f64a 100644
--- a/block/io.c
+++ b/block/io.c
@@ -244,6 +244,7 @@ typedef struct {
     bool begin;
     bool recursive;
     bool poll;
+    bool unlock;
     BdrvChild *parent;
     bool ignore_bds_parents;
     int *drained_end_counter;
@@ -334,7 +335,7 @@ static bool bdrv_drain_poll_top_level(BlockDriverState *bs, bool recursive,
 
 static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
                                   BdrvChild *parent, bool ignore_bds_parents,
-                                  bool poll);
+                                  bool poll, bool unlock);
 static void bdrv_do_drained_end(BlockDriverState *bs, bool recursive,
                                 BdrvChild *parent, bool ignore_bds_parents,
                                 int *drained_end_counter);
@@ -352,7 +353,8 @@ static void bdrv_co_drain_bh_cb(void *opaque)
         if (data->begin) {
             assert(!data->drained_end_counter);
             bdrv_do_drained_begin(bs, data->recursive, data->parent,
-                                  data->ignore_bds_parents, data->poll);
+                                  data->ignore_bds_parents, data->poll,
+                                  data->unlock);
         } else {
             assert(!data->poll);
             bdrv_do_drained_end(bs, data->recursive, data->parent,
@@ -374,6 +376,7 @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
                                                 BdrvChild *parent,
                                                 bool ignore_bds_parents,
                                                 bool poll,
+                                                bool unlock,
                                                 int *drained_end_counter)
 {
     BdrvCoDrainData data;
@@ -394,6 +397,7 @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
         .parent = parent,
         .ignore_bds_parents = ignore_bds_parents,
         .poll = poll,
+        .unlock = unlock,
         .drained_end_counter = drained_end_counter,
     };
 
@@ -441,13 +445,13 @@ static void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
 
 static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
                                   BdrvChild *parent, bool ignore_bds_parents,
-                                  bool poll)
+                                  bool poll, bool unlock)
 {
     BdrvChild *child, *next;
 
     if (qemu_in_coroutine()) {
         bdrv_co_yield_to_drain(bs, true, recursive, parent, ignore_bds_parents,
-                               poll, NULL);
+                               poll, unlock, NULL);
         return;
     }
 
@@ -458,7 +462,7 @@ static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
         bs->recursive_quiesce_counter++;
         QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
             bdrv_do_drained_begin(child->bs, true, child, ignore_bds_parents,
-                                  false);
+                                  false, false);
         }
     }
 
@@ -473,23 +477,35 @@ static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
      */
     if (poll) {
         assert(!ignore_bds_parents);
-        BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, recursive, parent));
+        if (unlock) {
+            BDRV_POLL_WHILE_UNLOCKED(bs,
+                                     bdrv_drain_poll_top_level(bs, recursive,
+                                                               parent));
+        } else {
+            BDRV_POLL_WHILE(bs,
+                            bdrv_drain_poll_top_level(bs, recursive, parent));
+        }
     }
 }
 
 void bdrv_drained_begin(BlockDriverState *bs)
 {
-    bdrv_do_drained_begin(bs, false, NULL, false, true);
+    bdrv_do_drained_begin(bs, false, NULL, false, true, false);
 }
 
 void bdrv_subtree_drained_begin(BlockDriverState *bs)
 {
-    bdrv_do_drained_begin(bs, true, NULL, false, true);
+    bdrv_do_drained_begin(bs, true, NULL, false, true, false);
+}
+
+void bdrv_subtree_drained_begin_unlocked(BlockDriverState *bs)
+{
+    bdrv_do_drained_begin(bs, true, NULL, false, true, true);
 }
 
 void bdrv_drained_begin_no_poll(BlockDriverState *bs)
 {
-    bdrv_do_drained_begin(bs, false, NULL, false, false);
+    bdrv_do_drained_begin(bs, false, NULL, false, false, false);
 }
 
 /**
@@ -517,7 +533,7 @@ static void bdrv_do_drained_end(BlockDriverState *bs, bool recursive,
 
     if (qemu_in_coroutine()) {
         bdrv_co_yield_to_drain(bs, false, recursive, parent, ignore_bds_parents,
-                               false, drained_end_counter);
+                               false, false, drained_end_counter);
         return;
     }
     assert(bs->quiesce_counter > 0);
@@ -561,12 +577,19 @@ void bdrv_subtree_drained_end(BlockDriverState *bs)
     BDRV_POLL_WHILE(bs, qatomic_read(&drained_end_counter) > 0);
 }
 
+void bdrv_subtree_drained_end_unlocked(BlockDriverState *bs)
+{
+    int drained_end_counter = 0;
+    bdrv_do_drained_end(bs, true, NULL, false, &drained_end_counter);
+    BDRV_POLL_WHILE_UNLOCKED(bs, qatomic_read(&drained_end_counter) > 0);
+}
+
 void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent)
 {
     int i;
 
     for (i = 0; i < new_parent->recursive_quiesce_counter; i++) {
-        bdrv_do_drained_begin(child->bs, true, child, false, true);
+        bdrv_do_drained_begin(child->bs, true, child, false, true, false);
     }
 }
 
@@ -651,7 +674,8 @@ void bdrv_drain_all_begin(void)
     assert(qemu_in_main_thread());
 
     if (qemu_in_coroutine()) {
-        bdrv_co_yield_to_drain(NULL, true, false, NULL, true, true, NULL);
+        bdrv_co_yield_to_drain(NULL, true, false, NULL, true, true, false,
+                               NULL);
         return;
     }
 
@@ -676,7 +700,7 @@ void bdrv_drain_all_begin(void)
         AioContext *aio_context = bdrv_get_aio_context(bs);
 
         aio_context_acquire(aio_context);
-        bdrv_do_drained_begin(bs, false, NULL, true, false);
+        bdrv_do_drained_begin(bs, false, NULL, true, false, false);
         aio_context_release(aio_context);
     }
 
diff --git a/include/block/block-io.h b/include/block/block-io.h
index 34a991649c..a329ae24c1 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -253,6 +253,7 @@ void bdrv_drained_begin_no_poll(BlockDriverState *bs);
  * exclusive access to all child nodes as well.
  */
 void bdrv_subtree_drained_begin(BlockDriverState *bs);
+void bdrv_subtree_drained_begin_unlocked(BlockDriverState *bs);
 
 /**
  * bdrv_drained_end:
@@ -285,6 +286,7 @@ void bdrv_drained_end_no_poll(BlockDriverState *bs, int *drained_end_counter);
  * End a quiescent section started by bdrv_subtree_drained_begin().
  */
 void bdrv_subtree_drained_end(BlockDriverState *bs);
+void bdrv_subtree_drained_end_unlocked(BlockDriverState *bs);
 
 bool bdrv_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *name,
                                      uint32_t granularity, Error **errp);
-- 
2.31.1
^ permalink raw reply related	[flat|nested] 38+ messages in thread
- * Re: [PATCH 07/12] block/io.c: introduce bdrv_subtree_drained_{begin/end}_unlocked
  2022-01-18 16:27 ` [PATCH 07/12] block/io.c: introduce bdrv_subtree_drained_{begin/end}_unlocked Emanuele Giuseppe Esposito
@ 2022-01-19  9:52   ` Paolo Bonzini
  2022-01-26 11:04   ` Stefan Hajnoczi
  1 sibling, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2022-01-19  9:52 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-devel,
	Hanna Reitz, Stefan Hajnoczi, John Snow
On 1/18/22 17:27, Emanuele Giuseppe Esposito wrote:
> Same as the locked version, but use BDRV_POLL_UNLOCKED
... We are going to add drains to all graph modifications, and they are 
generally performed without the AioContext lock taken.
Paolo
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>   block/io.c               | 50 +++++++++++++++++++++++++++++-----------
>   include/block/block-io.h |  2 ++
>   2 files changed, 39 insertions(+), 13 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 5123b7b713..9d5167f64a 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -244,6 +244,7 @@ typedef struct {
>       bool begin;
>       bool recursive;
>       bool poll;
> +    bool unlock;
>       BdrvChild *parent;
>       bool ignore_bds_parents;
>       int *drained_end_counter;
> @@ -334,7 +335,7 @@ static bool bdrv_drain_poll_top_level(BlockDriverState *bs, bool recursive,
>   
>   static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
>                                     BdrvChild *parent, bool ignore_bds_parents,
> -                                  bool poll);
> +                                  bool poll, bool unlock);
>   static void bdrv_do_drained_end(BlockDriverState *bs, bool recursive,
>                                   BdrvChild *parent, bool ignore_bds_parents,
>                                   int *drained_end_counter);
> @@ -352,7 +353,8 @@ static void bdrv_co_drain_bh_cb(void *opaque)
>           if (data->begin) {
>               assert(!data->drained_end_counter);
>               bdrv_do_drained_begin(bs, data->recursive, data->parent,
> -                                  data->ignore_bds_parents, data->poll);
> +                                  data->ignore_bds_parents, data->poll,
> +                                  data->unlock);
>           } else {
>               assert(!data->poll);
>               bdrv_do_drained_end(bs, data->recursive, data->parent,
> @@ -374,6 +376,7 @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
>                                                   BdrvChild *parent,
>                                                   bool ignore_bds_parents,
>                                                   bool poll,
> +                                                bool unlock,
>                                                   int *drained_end_counter)
>   {
>       BdrvCoDrainData data;
> @@ -394,6 +397,7 @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
>           .parent = parent,
>           .ignore_bds_parents = ignore_bds_parents,
>           .poll = poll,
> +        .unlock = unlock,
>           .drained_end_counter = drained_end_counter,
>       };
>   
> @@ -441,13 +445,13 @@ static void bdrv_do_drained_begin_quiesce(BlockDriverState *bs,
>   
>   static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
>                                     BdrvChild *parent, bool ignore_bds_parents,
> -                                  bool poll)
> +                                  bool poll, bool unlock)
>   {
>       BdrvChild *child, *next;
>   
>       if (qemu_in_coroutine()) {
>           bdrv_co_yield_to_drain(bs, true, recursive, parent, ignore_bds_parents,
> -                               poll, NULL);
> +                               poll, unlock, NULL);
>           return;
>       }
>   
> @@ -458,7 +462,7 @@ static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
>           bs->recursive_quiesce_counter++;
>           QLIST_FOREACH_SAFE(child, &bs->children, next, next) {
>               bdrv_do_drained_begin(child->bs, true, child, ignore_bds_parents,
> -                                  false);
> +                                  false, false);
>           }
>       }
>   
> @@ -473,23 +477,35 @@ static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
>        */
>       if (poll) {
>           assert(!ignore_bds_parents);
> -        BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, recursive, parent));
> +        if (unlock) {
> +            BDRV_POLL_WHILE_UNLOCKED(bs,
> +                                     bdrv_drain_poll_top_level(bs, recursive,
> +                                                               parent));
> +        } else {
> +            BDRV_POLL_WHILE(bs,
> +                            bdrv_drain_poll_top_level(bs, recursive, parent));
> +        }
>       }
>   }
>   
>   void bdrv_drained_begin(BlockDriverState *bs)
>   {
> -    bdrv_do_drained_begin(bs, false, NULL, false, true);
> +    bdrv_do_drained_begin(bs, false, NULL, false, true, false);
>   }
>   
>   void bdrv_subtree_drained_begin(BlockDriverState *bs)
>   {
> -    bdrv_do_drained_begin(bs, true, NULL, false, true);
> +    bdrv_do_drained_begin(bs, true, NULL, false, true, false);
> +}
> +
> +void bdrv_subtree_drained_begin_unlocked(BlockDriverState *bs)
> +{
> +    bdrv_do_drained_begin(bs, true, NULL, false, true, true);
>   }
>   
>   void bdrv_drained_begin_no_poll(BlockDriverState *bs)
>   {
> -    bdrv_do_drained_begin(bs, false, NULL, false, false);
> +    bdrv_do_drained_begin(bs, false, NULL, false, false, false);
>   }
>   
>   /**
> @@ -517,7 +533,7 @@ static void bdrv_do_drained_end(BlockDriverState *bs, bool recursive,
>   
>       if (qemu_in_coroutine()) {
>           bdrv_co_yield_to_drain(bs, false, recursive, parent, ignore_bds_parents,
> -                               false, drained_end_counter);
> +                               false, false, drained_end_counter);
>           return;
>       }
>       assert(bs->quiesce_counter > 0);
> @@ -561,12 +577,19 @@ void bdrv_subtree_drained_end(BlockDriverState *bs)
>       BDRV_POLL_WHILE(bs, qatomic_read(&drained_end_counter) > 0);
>   }
>   
> +void bdrv_subtree_drained_end_unlocked(BlockDriverState *bs)
> +{
> +    int drained_end_counter = 0;
> +    bdrv_do_drained_end(bs, true, NULL, false, &drained_end_counter);
> +    BDRV_POLL_WHILE_UNLOCKED(bs, qatomic_read(&drained_end_counter) > 0);
> +}
> +
>   void bdrv_apply_subtree_drain(BdrvChild *child, BlockDriverState *new_parent)
>   {
>       int i;
>   
>       for (i = 0; i < new_parent->recursive_quiesce_counter; i++) {
> -        bdrv_do_drained_begin(child->bs, true, child, false, true);
> +        bdrv_do_drained_begin(child->bs, true, child, false, true, false);
>       }
>   }
>   
> @@ -651,7 +674,8 @@ void bdrv_drain_all_begin(void)
>       assert(qemu_in_main_thread());
>   
>       if (qemu_in_coroutine()) {
> -        bdrv_co_yield_to_drain(NULL, true, false, NULL, true, true, NULL);
> +        bdrv_co_yield_to_drain(NULL, true, false, NULL, true, true, false,
> +                               NULL);
>           return;
>       }
>   
> @@ -676,7 +700,7 @@ void bdrv_drain_all_begin(void)
>           AioContext *aio_context = bdrv_get_aio_context(bs);
>   
>           aio_context_acquire(aio_context);
> -        bdrv_do_drained_begin(bs, false, NULL, true, false);
> +        bdrv_do_drained_begin(bs, false, NULL, true, false, false);
>           aio_context_release(aio_context);
>       }
>   
> diff --git a/include/block/block-io.h b/include/block/block-io.h
> index 34a991649c..a329ae24c1 100644
> --- a/include/block/block-io.h
> +++ b/include/block/block-io.h
> @@ -253,6 +253,7 @@ void bdrv_drained_begin_no_poll(BlockDriverState *bs);
>    * exclusive access to all child nodes as well.
>    */
>   void bdrv_subtree_drained_begin(BlockDriverState *bs);
> +void bdrv_subtree_drained_begin_unlocked(BlockDriverState *bs);
>   
>   /**
>    * bdrv_drained_end:
> @@ -285,6 +286,7 @@ void bdrv_drained_end_no_poll(BlockDriverState *bs, int *drained_end_counter);
>    * End a quiescent section started by bdrv_subtree_drained_begin().
>    */
>   void bdrv_subtree_drained_end(BlockDriverState *bs);
> +void bdrv_subtree_drained_end_unlocked(BlockDriverState *bs);
>   
>   bool bdrv_can_store_new_dirty_bitmap(BlockDriverState *bs, const char *name,
>                                        uint32_t granularity, Error **errp);
^ permalink raw reply	[flat|nested] 38+ messages in thread
- * Re: [PATCH 07/12] block/io.c: introduce bdrv_subtree_drained_{begin/end}_unlocked
  2022-01-18 16:27 ` [PATCH 07/12] block/io.c: introduce bdrv_subtree_drained_{begin/end}_unlocked Emanuele Giuseppe Esposito
  2022-01-19  9:52   ` Paolo Bonzini
@ 2022-01-26 11:04   ` Stefan Hajnoczi
  1 sibling, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2022-01-26 11:04 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-block,
	qemu-devel, Hanna Reitz, Paolo Bonzini, John Snow
[-- Attachment #1: Type: text/plain, Size: 1148 bytes --]
On Tue, Jan 18, 2022 at 11:27:33AM -0500, Emanuele Giuseppe Esposito wrote:
> diff --git a/block/io.c b/block/io.c
> index 5123b7b713..9d5167f64a 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -244,6 +244,7 @@ typedef struct {
>      bool begin;
>      bool recursive;
>      bool poll;
> +    bool unlock;
>      BdrvChild *parent;
>      bool ignore_bds_parents;
>      int *drained_end_counter;
...
> @@ -473,23 +477,35 @@ static void bdrv_do_drained_begin(BlockDriverState *bs, bool recursive,
>       */
>      if (poll) {
>          assert(!ignore_bds_parents);
> -        BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, recursive, parent));
> +        if (unlock) {
"Unlock" is a verb that suggests we'll perform an unlock operation.
Please call it "unlocked" instead.
> +            BDRV_POLL_WHILE_UNLOCKED(bs,
> +                                     bdrv_drain_poll_top_level(bs, recursive,
> +                                                               parent));
> +        } else {
> +            BDRV_POLL_WHILE(bs,
> +                            bdrv_drain_poll_top_level(bs, recursive, parent));
> +        }
>      }
>  }
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 38+ messages in thread
 
- * [PATCH 08/12] reopen: add a transaction to drain_end nodes picked in bdrv_reopen_parse_file_or_backing
  2022-01-18 16:27 [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm Emanuele Giuseppe Esposito
                   ` (6 preceding siblings ...)
  2022-01-18 16:27 ` [PATCH 07/12] block/io.c: introduce bdrv_subtree_drained_{begin/end}_unlocked Emanuele Giuseppe Esposito
@ 2022-01-18 16:27 ` Emanuele Giuseppe Esposito
  2022-01-19  9:33   ` Paolo Bonzini
  2022-01-26 11:16   ` Stefan Hajnoczi
  2022-01-18 16:27 ` [PATCH 09/12] jobs: ensure sleep in job_sleep_ns is fully performed Emanuele Giuseppe Esposito
                   ` (5 subsequent siblings)
  13 siblings, 2 replies; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-01-18 16:27 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Emanuele Giuseppe Esposito, qemu-devel, Hanna Reitz,
	Stefan Hajnoczi, Paolo Bonzini, John Snow
Depending on the options given to reopen_state,
bdrv_reopen_parse_file_or_backing could pick another bs
that could be from another graph, and thus not protected
by subtree_drained_begin called by the callers of this
function.
We can't simply drain-undrain here, because of transactions.
To simplify the logic, transactions always assume that they
are run under drain, so the various subtree_drain introduced
so far always take care of covering tran_commit().
And since we cannot directly do it, as the transaction is
created/committed higher above, we can just add a new
transaction to the list that just executes subtree_drained_end
to match the drained_begin done in this function.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 block.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)
diff --git a/block.c b/block.c
index fb5bc3077a..fcc44a49a0 100644
--- a/block.c
+++ b/block.c
@@ -4522,6 +4522,10 @@ int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
     return bdrv_reopen(bs, opts, true, errp);
 }
 
+TransactionActionDrv bdrv_drv_subtree_end = {
+    .clean = (void (*)(void *)) bdrv_subtree_drained_end_unlocked,
+};
+
 /*
  * Take a BDRVReopenState and check if the value of 'backing' in the
  * reopen_state->options QDict is valid or not.
@@ -4550,6 +4554,7 @@ static int bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
     const char *child_name = is_backing ? "backing" : "file";
     QObject *value;
     const char *str;
+    int ret = 0;
 
     assert(qemu_in_main_thread());
 
@@ -4573,6 +4578,8 @@ static int bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
                        "cycle", str, child_name, bs->node_name);
             return -EINVAL;
         }
+        /* This will be paired with a drained_end in tran_commit */
+        bdrv_subtree_drained_begin_unlocked(new_child_bs);
         break;
     default:
         /*
@@ -4583,18 +4590,19 @@ static int bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
     }
 
     if (old_child_bs == new_child_bs) {
-        return 0;
+        goto end;
     }
 
     if (old_child_bs) {
         if (bdrv_skip_implicit_filters(old_child_bs) == new_child_bs) {
-            return 0;
+            goto end;
         }
 
         if (old_child_bs->implicit) {
             error_setg(errp, "Cannot replace implicit %s child of %s",
                        child_name, bs->node_name);
-            return -EPERM;
+            ret = -EPERM;
+            goto end;
         }
     }
 
@@ -4605,7 +4613,8 @@ static int bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
          */
         error_setg(errp, "'%s' is a %s filter node that does not support a "
                    "%s child", bs->node_name, bs->drv->format_name, child_name);
-        return -EINVAL;
+        ret = -EINVAL;
+        goto end;
     }
 
     if (is_backing) {
@@ -4614,8 +4623,14 @@ static int bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
         reopen_state->old_file_bs = old_child_bs;
     }
 
-    return bdrv_set_file_or_backing_noperm(bs, new_child_bs, is_backing,
+    ret =  bdrv_set_file_or_backing_noperm(bs, new_child_bs, is_backing,
                                            tran, errp);
+
+end:
+    if (new_child_bs) {
+        tran_add(tran, &bdrv_drv_subtree_end, new_child_bs);
+    }
+    return ret;
 }
 
 /*
-- 
2.31.1
^ permalink raw reply related	[flat|nested] 38+ messages in thread
- * Re: [PATCH 08/12] reopen: add a transaction to drain_end nodes picked in bdrv_reopen_parse_file_or_backing
  2022-01-18 16:27 ` [PATCH 08/12] reopen: add a transaction to drain_end nodes picked in bdrv_reopen_parse_file_or_backing Emanuele Giuseppe Esposito
@ 2022-01-19  9:33   ` Paolo Bonzini
  2022-01-26 11:16   ` Stefan Hajnoczi
  1 sibling, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2022-01-19  9:33 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-devel,
	Hanna Reitz, Stefan Hajnoczi, John Snow
On 1/18/22 17:27, Emanuele Giuseppe Esposito wrote:
> Depending on the options given to reopen_state,
> bdrv_reopen_parse_file_or_backing could pick another bs
> that could be from another graph, and thus not protected
> by subtree_drained_begin called by the callers of this
> function.
> 
> We can't simply drain-undrain here, because of transactions.
> To simplify the logic, transactions always assume that they
> are run under drain, so the various subtree_drain introduced
> so far always take care of covering tran_commit().
> 
> And since we cannot directly do it, as the transaction is
> created/committed higher above, we can just add a new
> transaction to the list that just executes subtree_drained_end
> to match the drained_begin done in this function.
A rewrite of the last two paragraphs:
---
Of the two callers of bdrv_set_file_or_backing_noperm, 
bdrv_reopen_parse_file_or_backing is unique in that it does not complete 
the reopen, it only prepares a transaction.  The actual commit (or 
abort) is done higher in the call chain.
Therefore, the call to bdrv_subtree_drained_end_unlocked must be delayed 
until after the transaction's fate has been determined and its actions 
have been performed.  Do this by recording a TransactionActionDrv to end 
the drained section for new_child_bs.
---
Likewise, the subject can be "block: keep children drained across a 
reopen transaction"; explain the change and not the technicalities of 
how you did it.
Paolo
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>   block.c | 25 ++++++++++++++++++++-----
>   1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/block.c b/block.c
> index fb5bc3077a..fcc44a49a0 100644
> --- a/block.c
> +++ b/block.c
> @@ -4522,6 +4522,10 @@ int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
>       return bdrv_reopen(bs, opts, true, errp);
>   }
>   
> +TransactionActionDrv bdrv_drv_subtree_end = {
> +    .clean = (void (*)(void *)) bdrv_subtree_drained_end_unlocked,
> +};
> +
>   /*
>    * Take a BDRVReopenState and check if the value of 'backing' in the
>    * reopen_state->options QDict is valid or not.
> @@ -4550,6 +4554,7 @@ static int bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
>       const char *child_name = is_backing ? "backing" : "file";
>       QObject *value;
>       const char *str;
> +    int ret = 0;
>   
>       assert(qemu_in_main_thread());
>   
> @@ -4573,6 +4578,8 @@ static int bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
>                          "cycle", str, child_name, bs->node_name);
>               return -EINVAL;
>           }
> +        /* This will be paired with a drained_end in tran_commit */
> +        bdrv_subtree_drained_begin_unlocked(new_child_bs);
>           break;
>       default:
>           /*
> @@ -4583,18 +4590,19 @@ static int bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
>       }
>   
>       if (old_child_bs == new_child_bs) {
> -        return 0;
> +        goto end;
>       }
>   
>       if (old_child_bs) {
>           if (bdrv_skip_implicit_filters(old_child_bs) == new_child_bs) {
> -            return 0;
> +            goto end;
>           }
>   
>           if (old_child_bs->implicit) {
>               error_setg(errp, "Cannot replace implicit %s child of %s",
>                          child_name, bs->node_name);
> -            return -EPERM;
> +            ret = -EPERM;
> +            goto end;
>           }
>       }
>   
> @@ -4605,7 +4613,8 @@ static int bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
>            */
>           error_setg(errp, "'%s' is a %s filter node that does not support a "
>                      "%s child", bs->node_name, bs->drv->format_name, child_name);
> -        return -EINVAL;
> +        ret = -EINVAL;
> +        goto end;
>       }
>   
>       if (is_backing) {
> @@ -4614,8 +4623,14 @@ static int bdrv_reopen_parse_file_or_backing(BDRVReopenState *reopen_state,
>           reopen_state->old_file_bs = old_child_bs;
>       }
>   
> -    return bdrv_set_file_or_backing_noperm(bs, new_child_bs, is_backing,
> +    ret =  bdrv_set_file_or_backing_noperm(bs, new_child_bs, is_backing,
>                                              tran, errp);
> +
> +end:
> +    if (new_child_bs) {
> +        tran_add(tran, &bdrv_drv_subtree_end, new_child_bs);
> +    }
> +    return ret;
>   }
>   
>   /*
^ permalink raw reply	[flat|nested] 38+ messages in thread
- * Re: [PATCH 08/12] reopen: add a transaction to drain_end nodes picked in bdrv_reopen_parse_file_or_backing
  2022-01-18 16:27 ` [PATCH 08/12] reopen: add a transaction to drain_end nodes picked in bdrv_reopen_parse_file_or_backing Emanuele Giuseppe Esposito
  2022-01-19  9:33   ` Paolo Bonzini
@ 2022-01-26 11:16   ` Stefan Hajnoczi
  1 sibling, 0 replies; 38+ messages in thread
From: Stefan Hajnoczi @ 2022-01-26 11:16 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-block,
	qemu-devel, Hanna Reitz, Paolo Bonzini, John Snow
[-- Attachment #1: Type: text/plain, Size: 1505 bytes --]
On Tue, Jan 18, 2022 at 11:27:34AM -0500, Emanuele Giuseppe Esposito wrote:
> Depending on the options given to reopen_state,
> bdrv_reopen_parse_file_or_backing could pick another bs
> that could be from another graph, and thus not protected
> by subtree_drained_begin called by the callers of this
> function.
> 
> We can't simply drain-undrain here, because of transactions.
> To simplify the logic, transactions always assume that they
> are run under drain, so the various subtree_drain introduced
> so far always take care of covering tran_commit().
> 
> And since we cannot directly do it, as the transaction is
> created/committed higher above, we can just add a new
> transaction to the list that just executes subtree_drained_end
> to match the drained_begin done in this function.
> 
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>  block.c | 25 ++++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/block.c b/block.c
> index fb5bc3077a..fcc44a49a0 100644
> --- a/block.c
> +++ b/block.c
> @@ -4522,6 +4522,10 @@ int bdrv_reopen_set_read_only(BlockDriverState *bs, bool read_only,
>      return bdrv_reopen(bs, opts, true, errp);
>  }
>  
> +TransactionActionDrv bdrv_drv_subtree_end = {
> +    .clean = (void (*)(void *)) bdrv_subtree_drained_end_unlocked,
Please don't cast function pointers. If the types don't match please
define a wrapper function so the compiler can check the types.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 38+ messages in thread
 
- * [PATCH 09/12] jobs: ensure sleep in job_sleep_ns is fully performed
  2022-01-18 16:27 [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm Emanuele Giuseppe Esposito
                   ` (7 preceding siblings ...)
  2022-01-18 16:27 ` [PATCH 08/12] reopen: add a transaction to drain_end nodes picked in bdrv_reopen_parse_file_or_backing Emanuele Giuseppe Esposito
@ 2022-01-18 16:27 ` Emanuele Giuseppe Esposito
  2022-01-26 11:21   ` Stefan Hajnoczi
  2022-01-18 16:27 ` [PATCH 10/12] block.c: add subtree_drains where needed Emanuele Giuseppe Esposito
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-01-18 16:27 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Emanuele Giuseppe Esposito, qemu-devel, Hanna Reitz,
	Stefan Hajnoczi, Paolo Bonzini, John Snow
If a drain happens while a job is sleeping, the timeout
gets cancelled and the job continues once the drain ends.
This is especially bad for the sleep performed in commit and stream
jobs, since that is dictated by ratelimit to maintain a certain speed.
Basically the execution path is the followig:
1. job calls job_sleep_ns, and yield with a timer in @ns ns.
2. meanwhile, a drain is executed, and
   child_job_drained_{begin/end} could be executed as ->drained_begin()
   and ->drained_end() callbacks.
   Therefore child_job_drained_begin() enters the job, that continues
   execution in job_sleep_ns() and calls job_pause_point_locked().
3. job_pause_point_locked() detects that we are in the middle of a
   drain, and firstly deletes any existing timer and then yields again,
   waiting for ->drained_end().
4. Once draining is finished, child_job_drained_end() runs and resumes
   the job. At this point, the timer has been lost and we just resume
   without checking if enough time has passed.
This fix implies that from now onwards, job_sleep_ns will force the job
to sleep @ns, even if it is wake up (purposefully or not) in the middle
of the sleep. Therefore qemu-iotests test might run a little bit slower,
depending on the speed of the job. Setting a job speed to values like "1"
is not allowed anymore (unless you want to wait forever).
Because of this fix, test_stream_parallel() in tests/qemu-iotests/030
takes too long, since speed of stream job is just 1024 and before
it was skipping all the wait thanks to the drains. Increase the
speed to 256 * 1024. Exactly the same happens for test 151.
Instead we need to sleep less in test_cancel_ready() test-blockjob.c,
so that the job will be able to exit the sleep and transition to ready
before the main loop asserts.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 job.c                      | 28 +++++++++++++++++-----------
 tests/qemu-iotests/030     |  2 +-
 tests/qemu-iotests/151     |  4 ++--
 tests/unit/test-blockjob.c |  2 +-
 4 files changed, 21 insertions(+), 15 deletions(-)
diff --git a/job.c b/job.c
index 83921dd79b..6ef2adead4 100644
--- a/job.c
+++ b/job.c
@@ -584,17 +584,15 @@ static void coroutine_fn job_do_yield_locked(Job *job, uint64_t ns)
     assert(job->busy);
 }
 
-void coroutine_fn job_pause_point(Job *job)
+/* Called with job_mutex held, but releases it temporarly. */
+static void coroutine_fn job_pause_point_locked(Job *job)
 {
     assert(job && job_started(job));
 
-    job_lock();
     if (!job_should_pause_locked(job)) {
-        job_unlock();
         return;
     }
     if (job_is_cancelled_locked(job)) {
-        job_unlock();
         return;
     }
 
@@ -614,13 +612,20 @@ void coroutine_fn job_pause_point(Job *job)
         job->paused = false;
         job_state_transition_locked(job, status);
     }
-    job_unlock();
 
     if (job->driver->resume) {
+        job_unlock();
         job->driver->resume(job);
+        job_lock();
     }
 }
 
+void coroutine_fn job_pause_point(Job *job)
+{
+    JOB_LOCK_GUARD();
+    job_pause_point_locked(job);
+}
+
 void job_yield(Job *job)
 {
     WITH_JOB_LOCK_GUARD() {
@@ -641,21 +646,22 @@ void job_yield(Job *job)
 
 void coroutine_fn job_sleep_ns(Job *job, int64_t ns)
 {
-    WITH_JOB_LOCK_GUARD() {
-        assert(job->busy);
+    int64_t end_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + ns;
+    JOB_LOCK_GUARD();
+    assert(job->busy);
 
+    do {
         /* Check cancellation *before* setting busy = false, too!  */
         if (job_is_cancelled_locked(job)) {
             return;
         }
 
         if (!job_should_pause_locked(job)) {
-            job_do_yield_locked(job,
-                                qemu_clock_get_ns(QEMU_CLOCK_REALTIME) + ns);
+            job_do_yield_locked(job, end_ns);
         }
-    }
 
-    job_pause_point(job);
+        job_pause_point_locked(job);
+    } while (qemu_clock_get_ns(QEMU_CLOCK_REALTIME) < end_ns);
 }
 
 /* Assumes the job_mutex is held */
diff --git a/tests/qemu-iotests/030 b/tests/qemu-iotests/030
index 567bf1da67..969b246d0f 100755
--- a/tests/qemu-iotests/030
+++ b/tests/qemu-iotests/030
@@ -248,7 +248,7 @@ class TestParallelOps(iotests.QMPTestCase):
             pending_jobs.append(job_id)
             result = self.vm.qmp('block-stream', device=node_name,
                                  job_id=job_id, bottom=f'node{i-1}',
-                                 speed=1024)
+                                 speed=256*1024)
             self.assert_qmp(result, 'return', {})
 
         # Do this in reverse: After unthrottling them, some jobs may finish
diff --git a/tests/qemu-iotests/151 b/tests/qemu-iotests/151
index 93d14193d0..5998beb5c4 100755
--- a/tests/qemu-iotests/151
+++ b/tests/qemu-iotests/151
@@ -129,7 +129,7 @@ class TestActiveMirror(iotests.QMPTestCase):
                              sync='full',
                              copy_mode='write-blocking',
                              buf_size=(1048576 // 4),
-                             speed=1)
+                             speed=1024*1024)
         self.assert_qmp(result, 'return', {})
 
         # Start an unaligned request to a dirty area
@@ -154,7 +154,7 @@ class TestActiveMirror(iotests.QMPTestCase):
                              target='target-node',
                              sync='full',
                              copy_mode='write-blocking',
-                             speed=1)
+                             speed=1024*1024)
 
         self.vm.hmp_qemu_io('source', 'break write_aio A')
         self.vm.hmp_qemu_io('source', 'aio_write 0 1M')  # 1
diff --git a/tests/unit/test-blockjob.c b/tests/unit/test-blockjob.c
index c926db7b5d..0b3010b94d 100644
--- a/tests/unit/test-blockjob.c
+++ b/tests/unit/test-blockjob.c
@@ -184,7 +184,7 @@ static int coroutine_fn cancel_job_run(Job *job, Error **errp)
             job_transition_to_ready(&s->common.job);
         }
 
-        job_sleep_ns(&s->common.job, 100000);
+        job_sleep_ns(&s->common.job, 100);
     }
 
     return 0;
-- 
2.31.1
^ permalink raw reply related	[flat|nested] 38+ messages in thread
- * Re: [PATCH 09/12] jobs: ensure sleep in job_sleep_ns is fully performed
  2022-01-18 16:27 ` [PATCH 09/12] jobs: ensure sleep in job_sleep_ns is fully performed Emanuele Giuseppe Esposito
@ 2022-01-26 11:21   ` Stefan Hajnoczi
  2022-02-03 14:21     ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2022-01-26 11:21 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-block,
	qemu-devel, Hanna Reitz, Paolo Bonzini, John Snow
[-- Attachment #1: Type: text/plain, Size: 2145 bytes --]
On Tue, Jan 18, 2022 at 11:27:35AM -0500, Emanuele Giuseppe Esposito wrote:
> If a drain happens while a job is sleeping, the timeout
> gets cancelled and the job continues once the drain ends.
> This is especially bad for the sleep performed in commit and stream
> jobs, since that is dictated by ratelimit to maintain a certain speed.
> 
> Basically the execution path is the followig:
s/followig/following/
> 1. job calls job_sleep_ns, and yield with a timer in @ns ns.
> 2. meanwhile, a drain is executed, and
>    child_job_drained_{begin/end} could be executed as ->drained_begin()
>    and ->drained_end() callbacks.
>    Therefore child_job_drained_begin() enters the job, that continues
>    execution in job_sleep_ns() and calls job_pause_point_locked().
> 3. job_pause_point_locked() detects that we are in the middle of a
>    drain, and firstly deletes any existing timer and then yields again,
>    waiting for ->drained_end().
> 4. Once draining is finished, child_job_drained_end() runs and resumes
>    the job. At this point, the timer has been lost and we just resume
>    without checking if enough time has passed.
> 
> This fix implies that from now onwards, job_sleep_ns will force the job
> to sleep @ns, even if it is wake up (purposefully or not) in the middle
> of the sleep. Therefore qemu-iotests test might run a little bit slower,
> depending on the speed of the job. Setting a job speed to values like "1"
> is not allowed anymore (unless you want to wait forever).
> 
> Because of this fix, test_stream_parallel() in tests/qemu-iotests/030
> takes too long, since speed of stream job is just 1024 and before
> it was skipping all the wait thanks to the drains. Increase the
> speed to 256 * 1024. Exactly the same happens for test 151.
> 
> Instead we need to sleep less in test_cancel_ready() test-blockjob.c,
> so that the job will be able to exit the sleep and transition to ready
> before the main loop asserts.
I remember seeing Hanna and Kevin use carefully rate-limited jobs in
qemu-iotests. They might have thoughts on whether this patch is
acceptable or not.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 38+ messages in thread
- * Re: [PATCH 09/12] jobs: ensure sleep in job_sleep_ns is fully performed
  2022-01-26 11:21   ` Stefan Hajnoczi
@ 2022-02-03 14:21     ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-02-03 14:21 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-block,
	qemu-devel, Hanna Reitz, Paolo Bonzini, John Snow
On 26/01/2022 12:21, Stefan Hajnoczi wrote:
> On Tue, Jan 18, 2022 at 11:27:35AM -0500, Emanuele Giuseppe Esposito wrote:
>> If a drain happens while a job is sleeping, the timeout
>> gets cancelled and the job continues once the drain ends.
>> This is especially bad for the sleep performed in commit and stream
>> jobs, since that is dictated by ratelimit to maintain a certain speed.
>>
>> Basically the execution path is the followig:
> 
> s/followig/following/
> 
>> 1. job calls job_sleep_ns, and yield with a timer in @ns ns.
>> 2. meanwhile, a drain is executed, and
>>    child_job_drained_{begin/end} could be executed as ->drained_begin()
>>    and ->drained_end() callbacks.
>>    Therefore child_job_drained_begin() enters the job, that continues
>>    execution in job_sleep_ns() and calls job_pause_point_locked().
>> 3. job_pause_point_locked() detects that we are in the middle of a
>>    drain, and firstly deletes any existing timer and then yields again,
>>    waiting for ->drained_end().
>> 4. Once draining is finished, child_job_drained_end() runs and resumes
>>    the job. At this point, the timer has been lost and we just resume
>>    without checking if enough time has passed.
>>
>> This fix implies that from now onwards, job_sleep_ns will force the job
>> to sleep @ns, even if it is wake up (purposefully or not) in the middle
>> of the sleep. Therefore qemu-iotests test might run a little bit slower,
>> depending on the speed of the job. Setting a job speed to values like "1"
>> is not allowed anymore (unless you want to wait forever).
>>
>> Because of this fix, test_stream_parallel() in tests/qemu-iotests/030
>> takes too long, since speed of stream job is just 1024 and before
>> it was skipping all the wait thanks to the drains. Increase the
>> speed to 256 * 1024. Exactly the same happens for test 151.
>>
>> Instead we need to sleep less in test_cancel_ready() test-blockjob.c,
>> so that the job will be able to exit the sleep and transition to ready
>> before the main loop asserts.
> 
> I remember seeing Hanna and Kevin use carefully rate-limited jobs in
> qemu-iotests. They might have thoughts on whether this patch is
> acceptable or not.
> 
I think the speed was carefully set as "slow enough" just to give time
to the operation to happen while the job was running. Anyways, all tests
I run work as intended, I just increased their speed slightly.
Having speed=1 would make e job really really slow.
Emanuele
^ permalink raw reply	[flat|nested] 38+ messages in thread
 
 
- * [PATCH 10/12] block.c: add subtree_drains where needed
  2022-01-18 16:27 [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm Emanuele Giuseppe Esposito
                   ` (8 preceding siblings ...)
  2022-01-18 16:27 ` [PATCH 09/12] jobs: ensure sleep in job_sleep_ns is fully performed Emanuele Giuseppe Esposito
@ 2022-01-18 16:27 ` Emanuele Giuseppe Esposito
  2022-01-19  9:47   ` Paolo Bonzini
  2022-02-01 14:47   ` Vladimir Sementsov-Ogievskiy
  2022-01-18 16:27 ` [PATCH 11/12] block/io.c: fully enable assert_bdrv_graph_writable Emanuele Giuseppe Esposito
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-01-18 16:27 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Emanuele Giuseppe Esposito, qemu-devel, Hanna Reitz,
	Stefan Hajnoczi, Paolo Bonzini, John Snow
Protect bdrv_replace_child_noperm, as it modifies the
graph by adding/removing elements to .children and .parents
list of a bs. Use the newly introduced
bdrv_subtree_drained_{begin/end}_unlocked drains to achieve
that and be free from the aiocontext lock.
One important criteria to keep in mind is that if the caller of
bdrv_replace_child_noperm creates a transaction, we need to make sure that the
whole transaction is under the same drain block. This is imperative, as having
multiple drains also in the .abort() class of functions causes discrepancies
in the drained counters (as nodes are put back into the original positions),
making it really hard to retourn all to zero and leaving the code very buggy.
See https://patchew.org/QEMU/20211213104014.69858-1-eesposit@redhat.com/
for more explanations.
Unfortunately we still need to have bdrv_subtree_drained_begin/end
in bdrv_detach_child() releasing and then holding the AioContext
lock, since it later invokes bdrv_try_set_aio_context() that is
not safe yet. Once all is cleaned up, we can also remove the
acquire/release locks in job_unref, artificially added because of this.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 block.c | 50 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 44 insertions(+), 6 deletions(-)
diff --git a/block.c b/block.c
index fcc44a49a0..6196c95aae 100644
--- a/block.c
+++ b/block.c
@@ -3114,8 +3114,22 @@ static void bdrv_detach_child(BdrvChild **childp)
     BlockDriverState *old_bs = (*childp)->bs;
 
     assert(qemu_in_main_thread());
+    if (old_bs) {
+        /*
+         * TODO: this is called by job_unref with lock held, because
+         * afterwards it calls bdrv_try_set_aio_context.
+         * Once all of this is fixed, take care of removing
+         * the aiocontext lock and make this function _unlocked.
+         */
+        bdrv_subtree_drained_begin(old_bs);
+    }
+
     bdrv_replace_child_noperm(childp, NULL, true);
 
+    if (old_bs) {
+        bdrv_subtree_drained_end(old_bs);
+    }
+
     if (old_bs) {
         /*
          * Update permissions for old node. We're just taking a parent away, so
@@ -3154,6 +3168,7 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState *child_bs,
     Transaction *tran = tran_new();
 
     assert(qemu_in_main_thread());
+    bdrv_subtree_drained_begin_unlocked(child_bs);
 
     ret = bdrv_attach_child_common(child_bs, child_name, child_class,
                                    child_role, perm, shared_perm, opaque,
@@ -3168,6 +3183,7 @@ out:
     tran_finalize(tran, ret);
     /* child is unset on failure by bdrv_attach_child_common_abort() */
     assert((ret < 0) == !child);
+    bdrv_subtree_drained_end_unlocked(child_bs);
 
     bdrv_unref(child_bs);
     return child;
@@ -3197,6 +3213,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 
     assert(qemu_in_main_thread());
 
+    bdrv_subtree_drained_begin_unlocked(parent_bs);
+    bdrv_subtree_drained_begin_unlocked(child_bs);
+
     ret = bdrv_attach_child_noperm(parent_bs, child_bs, child_name, child_class,
                                    child_role, &child, tran, errp);
     if (ret < 0) {
@@ -3211,6 +3230,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
 out:
     tran_finalize(tran, ret);
     /* child is unset on failure by bdrv_attach_child_common_abort() */
+    bdrv_subtree_drained_end_unlocked(child_bs);
+    bdrv_subtree_drained_end_unlocked(parent_bs);
+
     assert((ret < 0) == !child);
 
     bdrv_unref(child_bs);
@@ -3456,6 +3478,11 @@ int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
 
     assert(qemu_in_main_thread());
 
+    bdrv_subtree_drained_begin_unlocked(bs);
+    if (backing_hd) {
+        bdrv_subtree_drained_begin_unlocked(backing_hd);
+    }
+
     ret = bdrv_set_backing_noperm(bs, backing_hd, tran, errp);
     if (ret < 0) {
         goto out;
@@ -3464,6 +3491,10 @@ int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
     ret = bdrv_refresh_perms(bs, errp);
 out:
     tran_finalize(tran, ret);
+    if (backing_hd) {
+        bdrv_subtree_drained_end_unlocked(backing_hd);
+    }
+    bdrv_subtree_drained_end_unlocked(bs);
 
     return ret;
 }
@@ -5266,7 +5297,8 @@ static int bdrv_replace_node_common(BlockDriverState *from,
 
     assert(qemu_get_current_aio_context() == qemu_get_aio_context());
     assert(bdrv_get_aio_context(from) == bdrv_get_aio_context(to));
-    bdrv_drained_begin(from);
+    bdrv_subtree_drained_begin_unlocked(from);
+    bdrv_subtree_drained_begin_unlocked(to);
 
     /*
      * Do the replacement without permission update.
@@ -5298,7 +5330,8 @@ static int bdrv_replace_node_common(BlockDriverState *from,
 out:
     tran_finalize(tran, ret);
 
-    bdrv_drained_end(from);
+    bdrv_subtree_drained_end_unlocked(to);
+    bdrv_subtree_drained_end_unlocked(from);
     bdrv_unref(from);
 
     return ret;
@@ -5345,6 +5378,9 @@ int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
 
     assert(!bs_new->backing);
 
+    bdrv_subtree_drained_begin_unlocked(bs_new);
+    bdrv_subtree_drained_begin_unlocked(bs_top);
+
     ret = bdrv_attach_child_noperm(bs_new, bs_top, "backing",
                                    &child_of_bds, bdrv_backing_role(bs_new),
                                    &bs_new->backing, tran, errp);
@@ -5360,6 +5396,8 @@ int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
     ret = bdrv_refresh_perms(bs_new, errp);
 out:
     tran_finalize(tran, ret);
+    bdrv_subtree_drained_end_unlocked(bs_top);
+    bdrv_subtree_drained_end_unlocked(bs_new);
 
     bdrv_refresh_limits(bs_top, NULL, NULL);
 
@@ -5379,8 +5417,8 @@ int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
     assert(qemu_in_main_thread());
 
     bdrv_ref(old_bs);
-    bdrv_drained_begin(old_bs);
-    bdrv_drained_begin(new_bs);
+    bdrv_subtree_drained_begin_unlocked(old_bs);
+    bdrv_subtree_drained_begin_unlocked(new_bs);
 
     bdrv_replace_child_tran(&child, new_bs, tran, true);
     /* @new_bs must have been non-NULL, so @child must not have been freed */
@@ -5394,8 +5432,8 @@ int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
 
     tran_finalize(tran, ret);
 
-    bdrv_drained_end(old_bs);
-    bdrv_drained_end(new_bs);
+    bdrv_subtree_drained_end_unlocked(new_bs);
+    bdrv_subtree_drained_end_unlocked(old_bs);
     bdrv_unref(old_bs);
 
     return ret;
-- 
2.31.1
^ permalink raw reply related	[flat|nested] 38+ messages in thread
- * Re: [PATCH 10/12] block.c: add subtree_drains where needed
  2022-01-18 16:27 ` [PATCH 10/12] block.c: add subtree_drains where needed Emanuele Giuseppe Esposito
@ 2022-01-19  9:47   ` Paolo Bonzini
  2022-02-03 13:13     ` Emanuele Giuseppe Esposito
  2022-02-01 14:47   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2022-01-19  9:47 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-devel,
	Hanna Reitz, Stefan Hajnoczi, John Snow
Subject doesn't say what needs them: "block: drain block devices across 
graph modifications"
On 1/18/22 17:27, Emanuele Giuseppe Esposito wrote:
> Protect bdrv_replace_child_noperm, as it modifies the
> graph by adding/removing elements to .children and .parents
> list of a bs. Use the newly introduced
> bdrv_subtree_drained_{begin/end}_unlocked drains to achieve
> that and be free from the aiocontext lock.
> 
> One important criteria to keep in mind is that if the caller of
> bdrv_replace_child_noperm creates a transaction, we need to make sure that the
> whole transaction is under the same drain block.
Okay, this is the important part and it should be mentioned in patch 8 
as well.  It should also be in a comment above bdrv_replace_child_noperm().
> This is imperative, as having
> multiple drains also in the .abort() class of functions causes discrepancies
> in the drained counters (as nodes are put back into the original positions),
> making it really hard to retourn all to zero and leaving the code very buggy.
> See https://patchew.org/QEMU/20211213104014.69858-1-eesposit@redhat.com/
> for more explanations.
> 
> Unfortunately we still need to have bdrv_subtree_drained_begin/end
> in bdrv_detach_child() releasing and then holding the AioContext
> lock, since it later invokes bdrv_try_set_aio_context() that is
> not safe yet. Once all is cleaned up, we can also remove the
> acquire/release locks in job_unref, artificially added because of this.
About this:
> +         * TODO: this is called by job_unref with lock held, because
> +         * afterwards it calls bdrv_try_set_aio_context.
> +         * Once all of this is fixed, take care of removing
> +         * the aiocontext lock and make this function _unlocked.
It may be clear to you, but it's quite cryptic:
- which lock is held by job_unref()?  Also, would it make more sense to 
talk about block_job_free() rather than job_unref()?  I can't quite 
follow where the AioContext lock is taken.
- what is "all of this", and what do you mean by "not safe yet"?  Do 
both refer to bdrv_try_set_aio_context() needing the AioContext lock?
- what is "this function" (that should become _unlocked)?
I think you could also split the patch in multiple parts for different 
call chains.  In particular bdrv_set_backing_hd can be merged with the 
patch to bdrv_reopen_parse_file_or_backing, since both of them deal with 
bdrv_set_file_or_backing_noperm.
Paolo
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>   block.c | 50 ++++++++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 44 insertions(+), 6 deletions(-)
> 
> diff --git a/block.c b/block.c
> index fcc44a49a0..6196c95aae 100644
> --- a/block.c
> +++ b/block.c
> @@ -3114,8 +3114,22 @@ static void bdrv_detach_child(BdrvChild **childp)
>       BlockDriverState *old_bs = (*childp)->bs;
>   
>       assert(qemu_in_main_thread());
> +    if (old_bs) {
> +        /*
> +         * TODO: this is called by job_unref with lock held, because
> +         * afterwards it calls bdrv_try_set_aio_context.
> +         * Once all of this is fixed, take care of removing
> +         * the aiocontext lock and make this function _unlocked.
> +         */
> +        bdrv_subtree_drained_begin(old_bs);
> +    }
> +
>       bdrv_replace_child_noperm(childp, NULL, true);
>   
> +    if (old_bs) {
> +        bdrv_subtree_drained_end(old_bs);
> +    }
> +
>       if (old_bs) {
>           /*
>            * Update permissions for old node. We're just taking a parent away, so
> @@ -3154,6 +3168,7 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState *child_bs,
>       Transaction *tran = tran_new();
>   
>       assert(qemu_in_main_thread());
> +    bdrv_subtree_drained_begin_unlocked(child_bs);
>   
>       ret = bdrv_attach_child_common(child_bs, child_name, child_class,
>                                      child_role, perm, shared_perm, opaque,
> @@ -3168,6 +3183,7 @@ out:
>       tran_finalize(tran, ret);
>       /* child is unset on failure by bdrv_attach_child_common_abort() */
>       assert((ret < 0) == !child);
> +    bdrv_subtree_drained_end_unlocked(child_bs);
>   
>       bdrv_unref(child_bs);
>       return child;
> @@ -3197,6 +3213,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
>   
>       assert(qemu_in_main_thread());
>   
> +    bdrv_subtree_drained_begin_unlocked(parent_bs);
> +    bdrv_subtree_drained_begin_unlocked(child_bs);
> +
>       ret = bdrv_attach_child_noperm(parent_bs, child_bs, child_name, child_class,
>                                      child_role, &child, tran, errp);
>       if (ret < 0) {
> @@ -3211,6 +3230,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
>   out:
>       tran_finalize(tran, ret);
>       /* child is unset on failure by bdrv_attach_child_common_abort() */
> +    bdrv_subtree_drained_end_unlocked(child_bs);
> +    bdrv_subtree_drained_end_unlocked(parent_bs);
> +
>       assert((ret < 0) == !child);
>   
>       bdrv_unref(child_bs);
> @@ -3456,6 +3478,11 @@ int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
>   
>       assert(qemu_in_main_thread());
>   
> +    bdrv_subtree_drained_begin_unlocked(bs);
> +    if (backing_hd) {
> +        bdrv_subtree_drained_begin_unlocked(backing_hd);
> +    }
> +
>       ret = bdrv_set_backing_noperm(bs, backing_hd, tran, errp);
>       if (ret < 0) {
>           goto out;
> @@ -3464,6 +3491,10 @@ int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
>       ret = bdrv_refresh_perms(bs, errp);
>   out:
>       tran_finalize(tran, ret);
> +    if (backing_hd) {
> +        bdrv_subtree_drained_end_unlocked(backing_hd);
> +    }
> +    bdrv_subtree_drained_end_unlocked(bs);
>   
>       return ret;
>   }
> @@ -5266,7 +5297,8 @@ static int bdrv_replace_node_common(BlockDriverState *from,
>   
>       assert(qemu_get_current_aio_context() == qemu_get_aio_context());
>       assert(bdrv_get_aio_context(from) == bdrv_get_aio_context(to));
> -    bdrv_drained_begin(from);
> +    bdrv_subtree_drained_begin_unlocked(from);
> +    bdrv_subtree_drained_begin_unlocked(to);
>   
>       /*
>        * Do the replacement without permission update.
> @@ -5298,7 +5330,8 @@ static int bdrv_replace_node_common(BlockDriverState *from,
>   out:
>       tran_finalize(tran, ret);
>   
> -    bdrv_drained_end(from);
> +    bdrv_subtree_drained_end_unlocked(to);
> +    bdrv_subtree_drained_end_unlocked(from);
>       bdrv_unref(from);
>   
>       return ret;
> @@ -5345,6 +5378,9 @@ int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
>   
>       assert(!bs_new->backing);
>   
> +    bdrv_subtree_drained_begin_unlocked(bs_new);
> +    bdrv_subtree_drained_begin_unlocked(bs_top);
> +
>       ret = bdrv_attach_child_noperm(bs_new, bs_top, "backing",
>                                      &child_of_bds, bdrv_backing_role(bs_new),
>                                      &bs_new->backing, tran, errp);
> @@ -5360,6 +5396,8 @@ int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
>       ret = bdrv_refresh_perms(bs_new, errp);
>   out:
>       tran_finalize(tran, ret);
> +    bdrv_subtree_drained_end_unlocked(bs_top);
> +    bdrv_subtree_drained_end_unlocked(bs_new);
>   
>       bdrv_refresh_limits(bs_top, NULL, NULL);
>   
> @@ -5379,8 +5417,8 @@ int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
>       assert(qemu_in_main_thread());
>   
>       bdrv_ref(old_bs);
> -    bdrv_drained_begin(old_bs);
> -    bdrv_drained_begin(new_bs);
> +    bdrv_subtree_drained_begin_unlocked(old_bs);
> +    bdrv_subtree_drained_begin_unlocked(new_bs);
>   
>       bdrv_replace_child_tran(&child, new_bs, tran, true);
>       /* @new_bs must have been non-NULL, so @child must not have been freed */
> @@ -5394,8 +5432,8 @@ int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
>   
>       tran_finalize(tran, ret);
>   
> -    bdrv_drained_end(old_bs);
> -    bdrv_drained_end(new_bs);
> +    bdrv_subtree_drained_end_unlocked(new_bs);
> +    bdrv_subtree_drained_end_unlocked(old_bs);
>       bdrv_unref(old_bs);
>   
>       return ret;
^ permalink raw reply	[flat|nested] 38+ messages in thread
- * Re: [PATCH 10/12] block.c: add subtree_drains where needed
  2022-01-19  9:47   ` Paolo Bonzini
@ 2022-02-03 13:13     ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-02-03 13:13 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-devel,
	Hanna Reitz, Stefan Hajnoczi, John Snow
On 19/01/2022 10:47, Paolo Bonzini wrote:
> 
> About this:
> 
>> +         * TODO: this is called by job_unref with lock held, because
>> +         * afterwards it calls bdrv_try_set_aio_context.
>> +         * Once all of this is fixed, take care of removing
>> +         * the aiocontext lock and make this function _unlocked.
> 
> It may be clear to you, but it's quite cryptic:
> 
I think you figured it by looking at the job patches, but:
> - which lock is held by job_unref()?  Also, would it make more sense to
> talk about block_job_free() rather than job_unref()?  I can't quite
> follow where the AioContext lock is taken.
AioContext lock. I think this is a change I made in the job patches, so
comparing it with the master would make this piece harder to understand.
In the job series, I reduce the granularity of the AioContext lock,
ending up having it only around few callbacks of JobDriver, namely
->free(). This is why I talk about job_unref, because it calls ->free.
The actual lock is taken in job_unref, but the caller (->free) is
block_job_free. Yes it makes more sense mentioning block_job_free.
> - what is "all of this", and what do you mean by "not safe yet"?  Do
> both refer to bdrv_try_set_aio_context() needing the AioContext lock?
Yes
> - what is "this function" (that should become _unlocked)?
bdrv_subtree_drained_begin
This is the new comment I intend to put:
/*
* TODO: this function is called by BlockJobDriver's ->free()
* callback, block_job_free.
* We unfortunately must still take the AioContext lock around
* ->free() in job_unref, because of the bdrv_try_set_aio_context
* call below that still releases/acquires it.
* Once bdrv_try_set_aio_context does not require the AioContext lock,
* take care of removing it around ->free() and replace
* the following bdrv_subtree_drained_begin with its
* _unlocked version.
*/
> 
> I think you could also split the patch in multiple parts for different
> call chains.  In particular bdrv_set_backing_hd can be merged with the
> patch to bdrv_reopen_parse_file_or_backing, since both of them deal with
> bdrv_set_file_or_backing_noperm.
> Ok, I will try to do that.
Emanuele
^ permalink raw reply	[flat|nested] 38+ messages in thread 
 
- * Re: [PATCH 10/12] block.c: add subtree_drains where needed
  2022-01-18 16:27 ` [PATCH 10/12] block.c: add subtree_drains where needed Emanuele Giuseppe Esposito
  2022-01-19  9:47   ` Paolo Bonzini
@ 2022-02-01 14:47   ` Vladimir Sementsov-Ogievskiy
  2022-02-02 15:37     ` Emanuele Giuseppe Esposito
  1 sibling, 1 reply; 38+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-02-01 14:47 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Hanna Reitz, Stefan Hajnoczi, Paolo Bonzini,
	Fam Zheng, John Snow, qemu-devel
18.01.2022 19:27, Emanuele Giuseppe Esposito wrote:
> Protect bdrv_replace_child_noperm, as it modifies the
> graph by adding/removing elements to .children and .parents
> list of a bs. Use the newly introduced
> bdrv_subtree_drained_{begin/end}_unlocked drains to achieve
> that and be free from the aiocontext lock.
> 
> One important criteria to keep in mind is that if the caller of
> bdrv_replace_child_noperm creates a transaction, we need to make sure that the
> whole transaction is under the same drain block. This is imperative, as having
> multiple drains also in the .abort() class of functions causes discrepancies
> in the drained counters (as nodes are put back into the original positions),
> making it really hard to retourn all to zero and leaving the code very buggy.
> See https://patchew.org/QEMU/20211213104014.69858-1-eesposit@redhat.com/
> for more explanations.
> 
> Unfortunately we still need to have bdrv_subtree_drained_begin/end
> in bdrv_detach_child() releasing and then holding the AioContext
> lock, since it later invokes bdrv_try_set_aio_context() that is
> not safe yet. Once all is cleaned up, we can also remove the
> acquire/release locks in job_unref, artificially added because of this.
> 
> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
> ---
>   block.c | 50 ++++++++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 44 insertions(+), 6 deletions(-)
> 
> diff --git a/block.c b/block.c
> index fcc44a49a0..6196c95aae 100644
> --- a/block.c
> +++ b/block.c
> @@ -3114,8 +3114,22 @@ static void bdrv_detach_child(BdrvChild **childp)
>       BlockDriverState *old_bs = (*childp)->bs;
>   
>       assert(qemu_in_main_thread());
> +    if (old_bs) {
> +        /*
> +         * TODO: this is called by job_unref with lock held, because
> +         * afterwards it calls bdrv_try_set_aio_context.
> +         * Once all of this is fixed, take care of removing
> +         * the aiocontext lock and make this function _unlocked.
> +         */
> +        bdrv_subtree_drained_begin(old_bs);
> +    }
> +
>       bdrv_replace_child_noperm(childp, NULL, true);
>   
> +    if (old_bs) {
> +        bdrv_subtree_drained_end(old_bs);
> +    }
> +
>       if (old_bs) {
>           /*
>            * Update permissions for old node. We're just taking a parent away, so
> @@ -3154,6 +3168,7 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState *child_bs,
>       Transaction *tran = tran_new();
>   
>       assert(qemu_in_main_thread());
> +    bdrv_subtree_drained_begin_unlocked(child_bs);
>   
>       ret = bdrv_attach_child_common(child_bs, child_name, child_class,
>                                      child_role, perm, shared_perm, opaque,
> @@ -3168,6 +3183,7 @@ out:
>       tran_finalize(tran, ret);
>       /* child is unset on failure by bdrv_attach_child_common_abort() */
>       assert((ret < 0) == !child);
> +    bdrv_subtree_drained_end_unlocked(child_bs);
>   
>       bdrv_unref(child_bs);
>       return child;
> @@ -3197,6 +3213,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
>   
>       assert(qemu_in_main_thread());
>   
> +    bdrv_subtree_drained_begin_unlocked(parent_bs);
> +    bdrv_subtree_drained_begin_unlocked(child_bs);
> +
>       ret = bdrv_attach_child_noperm(parent_bs, child_bs, child_name, child_class,
>                                      child_role, &child, tran, errp);
>       if (ret < 0) {
> @@ -3211,6 +3230,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
>   out:
>       tran_finalize(tran, ret);
>       /* child is unset on failure by bdrv_attach_child_common_abort() */
> +    bdrv_subtree_drained_end_unlocked(child_bs);
> +    bdrv_subtree_drained_end_unlocked(parent_bs);
> +
>       assert((ret < 0) == !child);
>   
>       bdrv_unref(child_bs);
> @@ -3456,6 +3478,11 @@ int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
>   
>       assert(qemu_in_main_thread());
>   
> +    bdrv_subtree_drained_begin_unlocked(bs);
> +    if (backing_hd) {
> +        bdrv_subtree_drained_begin_unlocked(backing_hd);
> +    }
> +
>       ret = bdrv_set_backing_noperm(bs, backing_hd, tran, errp);
>       if (ret < 0) {
>           goto out;
> @@ -3464,6 +3491,10 @@ int bdrv_set_backing_hd(BlockDriverState *bs, BlockDriverState *backing_hd,
>       ret = bdrv_refresh_perms(bs, errp);
>   out:
>       tran_finalize(tran, ret);
> +    if (backing_hd) {
> +        bdrv_subtree_drained_end_unlocked(backing_hd);
> +    }
> +    bdrv_subtree_drained_end_unlocked(bs);
>   
>       return ret;
>   }
> @@ -5266,7 +5297,8 @@ static int bdrv_replace_node_common(BlockDriverState *from,
>   
>       assert(qemu_get_current_aio_context() == qemu_get_aio_context());
>       assert(bdrv_get_aio_context(from) == bdrv_get_aio_context(to));
> -    bdrv_drained_begin(from);
> +    bdrv_subtree_drained_begin_unlocked(from);
> +    bdrv_subtree_drained_begin_unlocked(to);
>   
>       /*
>        * Do the replacement without permission update.
> @@ -5298,7 +5330,8 @@ static int bdrv_replace_node_common(BlockDriverState *from,
>   out:
>       tran_finalize(tran, ret);
>   
> -    bdrv_drained_end(from);
> +    bdrv_subtree_drained_end_unlocked(to);
> +    bdrv_subtree_drained_end_unlocked(from);
>       bdrv_unref(from);
>   
>       return ret;
> @@ -5345,6 +5378,9 @@ int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
>   
>       assert(!bs_new->backing);
>   
> +    bdrv_subtree_drained_begin_unlocked(bs_new);
> +    bdrv_subtree_drained_begin_unlocked(bs_top);
> +
>       ret = bdrv_attach_child_noperm(bs_new, bs_top, "backing",
>                                      &child_of_bds, bdrv_backing_role(bs_new),
>                                      &bs_new->backing, tran, errp);
> @@ -5360,6 +5396,8 @@ int bdrv_append(BlockDriverState *bs_new, BlockDriverState *bs_top,
>       ret = bdrv_refresh_perms(bs_new, errp);
>   out:
>       tran_finalize(tran, ret);
> +    bdrv_subtree_drained_end_unlocked(bs_top);
> +    bdrv_subtree_drained_end_unlocked(bs_new);
>   
>       bdrv_refresh_limits(bs_top, NULL, NULL);
>   
> @@ -5379,8 +5417,8 @@ int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
>       assert(qemu_in_main_thread());
>   
>       bdrv_ref(old_bs);
> -    bdrv_drained_begin(old_bs);
> -    bdrv_drained_begin(new_bs);
> +    bdrv_subtree_drained_begin_unlocked(old_bs);
> +    bdrv_subtree_drained_begin_unlocked(new_bs);
>   
>       bdrv_replace_child_tran(&child, new_bs, tran, true);
>       /* @new_bs must have been non-NULL, so @child must not have been freed */
> @@ -5394,8 +5432,8 @@ int bdrv_replace_child_bs(BdrvChild *child, BlockDriverState *new_bs,
>   
>       tran_finalize(tran, ret);
>   
> -    bdrv_drained_end(old_bs);
> -    bdrv_drained_end(new_bs);
> +    bdrv_subtree_drained_end_unlocked(new_bs);
> +    bdrv_subtree_drained_end_unlocked(old_bs);
>       bdrv_unref(old_bs);
>   
>       return ret;
> 
As we started to discuss in a thread started with my "[PATCH] block: bdrv_set_backing_hd(): use drained section", I think draining the whole subtree when we modify some parent-child relation is too much. In-flight requests in subtree don't touch these relations, so why to wait/stop them? Imagine two disks A and B with same backing file C. If we want to detach A from C, what is the reason to drain in-fligth read request of B in C?
Modifying children/parents lists should be protected somehow. Currently it's protected by aio-context lock.. If we drop it, we probably need some new mutex for it. Also, graph modification should be protected from each other, so that one graph modifiction is not started until another is in-flight, probably we need some global mutex for graph modification. But that's all not about drains.
Drains are about in-flight requests. And when we switch a child of node X, we should do it in drained section for X, as in-flight requests in X can touch its children. But another in-flight requests in subtree are safe and should not be awaited.
-- 
Best regards,
Vladimir
^ permalink raw reply	[flat|nested] 38+ messages in thread
- * Re: [PATCH 10/12] block.c: add subtree_drains where needed
  2022-02-01 14:47   ` Vladimir Sementsov-Ogievskiy
@ 2022-02-02 15:37     ` Emanuele Giuseppe Esposito
  2022-02-02 17:38       ` Paolo Bonzini
  2022-02-04  9:49       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 2 replies; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-02-02 15:37 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, John Snow
On 01/02/2022 15:47, Vladimir Sementsov-Ogievskiy wrote:
> 18.01.2022 19:27, Emanuele Giuseppe Esposito wrote:
>> Protect bdrv_replace_child_noperm, as it modifies the
>> graph by adding/removing elements to .children and .parents
>> list of a bs. Use the newly introduced
>> bdrv_subtree_drained_{begin/end}_unlocked drains to achieve
>> that and be free from the aiocontext lock.
>>
>> One important criteria to keep in mind is that if the caller of
>> bdrv_replace_child_noperm creates a transaction, we need to make sure
>> that the
>> whole transaction is under the same drain block. This is imperative,
>> as having
>> multiple drains also in the .abort() class of functions causes
>> discrepancies
>> in the drained counters (as nodes are put back into the original
>> positions),
>> making it really hard to retourn all to zero and leaving the code very
>> buggy.
>> See https://patchew.org/QEMU/20211213104014.69858-1-eesposit@redhat.com/
>> for more explanations.
>>
>> Unfortunately we still need to have bdrv_subtree_drained_begin/end
>> in bdrv_detach_child() releasing and then holding the AioContext
>> lock, since it later invokes bdrv_try_set_aio_context() that is
>> not safe yet. Once all is cleaned up, we can also remove the
>> acquire/release locks in job_unref, artificially added because of this.
>>
>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>> ---
>>   block.c | 50 ++++++++++++++++++++++++++++++++++++++++++++------
>>   1 file changed, 44 insertions(+), 6 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index fcc44a49a0..6196c95aae 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -3114,8 +3114,22 @@ static void bdrv_detach_child(BdrvChild **childp)
>>       BlockDriverState *old_bs = (*childp)->bs;
>>         assert(qemu_in_main_thread());
>> +    if (old_bs) {
>> +        /*
>> +         * TODO: this is called by job_unref with lock held, because
>> +         * afterwards it calls bdrv_try_set_aio_context.
>> +         * Once all of this is fixed, take care of removing
>> +         * the aiocontext lock and make this function _unlocked.
>> +         */
>> +        bdrv_subtree_drained_begin(old_bs);
>> +    }
>> +
>>       bdrv_replace_child_noperm(childp, NULL, true);
>>   +    if (old_bs) {
>> +        bdrv_subtree_drained_end(old_bs);
>> +    }
>> +
>>       if (old_bs) {
>>           /*
>>            * Update permissions for old node. We're just taking a
>> parent away, so
>> @@ -3154,6 +3168,7 @@ BdrvChild
>> *bdrv_root_attach_child(BlockDriverState *child_bs,
>>       Transaction *tran = tran_new();
>>         assert(qemu_in_main_thread());
>> +    bdrv_subtree_drained_begin_unlocked(child_bs);
>>         ret = bdrv_attach_child_common(child_bs, child_name, child_class,
>>                                      child_role, perm, shared_perm,
>> opaque,
>> @@ -3168,6 +3183,7 @@ out:
>>       tran_finalize(tran, ret);
>>       /* child is unset on failure by bdrv_attach_child_common_abort() */
>>       assert((ret < 0) == !child);
>> +    bdrv_subtree_drained_end_unlocked(child_bs);
>>         bdrv_unref(child_bs);
>>       return child;
>> @@ -3197,6 +3213,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState
>> *parent_bs,
>>         assert(qemu_in_main_thread());
>>   +    bdrv_subtree_drained_begin_unlocked(parent_bs);
>> +    bdrv_subtree_drained_begin_unlocked(child_bs);
>> +
>>       ret = bdrv_attach_child_noperm(parent_bs, child_bs, child_name,
>> child_class,
>>                                      child_role, &child, tran, errp);
>>       if (ret < 0) {
>> @@ -3211,6 +3230,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState
>> *parent_bs,
>>   out:
>>       tran_finalize(tran, ret);
>>       /* child is unset on failure by bdrv_attach_child_common_abort() */
>> +    bdrv_subtree_drained_end_unlocked(child_bs);
>> +    bdrv_subtree_drained_end_unlocked(parent_bs);
>> +
>>       assert((ret < 0) == !child);
>>         bdrv_unref(child_bs);
>> @@ -3456,6 +3478,11 @@ int bdrv_set_backing_hd(BlockDriverState *bs,
>> BlockDriverState *backing_hd,
>>         assert(qemu_in_main_thread());
>>   +    bdrv_subtree_drained_begin_unlocked(bs);
>> +    if (backing_hd) {
>> +        bdrv_subtree_drained_begin_unlocked(backing_hd);
>> +    }
>> +
>>       ret = bdrv_set_backing_noperm(bs, backing_hd, tran, errp);
>>       if (ret < 0) {
>>           goto out;
>> @@ -3464,6 +3491,10 @@ int bdrv_set_backing_hd(BlockDriverState *bs,
>> BlockDriverState *backing_hd,
>>       ret = bdrv_refresh_perms(bs, errp);
>>   out:
>>       tran_finalize(tran, ret);
>> +    if (backing_hd) {
>> +        bdrv_subtree_drained_end_unlocked(backing_hd);
>> +    }
>> +    bdrv_subtree_drained_end_unlocked(bs);
>>         return ret;
>>   }
>> @@ -5266,7 +5297,8 @@ static int
>> bdrv_replace_node_common(BlockDriverState *from,
>>         assert(qemu_get_current_aio_context() == qemu_get_aio_context());
>>       assert(bdrv_get_aio_context(from) == bdrv_get_aio_context(to));
>> -    bdrv_drained_begin(from);
>> +    bdrv_subtree_drained_begin_unlocked(from);
>> +    bdrv_subtree_drained_begin_unlocked(to);
>>         /*
>>        * Do the replacement without permission update.
>> @@ -5298,7 +5330,8 @@ static int
>> bdrv_replace_node_common(BlockDriverState *from,
>>   out:
>>       tran_finalize(tran, ret);
>>   -    bdrv_drained_end(from);
>> +    bdrv_subtree_drained_end_unlocked(to);
>> +    bdrv_subtree_drained_end_unlocked(from);
>>       bdrv_unref(from);
>>         return ret;
>> @@ -5345,6 +5378,9 @@ int bdrv_append(BlockDriverState *bs_new,
>> BlockDriverState *bs_top,
>>         assert(!bs_new->backing);
>>   +    bdrv_subtree_drained_begin_unlocked(bs_new);
>> +    bdrv_subtree_drained_begin_unlocked(bs_top);
>> +
>>       ret = bdrv_attach_child_noperm(bs_new, bs_top, "backing",
>>                                      &child_of_bds,
>> bdrv_backing_role(bs_new),
>>                                      &bs_new->backing, tran, errp);
>> @@ -5360,6 +5396,8 @@ int bdrv_append(BlockDriverState *bs_new,
>> BlockDriverState *bs_top,
>>       ret = bdrv_refresh_perms(bs_new, errp);
>>   out:
>>       tran_finalize(tran, ret);
>> +    bdrv_subtree_drained_end_unlocked(bs_top);
>> +    bdrv_subtree_drained_end_unlocked(bs_new);
>>         bdrv_refresh_limits(bs_top, NULL, NULL);
>>   @@ -5379,8 +5417,8 @@ int bdrv_replace_child_bs(BdrvChild *child,
>> BlockDriverState *new_bs,
>>       assert(qemu_in_main_thread());
>>         bdrv_ref(old_bs);
>> -    bdrv_drained_begin(old_bs);
>> -    bdrv_drained_begin(new_bs);
>> +    bdrv_subtree_drained_begin_unlocked(old_bs);
>> +    bdrv_subtree_drained_begin_unlocked(new_bs);
>>         bdrv_replace_child_tran(&child, new_bs, tran, true);
>>       /* @new_bs must have been non-NULL, so @child must not have been
>> freed */
>> @@ -5394,8 +5432,8 @@ int bdrv_replace_child_bs(BdrvChild *child,
>> BlockDriverState *new_bs,
>>         tran_finalize(tran, ret);
>>   -    bdrv_drained_end(old_bs);
>> -    bdrv_drained_end(new_bs);
>> +    bdrv_subtree_drained_end_unlocked(new_bs);
>> +    bdrv_subtree_drained_end_unlocked(old_bs);
>>       bdrv_unref(old_bs);
>>         return ret;
>>
From the other thread:
> So you mean that backing_hd is modified as its parent is changed, do I understand correctly?
Yes
> 
> As we started to discuss in a thread started with my "[PATCH] block:
> bdrv_set_backing_hd(): use drained section", I think draining the whole
> subtree when we modify some parent-child relation is too much. In-flight
> requests in subtree don't touch these relations, so why to wait/stop
> them? Imagine two disks A and B with same backing file C. If we want to
> detach A from C, what is the reason to drain in-fligth read request of B
> in C?
If I am not mistaken, bdrv_set_backing_hd adds a new node (yes removes
the old backing_hd, but let's not think about this for a moment).
So we have disk B with backing file C, and new disk A that wants to have
backing file C.
I think I understand what you mean, so in theory the operation would be
- create new child
- add child to A->children list
- add child to C->parents list
So in theory we need to
* drain A (without subtree), because it can't happen that child nodes of
  A have in-flight requests that look at A status (children list), right?
  In other words, if A has another node X, can a request in X inspect
  A->children
* drain C, as parents can inspect C status (like B). Same assumption
  here, C->children[x]->bs cannot have requests inspecting C->parents
  list?
> Modifying children/parents lists should be protected somehow. Currently
> it's protected by aio-context lock.. If we drop it, we probably need
> some new mutex for it. Also, graph modification should be protected from
> each other, so that one graph modifiction is not started until another
> is in-flight, probably we need some global mutex for graph modification.
> But that's all not about drains.
The idea was to use BQL + drain, to obtain the same effect of aiocontext
lock. BQL should prevent concurrent graph modifications, drain
concurrent I/O reading the state while being modificated.
Emanuele
> 
> Drains are about in-flight requests. And when we switch a child of node
> X, we should do it in drained section for X, as in-flight requests in X
> can touch its children. But another in-flight requests in subtree are
> safe and should not be awaited.
> 
> 
^ permalink raw reply	[flat|nested] 38+ messages in thread
- * Re: [PATCH 10/12] block.c: add subtree_drains where needed
  2022-02-02 15:37     ` Emanuele Giuseppe Esposito
@ 2022-02-02 17:38       ` Paolo Bonzini
  2022-02-03 10:09         ` Emanuele Giuseppe Esposito
  2022-02-04  9:49       ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2022-02-02 17:38 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, Vladimir Sementsov-Ogievskiy,
	qemu-block
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Hanna Reitz, Stefan Hajnoczi,
	John Snow
On 2/2/22 16:37, Emanuele Giuseppe Esposito wrote:
> So we have disk B with backing file C, and new disk A that wants to have
> backing file C.
> 
> I think I understand what you mean, so in theory the operation would be
> - create new child
> - add child to A->children list
> - add child to C->parents list
> 
> So in theory we need to
> * drain A (without subtree), because it can't happen that child nodes of
>    A have in-flight requests that look at A status (children list), right?
>    In other words, if A has another node X, can a request in X inspect
>    A->children
> * drain C, as parents can inspect C status (like B). Same assumption
>    here, C->children[x]->bs cannot have requests inspecting C->parents
>    list?
In that case (i.e. if parents have to be drained, but children need not) 
bdrv_drained_begin_unlocked would be enough, right?
That would mean that ->children is I/O state but ->parents is global 
state.  I think it's quite a bit more complicated to analyze and to 
understand.
Paolo
^ permalink raw reply	[flat|nested] 38+ messages in thread 
- * Re: [PATCH 10/12] block.c: add subtree_drains where needed
  2022-02-02 17:38       ` Paolo Bonzini
@ 2022-02-03 10:09         ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-02-03 10:09 UTC (permalink / raw)
  To: Paolo Bonzini, Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Hanna Reitz, Stefan Hajnoczi,
	John Snow
On 02/02/2022 18:38, Paolo Bonzini wrote:
> On 2/2/22 16:37, Emanuele Giuseppe Esposito wrote:
>> So we have disk B with backing file C, and new disk A that wants to have
>> backing file C.
>>
>> I think I understand what you mean, so in theory the operation would be
>> - create new child
>> - add child to A->children list
>> - add child to C->parents list
>>
>> So in theory we need to
>> * drain A (without subtree), because it can't happen that child nodes of
>>    A have in-flight requests that look at A status (children list),
>> right?
>>    In other words, if A has another node X, can a request in X inspect
>>    A->children
I am not sure if this can happen. It doesn't seem so, though. All
functions inspecting ->parents are either GS or don't call recursive
function that go down on children list again.
>> * drain C, as parents can inspect C status (like B). Same assumption
>>    here, C->children[x]->bs cannot have requests inspecting C->parents
>>    list?
What if C->children[x]->bs drains? we would have a child inspecting
C->parents.
> 
> In that case (i.e. if parents have to be drained, but children need not)
> bdrv_drained_begin_unlocked would be enough, right?
Should be, if that is the case.
> 
> That would mean that ->children is I/O state but ->parents is global
> state.  I think it's quite a bit more complicated to analyze and to
> understand.
Not sure I follow this one, why is ->parents GS? it is also used by the
drain API...
Emanuele
^ permalink raw reply	[flat|nested] 38+ messages in thread 
 
- * Re: [PATCH 10/12] block.c: add subtree_drains where needed
  2022-02-02 15:37     ` Emanuele Giuseppe Esposito
  2022-02-02 17:38       ` Paolo Bonzini
@ 2022-02-04  9:49       ` Vladimir Sementsov-Ogievskiy
  2022-02-04 13:30         ` Emanuele Giuseppe Esposito
  1 sibling, 1 reply; 38+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-02-04  9:49 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Hanna Reitz, Stefan Hajnoczi, Paolo Bonzini,
	Fam Zheng, John Snow, qemu-devel
02.02.2022 18:37, Emanuele Giuseppe Esposito wrote:
> 
> 
> On 01/02/2022 15:47, Vladimir Sementsov-Ogievskiy wrote:
>> 18.01.2022 19:27, Emanuele Giuseppe Esposito wrote:
>>> Protect bdrv_replace_child_noperm, as it modifies the
>>> graph by adding/removing elements to .children and .parents
>>> list of a bs. Use the newly introduced
>>> bdrv_subtree_drained_{begin/end}_unlocked drains to achieve
>>> that and be free from the aiocontext lock.
>>>
>>> One important criteria to keep in mind is that if the caller of
>>> bdrv_replace_child_noperm creates a transaction, we need to make sure
>>> that the
>>> whole transaction is under the same drain block. This is imperative,
>>> as having
>>> multiple drains also in the .abort() class of functions causes
>>> discrepancies
>>> in the drained counters (as nodes are put back into the original
>>> positions),
>>> making it really hard to retourn all to zero and leaving the code very
>>> buggy.
>>> See https://patchew.org/QEMU/20211213104014.69858-1-eesposit@redhat.com/
>>> for more explanations.
>>>
>>> Unfortunately we still need to have bdrv_subtree_drained_begin/end
>>> in bdrv_detach_child() releasing and then holding the AioContext
>>> lock, since it later invokes bdrv_try_set_aio_context() that is
>>> not safe yet. Once all is cleaned up, we can also remove the
>>> acquire/release locks in job_unref, artificially added because of this.
>>>
>>> Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
>>> ---
>>>    block.c | 50 ++++++++++++++++++++++++++++++++++++++++++++------
>>>    1 file changed, 44 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/block.c b/block.c
>>> index fcc44a49a0..6196c95aae 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -3114,8 +3114,22 @@ static void bdrv_detach_child(BdrvChild **childp)
>>>        BlockDriverState *old_bs = (*childp)->bs;
>>>          assert(qemu_in_main_thread());
>>> +    if (old_bs) {
>>> +        /*
>>> +         * TODO: this is called by job_unref with lock held, because
>>> +         * afterwards it calls bdrv_try_set_aio_context.
>>> +         * Once all of this is fixed, take care of removing
>>> +         * the aiocontext lock and make this function _unlocked.
>>> +         */
>>> +        bdrv_subtree_drained_begin(old_bs);
>>> +    }
>>> +
>>>        bdrv_replace_child_noperm(childp, NULL, true);
>>>    +    if (old_bs) {
>>> +        bdrv_subtree_drained_end(old_bs);
>>> +    }
>>> +
>>>        if (old_bs) {
>>>            /*
>>>             * Update permissions for old node. We're just taking a
>>> parent away, so
>>> @@ -3154,6 +3168,7 @@ BdrvChild
>>> *bdrv_root_attach_child(BlockDriverState *child_bs,
>>>        Transaction *tran = tran_new();
>>>          assert(qemu_in_main_thread());
>>> +    bdrv_subtree_drained_begin_unlocked(child_bs);
>>>          ret = bdrv_attach_child_common(child_bs, child_name, child_class,
>>>                                       child_role, perm, shared_perm,
>>> opaque,
>>> @@ -3168,6 +3183,7 @@ out:
>>>        tran_finalize(tran, ret);
>>>        /* child is unset on failure by bdrv_attach_child_common_abort() */
>>>        assert((ret < 0) == !child);
>>> +    bdrv_subtree_drained_end_unlocked(child_bs);
>>>          bdrv_unref(child_bs);
>>>        return child;
>>> @@ -3197,6 +3213,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState
>>> *parent_bs,
>>>          assert(qemu_in_main_thread());
>>>    +    bdrv_subtree_drained_begin_unlocked(parent_bs);
>>> +    bdrv_subtree_drained_begin_unlocked(child_bs);
>>> +
>>>        ret = bdrv_attach_child_noperm(parent_bs, child_bs, child_name,
>>> child_class,
>>>                                       child_role, &child, tran, errp);
>>>        if (ret < 0) {
>>> @@ -3211,6 +3230,9 @@ BdrvChild *bdrv_attach_child(BlockDriverState
>>> *parent_bs,
>>>    out:
>>>        tran_finalize(tran, ret);
>>>        /* child is unset on failure by bdrv_attach_child_common_abort() */
>>> +    bdrv_subtree_drained_end_unlocked(child_bs);
>>> +    bdrv_subtree_drained_end_unlocked(parent_bs);
>>> +
>>>        assert((ret < 0) == !child);
>>>          bdrv_unref(child_bs);
>>> @@ -3456,6 +3478,11 @@ int bdrv_set_backing_hd(BlockDriverState *bs,
>>> BlockDriverState *backing_hd,
>>>          assert(qemu_in_main_thread());
>>>    +    bdrv_subtree_drained_begin_unlocked(bs);
>>> +    if (backing_hd) {
>>> +        bdrv_subtree_drained_begin_unlocked(backing_hd);
>>> +    }
>>> +
>>>        ret = bdrv_set_backing_noperm(bs, backing_hd, tran, errp);
>>>        if (ret < 0) {
>>>            goto out;
>>> @@ -3464,6 +3491,10 @@ int bdrv_set_backing_hd(BlockDriverState *bs,
>>> BlockDriverState *backing_hd,
>>>        ret = bdrv_refresh_perms(bs, errp);
>>>    out:
>>>        tran_finalize(tran, ret);
>>> +    if (backing_hd) {
>>> +        bdrv_subtree_drained_end_unlocked(backing_hd);
>>> +    }
>>> +    bdrv_subtree_drained_end_unlocked(bs);
>>>          return ret;
>>>    }
>>> @@ -5266,7 +5297,8 @@ static int
>>> bdrv_replace_node_common(BlockDriverState *from,
>>>          assert(qemu_get_current_aio_context() == qemu_get_aio_context());
>>>        assert(bdrv_get_aio_context(from) == bdrv_get_aio_context(to));
>>> -    bdrv_drained_begin(from);
>>> +    bdrv_subtree_drained_begin_unlocked(from);
>>> +    bdrv_subtree_drained_begin_unlocked(to);
>>>          /*
>>>         * Do the replacement without permission update.
>>> @@ -5298,7 +5330,8 @@ static int
>>> bdrv_replace_node_common(BlockDriverState *from,
>>>    out:
>>>        tran_finalize(tran, ret);
>>>    -    bdrv_drained_end(from);
>>> +    bdrv_subtree_drained_end_unlocked(to);
>>> +    bdrv_subtree_drained_end_unlocked(from);
>>>        bdrv_unref(from);
>>>          return ret;
>>> @@ -5345,6 +5378,9 @@ int bdrv_append(BlockDriverState *bs_new,
>>> BlockDriverState *bs_top,
>>>          assert(!bs_new->backing);
>>>    +    bdrv_subtree_drained_begin_unlocked(bs_new);
>>> +    bdrv_subtree_drained_begin_unlocked(bs_top);
>>> +
>>>        ret = bdrv_attach_child_noperm(bs_new, bs_top, "backing",
>>>                                       &child_of_bds,
>>> bdrv_backing_role(bs_new),
>>>                                       &bs_new->backing, tran, errp);
>>> @@ -5360,6 +5396,8 @@ int bdrv_append(BlockDriverState *bs_new,
>>> BlockDriverState *bs_top,
>>>        ret = bdrv_refresh_perms(bs_new, errp);
>>>    out:
>>>        tran_finalize(tran, ret);
>>> +    bdrv_subtree_drained_end_unlocked(bs_top);
>>> +    bdrv_subtree_drained_end_unlocked(bs_new);
>>>          bdrv_refresh_limits(bs_top, NULL, NULL);
>>>    @@ -5379,8 +5417,8 @@ int bdrv_replace_child_bs(BdrvChild *child,
>>> BlockDriverState *new_bs,
>>>        assert(qemu_in_main_thread());
>>>          bdrv_ref(old_bs);
>>> -    bdrv_drained_begin(old_bs);
>>> -    bdrv_drained_begin(new_bs);
>>> +    bdrv_subtree_drained_begin_unlocked(old_bs);
>>> +    bdrv_subtree_drained_begin_unlocked(new_bs);
>>>          bdrv_replace_child_tran(&child, new_bs, tran, true);
>>>        /* @new_bs must have been non-NULL, so @child must not have been
>>> freed */
>>> @@ -5394,8 +5432,8 @@ int bdrv_replace_child_bs(BdrvChild *child,
>>> BlockDriverState *new_bs,
>>>          tran_finalize(tran, ret);
>>>    -    bdrv_drained_end(old_bs);
>>> -    bdrv_drained_end(new_bs);
>>> +    bdrv_subtree_drained_end_unlocked(new_bs);
>>> +    bdrv_subtree_drained_end_unlocked(old_bs);
>>>        bdrv_unref(old_bs);
>>>          return ret;
>>>
> 
>  From the other thread:
>> So you mean that backing_hd is modified as its parent is changed, do I understand correctly?
> 
> Yes
> 
>>
>> As we started to discuss in a thread started with my "[PATCH] block:
>> bdrv_set_backing_hd(): use drained section", I think draining the whole
>> subtree when we modify some parent-child relation is too much. In-flight
>> requests in subtree don't touch these relations, so why to wait/stop
>> them? Imagine two disks A and B with same backing file C. If we want to
>> detach A from C, what is the reason to drain in-fligth read request of B
>> in C?
> 
> If I am not mistaken, bdrv_set_backing_hd adds a new node (yes removes
> the old backing_hd, but let's not think about this for a moment).
> So we have disk B with backing file C, and new disk A that wants to have
> backing file C.
> 
> I think I understand what you mean, so in theory the operation would be
> - create new child
> - add child to A->children list
> - add child to C->parents list
> 
> So in theory we need to
> * drain A (without subtree), because it can't happen that child nodes of
>    A have in-flight requests that look at A status (children list), right?
>    In other words, if A has another node X, can a request in X inspect
>    A->children
It should not happen. In my understanding, IO request never access parents of the node.
> * drain C, as parents can inspect C status (like B). Same assumption
>    here, C->children[x]->bs cannot have requests inspecting C->parents
>    list?
No, I think we should not drain C. IO requests don't inspect parents list. And if some _other_ operations do inspect it, drain will not help, as drain is only about IO requests.
> 
>> Modifying children/parents lists should be protected somehow. Currently
>> it's protected by aio-context lock.. If we drop it, we probably need
>> some new mutex for it. Also, graph modification should be protected from
>> each other, so that one graph modifiction is not started until another
>> is in-flight, probably we need some global mutex for graph modification.
>> But that's all not about drains.
> 
> The idea was to use BQL + drain, to obtain the same effect of aiocontext
> lock. BQL should prevent concurrent graph modifications, drain
> concurrent I/O reading the state while being modificated.
> 
Concurrent I/O should not touch bs->parents list, so we don't need to drain C.
>>
>> Drains are about in-flight requests. And when we switch a child of node
>> X, we should do it in drained section for X, as in-flight requests in X
>> can touch its children. But another in-flight requests in subtree are
>> safe and should not be awaited.
>>
>>
> 
-- 
Best regards,
Vladimir
^ permalink raw reply	[flat|nested] 38+ messages in thread
- * Re: [PATCH 10/12] block.c: add subtree_drains where needed
  2022-02-04  9:49       ` Vladimir Sementsov-Ogievskiy
@ 2022-02-04 13:30         ` Emanuele Giuseppe Esposito
  2022-02-04 14:03           ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-02-04 13:30 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Hanna Reitz, Stefan Hajnoczi,
	Paolo Bonzini, John Snow
>>
>>  From the other thread:
>>> So you mean that backing_hd is modified as its parent is changed, do
>>> I understand correctly?
>>
>> Yes
>>
>>>
>>> As we started to discuss in a thread started with my "[PATCH] block:
>>> bdrv_set_backing_hd(): use drained section", I think draining the whole
>>> subtree when we modify some parent-child relation is too much. In-flight
>>> requests in subtree don't touch these relations, so why to wait/stop
>>> them? Imagine two disks A and B with same backing file C. If we want to
>>> detach A from C, what is the reason to drain in-fligth read request of B
>>> in C?
>>
>> If I am not mistaken, bdrv_set_backing_hd adds a new node (yes removes
>> the old backing_hd, but let's not think about this for a moment).
>> So we have disk B with backing file C, and new disk A that wants to have
>> backing file C.
>>
>> I think I understand what you mean, so in theory the operation would be
>> - create new child
>> - add child to A->children list
>> - add child to C->parents list
>>
>> So in theory we need to
>> * drain A (without subtree), because it can't happen that child nodes of
>>    A have in-flight requests that look at A status (children list),
>> right?
>>    In other words, if A has another node X, can a request in X inspect
>>    A->children
> 
> It should not happen. In my understanding, IO request never access
> parents of the node.
> 
>> * drain C, as parents can inspect C status (like B). Same assumption
>>    here, C->children[x]->bs cannot have requests inspecting C->parents
>>    list?
> 
> No, I think we should not drain C. IO requests don't inspect parents
> list. And if some _other_ operations do inspect it, drain will not help,
> as drain is only about IO requests.
I am still not convinced about this, because of the draining.
While drain can only be called by either main loop or the "home
iothread" (the one running the AioContext), it still means there are 2
threads involved. So while the iothread runs set_backing_hd, main loop
could easily drain one of the children of the nodes. Or the other way
around.
I am not saying that this happens, but it *might* happen.
I am a little bit confused about this, it would be nice to hear opinions
from others as well.
Once we figure this, I will know where exactly to put the assertions,
and consequently what to drain and with which function.
Emanuele
^ permalink raw reply	[flat|nested] 38+ messages in thread 
- * Re: [PATCH 10/12] block.c: add subtree_drains where needed
  2022-02-04 13:30         ` Emanuele Giuseppe Esposito
@ 2022-02-04 14:03           ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 38+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-02-04 14:03 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Hanna Reitz, Stefan Hajnoczi, Paolo Bonzini,
	Fam Zheng, John Snow, qemu-devel
04.02.2022 16:30, Emanuele Giuseppe Esposito wrote:
>>>
>>>   From the other thread:
>>>> So you mean that backing_hd is modified as its parent is changed, do
>>>> I understand correctly?
>>>
>>> Yes
>>>
>>>>
>>>> As we started to discuss in a thread started with my "[PATCH] block:
>>>> bdrv_set_backing_hd(): use drained section", I think draining the whole
>>>> subtree when we modify some parent-child relation is too much. In-flight
>>>> requests in subtree don't touch these relations, so why to wait/stop
>>>> them? Imagine two disks A and B with same backing file C. If we want to
>>>> detach A from C, what is the reason to drain in-fligth read request of B
>>>> in C?
>>>
>>> If I am not mistaken, bdrv_set_backing_hd adds a new node (yes removes
>>> the old backing_hd, but let's not think about this for a moment).
>>> So we have disk B with backing file C, and new disk A that wants to have
>>> backing file C.
>>>
>>> I think I understand what you mean, so in theory the operation would be
>>> - create new child
>>> - add child to A->children list
>>> - add child to C->parents list
>>>
>>> So in theory we need to
>>> * drain A (without subtree), because it can't happen that child nodes of
>>>     A have in-flight requests that look at A status (children list),
>>> right?
>>>     In other words, if A has another node X, can a request in X inspect
>>>     A->children
>>
>> It should not happen. In my understanding, IO request never access
>> parents of the node.
>>
>>> * drain C, as parents can inspect C status (like B). Same assumption
>>>     here, C->children[x]->bs cannot have requests inspecting C->parents
>>>     list?
>>
>> No, I think we should not drain C. IO requests don't inspect parents
>> list. And if some _other_ operations do inspect it, drain will not help,
>> as drain is only about IO requests.
> 
> I am still not convinced about this, because of the draining.
> 
> While drain can only be called by either main loop or the "home
> iothread" (the one running the AioContext), it still means there are 2
> threads involved. So while the iothread runs set_backing_hd, main loop
> could easily drain one of the children of the nodes. Or the other way
> around.
> I am not saying that this happens, but it *might* happen.
I agree that it might happen. So, parallel drains are a problem. But drain is always a part of graph modification, and any graph modifications running in parallel are already a problem, we should forbid it somehow.
When you drain node, you say: please no in-flight requests now at this node. But IO requests doesn't do drain themselves, so why to restrict them?
And anyway, I don't see how this help. Ok, assume you drain additional node C or even the whole subtree. What protect us from parallel drain from another thread?
> 
> I am a little bit confused about this, it would be nice to hear opinions
> from others as well.
> 
> Once we figure this, I will know where exactly to put the assertions,
> and consequently what to drain and with which function.
> 
> Emanuele
> 
-- 
Best regards,
Vladimir
^ permalink raw reply	[flat|nested] 38+ messages in thread 
 
 
 
 
 
- * [PATCH 11/12] block/io.c: fully enable assert_bdrv_graph_writable
  2022-01-18 16:27 [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm Emanuele Giuseppe Esposito
                   ` (9 preceding siblings ...)
  2022-01-18 16:27 ` [PATCH 10/12] block.c: add subtree_drains where needed Emanuele Giuseppe Esposito
@ 2022-01-18 16:27 ` Emanuele Giuseppe Esposito
  2022-01-18 16:27 ` [PATCH 12/12] block.c: additional assert qemu in main tread Emanuele Giuseppe Esposito
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-01-18 16:27 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Emanuele Giuseppe Esposito, qemu-devel, Hanna Reitz,
	Stefan Hajnoczi, Paolo Bonzini, John Snow
As explained in the TODO, complete the function by checking
that the node is also drained.
In this way, we can ensure that modify the bs is thread safe,
as the drain makes sure that no I/O concurrently reads the field,
and all writes are under BQL.
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 block/io.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/block/io.c b/block/io.c
index 9d5167f64a..d7b0707662 100644
--- a/block/io.c
+++ b/block/io.c
@@ -765,12 +765,7 @@ void bdrv_drain_all(void)
 
 void assert_bdrv_graph_writable(BlockDriverState *bs)
 {
-    /*
-     * TODO: this function is incomplete. Because the users of this
-     * assert lack the necessary drains, check only for BQL.
-     * Once the necessary drains are added,
-     * assert also for qatomic_read(&bs->quiesce_counter) > 0
-     */
+    assert(qatomic_read(&bs->quiesce_counter) > 0);
     assert(qemu_in_main_thread());
 }
 
-- 
2.31.1
^ permalink raw reply related	[flat|nested] 38+ messages in thread
- * [PATCH 12/12] block.c: additional assert qemu in main tread
  2022-01-18 16:27 [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm Emanuele Giuseppe Esposito
                   ` (10 preceding siblings ...)
  2022-01-18 16:27 ` [PATCH 11/12] block/io.c: fully enable assert_bdrv_graph_writable Emanuele Giuseppe Esposito
@ 2022-01-18 16:27 ` Emanuele Giuseppe Esposito
  2022-01-19  9:51 ` [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm Paolo Bonzini
  2022-01-26 11:29 ` Stefan Hajnoczi
  13 siblings, 0 replies; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-01-18 16:27 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy,
	Emanuele Giuseppe Esposito, qemu-devel, Hanna Reitz,
	Stefan Hajnoczi, Paolo Bonzini, John Snow
Add some missing assertion in static functions of block.c
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
---
 block.c               | 2 ++
 block/block-backend.c | 3 +++
 2 files changed, 5 insertions(+)
diff --git a/block.c b/block.c
index 6196c95aae..7961f5a984 100644
--- a/block.c
+++ b/block.c
@@ -5227,6 +5227,7 @@ static int bdrv_replace_node_noperm(BlockDriverState *from,
     BdrvChild *c, *next;
 
     assert(to != NULL);
+    assert(qemu_in_main_thread());
 
     QLIST_FOREACH_SAFE(c, &from->parents, next_parent, next) {
         assert(c->bs == from);
@@ -6767,6 +6768,7 @@ void bdrv_invalidate_cache_all(Error **errp)
 static bool bdrv_has_bds_parent(BlockDriverState *bs, bool only_active)
 {
     BdrvChild *parent;
+    assert(qemu_in_main_thread());
 
     QLIST_FOREACH(parent, &bs->parents, next_parent) {
         if (parent->klass->parent_is_bds) {
diff --git a/block/block-backend.c b/block/block-backend.c
index 9229ff7ca7..048ba83f37 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -754,6 +754,9 @@ BlockDriverState *blk_bs(BlockBackend *blk)
 static BlockBackend *bdrv_first_blk(BlockDriverState *bs)
 {
     BdrvChild *child;
+
+    assert(qemu_in_main_thread());
+
     QLIST_FOREACH(child, &bs->parents, next_parent) {
         if (child->klass == &child_root) {
             return child->opaque;
-- 
2.31.1
^ permalink raw reply related	[flat|nested] 38+ messages in thread
- * Re: [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm.
  2022-01-18 16:27 [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm Emanuele Giuseppe Esposito
                   ` (11 preceding siblings ...)
  2022-01-18 16:27 ` [PATCH 12/12] block.c: additional assert qemu in main tread Emanuele Giuseppe Esposito
@ 2022-01-19  9:51 ` Paolo Bonzini
  2022-01-26 11:29 ` Stefan Hajnoczi
  13 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2022-01-19  9:51 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito, qemu-block
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-devel,
	Hanna Reitz, Stefan Hajnoczi, John Snow
On 1/18/22 17:27, Emanuele Giuseppe Esposito wrote:
> * Finally fully enable assert_bdrv_graph_writable(), checking also for the
>    drains, in patch 11.
Wow, this is definitely not just moving code around!  It's great that 
you could pull this off.  It gives me a lot more confidence that the 
locking model and the graph/IO split are correct, and also a lot more 
confidence about the logic to move the drains back and forth.
It does add a lot more complication to the API with all the _unlocked 
variants, but I think we're getting closer and closer to the endgame for 
the AioContext lock.
Paolo
^ permalink raw reply	[flat|nested] 38+ messages in thread
- * Re: [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm.
  2022-01-18 16:27 [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm Emanuele Giuseppe Esposito
                   ` (12 preceding siblings ...)
  2022-01-19  9:51 ` [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm Paolo Bonzini
@ 2022-01-26 11:29 ` Stefan Hajnoczi
  2022-01-27 13:46   ` Paolo Bonzini
  13 siblings, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2022-01-26 11:29 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-block,
	qemu-devel, Hanna Reitz, Paolo Bonzini, John Snow
[-- Attachment #1: Type: text/plain, Size: 403 bytes --]
If I understand correctly aio_context_acquire() calls are not explicitly
removed in this patch series but switching to BDRV_POLL_WHILE_UNLOCKED()
means that we no longer need to run under the AioContext lock?
Still, I'm a bit surprised I didn't notice any
aio_context_acquire/release() removals in this patch series because I
thought callers need that before they switch to
BDRV_POLL_WHILE_UNLOCKED()?
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply	[flat|nested] 38+ messages in thread
- * Re: [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm.
  2022-01-26 11:29 ` Stefan Hajnoczi
@ 2022-01-27 13:46   ` Paolo Bonzini
  2022-01-28 12:20     ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2022-01-27 13:46 UTC (permalink / raw)
  To: Stefan Hajnoczi, Emanuele Giuseppe Esposito
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-block,
	qemu-devel, Hanna Reitz, John Snow
On 1/26/22 12:29, Stefan Hajnoczi wrote:
> Still, I'm a bit surprised I didn't notice any
> aio_context_acquire/release() removals in this patch series because I
> thought callers need that before they switch to
> BDRV_POLL_WHILE_UNLOCKED()?
I think the callers are new and were not calling 
bdrv_subtree_drained_begin() (and thus BDRV_POLL_WHILE) at all?
Emanuele, enlighten us. :)
Paolo
^ permalink raw reply	[flat|nested] 38+ messages in thread 
- * Re: [PATCH 00/12] Removal of Aiocontext lock through drains: protect bdrv_replace_child_noperm.
  2022-01-27 13:46   ` Paolo Bonzini
@ 2022-01-28 12:20     ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 38+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-01-28 12:20 UTC (permalink / raw)
  To: Paolo Bonzini, Stefan Hajnoczi
  Cc: Kevin Wolf, Fam Zheng, Vladimir Sementsov-Ogievskiy, qemu-block,
	qemu-devel, Hanna Reitz, John Snow
On 27/01/2022 14:46, Paolo Bonzini wrote:
> On 1/26/22 12:29, Stefan Hajnoczi wrote:
>> Still, I'm a bit surprised I didn't notice any
>> aio_context_acquire/release() removals in this patch series because I
>> thought callers need that before they switch to
>> BDRV_POLL_WHILE_UNLOCKED()?
> 
> I think the callers are new and were not calling
> bdrv_subtree_drained_begin() (and thus BDRV_POLL_WHILE) at all?
> 
> Emanuele, enlighten us. :)
Yes, the callers were not calling any kind of drains (or at least most
of them) *and* no context lock was acquired before calling them.
The current logic (or at least how I see it) when draining is:
"ok, we need to use bdrv_drain or bdrv_subtree_drain, but these function
call BDRV_POLL_WHILE(), which in turns calls AIO_WAIT_WHILE and thus
performs aio_context_release(lock); [...] aio_context_acquire(lock);
*Therefore* we need to acquire the lock". The lock is taken as a
consequence of the drain implementation.
This makes the lock usage useless, because we are just blindly acquiring
it for the purpose of making BDRV_POLL_WHILE happy.
On the other side, here no lock was acquired before, and
BDRV_POLL_WHILE_UNLOCKED is not releasing anything, thus no lock is needed.
This seems to hold and kinda proves my logic above.
Thank you,
Emanuele
^ permalink raw reply	[flat|nested] 38+ messages in thread