From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:46142)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <vsementsov@virtuozzo.com>) id 1bpDn4-00088w-Ag
	for qemu-devel@nongnu.org; Wed, 28 Sep 2016 08:16:51 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <vsementsov@virtuozzo.com>) id 1bpDn0-0006sm-63
	for qemu-devel@nongnu.org; Wed, 28 Sep 2016 08:16:46 -0400
References: <1470683381-16680-1-git-send-email-jsnow@redhat.com>
	<57E94491.8090501@virtuozzo.com>
	<bf412a34-acb7-d3df-1710-ee7917ee2060@redhat.com>
From: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-ID: <57EBB4A1.1000104@virtuozzo.com>
Date: Wed, 28 Sep 2016 15:16:33 +0300
MIME-Version: 1.0
In-Reply-To: <bf412a34-acb7-d3df-1710-ee7917ee2060@redhat.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 0/5] blockjobs: Fix transactional race
 condition
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: John Snow <jsnow@redhat.com>, qemu-devel <qemu-devel@nongnu.org>, Stefan Hajnoczi <stefanha@redhat.com>, Jeff Cody <jcody@redhat.com>, "Denis V. Lunev" <den@openvz.org>, qemu block <qemu-block@nongnu.org>

Ok, let's discuss it on list.

On 27.09.2016 19:19, John Snow wrote:
> Replying off-list to follow your lead, but this might be a good 
> discussion to have upstream where Stefan and Jeff can see it.
>
> On 09/26/2016 11:53 AM, Vladimir Sementsov-Ogievskiy wrote:
>> Hi John!
>>
>> What about this series? If you are working on this, I can tell about
>> another related problem.
>>
>
> I got mired in trying to think about how to delete a BlockJob that 
> hasn't technically been started yet -- this patch series is unsafe in 
> that it will be unable to "cancel" a job that hasn't started yet.
>
> (And I don't want to start a job just to cancel it, that's gross.)
>
> I need to expand the job modifications here a little bit to make this 
> completely safe. I may need to create a job->started bool that allows 
> you to call a job_delete() function if and only if started is false.
>
> I'll try to send this along this week.
>
>> I'm working on async scheme for backup. I've started from something like
>> just wrapping backup_do_cow into coroutine and use this coroutine in
>> full-backup-loop instead of just call backup_do_cow. This simple
>> approach seems better than used in mirror - in mirror we create new
>> coroutine for each async read and write, when I create new coroutine for
>> copy of = read + write. But then I decided to rewrite this to
>> worker-coroutines scheme. I.e. we do not create new coroutine for each
>> request and control their count by " while (job->inflight_requests >=
>> MAX_IN_FLIGHT) { yield }" but we create MAX_IN_FLIGHT worker coroutines
>> at backup start, and then they just copy cluster by cluster, using
>> common cop_bitmap to synchronize. All cluster copying is done in
>> backup-workers and write-notifiers only change the order in which
>> workers copy clusters. And I liked it all until I figured out the error
>> handling.
>>
>
> Makes sense to me so far.
>
>> I can't handle errors in workers, because I need to call
>> block_job_error_action, which may stop blockjob, and it seems to be a
>> problem if several workers calls it simultaneously. So, worker have to
>
> What kind of problems are you seeing? Maybe we can fix them? is it for 
> STOP cases? (or also on IGNORE/REPORT?)
>
> Otherwise, just stash an error object in a location that the master 
> coroutine can find, pause/terminate all worker threads, and signal the 
> error.
>
>> defer error handling to main coroutine of the blockjob (this coroutine
>> in my solution doesn't do copying but only handles blockjob staff like
>> pauses, stops, throttling). So, we will have a queue of workers, waiting
>> for main coroutine to handle their errors. This looks too complicated..
>
> Really? I don't think it's too bad. The worker detects an error and 
> creates a record of the problem, then the worker yields.
>
> The controller detects the worker has yielded and sees the record of 
> the error. The controller opts not to re-schedule the worker.
>
> The controller then instructs all remaining workers to yield. Once all 
> workers are yielded, the controller invokes block_job_error_action.
>
> From there, we can handle the error accordingly.
>
> Either way it sounds like we're going to have to manage the complexity 
> of what happens when a job with worker coroutines is paused by e.g. 
> the QMP monitor: the controller needs to send and coordinate messages 
> to the workers; so this doesn't sound like too much added complexity 
> unless I'm dreaming.
>
>> And finally I come to the idea that all this problems and complications
>> are because blockjob is not created for asynchronous work, because
>> blockjob has only one working coroutine. So the true way is to maintain
>> several working coroutines on generic blockjob layer. This will lead to
>> simpler and transparent async schemes for backup and mirror (and may be
>> other blockjobs).
>>
>> So, the question is: what about refactoring blockjobs, to move from one
>> working coroutine to several ones? Is it hard and is it possible? I do
>> not understand all block-jobs related staff..
>>
>
> I think jobs will need to remain "one coroutine, one job" for now, but 
> there's no reason why drive-backup or blockdev-backup can't just 
> create multiple jobs each if that's what they need to do. (The backup 
> job object could, in theory, just have another job pointer to a helper 
> job if it really became necessary.)
>
> Let's try to solve your design problem first and see if we can't make 
> error reporting behave nicely in a contested environment.

Hmm, ok, I'll go this way for now. But anyway, I think that finally 
async scheme in backup and mirror should be the same. And it should 
share the same code, so all worker-related should be separated from 
backup to own file, or injected into block-jobs.

>
>>
>> On 08.08.2016 22:09, John Snow wrote:
>>> There are a few problems with transactional job completion right now.
>>>
>>> First, if jobs complete so quickly they complete before remaining jobs
>>> get a chance to join the transaction, the completion mode can leave 
>>> well
>>> known state and the QLIST can get corrupted and the transactional jobs
>>> can complete in batches or phases instead of all together.
>>>
>>> Second, if two or more jobs defer to the main loop at roughly the same
>>> time, it's possible for one job's cleanup to directly invoke the other
>>> job's cleanup from within the same thread, leading to a situation that
>>> will deadlock the entire transaction.
>>>
>>> Thanks to Vladimir for pointing out these modes of failure.
>>>
>>> ________________________________________________________________________________ 
>>>
>>>
>>>
>>> For convenience, this branch is available at:
>>> https://github.com/jnsnow/qemu.git  branch job-manual-start
>>> https://github.com/jnsnow/qemu/tree/job-manual-start
>>>
>>> This version is tagged job-manual-start-v1:
>>> https://github.com/jnsnow/qemu/releases/tag/job-manual-start-v1
>>>
>>> John Snow (4):
>>>    blockjob: add block_job_start
>>>    blockjob: refactor backup_start as backup_job_create
>>>    blockjob: add .clean property
>>>    iotests: add transactional failure race test
>>>
>>> Vladimir Sementsov-Ogievskiy (1):
>>>    blockjob: fix dead pointer in txn list
>>>
>>>   block/backup.c             |  50 +++++++-----
>>>   block/commit.c             |   2 +-
>>>   block/mirror.c             |   2 +-
>>>   block/stream.c             |   2 +-
>>>   blockdev.c                 | 194
>>> ++++++++++++++++++++++++++-------------------
>>>   blockjob.c                 |  24 +++++-
>>>   include/block/block_int.h  |  19 ++---
>>>   include/block/blockjob.h   |  16 ++++
>>>   tests/qemu-iotests/124     |  91 +++++++++++++++++++++
>>>   tests/qemu-iotests/124.out |   4 +-
>>>   tests/test-blockjob-txn.c  |   2 +-
>>>   11 files changed, 284 insertions(+), 122 deletions(-)
>>>
>>
>>


-- 
Best regards,
Vladimir