From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:43645)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@openvz.org>) id 1ZZGU1-0002jQ-Lc
	for qemu-devel@nongnu.org; Tue, 08 Sep 2015 06:50:38 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <den@openvz.org>) id 1ZZGTy-0005z2-1u
	for qemu-devel@nongnu.org; Tue, 08 Sep 2015 06:50:37 -0400
Received: from relay.parallels.com ([195.214.232.42]:56771)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@openvz.org>) id 1ZZGTx-0005xk-Hn
	for qemu-devel@nongnu.org; Tue, 08 Sep 2015 06:50:33 -0400
References: <1441699228-25767-1-git-send-email-den@openvz.org>
	<55EEAB55.2070908@redhat.com> <55EEAD4B.6090609@openvz.org>
	<CABYiri-k_HvR54JYXfAShAcGLs+seyXVJSO7ufnKHdhqXEXuig@mail.gmail.com>
From: "Denis V. Lunev" <den@openvz.org>
Message-ID: <55EEBD69.5030701@openvz.org>
Date: Tue, 8 Sep 2015 13:50:17 +0300
MIME-Version: 1.0
In-Reply-To: <CABYiri-k_HvR54JYXfAShAcGLs+seyXVJSO7ufnKHdhqXEXuig@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Andrey Korolyov <andrey@xdel.ru>
Cc: Kevin Wolf <kwolf@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, Stefan Hajnoczi <stefanha@redhat.com>, Raushaniya Maksudova <rmaksudova@virtuozzo.com>

On 09/08/2015 01:37 PM, Andrey Korolyov wrote:
> On Tue, Sep 8, 2015 at 12:41 PM, Denis V. Lunev <den@openvz.org> wrote:
>> On 09/08/2015 12:33 PM, Paolo Bonzini wrote:
>>>
>>> On 08/09/2015 10:00, Denis V. Lunev wrote:
>>>> How the given solution works?
>>>>
>>>> If disk-deadlines option is enabled for a drive, one controls time
>>>> completion
>>>> of this drive's requests. The method is as follows (further assume that
>>>> this
>>>> option is enabled).
>>>>
>>>> Every drive has its own red-black tree for keeping its requests.
>>>> Expiration time of the request is a key, cookie (as id of request) is an
>>>> appropriate node. Assume that every requests has 8 seconds to be
>>>> completed.
>>>> If request was not accomplished in time for some reasons (server crash or
>>>> smth
>>>> else), timer of this drive is fired and an appropriate callback requests
>>>> to
>>>> stop Virtial Machine (VM).
>>>>
>>>> VM remains stopped until all requests from the disk which caused VM's
>>>> stopping
>>>> are completed. Furthermore, if there is another disks with
>>>> 'disk-deadlines=on'
>>>> whose requests are waiting to be completed, do not start VM : wait
>>>> completion
>>>> of all "late" requests from all disks.
>>>>
>>>> Furthermore, all requests which caused VM stopping (or those that just
>>>> were not
>>>> completed in time) could be printed using "info disk-deadlines" qemu
>>>> monitor
>>>> option as follows:
>>> This topic has come up several times in the past.
>>>
>>> I agree that the current behavior is not great, but I am not sure that
>>> timeouts are safe.  For example, how is disk-deadlines=on different from
>>> NFS soft mounts?  The NFS man page says
>>>
>>>        NB: A so-called "soft" timeout can cause silent data corruption in
>>>        certain cases.  As such, use the soft option only when client
>>>        responsiveness is more important than data integrity.  Using NFS
>>>        over TCP or increasing the value of the retrans option may
>>>        mitigate some of the risks of using the soft option.
>>>
>>> Note how it only says "mitigate", not solve.
>>>
>>> Paolo
>> This solution is far not perfect as there is a race window for
>> request complete anyway. Though the amount of failures is
>> reduced by 2-3 orders of magnitude.
>>
>> The behavior is similar not for soft mounts, which could
>> corrupt the data but to hard mounts which are default AFAIR.
>> It will not corrupt the data and should patiently wait
>> request complete.
>>
>> Without the disk the guest is not able to serve any request and
>> thus keeping it running does not make serious sense.
>>
>> This approach is used by Odin in production for years and
>> we were able to seriously reduce the amount of end-user
>> reclamations. We were unable to invent any reasonable
>> solution without guest modification/timeouts tuning.
>>
>> Anyway, this code is off by default, storage agnostic, separated.
>> Yes, we will be able to maintain it for us out-of-tree, but...
>> Den
>>
> Thanks, the series looks very promising. I have a rather side question
> - assuming that we have a guest for which scsi/ide usage is only an
> option, wouldn`t the timekeeping issues from the pause/resume action
> be a corner problem there?
I do not think so. The guest can be paused/suspended by the
management and resumes. Normally it takes some time
for guest to start see the time difference and speedup is
limited.

>   The assumption based on a fact that the
> guests with appropriate kvmclock settings can rather softly handle a
> resulting timer jump and at the same moment they are not bounded at
> most to the 'legacy' storage interfaces, but those guests with
> interfaces which are not prone to 'time-outing' can commonly misbehave
> as well from a large timer jump. For an IDE, the approach proposed by
> a patch is an only option, and for SCSI it is better to tune guest
> driver timeout instead, if guest OS allows that. So yes, description
> for possible drawbacks would be very useful there.
OK. I will add the note for this.

Though there are cases when this timeout could not be
tuned at all even for a SCSI case, f.e. Windows will BSOD
with 7b early on boot without the solution applied
and I do not know good ways to tweak this timeout
in guest. It is far too specific.

Den