From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f175.google.com (mail-pf1-f175.google.com [209.85.210.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 086D3391 for ; Wed, 2 Oct 2024 00:10:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.175 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727827815; cv=none; b=q/Cs7Lt4GfOIXordW6RuOsH/Nu/bRR7VOutXRSI/BR+HSLidaOIfBe9CbtA2dBIxP6/llNwkDnYlxR4uAo9D8y62y11oXbXh5p53YkolbPAh7J+VmfpZGfDrP+j+z68AbVnipQpz44Ow8ENxTzNpjam7LqIuT6pgqNQpGK/Di/4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727827815; c=relaxed/simple; bh=EK2PJuL0rJsx0hFg/W8mjhwlsrSGQ4zPXE/RNq92Uso=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=KOeWWoaxWdMU3vJxTPWAwiQjrP+K5fQIlmnFeqwOsmy7SGiJc6lgAecu8Sk3frHnRf5DW1LLkfKE4t8VavBgoGaxys/heoHqJwIQ/Fu4jmNuAb3asBSxsGnX+P6tFneiXnc+8P4rMJYiTJZl5XGUiqIEmyYz9nPJ3hSQMsXKCNY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=BhYjTNnG; arc=none smtp.client-ip=209.85.210.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="BhYjTNnG" Received: by mail-pf1-f175.google.com with SMTP id d2e1a72fcca58-718e2855479so4373561b3a.1 for ; Tue, 01 Oct 2024 17:10:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1727827813; x=1728432613; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=o59/6GgCFkYEVN5Vt+ZhJ3o9kIV1/zeluo3Umxq+vD8=; b=BhYjTNnGetVK/YEMweIBBcTdEp6urDYoMFus8NN23miiQ+xW36graW3P+3nRAXqUdF lwdkdYKW21LZVuc4zU6RmgUZQC0OSxISgnOi2VLU2+qPOHZw+1dq11CvC8dGVMhkiBrb IEj4VcMYRZkRKF4xANAv7as2JUaYihlHwWEfA0ya3JDsIXkAvw6KXGYCEpvzFS3uBGWD qoZe+AoHd+MdhgLDCqaz4wnhgSTGibOSfkiO5HXGwkHGeCPJTYcfYo2Q/av3euCRW2Q1 gyl47kVgKuJ84RQ35ys0Aaa98UGlbMy6P3HGaerKs9XXPyPeRVRTxgS/Ybeif5h5UJe3 ySVg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727827813; x=1728432613; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=o59/6GgCFkYEVN5Vt+ZhJ3o9kIV1/zeluo3Umxq+vD8=; b=WEjxZFCQjhftHtwWCS/Voi13nDkazbrEwnmMUP2d7xEjiCuAG9yv2IA9CiSJ67uC08 kMMK5mAKZUMhEbIi+f8gH/uof7Z1n5htL1z3hYlSRRvEHaBpxswx5cM3+CfxLuQ36X4d FQ9SaQeNJ/QY5MfYApgmKV2GDOk4EjWCxxtfSma9iOvzpAso/RE3iaCA8edaNZXgL5S8 hIzG8NQDkSvAdf99f1f4iaWVBkssCzmaGN/AxlKe53rRu77KYyPJn/nI+uyyXwUOd1c5 HPWIFZbuS2tUl4kEK+nB5q97I68jMUWN4TspuklmZPPukpcZ36jfCIu0ekAi/Mn6ZmQE YjFQ== X-Gm-Message-State: AOJu0YzraZVFOZKKzQOgqA0CWq6+8ApUrCICeax8hRQZ6pRePszbd5Jy mdKJHV5azWoAdSZbNWYAgza4o4EqEv2H5mr5WJ/5jEwdlvgdIeLlLyh0Foe3TDtjgSt28caiBsb E X-Google-Smtp-Source: AGHT+IFyrvIGLKMun2zBQAu/rH4qQ5xrod/OcaKZx8jK0PKAYugHRd0HtBx9ftP0/a6Jcy2+lI4bwg== X-Received: by 2002:a05:6a00:3910:b0:710:50c8:ddcb with SMTP id d2e1a72fcca58-71dc5c424c3mr2393108b3a.5.1727827813244; Tue, 01 Oct 2024 17:10:13 -0700 (PDT) Received: from dread.disaster.area (pa49-179-78-197.pa.nsw.optusnet.com.au. [49.179.78.197]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-7e6db2942ecsm8993948a12.6.2024.10.01.17.10.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 01 Oct 2024 17:10:12 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1svmwS-00CiKG-2a; Wed, 02 Oct 2024 10:10:08 +1000 Date: Wed, 2 Oct 2024 10:10:08 +1000 From: Dave Chinner To: Jean-Louis Dupond Cc: linux-fsdevel@vger.kernel.org Subject: Re: FIFREEZE on loop device does not return EBUSY Message-ID: References: <67055f77-fb5e-4c2e-9570-4361ab05defa@dupond.be> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <67055f77-fb5e-4c2e-9570-4361ab05defa@dupond.be> On Tue, Oct 01, 2024 at 03:20:43PM +0200, Jean-Louis Dupond wrote: > Hi All, > > I've been investigating a hang/freeze of some of our VM's when running a > snapshot on them. > The cause seems to be that the fsFreeze call before the snapshot gets locked > forever due to the use of a loop device. > > There are multiple reports about this, like [1] or [2]. > Also there is a quite easy way to simulate it, see [3]. > > Now this seems to happen when you do a ioctl FIFREEZE on a mount point that > is backed by a loop device. > For ex: > /dev/loop0      3.9G  508K  3.7G   1% /tmp > > And loop0 is: > /dev/loop0: [2052]:25308501 (/usr/tmpDSK) > > Now if you lock the disk/partition that has /usr, and then you want to lock > /tmp, it will hang forever (or until you thaw the /usr disk). Yup, this is the behaviour freezing the backing filesystem of a loop device has had since freezing filesystems was implemented 20 years ago. So, really, this is expected behaviour... > You would expect that the call returns -EBUSY like the others, but that is > not the case. I certainly don't expect this to return -EBUSY - the /tmp filesystem has not been frozen, and so it should not return -EBUSY to an attempt to freeze it. It should be frozen, and because that requires IO to be done, the behaviour of the operation is dependent on how the lower layers process the IO that the filesystem issues. > Is this something we want to solve? Or does somebody have better idea's on > how to resolve this? The actual process of freezing complex data caching heirarchies is beyond the awareness of the kernel. To create consistent snapshot images of a filesystem, userspace applications need to be flushing cached data before the filesystem is frozen (i.e. so they suspend to a coherent state for device snapshots and backups). Then the filesystem can suspend, and once all write IO is done, the block device can be suspended. This is a top-down process - it has to be done in this order. However, loop devices mean that suspend events first need to propagate upwards to the top of the heirarchy before any suspend operation is started so that the entire suspend can be done from the top down. I think FUSE also introduces complex heirarchies which could possibly include loops, and I suspect overlay can introduce them, too. Hence walking to the top of the heirarchy before we can start a suspend operation on a filesystem involves interacting with userspace policy, configuration and applications. This really can only be done reliably from userspace. This is especially true when we consider thawing that heirarchy. What is the kernel supposed to do if it gets a thaw request in the middle of such a nested suspend? Is it supposed to cascade up and down the heirarchy? Maybe just up? Or maybe it should be ignored because it's not at the bottom of the freeze graph? Or maybe we should allow overrides in case of emergencies if userspace loses track of what its frozen? This rapidly gets complex if we try to handle all these potential policy considerations from a completely context-free environment inside the kernel. We cannot make the right choices for all cases where nested freezes might or might not be required by userspace. It's easier to say "don't do that" than it is to try to solve such a complex problem with code. Ultimately, I suspect that FIFREEZE needs to issue a blocking fanotify event so that userspace can capture such requests and apply whatever policy the user wants for nested block device hierarchies and the applications on top of them before the filesystem itself gets frozen.... > The Qemu issue is already a long standing issue, which I want to get > resolved :) It's a long standing issue because it's a complex issue that can't really be solved by adding code to the kernel alone. Loop device management is done by userspace, not the kernel, and there is no one policy for freezing that applies to every situation.... -Dave. -- Dave Chinner david@fromorbit.com