Linux filesystem development
 help / color / mirror / Atom feed
From: Gao Xiang <hsiangkao@linux.alibaba.com>
To: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Cc: Christian Brauner <brauner@kernel.org>,
	Miklos Szeredi <miklos@szeredi.hu>, Gao Xiang <xiang@kernel.org>,
	Song Liu <song@kernel.org>, Amir Goldstein <amir73il@gmail.com>,
	Jan Kara <jack@suse.cz>
Subject: Fanotify (hsm or erofs) / fuse / nbd / ... + write(mmap) deadlock vector followup
Date: Wed, 10 Jun 2026 16:17:33 +0800	[thread overview]
Message-ID: <e36c9ef5-4cd0-4281-9598-c1bfb8ebfaa0@linux.alibaba.com> (raw)
In-Reply-To: <CAOQ4uxi0yRavUeztRh-Eb8_HvpfmqkpPbWT9JgQhUn7EKXnryw@mail.gmail.com>

Hi all,

Amir just suggested that I posted a long off-list discussion
on the list right now, but since it seems to relate to
`sb->s_writers.rw_sem` locking design, I am not quite sure
how deep I could help generally: But here it goes.

The background is that we have discussed a generic deadlock
timing for several months as below:

fsA is a filesystem which supports fsfreeze, such as EXT4/XFS/...

fsB is a filesystem which have some relationship with a
userspace daemon (e.g. a filesystem with fanotify HSM hooks /
   fanotify + EROFS file-backed mounts / a FUSE filesystem or
   a filesystem backed by a virtual block device)

Thread A                           Thread B                                 Userspace deamon
  write(fsA_fd, mmap(fsB_fd))
   file_start_write()
    (take SB_FREEZE_WRITE read lock)

    handle fsB mmap fault read
     -> notify userspace and wait

                                    freeze_super
                                    (try to take SB_FREEZE_WRITE write lock)
                                                                             received/handling fsB mmap read request
                                                                             (do random something...)
                                                                             write(fsA_fd2)
                                                                              (take SB_FREEZE_WRITE read lock)
The problem timing here is thread A does
`write(fsA_fd, mmap(fsB_fd))` => file_start_write() (rwsem read),
then hits page fault and wait for userspace deamon to finish
the request;

Thread B does fsfreeze on fsA so it is waiting a write lock
(sb_wait_write(), rwsem write) and blocked on thread A;

And the userspace deamon is a handler handling page fault,
and trying to write to another file (fsA_fd2) on fsA again
and blocked on thread B (file_start_write(), rwsem read).

because of the specific locking timing is `R->W->R`, at least
the whole workflow won't proceed so the related processes
(and fsA above) will be stuck.


Since the issue is complex, I can only give my own thought
from the perspective of fanotify+EROFS use cases on this:

  - fsfreeze is typically unneeded, especially on typical
    cloud + container environment;

  - Also we could isolate the image backing file into another
    totally different local filesystem so that
    "write(fd, mmap(),..)" + fsfreeze won't be a practical
    issue.

  - In the worst case, user programs can be killed to recover
    the system, and fanotify already supports this way, so
    it shouldn't be too harmful;

  - this vector impacts all userspace approaches.


I also gave my own preliminary idea at LSF to Jan and Amir
(just for reference):

diff --git a/fs/super.c b/fs/super.c
index 378e81efe643..2897d3572e9e 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -2069,6 +2069,7 @@ static inline bool may_unfreeze(struct super_block *sb, enum freeze_holder who,
   */
  int freeze_super(struct super_block *sb, enum freeze_holder who, const void *freeze_owner)
  {
+    bool retried;
      int ret;

      if (!super_lock_excl(sb)) {
@@ -2111,7 +2112,15 @@ int freeze_super(struct super_block *sb, enum freeze_holder who, const void *fre
      sb->s_writers.frozen = SB_FREEZE_WRITE;
      /* Release s_umount to preserve sb_start_write -> s_umount ordering */
      super_unlock_excl(sb);
-    sb_wait_write(sb, SB_FREEZE_WRITE);
+    while (ret = sb_wait_write_timeout(sb, SB_FREEZE_WRITE)) {
+        if (retried) {
+            sb->s_writers.frozen = SB_UNFROZEN;
+            sb_freeze_unlock(sb, SB_FREEZE_FS);
+            deactivate_locked_super(sb);
+            return ret;
+        }
+        retried = true;
+    }
      __super_lock_excl(sb);

      /* Now we go and block page faults... */

It was simply to add a new sb_wait_write_timeout()
interface so that the writer (freeze_super) can be woken
up even when the locking order is already (R->W->R),
allowing the W (freeze_super) to be woken and the (R->R)
to be merged.

I wonder if this is a practical way since it allows
"write(fd, mmap(),..)" and workable for all approaches.
However, the main issue at least in my opinion is that
percpu_rwsem doesn't have percpu_down_write_timeout()
(but as for rwsem since down_write_killable() exists, I
think down_write_timeout() should be doable at least),
and I tried to vibe-code a version but have no idea if
it performs well.


Amir and other people may have other better ideas for
further discussion, but as Amir suggested, I've written
this...

Thanks,
Gao Xiang

       reply	other threads:[~2026-06-10  8:17 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <271c1c26-f170-462b-8a7f-686530fdd449@linux.alibaba.com>
     [not found] ` <53d0e50a-faa9-49e4-a8f5-d4d37bcba0c6@linux.alibaba.com>
     [not found]   ` <CAOQ4uxh-kmoO03yCPaFnWcMjeOgqUi_19cGg4qdeY86ccz9ZyA@mail.gmail.com>
     [not found]     ` <f0adaed4-1412-4722-8bdd-53f5e3823209@linux.alibaba.com>
     [not found]       ` <CAOQ4uxj7gP=qpUvVbMbxbf6LdrYLeB4dX=vkUfOpi1ZfU0542Q@mail.gmail.com>
     [not found]         ` <CAJfpeguXSh3-_3cbpzS-s4+Bq+7MVwazD+3P-diG=Qq70E2qrQ@mail.gmail.com>
     [not found]           ` <d5f62836-62d1-4b8e-9694-bbab27112e7e@linux.alibaba.com>
     [not found]             ` <649fdbbb-64f7-43d9-afd5-a3076e3ec946@linux.alibaba.com>
     [not found]               ` <CAOQ4uxj+WAe+99FFi9RfMd5tpSwwP=rSH4dNuJEdhV1wL+5CDA@mail.gmail.com>
     [not found]                 ` <dff837f6-7480-4984-8728-802ddd78dda4@linux.alibaba.com>
     [not found]                   ` <CAOQ4uxi7DceKuh6+qGjuQJEFiE=HEFu4dMrc++VXUknRnX26JQ@mail.gmail.com>
     [not found]                     ` <95371379-97e9-4cb4-8358-ec014b765b74@linux.alibaba.com>
     [not found]                       ` <CAOQ4uxjh5VYkKLAKxhMsy-ZmrbRrgX4pFgBLFqsDqWMXeiqg3w@mail.gmail.com>
     [not found]                         ` <ac05da56-eb2a-4862-b887-8081f249d8ce@linux.alibaba.com>
     [not found]                           ` <CAOQ4uxi0yRavUeztRh-Eb8_HvpfmqkpPbWT9JgQhUn7EKXnryw@mail.gmail.com>
2026-06-10  8:17                             ` Gao Xiang [this message]
2026-06-10 12:21                               ` Stacking filesystem deadlocks [was Re: Fanotify (hsm or erofs) / fuse / nbd / ... + write(mmap) deadlock vector followup] Jan Kara
2026-06-10 13:18                                 ` Gao Xiang
2026-06-10 15:40                                   ` Amir Goldstein
2026-06-10 15:58                                     ` Gao Xiang
2026-06-11 11:32                                     ` Jan Kara
2026-06-11 15:25                                       ` Amir Goldstein
2026-06-17 13:37                                         ` Christian Brauner
2026-06-17 14:23                                           ` Gao Xiang
2026-06-17 17:09                                           ` Jan Kara
2026-06-19  7:25                                             ` Christian Brauner
2026-06-17 17:33                                         ` Jan Kara
2026-06-17 18:04                                           ` Gao Xiang
2026-06-19  7:27                                           ` Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e36c9ef5-4cd0-4281-9598-c1bfb8ebfaa0@linux.alibaba.com \
    --to=hsiangkao@linux.alibaba.com \
    --cc=amir73il@gmail.com \
    --cc=brauner@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=song@kernel.org \
    --cc=xiang@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox