From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-yw1-f177.google.com (mail-yw1-f177.google.com [209.85.128.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 83853189514 for ; Tue, 30 Jul 2024 16:51:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.177 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722358313; cv=none; b=tMacPz5ZsDLQ3wSJfK/Y8iTJsrgebqr5Y722dMRikdy4tyqA8DQ8FMYxw0rRROXV9klsqDRA/vOhznK4v/bSgtojZEtlqXkJtsfGxYKRWaFdrf5oopVQe6nlji84040WUNaGrsz/+tXzOChlH3yqoZ7dQ90AN8CbV+8h0vcrK9Q= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722358313; c=relaxed/simple; bh=5d6OI7h4OHmiQ6tMLwU1mLugYzOpWaHg0cCG8Q2bvnk=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=Qvje9wuWgtznFsZzm2NhgJs70GE7zgSZI4Zrd1Sfzyt8aevNYc0MIGbhxmGTEu7lbJ9RJHtV/pKEn01+9Z3kNE0wDLttIMLEsYJ98zSQ2J6pbDuRuKPgiAB4NM1mIEoaYwSBqc2eqz6CCsraMdMmkafB9LF+mCXs3toc47YlIIk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=toxicpanda.com; spf=none smtp.mailfrom=toxicpanda.com; dkim=pass (2048-bit key) header.d=toxicpanda-com.20230601.gappssmtp.com header.i=@toxicpanda-com.20230601.gappssmtp.com header.b=WMAkSWml; arc=none smtp.client-ip=209.85.128.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=toxicpanda.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=toxicpanda.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=toxicpanda-com.20230601.gappssmtp.com header.i=@toxicpanda-com.20230601.gappssmtp.com header.b="WMAkSWml" Received: by mail-yw1-f177.google.com with SMTP id 00721157ae682-66acac24443so35575297b3.1 for ; Tue, 30 Jul 2024 09:51:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20230601.gappssmtp.com; s=20230601; t=1722358310; x=1722963110; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=N2XCQgEpuVmn0WdPUiJXoXeNWgP4VBI/PJdRvkP0yMM=; b=WMAkSWmlXKmNstHOcPXfK97YMkW5/RmnTGVozzYcdE1wio2Vl/VMS312vB9A+91Ucm o2Hls5pdEh14pY/DrJ8U8dr2zkVMSvEomLKXzT7iuzTN3jrz5d0escVsN773s6FY+Opk EozWsKXxtPhSO/MP5012y1z0TK0qhHW0eJBYufwvELB0GIiAPKALR/cFtFkq2pnOd3a/ gM8leleMsGOwfVXCJpd0/L3NufUUjjJEvHvunND2G9ACWGbRKkKTGaVGn53tOCl2fhCY 4Ye1EpYka0Kd+xhuRTuYuLzV02eFc+28nZaJKzrckb9vOwq5lImWElFf82cNZ7hyZt+0 G5JA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722358310; x=1722963110; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=N2XCQgEpuVmn0WdPUiJXoXeNWgP4VBI/PJdRvkP0yMM=; b=ZOUKzsYGjoWwQi3rNO5l+OL6wtpaCOP6++WB/cz0YplNOfX9n/IBLqCGq2PYDSqLNL 5/eyxZ5y7FAviYb10fKhb/bdSrqI30pluX4QrzmHv1htTVD8JmGr4/s97PmAxN2WNa8p v3oekIlZy3OA7LAUxbMcgxX9TQ1XP0jWyldwxOfCl/ms2CC+yfM/OyNOEeGpH8nHxbBA yNW/CzwaXqu9hyaGq/CyS1nvqiBUlGOcKxr/PFNXYl+UjxQvMEWmx+rEtGj7beIjPg0Y BXS7JwRmNHUF4cR0s4N31fFkVQeO3u2bmVX/Qd+7qo0lOFgiucNp3aGBPa/MaraXUhNE ZWBQ== X-Forwarded-Encrypted: i=1; AJvYcCXVHRAS1MUCFKi/APkYXhIQ/qJuLUaqjWKNVcJvpeDPkRmiC/gynHgBDlcG9dtI4O35VUbZaBq3iFQFj150ZBunX5EF2EnWRg0FrTla6A== X-Gm-Message-State: AOJu0YzntRrN5GnkEPQG0AYJuGFOZzM8n+NFNcm0vf67UZ6huQk4NbEU khdFpFpJ5TjZiClbqXSYRCeWQTkXLrsRAZsiNR2wJfahz2yzM5eTWNcRjk9GB4M= X-Google-Smtp-Source: AGHT+IEXO624WXqf4eUcX0w8nHFqiS+UMoBu+aGD+DWsJcqhDOaLtcJ/xmrLdE+oWp4W1o+3D2z+dg== X-Received: by 2002:a0d:fe46:0:b0:650:82e0:63b1 with SMTP id 00721157ae682-67a0a13640bmr117998967b3.41.1722358310269; Tue, 30 Jul 2024 09:51:50 -0700 (PDT) Received: from localhost (syn-076-182-020-124.res.spectrum.com. [76.182.20.124]) by smtp.gmail.com with ESMTPSA id 00721157ae682-67566dd9059sm25815097b3.28.2024.07.30.09.51.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 30 Jul 2024 09:51:49 -0700 (PDT) Date: Tue, 30 Jul 2024 12:51:49 -0400 From: Josef Bacik To: Jan Kara Cc: Amir Goldstein , kernel-team@fb.com, linux-fsdevel@vger.kernel.org, brauner@kernel.org Subject: Re: [PATCH 10/10] fsnotify: generate pre-content permission event on page fault Message-ID: <20240730165149.GA3828363@perftesting> References: <1bc2855779e7ba1d80592be7d6257b43f1a91886.1721931241.git.josef@toxicpanda.com> <20240729171120.GB3596468@perftesting> <20240730121837.fixxjcbbu7caxf2s@quack3> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20240730121837.fixxjcbbu7caxf2s@quack3> On Tue, Jul 30, 2024 at 02:18:37PM +0200, Jan Kara wrote: > On Mon 29-07-24 21:57:34, Amir Goldstein wrote: > > On Mon, Jul 29, 2024 at 8:11 PM Josef Bacik wrote: > > > > If I am reading correctly, iomap (i.e. xfs) write shared memory fault > > > > does not reach this code? > > > > > > > > Do we care about writable shared memory faults use case for HSM? > > > > It does not sound very relevant to HSM, but we cannot just ignore it.. > > > > > > > > > > Sorry I realized I went off to try and solve this problem and never responded to > > > you. I'm addressing the other comments, but this one is a little tricky. > > > > > > We're kind of stuck between a rock and a hard place with this. I had originally > > > put this before the ->fault() callback, but purposefully moved it into > > > filemap_fault() because I want to be able to drop the mmap lock while we're > > > waiting for a response from the HSM. > > > > > > The reason to do this is because there are things that take the mmap lock for > > > simple things outside of the process, like /proc/$PID/smaps and other related > > > things, and this can cause high priority tasks to block behind possibly low > > > priority IO, creating a priority inversion. > > > > > > Now, I'm not sure how widespread of a problem this is anymore, I know there's > > > been work done to the kernel and tools to avoid this style of problem. I'm ok > > > with a "try it and see" approach, but I don't love that. > > > > > > > I defer this question to Jan. > > > > > However I think putting fsnotify hooks into XFS itself for this particular path > > > is a good choice either. > > > > I think you meant "not a good choice" and I agree - > > it is not only xfs, but could be any fs that will be converted to iomap > > Other fs have ->fault != filemap_fault, even if they do end up calling > > filemap_fault, IOW, there is no API guarantee that they will. > > > > > What do you think? Just move it to before ->fault(), > > > leave the mmap lock in place, and be done with it? > > > > If Jan blesses the hook called with mmap lock, then yeh, > > putting the hook in the most generic "vfs" code would be > > the best choice for maintenance. > > Well, I agree with Josef's comment about a rock and a hard place. For once, > regardless whether the hook will happen from before ->fault or from inside > the ->fault handler there will be fault callers where we cannot drop > mmap_lock (not all archs support dropping mmap_lock inside a fault AFAIR - > but a quick grep seems to show that these days maybe they do, also some > callers - most notably through GUP - don't allow dropping of mmap_lock > inside fault). So we have to have a way to handle a fault without > FAULT_FLAG_ALLOW_RETRY flag. > > Now of course waiting for userspace reply to fanotify event with mmap_lock > held is ... dangerous. For example consider application with two threads: > > T1 T2 > page fault on inode I write to inode I > lock mm->mmap_lock inode_lock(I) > send fanotify event ... > fault_in_iov_iter_readable() > lock mm->mmap_lock -> blocks > behind T1 > > now the HSM handler needs to fill in contents of inode I requested by the > page fault: > > inode_lock(I) -> deadlock > > So conceptually I think the flow could look like (in __do_fault): > > if (!(vmf->flags & FAULT_FLAG_TRIED) && > fsnotify_may_send_pre_content_event()) { > if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) > return VM_FAULT_RETRY; > fpin = maybe_unlock_mmap_for_io(vmf, NULL); > if (!fpin) > return ???VM_FAULT_SIGSEGV???; > err = fsnotify_fault(...); > if (err) > return VM_FAULT_SIGBUS | VM_FAULT_RETRY; > /* > * We are fine with proceeding with the fault. Retry the fault > * to let the filesystem handle it. > */ > return VM_FAULT_RETRY; > } > > The downside is that if we enter the page fault without ability to drop > mmap_lock on a file needing HSM handling, we get SIGSEGV. I'm not sure it > matters in practice because these are not that common paths e.g. stuff like > putting a breakpoint / uprobe on a file but maybe there are some surprises. > The only thing I don't like about this is that now the fault handler loses the ability to drop the mmap sem. I think in practice this doesn't really matter, because we're trying to avoid doing IO while under the mmap sem, and presumably this will have instantiated the pages to avoid the IO. However if you use reflink this wouldn't be the case, and now we've re-introduced the priority inversion possiblity. I'm leaning more towards just putting the fsnotify hook in the xfs code for the write case, and anybody else who implements their own ->fault without calling the generic one will just have to do the same. That being said I'm not going to die on any particular hill here, if you still want to do the above before the ->fault() handler then I'll update my code to do that and we can move on. Thanks, Josef