From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B408125DCEC; Tue, 4 Mar 2025 16:53:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741107198; cv=none; b=H2/WnEMq1pDaVspn/PTQrqahBSZURtYfwUAJffOI7Jxx+Zm3+y/Lh7ty0Qz9LwKEo4YFJV5cgj3+CJv3nAMKfWfvWCtuz1Lc95jB7M2jWaVi5V7zgsoId1I0PK3s/qmnj8MNlp4IZDMMI0I65zz+OLMWMTfZ5j1jzuhnknWHjX4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741107198; c=relaxed/simple; bh=Fl/v7qLy82ZL3HpIHCv0KzmgjAvnTeFEYEQJIM5ngmY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=pd3bV3NH8wHYHLpzFQPGT93C2SWQ1omiIM8kHoIJAS1toW4zb/arNSy8/3POLmVsUTexmHB8wXe2gPUfg3+4WWEvudAAP0Fn2DMgU6Er1Fqn7xz//7vDZCLRgTF+JTBUsdWWPXKaIlqZMEBtDcTEXK61k6KfSnEn3ZhH4Zwexq0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=wRFv1js6; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="wRFv1js6" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=cRHPazYps1wq7KJ5C8fXLniBimphIu0da+uuMFbtb3s=; b=wRFv1js6bmgBAOVXTa8oSiVAP+ 4BZzmc0J9YF/co2jTAPij/WtgiOPXAF+05gWyH9g3EOeFWOooEAigwevLkNAbGCwJTcQYKi7OooWG MHkv9erbQPJxNPA6bKkBDyPT0HNl6Lj+YwIT7PFP0I5NZ+t7wGjPJ6IgVGjYSrIvLGUp0Lh9gTRu9 Im4d9aIIE+DtMjKu+ll2pYnm9Rik1RGD7oAhEXe3nr4HZbbOtwxNMrTDn1gVyxXjT6C40tcckihZb MhTLCdITMBU+yOqheRkwVs9T+H2yjWs7/abkwVF3kE5GI8w8AlGi4zwK18cWM7M0xSNmF4rVrdH0G 1RHwCWkg==; Received: from willy by casper.infradead.org with local (Exim 4.98 #2 (Red Hat Linux)) id 1tpVW2-000000029No-3nl4; Tue, 04 Mar 2025 16:53:10 +0000 Date: Tue, 4 Mar 2025 16:53:09 +0000 From: Matthew Wilcox To: Hannes Reinecke Cc: Vlastimil Babka , Boris Pismenny , John Fastabend , Jakub Kicinski , Sagi Grimberg , "linux-nvme@lists.infradead.org" , "linux-block@vger.kernel.org" , linux-mm@kvack.org, Harry Yoo , "netdev@vger.kernel.org" Subject: Re: Kernel oops with 6.14 when enabling TLS Message-ID: References: <15be2446-f096-45b9-aaf3-b371a694049d@suse.com> <95b0b93b-3b27-4482-8965-01963cc8beb8@suse.cz> <6877dfb1-9f44-4023-bb6d-e7530d03e33c@suse.com> Precedence: bulk X-Mailing-List: linux-block@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Tue, Mar 04, 2025 at 05:32:32PM +0100, Hannes Reinecke wrote: > On 3/4/25 17:14, Matthew Wilcox wrote: > > I thought we'd done all the work needed to get rid of these pointless > > refcount bumps. Turns out that's only on the block side (eg commit > > e4cc64657bec). So what does networking need in order to understand > > that some iovecs do not need to mess with the refcount? > > The network stack needs to get hold of the page while transmission is > ongoing, as there is potentially rather deep queueing involved, > requiring several calls to sendmsg() and friends before the page is finally > transmitted. And maybe some post-processing (checksums, > digests, you name it), too, all of which require the page to be there. > > It's all so jumbled up ... personally, I would _love_ to do away with > __iov_iter_get_pages_alloc(). Allocating a page array? Seriously? > > And the problem with that is that it's always takes a page(!) reference, > completely oblivious to the fact whether you even _can_ take a page > reference (eg for tail pages); we've hit this problem several times now > (check for sendpage_ok() ...). Calling get_page() / put_page() on a tail page is fine -- that just redirects to the head page. But calling it on a slab never made any sense; at best it gets you the equivalent of TYPESAFE_BY_RCU -- that is, the object can be freed and reallocated, but the underlying slab will not be reallocated to some other purpose. > But that's not the real issue; real issue is that the page reference is > taken down in the very bowels of __iov_iter_get_pages_alloc(), but needs > to be undone by the _caller_. Who might (or might not) have an idea > that he needs to drop the reference here. > That's why there is no straightforward conversion; you need to audit > each and every caller and try to find out where the page reference (if any) > is dropped. > Bah. > > Can't we (at the very least) leave it to the caller of > __iov_iter_get_pages() to get a page reference (he has access to the page > array, after all ...)? That would make the interface slightly > better, and it'll be far more obvious to the caller what needs > to be done. Right, that's what happened in the block layer. We mark the bio with BIO_PAGE_PINNED if the pincount needs to be dropped. As a transitional period, we had BIO_PAGE_REFFED which indicated that the page refcount needed to be dropped. Perhaps there's something similar that network could be doing.