From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E5065C2BD09 for ; Fri, 12 Jul 2024 16:56:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4049D6B0083; Fri, 12 Jul 2024 12:56:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3B4906B0085; Fri, 12 Jul 2024 12:56:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 27BD66B008A; Fri, 12 Jul 2024 12:56:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 0A3926B0083 for ; Fri, 12 Jul 2024 12:56:26 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 617A2120ABF for ; Fri, 12 Jul 2024 16:56:25 +0000 (UTC) X-FDA: 82331703930.21.9FF5966 Received: from mail-wr1-f42.google.com (mail-wr1-f42.google.com [209.85.221.42]) by imf07.hostedemail.com (Postfix) with ESMTP id 8324A4000C for ; Fri, 12 Jul 2024 16:56:23 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kK2o0afv; spf=pass (imf07.hostedemail.com: domain of alexander.duyck@gmail.com designates 209.85.221.42 as permitted sender) smtp.mailfrom=alexander.duyck@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1720803357; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vhuiitIAwxRd/b19p99cs5Q4+qnNsSSYtzsOP3NuPvQ=; b=beztQqx3nWbcmIRBP/kiW93V1cV3T2wLvo+nnLFztbbce5xxzym1naZKSwYZtBwEoV1JOR AUBgqly0WAUnQHP+O5w7BG3wDUtqvN3IKaqYJAkE3pZc5sFuIShWaBfrH7V3aTXzRZpj/X EJVkyxpM4IAlvkZ+B4CHVBCVIPeQmCk= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kK2o0afv; spf=pass (imf07.hostedemail.com: domain of alexander.duyck@gmail.com designates 209.85.221.42 as permitted sender) smtp.mailfrom=alexander.duyck@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720803357; a=rsa-sha256; cv=none; b=RjAQcmLPPTmzEjLLcZKOSA3JZf1DSSNrAUSg7r6w34gjAqPEgfwjlbKM9LU3IKqul66HmZ nhHcTRJAzvqP+kcUTyH4NPfTwQN60KJZtQNFHQZKu9dX8KrKTI1Mb6ooqVUaVry1kGguzA BERn/kRJ0rUi2Ta2MDM6I6eXA6iMPhc= Received: by mail-wr1-f42.google.com with SMTP id ffacd0b85a97d-36805befd01so431990f8f.2 for ; Fri, 12 Jul 2024 09:56:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1720803382; x=1721408182; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=vhuiitIAwxRd/b19p99cs5Q4+qnNsSSYtzsOP3NuPvQ=; b=kK2o0afvO6pDcTTrMZnPuOO8yZvTgSYlV/Roe5fnNGMGaPCKhAXeSd8QYnS3N0gppJ +cRL9+DumGZygF+Vcx0ToP+iaTOpuPudKrSuYXCm/wCDCZQVoZ+S85rx+qkD0xgTo+KX 7KTlrFnmP6YkpmXjtvZZGP28KFr9H6vK+U2rt2zmInxAYFIOT6um8u1AxCry5WBAzt28 yi/Hds83+93CSuLjXn0RpaDR29l8dwdBGrIVwqKio5RZYs2DdOsagS/uRpB+o7JDNIHu vE7TVWs6Wp4TUXyB0RIKSy9ShEsVSBkufGWYl5hwATDIn8tYxudXaVXXdj+MBIyVW0HU GTpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720803382; x=1721408182; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vhuiitIAwxRd/b19p99cs5Q4+qnNsSSYtzsOP3NuPvQ=; b=DudxixRAQmFhFcsuLduZspuHaz0Q9P42WJHWasKnOMyDw5qtqGkmObWahqxamebF8W P7Ed+RUWP32qblycZAv/eID0gSyRGLrRsj/x1kcshgCMX2ib2+FkdAkESBLJEsHzCaa8 Vq3ixeTHjnNbwgqsAfGa+BON5SgqreAwNDWjg+BJfoson0v5CCtPW2KUHJGKTvE1S3iP hafRDpJNAUvUmxygrepkBhN3EeDiuzceN5W7zoz8Ne6pgRFaxTEIQqqoeC44NfLUPtdB zt5/1HYLECOHUcnHZ80utUFkKvPdTvzksXHARIWjJidd1pM2Mq+WySCcyuUVkRzrNVns fJ8w== X-Forwarded-Encrypted: i=1; AJvYcCVtnB5FB1ym3X6nipX58MsSxMmNRfu89PXPXYwPJgHCfNV3402s2yMTaRL1t3fwfQnVQQ3mPj9SH2j4Rd6F7b5+8gs= X-Gm-Message-State: AOJu0YyRthYz2sVaGY0TSLcYY+oWyYqxn0Ff2qSCWdOBuflBfYHKqCGb 0JltHCFXQdaD3XAs8IKdKxDs/NsU4KjJLDMcknSqYLcLE7iz0EgbUCf9+vvrLFiyEsbLUPQgXJx BVzW1nT2rdn/770RXF5M//h0/kH0= X-Google-Smtp-Source: AGHT+IF9VQyLvkAaTO5urEuI2/qG8nWKqoESvZN2R4YolJXAB69KV7NscorkImz+5VeUYl0N63NZq6pIUvXZQyVMbsU= X-Received: by 2002:a5d:6981:0:b0:366:ebd1:3bc1 with SMTP id ffacd0b85a97d-367cea45fdamr8599206f8f.3.1720803381607; Fri, 12 Jul 2024 09:56:21 -0700 (PDT) MIME-Version: 1.0 References: <20240625135216.47007-1-linyunsheng@huawei.com> <20240625135216.47007-7-linyunsheng@huawei.com> <12a8b9ddbcb2da8431f77c5ec952ccfb2a77b7ec.camel@gmail.com> <808be796-6333-c116-6ecb-95a39f7ad76e@huawei.com> <96b04ebb7f46d73482d5f71213bd800c8195f00d.camel@gmail.com> <5daed410-063b-4d86-b544-d1a85bd86375@huawei.com> In-Reply-To: From: Alexander Duyck Date: Fri, 12 Jul 2024 09:55:44 -0700 Message-ID: Subject: Re: [PATCH net-next v9 06/13] mm: page_frag: reuse existing space for 'size' and 'pfmemalloc' To: Yunsheng Lin Cc: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton , linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: rmd749nwfqrabq59rttiqy4f8yeoubk4 X-Rspam-User: X-Rspamd-Queue-Id: 8324A4000C X-Rspamd-Server: rspam02 X-HE-Tag: 1720803383-104677 X-HE-Meta: U2FsdGVkX18L8uB6yTtmGaJUNCWcHo1WZZXCnc7v6qHDkvaF2S4hl7vta1v/x4NZQx58ZUoX/RpIUgMwyGjIWolJ1ugbzcYzKSxvoB4NY9giMLTaK8QSG6rQ73E62ShPnDktkt/obqQ62Ibl1sZk6gRrXDqsxJFmKcfS4B+KtJI7pftTZmmR31H0eykaF9e7rSOWp8y173nhhjP+USmkKcdjTVQmIbNvem0VlJgYLxAbENm1Y9wSm0ehi7DKO9zgHPnTYfsK/YxkC4FdG3LWDnfhjfMShrBG2/IMQomw63K2sF74l+wo7zip3pCeldcqFIXfOSyKDqYmH1O+YlGwAcfV/j7v0MaSf61ILaStGICeNSdy5uThJGW8qKbJe08em9Hdkh2h+H4jD8FWPiXT1dO3GHEo4cU7OWIyPasRdaqDkGA+dBbb8V1Hf6YKv5joHix+rIU1oLT8vBCk8Uv5c4sXcfjzTJKz0Lg6V6QE1oivz2dQaX6+hl6RoRe+Tje1YgtAV2ftJXBRUvRkVrXci/SxeIp0ZcLXenLcFC2iqRqorgRoudOmVVWyoLOREtKIh+Tg415caIjEhJpgwXqDr4aSea/O9XEXKNXgumqyA2IjzHcjAeY8LDsTr/b2kvUmytlP85c1Q+4GKqoQJEZ2JslHDYg1hMV07cOXwDW/ecmLcMW/D2schByikNZsDWoB57eIclt3zilHsJyqW16+GStDkZ4dLqpDjrOPUspv+ISXChcYNKtKXGrdrv1B6F6sbVqLEdS4mMVNuQYQikUWNFWZ0Po7Qg4kH6LYFDmOaAKqJ0bmik8+suteHCvz/Cy3ijq8vzKIXqrUaidiILdVtd4oiGrU7/qlbOCtofp4v+WVUqcWMc0lQLRpYoF/MfL/moJaCuCka2NBIn/C1EbkgiBIZBg3vTgtOHpHACNkPZB+4dhKoqGQ7ZQ0ynu/3YlqADVGgQWQbpLSd/3pCVY H11Dg1aY OWZNnHB+oG7ley4MrVxT+eMIomUaAD9rAsK9fh466N9fRf/og8I96ACcezKKoOo6LDIZJkrsrNp5rOBlBQYsViKwp1O64VhxTPZqy8nWc1XIEc7G/BP3rQgygazItd/ENhKJjspLZmWQ4Pontv7dnC0viTvgfCHSkrdYQ2NzZzJqK0L3ECsPCPpuh5O17JV5IokIYNQcMMLTEVomhOhFviButxMBRVR7oULJZy7uk/yaaCq8mDpS/kX6x+oi5a7TVST/MlakN/yjQWUeYRV/z2yRMpx1YBUU1EFsf1i+4IBEnQWRAqei/OQSrIHwqeQ22U+gw10bx9+blJgosM4sxaEq1Doy1Jl6RjLtqL149BUwbEiizOa4mynBqEDhAd79dmTeTPJ0kVsUQ4tk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jul 12, 2024 at 1:42=E2=80=AFAM Yunsheng Lin wrote: > > On 2024/7/12 0:49, Alexander Duyck wrote: > > On Thu, Jul 11, 2024 at 1:16=E2=80=AFAM Yunsheng Lin wrote: > >> > >> On 2024/7/10 23:28, Alexander H Duyck wrote: > > > > ... > > > >>>> > >>>> Yes, agreed. It would be good to be more specific about how to avoid > >>>> the above problem using a signed negative number for 'remaining' as > >>>> I am not sure how it can be done yet. > >>>> > >>> > >>> My advice would be to go back to patch 3 and figure out how to do thi= s > >>> re-ordering without changing the alignment behaviour. The old code > >>> essentially aligned both the offset and fragsz by combining the two a= nd > >>> then doing the alignment. Since you are doing a count up setup you wi= ll > >> > >> I am not sure I understand 'aligned both the offset and fragsz ' part,= as > >> 'fragsz' being aligned or not seems to depend on last caller' align_ma= sk, > >> for the same page_frag_cache instance, suppose offset =3D 32768 initia= lly for > >> the old code: > >> Step 1: __page_frag_alloc_align() is called with fragsz=3D7 and align_= mask=3D~0u > >> the offset after this is 32761, the true fragsz is 7 too. > >> > >> Step 2: __page_frag_alloc_align() is called with fragsz=3D7 and align_= mask=3D-16 > >> the offset after this is 32752, the true fragsz is 9, which do= es > >> not seems to be aligned. > > > > I was referring to my original code before this patchset. I was doing > > the subtraction of the fragsz first, and then aligning which would end > > up padding the end of the frame as it was adding to the total size by > > pulling the offset down *after* I had already subtracted fragsz. The > > result was that I was always adding additional room depending on the > > setting of the fragsz and how it related to the alignment. After > > changing the code to realign on the start of the next frag the fragsz > > is at the mercy of the next caller's alignment. In the event that the > > caller didn't bother to align the fragsz by the align mask before hand > > they can end up with a scenario that might result in false sharing. > > So it is about ensuring the additional room due to alignment requirement > being placed at the end of a fragment, in order to avoid false sharing > between the last fragment and the current fragment as much as possible, > right? > > I am generally agreed with above if we can also ensure skb coaleasing by > doing offset count-up instead of offset countdown. > > If there is conflict between them, I am assuming that enabling skb frag > coaleasing is prefered over avoiding false sharing, right? The question I would have is where do we have opportunities for this to result in coalescing? If we are using this to allocate skb->data then there isn't such a chance as the tailroom gets in the way. If this is for a device allocating for an Rx buffer we won't get that chance as we have to preallocate some fixed size not knowing the buffer that is coming, and it needs to usually be DMA aligned in order to avoid causing partial cacheline reads/writes. The only way these would combine well is if you are doing aligned fixed blocks and are the only consumer of the page frag cache. It is essentially just optimizing for jumbo frames in that case. If this is for some software interface why wouldn't it request the coalesced size and do one allocation rather than trying to figure out how to perform a bunch of smaller allocations and then trying to merge them together after the fact. > > > >> The above is why I added the below paragraph in the doc to make the se= mantic > >> more explicit: > >> "Depending on different aligning requirement, the page_frag API caller= may call > >> page_frag_alloc*_align*() to ensure the returned virtual address or of= fset of > >> the page is aligned according to the 'align/alignment' parameter. Note= the size > >> of the allocated fragment is not aligned, the caller needs to provide = an aligned > >> fragsz if there is an alignment requirement for the size of the fragme= nt." > >> > >> And existing callers of page_frag aligned API does seems to follow the= above > >> rule last time I checked. > >> > >> Or did I miss something obvious here? > > > > No you didn't miss anything. It is just that there is now more > > potential for error than there was before. > > I guess the 'error' is referred to the 'false sharing' mentioned above, > right? If it is indeed an error, are we not supposed to fix it instead > of allowing such implicit implication? Allowing implicit implication > seems to make the 'error' harder to reproduce and debug. The concern with the code as it stands is that if I am not mistaken remaining isn't actually aligned. You aligned it, then added fragsz. That becomes the start of the next frame so if fragsz isn't aligned the next requester will be getting an unaligned buffer, or one that is only aligned to the previous caller's alignment. > > > >>> need to align the remaining, then add fragsz, and then I guess you > >>> could store remaining and then subtract fragsz from your final virtua= l > >>> address to get back to where the starting offset is actually located. > >> > >> remaining =3D __ALIGN_KERNEL_MASK(nc->remaining, ~align_mask); > >> remaining +=3D fragsz; > >> nc->remaining =3D remaining; > >> return encoded_page_address(nc->encoded_va) + (size + remaining) - fra= gsz; > >> > >> If yes, I am not sure what is the point of doing the above yet, it > >> just seem to make thing more complicated and harder to understand. > > > > That isn't right. I am not sure why you are adding size + remaining or > > what those are supposed to represent. > > As the above assumes 'remaining' is a negative value as you suggested, > (size + remaining) is supposed to represent the offset of next fragment > to ensure we have count-up offset for enabling skb frag coaleasing, and > '- fragsz' is used to get the offset of current fragment. > > > > > The issue was that the "remaining" ends up being an unaligned value as > > you were starting by aligning it and then adding fragsz. So by > > subtracting fragsz you can get back to the aliglined start. What this > > patch was doing before was adding the raw unaligned nc->remaining at > > the end of the function. > > > >>> > >>> Basically your "remaining" value isn't a safe number to use for an > >>> offset since it isn't aligned to your starting value at any point. > >> > >> Does using 'aligned_remaining' local variable to make it more obvious > >> seem reasonable to you? > > > > No, as the value you are storing above isn't guaranteed to be aligned. > > If you stored it before adding fragsz then it would be aligned. > > I have a feeling that what you are proposing may be conflict with enablin= g > skb frag coaleasing, as doing offset count-up seems to need some room to > be reserved at the begin of a allocated fragment due to alignment require= ment, > and that may mean we need to do both fragsz and offset aligning. > > Perhaps the 'remaining' changing in this patch does seems to make things > harder to discuss. Anyway, it would be more helpful if there is some pseu= do > code to show the steps of how the above can be done in your mind. Basically what you would really need do for all this is: remaining =3D __ALIGN_KERNEL_MASK(nc->remaining, ~align_mask); nc->remaining =3D remaining + fragsz; return encoded_page_address(nc->encoded_va) + size + remaining; The issue is you cannot be using page_frag_cache_current_va as that pointer will always be garbage as it isn't aligned to anything. It isn't like the code that I had that was working backwards as the offset cannot be used as soon as you compute it since you add fragsz to it. For any countup you have to align the current offset/remaining value, then that value is used for computing the page address, and you store the offset/remaining plus fragsz adjustment. At which point your offset/remaining pointer value isn't good for the current buffer anymore.