From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-44.mimecast.com (us-smtp-delivery-44.mimecast.com [205.139.111.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C31E1143C4B for ; Tue, 6 Aug 2024 11:05:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=205.139.111.44 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722942359; cv=none; b=gjlaVM5hGCVi+iXPonn+b6u8heZhUCY4Jlqdosqa13ubG4rhFAPbpBDDU0+R1wuz4Y1oxqaUly6Xt2zWeQv4fRKj6zmwWaeLdlTMaP5M5czkbk2KbSWFogADj2sU4zDG8FhQ/kG3JNzsdN2BE+VuYqaTRbKmuB6Ks7JY0V/wMiE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722942359; c=relaxed/simple; bh=rZwmuj7Cm32RNobSHUJ69JNy933Q6jvlmTbthAX4vrs=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: In-Reply-To:Content-Type:Content-Disposition; b=mbyObqO6MSTxgMeGzZEn9H8YW1ilvGmYXL/o14ktW2b2xMw+FNOYWUpvpA5xgGNtM0+C+AlD94BW8/LJTpW3Ebku983T83evlgrLNZrW8OyWfRj0tjFPI+tDwGiUSkx2Qythxzo56XqauUXoeewnkdQPU8L+Yat6bknMAg6xke4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=queasysnail.net; spf=none smtp.mailfrom=queasysnail.net; arc=none smtp.client-ip=205.139.111.44 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=queasysnail.net Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=queasysnail.net Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-606-rzu9rK7SMDWSV0Dq9pH-Pw-1; Tue, 06 Aug 2024 07:05:43 -0400 X-MC-Unique: rzu9rK7SMDWSV0Dq9pH-Pw-1 Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 16E571955D42; Tue, 6 Aug 2024 11:05:42 +0000 (UTC) Received: from hog (unknown [10.39.192.3]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id E95E91956046; Tue, 6 Aug 2024 11:05:38 +0000 (UTC) Date: Tue, 6 Aug 2024 13:05:32 +0200 From: Sabrina Dubroca To: Christian Hopps Cc: devel@linux-ipsec.org, Steffen Klassert , netdev@vger.kernel.org, Christian Hopps Subject: Re: [PATCH ipsec-next v8 10/16] xfrm: iptfs: add fragmenting of larger than MTU user packets Message-ID: References: <20240804203346.3654426-1-chopps@chopps.org> <20240804203346.3654426-11-chopps@chopps.org> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: queasysnail.net Content-Type: text/plain; charset=UTF-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable 2024-08-06, 04:54:53 -0400, Christian Hopps wrote: >=20 > Sabrina Dubroca writes: >=20 > > 2024-08-04, 22:33:05 -0400, Christian Hopps wrote: > > > > > +/* 1) skb->head should be cache aligned. > > > > > + * 2) when resv is for L2 headers (i.e., ethernet) we want the c= acheline to > > > > > + * start -16 from data. > > > > > + * 3) when resv is for L3+L2 headers IOW skb->data points at the= IPTFS payload > > > > > + * we want data to be cache line aligned so all the pushed heade= rs will be in > > > > > + * another cacheline. > > > > > + */ > > > > > +#define XFRM_IPTFS_MIN_L3HEADROOM 128 > > > > > +#define XFRM_IPTFS_MIN_L2HEADROOM (64 + 16) > > > > > > > > How did you pick those values? > > >=20 > > > That's what the comment is talking to. When reserving space for L2 he= aders we > > > pick 64 + 16 (a 2^(<=3D6) cacheline + 16 bytes so the the cacheline s= hould start > > > -16 from where skb->data will point at. > >=20 > > Hard-coding the x86 cacheline size is not a good idea. And what's the > > 16B for? You don't know that it's enough for the actual L2 headers. >=20 > I am not hard coding the x86 cacheline. I am picking 64 as the largest ca= cheline that this is optimized for, it also works for smaller cachelines. At least use SMP_CACHE_BYTES then? > 16B is to allow for the incredibly common 14B L2 header to fit. Why not use skb->dev->needed_headroom, like a bunch of tunnels are already doing? No guessing required. ethernet is the most common, but there's no reason to penalize other protocols when the information is available. > > > For L3 we reserve double the power of 2 space we reserved for L2 only= . > >=20 > > But that's the core of my question. Why is that correct/enough? >=20 > I have to pick a value. There is no magically perfect number that I can p= ick. I've given you technical reasons and justifications for the numbers I = have chosen -- not sure what else I can say. Do you have better suggestions= for the sizes which would be more optimal on more architectures? If not th= en let's use the numbers that I have given technical reasons for choosing. Yes, now you've spelled it out, and we can evaluate your choices. > Put this another way. I could just pick 128 b/c it's 2 cachelines > and fits lots of different headers and would be "good > enough". That's plenty justification too. I think you looking for > too much here -- this isn't a precision thing, it's a "Good Enough" > thing. I'm asking questions. That's kind of the reviewer's job, understanding how the thing they're reviewing works. =C2=AF\_(=E3=83=84)_/=C2=AF > > > We have to reserve some amount of space for pushed headers, so the ab= ove made sense to me for good performance/cache locality. > > >=20 > > > > > +static struct sk_buff *iptfs_alloc_skb(struct sk_buff *tpl, u32 = len, > > > > > +=09=09=09=09 bool l3resv) > > > > > +{ > > > > > +=09struct sk_buff *skb; > > > > > +=09u32 resv; > > > > > + > > > > > +=09if (!l3resv) { > > > > > +=09=09resv =3D XFRM_IPTFS_MIN_L2HEADROOM; > > > > > +=09} else { > > > > > +=09=09resv =3D skb_headroom(tpl); > > > > > +=09=09if (resv < XFRM_IPTFS_MIN_L3HEADROOM) > > > > > +=09=09=09resv =3D XFRM_IPTFS_MIN_L3HEADROOM; > > > > > +=09} > > > > > + > > > > > +=09skb =3D alloc_skb(len + resv, GFP_ATOMIC); > > > > > +=09if (!skb) { > > > > > +=09=09XFRM_INC_STATS(dev_net(tpl->dev), LINUX_MIB_XFRMNOSKBERROR= ); > > > > > > > > Hmpf, so we've gone from incrementing the wrong counter to > > > > incrementing a new counter that doesn't have a precise meaning. > > >=20 > > > The new "No SKB" counter is supposed to mean "couldn't get an SKB", > > > given plenty of other errors are logged under "OutErr" or "InErr" > > > i'm not sure what level of precision you're looking for here. :) > >=20 > > OutErr and InErr would be better than that new counter IMO. >=20 > Why? >=20 > My counter tracks the SKB depletion failure that is actually happening. W= ould you have me now pass in the direction argument just so I can tick the = correct overly general MIB counter that provides less value to the user in = identifying the actual problem? How is that good design? >=20 > I'm inclined to just delete the thing altogether rather than block on thi= s thing that will almost never happen. Fine. > > > > > +=09=09return NULL; > > > > > +=09} > > > > > + > > > > > +=09skb_reserve(skb, resv); > > > > > + > > > > > +=09/* We do not want any of the tpl->headers copied over, so we = do > > > > > +=09 * not use `skb_copy_header()`. > > > > > +=09 */ > > > > > > > > This is a bit of a bad sign for the implementation. It also worries > > > > me, as this may not be updated when changes are made to > > > > __copy_skb_header(). > > > > (c/p'd from v1 review since this was still not answered) > > >=20 > > > I don't agree that this is a bad design at all, I'm curious what you = think a good design to be. > >=20 > > Strange skb manipulations hiding in a protocol module is not good > > design. >=20 > It's a fragmentation and aggregation protocol, it's needs work with skbs = by design. It's literally the function of the protocol to manipulate packet= content. packet content !=3D cherry-picked parts of sk_buff > I would appreciate it if you could provide technical reasons to justify r= eferring to things as "bad" or "strange" -- it's not helpful otherwise. I did say it's a bad sign, not a blocking issue on its own. But that bad sign, combined with the unusual use of skb_seq and a lot of copying data around, indicates that this is not the right way to implement this part of the protocol. > > c/p bits of core code into a module (where they will never get fixed > > up when the core code gets updated) is always a bad idea. >=20 > I need some values from the SKB, so I copy them -- it's that simple. >=20 > > > I did specifically state why we are not re-using > > > skb_copy_header(). The functionality is different. We are not trying > > > to make a copy of an skb we are using an skb as a template for new > > > skbs. > >=20 > > I saw that. That doesn't mean it's a good thing to do. >=20 > Please suggest an alternative. A common helper in a location where people are going to know that they need to fix it up when they modify things about sk_buff would be a good start. > > > > > +/** > > > > > + * skb_copy_bits_seq - copy bits from a skb_seq_state to kernel = buffer > > > > > + * @st: source skb_seq_state > > > > > + * @offset: offset in source > > > > > + * @to: destination buffer > > > > > + * @len: number of bytes to copy > > > > > + * > > > > > + * Copy @len bytes from @offset bytes into the source @st to the= destination > > > > > + * buffer @to. `offset` should increase (or be unchanged) with e= ach subsequent > > > > > + * call to this function. If offset needs to decrease from the p= revious use `st` > > > > > + * should be reset first. > > > > > + * > > > > > + * Return: 0 on success or a negative error code on failure > > > > > + */ > > > > > +static int skb_copy_bits_seq(struct skb_seq_state *st, int offse= t, void *to, > > > > > +=09=09=09 int len) > > > > > > > > Probably belongs in net/core/skbuff.c, although I'm really not > > > > convinced copying data around is the right way to implement the typ= e > > > > of packet splitting IPTFS does (which sounds a bit like a kind of > > > > GSO). And there are helpers in net/core/skbuff.c (such as > > > > pskb_carve/pskb_extract) that seem to do similar things to what you > > > > need here, without as much data copying. > > >=20 > > > I don't have an issue with moving more general skb functionality > > > into skbuff.c; however, I do not want to gate IP-TFS on this change > > > to the general net infra, it is appropriate for a patchset of it's > > > own. > >=20 > > If you need helpers that don't exist, it's part of your job to make > > the core changes that are required to implement the functionality. >=20 > This is part of a new code protocol and feature addition and it's a singl= e use. Of course the helper would be single use when it's introduced. You don't know if it will remain single use. And pskb_extract is single use, it's fine. > Another patchset can present this code to the general network > community to see if they think it *also* has value outside of > IPTFS. There is *no* reason to delay IPTFS on general network > infrastructure improvements. Please don't do this. Sorry, I don't think that's how it works. > > > Re copying: Let's be clear here, we are not always copying data, > > > there are sharing code paths as well; however, there are times when > > > it is the best (and even fastest) way to accomplish things (e.g., > > > b/c the packet is small or the data is arranged in skbs in a way to > > > make sharing ridiculously complex and thus slow). > >=20 > > I'm not finding the sharing code. You mean iptfs_first_should_copy > > returning false? >=20 >=20 > /* Try share then copy. */ > if (fragwalk && skb_can_add_frags(newskb, fragwalk, data, copylen)= ) { > ... > leftover =3D skb_add_frags(newskb, fragwalk, data, copylen= ); > } else { > /* copy fragment data into newskb */ > if (skb_copy_bits_seq(st, data, skb_put(newskb, copylen), > ... > } You're talking about reassembly now. This patch is fragmentation/TX. --=20 Sabrina