* rewriting skb->truesize... good or bad idea
@ 2006-09-29 18:16 Vlad Yasevich
2006-09-29 21:31 ` David Miller
0 siblings, 1 reply; 4+ messages in thread
From: Vlad Yasevich @ 2006-09-29 18:16 UTC (permalink / raw)
To: netdev
[-- Attachment #1: Type: text/plain, Size: 1237 bytes --]
Hi Folks
I was looking at SCTP performance problem that is a result of
receive buffer exhaustion and found the we severely overcharge
the receive buffer when multiple data chunks are bundled together.
This bundling usually happens at retransmit time which penalizes us
even more.
Here is what happens. For every "data" chunk that SCTP stack receives,
we clone skb of that data chunk, charge the receive buffer for the skb,
and put the chunk on the the socket receive queue (this is skipping a
few steps, but they don't matter for the sake of this discussion). We
charge the receive buffer with the skb->truesize.
The problem shows up when multiple data chunks are "bundled" into the
same skb. We end up with multiple clones, and for each clone we charge
skb->truesize against the receive buffer. However, since skb_clone()
preservers the original truesize in all clones, we end up overcharging.
One of the proposed solutions is change the skb->truesize of the clone
to just be sizeof(struct sk_buff), if and only if this is not the first
data chunk in the packet.
I've attached the patch, in case people want to look at the code.
However, we question if this is a good idea or if this is going to break
things...
Thanks
-vlad
[-- Attachment #2: p --]
[-- Type: text/plain, Size: 1365 bytes --]
diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 5b5ae79..9bb1dbd 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -5349,7 +5349,7 @@ static int sctp_eat_data(const struct sc
if (SCTP_CMD_CHUNK_ULP == deliver)
sctp_add_cmd_sf(commands, SCTP_CMD_REPORT_TSN, SCTP_U32(tsn));
- chunk->data_accepted = 1;
+ chunk->data_accepted++;
/* Note: Some chunks may get overcounted (if we drop) or overcounted
* if we renege and the chunk arrives again.
diff --git a/net/sctp/ulpevent.c b/net/sctp/ulpevent.c
index ee23678..0e1f11d 100644
--- a/net/sctp/ulpevent.c
+++ b/net/sctp/ulpevent.c
@@ -685,6 +685,17 @@ struct sctp_ulpevent *sctp_ulpevent_make
/* Initialize event with flags 0. */
sctp_ulpevent_init(event, 0);
+ /* Check to see if we need to fixup the truesize of the clone.
+ * We are about to charge the receive buffer for this chunk,
+ * and we always use skb->truesize. However, this doesn't work
+ * for bundled data chunks since we'll drastically overcharge.
+ * To get around that, keep the oiginal truesize on the clone
+ * only for the first data chunk, and update truesize for the clone
+ * on subsequent ones.
+ */
+ if (chunk->data_accepted > 1)
+ skb->truesize = sizeof(struct skb);
+
sctp_ulpevent_receive_data(event, asoc);
event->stream = ntohs(chunk->subh.data_hdr->stream);
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: rewriting skb->truesize... good or bad idea
2006-09-29 18:16 rewriting skb->truesize... good or bad idea Vlad Yasevich
@ 2006-09-29 21:31 ` David Miller
2006-10-02 14:46 ` Vlad Yasevich
0 siblings, 1 reply; 4+ messages in thread
From: David Miller @ 2006-09-29 21:31 UTC (permalink / raw)
To: vladislav.yasevich; +Cc: netdev
From: Vlad Yasevich <vladislav.yasevich@hp.com>
Date: Fri, 29 Sep 2006 14:16:57 -0400
> I've attached the patch, in case people want to look at the code.
>
> However, we question if this is a good idea or if this is going to break
> things...
Modification of skb->truesize is very dangerous and is only legal
in a very limited set of circumstances.
The core problem is that if any other reference exists to the skb
you could corrupt existing socket accounting by changing skb->truesize.
Let's say that the packet went into a network tap like tcpdump and
the packet has been charged to an AF_PACKET socket via skb->truesize.
If you modify skb->truesize then when the AF_PACKET socket releases
it's reference it will subtract the wrong amount of skb->truesize
from the socket receive buffer.
If, on the other hand, you know you have exclusive access to the
skb and there are no other references, setting skb->truesize can
be OK. However setting it to sizeof(struct sk_buff) doesn't make
sense since there is at least:
sizeof(struct sk_buff) + sizeof(struct skb_shared_info)
memory assosciated with any SKB whatsoever in the kernel. That is,
even for a "zero length" skb->data buffer, we still always allocate
the skb_shared_info area which must be accounted for.
BTW, I think the whole chunking mechanism in the SCTP code is the
largest source of problems and maintainability issues in that stack.
Any time I want to make major modifications to SKBs to make them
smaller or whatever, the most difficult piece of code to adjust is
this code.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: rewriting skb->truesize... good or bad idea
2006-09-29 21:31 ` David Miller
@ 2006-10-02 14:46 ` Vlad Yasevich
2006-10-03 23:21 ` David Miller
0 siblings, 1 reply; 4+ messages in thread
From: Vlad Yasevich @ 2006-10-02 14:46 UTC (permalink / raw)
To: David Miller; +Cc: netdev
Hi David
Thanks for the answer.
David Miller wrote:
> From: Vlad Yasevich <vladislav.yasevich@hp.com>
> Date: Fri, 29 Sep 2006 14:16:57 -0400
>
>> I've attached the patch, in case people want to look at the code.
>>
>> However, we question if this is a good idea or if this is going to break
>> things...
>
> Modification of skb->truesize is very dangerous and is only legal
> in a very limited set of circumstances.
>
> The core problem is that if any other reference exists to the skb
> you could corrupt existing socket accounting by changing skb->truesize.
Yes, that's what I've found in the code.
>
> Let's say that the packet went into a network tap like tcpdump and
> the packet has been charged to an AF_PACKET socket via skb->truesize.
> If you modify skb->truesize then when the AF_PACKET socket releases
> it's reference it will subtract the wrong amount of skb->truesize
> from the socket receive buffer.
Thankfully, this does not appear to be a problem in this particular
case. The clones that would have their truesize changed, only exist in
SCTP. The packet has already gone through all verifications and we just
queue the clones to the socket. The original packet skb remains
unchanged. SCTP calls kfree_skb on it once all the cloning is done.
>
> If, on the other hand, you know you have exclusive access to the
> skb and there are no other references, setting skb->truesize can
> be OK. However setting it to sizeof(struct sk_buff) doesn't make
> sense since there is at least:
>
> sizeof(struct sk_buff) + sizeof(struct skb_shared_info)
>
> memory assosciated with any SKB whatsoever in the kernel. That is,
> even for a "zero length" skb->data buffer, we still always allocate
> the skb_shared_info area which must be accounted for.
Well, since we are dealing with clones of the original skb, I didn't
think that we need to include skb_shared_info. I thought that was
accounted for in the original skb?
The clones, that SCTP creates, point at a subset of data...
something like this:
clone 1 ------+ clone 2 ------+
| |
| |
v v
+-------------------------------------------------------
| proto hdrs | sctp data chunk 1 | data chunk 2
+-------------------------------------------------------
^
|
|
+--- orig_skb->head
Right now, for every clone we use the generic socket accounting code
that uses skb->truesize.
The alternative to changing truesize is to write an SCTP version of
socket accounting. This is what we did back in 2.6.10 days and we
tried get away from that.
If you have another idea of how we could do this, I'd appreciate it.
>
> BTW, I think the whole chunking mechanism in the SCTP code is the
> largest source of problems and maintainability issues in that stack.
> Any time I want to make major modifications to SKBs to make them
> smaller or whatever, the most difficult piece of code to adjust is
> this code.
>
I think you've already removed all the dependencies between chunking and
SKBs. All we have now are some pointers to skb. This work actually
made the protocol a lot more stable.
Thanks
-vlad
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: rewriting skb->truesize... good or bad idea
2006-10-02 14:46 ` Vlad Yasevich
@ 2006-10-03 23:21 ` David Miller
0 siblings, 0 replies; 4+ messages in thread
From: David Miller @ 2006-10-03 23:21 UTC (permalink / raw)
To: vladislav.yasevich; +Cc: netdev
From: Vlad Yasevich <vladislav.yasevich@hp.com>
Date: Mon, 02 Oct 2006 10:46:42 -0400
> David Miller wrote:
> > If, on the other hand, you know you have exclusive access to the
> > skb and there are no other references, setting skb->truesize can
> > be OK. However setting it to sizeof(struct sk_buff) doesn't make
> > sense since there is at least:
> >
> > sizeof(struct sk_buff) + sizeof(struct skb_shared_info)
> >
> > memory assosciated with any SKB whatsoever in the kernel. That is,
> > even for a "zero length" skb->data buffer, we still always allocate
> > the skb_shared_info area which must be accounted for.
>
> Well, since we are dealing with clones of the original skb, I didn't
> think that we need to include skb_shared_info. I thought that was
> accounted for in the original skb?
We use the original packet's truesize, as-is, when making
SKB clones. This is our global policy.
The reason for that behavior is the usual case is that the
clone is being attached to a different socket than the
original SKB will. And we very much should charge the memory
fully to every socket that references it.
In your case it seems that the memory is already charged to
the socket, and we would just be charging that memory again
to the same socket. So in this very specific case, it should
be OK not to include skb_shared_info in the truesize calculation.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2006-10-03 23:21 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-29 18:16 rewriting skb->truesize... good or bad idea Vlad Yasevich
2006-09-29 21:31 ` David Miller
2006-10-02 14:46 ` Vlad Yasevich
2006-10-03 23:21 ` David Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).