skb_shared_info()

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* skb_shared_info()
@ 2006-08-08 23:39 David Miller
  2006-08-09  0:36 ` skb_shared_info() Herbert Xu
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: David Miller @ 2006-08-08 23:39 UTC (permalink / raw)
  To: netdev; +Cc: herbert, johnpol

I'm beginning to think that where we store the
skb_shared_info() is a weakness of the SKB design.

It makes it more difficult to have local memory
management schemes and to just wrap SKB's around
arbitrary pieces of data.

The e1000 issue is just one example of this, another
would be any attempt to consolidate the TCP retransmit
queue data management.

How to deal with data refcounting and destruction is
the primary matter to deal with if we want to move
skb_shared_info() somewhere else.  But that can probably
be handled with a data destructor callback of some sort
which we're always talking about adding anyways. :)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-08 23:39 skb_shared_info() David Miller
@ 2006-08-09  0:36 ` Herbert Xu
  2006-08-09  0:58   ` skb_shared_info() David Miller
  2006-08-10 19:49 ` skb_shared_info() Evgeniy Polyakov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Herbert Xu @ 2006-08-09  0:36 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, johnpol

On Tue, Aug 08, 2006 at 04:39:15PM -0700, David Miller wrote:
> 
> I'm beginning to think that where we store the
> skb_shared_info() is a weakness of the SKB design.

I'm not sure whether the problem is where we store skb_shared_info,
or the fact that we can't put kmalloc'ed memory into skb_shinfo->frags.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-09  0:36 ` skb_shared_info() Herbert Xu
@ 2006-08-09  0:58   ` David Miller
  2006-08-09  5:35     ` skb_shared_info() Evgeniy Polyakov
  0 siblings, 1 reply; 19+ messages in thread
From: David Miller @ 2006-08-09  0:58 UTC (permalink / raw)
  To: herbert; +Cc: netdev, johnpol

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 9 Aug 2006 10:36:16 +1000

> I'm not sure whether the problem is where we store skb_shared_info,
> or the fact that we can't put kmalloc'ed memory into
> skb_shinfo->frags.

That's a good point.

I guess we could do something like:

struct skb_frag_struct {
	union {
		struct page *page;
		void *data;
	} u;
	__u16 page_offset;
	__u16 size;
};

and something clever like a special page_offset encoding
means "use data, not page".

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-09  0:58   ` skb_shared_info() David Miller
@ 2006-08-09  5:35     ` Evgeniy Polyakov
  2006-08-09  5:50       ` skb_shared_info() David Miller
  0 siblings, 1 reply; 19+ messages in thread
From: Evgeniy Polyakov @ 2006-08-09  5:35 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev

On Tue, Aug 08, 2006 at 05:58:39PM -0700, David Miller (davem@davemloft.net) wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 9 Aug 2006 10:36:16 +1000
> 
> > I'm not sure whether the problem is where we store skb_shared_info,
> > or the fact that we can't put kmalloc'ed memory into
> > skb_shinfo->frags.
> 
> That's a good point.

It is second problem actually.
If we want to change memory usage in skbs, we have problems with
frag_list and place where shared info lives, since it's addon only for
1500 MTU does not change alignment.

> I guess we could do something like:
> 
> struct skb_frag_struct {
> 	union {
> 		struct page *page;
> 		void *data;
> 	} u;
> 	__u16 page_offset;
> 	__u16 size;
> };
> 
> and something clever like a special page_offset encoding
> means "use data, not page".

We can separate kmalloced data from page by using internal page structures
(lru pointers and PG_slab bit).

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-09  5:35     ` skb_shared_info() Evgeniy Polyakov
@ 2006-08-09  5:50       ` David Miller
  0 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2006-08-09  5:50 UTC (permalink / raw)
  To: johnpol; +Cc: herbert, netdev

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Wed, 9 Aug 2006 09:35:24 +0400

> We can separate kmalloced data from page by using internal page structures
> (lru pointers and PG_slab bit).

Yes, it is one idea.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-08 23:39 skb_shared_info() David Miller
  2006-08-09  0:36 ` skb_shared_info() Herbert Xu
@ 2006-08-10 19:49 ` Evgeniy Polyakov
  2006-08-11 14:00 ` skb_shared_info() Alexey Kuznetsov
  2006-08-14  5:04 ` skb_shared_info() Andi Kleen
  3 siblings, 0 replies; 19+ messages in thread
From: Evgeniy Polyakov @ 2006-08-10 19:49 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, herbert

On Tue, Aug 08, 2006 at 04:39:15PM -0700, David Miller (davem@davemloft.net) wrote:
> 
> I'm beginning to think that where we store the
> skb_shared_info() is a weakness of the SKB design.

Food for thoughts - unix sockets can use PAGE_SIZEd chunks of memory
(and they do it almost always), which are aligned to 2*PAGE_SIZE due to
alignment issues with skb_shared_info, so unix sockets waste 3.5 kb of
memory on each skb. I think it is time to resurrect idea of placing
shared_info inside skb and allow to allocate it from own cache for
special cases, what do you think?

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-08 23:39 skb_shared_info() David Miller
  2006-08-09  0:36 ` skb_shared_info() Herbert Xu
  2006-08-10 19:49 ` skb_shared_info() Evgeniy Polyakov
@ 2006-08-11 14:00 ` Alexey Kuznetsov
  2006-08-12  0:27   ` skb_shared_info() David Miller
  2006-08-14  5:04 ` skb_shared_info() Andi Kleen
  3 siblings, 1 reply; 19+ messages in thread
From: Alexey Kuznetsov @ 2006-08-11 14:00 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, herbert, johnpol

Hello!

>> management schemes and to just wrap SKB's around
>> arbitrary pieces of data.
+
> and something clever like a special page_offset encoding
> means "use data, not page".

But for what purpose do you plan to use it?

> The e1000 issue is just one example of this, another

What is this issue?

What's about aggregated tcp queue, I can guess you did not find place
where to add protocol headers, but cannot figure out how adding non-pagecache
references could help.

You would rather want more then one skb_shared_info(): at least two,
one is immutable, another is for headers.

I think Evgeniy's idea about inlining skb_shared_info to skb head
is promising and simple enough. All the point of shared skb_shared_info
was to make cloning fast. But it makes lots of sense to inline some short
vector inot skb head (and, probably, even a MAX_HEADER space _instead_
of space for fclone).

With aggregated tcp send queue, when transmitting a segment, you could
allocate new skb head with space for header and either take existing
skb_shared_info from queue, attach it to head and set offset/length.
Or, alternatively, set one or two of page pointers in array, inlined in head.
(F.e. in the case of AF_UNIX socket, mentioned by Evgeniy, we would keep data
in pages and attach it directly to skb head).

Cloning becomes more expensive, but who needs it cheap, if tcp does not?

Returning to "arbitrary pieces of data".

Page cache references in skb_shared_info are unique thing,
get_page()/page_cache_release() are enough to clone data.

But it is not enough even for such simple thing as splice().
It wants we remembered some strange "pipe_buffer", where each page
is wrapped together with some ops, flags and even pipe_inode_info :-),
and called some destructor, when it is released. First thought is that
it is insane: it does not respect page cache logic, requires we implemented
additional level of refcounting, abuses amount of information, which
have to be stored in skb beyond all the limits of sanity.

But the second thought is that something like this is required in any case.
At least we must report to someone when a page is not in use and
can be recycled. I think Evgeniy knows more about this, AIO has
the same issue. But this is simpler, because release callback can be done
not per fragment or even per-skb, but actually per-transaction.

One idea is to announce (some) skb_shared_info completely immutable,
force each layer who needs to add a header or to fragment to refer
to original skb_shared_info as whole, using for modifications
another skb_shared_info() or area inlined in skb head.
And if someone is not able to, he must reallocate all the pages.
In this case destructor/notification can be done not for fragment,
but for whole aggregated skb_shared_info. Seems, it will work both
with aggregated tcp queue and with udp.

Alexey

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-11 14:00 ` skb_shared_info() Alexey Kuznetsov
@ 2006-08-12  0:27   ` David Miller
  2006-08-12  0:32     ` skb_shared_info() Herbert Xu
  2006-08-13 13:04     ` skb_shared_info() Alexey Kuznetsov
  0 siblings, 2 replies; 19+ messages in thread
From: David Miller @ 2006-08-12  0:27 UTC (permalink / raw)
  To: kuznet; +Cc: netdev, herbert, johnpol

From: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Date: Fri, 11 Aug 2006 18:00:19 +0400

> > The e1000 issue is just one example of this, another
> 
> What is this issue?

E1000 wants 16K buffers for jumbo MTU settings.

The reason is that the chip can only handle power-of-2 buffer
sizes, and next hop from 9K is 16K.

It is not possible to tell the chip to only accept 9K packets, you
must give it the whole next power of 2 buffer size for the MTU you
wish to use.

With skb_shared_info() overhead this becomes a 32K allocation
in the simplest implementation.

Whichever hardware person was trying to save some trace lines on the
chip should have consulted software folks before implementing things
this way :-)

> What's about aggregated tcp queue, I can guess you did not find
> place where to add protocol headers, but cannot figure out how
> adding non-pagecache references could help.

This is not the idea.  I'm trying to see if we can salvage non-SG
paths in the design.

The idea is that "struct retransmit_queue" entries could hold either
paged or non-paged data, based upon the capabilities of the transmit
device.

If we store raw kmalloc buffers, we cannot attach them to an arbitrary
skb because of skb_shared_info().  This is true even if we
purposefully allocate the necessary head room for these kmalloc based
buffers.

It's requirement to live in the skb->data area really does preclude
any kind of clever buffering schemes.

> I think Evgeniy's idea about inlining skb_shared_info to skb head
> is promising and simple enough.

I think you are talking about "struct sk_buff" area when you say "skb
head".  It is confusing talk because when I hear this phrase my brain
says "skb->head" which exactly where we want to move skb_shared_info()
away from! :-)

> But it makes lots of sense to inline some short vector inot skb head
> (and, probably, even a MAX_HEADER space _instead_ of space for
> fclone).

If inlined, one implementation of retransmit queue becomes apparent.
"struct retransmit_queue" is just list of data blobs, each represented
by usual vector of pages and offset and length.

Then transmission from write queue is merely propagating pages to
inline skb vector.

Some silly sample datastructure:

struct retransmit_block {
	struct list_head	rblk_node;
	void			*rblk_data;
	unsigned short		rblk_frags;
	skb_frag_t		frags[MAX_SKB_FRAGS];
};

struct retransmit_queue {
	struct list_head	rqueue_head;
	struct retransmit_block	*rqueue_send_head;
	int			rqueue_send_head_frag;
	unsigned int		rqueue_send_head_off;
};

tcp_sendmsg() and tcp_sendpage() just accumulate into the tail
retransmit_block until all of MAX_SKB_FRAGS are consumed.

tcp_write_xmit() and friends build skbs like this:

struct struct sk_buff *tcp_xmit_build(struct retransmit_block *rblk,
                                      int frag, unsigned int off,
				      unsigned int len)
{
	struct sk_buff *skb = alloc_skb(MAX_HEADER, GFP_KERNEL);
	int ent;

	if (unlikely(!skb))
		return NULL;
	ent = 0;
	while (len) {
		unsigned int this_off = rblk->frags[frag].page_offset + off;
		unsigned int this_len = rblk->flags[frag].size - off;

		if (this_len > len)
			this_len = len;

		skb->inline_info.frags[ent].page =
			rblk->frags[frag].page;
		skb->inline_info.frags[ent].page_offset = this_off;
		skb->inline_info.frags[ent].size = this_len;
		ent++;

		frag++;
		off = 0;
		len -= this_len;
	}
	return skb;
}

(sorry, another outer loop is also needed to traverse to subsequent
 retransmit_blocks in the list when all of rblk_frags of current
 retransmit_block are consumed by inner loop)

Depending upon how we do completion callbacks, as you discuss below,
either we'll need a get_page() refcount grab in that inner loop
or we won't.

> With aggregated tcp send queue, when transmitting a segment, you could
> allocate new skb head with space for header and either take existing
> skb_shared_info from queue, attach it to head and set offset/length.
> Or, alternatively, set one or two of page pointers in array, inlined in head.
> (F.e. in the case of AF_UNIX socket, mentioned by Evgeniy, we would keep data
> in pages and attach it directly to skb head).

The latter scheme is closer to what I was thinking about.  Why
not inline this entire fraglist thing alongside sk_buff?

> Cloning becomes more expensive, but who needs it cheap, if tcp does not?

Exactly :)

> One idea is to announce (some) skb_shared_info completely immutable,
> force each layer who needs to add a header or to fragment to refer
> to original skb_shared_info as whole, using for modifications
> another skb_shared_info() or area inlined in skb head.
> And if someone is not able to, he must reallocate all the pages.
> In this case destructor/notification can be done not for fragment,
> but for whole aggregated skb_shared_info. Seems, it will work both
> with aggregated tcp queue and with udp.

It sounds interesting... so using my sample datastructures above
for aggregated tcp queue, such notifications would be made on
(for example) "retransmit_block" objects?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-12  0:27   ` skb_shared_info() David Miller
@ 2006-08-12  0:32     ` Herbert Xu
  2006-08-13 13:04     ` skb_shared_info() Alexey Kuznetsov
  1 sibling, 0 replies; 19+ messages in thread
From: Herbert Xu @ 2006-08-12  0:32 UTC (permalink / raw)
  To: David Miller; +Cc: kuznet, netdev, johnpol

On Fri, Aug 11, 2006 at 05:27:30PM -0700, David Miller wrote:
> 
> E1000 wants 16K buffers for jumbo MTU settings.
> 
> The reason is that the chip can only handle power-of-2 buffer
> sizes, and next hop from 9K is 16K.
> 
> It is not possible to tell the chip to only accept 9K packets, you
> must give it the whole next power of 2 buffer size for the MTU you
> wish to use.
> 
> With skb_shared_info() overhead this becomes a 32K allocation
> in the simplest implementation.

I think is no longer an issue because we've all come to the conclusion
that E1000 supports SG and therefore we can and should use 4K pages, no?

Whatever we do here, allocating 16K via kmalloc is very unlikely to
succeed in the near future so we have to via the SG route anyway.
In which case that we must have a head area which can accomodate the
skb_shared_info as we do now.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-12  0:27   ` skb_shared_info() David Miller
  2006-08-12  0:32     ` skb_shared_info() Herbert Xu
@ 2006-08-13 13:04     ` Alexey Kuznetsov
  2006-08-14  8:01       ` skb_shared_info() Evgeniy Polyakov
  1 sibling, 1 reply; 19+ messages in thread
From: Alexey Kuznetsov @ 2006-08-13 13:04 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, herbert, johnpol

Hello!

> E1000 wants 16K buffers for jumbo MTU settings.
> 
> The reason is that the chip can only handle power-of-2 buffer
> sizes, and next hop from 9K is 16K.

Let it use pages. Someone should start. :-)

High order allocations are disaster in any case.

> If we store raw kmalloc buffers, we cannot attach them to an arbitrary
> skb because of skb_shared_info().  This is true even if we
> purposefully allocate the necessary head room for these kmalloc based
> buffers.

I still do not see. For non-SG paths, you have to keep header and data
together, you just do not need anything clever. And it does not look
like you can. If you have a fixed space for header, you already have
a space for refcnt to make skb_shared_info().

For SG paths, TCP_PAGE() seems to be enough and even better than
kmalloc()ed buffers.

> I think you are talking about "struct sk_buff"

Yes. :-)

> If inlined, one implementation of retransmit queue becomes apparent.
> "struct retransmit_queue" is just list of data blobs, each represented
> by usual vector of pages and offset and length.
> 
> Then transmission from write queue is merely propagating pages to
> inline skb vector.

Yes. I have no idea how good it is, but at least it is something
different of what we have now. :-)

Seems, it is too simple to be correct. See below.

> The latter scheme is closer to what I was thinking about.  Why
> not inline this entire fraglist thing alongside sk_buff?

Sorry, that crap about cost of clone still clouds my mind. :-)

Actually, the borderline was not between "short" and "long",
but between "plain page vector" and "not-so-plain". See below.

> It sounds interesting... so using my sample datastructures above
> for aggregated tcp queue, such notifications would be made on
> (for example) "retransmit_block" objects?

Not quite.

TCP is too easy case. If you remember, I did some ugly but 100% functional
zero-copy send() without any "architectural" changes: async_seq at socket side
and SIOCOUTQ/SIOCOUTWAITQ + POLLWRCOMP extension to poll() at api side
were enough to notify when to recycle pages. This sucks because of two reasons:

1. Small one. Someone sort of packet sniffer could hold those pages
   after they are acked. If they are recycled, it will see wrong content.
   Seems, we can live with this. TCP works, that's all that is really
   important.

2. Non-TCP. No queue. Notifications has to be done from skb destructor.
   It means skb have to carry all the info somewhere.

With UDP it could be original not-fragmented skb_shared_info().
Provided we prohibit to change it and to clone page references,
the point of recycle is when this skb_shared_info() is destroyed.
This is the place, where skb_shared_info() does not want to be
inlined.

For TCP the idea was to do the same. If transmitted skb does not have
separate pages in it, referring to an aggregated skb_shared_info()
instead, it would work. This is the second place.

It was the first approach. Probably, the best, but utterly inflexible.
F.e. it is impossible to merge two tcp segments from different sources.

I am afraid, per-fragment destructors are still required.
And this is so not cool. :-) Carrying in each skb space for 18
destructors and necessary data looks too scary.

I see two ways to do this, both consider the case, when destructors
are required as exceptional.

1. Multiple skb_shared_info() per skb. Some of them are nothing but
   wrapper for a page with refcounter and information necessary
   for destruction.

   So, in each skb you have vector:

   struct {
     union {
	struct page		*page;
	struct skb_shared_info	*shinfo;
     } datum;
     unsigned int offset, size;	/* This selects an area inside datum */
   } data [MAX_SKB_FRAGS_INLINED];

   The pointers could be distingushed by selector hidden in low bits. Sorry. :)

   struct skb_shared_info is essentially preserved, only some ops and space
   for information to clone/release are added and frag_list is removed.
   Seems, it is flexible enough and even reduces overhead sometimes
   (for SG you have only struct sk_buff with MAX_HEADER inlined and
   direct references to pages)
   But code promises to be ugly: sometimes page references are direct,
   sometimes they are indirect.

2. Another idea is to preserve current structure,
   but for those "pathological" cases, when special notifications are
   required, to allocate optional additional vector which could hold
   "transaction tokens", crap sort of pipe_inode_info for splice(),
   arbitrarily obscure and complicated, the advantage is that
   it keeps "normal" skbs small and clean.

Alexey

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-08 23:39 skb_shared_info() David Miller
                   ` (2 preceding siblings ...)
  2006-08-11 14:00 ` skb_shared_info() Alexey Kuznetsov
@ 2006-08-14  5:04 ` Andi Kleen
  2006-08-14  7:29   ` skb_shared_info() Herbert Xu
  3 siblings, 1 reply; 19+ messages in thread
From: Andi Kleen @ 2006-08-14  5:04 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, herbert, johnpol


> The e1000 issue is just one example of this, another
> would be any attempt to consolidate the TCP retransmit
> queue data management.

Another reason to move it in the sk_buff would be better cache 
coloring? Currently on large/small MTU packets it will be always on 
the same colors.

-Andi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-14  5:04 ` skb_shared_info() Andi Kleen
@ 2006-08-14  7:29   ` Herbert Xu
  2006-08-14  7:45     ` skb_shared_info() Andi Kleen
  0 siblings, 1 reply; 19+ messages in thread
From: Herbert Xu @ 2006-08-14  7:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, netdev, johnpol

On Mon, Aug 14, 2006 at 07:04:01AM +0200, Andi Kleen wrote:
> 
> Another reason to move it in the sk_buff would be better cache 
> coloring? Currently on large/small MTU packets it will be always on 
> the same colors.

If we went with a fully paged skbs this should be a non-issue, right?

In a fully paged representation the head would be small which is the
perfect place to place the skb_shared_info.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-14  7:29   ` skb_shared_info() Herbert Xu
@ 2006-08-14  7:45     ` Andi Kleen
  2006-08-14  7:50       ` skb_shared_info() Herbert Xu
  0 siblings, 1 reply; 19+ messages in thread
From: Andi Kleen @ 2006-08-14  7:45 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, johnpol

On Monday 14 August 2006 09:29, Herbert Xu wrote:
> On Mon, Aug 14, 2006 at 07:04:01AM +0200, Andi Kleen wrote:
> > 
> > Another reason to move it in the sk_buff would be better cache 
> > coloring? Currently on large/small MTU packets it will be always on 
> > the same colors.
> 
> If we went with a fully paged skbs this should be a non-issue, right?

Even for 1.5k MTU? (which is still the most common case after all)

> In a fully paged representation the head would be small which is the
> perfect place to place the skb_shared_info.

If the head is small and allocated with kmalloc then slab should 
color it yes. It won't for big allocations.

-Andi
 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-14  7:45     ` skb_shared_info() Andi Kleen
@ 2006-08-14  7:50       ` Herbert Xu
  2006-08-14  8:18         ` skb_shared_info() Andi Kleen
  0 siblings, 1 reply; 19+ messages in thread
From: Herbert Xu @ 2006-08-14  7:50 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, netdev, johnpol

On Mon, Aug 14, 2006 at 09:45:53AM +0200, Andi Kleen wrote:
> 
> Even for 1.5k MTU? (which is still the most common case after all)

Ideally they would stay in kmalloc memory.  Could you explain the cache
colouring problem for 1500-byte packets?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-13 13:04     ` skb_shared_info() Alexey Kuznetsov
@ 2006-08-14  8:01       ` Evgeniy Polyakov
  2006-08-14 13:19         ` skb_shared_info() Alexey Kuznetsov
  0 siblings, 1 reply; 19+ messages in thread
From: Evgeniy Polyakov @ 2006-08-14  8:01 UTC (permalink / raw)
  To: Alexey Kuznetsov; +Cc: David Miller, netdev, herbert

On Sun, Aug 13, 2006 at 05:04:31PM +0400, Alexey Kuznetsov (kuznet@ms2.inr.ac.ru) wrote:
> Hello!
> 
> > E1000 wants 16K buffers for jumbo MTU settings.
> > 
> > The reason is that the chip can only handle power-of-2 buffer
> > sizes, and next hop from 9K is 16K.
> 
> Let it use pages. Someone should start. :-)
> 
> High order allocations are disaster in any case.
> 
> 
> > If we store raw kmalloc buffers, we cannot attach them to an arbitrary
> > skb because of skb_shared_info().  This is true even if we
> > purposefully allocate the necessary head room for these kmalloc based
> > buffers.
> 
> I still do not see. For non-SG paths, you have to keep header and data
> together, you just do not need anything clever. And it does not look
> like you can. If you have a fixed space for header, you already have
> a space for refcnt to make skb_shared_info().
> 
> For SG paths, TCP_PAGE() seems to be enough and even better than
> kmalloc()ed buffers.

What if we will store array of pages and in shared info
just like we have right now. So there will only one destructor.

e1000 will setup head/data/tail pointers to point to the area in the
first sg page.

Those drivers which do not have sg support can use head/data pointers
like before, but destructor should be changed in that case.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-14  7:50       ` skb_shared_info() Herbert Xu
@ 2006-08-14  8:18         ` Andi Kleen
  0 siblings, 0 replies; 19+ messages in thread
From: Andi Kleen @ 2006-08-14  8:18 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, johnpol

On Monday 14 August 2006 09:50, Herbert Xu wrote:
> On Mon, Aug 14, 2006 at 09:45:53AM +0200, Andi Kleen wrote:
> > 
> > Even for 1.5k MTU? (which is still the most common case after all)
> 
> Ideally they would stay in kmalloc memory.  Could you explain the cache
> colouring problem for 1500-byte packets?

kmalloc rounds the 1.5k up to 2k and with that exactly two objects
fit into a 4k page. With that slab doesn't have any space left
to do cache coloring.

-Andi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-14  8:01       ` skb_shared_info() Evgeniy Polyakov
@ 2006-08-14 13:19         ` Alexey Kuznetsov
  2006-08-14 13:38           ` skb_shared_info() Evgeniy Polyakov
  0 siblings, 1 reply; 19+ messages in thread
From: Alexey Kuznetsov @ 2006-08-14 13:19 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev, herbert

Hello!

> e1000 will setup head/data/tail pointers to point to the area in the
> first sg page.

Maybe.

But I still hope this is not necessary, the driver should be able to do
at least primitive header splitting, in that case the header could
be inlined to skb.

Alternatively, header can be copied with skb_pull(). Not good, but
it is at least a regula procedure, which f.e. allows to keep the pages
in highmem.

> What if we will store array of pages and in shared info
> just like we have right now. So there will only one destructor.

Seems, this is not enough. Actually, I was asking for your experience
with aio.

The simplest situation, which confuses me, translated to aio language:
two senders send two pages. Those pages are to be recycled, when we
are done. We have to notify both. And we have to merge both those pages
in one segment.

The second half of that mail suggested three different solutions,
all of them creepy. :-)

Alexey

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-14 13:19         ` skb_shared_info() Alexey Kuznetsov
@ 2006-08-14 13:38           ` Evgeniy Polyakov
  2006-08-15 11:30             ` skb_shared_info() Alexey Kuznetsov
  0 siblings, 1 reply; 19+ messages in thread
From: Evgeniy Polyakov @ 2006-08-14 13:38 UTC (permalink / raw)
  To: Alexey Kuznetsov; +Cc: David Miller, netdev, herbert

On Mon, Aug 14, 2006 at 05:19:19PM +0400, Alexey Kuznetsov (kuznet@ms2.inr.ac.ru) wrote:
> > What if we will store array of pages and in shared info
> > just like we have right now. So there will only one destructor.
> 
> Seems, this is not enough. Actually, I was asking for your experience
> with aio.
> 
> The simplest situation, which confuses me, translated to aio language:
> two senders send two pages. Those pages are to be recycled, when we
> are done. We have to notify both. And we have to merge both those pages
> in one segment.

Since they will be sent from different users it will be separate chunks,
each of which has own destructor, so yes one destructor per page in
described case, but in general it will be one per block of pages.
Blocks can then be combined into queue with additional blocks of
kmalloced data...
Mda, sounds not very good.

> The second half of that mail suggested three different solutions,
> all of them creepy. :-)

I still like existing way - it is much simpler (I hope :) to convince
e1000 developers to fix driver's memory usage with help of putting
skb_shared_info() or only pointer into skb.
With pointer it would be simpler to do refcounting, but it requires
additional allocation which is not cheap.
Inlining skb_shared_info can lead to troubles with cloning/sharing,
although it is also possible to do.

Idea of having header inside struct sk_buff is good until we strike
something, which will create a header which size exceeds preallocated
area (does MAX_TCP_HEADER enough for all, what about ipv6 options?)...

So as first step I would try to create extended alloc_skb() with a
pointer to shared_info (crap) (in case of appropriate sizes, i.e. when
aligned one exceeds power-of-two and so on), and let's e1000 and unix 
socket use that. If there will be additional requirements for other
global changes like you suggested, we can break things more
fundamentally :)

> Alexey

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: skb_shared_info()
  2006-08-14 13:38           ` skb_shared_info() Evgeniy Polyakov
@ 2006-08-15 11:30             ` Alexey Kuznetsov
  0 siblings, 0 replies; 19+ messages in thread
From: Alexey Kuznetsov @ 2006-08-15 11:30 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, netdev, herbert

Hello!

> I still like existing way - it is much simpler (I hope :) to convince
> e1000 developers to fix driver's memory usage

e1000 is not a problem at all. It just has to use pages.
If it is going to use high order allocations, it will suck,
be it order 3 or 2.

> area (does MAX_TCP_HEADER enough for all, what about ipv6 options?)...

This does not matter. All the cases with pathologically extended
headers go slow path with non-trivial pskb_pull(). This is what all
those options are invented for.

> 								 unix 
> socket use that.

Totally does not worth it, unless it is a part of something bigger.

> global changes like you suggested, we can break things more
> fundamentally :)

_You_ have to. :-) Sender zerocopy stalled only because of this. 

Alexey

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2006-08-15 11:31 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-08 23:39 skb_shared_info() David Miller
2006-08-09  0:36 ` skb_shared_info() Herbert Xu
2006-08-09  0:58   ` skb_shared_info() David Miller
2006-08-09  5:35     ` skb_shared_info() Evgeniy Polyakov
2006-08-09  5:50       ` skb_shared_info() David Miller
2006-08-10 19:49 ` skb_shared_info() Evgeniy Polyakov
2006-08-11 14:00 ` skb_shared_info() Alexey Kuznetsov
2006-08-12  0:27   ` skb_shared_info() David Miller
2006-08-12  0:32     ` skb_shared_info() Herbert Xu
2006-08-13 13:04     ` skb_shared_info() Alexey Kuznetsov
2006-08-14  8:01       ` skb_shared_info() Evgeniy Polyakov
2006-08-14 13:19         ` skb_shared_info() Alexey Kuznetsov
2006-08-14 13:38           ` skb_shared_info() Evgeniy Polyakov
2006-08-15 11:30             ` skb_shared_info() Alexey Kuznetsov
2006-08-14  5:04 ` skb_shared_info() Andi Kleen
2006-08-14  7:29   ` skb_shared_info() Herbert Xu
2006-08-14  7:45     ` skb_shared_info() Andi Kleen
2006-08-14  7:50       ` skb_shared_info() Herbert Xu
2006-08-14  8:18         ` skb_shared_info() Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).