Re: [RFC][PATCH] VM deadlock prevention core -v3

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>, netdev <netdev@vger.kernel.org>,
	Daniel Phillips <phillips@google.com>,
	David Miller <davem@davemloft.net>, Thomas Graf <tgraf@suug.ch>,
	Indan Zupancic <indan@nul.nu>
Subject: Re: [RFC][PATCH] VM deadlock prevention core -v3
Date: Thu, 10 Aug 2006 16:46:31 +0200	[thread overview]
Message-ID: <1155221191.5696.43.camel@twins> (raw)
In-Reply-To: <20060810140240.GA28989@2ka.mipt.ru>

On Thu, 2006-08-10 at 18:02 +0400, Evgeniy Polyakov wrote:
> On Thu, Aug 10, 2006 at 03:32:49PM +0200, Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> > Hi,
> 
> Hello, Peter.
> 
> > So I try again, please tell me if I'm still on crack and should go detox.
> > However if you do so, I kindly request some words on the how and why of it.
> 
> I think you should talk with doctor in that case, but not with kernel
> hackers :)
> 
> I have some comments about implementation, not overall design, since we
> have slightly diametral points of view there.
> 
> ....
> 
> > --- linux-2.6.orig/net/core/skbuff.c
> > +++ linux-2.6/net/core/skbuff.c
> > @@ -43,6 +43,7 @@
> >  #include <linux/kernel.h>
> >  #include <linux/sched.h>
> >  #include <linux/mm.h>
> > +#include <linux/pagemap.h>
> >  #include <linux/interrupt.h>
> >  #include <linux/in.h>
> >  #include <linux/inet.h>
> > @@ -125,6 +126,8 @@ EXPORT_SYMBOL(skb_truesize_bug);
> >   *
> >   */
> >  
> > +#define ceiling_log2(x)	fls((x) - 1)
> > +
> >  /**
> >   *	__alloc_skb	-	allocate a network buffer
> >   *	@size: size to allocate
> > @@ -147,6 +150,59 @@ struct sk_buff *__alloc_skb(unsigned int
> >  	struct sk_buff *skb;
> >  	u8 *data;
> >  
> > +	size = SKB_DATA_ALIGN(size);

I moved it here.

> > +
> > +	if (gfp_mask & __GFP_MEMALLOC) {
> > +		/*
> > +		 * Fallback allocation for memalloc reserves.
> > +

		 * This allocator is build on alloc_pages() so that freed
		 * skbuffs return to the memalloc reserve imediately. SLAB
		 * memory might not ever be returned.

This was missing,... 

> > +		 * the page is populated like so:
> > +		 *
> > +		 *   struct sk_buff
> > +		 *   [ struct sk_buff ]
> > +		 *   [ atomic_t ]
> > +		 *   unsigned int
> > +		 *   struct skb_shared_info
> > +		 *   char []
> > +		 *
> > +		 * We have to do higher order allocations for icky jumbo
> > +		 * frame drivers :-(. They really should be migrated to
> > +		 * scather/gather DMA and use skb fragments.
> > +		 */
> > +		unsigned int data_offset =
> > +			sizeof(struct sk_buff) + sizeof(unsigned int);
> > +		unsigned long length = size + data_offset +
> > +			sizeof(struct skb_shared_info);
> > +		unsigned int pages;
> > +		unsigned int order;
> > +		struct page *page;
> > +		void *kaddr;
> > +
> > +		/*
> > +		 * Force fclone alloc in order to fudge a lacking in skb_clone().
> > +		 */
> > +		fclone = 1;
> > +		if (fclone) {
> > +			data_offset += sizeof(struct sk_buff) + sizeof(atomic_t);
> > +			length += sizeof(struct sk_buff) + sizeof(atomic_t);
> > +		}
> > +		pages = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +		order = ceiling_log2(pages);
> > +		skb = NULL;
> > +		if (!(page = alloc_pages(gfp_mask & ~__GFP_HIGHMEM, order)))
> > +			goto out;
> > +
> > +		kaddr = pfn_to_kaddr(page_to_pfn(page));
> > +		skb = (struct sk_buff *)kaddr;
> > +
> > +		*((unsigned int *)(kaddr + data_offset -
> > +					sizeof(unsigned int))) = order;
> > +		data = (u8 *)(kaddr + data_offset);
> > +
> 
> Tricky, but since you are using own allocator here, you could change it to
> be not so aggressive - i.e. do not round size to number of pages.

I'm not sure I follow you, I'm explicitly using
alloc_pages()/free_page(), if
I were to go smart here, I'd loose the whole reason for doing so.

> 
> > +		goto allocated;
> > +	}
> > +
> >  	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
> >  
> >  	/* Get the HEAD */
> > @@ -155,12 +211,13 @@ struct sk_buff *__alloc_skb(unsigned int
> >  		goto out;
> >  
> >  	/* Get the DATA. Size must match skb_add_mtu(). */
> > -	size = SKB_DATA_ALIGN(size);
> 
> Bad sign.

See above.

> >  	data = ____kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
> >  	if (!data)
> >  		goto nodata;
> >  
> > +struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
> > +		unsigned length, gfp_t gfp_mask)
> > +{
> > +	struct sk_buff *skb;
> > +
> > +	WARN_ON(gfp_mask & (__GFP_NOMEMALLOC | __GFP_MEMALLOC));
> > +	gfp_mask &= ~(__GFP_NOMEMALLOC | __GFP_MEMALLOC);
> > +
> > +	skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_NOMEMALLOC);
> > +	if (skb)
> > +		goto done;
> > +
> > +	if (atomic_read(&dev->rx_reserve_used) >=
> > +			dev->rx_reserve * dev->memalloc_socks)
> > +		goto out;
> > +
> > +	/*
> > +	 * pre-inc guards against a race with netdev_wait_memalloc()
> > +	 */
> > +	atomic_inc(&dev->rx_reserve_used);
> > +	skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_MEMALLOC);
> > +	if (unlikely(!skb)) {
> > +		atomic_dec(&dev->rx_reserve_used);
> > +		goto out;
> > +	}
> 
> Since you have added atomic operation in that path, you can use device's
> reference counter instead and do not care that it can dissapear.

Is that the sole reason taking a reference on the device is bad?

> > @@ -434,6 +567,12 @@ struct sk_buff *skb_clone(struct sk_buff
> >  		n->fclone = SKB_FCLONE_CLONE;
> >  		atomic_inc(fclone_ref);
> >  	} else {
> > +		/*
> > +		 * should we special-case skb->memalloc cloning?
> > +		 * for now fudge it by forcing fast-clone alloc.
> > +		 */
> > +		BUG_ON(skb->memalloc);
> > +
> >  		n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
> >  		if (!n)
> >  			return NULL;
> 
> Ugh... cloning is a one of the shoulders of giant where Linux network
> stack is staying...

Yes, I'm aware of that, I have a plan to fix this, however I haven't had
time
to implement it. My immediate concern is the point wrt. the net_device
mapping.

My idea was: instead of the order, store the size, and allocate clone 
skbuffs in the available room at the end of the page(s), allocating
extra pages
if needed.

> > @@ -686,6 +825,8 @@ int pskb_expand_head(struct sk_buff *skb
> >  	if (skb_shared(skb))
> >  		BUG();
> >  
> > +	BUG_ON(skb->memalloc);
> > +
> >  	size = SKB_DATA_ALIGN(size);
> >  
> >  	data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
> 
> And that is a bug.
> That operation can happen even with usual receiving processing.

Yes, more allocator work..

> > Index: linux-2.6/net/ipv4/af_inet.c
> > ===================================================================
> > --- linux-2.6.orig/net/ipv4/af_inet.c
> > +++ linux-2.6/net/ipv4/af_inet.c
> > @@ -132,6 +132,14 @@ void inet_sock_destruct(struct sock *sk)
> >  {
> >  	struct inet_sock *inet = inet_sk(sk);
> >  
> > +	if (sk_is_memalloc(sk)) {
> > +		struct net_device *dev = ip_dev_find(inet->rcv_saddr);
> > +		if (dev) {
> > +			dev_adjust_memalloc(dev, -1);
> > +			dev_put(dev);
> > +		}
> > +	}
> > +
> 
> This looks very strange - you decrement reference counter both in socket
> destruction code and in netdevice destruction code.

This is the right place; however I was not sure net_device destruction
implied
destruction of all related sockets; in case that is not so, you will
have to
clean up there, because this destructor will not find the device any
longer.

Thanks for the feedback.

next prev parent reply	other threads:[~2006-08-10 14:46 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-10 13:32 [RFC][PATCH] VM deadlock prevention core -v3 Peter Zijlstra
2006-08-10 14:02 ` Evgeniy Polyakov
2006-08-10 14:46   ` Peter Zijlstra [this message]
2006-08-10 16:22     ` Evgeniy Polyakov
2006-08-10 16:50       ` Peter Zijlstra
2006-08-11  0:45         ` Indan Zupancic

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1155221191.5696.43.camel@twins \
    --to=a.p.zijlstra@chello.nl \
    --cc=davem@davemloft.net \
    --cc=indan@nul.nu \
    --cc=johnpol@2ka.mipt.ru \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=phillips@google.com \
    --cc=tgraf@suug.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).