From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>, netdev <netdev@vger.kernel.org>,
Daniel Phillips <phillips@google.com>,
David Miller <davem@davemloft.net>, Thomas Graf <tgraf@suug.ch>,
Indan Zupancic <indan@nul.nu>
Subject: Re: [RFC][PATCH] VM deadlock prevention core -v3
Date: Thu, 10 Aug 2006 18:50:40 +0200 [thread overview]
Message-ID: <1155228640.5696.55.camel@twins> (raw)
In-Reply-To: <20060810162229.GA27364@2ka.mipt.ru>
> > > Tricky, but since you are using own allocator here, you could change it to
> > > be not so aggressive - i.e. do not round size to number of pages.
> >
> > I'm not sure I follow you, I'm explicitly using
> > alloc_pages()/free_page(), if
> > I were to go smart here, I'd loose the whole reason for doing so.
>
> You can use page to put there several skbs for example or at least add
> there a fclone (fast clone).
fclone support is there.
> > > > +struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
> > > > + unsigned length, gfp_t gfp_mask)
> > > > +{
> > > > + struct sk_buff *skb;
> > > > +
> > > > + WARN_ON(gfp_mask & (__GFP_NOMEMALLOC | __GFP_MEMALLOC));
> > > > + gfp_mask &= ~(__GFP_NOMEMALLOC | __GFP_MEMALLOC);
> > > > +
> > > > + skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_NOMEMALLOC);
> > > > + if (skb)
> > > > + goto done;
> > > > +
> > > > + if (atomic_read(&dev->rx_reserve_used) >=
> > > > + dev->rx_reserve * dev->memalloc_socks)
> > > > + goto out;
> > > > +
> > > > + /*
> > > > + * pre-inc guards against a race with netdev_wait_memalloc()
> > > > + */
> > > > + atomic_inc(&dev->rx_reserve_used);
> > > > + skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_MEMALLOC);
> > > > + if (unlikely(!skb)) {
> > > > + atomic_dec(&dev->rx_reserve_used);
> > > > + goto out;
> > > > + }
> > >
> > > Since you have added atomic operation in that path, you can use device's
> > > reference counter instead and do not care that it can dissapear.
> >
> > Is that the sole reason taking a reference on the device is bad?
>
> Taking a reference is bad due to performance reasons, since atomic
> increment is not that cheap. If you do it for one variable for the
> purpose of reference counting you can use device's refcnt istead, which
> will solve some races.
Yes, I understand you. However I'm not sure if performance is the only
reason not to take a refcount on the device. Anyway, I think I have just
been convinced to abandon the per device thing and go global.
> > > > @@ -434,6 +567,12 @@ struct sk_buff *skb_clone(struct sk_buff
> > > > n->fclone = SKB_FCLONE_CLONE;
> > > > atomic_inc(fclone_ref);
> > > > } else {
> > > > + /*
> > > > + * should we special-case skb->memalloc cloning?
> > > > + * for now fudge it by forcing fast-clone alloc.
> > > > + */
> > > > + BUG_ON(skb->memalloc);
> > > > +
> > > > n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
> > > > if (!n)
> > > > return NULL;
> > >
> > > Ugh... cloning is a one of the shoulders of giant where Linux network
> > > stack is staying...
> >
> > Yes, I'm aware of that, I have a plan to fix this, however I haven't had
> > time
> > to implement it. My immediate concern is the point wrt. the net_device
> > mapping.
> >
> > My idea was: instead of the order, store the size, and allocate clone
> > skbuffs in the available room at the end of the page(s), allocating
> > extra pages
> > if needed.
>
> You can check if requested skb with fclone fits allocated pages, and if
> so use fclone magic, otherwise postpone clone allocation until it is
> required.
Yes the fclone magic works, however that will only let you have one
clone.
I'm just not confident no receive path will ever exceed that.
> Sockets can live without network devices at all, I expect it is enough
> to clean up in socket destructor, since packets can come from
> different devices into the same socket.
You are right if the reserve wasn't device bound - which I will abandon
because you are right that with multi-path routing, bridge device and
other advanced goodies this scheme is broken in that there is no
unambiguous
mapping from sockets to devices.
WARNING: multiple messages have this Message-ID (diff)
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
linux-mm <linux-mm@kvack.org>, netdev <netdev@vger.kernel.org>,
Daniel Phillips <phillips@google.com>,
David Miller <davem@davemloft.net>, Thomas Graf <tgraf@suug.ch>,
Indan Zupancic <indan@nul.nu>
Subject: Re: [RFC][PATCH] VM deadlock prevention core -v3
Date: Thu, 10 Aug 2006 18:50:40 +0200 [thread overview]
Message-ID: <1155228640.5696.55.camel@twins> (raw)
In-Reply-To: <20060810162229.GA27364@2ka.mipt.ru>
> > > Tricky, but since you are using own allocator here, you could change it to
> > > be not so aggressive - i.e. do not round size to number of pages.
> >
> > I'm not sure I follow you, I'm explicitly using
> > alloc_pages()/free_page(), if
> > I were to go smart here, I'd loose the whole reason for doing so.
>
> You can use page to put there several skbs for example or at least add
> there a fclone (fast clone).
fclone support is there.
> > > > +struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
> > > > + unsigned length, gfp_t gfp_mask)
> > > > +{
> > > > + struct sk_buff *skb;
> > > > +
> > > > + WARN_ON(gfp_mask & (__GFP_NOMEMALLOC | __GFP_MEMALLOC));
> > > > + gfp_mask &= ~(__GFP_NOMEMALLOC | __GFP_MEMALLOC);
> > > > +
> > > > + skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_NOMEMALLOC);
> > > > + if (skb)
> > > > + goto done;
> > > > +
> > > > + if (atomic_read(&dev->rx_reserve_used) >=
> > > > + dev->rx_reserve * dev->memalloc_socks)
> > > > + goto out;
> > > > +
> > > > + /*
> > > > + * pre-inc guards against a race with netdev_wait_memalloc()
> > > > + */
> > > > + atomic_inc(&dev->rx_reserve_used);
> > > > + skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_MEMALLOC);
> > > > + if (unlikely(!skb)) {
> > > > + atomic_dec(&dev->rx_reserve_used);
> > > > + goto out;
> > > > + }
> > >
> > > Since you have added atomic operation in that path, you can use device's
> > > reference counter instead and do not care that it can dissapear.
> >
> > Is that the sole reason taking a reference on the device is bad?
>
> Taking a reference is bad due to performance reasons, since atomic
> increment is not that cheap. If you do it for one variable for the
> purpose of reference counting you can use device's refcnt istead, which
> will solve some races.
Yes, I understand you. However I'm not sure if performance is the only
reason not to take a refcount on the device. Anyway, I think I have just
been convinced to abandon the per device thing and go global.
> > > > @@ -434,6 +567,12 @@ struct sk_buff *skb_clone(struct sk_buff
> > > > n->fclone = SKB_FCLONE_CLONE;
> > > > atomic_inc(fclone_ref);
> > > > } else {
> > > > + /*
> > > > + * should we special-case skb->memalloc cloning?
> > > > + * for now fudge it by forcing fast-clone alloc.
> > > > + */
> > > > + BUG_ON(skb->memalloc);
> > > > +
> > > > n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
> > > > if (!n)
> > > > return NULL;
> > >
> > > Ugh... cloning is a one of the shoulders of giant where Linux network
> > > stack is staying...
> >
> > Yes, I'm aware of that, I have a plan to fix this, however I haven't had
> > time
> > to implement it. My immediate concern is the point wrt. the net_device
> > mapping.
> >
> > My idea was: instead of the order, store the size, and allocate clone
> > skbuffs in the available room at the end of the page(s), allocating
> > extra pages
> > if needed.
>
> You can check if requested skb with fclone fits allocated pages, and if
> so use fclone magic, otherwise postpone clone allocation until it is
> required.
Yes the fclone magic works, however that will only let you have one
clone.
I'm just not confident no receive path will ever exceed that.
> Sockets can live without network devices at all, I expect it is enough
> to clean up in socket destructor, since packets can come from
> different devices into the same socket.
You are right if the reserve wasn't device bound - which I will abandon
because you are right that with multi-path routing, bridge device and
other advanced goodies this scheme is broken in that there is no
unambiguous
mapping from sockets to devices.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2006-08-10 16:50 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-08-10 13:32 [RFC][PATCH] VM deadlock prevention core -v3 Peter Zijlstra
2006-08-10 13:32 ` Peter Zijlstra
2006-08-10 14:02 ` Evgeniy Polyakov
2006-08-10 14:02 ` Evgeniy Polyakov
2006-08-10 14:46 ` Peter Zijlstra
2006-08-10 14:46 ` Peter Zijlstra
2006-08-10 16:22 ` Evgeniy Polyakov
2006-08-10 16:22 ` Evgeniy Polyakov
2006-08-10 16:50 ` Peter Zijlstra [this message]
2006-08-10 16:50 ` Peter Zijlstra
2006-08-11 0:45 ` Indan Zupancic
2006-08-11 0:45 ` Indan Zupancic
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1155228640.5696.55.camel@twins \
--to=a.p.zijlstra@chello.nl \
--cc=davem@davemloft.net \
--cc=indan@nul.nu \
--cc=johnpol@2ka.mipt.ru \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=netdev@vger.kernel.org \
--cc=phillips@google.com \
--cc=tgraf@suug.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.