netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: David Miller <davem@davemloft.net>,
	jakub.kicinski@netronome.com, xiyou.wangcong@gmail.com,
	jiri@resnulli.us, netdev@vger.kernel.org
Subject: Re: [net-next PATCH] net: ptr_ring: otherwise safe empty checks can overrun array bounds
Date: Wed, 3 Jan 2018 01:12:55 +0200	[thread overview]
Message-ID: <20180103004619-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <149cba1c-e51a-e3fa-8105-34b95ad2e582@gmail.com>

On Tue, Jan 02, 2018 at 01:27:23PM -0800, John Fastabend wrote:
> On 01/02/2018 09:17 AM, Michael S. Tsirkin wrote:
> > On Tue, Jan 02, 2018 at 07:01:33PM +0200, Michael S. Tsirkin wrote:
> >> On Tue, Jan 02, 2018 at 11:52:19AM -0500, David Miller wrote:
> >>> From: John Fastabend <john.fastabend@gmail.com>
> >>> Date: Wed, 27 Dec 2017 19:50:25 -0800
> >>>
> >>>> When running consumer and/or producer operations and empty checks in
> >>>> parallel its possible to have the empty check run past the end of the
> >>>> array. The scenario occurs when an empty check is run while
> >>>> __ptr_ring_discard_one() is in progress. Specifically after the
> >>>> consumer_head is incremented but before (consumer_head >= ring_size)
> >>>> check is made and the consumer head is zeroe'd.
> >>>>
> >>>> To resolve this, without having to rework how consumer/producer ops
> >>>> work on the array, simply add an extra dummy slot to the end of the
> >>>> array. Even if we did a rework to avoid the extra slot it looks
> >>>> like the normal case checks would suffer some so best to just
> >>>> allocate an extra pointer.
> >>>>
> >>>> Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> >>>> Fixes: c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array")
> >>>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> >>>
> >>> Applied, thanks John.
> >>
> >> I think that patch is wrong. I'd rather it got reverted.
> > 
> > And replaced with something like the following - stil untested, but
> > apparently there's some urgency to fix it so posting for review ASAP.
> > 
> 
> So the above ptr_ring patch is meant for the dequeue() case not
> the peek case. Dequeue case shown here,
> 
> static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
> {
>         struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>         struct sk_buff *skb = NULL;
>         int band;
> 
>         for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>                 struct skb_array *q = band2list(priv, band);
> 
>                 if (__skb_array_empty(q))
>                         continue;
> 
>                 skb = skb_array_consume_bh(q);
>         }
>         if (likely(skb)) {
>                 qdisc_qstats_cpu_backlog_dec(qdisc, skb);
>                 qdisc_bstats_cpu_update(qdisc, skb);
>                 qdisc_qstats_cpu_qlen_dec(qdisc);
>         }
> 
>         return skb;
> }
> 
> In the dequeue case we use it purely as a hint and then do a proper
> consume (with locks) if needed.  A false negative in this case means
> a consume is happening on another cpu due to a reset op or a
> dequeue op. (an aside but I'm evaluating if we should only allow a
> single dequeue'ing cpu at a time more below?) If its a reset op
> that caused the false negative it is OK because we are flushing the
> array anyways. If it is a dequeue op it is also OK because this core
> will abort but the running core will continue to dequeue avoiding a
> stall. So I believe false negatives are OK in the above function.

I'm not 100% sure I understand. We don't always dequeue until empty.
E.g. it seems that another CPU could dequeue an skb and then stop
because e.g. it reached a byte queue limit, then this CPU gets a false
negativesand stops because of that.


More generally, what makes this usage safe?
Is there a way to formalize it at the API level?

> > John, others, could you pls confirm it's not too bad performance-wise?
> > I'll then split it up properly and re-post.
> > 
> 
> I haven't benchmarked it but in the dequeue case taking a lock for
> every priority even when its empty seems unneeded.

Well it does seem to make the code more comprehensible and more robust.

But OK -  I was looking at fixing the unlocked empty API to make sure it
actually does what it's supposed to. I posted a draft earlier in this
thread, it needs to be looked at in depth to figure out whether it can
ever give false negatives or positives, and document the results.



> > -->
> > 
> > net: don't misuse ptr_ring_peek
> > 
> > ptr_ring_peek only claims to be safe if the result is never
> > dereferenced, which isn't the case for its use in sch_generic.
> > Add locked API variants and use the bh one here.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > ---
> > 
> 
> [...]
> 
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -659,7 +659,7 @@ static struct sk_buff *pfifo_fast_peek(struct Qdisc *qdisc)
> >  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
> >  		struct skb_array *q = band2list(priv, band);
> >  
> > -		skb = __skb_array_peek(q);
> > +		skb = skb_array_peek_bh(q);
> 
> Ah I should have added a comment here. For now peek() is only used from
> locking qdiscs. So peek and consume/produce operations will never happen
> in parallel. In this case we should never hit the false negative case with
> my patch or the out of bounds reference without my patch.
> 
> Doing a peek() op without qdisc lock is a bit problematic anyways. With
> current code another cpu could consume the skb and free it. Either we
> can ensure a single consumer runs at a time on an array (not the same as
> qdisc maybe) or just avoid peek operations in this case. My current plan
> was to just avoid peek() ops altogether, they seem unnecessary with the
> types of qdiscs I want to be build.
> 
> Thanks,
> John

For the lockless qdisc, for net-next, do we need the patch above until you fix that though?

> >  	}
> >  
> >  	return skb;
> > 

  reply	other threads:[~2018-01-02 23:12 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-28  3:50 [net-next PATCH] net: ptr_ring: otherwise safe empty checks can overrun array bounds John Fastabend
2018-01-02 16:52 ` David Miller
2018-01-02 17:01   ` Michael S. Tsirkin
2018-01-02 17:17     ` Michael S. Tsirkin
2018-01-02 21:27       ` John Fastabend
2018-01-02 23:12         ` Michael S. Tsirkin [this message]
2018-01-03  0:25           ` John Fastabend
2018-01-03 15:50             ` Michael S. Tsirkin
2018-01-03 17:46               ` John Fastabend
2018-01-03 18:34                 ` Michael S. Tsirkin
2018-01-02 18:33     ` David Miller
2018-01-02 16:53 ` Michael S. Tsirkin
2018-01-02 17:01   ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180103004619-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=davem@davemloft.net \
    --cc=jakub.kicinski@netronome.com \
    --cc=jiri@resnulli.us \
    --cc=john.fastabend@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).