From: "Michael S. Tsirkin" <mst@redhat.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: David Miller <davem@davemloft.net>,
jakub.kicinski@netronome.com, xiyou.wangcong@gmail.com,
jiri@resnulli.us, netdev@vger.kernel.org
Subject: Re: [net-next PATCH] net: ptr_ring: otherwise safe empty checks can overrun array bounds
Date: Wed, 3 Jan 2018 01:12:55 +0200 [thread overview]
Message-ID: <20180103004619-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <149cba1c-e51a-e3fa-8105-34b95ad2e582@gmail.com>
On Tue, Jan 02, 2018 at 01:27:23PM -0800, John Fastabend wrote:
> On 01/02/2018 09:17 AM, Michael S. Tsirkin wrote:
> > On Tue, Jan 02, 2018 at 07:01:33PM +0200, Michael S. Tsirkin wrote:
> >> On Tue, Jan 02, 2018 at 11:52:19AM -0500, David Miller wrote:
> >>> From: John Fastabend <john.fastabend@gmail.com>
> >>> Date: Wed, 27 Dec 2017 19:50:25 -0800
> >>>
> >>>> When running consumer and/or producer operations and empty checks in
> >>>> parallel its possible to have the empty check run past the end of the
> >>>> array. The scenario occurs when an empty check is run while
> >>>> __ptr_ring_discard_one() is in progress. Specifically after the
> >>>> consumer_head is incremented but before (consumer_head >= ring_size)
> >>>> check is made and the consumer head is zeroe'd.
> >>>>
> >>>> To resolve this, without having to rework how consumer/producer ops
> >>>> work on the array, simply add an extra dummy slot to the end of the
> >>>> array. Even if we did a rework to avoid the extra slot it looks
> >>>> like the normal case checks would suffer some so best to just
> >>>> allocate an extra pointer.
> >>>>
> >>>> Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> >>>> Fixes: c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array")
> >>>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> >>>
> >>> Applied, thanks John.
> >>
> >> I think that patch is wrong. I'd rather it got reverted.
> >
> > And replaced with something like the following - stil untested, but
> > apparently there's some urgency to fix it so posting for review ASAP.
> >
>
> So the above ptr_ring patch is meant for the dequeue() case not
> the peek case. Dequeue case shown here,
>
> static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
> {
> struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
> struct sk_buff *skb = NULL;
> int band;
>
> for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
> struct skb_array *q = band2list(priv, band);
>
> if (__skb_array_empty(q))
> continue;
>
> skb = skb_array_consume_bh(q);
> }
> if (likely(skb)) {
> qdisc_qstats_cpu_backlog_dec(qdisc, skb);
> qdisc_bstats_cpu_update(qdisc, skb);
> qdisc_qstats_cpu_qlen_dec(qdisc);
> }
>
> return skb;
> }
>
> In the dequeue case we use it purely as a hint and then do a proper
> consume (with locks) if needed. A false negative in this case means
> a consume is happening on another cpu due to a reset op or a
> dequeue op. (an aside but I'm evaluating if we should only allow a
> single dequeue'ing cpu at a time more below?) If its a reset op
> that caused the false negative it is OK because we are flushing the
> array anyways. If it is a dequeue op it is also OK because this core
> will abort but the running core will continue to dequeue avoiding a
> stall. So I believe false negatives are OK in the above function.
I'm not 100% sure I understand. We don't always dequeue until empty.
E.g. it seems that another CPU could dequeue an skb and then stop
because e.g. it reached a byte queue limit, then this CPU gets a false
negativesand stops because of that.
More generally, what makes this usage safe?
Is there a way to formalize it at the API level?
> > John, others, could you pls confirm it's not too bad performance-wise?
> > I'll then split it up properly and re-post.
> >
>
> I haven't benchmarked it but in the dequeue case taking a lock for
> every priority even when its empty seems unneeded.
Well it does seem to make the code more comprehensible and more robust.
But OK - I was looking at fixing the unlocked empty API to make sure it
actually does what it's supposed to. I posted a draft earlier in this
thread, it needs to be looked at in depth to figure out whether it can
ever give false negatives or positives, and document the results.
> > -->
> >
> > net: don't misuse ptr_ring_peek
> >
> > ptr_ring_peek only claims to be safe if the result is never
> > dereferenced, which isn't the case for its use in sch_generic.
> > Add locked API variants and use the bh one here.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >
> > ---
> >
>
> [...]
>
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -659,7 +659,7 @@ static struct sk_buff *pfifo_fast_peek(struct Qdisc *qdisc)
> > for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
> > struct skb_array *q = band2list(priv, band);
> >
> > - skb = __skb_array_peek(q);
> > + skb = skb_array_peek_bh(q);
>
> Ah I should have added a comment here. For now peek() is only used from
> locking qdiscs. So peek and consume/produce operations will never happen
> in parallel. In this case we should never hit the false negative case with
> my patch or the out of bounds reference without my patch.
>
> Doing a peek() op without qdisc lock is a bit problematic anyways. With
> current code another cpu could consume the skb and free it. Either we
> can ensure a single consumer runs at a time on an array (not the same as
> qdisc maybe) or just avoid peek operations in this case. My current plan
> was to just avoid peek() ops altogether, they seem unnecessary with the
> types of qdiscs I want to be build.
>
> Thanks,
> John
For the lockless qdisc, for net-next, do we need the patch above until you fix that though?
> > }
> >
> > return skb;
> >
next prev parent reply other threads:[~2018-01-02 23:12 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-28 3:50 [net-next PATCH] net: ptr_ring: otherwise safe empty checks can overrun array bounds John Fastabend
2018-01-02 16:52 ` David Miller
2018-01-02 17:01 ` Michael S. Tsirkin
2018-01-02 17:17 ` Michael S. Tsirkin
2018-01-02 21:27 ` John Fastabend
2018-01-02 23:12 ` Michael S. Tsirkin [this message]
2018-01-03 0:25 ` John Fastabend
2018-01-03 15:50 ` Michael S. Tsirkin
2018-01-03 17:46 ` John Fastabend
2018-01-03 18:34 ` Michael S. Tsirkin
2018-01-02 18:33 ` David Miller
2018-01-02 16:53 ` Michael S. Tsirkin
2018-01-02 17:01 ` Michael S. Tsirkin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180103004619-mutt-send-email-mst@kernel.org \
--to=mst@redhat.com \
--cc=davem@davemloft.net \
--cc=jakub.kicinski@netronome.com \
--cc=jiri@resnulli.us \
--cc=john.fastabend@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=xiyou.wangcong@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).