Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: pull request: wireless-next 2014-10-03
From: John W. Linville @ 2014-10-07 14:18 UTC (permalink / raw)
  To: David Miller; +Cc: linux-wireless, netdev, linux-kernel, larry.finger
In-Reply-To: <20141007.004724.187788111776188480.davem@davemloft.net>

On Tue, Oct 07, 2014 at 12:47:24AM -0400, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Sun, 05 Oct 2014 21:38:53 -0400 (EDT)
> 
> > From: David Miller <davem@davemloft.net>
> > Date: Sun, 05 Oct 2014 21:35:11 -0400 (EDT)
> > 
> >> From: "John W. Linville" <linville@tuxdriver.com>
> >> Date: Fri, 3 Oct 2014 14:01:52 -0400
> >> 
> >>> Please pull tihs batch of updates intended for the 3.18 stream!
> >> 
> >> Pulled, thanks for the stellar pull request text, as always.
> > 
> > John, what's the deal with the following?  Will it be resolved by the
> > driver being removed from the staging tree?
> > 
> > WARNING: drivers/staging/rtl8192ee/r8192ee: 'rtl_evm_dbm_jaguar' exported twice. Previous export was in drivers/net/wireless/rtlwifi/rtlwifi.ko
> 
> John, ping?

Sorry, Dave -- sendmail died on my local machine and I didn't notice
until now... :-(

Anyway...yes, the staging driver should be removed by a patch in
Greg's queue.

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: Quota in __qdisc_run() (was: qdisc: validate skb without holding lock)
From: Jesper Dangaard Brouer @ 2014-10-07 13:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, therbert, hannes, fw, dborkman, jhs,
	alexander.duyck, john.r.fastabend, dave.taht, toke, brouer
In-Reply-To: <1412686038.11091.111.camel@edumazet-glaptop2.roam.corp.google.com>

On Tue, 07 Oct 2014 05:47:18 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Tue, 2014-10-07 at 09:34 +0200, Jesper Dangaard Brouer wrote:
> > On Fri, 03 Oct 2014 16:30:44 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > 
> > > Another problem we need to address is the quota in __qdisc_run()
> > > is no longer meaningfull, if each qdisc_restart() can pump many packets.
> > 
> > I fully agree. My earlier "magic" packet limit was covering/pampering
> > over this issue.
> 
> Although quota was multiplied by 7 or 8 in worst case ?

Yes, exactly not a very elegant solution ;-)


> > > An idea would be to use the bstats (or cpu_qstats if applicable)
> > 
> > Please elaborate some more, as I don't completely follow (feel free to
> > show with a patch ;-)).
> > 
> 
> I was hoping John could finish the percpu stats before I do that.
> 
> Problem with q->bstats.packets is that TSO packets with 45 MSS add 45 to
> this counter.
> 
> Using a time quota would be better, but : jiffies is too big, and
> local_clock() might be too expensive.

Hannes hacked up this patch for me... (didn't finish testing)

The basic idea is we want keep/restore the quota fairness between
qdisc's , that we sort of broke with commit 5772e9a346 ("qdisc: bulk
dequeue support for qdiscs with TCQ_F_ONETXQUEUE").

We choose not to account for the number of packets inside the TSO/GSO
packets ("skb_gso_segs").  As the previous fairness also had this "defect".

You might view this as a short term solution, until you can fix it with
your q->bstats.packets or time quota?


[RFC PATCH] net_sched: restore quota limits after bulk dequeue

--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -57,17 +57,19 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 
 static void try_bulk_dequeue_skb(struct Qdisc *q,
                                 struct sk_buff *skb,
-                                const struct netdev_queue *txq)
+                                const struct netdev_queue *txq,
+                                int *quota)
 {
        int bytelimit = qdisc_avail_bulklimit(txq) - skb->len;
 
-       while (bytelimit > 0) {
+       while (bytelimit > 0 && *quota > 0) {
                struct sk_buff *nskb = q->dequeue(q);
 
                if (!nskb)
                        break;
 
                bytelimit -= nskb->len; /* covers GSO len */
+               --*quota;
                skb->next = nskb;
                skb = nskb;
        }
@@ -77,7 +79,7 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
 /* Note that dequeue_skb can possibly return a SKB list (via skb->next).
  * A requeued skb (via q->gso_skb) can also be a SKB list.
  */
-static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate)
+static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, int *quota)
 {
        struct sk_buff *skb = q->gso_skb;
        const struct netdev_queue *txq = q->dev_queue;
@@ -87,18 +89,25 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate)
                /* check the reason of requeuing without tx lock first */
                txq = skb_get_tx_queue(txq->dev, skb);
                if (!netif_xmit_frozen_or_stopped(txq)) {
+                       struct sk_buff *iskb = skb;
+
                        q->gso_skb = NULL;
                        q->q.qlen--;
-               } else
+                       do
+                               --*quota;
+                       while ((iskb = skb->next));
+               } else {
                        skb = NULL;
+               }
                /* skb in gso_skb were already validated */
                *validate = false;
        } else {
                if (!(q->flags & TCQ_F_ONETXQUEUE) ||
                    !netif_xmit_frozen_or_stopped(txq)) {
                        skb = q->dequeue(q);
+                       --*quota;
                        if (skb && qdisc_may_bulk(q))
-                               try_bulk_dequeue_skb(q, skb, txq);
+                               try_bulk_dequeue_skb(q, skb, txq, quota);
                }
        }
        return skb;
@@ -204,7 +213,7 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
  *                             >0 - queue is not empty.
  *
  */
-static inline int qdisc_restart(struct Qdisc *q)
+static inline int qdisc_restart(struct Qdisc *q, int *quota)
 {
        struct netdev_queue *txq;
        struct net_device *dev;
@@ -213,7 +222,7 @@ static inline int qdisc_restart(struct Qdisc *q)
        bool validate;
 
        /* Dequeue packet */
-       skb = dequeue_skb(q, &validate);
+       skb = dequeue_skb(q, &validate, quota);
        if (unlikely(!skb))
                return 0;
 
@@ -230,13 +239,13 @@ void __qdisc_run(struct Qdisc *q)
 {
        int quota = weight_p;
 
-       while (qdisc_restart(q)) {
+       while (qdisc_restart(q, &quota)) {
                /*
                 * Ordered by possible occurrence: Postpone processing if
                 * 1. we've exceeded packet quota
                 * 2. another process needs the CPU;
                 */
-               if (--quota <= 0 || need_resched()) {
+               if (quota <= 0 || need_resched()) {
                        __netif_schedule(q);
                        break;
                }

^ permalink raw reply

* RE: [PATCH] net: fec: fix regression on i.MX28 introduced by rx_copybreak support
From: David Laight @ 2014-10-07 14:23 UTC (permalink / raw)
  To: 'Lothar Waßmann'
  Cc: netdev@vger.kernel.org, David S. Miller, Russell King, Frank Li,
	Fabio Estevam, linux-kernel@vger.kernel.org
In-Reply-To: <20141007161702.7cd1e800@ipc1.ka-ro>

From: Lothar
> David Laight wrote:
> > From: Lothar Waßmann
> > > commit 1b7bde6d659d ("net: fec: implement rx_copybreak to improve rx performance")
> > > introduced a regression for i.MX28. The swap_buffer() function doing
> > > the endian conversion of the received data on i.MX28 may access memory
> > > beyond the actual packet size in the DMA buffer. fec_enet_copybreak()
> > > does not copy those bytes, so that the last bytes of a packet may be
> > > filled with invalid data after swapping.
> > > This will likely lead to checksum errors on received packets.
> > > E.g. when trying to mount an NFS rootfs:
> > > UDP: bad checksum. From 192.168.1.225:111 to 192.168.100.73:44662 ulen 36
> > >
> > > Do the byte swapping and copying to the new skb in one go if
> > > necessary.
> >
> > ISTM that if you need to do the 'swap' you should copy the data regardless
> > of the length.
> >
> The swap function has to look at at most 3 bytes beyond the actual
> packet length. That is what the original swap_buffer() function does and
> what the new function swap_buffer2(), that does the endian swapping
> while copying to the new buffer, also does.

I understood the bug.

The point I was making is that if you have to do a read-write of the received
data (to byteswap it) then you might as well always copy it into a new skb that
is just big enough for the actual receive frame.

	David


^ permalink raw reply

* Re: [PATCH] net: fec: fix regression on i.MX28 introduced by rx_copybreak support
From: Eric Dumazet @ 2014-10-07 14:27 UTC (permalink / raw)
  To: Lothar Waßmann
  Cc: netdev, David S. Miller, Russell King, Frank Li, Fabio Estevam,
	linux-kernel
In-Reply-To: <1412687977-11742-1-git-send-email-LW@KARO-electronics.de>

On Tue, 2014-10-07 at 15:19 +0200, Lothar Waßmann wrote:
> commit 1b7bde6d659d ("net: fec: implement rx_copybreak to improve rx performance")
> introduced a regression for i.MX28. The swap_buffer() function doing
> the endian conversion of the received data on i.MX28 may access memory
> beyond the actual packet size in the DMA buffer. fec_enet_copybreak()
> does not copy those bytes, so that the last bytes of a packet may be
> filled with invalid data after swapping.
> This will likely lead to checksum errors on received packets.
> E.g. when trying to mount an NFS rootfs:
> UDP: bad checksum. From 192.168.1.225:111 to 192.168.100.73:44662 ulen 36
> 
> Do the byte swapping and copying to the new skb in one go if
> necessary.
> 
> Signed-off-by: Lothar Waßmann <LW@KARO-electronics.de>
> ---
>  drivers/net/ethernet/freescale/fec_main.c |   25 +++++++++++++++++++++----
>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index 87975b5..eaaebad 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -339,6 +339,18 @@ static void *swap_buffer(void *bufaddr, int len)
>  	return bufaddr;
>  }
>  
> +static void *swap_buffer2(void *dst_buf, void *src_buf, int len)
> +{
> +	int i;
> +	unsigned int *src = src_buf;
> +	unsigned int *dst = dst_buf;
> +
> +	for (i = 0; i < DIV_ROUND_UP(len, 4); i++, src++, dst++)
> +		*dst = cpu_to_be32(*src);

No need for the DIV :

	for (i = 0; i < len; i += sizeof(*dst), src++, dst++)
		*dst = cpu_to_be32(*src);

Also are you sure both src/dst are aligned to word boundaries, or is
this architecture OK with possible misalignment ?

^ permalink raw reply

* RE: [PATCH] net: fec: fix regression on i.MX28 introduced by rx_copybreak support
From: Eric Dumazet @ 2014-10-07 14:34 UTC (permalink / raw)
  To: David Laight
  Cc: 'Lothar Waßmann', netdev@vger.kernel.org,
	David S. Miller, Russell King, Frank Li, Fabio Estevam,
	linux-kernel@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D174C60B1@AcuExch.aculab.com>

On Tue, 2014-10-07 at 14:23 +0000, David Laight wrote:

> The point I was making is that if you have to do a read-write of the received
> data (to byteswap it) then you might as well always copy it into a new skb that
> is just big enough for the actual receive frame.

+1

^ permalink raw reply

* RE: [PATCH] net: fec: fix regression on i.MX28 introduced by rx_copybreak support
From: David Laight @ 2014-10-07 14:41 UTC (permalink / raw)
  To: 'Eric Dumazet', Lothar Waßmann
  Cc: netdev@vger.kernel.org, David S. Miller, Russell King, Frank Li,
	Fabio Estevam, linux-kernel@vger.kernel.org
In-Reply-To: <1412692034.11091.122.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet
> On Tue, 2014-10-07 at 15:19 +0200, Lothar Wamann wrote:
> > commit 1b7bde6d659d ("net: fec: implement rx_copybreak to improve rx performance")
> > introduced a regression for i.MX28. The swap_buffer() function doing
> > the endian conversion of the received data on i.MX28 may access memory
> > beyond the actual packet size in the DMA buffer. fec_enet_copybreak()
> > does not copy those bytes, so that the last bytes of a packet may be
> > filled with invalid data after swapping.
> > This will likely lead to checksum errors on received packets.
> > E.g. when trying to mount an NFS rootfs:
> > UDP: bad checksum. From 192.168.1.225:111 to 192.168.100.73:44662 ulen 36
> >
> > Do the byte swapping and copying to the new skb in one go if
> > necessary.
> >
> > Signed-off-by: Lothar Wamann <LW@KARO-electronics.de>
> > ---
> >  drivers/net/ethernet/freescale/fec_main.c |   25 +++++++++++++++++++++----
> >  1 file changed, 21 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> > index 87975b5..eaaebad 100644
> > --- a/drivers/net/ethernet/freescale/fec_main.c
> > +++ b/drivers/net/ethernet/freescale/fec_main.c
> > @@ -339,6 +339,18 @@ static void *swap_buffer(void *bufaddr, int len)
> >  	return bufaddr;
> >  }
> >
> > +static void *swap_buffer2(void *dst_buf, void *src_buf, int len)
> > +{
> > +	int i;
> > +	unsigned int *src = src_buf;
> > +	unsigned int *dst = dst_buf;
> > +
> > +	for (i = 0; i < DIV_ROUND_UP(len, 4); i++, src++, dst++)
> > +		*dst = cpu_to_be32(*src);
> 
> No need for the DIV :
> 
> 	for (i = 0; i < len; i += sizeof(*dst), src++, dst++)
> 		*dst = cpu_to_be32(*src);
> 
> Also are you sure both src/dst are aligned to word boundaries, or is
> this architecture OK with possible misalignment ?

I wondered about that as well.
I wouldn't have expected ppc to support misaligned transfers, and you'd also
want to make sure that cpu_to_be(*src) was using a byte-swapping instruction.
Hmmm... cpu_to_be() doesn't sound like the right 'swap' macro name.

It looks as though this is processing an entire ethernet frame, in which
case the data (destination address) would normally be on an 4n+2 boundary.
The two bytes before the data can be safely copied (as can the rounding bytes at
the end) - so provided the pointers allow for this it should be fine.
OTOH I've NFI what the actual data is.

	David


^ permalink raw reply

* Re: Quota in __qdisc_run() (was: qdisc: validate skb without holding lock)
From: Hannes Frederic Sowa @ 2014-10-07 14:43 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Eric Dumazet
  Cc: David Miller, netdev, therbert, fw, dborkman, jhs,
	alexander.duyck, john.r.fastabend, dave.taht, toke
In-Reply-To: <20141007153050.792c9743@redhat.com>

On Tue, Oct 7, 2014, at 15:30, Jesper Dangaard Brouer wrote:
> On Tue, 07 Oct 2014 05:47:18 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Tue, 2014-10-07 at 09:34 +0200, Jesper Dangaard Brouer wrote:
> > > On Fri, 03 Oct 2014 16:30:44 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > 
> > > > Another problem we need to address is the quota in __qdisc_run()
> > > > is no longer meaningfull, if each qdisc_restart() can pump many packets.
> > > 
> > > I fully agree. My earlier "magic" packet limit was covering/pampering
> > > over this issue.
> > 
> > Although quota was multiplied by 7 or 8 in worst case ?
> 
> Yes, exactly not a very elegant solution ;-)
> 
> 
> > > > An idea would be to use the bstats (or cpu_qstats if applicable)
> > > 
> > > Please elaborate some more, as I don't completely follow (feel free to
> > > show with a patch ;-)).
> > > 
> > 
> > I was hoping John could finish the percpu stats before I do that.
> > 
> > Problem with q->bstats.packets is that TSO packets with 45 MSS add 45 to
> > this counter.
> > 
> > Using a time quota would be better, but : jiffies is too big, and
> > local_clock() might be too expensive.
> 
> Hannes hacked up this patch for me... (didn't finish testing)
> 
> The basic idea is we want keep/restore the quota fairness between
> qdisc's , that we sort of broke with commit 5772e9a346 ("qdisc: bulk
> dequeue support for qdiscs with TCQ_F_ONETXQUEUE").
> 
> We choose not to account for the number of packets inside the TSO/GSO
> packets ("skb_gso_segs").  As the previous fairness also had this
> "defect".
> 
> You might view this as a short term solution, until you can fix it with
> your q->bstats.packets or time quota?
> 
> 
> [RFC PATCH] net_sched: restore quota limits after bulk dequeue
> 
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -57,17 +57,19 @@ static inline int dev_requeue_skb(struct sk_buff
> *skb, struct Qdisc *q)
>  
>  static void try_bulk_dequeue_skb(struct Qdisc *q,
>                                  struct sk_buff *skb,
> -                                const struct netdev_queue *txq)
> +                                const struct netdev_queue *txq,
> +                                int *quota)
>  {
>         int bytelimit = qdisc_avail_bulklimit(txq) - skb->len;
>  
> -       while (bytelimit > 0) {
> +       while (bytelimit > 0 && *quota > 0) {
>                 struct sk_buff *nskb = q->dequeue(q);
>  
>                 if (!nskb)
>                         break;
>  
>                 bytelimit -= nskb->len; /* covers GSO len */
> +               --*quota;
>                 skb->next = nskb;
>                 skb = nskb;
>         }
> @@ -77,7 +79,7 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
>  /* Note that dequeue_skb can possibly return a SKB list (via skb->next).
>   * A requeued skb (via q->gso_skb) can also be a SKB list.
>   */
> -static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate)
> +static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, int
> *quota)
>  {
>         struct sk_buff *skb = q->gso_skb;
>         const struct netdev_queue *txq = q->dev_queue;
> @@ -87,18 +89,25 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q,
> bool *validate)
>                 /* check the reason of requeuing without tx lock first */
>                 txq = skb_get_tx_queue(txq->dev, skb);
>                 if (!netif_xmit_frozen_or_stopped(txq)) {
> +                       struct sk_buff *iskb = skb;
> +
>                         q->gso_skb = NULL;
>                         q->q.qlen--;
> -               } else
> +                       do
> +                               --*quota;
> +                       while ((iskb = skb->next));

This needs to be:

do
   ...
while ((iskb = iskb->next))

Bye,
Hannes

^ permalink raw reply

* Re: pull request: wireless-next 2014-10-03
From: Larry Finger @ 2014-10-07 14:45 UTC (permalink / raw)
  To: David Miller, linville; +Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <20141007.004724.187788111776188480.davem@davemloft.net>

On 10/06/2014 11:47 PM, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Sun, 05 Oct 2014 21:38:53 -0400 (EDT)
>
>> From: David Miller <davem@davemloft.net>
>> Date: Sun, 05 Oct 2014 21:35:11 -0400 (EDT)
>>
>>> From: "John W. Linville" <linville@tuxdriver.com>
>>> Date: Fri, 3 Oct 2014 14:01:52 -0400
>>>
>>>> Please pull tihs batch of updates intended for the 3.18 stream!
>>>
>>> Pulled, thanks for the stellar pull request text, as always.
>>
>> John, what's the deal with the following?  Will it be resolved by the
>> driver being removed from the staging tree?
>>
>> WARNING: drivers/staging/rtl8192ee/r8192ee: 'rtl_evm_dbm_jaguar' exported twice. Previous export was in drivers/net/wireless/rtlwifi/rtlwifi.ko
>
> John, ping?

David,

Yes, that message will go away once the removal of the staging driver, which was 
handled by GregKH, gets merged with the changes from John.

Larry

^ permalink raw reply

* Re: [PATCH 2/9] netfilter: move nf_send_resetX() code to nf_reject_ipvX modules
From: Eric Dumazet @ 2014-10-07 14:45 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, davem, netdev
In-Reply-To: <1412356824-6362-3-git-send-email-pablo@netfilter.org>

On Fri, 2014-10-03 at 19:20 +0200, Pablo Neira Ayuso wrote:
> Move nf_send_reset() and nf_send_reset6() to nf_reject_ipv4 and
> nf_reject_ipv6 respectively. This code is shared by x_tables and
> nf_tables.
> 
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> ---
>  include/net/netfilter/ipv4/nf_reject.h |  118 +----------------------
>  net/ipv4/netfilter/Kconfig             |    6 ++
>  net/ipv4/netfilter/Makefile            |    3 +
>  net/ipv4/netfilter/nf_reject_ipv4.c    |  127 +++++++++++++++++++++++++
>  net/ipv6/netfilter/Kconfig             |    6 ++
>  net/ipv6/netfilter/Makefile            |    3 +
>  net/ipv6/netfilter/nf_reject_ipv6.c    |  163 ++++++++++++++++++++++++++++++++
>  7 files changed, 309 insertions(+), 117 deletions(-)
>  create mode 100644 net/ipv4/netfilter/nf_reject_ipv4.c
>  create mode 100644 net/ipv6/netfilter/nf_reject_ipv6.c

I am confused

We now have two copies of nf_send_reset6()

$ git grep -n nf_send_reset6
include/net/netfilter/ipv6/nf_reject.h:21:static void nf_send_reset6(struct net *net, struct sk_buff *oldskb, int hook)
net/bridge/netfilter/nft_reject_bridge.c:49:                    nf_send_reset6(net, pkt->skb, pkt->ops->hooknum);
net/ipv6/netfilter/ip6t_REJECT.c:66:            nf_send_reset6(net, skb, par->hooknum);
net/ipv6/netfilter/nf_reject_ipv6.c:14:void nf_send_reset6(struct net *net, struct sk_buff *oldskb, int hook)
net/ipv6/netfilter/nf_reject_ipv6.c:163:EXPORT_SYMBOL_GPL(nf_send_reset6);
net/ipv6/netfilter/nft_reject_ipv6.c:35:                nf_send_reset6(net, pkt->skb, pkt->ops->hooknum);
net/netfilter/nft_reject_inet.c:49:                     nf_send_reset6(net, pkt->skb, pkt->ops->hooknum);

^ permalink raw reply

* Re: Quota in __qdisc_run() (was: qdisc: validate skb without holding lock)
From: Eric Dumazet @ 2014-10-07 15:01 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jesper Dangaard Brouer, David Miller, netdev, therbert, fw,
	dborkman, jhs, alexander.duyck, john.r.fastabend, dave.taht, toke
In-Reply-To: <1412693013.4140057.176149461.0FBF6CBD@webmail.messagingengine.com>

On Tue, 2014-10-07 at 16:43 +0200, Hannes Frederic Sowa wrote:

> This needs to be:
> 
> do
>    ...
> while ((iskb = iskb->next))

I do not feel needed to break the bulk dequeue at precise quota
boundary. These quotas are advisory, and bql prefers to get its full
budget for appropriate feedback from TX completion.

Quota was a packet quota, which was quite irrelevant if segmentation had
to be done, so I would just let the dequeue be done so that we benefit
from optimal xmit_more.

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 2b349a4de3c8e3491fad210a9400d26bda5b52fe..581ba0dcc2474f325d1c0b3e1dc957648f11992f 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -58,7 +58,8 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 
 static void try_bulk_dequeue_skb(struct Qdisc *q,
 				 struct sk_buff *skb,
-				 const struct netdev_queue *txq)
+				 const struct netdev_queue *txq,
+				 int *packets)
 {
 	int bytelimit = qdisc_avail_bulklimit(txq) - skb->len;
 
@@ -71,6 +72,7 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
 		bytelimit -= nskb->len; /* covers GSO len */
 		skb->next = nskb;
 		skb = nskb;
+		(*packets)++;
 	}
 	skb->next = NULL;
 }
@@ -78,11 +80,13 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
 /* Note that dequeue_skb can possibly return a SKB list (via skb->next).
  * A requeued skb (via q->gso_skb) can also be a SKB list.
  */
-static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate)
+static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
+				   int *packets)
 {
 	struct sk_buff *skb = q->gso_skb;
 	const struct netdev_queue *txq = q->dev_queue;
 
+	*packets = 1; /* yes, this might be not accurate, only if BQL is wrong */
 	*validate = true;
 	if (unlikely(skb)) {
 		/* check the reason of requeuing without tx lock first */
@@ -99,7 +103,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate)
 		    !netif_xmit_frozen_or_stopped(txq)) {
 			skb = q->dequeue(q);
 			if (skb && qdisc_may_bulk(q))
-				try_bulk_dequeue_skb(q, skb, txq);
+				try_bulk_dequeue_skb(q, skb, txq, packets);
 		}
 	}
 	return skb;
@@ -205,7 +209,7 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
  *				>0 - queue is not empty.
  *
  */
-static inline int qdisc_restart(struct Qdisc *q)
+static inline int qdisc_restart(struct Qdisc *q, int *packets)
 {
 	struct netdev_queue *txq;
 	struct net_device *dev;
@@ -214,7 +218,7 @@ static inline int qdisc_restart(struct Qdisc *q)
 	bool validate;
 
 	/* Dequeue packet */
-	skb = dequeue_skb(q, &validate);
+	skb = dequeue_skb(q, &validate, packets);
 	if (unlikely(!skb))
 		return 0;
 
@@ -230,14 +234,16 @@ static inline int qdisc_restart(struct Qdisc *q)
 void __qdisc_run(struct Qdisc *q)
 {
 	int quota = weight_p;
+	int packets;
 
-	while (qdisc_restart(q)) {
+	while (qdisc_restart(q, &packets)) {
 		/*
 		 * Ordered by possible occurrence: Postpone processing if
 		 * 1. we've exceeded packet quota
 		 * 2. another process needs the CPU;
 		 */
-		if (--quota <= 0 || need_resched()) {
+		quota -= packets;
+		if (quota <= 0 || need_resched()) {
 			__netif_schedule(q);
 			break;
 		}

^ permalink raw reply related

* Re: [PATCH v2 1/1] tcp: fixing TLP's FIN recovery
From: Josh Hunt @ 2014-10-07 15:03 UTC (permalink / raw)
  To: David Miller
  Cc: per.hurtig, Eric Dumazet, panweiping3, nanditad, netdev,
	anna.brunstrom, mohammad.rajiullah, ncardwell, sergei.shtylyov
In-Reply-To: <20140612.110611.584155630422154338.davem@davemloft.net>

On Thu, Jun 12, 2014 at 1:06 PM, David Miller <davem@davemloft.net> wrote:
> From: Per Hurtig <per.hurtig@kau.se>
> Date: Thu, 12 Jun 2014 17:08:32 +0200
>
>> Fix to a problem observed when losing a FIN segment that does not
>> contain data.  In such situations, TLP is unable to recover from
>> *any* tail loss and instead adds at least PTO ms to the
>> retransmission process, i.e., RTO = RTO + PTO.
>>
>> Signed-off-by: Per Hurtig <per.hurtig@kau.se>
>
> Applied, thanks.

Can we queue this up for stable? 2cd0d743b05e87 (tcp: fix
tcp_match_skb_to_sack() for unaligned SACK at end of an skb) is
already in stable and based on the changelog was put in place to fix a
case that this patch introduced:

"This was visible now because the recently simplified TLP logic in
 bef1909ee3ed1c ("tcp: fixing TLP's FIN recovery") could find that 0-byte
 skb at the end of the write queue, and now that we do not check that
 skb's length we could send it as a TLP probe."

However, the patch to fix TLP's FIN recovery is not in -stable.

Thanks
-- 
Josh

^ permalink raw reply

* Re: Quota in __qdisc_run() (was: qdisc: validate skb without holding lock)
From: Eric Dumazet @ 2014-10-07 15:06 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Jesper Dangaard Brouer, David Miller, netdev, therbert, fw,
	dborkman, jhs, alexander.duyck, john.r.fastabend, dave.taht, toke
In-Reply-To: <1412694080.11091.131.camel@edumazet-glaptop2.roam.corp.google.com>

On Tue, 2014-10-07 at 08:01 -0700, Eric Dumazet wrote:

> Quota was a packet quota, which was quite irrelevant if segmentation had
> to be done, so I would just let the dequeue be done so that we benefit
> from optimal xmit_more.

And it also is better to allow receiver to get full LRO/GRO aggregation,
if we do not break GSO train in multiple parts.

^ permalink raw reply

* RE: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
From: Zhou, Danny @ 2014-10-07 15:21 UTC (permalink / raw)
  To: Willem de Bruijn, John Fastabend
  Cc: Daniel Borkmann, Florian Westphal, gerlitz.or@gmail.com,
	Hannes Frederic Sowa, Network Development, Ronciak, John,
	Amir Vadai, Eric Dumazet
In-Reply-To: <CA+FuTSfwTBtZLu7CXh4bPxUtVubOvpCPo+O38BsnSiLdvV_KEA@mail.gmail.com>

> -----Original Message-----
> From: Willem de Bruijn [mailto:willemb@google.com]
> Sent: Tuesday, October 07, 2014 12:24 PM
> To: John Fastabend
> Cc: Daniel Borkmann; Florian Westphal; gerlitz.or@gmail.com; Hannes Frederic Sowa; Network Development; Ronciak, John; Amir
> Vadai; Eric Dumazet; Zhou, Danny
> Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
> 
> > Supporting some way to steer traffic to a queue
> > is the _only_ hardware requirement to support the interface,
> 
> I would not impose his constraint. There may be legitimate use
> cases for taking over all queues of a device. For instance, when
> this is a secondary nic that does not carry any control traffic.

For the secondary NIC that carries the data plane traffics only, you can use UIO or VFIO 
to map the entire NIC's entire I/O space to user space. Then the user space poll-mode driver, 
like those have been supported and open-sourced in DPDK and those supports 
Mellanox/Emulex NICs but not open-sourced, can drive the NIC as a sole driver in user space. 

> 
> > Typically in an af_packet interface a packet_type handler is
> > registered and used to filter traffic to the socket and do other
> > things such as fan out traffic to multiple sockets. In this case the
> > networking stack is being bypassed so this code is not run. So the
> > hardware must push the correct traffic to the queues obtained from
> > the ndo callback ndo_split_queue_pairs().
> 
> Why does the interface work at the level of queue_pairs instead of
> individual queues?

The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O 
on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and 
egress traffics respectively, although the flow director only applies to ingress traffics.

> 
> >         /* Get the layout of ring space offset, page_sz, cnt */
> >         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
> >                    &info, &optlen);
> >
> >         /* request some queues from the driver */
> >         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> >                    &qpairs_info, sizeof(qpairs_info));
> >
> >         /* if we let the driver pick us queues learn which queues
> >          * we were given
> >          */
> >         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> >                    &qpairs_info, sizeof(qpairs_info));
> 
> If ethtool -U is used to steer traffic to a specific descriptor queue,
> then the setsockopt can pass the exact id of that queue and there
> is no need for a getsockopt follow-up.

Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via 
setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and 
available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info 
as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable. 
But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need 
to call getsocketopt.

> 
> >         /* And mmap queue pairs to user space */
> >         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
> >              MAP_SHARED, fd, 0);
> 
> How will packet data be mapped and how will userspace translate
> from paddr to vaddr? Is the goal to maintain long lived mappings
> and instruct drivers to allocate from this restricted range (to
> avoid per-packet system calls and vma operations)?
> 

Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues 
completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages 
are translated to paddr) to fill in the packet descriptors.
As of security concern raised in previous discussion, the reason we think(BTW, correct me if I am wrong)
af_packet  is most suitable is because only user application with root permission is allowed to successfully 
split-off queue pairs and mmap a small window of PCIe I/O space to user space, so concern regarding "device 
can DMA from/to any arbitrary physical memory." is not that big. As all user space device drivers based on 
UIO mechanism has the same concern issue, VFIO adds protection but it is based on IOMMU which is
specific to Intel silicons.

> For throughput-oriented workloads, the syscall overhead
> involved in kicking the nic (on tx, or for increasing the ring
> consumer index on rx) can be amortized. And the operation
> can perhaps piggy-back on interrupts or other events
> (as long as interrupts are not disabled for full userspace
> polling). Latency would be harder to satisfy while maintaining
> some kernel policy enforcement. An extreme solution
> uses an asynchronously busy polling kernel worker thread
> (at high cycle cost, so acceptable for few workloads).
> 
> When keeping the kernel in the loop, it is possible to do
> some basic sanity checking and transparently translate between
> vaddr and paddr, even when exposing the hardware descriptors
> directly. Though at this point it may be just as cheap to expose
> an idealized virtualized descriptor format and copy fields between
> that and device descriptors.
> 
> One assumption underlying exposing the hardware descriptors
> is that they are quire similar between devices. How true is this
> in the context of formats that span multiple descriptors?
> 

Packet descriptors format varies for different vendors. On Intel NICs, 1G/10G/40G NICs have 
totally different formats. Even a same Intel 10G/40G NIC supports at least 2 different descriptor 
formats. IMHO, the idea behind those patches intent to skip descriptor difference among devices, 
as it just maps certain I/O space pages to user space and user space "slave" NIC driver can handle 
it using different descriptor struct based on vendor/device ID. But I am open to add support of generic 
packet descriptor format description, per David M' suggestion.

> > + * int (*ndo_split_queue_pairs) (struct net_device *dev,
> > + *                              unsigned int qpairs_start_from,
> > + *                              unsigned int qpairs_num,
> > + *                              struct sock *sk)
> > + *     Called to request a set of queues from the driver to be
> > + *     handed to the callee for management. After this returns the
> > + *     driver will not use the queues.
> 
> Are these queues also taken out of ethtool management, or is
> this equivalent to taking removing them from the rss set with
> ethtool -X?

As a master driver, the NIC kernel driver still takes control of flow director as a ethtool backend. Generally, 
not all queues are initialized and used by NIC kernel driver, which reports actually used rx/tx numbers to stacks. 
Before splitting off certain queues, if you want use ethtool to direct traffics to those unused queues, 
ethtool reports invalid argument. Once certain stack-unaware queues are allocated for user space slave driver, 
ethtool allows directing packets to them as the NIC driver maintains a data struct about which queues are visible 
and used by kernel, which are used by user space.

^ permalink raw reply

* Re: Quota in __qdisc_run() (was: qdisc: validate skb without holding lock)
From: Jesper Dangaard Brouer @ 2014-10-07 15:26 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Eric Dumazet, David Miller, netdev, therbert, fw, dborkman, jhs,
	alexander.duyck, john.r.fastabend, dave.taht, toke, brouer
In-Reply-To: <1412693013.4140057.176149461.0FBF6CBD@webmail.messagingengine.com>

On Tue, 07 Oct 2014 16:43:33 +0200
Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:

> On Tue, Oct 7, 2014, at 15:30, Jesper Dangaard Brouer wrote:
[...]
> > 
> > The basic idea is we want keep/restore the quota fairness between
> > qdisc's , that we sort of broke with commit 5772e9a346 ("qdisc: bulk
> > dequeue support for qdiscs with TCQ_F_ONETXQUEUE").
> > 
[...]
> 
> This needs to be:
> 
> do
>    ...
> while ((iskb = iskb->next))

Check, testing with this update, now.

My netperf-wrapper test with GSO=off TSO=off, looks much more stable at
keeping the 10G link fully utilized.  Before, without this patch, I
could not get stable results at 10G with GSO=off TSO=off.  Think this
really does address the fairness (I didn't think I would be able to
measure it).

The other cases (GSO=on,TSO=off) and (GSO=on,TSO=on) looks the same.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH 08/16] virtio_net: drop config_enable
From: David Miller @ 2014-10-07 15:36 UTC (permalink / raw)
  To: mst; +Cc: netdev, linux-kernel, virtualization
In-Reply-To: <20141007064915.GA2424@redhat.com>

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Tue, 7 Oct 2014 09:49:15 +0300

> On Mon, Oct 06, 2014 at 03:02:38PM -0400, David Miller wrote:
>> From: "Michael S. Tsirkin" <mst@redhat.com>
>> Date: Sun, 5 Oct 2014 19:07:13 +0300
>> 
>> > Now that virtio core ensures config changes don't arrive during probing,
>> > drop config_enable flag in virtio net.
>> > On removal, flush is now sufficient to guarantee that no change work is
>> > queued.
>> > 
>> > This help simplify the driver, and will allow setting DRIVER_OK earlier
>> > without losing config change notifications.
>> > 
>> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> 
>> It's hard for people on the networking side to review these changes
>> since you haven't CC:'d them on any of the postings necessary to
>> understand the context of the net/ and drivers/net/ changes.
>> 
>> Please at a minimum CC: everyone on your header [PATCH 0/N] posting
>> so we know at least at a high level what is going on, and why.
>> 
>> Thanks.
> 
> It's a bit tricky for large patchsets - if I add everyone to 0/N
> then vger isn't happy with Cc list that is too large.
> 
> What is your advice here? Cc just mailing lists on 0/N?
> 
> FWIW this patchset is inteded for the virtio tree.

CC: mailing lists and "focus" developers, a small carefully selected
group of people who would be strongly interested in this change.

I really don't understand why this is so complicated, I've never run
into a situation where I had to CC: 200 people in my two decades of
kernel development :-/

^ permalink raw reply

* Re: [PATCH 08/16] virtio_net: drop config_enable
From: Michael S. Tsirkin @ 2014-10-07 15:42 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-kernel, virtualization
In-Reply-To: <20141007.113643.1462528588440998168.davem@davemloft.net>

On Tue, Oct 07, 2014 at 11:36:43AM -0400, David Miller wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> Date: Tue, 7 Oct 2014 09:49:15 +0300
> 
> > On Mon, Oct 06, 2014 at 03:02:38PM -0400, David Miller wrote:
> >> From: "Michael S. Tsirkin" <mst@redhat.com>
> >> Date: Sun, 5 Oct 2014 19:07:13 +0300
> >> 
> >> > Now that virtio core ensures config changes don't arrive during probing,
> >> > drop config_enable flag in virtio net.
> >> > On removal, flush is now sufficient to guarantee that no change work is
> >> > queued.
> >> > 
> >> > This help simplify the driver, and will allow setting DRIVER_OK earlier
> >> > without losing config change notifications.
> >> > 
> >> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >> 
> >> It's hard for people on the networking side to review these changes
> >> since you haven't CC:'d them on any of the postings necessary to
> >> understand the context of the net/ and drivers/net/ changes.
> >> 
> >> Please at a minimum CC: everyone on your header [PATCH 0/N] posting
> >> so we know at least at a high level what is going on, and why.
> >> 
> >> Thanks.
> > 
> > It's a bit tricky for large patchsets - if I add everyone to 0/N
> > then vger isn't happy with Cc list that is too large.
> > 
> > What is your advice here? Cc just mailing lists on 0/N?
> > 
> > FWIW this patchset is inteded for the virtio tree.
> 
> CC: mailing lists and "focus" developers, a small carefully selected
> group of people who would be strongly interested in this change.
> 
> I really don't understand why this is so complicated, I've never run
> into a situation where I had to CC: 200 people in my two decades of
> kernel development :-/

Will do for the next version, thanks!

^ permalink raw reply

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
From: David Miller @ 2014-10-07 15:43 UTC (permalink / raw)
  To: David.Laight
  Cc: willemb, john.fastabend, dborkman, fw, gerlitz.or, hannes, netdev,
	john.ronciak, amirv, eric.dumazet, danny.zhou
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D174C5C35@AcuExch.aculab.com>

From: David Laight <David.Laight@ACULAB.COM>
Date: Tue, 7 Oct 2014 09:27:03 +0000

> That is (probably) the only scheme that stops the application
> accessing random parts of physical memory.

I don't know where this claim keeps coming from, it's false.

The application has to attach memory to the ring, and then the
ring can only refer to that memory for the duration of the
session.

There is no way that the user can program the address field of the
descriptors to point at arbitrary physical memory locations.

There is protection and control.

^ permalink raw reply

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
From: Willem de Bruijn @ 2014-10-07 15:46 UTC (permalink / raw)
  To: Zhou, Danny
  Cc: John Fastabend, Daniel Borkmann, Florian Westphal,
	gerlitz.or@gmail.com, Hannes Frederic Sowa, Network Development,
	Ronciak, John, Amir Vadai, Eric Dumazet
In-Reply-To: <DFDF335405C17848924A094BC35766CF0A953308@SHSMSX104.ccr.corp.intel.com>

>> > Typically in an af_packet interface a packet_type handler is
>> > registered and used to filter traffic to the socket and do other
>> > things such as fan out traffic to multiple sockets. In this case the
>> > networking stack is being bypassed so this code is not run. So the
>> > hardware must push the correct traffic to the queues obtained from
>> > the ndo callback ndo_split_queue_pairs().
>>
>> Why does the interface work at the level of queue_pairs instead of
>> individual queues?
>
> The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O
> on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and
> egress traffics respectively, although the flow director only applies to ingress traffics.

That requirement of co-allocation is absent in existing packet
rings. Many applications only receive or transmit. For
receive-only, it would even be possible to map descriptor
rings read-only, if the kernel remains responsible for posting
buffers -- but I see below that that is not the case, so that's
not very relevant here.

Still, some workloads want asymmetric sets of rx and tx rings.
For instance, instead of using RSS, a process may want to
receive on as few rings as possible, load balance across
workers in software, but still give each worker thread its own
private transmit ring.


>>
>> >         /* Get the layout of ring space offset, page_sz, cnt */
>> >         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
>> >                    &info, &optlen);
>> >
>> >         /* request some queues from the driver */
>> >         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>> >                    &qpairs_info, sizeof(qpairs_info));
>> >
>> >         /* if we let the driver pick us queues learn which queues
>> >          * we were given
>> >          */
>> >         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>> >                    &qpairs_info, sizeof(qpairs_info));
>>
>> If ethtool -U is used to steer traffic to a specific descriptor queue,
>> then the setsockopt can pass the exact id of that queue and there
>> is no need for a getsockopt follow-up.
>
> Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via
> setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and
> available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
> Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info
> as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable.
> But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need
> to call getsocketopt.

One step further would be to move the entire configuration behind
the packet socket interface. It's perhaps out of scope of this patch,
but the difference between using `ethtool -U` and passing the same
expression through the packet socket is that in the latter case the
kernel can automatically rollback the configuration change when the
process dies.

>
>>
>> >         /* And mmap queue pairs to user space */
>> >         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
>> >              MAP_SHARED, fd, 0);
>>
>> How will packet data be mapped and how will userspace translate
>> from paddr to vaddr? Is the goal to maintain long lived mappings
>> and instruct drivers to allocate from this restricted range (to
>> avoid per-packet system calls and vma operations)?
>>
>
> Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues
> completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages
> are translated to paddr) to fill in the packet descriptors.

Ah, userspace is responsible for posting buffers and translation
from vaddr to paddr is straightforward. Yes that makes sense.

^ permalink raw reply

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
From: John Fastabend @ 2014-10-07 15:55 UTC (permalink / raw)
  To: Willem de Bruijn, Zhou, Danny
  Cc: John Fastabend, Daniel Borkmann, Florian Westphal,
	gerlitz.or@gmail.com, Hannes Frederic Sowa, Network Development,
	Ronciak, John, Amir Vadai, Eric Dumazet
In-Reply-To: <CA+FuTSe=vo1-Xpk+318SNc-mCH_c0WQadXo3usiA_dRBNx_fEQ@mail.gmail.com>

On 10/07/2014 08:46 AM, Willem de Bruijn wrote:
>>>> Typically in an af_packet interface a packet_type handler is
>>>> registered and used to filter traffic to the socket and do other
>>>> things such as fan out traffic to multiple sockets. In this case the
>>>> networking stack is being bypassed so this code is not run. So the
>>>> hardware must push the correct traffic to the queues obtained from
>>>> the ndo callback ndo_split_queue_pairs().
>>>
>>> Why does the interface work at the level of queue_pairs instead of
>>> individual queues?
>>
>> The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O
>> on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and
>> egress traffics respectively, although the flow director only applies to ingress traffics.
> 
> That requirement of co-allocation is absent in existing packet
> rings. Many applications only receive or transmit. For
> receive-only, it would even be possible to map descriptor
> rings read-only, if the kernel remains responsible for posting
> buffers -- but I see below that that is not the case, so that's
> not very relevant here.
> 
> Still, some workloads want asymmetric sets of rx and tx rings.
> For instance, instead of using RSS, a process may want to
> receive on as few rings as possible, load balance across
> workers in software, but still give each worker thread its own
> private transmit ring.
> 

We can build this into the interface by having the setsockopt
provide both the number of tx rings and number of rx rings. It
might not be immediately available in any drivers because at
least ixgbe is pretty dependent on tx/rx pairing.

I would have to look through the other drivers to see how
much work it would be to support this on them. If I can't find
a good candidate we might leave it out until we can fix up the
drivers.

> 
>>>
>>>>         /* Get the layout of ring space offset, page_sz, cnt */
>>>>         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
>>>>                    &info, &optlen);
>>>>
>>>>         /* request some queues from the driver */
>>>>         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>>>>                    &qpairs_info, sizeof(qpairs_info));
>>>>
>>>>         /* if we let the driver pick us queues learn which queues
>>>>          * we were given
>>>>          */
>>>>         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>>>>                    &qpairs_info, sizeof(qpairs_info));
>>>
>>> If ethtool -U is used to steer traffic to a specific descriptor queue,
>>> then the setsockopt can pass the exact id of that queue and there
>>> is no need for a getsockopt follow-up.
>>
>> Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via
>> setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and
>> available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
>> Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info
>> as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable.
>> But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need
>> to call getsocketopt.
> 
> One step further would be to move the entire configuration behind
> the packet socket interface. It's perhaps out of scope of this patch,
> but the difference between using `ethtool -U` and passing the same
> expression through the packet socket is that in the latter case the
> kernel can automatically rollback the configuration change when the
> process dies.
> 

hmm might be interesting I think  this is a follow on path to
investigate after the initial support.

>>
>>>
>>>>         /* And mmap queue pairs to user space */
>>>>         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
>>>>              MAP_SHARED, fd, 0);
>>>
>>> How will packet data be mapped and how will userspace translate
>>> from paddr to vaddr? Is the goal to maintain long lived mappings
>>> and instruct drivers to allocate from this restricted range (to
>>> avoid per-packet system calls and vma operations)?
>>>
>>
>> Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues
>> completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages
>> are translated to paddr) to fill in the packet descriptors.
> 
> Ah, userspace is responsible for posting buffers and translation
> from vaddr to paddr is straightforward. Yes that makes sense.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* RE: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
From: David Laight @ 2014-10-07 15:59 UTC (permalink / raw)
  To: 'David Miller'
  Cc: willemb@google.com, john.fastabend@gmail.com, dborkman@redhat.com,
	fw@strlen.de, gerlitz.or@gmail.com, hannes@stressinduktion.org,
	netdev@vger.kernel.org, john.ronciak@intel.com,
	amirv@mellanox.com, eric.dumazet@gmail.com, danny.zhou@intel.com
In-Reply-To: <20141007.114341.1417644866461362364.davem@davemloft.net>

From: David 
> From: David Laight <David.Laight@ACULAB.COM>
> Date: Tue, 7 Oct 2014 09:27:03 +0000
> 
> > That is (probably) the only scheme that stops the application
> > accessing random parts of physical memory.
> 
> I don't know where this claim keeps coming from, it's false.
> 
> The application has to attach memory to the ring, and then the
> ring can only refer to that memory for the duration of the
> session.
> 
> There is no way that the user can program the address field of the
> descriptors to point at arbitrary physical memory locations.
> 
> There is protection and control.

I got the impression that the application was directly writing the ring
structure that the ethernet mac hardware uses to describe tx and rx buffers.
(ie they are mapped read-write into userspace).
Unless you have a system where you can limit the physical memory
ranges accessible to the mac hardware, I don't see how you can stop
the application putting rogue values into the ring.

Clearly I'm missing something in my quick read of the change.

	David

^ permalink raw reply

* Re: [iproute2 1/1] RFC: obsolete direct invocation of police
From: Stephen Hemminger @ 2014-10-07 16:04 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: netdev, xiyou.wangcong, john.r.fastabend
In-Reply-To: <5433CF39.5030202@mojatatu.com>

On Tue, 07 Oct 2014 07:32:09 -0400
Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> On 10/06/14 13:01, Stephen Hemminger wrote:
> 
> > I think iproute utilities needs to accept the old syntax and warn about
> > deprecated syntax use. Later (like 2yr +) the code can be removed.
> >
> > The old syntax can be removed from all documentation and help messages
> > now though.
> 
> Ok, so ignore that one patch. I will send another one.
> Fair to put a date for when the obsoletion notice went out?
> Example:
> "As of October 10, 2014 this syntax is obsolete. Please use instead ..."
> Maybe also mention when the approximate cutoff date is.

Don't put a date in, it either will look silly, or will be ignored.
Make message a short direct one liner.

^ permalink raw reply

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
From: David Miller @ 2014-10-07 16:05 UTC (permalink / raw)
  To: danny.zhou
  Cc: willemb, john.fastabend, dborkman, fw, gerlitz.or, hannes, netdev,
	john.ronciak, amirv, eric.dumazet
In-Reply-To: <DFDF335405C17848924A094BC35766CF0A953308@SHSMSX104.ccr.corp.intel.com>

From: "Zhou, Danny" <danny.zhou@intel.com>
Date: Tue, 7 Oct 2014 15:21:15 +0000

> Once qpairs split-off is done, the user space driver, as a slave
> driver, will re-initialize those queues completely in user space by
> using paddr(in the case of DPDK, vaddr of DPDK used huge pages are
> translated to paddr) to fill in the packet descriptors.  As of
> security concern raised in previous discussion, the reason we
> think(BTW, correct me if I am wrong) af_packet is most suitable is
> because only user application with root permission is allowed to
> successfully split-off queue pairs and mmap a small window of PCIe
> I/O space to user space, so concern regarding "device can DMA
> from/to any arbitrary physical memory." is not that big. As all user
> space device drivers based on UIO mechanism has the same concern
> issue, VFIO adds protection but it is based on IOMMU which is
> specific to Intel silicons.

Wait a second.

If there is no memory protection performed I'm not merging this.

I thought the user has to associate a fixed pool of memory to the
queueus, the kernel attaches that memory, and then the user cannot
modify the addresses _AT_ _ALL_.

If the user can modify the addresses in the descriptors and make
the chip crap on random memory, this is a non-starter.

Sorry.

^ permalink raw reply

* Re: [PATCH] net: fec: fix regression on i.MX28 introduced by rx_copybreak support
From: Sergei Shtylyov @ 2014-10-07 16:07 UTC (permalink / raw)
  To: Lothar Waßmann, netdev
  Cc: David S. Miller, Russell King, Frank Li, Fabio Estevam,
	linux-kernel
In-Reply-To: <1412687977-11742-1-git-send-email-LW@KARO-electronics.de>

Hello.

On 10/07/2014 05:19 PM, Lothar Waßmann wrote:

> commit 1b7bde6d659d ("net: fec: implement rx_copybreak to improve rx performance")
> introduced a regression for i.MX28. The swap_buffer() function doing
> the endian conversion of the received data on i.MX28 may access memory
> beyond the actual packet size in the DMA buffer. fec_enet_copybreak()
> does not copy those bytes, so that the last bytes of a packet may be
> filled with invalid data after swapping.
> This will likely lead to checksum errors on received packets.
> E.g. when trying to mount an NFS rootfs:
> UDP: bad checksum. From 192.168.1.225:111 to 192.168.100.73:44662 ulen 36

> Do the byte swapping and copying to the new skb in one go if
> necessary.

> Signed-off-by: Lothar Waßmann <LW@KARO-electronics.de>
> ---
>   drivers/net/ethernet/freescale/fec_main.c |   25 +++++++++++++++++++++----
>   1 file changed, 21 insertions(+), 4 deletions(-)

> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index 87975b5..eaaebad 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
[...]
> @@ -1348,7 +1360,7 @@ fec_enet_new_rxbdp(struct net_device *ndev, struct bufdesc *bdp, struct sk_buff
>   }
>
>   static bool fec_enet_copybreak(struct net_device *ndev, struct sk_buff **skb,
> -			       struct bufdesc *bdp, u32 length)
> +			       struct bufdesc *bdp, u32 length, int swap)

    'bool swap' perhaps?

> @@ -1393,6 +1408,7 @@ fec_enet_rx_queue(struct net_device *ndev, int budget, u16 queue_id)
>   	u16	vlan_tag;
>   	int	index = 0;
>   	bool	is_copybreak;
> +	bool need_swap = id_entry->driver_data & FEC_QUIRK_SWAP_FRAME;

    ... especially talking this into account...

WBR, Sergei

^ permalink raw reply

* RE: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
From: Zhou, Danny @ 2014-10-07 16:06 UTC (permalink / raw)
  To: Fastabend, John R, Willem de Bruijn
  Cc: John Fastabend, Daniel Borkmann, Florian Westphal,
	gerlitz.or@gmail.com, Hannes Frederic Sowa, Network Development,
	Ronciak, John, Amir Vadai, Eric Dumazet
In-Reply-To: <54340CEB.204@intel.com>



> -----Original Message-----
> From: Fastabend, John R
> Sent: Tuesday, October 07, 2014 11:55 PM
> To: Willem de Bruijn; Zhou, Danny
> Cc: John Fastabend; Daniel Borkmann; Florian Westphal; gerlitz.or@gmail.com; Hannes Frederic Sowa; Network Development;
> Ronciak, John; Amir Vadai; Eric Dumazet
> Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
> 
> On 10/07/2014 08:46 AM, Willem de Bruijn wrote:
> >>>> Typically in an af_packet interface a packet_type handler is
> >>>> registered and used to filter traffic to the socket and do other
> >>>> things such as fan out traffic to multiple sockets. In this case the
> >>>> networking stack is being bypassed so this code is not run. So the
> >>>> hardware must push the correct traffic to the queues obtained from
> >>>> the ndo callback ndo_split_queue_pairs().
> >>>
> >>> Why does the interface work at the level of queue_pairs instead of
> >>> individual queues?
> >>
> >> The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O
> >> on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and
> >> egress traffics respectively, although the flow director only applies to ingress traffics.
> >
> > That requirement of co-allocation is absent in existing packet
> > rings. Many applications only receive or transmit. For
> > receive-only, it would even be possible to map descriptor
> > rings read-only, if the kernel remains responsible for posting
> > buffers -- but I see below that that is not the case, so that's
> > not very relevant here.
> >
> > Still, some workloads want asymmetric sets of rx and tx rings.
> > For instance, instead of using RSS, a process may want to
> > receive on as few rings as possible, load balance across
> > workers in software, but still give each worker thread its own
> > private transmit ring.
> >

> 
> We can build this into the interface by having the setsockopt
> provide both the number of tx rings and number of rx rings. It
> might not be immediately available in any drivers because at
> least ixgbe is pretty dependent on tx/rx pairing.
> 
> I would have to look through the other drivers to see how
> much work it would be to support this on them. If I can't find
> a good candidate we might leave it out until we can fix up the
> drivers.
> 


Fully agreed. Unlike ixgbe, DPDK supports asymmetric sets of rx and tx rings. It should be easily enabled
if user space requests those free and available rx/tx queues, but it would cause a lot of troubles if certain 
requested queues have been occupied by ixgbe, as it breaks existing rx/tx pairing.

> >
> >>>
> >>>>         /* Get the layout of ring space offset, page_sz, cnt */
> >>>>         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
> >>>>                    &info, &optlen);
> >>>>
> >>>>         /* request some queues from the driver */
> >>>>         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> >>>>                    &qpairs_info, sizeof(qpairs_info));
> >>>>
> >>>>         /* if we let the driver pick us queues learn which queues
> >>>>          * we were given
> >>>>          */
> >>>>         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> >>>>                    &qpairs_info, sizeof(qpairs_info));
> >>>
> >>> If ethtool -U is used to steer traffic to a specific descriptor queue,
> >>> then the setsockopt can pass the exact id of that queue and there
> >>> is no need for a getsockopt follow-up.
> >>
> >> Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via
> >> setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and
> >> available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
> >> Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info
> >> as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable.
> >> But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need
> >> to call getsocketopt.
> >
> > One step further would be to move the entire configuration behind
> > the packet socket interface. It's perhaps out of scope of this patch,
> > but the difference between using `ethtool -U` and passing the same
> > expression through the packet socket is that in the latter case the
> > kernel can automatically rollback the configuration change when the
> > process dies.
> >
> 
> hmm might be interesting I think  this is a follow on path to
> investigate after the initial support.
> 
> >>
> >>>
> >>>>         /* And mmap queue pairs to user space */
> >>>>         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
> >>>>              MAP_SHARED, fd, 0);
> >>>
> >>> How will packet data be mapped and how will userspace translate
> >>> from paddr to vaddr? Is the goal to maintain long lived mappings
> >>> and instruct drivers to allocate from this restricted range (to
> >>> avoid per-packet system calls and vma operations)?
> >>>
> >>
> >> Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues
> >> completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages
> >> are translated to paddr) to fill in the packet descriptors.
> >
> > Ah, userspace is responsible for posting buffers and translation
> > from vaddr to paddr is straightforward. Yes that makes sense.
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >


^ permalink raw reply

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
From: David Miller @ 2014-10-07 16:08 UTC (permalink / raw)
  To: David.Laight
  Cc: willemb, john.fastabend, dborkman, fw, gerlitz.or, hannes, netdev,
	john.ronciak, amirv, eric.dumazet, danny.zhou
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D174C621F@AcuExch.aculab.com>

From: David Laight <David.Laight@ACULAB.COM>
Date: Tue, 7 Oct 2014 15:59:35 +0000

> From: David 
>> From: David Laight <David.Laight@ACULAB.COM>
>> Date: Tue, 7 Oct 2014 09:27:03 +0000
>> 
>> > That is (probably) the only scheme that stops the application
>> > accessing random parts of physical memory.
>> 
>> I don't know where this claim keeps coming from, it's false.
>> 
>> The application has to attach memory to the ring, and then the
>> ring can only refer to that memory for the duration of the
>> session.
>> 
>> There is no way that the user can program the address field of the
>> descriptors to point at arbitrary physical memory locations.
>> 
>> There is protection and control.
> 
> I got the impression that the application was directly writing the ring
> structure that the ethernet mac hardware uses to describe tx and rx buffers.
> (ie they are mapped read-write into userspace).
> Unless you have a system where you can limit the physical memory
> ranges accessible to the mac hardware, I don't see how you can stop
> the application putting rogue values into the ring.
> 
> Clearly I'm missing something in my quick read of the change.

No, I think I misunderstood, and apparently the Mellanox driver allows
the user to crap into arbitrary physical memory too.

All of this garbage must get fixed and this feature is a non-starter
until there is control over the memory the rings can point ti.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox