Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2 net-next] pkt_sched: fq: Fair Queue packet scheduler
From: Eric Dumazet @ 2013-09-05  0:50 UTC (permalink / raw)
  To: Jason Wang
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Michael S. Tsirkin
In-Reply-To: <1378294029.7360.92.camel@edumazet-glaptop>

On Wed, 2013-09-04 at 04:27 -0700, Eric Dumazet wrote:
> On Wed, 2013-09-04 at 03:30 -0700, Eric Dumazet wrote:
> > On Wed, 2013-09-04 at 14:30 +0800, Jason Wang wrote:
> > 
> > > > And tcpdump would certainly help ;)
> > > 
> > > See attachment.
> > > 
> > 
> > Nothing obvious on tcpdump (only that lot of frames are missing)
> > 
> > 1) Are you capturing part of the payload only (like tcpdump -s 128)
> > 
> > 2) What is the setup.
> > 
> > 3) tc -s -d qdisc
> 
> If you use FQ in the guest, then it could be that high resolution timers
> have high latency ?
> 
> So FQ arms short timers, but effective duration could be much longer.
> 
> Here I get a smooth latency of up to ~3 us
> 
> lpq83:~# ./netperf -H lpq84 ; ./tc -s -d qd ; dmesg | tail -n1
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
> Recv   Send    Send                          
> Socket Socket  Message  Elapsed              
> Size   Size    Size     Time     Throughput  
> bytes  bytes   bytes    secs.    10^6bits/sec  
> 
>  87380  16384  16384    10.00    9410.82   
> qdisc fq 8005: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
>  Sent 50545633991 bytes 33385894 pkt (dropped 0, overlimits 0 requeues 19) 
>  rate 9258Mbit 764335pps backlog 0b 0p requeues 19 
>   117 flow, 115 inactive, 0 throttled
>   0 gc, 0 highprio, 0 retrans, 96861 throttled, 0 flows_plimit
> [  572.551664] latency = 3035 ns
> 
> 
> What do you get with this debugging patch ?
> 
> diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
> index 32ad015..c1312a0 100644
> --- a/net/sched/sch_fq.c
> +++ b/net/sched/sch_fq.c
> @@ -103,6 +103,7 @@ struct fq_sched_data {
>  	u64		stat_internal_packets;
>  	u64		stat_tcp_retrans;
>  	u64		stat_throttled;
> +	s64		slatency;
>  	u64		stat_flows_plimit;
>  	u64		stat_pkts_too_long;
>  	u64		stat_allocation_errors;
> @@ -393,6 +394,7 @@ static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
>  static void fq_check_throttled(struct fq_sched_data *q, u64 now)
>  {
>  	struct rb_node *p;
> +	bool first = true;
>  
>  	if (q->time_next_delayed_flow > now)
>  		return;
> @@ -405,6 +407,13 @@ static void fq_check_throttled(struct fq_sched_data *q, u64 now)
>  			q->time_next_delayed_flow = f->time_next_packet;
>  			break;
>  		}
> +		if (first) {
> +			s64 delay = now - f->time_next_packet;
> +
> +			first = false;
> +			delay -= q->slatency >> 3;
> +			q->slatency += delay;
> +		}
>  		rb_erase(p, &q->delayed);
>  		q->throttled_flows--;
>  		fq_flow_add_tail(&q->old_flows, f);
> @@ -711,6 +720,7 @@ static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
>  	if (opts == NULL)
>  		goto nla_put_failure;
>  
> +	pr_err("latency = %lld ns\n", q->slatency >> 3);
>  	if (nla_put_u32(skb, TCA_FQ_PLIMIT, sch->limit) ||
>  	    nla_put_u32(skb, TCA_FQ_FLOW_PLIMIT, q->flow_plimit) ||
>  	    nla_put_u32(skb, TCA_FQ_QUANTUM, q->quantum) ||
> 


BTW what is your HZ value ?

We have a problem in TCP stack, because srtt is in HZ units.

Before we change to us units, I guess tcp_update_pacing_rate() should be
changed a bit if HZ=250

^ permalink raw reply

* linux-next: manual merge of the h8300-remove tree with the net tree
From: Stephen Rothwell @ 2013-09-05  1:00 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: linux-next, linux-kernel, Chen Gang, David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 601 bytes --]

Hi Guenter,

Today's linux-next merge of the h8300-remove tree got a conflict in
drivers/net/ethernet/8390/Kconfig between commit 061ba049abc6 ("drivers:
net: ethernet: 8390: Kconfig: add H8300H_AKI3068NET and H8300H_H8MAX
dependancy for NE_H8300") from the net tree and commit d5bfa4f18b55
("net/ethernet: Drop H8/300 Ethernet driver") from the h8300-remove tree.

I fixed it up (The latter removes the entry that is updated by the
former, so I did that) and can carry the fix as necessary (no action is
required).

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH v2 net-next] pkt_sched: fq: Fair Queue packet scheduler
From: Eric Dumazet @ 2013-09-05  1:23 UTC (permalink / raw)
  To: Jason Wang
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Michael S. Tsirkin
In-Reply-To: <1378342226.11205.1.camel@edumazet-glaptop>

On Wed, 2013-09-04 at 17:50 -0700, Eric Dumazet wrote:

> 
> BTW what is your HZ value ?
> 
> We have a problem in TCP stack, because srtt is in HZ units.
> 
> Before we change to us units, I guess tcp_update_pacing_rate() should be
> changed a bit if HZ=250

Oh well, I feel dumb, please try this fix (If you have HZ != 1000) :

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 93d7e9d..fd96d8e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -700,7 +700,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
 	u64 rate;
 
 	/* set sk_pacing_rate to 200 % of current rate (mss * cwnd / srtt) */
-	rate = (u64)tp->mss_cache * 2 * (HZ << 3);
+	rate = (u64)tp->mss_cache * 2 * ((USEC_PER_SEC/HZ) << 3);
 
 	rate *= max(tp->snd_cwnd, tp->packets_out);
 

^ permalink raw reply related

* RE: [E1000-devel] [net-next v4 7/8] i40e: sysfs and debugfs interfaces
From: Nelson, Shannon @ 2013-09-05  1:25 UTC (permalink / raw)
  To: Stephen Hemminger, Kirsher, Jeffrey T
  Cc: e1000-devel@lists.sourceforge.net, netdev@vger.kernel.org,
	Brandeburg, Jesse, gospo@redhat.com, davem@davemloft.net,
	sassmann@redhat.com
In-Reply-To: <20130904173759.7c2eec52@nehalam.linuxnetplumber.net>

> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Wednesday, September 04, 2013 5:38 PM
> 

[ ... many good comments on i40e_sysfs.c ... ]

Hi Stephen,

Thanks for your comments.  Frankly, we were hoping for this kind of feedback when we posted the first version of this driver a couple of months ago, with the intent of clearing up any issues in time to get settled and accepted about now.  Yes, this is driven by business reasons that this list doesn't need to hear me whine about, but that is our reality.

We'll be happy to work with you on- or off-list over the next few days to come up with a good resolution to your concerns, and from that we can publish follow-on patches to this current submission.  We should be able to get it settled and done quickly and without too much fuss.  I won't be at Plumber's this year, but perhaps Jesse or John can get together with you and other concerned sysfs folks if needed to finalize any issues.

Will this work for you?

Thanks,
sln

^ permalink raw reply

* Re: [PATCH net-next v2 1/6] bonding: simplify and use RCU protection for 3ad xmit path
From: Ding Tianhong @ 2013-09-05  2:06 UTC (permalink / raw)
  To: David Miller; +Cc: vfalico, fubar, andy, nikolay, netdev
In-Reply-To: <20130904.122512.1633906087065495330.davem@davemloft.net>

On 2013/9/5 0:25, David Miller wrote:
> From: Veaceslav Falico <vfalico@redhat.com>
> Date: Wed, 4 Sep 2013 12:18:24 +0200
> 
>> On Wed, Sep 04, 2013 at 05:43:45PM +0800, Ding Tianhong wrote:
>> ...snip...
>>> +/**
>>> + * IMPORTANT: bond_first/last_slave_rcu can return NULL in case of an
>>> empty list
>>> + * Caller must hold rcu_read_lock
>>> + */
>>> +#define bond_first_slave_rcu(bond) \
>>> +	list_first_or_null_rcu(&(bond)->slave_list, struct slave, list)
>>> +#define bond_last_slave_rcu(bond) \
>>> +	(list_empty(&(bond)->slave_list) ? NULL : \
>>> + bond_to_slave_rcu((bond)->slave_list.prev))
>>
>> Here, bond_last_slave_rcu() is racy. The list can be non-empty when
>> list_empty() is verified, however afterwards it might become empty,
>> when
>> you call bond_to_slave_rcu(), and thus you'll get
>> bond_to_slave(bond->slave_list) in the result, which is not a slave.
>>
>> Take a look at list_first_or_null_rcu() for a reference. The main idea
>> is
>> that it first gets the ->next pointer, with RCU protection, and then
>> verifies if it's the list head or not, and if not - it gets the
>> container
>> already. This way the ->next pointer won't get away.
>>
>> These kind of bugs are really rare, but are *EXTREMELY* hard to debug.
> 
> I agree with this analysis.
> 
> Ding, "rcu_read_lock()" doesn't "lock" anything.  It's just a memory
> barrier.
> 
> All the list can still change on you asynchronously to your accesses.
> 
> That's why list_first_or_null_rcu() is so carefully arranged.
> Therefore, you must make similar accomodations.
> 
> 
> 

yes, after a long time thinking, I found the problem and know how to do next, repair and resend it later.

> .
> 

^ permalink raw reply

* question about __udp6_lib_err usage of __udp6_lib_lookup
From: Dave Jones @ 2013-09-05  2:44 UTC (permalink / raw)
  To: netdev

__udp6_lib_lookup's prototype is ...

struct sock *__udp6_lib_lookup(struct net *net,
                                      const struct in6_addr *saddr, __be16 sport,
                                      const struct in6_addr *daddr, __be16 dport,
                                      int dif, struct udp_table *udptable)

But the usage in __udp6_lib_err is...

 521         sk = __udp6_lib_lookup(dev_net(skb->dev), daddr, uh->dest,
 522                                saddr, uh->source, inet6_iif(skb), udptable);

With the source/dest addr/port arguments swapped. Is this intentional ?

	Dave

^ permalink raw reply

* Intel igb module command line configuration in kernel sources
From: Andrew Davidoff @ 2013-09-05  2:51 UTC (permalink / raw)
  To: netdev

Hi,

I apologize that this is a user-type question but linux-net seems to
have gone away, and I cannot find a more appropriate networking
related mailing list.

Why are the Intel igb module configuration parameters, usually found
in igb_param.c and part of the source from Intel, not included in the
igb sources distributed with the linux kernel? I cannot find them in
the kernel's igb source, modinfo doesn't list them, and if I try to
use them as documented in the igb readme, I get errors because they
are unrecognized (ex: RSS). Am I overlooking an alternate
configuration method?

Thanks.
Andy

^ permalink raw reply

* Re: [PATCH V3 4/6] vhost_net: determine whether or not to use zerocopy at one time
From: Jason Wang @ 2013-09-05  2:54 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20130904115929.GA9393@redhat.com>

On 09/04/2013 07:59 PM, Michael S. Tsirkin wrote:
> On Mon, Sep 02, 2013 at 04:40:59PM +0800, Jason Wang wrote:
>> Currently, even if the packet length is smaller than VHOST_GOODCOPY_LEN, if
>> upend_idx != done_idx we still set zcopy_used to true and rollback this choice
>> later. This could be avoided by determining zerocopy once by checking all
>> conditions at one time before.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>>  drivers/vhost/net.c |   47 ++++++++++++++++++++---------------------------
>>  1 files changed, 20 insertions(+), 27 deletions(-)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 8a6dd0d..3f89dea 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -404,43 +404,36 @@ static void handle_tx(struct vhost_net *net)
>>  			       iov_length(nvq->hdr, s), hdr_size);
>>  			break;
>>  		}
>> -		zcopy_used = zcopy && (len >= VHOST_GOODCOPY_LEN ||
>> -				       nvq->upend_idx != nvq->done_idx);
>> +
>> +		zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
>> +				   && (nvq->upend_idx + 1) % UIO_MAXIOV !=
>> +				      nvq->done_idx
> Thinking about this, this looks strange.
> The original idea was that once we start doing zcopy, we keep
> using the heads ring even for short packets until no zcopy is outstanding.

What's the reason for keep using the heads ring?
>
> What's the logic behind (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx
> here?

Because we initialize both upend_idx and done_idx to zero, so upend_idx
!= done_idx could not be used to check whether or not the heads ring
were full.
>> +				   && vhost_net_tx_select_zcopy(net);
>>  
>>  		/* use msg_control to pass vhost zerocopy ubuf info to skb */
>>  		if (zcopy_used) {
>> +			struct ubuf_info *ubuf;
>> +			ubuf = nvq->ubuf_info + nvq->upend_idx;
>> +
>>  			vq->heads[nvq->upend_idx].id = head;
>> -			if (!vhost_net_tx_select_zcopy(net) ||
>> -			    len < VHOST_GOODCOPY_LEN) {
>> -				/* copy don't need to wait for DMA done */
>> -				vq->heads[nvq->upend_idx].len =
>> -							VHOST_DMA_DONE_LEN;
>> -				msg.msg_control = NULL;
>> -				msg.msg_controllen = 0;
>> -				ubufs = NULL;
>> -			} else {
>> -				struct ubuf_info *ubuf;
>> -				ubuf = nvq->ubuf_info + nvq->upend_idx;
>> -
>> -				vq->heads[nvq->upend_idx].len =
>> -					VHOST_DMA_IN_PROGRESS;
>> -				ubuf->callback = vhost_zerocopy_callback;
>> -				ubuf->ctx = nvq->ubufs;
>> -				ubuf->desc = nvq->upend_idx;
>> -				msg.msg_control = ubuf;
>> -				msg.msg_controllen = sizeof(ubuf);
>> -				ubufs = nvq->ubufs;
>> -				kref_get(&ubufs->kref);
>> -			}
>> +			vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
>> +			ubuf->callback = vhost_zerocopy_callback;
>> +			ubuf->ctx = nvq->ubufs;
>> +			ubuf->desc = nvq->upend_idx;
>> +			msg.msg_control = ubuf;
>> +			msg.msg_controllen = sizeof(ubuf);
>> +			ubufs = nvq->ubufs;
>> +			kref_get(&ubufs->kref);
>>  			nvq->upend_idx = (nvq->upend_idx + 1) % UIO_MAXIOV;
>> -		} else
>> +		} else {
>>  			msg.msg_control = NULL;
>> +			ubufs = NULL;
>> +		}
>>  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
>>  		err = sock->ops->sendmsg(NULL, sock, &msg, len);
>>  		if (unlikely(err < 0)) {
>>  			if (zcopy_used) {
>> -				if (ubufs)
>> -					vhost_net_ubuf_put(ubufs);
>> +				vhost_net_ubuf_put(ubufs);
>>  				nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
>>  					% UIO_MAXIOV;
>>  			}
>> -- 
>> 1.7.1
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2 net-next] pkt_sched: fq: Fair Queue packet scheduler
From: Jason Wang @ 2013-09-05  3:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Michael S. Tsirkin
In-Reply-To: <1378290638.7360.85.camel@edumazet-glaptop>

On 09/04/2013 06:30 PM, Eric Dumazet wrote:
> On Wed, 2013-09-04 at 14:30 +0800, Jason Wang wrote:
>
>>> And tcpdump would certainly help ;)
>> See attachment.
>>
> Nothing obvious on tcpdump (only that lot of frames are missing)
>
> 1) Are you capturing part of the payload only (like tcpdump -s 128)

No, I use something like tcpdump -i eth0 -w fq -c 300 after the netperf
start.
>
> 2) What is the setup.

Tow kvm guest with virtio-net and vhost enabled. Only one queue is
enabled and the guest were connected with bridge. Both host and guest
were net-next.git
>
> 3) tc -s -d qdisc

tc -s -d qdisc
qdisc fq 8001: dev eth0 root refcnt 2 [Unknown qdisc, optlen=64]
 Sent 6680760347 bytes 4431855 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

btw, is the fq support for tc merged into iproute2? Looks like I can't
find them.
>
>
>
>

^ permalink raw reply

* Re: rtl8188eu array/null comparison.
From: Larry Finger @ 2013-09-05  3:14 UTC (permalink / raw)
  To: Dave Jones; +Cc: netdev
In-Reply-To: <20130904233018.GB5727@redhat.com>

On 09/04/2013 06:30 PM, Dave Jones wrote:
> Hi Larry,
>
> in drivers/staging/rtl8188eu/core/rtw_mlme.c:rtw_check_join_candidate there's this code..
>
> 1721         /* check ssid, if needed */
> 1722         if (pmlmepriv->assoc_ssid.Ssid && pmlmepriv->assoc_ssid.SsidLength) {
> 1723                 if (competitor->network.Ssid.SsidLength != pmlmepriv->assoc_ssid.SsidLength ||
> 1724                     _rtw_memcmp(competitor->network.Ssid.Ssid, pmlmepriv->assoc_ssid.Ssid, pmlmepriv->assoc_ssid.SsidLength) == false)
> 1725                         goto exit;
> 1726         }
> 1727
>
> assoc_ssid.Ssid being an array, this comparison against null probably isn't what was intended ?

I think that a non-zero SsidLength should be enough of a test, but I need to 
think about a bit more.

Thanks,

Larry

^ permalink raw reply

* Re: [E1000-devel] [net-next v4 7/8] i40e: sysfs and debugfs interfaces
From: David Miller @ 2013-09-05  3:19 UTC (permalink / raw)
  To: shannon.nelson
  Cc: stephen, jeffrey.t.kirsher, e1000-devel, netdev, jesse.brandeburg,
	gospo, sassmann
In-Reply-To: <FC41C24E35F18A40888AACA1A36F3E416C61CF23@FMSMSX102.amr.corp.intel.com>

From: "Nelson, Shannon" <shannon.nelson@intel.com>
Date: Thu, 5 Sep 2013 01:25:47 +0000

> Will this work for you?

You will fix the problems people are reporting with this patch series
before I apply it.

^ permalink raw reply

* Re: question about __udp6_lib_err usage of __udp6_lib_lookup
From: David Miller @ 2013-09-05  3:21 UTC (permalink / raw)
  To: davej; +Cc: netdev
In-Reply-To: <20130905024434.GA8219@redhat.com>

From: Dave Jones <davej@redhat.com>
Date: Wed, 4 Sep 2013 22:44:34 -0400

> __udp6_lib_lookup's prototype is ...
> 
> struct sock *__udp6_lib_lookup(struct net *net,
>                                       const struct in6_addr *saddr, __be16 sport,
>                                       const struct in6_addr *daddr, __be16 dport,
>                                       int dif, struct udp_table *udptable)
> 
> But the usage in __udp6_lib_err is...
> 
>  521         sk = __udp6_lib_lookup(dev_net(skb->dev), daddr, uh->dest,
>  522                                saddr, uh->source, inet6_iif(skb), udptable);
> 
> With the source/dest addr/port arguments swapped. Is this intentional ?

Yes, because we are looking at the headers of a packet we transmitted,
quoted in an ICMP response.

^ permalink raw reply

* Re: [PATCH v2 net-next] pkt_sched: fq: Fair Queue packet scheduler
From: Jason Wang @ 2013-09-05  3:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Michael S. Tsirkin
In-Reply-To: <1378294029.7360.92.camel@edumazet-glaptop>

On 09/04/2013 07:27 PM, Eric Dumazet wrote:
> On Wed, 2013-09-04 at 03:30 -0700, Eric Dumazet wrote:
>> > On Wed, 2013-09-04 at 14:30 +0800, Jason Wang wrote:
>> > 
>>>> > > > And tcpdump would certainly help ;)
>>> > > 
>>> > > See attachment.
>>> > > 
>> > 
>> > Nothing obvious on tcpdump (only that lot of frames are missing)
>> > 
>> > 1) Are you capturing part of the payload only (like tcpdump -s 128)
>> > 
>> > 2) What is the setup.
>> > 
>> > 3) tc -s -d qdisc
> If you use FQ in the guest, then it could be that high resolution timers
> have high latency ?

Not sure, but it should not affect so much. And I'm using kvm-clock in
guest whose overhead should be very small.
>
> So FQ arms short timers, but effective duration could be much longer.
>
> Here I get a smooth latency of up to ~3 us
>
> lpq83:~# ./netperf -H lpq84 ; ./tc -s -d qd ; dmesg | tail -n1
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
> Recv   Send    Send                          
> Socket Socket  Message  Elapsed              
> Size   Size    Size     Time     Throughput  
> bytes  bytes   bytes    secs.    10^6bits/sec  
>
>  87380  16384  16384    10.00    9410.82   
> qdisc fq 8005: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
>  Sent 50545633991 bytes 33385894 pkt (dropped 0, overlimits 0 requeues 19) 
>  rate 9258Mbit 764335pps backlog 0b 0p requeues 19 
>   117 flow, 115 inactive, 0 throttled
>   0 gc, 0 highprio, 0 retrans, 96861 throttled, 0 flows_plimit
> [  572.551664] latency = 3035 ns
>
>
> What do you get with this debugging patch ?

I'm getting about 13us-19us, one run like:

 netperf -H 192.168.100.5; tc -s -d qd; dmesg | tail -n1
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.100.5 () port 0 AF_INET : demo
Recv   Send    Send                         
Socket Socket  Message  Elapsed             
Size   Size    Size     Time     Throughput 
bytes  bytes   bytes    secs.    10^6bits/sec 

 87380  16384  16384    10.00    4542.09  
qdisc fq 8001: dev eth0 root refcnt 2 [Unknown qdisc, optlen=64]
 Sent 53652327205 bytes 35580150 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
[  201.320565] latency = 14905 ns

One interesting thing is if I switch from kvm-clock to acpi_pm which has
much more overhead, the latency increase to about 50ns, and the
throughput drops very quickly.
netperf -H 192.168.100.5; tc -s -d qd; dmesg | tail -n1
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
192.168.100.5 () port 0 AF_INET : demo
Recv   Send    Send                         
Socket Socket  Message  Elapsed             
Size   Size    Size     Time     Throughput 
bytes  bytes   bytes    secs.    10^6bits/sec 

 87380  16384  16384    10.00    2262.46  
qdisc fq 8001: dev eth0 root refcnt 2 [Unknown qdisc, optlen=64]
 Sent 56611533075 bytes 37550429 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
[  474.121689] latency = 51841 ns

^ permalink raw reply

* Re: [PATCH v2 net-next] pkt_sched: fq: Fair Queue packet scheduler
From: Jason Wang @ 2013-09-05  3:39 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Eric Dumazet, David Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Michael S. Tsirkin, KVM
In-Reply-To: <5227209F.4060708@redhat.com>

On 09/04/2013 07:59 PM, Daniel Borkmann wrote:
> On 09/04/2013 01:27 PM, Eric Dumazet wrote:
>> On Wed, 2013-09-04 at 03:30 -0700, Eric Dumazet wrote:
>>> On Wed, 2013-09-04 at 14:30 +0800, Jason Wang wrote:
>>>
>>>>> And tcpdump would certainly help ;)
>>>>
>>>> See attachment.
>>>>
>>>
>>> Nothing obvious on tcpdump (only that lot of frames are missing)
>>>
>>> 1) Are you capturing part of the payload only (like tcpdump -s 128)
>>>
>>> 2) What is the setup.
>>>
>>> 3) tc -s -d qdisc
>>
>> If you use FQ in the guest, then it could be that high resolution timers
>> have high latency ?
>
> Probably they internally switch to a lower resolution clock event
> source if
> there's no hardware support available:
>
>   The [source event] management layer provides interfaces for hrtimers to
>   implement high resolution timers [...] [and it] supports these more
> advanced
>   functions only when appropriate clock event sources have been
> registered,
>   otherwise the traditional periodic tick based behaviour is retained.
> [1]
>
> [1] https://www.kernel.org/doc/ols/2006/ols2006v1-pages-333-346.pdf 

Maybe, AFAIK, kvm-clock does not provide a clock event, only a pv
clocksource were provided.

^ permalink raw reply

* Re: [PATCH v2 net-next] pkt_sched: fq: Fair Queue packet scheduler
From: Eric Dumazet @ 2013-09-05  3:41 UTC (permalink / raw)
  To: Jason Wang
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Michael S. Tsirkin
In-Reply-To: <5227F57D.7030709@redhat.com>

On Thu, 2013-09-05 at 11:07 +0800, Jason Wang wrote:

> tc -s -d qdisc
> qdisc fq 8001: dev eth0 root refcnt 2 [Unknown qdisc, optlen=64]
>  Sent 6680760347 bytes 4431855 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> 
> btw, is the fq support for tc merged into iproute2? Looks like I can't
> find them.

Yep its there on net-next-3.11 branch

# git branch -a
  master
* net-next-3.11
  remotes/origin/HEAD -> origin/master
  remotes/origin/iproute-3.5.1
  remotes/origin/master
  remotes/origin/net-next-3.11

# cat .git/config 
[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
[remote "origin"]
	url = git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git
	fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
	remote = origin
	merge = refs/heads/master
[branch "net-next-3.11"]
	remote = origin
	merge = refs/heads/net-next-3.11

^ permalink raw reply

* Re: [PATCH v2 net-next] pkt_sched: fq: Fair Queue packet scheduler
From: Jason Wang @ 2013-09-05  3:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Yuchung Cheng, Neal Cardwell,
	Michael S. Tsirkin
In-Reply-To: <1378342226.11205.1.camel@edumazet-glaptop>

On 09/05/2013 08:50 AM, Eric Dumazet wrote:
> On Wed, 2013-09-04 at 04:27 -0700, Eric Dumazet wrote:
>> On Wed, 2013-09-04 at 03:30 -0700, Eric Dumazet wrote:
>>> On Wed, 2013-09-04 at 14:30 +0800, Jason Wang wrote:
>>>
>>>>> And tcpdump would certainly help ;)
>>>> See attachment.
>>>>
>>> Nothing obvious on tcpdump (only that lot of frames are missing)
>>>
>>> 1) Are you capturing part of the payload only (like tcpdump -s 128)
>>>
>>> 2) What is the setup.
>>>
>>> 3) tc -s -d qdisc
>> If you use FQ in the guest, then it could be that high resolution timers
>> have high latency ?
>>
>> So FQ arms short timers, but effective duration could be much longer.
>>
>> Here I get a smooth latency of up to ~3 us
>>
>> lpq83:~# ./netperf -H lpq84 ; ./tc -s -d qd ; dmesg | tail -n1
>> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
>> Recv   Send    Send                          
>> Socket Socket  Message  Elapsed              
>> Size   Size    Size     Time     Throughput  
>> bytes  bytes   bytes    secs.    10^6bits/sec  
>>
>>  87380  16384  16384    10.00    9410.82   
>> qdisc fq 8005: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 
>>  Sent 50545633991 bytes 33385894 pkt (dropped 0, overlimits 0 requeues 19) 
>>  rate 9258Mbit 764335pps backlog 0b 0p requeues 19 
>>   117 flow, 115 inactive, 0 throttled
>>   0 gc, 0 highprio, 0 retrans, 96861 throttled, 0 flows_plimit
>> [  572.551664] latency = 3035 ns
>>
>>
>> What do you get with this debugging patch ?
>>
>> diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
>> index 32ad015..c1312a0 100644
>> --- a/net/sched/sch_fq.c
>> +++ b/net/sched/sch_fq.c
>> @@ -103,6 +103,7 @@ struct fq_sched_data {
>>  	u64		stat_internal_packets;
>>  	u64		stat_tcp_retrans;
>>  	u64		stat_throttled;
>> +	s64		slatency;
>>  	u64		stat_flows_plimit;
>>  	u64		stat_pkts_too_long;
>>  	u64		stat_allocation_errors;
>> @@ -393,6 +394,7 @@ static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
>>  static void fq_check_throttled(struct fq_sched_data *q, u64 now)
>>  {
>>  	struct rb_node *p;
>> +	bool first = true;
>>  
>>  	if (q->time_next_delayed_flow > now)
>>  		return;
>> @@ -405,6 +407,13 @@ static void fq_check_throttled(struct fq_sched_data *q, u64 now)
>>  			q->time_next_delayed_flow = f->time_next_packet;
>>  			break;
>>  		}
>> +		if (first) {
>> +			s64 delay = now - f->time_next_packet;
>> +
>> +			first = false;
>> +			delay -= q->slatency >> 3;
>> +			q->slatency += delay;
>> +		}
>>  		rb_erase(p, &q->delayed);
>>  		q->throttled_flows--;
>>  		fq_flow_add_tail(&q->old_flows, f);
>> @@ -711,6 +720,7 @@ static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
>>  	if (opts == NULL)
>>  		goto nla_put_failure;
>>  
>> +	pr_err("latency = %lld ns\n", q->slatency >> 3);
>>  	if (nla_put_u32(skb, TCA_FQ_PLIMIT, sch->limit) ||
>>  	    nla_put_u32(skb, TCA_FQ_FLOW_PLIMIT, q->flow_plimit) ||
>>  	    nla_put_u32(skb, TCA_FQ_QUANTUM, q->quantum) ||
>>
>
> BTW what is your HZ value ?

Guest HZ is 1000.
>
> We have a problem in TCP stack, because srtt is in HZ units.
>
> Before we change to us units, I guess tcp_update_pacing_rate() should be
> changed a bit if HZ=250
>
>

^ permalink raw reply

* Add missing braces in bnx2x:bnx2x_link_initialize
From: Dave Jones @ 2013-09-05  3:46 UTC (permalink / raw)
  To: netdev; +Cc: eilong

The indentation here implies that the intent was for this to be a multiline if.
Introduced a few years ago in commit ec146a6f019923819f5ca381980248b6d154ca1a ("bnx2x: Modify XGXS functions")

Signed-off-by: Dave Jones <davej@fedoraproject.org>

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
index 9d64b98..6645684 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_link.c
@@ -6501,12 +6501,13 @@ static int bnx2x_link_initialize(struct link_params *params,
 		struct bnx2x_phy *phy = &params->phy[INT_PHY];
 		if (vars->line_speed == SPEED_AUTO_NEG &&
 		    (CHIP_IS_E1x(bp) ||
-		     CHIP_IS_E2(bp)))
+		     CHIP_IS_E2(bp))) {
 			bnx2x_set_parallel_detection(phy, params);
 			if (params->phy[INT_PHY].config_init)
 				params->phy[INT_PHY].config_init(phy,
 								 params,
 								 vars);
+		}
 	}
 
 	/* Init external phy*/

^ permalink raw reply related

* [PATCH] Add missing braces to ath10k_pci_tx_pipe_cleanup
From: Dave Jones @ 2013-09-05  3:51 UTC (permalink / raw)
  To: netdev; +Cc: kvalo, linville, ath10k, linux-wireless

The indentation here implies this was meant to be a multi-statement if, but it lacks the braces.

Signed-off-by: Dave Jones <davej@fedoraproject.org>

diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
index 33af467..31b69d3 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -1200,7 +1200,7 @@ static void ath10k_pci_tx_pipe_cleanup(struct hif_ce_pipe_info *pipe_info)
 
 	while (ath10k_ce_cancel_send_next(ce_hdl, (void **)&netbuf,
 					  &ce_data, &nbytes, &id) == 0) {
-		if (netbuf != CE_SENDLIST_ITEM_CTXT)
+		if (netbuf != CE_SENDLIST_ITEM_CTXT) {
 			/*
 			 * Indicate the completion to higer layer to free
 			 * the buffer
@@ -1209,6 +1209,7 @@ static void ath10k_pci_tx_pipe_cleanup(struct hif_ce_pipe_info *pipe_info)
 			ar_pci->msg_callbacks_current.tx_completion(ar,
 								    netbuf,
 								    id);
+		}
 	}
 }
 

^ permalink raw reply related

* Re: [E1000-devel] [net-next v4 7/8] i40e: sysfs and debugfs interfaces
From: Brandeburg, Jesse @ 2013-09-05  4:08 UTC (permalink / raw)
  To: David Miller
  Cc: Nelson, Shannon, stephen@networkplumber.org, Kirsher, Jeffrey T,
	e1000-devel@lists.sourceforge.net, netdev@vger.kernel.org,
	gospo@redhat.com, sassmann@redhat.com
In-Reply-To: <20130904.231950.956676623509287249.davem@davemloft.net>

On Wed, 2013-09-04 at 23:19 -0400, David Miller wrote:
> You will fix the problems people are reporting with this patch series
> before I apply it.

Okay, the quickest path to that might be to drop the sysfs patch for
now.  If that is acceptable I will re-spin the patches tonight.


^ permalink raw reply

* [PATCH] Add missing braces to multiline if in cfctrl_linkup_request
From: Dave Jones @ 2013-09-05  4:11 UTC (permalink / raw)
  To: dmitry.tarnyagin; +Cc: netdev

The indentation here implies this was meant to be a multi-line if.

Introduced several years back in commit c85c2951d4da1236e32f1858db418221e624aba5
("caif: Handle dev_queue_xmit errors.")

Signed-off-by: Dave Jones <davej@fedoraproject.org>

diff --git a/net/caif/cfctrl.c b/net/caif/cfctrl.c
index 2bd4b58..0f45522 100644
--- a/net/caif/cfctrl.c
+++ b/net/caif/cfctrl.c
@@ -293,9 +293,10 @@ int cfctrl_linkup_request(struct cflayer *layer,
 
 		count = cfctrl_cancel_req(&cfctrl->serv.layer,
 						user_layer);
-		if (count != 1)
+		if (count != 1) {
 			pr_err("Could not remove request (%d)", count);
 			return -ENODEV;
+		}
 	}
 	return 0;
 }

^ permalink raw reply related

* Re: [PATCH] net: mvneta: properly disable HW PHY polling and ensure adjust_link() works
From: Ethan Tuttle @ 2013-09-05  4:14 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Thomas Petazzoni, David S. Miller, netdev, linux-arm-kernel,
	Lior Amsalem, Gregory Clement, Ezequiel Garcia, Jochen De Smet,
	Peter Sanford, Chény Yves-Gael, Ryan Press, Simon Guinot,
	vdonnefort, stable
In-Reply-To: <20130904163245.GI20552@1wt.eu>

Just booted with the patch on my Mirabox.  Both interfaces work!
Thank you Thomas.

One remaining issue is that the interface which uBoot didn't configure
is still getting a random mac address:

mvneta d0070000.ethernet eth0: Using hardware mac address f0:ad:4e:01:dc:97
mvneta d0074000.ethernet eth1: Using random mac address d2:35:dd:c8:99:48

This is on 3.11 plus Thomas' patch, which includes the previous fix to
"read MAC address from hardware when available".  Perhaps that fix
isn't working with the phylib?

Thanks again,

Ethan

On Wed, Sep 4, 2013 at 9:32 AM, Willy Tarreau <w@1wt.eu> wrote:
>
> Hi Thomas!
>
> On Wed, Sep 04, 2013 at 04:21:18PM +0200, Thomas Petazzoni wrote:
> > This commit fixes a long-standing bug that has been reported by many
> > users: on some Armada 370 platforms, only the network interface that
> > has been used in U-Boot to tftp the kernel works properly in
> > Linux. The other network interfaces can see a 'link up', but are
> > unable to transmit data. The reports were generally made on the Armada
> > 370-based Mirabox, but have also been given on the Armada 370-RD
> > board.
> (...)
> > This patch has been tested on Armada 370 Mirabox, and now both network
> > interfaces are usable after boot.
>
> Just as a complementary check, I can also confirm that the OpenBlocks
> AX3 continues to work fine after this change.
>
> Best regards!
> Willy
>

^ permalink raw reply

* [rfc] suspicious indentation in do_tcp_setsockopt
From: Dave Jones @ 2013-09-05  4:20 UTC (permalink / raw)
  To: netdev

What's the intent here ?

This ?


diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b2f6c74..95544e4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2454,10 +2454,11 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 	case TCP_THIN_DUPACK:
 		if (val < 0 || val > 1)
 			err = -EINVAL;
-		else
+		else {
 			tp->thin_dupack = val;
 			if (tp->thin_dupack)
 				tcp_disable_early_retrans(tp);
+		}
 		break;
 
 	case TCP_REPAIR:

Or this ...

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b2f6c74..187c5a4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2456,8 +2456,9 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 			err = -EINVAL;
 		else
 			tp->thin_dupack = val;
-			if (tp->thin_dupack)
-				tcp_disable_early_retrans(tp);
+
+		if (tp->thin_dupack)
+			tcp_disable_early_retrans(tp);
 		break;
 
 	case TCP_REPAIR:


I'll submit the right patch in the right form once I know what was intended.

The former seems more 'correct' to me, but I'm unsure if that could break something.

	Dave

^ permalink raw reply related

* Re: [E1000-devel] [net-next v4 7/8] i40e: sysfs and debugfs interfaces
From: David Miller @ 2013-09-05  4:37 UTC (permalink / raw)
  To: jesse.brandeburg
  Cc: shannon.nelson, stephen, jeffrey.t.kirsher, e1000-devel, netdev,
	gospo, sassmann
In-Reply-To: <1378354121.26131.4.camel@jbrandeb-mobl2>

From: "Brandeburg, Jesse" <jesse.brandeburg@intel.com>
Date: Thu, 5 Sep 2013 04:08:39 +0000

> On Wed, 2013-09-04 at 23:19 -0400, David Miller wrote:
>> You will fix the problems people are reporting with this patch series
>> before I apply it.
> 
> Okay, the quickest path to that might be to drop the sysfs patch for
> now.  If that is acceptable I will re-spin the patches tonight.

You're just going to ask me to add the sysfs code later, just make
the appropriate fixes.

Thanks.

^ permalink raw reply

* Re: [rfc] suspicious indentation in do_tcp_setsockopt
From: David Miller @ 2013-09-05  4:39 UTC (permalink / raw)
  To: davej; +Cc: netdev
In-Reply-To: <20130905042045.GD15824@redhat.com>

From: Dave Jones <davej@redhat.com>
Date: Thu, 5 Sep 2013 00:20:45 -0400

> What's the intent here ?

This stuff is great, do you have a script that looks for this false
indentation pattern?

^ permalink raw reply

* Re: [rfc] suspicious indentation in do_tcp_setsockopt
From: Joe Perches @ 2013-09-05  4:43 UTC (permalink / raw)
  To: Dave Jones, Yuchung Cheng; +Cc: netdev, Neal Cardwell
In-Reply-To: <20130905042045.GD15824@redhat.com>

(Adding Yuchung Cheng and Neal Cardwell as the
 author and acker of the patch)

On Thu, 2013-09-05 at 00:20 -0400, Dave Jones wrote:
> What's the intent here ?
> 
> This ?

I think the first is most likely.

> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index b2f6c74..95544e4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2454,10 +2454,11 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>  	case TCP_THIN_DUPACK:
>  		if (val < 0 || val > 1)
>  			err = -EINVAL;
> -		else
> +		else {
>  			tp->thin_dupack = val;
>  			if (tp->thin_dupack)
>  				tcp_disable_early_retrans(tp);
> +		}
>  		break;
>  
>  	case TCP_REPAIR:
> 
> Or this ...
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index b2f6c74..187c5a4 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2456,8 +2456,9 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>  			err = -EINVAL;
>  		else
>  			tp->thin_dupack = val;
> -			if (tp->thin_dupack)
> -				tcp_disable_early_retrans(tp);
> +
> +		if (tp->thin_dupack)
> +			tcp_disable_early_retrans(tp);
>  		break;
>  
>  	case TCP_REPAIR:
> 
> 
> I'll submit the right patch in the right form once I know what was intended.
> 
> The former seems more 'correct' to me, but I'm unsure if that could break something.
> 
> 	Dave
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox