Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [net-next PATCH] net: codel: Avoid undefined behavior from signed overflow
From: Ben Hutchings @ 2013-10-30 20:19 UTC (permalink / raw)
  To: paulmck; +Cc: Jesper Dangaard Brouer, netdev, Eric Dumazet, Dave Taht
In-Reply-To: <20131030201327.GO4126@linux.vnet.ibm.com>

On Wed, 2013-10-30 at 13:13 -0700, Paul E. McKenney wrote:
> On Wed, Oct 30, 2013 at 07:35:48PM +0000, Ben Hutchings wrote:
> > On Wed, 2013-10-30 at 18:23 +0100, Jesper Dangaard Brouer wrote:
> > > From: Jesper Dangaard Brouer <netoptimizer@brouer.com>
> > > 
> > > As described in commit 5a581b367 (jiffies: Avoid undefined
> > > behavior from signed overflow), according to the C standard
> > > 3.4.3p3, overflow of a signed integer results in undefined
> > > behavior.
> > [...]
> > 
> > According to the real processors that Linux runs on, signed arithmetic
> > uses 2's complement representation and overflow wraps accordingly.  And
> > we rely on that behaviour in many places, so we use
> > '-fno-strict-overflow' to tell gcc not to assume we avoid signed
> > overflow.  (There is also '-fwrapv' which tells gcc to assume the
> > processor behaves this way, but shouldn't it already know how the target
> > machine works?)
> 
> We should still fix them as we come across them.  There are a few types
> of loops where '-fno-strict-overflow' results in more instructions
> being generated.

I realise there's an opportunity for optimisation, but if these cases
are fixed on an ad-hoc basis, how will we know we're ready to make the
switch?

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* HURRY NOW!!!!
From: UBA ALERT @ 2013-10-29 17:13 UTC (permalink / raw)
  To: netdev

Did you authorize Mrs Mary to receive your compensation fund ($800,000.00)? she is here with her documents to receive your compensation payment on your behalf. Kindly re-confirm your names and phone numbers immediately for confirmation, we shall proceed further with her as soon as we receive your email.

Yours Faithfully,

Dr. Michael Pirnie
UBA New York Branch
Tel: (+1)347-470-0577
Email: uba.payment2@poczta.pl

^ permalink raw reply

* RE: [net-next  9/9] igb: Add ethtool support to configure number of channels
From: Wyborny, Carolyn @ 2013-10-30 20:30 UTC (permalink / raw)
  To: Ben Hutchings, Kirsher, Jeffrey T
  Cc: davem@davemloft.net, Laura Mihaela Vasilescu,
	netdev@vger.kernel.org, gospo@redhat.com, sassmann@redhat.com
In-Reply-To: <1380643148.1939.11.camel@bwh-desktop.uk.level5networks.com>

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of Ben Hutchings
> Sent: Tuesday, October 01, 2013 8:59 AM
> To: Kirsher, Jeffrey T
> Cc: davem@davemloft.net; Laura Mihaela Vasilescu; netdev@vger.kernel.org;
> gospo@redhat.com; sassmann@redhat.com
> Subject: Re: [net-next 9/9] igb: Add ethtool support to configure number of
> channels
> 
[..]
> In case this fails, is the interface in a consistent state where is it safe to
> reconfigure the interface again or to unbind the driver?
> 
> If it fails, and the interface was up, shouldn't it call dev_close() so that it's
> obviously down and the user can then try to bring it up again?
> 
> Ben.

Good question.  I was delayed in replying as I was giving Laura a chance to reply to this thread.  

We call our close routine igb_close(), if the device is up.  This does not call dev_close() specifically, however it’s a good suggestion for the error case.  I'll update the patch.    

Thanks,

Carolyn

Carolyn Wyborny 
Linux Development 
Networking Division 
Intel Corporation 



^ permalink raw reply

* Re: [PATCH] sh_eth: call phy_scan_fixups() after PHY reset
From: Sergei Shtylyov @ 2013-10-30 20:50 UTC (permalink / raw)
  To: David Miller
  Cc: nobuhiro.iwamatsu.yj, netdev, linux-sh, laurent.pinchart+renesas
In-Reply-To: <523CEAD2.7030706@cogentembedded.com>

Hello.

On 09/21/2013 04:39 AM, Sergei Shtylyov wrote:

>>> Sometimes the PHY reset that sh_eth_phy_start() does effects the PHY registers
>>> registers values of which are vital for the correct functioning of the driver.
>>> Unfortunately, the existing PHY platform fixup mechanism doesn't help  here as
>>> it only hooks PHY resets done by ioctl() calls. Calling phy_scan_fixups() from
>>> the driver helps here. With a proper platform fixup, this fixes NFS
>>> timeouts on
>>> the SH-Mobile Lager board.

    Timeouts happen because of the sideband ETH_LINK signal connected to PHY's 
LED0 pin -- it bounces on/off after each packet in the default LED mode and 
that seems to hinder packet sending and/or reception...

>     "And sets the PHY LED pins to the desired mode", I should have added.

>>> Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>

>> The PHY layer is designed to naturally already take care of this kind of
>> thing.  I think that part of the problem is that you're fighting the
>> natural control flow the PHY layer provides.

>> When the phy_connect() is performed, what we end up doing is calling
>> phy_attach_direct() which invokes the ->probe() method of the driver
>> and then afterwards we do phy_init_hw() which takes care of doing
>> the fixup calls.

>     Yes, I have studied the code paths beforehand.

>> So if you really need to do a BMCR reset then run the fixups I'd like
>> you to look into making that happen within the provided control
>> flow rather than with an exceptional explicit call to run the fixups.

>     That could change the behavior of many Ethernet drivers in sometimes
> unpredictable ways I think (due to extended registers the PHYs sometimes have,
> like in this case) if you meant including the PHY reset into phylib control
> flows. Anyway, that would have required more complex patches only good for
> merging at the merge window time while I aimed at a quick fix for a problem at
> hand (which is NFS timeout/slowdown and LED mode mismatch to what was designed
> for the board).
>     Some other drivers also do reset the PHYs but usually that's accompanied
> by a loop polling for reset completion, so a naive code like that one on the
> phylib's ioctl() path couldn't have helped if I wanted to hook reset writes in
> the same fashion in phy_write(). In my case reset seems just quick enough for
> the extended PHY register reads/writes to work correctly without polling the
> reset bit first...
>     That's why I took an easy way and used already exported phy_scan_fixups()
> to undo what the PHY reset did to some of the PHY's registers. The question is
> why it was exported in the first place? It doesn't seem to be used by anything
> but phylib internally...

    Well, how about I create phy_reset() function (that will care about BMCR 
polling and calling PHY driver/fixups) that those drivers that currently do 
reset their PHYs can call (instead of open coding BMCR reset)? That way it 
seems to be less invasive than embedding PHY reset into phylib's control flow...

>> I'm willing to be convinced that this is a better or necessary approach
>> but you'll need to explain it to me.

>     Well, I didn't write this driver, so I'm probably not the best person to
> be asked about its design (maybe Iwamatsu-san could add something here). I
> don't know about the purpose of the explicit PHY reset in the driver more than
> the accompanying comment says (and it doesn't say much other than that it
> takes the PHY out of power-down). Perhaps we could just painlessly remove it,
> who knows?

    Unfortunately, Iwamatsu-san hasn't commented on its purpose... :-(

WBR, Sergei


^ permalink raw reply

* Re: [PATCH net-next v2] tipc: remove two indentation levels in tipc_recv_msg routine
From: David Miller @ 2013-10-30 20:54 UTC (permalink / raw)
  To: ying.xue
  Cc: David.Laight, maloy, Paul.Gortmaker, jon.maloy, erik.hugne,
	andreas.bofjall, tipc-discussion, netdev
In-Reply-To: <1383103617-28813-1-git-send-email-ying.xue@windriver.com>

From: Ying Xue <ying.xue@windriver.com>
Date: Wed, 30 Oct 2013 11:26:57 +0800

> The message dispatching part of tipc_recv_msg() is wrapped layers of
> while/if/if/switch, causing out-of-control indentation and does not
> look very good. We reduce two indentation levels by separating the
> message dispatching from the blocks that checks link state and
> sequence numbers, allowing longer function and arg names to be
> consistently indented without wrapping. Additionally we also rename
> "cont" label to "discard" and add one new label called "unlock_discard"
> to make code clearer. In all, these are cosmetic changes that do not
> alter the operation of TIPC in any way.
> 
> Signed-off-by: Ying Xue <ying.xue@windriver.com>
> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
> Cc: David Laight <david.laight@aculab.com>
> Cc: Andreas Bofjäll <andreas.bofjall@ericsson.com>
> ---
> v2: Incorporated comments from David Laight and Andreas Bofjäll

This patch looks good, applied, thanks.

^ permalink raw reply

* Re: named network namespace -- setns() with Invalid argument (errno 22)
From: Eric W. Biederman @ 2013-10-30 20:54 UTC (permalink / raw)
  To: dilip.daya; +Cc: netdev
In-Reply-To: <8738njfkdp.fsf@xmission.com>

ebiederm@xmission.com (Eric W. Biederman) writes:

> Dilip Daya <dilip.daya@hp.com> writes:
>
>> Hi All,
>>
>> Is the following intended behavior for adding "nested" named network namespaces ?
>
> Not exactly intended but this is not misbehavior either.
>
> Mostly this is a don't do that then scenario.

Let me clarify a little. The primary purpose of ip netns exec is to
allow programs that are not aware or more than one network namespace
to work without modification.  It is not intended to be a primary
environment for applications to run in.

Which is a big part of where the don't do that then, comes from.

If you can figure out what is going on and send patches I will be happy
to accept them.

Also public conversation is appreciated so that anyone else with the
same confusions may be educated at the same time.

Eric

^ permalink raw reply

* Re: [PATCH] bgmac: pass received packet to the netif instead of copying it
From: David Miller @ 2013-10-30 20:58 UTC (permalink / raw)
  To: zajec5; +Cc: netdev, openwrt-devel, nlhintz
In-Reply-To: <1383116400-29905-1-git-send-email-zajec5@gmail.com>

From: Rafał Miłecki <zajec5@gmail.com>
Date: Wed, 30 Oct 2013 08:00:00 +0100

> Copying whole packets with skb_copy_from_linear_data_offset is a pretty
> bad idea. CPU was spending time in __copy_user_common and network
> performance was lower. With the new solution iperf-measured speed
> increased from 116Mb/s to 134Mb/s.
> 
> Signed-off-by: Rafał Miłecki <zajec5@gmail.com>
> ---
> Changes since [RFC TRY#2]:
> 1) Fixed arguments alignment
> 2) Dropped code fixing old slot in case of bgmac_dma_rx_skb_for_slot
> failure. Thanks to Nathan patch bgmac_dma_rx_skb_for_slot doesn't
> change anything in slot in case it failed somewhere.

Looks good, applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH net v2 2/3] r8152: modify the tx flow
From: David Miller @ 2013-10-30 21:04 UTC (permalink / raw)
  To: hayeswang; +Cc: netdev, nic_swsd, linux-kernel, linux-usb
In-Reply-To: <1383117220-893-3-git-send-email-hayeswang@realtek.com>

From: Hayes Wang <hayeswang@realtek.com>
Date: Wed, 30 Oct 2013 15:13:39 +0800

> Remove the code for sending the packet in the rtl8152_start_xmit().
> Let rtl8152_start_xmit() to queue the packet only, and schedule a
> tasklet to send the queued packets. This simplify the code and make
> sure all the packet would be sent by the original order.
> 
> Signed-off-by: Hayes Wang <hayeswang@realtek.com>

Basically, your driver will now queue up to 1,000 packets onto
this tx_queue list, because that is what tx_queue_len will be
for alloc_etherdev() allocated network devices.

In my previous reply to you about this patch, I asked you to
quantify and study the effects of using a limit of 60.  I said
that 60 might be too large.

You've responded by removing the limit completely, which is exactly
the opposite of what I've asked you to do.  Why did you do this?

This patch series is still not in a state where I can apply it,
sorry.

^ permalink raw reply

* Re: [PATCH] mac802154: Use pr_err(...) rather than printk(KERN_ERR ...)
From: David Miller @ 2013-10-30 21:06 UTC (permalink / raw)
  To: chenweilong
  Cc: alex.bluesman.smirnov, dbaryshkov, linux-zigbee-devel, netdev,
	dingtianhong
In-Reply-To: <1383118087-10128-1-git-send-email-chenweilong@huawei.com>

From: Chen Weilong <chenweilong@huawei.com>
Date: Wed, 30 Oct 2013 15:28:07 +0800

> This change is inspired by checkpatch.
> 
> Signed-off-by: Weilong Chen <chenweilong@huawei.com>

Applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH] ipv6: remove the unnecessary statement in find_match()
From: David Miller @ 2013-10-30 21:08 UTC (permalink / raw)
  To: duanj.fnst; +Cc: netdev
In-Reply-To: <5270B7AE.9020801@cn.fujitsu.com>

From: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Date: Wed, 30 Oct 2013 15:39:26 +0800

> 
> After reading the function rt6_check_neigh(), we can
> know that the RT6_NUD_FAIL_SOFT can be returned only
> when the IS_ENABLE(CONFIG_IPV6_ROUTER_PREF) is false.
> so in function find_match(), there is no need to execute
> the statement !IS_ENABLED(CONFIG_IPV6_ROUTER_PREF).
> 
> Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>

Applied to net-next, thanks.

CONFIG_IPV6_ROUTER_PREF is another good candidate for Kconfig
removal.  I know we've had several bugs that only apply when
this option is on vs. off.  We're maintaining two different
code paths, for really no good reason.

^ permalink raw reply

* Re: [Xen-devel] [PATCH net-next RFC 2/5] xen-netback: Change TX path from grant copy to mapping
From: Zoltan Kiss @ 2013-10-30 21:10 UTC (permalink / raw)
  To: Paul Durrant, Ian Campbell, Wei Liu,
	xen-devel@lists.xenproject.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Jonathan Davies
In-Reply-To: <9AAE0902D5BC7E449B7C8E4E778ABCD015717A@AMSPEX01CL01.citrite.net>

On 30/10/13 09:11, Paul Durrant wrote:
>> diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-
>> netback/interface.c
>> index f5c3c57..fb16ede 100644
>> --- a/drivers/net/xen-netback/interface.c
>> +++ b/drivers/net/xen-netback/interface.c
>> @@ -336,8 +336,20 @@ struct xenvif *xenvif_alloc(struct device *parent,
>> domid_t domid,
>>   	vif->pending_prod = MAX_PENDING_REQS;
>>   	for (i = 0; i < MAX_PENDING_REQS; i++)
>>   		vif->pending_ring[i] = i;
>> -	for (i = 0; i < MAX_PENDING_REQS; i++)
>> -		vif->mmap_pages[i] = NULL;
>> +	err = alloc_xenballooned_pages(MAX_PENDING_REQS,
>> +		vif->mmap_pages,
>> +		false);
>
> Since this is a per-vif allocation, is this going to scale?
Good question, I'll look after this.

>> @@ -1620,14 +1562,25 @@ static int xenvif_tx_submit(struct xenvif *vif, int
>> budget)
>>   		memcpy(skb->data,
>>   		       (void *)(idx_to_kaddr(vif, pending_idx)|txp->offset),
>>   		       data_len);
>> +		vif->pending_tx_info[pending_idx].callback_struct.ctx =
>> NULL;
>>   		if (data_len < txp->size) {
>>   			/* Append the packet payload as a fragment. */
>>   			txp->offset += data_len;
>>   			txp->size -= data_len;
>> -		} else {
>> +			skb_shinfo(skb)->destructor_arg =
>> +				&vif-
>>> pending_tx_info[pending_idx].callback_struct;
>> +		} else if (!skb_shinfo(skb)->nr_frags) {
>>   			/* Schedule a response immediately. */
>> +			skb_shinfo(skb)->destructor_arg = NULL;
>> +			xenvif_idx_unmap(vif, pending_idx);
>>   			xenvif_idx_release(vif, pending_idx,
>>   					   XEN_NETIF_RSP_OKAY);
>> +		} else {
>> +			/* FIXME: first request fits linear space, I don't know
>> +			 * if any guest would do that, but I think it's possible
>> +			 */
>
> The Windows frontend, because it has to parse the packet headers, will coalesce everything up to the payload in a single frag and it would be a good idea to copy this directly into the linear area.
I forgot to clarify this comment: the problem I wanted to handle here if 
the first request's size is PKT_PROT_LEN and there is more fragments. 
Then skb->len will be PKT_PROT_LEN as well, and the if statement falls 
through to the else branch. That might be problematic if we release the 
slot of the first request separately from the others. Or am I 
overlooking something? Does that matter to netfront anyway?
And this problem, if it's true, applies to the previous, grant copy 
method as well.
However, as I think, it might be better to change the condition to 
(data_len <= txp->size), rather than putting an if-else statement into 
the else branch.

>> @@ -1635,13 +1588,19 @@ static int xenvif_tx_submit(struct xenvif *vif, int
>> budget)
>>   		else if (txp->flags & XEN_NETTXF_data_validated)
>>   			skb->ip_summed = CHECKSUM_UNNECESSARY;
>>
>> -		xenvif_fill_frags(vif, skb);
>> +		xenvif_fill_frags(vif, skb, pending_idx);
>>
>>   		if (skb_is_nonlinear(skb) && skb_headlen(skb) <
>> PKT_PROT_LEN) {
>>   			int target = min_t(int, skb->len, PKT_PROT_LEN);
>>   			__pskb_pull_tail(skb, target - skb_headlen(skb));
>>   		}
>>
>> +		/* Set this flag after __pskb_pull_tail, as it can trigger
>> +		 * skb_copy_ubufs, while we are still in control of the skb
>> +		 */
>
> You can't be sure that there will be no subsequent pullups. The v6 parsing code I added may need to do that on a (hopefully) rare occasion.
The only thing matters that it shouldn't happen between this and before 
calling netif_receive_skb. I think I will move this right before it, and 
expand the comment.

Zoli

^ permalink raw reply

* Re: [PATCH] [trivial]doc:net: Fix typo in Documentation/networking
From: David Miller @ 2013-10-30 21:11 UTC (permalink / raw)
  To: standby24x7; +Cc: trivial, linux-kenrel, netdev
In-Reply-To: <1383119175-1963-1-git-send-email-standby24x7@gmail.com>

From: Masanari Iida <standby24x7@gmail.com>
Date: Wed, 30 Oct 2013 16:46:15 +0900

> Correct spelling typo in Documentation/networking
> 
> Signed-off-by: Masanari Iida <standby24x7@gmail.com>

Applied with Randy's suggested adjustments, thanks.

^ permalink raw reply

* Re: [PATCH] ipv6: remove the unnecessary statement in find_match()
From: Hannes Frederic Sowa @ 2013-10-30 21:11 UTC (permalink / raw)
  To: David Miller; +Cc: duanj.fnst, netdev
In-Reply-To: <20131030.170837.1882918923249091614.davem@davemloft.net>

On Wed, Oct 30, 2013 at 05:08:37PM -0400, David Miller wrote:
> From: Duan Jiong <duanj.fnst@cn.fujitsu.com>
> Date: Wed, 30 Oct 2013 15:39:26 +0800
> 
> > 
> > After reading the function rt6_check_neigh(), we can
> > know that the RT6_NUD_FAIL_SOFT can be returned only
> > when the IS_ENABLE(CONFIG_IPV6_ROUTER_PREF) is false.
> > so in function find_match(), there is no need to execute
> > the statement !IS_ENABLED(CONFIG_IPV6_ROUTER_PREF).
> > 
> > Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
> 
> Applied to net-next, thanks.
> 
> CONFIG_IPV6_ROUTER_PREF is another good candidate for Kconfig
> removal.  I know we've had several bugs that only apply when
> this option is on vs. off.  We're maintaining two different
> code paths, for really no good reason.

I agree and actually thought about that yesterday. Do you think a sysctl
is a good option?

^ permalink raw reply

* Re: [Xen-devel] [PATCH net-next RFC 0/5] xen-netback: TX grant mapping instead of copy
From: Zoltan Kiss @ 2013-10-30 21:14 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: ian.campbell, wei.liu2, xen-devel, netdev, linux-kernel,
	jonathan.davies
In-Reply-To: <20131030191721.GA14261@phenom.dumpdata.com>

On 30/10/13 19:17, Konrad Rzeszutek Wilk wrote:
> On Wed, Oct 30, 2013 at 03:16:17PM -0400, Konrad Rzeszutek Wilk wrote:
>> Odd. I don't see #5 patch patch?
>
> Ah, you have two #4 patches:
>
> [PATCH net-next RFC 4/5] xen-netback: Change RX path for mapped SKB fragments
> [PATCH net-next RFC 4/5] xen-netback: Fix indentations

Yep, sorry, I will fix it up in the next version!

Zoli

^ permalink raw reply

* Re: [PATCH net-next 0/4] 6lowpan: cleanup header creation
From: David Miller @ 2013-10-30 21:19 UTC (permalink / raw)
  To: alex.aring
  Cc: alex.bluesman.smirnov, linux-zigbee-devel, werner, dbaryshkov,
	netdev
In-Reply-To: <1383121104-2515-1-git-send-email-alex.aring@gmail.com>

From: Alexander Aring <alex.aring@gmail.com>
Date: Wed, 30 Oct 2013 09:18:20 +0100

> This patch series cleanup the 6LoWPAN header creation and extend the use
> of skb_*_header functions.
> 
> Patch 2/4 fix issues of parsing the mac header. The ieee802.15.4 header
> has a dynamic size which depends on frame control bits. This patch replaces the
> static mac header len calculation with a dynamic one.

Series applied, thanks.

^ permalink raw reply

* [PATCH v2 net-next] net: pkt_sched: PIE AQM scheme
From: Vijay Subramanian @ 2013-10-30 21:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, shemminger, Vijay Subramanian, Mythili Prabhu, Dave Taht

From: Vijay Subramanian <vijaynsu@cisco.com>

Proportional Integral controller Enhanced (PIE) scheduler for bufferbloat.

>From the IETF draft below:
"Bufferbloat is a phenomenon where excess buffers in the network cause high
latency and jitter. As more and more interactive applications (e.g. voice over
IP, real time video streaming and financial transactions) run in the Internet,
high latency and jitter degrade application performance. There is a pressing
need to design intelligent queue management schemes that can control latency
and jitter; and hence provide desirable quality of service to users.

We present here a lightweight design, PIE(Proportional Integral controller
Enhanced) that can effectively control the average queueing latency to a target
value. Simulation results, theoretical analysis and Linux testbed results have
shown that PIE can ensure low latency and achieve high link utilization under
various congestion situations. The design does not require per-packet
timestamp, so it incurs very small overhead and is simple enough to implement
in both hardware and software."

Many thanks to Dave Taht for extensive feedback, reviews, testing and
suggestions. Thanks also to Stephen Hemminger for initial review and suggestion
to use psched and friends.   Naeem Khademi and Dave Taht independently
contributed to ECN support.

For more information, please see technical paper about PIE in the IEEE
Conference on High Performance Switching and Routing 2013. A copy of the paper
can be found at ftp://ftpeng.cisco.com/pie/.

Please also refer to the IETF draft submission at
http://tools.ietf.org/html/draft-pan-tsvwg-pie-00

All relevant code, documents, test scripts and results can be found at
ftp://ftpeng.cisco.com/pie/.

For problems with the iproute2/tc or Linux kernel code, please contact Vijay
Subramanian (vijaynsu@cisco.com or subramanian.vijay@gmail.com) Mythili Prabhu
(mysuryan@cisco.com)

Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Mythili Prabhu <mysuryan@cisco.com>
CC: Dave Taht <dave.taht@bufferbloat.net>
---
Changes from V1: Addressed review comments regarding coding style and various
implementation issues. In particular 1) add locking for the timer and 2) use 
psched timer directly instead of a custom timer based on psched.

 include/uapi/linux/pkt_sched.h |   26 ++
 net/sched/Kconfig              |   13 +
 net/sched/Makefile             |    1 +
 net/sched/sch_pie.c            |  567 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 607 insertions(+)
 create mode 100644 net/sched/sch_pie.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index f2624b5..2fb6e6d 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -787,4 +787,30 @@ struct tc_fq_qd_stats {
 	__u32	throttled_flows;
 	__u32	pad;
 };
+
+/*PIE*/
+enum {
+	TCA_PIE_UNSPEC,
+	TCA_PIE_TARGET,
+	TCA_PIE_LIMIT,
+	TCA_PIE_TUPDATE,
+	TCA_PIE_ALPHA,
+	TCA_PIE_BETA,
+	TCA_PIE_ECN,
+	TCA_PIE_BYTEMODE,
+	__TCA_PIE_MAX
+};
+#define TCA_PIE_MAX   (__TCA_PIE_MAX - 1)
+
+struct tc_pie_xstats {
+	__u32 prob;             /* current probability */
+	__u32 delay;            /* current delay in ms */
+	__u32 avg_dq_rate;      /* current average dq_rate in bits/pie_time */
+	__u32 packets_in;       /* total number of packets enqueued */
+	__u32 dropped;          /* packets dropped due to pie_action */
+	__u32 overlimit;        /* dropped due to lack of space in queue */
+	__u32 maxq;             /* maximum queue size */
+	__u32 ecn_mark;         /* packets marked with ecn*/
+};
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index c03a32a..a079e03 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -286,6 +286,19 @@ config NET_SCH_FQ
 
 	  If unsure, say N.
 
+config NET_SCH_PIE
+        tristate "Proportional Integral controller Enhanced (PIE) scheduler"
+        help
+	  Say Y here if you want to use the Proportional Integral controller
+	  Enhanced) scheduler packet scheduling algorithm.
+	  For more information, please see
+	  http://tools.ietf.org/html/draft-pan-tsvwg-pie-00
+
+          To compile this driver as a module, choose M here: the module
+          will be called sch_pie.
+
+          If unsure, say N.
+
 config NET_SCH_INGRESS
 	tristate "Ingress Qdisc"
 	depends on NET_CLS_ACT
diff --git a/net/sched/Makefile b/net/sched/Makefile
index e5f9abe..5b4ece9 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -40,6 +40,7 @@ obj-$(CONFIG_NET_SCH_QFQ)	+= sch_qfq.o
 obj-$(CONFIG_NET_SCH_CODEL)	+= sch_codel.o
 obj-$(CONFIG_NET_SCH_FQ_CODEL)	+= sch_fq_codel.o
 obj-$(CONFIG_NET_SCH_FQ)	+= sch_fq.o
+obj-$(CONFIG_NET_SCH_PIE)	+= sch_pie.o
 
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
new file mode 100644
index 0000000..e263208
--- /dev/null
+++ b/net/sched/sch_pie.c
@@ -0,0 +1,567 @@
+/* Copyright (C) 2013 Cisco Systems, Inc, 2013.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301,
+ * USA.
+ *
+ * Author: Vijay Subramanian <vijaynsu@cisco.com>
+ * Author: Mythili Prabhu <mysuryan@cisco.com>
+ *
+ * ECN support is added by Naeem Khademi <naeemk@ifi.uio.no>
+ * University of Oslo, Norway.
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/pkt_sched.h>
+#include <net/inet_ecn.h>
+
+#define QUEUE_THRESHOLD 5000
+#define DQCOUNT_INVALID -1
+#define MAX_PROB  0xffffffff
+#define PIE_SCALE 8
+
+/* parameters used */
+struct pie_params {
+	psched_time_t target;	/* user specified target delay in pschedtime */
+	psched_time_t tupdate;	/* frequency with which the timer fires */
+	u32 limit;		/* number of packets that can be enqueued */
+	u32 alpha;		/* alpha and beta are between -4 and 4 */
+	u32 beta;		/* and are used for shift relative to 1 */
+	bool ecn;		/* true if ecn is enabled */
+	bool bytemode;		/* to scale drop early prob based on pkt size */
+};
+
+/* variables used */
+struct pie_vars {
+	u32 prob;		/* probability but scaled by u32 limit. */
+	psched_time_t burst_time;
+	psched_time_t qdelay;
+	psched_time_t qdelay_old;
+	u64 dq_count;		/* measured in bytes */
+	psched_time_t dq_tstamp;	/* drain rate */
+	u32 avg_dq_rate;	/* bytes per pschedtime tick,scaled */
+	u32 qlen_old;		/* in bytes */
+};
+
+/* statistics gathering*/
+struct pie_stats {
+	u32 packets_in;		/* total number of packets enqueued */
+	u32 dropped;		/* packets dropped due to pie_action */
+	u32 overlimit;		/* dropped due to lack of space in queue */
+	u32 maxq;		/* maximum queue size */
+	u32 ecn_mark;		/* packets marked with ECN */
+};
+
+/* private data for the Qdisc */
+struct pie_sched_data {
+	struct pie_params params;
+	struct pie_vars vars;
+	struct pie_stats stats;
+	struct timer_list adapt_timer;
+};
+
+static void pie_params_init(struct pie_params *params)
+{
+	params->alpha = 2;
+	params->beta = 20;
+	params->tupdate = PSCHED_NS2TICKS(30 * NSEC_PER_MSEC);	/* 30 ms */
+	params->limit = 200;	/* default of 200 packets */
+	params->target = PSCHED_NS2TICKS(20 * NSEC_PER_MSEC);	/* 20 ms */
+	params->ecn = false;
+	params->bytemode = false;
+}
+
+static void pie_vars_init(struct pie_vars *vars)
+{
+	vars->dq_count = DQCOUNT_INVALID;
+	vars->avg_dq_rate = 0;
+	/* default of 100 ms in pschedtime  */
+	vars->burst_time = PSCHED_NS2TICKS(100 * NSEC_PER_MSEC);
+}
+
+static bool drop_early(struct Qdisc *sch, u32 packet_size)
+{
+	struct pie_sched_data *q = qdisc_priv(sch);
+	u32 rnd;
+	u32 local_prob = q->vars.prob;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+
+	/* If there is still burst allowance left skip random early drop */
+	if (q->vars.burst_time > 0)
+		return false;
+
+	/* If current delay is less than half of target, and
+	 * if drop prob is low already, disable early_drop
+	 */
+	if ((q->vars.qdelay < q->params.target / 2)
+	    && (q->vars.prob < MAX_PROB / 5))
+		return false;
+
+	/* If we have fewer than 2 mtu-sized packets, disable drop_early,
+	 * similar to min_th in RED
+	 */
+	if (sch->qstats.backlog < 2 * mtu)
+		return false;
+
+	/* If bytemode is turned on, use packet size to compute new
+	 * probablity. Smaller packets will have lower drop prob in this case
+	 */
+	if (q->params.bytemode && packet_size <= mtu)
+		local_prob = (local_prob / mtu) * packet_size;
+	else
+		local_prob = q->vars.prob;
+
+	rnd = net_random();
+	if (rnd < local_prob)
+		return true;
+
+	return false;
+}
+
+static int pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+{
+	struct pie_sched_data *q = qdisc_priv(sch);
+
+	if (unlikely(qdisc_qlen(sch) >= sch->limit))
+		goto out;
+
+	if (!drop_early(sch, skb->len)) {
+		/* we can enqueue the packet */
+		q->stats.packets_in++;
+
+		if (qdisc_qlen(sch) > q->stats.maxq)
+			q->stats.maxq = qdisc_qlen(sch);
+
+		return qdisc_enqueue_tail(skb, sch);
+	} else if (q->params.ecn && INET_ECN_set_ce(skb) &&
+		   (q->vars.prob <= MAX_PROB / 10)) {
+		/* If packet is ecn capable, mark it if drop probability
+		 * is lower than 10%, else drop it.
+		 */
+		q->stats.ecn_mark++;
+		return qdisc_enqueue_tail(skb, sch);
+	}
+out:
+	q->stats.overlimit++;
+	return qdisc_drop(skb, sch);
+}
+
+static const struct nla_policy pie_policy[TCA_PIE_MAX + 1] = {
+	[TCA_PIE_TARGET] = {.type = NLA_U32},
+	[TCA_PIE_LIMIT] = {.type = NLA_U32},
+	[TCA_PIE_TUPDATE] = {.type = NLA_U32},
+	[TCA_PIE_ALPHA] = {.type = NLA_U32},
+	[TCA_PIE_BETA] = {.type = NLA_U32},
+	[TCA_PIE_ECN] = {.type = NLA_U32},
+	[TCA_PIE_BYTEMODE] = {.type = NLA_U32},
+};
+
+static int pie_change(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct pie_sched_data *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_PIE_MAX + 1];
+	unsigned int qlen;
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+
+	err = nla_parse_nested(tb, TCA_PIE_MAX, opt, pie_policy);
+	if (err < 0)
+		return err;
+
+	sch_tree_lock(sch);
+
+	/* convert from microseconds to pschedtime */
+	if (tb[TCA_PIE_TARGET]) {
+		/* target is in us */
+		u32 target = nla_get_u32(tb[TCA_PIE_TARGET]);
+		/* convert to pschedtime */
+		q->params.target = PSCHED_NS2TICKS((u64) target * NSEC_PER_USEC);
+	}
+
+	if (tb[TCA_PIE_TUPDATE]) {
+		/* tupdate is in us */
+		u32 tupdate = nla_get_u32(tb[TCA_PIE_TUPDATE]);
+		/* convert to pschedtime */
+		q->params.tupdate = PSCHED_NS2TICKS((u64) tupdate * NSEC_PER_USEC);
+	}
+
+	if (tb[TCA_PIE_LIMIT]) {
+		u32 limit = nla_get_u32(tb[TCA_PIE_LIMIT]);
+		q->params.limit = limit;
+		sch->limit = limit;
+	}
+
+	if (tb[TCA_PIE_ALPHA])
+		q->params.alpha = nla_get_u32(tb[TCA_PIE_ALPHA]);
+
+	if (tb[TCA_PIE_BETA])
+		q->params.beta = nla_get_u32(tb[TCA_PIE_BETA]);
+
+	if (tb[TCA_PIE_ECN])
+		q->params.ecn = nla_get_u32(tb[TCA_PIE_ECN]);
+
+	if (tb[TCA_PIE_BYTEMODE])
+		q->params.bytemode = nla_get_u32(tb[TCA_PIE_BYTEMODE]);
+
+	/* Drop excess packets if new limit is lower */
+	qlen = sch->q.qlen;
+	while (sch->q.qlen > sch->limit) {
+		struct sk_buff *skb = __skb_dequeue(&sch->q);
+
+		sch->qstats.backlog -= qdisc_pkt_len(skb);
+		qdisc_drop(skb, sch);
+	}
+	qdisc_tree_decrease_qlen(sch, qlen - sch->q.qlen);
+
+	sch_tree_unlock(sch);
+	return 0;
+}
+
+static void pie_process_dequeue(struct Qdisc *sch, struct sk_buff *skb)
+{
+
+	struct pie_sched_data *q = qdisc_priv(sch);
+	int qlen = sch->qstats.backlog;	/* current queue size in bytes */
+
+	/* If current queue is about 10 packets or more and dq_count is unset
+	 *  we have enough packets to calculate the drain rate. Save
+	 *  current time as dq_tstamp and start measurement cycle.
+	 */
+	if (qlen >= QUEUE_THRESHOLD && q->vars.dq_count == DQCOUNT_INVALID) {
+		q->vars.dq_tstamp = psched_get_time();
+		q->vars.dq_count = 0;
+	}
+
+	/*  Calculate the average drain rate from this value.  If queue length
+	 *  has receded to a small value viz., <= QUEUE_THRESHOLD bytes,reset
+	 *  the dq_count to -1 as we don't have enough packets to calculate the
+	 *  drain rate anymore The following if block is entered only when we
+	 *  have a substantial queue built up (QUEUE_THRESHOLD bytes or more)
+	 *  and we calculate the drain rate for the threshold here.  dq_count is
+	 *  in bytes, time difference in psched_time, hence rate is in
+	 *  bytes/psched_time.
+	 */
+	if (q->vars.dq_count != DQCOUNT_INVALID) {
+
+		q->vars.dq_count += skb->len;
+
+		if (q->vars.dq_count >= QUEUE_THRESHOLD) {
+			psched_time_t now = psched_get_time();
+			u32 dtime = now - q->vars.dq_tstamp;
+			u32 count = q->vars.dq_count << PIE_SCALE;
+
+			if (dtime == 0)
+				return;
+
+			count = count / dtime;
+
+			if (q->vars.avg_dq_rate == 0)
+				q->vars.avg_dq_rate = count;
+			else
+				q->vars.avg_dq_rate =
+				    (q->vars.avg_dq_rate -
+				     (q->vars.avg_dq_rate >> 3)) + (count >> 3);
+
+			/* If the queue has receded below the threshold, we hold
+			 * on to the last drain rate calculated, else we reset
+			 * dq_count to 0 to re-enter the if block when the next
+			 * packet is dequeued
+			 */
+			if (qlen < QUEUE_THRESHOLD)
+				q->vars.dq_count = DQCOUNT_INVALID;
+			else {
+				q->vars.dq_count = 0;
+				q->vars.dq_tstamp = psched_get_time();
+			}
+
+			if (q->vars.burst_time > 0) {
+				if (q->vars.burst_time > dtime)
+					q->vars.burst_time -= dtime;
+				else
+					q->vars.burst_time = 0;
+			}
+		}
+	}
+}
+
+static void calculate_probability(struct Qdisc *sch)
+{
+	struct pie_sched_data *q = qdisc_priv(sch);
+	u32 qlen = sch->qstats.backlog;	/* queue size in bytes */
+	psched_time_t qdelay = 0;	/* in pschedtime */
+	psched_time_t qdelay_old = q->vars.qdelay;	/* in pschedtime */
+	s32 delta = 0;		/* determines the change in probability  */
+	u32 oldprob;
+	u32 alpha, beta;
+	bool update_prob = true;
+
+	q->vars.qdelay_old = q->vars.qdelay;
+
+	if (q->vars.avg_dq_rate > 0)
+		qdelay = (qlen << PIE_SCALE) / q->vars.avg_dq_rate;
+	else
+		qdelay = 0;
+
+	/* If qdelay is zero and qlen is not, it means qlen is very small, less
+	 * than dequeue_rate, so we do not update probabilty in this round
+	 */
+	if (qdelay == 0 && qlen != 0)
+		update_prob = false;
+
+	/* Add ranges for alpha and beta, more aggressive for high dropping
+	 * mode and gentle steps for light dropping mode
+	 * In light dropping mode, take gentle steps; in medium dropping mode,
+	 * take medium steps; in high dropping mode, take big steps.
+	 */
+	if (q->vars.prob < MAX_PROB / 100) {
+		alpha =
+		    (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 7;
+		beta =
+		    (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 7;
+	} else if (q->vars.prob < MAX_PROB / 10) {
+		alpha =
+		    (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 5;
+		beta =
+		    (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 5;
+	} else {
+		alpha =
+		    (q->params.alpha * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 4;
+		beta =
+		    (q->params.beta * (MAX_PROB / PSCHED_TICKS_PER_SEC)) >> 4;
+	}
+
+	/* alpha and beta should be between 0 and 32, in multiples of 1/16
+	 */
+	delta += alpha * ((qdelay - q->params.target));
+	delta += beta * ((qdelay - qdelay_old));
+
+	oldprob = q->vars.prob;
+
+	/* addition to ensure we increase probability in steps of no
+	 *  more than 2%
+	 */
+	if (delta > (s32) (MAX_PROB / (100 / 2))
+	    && q->vars.prob >= MAX_PROB / 10)
+		delta = (MAX_PROB / 100) * 2;
+
+	/*  Non-linear drop
+	 *  Tune drop probability to increase quickly for high delays
+	 *  (250ms and above)
+	 *  250ms is derived through experiments and provides error protection
+	 */
+
+	if (qdelay > (PSCHED_NS2TICKS(250 * NSEC_PER_MSEC)))
+		delta += MAX_PROB / (100 / 2);
+
+	q->vars.prob += delta;
+
+	if (delta > 0) {
+		/* prevent overflow */
+		if (q->vars.prob < oldprob) {
+			q->vars.prob = MAX_PROB;
+			/* Prevent normalization error
+			 * If probability is the maximum value already,
+			 * we normalize it here, and skip the
+			 * check to do a non-linear drop in the next section
+			 */
+			update_prob = false;
+		}
+	} else {
+		/* prevent underflow */
+		if (q->vars.prob > oldprob)
+			q->vars.prob = 0;
+	}
+
+	/* Non-linear drop in probability */
+	/* Reduce drop probability quickly if delay is 0 for 2 consecutive
+	 * Tupdate periods
+	 */
+	if ((qdelay == 0) && (qdelay_old == 0) && update_prob)
+		q->vars.prob = (q->vars.prob * 98) / 100;
+
+	q->vars.qdelay = qdelay;
+	q->vars.qlen_old = qlen;
+
+	/* we restart the measurement cycle if the following conditions are met
+	 *  1. If the delay has been low for 2 consecutive Tupdate periods
+	 *  2. Calculated drop probability is zero
+	 *  3. We have atleast one estimate for the avg_dq_rate ie.,
+	 *     is a non-zero value
+	 */
+	if ((q->vars.qdelay < q->params.target / 2)
+	    && (q->vars.qdelay_old < q->params.target / 2)
+	    && (q->vars.prob == 0)
+	    && q->vars.avg_dq_rate > 0)
+		pie_vars_init(&q->vars);
+}
+
+static void pie_timer(unsigned long arg)
+{
+	struct Qdisc *sch = (struct Qdisc *)arg;
+	struct pie_sched_data *q = qdisc_priv(sch);
+	u32 tup;
+	spinlock_t *root_lock = qdisc_lock(qdisc_root_sleeping(sch));
+
+	spin_lock(root_lock);
+	calculate_probability(sch);
+
+	/* reset the timer to fire after 'tupdate'. tupdate is currently in
+	 * psched_time; mod_timer expects time to be in jiffies so convert from
+	 * pschedtime to jiffies
+	 */
+	tup = PSCHED_TICKS2NS(q->params.tupdate);
+	tup = tup / NSEC_PER_MSEC;
+	tup = (tup * HZ) / MSEC_PER_SEC;
+
+	mod_timer(&q->adapt_timer, jiffies + tup);
+	spin_unlock(root_lock);
+
+}
+
+static int pie_init(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct pie_sched_data *q = qdisc_priv(sch);
+
+	pie_params_init(&q->params);
+	pie_vars_init(&q->vars);
+	sch->limit = q->params.limit;
+
+	setup_timer(&q->adapt_timer, pie_timer, (unsigned long)sch);
+	mod_timer(&q->adapt_timer, jiffies + HZ / 2);
+
+	if (opt) {
+		int err = pie_change(sch, opt);
+
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static int pie_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct pie_sched_data *q = qdisc_priv(sch);
+	struct nlattr *opts;
+
+	opts = nla_nest_start(skb, TCA_OPTIONS);
+	if (opts == NULL)
+		goto nla_put_failure;
+
+	/* convert target and tupdate from pschedtime to us */
+	if (nla_put_u32(skb, TCA_PIE_TARGET,
+			((u32) PSCHED_TICKS2NS(q->params.target)) /
+			NSEC_PER_USEC) ||
+	    nla_put_u32(skb, TCA_PIE_LIMIT, sch->limit) ||
+	    nla_put_u32(skb, TCA_PIE_TUPDATE,
+			   ((u32) PSCHED_TICKS2NS(q->params.tupdate)) /
+			   NSEC_PER_USEC) ||
+	    nla_put_u32(skb, TCA_PIE_ALPHA, q->params.alpha) ||
+	    nla_put_u32(skb, TCA_PIE_BETA, q->params.beta) ||
+	    nla_put_u32(skb, TCA_PIE_ECN, q->params.ecn) ||
+	    nla_put_u32(skb, TCA_PIE_BYTEMODE, q->params.bytemode))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	nla_nest_cancel(skb, opts);
+	return -1;
+
+}
+
+static int pie_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	struct pie_sched_data *q = qdisc_priv(sch);
+	struct tc_pie_xstats st = {
+		.prob		= q->vars.prob,
+		.delay		= ((u32) PSCHED_TICKS2NS(q->vars.qdelay)) /
+				   NSEC_PER_USEC,
+		/* unscale and return dq_rate in bytes per sec */
+		.avg_dq_rate	= q->vars.avg_dq_rate *
+				  (PSCHED_TICKS_PER_SEC) >> PIE_SCALE,
+		.packets_in	= q->stats.packets_in,
+		.overlimit	= q->stats.overlimit,
+		.maxq		= q->stats.maxq,
+		.dropped	= q->stats.dropped,
+		.ecn_mark	= q->stats.ecn_mark,
+	};
+
+	return gnet_stats_copy_app(d, &st, sizeof(st));
+}
+
+static struct sk_buff *pie_qdisc_dequeue(struct Qdisc *sch)
+{
+	struct sk_buff *skb;
+	skb = __qdisc_dequeue_head(sch, &sch->q);
+
+	if (!skb)
+		return NULL;
+
+	pie_process_dequeue(sch, skb);
+	return skb;
+}
+
+static void pie_reset(struct Qdisc *sch)
+{
+	struct pie_sched_data *q = qdisc_priv(sch);
+	qdisc_reset_queue(sch);
+	pie_vars_init(&q->vars);
+}
+
+static void pie_destroy(struct Qdisc *sch)
+{
+	struct pie_sched_data *q = qdisc_priv(sch);
+	del_timer_sync(&q->adapt_timer);
+}
+
+static struct Qdisc_ops pie_qdisc_ops __read_mostly = {
+	.id = "pie",
+	.priv_size	= sizeof(struct pie_sched_data),
+	.enqueue	= pie_qdisc_enqueue,
+	.dequeue	= pie_qdisc_dequeue,
+	.peek		= qdisc_peek_dequeued,
+	.init		= pie_init,
+	.destroy	= pie_destroy,
+	.reset		= pie_reset,
+	.change		= pie_change,
+	.dump		= pie_dump,
+	.dump_stats	= pie_dump_stats,
+	.owner		= THIS_MODULE,
+};
+
+static int __init pie_module_init(void)
+{
+	return register_qdisc(&pie_qdisc_ops);
+}
+
+static void __exit pie_module_exit(void)
+{
+	unregister_qdisc(&pie_qdisc_ops);
+}
+
+module_init(pie_module_init);
+module_exit(pie_module_exit);
+
+MODULE_DESCRIPTION("Proportional Integral controller Enhanced (PIE) scheduler");
+MODULE_AUTHOR("Vijay Subramanian");
+MODULE_AUTHOR("Mythili Prabhu");
+MODULE_LICENSE("GPL");
-- 
1.7.9.5

^ permalink raw reply related

* Re: [PATCH 1/2] net: if_arp: add ARPHRD_RAWIP type
From: David Miller @ 2013-10-30 21:25 UTC (permalink / raw)
  To: jukka.rissanen; +Cc: netdev
In-Reply-To: <1383124271-15290-2-git-send-email-jukka.rissanen@linux.intel.com>

From: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Date: Wed, 30 Oct 2013 11:11:10 +0200

> This is used when there is no L2 header before IP header.
> Example of this is Bluetooth 6LoWPAN network.
> 
> The RAWIP header type value is already used in some Android kernels
> so same value is used here in order not to break userspace.
> 
> Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>

I'm not applying patches like this until there is an actual user,
and this therefore goes for patch #2 as well.

^ permalink raw reply

* Re: [PATCH] bnx2: Use dev_kfree_skb_any() in bnx2_tx_int()
From: David Miller @ 2013-10-30 21:32 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: tdmackey, mchan, netdev, linux-kernel
In-Reply-To: <CAM_iQpV6ufX5-OOrNQtsXH0_9itjU-FriOoutyKfDEWdg-irQw@mail.gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Wed, 30 Oct 2013 12:23:52 -0700

> On Tue, Oct 29, 2013 at 11:40 PM, David Miller <davem@davemloft.net> wrote:
>> From: Cong Wang <xiyou.wangcong@gmail.com>
>> Date: Tue, 29 Oct 2013 20:50:08 -0700
>>
>>> Normally ->poll() is called in softirq context, while netpoll could
>>> be called in any context depending on its caller.
>>
>> It still makes amends to make the execution context still looks
>> "compatible" as far as locking et al. is concerned.
> 
> Adjusting netpoll code for IRQ context is much harder
> than just calling dev_kfree_skb_any()...
> 
> What's more, we have similar change before:
> 
> commit ed79bab847d8e5a2986d8ff43c49c6fb8ee3265f
> Author: Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Wed Oct 14 14:36:43 2009 +0000
> 
>     virtio_net: use dev_kfree_skb_any() in free_old_xmit_skbs()

Explain to me then why other ethernet drivers implemented identically,
such as tg3, can use plain dev_kfree_skb() just fine?

^ permalink raw reply

* Re: [PATCH net-next] ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE
From: David Miller @ 2013-10-30 21:36 UTC (permalink / raw)
  To: hannes; +Cc: netdev, fweimer
In-Reply-To: <20131030200725.GB9093@order.stressinduktion.org>

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Wed, 30 Oct 2013 21:07:25 +0100

> On Tue, Oct 29, 2013 at 01:04:25PM +0100, Hannes Frederic Sowa wrote:
>> I really tried hard to find alternatives or even a way to enable
>> the protection automatically given that at least unbound does apply
>> IP_PMTUDISC_DONT to its sockets already. These are the reasons why I
>> came up with the new IP_PMTUDISC_INTERFACE value:
> 
> Sorry to bother you but I would really love to hear your feedback on my
> reasoning so I can try to come up with a solution you would be happy with.

All I've read is that administrators cannot be relied upon to
configure their systems properly for the requirements they have.

And that is a non-argument for adding this new socket option as far as
I'm concerned.

"I strongly do not trust path MTU information" has a scope as small as
a route or an interface, it doesn't go down to the socket or
application level at all.

Please stop pretending that it does.

^ permalink raw reply

* Re: 3.12-rc7 regression - network panic from ipv6
From: mroos @ 2013-10-30 21:41 UTC (permalink / raw)
  To: Steffen Klassert; +Cc: David Miller, hannes, Linux Kernel list, netdev
In-Reply-To: <20131030130945.GJ31491@secunet.com>

> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Works fine, thanks!

Tested-by: Meelis Roos <mroos@linux.ee>

-- 
Meelis Roos (mroos@linux.ee)

^ permalink raw reply

* Re: RFC [PATCH 2/3] PTP: use flags to request HW features
From: Flavio Leitner @ 2013-10-30 21:48 UTC (permalink / raw)
  To: linuxptp-devel; +Cc: netdev
In-Reply-To: <1383159637-8165-3-git-send-email-fbl@redhat.com>

Adding CC to netdev.

On Wed, Oct 30, 2013 at 05:00:36PM -0200, Flavio Leitner wrote:
> Currently the user space can't tell which delay mechanism
> (E2E/P2P) or transport is needed in the ioctl(). Therefore,
> PTP silently fails when the hardware doesn't support a
> certain delay mechanism or transport.
> 
> This patch uses the ioctl flags field to pass that information
> from the user space to kernel. If the hardware supports all
> the desired features, the ioctl continues as before, otherwise
> the unsupported bits are reseted to 0 and this information is
> returned to user space.
> 
> It's backwards compatible. If an older PTP applications calls
> the ioctl(), the flags field will be zero and no feature is
> checked.
> 
> Signed-off-by: Flavio Leitner <fbl@redhat.com>
> ---
>  include/uapi/linux/net_tstamp.h | 21 +++++++++++++++++++++
>  net/core/dev_ioctl.c            |  3 ---
>  2 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h
> index ae5df12..d63f8e9 100644
> --- a/include/uapi/linux/net_tstamp.h
> +++ b/include/uapi/linux/net_tstamp.h
> @@ -44,6 +44,27 @@ struct hwtstamp_config {
>  	int rx_filter;
>  };
>  
> +/* possible values for hwtstamp_config->flags */
> +#define HWTSTAMP_FEATURE_FLAGS_MASK 0xFFFF
> +enum hwtstamp_feature_flags {
> +	/* End-to-End Delay Mechanism */
> +	HWTSTAMP_DM_E2E = (1<<0),
> +	/* Peer-to-Peer Delay Mechanism */
> +	HWTSTAMP_DM_P2P = (1<<1),
> +	/* hole: 3 bits for additional mechanisms */
> +
> +	HWTSTAMP_TRANS_UDS = (1<<5),
> +	/* Message Transport: UDP over IPv4 */
> +	HWTSTAMP_TRANS_UDP_IPV4 = (1<<6),
> +	/* Message Transport: UDP over IPv6 */
> +	HWTSTAMP_TRANS_UDP_IPV6 = (1<<7),
> +	/* Message Transport: IEEE 802.3 Ethernet */
> +	HWTSTAMP_TRANS_IEEE_802_3 = (1<<8),
> +	HWTSTAMP_TRANS_DEVICENET = (1<<9),
> +	HWTSTAMP_TRANS_CONTROLNET = (1<<10),
> +	HWTSTAMP_TRANS_PROFINET = (1<<11),
> +};
> +
>  /* possible values for hwtstamp_config->tx_type */
>  enum hwtstamp_tx_types {
>  	/*
> diff --git a/net/core/dev_ioctl.c b/net/core/dev_ioctl.c
> index 5b7d0e1..9e2407e 100644
> --- a/net/core/dev_ioctl.c
> +++ b/net/core/dev_ioctl.c
> @@ -193,9 +193,6 @@ static int net_hwtstamp_validate(struct ifreq *ifr)
>  	if (copy_from_user(&cfg, ifr->ifr_data, sizeof(cfg)))
>  		return -EFAULT;
>  
> -	if (cfg.flags) /* reserved for future extensions */
> -		return -EINVAL;
> -
>  	tx_type = cfg.tx_type;
>  	rx_filter = cfg.rx_filter;
>  
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: [PATCH] bnx2: Use dev_kfree_skb_any() in bnx2_tx_int()
From: Cong Wang @ 2013-10-30 22:01 UTC (permalink / raw)
  To: David Miller; +Cc: TD Mackey, mchan, Linux Kernel Network Developers, LKML
In-Reply-To: <20131030.173200.2256841895208134119.davem@davemloft.net>

On Wed, Oct 30, 2013 at 2:32 PM, David Miller <davem@davemloft.net> wrote:
>
> Explain to me then why other ethernet drivers implemented identically,
> such as tg3, can use plain dev_kfree_skb() just fine?

I don't think they are fine, I just don't see bug reports
for them. At very least, I saw a same bug report for be2net too.

To reproduce this bug, we need to find some one calling printk()
within IRQ handler, which seems rare? It seems there are few
people using hpsa driver together with netconsole.

^ permalink raw reply

* Re: [PATCH 1/2] net: if_arp: add ARPHRD_RAWIP type
From: Marcel Holtmann @ 2013-10-30 22:26 UTC (permalink / raw)
  To: David S. Miller; +Cc: Jukka Rissanen, netdev
In-Reply-To: <20131030.172556.454916501978452298.davem@davemloft.net>

Hi Dave,

>> This is used when there is no L2 header before IP header.
>> Example of this is Bluetooth 6LoWPAN network.
>> 
>> The RAWIP header type value is already used in some Android kernels
>> so same value is used here in order not to break userspace.
>> 
>> Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
> 
> I'm not applying patches like this until there is an actual user,
> and this therefore goes for patch #2 as well.

patches for Bluetooth 6loWPAN have been posted to linux-bluetooth for review. So there is an actual user here.

If you do not want to merge these patches at this point, that is totally fine. We can happily carry them through bluetooth-next and wireless-next trees as well.

Posting them on netdev is mainly for checking that the changes we have to make outside the Bluetooth subsystem are in sync. So that they are reviewed and have been seen before. If you have any general objections to these assignments or changes, please let us now.

Regards

Marcel

^ permalink raw reply

* Re: [PATCH 3/3] e1000e: PTP: provide hardware features
From: Flavio Leitner @ 2013-10-30 21:50 UTC (permalink / raw)
  To: linuxptp-devel; +Cc: netdev
In-Reply-To: <1383159637-8165-4-git-send-email-fbl@redhat.com>

Adding CC to netdev.

The first patch [1/3] is the userlevel linuxptp patch, so
I am not forwarding to netdev.

This is just an example which I used for testing locally. If
the idea is acceptable, this patch must be replaced since the
card does support more features than what is listed below.

fbl

On Wed, Oct 30, 2013 at 05:00:37PM -0200, Flavio Leitner wrote:
> Provide the modes and transports supported by the hardware.
> 
> Signed-off-by: Flavio Leitner <fbl@redhat.com>
> ---
>  drivers/net/ethernet/intel/e1000e/netdev.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 4ef7867..8bcf167 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -3498,10 +3498,6 @@ static int e1000e_config_hwtstamp(struct e1000_adapter *adapter)
>  	if (!(adapter->flags & FLAG_HAS_HW_TIMESTAMP))
>  		return -EINVAL;
>  
> -	/* flags reserved for future extensions - must be zero */
> -	if (config->flags)
> -		return -EINVAL;
> -
>  	switch (config->tx_type) {
>  	case HWTSTAMP_TX_OFF:
>  		tsync_tx_ctl = 0;
> @@ -5772,6 +5768,9 @@ static int e1000_mii_ioctl(struct net_device *netdev, struct ifreq *ifr,
>  	return 0;
>  }
>  
> +#define E1000E_HWSTAMP_FLAGS_MASK (HWTSTAMP_DM_E2E | HWTSTAMP_DM_P2P |\
> +				   HWTSTAMP_TRANS_IEEE_802_3 |\
> +				   HWTSTAMP_TRANS_UDP_IPV4)
>  /**
>   * e1000e_hwtstamp_ioctl - control hardware time stamping
>   * @netdev: network interface device structure
> @@ -5792,11 +5791,23 @@ static int e1000e_hwtstamp_ioctl(struct net_device *netdev, struct ifreq *ifr)
>  {
>  	struct e1000_adapter *adapter = netdev_priv(netdev);
>  	struct hwtstamp_config config;
> +	int flags_req;
> +	int flags_unsup;
>  	int ret_val;
>  
>  	if (copy_from_user(&config, ifr->ifr_data, sizeof(config)))
>  		return -EFAULT;
>  
> +	if (config.flags & ~HWTSTAMP_FEATURE_FLAGS_MASK)
> +		return -EINVAL;
> +
> +	flags_req = config.flags & HWTSTAMP_FEATURE_FLAGS_MASK;
> +	flags_unsup = flags_req & ~E1000E_HWSTAMP_FLAGS_MASK;
> +	if (flags_unsup) {
> +		config.flags &= ~flags_unsup;
> +		goto out;
> +	}
> +
>  	adapter->hwtstamp_config = config;
>  
>  	ret_val = e1000e_config_hwtstamp(adapter);
> @@ -5823,6 +5834,7 @@ static int e1000e_hwtstamp_ioctl(struct net_device *netdev, struct ifreq *ifr)
>  		break;
>  	}
>  
> +out:
>  	return copy_to_user(ifr->ifr_data, &config,
>  			    sizeof(config)) ? -EFAULT : 0;
>  }
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: [PATCH net-next] ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE
From: Hannes Frederic Sowa @ 2013-10-30 22:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, fweimer
In-Reply-To: <20131030.173608.2156276556221568210.davem@davemloft.net>

On Wed, Oct 30, 2013 at 05:36:08PM -0400, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Wed, 30 Oct 2013 21:07:25 +0100
> 
> > On Tue, Oct 29, 2013 at 01:04:25PM +0100, Hannes Frederic Sowa wrote:
> >> I really tried hard to find alternatives or even a way to enable
> >> the protection automatically given that at least unbound does apply
> >> IP_PMTUDISC_DONT to its sockets already. These are the reasons why I
> >> came up with the new IP_PMTUDISC_INTERFACE value:
> > 
> > Sorry to bother you but I would really love to hear your feedback on my
> > reasoning so I can try to come up with a solution you would be happy with.

Thanks for the answer but I still tend to disagree:

> All I've read is that administrators cannot be relied upon to
> configure their systems properly for the requirements they have.
> 
> And that is a non-argument for adding this new socket option as far as
> I'm concerned.

First of, it is pretty hard to do it correct. Before writing this patch I
thought that ip_no_pmtu_disc does not honour pmtu updates at all. But as
soon as one fragmentation needed ICMP enters the box the mtu is reduced
to pmin_mtu. This was very counterintuitive for me (I don't know if this is
actually intended). Also we honour per application settings for TCP and DCCP
sockets, but I totally understand if you take this as a non-argument.

But that is not that important. Documentation can fix this.

> "I strongly do not trust path MTU information" has a scope as small as
> a route or an interface, it doesn't go down to the socket or
> application level at all.

I still disagree here. We can prevent generating UDP fragments if we
lock the mtu on the route, I agree and somehow missed that before (even
though I already used that myself once).

DNS resolver can fallback to TCP for querying where we can honour the
path MTU because it won't do any harm and ensures connectivity.

Also for TCP the socket is matched on the whole 4-tuple and we may
fallback to a 2-tuple lookup on unconnected UDP sockets. The new socket
option would let an application programmer choose to do path mtu discovery
because it knows it will only use a connected socket. On unconnected
sockets one can specify IP_PMTUDISC_INTERFACE to suppress the path MTU
updates and always use the interface MTU without the DF-bit set.

I really am open to suggestions. The socket option could be renamed or
we can transform IP_PMTUDISC_DONT to what I think it actually should
do. Per route or interface settings to enable the acceptance of path
MTU updates per protocol would be ok, too, but more complex.

Greetings,

  Hannes

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox