* Re: [RFC] sched: CHOKe packet scheduler (v0.2)
From: Stephen Hemminger @ 2011-01-07 7:10 UTC (permalink / raw)
To: Changli Gao; +Cc: Eric Dumazet, David Miller, netdev
In-Reply-To: <AANLkTi=C6c2md1_MB6dqvVHx2OL0Vj5Lgosi2C=kK9YG@mail.gmail.com>
On Fri, 7 Jan 2011 13:39:26 +0800
Changli Gao <xiaosuo@gmail.com> wrote:
> On Fri, Jan 7, 2011 at 12:55 PM, Stephen Hemminger
> <shemminger@vyatta.com> wrote:
> > On Thu, 06 Jan 2011 05:07:30 +0100
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> >> Le mercredi 05 janvier 2011 à 11:21 -0800, Stephen Hemminger a écrit :
> >> > This implements the CHOKe packet scheduler based on the existing
> >> > Linux RED scheduler based on the algorithm described in the paper.
> >> >
> >> > The core idea is:
> >> > For every packet arrival:
> >> > Calculate Qave
> >> > if (Qave < minth)
> >> > Queue the new packet
> >> > else
> >> > Select randomly a packet from the queue
> >> > if (both packets from same flow)
> >> > then Drop both the packets
> >> > else if (Qave > maxth)
> >> > Drop packet
> >> > else
> >> > Admit packet with probability p (same as RED)
> >> >
> >> > See also:
> >> > Rong Pan, Balaji Prabhakar, Konstantinos Psounis, "CHOKe: a stateless active
> >> > queue management scheme for approximating fair bandwidth allocation",
> >> > Proceeding of INFOCOM'2000, March 2000.
> >> >
> >> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> >> >
> >>
> >> To be really useful in a wide range of environments, I believe that :
> >>
> >> - CHOKe should be able to use an external flow classifier (like say...
> >> SFQ) to compute a token and compare two skbs by this token instead of
> >> custom rxhash or whatever. (rxhash can be the default in absence of flow
> >> classifier). Probably you need to store the token in skb->cb[] to avoid
> >> calling tc_classify() several times for a given packet.
> >>
> >> http://lwn.net/Articles/236200/
> >> http://kerneltrap.org/mailarchive/linux-netdev/2008/1/31/667679
> >
> > Probably should split SFQ flow hash stuff into core code for reuse.
> >
> >
>
> We need not do that, since we have sch_drr and cls_flow. :)
I prefer that the qdisc be useable without any explicit flow classification.
I.e like SFQ it should fall back to a sensible flow matching. DRR and others
put everything in one flow, if no filters are used.
^ permalink raw reply
* Re: [PATCH v3 05/10] net/fec: add dual fec support for mx28
From: Shawn Guo @ 2011-01-07 7:00 UTC (permalink / raw)
To: Uwe Kleine-König
Cc: davem, gerg, baruch, eric, bryan.wu, r64343, B32542, lw, w.sang,
s.hauer, netdev, linux-arm-kernel
In-Reply-To: <20110106071047.GW25121@pengutronix.de>
Hi Uwe,
On Thu, Jan 06, 2011 at 08:10:47AM +0100, Uwe Kleine-König wrote:
> Hello Shawn,
>
[...]
> > > > + /*
> > > > + * enet-mac reset will reset mac address registers too,
> > > > + * so need to reconfigure it.
> > > > + */
> > > > + if (fec_is_enetmac) {
> > > > + memcpy(&temp_mac, dev->dev_addr, ETH_ALEN);
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > > + writel(cpu_to_be32(temp_mac[0]), fep->hwp + FEC_ADDR_LOW);
> > > > + writel(cpu_to_be32(temp_mac[1]), fep->hwp + FEC_ADDR_HIGH);
> > > where is the value saved to temp_mac[]? For me it looks you write
> > > uninitialized data into the mac registers.
> >
> > memcpy above.
> oops. right. I looked for something like
>
> temp_mac[0] = dev->dev_addr[0] << $shiftfor0 | ...
>
> which AFAIK is what you want here. memcpy is sensible to (at least)
> endian issues. If you ask me factor out the setting of the mac address
> in probe to a function and use that here, too.
>
The memcpy of mac address is being widely used in fec and other
network drivers, and I do not prefer to change something so common
in this patch set.
--
Regards,
Shawn
^ permalink raw reply
* Re: Bad TCP timestamps on non-PC platforms
From: Alex Dubov @ 2011-01-07 6:39 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, David Miller
In-Reply-To: <1294366281.2704.36.camel@edumazet-laptop>
>
>
> You dont give new informations ;)
>
> I asked if you could give information on the other side :
> The bug is to
> drop this legal packet.
>
> uname -a
> sysctl -a | grep tcp
>
One of the other machines:
Linux sunapp 2.6.31.3 #2 SMP Mon Oct 12 21:32:20 EST 2009 x86_64 Dual-Core AMD Opteron(tm) Processor 8220 AuthenticAMD GNU/Linux
error: permission denied on key 'net.ipv4.route.flush'
fs.nfs.nlm_tcpport = 0
fs.nfs.nfs_callback_tcpport = 0
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_retrans_collapse = 1
net.ipv4.tcp_syn_retries = 5
net.ipv4.tcp_synack_retries = 5
net.ipv4.tcp_max_orphans = 65536
net.ipv4.tcp_max_tw_buckets = 180000
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_retries2 = 15
net.ipv4.tcp_fin_timeout = 60
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_abort_on_overflow = 0
net.ipv4.tcp_stdurg = 0
net.ipv4.tcp_rfc1337 = 0
net.ipv4.tcp_max_syn_backlog = 1024
net.ipv4.tcp_orphan_retries = 0
net.ipv4.tcp_fack = 1
net.ipv4.tcp_reordering = 3
net.ipv4.tcp_ecn = 2
net.ipv4.tcp_dsack = 1
net.ipv4.tcp_mem = 3095040 4126720 6190080
net.ipv4.tcp_wmem = 4096 16384 4194304
net.ipv4.tcp_rmem = 4096 87380 4194304
net.ipv4.tcp_app_win = 31
net.ipv4.tcp_adv_win_scale = 2
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_frto = 2
net.ipv4.tcp_frto_response = 0
net.ipv4.tcp_low_latency = 0
net.ipv4.tcp_no_metrics_save = 0
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_tso_win_divisor = 3
net.ipv4.tcp_congestion_control = reno
net.ipv4.tcp_abc = 0
net.ipv4.tcp_mtu_probing = 0
net.ipv4.tcp_base_mss = 512
net.ipv4.tcp_workaround_signed_windows = 0
net.ipv4.tcp_dma_copybreak = 4096
net.ipv4.tcp_slow_start_after_idle = 1
net.ipv4.tcp_available_congestion_control = reno
net.ipv4.tcp_allowed_congestion_control = reno
net.ipv4.tcp_max_ssthresh = 0
error: permission denied on key 'net.ipv6.route.flush'
sunrpc.transports = tcp 1048576
sunrpc.tcp_slot_table_entries = 16
sunrpc.tcp_fin_timeout = 15
^ permalink raw reply
* Re: Use ioctl() to list all network interfaces.
From: Chin Shi Hong @ 2011-01-07 5:41 UTC (permalink / raw)
To: Alexander Clouter; +Cc: netdev
In-Reply-To: <4f9hv7-kev.ln1@chipmunk.wormnet.eu>
On Fri, Jan 7, 2011 at 1:07 AM, Alexander Clouter <alex@digriz.org.uk> wrote:
> Chin Shi Hong <cshong87@gmail.com> wrote:
>>
>> The following codes are just part of my application's source codes:
>>
>> --begin of some codes--
>> ifconf ifc;
>> ret = ioctl(n, SIOCGIFCONF,&ifc);
>> --End of some codes--
>>
>> My application's role is to display all network interfaces, including
>> the network interface which are down.
>>
>> However, with the above codes, my application only display the network
>> interface which are already up. The network interface which are down
>> will not be displayed.
>>
>> So, what can I do to make my application display all network
>> interfaces, including the network interface which are down?
>>
> getifaddrs(), you get the IPv6 ones too this way.
>
> Cheers
>
> --
> Alexander Clouter
> .sigmonster says: I've only got 12 cards.
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
getifaddrs() helpful also. Thank you!
Regards,
^ permalink raw reply
* Re: Use ioctl() to list all network interfaces.
From: Chin Shi Hong @ 2011-01-07 5:40 UTC (permalink / raw)
To: Rémi Denis-Courmont; +Cc: netdev
In-Reply-To: <201101061811.06847.remi@remlab.net>
2011/1/7 Rémi Denis-Courmont <remi@remlab.net>:
> Le jeudi 6 janvier 2011 17:33:39 Chin Shi Hong, vous avez écrit :
>> My application's role is to display all network interfaces, including
>> the network interface which are down.
>
> You should use if_nameindex() then.
>
> --
> Rémi Denis-Courmont
> http://www.remlab.net/
> http://fi.linkedin.com/in/remidenis
>
Thank you. if_nameindex() really help me.
Regards,
^ permalink raw reply
* Re: [RFC] sched: CHOKe packet scheduler (v0.2)
From: Changli Gao @ 2011-01-07 5:39 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Eric Dumazet, David Miller, netdev
In-Reply-To: <20110106205549.0de56de1@nehalam>
On Fri, Jan 7, 2011 at 12:55 PM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
> On Thu, 06 Jan 2011 05:07:30 +0100
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> Le mercredi 05 janvier 2011 à 11:21 -0800, Stephen Hemminger a écrit :
>> > This implements the CHOKe packet scheduler based on the existing
>> > Linux RED scheduler based on the algorithm described in the paper.
>> >
>> > The core idea is:
>> > For every packet arrival:
>> > Calculate Qave
>> > if (Qave < minth)
>> > Queue the new packet
>> > else
>> > Select randomly a packet from the queue
>> > if (both packets from same flow)
>> > then Drop both the packets
>> > else if (Qave > maxth)
>> > Drop packet
>> > else
>> > Admit packet with probability p (same as RED)
>> >
>> > See also:
>> > Rong Pan, Balaji Prabhakar, Konstantinos Psounis, "CHOKe: a stateless active
>> > queue management scheme for approximating fair bandwidth allocation",
>> > Proceeding of INFOCOM'2000, March 2000.
>> >
>> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>> >
>>
>> To be really useful in a wide range of environments, I believe that :
>>
>> - CHOKe should be able to use an external flow classifier (like say...
>> SFQ) to compute a token and compare two skbs by this token instead of
>> custom rxhash or whatever. (rxhash can be the default in absence of flow
>> classifier). Probably you need to store the token in skb->cb[] to avoid
>> calling tc_classify() several times for a given packet.
>>
>> http://lwn.net/Articles/236200/
>> http://kerneltrap.org/mailarchive/linux-netdev/2008/1/31/667679
>
> Probably should split SFQ flow hash stuff into core code for reuse.
>
>
We need not do that, since we have sch_drr and cls_flow. :)
--
Regards,
Changli Gao(xiaosuo@gmail.com)
^ permalink raw reply
* Re: [RFC] sched: CHOKe packet scheduler (v0.2)
From: Stephen Hemminger @ 2011-01-07 4:55 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1294286850.2723.65.camel@edumazet-laptop>
On Thu, 06 Jan 2011 05:07:30 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mercredi 05 janvier 2011 à 11:21 -0800, Stephen Hemminger a écrit :
> > This implements the CHOKe packet scheduler based on the existing
> > Linux RED scheduler based on the algorithm described in the paper.
> >
> > The core idea is:
> > For every packet arrival:
> > Calculate Qave
> > if (Qave < minth)
> > Queue the new packet
> > else
> > Select randomly a packet from the queue
> > if (both packets from same flow)
> > then Drop both the packets
> > else if (Qave > maxth)
> > Drop packet
> > else
> > Admit packet with probability p (same as RED)
> >
> > See also:
> > Rong Pan, Balaji Prabhakar, Konstantinos Psounis, "CHOKe: a stateless active
> > queue management scheme for approximating fair bandwidth allocation",
> > Proceeding of INFOCOM'2000, March 2000.
> >
> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> >
>
> To be really useful in a wide range of environments, I believe that :
>
> - CHOKe should be able to use an external flow classifier (like say...
> SFQ) to compute a token and compare two skbs by this token instead of
> custom rxhash or whatever. (rxhash can be the default in absence of flow
> classifier). Probably you need to store the token in skb->cb[] to avoid
> calling tc_classify() several times for a given packet.
>
> http://lwn.net/Articles/236200/
> http://kerneltrap.org/mailarchive/linux-netdev/2008/1/31/667679
Probably should split SFQ flow hash stuff into core code for reuse.
> - Must use a FIFO with O(1) access to Nth skb in queue.
>
> A linked list makes this implementation too expensive for big queues.
>
> For small queues (less than 128 skbs at this moment for SFQ), existing
> schedulers are good enough.
>
> CHOKe authors dont mention this in their paper, but their experiments
> were done in 1999 with 1Mbs links. minth=100 and maxth=200, limit=300
>
> We want to try CHOKe with modern links, probably with minth=2000 and
> maxth=4000 or more.
>
> They said "It is arguably more difficult to drop a randomy chosen packet
> since this means removing from a linked-list. Instead of doing this, we
> propose to add one extra bit to the packet header. The bit is set to one
> if the drop candidate is to be dropped. When a packet advance to the
> head of the FIFO buffer, the status of the bit determines whether it is
> to be immediately discarded or transmitted on the outgoind line"
>
> If they thought removing a buffer from a linked list was expensive
> (!!!), they certainly assumed the previous access to the randomly chosen
> buffer was faster than the skb unlink !
>
> Using a circular buffer should be enough, using a similar trick than
> suggested : when droping an skb from the ring, stick a NULL pointer and
> dont memmove() the window to shrink it.
>
> struct skb_ring {
> unsigned int head;
> unsigned int tail;
> unsigned int size; /* a power of two */
> struct sk_buff **table;
> };
>
> Doing so avoids the cache misses to adjacent skbs prev/next when
> queue/dequeue is done.
The problem is that large tables of pointers in kernel require either
contiguous allocation or some indirect table algorithm.
--
^ permalink raw reply
* [PATCH 2/2] sky2: convert to new VLAN model
From: Stephen Hemminger @ 2011-01-07 4:41 UTC (permalink / raw)
To: David Miller; +Cc: netdev
This converts sky2 to new VLAN offload flags control via ethtool.
It also allows for transmit offload of vlan tagged frames which
was not possible before.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
--- a/drivers/net/sky2.c 2011-01-06 17:44:00.643524039 -0800
+++ b/drivers/net/sky2.c 2011-01-06 19:05:06.100968234 -0800
@@ -46,10 +46,6 @@
#include <asm/irq.h>
-#if defined(CONFIG_VLAN_8021Q) || defined(CONFIG_VLAN_8021Q_MODULE)
-#define SKY2_VLAN_TAG_USED 1
-#endif
-
#include "sky2.h"
#define DRV_NAME "sky2"
@@ -1326,40 +1322,33 @@ static int sky2_ioctl(struct net_device
return err;
}
-#ifdef SKY2_VLAN_TAG_USED
-static void sky2_set_vlan_mode(struct sky2_hw *hw, u16 port, bool onoff)
+/* Features available on VLAN with transmit tag stripped */
+#define VLAN_FEAT (NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO)
+
+static void sky2_vlan_mode(struct net_device *dev)
{
- if (onoff) {
+ struct sky2_port *sky2 = netdev_priv(dev);
+ struct sky2_hw *hw = sky2->hw;
+ u16 port = sky2->port;
+
+ if (dev->features & NETIF_F_HW_VLAN_RX)
sky2_write32(hw, SK_REG(port, RX_GMF_CTRL_T),
RX_VLAN_STRIP_ON);
+ else
+ sky2_write32(hw, SK_REG(port, RX_GMF_CTRL_T),
+ RX_VLAN_STRIP_OFF);
+
+ if (dev->features & NETIF_F_HW_VLAN_TX) {
sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T),
TX_VLAN_TAG_ON);
+ dev->vlan_features = dev->features & VLAN_FEAT;
} else {
- sky2_write32(hw, SK_REG(port, RX_GMF_CTRL_T),
- RX_VLAN_STRIP_OFF);
sky2_write32(hw, SK_REG(port, TX_GMF_CTRL_T),
TX_VLAN_TAG_OFF);
+ dev->vlan_features = dev->features & NETIF_F_HIGHDMA;
}
}
-static void sky2_vlan_rx_register(struct net_device *dev, struct vlan_group *grp)
-{
- struct sky2_port *sky2 = netdev_priv(dev);
- struct sky2_hw *hw = sky2->hw;
- u16 port = sky2->port;
-
- netif_tx_lock_bh(dev);
- napi_disable(&hw->napi);
-
- sky2->vlgrp = grp;
- sky2_set_vlan_mode(hw, port, grp != NULL);
-
- sky2_read32(hw, B0_Y2_SP_LISR);
- napi_enable(&hw->napi);
- netif_tx_unlock_bh(dev);
-}
-#endif
-
/* Amount of required worst case padding in rx buffer */
static inline unsigned sky2_rx_pad(const struct sky2_hw *hw)
{
@@ -1635,9 +1624,7 @@ static void sky2_hw_up(struct sky2_port
sky2_prefetch_init(hw, txqaddr[port], sky2->tx_le_map,
sky2->tx_ring_size - 1);
-#ifdef SKY2_VLAN_TAG_USED
- sky2_set_vlan_mode(hw, port, sky2->vlgrp != NULL);
-#endif
+ sky2_vlan_mode(sky2->netdev);
sky2_rx_start(sky2);
}
@@ -1780,7 +1767,7 @@ static netdev_tx_t sky2_xmit_frame(struc
}
ctrl = 0;
-#ifdef SKY2_VLAN_TAG_USED
+
/* Add VLAN tag, can piggyback on LRGLEN or ADDR64 */
if (vlan_tx_tag_present(skb)) {
if (!le) {
@@ -1792,7 +1779,6 @@ static netdev_tx_t sky2_xmit_frame(struc
le->length = cpu_to_be16(vlan_tx_tag_get(skb));
ctrl |= INS_VLAN;
}
-#endif
/* Handle TCP checksum offload */
if (skb->ip_summed == CHECKSUM_PARTIAL) {
@@ -2432,11 +2418,8 @@ static struct sk_buff *sky2_receive(stru
struct sk_buff *skb = NULL;
u16 count = (status & GMR_FS_LEN) >> 16;
-#ifdef SKY2_VLAN_TAG_USED
- /* Account for vlan tag */
- if (sky2->vlgrp && (status & GMR_FS_VLAN))
- count -= VLAN_HLEN;
-#endif
+ if (status & GMR_FS_VLAN)
+ count -= VLAN_HLEN; /* Account for vlan tag */
netif_printk(sky2, rx_status, KERN_DEBUG, dev,
"rx slot %u status 0x%x len %d\n",
@@ -2504,17 +2487,9 @@ static inline void sky2_tx_done(struct n
static inline void sky2_skb_rx(const struct sky2_port *sky2,
u32 status, struct sk_buff *skb)
{
-#ifdef SKY2_VLAN_TAG_USED
- u16 vlan_tag = be16_to_cpu(sky2->rx_tag);
- if (sky2->vlgrp && (status & GMR_FS_VLAN)) {
- if (skb->ip_summed == CHECKSUM_NONE)
- vlan_hwaccel_receive_skb(skb, sky2->vlgrp, vlan_tag);
- else
- vlan_gro_receive(&sky2->hw->napi, sky2->vlgrp,
- vlan_tag, skb);
- return;
- }
-#endif
+ if (status & GMR_FS_VLAN)
+ __vlan_hwaccel_put_tag(skb, be16_to_cpu(sky2->rx_tag));
+
if (skb->ip_summed == CHECKSUM_NONE)
netif_receive_skb(skb);
else
@@ -2631,7 +2606,6 @@ static int sky2_status_intr(struct sky2_
goto exit_loop;
break;
-#ifdef SKY2_VLAN_TAG_USED
case OP_RXVLAN:
sky2->rx_tag = length;
break;
@@ -2639,7 +2613,6 @@ static int sky2_status_intr(struct sky2_
case OP_RXCHKSVLAN:
sky2->rx_tag = length;
/* fall through */
-#endif
case OP_RXCHKS:
if (likely(sky2->flags & SKY2_FLAG_RX_CHECKSUM))
sky2_rx_checksum(sky2, status);
@@ -3042,6 +3015,10 @@ static int __devinit sky2_init(struct sk
| SKY2_HW_NEW_LE
| SKY2_HW_AUTO_TX_SUM
| SKY2_HW_ADV_POWER_CTL;
+
+ /* The workaround for status conflicts VLAN tag detection. */
+ if (hw->chip_rev == CHIP_REV_YU_FE2_A0)
+ hw->flags |= SKY2_HW_VLAN_BROKEN;
break;
case CHIP_ID_YUKON_SUPR:
@@ -4237,15 +4214,28 @@ static int sky2_set_eeprom(struct net_de
static int sky2_set_flags(struct net_device *dev, u32 data)
{
struct sky2_port *sky2 = netdev_priv(dev);
- u32 supported =
- (sky2->hw->flags & SKY2_HW_RSS_BROKEN) ? 0 : ETH_FLAG_RXHASH;
+ unsigned long old_feat = dev->features;
+ u32 supported = 0;
int rc;
+ if (!(sky2->hw->flags & SKY2_HW_RSS_BROKEN))
+ supported |= ETH_FLAG_RXHASH;
+
+ if (!(sky2->hw->flags & SKY2_HW_VLAN_BROKEN))
+ supported |= ETH_FLAG_RXVLAN | ETH_FLAG_TXVLAN;
+
+ printk(KERN_DEBUG "sky2 set_flags: supported %x data %x\n",
+ supported, data);
+
rc = ethtool_op_set_flags(dev, data, supported);
if (rc)
return rc;
- rx_set_rss(dev);
+ if ((old_feat ^ dev->features) & NETIF_F_RXHASH)
+ rx_set_rss(dev);
+
+ if ((old_feat ^ dev->features) & (NETIF_F_HW_VLAN_RX|NETIF_F_HW_VLAN_TX))
+ sky2_vlan_mode(dev);
return 0;
}
@@ -4281,6 +4271,7 @@ static const struct ethtool_ops sky2_eth
.get_sset_count = sky2_get_sset_count,
.get_ethtool_stats = sky2_get_ethtool_stats,
.set_flags = sky2_set_flags,
+ .get_flags = ethtool_op_get_flags,
};
#ifdef CONFIG_SKY2_DEBUG
@@ -4562,9 +4553,6 @@ static const struct net_device_ops sky2_
.ndo_change_mtu = sky2_change_mtu,
.ndo_tx_timeout = sky2_tx_timeout,
.ndo_get_stats64 = sky2_get_stats,
-#ifdef SKY2_VLAN_TAG_USED
- .ndo_vlan_rx_register = sky2_vlan_rx_register,
-#endif
#ifdef CONFIG_NET_POLL_CONTROLLER
.ndo_poll_controller = sky2_netpoll,
#endif
@@ -4580,9 +4568,6 @@ static const struct net_device_ops sky2_
.ndo_change_mtu = sky2_change_mtu,
.ndo_tx_timeout = sky2_tx_timeout,
.ndo_get_stats64 = sky2_get_stats,
-#ifdef SKY2_VLAN_TAG_USED
- .ndo_vlan_rx_register = sky2_vlan_rx_register,
-#endif
},
};
@@ -4633,7 +4618,8 @@ static __devinit struct net_device *sky2
sky2->port = port;
dev->features |= NETIF_F_IP_CSUM | NETIF_F_SG
- | NETIF_F_TSO | NETIF_F_GRO;
+ | NETIF_F_TSO | NETIF_F_GRO;
+
if (highmem)
dev->features |= NETIF_F_HIGHDMA;
@@ -4641,13 +4627,8 @@ static __devinit struct net_device *sky2
if (!(hw->flags & SKY2_HW_RSS_BROKEN))
dev->features |= NETIF_F_RXHASH;
-#ifdef SKY2_VLAN_TAG_USED
- /* The workaround for FE+ status conflicts with VLAN tag detection. */
- if (!(sky2->hw->chip_id == CHIP_ID_YUKON_FE_P &&
- sky2->hw->chip_rev == CHIP_REV_YU_FE2_A0)) {
+ if (!(hw->flags & SKY2_HW_VLAN_BROKEN))
dev->features |= NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX;
- }
-#endif
/* read the mac address */
memcpy_fromio(dev->dev_addr, hw->regs + B2_MAC_1 + port * 8, ETH_ALEN);
--- a/drivers/net/sky2.h 2011-01-06 17:44:01.939546184 -0800
+++ b/drivers/net/sky2.h 2011-01-06 17:59:12.430966390 -0800
@@ -2236,11 +2236,8 @@ struct sky2_port {
u16 rx_pending;
u16 rx_data_size;
u16 rx_nfrags;
-
-#ifdef SKY2_VLAN_TAG_USED
u16 rx_tag;
- struct vlan_group *vlgrp;
-#endif
+
struct {
unsigned long last;
u32 mac_rp;
@@ -2284,6 +2281,7 @@ struct sky2_hw {
#define SKY2_HW_AUTO_TX_SUM 0x00000040 /* new IP decode for Tx */
#define SKY2_HW_ADV_POWER_CTL 0x00000080 /* additional PHY power regs */
#define SKY2_HW_RSS_BROKEN 0x00000100
+#define SKY2_HW_VLAN_BROKEN 0x00000200
u8 chip_id;
u8 chip_rev;
^ permalink raw reply
* [PATCH 1/2] sky2: fix limited auto negotiation
From: Stephen Hemminger @ 2011-01-07 4:40 UTC (permalink / raw)
To: Mohsen Hariri, David Miller; +Cc: netdev
In-Reply-To: <AANLkTimGZbDNeRccp4SkgOsJsLXjyKNq+ma4UvW=-5GG@mail.gmail.com>
The sky2 driver would always try all possible supported speeds even
if the user only asked for a limited set of speed/duplex combinations.
Reported-by: Mohsen Hariri <m.hariri@gmail.com>
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
--- a/drivers/net/sky2.c 2011-01-04 08:46:31.796080594 -0800
+++ b/drivers/net/sky2.c 2011-01-06 20:09:25.660684849 -0800
@@ -3411,18 +3411,15 @@ static u32 sky2_supported_modes(const st
u32 modes = SUPPORTED_10baseT_Half
| SUPPORTED_10baseT_Full
| SUPPORTED_100baseT_Half
- | SUPPORTED_100baseT_Full
- | SUPPORTED_Autoneg | SUPPORTED_TP;
+ | SUPPORTED_100baseT_Full;
if (hw->flags & SKY2_HW_GIGABIT)
modes |= SUPPORTED_1000baseT_Half
| SUPPORTED_1000baseT_Full;
return modes;
} else
- return SUPPORTED_1000baseT_Half
- | SUPPORTED_1000baseT_Full
- | SUPPORTED_Autoneg
- | SUPPORTED_FIBRE;
+ return SUPPORTED_1000baseT_Half
+ | SUPPORTED_1000baseT_Full;
}
static int sky2_get_settings(struct net_device *dev, struct ethtool_cmd *ecmd)
@@ -3436,9 +3433,11 @@ static int sky2_get_settings(struct net_
if (sky2_is_copper(hw)) {
ecmd->port = PORT_TP;
ecmd->speed = sky2->speed;
+ ecmd->supported |= SUPPORTED_Autoneg | SUPPORTED_TP;
} else {
ecmd->speed = SPEED_1000;
ecmd->port = PORT_FIBRE;
+ ecmd->supported |= SUPPORTED_Autoneg | SUPPORTED_FIBRE;
}
ecmd->advertising = sky2->advertising;
@@ -3455,8 +3454,19 @@ static int sky2_set_settings(struct net_
u32 supported = sky2_supported_modes(hw);
if (ecmd->autoneg == AUTONEG_ENABLE) {
+ if (ecmd->advertising & ~supported)
+ return -EINVAL;
+
+ if (sky2_is_copper(hw))
+ sky2->advertising = ecmd->advertising |
+ ADVERTISED_TP |
+ ADVERTISED_Autoneg;
+ else
+ sky2->advertising = ecmd->advertising |
+ ADVERTISED_FIBRE |
+ ADVERTISED_Autoneg;
+
sky2->flags |= SKY2_FLAG_AUTO_SPEED;
- ecmd->advertising = supported;
sky2->duplex = -1;
sky2->speed = -1;
} else {
@@ -3500,8 +3510,6 @@ static int sky2_set_settings(struct net_
sky2->flags &= ~SKY2_FLAG_AUTO_SPEED;
}
- sky2->advertising = ecmd->advertising;
-
if (netif_running(dev)) {
sky2_phy_reinit(sky2);
sky2_set_multicast(dev);
^ permalink raw reply
* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Jesse Gross @ 2011-01-07 4:38 UTC (permalink / raw)
To: Matt Carlson
Cc: Eric Dumazet, Michael Leun, Michael Chan, David Miller,
Ben Greear, linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20110107034130.GA18028@mcarlson.broadcom.com>
On Thu, Jan 6, 2011 at 10:41 PM, Matt Carlson <mcarlson@broadcom.com> wrote:
> On Thu, Jan 06, 2011 at 07:04:46PM -0800, Eric Dumazet wrote:
>> Le jeudi 06 janvier 2011 ?? 18:59 -0800, Matt Carlson a ??crit :
>> > On Thu, Jan 06, 2011 at 06:43:22PM -0800, Eric Dumazet wrote:
>> > > Le vendredi 07 janvier 2011 ?? 03:41 +0100, Eric Dumazet a ??crit :
>> > > > Le jeudi 06 janvier 2011 ?? 18:29 -0800, Matt Carlson a ??crit :
>> > > >
>> > > > > Hi Eric. Sorry for the delay. I was under the impression that your
>> > > > > problems were software related and that you just needed a revised
>> > > > > version of these VLAN patches I was sending to Michael. Is this not
>> > > > > true?
>> > > > >
>> > > > > Having a hardware stat increment suggests this is a new problem.
>> > > > > Maybe I missed it, but I didn't see what hardware you are working
>> > > > > with and whether or not management firmware was enabled. Could you tell
>> > > > > me that info?
>> > > > >
>> > > >
>> > > > Hi Matt
>> > > >
>> > > > I started a bisection, because I couldnt sleep tonight anyway :(
>> > > >
>> > > > 14:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S
>> > > > Gigabit Ethernet (rev a3)
>> > > > Subsystem: Hewlett-Packard Company NC326m PCIe Dual Port Adapter
>> > > > Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 43
>> > > > Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
>> > > > Memory at fdfe0000 (64-bit, non-prefetchable) [size=64K]
>> > > > [virtual] Expansion ROM at fdbe0000 [disabled] [size=128K]
>> > > > Capabilities: [40] PCI-X non-bridge device
>> > > > Capabilities: [48] Power Management version 2
>> > > > Capabilities: [50] Vital Product Data
>> > > > Capabilities: [58] MSI: Enable+ Count=1/8 Maskable- 64bit+
>> > > > Kernel driver in use: tg3
>> > > > Kernel modules: tg3
>> > > >
>> > > >
>> > >
>> > > $ ethtool -i eth2
>> > > driver: tg3
>> > > version: 3.115
>> > > firmware-version: 5715s-v3.28
>> > > bus-info: 0000:14:04.0
>> > > $ dmesg | grep ASF
>> > > [ 6.220577] tg3 0000:14:04.0: eth2: RXcsums[1] LinkChgREG[0] MIirq[0]
>> > > ASF[0] TSOcap[1]
>> > > [ 6.228586] tg3 0000:14:04.1: eth3: RXcsums[1] LinkChgREG[0] MIirq[0]
>> > > ASF[0] TSOcap[1]
>> >
>> > Thanks. So management firmware is disabled. This should be
>> > straightforward case.
>> >
>> > I'm wondering if I'm misunderstanding something though. You said earlier
>> > that VLAN tagging doesn't work unless you applied my patch. Is this no
>> > longer true?
>> >
>>
>> I dont apply your patch because Jesse said it was not a good patch ;)
>
> Oh.
>
>> Maybe I missed something and it must be applied ? Problem is : current
>> Linus tree now includes net-next-2.6 and vlan doesnt work. You should
>> resubmit it perhaps ?
>
> Yes, something needs to be submitted. I want to make sure we aren't
> chasing the same problem though. If the patch(es) fix your problem,
> then I can concentrate on finalizing the patch.
>
> I can combine my last patch (the one that always enabled VLAN tag
> stripping) and the previous patch (that implements all your comments so
> far) into one patch, but that still leaves the behavior Michael noted
> unaddressed.
Just to clarify, I think there are three separate things going on here:
* The patch, which independent of the separately reported issues, is
good because it moves tg3 to the new vlan model. However, I don't
think we should always disable vlan stripping as is done because it is
probably useful in the majority of cases. Maybe in some situations it
needs to be disabled but those are independent and should affect both
the patched and unpatched versions.
* Eric's issue. It sounds like the commit that bisect turned up has
some interaction with stripping. The patch fixes this because it
always disables stripping but that doesn't seem like the right
solution because previous versions worked with stripping enabled.
* Michael's issue. Not clear what the cause is but disabling
stripping fixes it. It has different symptoms from Eric's though
(missing tags vs missing packets). The patch changes behavior a
little bit because it changes when stripping is enabled but doesn't
fix the underlying cause.
So each needs to be tracked down separately. Unfortunately, the fixed
patch will no longer solve Eric's issue...
^ permalink raw reply
* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Jesse Gross @ 2011-01-07 4:36 UTC (permalink / raw)
To: Matt Carlson
Cc: Michael Leun, Michael Chan, Eric Dumazet, David Miller,
Ben Greear, linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20110107032459.GA17959@mcarlson.broadcom.com>
On Thu, Jan 6, 2011 at 10:24 PM, Matt Carlson <mcarlson@broadcom.com> wrote:
> On Sat, Dec 18, 2010 at 07:38:00PM -0800, Jesse Gross wrote:
>> On Tue, Dec 14, 2010 at 11:16 PM, Michael Leun
>> <lkml20101129@newton.leun.net> wrote:
>> > OK - all tests done on that DL320G5:
>> >
>> > For completeness, 2.6.37-rc5 unpatched:
>> >
>> > eth0, no vlan configured: totally broken - see double tagged vlans
>> > without tag, single or untagged packets missing at all
>>
>> Random behavior? This one is somewhat hard to explain - maybe there
>> are some other factors. eth0 has ASF on, so it always strips tags. I
>> would expect it to behave like the vlan configured case.
>>
>> >
>> > eth0, vlan configured: see packets without vlan tag (see double tagged
>> > packets with one vlan tag)
>>
>> Both ASF and vlan group configured cause tag stripping to be enabled.
>> Missing tag.
>>
>> >
>> > eth1 same as originally reported:
>> > without vlan configured see vlan tags (single and double tagged as
>> > expected)
>>
>> No ASF and no vlan group means tag stripping is disabled. Have tag.
>>
>> > with vlan configured: see packets without vlan tag (see double tagged
>> > packets with one vlan tag)
>>
>> Configuring vlan group causes stripping to be enabled. Missing tag.
>>
>> >
>> >
>> > 2.6.37-rc5, your tg3 use new vlan-code patch:
>> >
>> > eth0, no vlan configured: ?see packets without vlan tag (see double
>> > tagged packets with one vlan tag)
>>
>> ASF enables tag stripping. Missing tag.
>>
>> > eth1, no vlan configured: see vlan tags (single and double tagged as
>> > expected)
>>
>> No ASF, no vlan group means no stripping. Have tag.
>>
>> >
>> >
>> > eth0, vlan configured: as without vlan
>>
>> ASF enables stripping. Missing tag.
>>
>> > eth1, vlan configured: as without vlan
>>
>> With this patch vlan stripping is only enabled when ASF is on, so no
>> stripping. Have tag.
>>
>> >
>> > 2.6.37-rc5, your tg3 use new vlan-code patch with test patch ontop
>> >
>> > eth1 no vlan configured: see packets without vlan tag (see double tagged
>> > packets with one vlan tag)
>>
>> With the second patch, vlan stripping is always enabled. Missing tag.
>>
>> > eth1 with vlan: the same
>>
>> Stripping still always enabled. Missing tag.
>>
>> The bottom line is whenever vlan stripping is enabled we're missing
>> the outer tag. It might be worth adding some debugging in the area
>> before napi_gro_receive/vlan_gro_receive (depending on version). My
>> guess is that (desc->type_flags & RXD_FLAG_VLAN) is false even for
>> vlan packets on this NIC.
>>
>> You said that everything works on the 5752? Matt, is it possible that
>> the 5714 either has a problem with vlan stripping or a different way
>> of reporting it?
>
> I don't think this is a 5714 specific issue. I think the problem is
> rooted in the fact that the VLAN tag stripping is enabled.
It's definitely related to vlan stripping being enabled. Other cards
using tg3 seem to work fine with stripping though, which is why I
thought it might be specific to the 5714.
>
> Your RXD_FLAG_VLAN idea sounds unlikely to me, but it's worth a check.
>
> The patch here is using __vlan_hwaccel_put_tag(), which informs the
> stack a VLAN tag is present. If this is indeed a reporting problem, I'm
> not sure what else the driver should be doing.
The code to hand off the tag to the stack looks OK to me. Michael was
seeing this on older versions of the kernel as well with this NIC,
which predates both this patch and the larger vlan changes so it
doesn't seem like a problem with passing the tag to the network stack.
It's hard to know exactly what is going on though without seeing what
the hardware is reporting.
^ permalink raw reply
* Re: [net-next 12/12] ixgbe: update ntuple filter configuration
From: Alexander Duyck @ 2011-01-07 4:12 UTC (permalink / raw)
To: Ben Hutchings
Cc: jeffrey.t.kirsher, davem, Alexander Duyck, netdev, gosp, bphilips
In-Reply-To: <1294362158.11825.78.camel@bwh-desktop>
On Thu, Jan 6, 2011 at 5:02 PM, Ben Hutchings <bhutchings@solarflare.com> wrote:
> On Thu, 2011-01-06 at 16:29 -0800, jeffrey.t.kirsher@intel.com wrote:
>> From: Alexander Duyck <alexander.h.duyck@intel.com>
>>
>> This change fixes several issues found in ntuple filtering while I was
>> doing the ATR refactor.
>>
>> Specifically I updated the masks to work correctly with the latest version
>> of ethtool,
> [...]
>
> Did the previous code not correctly handle a zero value with a non-zero
> mask for some fields? If so, I can revert that change to ethtool.
>
> Ben.
Actually I think the ethtool mention doesn't really apply to the
in-kernel driver. I think I just carried that over from the check-in
for our out of tree driver which hadn't been updated when you added
the code to cleanup the rx ntuple filters. Also for the driver I'm
not too worried about what the status of it was before since there are
blatant errors in bit/byte ordering and mask bit values in the code
that from what I can tell would have significantly diminished the
usability of the filters.
Thanks,
Alex
^ permalink raw reply
* [RFC] sched: QFQ - quick fair queue scheduler
From: Stephen Hemminger @ 2011-01-07 3:56 UTC (permalink / raw)
To: David Miller, Eric Dumazet, Fabio Checconi; +Cc: netdev, Luigi Rizzo
This is an implementation of the Quick Fair Queue scheduler developed
by Fabio Checconi and Luigi Rizzo. The same algorithm is already implemented in ipfw
in FreeBSD. Fabio had an earlier version developed on Linux, I just
did some cleanup, and backporting of FreeBSD version.
For more information see web page: http://info.iet.unipi.it/~luigi/qfq/
and Google tech talk: http://www.youtube.com/watch?v=r8vBmybeKlE
This is for inspection at this point, barely tested.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
---
Patch against net-next-2.6.
Configuration may get patch fuzz because of testing CHOKe in
same tree.
include/linux/pkt_sched.h | 14
net/sched/Kconfig | 11
net/sched/Makefile | 1
net/sched/sch_qfq.c | 1012 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 1038 insertions(+)
--- a/include/linux/pkt_sched.h 2011-01-05 09:01:33.268032043 -0800
+++ b/include/linux/pkt_sched.h 2011-01-05 23:17:20.637390255 -0800
@@ -481,4 +481,18 @@ struct tc_drr_stats {
__u32 deficit;
};
+/* QFQ */
+enum {
+ TCA_QFQ_WEIGHT,
+ TCA_QFQ_LMAX,
+ __TCA_QFQ_MAX
+};
+
+#define TCA_QFQ_MAX (__TCA_QFQ_MAX - 1)
+
+struct tc_qfq_stats {
+ __u32 weight;
+ __u32 lmax;
+};
+
#endif
--- a/net/sched/Kconfig 2011-01-05 09:01:33.280032462 -0800
+++ b/net/sched/Kconfig 2011-01-05 23:17:20.637390255 -0800
@@ -216,6 +216,17 @@ config NET_SCH_CHOKE
To compile this code as a module, choose M here: the
module will be called sch_choke.
+config NET_SCH_QFQ
+ tristate "Quick Fair Queueing Scheduler (QFQ)"
+ help
+ Say Y here if you want to use the Quick Fair Queueing Scheduler (QFQ)
+ packet scheduling algorithm.
+
+ To compile this driver as a module, choose M here: the module
+ will be called sch_qfq.
+
+ If unsure, say N.
+
config NET_SCH_INGRESS
tristate "Ingress Qdisc"
depends on NET_CLS_ACT
--- a/net/sched/Makefile 2011-01-05 09:01:33.284032598 -0800
+++ b/net/sched/Makefile 2011-01-05 23:17:20.645389829 -0800
@@ -32,6 +32,7 @@ obj-$(CONFIG_NET_SCH_MULTIQ) += sch_mult
obj-$(CONFIG_NET_SCH_ATM) += sch_atm.o
obj-$(CONFIG_NET_SCH_NETEM) += sch_netem.o
obj-$(CONFIG_NET_SCH_DRR) += sch_drr.o
+obj-$(CONFIG_NET_SCH_QFQ) += sch_qfq.o
obj-$(CONFIG_NET_SCH_CHOKE) += sch_choke.o
obj-$(CONFIG_NET_CLS_U32) += cls_u32.o
obj-$(CONFIG_NET_CLS_ROUTE4) += cls_route.o
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ b/net/sched/sch_qfq.c 2011-01-06 12:51:28.498280327 -0800
@@ -0,0 +1,1125 @@
+/*
+ * net/sched/sch_qfq.c Quick Fair Queueing Scheduler.
+ *
+ * Copyright (c) 2009 Fabio Checconi, Luigi Rizzo, and Paolo Valente.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/bitops.h>
+#include <linux/errno.h>
+#include <linux/netdevice.h>
+#include <linux/pkt_sched.h>
+#include <net/sch_generic.h>
+#include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
+
+/* Quick Fair Queueing
+ ===================
+
+ Sources:
+ Fabio Checconi and Scuola Superiore and S. Anna
+ and Paolo Valente and Luigi Riz "QFQ: Efficient Packet Scheduling
+ with Tight Bandwidth Distribution Guarantees", SIGCOMM 2010
+
+ See also:
+ http://retis.sssup.it/~fabio/linux/qfq/
+ */
+
+/*
+
+ Virtual time computations.
+
+ S, F and V are all computed in fixed point arithmetic with
+ FRAC_BITS decimal bits.
+
+ QFQ_MAX_INDEX is the maximum index allowed for a group. We need
+ one bit per index.
+ QFQ_MAX_WSHIFT is the maximum power of two supported as a weight.
+
+ The layout of the bits is as below:
+
+ [ MTU_SHIFT ][ FRAC_BITS ]
+ [ MAX_INDEX ][ MIN_SLOT_SHIFT ]
+ ^.__grp->index = 0
+ *.__grp->slot_shift
+
+ where MIN_SLOT_SHIFT is derived by difference from the others.
+
+ The max group index corresponds to Lmax/w_min, where
+ Lmax=1<<MTU_SHIFT, w_min = 1 .
+ From this, and knowing how many groups (MAX_INDEX) we want,
+ we can derive the shift corresponding to each group.
+
+ Because we often need to compute
+ F = S + len/w_i and V = V + len/wsum
+ instead of storing w_i store the value
+ inv_w = (1<<FRAC_BITS)/w_i
+ so we can do F = S + len * inv_w * wsum.
+ We use W_TOT in the formulas so we can easily move between
+ static and adaptive weight sum.
+
+ The per-scheduler-instance data contain all the data structures
+ for the scheduler: bitmaps and bucket lists.
+
+ */
+
+/*
+ * Maximum number of consecutive slots occupied by backlogged classes
+ * inside a group.
+ */
+#define QFQ_MAX_SLOTS 32
+
+/*
+ * Shifts used for class<->group mapping. We allow class weights that are
+ * in the range [1, 2^MAX_WSHIFT], and we try to map each class i to the
+ * group with the smallest index that can support the L_i / r_i configured
+ * for the class.
+ *
+ * grp->index is the index of the group; and grp->slot_shift
+ * is the shift for the corresponding (scaled) sigma_i.
+ */
+#define QFQ_MAX_INDEX 19
+#define QFQ_MAX_WSHIFT 16
+
+#define QFQ_MAX_WEIGHT (1<<QFQ_MAX_WSHIFT)
+#define QFQ_MAX_WSUM (2*QFQ_MAX_WEIGHT)
+
+#define FRAC_BITS 30 /* fixed point arithmetic */
+#define ONE_FP (1UL << FRAC_BITS)
+#define IWSUM (ONE_FP/QFQ_MAX_WSUM)
+
+#define QFQ_MTU_SHIFT 11
+#define QFQ_MIN_SLOT_SHIFT (FRAC_BITS + QFQ_MTU_SHIFT - QFQ_MAX_INDEX)
+
+/*
+ * Possible group states. These values are used as indexes for the bitmaps
+ * array of struct qfq_queue.
+ */
+enum qfq_state { ER, IR, EB, IB, QFQ_MAX_STATE };
+
+struct qfq_group;
+
+struct qfq_class {
+ struct Qdisc_class_common common;
+
+ unsigned int refcnt;
+ unsigned int filter_cnt;
+
+ struct gnet_stats_basic_packed bstats;
+ struct gnet_stats_queue qstats;
+ struct gnet_stats_rate_est rate_est;
+ struct Qdisc *qdisc;
+
+ struct qfq_class *next; /* Link for the slot list. */
+ u64 S, F; /* flow timestamps (exact) */
+
+ /* group we belong to. In principle we would need the index,
+ * which is log_2(lmax/weight), but we never reference it
+ * directly, only the group.
+ */
+ struct qfq_group *grp;
+
+ /* these are copied from the flowset. */
+ u32 inv_w; /* ONE_FP/weight */
+ u32 lmax; /* Max packet size for this flow. */
+};
+
+struct qfq_group {
+ uint64_t S, F; /* group timestamps (approx). */
+ unsigned int slot_shift; /* Slot shift. */
+ unsigned int index; /* Group index. */
+ unsigned int front; /* Index of the front slot. */
+ unsigned long full_slots; /* non-empty slots */
+
+ /* Array of RR lists of active classes. */
+ struct qfq_class *slots[QFQ_MAX_SLOTS];
+};
+
+struct qfq_sched {
+ struct tcf_proto *filter_list;
+ struct Qdisc_class_hash clhash;
+
+ uint64_t V; /* Precise virtual time. */
+ u32 wsum; /* weight sum */
+
+ unsigned long bitmaps[QFQ_MAX_STATE]; /* Group bitmaps. */
+ struct qfq_group groups[QFQ_MAX_INDEX + 1]; /* The groups. */
+};
+
+static struct qfq_class *qfq_find_class(struct Qdisc *sch, u32 classid)
+{
+ struct qfq_sched *q = qdisc_priv(sch);
+ struct Qdisc_class_common *clc;
+
+ clc = qdisc_class_find(&q->clhash, classid);
+ if (clc == NULL)
+ return NULL;
+ return container_of(clc, struct qfq_class, common);
+}
+
+static void qfq_purge_queue(struct qfq_class *cl)
+{
+ unsigned int len = cl->qdisc->q.qlen;
+
+ qdisc_reset(cl->qdisc);
+ qdisc_tree_decrease_qlen(cl->qdisc, len);
+}
+
+static const struct nla_policy qfq_policy[TCA_QFQ_MAX + 1] = {
+ [TCA_QFQ_WEIGHT] = { .type = NLA_U32 },
+ [TCA_QFQ_LMAX] = { .type = NLA_U32 },
+};
+
+/*
+ * Calculate a flow index, given its weight and maximum packet length.
+ * index = log_2(maxlen/weight) but we need to apply the scaling.
+ * This is used only once at flow creation.
+ */
+static int qfq_calc_index(u32 inv_w, unsigned int maxlen)
+{
+ u64 slot_size = (u64)maxlen *inv_w;
+ unsigned long size_map;
+ int index = 0;
+
+ size_map = slot_size >> QFQ_MIN_SLOT_SHIFT;
+ if (!size_map)
+ goto out;
+
+ index = __fls(size_map) + 1; /* basically a log_2 */
+ index -= !(slot_size - (1ULL << (index + QFQ_MIN_SLOT_SHIFT - 1)));
+
+ if (index < 0)
+ index = 0;
+out:
+ pr_debug("qfq calc_index: W = %lu, L = %u, I = %d\n",
+ (unsigned long) ONE_FP/inv_w, maxlen, index);
+
+ return index;
+}
+
+static int qfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
+ struct nlattr **tca, unsigned long *arg)
+{
+ struct qfq_sched *q = qdisc_priv(sch);
+ struct qfq_class *cl = (struct qfq_class *)*arg;
+ struct nlattr *tb[TCA_QFQ_MAX + 1];
+ u32 weight, lmax, inv_w;
+ int i, err;
+
+ if (tca[TCA_OPTIONS] == NULL)
+ return -EINVAL;
+
+ err = nla_parse_nested(tb, TCA_QFQ_MAX, tca[TCA_OPTIONS], qfq_policy);
+ if (err < 0)
+ return err;
+
+ if (tb[TCA_QFQ_WEIGHT]) {
+ weight = nla_get_u32(tb[TCA_QFQ_WEIGHT]);
+ if (!weight || weight > (1UL << QFQ_MAX_WSHIFT)) {
+ pr_notice("qfq: invalid weight %u\n", weight);
+ return -EINVAL;
+ }
+ } else
+ weight = 1;
+
+ inv_w = ONE_FP / weight;
+ weight = ONE_FP / inv_w;
+ if (q->wsum + weight > QFQ_MAX_WSUM) {
+ pr_notice("qfq: total weight out of range (%u + %u)\n",
+ weight, q->wsum);
+ return -EINVAL;
+ }
+
+ if (tb[TCA_QFQ_LMAX]) {
+ lmax = nla_get_u32(tb[TCA_QFQ_LMAX]);
+ if (!lmax || lmax > (1UL << QFQ_MTU_SHIFT)) {
+ pr_notice("qfq: invalid max length %u\n", lmax);
+ return -EINVAL;
+ }
+ } else
+ lmax = 1UL << QFQ_MTU_SHIFT;
+
+ if (cl != NULL) {
+ if (tca[TCA_RATE]) {
+ err = gen_replace_estimator(&cl->bstats, &cl->rate_est,
+ qdisc_root_sleeping_lock(sch),
+ tca[TCA_RATE]);
+ if (err)
+ return err;
+ }
+
+ sch_tree_lock(sch);
+ if (tb[TCA_QFQ_WEIGHT]) {
+ q->wsum = weight - ONE_FP / cl->inv_w;
+ cl->inv_w = inv_w;
+ }
+ sch_tree_unlock(sch);
+
+ return 0;
+ }
+
+ cl = kzalloc(sizeof(struct qfq_class), GFP_KERNEL);
+ if (cl == NULL)
+ return -ENOBUFS;
+
+ cl->refcnt = 1;
+ cl->common.classid = classid;
+ cl->lmax = lmax;
+ cl->inv_w = inv_w;
+ i = qfq_calc_index(cl->inv_w, cl->lmax);
+
+ cl->grp = &q->groups[i];
+ q->wsum += weight;
+
+ cl->qdisc = qdisc_create_dflt(sch->dev_queue,
+ &pfifo_qdisc_ops, classid);
+ if (cl->qdisc == NULL)
+ cl->qdisc = &noop_qdisc;
+
+ if (tca[TCA_RATE]) {
+ err = gen_new_estimator(&cl->bstats, &cl->rate_est,
+ qdisc_root_sleeping_lock(sch),
+ tca[TCA_RATE]);
+ if (err) {
+ qdisc_destroy(cl->qdisc);
+ kfree(cl);
+ return err;
+ }
+ }
+
+ sch_tree_lock(sch);
+ qdisc_class_hash_insert(&q->clhash, &cl->common);
+ sch_tree_unlock(sch);
+
+ qdisc_class_hash_grow(sch, &q->clhash);
+
+ *arg = (unsigned long)cl;
+ return 0;
+}
+
+static void qfq_destroy_class(struct Qdisc *sch, struct qfq_class *cl)
+{
+ struct qfq_sched *q = (struct qfq_sched *)sch;
+
+ if (cl->inv_w) {
+ q->wsum -= ONE_FP / cl->inv_w;
+ cl->inv_w = 0;
+ }
+
+ gen_kill_estimator(&cl->bstats, &cl->rate_est);
+ qdisc_destroy(cl->qdisc);
+ kfree(cl);
+}
+
+static int qfq_delete_class(struct Qdisc *sch, unsigned long arg)
+{
+ struct qfq_sched *q = qdisc_priv(sch);
+ struct qfq_class *cl = (struct qfq_class *)arg;
+
+ if (cl->filter_cnt > 0)
+ return -EBUSY;
+
+ sch_tree_lock(sch);
+
+ qfq_purge_queue(cl);
+ qdisc_class_hash_remove(&q->clhash, &cl->common);
+
+ if (--cl->refcnt == 0)
+ qfq_destroy_class(sch, cl);
+
+ sch_tree_unlock(sch);
+ return 0;
+}
+
+static unsigned long qfq_get_class(struct Qdisc *sch, u32 classid)
+{
+ struct qfq_class *cl = qfq_find_class(sch, classid);
+
+ if (cl != NULL)
+ cl->refcnt++;
+
+ return (unsigned long)cl;
+}
+
+static void qfq_put_class(struct Qdisc *sch, unsigned long arg)
+{
+ struct qfq_class *cl = (struct qfq_class *)arg;
+
+ if (--cl->refcnt == 0)
+ qfq_destroy_class(sch, cl);
+}
+
+static struct tcf_proto **qfq_tcf_chain(struct Qdisc *sch, unsigned long cl)
+{
+ struct qfq_sched *q = qdisc_priv(sch);
+
+ if (cl)
+ return NULL;
+
+ return &q->filter_list;
+}
+
+static unsigned long qfq_bind_tcf(struct Qdisc *sch, unsigned long parent,
+ u32 classid)
+{
+ struct qfq_class *cl = qfq_find_class(sch, classid);
+
+ if (cl != NULL)
+ cl->filter_cnt++;
+
+ return (unsigned long)cl;
+}
+
+static void qfq_unbind_tcf(struct Qdisc *sch, unsigned long arg)
+{
+ struct qfq_class *cl = (struct qfq_class *)arg;
+
+ cl->filter_cnt--;
+}
+
+static int qfq_graft_class(struct Qdisc *sch, unsigned long arg,
+ struct Qdisc *new, struct Qdisc **old)
+{
+ struct qfq_class *cl = (struct qfq_class *)arg;
+
+ if (new == NULL) {
+ new = qdisc_create_dflt(sch->dev_queue,
+ &pfifo_qdisc_ops, cl->common.classid);
+ if (new == NULL)
+ new = &noop_qdisc;
+ }
+
+ sch_tree_lock(sch);
+ qfq_purge_queue(cl);
+ *old = cl->qdisc;
+ cl->qdisc = new;
+ sch_tree_unlock(sch);
+ return 0;
+}
+
+static struct Qdisc *qfq_class_leaf(struct Qdisc *sch, unsigned long arg)
+{
+ struct qfq_class *cl = (struct qfq_class *)arg;
+
+ return cl->qdisc;
+}
+
+static int qfq_dump_class(struct Qdisc *sch, unsigned long arg,
+ struct sk_buff *skb, struct tcmsg *tcm)
+{
+ struct qfq_class *cl = (struct qfq_class *)arg;
+ struct nlattr *nest;
+
+ tcm->tcm_parent = TC_H_ROOT;
+ tcm->tcm_handle = cl->common.classid;
+ tcm->tcm_info = cl->qdisc->handle;
+
+ nest = nla_nest_start(skb, TCA_OPTIONS);
+ if (nest == NULL)
+ goto nla_put_failure;
+ NLA_PUT_U32(skb, TCA_QFQ_WEIGHT, ONE_FP/cl->inv_w);
+ NLA_PUT_U32(skb, TCA_QFQ_LMAX, cl->lmax);
+ return nla_nest_end(skb, nest);
+
+nla_put_failure:
+ nla_nest_cancel(skb, nest);
+ return -EMSGSIZE;
+}
+
+static int qfq_dump_class_stats(struct Qdisc *sch, unsigned long arg,
+ struct gnet_dump *d)
+{
+ struct qfq_class *cl = (struct qfq_class *)arg;
+ struct tc_qfq_stats xstats;
+
+ memset(&xstats, 0, sizeof(xstats));
+
+ xstats.weight = ONE_FP/cl->inv_w;
+ xstats.lmax = cl->lmax;
+
+ if (gnet_stats_copy_basic(d, &cl->bstats) < 0 ||
+ gnet_stats_copy_rate_est(d, NULL, &cl->rate_est) < 0 ||
+ gnet_stats_copy_queue(d, &cl->qdisc->qstats) < 0)
+ return -1;
+
+ return gnet_stats_copy_app(d, &xstats, sizeof(xstats));
+}
+
+static void qfq_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+ struct qfq_sched *q = qdisc_priv(sch);
+ struct qfq_class *cl;
+ struct hlist_node *n;
+ unsigned int i;
+
+ if (arg->stop)
+ return;
+
+ for (i = 0; i < q->clhash.hashsize; i++) {
+ hlist_for_each_entry(cl, n, &q->clhash.hash[i], common.hnode) {
+ if (arg->count < arg->skip) {
+ arg->count++;
+ continue;
+ }
+ if (arg->fn(sch, (unsigned long)cl, arg) < 0) {
+ arg->stop = 1;
+ return;
+ }
+ arg->count++;
+ }
+ }
+}
+
+static struct qfq_class *qfq_classify(struct sk_buff *skb, struct Qdisc *sch,
+ int *qerr)
+{
+ struct qfq_sched *q = qdisc_priv(sch);
+ struct qfq_class *cl;
+ struct tcf_result res;
+ int result;
+
+ if (TC_H_MAJ(skb->priority ^ sch->handle) == 0) {
+ cl = qfq_find_class(sch, skb->priority);
+ if (cl != NULL)
+ return cl;
+ }
+
+ *qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+ result = tc_classify(skb, q->filter_list, &res);
+ if (result >= 0) {
+#ifdef CONFIG_NET_CLS_ACT
+ switch (result) {
+ case TC_ACT_QUEUED:
+ case TC_ACT_STOLEN:
+ *qerr = NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
+ case TC_ACT_SHOT:
+ return NULL;
+ }
+#endif
+ cl = (struct qfq_class *)res.class;
+ if (cl == NULL)
+ cl = qfq_find_class(sch, res.classid);
+ return cl;
+ }
+
+ return NULL;
+}
+
+/* Generic comparison function, handling wraparound. */
+static inline int qfq_gt(u64 a, u64 b)
+{
+ return (s64)(a - b) > 0;
+}
+
+/* Round a precise timestamp to its slotted value. */
+static inline u64 qfq_round_down(u64 ts, unsigned int shift)
+{
+ return ts & ~((1ULL << shift) - 1);
+}
+
+/* return the pointer to the group with lowest index in the bitmap */
+static inline struct qfq_group *qfq_ffs(struct qfq_sched *q,
+ unsigned long bitmap)
+{
+ int index = __ffs(bitmap); // zero-based
+ return &q->groups[index];
+}
+/* Calculate a mask to mimic what would be ffs_from(). */
+static inline unsigned long mask_from(unsigned long bitmap, int from)
+{
+ return bitmap & ~((1UL << from) - 1);
+}
+
+/*
+ * The state computation relies on ER=0, IR=1, EB=2, IB=3
+ * First compute eligibility comparing grp->S, q->V,
+ * then check if someone is blocking us and possibly add EB
+ */
+static int qfq_calc_state(struct qfq_sched *q, const struct qfq_group *grp)
+{
+ /* if S > V we are not eligible */
+ unsigned int state = qfq_gt(grp->S, q->V);
+ unsigned long mask = mask_from(q->bitmaps[ER], grp->index);
+ struct qfq_group *next;
+
+ if (mask) {
+ next = qfq_ffs(q, mask);
+ if (qfq_gt(grp->F, next->F))
+ state |= EB;
+ }
+
+ return state;
+}
+
+
+/*
+ * In principle
+ * q->bitmaps[dst] |= q->bitmaps[src] & mask;
+ * q->bitmaps[src] &= ~mask;
+ * but we should make sure that src != dst
+ */
+static inline void qfq_move_groups(struct qfq_sched *q, unsigned long mask,
+ int src, int dst)
+{
+ q->bitmaps[dst] |= q->bitmaps[src] & mask;
+ q->bitmaps[src] &= ~mask;
+}
+
+static void qfq_unblock_groups(struct qfq_sched *q, int index, u64 old_F)
+{
+ unsigned long mask = mask_from(q->bitmaps[ER], index + 1);
+ struct qfq_group *next;
+
+ if (mask) {
+ next = qfq_ffs(q, mask);
+ if (!qfq_gt(next->F, old_F))
+ return;
+ }
+
+ mask = (1UL << index) - 1;
+ qfq_move_groups(q, mask, EB, ER);
+ qfq_move_groups(q, mask, IB, IR);
+}
+
+/*
+ * perhaps
+ *
+ old_V ^= q->V;
+ old_V >>= QFQ_MIN_SLOT_SHIFT;
+ if (old_V) {
+ ...
+ }
+ *
+ */
+static void qfq_make_eligible(struct qfq_sched *q, u64 old_V)
+{
+ unsigned long vslot = q->V >> QFQ_MIN_SLOT_SHIFT;
+ unsigned long old_vslot = old_V >> QFQ_MIN_SLOT_SHIFT;
+
+ if (vslot != old_vslot) {
+ unsigned long mask = (1UL << fls(vslot ^ old_vslot)) - 1;
+ qfq_move_groups(q, mask, IR, ER);
+ qfq_move_groups(q, mask, IB, EB);
+ }
+}
+
+/*
+ * XXX we should make sure that slot becomes less than 32.
+ * This is guaranteed by the input values.
+ * roundedS is always cl->S rounded on grp->slot_shift bits.
+ */
+static void qfq_slot_insert(struct qfq_group *grp, struct qfq_class *cl,
+ u64 roundedS)
+{
+ u64 slot = (roundedS - grp->S) >> grp->slot_shift;
+ unsigned int i = (grp->front + slot) % QFQ_MAX_SLOTS;
+
+ cl->next = grp->slots[i];
+ grp->slots[i] = cl;
+ __set_bit(slot, &grp->full_slots);
+}
+
+/*
+ * remove the entry from the slot
+ */
+static void qfq_front_slot_remove(struct qfq_group *grp)
+{
+ struct qfq_class **h = &grp->slots[grp->front];
+
+ *h = (*h)->next;
+ if (!*h)
+ __clear_bit(0, &grp->full_slots);
+}
+
+/*
+ * Returns the first full queue in a group. As a side effect,
+ * adjust the bucket list so the first non-empty bucket is at
+ * position 0 in full_slots.
+ */
+static struct qfq_class *qfq_slot_scan(struct qfq_group *grp)
+{
+ unsigned int i;
+
+ pr_debug("qfq slot_scan: grp %u full %#lx\n",
+ grp->index, grp->full_slots);
+
+ if (!grp->full_slots)
+ return NULL;
+
+ i = __ffs(grp->full_slots); /* zero based */
+ if (i > 0) {
+ grp->front = (grp->front + i) % QFQ_MAX_SLOTS;
+ grp->full_slots >>= i;
+ }
+
+ return grp->slots[grp->front];
+}
+
+/*
+ * adjust the bucket list. When the start time of a group decreases,
+ * we move the index down (modulo QFQ_MAX_SLOTS) so we don't need to
+ * move the objects. The mask of occupied slots must be shifted
+ * because we use ffs() to find the first non-empty slot.
+ * This covers decreases in the group's start time, but what about
+ * increases of the start time ?
+ * Here too we should make sure that i is less than 32
+ */
+static void qfq_slot_rotate(struct qfq_group *grp, u64 roundedS)
+{
+ unsigned int i = (grp->S - roundedS) >> grp->slot_shift;
+
+ grp->full_slots <<= i;
+ grp->front = (grp->front - i) % QFQ_MAX_SLOTS;
+}
+
+static void qfq_update_eligible(struct qfq_sched *q, u64 old_V)
+{
+ struct qfq_group *grp;
+ unsigned long ineligible;
+
+ ineligible = q->bitmaps[IR] | q->bitmaps[IB];
+ if (ineligible) {
+ if (!q->bitmaps[ER]) {
+ grp = qfq_ffs(q, ineligible);
+ if (qfq_gt(grp->S, q->V))
+ q->V = grp->S;
+ }
+ qfq_make_eligible(q, old_V);
+ }
+}
+
+/* What is length of next packet in queue (0 if queue is empty) */
+static unsigned int qdisc_peek_len(struct Qdisc *sch)
+{
+ struct sk_buff *skb;
+
+ skb = sch->ops->peek(sch);
+ return skb ? qdisc_pkt_len(skb) : 0;
+}
+
+/*
+ * Updates the class, returns true if also the group needs to be updated.
+ */
+static bool qfq_update_class(struct qfq_group *grp, struct qfq_class *cl)
+{
+ unsigned int len = qdisc_peek_len(cl->qdisc);
+
+ cl->S = cl->F;
+ if (!len)
+ qfq_front_slot_remove(grp); /* queue is empty */
+ else {
+ u64 roundedS;
+
+ cl->F = cl->S + (u64)len * cl->inv_w;
+ roundedS = qfq_round_down(cl->S, grp->slot_shift);
+ if (roundedS == grp->S)
+ return false;
+
+ qfq_front_slot_remove(grp);
+ qfq_slot_insert(grp, cl, roundedS);
+ }
+
+ return true;
+}
+
+static struct sk_buff *qfq_dequeue(struct Qdisc *sch)
+{
+ struct qfq_sched *q = qdisc_priv(sch);
+ struct qfq_group *grp;
+ struct qfq_class *cl;
+ struct sk_buff *skb;
+ unsigned int len;
+ u64 old_V;
+
+ if (!q->bitmaps[ER])
+ return NULL;
+
+ grp = qfq_ffs(q, q->bitmaps[ER]);
+
+ cl = grp->slots[grp->front];
+ skb = qdisc_dequeue_peeked(cl->qdisc);
+ if (!skb) {
+ WARN_ONCE(1, "qfq_dequeue: non-workconserving leaf\n");
+ return NULL;
+ }
+
+ sch->q.qlen--;
+
+ old_V = q->V;
+ len = qdisc_pkt_len(skb);
+ q->V += (u64)len * IWSUM;
+ pr_debug("qfq enqueue: len %u F %lld now %lld\n",
+ len, (unsigned long long) cl->F, (unsigned long long) q->V);
+
+ if (qfq_update_class(grp, cl)) {
+ u64 old_F = grp->F;
+
+ cl = qfq_slot_scan(grp);
+ if (!cl)
+ __clear_bit(grp->index, &q->bitmaps[ER]);
+ else {
+ u64 roundedS = qfq_round_down(cl->S, grp->slot_shift);
+ unsigned int s;
+
+ if (grp->S == roundedS)
+ goto skip_unblock;
+ grp->S = roundedS;
+ grp->F = roundedS + (2ULL << grp->slot_shift);
+ __clear_bit(grp->index, &q->bitmaps[ER]);
+ s = qfq_calc_state(q, grp);
+ __set_bit(grp->index, &q->bitmaps[s]);
+ }
+
+ qfq_unblock_groups(q, grp->index, old_F);
+ }
+
+skip_unblock:
+ qfq_update_eligible(q, old_V);
+
+ return skb;
+}
+
+/*
+ * Assign a reasonable start time for a new flow k in group i.
+ * Admissible values for \hat(F) are multiples of \sigma_i
+ * no greater than V+\sigma_i . Larger values mean that
+ * we had a wraparound so we consider the timestamp to be stale.
+ *
+ * If F is not stale and F >= V then we set S = F.
+ * Otherwise we should assign S = V, but this may violate
+ * the ordering in ER. So, if we have groups in ER, set S to
+ * the F_j of the first group j which would be blocking us.
+ * We are guaranteed not to move S backward because
+ * otherwise our group i would still be blocked.
+ */
+static void qfq_update_start(struct qfq_sched *q, struct qfq_class *cl)
+{
+ unsigned long mask;
+ uint32_t limit, roundedF;
+ int slot_shift = cl->grp->slot_shift;
+
+ roundedF = qfq_round_down(cl->F, slot_shift);
+ limit = qfq_round_down(q->V, slot_shift) + (1UL << slot_shift);
+
+ if (!qfq_gt(cl->F, q->V) || qfq_gt(roundedF, limit)) {
+ /* timestamp was stale */
+ mask = mask_from(q->bitmaps[ER], cl->grp->index);
+ if (mask) {
+ struct qfq_group *next = qfq_ffs(q, mask);
+ if (qfq_gt(roundedF, next->F)) {
+ cl->S = next->F;
+ return;
+ }
+ }
+ cl->S = q->V;
+ } else { /* timestamp is not stale */
+ cl->S = cl->F;
+ }
+}
+
+static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+{
+ struct qfq_sched *q = qdisc_priv(sch);
+ struct qfq_group *grp;
+ struct qfq_class *cl;
+ unsigned int len;
+ int err;
+ u64 roundedS;
+ int s;
+
+ cl = qfq_classify(skb, sch, &err);
+ if (cl == NULL || cl->qdisc->q.qlen > 80) {
+ if (err & __NET_XMIT_BYPASS)
+ sch->qstats.drops++;
+ kfree_skb(skb);
+ return err;
+ }
+
+ len = qdisc_pkt_len(skb);
+ err = qdisc_enqueue(skb, cl->qdisc);
+ if (unlikely(err != NET_XMIT_SUCCESS)) {
+ if (net_xmit_drop_count(err)) {
+ cl->qstats.drops++;
+ sch->qstats.drops++;
+ }
+ return err;
+ }
+
+ cl->bstats.packets += skb_is_gso(skb)?skb_shinfo(skb)->gso_segs:1;
+ cl->bstats.bytes += qdisc_pkt_len(skb);
+
+ sch->q.qlen++;
+ sch->bstats.packets += skb_is_gso(skb)?skb_shinfo(skb)->gso_segs:1;
+ sch->bstats.bytes += qdisc_pkt_len(skb);
+
+ if (qdisc_peek_head(sch) != skb)
+ return err;
+
+ /* If reach this point, queue q was idle */
+ grp = cl->grp;
+ qfq_update_start(q, cl);
+
+ /* compute new finish time and rounded start. */
+ cl->F = cl->S + (u64)qdisc_pkt_len(skb) * cl->inv_w;
+ roundedS = qfq_round_down(cl->S, grp->slot_shift);
+
+ /*
+ * insert cl in the correct bucket.
+ * If cl->S >= grp->S we don't need to adjust the
+ * bucket list and simply go to the insertion phase.
+ * Otherwise grp->S is decreasing, we must make room
+ * in the bucket list, and also recompute the group state.
+ * Finally, if there were no flows in this group and nobody
+ * was in ER make sure to adjust V.
+ */
+ if (grp->full_slots) {
+ if (!qfq_gt(grp->S, cl->S))
+ goto skip_update;
+
+ /* create a slot for this cl->S */
+ qfq_slot_rotate(grp, roundedS);
+ /* group was surely ineligible, remove */
+ __clear_bit(grp->index, &q->bitmaps[IR]);
+ __clear_bit(grp->index, &q->bitmaps[IB]);
+ } else if (!q->bitmaps[ER] && qfq_gt(roundedS, q->V))
+ q->V = roundedS;
+
+ grp->S = roundedS;
+ grp->F = roundedS + (2ULL << grp->slot_shift);
+ s = qfq_calc_state(q, grp);
+ __set_bit(grp->index, &q->bitmaps[s]);
+
+ pr_debug("qfq enqueue: new state %d %#lx S %lld F %lld V %lld\n",
+ s, q->bitmaps[s],
+ (unsigned long long) cl->S,
+ (unsigned long long) cl->F,
+ (unsigned long long) q->V);
+
+skip_update:
+ qfq_slot_insert(grp, cl, roundedS);
+
+ return err;
+}
+
+
+static void qfq_slot_remove(struct qfq_sched *q, struct qfq_group *grp,
+ struct qfq_class *cl, struct qfq_class **pprev)
+{
+ unsigned int i, offset;
+ u64 roundedS;
+
+ roundedS = qfq_round_down(cl->S, grp->slot_shift);
+ offset = (roundedS - grp->S) >> grp->slot_shift;
+ i = (grp->front + offset) % QFQ_MAX_SLOTS;
+
+ if (!pprev) {
+ pprev = &grp->slots[i];
+ while (*pprev && *pprev != cl)
+ pprev = &(*pprev)->next;
+ }
+
+ *pprev = cl->next;
+ if (!grp->slots[i])
+ __clear_bit(offset, &grp->full_slots);
+}
+
+/*
+ * called to forcibly destroy a queue.
+ * If the queue is not in the front bucket, or if it has
+ * other queues in the front bucket, we can simply remove
+ * the queue with no other side effects.
+ * Otherwise we must propagate the event up.
+ */
+static void qfq_deactivate_class(struct qfq_sched *q, struct qfq_class *cl,
+ struct qfq_class **pprev)
+{
+ struct qfq_group *grp = cl->grp;
+ unsigned long mask;
+ u64 roundedS;
+ int s;
+
+ cl->F = cl->S;
+ qfq_slot_remove(q, grp, cl, pprev);
+
+ if (!grp->full_slots) {
+ __clear_bit(grp->index, &q->bitmaps[IR]);
+ __clear_bit(grp->index, &q->bitmaps[EB]);
+ __clear_bit(grp->index, &q->bitmaps[IB]);
+
+ if (test_bit(grp->index, &q->bitmaps[ER]) &&
+ !(q->bitmaps[ER] & ~((1UL << grp->index) - 1))) {
+ mask = q->bitmaps[ER] & ((1UL << grp->index) - 1);
+ if (mask)
+ mask = ~((1UL << __fls(mask)) - 1);
+ else
+ mask = ~0UL;
+ qfq_move_groups(q, mask, EB, ER);
+ qfq_move_groups(q, mask, IB, IR);
+ }
+ __clear_bit(grp->index, &q->bitmaps[ER]);
+ } else if (!grp->slots[grp->front]) {
+ cl = qfq_slot_scan(grp);
+ roundedS = qfq_round_down(cl->S, grp->slot_shift);
+ if (grp->S != roundedS) {
+ __clear_bit(grp->index, &q->bitmaps[ER]);
+ __clear_bit(grp->index, &q->bitmaps[IR]);
+ __clear_bit(grp->index, &q->bitmaps[EB]);
+ __clear_bit(grp->index, &q->bitmaps[IB]);
+ grp->S = roundedS;
+ grp->F = roundedS + (2ULL << grp->slot_shift);
+ s = qfq_calc_state(q, grp);
+ __set_bit(grp->index, &q->bitmaps[s]);
+ }
+ }
+
+ qfq_update_eligible(q, q->V);
+}
+
+static void qfq_qlen_notify(struct Qdisc *sch, unsigned long arg)
+{
+ struct qfq_sched *q = (struct qfq_sched *)sch;
+ struct qfq_class *cl = (struct qfq_class *)arg;
+
+ if (cl->qdisc->q.qlen == 0)
+ qfq_deactivate_class(q, cl, NULL);
+}
+
+static unsigned int qfq_drop(struct Qdisc *sch)
+{
+ struct qfq_sched *q = qdisc_priv(sch);
+ struct qfq_group *grp;
+ struct qfq_class *cl, **pp;
+ unsigned int i, j, len;
+
+ for (i = 0; i <= QFQ_MAX_INDEX; i++) {
+ grp = &q->groups[i];
+ for (j = 0; j < QFQ_MAX_SLOTS; j++) {
+ for (pp = &grp->slots[j]; *pp; pp = &(*pp)->next) {
+ cl = *pp;
+ if (!cl->qdisc->ops->drop)
+ continue;
+
+ len = cl->qdisc->ops->drop(cl->qdisc);
+ if (len > 0) {
+ sch->q.qlen--;
+ if (!cl->qdisc->q.qlen)
+ qfq_deactivate_class(q, cl, pp);
+
+ return len;
+ }
+ }
+ }
+ }
+
+ return 0;
+}
+
+static int qfq_init_qdisc(struct Qdisc *sch, struct nlattr *opt)
+{
+ struct qfq_sched *q = qdisc_priv(sch);
+ struct qfq_group *grp;
+ int i, err;
+
+ err = qdisc_class_hash_init(&q->clhash);
+ if (err < 0)
+ return err;
+
+ for (i = 0; i <= QFQ_MAX_INDEX; i++) {
+ grp = &q->groups[i];
+ grp->index = i;
+ }
+
+ return 0;
+}
+
+static void qfq_reset_qdisc(struct Qdisc *sch)
+{
+ struct qfq_sched *q = qdisc_priv(sch);
+ struct qfq_group *grp;
+ struct qfq_class *cl, **pp;
+ struct hlist_node *n;
+ unsigned int i, j;
+
+ for (i = 0; i <= QFQ_MAX_INDEX; i++) {
+ grp = &q->groups[i];
+ for (j = 0; j < QFQ_MAX_SLOTS; j++) {
+ for (pp = &grp->slots[j]; *pp; pp = &(*pp)->next) {
+ cl = *pp;
+ if (cl->qdisc->q.qlen)
+ qfq_deactivate_class(q, cl, pp);
+ }
+ }
+ }
+
+ for (i = 0; i < q->clhash.hashsize; i++) {
+ hlist_for_each_entry(cl, n, &q->clhash.hash[i], common.hnode)
+ qdisc_reset(cl->qdisc);
+ }
+ sch->q.qlen = 0;
+}
+
+static void qfq_destroy_qdisc(struct Qdisc *sch)
+{
+ struct qfq_sched *q = qdisc_priv(sch);
+ struct qfq_class *cl;
+ struct hlist_node *n, *next;
+ unsigned int i;
+
+ tcf_destroy_chain(&q->filter_list);
+
+ for (i = 0; i < q->clhash.hashsize; i++) {
+ hlist_for_each_entry_safe(cl, n, next, &q->clhash.hash[i],
+ common.hnode)
+ qfq_destroy_class(sch, cl);
+ }
+ qdisc_class_hash_destroy(&q->clhash);
+}
+
+static const struct Qdisc_class_ops qfq_class_ops = {
+ .change = qfq_change_class,
+ .delete = qfq_delete_class,
+ .get = qfq_get_class,
+ .put = qfq_put_class,
+ .tcf_chain = qfq_tcf_chain,
+ .bind_tcf = qfq_bind_tcf,
+ .unbind_tcf = qfq_unbind_tcf,
+ .graft = qfq_graft_class,
+ .leaf = qfq_class_leaf,
+ .qlen_notify = qfq_qlen_notify,
+ .dump = qfq_dump_class,
+ .dump_stats = qfq_dump_class_stats,
+ .walk = qfq_walk,
+};
+
+static struct Qdisc_ops qfq_qdisc_ops __read_mostly = {
+ .cl_ops = &qfq_class_ops,
+ .id = "qfq",
+ .priv_size = sizeof(struct qfq_sched),
+ .enqueue = qfq_enqueue,
+ .dequeue = qfq_dequeue,
+ .peek = qdisc_peek_dequeued,
+ .drop = qfq_drop,
+ .init = qfq_init_qdisc,
+ .reset = qfq_reset_qdisc,
+ .destroy = qfq_destroy_qdisc,
+ .owner = THIS_MODULE,
+};
+
+static int __init qfq_init(void)
+{
+ return register_qdisc(&qfq_qdisc_ops);
+}
+
+static void __exit qfq_exit(void)
+{
+ unregister_qdisc(&qfq_qdisc_ops);
+}
+
+module_init(qfq_init);
+module_exit(qfq_exit);
+MODULE_LICENSE("GPL");
^ permalink raw reply
* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Eric Dumazet @ 2011-01-07 3:54 UTC (permalink / raw)
To: Matt Carlson
Cc: Jesse Gross, Michael Leun, Michael Chan, David Miller, Ben Greear,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20110107034130.GA18028@mcarlson.broadcom.com>
Le jeudi 06 janvier 2011 à 19:41 -0800, Matt Carlson a écrit :
> On Thu, Jan 06, 2011 at 07:04:46PM -0800, Eric Dumazet wrote:
> > Le jeudi 06 janvier 2011 ?? 18:59 -0800, Matt Carlson a ??crit :
> > > On Thu, Jan 06, 2011 at 06:43:22PM -0800, Eric Dumazet wrote:
> > > > Le vendredi 07 janvier 2011 ?? 03:41 +0100, Eric Dumazet a ??crit :
> > > > > Le jeudi 06 janvier 2011 ?? 18:29 -0800, Matt Carlson a ??crit :
> > > > >
> > > > > > Hi Eric. Sorry for the delay. I was under the impression that your
> > > > > > problems were software related and that you just needed a revised
> > > > > > version of these VLAN patches I was sending to Michael. Is this not
> > > > > > true?
> > > > > >
> > > > > > Having a hardware stat increment suggests this is a new problem.
> > > > > > Maybe I missed it, but I didn't see what hardware you are working
> > > > > > with and whether or not management firmware was enabled. Could you tell
> > > > > > me that info?
> > > > > >
> > > > >
> > > > > Hi Matt
> > > > >
> > > > > I started a bisection, because I couldnt sleep tonight anyway :(
> > > > >
> > > > > 14:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S
> > > > > Gigabit Ethernet (rev a3)
> > > > > Subsystem: Hewlett-Packard Company NC326m PCIe Dual Port Adapter
> > > > > Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 43
> > > > > Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
> > > > > Memory at fdfe0000 (64-bit, non-prefetchable) [size=64K]
> > > > > [virtual] Expansion ROM at fdbe0000 [disabled] [size=128K]
> > > > > Capabilities: [40] PCI-X non-bridge device
> > > > > Capabilities: [48] Power Management version 2
> > > > > Capabilities: [50] Vital Product Data
> > > > > Capabilities: [58] MSI: Enable+ Count=1/8 Maskable- 64bit+
> > > > > Kernel driver in use: tg3
> > > > > Kernel modules: tg3
> > > > >
> > > > >
> > > >
> > > > $ ethtool -i eth2
> > > > driver: tg3
> > > > version: 3.115
> > > > firmware-version: 5715s-v3.28
> > > > bus-info: 0000:14:04.0
> > > > $ dmesg | grep ASF
> > > > [ 6.220577] tg3 0000:14:04.0: eth2: RXcsums[1] LinkChgREG[0] MIirq[0]
> > > > ASF[0] TSOcap[1]
> > > > [ 6.228586] tg3 0000:14:04.1: eth3: RXcsums[1] LinkChgREG[0] MIirq[0]
> > > > ASF[0] TSOcap[1]
> > >
> > > Thanks. So management firmware is disabled. This should be
> > > straightforward case.
> > >
> > > I'm wondering if I'm misunderstanding something though. You said earlier
> > > that VLAN tagging doesn't work unless you applied my patch. Is this no
> > > longer true?
> > >
> >
> > I dont apply your patch because Jesse said it was not a good patch ;)
>
> Oh.
>
> > Maybe I missed something and it must be applied ? Problem is : current
> > Linus tree now includes net-next-2.6 and vlan doesnt work. You should
> > resubmit it perhaps ?
>
> Yes, something needs to be submitted. I want to make sure we aren't
> chasing the same problem though. If the patch(es) fix your problem,
> then I can concentrate on finalizing the patch.
>
I believe it did, I can test your next patch ;)
> I can combine my last patch (the one that always enabled VLAN tag
> stripping) and the previous patch (that implements all your comments so
> far) into one patch, but that still leaves the behavior Michael noted
> unaddressed.
>
> Michael, did you ever find out whether or not RXD_FLAG_VLAN was being
> set?
>
Here is the bisect log , just in case :
d2394e6bb1aa636f3bd142cb6f7845a4332514b5 is first bad commit
commit d2394e6bb1aa636f3bd142cb6f7845a4332514b5
Author: Matt Carlson <mcarlson@broadcom.com>
Date: Wed Nov 24 08:31:47 2010 +0000
tg3: Always turn on APE features in mac_mode reg
The APE needs certain bits in the mac_mode register to be enabled for
traffic to flow correctly. This patch changes the code to always enable
these bits in the presence of the APE.
Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
Reviewed-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
:040000 040000 086382ce3ecd909222faf53ca23c48d1200dd60c 0a325ac6e7aa87a6737610292ab3025156a4ed80 M drivers
$ git bisect log
git bisect start
# bad: [3c0cb7c31c206aaedb967e44b98442bbeb17a6c4] Merge branch 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm
git bisect bad 3c0cb7c31c206aaedb967e44b98442bbeb17a6c4
# good: [3c0eee3fe6a3a1c745379547c7e7c904aa64f6d5] Linux 2.6.37
git bisect good 3c0eee3fe6a3a1c745379547c7e7c904aa64f6d5
# bad: [63e35cd9bd4c8ae085c8b9a70554595b529c4100] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 into for-davem
git bisect bad 63e35cd9bd4c8ae085c8b9a70554595b529c4100
# bad: [67d5288049f46f816181f63eaa8f1371877ad8ea] vxge: update driver version
git bisect bad 67d5288049f46f816181f63eaa8f1371877ad8ea
# good: [b382b191ea9e9ccefc437433d23befe91f4a8925] ipv6: AF_INET6 link address family
git bisect good b382b191ea9e9ccefc437433d23befe91f4a8925
# good: [bedbbb959d2c1d1dbb4c2215f5b7074b1da3030a] ath: Add a driver_info bitmask field
git bisect good bedbbb959d2c1d1dbb4c2215f5b7074b1da3030a
# bad: [cf7afbfeb8ceb0187348d0a1a0db61305e25f05f] rtnl: make link af-specific updates atomic
git bisect bad cf7afbfeb8ceb0187348d0a1a0db61305e25f05f
# good: [22674a24b44ac53f244ef6edadd02021a270df5a] Net: dns_resolver: Makefile: Remove deprecated kbuild goal definitions
git bisect good 22674a24b44ac53f244ef6edadd02021a270df5a
# bad: [cf79003d598b1f82a4caa0564107283b4f560e14] tg3: Fix 5719 internal FIFO overflow problem
git bisect bad cf79003d598b1f82a4caa0564107283b4f560e14
# good: [094f2faaa2c4973e50979158f655a1d31a97ba98] Net: rds: Makefile: Remove deprecated items
git bisect good 094f2faaa2c4973e50979158f655a1d31a97ba98
# good: [04f6d70f6e64900a5d70a5fc199dd9d5fa787738] SELinux: Only return netlink error when we know the return is fatal
git bisect good 04f6d70f6e64900a5d70a5fc199dd9d5fa787738
# good: [5093eedc8bdfd7d906836a44a248f66a99e27d22] tg3: Apply 10Mbps fix to all 57765 revisions
git bisect good 5093eedc8bdfd7d906836a44a248f66a99e27d22
# bad: [d2394e6bb1aa636f3bd142cb6f7845a4332514b5] tg3: Always turn on APE features in mac_mode reg
git bisect bad d2394e6bb1aa636f3bd142cb6f7845a4332514b5
# good: [b75cc0e4c1caac63941d96a73b2214e8007b934b] tg3: Assign correct tx margin for 5719
git bisect good b75cc0e4c1caac63941d96a73b2214e8007b934b
^ permalink raw reply
* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Matt Carlson @ 2011-01-07 3:41 UTC (permalink / raw)
To: Eric Dumazet
Cc: Matthew Carlson, Jesse Gross, Michael Leun, Michael Chan,
David Miller, Ben Greear, linux-kernel@vger.kernel.org,
netdev@vger.kernel.org
In-Reply-To: <1294369486.2704.53.camel@edumazet-laptop>
On Thu, Jan 06, 2011 at 07:04:46PM -0800, Eric Dumazet wrote:
> Le jeudi 06 janvier 2011 ?? 18:59 -0800, Matt Carlson a ??crit :
> > On Thu, Jan 06, 2011 at 06:43:22PM -0800, Eric Dumazet wrote:
> > > Le vendredi 07 janvier 2011 ?? 03:41 +0100, Eric Dumazet a ??crit :
> > > > Le jeudi 06 janvier 2011 ?? 18:29 -0800, Matt Carlson a ??crit :
> > > >
> > > > > Hi Eric. Sorry for the delay. I was under the impression that your
> > > > > problems were software related and that you just needed a revised
> > > > > version of these VLAN patches I was sending to Michael. Is this not
> > > > > true?
> > > > >
> > > > > Having a hardware stat increment suggests this is a new problem.
> > > > > Maybe I missed it, but I didn't see what hardware you are working
> > > > > with and whether or not management firmware was enabled. Could you tell
> > > > > me that info?
> > > > >
> > > >
> > > > Hi Matt
> > > >
> > > > I started a bisection, because I couldnt sleep tonight anyway :(
> > > >
> > > > 14:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S
> > > > Gigabit Ethernet (rev a3)
> > > > Subsystem: Hewlett-Packard Company NC326m PCIe Dual Port Adapter
> > > > Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 43
> > > > Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
> > > > Memory at fdfe0000 (64-bit, non-prefetchable) [size=64K]
> > > > [virtual] Expansion ROM at fdbe0000 [disabled] [size=128K]
> > > > Capabilities: [40] PCI-X non-bridge device
> > > > Capabilities: [48] Power Management version 2
> > > > Capabilities: [50] Vital Product Data
> > > > Capabilities: [58] MSI: Enable+ Count=1/8 Maskable- 64bit+
> > > > Kernel driver in use: tg3
> > > > Kernel modules: tg3
> > > >
> > > >
> > >
> > > $ ethtool -i eth2
> > > driver: tg3
> > > version: 3.115
> > > firmware-version: 5715s-v3.28
> > > bus-info: 0000:14:04.0
> > > $ dmesg | grep ASF
> > > [ 6.220577] tg3 0000:14:04.0: eth2: RXcsums[1] LinkChgREG[0] MIirq[0]
> > > ASF[0] TSOcap[1]
> > > [ 6.228586] tg3 0000:14:04.1: eth3: RXcsums[1] LinkChgREG[0] MIirq[0]
> > > ASF[0] TSOcap[1]
> >
> > Thanks. So management firmware is disabled. This should be
> > straightforward case.
> >
> > I'm wondering if I'm misunderstanding something though. You said earlier
> > that VLAN tagging doesn't work unless you applied my patch. Is this no
> > longer true?
> >
>
> I dont apply your patch because Jesse said it was not a good patch ;)
Oh.
> Maybe I missed something and it must be applied ? Problem is : current
> Linus tree now includes net-next-2.6 and vlan doesnt work. You should
> resubmit it perhaps ?
Yes, something needs to be submitted. I want to make sure we aren't
chasing the same problem though. If the patch(es) fix your problem,
then I can concentrate on finalizing the patch.
I can combine my last patch (the one that always enabled VLAN tag
stripping) and the previous patch (that implements all your comments so
far) into one patch, but that still leaves the behavior Michael noted
unaddressed.
Michael, did you ever find out whether or not RXD_FLAG_VLAN was being
set?
^ permalink raw reply
* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Matt Carlson @ 2011-01-07 3:24 UTC (permalink / raw)
To: Jesse Gross
Cc: Michael Leun, Matthew Carlson, Michael Chan, Eric Dumazet,
David Miller, Ben Greear, linux-kernel@vger.kernel.org,
netdev@vger.kernel.org
In-Reply-To: <AANLkTi=hGRW+DwoKQzxPAZao6-y_tvn6nNXS-tj_-Y6T@mail.gmail.com>
On Sat, Dec 18, 2010 at 07:38:00PM -0800, Jesse Gross wrote:
> On Tue, Dec 14, 2010 at 11:16 PM, Michael Leun
> <lkml20101129@newton.leun.net> wrote:
> > OK - all tests done on that DL320G5:
> >
> > For completeness, 2.6.37-rc5 unpatched:
> >
> > eth0, no vlan configured: totally broken - see double tagged vlans
> > without tag, single or untagged packets missing at all
>
> Random behavior? This one is somewhat hard to explain - maybe there
> are some other factors. eth0 has ASF on, so it always strips tags. I
> would expect it to behave like the vlan configured case.
>
> >
> > eth0, vlan configured: see packets without vlan tag (see double tagged
> > packets with one vlan tag)
>
> Both ASF and vlan group configured cause tag stripping to be enabled.
> Missing tag.
>
> >
> > eth1 same as originally reported:
> > without vlan configured see vlan tags (single and double tagged as
> > expected)
>
> No ASF and no vlan group means tag stripping is disabled. Have tag.
>
> > with vlan configured: see packets without vlan tag (see double tagged
> > packets with one vlan tag)
>
> Configuring vlan group causes stripping to be enabled. Missing tag.
>
> >
> >
> > 2.6.37-rc5, your tg3 use new vlan-code patch:
> >
> > eth0, no vlan configured: ?see packets without vlan tag (see double
> > tagged packets with one vlan tag)
>
> ASF enables tag stripping. Missing tag.
>
> > eth1, no vlan configured: see vlan tags (single and double tagged as
> > expected)
>
> No ASF, no vlan group means no stripping. Have tag.
>
> >
> >
> > eth0, vlan configured: as without vlan
>
> ASF enables stripping. Missing tag.
>
> > eth1, vlan configured: as without vlan
>
> With this patch vlan stripping is only enabled when ASF is on, so no
> stripping. Have tag.
>
> >
> > 2.6.37-rc5, your tg3 use new vlan-code patch with test patch ontop
> >
> > eth1 no vlan configured: see packets without vlan tag (see double tagged
> > packets with one vlan tag)
>
> With the second patch, vlan stripping is always enabled. Missing tag.
>
> > eth1 with vlan: the same
>
> Stripping still always enabled. Missing tag.
>
> The bottom line is whenever vlan stripping is enabled we're missing
> the outer tag. It might be worth adding some debugging in the area
> before napi_gro_receive/vlan_gro_receive (depending on version). My
> guess is that (desc->type_flags & RXD_FLAG_VLAN) is false even for
> vlan packets on this NIC.
>
> You said that everything works on the 5752? Matt, is it possible that
> the 5714 either has a problem with vlan stripping or a different way
> of reporting it?
I don't think this is a 5714 specific issue. I think the problem is
rooted in the fact that the VLAN tag stripping is enabled.
Your RXD_FLAG_VLAN idea sounds unlikely to me, but it's worth a check.
The patch here is using __vlan_hwaccel_put_tag(), which informs the
stack a VLAN tag is present. If this is indeed a reporting problem, I'm
not sure what else the driver should be doing.
> Also, why does ASF require vlan stripping?
This is a firmware limitation.
^ permalink raw reply
* Re: [PATCH] ehea: Add some info messages and fix an issue
From: Anton Blanchard @ 2011-01-07 3:24 UTC (permalink / raw)
To: leitao; +Cc: joe, netdev, davem
In-Reply-To: <1290792387-12331-1-git-send-email-leitao@linux.vnet.ibm.com>
Hi,
> From: Breno Leitao <breno@cafe.(none)>
>
> This patch adds some debug information about ehea not being able to
> allocate enough spaces. Also it correctly updates the amount of
> available skb.
I'm seeing issues on a number of machines with the ehea device.
Sometime after boot I see a bunch of:
ehea: Error in ehea_proc_rwqes: LL rq1: skb=NULL
ehea: Error in ehea_proc_rwqes: LL rq1: skb=NULL
ehea: Error in ehea_proc_rwqes: LL rq1: skb=NULL
ehea: Error in ehea_proc_rwqes: LL rq1: skb=NULL
which eventually stop.
- for (i = 0; i < pr->rq1_skba.len; i++) {
+ for (i = 0; i < nr_rq1a; i++) {
It looks like you are now only initialising half the ring, but still
telling the hardware to use the whole ring. Once you get through the
entire ring once the errors go away.
Anton
^ permalink raw reply
* [net-next-2.6 PATCH v6 2/2] net_sched: implement a root container qdisc sch_mqprio
From: John Fastabend @ 2011-01-07 3:12 UTC (permalink / raw)
To: davem, jarkao2
Cc: hadi, eric.dumazet, shemminger, tgraf, bhutchings, nhorman,
netdev
In-Reply-To: <20110107031211.2446.35715.stgit@jf-dev1-dcblab>
This implements a mqprio queueing discipline that by default creates
a pfifo_fast qdisc per tx queue and provides the needed configuration
interface.
Using the mqprio qdisc the number of tcs currently in use along
with the range of queues alloted to each class can be configured. By
default skbs are mapped to traffic classes using the skb priority.
This mapping is configurable.
Configurable parameters,
struct tc_mqprio_qopt {
__u8 num_tc;
__u8 prio_tc_map[TC_BITMASK + 1];
__u8 hw;
__u16 count[TC_MAX_QUEUE];
__u16 offset[TC_MAX_QUEUE];
};
Here the count/offset pairing give the queue alignment and the
prio_tc_map gives the mapping from skb->priority to tc.
The hw bit determines if the hardware should configure the count
and offset values. If the hardware bit is set then the operation
will fail if the hardware does not implement the ndo_setup_tc
operation. This is to avoid undetermined states where the hardware
may or may not control the queue mapping. Also minimal bounds
checking is done on the count/offset to verify a queue does not
exceed num_tx_queues and that queue ranges do not overlap. Otherwise
it is left to user policy or hardware configuration to create
useful mappings.
It is expected that hardware QOS schemes can be implemented by
creating appropriate mappings of queues in ndo_tc_setup().
One expected use case is drivers will use the ndo_setup_tc to map
queue ranges onto 802.1Q traffic classes. This provides a generic
mechanism to map network traffic onto these traffic classes and
removes the need for lower layer drivers to know specifics about
traffic types.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
include/linux/pkt_sched.h | 12 +
net/sched/Kconfig | 12 +
net/sched/Makefile | 1
net/sched/sch_generic.c | 4
net/sched/sch_mqprio.c | 415 +++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 444 insertions(+), 0 deletions(-)
create mode 100644 net/sched/sch_mqprio.c
diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 2cfa4bc..776cd93 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -481,4 +481,16 @@ struct tc_drr_stats {
__u32 deficit;
};
+/* MQPRIO */
+#define TC_QOPT_BITMASK 15
+#define TC_QOPT_MAX_QUEUE 16
+
+struct tc_mqprio_qopt {
+ __u8 num_tc;
+ __u8 prio_tc_map[TC_QOPT_BITMASK + 1];
+ __u8 hw;
+ __u16 count[TC_QOPT_MAX_QUEUE];
+ __u16 offset[TC_QOPT_MAX_QUEUE];
+};
+
#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a36270a..f52f5eb 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -205,6 +205,18 @@ config NET_SCH_DRR
If unsure, say N.
+config NET_SCH_MQPRIO
+ tristate "Multi-queue priority scheduler (MQPRIO)"
+ help
+ Say Y here if you want to use the Multi-queue Priority scheduler.
+ This scheduler allows QOS to be offloaded on NICs that have support
+ for offloading QOS schedulers.
+
+ To compile this driver as a module, choose M here: the module will
+ be called sch_mqprio.
+
+ If unsure, say N.
+
config NET_SCH_INGRESS
tristate "Ingress Qdisc"
depends on NET_CLS_ACT
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 960f5db..26ce681 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -32,6 +32,7 @@ obj-$(CONFIG_NET_SCH_MULTIQ) += sch_multiq.o
obj-$(CONFIG_NET_SCH_ATM) += sch_atm.o
obj-$(CONFIG_NET_SCH_NETEM) += sch_netem.o
obj-$(CONFIG_NET_SCH_DRR) += sch_drr.o
+obj-$(CONFIG_NET_SCH_MQPRIO) += sch_mqprio.o
obj-$(CONFIG_NET_CLS_U32) += cls_u32.o
obj-$(CONFIG_NET_CLS_ROUTE4) += cls_route.o
obj-$(CONFIG_NET_CLS_FW) += cls_fw.o
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 34dc598..723b278 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -540,6 +540,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
.dump = pfifo_fast_dump,
.owner = THIS_MODULE,
};
+EXPORT_SYMBOL(pfifo_fast_ops);
struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
struct Qdisc_ops *ops)
@@ -674,6 +675,7 @@ struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
return oqdisc;
}
+EXPORT_SYMBOL(dev_graft_qdisc);
static void attach_one_default_qdisc(struct net_device *dev,
struct netdev_queue *dev_queue,
@@ -761,6 +763,7 @@ void dev_activate(struct net_device *dev)
dev_watchdog_up(dev);
}
}
+EXPORT_SYMBOL(dev_activate);
static void dev_deactivate_queue(struct net_device *dev,
struct netdev_queue *dev_queue,
@@ -840,6 +843,7 @@ void dev_deactivate(struct net_device *dev)
list_add(&dev->unreg_list, &single);
dev_deactivate_many(&single);
}
+EXPORT_SYMBOL(dev_deactivate);
static void dev_init_scheduler_queue(struct net_device *dev,
struct netdev_queue *dev_queue,
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
new file mode 100644
index 0000000..4363f95
--- /dev/null
+++ b/net/sched/sch_mqprio.c
@@ -0,0 +1,415 @@
+/*
+ * net/sched/sch_mqprio.c
+ *
+ * Copyright (c) 2010 John Fastabend <john.r.fastabend@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ */
+
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/sch_generic.h>
+
+struct mqprio_sched {
+ struct Qdisc **qdiscs;
+ int hw_owned;
+};
+
+static void mqprio_destroy(struct Qdisc *sch)
+{
+ struct net_device *dev = qdisc_dev(sch);
+ struct mqprio_sched *priv = qdisc_priv(sch);
+ unsigned int ntx;
+
+ if (!priv->qdiscs)
+ return;
+
+ for (ntx = 0; ntx < dev->num_tx_queues && priv->qdiscs[ntx]; ntx++)
+ qdisc_destroy(priv->qdiscs[ntx]);
+
+ if (priv->hw_owned && dev->netdev_ops->ndo_setup_tc)
+ dev->netdev_ops->ndo_setup_tc(dev, 0, dev->real_num_tx_queues);
+ else
+ netdev_set_num_tc(dev, 0);
+
+ kfree(priv->qdiscs);
+}
+
+static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt)
+{
+ int i, j;
+
+ /* Verify num_tc is not out of max range */
+ if (qopt->num_tc > TC_QOPT_MAX_QUEUE)
+ return -EINVAL;
+
+ /* Verify priority mapping uses valid tcs */
+ for (i = 0; i < TC_QOPT_BITMASK + 1; i++) {
+ if (qopt->prio_tc_map[i] >= qopt->num_tc)
+ return -EINVAL;
+ }
+
+ /* net_device does not support requested operation */
+ if (qopt->hw && !dev->netdev_ops->ndo_setup_tc)
+ return -EINVAL;
+
+ /* if hw owned qcount and qoffset are taken from LLD so
+ * no reason to verify them here
+ */
+ if (qopt->hw)
+ return 0;
+
+ for (i = 0; i < qopt->num_tc; i++) {
+ unsigned int last = qopt->offset[i] + qopt->count[i];
+
+ /* Verify the queue count is in tx range being equal to the
+ * real_num_tx_queues indicates the last queue is in use.
+ */
+ if (qopt->offset[i] >= dev->real_num_tx_queues ||
+ !qopt->count[i] ||
+ last > dev->real_num_tx_queues)
+ return -EINVAL;
+
+ /* Verify that the offset and counts do not overlap */
+ for (j = i + 1; j < qopt->num_tc; j++) {
+ if (last > qopt->offset[j])
+ return -EINVAL;
+ }
+ }
+
+ return 0;
+}
+
+static int mqprio_init(struct Qdisc *sch, struct nlattr *opt)
+{
+ struct net_device *dev = qdisc_dev(sch);
+ struct mqprio_sched *priv = qdisc_priv(sch);
+ struct netdev_queue *dev_queue;
+ struct Qdisc *qdisc;
+ int i, err = -EOPNOTSUPP;
+ struct tc_mqprio_qopt *qopt = NULL;
+
+ if (sch->parent != TC_H_ROOT)
+ return -EOPNOTSUPP;
+
+ if (!netif_is_multiqueue(dev))
+ return -EOPNOTSUPP;
+
+ if (nla_len(opt) < sizeof(*qopt))
+ return -EINVAL;
+
+ qopt = nla_data(opt);
+ if (mqprio_parse_opt(dev, qopt))
+ return -EINVAL;
+
+ /* pre-allocate qdisc, attachment can't fail */
+ priv->qdiscs = kcalloc(dev->num_tx_queues, sizeof(priv->qdiscs[0]),
+ GFP_KERNEL);
+ if (priv->qdiscs == NULL) {
+ err = -ENOMEM;
+ goto err;
+ }
+
+ for (i = 0; i < dev->num_tx_queues; i++) {
+ dev_queue = netdev_get_tx_queue(dev, i);
+ qdisc = qdisc_create_dflt(dev_queue, &pfifo_fast_ops,
+ TC_H_MAKE(TC_H_MAJ(sch->handle),
+ TC_H_MIN(i + 1)));
+ if (qdisc == NULL) {
+ err = -ENOMEM;
+ goto err;
+ }
+ qdisc->flags |= TCQ_F_CAN_BYPASS;
+ priv->qdiscs[i] = qdisc;
+ }
+
+ /* If the mqprio options indicate that hardware should own
+ * the queue mapping then run ndo_setup_tc otherwise use the
+ * supplied and verified mapping
+ */
+ if (qopt->hw) {
+ priv->hw_owned = 1;
+ err = dev->netdev_ops->ndo_setup_tc(dev, qopt->num_tc,
+ dev->real_num_tx_queues);
+ if (err)
+ goto err;
+ } else {
+ netdev_set_num_tc(dev, qopt->num_tc);
+ for (i = 0; i < qopt->num_tc; i++)
+ netdev_set_tc_queue(dev, i,
+ qopt->count[i], qopt->offset[i]);
+ }
+
+ /* Always use supplied priority mappings */
+ for (i = 0; i < TC_QOPT_BITMASK + 1; i++)
+ netdev_set_prio_tc_map(dev, i, qopt->prio_tc_map[i]);
+
+ sch->flags |= TCQ_F_MQROOT;
+ return 0;
+
+err:
+ mqprio_destroy(sch);
+ return err;
+}
+
+static void mqprio_attach(struct Qdisc *sch)
+{
+ struct net_device *dev = qdisc_dev(sch);
+ struct mqprio_sched *priv = qdisc_priv(sch);
+ struct Qdisc *qdisc;
+ unsigned int ntx;
+
+ /* Attach underlying qdisc */
+ for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
+ qdisc = priv->qdiscs[ntx];
+ qdisc = dev_graft_qdisc(qdisc->dev_queue, qdisc);
+ if (qdisc)
+ qdisc_destroy(qdisc);
+ }
+ kfree(priv->qdiscs);
+ priv->qdiscs = NULL;
+}
+
+static struct netdev_queue *mqprio_queue_get(struct Qdisc *sch,
+ unsigned long cl)
+{
+ struct net_device *dev = qdisc_dev(sch);
+ unsigned long ntx = cl - 1 - netdev_get_num_tc(dev);
+
+ if (ntx >= dev->num_tx_queues)
+ return NULL;
+ return netdev_get_tx_queue(dev, ntx);
+}
+
+static int mqprio_graft(struct Qdisc *sch, unsigned long cl, struct Qdisc *new,
+ struct Qdisc **old)
+{
+ struct net_device *dev = qdisc_dev(sch);
+ struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
+
+ if (!dev_queue)
+ return -EINVAL;
+
+ if (dev->flags & IFF_UP)
+ dev_deactivate(dev);
+
+ *old = dev_graft_qdisc(dev_queue, new);
+
+ if (dev->flags & IFF_UP)
+ dev_activate(dev);
+
+ return 0;
+}
+
+static int mqprio_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+ struct net_device *dev = qdisc_dev(sch);
+ struct mqprio_sched *priv = qdisc_priv(sch);
+ unsigned char *b = skb_tail_pointer(skb);
+ struct tc_mqprio_qopt opt;
+ struct Qdisc *qdisc;
+ unsigned int i;
+
+ sch->q.qlen = 0;
+ memset(&sch->bstats, 0, sizeof(sch->bstats));
+ memset(&sch->qstats, 0, sizeof(sch->qstats));
+
+ for (i = 0; i < dev->num_tx_queues; i++) {
+ qdisc = netdev_get_tx_queue(dev, i)->qdisc;
+ spin_lock_bh(qdisc_lock(qdisc));
+ sch->q.qlen += qdisc->q.qlen;
+ sch->bstats.bytes += qdisc->bstats.bytes;
+ sch->bstats.packets += qdisc->bstats.packets;
+ sch->qstats.qlen += qdisc->qstats.qlen;
+ sch->qstats.backlog += qdisc->qstats.backlog;
+ sch->qstats.drops += qdisc->qstats.drops;
+ sch->qstats.requeues += qdisc->qstats.requeues;
+ sch->qstats.overlimits += qdisc->qstats.overlimits;
+ spin_unlock_bh(qdisc_lock(qdisc));
+ }
+
+ opt.num_tc = netdev_get_num_tc(dev);
+ memcpy(opt.prio_tc_map, dev->prio_tc_map, sizeof(opt.prio_tc_map));
+ opt.hw = priv->hw_owned;
+
+ for (i = 0; i < netdev_get_num_tc(dev); i++) {
+ opt.count[i] = dev->tc_to_txq[i].count;
+ opt.offset[i] = dev->tc_to_txq[i].offset;
+ }
+
+ NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
+
+ return skb->len;
+nla_put_failure:
+ nlmsg_trim(skb, b);
+ return -1;
+}
+
+static struct Qdisc *mqprio_leaf(struct Qdisc *sch, unsigned long cl)
+{
+ struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
+
+ if (!dev_queue)
+ return NULL;
+
+ return dev_queue->qdisc_sleeping;
+}
+
+static unsigned long mqprio_get(struct Qdisc *sch, u32 classid)
+{
+ struct net_device *dev = qdisc_dev(sch);
+ unsigned int ntx = TC_H_MIN(classid);
+
+ if (ntx > dev->num_tx_queues + netdev_get_num_tc(dev))
+ return 0;
+ return ntx;
+}
+
+static void mqprio_put(struct Qdisc *sch, unsigned long cl)
+{
+}
+
+static int mqprio_dump_class(struct Qdisc *sch, unsigned long cl,
+ struct sk_buff *skb, struct tcmsg *tcm)
+{
+ struct net_device *dev = qdisc_dev(sch);
+
+ if (cl <= netdev_get_num_tc(dev)) {
+ tcm->tcm_parent = TC_H_ROOT;
+ tcm->tcm_info = 0;
+ } else {
+ int i;
+ struct netdev_queue *dev_queue;
+
+ dev_queue = mqprio_queue_get(sch, cl);
+ tcm->tcm_parent = 0;
+ for (i = 0; i < netdev_get_num_tc(dev); i++) {
+ struct netdev_tc_txq tc = dev->tc_to_txq[i];
+ int q_idx = cl - netdev_get_num_tc(dev);
+
+ if (q_idx > tc.offset &&
+ q_idx <= tc.offset + tc.count) {
+ tcm->tcm_parent =
+ TC_H_MAKE(TC_H_MAJ(sch->handle),
+ TC_H_MIN(i + 1));
+ break;
+ }
+ }
+ tcm->tcm_info = dev_queue->qdisc_sleeping->handle;
+ }
+ tcm->tcm_handle |= TC_H_MIN(cl);
+ return 0;
+}
+
+static int mqprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
+ struct gnet_dump *d)
+{
+ struct net_device *dev = qdisc_dev(sch);
+
+ if (cl <= netdev_get_num_tc(dev)) {
+ int i;
+ struct Qdisc *qdisc;
+ struct gnet_stats_queue qstats = {0};
+ struct gnet_stats_basic_packed bstats = {0};
+ struct netdev_tc_txq tc = dev->tc_to_txq[cl - 1];
+
+ /* Drop lock here it will be reclaimed before touching
+ * statistics this is required because the d->lock we
+ * hold here is the look on dev_queue->qdisc_sleeping
+ * also acquired below.
+ */
+ spin_unlock_bh(d->lock);
+
+ for (i = tc.offset; i < tc.offset + tc.count; i++) {
+ qdisc = netdev_get_tx_queue(dev, i)->qdisc;
+ spin_lock_bh(qdisc_lock(qdisc));
+ bstats.bytes += qdisc->bstats.bytes;
+ bstats.packets += qdisc->bstats.packets;
+ qstats.qlen += qdisc->qstats.qlen;
+ qstats.backlog += qdisc->qstats.backlog;
+ qstats.drops += qdisc->qstats.drops;
+ qstats.requeues += qdisc->qstats.requeues;
+ qstats.overlimits += qdisc->qstats.overlimits;
+ spin_unlock_bh(qdisc_lock(qdisc));
+ }
+ /* Reclaim root sleeping lock before completing stats */
+ spin_lock_bh(d->lock);
+ if (gnet_stats_copy_basic(d, &bstats) < 0 ||
+ gnet_stats_copy_queue(d, &qstats) < 0)
+ return -1;
+ } else {
+ struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
+
+ sch = dev_queue->qdisc_sleeping;
+ sch->qstats.qlen = sch->q.qlen;
+ if (gnet_stats_copy_basic(d, &sch->bstats) < 0 ||
+ gnet_stats_copy_queue(d, &sch->qstats) < 0)
+ return -1;
+ }
+ return 0;
+}
+
+static void mqprio_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+ struct net_device *dev = qdisc_dev(sch);
+ unsigned long ntx;
+
+ if (arg->stop)
+ return;
+
+ /* Walk hierarchy with a virtual class per tc */
+ arg->count = arg->skip;
+ for (ntx = arg->skip;
+ ntx < dev->num_tx_queues + netdev_get_num_tc(dev);
+ ntx++) {
+ if (arg->fn(sch, ntx + 1, arg) < 0) {
+ arg->stop = 1;
+ break;
+ }
+ arg->count++;
+ }
+}
+
+static const struct Qdisc_class_ops mqprio_class_ops = {
+ .graft = mqprio_graft,
+ .leaf = mqprio_leaf,
+ .get = mqprio_get,
+ .put = mqprio_put,
+ .walk = mqprio_walk,
+ .dump = mqprio_dump_class,
+ .dump_stats = mqprio_dump_class_stats,
+};
+
+struct Qdisc_ops mqprio_qdisc_ops __read_mostly = {
+ .cl_ops = &mqprio_class_ops,
+ .id = "mqprio",
+ .priv_size = sizeof(struct mqprio_sched),
+ .init = mqprio_init,
+ .destroy = mqprio_destroy,
+ .attach = mqprio_attach,
+ .dump = mqprio_dump,
+ .owner = THIS_MODULE,
+};
+
+static int __init mqprio_module_init(void)
+{
+ return register_qdisc(&mqprio_qdisc_ops);
+}
+
+static void __exit mqprio_module_exit(void)
+{
+ unregister_qdisc(&mqprio_qdisc_ops);
+}
+
+module_init(mqprio_module_init);
+module_exit(mqprio_module_exit);
+
+MODULE_LICENSE("GPL");
^ permalink raw reply related
* [net-next-2.6 PATCH v6 1/2] net: implement mechanism for HW based QOS
From: John Fastabend @ 2011-01-07 3:12 UTC (permalink / raw)
To: davem, jarkao2
Cc: hadi, eric.dumazet, shemminger, tgraf, bhutchings, nhorman,
netdev
This patch provides a mechanism for lower layer devices to
steer traffic using skb->priority to tx queues. This allows
for hardware based QOS schemes to use the default qdisc without
incurring the penalties related to global state and the qdisc
lock. While reliably receiving skbs on the correct tx ring
to avoid head of line blocking resulting from shuffling in
the LLD. Finally, all the goodness from txq caching and xps/rps
can still be leveraged.
Many drivers and hardware exist with the ability to implement
QOS schemes in the hardware but currently these drivers tend
to rely on firmware to reroute specific traffic, a driver
specific select_queue or the queue_mapping action in the
qdisc.
By using select_queue for this drivers need to be updated for
each and every traffic type and we lose the goodness of much
of the upstream work. Firmware solutions are inherently
inflexible. And finally if admins are expected to build a
qdisc and filter rules to steer traffic this requires knowledge
of how the hardware is currently configured. The number of tx
queues and the queue offsets may change depending on resources.
Also this approach incurs all the overhead of a qdisc with filters.
With the mechanism in this patch users can set skb priority using
expected methods ie setsockopt() or the stack can set the priority
directly. Then the skb will be steered to the correct tx queues
aligned with hardware QOS traffic classes. In the normal case with
a single traffic class and all queues in this class everything
works as is until the LLD enables multiple tcs.
To steer the skb we mask out the lower 4 bits of the priority
and allow the hardware to configure upto 15 distinct classes
of traffic. This is expected to be sufficient for most applications
at any rate it is more then the 8021Q spec designates and is
equal to the number of prio bands currently implemented in
the default qdisc.
This in conjunction with a userspace application such as
lldpad can be used to implement 8021Q transmission selection
algorithms one of these algorithms being the extended transmission
selection algorithm currently being used for DCB.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
include/linux/netdevice.h | 65 +++++++++++++++++++++++++++++++++++++++++++++
net/core/dev.c | 52 +++++++++++++++++++++++++++++++++++-
2 files changed, 116 insertions(+), 1 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0f6b1c9..12fff42 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -646,6 +646,14 @@ struct xps_dev_maps {
(nr_cpu_ids * sizeof(struct xps_map *)))
#endif /* CONFIG_XPS */
+#define TC_MAX_QUEUE 16
+#define TC_BITMASK 15
+/* HW offloaded queuing disciplines txq count and offset maps */
+struct netdev_tc_txq {
+ u16 count;
+ u16 offset;
+};
+
/*
* This structure defines the management hooks for network devices.
* The following hooks can be defined; unless noted otherwise, they are
@@ -756,6 +764,7 @@ struct xps_dev_maps {
* int (*ndo_set_vf_port)(struct net_device *dev, int vf,
* struct nlattr *port[]);
* int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb);
+ * void (*ndo_setup_tc)(struct net_device *dev, u8 tc)
*/
#define HAVE_NET_DEVICE_OPS
struct net_device_ops {
@@ -814,6 +823,8 @@ struct net_device_ops {
struct nlattr *port[]);
int (*ndo_get_vf_port)(struct net_device *dev,
int vf, struct sk_buff *skb);
+ int (*ndo_setup_tc)(struct net_device *dev, u8 tc,
+ unsigned int txq);
#if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
int (*ndo_fcoe_enable)(struct net_device *dev);
int (*ndo_fcoe_disable)(struct net_device *dev);
@@ -1146,6 +1157,9 @@ struct net_device {
/* Data Center Bridging netlink ops */
const struct dcbnl_rtnl_ops *dcbnl_ops;
#endif
+ u8 num_tc;
+ struct netdev_tc_txq tc_to_txq[TC_MAX_QUEUE];
+ u8 prio_tc_map[TC_BITMASK + 1];
#if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
/* max exchange id for FCoE LRO by ddp */
@@ -1162,6 +1176,57 @@ struct net_device {
#define NETDEV_ALIGN 32
static inline
+int netdev_get_prio_tc_map(const struct net_device *dev, u32 prio)
+{
+ return dev->prio_tc_map[prio & TC_BITMASK];
+}
+
+static inline
+int netdev_set_prio_tc_map(struct net_device *dev, u8 prio, u8 tc)
+{
+ if (tc >= dev->num_tc)
+ return -EINVAL;
+
+ dev->prio_tc_map[prio & TC_BITMASK] = tc & TC_BITMASK;
+ return 0;
+}
+
+static inline
+void netdev_reset_tc(struct net_device *dev)
+{
+ dev->num_tc = 0;
+ memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
+ memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
+}
+
+static inline
+int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset)
+{
+ if (tc >= dev->num_tc)
+ return -EINVAL;
+
+ dev->tc_to_txq[tc].count = count;
+ dev->tc_to_txq[tc].offset = offset;
+ return 0;
+}
+
+static inline
+int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
+{
+ if (num_tc > TC_MAX_QUEUE)
+ return -EINVAL;
+
+ dev->num_tc = num_tc;
+ return 0;
+}
+
+static inline
+int netdev_get_num_tc(struct net_device *dev)
+{
+ return dev->num_tc;
+}
+
+static inline
struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
unsigned int index)
{
diff --git a/net/core/dev.c b/net/core/dev.c
index a215269..12a2c2a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1593,6 +1593,45 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
rcu_read_unlock();
}
+/* netif_setup_tc - Handle tc mappings on real_num_tx_queues change
+ * @dev: Network device
+ * @txq: number of queues available
+ *
+ * If real_num_tx_queues is changed the tc mappings may no longer be
+ * valid. To resolve this if the net_device supports ndo_setup_tc
+ * call the ops routine with the new queue number. If the ops is not
+ * available verify the tc mapping remains valid and if not NULL the
+ * mapping. With no priorities mapping to this offset/count pair it
+ * will no longer be used. In the worst case TC0 is invalid nothing
+ * can be done so disable priority mappings.
+ */
+void netif_setup_tc(struct net_device *dev, unsigned int txq)
+{
+ const struct net_device_ops *ops = dev->netdev_ops;
+
+ if (ops->ndo_setup_tc) {
+ ops->ndo_setup_tc(dev, dev->num_tc, txq);
+ } else {
+ int i;
+ struct netdev_tc_txq *tc = &dev->tc_to_txq[0];
+
+ /* If TC0 is invalidated disable TC mapping */
+ if (tc->offset + tc->count > txq) {
+ dev->num_tc = 0;
+ return;
+ }
+
+ /* Invalidated prio to tc mappings set to TC0 */
+ for (i = 1; i < TC_BITMASK + 1; i++) {
+ int q = netdev_get_prio_tc_map(dev, i);
+ tc = &dev->tc_to_txq[q];
+
+ if (tc->offset + tc->count > txq)
+ netdev_set_prio_tc_map(dev, i, 0);
+ }
+ }
+}
+
/*
* Routine to help set real_num_tx_queues. To avoid skbs mapped to queues
* greater then real_num_tx_queues stale skbs on the qdisc must be flushed.
@@ -1614,6 +1653,9 @@ int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq)
if (txq < dev->real_num_tx_queues)
qdisc_reset_all_tx_gt(dev, txq);
+
+ if (dev->num_tc)
+ netif_setup_tc(dev, txq);
}
dev->real_num_tx_queues = txq;
@@ -2165,6 +2207,8 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
unsigned int num_tx_queues)
{
u32 hash;
+ u16 qoffset = 0;
+ u16 qcount = num_tx_queues;
if (skb_rx_queue_recorded(skb)) {
hash = skb_get_rx_queue(skb);
@@ -2173,13 +2217,19 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
return hash;
}
+ if (dev->num_tc) {
+ u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+ qoffset = dev->tc_to_txq[tc].offset;
+ qcount = dev->tc_to_txq[tc].count;
+ }
+
if (skb->sk && skb->sk->sk_hash)
hash = skb->sk->sk_hash;
else
hash = (__force u16) skb->protocol ^ skb->rxhash;
hash = jhash_1word(hash, hashrnd);
- return (u16) (((u64) hash * num_tx_queues) >> 32);
+ return (u16) (((u64) hash * qcount) >> 32) + qoffset;
}
EXPORT_SYMBOL(__skb_tx_hash);
^ permalink raw reply related
* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Eric Dumazet @ 2011-01-07 3:04 UTC (permalink / raw)
To: Matt Carlson
Cc: Jesse Gross, Michael Leun, Michael Chan, David Miller, Ben Greear,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20110107025930.GA17808@mcarlson.broadcom.com>
Le jeudi 06 janvier 2011 à 18:59 -0800, Matt Carlson a écrit :
> On Thu, Jan 06, 2011 at 06:43:22PM -0800, Eric Dumazet wrote:
> > Le vendredi 07 janvier 2011 ?? 03:41 +0100, Eric Dumazet a ??crit :
> > > Le jeudi 06 janvier 2011 ?? 18:29 -0800, Matt Carlson a ??crit :
> > >
> > > > Hi Eric. Sorry for the delay. I was under the impression that your
> > > > problems were software related and that you just needed a revised
> > > > version of these VLAN patches I was sending to Michael. Is this not
> > > > true?
> > > >
> > > > Having a hardware stat increment suggests this is a new problem.
> > > > Maybe I missed it, but I didn't see what hardware you are working
> > > > with and whether or not management firmware was enabled. Could you tell
> > > > me that info?
> > > >
> > >
> > > Hi Matt
> > >
> > > I started a bisection, because I couldnt sleep tonight anyway :(
> > >
> > > 14:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S
> > > Gigabit Ethernet (rev a3)
> > > Subsystem: Hewlett-Packard Company NC326m PCIe Dual Port Adapter
> > > Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 43
> > > Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
> > > Memory at fdfe0000 (64-bit, non-prefetchable) [size=64K]
> > > [virtual] Expansion ROM at fdbe0000 [disabled] [size=128K]
> > > Capabilities: [40] PCI-X non-bridge device
> > > Capabilities: [48] Power Management version 2
> > > Capabilities: [50] Vital Product Data
> > > Capabilities: [58] MSI: Enable+ Count=1/8 Maskable- 64bit+
> > > Kernel driver in use: tg3
> > > Kernel modules: tg3
> > >
> > >
> >
> > $ ethtool -i eth2
> > driver: tg3
> > version: 3.115
> > firmware-version: 5715s-v3.28
> > bus-info: 0000:14:04.0
> > $ dmesg | grep ASF
> > [ 6.220577] tg3 0000:14:04.0: eth2: RXcsums[1] LinkChgREG[0] MIirq[0]
> > ASF[0] TSOcap[1]
> > [ 6.228586] tg3 0000:14:04.1: eth3: RXcsums[1] LinkChgREG[0] MIirq[0]
> > ASF[0] TSOcap[1]
>
> Thanks. So management firmware is disabled. This should be
> straightforward case.
>
> I'm wondering if I'm misunderstanding something though. You said earlier
> that VLAN tagging doesn't work unless you applied my patch. Is this no
> longer true?
>
I dont apply your patch because Jesse said it was not a good patch ;)
Maybe I missed something and it must be applied ? Problem is : current
Linus tree now includes net-next-2.6 and vlan doesnt work. You should
resubmit it perhaps ?
^ permalink raw reply
* Re: [PATCH v2] net: ppp: use {get,put}_unaligned_be{16,32}
From: Paul Mackerras @ 2011-01-07 3:01 UTC (permalink / raw)
To: Changli Gao; +Cc: David S. Miller, Harvey Harrison, linux-ppp, netdev
In-Reply-To: <1294357056-25889-1-git-send-email-xiaosuo@gmail.com>
On Fri, Jan 07, 2011 at 07:37:36AM +0800, Changli Gao wrote:
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
This patch description is inadequate. It should tell us why you are
making this change. Does it result in smaller and/or faster code, and
if so by how much on what sort of machine? Do you think it makes the
code clearer? (I don't.) Or is there some other motivation for this?
Paul.
^ permalink raw reply
* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Matt Carlson @ 2011-01-07 2:59 UTC (permalink / raw)
To: Eric Dumazet
Cc: Matthew Carlson, Jesse Gross, Michael Leun, Michael Chan,
David Miller, Ben Greear, linux-kernel@vger.kernel.org,
netdev@vger.kernel.org
In-Reply-To: <1294368202.2704.50.camel@edumazet-laptop>
On Thu, Jan 06, 2011 at 06:43:22PM -0800, Eric Dumazet wrote:
> Le vendredi 07 janvier 2011 ?? 03:41 +0100, Eric Dumazet a ??crit :
> > Le jeudi 06 janvier 2011 ?? 18:29 -0800, Matt Carlson a ??crit :
> >
> > > Hi Eric. Sorry for the delay. I was under the impression that your
> > > problems were software related and that you just needed a revised
> > > version of these VLAN patches I was sending to Michael. Is this not
> > > true?
> > >
> > > Having a hardware stat increment suggests this is a new problem.
> > > Maybe I missed it, but I didn't see what hardware you are working
> > > with and whether or not management firmware was enabled. Could you tell
> > > me that info?
> > >
> >
> > Hi Matt
> >
> > I started a bisection, because I couldnt sleep tonight anyway :(
> >
> > 14:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S
> > Gigabit Ethernet (rev a3)
> > Subsystem: Hewlett-Packard Company NC326m PCIe Dual Port Adapter
> > Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 43
> > Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
> > Memory at fdfe0000 (64-bit, non-prefetchable) [size=64K]
> > [virtual] Expansion ROM at fdbe0000 [disabled] [size=128K]
> > Capabilities: [40] PCI-X non-bridge device
> > Capabilities: [48] Power Management version 2
> > Capabilities: [50] Vital Product Data
> > Capabilities: [58] MSI: Enable+ Count=1/8 Maskable- 64bit+
> > Kernel driver in use: tg3
> > Kernel modules: tg3
> >
> >
>
> $ ethtool -i eth2
> driver: tg3
> version: 3.115
> firmware-version: 5715s-v3.28
> bus-info: 0000:14:04.0
> $ dmesg | grep ASF
> [ 6.220577] tg3 0000:14:04.0: eth2: RXcsums[1] LinkChgREG[0] MIirq[0]
> ASF[0] TSOcap[1]
> [ 6.228586] tg3 0000:14:04.1: eth3: RXcsums[1] LinkChgREG[0] MIirq[0]
> ASF[0] TSOcap[1]
Thanks. So management firmware is disabled. This should be
straightforward case.
I'm wondering if I'm misunderstanding something though. You said earlier
that VLAN tagging doesn't work unless you applied my patch. Is this no
longer true?
^ permalink raw reply
* Re: [Bugme-new] [Bug 25062] New: Bonding packet deduplication doesn't work properly anymore
From: Jay Vosburgh @ 2011-01-07 2:47 UTC (permalink / raw)
To: Andrew Morton
Cc: netdev, bugzilla-daemon, bugme-new, bugme-daemon, kevin.lapagna
In-Reply-To: <20110104133936.60d389e2.akpm@linux-foundation.org>
Andrew Morton <akpm@linux-foundation.org> wrote:
>On Fri, 17 Dec 2010 11:45:18 GMT
>bugzilla-daemon@bugzilla.kernel.org wrote:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=25062
>>
>> Summary: Bonding packet deduplication doesn't work properly
>> anymore
>> Product: Networking
>> Version: 2.5
>> Kernel Version: > 2.6.33
>> Platform: All
>> OS/Version: Linux
>> Tree: Mainline
>> Status: NEW
>> Severity: high
>> Priority: P1
>> Component: Other
>> AssignedTo: acme@ghostprotocols.net
>> ReportedBy: kevin.lapagna@bigtag.ch
>> Regression: No
>>
>>
>> Here's the setup:
>>
>> switch: ordinary cisco switch
>> eth0: NIC with kernel module tg3
>> eth1: NIC with kernel module e1000e
>> bond0: bond with slaves eth0,eth1 in mode 1 (or 5)
>> bond0.100: vlan device created with vconfig
>> bridge100: bridge created with brctl
>> tap1: tap device created with tunctl
>> vguest: qemu-kvm vguest whit emulated e1000 NIC
>>
>>
>> |________________|-- eth0 \ |________________|
>> | switch | -- bond0 -- bond0.100 -- bridge100 -- tap1 -- | vguest |
>> |________|-- eth1 / |________|
>>
>> When the vguest emits an ethernet broadcast (DHCP-request), it's forwarded all
>> the way up to the switch, through eth0. The switch forwards the broadcast -
>> also to eth1. The packet travels then all the way back to bridge100. So the
>> last status known for bridge100, regarding the mac address of the vgeust is,
>> that it is behind bond0.110 (instead of tap1). If a DHCP-server responds to the
>> request, the packet travels to bridge100, which has now a faulty
>> MAC-address-table and the packet will be rejected and never reaches tap1 and
>> therefor not the vguest.
>>
>> I witnessed this wrong behavior in kernel 2.6.37-rc5 (debian package), 2.6.36.2
>> and 2.6.35.9 (self compiled - vanilla). The setup has worked with kernels <=
>> 2.6.33.7. I've never tried 2.6.34.
>>
>> I assume the setup above is a common way for the separation of virtual guests
>> on a network level. So this could become a major issue for a lot of people when
>> upgrading their kernels.
Just a note that I have reproduced what I believe is the same
problem (I didn't use tap, and assigned an IP to the bridge). I used
arping to generate ethernet broadcasts. I see the problem on 2.6.36.2,
but not on today's net-next-2.6.
I'll see if I can dig up the root cause tomorrow.
-J
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
^ permalink raw reply
* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Eric Dumazet @ 2011-01-07 2:43 UTC (permalink / raw)
To: Matt Carlson
Cc: Jesse Gross, Michael Leun, Michael Chan, David Miller, Ben Greear,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <1294368071.2704.49.camel@edumazet-laptop>
Le vendredi 07 janvier 2011 à 03:41 +0100, Eric Dumazet a écrit :
> Le jeudi 06 janvier 2011 à 18:29 -0800, Matt Carlson a écrit :
>
> > Hi Eric. Sorry for the delay. I was under the impression that your
> > problems were software related and that you just needed a revised
> > version of these VLAN patches I was sending to Michael. Is this not
> > true?
> >
> > Having a hardware stat increment suggests this is a new problem.
> > Maybe I missed it, but I didn't see what hardware you are working
> > with and whether or not management firmware was enabled. Could you tell
> > me that info?
> >
>
> Hi Matt
>
> I started a bisection, because I couldnt sleep tonight anyway :(
>
> 14:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S
> Gigabit Ethernet (rev a3)
> Subsystem: Hewlett-Packard Company NC326m PCIe Dual Port Adapter
> Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 43
> Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
> Memory at fdfe0000 (64-bit, non-prefetchable) [size=64K]
> [virtual] Expansion ROM at fdbe0000 [disabled] [size=128K]
> Capabilities: [40] PCI-X non-bridge device
> Capabilities: [48] Power Management version 2
> Capabilities: [50] Vital Product Data
> Capabilities: [58] MSI: Enable+ Count=1/8 Maskable- 64bit+
> Kernel driver in use: tg3
> Kernel modules: tg3
>
>
$ ethtool -i eth2
driver: tg3
version: 3.115
firmware-version: 5715s-v3.28
bus-info: 0000:14:04.0
$ dmesg | grep ASF
[ 6.220577] tg3 0000:14:04.0: eth2: RXcsums[1] LinkChgREG[0] MIirq[0]
ASF[0] TSOcap[1]
[ 6.228586] tg3 0000:14:04.1: eth3: RXcsums[1] LinkChgREG[0] MIirq[0]
ASF[0] TSOcap[1]
^ permalink raw reply
* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Eric Dumazet @ 2011-01-07 2:41 UTC (permalink / raw)
To: Matt Carlson
Cc: Jesse Gross, Michael Leun, Michael Chan, David Miller, Ben Greear,
linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20110107022912.GA17757@mcarlson.broadcom.com>
Le jeudi 06 janvier 2011 à 18:29 -0800, Matt Carlson a écrit :
> Hi Eric. Sorry for the delay. I was under the impression that your
> problems were software related and that you just needed a revised
> version of these VLAN patches I was sending to Michael. Is this not
> true?
>
> Having a hardware stat increment suggests this is a new problem.
> Maybe I missed it, but I didn't see what hardware you are working
> with and whether or not management firmware was enabled. Could you tell
> me that info?
>
Hi Matt
I started a bisection, because I couldnt sleep tonight anyway :(
14:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S
Gigabit Ethernet (rev a3)
Subsystem: Hewlett-Packard Company NC326m PCIe Dual Port Adapter
Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 43
Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
Memory at fdfe0000 (64-bit, non-prefetchable) [size=64K]
[virtual] Expansion ROM at fdbe0000 [disabled] [size=128K]
Capabilities: [40] PCI-X non-bridge device
Capabilities: [48] Power Management version 2
Capabilities: [50] Vital Product Data
Capabilities: [58] MSI: Enable+ Count=1/8 Maskable- 64bit+
Kernel driver in use: tg3
Kernel modules: tg3
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox