Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 02/10] GRETH: added option to disable a device node from bootloader.
From: David Miller @ 2011-01-14  6:12 UTC (permalink / raw)
  To: daniel; +Cc: netdev, kristoffer
In-Reply-To: <1294907135-24884-2-git-send-email-daniel@gaisler.com>

From: Daniel Hellstrom <daniel@gaisler.com>
Date: Thu, 13 Jan 2011 09:25:27 +0100

> Signed-off-by: Daniel Hellstrom <daniel@gaisler.com>

This is not how you do this.

Simply not present the device in the OpenFirmware tree at all.  If you
can make this special properly appear, you can also toss the device
node away completely.

There is zero reason whatsoever to create a special hack-job
non-standardized device node properly to do this.

^ permalink raw reply

* Re: [PATCH v2 net-next-2.6] netdev: tilepro: Use is_unicast_ether_addr helper
From: David Miller @ 2011-01-14  5:51 UTC (permalink / raw)
  To: tklauser; +Cc: cmetcalf, netdev
In-Reply-To: <1294906508-14999-1-git-send-email-tklauser@distanz.ch>

From: Tobias Klauser <tklauser@distanz.ch>
Date: Thu, 13 Jan 2011 09:15:08 +0100

> Use is_unicast_ether_addr from linux/etherdevice.h instead of custom
> macros.
> 
> Signed-off-by: Tobias Klauser <tklauser@distanz.ch>

Applied.

^ permalink raw reply

* Re: [PATCH net-next-2.6] etherdevice.h: Add is_unicast_ether_addr function
From: David Miller @ 2011-01-14  5:51 UTC (permalink / raw)
  To: tklauser; +Cc: cmetcalf, netdev
In-Reply-To: <1294906496-14950-1-git-send-email-tklauser@distanz.ch>

From: Tobias Klauser <tklauser@distanz.ch>
Date: Thu, 13 Jan 2011 09:14:56 +0100

>>From a check for !is_multicast_ether_addr it is not always obvious that
> we're checking for a unicast address. So add this helper function to
> make those code paths easier to read.
> 
> Signed-off-by: Tobias Klauser <tklauser@distanz.ch>

Applied.

^ permalink raw reply

* Re: [PATCH net-next-2.6] netdev: bfin_mac: Remove is_multicast_ether_addr use in netdev_for_each_mc_addr
From: David Miller @ 2011-01-14  5:51 UTC (permalink / raw)
  To: joe; +Cc: vapier.adi, tklauser, michael.hennerich, uclinux-dist-devel,
	netdev
In-Reply-To: <1294891685.4114.29.camel@Joe-Laptop>

From: Joe Perches <joe@perches.com>
Date: Wed, 12 Jan 2011 20:08:04 -0800

> Remove code that has no effect.
> 
> Signed-off-by: Joe Perches <joe@perches.com>

Applied.

^ permalink raw reply

* Re: [PATCH 2/2] ks8695net: Use default implementation of ethtool_ops::get_link
From: David Miller @ 2011-01-14  5:51 UTC (permalink / raw)
  To: bhutchings; +Cc: figo1802, zealcook, netdev
In-Reply-To: <1294941171.3946.49.camel@bwh-desktop>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Thu, 13 Jan 2011 17:52:51 +0000

> This is completely untested as I don't have an ARM build environment.
> 
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
 ...
> -/**
>   *	ks8695_wan_get_pause - Retrieve network pause/flow-control
> advertising
>   *	@ndev: The device to retrieve settings from
>   *	@param: The structure to fill out with the information

Line-wrap damage from your mail client.

> @@ -1058,7 +1044,7 @@ static const struct ethtool_ops
> ks8695_wan_ethtool_ops = {
>  	.get_settings	= ks8695_wan_get_settings,

Same here.

I fixed this up and applied your patch.

^ permalink raw reply

* Re: [PATCH] vxge: Remember to release firmware after upgrading firmware
From: David Miller @ 2011-01-14  5:50 UTC (permalink / raw)
  To: jj
  Cc: netdev, linux-kernel, ramkrishna.vepa, jon.mason,
	sivakumar.subramani, sreenivasa.honnur
In-Reply-To: <alpine.LNX.2.00.1101132119080.11347@swampdragon.chaosbits.net>

From: Jesper Juhl <jj@chaosbits.net>
Date: Thu, 13 Jan 2011 21:25:20 +0100 (CET)

> Regardless of whether the firmware update being performed by 
> vxge_fw_upgrade() is a success or not we must still remember to always 
> release_firmware() before returning.
> 
> Signed-off-by: Jesper Juhl <jj@chaosbits.net>

Applied.

^ permalink raw reply

* Re: [PATCH 1/2] ks8695net: Disable non-working ethtool operations
From: David Miller @ 2011-01-14  5:50 UTC (permalink / raw)
  To: bhutchings; +Cc: figo1802, zealcook, netdev
In-Reply-To: <1294941014.3946.46.camel@bwh-desktop>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Thu, 13 Jan 2011 17:50:14 +0000

> Some ethtool operations can only be implemented for the WAN port, and
> not all such operations are allowed to return an error code such as
> -EOPNOTSUPP.  Therefore, define two separate ethtool_ops structures
> for WAN and non-WAN ports; simplify and rename the WAN-only functions.
> 
> This is completely untested as I don't have an ARM build environment.
> 
> Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>

Applied.

^ permalink raw reply

* Re: [PATCH] USB CDC NCM: Don't deref NULL in cdc_ncm_rx_fixup() and don't use uninitialized variable.
From: David Miller @ 2011-01-14  5:50 UTC (permalink / raw)
  To: jj
  Cc: linux-kernel, oliver, gregkh, linux-usb, netdev, alexey.orishko,
	hans.petter.selasky
In-Reply-To: <alpine.LNX.2.00.1101132229260.11347@swampdragon.chaosbits.net>

From: Jesper Juhl <jj@chaosbits.net>
Date: Thu, 13 Jan 2011 22:40:11 +0100 (CET)

> skb_clone() dynamically allocates memory and may fail. If it does it 
> returns NULL. This means we'll dereference a NULL pointer in 
> drivers/net/usb/cdc_ncm.c::cdc_ncm_rx_fixup().
> As far as I can tell, the proper way to deal with this is simply to goto 
> the error label.
> 
> Furthermore gcc complains that 'skb' may be used uninitialized:
>   drivers/net/usb/cdc_ncm.c: In function ‘cdc_ncm_rx_fixup’:
>   drivers/net/usb/cdc_ncm.c:922:18: warning: ‘skb’ may be used uninitialized in this function
> and I believe it is right. On the line where we
>   pr_debug("invalid frame detected (ignored)" ...
> we are using the local variable 'skb' but nothing has ever been assigned 
> to that variable yet. I believe the correct fix for that is to use 
> 'skb_in' instead.
> 
> Signed-off-by: Jesper Juhl <jj@chaosbits.net>

Applied.

^ permalink raw reply

* Re: [PATCH] r8169: keep firmware in memory.
From: David Miller @ 2011-01-14  5:50 UTC (permalink / raw)
  To: romieu; +Cc: netdev, jarek, hayeswang, benh, torvalds
In-Reply-To: <20110113230753.GA2750@electric-eye.fr.zoreil.com>

From: Francois Romieu <romieu@fr.zoreil.com>
Date: Fri, 14 Jan 2011 00:07:53 +0100

> The firmware agent is not available during resume. Loading the firmware
> during open() (see eee3a96c6368f47df8df5bd4ed1843600652b337) is not
> enough.
> 
> close() is run during resume through rtl8169_reset_task(), whence the
> mildly natural release of firmware in the driver removal method instead.
> 
> It will help with http://bugs.debian.org/609538. It will not avoid
> the 60 seconds delay when:
> - there is no firmware
> - the driver is loaded and the device is not up before a suspend/resume
> 
> Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
> Tested-by: Jarek Kamiński <jarek@vilo.eu.org>
> Cc: Hayes <hayeswang@realtek.com>
> Cc: Ben Hutchings <benh@debian.org>

Applied.

^ permalink raw reply

* Re: [PATCH] ipsec: update MAX_AH_AUTH_LEN to support sha512
From: David Miller @ 2011-01-14  5:50 UTC (permalink / raw)
  To: nicolas.dichtel; +Cc: netdev
In-Reply-To: <4D2F73C7.8010107@6wind.com>

From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Thu, 13 Jan 2011 22:51:03 +0100

>>From e330817aa2b33e9d1f44071072fdc4778acf8d76 Mon Sep 17 00:00:00 2001
> From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> Date: Thu, 13 Jan 2011 14:20:19 -0500
> Subject: [PATCH] ipsec: update MAX_AH_AUTH_LEN to support sha512
> 
> icv_truncbits is set to 256 for sha512, so update
> MAX_AH_AUTH_LEN to 64.
> 
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>

Applied.

^ permalink raw reply

* Re: [PATCH v4 05/10] net/fec: add dual fec support for mx28
From: Shawn Guo @ 2011-01-14  5:48 UTC (permalink / raw)
  To: Uwe Kleine-König
  Cc: davem, gerg, baruch, eric, bryan.wu, r64343, B32542, lw, w.sang,
	s.hauer, jamie, jamie, netdev, linux-arm-kernel
In-Reply-To: <20110113144805.GS24920@pengutronix.de>

Hi Uwe,

On Thu, Jan 13, 2011 at 03:48:05PM +0100, Uwe Kleine-König wrote:

[...]

> > +/* Controller is ENET-MAC */
> > +#define FEC_QUIRK_ENET_MAC           (1 << 0)
> does this really qualify to be a quirk?
> 
My understanding is that ENET-MAC is a type of "quirky" FEC
controller.

> > +/* Controller needs driver to swap frame */
> > +#define FEC_QUIRK_SWAP_FRAME         (1 << 1)
> IMHO this is a bit misnamed.  FEC_QUIRK_NEEDS_BE_DATA or similar would
> be more accurate.
> 
When your make this change, you may want to pick a better name for
function swap_buffer too.

[...]

> > +static void *swap_buffer(void *bufaddr, int len)
> > +{
> > +     int i;
> > +     unsigned int *buf = bufaddr;
> > +
> > +     for (i = 0; i < (len + 3) / 4; i++, buf++)
> > +             *buf = cpu_to_be32(*buf);
> if len isn't a multiple of 4 this accesses bytes behind len.  Is this
> generally OK here?  (E.g. because skbs always have a length that is a
> multiple of 4?)
The len may not be a multiple of 4.  But I believe bufaddr is always
a buffer allocated in a length that is a multiple of 4, and the 1~3
bytes exceeding the len very likely has no data that matters.  But
yes, it deserves a safer implementation.

[...]

> > +     /*
> > +      * The dual fec interfaces are not equivalent with enet-mac.
> > +      * Here are the differences:
> > +      *
> > +      *  - fec0 supports MII & RMII modes while fec1 only supports RMII
> > +      *  - fec0 acts as the 1588 time master while fec1 is slave
> > +      *  - external phys can only be configured by fec0
> > +      *
> > +      * That is to say fec1 can not work independently. It only works
> > +      * when fec0 is working. The reason behind this design is that the
> > +      * second interface is added primarily for Switch mode.
> > +      *
> > +      * Because of the last point above, both phys are attached on fec0
> > +      * mdio interface in board design, and need to be configured by
> > +      * fec0 mii_bus.
> > +      */
> > +     if ((id_entry->driver_data & FEC_QUIRK_ENET_MAC) && pdev->id) {
> > +             /* fec1 uses fec0 mii_bus */
> > +             fep->mii_bus = fec0_mii_bus;
> > +             return 0;
> What happens if imx28-fec.1 is probed before imx28-fec.0?
It's something that generally should not happen, as these two fec are
not equivalent, and fec.1 should always be added after fec.0 if you
intend to get dual interfaces.  But yes, we should add error checking
for this case in the driver.

-- 
Regards,
Shawn


^ permalink raw reply

* Re: [PATCH] net: remove dev_txq_stats_fold()
From: David Miller @ 2011-01-14  5:45 UTC (permalink / raw)
  To: eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w
  Cc: jarkao2-Re5JQEeQqe8AvxtiuMwx3w, david-b-yBeKhBN/0LDR7s880joybQ,
	neiljay-Re5JQEeQqe8AvxtiuMwx3w, linux-usb-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w,
	jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w,
	mina86-deATy8a+UHjQT0dZR+AlfA,
	sandeep.kumar-KZfg59tc24xl57MIdRCFDg
In-Reply-To: <1294870394.3335.75.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Date: Wed, 12 Jan 2011 23:13:14 +0100

> [PATCH v2] net: remove dev_txq_stats_fold()
> 
> After recent changes, (percpu stats on vlan/tunnels...), we dont need
> anymore per struct netdev_queue tx_bytes/tx_packets/tx_dropped counters.
> 
> Only remaining users are ixgbe, sch_teql, gianfar & macvlan :
> 
> 1) ixgbe can be converted to use existing tx_ring counters.
> 
> 2) macvlan incremented txq->tx_dropped, it can use the
> dev->stats.tx_dropped counter.
> 
> 3) sch_teql : almost revert ab35cd4b8f42 (Use net_device internal stats)
>     Now we have ndo_get_stats64(), use it, even for "unsigned long"
> fields (No need to bring back a struct net_device_stats)
> 
> 4) gianfar adds a stats structure per tx queue to hold
> tx_bytes/tx_packets
> 
> This removes a lockdep warning (and possible lockup) in rndis gadget,
> calling dev_get_stats() from hard IRQ context.
> 
> Ref: http://www.spinics.net/lists/netdev/msg149202.html
> 
> Reported-by: Neil Jones <neiljay-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> CC: Jarek Poplawski <jarkao2-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> CC: Alexander Duyck <alexander.h.duyck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> CC: Jeff Kirsher <jeffrey.t.kirsher-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> CC: Sandeep Gopalpet <sandeep.kumar-KZfg59tc24xl57MIdRCFDg@public.gmane.org>
> CC: Michal Nazarewicz <mina86-deATy8a+UHjQT0dZR+AlfA@public.gmane.org>

Applied, thanks everyone.
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net: add Faraday FTMAC100 10/100 Ethernet driver
From: Po-Yu Chuang @ 2011-01-14  5:37 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, linux-kernel, Po-Yu Chuang
In-Reply-To: <1294927387.3044.76.camel@localhost>

Dear Ben,

On Thu, Jan 13, 2011 at 10:03 PM, Ben Hutchings
<bhutchings@solarflare.com> wrote:
> On Thu, 2011-01-13 at 19:49 +0800, Po-Yu Chuang wrote:
>> From: Po-Yu Chuang <ratbert@faraday-tech.com>
>>
>> FTMAC100 Ethernet Media Access Controller supports 10/100 Mbps and
>> MII.  This driver has been working on some ARM/NDS32 SoC including
>> Faraday A320 and Andes AG101.
> [...]
>> +#define USE_NAPI
>
> All new network drivers should use NAPI only, so please remove the
> definition of USE_NAPI and change the conditional code to use NAPI
> always.

OK, fixed.

> [...]
>> +     struct net_device_stats stats;
>
> There is a net_device_stats structure in the net_device structure; you
> should normally use that instead of adding another one.

OK, fixed.

> [...]
>> +static int ftmac100_reset(struct ftmac100 *priv)
>> +{
>> +     struct device *dev = &priv->netdev->dev;
>> +     unsigned long flags;
>> +     int i;
>> +
>> +     /* NOTE: reset clears all registers */
>> +
>> +     spin_lock_irqsave(&priv->hw_lock, flags);
>> +     iowrite32(FTMAC100_MACCR_SW_RST, priv->base + FTMAC100_OFFSET_MACCR);
>> +     spin_unlock_irqrestore(&priv->hw_lock, flags);
>> +
>> +     for (i = 0; i < 5; i++) {
>> +             int maccr;
>> +
>> +             spin_lock_irqsave(&priv->hw_lock, flags);
>> +             maccr = ioread32(priv->base + FTMAC100_OFFSET_MACCR);
>> +             spin_unlock_irqrestore(&priv->hw_lock, flags);
>> +             if (!(maccr & FTMAC100_MACCR_SW_RST)) {
>> +                     /*
>> +                      * FTMAC100_MACCR_SW_RST cleared does not indicate
>> +                      * that hardware reset completed (what the f*ck).
>> +                      * We still need to wait for a while.
>> +                      */
>> +                     usleep_range(500, 1000);
>
> Sleeping here means this must be running in process context.  But you
> used spin_lock_irqsave() above which implies you're not sure what the
> context is.  One of these must be wrong.
>
> I wonder whether hw_lock is even needed; you seem to acquire and release
> it around each PIO (read or write).  This should be unnecessary since
> each PIO is atomic.

OK, fixed.

> I think you can also get rid of rx_lock; it's only used in the RX data
> path which is already serialised by NAPI.

OK, fixed.

> [...]
>> +     netdev->last_rx = jiffies;
>
> Don't set last_rx; the networking core takes care of it now.

OK, fixed.

>> +     priv->stats.rx_packets++;
>> +     priv->stats.rx_bytes += skb->len;
>
> This should be done earlier, so that these stats include packets that
> are dropped for any reason.

OK, fixed.

> [...]
>> +     netdev->trans_start = jiffies;
>
> Don't set trans_start; the networking core takes care of it now.

OK, fixed.

> [...]
>> +     priv->descs = dma_alloc_coherent(priv->dev,
>> +             sizeof(struct ftmac100_descs), &priv->descs_dma_addr,
>> +             GFP_KERNEL | GFP_DMA);
>
> On x86, GFP_DMA means the memory must be within the ISA DMA range
> (< 16 MB).  I don't know quite what it means on ARM but it may not be
> what you want.

On ARM, all 4G address space can be used by DMA (at least for all the
hardwares I have ever used). All memory pages are in DMA zone AFAIK.
I put GFP_DMA here just to be safe if there were any constraints.

By checking drivers in drivers/net/arm/, ep93xx_eth.c also uses this flag,
so I guess this is acceptable?

> [...]
>> +     if (status & (FTMAC100_INT_XPKT_OK | FTMAC100_INT_XPKT_LOST)) {
>> +             /*
>> +              * FTMAC100_INT_XPKT_OK:
>> +              *       packet transmitted to ethernet successfully
>> +              *
>> +              * FTMAC100_INT_XPKT_LOST:
>> +              *      packet transmitted to ethernet lost due to late
>> +              *      collision or excessive collision
>> +              */
>> +             ftmac100_tx_complete(priv);
>> +     }
>
> TX completions should also be handled through NAPI if possible.

OK, I'll study how to do that.

> [...]
>> +     priv->rx_pointer = 0;
>> +     priv->tx_clean_pointer = 0;
>> +     priv->tx_pointer = 0;
>> +     spin_lock_init(&priv->hw_lock);
>> +     spin_lock_init(&priv->rx_lock);
>> +     spin_lock_init(&priv->tx_lock);
>
> These locks should be initialised in your probe function.

OK, fixed.

> [...]
>> +     unregister_netdev(netdev);
>
> There should be a netif_napi_del() before this.

OK, fixed.

> A general comment: please use netdev_err(), netdev_info() etc. for
> logging.  This ensures that both the platform device address and the
> network device name appears in the log messages.

OK, fixed.

Thanks a lot for your detailed review. I'll submit a new version ASAP.

Thanks,
Po-Yu Chuang

^ permalink raw reply

* Re: Flow Control and Port Mirroring Revisited
From: Michael S. Tsirkin @ 2011-01-14  4:58 UTC (permalink / raw)
  To: Simon Horman
  Cc: Jesse Gross, Eric Dumazet, Rusty Russell, virtualization, dev,
	virtualization, netdev, kvm
In-Reply-To: <20110113234135.GC8426@verge.net.au>

On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
> On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
> > On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman <horms@verge.net.au> wrote:
> > > On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
> > >> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> > >> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> > >> >
> > >> > [ snip ]
> > >> > >
> > >> > > I know that everyone likes a nice netperf result but I agree with
> > >> > > Michael that this probably isn't the right question to be asking.  I
> > >> > > don't think that socket buffers are a real solution to the flow
> > >> > > control problem: they happen to provide that functionality but it's
> > >> > > more of a side effect than anything.  It's just that the amount of
> > >> > > memory consumed by packets in the queue(s) doesn't really have any
> > >> > > implicit meaning for flow control (think multiple physical adapters,
> > >> > > all with the same speed instead of a virtual device and a physical
> > >> > > device with wildly different speeds).  The analog in the physical
> > >> > > world that you're looking for would be Ethernet flow control.
> > >> > > Obviously, if the question is limiting CPU or memory consumption then
> > >> > > that's a different story.
> > >> >
> > >> > Point taken. I will see if I can control CPU (and thus memory) consumption
> > >> > using cgroups and/or tc.
> > >>
> > >> I have found that I can successfully control the throughput using
> > >> the following techniques
> > >>
> > >> 1) Place a tc egress filter on dummy0
> > >>
> > >> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
> > >>    this is effectively the same as one of my hacks to the datapath
> > >>    that I mentioned in an earlier mail. The result is that eth1
> > >>    "paces" the connection.

This is actually a bug. This means that one slow connection will
affect fast ones. I intend to change the default for qemu to sndbuf=0 :
this will fix it but break your "pacing". So pls do not count on this behaviour.

> > > Further to this, I wonder if there is any interest in providing
> > > a method to switch the action order - using ovs-ofctl is a hack imho -
> > > and/or switching the default action order for mirroring.
> > 
> > I'm not sure that there is a way to do this that is correct in the
> > generic case.  It's possible that the destination could be a VM while
> > packets are being mirrored to a physical device or we could be
> > multicasting or some other arbitrarily complex scenario.  Just think
> > of what a physical switch would do if it has ports with two different
> > speeds.
> 
> Yes, I have considered that case. And I agree that perhaps there
> is no sensible default. But perhaps we could make it configurable somehow?

The fix is at the application level. Run netperf with -b and -w flags to
limit the speed to a sensible value.

-- 
MST

^ permalink raw reply

* Re: [PATCH] CHOKe flow scheduler (0.8)
From: Eric Dumazet @ 2011-01-14  3:58 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev
In-Reply-To: <1294976067.3403.118.camel@edumazet-laptop>

Le vendredi 14 janvier 2011 à 04:34 +0100, Eric Dumazet a écrit :

> Hmm, please wait a bit, I had another crash when I stopped my
> bench/stress

I am not sure p->qavg is correctly computed.

Crash happened because choke_peek_random() was called while no packet
was in queue.

With my params (min=10833 max=32500 burst=18055 limit=130000) this
implies qavg was very big while qlen==0 !

qdisc choke 11: dev ifb0 parent 1:11 limit 130000b min 10833b max 32500b ewma 13 Plog 21 Scell_log 30
 Sent 200857857 bytes 365183 pkt (dropped 1010937, overlimits 557577 requeues 0) 
 rate 32253Kbit 7330pps backlog 17875996b 32505p requeues 0 
  marked 0 early 557577 pdrop 0 other 0 matched 226680


Here is latest diff :

 include/linux/pkt_sched.h |    8 +++----
 net/sched/sch_choke.c     |   50 +++++++++++++++++++++++++++++-----------------
 2 files changed, 36 insertions(+), 22 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 83bac92..498c798 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -269,10 +269,10 @@ struct tc_choke_qopt {
 };
 
 struct tc_choke_xstats {
-	__u32           early;          /* Early drops */
-	__u32           pdrop;          /* Drops due to queue limits */
-	__u32           other;          /* Drops due to drop() calls */
-	__u32           marked;         /* Marked packets */
+	__u32		early;          /* Early drops */
+	__u32		pdrop;          /* Drops due to queue limits */
+	__u32		other;          /* Drops due to drop() calls */
+	__u32		marked;         /* Marked packets */
 	__u32		matched;	/* Drops due to flow match */
 };
 
diff --git a/net/sched/sch_choke.c b/net/sched/sch_choke.c
index 136d4e5..2f94dad 100644
--- a/net/sched/sch_choke.c
+++ b/net/sched/sch_choke.c
@@ -74,7 +74,7 @@ struct choke_sched_data {
 };
 
 /* deliver a random number between 0 and N - 1 */
-static inline u32 random_N(unsigned int N)
+static u32 random_N(unsigned int N)
 {
 	return reciprocal_divide(random32(), N);
 }
@@ -94,18 +94,20 @@ static struct sk_buff *choke_peek_random(struct Qdisc *sch,
 			return skb;
 	} while (--retrys > 0);
 
-	/* queue is has lots of holes use the head which is known to exist */
+	/* queue is has lots of holes use the head which is known to exist
+	 * Note : result can still be NULL if q->head == q->tail
+	 */
 	return q->tab[*pidx = q->head];
 }
 
 /* Is ECN parameter configured */
-static inline int use_ecn(const struct choke_sched_data *q)
+static int use_ecn(const struct choke_sched_data *q)
 {
 	return q->flags & TC_RED_ECN;
 }
 
 /* Should packets over max just be dropped (versus marked) */
-static inline int use_harddrop(const struct choke_sched_data *q)
+static int use_harddrop(const struct choke_sched_data *q)
 {
 	return q->flags & TC_RED_HARDDROP;
 }
@@ -113,20 +115,21 @@ static inline int use_harddrop(const struct choke_sched_data *q)
 /* Move head pointer forward to skip over holes */
 static void choke_zap_head_holes(struct choke_sched_data *q)
 {
-	while (q->tab[q->head] == NULL) {
+	do {
 		q->head = (q->head + 1) & q->tab_mask;
-
-		BUG_ON(q->head == q->tail);
-	}
+		if (q->head == q->tail)
+			break;
+	} while (q->tab[q->head] == NULL);
 }
 
 /* Move tail pointer backwards to reuse holes */
 static void choke_zap_tail_holes(struct choke_sched_data *q)
 {
-	while (q->tab[q->tail - 1] == NULL) {
+	do {
 		q->tail = (q->tail - 1) & q->tab_mask;
-		BUG_ON(q->head == q->tail);
-	}
+		if (q->head == q->tail)
+			break;
+	} while (q->tab[q->tail] == NULL);
 }
 
 /* Drop packet from queue array by creating a "hole" */
@@ -145,7 +148,7 @@ static void choke_drop_by_idx(struct choke_sched_data *q, unsigned int idx)
    2. fast internal classification
    3. use TC filter based classification
 */
-static inline unsigned int choke_classify(struct sk_buff *skb,
+static unsigned int choke_classify(struct sk_buff *skb,
 					  struct Qdisc *sch, int *qerr)
 
 {
@@ -214,11 +217,12 @@ static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 		oskb = choke_peek_random(sch, &idx);
 
 		/* Both packets from same flow ? */
-		if (*(unsigned int *)(qdisc_skb_cb(oskb)->data) == hash) {
+		if (oskb &&
+		    *(unsigned int *)(qdisc_skb_cb(oskb)->data) == hash) {
 			/* Drop both packets */
 			q->stats.matched++;
 			choke_drop_by_idx(q, idx);
-			sch->qstats.backlog -= qdisc_pkt_len(skb);
+			sch->qstats.backlog -= qdisc_pkt_len(oskb);
 			--sch->q.qlen;
 			qdisc_drop(oskb, sch);
 			goto congestion_drop;
@@ -285,8 +289,7 @@ static struct sk_buff *choke_dequeue(struct Qdisc *sch)
 	}
 
 	skb = q->tab[q->head];
-	q->tab[q->head] = NULL; /* not really needed */
-	q->head = (q->head + 1) & q->tab_mask;
+	q->tab[q->head] = NULL;
 	choke_zap_head_holes(q);
 	--sch->q.qlen;
 	sch->qstats.backlog -= qdisc_pkt_len(skb);
@@ -371,12 +374,23 @@ static int choke_change(struct Qdisc *sch, struct nlattr *opt)
 		sch_tree_lock(sch);
 		old = q->tab;
 		if (old) {
-			unsigned int tail = 0;
+			unsigned int oqlen = sch->q.qlen, tail = 0;
 
 			while (q->head != q->tail) {
-				ntab[tail++] = q->tab[q->head];
+				struct sk_buff *skb = q->tab[q->head];
+
 				q->head = (q->head + 1) & q->tab_mask;
+				if (!skb)
+					continue;
+				if (tail < mask) {
+					ntab[tail++] = skb;
+					continue;
+				}
+				sch->qstats.backlog -= qdisc_pkt_len(skb);
+				--sch->q.qlen;
+				qdisc_drop(skb, sch);
 			}
+			qdisc_tree_decrease_qlen(sch, oqlen - sch->q.qlen);
 			q->head = 0;
 			q->tail = tail;
 		}



^ permalink raw reply related

* Re: [PATCH] CHOKe flow scheduler (0.8)
From: Eric Dumazet @ 2011-01-14  3:34 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev
In-Reply-To: <1294975793.3403.117.camel@edumazet-laptop>

Le vendredi 14 janvier 2011 à 04:29 +0100, Eric Dumazet a écrit :
> Le jeudi 13 janvier 2011 à 15:34 -0800, Stephen Hemminger a écrit :
> > CHOKe ("CHOose and Kill" or "CHOose and Keep") is an alternative
> > packet scheduler based on the Random Exponential Drop (RED) algorithm.
> > 
> > The core idea is:
> >   For every packet arrival:
> >   	Calculate Qave
> > 	if (Qave < minth) 
> > 	     Queue the new packet
> > 	else 
> > 	     Select randomly a packet from the queue 
> > 	     if (both packets from same flow)
> > 	     then Drop both the packets
> > 	     else if (Qave > maxth)
> > 	          Drop packet
> > 	     else
> > 	       	  Admit packet with probability P (same as RED)
> > 
> > See also:
> >   Rong Pan, Balaji Prabhakar, Konstantinos Psounis, "CHOKe: a stateless active
> >    queue management scheme for approximating fair bandwidth allocation", 
> >   Proceeding of INFOCOM'2000, March 2000.
> > 
> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> > 
> > ---
> > 0.8 change queue length and holes account.
> >     keep sch->q.qlen updated, and holes counter not needed.
> > 
> > ---
> >  net/sched/Kconfig     |   11 +
> >  net/sched/Makefile    |    1 
> >  net/sched/sch_choke.c |  536 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 548 insertions(+)
> > 
> 
> Hi Stephen
> 
> Your diffstat was an old one, here the right one.
> 
>  include/linux/pkt_sched.h |   29 +
>  net/sched/Kconfig         |   11 
>  net/sched/Makefile        |    1 
>  net/sched/sch_choke.c     |  552 ++++++++++++++++++++++++++++++++++++
> 
> 
> I tested v8 and found several serious problems, please find a diff of my
> latest changes :
> 
> - wrong oskb/skb used in choke_enqueue()
> - choke_zap_head_holes() is called from choke_dequeue() and crash if we
> dequeued last packet. (!!!)
> - out of bound access in choke_zap_tail_holes()
> - choke_dequeue() can be shorter
> - choke_change() must dequeue/drop in excess packets or risk new array
> overfill (if we reduce queue limit by tc qdisc change ...)
> - inline is not needed, space errors in include file
> 
> Thanks !

Hmm, please wait a bit, I had another crash when I stopped my
bench/stress




^ permalink raw reply

* Re: [PATCH] CHOKe flow scheduler (0.8)
From: Eric Dumazet @ 2011-01-14  3:29 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev
In-Reply-To: <20110113153436.70d3c0a3@s6510>

Le jeudi 13 janvier 2011 à 15:34 -0800, Stephen Hemminger a écrit :
> CHOKe ("CHOose and Kill" or "CHOose and Keep") is an alternative
> packet scheduler based on the Random Exponential Drop (RED) algorithm.
> 
> The core idea is:
>   For every packet arrival:
>   	Calculate Qave
> 	if (Qave < minth) 
> 	     Queue the new packet
> 	else 
> 	     Select randomly a packet from the queue 
> 	     if (both packets from same flow)
> 	     then Drop both the packets
> 	     else if (Qave > maxth)
> 	          Drop packet
> 	     else
> 	       	  Admit packet with probability P (same as RED)
> 
> See also:
>   Rong Pan, Balaji Prabhakar, Konstantinos Psounis, "CHOKe: a stateless active
>    queue management scheme for approximating fair bandwidth allocation", 
>   Proceeding of INFOCOM'2000, March 2000.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> ---
> 0.8 change queue length and holes account.
>     keep sch->q.qlen updated, and holes counter not needed.
> 
> ---
>  net/sched/Kconfig     |   11 +
>  net/sched/Makefile    |    1 
>  net/sched/sch_choke.c |  536 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 548 insertions(+)
> 

Hi Stephen

Your diffstat was an old one, here the right one.

 include/linux/pkt_sched.h |   29 +
 net/sched/Kconfig         |   11 
 net/sched/Makefile        |    1 
 net/sched/sch_choke.c     |  552 ++++++++++++++++++++++++++++++++++++


I tested v8 and found several serious problems, please find a diff of my
latest changes :

- wrong oskb/skb used in choke_enqueue()
- choke_zap_head_holes() is called from choke_dequeue() and crash if we
dequeued last packet. (!!!)
- out of bound access in choke_zap_tail_holes()
- choke_dequeue() can be shorter
- choke_change() must dequeue/drop in excess packets or risk new array
overfill (if we reduce queue limit by tc qdisc change ...)
- inline is not needed, space errors in include file

Thanks !

 include/linux/pkt_sched.h |    8 +++---
 net/sched/sch_choke.c     |   43 ++++++++++++++++++++++--------------
 2 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 83bac92..498c798 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -269,10 +269,10 @@ struct tc_choke_qopt {
 };
 
 struct tc_choke_xstats {
-	__u32           early;          /* Early drops */
-	__u32           pdrop;          /* Drops due to queue limits */
-	__u32           other;          /* Drops due to drop() calls */
-	__u32           marked;         /* Marked packets */
+	__u32		early;          /* Early drops */
+	__u32		pdrop;          /* Drops due to queue limits */
+	__u32		other;          /* Drops due to drop() calls */
+	__u32		marked;         /* Marked packets */
 	__u32		matched;	/* Drops due to flow match */
 };
 
diff --git a/net/sched/sch_choke.c b/net/sched/sch_choke.c
index 136d4e5..c710c57 100644
--- a/net/sched/sch_choke.c
+++ b/net/sched/sch_choke.c
@@ -74,7 +74,7 @@ struct choke_sched_data {
 };
 
 /* deliver a random number between 0 and N - 1 */
-static inline u32 random_N(unsigned int N)
+static u32 random_N(unsigned int N)
 {
 	return reciprocal_divide(random32(), N);
 }
@@ -99,13 +99,13 @@ static struct sk_buff *choke_peek_random(struct Qdisc *sch,
 }
 
 /* Is ECN parameter configured */
-static inline int use_ecn(const struct choke_sched_data *q)
+static int use_ecn(const struct choke_sched_data *q)
 {
 	return q->flags & TC_RED_ECN;
 }
 
 /* Should packets over max just be dropped (versus marked) */
-static inline int use_harddrop(const struct choke_sched_data *q)
+static int use_harddrop(const struct choke_sched_data *q)
 {
 	return q->flags & TC_RED_HARDDROP;
 }
@@ -113,20 +113,21 @@ static inline int use_harddrop(const struct choke_sched_data *q)
 /* Move head pointer forward to skip over holes */
 static void choke_zap_head_holes(struct choke_sched_data *q)
 {
-	while (q->tab[q->head] == NULL) {
+	do {
 		q->head = (q->head + 1) & q->tab_mask;
-
-		BUG_ON(q->head == q->tail);
-	}
+		if (q->head == q->tail)
+			break;
+	} while (q->tab[q->head] == NULL);
 }
 
 /* Move tail pointer backwards to reuse holes */
 static void choke_zap_tail_holes(struct choke_sched_data *q)
 {
-	while (q->tab[q->tail - 1] == NULL) {
+	do {
 		q->tail = (q->tail - 1) & q->tab_mask;
-		BUG_ON(q->head == q->tail);
-	}
+		if (q->head == q->tail)
+			break;
+	} while (q->tab[q->tail] == NULL);
 }
 
 /* Drop packet from queue array by creating a "hole" */
@@ -145,7 +146,7 @@ static void choke_drop_by_idx(struct choke_sched_data *q, unsigned int idx)
    2. fast internal classification
    3. use TC filter based classification
 */
-static inline unsigned int choke_classify(struct sk_buff *skb,
+static unsigned int choke_classify(struct sk_buff *skb,
 					  struct Qdisc *sch, int *qerr)
 
 {
@@ -218,7 +219,7 @@ static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 			/* Drop both packets */
 			q->stats.matched++;
 			choke_drop_by_idx(q, idx);
-			sch->qstats.backlog -= qdisc_pkt_len(skb);
+			sch->qstats.backlog -= qdisc_pkt_len(oskb);
 			--sch->q.qlen;
 			qdisc_drop(oskb, sch);
 			goto congestion_drop;
@@ -285,8 +286,7 @@ static struct sk_buff *choke_dequeue(struct Qdisc *sch)
 	}
 
 	skb = q->tab[q->head];
-	q->tab[q->head] = NULL; /* not really needed */
-	q->head = (q->head + 1) & q->tab_mask;
+	q->tab[q->head] = NULL;
 	choke_zap_head_holes(q);
 	--sch->q.qlen;
 	sch->qstats.backlog -= qdisc_pkt_len(skb);
@@ -371,12 +371,23 @@ static int choke_change(struct Qdisc *sch, struct nlattr *opt)
 		sch_tree_lock(sch);
 		old = q->tab;
 		if (old) {
-			unsigned int tail = 0;
+			unsigned int oqlen = sch->q.qlen, tail = 0;
 
 			while (q->head != q->tail) {
-				ntab[tail++] = q->tab[q->head];
+				struct sk_buff *skb = q->tab[q->head];
+
 				q->head = (q->head + 1) & q->tab_mask;
+				if (!skb)
+					continue;
+				if (tail < mask) {
+					ntab[tail++] = skb;
+					continue;
+				}
+				sch->qstats.backlog -= qdisc_pkt_len(skb);
+				--sch->q.qlen;
+				qdisc_drop(skb, sch);
 			}
+			qdisc_tree_decrease_qlen(sch, oqlen - sch->q.qlen);
 			q->head = 0;
 			q->tail = tail;
 		}



^ permalink raw reply related

* Re: [PATCH v1 2/2] TCPCT API sockopt update to draft -03
From: Eric Dumazet @ 2011-01-14  3:08 UTC (permalink / raw)
  To: William Allen Simpson
  Cc: Arnaud Lacombe, Stephen Hemminger, Linux Kernel Developers,
	Linux Kernel Network Developers, David Miller, Andrew Morton
In-Reply-To: <4D2FBC34.6050901@gmail.com>

Le jeudi 13 janvier 2011 à 22:00 -0500, William Allen Simpson a écrit :

> Is this supposed to be humorous?  Maybe folks here find it amusing that
> somebody thinks they know more than the *author* about the contents of the
> document?  Did you note the words above?  That is, "very recent changes"?
> 
> Perhaps you are viewing an older cached version.  Please check for the
> current month on every page: "January 2011".
> 
> We discussed -- and ultimately decided -- these changes in private email
> during the independent review process before making them available to the
> general public.  That's how the RFC publication procedure works.
> 
> I tried to be helpful to the Linux community in advance of publication, so
> you would be prepared.  I'm sorry that the community here is so lacking in
> appreciation for my efforts on your behalf.
> 
> As always, what you actually do with my code is up to you....
> --


Next time you come here, provide an up2date link for us mere mortals, so
that we can check your code against your claims. We dont trust you
anymore, we had to fix several bugs.

This is getting ridiculous.

As I said, we are going to wait for official RFC, because its time
consuming to review your patches, and nobody asked for early TCPCT
coding in linux kernel (you already said your buddies dont care at all)

^ permalink raw reply

* Re: [PATCH v1 2/2] TCPCT API sockopt update to draft -03
From: William Allen Simpson @ 2011-01-14  3:00 UTC (permalink / raw)
  To: Arnaud Lacombe
  Cc: Stephen Hemminger, Linux Kernel Developers,
	Linux Kernel Network Developers, David Miller, Andrew Morton
In-Reply-To: <AANLkTinhnaF3-scKnTOC5rF68+5otVW2NG3v1y80L0k6@mail.gmail.com>

On 1/13/11 12:53 PM, Arnaud Lacombe wrote:
> On Thu, Jan 13, 2011 at 12:32 PM, William Allen Simpson
> <william.allen.simpson@gmail.com>  wrote:
>> Even though I'm not paid to work on Linux, I'm doing my best to give you
>> folks a quick heads up and provide code to rectify the very recent changes
>> that can be propagated back through the stable tree (to 2.6.33).
>>
>> As always, what you actually do with my code is up to you....
>>
> FWIW, what is the basis of this hunk ? The RFC text[0] seems to use
> the TCP_COOKIE_* naming, not TCPCT_.
>
> Thanks,
>   - Arnaud
>
> [0]: http://www.rfc-editor.org/authors/rfc6013.txt
>
Is this supposed to be humorous?  Maybe folks here find it amusing that
somebody thinks they know more than the *author* about the contents of the
document?  Did you note the words above?  That is, "very recent changes"?

Perhaps you are viewing an older cached version.  Please check for the
current month on every page: "January 2011".

We discussed -- and ultimately decided -- these changes in private email
during the independent review process before making them available to the
general public.  That's how the RFC publication procedure works.

I tried to be helpful to the Linux community in advance of publication, so
you would be prepared.  I'm sorry that the community here is so lacking in
appreciation for my efforts on your behalf.

As always, what you actually do with my code is up to you....

^ permalink raw reply

* Re: sch_sfb
From: Patrick McHardy @ 2011-01-14  1:39 UTC (permalink / raw)
  To: Juliusz Chroboczek; +Cc: netdev, Jesper Dangaard Brouer, David Miller
In-Reply-To: <7ir5cgs3k1.fsf@lanthane.pps.jussieu.fr>

On 14.01.2011 01:59, Juliusz Chroboczek wrote:
>> Since to my knowledge you've never attempted an upstream merge
> 
> You're kidding, I hope.
> 
>   http://thread.gmane.org/gmane.linux.network/90225
>   http://thread.gmane.org/gmane.linux.network/90375
> 
> It was reviewed in particular by one Patrick McHardy.

Well, sorry, I don't remember every patch I've ever reviewed.

Just to state it again, I've actually started implementing this
years ago without every finishing it and was quite happy about
noticing your implementation for inspiration. There's no reason
to be pissed, I don't really mind which version is merged if
any, reimplementing just forced me to think from scratch, with
existing code at hand this makes it easier to avoid or fix
mistakes.

^ permalink raw reply

* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Matt Carlson @ 2011-01-14  1:15 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Matthew Carlson, Michael Leun, Michael Chan, Eric Dumazet,
	David Miller, Ben Greear, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <AANLkTimQ1T0erugwK17o0zuSwxVh-Ro4yCsBpOqCykyr@mail.gmail.com>

On Thu, Jan 13, 2011 at 01:58:40PM -0800, Jesse Gross wrote:
> On Thu, Jan 13, 2011 at 3:50 PM, Matt Carlson <mcarlson@broadcom.com> wrote:
> > On Thu, Jan 13, 2011 at 07:06:22AM -0800, Jesse Gross wrote:
> >> On Wed, Jan 12, 2011 at 8:21 PM, Matt Carlson <mcarlson@broadcom.com> wrote:
> >> > On Thu, Jan 06, 2011 at 08:36:27PM -0800, Jesse Gross wrote:
> >> >> On Thu, Jan 6, 2011 at 10:24 PM, Matt Carlson <mcarlson@broadcom.com> wrote:
> >> >> > On Sat, Dec 18, 2010 at 07:38:00PM -0800, Jesse Gross wrote:
> >> >> >> On Tue, Dec 14, 2010 at 11:16 PM, Michael Leun
> >> >> >> <lkml20101129@newton.leun.net> wrote:
> >> >> >> > OK - all tests done on that DL320G5:
> >> >> >> >
> >> >> >> > For completeness, 2.6.37-rc5 unpatched:
> >> >> >> >
> >> >> >> > eth0, no vlan configured: totally broken - see double tagged vlans
> >> >> >> > without tag, single or untagged packets missing at all
> >> >> >>
> >> >> >> Random behavior? ?This one is somewhat hard to explain - maybe there
> >> >> >> are some other factors. ?eth0 has ASF on, so it always strips tags. ?I
> >> >> >> would expect it to behave like the vlan configured case.
> >> >> >>
> >> >> >> >
> >> >> >> > eth0, vlan configured: see packets without vlan tag (see double tagged
> >> >> >> > packets with one vlan tag)
> >> >> >>
> >> >> >> Both ASF and vlan group configured cause tag stripping to be enabled.
> >> >> >> Missing tag.
> >> >> >>
> >> >> >> >
> >> >> >> > eth1 same as originally reported:
> >> >> >> > without vlan configured see vlan tags (single and double tagged as
> >> >> >> > expected)
> >> >> >>
> >> >> >> No ASF and no vlan group means tag stripping is disabled. ?Have tag.
> >> >> >>
> >> >> >> > with vlan configured: see packets without vlan tag (see double tagged
> >> >> >> > packets with one vlan tag)
> >> >> >>
> >> >> >> Configuring vlan group causes stripping to be enabled. ?Missing tag.
> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> > 2.6.37-rc5, your tg3 use new vlan-code patch:
> >> >> >> >
> >> >> >> > eth0, no vlan configured: ?see packets without vlan tag (see double
> >> >> >> > tagged packets with one vlan tag)
> >> >> >>
> >> >> >> ASF enables tag stripping. ?Missing tag.
> >> >> >>
> >> >> >> > eth1, no vlan configured: see vlan tags (single and double tagged as
> >> >> >> > expected)
> >> >> >>
> >> >> >> No ASF, no vlan group means no stripping. ?Have tag.
> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> > eth0, vlan configured: as without vlan
> >> >> >>
> >> >> >> ASF enables stripping. ?Missing tag.
> >> >> >>
> >> >> >> > eth1, vlan configured: as without vlan
> >> >> >>
> >> >> >> With this patch vlan stripping is only enabled when ASF is on, so no
> >> >> >> stripping. ?Have tag.
> >> >> >>
> >> >> >> >
> >> >> >> > 2.6.37-rc5, your tg3 use new vlan-code patch with test patch ontop
> >> >> >> >
> >> >> >> > eth1 no vlan configured: see packets without vlan tag (see double tagged
> >> >> >> > packets with one vlan tag)
> >> >> >>
> >> >> >> With the second patch, vlan stripping is always enabled. ?Missing tag.
> >> >> >>
> >> >> >> > eth1 with vlan: the same
> >> >> >>
> >> >> >> Stripping still always enabled. ?Missing tag.
> >> >> >>
> >> >> >> The bottom line is whenever vlan stripping is enabled we're missing
> >> >> >> the outer tag. ?It might be worth adding some debugging in the area
> >> >> >> before napi_gro_receive/vlan_gro_receive (depending on version). ?My
> >> >> >> guess is that (desc->type_flags & RXD_FLAG_VLAN) is false even for
> >> >> >> vlan packets on this NIC.
> >> >> >>
> >> >> >> You said that everything works on the 5752? ?Matt, is it possible that
> >> >> >> the 5714 either has a problem with vlan stripping or a different way
> >> >> >> of reporting it?
> >> >> >
> >> >> > I don't think this is a 5714 specific issue. ?I think the problem is
> >> >> > rooted in the fact that the VLAN tag stripping is enabled.
> >> >>
> >> >> It's definitely related to vlan stripping being enabled. ?Other cards
> >> >> using tg3 seem to work fine with stripping though, which is why I
> >> >> thought it might be specific to the 5714.
> >> >
> >> > I just tested this on a 5714S, using a net-next-2.6 snapshot obtained
> >> > today. ?It does the right thing in both cases (2nd tg3 patch ommited /
> >> > applied). ?The tag is always visible in the packet stream as seen from
> >> > tcpdump.
> >> >
> >> >> > Your RXD_FLAG_VLAN idea sounds unlikely to me, but it's worth a check.
> >> >> >
> >> >> > The patch here is using __vlan_hwaccel_put_tag(), which informs the
> >> >> > stack a VLAN tag is present. ?If this is indeed a reporting problem, I'm
> >> >> > not sure what else the driver should be doing.
> >> >>
> >> >> The code to hand off the tag to the stack looks OK to me. ?Michael was
> >> >> seeing this on older versions of the kernel as well with this NIC,
> >> >> which predates both this patch and the larger vlan changes so it
> >> >> doesn't seem like a problem with passing the tag to the network stack.
> >> >> ?It's hard to know exactly what is going on though without seeing what
> >> >> the hardware is reporting.
> >> >
> >> > When RX_MODE_KEEP_VLAN_TAG is set, the RXD_FLAG_VLAN flag will not be set
> >> > when receiving a packet. ?The driver skips the __vlan_hwaccel_put_tag()
> >> > call.
> >> >
> >> > When RX_MODE_KEEP_VLAN_TAG is unset, the RXD_FLAG_VLAN flag is set, and
> >> > __vlan_hwaccel_put_tag() is called to reinject the packet.
> >>
> >> OK, thanks for testing it out. ?I'm not sure that there's anything
> >> more we can do without hearing from Michael.
> >
> > In the meantime, I think what we have should go upstream. ?Just to be
> > absolutely clear though, your position is that VLAN tags should always
> > be stripped?
> 
> That's what the other converted drivers do by default (though most of
> them also provide an ethtool set_flags() method to change this).  It's
> generally the most efficient and is now safe to do in all cases.  It's
> also the consistent with what was happening before, since stripping
> was enabled when a vlan device was configured.  So, yes, normally I
> think stripping should be enabled.
> 
> I assumed that disabling stripping in most situations was just an
> oversight.  Was there a reason why you feel it is better not to use
> it?

Actually, the tg3 driver was trying to disable VLAN tag stripping
when possible.  I believe this was primarily to support the raw packet
interface.

^ permalink raw reply

* Re: sch_sfb
From: Juliusz Chroboczek @ 2011-01-14  0:59 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netdev, Jesper Dangaard Brouer, David Miller
In-Reply-To: <4D2F4B83.6060001@trash.net>

> Since to my knowledge you've never attempted an upstream merge

You're kidding, I hope.

  http://thread.gmane.org/gmane.linux.network/90225
  http://thread.gmane.org/gmane.linux.network/90375

It was reviewed in particular by one Patrick McHardy.

                                        Juliusz

^ permalink raw reply

* [PATCH] ethtool : Add option -L | --set-common to set common flags.
From: Mahesh Bandewar @ 2011-01-14  0:11 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Miller, Tom Herbert, Laurent Chavey, netdev,
	Mahesh Bandewar

This patch adds -L | --set-common option to add / remove common flags which
includes loopback flag. The -l | --show-common displays the current values
for these common flags.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
 ethtool-copy.h |    1 +
 ethtool.8      |   16 ++++++++++
 ethtool.c      |   90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 105 insertions(+), 2 deletions(-)

diff --git a/ethtool-copy.h b/ethtool-copy.h
index 75c3ae7..5fd18c7 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -309,6 +309,7 @@ struct ethtool_perm_addr {
  * flag differs from the read-only value.
  */
 enum ethtool_flags {
+	ETH_FLAG_LOOPBACK	= (1 << 2),	/* Loopback enable / disable */
 	ETH_FLAG_TXVLAN		= (1 << 7),	/* TX VLAN offload enabled */
 	ETH_FLAG_RXVLAN		= (1 << 8),	/* RX VLAN offload enabled */
 	ETH_FLAG_LRO		= (1 << 15),	/* LRO is enabled */
diff --git a/ethtool.8 b/ethtool.8
index 1760924..cf7128f 100644
--- a/ethtool.8
+++ b/ethtool.8
@@ -174,6 +174,13 @@ ethtool \- Display or change ethernet card settings
 .B2 txvlan on off
 .B2 rxhash on off
 
+.B ethtool \-l|\-\-show\-common
+.I ethX
+
+.B ethtool \-L|\-\-set\-common
+.I ethX
+.B2 loopback on off
+
 .B ethtool \-p|\-\-identify
 .I ethX
 .RI [ N ]
@@ -406,6 +413,15 @@ Specifies whether TX VLAN acceleration should be enabled
 .A2 rxhash on off
 Specifies whether receive hashing offload should be enabled
 .TP
+.B \-l \-\-show\-common
+Queries the specified ethernet device for common flag settings.
+.TP
+.B \-L \-\-set\-common
+Changes the common parameters of the specified ethernet device.
+.TP
+.A2 loopback on off
+Specifies whether loopback should be enabled.
+.TP
 .B \-p \-\-identify
 Initiates adapter-specific action intended to enable an operator to
 easily identify the adapter by sight.  Typically this involves
diff --git a/ethtool.c b/ethtool.c
index 63e0ead..1a0c10c 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -97,6 +97,8 @@ static int do_gcoalesce(int fd, struct ifreq *ifr);
 static int do_scoalesce(int fd, struct ifreq *ifr);
 static int do_goffload(int fd, struct ifreq *ifr);
 static int do_soffload(int fd, struct ifreq *ifr);
+static int do_gcommon(int fd, struct ifreq *ifr);
+static int do_scommon(int fd, struct ifreq *ifr);
 static int do_gstats(int fd, struct ifreq *ifr);
 static int rxflow_str_to_type(const char *str);
 static int parse_rxfhashopts(char *optstr, u32 *data);
@@ -142,6 +144,8 @@ static enum {
 	MODE_GNTUPLE,
 	MODE_FLASHDEV,
 	MODE_PERMADDR,
+	MODE_GCOMMON,
+	MODE_SCOMMON,
 } mode = MODE_GSET;
 
 static struct option {
@@ -211,6 +215,10 @@ static struct option {
 		"		[ ntuple on|off ]\n"
 		"		[ rxhash on|off ]\n"
     },
+    { "-l", "--show-common", MODE_GCOMMON, "Get common flags information" },
+    { "-L", "--set-common", MODE_SCOMMON, "Set common flags",
+		"               [ loopback on|off ]\n"
+    },
     { "-i", "--driver", MODE_GDRV, "Show driver information" },
     { "-d", "--register-dump", MODE_GREGS, "Do a register dump",
 		"		[ raw on|off ]\n"
@@ -309,6 +317,10 @@ static u32 off_flags_wanted = 0;
 static u32 off_flags_mask = 0;
 static int off_gro_wanted = -1;
 
+static int gcommon_changed = 0;
+static u32 common_flags_wanted = 0;
+static u32 common_flags_mask = 0;
+
 static struct ethtool_pauseparam epause;
 static int gpause_changed = 0;
 static int pause_autoneg_wanted = -1;
@@ -482,6 +494,11 @@ static struct cmdline_info cmdline_offload[] = {
 	  ETH_FLAG_RXHASH, &off_flags_mask },
 };
 
+static struct cmdline_info cmdline_commonflags[] = {
+	{ "loopback", CMDL_FLAG, &common_flags_wanted, NULL,
+	  ETH_FLAG_LOOPBACK, &common_flags_mask },
+};
+
 static struct cmdline_info cmdline_pause[] = {
 	{ "autoneg", CMDL_BOOL, &pause_autoneg_wanted, &epause.autoneg },
 	{ "rx", CMDL_BOOL, &pause_rx_wanted, &epause.rx_pause },
@@ -829,6 +846,8 @@ static void parse_cmdline(int argc, char **argp)
 			    (mode == MODE_SRING) ||
 			    (mode == MODE_GOFFLOAD) ||
 			    (mode == MODE_SOFFLOAD) ||
+			    (mode == MODE_GCOMMON) ||
+			    (mode == MODE_SCOMMON) ||
 			    (mode == MODE_GSTATS) ||
 			    (mode == MODE_GNFC) ||
 			    (mode == MODE_SNFC) ||
@@ -919,6 +938,14 @@ static void parse_cmdline(int argc, char **argp)
 				i = argc;
 				break;
 			}
+			if (mode == MODE_SCOMMON) {
+				parse_generic_cmdline(argc, argp, i,
+					&gcommon_changed,
+			      		cmdline_commonflags,
+			      		ARRAY_SIZE(cmdline_offload));
+				i = argc;
+				break;
+			}
 			if (mode == MODE_SNTUPLE) {
 				if (!strcmp(argp[i], "flow-type")) {
 					i += 1;
@@ -1905,6 +1932,13 @@ static int dump_offload(int rx, int tx, int sg, int tso, int ufo, int gso,
 	return 0;
 }
 
+static int dump_common_flags(int loopback)
+{
+	fprintf(stdout, "loopback: %s\n", loopback ? "on" : "off");
+
+	return 0;
+}
+
 static int dump_rxfhash(int fhash, u64 val)
 {
 	switch (fhash) {
@@ -1998,6 +2032,10 @@ static int doit(void)
 		return do_goffload(fd, &ifr);
 	} else if (mode == MODE_SOFFLOAD) {
 		return do_soffload(fd, &ifr);
+	} else if (mode == MODE_GCOMMON) {
+		return do_gcommon(fd, &ifr);
+	} else if (mode == MODE_SCOMMON) {
+		return do_scommon(fd, &ifr);
 	} else if (mode == MODE_GSTATS) {
 		return do_gstats(fd, &ifr);
 	} else if (mode == MODE_GNFC) {
@@ -2219,6 +2257,53 @@ static int do_scoalesce(int fd, struct ifreq *ifr)
 	return 0;
 }
 
+static int do_gcommon(int fd, struct ifreq *ifr)
+{
+	struct ethtool_value eval;
+	int loopback = 0;
+
+	fprintf(stdout, "Common flags for %s:\n", devname);
+
+	eval.cmd = ETHTOOL_GFLAGS;
+	ifr->ifr_data = (caddr_t)&eval;
+	if (ioctl(fd, SIOCETHTOOL, ifr)) {
+		perror("Cannot get device flags");
+	} else {
+		loopback = (eval.data & ETH_FLAG_LOOPBACK) != 0;
+	}
+
+	return dump_common_flags(loopback);
+}
+
+static int do_scommon(int fd, struct ifreq *ifr)
+{
+	struct ethtool_value eval;
+
+	if (common_flags_mask) {
+		eval.cmd = ETHTOOL_GFLAGS;
+		eval.data = 0;
+		ifr->ifr_data = (caddr_t)&eval;
+		if (ioctl(fd, SIOCETHTOOL, ifr)) {
+			perror("Cannot get device common flags");
+			return 1;
+		}
+
+		eval.cmd = ETHTOOL_SFLAGS;
+		eval.data =
+		    ((eval.data & ~(common_flags_mask | off_flags_mask)) |
+		     (common_flags_wanted | off_flags_wanted));
+
+		if (ioctl(fd, SIOCETHTOOL, ifr)) {
+			perror("Cannot set device common flags");
+			return 1;
+		}
+	} else {
+		fprintf(stdout, "No common settings changed\n");
+	}
+
+	return 0;
+}
+
 static int do_goffload(int fd, struct ifreq *ifr)
 {
 	struct ethtool_value eval;
@@ -2407,8 +2492,9 @@ static int do_soffload(int fd, struct ifreq *ifr)
 		}
 
 		eval.cmd = ETHTOOL_SFLAGS;
-		eval.data = ((eval.data & ~off_flags_mask) |
-			     off_flags_wanted);
+		eval.data =
+		    ((eval.data & ~(off_flags_mask | common_flags_mask)) |
+		     (off_flags_wanted | common_flags_wanted));
 
 		err = ioctl(fd, SIOCETHTOOL, ifr);
 		if (err) {
-- 
1.7.3.1


^ permalink raw reply related

* Re: Flow Control and Port Mirroring Revisited
From: Simon Horman @ 2011-01-13 23:41 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Eric Dumazet, Rusty Russell, virtualization, dev, virtualization,
	netdev, kvm, Michael S. Tsirkin
In-Reply-To: <AANLkTimO=5HmTJO1kmHGAWa-HTac+3d0TbrmJX5W4hVu@mail.gmail.com>

On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
> On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman <horms@verge.net.au> wrote:
> > On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
> >> On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
> >> > On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
> >> >
> >> > [ snip ]
> >> > >
> >> > > I know that everyone likes a nice netperf result but I agree with
> >> > > Michael that this probably isn't the right question to be asking.  I
> >> > > don't think that socket buffers are a real solution to the flow
> >> > > control problem: they happen to provide that functionality but it's
> >> > > more of a side effect than anything.  It's just that the amount of
> >> > > memory consumed by packets in the queue(s) doesn't really have any
> >> > > implicit meaning for flow control (think multiple physical adapters,
> >> > > all with the same speed instead of a virtual device and a physical
> >> > > device with wildly different speeds).  The analog in the physical
> >> > > world that you're looking for would be Ethernet flow control.
> >> > > Obviously, if the question is limiting CPU or memory consumption then
> >> > > that's a different story.
> >> >
> >> > Point taken. I will see if I can control CPU (and thus memory) consumption
> >> > using cgroups and/or tc.
> >>
> >> I have found that I can successfully control the throughput using
> >> the following techniques
> >>
> >> 1) Place a tc egress filter on dummy0
> >>
> >> 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
> >>    this is effectively the same as one of my hacks to the datapath
> >>    that I mentioned in an earlier mail. The result is that eth1
> >>    "paces" the connection.
> >
> > Further to this, I wonder if there is any interest in providing
> > a method to switch the action order - using ovs-ofctl is a hack imho -
> > and/or switching the default action order for mirroring.
> 
> I'm not sure that there is a way to do this that is correct in the
> generic case.  It's possible that the destination could be a VM while
> packets are being mirrored to a physical device or we could be
> multicasting or some other arbitrarily complex scenario.  Just think
> of what a physical switch would do if it has ports with two different
> speeds.

Yes, I have considered that case. And I agree that perhaps there
is no sensible default. But perhaps we could make it configurable somehow?

^ permalink raw reply

* [PATCH] CHOKe flow scheduler (0.8)
From: Stephen Hemminger @ 2011-01-13 23:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev
In-Reply-To: <1294951069.3403.11.camel@edumazet-laptop>

CHOKe ("CHOose and Kill" or "CHOose and Keep") is an alternative
packet scheduler based on the Random Exponential Drop (RED) algorithm.

The core idea is:
  For every packet arrival:
  	Calculate Qave
	if (Qave < minth) 
	     Queue the new packet
	else 
	     Select randomly a packet from the queue 
	     if (both packets from same flow)
	     then Drop both the packets
	     else if (Qave > maxth)
	          Drop packet
	     else
	       	  Admit packet with probability P (same as RED)

See also:
  Rong Pan, Balaji Prabhakar, Konstantinos Psounis, "CHOKe: a stateless active
   queue management scheme for approximating fair bandwidth allocation", 
  Proceeding of INFOCOM'2000, March 2000.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
0.8 change queue length and holes account.
    keep sch->q.qlen updated, and holes counter not needed.

---
 net/sched/Kconfig     |   11 +
 net/sched/Makefile    |    1 
 net/sched/sch_choke.c |  536 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 548 insertions(+)

--- a/net/sched/Kconfig	2011-01-13 15:19:41.542022830 -0800
+++ b/net/sched/Kconfig	2011-01-13 15:20:53.586380066 -0800
@@ -205,6 +205,17 @@ config NET_SCH_DRR
 
 	  If unsure, say N.
 
+config NET_SCH_CHOKE
+	tristate "CHOose and Keep responsive flow scheduler (CHOKE)"
+	help
+	  Say Y here if you want to use the CHOKe packet scheduler (CHOose
+	  and Keep for responsive flows, CHOose and Kill for unresponsive
+	  flows). This is a variation of RED which trys to penalize flows
+	  that monopolize the queue.
+
+	  To compile this code as a module, choose M here: the
+	  module will be called sch_choke.
+
 config NET_SCH_INGRESS
 	tristate "Ingress Qdisc"
 	depends on NET_CLS_ACT
--- a/net/sched/Makefile	2011-01-13 15:19:41.578022995 -0800
+++ b/net/sched/Makefile	2011-01-13 15:20:53.586380066 -0800
@@ -32,6 +32,7 @@ obj-$(CONFIG_NET_SCH_MULTIQ)	+= sch_mult
 obj-$(CONFIG_NET_SCH_ATM)	+= sch_atm.o
 obj-$(CONFIG_NET_SCH_NETEM)	+= sch_netem.o
 obj-$(CONFIG_NET_SCH_DRR)	+= sch_drr.o
+obj-$(CONFIG_NET_SCH_CHOKE)	+= sch_choke.o
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
 obj-$(CONFIG_NET_CLS_FW)	+= cls_fw.o
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ b/net/sched/sch_choke.c	2011-01-13 15:26:18.771992614 -0800
@@ -0,0 +1,552 @@
+/*
+ * net/sched/sch_choke.c	CHOKE scheduler
+ *
+ * Copyright (c) 2011 Stephen Hemminger <shemminger@vyatta.com>
+ * Copyright (c) 2011 Eric Dumazet <eric.dumazet@gmail.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/skbuff.h>
+#include <linux/reciprocal_div.h>
+#include <net/pkt_sched.h>
+#include <net/inet_ecn.h>
+#include <net/red.h>
+
+/*	CHOKe stateless AQM for fair bandwidth allocation
+        =================================================
+
+   CHOKe (CHOose and Keep for responsive flows, CHOose and Kill for
+   unresponsive flows) is a variant of RED that penalizes misbehaving flows but
+   maintains no flow state. The difference from RED is an additional step
+   during the enqueuing process. If average queue size is over the
+   low threshold (qmin), a packet is chosen at random from the queue.
+   If both the new and chosen packet are from the same flow, both
+   are dropped. Unlike RED, CHOKe is not really a "classful" qdisc because it
+   needs to access packets in queue randomly. It has a minimal class
+   interface to allow overriding the builtin flow classifier with
+   filters.
+
+   Source:
+   R. Pan, B. Prabhakar, and K. Psounis, "CHOKe, A Stateless
+   Active Queue Management Scheme for Approximating Fair Bandwidth Allocation",
+   IEEE INFOCOM, 2000.
+
+   A. Tang, J. Wang, S. Low, "Understanding CHOKe: Throughput and Spatial
+   Characteristics", IEEE/ACM Transactions on Networking, 2004
+
+ */
+
+/* Upper bound on size of sk_buff table */
+#define CHOKE_MAX_QUEUE	(128*1024 - 1)
+
+struct choke_sched_data {
+/* Parameters */
+	u32		 limit;
+	unsigned char	 flags;
+
+	struct red_parms parms;
+
+/* Variables */
+	struct tcf_proto *filter_list;
+	struct {
+		u32	prob_drop;	/* Early probability drops */
+		u32	prob_mark;	/* Early probability marks */
+		u32	forced_drop;	/* Forced drops, qavg > max_thresh */
+		u32	forced_mark;	/* Forced marks, qavg > max_thresh */
+		u32	pdrop;          /* Drops due to queue limits */
+		u32	other;          /* Drops due to drop() calls */
+		u32	matched;	/* Drops to flow match */
+	} stats;
+
+	unsigned int	 head;
+	unsigned int	 tail;
+
+	unsigned int	 tab_mask; /* size - 1 */
+
+	struct sk_buff **tab;
+};
+
+/* deliver a random number between 0 and N - 1 */
+static inline u32 random_N(unsigned int N)
+{
+	return reciprocal_divide(random32(), N);
+}
+
+/* Select a packet at random from the queue in O(1) and handle holes */
+static struct sk_buff *choke_peek_random(struct Qdisc *sch,
+					 unsigned int *pidx)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+	int retrys = 3;
+
+	do {
+		*pidx = (q->head + random_N(sch->q.qlen)) & q->tab_mask;
+		skb = q->tab[*pidx];
+		if (skb)
+			return skb;
+	} while (--retrys > 0);
+
+	/* queue is has lots of holes use the head which is known to exist */
+	return q->tab[*pidx = q->head];
+}
+
+/* Is ECN parameter configured */
+static inline int use_ecn(const struct choke_sched_data *q)
+{
+	return q->flags & TC_RED_ECN;
+}
+
+/* Should packets over max just be dropped (versus marked) */
+static inline int use_harddrop(const struct choke_sched_data *q)
+{
+	return q->flags & TC_RED_HARDDROP;
+}
+
+/* Move head pointer forward to skip over holes */
+static void choke_zap_head_holes(struct choke_sched_data *q)
+{
+	while (q->tab[q->head] == NULL) {
+		q->head = (q->head + 1) & q->tab_mask;
+
+		BUG_ON(q->head == q->tail);
+	}
+}
+
+/* Move tail pointer backwards to reuse holes */
+static void choke_zap_tail_holes(struct choke_sched_data *q)
+{
+	while (q->tab[q->tail - 1] == NULL) {
+		q->tail = (q->tail - 1) & q->tab_mask;
+		BUG_ON(q->head == q->tail);
+	}
+}
+
+/* Drop packet from queue array by creating a "hole" */
+static void choke_drop_by_idx(struct choke_sched_data *q, unsigned int idx)
+{
+	q->tab[idx] = NULL;
+
+	if (idx == q->head)
+		choke_zap_head_holes(q);
+	if (idx == q->tail)
+		choke_zap_tail_holes(q);
+}
+
+/* Classify flow using either:
+   1. pre-existing classification result in skb
+   2. fast internal classification
+   3. use TC filter based classification
+*/
+static inline unsigned int choke_classify(struct sk_buff *skb,
+					  struct Qdisc *sch, int *qerr)
+
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct tcf_result res;
+	int result;
+
+	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+
+	if (TC_H_MAJ(skb->priority) == sch->handle &&
+	    TC_H_MIN(skb->priority) > 0)
+		return TC_H_MIN(skb->priority);
+
+	if (!q->filter_list)
+		return skb_get_rxhash(skb);
+
+	result = tc_classify(skb, q->filter_list, &res);
+	if (result >= 0) {
+#ifdef CONFIG_NET_CLS_ACT
+		switch (result) {
+		case TC_ACT_STOLEN:
+		case TC_ACT_QUEUED:
+			*qerr = NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
+		case TC_ACT_SHOT:
+			return 0;
+		}
+#endif
+		return TC_H_MIN(res.classid);
+	}
+
+	return 0;
+}
+
+static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct red_parms *p = &q->parms;
+	unsigned int hash;
+	int uninitialized_var(ret);
+
+	hash = choke_classify(skb, sch, &ret);
+	if (!hash) {
+		/* Packet was eaten by filter */
+		if (ret & __NET_XMIT_BYPASS)
+			sch->qstats.drops++;
+		kfree_skb(skb);
+		return ret;
+	}
+
+	/* Maybe add hash as field in struct qdisc_skb_cb? */
+	*(unsigned int *)(qdisc_skb_cb(skb)->data) = hash;
+
+	/* Compute average queue usage (see RED) */
+	p->qavg = red_calc_qavg(p, sch->q.qlen);
+	if (red_is_idling(p))
+		red_end_of_idle_period(p);
+
+	/* Is queue small? */
+	if (p->qavg <= p->qth_min)
+		p->qcount = -1;
+	else {
+		struct sk_buff *oskb;
+		unsigned int idx;
+
+		/* Draw a packet at random from queue */
+		oskb = choke_peek_random(sch, &idx);
+
+		/* Both packets from same flow ? */
+		if (*(unsigned int *)(qdisc_skb_cb(oskb)->data) == hash) {
+			/* Drop both packets */
+			q->stats.matched++;
+			choke_drop_by_idx(q, idx);
+			sch->qstats.backlog -= qdisc_pkt_len(skb);
+			--sch->q.qlen;
+			qdisc_drop(oskb, sch);
+			goto congestion_drop;
+		}
+
+		/* Queue is large, always mark/drop */
+		if (p->qavg > p->qth_max) {
+			p->qcount = -1;
+
+			sch->qstats.overlimits++;
+			if (use_harddrop(q) || !use_ecn(q) ||
+			    !INET_ECN_set_ce(skb)) {
+				q->stats.forced_drop++;
+				goto congestion_drop;
+			}
+
+			q->stats.forced_mark++;
+		} else if (++p->qcount) {
+			if (red_mark_probability(p, p->qavg)) {
+				p->qcount = 0;
+				p->qR = red_random(p);
+
+				sch->qstats.overlimits++;
+				if (!use_ecn(q) || !INET_ECN_set_ce(skb)) {
+					q->stats.prob_drop++;
+					goto congestion_drop;
+				}
+
+				q->stats.prob_mark++;
+			}
+		} else
+			p->qR = red_random(p);
+	}
+
+	/* Admit new packet */
+	if (sch->q.qlen < q->limit) {
+		q->tab[q->tail] = skb;
+		q->tail = (q->tail + 1) & q->tab_mask;
+		++sch->q.qlen;
+		sch->qstats.backlog += qdisc_pkt_len(skb);
+		qdisc_bstats_update(sch, skb);
+		return NET_XMIT_SUCCESS;
+	}
+
+	q->stats.pdrop++;
+	sch->qstats.drops++;
+	kfree_skb(skb);
+	return NET_XMIT_DROP;
+
+ congestion_drop:
+	qdisc_drop(skb, sch);
+	return NET_XMIT_CN;
+}
+
+static struct sk_buff *choke_dequeue(struct Qdisc *sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+
+	if (q->head == q->tail) {
+		if (!red_is_idling(&q->parms))
+			red_start_of_idle_period(&q->parms);
+		return NULL;
+	}
+
+	skb = q->tab[q->head];
+	q->tab[q->head] = NULL; /* not really needed */
+	q->head = (q->head + 1) & q->tab_mask;
+	choke_zap_head_holes(q);
+	--sch->q.qlen;
+	sch->qstats.backlog -= qdisc_pkt_len(skb);
+
+	return skb;
+}
+
+static unsigned int choke_drop(struct Qdisc *sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	unsigned int len;
+
+	len = qdisc_queue_drop(sch);
+	if (len > 0)
+		q->stats.other++;
+	else {
+		if (!red_is_idling(&q->parms))
+			red_start_of_idle_period(&q->parms);
+	}
+
+	return len;
+}
+
+static void choke_reset(struct Qdisc* sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+
+	red_restart(&q->parms);
+}
+
+static const struct nla_policy choke_policy[TCA_CHOKE_MAX + 1] = {
+	[TCA_CHOKE_PARMS]	= { .len = sizeof(struct tc_red_qopt) },
+	[TCA_CHOKE_STAB]	= { .len = 256 },
+};
+
+
+static void choke_free(void *addr)
+{
+	if (addr) {
+		if (is_vmalloc_addr(addr))
+			vfree(addr);
+		else
+			kfree(addr);
+	}
+}
+
+static int choke_change(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_CHOKE_MAX + 1];
+	struct tc_red_qopt *ctl;
+	int err;
+	struct sk_buff **old = NULL;
+	unsigned int mask;
+
+	if (opt == NULL)
+		return -EINVAL;
+
+	err = nla_parse_nested(tb, TCA_CHOKE_MAX, opt, choke_policy);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_CHOKE_PARMS] == NULL ||
+	    tb[TCA_CHOKE_STAB] == NULL)
+		return -EINVAL;
+
+	ctl = nla_data(tb[TCA_CHOKE_PARMS]);
+
+	if (ctl->limit > CHOKE_MAX_QUEUE)
+		return -EINVAL;
+
+	mask = roundup_pow_of_two(ctl->limit + 1) - 1;
+	if (mask != q->tab_mask) {
+		struct sk_buff **ntab;
+
+		ntab = kcalloc(mask + 1, sizeof(struct sk_buff *), GFP_KERNEL);
+		if (!ntab)
+			ntab = vzalloc((mask + 1) * sizeof(struct sk_buff *));
+		if (!ntab)
+			return -ENOMEM;
+
+		sch_tree_lock(sch);
+		old = q->tab;
+		if (old) {
+			unsigned int tail = 0;
+
+			while (q->head != q->tail) {
+				ntab[tail++] = q->tab[q->head];
+				q->head = (q->head + 1) & q->tab_mask;
+			}
+			q->head = 0;
+			q->tail = tail;
+		}
+
+		q->tab_mask = mask;
+		q->tab = ntab;
+	} else
+		sch_tree_lock(sch);
+
+	q->flags = ctl->flags;
+	q->limit = ctl->limit;
+
+	red_set_parms(&q->parms, ctl->qth_min, ctl->qth_max, ctl->Wlog,
+		      ctl->Plog, ctl->Scell_log,
+		      nla_data(tb[TCA_CHOKE_STAB]));
+
+	if (q->head == q->tail)
+		red_end_of_idle_period(&q->parms);
+
+	sch_tree_unlock(sch);
+	choke_free(old);
+	return 0;
+}
+
+static int choke_init(struct Qdisc* sch, struct nlattr *opt)
+{
+	return choke_change(sch, opt);
+}
+
+static int choke_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct nlattr *opts = NULL;
+	struct tc_red_qopt opt = {
+		.limit		= q->limit,
+		.flags		= q->flags,
+		.qth_min	= q->parms.qth_min >> q->parms.Wlog,
+		.qth_max	= q->parms.qth_max >> q->parms.Wlog,
+		.Wlog		= q->parms.Wlog,
+		.Plog		= q->parms.Plog,
+		.Scell_log	= q->parms.Scell_log,
+	};
+
+	opts = nla_nest_start(skb, TCA_OPTIONS);
+	if (opts == NULL)
+		goto nla_put_failure;
+
+	NLA_PUT(skb, TCA_CHOKE_PARMS, sizeof(opt), &opt);
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	nla_nest_cancel(skb, opts);
+	return -EMSGSIZE;
+}
+
+static int choke_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+	struct tc_choke_xstats st = {
+		.early	= q->stats.prob_drop + q->stats.forced_drop,
+		.marked	= q->stats.prob_mark + q->stats.forced_mark,
+		.pdrop	= q->stats.pdrop,
+		.other	= q->stats.other,
+		.matched = q->stats.matched,
+	};
+
+	return gnet_stats_copy_app(d, &st, sizeof(st));
+}
+
+static void choke_destroy(struct Qdisc *sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+
+	tcf_destroy_chain(&q->filter_list);
+	choke_free(q->tab);
+}
+
+static struct Qdisc *choke_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	return NULL;
+}
+
+static unsigned long choke_get(struct Qdisc *sch, u32 classid)
+{
+	return 0;
+}
+
+static void choke_put(struct Qdisc *q, unsigned long cl)
+{
+}
+
+static unsigned long choke_bind(struct Qdisc *sch, unsigned long parent,
+				u32 classid)
+{
+	return 0;
+}
+
+static struct tcf_proto **choke_find_tcf(struct Qdisc *sch, unsigned long cl)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+
+	if (cl)
+		return NULL;
+	return &q->filter_list;
+}
+
+static int choke_dump_class(struct Qdisc *sch, unsigned long cl,
+			  struct sk_buff *skb, struct tcmsg *tcm)
+{
+	tcm->tcm_handle |= TC_H_MIN(cl);
+	return 0;
+}
+
+static void choke_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	if (!arg->stop) {
+		if (arg->fn(sch, 1, arg) < 0) {
+			arg->stop = 1;
+			return;
+		}
+		arg->count++;
+	}
+}
+
+static const struct Qdisc_class_ops choke_class_ops = {
+	.leaf		=	choke_leaf,
+	.get		=	choke_get,
+	.put		=	choke_put,
+	.tcf_chain	=	choke_find_tcf,
+	.bind_tcf	=	choke_bind,
+	.unbind_tcf	=	choke_put,
+	.dump		=	choke_dump_class,
+	.walk		=	choke_walk,
+};
+
+static struct sk_buff *choke_peek_head(struct Qdisc *sch)
+{
+	struct choke_sched_data *q = qdisc_priv(sch);
+
+	return (q->head != q->tail) ? q->tab[q->head] : NULL;
+}
+
+static struct Qdisc_ops choke_qdisc_ops __read_mostly = {
+	.id		=	"choke",
+	.priv_size	=	sizeof(struct choke_sched_data),
+
+	.enqueue	=	choke_enqueue,
+	.dequeue	=	choke_dequeue,
+	.peek		=	choke_peek_head,
+	.drop		=	choke_drop,
+	.init		=	choke_init,
+	.destroy	=	choke_destroy,
+	.reset		=	choke_reset,
+	.change		=	choke_change,
+	.dump		=	choke_dump,
+	.dump_stats	=	choke_dump_stats,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init choke_module_init(void)
+{
+	return register_qdisc(&choke_qdisc_ops);
+}
+
+static void __exit choke_module_exit(void)
+{
+	unregister_qdisc(&choke_qdisc_ops);
+}
+
+module_init(choke_module_init)
+module_exit(choke_module_exit)
+
+MODULE_LICENSE("GPL");
--- a/include/linux/pkt_sched.h	2011-01-13 15:19:41.726023725 -0800
+++ b/include/linux/pkt_sched.h	2011-01-13 15:20:53.590380094 -0800
@@ -247,6 +247,35 @@ struct tc_gred_sopt {
 	__u16		pad1;
 };
 
+/* CHOKe section */
+
+enum {
+	TCA_CHOKE_UNSPEC,
+	TCA_CHOKE_PARMS,
+	TCA_CHOKE_STAB,
+	__TCA_CHOKE_MAX,
+};
+
+#define TCA_CHOKE_MAX (__TCA_CHOKE_MAX - 1)
+
+struct tc_choke_qopt {
+	__u32		limit;		/* HARD maximal queue length (packets)	*/
+	__u32		qth_min;	/* Min average length threshold (packets) */
+	__u32		qth_max;	/* Max average length threshold (packets) */
+	unsigned char   Wlog;		/* log(W)		*/
+	unsigned char   Plog;		/* log(P_max/(qth_max-qth_min))	*/
+	unsigned char   Scell_log;	/* cell size for idle damping */
+	unsigned char	flags;		/* see RED flags */
+};
+
+struct tc_choke_xstats {
+	__u32           early;          /* Early drops */
+	__u32           pdrop;          /* Drops due to queue limits */
+	__u32           other;          /* Drops due to drop() calls */
+	__u32           marked;         /* Marked packets */
+	__u32		matched;	/* Drops due to flow match */
+};
+
 /* HTB section */
 #define TC_HTB_NUMPRIO		8
 #define TC_HTB_MAXDEPTH		8

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox