Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 04/12] phylib: add a way to make PHY time stamps possible.
From: Richard Cochran @ 2010-06-16  5:40 UTC (permalink / raw)
  To: Grant Likely
  Cc: netdev, devicetree-discuss, linuxppc-dev, linux-arm-kernel,
	Krzysztof Halasa
In-Reply-To: <AANLkTimhAvu0_WuMCwnCrw9Rbcy6dKxvTlzepJm7ZDdR@mail.gmail.com>

On Tue, Jun 15, 2010 at 10:33:51AM -0600, Grant Likely wrote:
> > +config NETWORK_PHY_TIMESTAMPING
> Some overhead?  At a brief glance of the series it looks like it could
> add a lot of overhead, but I'm not fully clear on what the full
> process is.  Can you describe how the hardware timestamping works?  I
> could use an overview of what the kernel has to do.

First of all, I want to emphasize that this network stack option is
purely voluntary. Only those people who know that they have a PTP
capable PHY and really want the timestamps will (or should) enable
this option. When it is not enabled, it has no effect at all.

Hardware timestamping is described in

   Documentation/networking/timestamping.txt
   Documentation/networking/timestamping/timestamping.c

The PTP subsystem is described in

   Documentation/ptp/ptp.txt

There really is more to say about the issue than appears in those
documents, but they are a good starting place for discussion.

BTW I am submitting a conference paper on the design on the PTP
subsystem. If you would like to have it, just ask me off-list.

Richard

^ permalink raw reply

* Re: [Bugme-new] [Bug 16187] New: Carrier detection failed in dhcpcd when link is up
From: Christian Casteyde @ 2010-06-16  5:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: netdev, Grant Grundler, Kyle McMartin, bugzilla-daemon,
	bugme-daemon
In-Reply-To: <20100615142418.f47e8abd.akpm@linux-foundation.org>

Exact, my kernel config is:
-*- Generic Media Independent Interface device support
[*]   "Tulip" family network device support
<*>   ULi M526x controller support

Le mardi 15 juin 2010 23:24:18, Andrew Morton a écrit :
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Sat, 12 Jun 2010 15:15:31 GMT
> 
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=16187
> > 
> >            Summary: Carrier detection failed in dhcpcd when link is up
> >            Product: Networking
> >            Version: 2.5
> >     
> >     Kernel Version: 2.6.35-rc2
> >     
> >           Platform: All
> >         
> >         OS/Version: Linux
> >         
> >               Tree: Mainline
> >             
> >             Status: NEW
> >           
> >           Severity: normal
> >           Priority: P1
> >          
> >          Component: Other
> >         
> >         AssignedTo: acme@ghostprotocols.net
> >         ReportedBy: casteyde.christian@free.fr
> >         Regression: Yes
> > 
> > Created an attachment (id=26742)
> > 
> >  --> (https://bugzilla.kernel.org/attachment.cgi?id=26742)
> > 
> > lspci output for 2.6.34 on my computer
> > 
> > Kernel at least 2.6.35-rc2, 2.6.34 works fine
> 
> Seems to be post-2.6.34 breakage in the tulip driver.
> 
> > Athlon X2 3000 in 64bits mode
> > Slackware 13.1
> > "Ethernet controller: ALi Corporation ULi 1689,1573 integrated ethernet.
> > (rev 40)" (from lspci)
> > 
> > Since 2.6.35-rc2 (didn't checked -rc1), dhcpcd hangs fails to detect
> > carrier appearance at boot.
> > 
> > The Slackware network script uses dhcpcd to bring DHCP interfaces up. At
> > boot, it seems my network device doesn't have the link up immediatly.
> > dhcpcd tries to bring it up, and wait for the carrier. The carrier
> > indeed goes up, but dhcpcd gets absolutly no notification of that, and
> > therefore times out.
> > 
> > logs says that:
> > Jun 12 16:53:58 sirius logger: /etc/rc.d/rc.inet1:  /sbin/route add -net
> > 127.0.0.0 netmask 255.0.0.0 lo
> > Jun 12 16:53:58 sirius logger: /etc/rc.d/rc.inet1:  /sbin/dhcpcd -t 10 
> > eth0 Jun 12 16:53:58 sirius dhcpcd: version 5.2.2 starting
> > Jun 12 16:53:58 sirius kernel: ADDRCONF(NETDEV_UP): eth0: link is not
> > ready Jun 12 16:53:58 sirius dhcpcd: eth0: waiting for carrier
> > Jun 12 16:54:01 sirius kernel: uli526x: eth0 NIC Link is Up 100 Mbps Full
> > duplex
> > Jun 12 16:54:01 sirius kernel: ADDRCONF(NETDEV_CHANGE): eth0: link
> > becomes ready
> > then dhcpcd times out a little later (10 seconds as -t 10 is specified).
> > 
> > If I use the -K option (no carrier detection), and does ifconfig eth0 up
> > just before issueing the dhcpcd command, dhcpcd doesn't wait for the
> > carrier and gets a lease correctly.
> > 
> > Reversely, with the 2.6.34 kernel, dhcpcd indeed gets the link up
> > notification, and gets the lease immediatly. In this case, the log says:
> > 
> > Jun 12 17:04:21 sirius logger: /etc/rc.d/rc.inet1:  /sbin/route add -net
> > 127.0.0.0 netmask 255.0.0.0 lo
> > Jun 12 17:04:21 sirius logger: /etc/rc.d/rc.inet1:  /sbin/dhcpcd -t 10 
> > eth0 Jun 12 17:04:22 sirius dhcpcd: version 5.2.2 starting
> > Jun 12 17:04:22 sirius kernel: ADDRCONF(NETDEV_UP): eth0: link is not
> > ready Jun 12 17:04:22 sirius dhcpcd: eth0: waiting for carrier
> > Jun 12 17:04:25 sirius dhcpcd: eth0: carrier acquired
> > Jun 12 17:04:25 sirius kernel: uli526x: eth0 NIC Link is Up 100 Mbps Full
> > duplex
> > Jun 12 17:04:25 sirius kernel: ADDRCONF(NETDEV_CHANGE): eth0: link
> > becomes ready
> > Jun 12 17:04:25 sirius dhcpcd: eth0: broadcasting for a lease
> > Jun 12 17:04:29 sirius dhcpcd: eth0: offered 192.168.1.3 from 192.168.1.1
> > Jun 12 17:04:29 sirius dhcpcd: eth0: acknowledged 192.168.1.3 from
> > 192.168.1.1 Jun 12 17:04:29 sirius dhcpcd: eth0: checking for
> > 192.168.1.3
> > Jun 12 17:04:34 sirius dhcpcd: eth0: leased 192.168.1.3 for 864000
> > seconds Jun 12 17:04:34 sirius dhcpcd: forking to background
> > 
> > As you can see, dhcpcd asks for a lease as soon as the kernel tells it
> > the link is up (message "carrier acquired").
> > 
> > Therefore, I think 2.6.35-rc* notification of carrier is broken, and at
> > least it broke dhcpcd way of watching the link.

^ permalink raw reply

* Re: Proposed linux kernel changes : scaling  tcp/ip stack : 3rd part
From: Mitchell Erblich @ 2010-06-16  6:09 UTC (permalink / raw)
  To: Mitchell Erblich; +Cc: Eric Dumazet, netdev
In-Reply-To: <D70D2636-7C49-48A5-B4DC-B9583A448415@earthlink.net>


On Jun 15, 2010, at 8:30 PM, Mitchell Erblich wrote:

> 
> On Jun 15, 2010, at 8:11 PM, Mitchell Erblich wrote:
> 
>> 
>> On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote:
>> 
>>> Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
>>>> To whom it may concern,
>>>> 
>>>> First, my assumption is to keep this discussion local to just a few tcp/ip
>>>> developers to see if there is any consensus that the below is a logical 
>>>> approach. Please also pass this email if there is a "owner(s)" of this stack
>>>> to identify if a case exists for the below possible changes.
>>>> 
>>>> I am not currently on the linux kernel mail group.
>>>> 			
>>>> I have experience with modifications of the Linux tcp/ip stack, and have
>>>> merged the changes into the company's local tree and left the possible 
>>>> global integration to others.
>>>> 
>>>> I have been approached by a number of companies about scaling the
>>>> stack with the assumption of a number of cpu cores. At present, I find extra
>>>> time on my hands and am considering looking into this area on my own.
>>>> 
>>>> The first assumption is that if extra cores are available, that a single
>>>> received homogeneous flow of a large number of packets/segments per
>>>> second (pps) can be split into non-equal flows. This split can in effect
>>>> allow a larger recv'd pps rate at the same core load while splitting off
>>>> other workloads, such as xmit'ing pure ACKs.
>>>> 
>>>> Simply, again assuming Amdahl's law (and not looking to equalize the load
>>>> between cores), and creating logical separations where in a many core 
>>>> system, different cores could have new kernel threads  that operate in 
>>>> parallel within the tcp/ip stack. The initial separation points would be at 
>>>> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some 
>>>> form of output.
>>>> 
>>>> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
>>>> with some form of queuing & scheduling, would be needed. In addition,
>>>> the queuing/schedullng of other kernel threads would occur within ip & tcp
>>>> to separate the I/O.
>>>> 
>>>> A possible validation test is to identify the max recv'd pps rate within the
>>>> tcp/ip modules within normal flow TCP established state with normal order 
>>>> of say 64byte non fragmented segments, before and after each 
>>>> incremental change. Or the same rate with fewer core/cpu cycles.
>>>> 
>>>> I am willing to have a private git Linux.org tree that concentrates proposed
>>>> changes into this tree and if there is willingness, a seen want/need then identify
>>>> how to implement the merge.
>>> 
>>> Hi Mitchell
>>> 
>>> We work everyday to improve network stack, and standard linux tree is
>>> pretty scalable, you dont need to setup a separate git tree for that.
>>> 
>>> Our beloved maintainer David S. Miller handles two trees, net-2.6 and
>>> net-next-2.6 where we put all our changes.
>>> 
>>> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
>>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
>>> 
>>> I suggest you read the last patches (say .. about 10.000 of them), to
>>> have an idea of things we did during last years.
>>> 
>>> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
>>> placement...
>>> 
>>> Its nice to see another man joining the team !
>>> 
>>> Thanks
>>> 
>> 
>> 
>> Lets start with a two part Linux kernel change and a tcp input/output change:
>> 
>> 2 Parts: 2nd part TBD
>> 
>> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
>> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
>> in the generic kernel. TBD.
>> 
>> This change should have no effect with normal available kernel mem allocs.
>> 
>> Assuming memory pressure ( WAITING for clean memory) we should be allocating
>> our last pages for input skbufs and not for xmit allocs.
>> 
>> By delaying skbuf allocations when we have low kmem, we secondarily slow down the
>> tcp flow : if in slow start (SS) we are almost doing a DELACK, else CA should/could
>> decrease the number of in-flight ACKs and the peer should do burst avoidance
>> if our later ack increases the window in a larger chunk..
>> 
>> And use the last pages to decrease the chance of dropping a input pkt or
>> running out of recv descriptors, because of mem back pressure.
>> 
>> The change could check for some form of mem pressure before the alloc,
>> but the alloc in itself should suffice. We could also do a ECN type check before
>> the alloc.
>> 
>> Now the kicker.  I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC and
>> NOT use emergency pools, thus CAN FAIL, to have 0 other secondary effects
>> and change just the 1 arg.
>> 
>> code : tcp_output.c : tcp_send_ack()
>>  line : buff = alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP);   /* with a NO SLEEP */
>> 
>> Suggestions, feedback??
>> 
>> Mitchell Erblich
>> 
>> 
>> 
>> 
> 
> Sorry :),
> 
> 		2nd part:
> 
> 		use GFP_NOWAIT as 2nd arg to alloc_skb()
> 
> Mitchell Erblich

Going in the same direction,


If tcp_out_of_resources() and the number of orphaned sockets is above
a configured number (maybe because of DoS attack), SHOULD we consume
our last available resources and most likely effect skbufs that we aren't
reset-ing because NOW the recv sk allocs are failing.

thus,
file tcp_timer.c : tcp_out_of_resources()
suggestion change 2nd arg GFP_ATOMIC: tcp_send_active_reset(sk, GFP_NOWAIT);

Please note that even if we believed that the GFP_ATOMIC would have a higher 
probability to send a TCP pkt/seg, that gives us no guarantee that the peer
will recv it or will process it.

We COULD also do some form of ECN in this function to inform the peer that our 
system is in distress if tcp_send_active_reset() did not return void and informed 
us of a mem alloc failure with the GFP_NOWAIT.

Since the ECN would benefit the our node/system, this ECN sending event COULD
be argued to have a higher priority and mem argument then sent with a GFP_ATOMIC.


Suggestions, opinions...

Mitchell Erblich

>> 
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [0/8] netpoll/bridge fixes
From: Eric Dumazet @ 2010-06-16  6:16 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, Paul E. McKenney, shemminger, mst, frzhang, netdev,
	amwang, mpm
In-Reply-To: <20100616033336.GA17440@gondor.apana.org.au>

Le mercredi 16 juin 2010 à 13:33 +1000, Herbert Xu a écrit :
> On Wed, Jun 16, 2010 at 05:03:20AM +0200, Eric Dumazet wrote:
> >
> > I wonder how these patches were tested, Herbert ?
> 
> You know, not everyone enables RCU debugging...
> 

Hmm, this was not an attack Herbert, just a suggestion, I now turn on
RCU lockdep debugging when doing RCU changes, it can helps ;)

> Anyway, this patch should fix the problems you've spotted.
> 

Thanks


^ permalink raw reply

* Re: [PATCH 05/12] phylib: Allow reading and writing a mii bus from atomic context.
From: Richard Cochran @ 2010-06-16  6:20 UTC (permalink / raw)
  To: Grant Likely
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ, Thomas Gleixner,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Krzysztof Halasa
In-Reply-To: <AANLkTin8RI4UKzA58k_qER_urdwnD037u5GZrVQS8IoS-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

> That's right, and I fully agree with that change.  To me, going back
> to allowing spin locks is a regression because it adds a new source of
> scheduling latency.

I think that the change was not about reducing scheduling
latency. Rather, the idea was simply to allow mdio bus drivers that
sleep. Here is the change log message:

    commit 35b5f6b1a82b5c586e0b24c711dc6ba944e88ef1

    PHYLIB: Locking fixes for PHY I/O potentially sleeping
    
    PHY read/write functions can potentially sleep (e.g., a PHY accessed
    via I2C).  The following changes were made to account for this:
    
        * Change spin locks to mutex locks
        * Add a BUG_ON() to phy_read() phy_write() to warn against
          calling them from an interrupt context.
        * Use work queue for PHY state machine handling since
          it can potentially sleep
        * Change phydev lock from spinlock to mutex

The fundamental issue is this: Fro the SO_TIMESTAMPING API, receive
timestamps must appear in a control message along with the packet
data. Only the MAC driver (or the PHY driver) knows how to get the
timestamp. The stack calls the MAC driver via its napi poll
function. During the call, the driver must provide the skb with Rx
timestamp.

The only reasonable way to do this is to have the driver fetch the
timestamp durng the napi poll function. For MAC drivers with fast
register access, the performance penalty is small. For PHY drivers
with must go via the MDIO bus, the performance penalty is obviously
larger, and the user must be willing to live with it.

You might suggest the alternate that the driver would defer the
netif_receive_skb() callback until a work queue completes, providing
the Rx timestamp. The driver would then call netif_receive_skb() at
some later time.

However, there are a number of problems with this idea:

1. It is really icky for the drivers to be creating new skb queues for
   this purpose. MAC drivers would have to maintain such queues on
   behalf of the PHY drivers, but only when the PHYs support
   timestamping. Yuck.

2. There is a (soft) real time constraint on the delivery of the PTP
   packets to the user space application. Basicly, delays and jitter
   in the time to receive the packet negatively affect the clock servo
   loop.

3. It cannot work for many kinds of PTP timestamping hardware. Some of
   hardware only timestamps PTP packets. That means that not every
   received packet will have a timestamp. Such hardware provides some
   key data from the packet (like PTP UUID and sequence number) with
   the timestamp. Software must match this information to a particular
   packet. In order to defer a skb, the driver must first obtain the
   timestamp information. This is a catch-22.

Having said all that, I am still open to suggestions...

Richard

^ permalink raw reply

* Re: [0/8] netpoll/bridge fixes
From: Eric Dumazet @ 2010-06-16  6:21 UTC (permalink / raw)
  To: paulmck
  Cc: David Miller, herbert, shemminger, mst, frzhang, netdev, amwang,
	mpm
In-Reply-To: <20100616050808.GD2911@linux.vnet.ibm.com>

Le mardi 15 juin 2010 à 22:08 -0700, Paul E. McKenney a écrit :
> On Wed, Jun 16, 2010 at 04:58:59AM +0200, Eric Dumazet wrote:
> > 
> > Paul, could you please explain if current lockdep rules are correct, or could be relaxed ?
> > 
> > I thought :
> > 
> > rcu_read_lock_bh();
> > 
> > was a shorthand to
> > 
> > local_disable_bh();
> > rcu_read_lock();
> 
> In CONFIG_TREE_RCU and CONFIG_TINY_RCU, rcu_read_lock_bh() is actually
> shorthand for only local_disable_bh().  Therefore, rcu_dereference()
> will scream if only rcu_read_lock_bh() is held.
> 
> However, in CONFIG_PREEMPT_TREE_RCU, rcu_read_lock_bh() is its own
> mechanism that does local_disable_bh() but has its own set of grace
> periods, independent of those of rcu_read_lock().
> 
> > Why lockdep is not able to make a correct diagnostic ?
> 
> Here is the situation I am concerned about:
> 
> o	Task 0 does rcu_read_lock(), then p=rcu_dereference_bh().
> 	If we make the change you are asking for, rcu_dereference_bh()
> 	is OK with this.
> 
> o	Task 0 now is preempted before finishing its RCU read-side
> 	critical section.
> 
> o	Task 1 removes the data element referenced by pointer p,
> 	then invokes synchronize_rcu_bh().
> 
> o	Task 0 does not block synchronize_rcu_bh(), so the grace
> 	period completes.
> 
> o	Task 1 frees up the data element referenced by pointer p,
> 	which might be reallocated as some other type, unmapped,
> 	or whatever else.
> 
> o	Task 0 resumes, and is sadly disappointed when the data
> 	element referenced by pointer p has been swept out from
> 	under it.
> 
> Or am I missing something here?
> 

Nice thing with RCU is that I learn new things every day ;)

Thanks Paul, I'll try to remember all the details ! ;)




^ permalink raw reply

* Re: [PATCH 04/12] phylib: add a way to make PHY time stamps possible.
From: Richard Cochran @ 2010-06-16  6:29 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Krzysztof Halasa
In-Reply-To: <27c0ad283f025c2bb71e7ceb71be07f969939429.1276615626.git.richard.cochran-3mrvs1K0uXizZXS1Dc/lvw@public.gmane.org>

On Tue, Jun 15, 2010 at 06:08:20PM +0200, Richard Cochran wrote:

> +static inline void skb_tx_timetamp(struct phy_device *phy, struct sk_buff *skb)
> +{
> +	union skb_shared_tx *shtx = skb_tx(skb);
> +
> +	if (shtx->hardware && phy && phy->drv->txtstamp)
> +		phy->drv->txtstamp(phy, skb);
> +
> +	if (shtx->software && !shtx->in_progress)
> +		skb_tstamp_tx(skb, NULL);
> +}

I forgot to mention this patch also provides a way to fix the broken
software timestamp fallback mode of the SO_TIMESTAMPING API.

We would have to add this inline call to every MAC driver in an
appropriate spot within the hard_xmit function. It is not too pretty,
but providing this as a compile time option will promote
standardization of the SO_TIMESTAMPING API for applications.

Richard

^ permalink raw reply

* Re: linux-next: build warning after merge of the final tree (net tree related)
From: Stephen Rothwell @ 2010-06-16  6:30 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, linux-next, linux-kernel
In-Reply-To: <20100615.214406.45904170.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 561 bytes --]

On Tue, 15 Jun 2010 21:44:06 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
>
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 16 Jun 2010 13:56:30 +1000
> 
> > bridge: Add const to dummy br_netpoll_send_skb
> > 
> > The version of br_netpoll_send_skb used when netpoll is off is
> > missing a const thus causing a warning.
> > 
> > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> Applied.

Thanks guys.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: linux-next: build warnings after merge of the net tree
From: Stephen Rothwell @ 2010-06-16  6:30 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, linux-next, linux-kernel, bhutchings
In-Reply-To: <20100615.215212.214211994.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 470 bytes --]

Hi Dave,

On Tue, 15 Jun 2010 21:52:12 -0700 (PDT) David Miller <davem@davemloft.net> wrote:
>
> I've commited the patch below to deal with this, thanks for the report.

Thanks.

> There's some pre-existing warnings someone will need to deal with at
> some point:
> 
> drivers/usb/gadget/rndis.c: whole file: warning: coding style is bolixed

:-)

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: Proposed linux kernel changes : scaling  tcp/ip stack : 3rd part
From: Eric Dumazet @ 2010-06-16  6:37 UTC (permalink / raw)
  To: Mitchell Erblich; +Cc: netdev
In-Reply-To: <F5330DD9-0486-4831-B2CD-83B3DF111456@earthlink.net>

Le mardi 15 juin 2010 à 23:09 -0700, Mitchell Erblich a écrit :
> On Jun 15, 2010, at 8:30 PM, Mitchell Erblich wrote:
> 
> > 
> > On Jun 15, 2010, at 8:11 PM, Mitchell Erblich wrote:
> > 
> >> 
> >> On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote:
> >> 
> >>> Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
> >>>> To whom it may concern,
> >>>> 
> >>>> First, my assumption is to keep this discussion local to just a few tcp/ip
> >>>> developers to see if there is any consensus that the below is a logical 
> >>>> approach. Please also pass this email if there is a "owner(s)" of this stack
> >>>> to identify if a case exists for the below possible changes.
> >>>> 
> >>>> I am not currently on the linux kernel mail group.
> >>>> 			
> >>>> I have experience with modifications of the Linux tcp/ip stack, and have
> >>>> merged the changes into the company's local tree and left the possible 
> >>>> global integration to others.
> >>>> 
> >>>> I have been approached by a number of companies about scaling the
> >>>> stack with the assumption of a number of cpu cores. At present, I find extra
> >>>> time on my hands and am considering looking into this area on my own.
> >>>> 
> >>>> The first assumption is that if extra cores are available, that a single
> >>>> received homogeneous flow of a large number of packets/segments per
> >>>> second (pps) can be split into non-equal flows. This split can in effect
> >>>> allow a larger recv'd pps rate at the same core load while splitting off
> >>>> other workloads, such as xmit'ing pure ACKs.
> >>>> 
> >>>> Simply, again assuming Amdahl's law (and not looking to equalize the load
> >>>> between cores), and creating logical separations where in a many core 
> >>>> system, different cores could have new kernel threads  that operate in 
> >>>> parallel within the tcp/ip stack. The initial separation points would be at 
> >>>> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some 
> >>>> form of output.
> >>>> 
> >>>> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
> >>>> with some form of queuing & scheduling, would be needed. In addition,
> >>>> the queuing/schedullng of other kernel threads would occur within ip & tcp
> >>>> to separate the I/O.
> >>>> 
> >>>> A possible validation test is to identify the max recv'd pps rate within the
> >>>> tcp/ip modules within normal flow TCP established state with normal order 
> >>>> of say 64byte non fragmented segments, before and after each 
> >>>> incremental change. Or the same rate with fewer core/cpu cycles.
> >>>> 
> >>>> I am willing to have a private git Linux.org tree that concentrates proposed
> >>>> changes into this tree and if there is willingness, a seen want/need then identify
> >>>> how to implement the merge.
> >>> 
> >>> Hi Mitchell
> >>> 
> >>> We work everyday to improve network stack, and standard linux tree is
> >>> pretty scalable, you dont need to setup a separate git tree for that.
> >>> 
> >>> Our beloved maintainer David S. Miller handles two trees, net-2.6 and
> >>> net-next-2.6 where we put all our changes.
> >>> 
> >>> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
> >>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
> >>> 
> >>> I suggest you read the last patches (say .. about 10.000 of them), to
> >>> have an idea of things we did during last years.
> >>> 
> >>> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
> >>> placement...
> >>> 
> >>> Its nice to see another man joining the team !
> >>> 
> >>> Thanks
> >>> 
> >> 
> >> 
> >> Lets start with a two part Linux kernel change and a tcp input/output change:
> >> 
> >> 2 Parts: 2nd part TBD
> >> 
> >> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
> >> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
> >> in the generic kernel. TBD.
> >> 
> >> This change should have no effect with normal available kernel mem allocs.
> >> 
> >> Assuming memory pressure ( WAITING for clean memory) we should be allocating
> >> our last pages for input skbufs and not for xmit allocs.
> >> 
> >> By delaying skbuf allocations when we have low kmem, we secondarily slow down the
> >> tcp flow : if in slow start (SS) we are almost doing a DELACK, else CA should/could
> >> decrease the number of in-flight ACKs and the peer should do burst avoidance
> >> if our later ack increases the window in a larger chunk..
> >> 
> >> And use the last pages to decrease the chance of dropping a input pkt or
> >> running out of recv descriptors, because of mem back pressure.
> >> 
> >> The change could check for some form of mem pressure before the alloc,
> >> but the alloc in itself should suffice. We could also do a ECN type check before
> >> the alloc.
> >> 
> >> Now the kicker.  I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC and
> >> NOT use emergency pools, thus CAN FAIL, to have 0 other secondary effects
> >> and change just the 1 arg.
> >> 
> >> code : tcp_output.c : tcp_send_ack()
> >>  line : buff = alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP);   /* with a NO SLEEP */
> >> 
> >> Suggestions, feedback??
> >> 
> >> Mitchell Erblich
> >> 
> >> 
> >> 
> >> 
> > 
> > Sorry :),
> > 
> > 		2nd part:
> > 
> > 		use GFP_NOWAIT as 2nd arg to alloc_skb()
> > 
> > Mitchell Erblich
> 
> Going in the same direction,
> 
> 
> If tcp_out_of_resources() and the number of orphaned sockets is above
> a configured number (maybe because of DoS attack), SHOULD we consume
> our last available resources and most likely effect skbufs that we aren't
> reset-ing because NOW the recv sk allocs are failing.
> 
> thus,
> file tcp_timer.c : tcp_out_of_resources()
> suggestion change 2nd arg GFP_ATOMIC: tcp_send_active_reset(sk, GFP_NOWAIT);
> 
> Please note that even if we believed that the GFP_ATOMIC would have a higher 
> probability to send a TCP pkt/seg, that gives us no guarantee that the peer
> will recv it or will process it.
> 
> We COULD also do some form of ECN in this function to inform the peer that our 
> system is in distress if tcp_send_active_reset() did not return void and informed 
> us of a mem alloc failure with the GFP_NOWAIT.
> 
> Since the ECN would benefit the our node/system, this ECN sending event COULD
> be argued to have a higher priority and mem argument then sent with a GFP_ATOMIC.
> 
> 
> Suggestions, opinions...


1) Acks are about the smallest chunks that are ever allocated in network
stack.

2) Their lifetime is close to 0 us. They are not cloned (queued on a
socket queue), only given to device xmit. Unless you play with trafic
shaping and insane queue lengths, acks should not use more than 0.0001 %
of your ram.

3) Under attack, adding complex algos to try to resist only delay a bit
the moment where nothing can be done to stop the attack. Being clever or
not. Dropping packets is very fine.

4) Maybe all the work you think about is the balance between ATOMIC and
non ATOMIC (GFP_KERNEL) memory allocations ? Some tuning maybe ?
   input path always use ATOMIC ops, being run from sofirq, and cannot
wait.



^ permalink raw reply

* Re: [PATCH 10/12] ptp: Added a clock that uses the eTSEC found on the MPC85xx.
From: Richard Cochran @ 2010-06-16  6:45 UTC (permalink / raw)
  To: Grant Likely
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Krzysztof Halasa
In-Reply-To: <AANLkTikFc15j-Qhw9M2CnKLKq58Wi1xD7E2dtVNsVgeA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Tue, Jun 15, 2010 at 11:20:41AM -0600, Grant Likely wrote:
> 
> Is this header file used by anything other than gianfar_ptp.c?  If
> not, then roll the two files together.

I anticipate that it might be necessary to share the header's contents
with gianfar.c one day.

> Use dash ('-') not underscore ('_') in property names.

Okay, can do.

> If you encode this value as a string, then it will be friendly for humans too.

Okay.

> I could use more explication here.  Is this a divider value?
> Computers are good at making calculations, and the driver can obtain
> the clock frequency supplied to the device.  It may be more useful to
> specify here the desired frequency rather than the divider.  Certainly
> more human-friendly too.

It is not that simple. The basic algorithm is described in the text,
and anyone wanting to use the eTSEC for PTP will have to consider the
issue themselves, since it is a design issue with a few tradeoffs
related to the board layout. It is really too thorny to do in the
driver automatically.

I can post a tcltk calculator script for finding appropriate values,
in anyone would like to see it.

> > +static struct etsects the_clock;
> 
> Will there ever be multiple instances of this device?

No, never. If you consider how PTP works, there can only be one clock
per system.

> Consider of_iomap(), it will simplify the code a bit.

> Move ptp_gianfar_exit() definition here so it is immediately before
> the module_exit() line.

Okay.


Thanks for the review,

Richard

^ permalink raw reply

* Re: [PATCH 11/12] ptp: Added a clock driver for the IXP46x.
From: Richard Cochran @ 2010-06-16  6:54 UTC (permalink / raw)
  To: Grant Likely
  Cc: netdev, devicetree-discuss, linuxppc-dev, linux-arm-kernel,
	Krzysztof Halasa
In-Reply-To: <AANLkTinPDaWNvUG8q8RdvRUga-qrPlSV2H5TkzVVMaFn@mail.gmail.com>

On Tue, Jun 15, 2010 at 12:41:56PM -0600, Grant Likely wrote:
> Nitpick.  We use all lower case names for structures in Linux.

Yes, I know, but in this case an exception makes sense.

I prefer to use the exact same register mnemonics as in the hardware
documentation, whenever possible. That way, anyone later working on
the driver with hardware manual in hand (and they should be doing that
way) will immediately see the connection.

> You want to get stuff as fast as possible, but there is a udelay()
> that just chews up CPU time.  Would cpu_relax() be sufficient with a
> time-based exit condition in the loop?

I am not sure. What does cpu_relax() do exactly, and when is it safe
to call?

Thanks,
Richard

^ permalink raw reply

* Re: Proposed linux kernel changes : scaling  tcp/ip stack : 3rd part
From: Mitchell Erblich @ 2010-06-16  7:46 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1276670223.19249.77.camel@edumazet-laptop>


On Jun 15, 2010, at 11:37 PM, Eric Dumazet wrote:

> Le mardi 15 juin 2010 à 23:09 -0700, Mitchell Erblich a écrit :
>> On Jun 15, 2010, at 8:30 PM, Mitchell Erblich wrote:
>> 
>>> 
>>> On Jun 15, 2010, at 8:11 PM, Mitchell Erblich wrote:
>>> 
>>>> 
>>>> On Jun 3, 2010, at 2:14 AM, Eric Dumazet wrote:
>>>> 
>>>>> Le jeudi 03 juin 2010 à 01:16 -0700, Mitchell Erblich a écrit :
>>>>>> To whom it may concern,
>>>>>> 
>>>>>> First, my assumption is to keep this discussion local to just a few tcp/ip
>>>>>> developers to see if there is any consensus that the below is a logical 
>>>>>> approach. Please also pass this email if there is a "owner(s)" of this stack
>>>>>> to identify if a case exists for the below possible changes.
>>>>>> 
>>>>>> I am not currently on the linux kernel mail group.
>>>>>> 			
>>>>>> I have experience with modifications of the Linux tcp/ip stack, and have
>>>>>> merged the changes into the company's local tree and left the possible 
>>>>>> global integration to others.
>>>>>> 
>>>>>> I have been approached by a number of companies about scaling the
>>>>>> stack with the assumption of a number of cpu cores. At present, I find extra
>>>>>> time on my hands and am considering looking into this area on my own.
>>>>>> 
>>>>>> The first assumption is that if extra cores are available, that a single
>>>>>> received homogeneous flow of a large number of packets/segments per
>>>>>> second (pps) can be split into non-equal flows. This split can in effect
>>>>>> allow a larger recv'd pps rate at the same core load while splitting off
>>>>>> other workloads, such as xmit'ing pure ACKs.
>>>>>> 
>>>>>> Simply, again assuming Amdahl's law (and not looking to equalize the load
>>>>>> between cores), and creating logical separations where in a many core 
>>>>>> system, different cores could have new kernel threads  that operate in 
>>>>>> parallel within the tcp/ip stack. The initial separation points would be at 
>>>>>> the ip/tcp layer boundry and where any recv'd sk/pkt would generate some 
>>>>>> form of output.
>>>>>> 
>>>>>> The ip/tcp layer would be split like the vintage AT&T STREAMs protocol,
>>>>>> with some form of queuing & scheduling, would be needed. In addition,
>>>>>> the queuing/schedullng of other kernel threads would occur within ip & tcp
>>>>>> to separate the I/O.
>>>>>> 
>>>>>> A possible validation test is to identify the max recv'd pps rate within the
>>>>>> tcp/ip modules within normal flow TCP established state with normal order 
>>>>>> of say 64byte non fragmented segments, before and after each 
>>>>>> incremental change. Or the same rate with fewer core/cpu cycles.
>>>>>> 
>>>>>> I am willing to have a private git Linux.org tree that concentrates proposed
>>>>>> changes into this tree and if there is willingness, a seen want/need then identify
>>>>>> how to implement the merge.
>>>>> 
>>>>> Hi Mitchell
>>>>> 
>>>>> We work everyday to improve network stack, and standard linux tree is
>>>>> pretty scalable, you dont need to setup a separate git tree for that.
>>>>> 
>>>>> Our beloved maintainer David S. Miller handles two trees, net-2.6 and
>>>>> net-next-2.6 where we put all our changes.
>>>>> 
>>>>> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git
>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
>>>>> 
>>>>> I suggest you read the last patches (say .. about 10.000 of them), to
>>>>> have an idea of things we did during last years.
>>>>> 
>>>>> keywords : RCU, multiqueue, RPS, percpu data, lockless algos, cache line
>>>>> placement...
>>>>> 
>>>>> Its nice to see another man joining the team !
>>>>> 
>>>>> Thanks
>>>>> 
>>>> 
>>>> 
>>>> Lets start with a two part Linux kernel change and a tcp input/output change:
>>>> 
>>>> 2 Parts: 2nd part TBD
>>>> 
>>>> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
>>>> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
>>>> in the generic kernel. TBD.
>>>> 
>>>> This change should have no effect with normal available kernel mem allocs.
>>>> 
>>>> Assuming memory pressure ( WAITING for clean memory) we should be allocating
>>>> our last pages for input skbufs and not for xmit allocs.
>>>> 
>>>> By delaying skbuf allocations when we have low kmem, we secondarily slow down the
>>>> tcp flow : if in slow start (SS) we are almost doing a DELACK, else CA should/could
>>>> decrease the number of in-flight ACKs and the peer should do burst avoidance
>>>> if our later ack increases the window in a larger chunk..
>>>> 
>>>> And use the last pages to decrease the chance of dropping a input pkt or
>>>> running out of recv descriptors, because of mem back pressure.
>>>> 
>>>> The change could check for some form of mem pressure before the alloc,
>>>> but the alloc in itself should suffice. We could also do a ECN type check before
>>>> the alloc.
>>>> 
>>>> Now the kicker.  I want a GFP_KERNEL with NO_SLEEP OR a GFP_ATOMIC and
>>>> NOT use emergency pools, thus CAN FAIL, to have 0 other secondary effects
>>>> and change just the 1 arg.
>>>> 
>>>> code : tcp_output.c : tcp_send_ack()
>>>> line : buff = alloc_skb(MAX_TCP_HDR, GFP_KERNEL_NSLEEP);   /* with a NO SLEEP */
>>>> 
>>>> Suggestions, feedback??
>>>> 
>>>> Mitchell Erblich
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> Sorry :),
>>> 
>>> 		2nd part:
>>> 
>>> 		use GFP_NOWAIT as 2nd arg to alloc_skb()
>>> 
>>> Mitchell Erblich
>> 
>> Going in the same direction,
>> 
>> 
>> If tcp_out_of_resources() and the number of orphaned sockets is above
>> a configured number (maybe because of DoS attack), SHOULD we consume
>> our last available resources and most likely effect skbufs that we aren't
>> reset-ing because NOW the recv sk allocs are failing.
>> 
>> thus,
>> file tcp_timer.c : tcp_out_of_resources()
>> suggestion change 2nd arg GFP_ATOMIC: tcp_send_active_reset(sk, GFP_NOWAIT);
>> 
>> Please note that even if we believed that the GFP_ATOMIC would have a higher 
>> probability to send a TCP pkt/seg, that gives us no guarantee that the peer
>> will recv it or will process it.
>> 
>> We COULD also do some form of ECN in this function to inform the peer that our 
>> system is in distress if tcp_send_active_reset() did not return void and informed 
>> us of a mem alloc failure with the GFP_NOWAIT.
>> 
>> Since the ECN would benefit the our node/system, this ECN sending event COULD
>> be argued to have a higher priority and mem argument then sent with a GFP_ATOMIC.
>> 
>> 
>> Suggestions, opinions...
> 
> 
> 1) Acks are about the smallest chunks that are ever allocated in network
> stack.
> 
> 2) Their lifetime is close to 0 us. They are not cloned (queued on a
> socket queue), only given to device xmit. Unless you play with trafic
> shaping and insane queue lengths, acks should not use more than 0.0001 %
> of your ram.
> 
> 3) Under attack, adding complex algos to try to resist only delay a bit
> the moment where nothing can be done to stop the attack. Being clever or
> not. Dropping packets is very fine.
> 
> 4) Maybe all the work you think about is the balance between ATOMIC and
> non ATOMIC (GFP_KERNEL) memory allocations ? Some tuning maybe ?
>   input path always use ATOMIC ops, being run from sofirq, and cannot
> wait.
> 

I am not suggesting a pure GFP_KERNEL /sleep allocs on either side.
I suggested it with a NO-SLEEP and then looked and saw GFP_NOWAIT.

However, when under mem pressure, it might make sense to delay/slow
the opening of the snd and recv TCP windows, to allow mem pages
to be cleaned and re-used to relieve the mem pressure.

If queues are already existing, then they are the first delay-points
to be reviewed / used, however only under abnormal circumstances.

Maybe even short-lived flows have expired and those resources can now
be used.

> 
> --

Eric & group,

I am starting simple with a different look at what COULD/SHOULD IT be done?

The question that I am asking is if GFP_NOWAIT WOULD fail  AND GFP_ATOMIC
would succeed, then only the last percentage of kernel memory allocations are
being done. Do you want to use this last few percentage of memory with ACKs/xmits?

Maybe if we failed a few non-necessary / delayable items, then enough time
may occur to clean pages, and not execute any OOM like code.

The effect of generating a short DELACK SHOULD reduce memory pressure from 
the peer in 1 RTT. Also, failing and executing the non-buff code, MAY ALSO slow-down
the ramp-up based on the TCP ACK clock.

The number of ACTIVE (recv at the NIC to xmit at the NIC) tcp flows may account
for a non-neglible number of later allocs and decrease mem pressure.

If I understand the diff between GFP_NOWAIT and GFP_ATOMIC, both
don't sleep, and only GFP_NOWAIT doesn't grab for the last available pages. So,
they are both atomic/no-sleeps. 

To conclude with part 4: tcp_timer.c has 2 additional calls to tcp_send_active_reset()
use GFP_NOWAIT instead of GFP_ATOMIC.

Now, with your statement about "input path always uses atomic". SHOULD IT?
SAY under a DoS attack? Why not reject some if no-memory via GFP_NOWAIT.
If a new flow is to be started and we are under memory pressure, should it not ALSO
use GFP_NOWAIT?  If we ALSO do GFP_NOWAIT on ESTABLISHED flows and (shoot me)
drop a seg/pkt, that should drop them to 1/2 bandwidth. This is in effect TCP fair-ness.
Thus, delays in ACKs are much more preferred.

Again, we will only minimally effect flows with the NEW suggested changes if their is 
mem pressure and going to do more aggressive things like reset sockets/flows.

Since, my suggested changes are NOT in the form of a patch, then someone ELSE 
needs to agree with me and then the maintainer must see that it does no harm and 
the changes slowly moves the code in the right direction.

Mitchell Erblich

==================


	
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [PATCH net-next-2.6] syncookies: check decoded options against sysctl settings
From: Florian Westphal @ 2010-06-16  8:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100615.180947.241927235.davem@davemloft.net>

David Miller <davem@davemloft.net> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Sun, 13 Jun 2010 23:34:35 +0200
> 
> > -	if (tcp_opt->sack_ok)
> > -		tcp_sack_reset(tcp_opt);
> > +	if (tcp_opt->sack_ok && !sysctl_tcp_sack)
> > +		return false;
> >  
> 
> If you remove the tcp_sack_reset() call here, who is going to
> do it?

Right, I should have mentioned that in the changelog, sorry about that.

Bottom line is that I failed to find out why its needed.
Both call sites of this function (cookie_v4_check, cookie_v6_check)
allocate the "struct tcp_options_received" argument on the stack, zero it,
hand it to tcp_parse_options() and then call cookie_check_timestamp().

I did not find any place in tcp_parse_options that would cause
tcp_opt->num_sacks/dsack to become nonzero.

Even if it can turn nonzero, I do not see any ill effects that might
happen then. The structure is on the stack and after tcp_parse_options()
returns only a few selected members are copied to the inet_request_sock.

^ permalink raw reply

* Re: [PATCH] vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)
From: Pedro Garcia @ 2010-06-16  8:49 UTC (permalink / raw)
  To: netdev; +Cc: Patrick McHardy, Ben Hutchings, Eric Dumazet
In-Reply-To: <1276542772.2444.13.camel@edumazet-laptop>

On Mon, 14 Jun 2010 21:12:52 +0200, Eric Dumazet <eric.dumazet@gmail.com>
wrote:
> Le lundi 14 juin 2010 à 19:11 +0200, Patrick McHardy a écrit :
>> Ben Hutchings wrote:
>> > On Mon, 2010-06-14 at 18:49 +0200, Pedro Garcia wrote:
>> >   
>> >> On Sun, 13 Jun 2010 22:56:30 +0100, Ben Hutchings
>> >> <bhutchings@solarflare.com> wrote:
>> >>     
>> >>> I have no particular opinion on this change, but you need to read and
>> >>> follow Documentation/SubmittingPatches.
>> >>>
>> >>> Ben.
>> >>>       
>> >> Sorry, first kernel patch, and I did not know about it. I resubmit
>> >> with
>> >> the correct style / format:
>> >>     
>> > [...]
>> >
>> > Sorry, no you haven't.
>> >
>> > - Networking changes go through David Miller's net-next-2.6 tree so you
>> > need to use that as the baseline, not 2.6.26
>> > - Patches should be applicable with -p1, not -p0 (so if you use diff,
>> > you should run it from one directory level up)
>> > - The patch was word-wrapped
>> 
>> Additionally:
>> 
>> - please use the proper comment style, meaning each line begins
>>   with a '*'
>> 
>> - the pr_debug statements may be useful for debugging, but are
>>   a bit excessive for the final version
>> 
>> - + /* 2010-06-13: Pedro Garcia
>> 
>>    We have changelogs for this, simply explaining what the code
>>    does is enough.
>> 
>> - Please CC the maintainer (which is me)
>> --
> 
> Pedro, we have two kind of vlan setups :
> 
> accelerated and non accelerated ones.
> 
> Your patch address non accelated ones only, since you only touch
> vlan_skb_recv()
> 
> Accelerated vlan can follow these paths :
> 
> 1) NAPI devices
> 
> vlan_gro_receive() -> vlan_gro_common()
> 
> 2) non NAPI devices
> 
> __vlan_hwaccel_rx() 
> 
> So you might also patch __vlan_hwaccel_rx() and vlan_gro_common()
> 
> Please merge following bits to your patch submission :
> 
> http://kerneltrap.org/mailarchive/linux-netdev/2010/5/23/6277868
> 
> 
> Good luck for your first patch !

Here it is again. I added the modifications in http://kerneltrap.org/mailarchive/linux-netdev/2010/5/23/6277868 for HW accelerated incoming packets (it did not apply cleanly on the last version of
the kernel, so I applied manually). Now, if the VLAN 0 is not explicitly created by the user, VLAN 0 packets will be treated as no VLAN (802.1p packets), instead of dropping them.

The patch is now for two files: vlan_core (accel) and vlan_dev (non accel)

I can not test on HW accelerated devices, so if someone can check it I will appreciate (even though in the thread above it looked like yes). For non accel I tessted in 2.6.26. Now the patch is for
net-next-2.6, and it compiles OK, but I a have to setup a test environment to check it is still OK (should, but better to test).

Signed-off-by: Pedro Garcia <pedro.netdev@dondevamos.com>
--
diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index 50f58f5..daaca31 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -8,6 +8,9 @@
 int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
                      u16 vlan_tci, int polling)
 {
+       struct net_device *vlan_dev;
+       u16 vlan_id;
+
        if (netpoll_rx(skb))
                return NET_RX_DROP;
 
@@ -16,10 +19,14 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct vlan_group *grp,
 
        skb->skb_iif = skb->dev->ifindex;
        __vlan_hwaccel_put_tag(skb, vlan_tci);
-       skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
+       vlan_id = vlan_tci & VLAN_VID_MASK;
+       vlan_dev = vlan_group_get_device(grp, vlan_id);
 
-       if (!skb->dev)
-               goto drop;
+       if (vlan_dev)
+               skb->dev = vlan_dev;
+       else
+               if (vlan_id)
+                       goto drop;
 
        return (polling ? netif_receive_skb(skb) : netif_rx(skb));
 
@@ -82,16 +89,22 @@ vlan_gro_common(struct napi_struct *napi, struct vlan_group *grp,
                unsigned int vlan_tci, struct sk_buff *skb)
 {
        struct sk_buff *p;
+       struct net_device *vlan_dev;
+       u16 vlan_id;
 
        if (skb_bond_should_drop(skb, ACCESS_ONCE(skb->dev->master)))
                skb->deliver_no_wcard = 1;
 
        skb->skb_iif = skb->dev->ifindex;
        __vlan_hwaccel_put_tag(skb, vlan_tci);
-       skb->dev = vlan_group_get_device(grp, vlan_tci & VLAN_VID_MASK);
-
-       if (!skb->dev)
-               goto drop;
+       vlan_id = vlan_tci & VLAN_VID_MASK;
+       vlan_dev = vlan_group_get_device(grp, vlan_id);
+
+       if (vlan_dev)
+               skb->dev = vlan_dev;
+       else
+               if (vlan_id)
+                       goto drop;
 
        for (p = napi->gro_list; p; p = p->next) {
                NAPI_GRO_CB(p)->same_flow =
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 5298426..65512c3 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -142,6 +142,7 @@ int vlan_skb_recv(struct sk_buff *skb, struct net_device *dev,
 {
        struct vlan_hdr *vhdr;
        struct vlan_rx_stats *rx_stats;
+       struct net_device *vlan_dev;
        u16 vlan_id;
        u16 vlan_tci;
 
@@ -157,8 +158,18 @@ int vlan_skb_recv(struct sk_buff *skb, struct net_device *dev,
        vlan_id = vlan_tci & VLAN_VID_MASK;
 
        rcu_read_lock();
-       skb->dev = __find_vlan_dev(dev, vlan_id);
-       if (!skb->dev) {
+       vlan_dev = __find_vlan_dev(dev, vlan_id);
+
+       /* If the VLAN device is defined, we use it.
+        * If not, and the VID is 0, it is a 802.1p packet (not
+        * really a VLAN), so we will just netif_rx it later to the
+        * original interface, but with the skb->proto set to the
+        * wrapped proto: we do nothing here.
+        */
+
+       if (vlan_dev) {
+               skb->dev = vlan_dev;
+       } else if (vlan_id) {
                pr_debug("%s: ERROR: No net_device for VID: %u on dev: %s\n",
                         __func__, vlan_id, dev->name);
                goto err_unlock;



^ permalink raw reply related

* [PATCH] Clear IFF_XMIT_DST_RELEASE for teql interfaces
From: Tom Hughes @ 2010-06-16  8:24 UTC (permalink / raw)
  To: netdev
  Cc: akpm, Tom Hughes, Jamal Hadi Salim, David S. Miller,
	Stephen Hemminger, Patrick McHardy, Tejun Heo, linux-kernel

The sch_teql module, which can be used to load balance over a set of
underlying interfaces, stopped working after 2.6.30 and has been
broken in all kernels since then for any underlying interface which
requires the addition of link level headers.

The problem is that the transmit routine relies on being able to
access the destination address in the skb in order to do address
resolution once it has decided which underlying interface it is going
to transmit through.

In 2.6.31 the IFF_XMIT_DST_RELEASE flag was introduced, and set by
default for all interfaces, which causes the destination address to be
released before the transmit routine for the interface is called.

The solution is to clear that flag for teql interfaces.

Signed-off-by: Tom Hughes <tom@compton.nu>
---
 net/sched/sch_teql.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
index 3415b6c..807643b 100644
--- a/net/sched/sch_teql.c
+++ b/net/sched/sch_teql.c
@@ -449,6 +449,7 @@ static __init void teql_master_setup(struct net_device *dev)
 	dev->tx_queue_len	= 100;
 	dev->flags		= IFF_NOARP;
 	dev->hard_header_len	= LL_MAX_HEADER;
+	dev->priv_flags		&= ~IFF_XMIT_DST_RELEASE;
 }
 
 static LIST_HEAD(master_dev_list);
-- 
1.7.0.1


^ permalink raw reply related

* Re: [PATCH net-next-2.6] inetpeer: do not use zero refcnt for freed entries
From: Eric Dumazet @ 2010-06-16  8:56 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, paulmck
In-Reply-To: <20100615.214754.42801686.davem@davemloft.net>

Le mardi 15 juin 2010 à 21:47 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 16 Jun 2010 04:45:24 +0200
> 
> > [PATCH net-next-2.6] inetpeer: do not use zero refcnt for freed entries
> > 
> > Followup of commit aa1039e73cc2 (inetpeer: RCU conversion)
> > 
> > Unused inet_peer entries have a null refcnt.
> > 
> > Using atomic_inc_not_zero() in rcu lookups is not going to work for
> > them, and slow path is taken.
> > 
> > Fix this using -1 marker instead of 0 for deleted entries.
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Applied, thanks Eric.

Thanks

With 65537 peers and a DDOS frag attack, I now get following profiling
results :

-----------------------------------------------------------------------------------------
   PerfTop:    1024 irqs/sec  kernel:100.0%  exact:  0.0% [1000Hz
cycles],  (all, cpu: 0)
-----------------------------------------------------------------------------------------

             samples  pcnt function                  DSO
             _______ _____ _________________________ 

             7722.00 65.6% inet_frag_find            
             1355.00 11.5% ip4_frag_match            
              494.00  4.2% __lock_acquire            
              260.00  2.2% inet_getpeer              
              243.00  2.1% ip_route_input_common     
              151.00  1.3% lock_release              
              142.00  1.2% mark_lock                 
              126.00  1.1% lock_acquire              
              104.00  0.9% __kmalloc                 
               86.00  0.7% skb_put                   


Just to show what could be the next steps ;)






^ permalink raw reply

* Re: [PATCH] vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)
From: Eric Dumazet @ 2010-06-16  9:08 UTC (permalink / raw)
  To: Pedro Garcia; +Cc: netdev, Patrick McHardy, Ben Hutchings
In-Reply-To: <311b59aee7d648c6124a84b5ca06ac60@dondevamos.com>

Le mercredi 16 juin 2010 à 10:49 +0200, Pedro Garcia a écrit :
> On Mon, 14 Jun 2010 21:12:52 +0200, Eric Dumazet <eric.dumazet@gmail.com>

> > Good luck for your first patch !
> 
> Here it is again. I added the modifications in http://kerneltrap.org/mailarchive/linux-netdev/2010/5/23/6277868 for HW accelerated incoming packets (it did not apply cleanly on the last version of
> the kernel, so I applied manually). Now, if the VLAN 0 is not explicitly created by the user, VLAN 0 packets will be treated as no VLAN (802.1p packets), instead of dropping them.
> 
> The patch is now for two files: vlan_core (accel) and vlan_dev (non accel)
> 
> I can not test on HW accelerated devices, so if someone can check it I will appreciate (even though in the thread above it looked like yes). For non accel I tessted in 2.6.26. Now the patch is for
> net-next-2.6, and it compiles OK, but I a have to setup a test environment to check it is still OK (should, but better to test).
> 
> Signed-off-by: Pedro Garcia <pedro.netdev@dondevamos.com>

OK, the patch itself is correct.

Now, could you please send it again with a proper changelog ?

In this changelog, please explain why patch is needed, and
keep lines short (< 72 chars), like the one you did in your first mail.

I'll then add my Signed-off-by, since I wrote the accelerated part ;)

Note : I wonder if another patch is needed, in case 8021q module is
_not_ loaded. We probably should accept vlan 0 frames in this case ?



^ permalink raw reply

* Re: Proposed linux kernel changes : scaling  tcp/ip stack
From: Andi Kleen @ 2010-06-16  9:10 UTC (permalink / raw)
  To: Mitchell Erblich; +Cc: Eric Dumazet, netdev
In-Reply-To: <97746864-ED54-4A12-AFE7-752AA6E41CDD@earthlink.net>

Mitchell Erblich <erblichs@earthlink.net> writes:
>
> Summary: Don't use last free pages for TCP ACKs with GFP_ATOMIC for our
> sk buf allocs. 1 line change in tcp_output.c with a new gfp.h arg, and a change
> in the generic kernel. TBD.
>
> This change should have no effect with normal available kernel mem allocs.
>
> Assuming memory pressure ( WAITING for clean memory) we should be allocating
> our last pages for input skbufs and not for xmit allocs.

How about you instrument a kernel and measure if this really happens
frequently under reasonable loads?  That is you can probably
use the existing dropped page counters in netstat 
Stephen added some time ago.

Since soft irqs cannot really wait exhausted GFP_ATOMIC would normally
lead to dropped packets. FWIW I am not aware of any serious dropped
packets problem on normal loads.

Running a kernel with nearly zero free memory is dangerous anyways
-- pretty much any kernel service can fail arbitarily --
if this happened frequently I suspect we would need generic
VM solution for it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: mpd client timeouts (bisected) 2.6.35-rc3
From: Christian Kujau @ 2010-06-16 10:01 UTC (permalink / raw)
  To: markus@trippelsdorf.de
  Cc: John Fastabend, David Miller, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, yanmin_zhang, alex.shi, tim.c.chen
In-Reply-To: <20100613205922.GA1806@arch.tripp.de>

On Sun, 13 Jun 2010 at 22:59, markus@trippelsdorf.de wrote:
> This solves the problem here. Thanks.

Not sure if this is related, but I've noticed connection timeouts and 
connections going in FIN_WAIT2 state (most of them SSH tunnels) with 
2.6.35. Going back to 2.6.34 or applying John's (and Eric's) patch 
does seem to fix this issue.

Thanks,
Christian.
-- 
BOFH excuse #168:

le0: no carrier: transceiver cable problem?

^ permalink raw reply

* Re: [PATCH 12/12] ptp: Added a clock driver for the National Semiconductor PHYTER.
From: Richard Cochran @ 2010-06-16 10:05 UTC (permalink / raw)
  To: Grant Likely
  Cc: netdev, devicetree-discuss, linuxppc-dev, linux-arm-kernel,
	Krzysztof Halasa
In-Reply-To: <AANLkTilqd-wxqMWDxssC-XMDJqZ8iIiSrvp2t8xARAQV@mail.gmail.com>

On Tue, Jun 15, 2010 at 12:49:13PM -0600, Grant Likely wrote:
> Won't this break things for existing DP83640 users?

Nope, the driver was only added five patches ago, and it only offers
the timestamping stuff. The standard PHY functions just call the
generic functions, so the PHY works fine even without this driver.

> > +static struct ptp_clock *dp83640_clock;
> > +DEFINE_SPINLOCK(clock_lock); /* protects the one and only dp83640_clock */
> 
> Why only one?  Is it not possible to have 2 of these PHYs in a system?

Yes, you can have multiple PHYs, but only one PTP clock.

If you do use multiple PHYs, then you must wire their clocks together
and adjust the PTP clock on only one of the PHYs.


Thanks for your other comments,

Richard

^ permalink raw reply

* Re: BUG: unable to handle kernel paging request at 000041ed00000001
From: Arturas @ 2010-06-16 10:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <96E62960-9EF8-4F05-92DD-2D7477D0D78B@res.lt>

Your new patch doesn't show any warnings about tx queue 
and everything is working as expected.

On Jun 14, 2010, at 12:27 PM, Arturas wrote:

> 
> On Jun 14, 2010, at 11:31 AM, Eric Dumazet wrote:
>> But your problem is about bridge, not bonding (see trace).
> I want it for performance reason, not because of this bug.
> Bridge isn't a bottleneck for me, but bonding may be and not to me only,
> but for many people. I believe that performance gain would be more 
> than 1% on cpu? :-)
> 
>> 
>> And 2.6.34 wont accept such changes, its already released.
> It can be as a separate patch or I can test 2.3.35 if it would accept
> such change. I just need a stable kernel with good performance :-)
> 
>> 
>>> I also have another issue with NMI. On older machine with 5500 xeons i 
>>> have almost no overhead with nmi_watchdog enabled, but on this it is about twice.
>>> without nmi enabled cpu peak average is 30%, and with nmi enabled i have 53%.
>>> When traffic is not passing all cpus are idling at 100%.
>>> Maybe overhead could be a little bit smaller? :-)
>>> 
>> 
>> I am a bit lost here, NMI have litle to do with network stack ;)
> May this be related to very recent cpu? As i understand NMI depends on CPU.
> 
>> 
>> 
>> Could you please test another patch ?
> Applied, it's working correctly for now. If i'll get a warning i'll write you or maybe I
> shouldn't get it if a patch is correct?
> 
>> 
>> Before calling sk_tx_queue_set(sk, queue_index); we should check if dst
>> dev is current device.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* rt hash table / rt hash locks question
From: Nick Piggin @ 2010-06-16 10:46 UTC (permalink / raw)
  To: netdev

I'm just converting this scalable dentry/inode hash table to a more
compact form. I was previously using a dumb spinlock per bucket,
but this doubles the size of the tables so isn't production quality.

What I've done at the moment is to use a bit_spinlock in bit 0 of each
list pointer of the table. Bit spinlocks are now pretty nice because
we can do __bit_spin_unlock() which gives non-atomic store with release
ordering, so it should be almost as fast as spinlock.

But I look at rt hash and it seems you use a small hash on the side
for spinlocks. So I wonder, pros for each:

- bitlocks have effectively zero storage
- bitlocks hit the same cacheline that the hash walk hits.
- in RCU list, locked hash walks usually followed by hash modification,
  bitlock should have brought in the line for exclusive.
- bitlock number of locks scales with hash size
- spinlocks may be slightly better at the cacheline level (bitops
  sometimes require explicit load which may not acquire exclusive
  line on some archs). On x86 ll/sc architectures, this shouldn't
  be a problem.
- spinlocks better debugging (could be overcome with a LOCKDEP
  option to revert to spinlocks, but a bit ugly).
- in practice, contention due to aliasing in buckets to lock mapping
  is probably fairly minor.

Net code is obviously tested and tuned well, but instinctively I would
have tought bitlocks are the better way to go. Any comments on this?

Thanks,
Nick


^ permalink raw reply

* DMFE Driver in current kernel
From: Heiko Gerstung @ 2010-06-16 10:52 UTC (permalink / raw)
  To: netdev

Dear netdev members,

I apologize for contacting this list, I was pointed here by Tobias
Ringstroem who is listed as the current maintainer for the dmfe network
driver but he told me that this is not true anymore. I believe that
probably no one is looking at this driver anymore, and he suggested that
I should contact you guys because he cannot help me.

Here is my long worded story: We are a manufacturer of special network
appliances used for time synchronization and in our LANTIME time server
appliances we are currently using GNU Linux on an embedded i386 CPU
board with a Davicom network chip.

I am desperately missing an important feature in the dmfe driver (which
lists you as the current maintainer): I would want to be able to change
network speed and duplex settings without having to unload/reload the
kernel module, preferably by using ethtool.

Would anyone here interested in adding support for this into the driver?

If yes, we would want to pay for this (and of course we would love
to see that these changes are included in the next kernel version, if
possible). If nobody on this list is interested in this job, I would
appreciate recommendations regarding where to look for someone. If I do
not manage to find anyone for this task, I will try it on my own.
Pointers to suitable documentation or hints would be appreciated!

I know that this is a pretty old chip, but it seems that it is still
used in a number of embedded systems and since we want to be able to
upgrade our existing units in the field with our new firmware, I really
would like to get this functionality into the driver.

Another point I am looking for is how to get the asix driver to play
nice when used in combination with active-backup bonding. As it seems,
the current driver does not work unless I put the bonding interface into
promiscuous mode (which I find out by accident when I used tcpdump for
debugging). If anyone has an idea how to improve this, I would be very
grateful, too.

Thanks for your time,
  Heiko


-- 

Heiko Gerstung

*MEINBERG Funkuhren* GmbH & Co. KG
Lange Wand 9
D-31812 Bad Pyrmont, Germany
Phone: +49 (0)5281 9309-25
Fax: +49 (0)5281 9309-30
Amtsgericht Hannover 17HRA 100322
Geschäftsführer/Managing Directors: Günter Meinberg, Werner Meinberg,
Andre Hartmann, Heiko Gerstung
Email: heiko.gerstung@meinberg.de <mailto:heiko.gerstung@meinberg.de>
Web: www.meinberg.de <http://www.meinberg.de>

------------------------------------------------------------------------
*MEINBERG - Accurate Time Worldwide*



^ permalink raw reply

* Re: [PATCH] vlan_dev: VLAN 0 should be treated as "no vlan tag" (802.1p packet)
From: Patrick McHardy @ 2010-06-16 11:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Pedro Garcia, netdev, Ben Hutchings
In-Reply-To: <1276679284.2632.22.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le mercredi 16 juin 2010 à 10:49 +0200, Pedro Garcia a écrit :
>> Here it is again. I added the modifications in http://kerneltrap.org/mailarchive/linux-netdev/2010/5/23/6277868 for HW accelerated incoming packets (it did not apply cleanly on the last version of
>> the kernel, so I applied manually). Now, if the VLAN 0 is not explicitly created by the user, VLAN 0 packets will be treated as no VLAN (802.1p packets), instead of dropping them.
>>
>> The patch is now for two files: vlan_core (accel) and vlan_dev (non accel)
>>
>> I can not test on HW accelerated devices, so if someone can check it I will appreciate (even though in the thread above it looked like yes). For non accel I tessted in 2.6.26. Now the patch is for
>> net-next-2.6, and it compiles OK, but I a have to setup a test environment to check it is still OK (should, but better to test).
>>
>> Signed-off-by: Pedro Garcia <pedro.netdev@dondevamos.com>
>>     
>
> OK, the patch itself is correct.
>   

Yes, looks fine to me as well.

> Now, could you please send it again with a proper changelog ?
>
> In this changelog, please explain why patch is needed, and
> keep lines short (< 72 chars), like the one you did in your first mail.
>
> I'll then add my Signed-off-by, since I wrote the accelerated part ;)
>
> Note : I wonder if another patch is needed, in case 8021q module is
> _not_ loaded. We probably should accept vlan 0 frames in this case ?
>   

I agree that this would be best for consistency, but that would mean
adding more special cases to __netif_receive_skb().

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox