Netdev List
 help / color / mirror / Atom feed
* net/can/mscan: Fix buggy listen only mode setting
From: Wolfgang Grandegger @ 2011-11-14 13:34 UTC (permalink / raw)
  To: Netdev; +Cc: linux-can, Socketcan-core, Marc Kleine-Budde

This patch fixes an issue introduced recently with commit
452448f9283e1939408b397e87974a418825b0a8.

CC: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Wolfgang Grandegger <wg@grandegger.com>
---
 drivers/net/can/mscan/mscan.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/can/mscan/mscan.c b/drivers/net/can/mscan/mscan.c
index 74f3b18..1c82dd8 100644
--- a/drivers/net/can/mscan/mscan.c
+++ b/drivers/net/can/mscan/mscan.c
@@ -581,7 +581,7 @@ static int mscan_open(struct net_device *dev)
 
 	priv->open_time = jiffies;
 
-	if (ctrlmode.flags & CAN_CTRLMODE_LISTENONLY)
+	if (priv->can.ctrlmode & CAN_CTRLMODE_LISTENONLY)
 		setbits8(&regs->canctl1, MSCAN_LISTEN);
 	else
 		clrbits8(&regs->canctl1, MSCAN_LISTEN);
-- 
1.7.6.4

^ permalink raw reply related

* RE: [PATCH 4/4] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
From: David Laight @ 2011-11-14 13:25 UTC (permalink / raw)
  To: Ian Campbell, Michael S. Tsirkin
  Cc: netdev, David S. Miller, Neil Brown, J. Bruce Fields, linux-nfs
In-Reply-To: <1321276055.3664.67.camel@zakaz.uk.xensource.com>

 
> This prevents an issue where an ACK is delayed, a retransmit is queued
(either
> at the RPC or TCP level) and the ACK arrives before the retransmission
hits the
> wire. If this happens to an NFS WRITE RPC then the write() system call
> completes and the userspace process can continue, potentially
modifying data
> referenced by the retransmission before the retransmission occurs.

The only problem I see is that the source address might get
invalidated (assuming it is a user address mapped into kernel).
Not sure what effect the fault would have...

If it is an RPC retransmittion the NFS write won't be
terminated by the RPC ACK (because it is a stale ACK)
and the destination will see two copies of the NFS write.
(This is quite common with NFS/UDP.)

If it is a TCP retransmittion, then the receiving TCP stack
will see it as duplicate data and discard the contents
without passing them to RPC.

Interrupting an NFS write with a signal is problematic
anyway - I'm sure I've seen systems (SYSV?) when NFS
writes were uninterruptable. NFI how Linux handles this.

	David

^ permalink raw reply

* Re: [PATCH 4/4] sunrpc: use SKB fragment destructors to delay completion until page is released by network stack.
From: Ian Campbell @ 2011-11-14 13:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, David S. Miller,
	Neil Brown, J. Bruce Fields,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20111113101713.GB15322-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Sun, 2011-11-13 at 10:17 +0000, Michael S. Tsirkin wrote:
> On Fri, Nov 11, 2011 at 01:20:27PM +0000, Ian Campbell wrote:
> > On Fri, 2011-11-11 at 12:38 +0000, Michael S. Tsirkin wrote:
> > > On Wed, Nov 09, 2011 at 03:02:07PM +0000, Ian Campbell wrote:
> > > > This prevents an issue where an ACK is delayed, a retransmit is queued (either
> > > > at the RPC or TCP level) and the ACK arrives before the retransmission hits the
> > > > wire. If this happens to an NFS WRITE RPC then the write() system call
> > > > completes and the userspace process can continue, potentially modifying data
> > > > referenced by the retransmission before the retransmission occurs.
> > > > 
> > > > Signed-off-by: Ian Campbell <ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
> > > > Acked-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
> > > > Cc: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> > > > Cc: Neil Brown <neilb-l3A5Bk7waGM@public.gmane.org>
> > > > Cc: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
> > > > Cc: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > 
> > > So this blocks the system call until all page references
> > > are gone, right?
> > 
> > Right. The alternative is to return to userspace while the network stack
> > still has a reference to the buffer which was passed in -- that's the
> > exact class of problem this patch is supposed to fix.
> 
> BTW, the increased latency and the overhead extra wakeups might for some
> workloads be greater than the cost of the data copy.

Under normal circumstances these paths should not be activated at all.
These only come into play if there are delays in the network coming from
somewhere and I would expect any negative effect from that to outweigh
either a copy or an additional wakeup.

> > 
> > >  consider a bridged setup
> > > with an skb queued at a tap device - this cause one process
> > > to block another one by virtue of not consuming a cloned skb?
> > 
> > Hmm, yes.
> > 
> > One approach might be to introduce the concept of an skb timeout to the
> > stack as a whole and cancel (or deep copy) after that timeout occurs.
> > That's going to be tricky though I suspect...
> 
> Further, an application might use signals such as SIGALARM,
> delaying them significantly will cause trouble.

AIUI there is nothing to stop the SIGALARM being delivered in a timely
manner, all which may need to be delayed is the write() returning
-EINTR. 

When -EINTR is returned the buffer must no longer be referenced either
implicitly or explicitly by the kernel and the write must not have
completed and nor should it complete in the future (that way lies
corruption of varying sorts) so copying the data pages is not helpful in
this case.

This patch ensures that the buffer is no longer referenced when the
write returns. It's possible that NFS might need to cancel a write
operation in order to not complete it after returning -EINTR (I don't
know if it does this or not) but I don't think this series impacts that
one way or the other.

> 
> > A simpler option would be to have an end points such as a tap device
> 
> Which end points would that be? Doesn't this affect a packet socket
> with matching filters? A userspace TCP socket that happens to
> reside on the same box?  It also seems that at least with a tap device
> an skb can get queued in a qdisk, also indefinitely, right?

Those are all possibilities.

In order for this patch to have any impact on any of these scenarios
those end points would have to currently be referencing and using pages
of data which have been "returned" to the originating user process and
may be changing under their feet. This is without a doubt a Bad Thing.

In the normal case it is likely that the same end point which is
injecting delay is also the one the originating process is actually
trying to talk to and so the delay would already have been present and
this patch doesn't delay things any further.

If there are parts of the stack which can end up holding onto an skb for
an arbitrary amount of time then I think that is something which needs
to be fixed up in those end points rather than in everyone who injects
an skb into the stack. Whether an immediate deep copy or a more lazy
approach is appropriate I suspect depends upon the exact use case of
each end point.

Having said that one idea which springs to mind would be to allow
someone who has injected a page into the stack to "cancel" it. Since I
am working on pushing the struct page * down into the struct
skb_destructor this could be as simple as setting the page to NULL.
However every end point would need to be taught to expect this. I'm sure
there are also all sorts of locking nightmares underlying this idea.
Perhaps you'd need separate reference counts for queued vs in active
use.

Ian.

> 
> > which can swallow skbs for arbitrary times implement a policy in this
> > regard, either to deep copy or drop after a timeout?
> > 
> > Ian.
> 
> Or deep copy immediately?
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC] kvm tools: Implement multiple VQ for virtio-net
From: Michael S. Tsirkin @ 2011-11-14 13:05 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Krishna Kumar, kvm, Asias He, virtualization, gorcunov,
	Sasha Levin, netdev, mingo
In-Reply-To: <CAOJsxLE4yRO+5XncrDEzVh5-RGXG+=HGTAmQ=bX=KMsDqf=feA@mail.gmail.com>

On Mon, Nov 14, 2011 at 02:25:17PM +0200, Pekka Enberg wrote:
> On Mon, Nov 14, 2011 at 4:04 AM, Asias He <asias.hejun@gmail.com> wrote:
> > Why both the bandwidth and latency performance are dropping so dramatically
> > with multiple VQ?
> 
> What's the expected benefit from multiple VQs

Heh, the original patchset didn't mention this :) It really should.
They are supposed to speed up networking for high smp guests.

> i.e. why are doing the patches Sasha?

-- 
MST

^ permalink raw reply

* Re: net: Add network priority cgroup
From: Dave Taht @ 2011-11-14 12:32 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, John Fastabend, Robert Love, David S. Miller
In-Reply-To: <20111114114700.GA27284@hmsreliant.think-freely.org>

On Mon, Nov 14, 2011 at 12:47 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
>> Data Center Bridging environments are currently somewhat limited in their
>> ability to provide a general mechanism for controlling traffic priority.
>> Specifically they are unable to administratively control the priority at which
>> various types of network traffic are sent.
>>
>> Currently, the only ways to set the priority of a network buffer are:
>>
>> 1) Through the use of the SO_PRIORITY socket option
>> 2) By using low level hooks, like a tc action
>>
>> (1) is difficult from an administrative perspective because it requires that the
>> application to be coded to not just assume the default priority is sufficient,
>> and must expose an administrative interface to allow priority adjustment.  Such
>> a solution is not scalable in a DCB environment
>>
>> (2) is also difficult, as it requires constant administrative oversight of
>> applications so as to build appropriate rules to match traffic belonging to
>> various classes, so that priority can be appropriately set. It is further
>> limiting when DCB enabled hardware is in use, due to the fact that tc rules are
>> only run after a root qdisc has been selected (DCB enabled hardware may reserve
>> hw queues for various traffic classes and needs the priority to be set prior to
>> selecting the root qdisc)
>>
>>
>> I've discussed various solutions with John Fastabend, and we saw a cgroup as
>> being a good general solution to this problem.  The network priority cgroup
>> allows for a per-interface priority map to be built per cgroup.  Any traffic
>> originating from an application in a cgroup, that does not explicitly set its
>> priority with SO_PRIORITY will have its priority assigned to the value
>> designated for that group on that interface.  This allows a user space daemon,
>> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
>> based on the APP_TLV value received and administratively assign applications to
>> that priority using the existing cgroup utility infrastructure.
>>
>> Tested by John and myself, with good results
>>
>> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>> CC: John Fastabend <john.r.fastabend@intel.com>
>> CC: Robert Love <robert.w.love@intel.com>
>> CC: "David S. Miller" <davem@davemloft.net>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> Bump, any other thoughts here?  Dave T. has some reasonable thoughts regarding
> the use of skb->priority, but IMO they really seem orthogonal to the purpose of
> this change.  Any other reviews would be welcome.

Well, in part I've been playing catchup in the hope that lldp and
openlldp and/or this dcb netlink layer that I don't know anything
about (pointers please?) could help somehow to resolve the semantic
mess skb->priority has become in the first place.

I liked what was described here.

"What if we did at least carve out the DCB functionality away from
skb->priority?  Since, AIUI, we're only concerning ourselves with
locally generated traffic here, we're talking
about skbs that have a socket attached to them.  We could, instead of indexing
the prio_tc_map with skb->priority, we could index it with
skb->dev->priomap[skb->sk->prioidx] (as provided by this patch).  The cgroup
then could be, instead of a strict priority cgroup, a queue_selector cgroup (or
something more appropriately named), and we don't have to touch skb->priority at
all.  I'd really rather not start down that road until I got more opinions and
consensus on that, but it seems like a pretty good solution, one that would
allow hardware queue selection in systems that use things like DCB to co-exist
with software queueing features."

The piece that still kind of bothered me about the original proposal
(and perhaps this one) was that setting SO_PRIORITY in an app means
'give my packets more mojo'.

Taking something that took unprioritized packets and assigned them and
*them only* to a hardware queue struck me as possibly deprioritizing
the 'more mojo wanted' packets in the app(s), as they would end up in
some other, possibly overloaded, hardware queue.

So a cgroup that moves all of the packets from an application into a
given hardware queue, and then gets scheduled normally according to
skb->priority and friends (software queue, default of pfifo_fast,
etc), seems to make some sense to me. (I wouldn't mind if we had
abstractions for software queues, too, like, I need a software queue
with these properties, find me a place for it on the hardware - but
I'm dreaming)

One open question is where do packets generated from other subsystems
end up, if you are using a cgroup for the app? arp, dns, etc?

So to rephrase your original description from this:

>> Any traffic originating from an application in a cgroup, that does not explicitly set its
>> priority with SO_PRIORITY will have its priority assigned to the value
>> designated for that group on that interface.  This allows a user space daemon,
>> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
>> based on the APP_TLV value received and administratively assign applications to
>> that priority using the existing cgroup utility infrastructure.
> John, Robert, if you're supportive of these changes, some Acks would be
> appreciated.

To this:

"Any traffic originating from an application in a cgroup,  will have
its hardware queue  assigned to the value designated for that group on
that interface.  This allows a user space daemon, when conducting LLDP
negotiation with a DCB enabled peer to create a cgroup based on the
APP_TLV value received and administratively assign applications to
that hardware queue using the existing cgroup utility infrastructure."

Assuming we're on the same page here, what the heck is APP_TLV?

> John, Robert, if you're supportive of these changes, some Acks would be
> appreciated.


>
>
> Regards
> Neil
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
FR Tel: 0638645374
http://www.bufferbloat.net

^ permalink raw reply

* Re: [RFC] kvm tools: Implement multiple VQ for virtio-net
From: Pekka Enberg @ 2011-11-14 12:25 UTC (permalink / raw)
  To: Asias He
  Cc: Sasha Levin, Michael S. Tsirkin, kvm, mingo, gorcunov,
	Krishna Kumar, Rusty Russell, virtualization, netdev
In-Reply-To: <4EC07729.3050303@gmail.com>

On Mon, Nov 14, 2011 at 4:04 AM, Asias He <asias.hejun@gmail.com> wrote:
> Why both the bandwidth and latency performance are dropping so dramatically
> with multiple VQ?

What's the expected benefit from multiple VQs i.e. why are doing the
patches Sasha?

^ permalink raw reply

* Re: net: Add network priority cgroup
From: Neil Horman @ 2011-11-14 11:47 UTC (permalink / raw)
  To: netdev; +Cc: John Fastabend, Robert Love, David S. Miller
In-Reply-To: <1320868655-32592-1-git-send-email-nhorman@tuxdriver.com>

On Wed, Nov 09, 2011 at 02:57:33PM -0500, Neil Horman wrote:
> Data Center Bridging environments are currently somewhat limited in their
> ability to provide a general mechanism for controlling traffic priority.
> Specifically they are unable to administratively control the priority at which
> various types of network traffic are sent.
>  
> Currently, the only ways to set the priority of a network buffer are:
> 
> 1) Through the use of the SO_PRIORITY socket option
> 2) By using low level hooks, like a tc action
> 
> (1) is difficult from an administrative perspective because it requires that the
> application to be coded to not just assume the default priority is sufficient,
> and must expose an administrative interface to allow priority adjustment.  Such
> a solution is not scalable in a DCB environment
> 
> (2) is also difficult, as it requires constant administrative oversight of
> applications so as to build appropriate rules to match traffic belonging to
> various classes, so that priority can be appropriately set. It is further
> limiting when DCB enabled hardware is in use, due to the fact that tc rules are
> only run after a root qdisc has been selected (DCB enabled hardware may reserve
> hw queues for various traffic classes and needs the priority to be set prior to
> selecting the root qdisc)
> 
> 
> I've discussed various solutions with John Fastabend, and we saw a cgroup as
> being a good general solution to this problem.  The network priority cgroup
> allows for a per-interface priority map to be built per cgroup.  Any traffic
> originating from an application in a cgroup, that does not explicitly set its
> priority with SO_PRIORITY will have its priority assigned to the value
> designated for that group on that interface.  This allows a user space daemon,
> when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
> based on the APP_TLV value received and administratively assign applications to
> that priority using the existing cgroup utility infrastructure.
> 
> Tested by John and myself, with good results
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: John Fastabend <john.r.fastabend@intel.com>
> CC: Robert Love <robert.w.love@intel.com>
> CC: "David S. Miller" <davem@davemloft.net>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Bump, any other thoughts here?  Dave T. has some reasonable thoughts regarding
the use of skb->priority, but IMO they really seem orthogonal to the purpose of
this change.  Any other reviews would be welcome.

John, Robert, if you're supportive of these changes, some Acks would be
appreciated.


Regards
Neil

^ permalink raw reply

* Re: [v2 PATCH 1/2] NETFILTER module xt_hmark new target for HASH based fw
From: Jan Engelhardt @ 2011-11-14 11:38 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: Pablo Neira Ayuso, Hans Schillstrom, kaber, netfilter-devel,
	netdev
In-Reply-To: <201111141019.43423.hans@schillstrom.com>

On Monday 2011-11-14 10:19, Hans Schillstrom wrote:

>
>On Sunday, November 13, 2011 18:05:28 Pablo Neira Ayuso wrote:
>> BTW, I think you should split xt_HMARK to ipt_HMARK and ip6t_HMARK
>> (see recent Florian Westphal patches regarding reserve lookup for
>> instance).
>> 
>> The IPv4 and IPv6 parts for HMARK look so different that I don't think
>> it makes sense to keep them into one single xt_HMARK thing with all
>> those conditional ifdefs for IPV6.
>>
>Ok I'll do that, for some reason a thought it was better with one module.

So do I. The module overhead is so much larger.

^ permalink raw reply

* Re: [LARTC] LARTC mailing list
From: Niccolò Belli @ 2011-11-14 11:24 UTC (permalink / raw)
  To: Linux Advanced Routing & Traffic Control project
  Cc: João Almeida, linux-new-lists, netfilter,
	Linux Networking Developer Mailing List
In-Reply-To: <CAAQ01U7iwOMD9xrq3y_dOwvbmHRJZFU6VDb6QxW3qDa8mZWQnQ@mail.gmail.com>

Il 14/11/2011 12:16, João Almeida ha scritto:
> lartc.org <http://lartc.org> should be updated with decent docs as well...

Do we have access to lartc.org? Because if we have access we can even 
keep (and fix) the original mailing list, or at least dump the addresses 
database...
Otherwise we will have to make a completely new site too.

Niccolò
--
To unsubscribe from this list: send the line "unsubscribe linux-new-lists" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH] Phonet: set the pipe handle using setsockopt
From: Hemant-vilas RAMDASI @ 2011-11-14 10:36 UTC (permalink / raw)
  To: Rémi Denis-Courmont, netdev@vger.kernel.org,
	netdev-owner@vger.kernel.org
In-Reply-To: <2168215.tYVKKNoJjY@hector>

Remi,

> -----Original Message-----
> From: Rémi Denis-Courmont [mailto:remi.denis-courmont@nokia.com]
> Sent: Monday, November 14, 2011 3:00 PM
> To: Hemant-vilas RAMDASI; netdev@vger.kernel.org
> Subject: Re: [PATCH] Phonet: set the pipe handle using setsockopt
> > @@ -180,6 +182,7 @@ static inline __u8 pn_sockaddr_get_resource(const
> struct
> > sockaddr_pn *spn) /* Phonet device ioctl requests */
> >  #ifdef __KERNEL__
> >  #define SIOCPNGAUTOCONF		(SIOCDEVPRIVATE + 0)
> > +#define SIOPNPIPE_ENABLE	_IO(SIOCPNGAUTOCONF,   1)
> 
> Does this even work? I am not an expert on this, but I would think that
> device-private controls are routed to the network device, not the
> socket. In
> any case, it does not seem right.
> 
Yes, it works. The ioctl is routed to per-socket functions.

> > @@ -863,14 +898,31 @@ static int pep_sock_connect(struct sock *sk,
> struct
> > sockaddr *addr, int len) int err;
> >  	u8 data[4] = { 0 /* sub-blocks */, PAD, PAD, PAD };
> >
> > -	pn->pipe_handle = 1; /* anything but INVALID_HANDLE */
> > +	if (pn->pipe_handle == PN_PIPE_INVALID_HANDLE)
> > +		pn->pipe_handle = 1; /* anything but INVALID_HANDLE */
> > +
> >  	err = pipe_handler_request(sk, PNS_PEP_CONNECT_REQ,
> > -					PN_PIPE_ENABLE, data, 4);
> > -	if (err) {
> > -		pn->pipe_handle = PN_PIPE_INVALID_HANDLE;
> 
> The current backlog functions assume that pipe_handle =
> PN_PIPE_INVALID_HANDLE
> if the socket is not yet connected. That's why the old code would clear
> the
> pipe_handle always on error.
Ok..I think clearing pipe-handle on connect error should take care of this.
User-space need to set the handle again.
> 
> So it is not that simple.
> 
> > @@ -894,6 +946,16 @@ static int pep_ioctl(struct sock *sk, int cmd,
> unsigned
> > +	case SIOPNPIPE_ENABLE:
> > +		if (sk->sk_state == TCP_SYN_SENT)
> > +			return -EBUSY;
> > +		else if (sk->sk_state == TCP_ESTABLISHED)
> > +			return -EISCONN;
> > +		else
> > +			return pep_sock_enable(sk, NULL, 0);
> > +		break;
> >  	}
> 
> I strongly suspect insufficient locking here.
Ok..
> 
> >
> >  	return -ENOIOCTLCMD;
> > @@ -959,6 +1021,18 @@ static int pep_setsockopt(struct sock *sk, int
> level,
> > int optname, }
> >  		goto out_norel;
> >
> > +	case PNPIPE_HANDLE:
> > +		if ((sk->sk_state == TCP_CLOSE) &&
> > +			(val >= 0) && (val < PN_PIPE_INVALID_HANDLE))
> > +			pn->pipe_handle = val;
> > +		else
> > +			err = -EINVAL;
> > +		break;
> 
> Same problem regarding pipe_handle as above.
I think locking already exists..

> > @@ -994,6 +1068,17 @@ static int pep_getsockopt(struct sock *sk, int
> level,
> > int optname, return -EINVAL;
> >  		break;
> >
> > +	case PNPIPE_ENABLE:
> > +		if (sk->sk_state == TCP_ESTABLISHED)
> > +			val = 1;
> > +		else
> > +			val = 0;
> > +		break;
> 
> Do you still need this read-only option?
Yes.

^ permalink raw reply

* Disabling TCP slow start after idle
From: David Laight @ 2011-11-14 10:16 UTC (permalink / raw)
  To: netdev

(oOriginally sent to linux-kernel@, but suggested I report here.)

We have some connections that suffer very badly from the TCP
'slow start' algorithm. These are connections that will
always be local - they may be MAC-Switch-MAC using RGMII
crossover, they might also be connected via an external
switch. In either case the RTT is most likely to be almost
zero, certainly below 1ms.

The traffic is single packets (carrying another protocol)
so we have Nagle disabled and the send and receive sides
run separately. So the traffic is neither bulk, nor
command/response.

This means that there is very rarely any unacked data,
so almost every packet is sent using 'slow start'.

If the external switch drops a packet (they do!) then
slow start stops more packets being sent, and nothing
progresses for about 1.5 seconds by which time there
is a significant amount of traffic queued and, in some
cases, data has to be discarded.

Similar issues happen if the receiving system decides
to defer the ack until a timer tick (instead of
sending one after every second packet). In this case
only 4 packets are sent. (We fixed this one be sending
a software ACK every 4 packets.)

Quite cleary the 'slow start' algorithm just doesn't
work in these cases.

I found this https://lkml.org/lkml/2010/4/9/427
discussion about a socket option to disable slow start.
But it seems that some people are completely against the idea.
I'd have thought that the global option would be more of a
problem - since that will affect ftp connections to remote
hosts where slow start is alomost certainly benefitial.

I'd have thought it would be sensible to allow one (or more)
of the following (either as a sysctl, socket option, or
code change):
1) Disable slow start for the local subnet.
2) Disable slow start for connections with very low RTT.
3) Disable slow start for a minimum period with no traffic
   (after the last packet is acked).

I'm not sure of the resolution used by the Linux RTT
calculations. I know NetBSD had a recent set of patches
to fix calculation errors with low RTT because the code
had been written when all RTT were much longer.

	David

^ permalink raw reply

* Re: [RFC] kvm tools: Implement multiple VQ for virtio-net
From: Sasha Levin @ 2011-11-14 10:15 UTC (permalink / raw)
  To: Asias He
  Cc: Michael S. Tsirkin, penberg, kvm, mingo, gorcunov, Krishna Kumar,
	Rusty Russell, virtualization, netdev
In-Reply-To: <4EC07729.3050303@gmail.com>

On Mon, 2011-11-14 at 10:04 +0800, Asias He wrote:
> Hi, Shsha
> 
> On 11/13/2011 11:00 PM, Sasha Levin wrote:
> > On Sun, 2011-11-13 at 12:24 +0200, Michael S. Tsirkin wrote:
> >> On Sat, Nov 12, 2011 at 12:12:01AM +0200, Sasha Levin wrote:
> >>> This is a patch based on Krishna Kumar's patch series which implements
> >>> multiple VQ support for virtio-net.
> >>>
> >>> The patch was tested with ver3 of the patch.
> >>>
> >>> Cc: Krishna Kumar<krkumar2@in.ibm.com>
> >>> Cc: Michael S. Tsirkin<mst@redhat.com>
> >>> Cc: Rusty Russell<rusty@rustcorp.com.au>
> >>> Cc: virtualization@lists.linux-foundation.org
> >>> Cc: netdev@vger.kernel.org
> >>> Signed-off-by: Sasha Levin<levinsasha928@gmail.com>
> >>
> >> Any performance numbers?
> >
> > I tried finding a box with more than two cores so I could test it on
> > something like that as well.
> >
> >> From what I see this patch causes a performance regression on my 2 core
> > box.
> >
> > I'll send an updated KVM tools patch in a bit as well.
> >
> > Before:
> >
> > # netperf -H 192.168.33.4,ipv4 -t TCP_RR
> > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
> > to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0
> > Local /Remote
> > Socket Size   Request  Resp.   Elapsed  Trans.
> > Send   Recv   Size     Size    Time     Rate
> > bytes  Bytes  bytes    bytes   secs.    per sec
> >
> > 16384  87380  1        1       10.00    11160.63
> > 16384  87380
> >
> > # netperf -H 192.168.33.4,ipv4 -t UDP_RR
> > MIGRATED UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
> > to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0
> > Local /Remote
> > Socket Size   Request  Resp.   Elapsed  Trans.
> > Send   Recv   Size     Size    Time     Rate
> > bytes  Bytes  bytes    bytes   secs.    per sec
> >
> > 122880 122880 1        1       10.00    12072.64
> > 229376 229376
> >
> > # netperf -H 192.168.33.4,ipv4 -t TCP_STREAM
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > 192.168.33.4 (192.168.33.4) port 0 AF_INET
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> >
> >   87380  16384  16384    10.00    4654.50
> >
> > netperf -H 192.168.33.4,ipv4 -t TCP_STREAM -- -m 128
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > 192.168.33.4 (192.168.33.4) port 0 AF_INET
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> >
> >   87380  16384    128    10.00     635.45
> >
> > # netperf -H 192.168.33.4,ipv4 -t UDP_STREAM
> > MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > 192.168.33.4 (192.168.33.4) port 0 AF_INET
> > Socket  Message  Elapsed      Messages
> > Size    Size     Time         Okay Errors   Throughput
> > bytes   bytes    secs            #      #   10^6bits/sec
> >
> > 122880   65507   10.00      113894      0    5968.54
> > 229376           10.00       89373           4683.54
> >
> > # netperf -H 192.168.33.4,ipv4 -t UDP_STREAM -- -m 128
> > MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > 192.168.33.4 (192.168.33.4) port 0 AF_INET
> > Socket  Message  Elapsed      Messages
> > Size    Size     Time         Okay Errors   Throughput
> > bytes   bytes    secs            #      #   10^6bits/sec
> >
> > 122880     128   10.00      550634      0      56.38
> > 229376           10.00      398786             40.84
> >
> >
> > After:
> >
> > # netperf -H 192.168.33.4,ipv4 -t TCP_RR
> > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
> > to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0
> > Local /Remote
> > Socket Size   Request  Resp.   Elapsed  Trans.
> > Send   Recv   Size     Size    Time     Rate
> > bytes  Bytes  bytes    bytes   secs.    per sec
> >
> > 16384  87380  1        1       10.00    8952.47
> > 16384  87380
> >
> > # netperf -H 192.168.33.4,ipv4 -t UDP_RR
> > MIGRATED UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
> > to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0
> > Local /Remote
> > Socket Size   Request  Resp.   Elapsed  Trans.
> > Send   Recv   Size     Size    Time     Rate
> > bytes  Bytes  bytes    bytes   secs.    per sec
> >
> > 122880 122880 1        1       10.00    9534.52
> > 229376 229376
> >
> > # netperf -H 192.168.33.4,ipv4 -t TCP_STREAM
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > 192.168.33.4 (192.168.33.4) port 0 AF_INET
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> >
> >   87380  16384  16384    10.13    2278.23
> >
> > # netperf -H 192.168.33.4,ipv4 -t TCP_STREAM -- -m 128
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > 192.168.33.4 (192.168.33.4) port 0 AF_INET
> > Recv   Send    Send
> > Socket Socket  Message  Elapsed
> > Size   Size    Size     Time     Throughput
> > bytes  bytes   bytes    secs.    10^6bits/sec
> >
> >   87380  16384    128    10.00     623.27
> >
> > # netperf -H 192.168.33.4,ipv4 -t UDP_STREAM
> > MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > 192.168.33.4 (192.168.33.4) port 0 AF_INET
> > Socket  Message  Elapsed      Messages
> > Size    Size     Time         Okay Errors   Throughput
> > bytes   bytes    secs            #      #   10^6bits/sec
> >
> > 122880   65507   10.00      136930      0    7175.72
> > 229376           10.00       16726            876.51
> >
> > # netperf -H 192.168.33.4,ipv4 -t UDP_STREAM -- -m 128
> > MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > 192.168.33.4 (192.168.33.4) port 0 AF_INET
> > Socket  Message  Elapsed      Messages
> > Size    Size     Time         Okay Errors   Throughput
> > bytes   bytes    secs            #      #   10^6bits/sec
> >
> > 122880     128   10.00      982492      0     100.61
> > 229376           10.00      249597             25.56
> >
> 
> Why both the bandwidth and latency performance are dropping so 
> dramatically with multiple VQ?

It looks like theres no hash sync between host and guest, which makes
the RX VQ change for every packet. This is my guess.

-- 

Sasha.


^ permalink raw reply

* Re: [PATCH] brcmsmac: Use current logging styles
From: Arend van Spriel @ 2011-11-14 10:13 UTC (permalink / raw)
  To: Joe Perches; +Cc: John W. Linville, linux-wireless, netdev, LKML
In-Reply-To: <1321213264.15834.4.camel@Joe-Laptop>

On 11/13/2011 08:41 PM, Joe Perches wrote:
> Add and use pr_fmt and pr_<level>
> Remove useless double parentheses from macros.
> Remove function names from format strings, add to pr_debug use.
> Coalesce formats.
> Remove uncompileable undeclared variable in a DMA_NONE use.
> 
> Signed-off-by: Joe Perches <joe@perches.com>
> ---
>  drivers/net/wireless/brcm80211/brcmsmac/aiutils.c |   14 +-
>  drivers/net/wireless/brcm80211/brcmsmac/dma.c     |  145 ++++++++++-----------
>  2 files changed, 80 insertions(+), 79 deletions(-)

Thanks for the service.

Acked-by: Arend van Spriel <arend@broadcom.com>

Gr. AvS

^ permalink raw reply

* Re: [PATCH 2/4] ipv4: Update pmtu informations on inetpeer only for output routes
From: Steffen Klassert @ 2011-11-14 10:12 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20111108.143630.106539981030509701.davem@davemloft.net>

On Tue, Nov 08, 2011 at 02:36:30PM -0500, David Miller wrote:
> 
> What I think you can do to solve this problem is explicitly use
> dst->ops->default_mtu() in ip_forward() instead of dst_mtu().
> 

Unfortunately, that's not sufficient. We need to do an input route
check in ip_skb_dst_mtu() at least to get plain ipv4 to work,
IPsec is still broken then. So I assume most (all?) callers of
dst_mtu() don't want to get the cached PMTU for input routes on ipv4.

So for the moment I'm thinking about adding an ip_dst_mtu()
function that returns dst->ops->default_mtu() for input routes
and dst_mtu() for output routes. Then we could convert the
dst_mtu() users in net/ipv4/ over to this one.

^ permalink raw reply

* [PATCH RESUBMIT3 net-next 1/2] IPv6 routing, NLM_F_* flag support: warn if new route is created without NLM_F_CREATE
From: Matti Vaittinen @ 2011-11-14 10:14 UTC (permalink / raw)
  To: netdev; +Cc: davem



The support for NLM_F_* flags at IPv6 routing requests.

Warn if NLM_F_CREATE flag is not defined for RTM_NEWROUTE request,
creating new table. Later NLM_F_CREATE may be required for
new route creation.

Patch created against linux-3.2-rc1

Signed-off-by: Matti Vaittinen <Mazziesaccount@gmail.com>
--
diff -uNr Linux-3.2-rc1.orig/net/ipv6/route.c Linux-3.2-rc1.new/net/ipv6/route.c
--- Linux-3.2-rc1.orig/net/ipv6/route.c	2011-11-10 08:44:18.000000000 +0200
+++ Linux-3.2-rc1.new/net/ipv6/route.c	2011-11-10 08:46:15.000000000 +0200
@@ -1230,9 +1230,18 @@
 	if (cfg->fc_metric == 0)
 		cfg->fc_metric = IP6_RT_PRIO_USER;
 
-	table = fib6_new_table(net, cfg->fc_table);
+	err = -ENOBUFS;
+	if (NULL != cfg->fc_nlinfo.nlh &&
+	    !(cfg->fc_nlinfo.nlh->nlmsg_flags&NLM_F_CREATE)) {
+		table = fib6_get_table(net, cfg->fc_table);
+		if (table == NULL) {
+			printk(KERN_WARNING "IPv6: NLM_F_CREATE should be specified when creating new route\n");
+			table = fib6_new_table(net, cfg->fc_table);
+		}
+	} else {
+		table = fib6_new_table(net, cfg->fc_table);
+	}
 	if (table == NULL) {
-		err = -ENOBUFS;
 		goto out;
 	}
 

^ permalink raw reply

* [PATCH RESUBMIT3 net-next 2/2] IPv6 routing, NLM_F_* flag support: REPLACE and EXCL flags support, warn about missing CREATE flag
From: Matti Vaittinen @ 2011-11-14 10:15 UTC (permalink / raw)
  To: davem; +Cc: netdev


The support for NLM_F_* flags at IPv6 routing requests.

If NLM_F_CREATE flag is not defined for RTM_NEWROUTE request,
warning is printed, but no error is returned. Instead new route is
added. Later NLM_F_CREATE may be required for
new route creation.

Exception is when NLM_F_REPLACE flag is given without NLM_F_CREATE, and
no matching route is found. In this case it should be safe to assume
that the request issuer is familiar with NLM_F_* flags, and does really
not want route to be created.

Specifying NLM_F_REPLACE flag will now make the kernel to search for
matching route, and replace it with new one. If no route is found and
NLM_F_CREATE is specified as well, then new route is created.

Also, specifying NLM_F_EXCL will yield returning of error if matching
route is found.

Patch created against linux-3.2-rc1



Signed-off-by: Matti Vaittinen <Mazziesaccount@gmail.com>
--
diff -uNr Linux-3.2-rc1.orig/net/ipv6/ip6_fib.c Linux-3.2-rc1.new/net/ipv6/ip6_fib.c
--- Linux-3.2-rc1.orig/net/ipv6/ip6_fib.c	2011-11-10 08:44:18.000000000 +0200
+++ Linux-3.2-rc1.new/net/ipv6/ip6_fib.c	2011-11-14 11:47:37.000000000 +0200
@@ -425,7 +425,8 @@
 
 static struct fib6_node * fib6_add_1(struct fib6_node *root, void *addr,
 				     int addrlen, int plen,
-				     int offset)
+				     int offset, int allow_create,
+				     int replace_required)
 {
 	struct fib6_node *fn, *in, *ln;
 	struct fib6_node *pn = NULL;
@@ -447,8 +448,12 @@
 		 *	Prefix match
 		 */
 		if (plen < fn->fn_bit ||
-		    !ipv6_prefix_equal(&key->addr, addr, fn->fn_bit))
+		    !ipv6_prefix_equal(&key->addr, addr, fn->fn_bit)) {
+			if (!allow_create)
+				printk(KERN_WARNING
+				    "IPv6: NLM_F_CREATE should be set when creating new route\n");
 			goto insert_above;
+		}
 
 		/*
 		 *	Exact match ?
@@ -477,10 +482,26 @@
 		fn = dir ? fn->right: fn->left;
 	} while (fn);
 
+	if (replace_required && !allow_create) {
+		/* We should not create new node because
+		 * NLM_F_REPLACE was specified without NLM_F_CREATE
+		 * I assume it is safe to require NLM_F_CREATE when
+		 * REPLACE flag is used! Later we may want to remove the
+		 * check for replace_required, because according
+		 * to netlink specification, NLM_F_CREATE
+		 * MUST be specified if new route is created.
+		 * That would keep IPv6 consistent with IPv4
+		 */
+		printk(KERN_WARNING
+		    "IPv6: NLM_F_CREATE should be set when creating new route - ignoring request\n");
+		return ERR_PTR(-ENOENT);
+	}
 	/*
 	 *	We walked to the bottom of tree.
 	 *	Create new leaf node without children.
 	 */
+	if (!allow_create)
+		printk(KERN_WARNING "IPv6: NLM_F_CREATE should be set when creating new route\n");
 
 	ln = node_alloc();
 
@@ -614,6 +635,12 @@
 {
 	struct rt6_info *iter = NULL;
 	struct rt6_info **ins;
+	int replace = (NULL != info &&
+	    NULL != info->nlh &&
+	    (info->nlh->nlmsg_flags&NLM_F_REPLACE));
+	int add = ((NULL == info || NULL == info->nlh) ||
+	    (info->nlh->nlmsg_flags&NLM_F_CREATE));
+	int found = 0;
 
 	ins = &fn->leaf;
 
@@ -626,6 +653,13 @@
 			/*
 			 *	Same priority level
 			 */
+			if (NULL != info->nlh &&
+			    (info->nlh->nlmsg_flags&NLM_F_EXCL))
+				return -EEXIST;
+			if (replace) {
+				found++;
+				break;
+			}
 
 			if (iter->rt6i_dev == rt->rt6i_dev &&
 			    iter->rt6i_idev == rt->rt6i_idev &&
@@ -655,17 +689,40 @@
 	/*
 	 *	insert node
 	 */
+	if (!replace) {
+		if (!add)
+			printk(KERN_WARNING "IPv6: NLM_F_CREATE should be set when creating new route\n");
+
+add:
+		rt->dst.rt6_next = iter;
+		*ins = rt;
+		rt->rt6i_node = fn;
+		atomic_inc(&rt->rt6i_ref);
+		inet6_rt_notify(RTM_NEWROUTE, rt, info);
+		info->nl_net->ipv6.rt6_stats->fib_rt_entries++;
+
+		if ((fn->fn_flags & RTN_RTINFO) == 0) {
+			info->nl_net->ipv6.rt6_stats->fib_route_nodes++;
+			fn->fn_flags |= RTN_RTINFO;
+		}
 
-	rt->dst.rt6_next = iter;
-	*ins = rt;
-	rt->rt6i_node = fn;
-	atomic_inc(&rt->rt6i_ref);
-	inet6_rt_notify(RTM_NEWROUTE, rt, info);
-	info->nl_net->ipv6.rt6_stats->fib_rt_entries++;
-
-	if ((fn->fn_flags & RTN_RTINFO) == 0) {
-		info->nl_net->ipv6.rt6_stats->fib_route_nodes++;
-		fn->fn_flags |= RTN_RTINFO;
+	} else {
+		if (!found) {
+			if (add)
+				goto add;
+			printk(KERN_WARNING "IPv6: NLM_F_REPLACE set, but no existing node found!\n");
+			return -ENOENT;
+		}
+		*ins = rt;
+		rt->rt6i_node = fn;
+		rt->dst.rt6_next = iter->dst.rt6_next;
+		atomic_inc(&rt->rt6i_ref);
+		inet6_rt_notify(RTM_NEWROUTE, rt, info);
+		rt6_release(iter);
+		if ((fn->fn_flags & RTN_RTINFO) == 0) {
+			info->nl_net->ipv6.rt6_stats->fib_route_nodes++;
+			fn->fn_flags |= RTN_RTINFO;
+		}
 	}
 
 	return 0;
@@ -696,9 +753,25 @@
 {
 	struct fib6_node *fn, *pn = NULL;
 	int err = -ENOMEM;
+	int allow_create = 1;
+	int replace_required = 0;
+	if (NULL != info && NULL != info->nlh) {
+		if (!(info->nlh->nlmsg_flags&NLM_F_CREATE))
+			allow_create = 0;
+		if ((info->nlh->nlmsg_flags&NLM_F_REPLACE))
+			replace_required = 1;
+	}
+	if (!allow_create && !replace_required)
+		printk(KERN_WARNING "IPv6: RTM_NEWROUTE with no NLM_F_CREATE or NLM_F_REPLACE\n");
 
 	fn = fib6_add_1(root, &rt->rt6i_dst.addr, sizeof(struct in6_addr),
-			rt->rt6i_dst.plen, offsetof(struct rt6_info, rt6i_dst));
+		    rt->rt6i_dst.plen, offsetof(struct rt6_info, rt6i_dst),
+		    allow_create, replace_required);
+
+	if (IS_ERR(fn)) {
+		err = PTR_ERR(fn);
+		fn = NULL;
+	}
 
 	if (fn == NULL)
 		goto out;
@@ -736,7 +809,8 @@
 
 			sn = fib6_add_1(sfn, &rt->rt6i_src.addr,
 					sizeof(struct in6_addr), rt->rt6i_src.plen,
-					offsetof(struct rt6_info, rt6i_src));
+					offsetof(struct rt6_info, rt6i_src),
+					allow_create, replace_required);
 
 			if (sn == NULL) {
 				/* If it is failed, discard just allocated
@@ -753,8 +827,13 @@
 		} else {
 			sn = fib6_add_1(fn->subtree, &rt->rt6i_src.addr,
 					sizeof(struct in6_addr), rt->rt6i_src.plen,
-					offsetof(struct rt6_info, rt6i_src));
+					offsetof(struct rt6_info, rt6i_src),
+					allow_create, replace_required);
 
+			if (IS_ERR(sn)) {
+				err = PTR_ERR(sn);
+				sn = NULL;
+			}
 			if (sn == NULL)
 				goto st_failure;
 		}

^ permalink raw reply

* Re: [PATCH] Phonet: set the pipe handle using setsockopt
From: Rémi Denis-Courmont @ 2011-11-14  9:29 UTC (permalink / raw)
  To: ext Hemant Vilas RAMDASI, netdev
In-Reply-To: <1321257210-17200-1-git-send-email-hemant.ramdasi@stericsson.com>

Le Lundi 14 Novembre 2011 13:23:30 ext Hemant Vilas RAMDASI a écrit :
> From: Dinesh Kumar Sharma <dinesh.sharma@stericsson.com>
> 
> This provides flexibility to set the pipe handle
> using setsockopt. The pipe can be enabled (if disabled) later
> using ioctl.
> 
> Signed-off-by: Hemant Ramdasi <hemant.ramdasi@stericsson.com>
> Signed-off-by: Dinesh Kumar Sharma <dinesh.sharma@stericsson.com>
> ---
>  include/linux/phonet.h |    3 +
>  net/phonet/pep.c       |  105
> +++++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 98
> insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/phonet.h b/include/linux/phonet.h
> index 6fb1384..4c00551 100644
> --- a/include/linux/phonet.h
> +++ b/include/linux/phonet.h
> @@ -37,6 +37,8 @@
>  #define PNPIPE_ENCAP		1
>  #define PNPIPE_IFINDEX		2
>  #define PNPIPE_HANDLE		3
> +#define PNPIPE_ENABLE		4
> +#define PNPIPE_INITSTATE	5
> 
>  #define PNADDR_ANY		0
>  #define PNADDR_BROADCAST	0xFC
> @@ -180,6 +182,7 @@ static inline __u8 pn_sockaddr_get_resource(const struct
> sockaddr_pn *spn) /* Phonet device ioctl requests */
>  #ifdef __KERNEL__
>  #define SIOCPNGAUTOCONF		(SIOCDEVPRIVATE + 0)
> +#define SIOPNPIPE_ENABLE	_IO(SIOCPNGAUTOCONF,   1)

Does this even work? I am not an expert on this, but I would think that 
device-private controls are routed to the network device, not the socket. In 
any case, it does not seem right.

> 
>  struct if_phonet_autoconf {
>  	uint8_t device;
> diff --git a/net/phonet/pep.c b/net/phonet/pep.c
> index f17fd84..48339b9 100644
> --- a/net/phonet/pep.c
> +++ b/net/phonet/pep.c
> @@ -533,6 +533,29 @@ static int pep_connresp_rcv(struct sock *sk, struct
> sk_buff *skb) return pipe_handler_send_created_ind(sk);
>  }
> 
> +static int pep_enableresp_rcv(struct sock *sk, struct sk_buff *skb)
> +{
> +	struct pnpipehdr *hdr = pnp_hdr(skb);
> +
> +	if (hdr->error_code != PN_PIPE_NO_ERROR)
> +		return -ECONNREFUSED;
> +
> +	return pep_indicate(sk, PNS_PIPE_ENABLED_IND, 0 /* sub-blocks */,
> +		NULL, 0, GFP_ATOMIC);
> +
> +}
> +
> +static void pipe_start_flow_control(struct sock *sk)
> +{
> +	struct pep_sock *pn = pep_sk(sk);
> +
> +	if (!pn_flow_safe(pn->tx_fc)) {
> +		atomic_set(&pn->tx_credits, 1);
> +		sk->sk_write_space(sk);
> +	}
> +	pipe_grant_credits(sk, GFP_ATOMIC);
> +}
> +
>  /* Queue an skb to an actively connected sock.
>   * Socket lock must be held. */
>  static int pipe_handler_do_rcv(struct sock *sk, struct sk_buff *skb)
> @@ -578,13 +601,25 @@ static int pipe_handler_do_rcv(struct sock *sk, struct
> sk_buff *skb) sk->sk_state = TCP_CLOSE_WAIT;
>  			break;
>  		}
> +		if (pn->init_enable == PN_PIPE_DISABLE)
> +			sk->sk_state = TCP_SYN_RECV;
> +		else {
> +			sk->sk_state = TCP_ESTABLISHED;
> +			pipe_start_flow_control(sk);
> +		}
> +		break;
> 
> -		sk->sk_state = TCP_ESTABLISHED;
> -		if (!pn_flow_safe(pn->tx_fc)) {
> -			atomic_set(&pn->tx_credits, 1);
> -			sk->sk_write_space(sk);
> +	case PNS_PEP_ENABLE_RESP:
> +		if (sk->sk_state != TCP_SYN_SENT)
> +			break;
> +
> +		if (pep_enableresp_rcv(sk, skb)) {
> +			sk->sk_state = TCP_CLOSE_WAIT;
> +			break;
>  		}
> -		pipe_grant_credits(sk, GFP_ATOMIC);
> +
> +		sk->sk_state = TCP_ESTABLISHED;
> +		pipe_start_flow_control(sk);
>  		break;
> 
>  	case PNS_PEP_DISCONNECT_RESP:
> @@ -863,14 +898,31 @@ static int pep_sock_connect(struct sock *sk, struct
> sockaddr *addr, int len) int err;
>  	u8 data[4] = { 0 /* sub-blocks */, PAD, PAD, PAD };
> 
> -	pn->pipe_handle = 1; /* anything but INVALID_HANDLE */
> +	if (pn->pipe_handle == PN_PIPE_INVALID_HANDLE)
> +		pn->pipe_handle = 1; /* anything but INVALID_HANDLE */
> +
>  	err = pipe_handler_request(sk, PNS_PEP_CONNECT_REQ,
> -					PN_PIPE_ENABLE, data, 4);
> -	if (err) {
> -		pn->pipe_handle = PN_PIPE_INVALID_HANDLE;

The current backlog functions assume that pipe_handle = PN_PIPE_INVALID_HANDLE 
if the socket is not yet connected. That's why the old code would clear the 
pipe_handle always on error.

So it is not that simple.

> +				pn->init_enable, data, 4);
> +	if (err)
>  		return err;
> -	}
> +
> +	sk->sk_state = TCP_SYN_SENT;
> +
> +	return 0;
> +}
> +
> +static int pep_sock_enable(struct sock *sk, struct sockaddr *addr, int len)
> +{
> +	int err;
> +
> +	err = pipe_handler_request(sk, PNS_PEP_ENABLE_REQ, PAD,
> +				NULL, 0);
> +
> +	if (err)
> +		return err;
> +
>  	sk->sk_state = TCP_SYN_SENT;
> +
>  	return 0;
>  }
> 
> @@ -894,6 +946,16 @@ static int pep_ioctl(struct sock *sk, int cmd, unsigned
> long arg) answ = 0;
>  		release_sock(sk);
>  		return put_user(answ, (int __user *)arg);
> +		break;
> +
> +	case SIOPNPIPE_ENABLE:
> +		if (sk->sk_state == TCP_SYN_SENT)
> +			return -EBUSY;
> +		else if (sk->sk_state == TCP_ESTABLISHED)
> +			return -EISCONN;
> +		else
> +			return pep_sock_enable(sk, NULL, 0);
> +		break;
>  	}

I strongly suspect insufficient locking here.

> 
>  	return -ENOIOCTLCMD;
> @@ -959,6 +1021,18 @@ static int pep_setsockopt(struct sock *sk, int level,
> int optname, }
>  		goto out_norel;
> 
> +	case PNPIPE_HANDLE:
> +		if ((sk->sk_state == TCP_CLOSE) &&
> +			(val >= 0) && (val < PN_PIPE_INVALID_HANDLE))
> +			pn->pipe_handle = val;
> +		else
> +			err = -EINVAL;
> +		break;

Same problem regarding pipe_handle as above.

> +
> +	case PNPIPE_INITSTATE:
> +		pn->init_enable = !!val;
> +		break;
> +
>  	default:
>  		err = -ENOPROTOOPT;
>  	}
> @@ -994,6 +1068,17 @@ static int pep_getsockopt(struct sock *sk, int level,
> int optname, return -EINVAL;
>  		break;
> 
> +	case PNPIPE_ENABLE:
> +		if (sk->sk_state == TCP_ESTABLISHED)
> +			val = 1;
> +		else
> +			val = 0;
> +		break;

Do you still need this read-only option?

> +
> +	case PNPIPE_INITSTATE:
> +		val = pn->init_enable;
> +		break;
> +
>  	default:
>  		return -ENOPROTOOPT;
>  	}
-- 
Rémi Denis-Courmont
http://www.remlab.net/

^ permalink raw reply

* Re: [PATCH RESUBMIT2 net-next 1/2] IPv6 routing, NLM_F_* flag support: warn if NEWROUTE lacks of both CREATE and REPLACE flags
From: Matti Vaittinen @ 2011-11-14  9:37 UTC (permalink / raw)
  To: ext David Miller; +Cc: netdev
In-Reply-To: <20111114.033608.2000639077648429389.davem@davemloft.net>

On Mon, 2011-11-14 at 03:36 -0500, ext David Miller wrote:
> From: Matti Vaittinen <matti.vaittinen@nsn.com>
> Date: Mon, 14 Nov 2011 10:31:20 +0200
> 
> > Messages can be changed, that's no problem. I would like to hear suggestions 
> > because I personally see not much problems in current messages. 
> 
> You don't even say that the route request is for an ipv6 route.

I'll add the IPv6 in prints, and also look through the ŕest of the prints 
in 2/2 patch && see if I can think of something to improve.

Also, this patch (1/2) has wrong subject and explanation - total brainfart
from my side, sorry. The check for NLM_F_REPLACE is not in this, but in 
the 2/2 patch. I'll correct subject and explanation too.

-- 
Matti Vaittinen
+358 504863070
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Told a UDP joke the other night...
...but I'm not sure everyone got it...
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

^ permalink raw reply

* Re: [v2 PATCH 1/2] NETFILTER module xt_hmark new target for HASH based fw
From: Hans Schillstrom @ 2011-11-14  9:19 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Hans Schillstrom, kaber, jengelh, netfilter-devel, netdev
In-Reply-To: <20111113170528.GB16851@1984>


On Sunday, November 13, 2011 18:05:28 Pablo Neira Ayuso wrote:
> BTW, I think you should split xt_HMARK to ipt_HMARK and ip6t_HMARK
> (see recent Florian Westphal patches regarding reserve lookup for
> instance).
> 
> The IPv4 and IPv6 parts for HMARK look so different that I don't think
> it makes sense to keep them into one single xt_HMARK thing with all
> those conditional ifdefs for IPV6.
>
Ok I'll do that, for some reason a thought it was better with one module.

--
Hans

^ permalink raw reply

* Re: [PATCH] net, bridge: print log message after state changed
From: Fritz, Wolfgang @ 2011-11-14  9:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, shemminger, bridge, Brunck, Holger
In-Reply-To: <20111114.020749.1176229013272872343.davem@davemloft.net>


[-- Attachment #1.1: Type: text/plain, Size: 3241 bytes --]


Am Montag, den 14.11.2011, 02:07 -0500 schrieb David Miller:
> From: "Fritz, Wolfgang" <Wolfgang.Fritz@keymile.com>
> Date: Mon, 14 Nov 2011 08:03:31 +0100
> 
> > 
> > Am Montag, den 14.11.2011, 00:37 -0500 schrieb David Miller:
> >> From: Holger Brunck <holger.brunck@keymile.com>
> >> Date: Thu, 10 Nov 2011 17:18:54 +0100
> >> 
> >> > From: Wolfgang Fritz <wolfgang.fritz@keymile.com>
> >> > 
> >> > Function br_log_state writes log message "... entering XXXX state" so it
> >> > should be called after the state has changed and not before.
> >> > 
> >> > Signed-off-by: Wolfgang Fritz <wolfgang.fritz@keymile.com>
> >> > Signed-off-by: Holger Brunck <holger.brunck@keymile.com>
> >> 
> >> "entering" can roughly mean "about to enter"
> >> 
> > 
> > Exactly.
> > 
> >> The message is therefore appropriately timed as far as I'm concerned.
> >> 
> > 
> > It's not.
> > 
> > Please test: create a bridge with STP disabled, add an interface to the
> > bridge and set that interface down. You get the message "... entering
> > forwarding state". That's wrong because it's "about to enter" disabled
> > state some lines later.
> > 
> > All other (4) calls to br_log_state are located after the state change,
> > see for example br_stp_enable_port() just some lines above the patch.
> 
> I would never expect an "entering" message to print out after the event
> happens, and in fact I'd _always_ want to see it beforehand so that if
> the state change caused a crash or similar it'd be that much easier
> to pinpoint the proper location.
> 

OK. I see what you mean.

Proposal 1 (simple): 
- Modify log message to "Entered xxx state"
- Apply patch
- Do not touch other br_log_message calls.

Proposal 2 (more complicated):
Add parameter "newstate" to br_log_message and put br_log_message before
the state change. Print log message only if new state != current state.

Advantage: Reduces log spam when STP is disabled. After the last
modifications to the bridge code which force "forwarding" state if STP
is disabled, you get lots of bogus messages "entering forwarding state":

Event                    old message      new message
link up                  learning         forwarding
forwarding delay expired forwarding       forwarding
interface down           disabled         forwarding
link down                disabled         forwarding  

If you accept one of the proposals, I'll prepare a patch. Otherwise, we
keep my original patch in our local tree.

> I'm still not applying this.  If the other log messages behave
> differently, they are broken, so go fix those instead.

Regards,
Wolfgang

-- 
"I love deadlines. I like the whooshing sound they make as they fly by"
(Douglas Adams)
=======================================================================
KEYMILE GmbH       mailto:wolfgang.fritz@keymile.com
Wolfgang Fritz     Phone: +49 (0)511 6747-692
Wohlenbergstr. 3   Fax:   +49 (0)511 6747-777
D-30179 Hannover   http://www.keymile.com

Managing Directors: Björn Claaßen, Dipl.-Kfm. Andreas Gebauer
Legal structure: GmbH, Registered office: Hanover, HRB 61069 
Local court Hanover, VAT-Reg.-No.: DE 812282795, 
WEEE-Reg.-No.: DE 59336750

[-- Attachment #1.2: Type: text/html, Size: 5051 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Bridge mailing list
Bridge@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/bridge

^ permalink raw reply

* Re: [PATCH net-next] bnx2x: add endline at end of message
From: Dmitry Kravkov @ 2011-11-14  8:44 UTC (permalink / raw)
  To: davem@davemloft.net; +Cc: netdev@vger.kernel.org, Joe Perches
In-Reply-To: <1321260039-19254-1-git-send-email-dmitry@broadcom.com>

git send-email ignored <Reported-by> tag, adding Joe
On Mon, 2011-11-14 at 00:40 -0800, Dmitry Kravkov wrote:
> Reported-by: Joe Perches <joe@perches.com>
> Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
> ---
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> index 33ff60d..c09e59a 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> @@ -8520,7 +8520,7 @@ sp_rtnl_not_reset:
>  	 * damage
>  	 */
>  	if (test_and_clear_bit(BNX2X_SP_RTNL_FAN_FAILURE, &bp->sp_rtnl_state)) {
> -		DP(BNX2X_MSG_SP, "fan failure detected. Unloading driver");
> +		DP(BNX2X_MSG_SP, "fan failure detected. Unloading driver\n");
>  		netif_device_detach(bp->dev);
>  		bnx2x_close(bp->dev);
>  	}

^ permalink raw reply

* [PATCH net-next] bnx2x: add endline at end of message
From: Dmitry Kravkov @ 2011-11-14  8:40 UTC (permalink / raw)
  To: davem, netdev; +Cc: Dmitry Kravkov

Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 33ff60d..c09e59a 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -8520,7 +8520,7 @@ sp_rtnl_not_reset:
 	 * damage
 	 */
 	if (test_and_clear_bit(BNX2X_SP_RTNL_FAN_FAILURE, &bp->sp_rtnl_state)) {
-		DP(BNX2X_MSG_SP, "fan failure detected. Unloading driver");
+		DP(BNX2X_MSG_SP, "fan failure detected. Unloading driver\n");
 		netif_device_detach(bp->dev);
 		bnx2x_close(bp->dev);
 	}
-- 
1.7.7.2

^ permalink raw reply related

* Re: [PATCH v3 net-next 09/12] bnx2x: add fan failure event handling
From: David Miller @ 2011-11-14  8:37 UTC (permalink / raw)
  To: dmitry; +Cc: joe, netdev, ariele, eilong
In-Reply-To: <1321259286.4651.9.camel@lb-tlvb-dmitry>


Come on :-/

Please make a real patch submission instead of hijacking the posting
for the original patch.

The subject line here is going to be completely wrong if I just
apply this thing as-is.

^ permalink raw reply

* Re: [PATCH RESUBMIT2 net-next 1/2] IPv6 routing, NLM_F_* flag support: warn if NEWROUTE lacks of both CREATE and REPLACE flags
From: David Miller @ 2011-11-14  8:36 UTC (permalink / raw)
  To: matti.vaittinen; +Cc: netdev
In-Reply-To: <1321259480.23935.75.camel@hakki>

From: Matti Vaittinen <matti.vaittinen@nsn.com>
Date: Mon, 14 Nov 2011 10:31:20 +0200

> Messages can be changed, that's no problem. I would like to hear suggestions 
> because I personally see not much problems in current messages. 

You don't even say that the route request is for an ipv6 route.

How in the world can we track down the application if you don't
even say what protocol the route request is for?

^ permalink raw reply

* Re: [PATCH v3 net-next 09/12] bnx2x: add fan failure event handling
From: Dmitry Kravkov @ 2011-11-14  8:28 UTC (permalink / raw)
  To: Joe Perches, davem, netdev; +Cc: Ariel Elior, Eilon Greenstein
In-Reply-To: <80A6ED01A151614884EE6C5D6AF0F2E9099806AF42@SJEXCHCCR01.corp.ad.broadcom.com>

> -----Original Message-----
> From: Joe Perches [mailto:joe@perches.com] 
> Sent: Sunday, November 13, 2011 11:12 PM
> To: Dmitry Kravkov
> Cc: davem@davemloft.net; netdev@vger.kernel.org; Ariel Elior; Eilon Greenstein
> Subject: Re: [PATCH v3 net-next 09/12] bnx2x: add fan failure event handling
> 
> On Sun, 2011-11-13 at 16:34 +0200, Dmitry Kravkov wrote:
> > From: Ariel Elior <ariele@broadcom.com>
> > Shut down the device in case of fan failure to prevent HW damage.
> trivia.
> > diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
> []
> > @@ -8503,6 +8514,17 @@ sp_rtnl_not_reset:
> []
> > +	if (test_and_clear_bit(BNX2X_SP_RTNL_FAN_FAILURE, &bp->sp_rtnl_state)) {
> > +		DP(BNX2X_MSG_SP, "fan failure detected. Unloading driver");
> 
> Missing a trailing "\n".
> 

Reported-by: Joe Perches <joe@perches.com>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 33ff60d..c09e59a 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -8520,7 +8520,7 @@ sp_rtnl_not_reset:
 	 * damage
 	 */
 	if (test_and_clear_bit(BNX2X_SP_RTNL_FAN_FAILURE, &bp->sp_rtnl_state)) {
-		DP(BNX2X_MSG_SP, "fan failure detected. Unloading driver");
+		DP(BNX2X_MSG_SP, "fan failure detected. Unloading driver\n");
 		netif_device_detach(bp->dev);
 		bnx2x_close(bp->dev);
 	}
-- 
1.7.7.2

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox