Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH]: sctp: Fix skb_over_panic resulting from multiple invalid parameter errors (CVE-2010-1173)
From: Vlad Yasevich @ 2010-04-28 14:17 UTC (permalink / raw)
  To: Neil Horman; +Cc: sri, linux-sctp, eteo, netdev, davem, security
In-Reply-To: <4BD83F85.8090308@hp.com>



Vlad Yasevich wrote:
> I have this patch and a few others already queued.

Scratch that.  I totally misread the description and the patch.

-vlad
> 
> I was planning on sending these today for stable.
> 
> Here is the full list of stable patches I have:
> 
> sctp: Fix oops when sending queued ASCONF chunks
> sctp: fix to calc the INIT/INIT-ACK chunk length correctly is set
> sctp: per_cpu variables should be in bh_disabled section
> sctp: fix potential reference of a freed pointer
> sctp: avoid irq lock inversion while call sk->sk_data_ready()
> 
> -vlad
> 
> Neil Horman wrote:
>> Hey-
>> 	Recently, it was reported to me that the kernel could oops in the
>> following way:
>>
>> <5> kernel BUG at net/core/skbuff.c:91!
>> <5> invalid operand: 0000 [#1]
>> <5> Modules linked in: sctp netconsole nls_utf8 autofs4 sunrpc iptable_filter
>> ip_tables cpufreq_powersave parport_pc lp parport vmblock(U) vsock(U) vmci(U)
>> vmxnet(U) vmmemctl(U) vmhgfs(U) acpiphp dm_mirror dm_mod button battery ac md5
>> ipv6 uhci_hcd ehci_hcd snd_ens1371 snd_rawmidi snd_seq_device snd_pcm_oss
>> snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_ac97_codec snd soundcore
>> pcnet32 mii floppy ext3 jbd ata_piix libata mptscsih mptsas mptspi mptscsi
>> mptbase sd_mod scsi_mod
>> <5> CPU:    0
>> <5> EIP:    0060:[<c02bff27>]    Not tainted VLI
>> <5> EFLAGS: 00010216   (2.6.9-89.0.25.EL) 
>> <5> EIP is at skb_over_panic+0x1f/0x2d
>> <5> eax: 0000002c   ebx: c033f461   ecx: c0357d96   edx: c040fd44
>> <5> esi: c033f461   edi: df653280   ebp: 00000000   esp: c040fd40
>> <5> ds: 007b   es: 007b   ss: 0068
>> <5> Process swapper (pid: 0, threadinfo=c040f000 task=c0370be0)
>> <5> Stack: c0357d96 e0c29478 00000084 00000004 c033f461 df653280 d7883180
>> e0c2947d 
>> <5>        00000000 00000080 df653490 00000004 de4f1ac0 de4f1ac0 00000004
>> df653490 
>> <5>        00000001 e0c2877a 08000800 de4f1ac0 df653490 00000000 e0c29d2e
>> 00000004 
>> <5> Call Trace:
>> <5>  [<e0c29478>] sctp_addto_chunk+0xb0/0x128 [sctp]
>> <5>  [<e0c2947d>] sctp_addto_chunk+0xb5/0x128 [sctp]
>> <5>  [<e0c2877a>] sctp_init_cause+0x3f/0x47 [sctp]
>> <5>  [<e0c29d2e>] sctp_process_unk_param+0xac/0xb8 [sctp]
>> <5>  [<e0c29e90>] sctp_verify_init+0xcc/0x134 [sctp]
>> <5>  [<e0c20322>] sctp_sf_do_5_1B_init+0x83/0x28e [sctp]
>> <5>  [<e0c25333>] sctp_do_sm+0x41/0x77 [sctp]
>> <5>  [<c01555a4>] cache_grow+0x140/0x233
>> <5>  [<e0c26ba1>] sctp_endpoint_bh_rcv+0xc5/0x108 [sctp]
>> <5>  [<e0c2b863>] sctp_inq_push+0xe/0x10 [sctp]
>> <5>  [<e0c34600>] sctp_rcv+0x454/0x509 [sctp]
>> <5>  [<e084e017>] ipt_hook+0x17/0x1c [iptable_filter]
>> <5>  [<c02d005e>] nf_iterate+0x40/0x81
>> <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
>> <5>  [<c02e0c7f>] ip_local_deliver_finish+0xc6/0x151
>> <5>  [<c02d0362>] nf_hook_slow+0x83/0xb5
>> <5>  [<c02e0bb2>] ip_local_deliver+0x1a2/0x1a9
>> <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
>> <5>  [<c02e103e>] ip_rcv+0x334/0x3b4
>> <5>  [<c02c66fd>] netif_receive_skb+0x320/0x35b
>> <5>  [<e0a0928b>] init_stall_timer+0x67/0x6a [uhci_hcd]
>> <5>  [<c02c67a4>] process_backlog+0x6c/0xd9
>> <5>  [<c02c690f>] net_rx_action+0xfe/0x1f8
>> <5>  [<c012a7b1>] __do_softirq+0x35/0x79
>> <5>  [<c0107efb>] handle_IRQ_event+0x0/0x4f
>> <5>  [<c01094de>] do_softirq+0x46/0x4d
>>
>> Its an skb_over_panic BUG halt that results from processing an init chunk in
>> which too many of its variable length parameters are in some way malformed.
>>
>> The problem is in sctp_process_unk_param:
>> if (NULL == *errp)
>> 	*errp = sctp_make_op_error_space(asoc, chunk,
>> 					 ntohs(chunk->chunk_hdr->length));
>>
>> 	if (*errp) {
>> 		sctp_init_cause(*errp, SCTP_ERROR_UNKNOWN_PARAM,
>> 				 WORD_ROUND(ntohs(param.p->length)));
>> 		sctp_addto_chunk(*errp,
>> 			WORD_ROUND(ntohs(param.p->length)),
>> 				  param.v);
>>
>> When we allocate an error chunk, we assume that the worst case scenario requires
>> that we have chunk_hdr->length data allocated, which would be correct nominally,
>> given that we call sctp_addto_chunk for the violating parameter.  Unfortunately,
>> we also, in sctp_init_cause insert a sctp_errhdr_t structure into the error
>> chunk, so the worst case situation in which all parameters are in violation
>> requires chunk_hdr->length+(sizeof(sctp_errhdr_t)*param_count) bytes of data.
>>
>> The result of this error is that a deliberately malformed packet sent to a
>> listening host can cause a remote DOS, described in CVE-2010-1173:
>> http://cve.mitre.org/cgi-bin/cvename.cgi?name=2010-1173
>>
>> I've tested the below fix and confirmed that it fixes the issue.  It
>> pre-allocates the error chunk in sctp_verify_init, where we are able to count
>> the total number of variable length parameters, so we know how many error
>> headers we might need.  Then we simply use that chunk, if we find an error, or
>> discard/free it if all the parameters are valid.  Applies on top of the
>> lksctp-dev tree
>>
>> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>>
>>
>>  sm_make_chunk.c |   24 ++++++++++++++++++++++--
>>  1 file changed, 22 insertions(+), 2 deletions(-)
>>
>>
>> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
>> index f592163..990457b 100644
>> --- a/net/sctp/sm_make_chunk.c
>> +++ b/net/sctp/sm_make_chunk.c
>> @@ -2134,6 +2134,8 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>  	union sctp_params param;
>>  	int has_cookie = 0;
>>  	int result;
>> +	unsigned int param_cnt;
>> +	unsigned int len;
>>  
>>  	/* Verify stream values are non-zero. */
>>  	if ((0 == peer_init->init_hdr.num_outbound_streams) ||
>> @@ -2149,6 +2151,7 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>  
>>  		if (SCTP_PARAM_STATE_COOKIE == param.p->type)
>>  			has_cookie = 1;
>> +		param_cnt++;
>>  
>>  	} /* for (loop through all parameters) */
>>  
>> @@ -2169,6 +2172,20 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>  		return sctp_process_missing_param(asoc, SCTP_PARAM_STATE_COOKIE,
>>  						  chunk, errp);
>>  
>> +	if (!*errp) {
>> +		/*
>> +		 * Pre-allocate the error packet here
>> +		 * we do this as we need to reserve space
>> +		 * for the worst case scenario in which 
>> +		 * every parameter is in error and needs 
>> +		 * an errhdr attached to it
>> +		 */
>> +		len = ntohs(chunk->chunk_hdr->length);
>> +		len += sizeof(sctp_errhdr_t)*param_cnt;
>> +
>> +		*errp = sctp_make_op_error_space(asoc, chunk, len);
>> +	}
>> +
>>  	/* Verify all the variable length parameters */
>>  	sctp_walk_params(param, peer_init, init_hdr.params) {
>>  
>> @@ -2176,9 +2193,11 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>  		switch (result) {
>>  		    case SCTP_IERROR_ABORT:
>>  		    case SCTP_IERROR_NOMEM:
>> -				return 0;
>>  		    case SCTP_IERROR_ERROR:
>> -				return 1;
>> +				len = ntohs((*errp)->chunk_hdr->length);
>> +				if ((*errp) && (len == sizeof(sctp_chunkhdr_t)))
>> +					sctp_chunk_free(*errp);
>> +				return (result == SCTP_IERROR_ERROR) ? 1 : 0;
>>  		    case SCTP_IERROR_NO_ERROR:
>>  		    default:
>>  				break;
>> @@ -2186,6 +2205,7 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>  
>>  	} /* for (loop through all parameters) */
>>  
>> +	sctp_chunk_free(*errp);
>>  	return 1;
>>  }
>>  
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-04-28 14:19 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, xiaosuo, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272463605.2267.70.camel@edumazet-laptop>

Le mercredi 28 avril 2010 à 16:06 +0200, Eric Dumazet a écrit :
> Le mercredi 28 avril 2010 à 08:36 -0400, jamal a écrit :
> > On Wed, 2010-04-28 at 14:33 +0200, Eric Dumazet wrote:
> > 
> > > If you wait a bit, I have another patch to speedup udp receive path ;)
> > 
> > Shoot whenever you are ready ;-> I will test with and without your
> > patch..
> > 
> 
> Here it is ;)
> 
> Thanks

I forgot to say that with my previous DDOS test/bench (16 cpus trying to
feed one udp socket), my receiver can now process 420.000 pps instead of
200.000 ;)




^ permalink raw reply

* Re: [PATCH]: sctp: Fix skb_over_panic resulting from multiple invalid parameter errors (CVE-2010-1173)
From: Neil Horman @ 2010-04-28 14:21 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: sri, linux-sctp, eteo, netdev, davem, security
In-Reply-To: <4BD83F85.8090308@hp.com>

On Wed, Apr 28, 2010 at 10:00:37AM -0400, Vlad Yasevich wrote:
> I have this patch and a few others already queued.
> 
> I was planning on sending these today for stable.
> 
> Here is the full list of stable patches I have:
> 
> sctp: Fix oops when sending queued ASCONF chunks
> sctp: fix to calc the INIT/INIT-ACK chunk length correctly is set
> sctp: per_cpu variables should be in bh_disabled section
> sctp: fix potential reference of a freed pointer
> sctp: avoid irq lock inversion while call sk->sk_data_ready()
> 
> -vlad
> 
Are you sure?  this oops looks _very_ simmilar to the INIT/INIT-ACK length
calculation oops described above, but is in fact different, and requires this
patch, from what I can see.  The right fix might be in the ASCONF chunk patch
you list above, but I don't see that in your tree at the moment, so I can't be
sure.

Neil

> Neil Horman wrote:
> > Hey-
> > 	Recently, it was reported to me that the kernel could oops in the
> > following way:
> > 
> > <5> kernel BUG at net/core/skbuff.c:91!
> > <5> invalid operand: 0000 [#1]
> > <5> Modules linked in: sctp netconsole nls_utf8 autofs4 sunrpc iptable_filter
> > ip_tables cpufreq_powersave parport_pc lp parport vmblock(U) vsock(U) vmci(U)
> > vmxnet(U) vmmemctl(U) vmhgfs(U) acpiphp dm_mirror dm_mod button battery ac md5
> > ipv6 uhci_hcd ehci_hcd snd_ens1371 snd_rawmidi snd_seq_device snd_pcm_oss
> > snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_ac97_codec snd soundcore
> > pcnet32 mii floppy ext3 jbd ata_piix libata mptscsih mptsas mptspi mptscsi
> > mptbase sd_mod scsi_mod
> > <5> CPU:    0
> > <5> EIP:    0060:[<c02bff27>]    Not tainted VLI
> > <5> EFLAGS: 00010216   (2.6.9-89.0.25.EL) 
> > <5> EIP is at skb_over_panic+0x1f/0x2d
> > <5> eax: 0000002c   ebx: c033f461   ecx: c0357d96   edx: c040fd44
> > <5> esi: c033f461   edi: df653280   ebp: 00000000   esp: c040fd40
> > <5> ds: 007b   es: 007b   ss: 0068
> > <5> Process swapper (pid: 0, threadinfo=c040f000 task=c0370be0)
> > <5> Stack: c0357d96 e0c29478 00000084 00000004 c033f461 df653280 d7883180
> > e0c2947d 
> > <5>        00000000 00000080 df653490 00000004 de4f1ac0 de4f1ac0 00000004
> > df653490 
> > <5>        00000001 e0c2877a 08000800 de4f1ac0 df653490 00000000 e0c29d2e
> > 00000004 
> > <5> Call Trace:
> > <5>  [<e0c29478>] sctp_addto_chunk+0xb0/0x128 [sctp]
> > <5>  [<e0c2947d>] sctp_addto_chunk+0xb5/0x128 [sctp]
> > <5>  [<e0c2877a>] sctp_init_cause+0x3f/0x47 [sctp]
> > <5>  [<e0c29d2e>] sctp_process_unk_param+0xac/0xb8 [sctp]
> > <5>  [<e0c29e90>] sctp_verify_init+0xcc/0x134 [sctp]
> > <5>  [<e0c20322>] sctp_sf_do_5_1B_init+0x83/0x28e [sctp]
> > <5>  [<e0c25333>] sctp_do_sm+0x41/0x77 [sctp]
> > <5>  [<c01555a4>] cache_grow+0x140/0x233
> > <5>  [<e0c26ba1>] sctp_endpoint_bh_rcv+0xc5/0x108 [sctp]
> > <5>  [<e0c2b863>] sctp_inq_push+0xe/0x10 [sctp]
> > <5>  [<e0c34600>] sctp_rcv+0x454/0x509 [sctp]
> > <5>  [<e084e017>] ipt_hook+0x17/0x1c [iptable_filter]
> > <5>  [<c02d005e>] nf_iterate+0x40/0x81
> > <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
> > <5>  [<c02e0c7f>] ip_local_deliver_finish+0xc6/0x151
> > <5>  [<c02d0362>] nf_hook_slow+0x83/0xb5
> > <5>  [<c02e0bb2>] ip_local_deliver+0x1a2/0x1a9
> > <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
> > <5>  [<c02e103e>] ip_rcv+0x334/0x3b4
> > <5>  [<c02c66fd>] netif_receive_skb+0x320/0x35b
> > <5>  [<e0a0928b>] init_stall_timer+0x67/0x6a [uhci_hcd]
> > <5>  [<c02c67a4>] process_backlog+0x6c/0xd9
> > <5>  [<c02c690f>] net_rx_action+0xfe/0x1f8
> > <5>  [<c012a7b1>] __do_softirq+0x35/0x79
> > <5>  [<c0107efb>] handle_IRQ_event+0x0/0x4f
> > <5>  [<c01094de>] do_softirq+0x46/0x4d
> > 
> > Its an skb_over_panic BUG halt that results from processing an init chunk in
> > which too many of its variable length parameters are in some way malformed.
> > 
> > The problem is in sctp_process_unk_param:
> > if (NULL == *errp)
> > 	*errp = sctp_make_op_error_space(asoc, chunk,
> > 					 ntohs(chunk->chunk_hdr->length));
> > 
> > 	if (*errp) {
> > 		sctp_init_cause(*errp, SCTP_ERROR_UNKNOWN_PARAM,
> > 				 WORD_ROUND(ntohs(param.p->length)));
> > 		sctp_addto_chunk(*errp,
> > 			WORD_ROUND(ntohs(param.p->length)),
> > 				  param.v);
> > 
> > When we allocate an error chunk, we assume that the worst case scenario requires
> > that we have chunk_hdr->length data allocated, which would be correct nominally,
> > given that we call sctp_addto_chunk for the violating parameter.  Unfortunately,
> > we also, in sctp_init_cause insert a sctp_errhdr_t structure into the error
> > chunk, so the worst case situation in which all parameters are in violation
> > requires chunk_hdr->length+(sizeof(sctp_errhdr_t)*param_count) bytes of data.
> > 
> > The result of this error is that a deliberately malformed packet sent to a
> > listening host can cause a remote DOS, described in CVE-2010-1173:
> > http://cve.mitre.org/cgi-bin/cvename.cgi?name=2010-1173
> > 
> > I've tested the below fix and confirmed that it fixes the issue.  It
> > pre-allocates the error chunk in sctp_verify_init, where we are able to count
> > the total number of variable length parameters, so we know how many error
> > headers we might need.  Then we simply use that chunk, if we find an error, or
> > discard/free it if all the parameters are valid.  Applies on top of the
> > lksctp-dev tree
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > 
> > 
> >  sm_make_chunk.c |   24 ++++++++++++++++++++++--
> >  1 file changed, 22 insertions(+), 2 deletions(-)
> > 
> > 
> > diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> > index f592163..990457b 100644
> > --- a/net/sctp/sm_make_chunk.c
> > +++ b/net/sctp/sm_make_chunk.c
> > @@ -2134,6 +2134,8 @@ int sctp_verify_init(const struct sctp_association *asoc,
> >  	union sctp_params param;
> >  	int has_cookie = 0;
> >  	int result;
> > +	unsigned int param_cnt;
> > +	unsigned int len;
> >  
> >  	/* Verify stream values are non-zero. */
> >  	if ((0 == peer_init->init_hdr.num_outbound_streams) ||
> > @@ -2149,6 +2151,7 @@ int sctp_verify_init(const struct sctp_association *asoc,
> >  
> >  		if (SCTP_PARAM_STATE_COOKIE == param.p->type)
> >  			has_cookie = 1;
> > +		param_cnt++;
> >  
> >  	} /* for (loop through all parameters) */
> >  
> > @@ -2169,6 +2172,20 @@ int sctp_verify_init(const struct sctp_association *asoc,
> >  		return sctp_process_missing_param(asoc, SCTP_PARAM_STATE_COOKIE,
> >  						  chunk, errp);
> >  
> > +	if (!*errp) {
> > +		/*
> > +		 * Pre-allocate the error packet here
> > +		 * we do this as we need to reserve space
> > +		 * for the worst case scenario in which 
> > +		 * every parameter is in error and needs 
> > +		 * an errhdr attached to it
> > +		 */
> > +		len = ntohs(chunk->chunk_hdr->length);
> > +		len += sizeof(sctp_errhdr_t)*param_cnt;
> > +
> > +		*errp = sctp_make_op_error_space(asoc, chunk, len);
> > +	}
> > +
> >  	/* Verify all the variable length parameters */
> >  	sctp_walk_params(param, peer_init, init_hdr.params) {
> >  
> > @@ -2176,9 +2193,11 @@ int sctp_verify_init(const struct sctp_association *asoc,
> >  		switch (result) {
> >  		    case SCTP_IERROR_ABORT:
> >  		    case SCTP_IERROR_NOMEM:
> > -				return 0;
> >  		    case SCTP_IERROR_ERROR:
> > -				return 1;
> > +				len = ntohs((*errp)->chunk_hdr->length);
> > +				if ((*errp) && (len == sizeof(sctp_chunkhdr_t)))
> > +					sctp_chunk_free(*errp);
> > +				return (result == SCTP_IERROR_ERROR) ? 1 : 0;
> >  		    case SCTP_IERROR_NO_ERROR:
> >  		    default:
> >  				break;
> > @@ -2186,6 +2205,7 @@ int sctp_verify_init(const struct sctp_association *asoc,
> >  
> >  	} /* for (loop through all parameters) */
> >  
> > +	sctp_chunk_free(*errp);
> >  	return 1;
> >  }
> >  
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 

^ permalink raw reply

* Re: Checkpoint and Restart of INET routing information
From: Daniel Lezcano @ 2010-04-28 14:24 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers, netdev
In-Reply-To: <1272034539-19899-1-git-send-email-danms@us.ibm.com>

Dan Smith wrote:
> This set extends the existing network socket, device, and namespace support
> in the checkpoint tree to cover routing information.  It does so by making
> heavy use of RTNETLINK to dump and insert routes much like userspace would.
> Because the task doing the checkpointing or restarting needs to examine
> or setup resources for tasks in network namespaces other than its own, an
> additional kernel socket setup call is added.  It provides us the ability
> to talk to RTNETLINK in a foreign netns.
>
> The support added in this set allows me to configure various inet4 and inet6
> routes in a container and have them saved and restored successfully during
> a checkpoint/restart process.
>   

Why do you need to do that from the kernel ? Same remark for ipv4/6 
addresses.
What prevents you to do 'ip route show'  and use these informations to 
restore the routes later ?
Will we end up by moving all the network userspace tools in the kernel ? :)

If you use the Eric's setns patchset, you will be able to do that easily 
from userspace, no ?



^ permalink raw reply

* Re: [PATCH 0/3] [RFC] ptp: IEEE 1588 clock support
From: Wolfgang Grandegger @ 2010-04-28 14:31 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev
In-Reply-To: <4BD83D37.4060301@grandegger.com>

Wolfgang Grandegger wrote:
> Richard Cochran wrote:
>> On Tue, Apr 27, 2010 at 06:20:25PM +0200, Wolfgang Grandegger wrote:
>>> Do you have also a patch adding support for hardware timestamping to ptpd?
>> Yes, I do:
>>
>>    https://sourceforge.net/tracker/index.php?func=detail&aid=2992847&group_id=139814&atid=744634
> 
> Thanks.
> 
>> I should have mentioned, you also need the gianfar HW time stamping
>> patches, recently posted to netdev by Manfred Rudigier.
> 
> I'm aware of these patches. I'm actually using the net-next-2.6 git tree.
> 
> I got ptpd working but I do not yet see the PPS-Signals on my scope. At
> a first glance, the PPS-Signal seems to be configured by the gianfar_ptp
> driver (setting the fiper1 and timer1 registers) but I might have missed
> something.

That's because some 1588_PPS related bits are not yet setup in the
platform code of mainline kernel.

Wolfgang.

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: Eric Dumazet @ 2010-04-28 14:34 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, xiaosuo, therbert, shemminger, netdev,
	Eilon Greenstein, Brian Bloniarz
In-Reply-To: <1272464368.2267.72.camel@edumazet-laptop>

Le mercredi 28 avril 2010 à 16:19 +0200, Eric Dumazet a écrit :

> I forgot to say that with my previous DDOS test/bench (16 cpus trying to
> feed one udp socket), my receiver can now process 420.000 pps instead of
> 200.000 ;)

And perf top of the cpu dedicated to the thread doing the recvmsg() is :
(after patch)

----------------------------------------------------------------------------------------------------------------------------------------------
   PerfTop:    1001 irqs/sec  kernel:98.0% [1000Hz cycles],  (all, cpu: 1)
----------------------------------------------------------------------------------------------------------------------------------------------

             samples  pcnt function                      DSO
             _______ _____ _____________________________ ____________________________

             5463.00 45.5% _raw_spin_lock_bh             vmlinux                     
              761.00  6.3% copy_user_generic_string      vmlinux                     
              662.00  5.5% sock_recv_ts_and_drops        vmlinux                     
              645.00  5.4% kfree                         vmlinux                     
              568.00  4.7% _raw_spin_lock                vmlinux                     
              494.00  4.1% __skb_recv_datagram           vmlinux                     
              488.00  4.1% skb_copy_datagram_iovec       vmlinux                     
              467.00  3.9% __slab_free                   vmlinux                     
              176.00  1.5% udp_recvmsg                   vmlinux                     
              168.00  1.4% ia32_sysenter_target          vmlinux                     
              161.00  1.3% kmem_cache_free               vmlinux                     
              161.00  1.3% _raw_spin_lock_irqsave        vmlinux                     
              151.00  1.3% memcpy_toiovec                vmlinux                     
              131.00  1.1% fget_light                    vmlinux                     
              130.00  1.1% sock_rfree                    vmlinux                     
              104.00  0.9% inet_recvmsg                  vmlinux                     
               99.00  0.8% dst_release                   vmlinux                     
               98.00  0.8% skb_release_head_state        vmlinux                     
               83.00  0.7% __sk_mem_reclaim              vmlinux                     
               75.00  0.6% sys_recvfrom                  vmlinux                     
               61.00  0.5% sysexit_from_sys_call         vmlinux                     
               59.00  0.5% fput                          vmlinux                     
               56.00  0.5% schedule                      vmlinux                     
               56.00  0.5% sock_recvmsg                  vmlinux                     
               54.00  0.4% move_addr_to_user             vmlinux                     
               51.00  0.4% compat_sys_socketcall         vmlinux                     
               48.00  0.4% _raw_spin_unlock_bh           vmlinux                    



^ permalink raw reply

* Re: [PATCH]: sctp: Fix skb_over_panic resulting from multiple invalid parameter errors (CVE-2010-1173)
From: Vlad Yasevich @ 2010-04-28 14:37 UTC (permalink / raw)
  To: Neil Horman; +Cc: sri, linux-sctp, eteo, netdev, davem, security
In-Reply-To: <20100428142147.GB4818@hmsreliant.think-freely.org>



Neil Horman wrote:
> On Wed, Apr 28, 2010 at 10:00:37AM -0400, Vlad Yasevich wrote:
>> I have this patch and a few others already queued.
>>
>> I was planning on sending these today for stable.
>>
>> Here is the full list of stable patches I have:
>>
>> sctp: Fix oops when sending queued ASCONF chunks
>> sctp: fix to calc the INIT/INIT-ACK chunk length correctly is set
>> sctp: per_cpu variables should be in bh_disabled section
>> sctp: fix potential reference of a freed pointer
>> sctp: avoid irq lock inversion while call sk->sk_data_ready()
>>
>> -vlad
>>
> Are you sure?  this oops looks _very_ simmilar to the INIT/INIT-ACK length
> calculation oops described above, but is in fact different, and requires this
> patch, from what I can see.  The right fix might be in the ASCONF chunk patch
> you list above, but I don't see that in your tree at the moment, so I can't be
> sure.

As I said, I totally goofed when reading the description and I apologize.
However, I do one comment regarding the patch.

If the bad packet is REALLY long (I mean close to 65K IP limit), then
we'll end up allocating a supper huge skb in this case and potentially exceed
the IP length limitation.  Section 11.4 of rfc 4960 allows us to omit some
errors and limit the size of the packet.

I would recommend limiting this to MTU worth of potentiall errors.  This is
on top of what the INIT-ACK is going to carry, so at most we'll sent 2 MTUs
worth.  That's still a potential by amplification attack, but it's somewhat
mitigated.

Of course now we have to handle the case of checking for space before adding
an error cause. :)

-vlad

> 
> Neil
> 
>> Neil Horman wrote:
>>> Hey-
>>> 	Recently, it was reported to me that the kernel could oops in the
>>> following way:
>>>
>>> <5> kernel BUG at net/core/skbuff.c:91!
>>> <5> invalid operand: 0000 [#1]
>>> <5> Modules linked in: sctp netconsole nls_utf8 autofs4 sunrpc iptable_filter
>>> ip_tables cpufreq_powersave parport_pc lp parport vmblock(U) vsock(U) vmci(U)
>>> vmxnet(U) vmmemctl(U) vmhgfs(U) acpiphp dm_mirror dm_mod button battery ac md5
>>> ipv6 uhci_hcd ehci_hcd snd_ens1371 snd_rawmidi snd_seq_device snd_pcm_oss
>>> snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_ac97_codec snd soundcore
>>> pcnet32 mii floppy ext3 jbd ata_piix libata mptscsih mptsas mptspi mptscsi
>>> mptbase sd_mod scsi_mod
>>> <5> CPU:    0
>>> <5> EIP:    0060:[<c02bff27>]    Not tainted VLI
>>> <5> EFLAGS: 00010216   (2.6.9-89.0.25.EL) 
>>> <5> EIP is at skb_over_panic+0x1f/0x2d
>>> <5> eax: 0000002c   ebx: c033f461   ecx: c0357d96   edx: c040fd44
>>> <5> esi: c033f461   edi: df653280   ebp: 00000000   esp: c040fd40
>>> <5> ds: 007b   es: 007b   ss: 0068
>>> <5> Process swapper (pid: 0, threadinfo=c040f000 task=c0370be0)
>>> <5> Stack: c0357d96 e0c29478 00000084 00000004 c033f461 df653280 d7883180
>>> e0c2947d 
>>> <5>        00000000 00000080 df653490 00000004 de4f1ac0 de4f1ac0 00000004
>>> df653490 
>>> <5>        00000001 e0c2877a 08000800 de4f1ac0 df653490 00000000 e0c29d2e
>>> 00000004 
>>> <5> Call Trace:
>>> <5>  [<e0c29478>] sctp_addto_chunk+0xb0/0x128 [sctp]
>>> <5>  [<e0c2947d>] sctp_addto_chunk+0xb5/0x128 [sctp]
>>> <5>  [<e0c2877a>] sctp_init_cause+0x3f/0x47 [sctp]
>>> <5>  [<e0c29d2e>] sctp_process_unk_param+0xac/0xb8 [sctp]
>>> <5>  [<e0c29e90>] sctp_verify_init+0xcc/0x134 [sctp]
>>> <5>  [<e0c20322>] sctp_sf_do_5_1B_init+0x83/0x28e [sctp]
>>> <5>  [<e0c25333>] sctp_do_sm+0x41/0x77 [sctp]
>>> <5>  [<c01555a4>] cache_grow+0x140/0x233
>>> <5>  [<e0c26ba1>] sctp_endpoint_bh_rcv+0xc5/0x108 [sctp]
>>> <5>  [<e0c2b863>] sctp_inq_push+0xe/0x10 [sctp]
>>> <5>  [<e0c34600>] sctp_rcv+0x454/0x509 [sctp]
>>> <5>  [<e084e017>] ipt_hook+0x17/0x1c [iptable_filter]
>>> <5>  [<c02d005e>] nf_iterate+0x40/0x81
>>> <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
>>> <5>  [<c02e0c7f>] ip_local_deliver_finish+0xc6/0x151
>>> <5>  [<c02d0362>] nf_hook_slow+0x83/0xb5
>>> <5>  [<c02e0bb2>] ip_local_deliver+0x1a2/0x1a9
>>> <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
>>> <5>  [<c02e103e>] ip_rcv+0x334/0x3b4
>>> <5>  [<c02c66fd>] netif_receive_skb+0x320/0x35b
>>> <5>  [<e0a0928b>] init_stall_timer+0x67/0x6a [uhci_hcd]
>>> <5>  [<c02c67a4>] process_backlog+0x6c/0xd9
>>> <5>  [<c02c690f>] net_rx_action+0xfe/0x1f8
>>> <5>  [<c012a7b1>] __do_softirq+0x35/0x79
>>> <5>  [<c0107efb>] handle_IRQ_event+0x0/0x4f
>>> <5>  [<c01094de>] do_softirq+0x46/0x4d
>>>
>>> Its an skb_over_panic BUG halt that results from processing an init chunk in
>>> which too many of its variable length parameters are in some way malformed.
>>>
>>> The problem is in sctp_process_unk_param:
>>> if (NULL == *errp)
>>> 	*errp = sctp_make_op_error_space(asoc, chunk,
>>> 					 ntohs(chunk->chunk_hdr->length));
>>>
>>> 	if (*errp) {
>>> 		sctp_init_cause(*errp, SCTP_ERROR_UNKNOWN_PARAM,
>>> 				 WORD_ROUND(ntohs(param.p->length)));
>>> 		sctp_addto_chunk(*errp,
>>> 			WORD_ROUND(ntohs(param.p->length)),
>>> 				  param.v);
>>>
>>> When we allocate an error chunk, we assume that the worst case scenario requires
>>> that we have chunk_hdr->length data allocated, which would be correct nominally,
>>> given that we call sctp_addto_chunk for the violating parameter.  Unfortunately,
>>> we also, in sctp_init_cause insert a sctp_errhdr_t structure into the error
>>> chunk, so the worst case situation in which all parameters are in violation
>>> requires chunk_hdr->length+(sizeof(sctp_errhdr_t)*param_count) bytes of data.
>>>
>>> The result of this error is that a deliberately malformed packet sent to a
>>> listening host can cause a remote DOS, described in CVE-2010-1173:
>>> http://cve.mitre.org/cgi-bin/cvename.cgi?name=2010-1173
>>>
>>> I've tested the below fix and confirmed that it fixes the issue.  It
>>> pre-allocates the error chunk in sctp_verify_init, where we are able to count
>>> the total number of variable length parameters, so we know how many error
>>> headers we might need.  Then we simply use that chunk, if we find an error, or
>>> discard/free it if all the parameters are valid.  Applies on top of the
>>> lksctp-dev tree
>>>
>>> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>>>
>>>
>>>  sm_make_chunk.c |   24 ++++++++++++++++++++++--
>>>  1 file changed, 22 insertions(+), 2 deletions(-)
>>>
>>>
>>> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
>>> index f592163..990457b 100644
>>> --- a/net/sctp/sm_make_chunk.c
>>> +++ b/net/sctp/sm_make_chunk.c
>>> @@ -2134,6 +2134,8 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>>  	union sctp_params param;
>>>  	int has_cookie = 0;
>>>  	int result;
>>> +	unsigned int param_cnt;
>>> +	unsigned int len;
>>>  
>>>  	/* Verify stream values are non-zero. */
>>>  	if ((0 == peer_init->init_hdr.num_outbound_streams) ||
>>> @@ -2149,6 +2151,7 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>>  
>>>  		if (SCTP_PARAM_STATE_COOKIE == param.p->type)
>>>  			has_cookie = 1;
>>> +		param_cnt++;
>>>  
>>>  	} /* for (loop through all parameters) */
>>>  
>>> @@ -2169,6 +2172,20 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>>  		return sctp_process_missing_param(asoc, SCTP_PARAM_STATE_COOKIE,
>>>  						  chunk, errp);
>>>  
>>> +	if (!*errp) {
>>> +		/*
>>> +		 * Pre-allocate the error packet here
>>> +		 * we do this as we need to reserve space
>>> +		 * for the worst case scenario in which 
>>> +		 * every parameter is in error and needs 
>>> +		 * an errhdr attached to it
>>> +		 */
>>> +		len = ntohs(chunk->chunk_hdr->length);
>>> +		len += sizeof(sctp_errhdr_t)*param_cnt;
>>> +
>>> +		*errp = sctp_make_op_error_space(asoc, chunk, len);
>>> +	}
>>> +
>>>  	/* Verify all the variable length parameters */
>>>  	sctp_walk_params(param, peer_init, init_hdr.params) {
>>>  
>>> @@ -2176,9 +2193,11 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>>  		switch (result) {
>>>  		    case SCTP_IERROR_ABORT:
>>>  		    case SCTP_IERROR_NOMEM:
>>> -				return 0;
>>>  		    case SCTP_IERROR_ERROR:
>>> -				return 1;
>>> +				len = ntohs((*errp)->chunk_hdr->length);
>>> +				if ((*errp) && (len == sizeof(sctp_chunkhdr_t)))
>>> +					sctp_chunk_free(*errp);
>>> +				return (result == SCTP_IERROR_ERROR) ? 1 : 0;
>>>  		    case SCTP_IERROR_NO_ERROR:
>>>  		    default:
>>>  				break;
>>> @@ -2186,6 +2205,7 @@ int sctp_verify_init(const struct sctp_association *asoc,
>>>  
>>>  	} /* for (loop through all parameters) */
>>>  
>>> +	sctp_chunk_free(*errp);
>>>  	return 1;
>>>  }
>>>  
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
> 

^ permalink raw reply

* Re: [PATCH 1/2] netfilter: xtables: inclusion of xt_SYSRQ
From: John Haxby @ 2010-04-28 14:43 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Patrick McHardy, Netfilter Developer Mailing List,
	Linux Netdev List
In-Reply-To: <alpine.LSU.2.01.1004211533410.21020@obet.zrqbmnf.qr>

On 21/04/10 14:35, Jan Engelhardt wrote:
> On Wednesday 2010-04-21 15:17, Patrick McHardy wrote:
>    
>> Jan Engelhardt wrote:
>>      
>>> On Wednesday 2010-04-21 14:59, Patrick McHardy wrote:
>>>
>>>        
>>>> Jan Engelhardt wrote:
>>>>          
>>>>> The SYSRQ target will allow to remotely invoke sysrq on the local
>>>>> machine. Authentication is by means of a pre-shared key that can
>>>>> either be transmitted plaintext or digest-secured.
>>>>>            
>>>> I really think this is pushing what netfilter is meant for a bit
>>>> far. Its basically abusing the firewall ruleset to offer a network
>>>> service.
>>>>
>>>> I can see that its useful to have this in the kernel instead of
>>>> userspace, but why isn't this implemented as a stand-alone module?
>>>> That seems like a better design to me and also makes it more useful
>>>> by not depending on netfilter.
>>>>          
>>> That sort of diverts from the earlier what-seemed-to-be-consensus.
>>>
>>> Oh well, I would not mind holding the single commit up as long as the
>>> rest isn't blocked too :-)
>>>        
>> Then lets skip this one for now.
>>      
> Well you raised the concern before -- namely that kdboe would have
> the very same feature. And yet, kdboe was not part of the kernel.
> Neither is the magical stand-alone module.
> I really prefer to have it in rather than out, because I know
> that's going to mess up maintenance-here-and-there. I'm already
> having a big time with xtables-addons that still carries
> xt_condition and SYSRQ for a while, and it does have some different
> code lines than the kernel copy.
>    

I have to agree with Jan here, but I'd like to raise some additional points.

kdboe (or kgdboe) isn't part of the kernel and I don't think it 
necessarily fits all the use cases for xt_SYSRQ.  The one I have in mind 
is where there is a non-kernel hacker whose machine has got into 
trouble.  The poor harrassed sys admin (in this case) has configured 
netconsole and knows that sysrq-t and sysrq-m are useful as a first 
attempt at passing useful information to someone who knows what might be 
going on and that sysrq-c to get a crash dump will also be useful.   
(This represents quite a few of the better sys admins that I come 
across.)   xt_SYSRQ is likewise easy to set up and easy to use.   It's 
true that k(g)dboe would provide this kind of information provided that 
the debuginfo was present on the target machine and the environment was 
such that any sort of debugging over netconsole was sufficiently secure 
... (is it at least as secure as the xt_SYSRQ controls?)

I was running over the design of a standalone module in my head on the 
way in this morning.   It seems fairly straightforward, but as I started 
adding in necessary requirements like limited IP addresses (which I know 
are not actually secure), limited interfaces (which are more secure in a 
controlled physical environment), user-space control and so on the more 
it was sounding as though it would just be a cut-down iptables.   And 
then, of course, that begs the question "why don't you leave all that 
extra stuff to iptables?"

My own interest in getting xt_SYSRQ into the mainline kernel is that it 
would then be easier to get it accepted in production kernels where it 
would make the poor beleaguered sys admin's life a little easier.  That 
is, _some_ useful information or even a crash dump could be extracted 
from the machine before it's big red button time.

jch

^ permalink raw reply

* Re: [Bugme-new] [Bug 15868] New: Deleting IP address from interface doesn't prevent sending a data.
From: Andrew Morton @ 2010-04-28 11:42 UTC (permalink / raw)
  To: netdev; +Cc: bugzilla-daemon, bugme-daemon, Yurij.Plotnikov
In-Reply-To: <bug-15868-10286@https.bugzilla.kernel.org/>


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Wed, 28 Apr 2010 08:11:02 GMT bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=15868
> 
>            Summary: Deleting IP address from interface doesn't prevent
>                     sending a data.
>            Product: Networking
>            Version: 2.5
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: IPV4
>         AssignedTo: shemminger@linux-foundation.org
>         ReportedBy: Yurij.Plotnikov@oktetlabs.ru
>         Regression: No
> 
> 
> Starting from 2.6.26, Linux kernel has strange behavior for the interface with
> two IPv4 addresses.
> 
>  Let A and B are hosts with directly connected interfaces ethA (on host A) and
> ethB (on host B). Let 10.10.0.1/24 and 10.10.0.3/24 addresses are assigned to
> ethA and 10.10.0.2/24 address is assigned to ethB. Let there is established TCP
> connection between host A and host B with sockets sock_A and sock_B that are
> bound to 10.10.0.3 and 10.10.0.2 addresses respectively. Then if someone
> deletes 10.10.0.3 address from ethA interface and after that send some data
> from sock_A socket then the data will be delivered to sock_B socket and someone
> can read it from this socket.
> 
>  There is the same picture for UDP sockets. With previous definitions if there
> are UDP sockets sock_A on host A and sock_B on host B and they are bound to
> 10.10.0.3 and 10.10.0.2 addresses respectively and they are connected to
> 10.10.0.2 and 10.10.0.3 addresses respectively then if someone deletes
> 10.10.0.3 address from ethA interface and after that send some data using
> send() function from sock_A then the data will be delivered to sock_B.
> 
> The data will not be sent in both cases if there are no addresses assigned to
> the interface after address removing.


^ permalink raw reply

* Re: [PATCH 1/2] netfilter: xtables: inclusion of xt_SYSRQ
From: John Haxby @ 2010-04-28 14:54 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Patrick McHardy, Netfilter Developer Mailing List,
	Linux Netdev List
In-Reply-To: <4BD84992.5030909@oracle.com>

On 28/04/10 15:43, John Haxby wrote:
>
> kdboe (or kgdboe) isn't part of the kernel and I don't think it 
> necessarily fits all the use cases for xt_SYSRQ.  The one I have in 
> mind is where there is a non-kernel hacker whose machine has got into 
> trouble.  The poor harrassed sys admin (in this case) has configured 
> netconsole and knows that sysrq-t and sysrq-m are useful as a first 
> attempt at passing useful information to someone who knows what might 
> be going on and that sysrq-c to get a crash dump will also be 
> useful.   (This represents quite a few of the better sys admins that I 
> come across.)   xt_SYSRQ is likewise easy to set up and easy to use.   
> It's true that k(g)dboe would provide this kind of information 
> provided that the debuginfo was present on the target machine and the 
> environment was such that any sort of debugging over netconsole was 
> sufficiently secure ... (is it at least as secure as the xt_SYSRQ 
> controls?)
>

I really must read what I've written more carefully.   I should have 
gone on to say that I don't see that k(g)dboe will be viable in this use 
case although for someone actually debugging a kernel on a machine that 
they have access to xt_SYSRQ leaves an awful lot to be desired :-)   But 
that isn't the common use-case I see -- the one I see is where the sys 
admins used to have a "crash trolley" which was a console and PS/2 
keyboard which they could plug into a machine to get some information, 
but as many rack machines no longer have anything PS/2 and USB hot plug 
is unlikely to work on a sick machine we need a sufficiently light 
mechanism that it will work in most cases (xt_SYSRQ is careful to 
pre-allocate most of the resources it will need).

And then I should have said that moving on to the possibility of a 
standalone module and that ...
> I was running over the design of a standalone module in my head on the 
> way in this morning.   It seems fairly straightforward, but as I 
> started adding in necessary requirements like limited IP addresses 
> (which I know are not actually secure), limited interfaces (which are 
> more secure in a controlled physical environment), user-space control 
> and so on the more it was sounding as though it would just be a 
> cut-down iptables.   And then, of course, that begs the question "why 
> don't you leave all that extra stuff to iptables?"

So unless I'm missing something obvious and different, I don't see that 
a standalone module is going to be lightweight enough to be acceptable.

Sorry for not making filling this parts in earlier.

jch

^ permalink raw reply

* Re: [PATCH 1/2] netfilter: xtables: inclusion of xt_SYSRQ
From: Jan Engelhardt @ 2010-04-28 15:03 UTC (permalink / raw)
  To: John Haxby
  Cc: Patrick McHardy, Netfilter Developer Mailing List,
	Linux Netdev List
In-Reply-To: <4BD84C23.2000301@oracle.com>


On Wednesday 2010-04-28 16:54, John Haxby wrote:
>
> use-case I see -- the one I see is where the sys admins used to have a "crash
> trolley" which was a console and PS/2 keyboard which they could plug into a
> machine to get some information, but as many rack machines no longer have
> anything PS/2 and USB hot plug is unlikely to work on a sick machine

Oh I can tell you stories... sometimes it's so dead in the water that
the console unblanking would not work any more, rendering even any PS/2
useless. Stupid southbridge chipsets blowing up DMA :-)

^ permalink raw reply

* Re: [PATCH 2/4] [RFC] Add sock_create_kern_net()
From: Dan Smith @ 2010-04-28 15:06 UTC (permalink / raw)
  To: David Miller; +Cc: containers, netdev
In-Reply-To: <20100427.171844.77354120.davem@davemloft.net>

Hi,

DM> If you can create netlink sockets in a remote NS you can also make
DM> changes there, and the whole point is to disallow changes.

DM> So maybe you won't be making changes, but others will think about
DM> using this and doing so.

I would be making changes on restart, because I insert routes.  As has
been pointed out, Eric's setns() patches allow this sort of violation
from userspace even :)

Following that example, I could have the checkpointing task stash the
current nsproxy and temporarily jump to the destination netns to do
the checkpoint.  I'll cook up something to look at...

Thanks Dave!

-- 
Dan Smith
IBM Linux Technology Center
email: danms@us.ibm.com

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: sk_add_backlog() take rmem_alloc into account
From: Brian Bloniarz @ 2010-04-28 15:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, therbert, netdev, rick.jones2
In-Reply-To: <1272399662.2343.12.camel@edumazet-laptop>

Eric Dumazet wrote:
> Le mardi 27 avril 2010 à 19:37 +0200, Eric Dumazet a écrit :
> 
>> We might use the ticket spinlock paradigm to let writers go in parallel
>> and let the user the socket lock
>>
>> Instead of having the bh_lock_sock() to protect receive_queue *and*
>> backlog, writers get a unique slot in a table, that 'user' can handle
>> later.
>>
>> Or serialize writers (before they try to bh_lock_sock()) with a
>> dedicated lock, so that user has 50% chances to get the sock lock,
>> contending with at most one writer.
> 
> Following patch fixes the issue for me, with little performance hit on
> fast path.
> 
> Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp
> receiver can now process ~200.000 pps (instead of ~100 pps before the
> patch) on my dev machine.
> 
> Thanks !
> 
> [PATCH net-next-2.6] net: sk_add_backlog() take rmem_alloc into account
> 
> Current socket backlog limit is not enough to really stop DDOS attacks,
> because user thread spend many time to process a full backlog each
> round, and user might crazy spin on socket lock.
> 
> We should add backlog size and receive_queue size (aka rmem_alloc) to
> pace writers, and let user run without being slow down too much.
> 
> Introduce a sk_rcvqueues_full() helper, to avoid taking socket lock in
> stress situations.
> 
> Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp
> receiver can now process ~200.000 pps (instead of ~100 pps before the
> patch) on a 8 core machine.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Wow that was awesome.

> ---
>  include/net/sock.h |   13 +++++++++++--
>  net/core/sock.c    |    5 ++++-
>  net/ipv4/udp.c     |    4 ++++
>  net/ipv6/udp.c     |    8 ++++++++
>  4 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 86a8ca1..4b0097d 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -255,7 +255,6 @@ struct sock {
>  		struct sk_buff *head;
>  		struct sk_buff *tail;
>  		int len;
> -		int limit;
>  	} sk_backlog;
>  	wait_queue_head_t	*sk_sleep;
>  	struct dst_entry	*sk_dst_cache;
> @@ -604,10 +603,20 @@ static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
>  	skb->next = NULL;
>  }
>  
> +/*
> + * Take into account size of receive queue and backlog queue
> + */
> +static inline bool sk_rcvqueues_full(const struct sock *sk, const struct sk_buff *skb)
> +{
> +	unsigned int qsize = sk->sk_backlog.len + atomic_read(&sk->sk_rmem_alloc);
> +
> +	return qsize + skb->truesize > sk->sk_rcvbuf;
> +}
> +

Reading sk_backlog.len without the socket lock held seems to
contradict the comment in sock.h:
  *	@sk_backlog: always used with the per-socket spinlock held
  ...
struct sock {

        ...
	/*
	 * The backlog queue is special, it is always used with
	 * the per-socket spinlock held and requires low latency
	 * access. Therefore we special case it's implementation.
	 */
	struct { ... } sk_backlog;

Is this just a doc mismatch or does sk_backlog.len need to use
atomic_read & atomic_set?

>  /* The per-socket spinlock must be held here. */
>  static inline __must_check int sk_add_backlog(struct sock *sk, struct sk_buff *skb)
>  {
> -	if (sk->sk_backlog.len >= max(sk->sk_backlog.limit, sk->sk_rcvbuf << 1))
> +	if (sk_rcvqueues_full(sk, skb))
>  		return -ENOBUFS;
>  
>  	__sk_add_backlog(sk, skb);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 58ebd14..5104175 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -327,6 +327,10 @@ int sk_receive_skb(struct sock *sk, struct sk_buff *skb, const int nested)
>  
>  	skb->dev = NULL;
>  
> +	if (sk_rcvqueues_full(sk, skb)) {
> +		atomic_inc(&sk->sk_drops);
> +		goto discard_and_relse;
> +	}
>  	if (nested)
>  		bh_lock_sock_nested(sk);
>  	else
> @@ -1885,7 +1889,6 @@ void sock_init_data(struct socket *sock, struct sock *sk)
>  	sk->sk_allocation	=	GFP_KERNEL;
>  	sk->sk_rcvbuf		=	sysctl_rmem_default;
>  	sk->sk_sndbuf		=	sysctl_wmem_default;
> -	sk->sk_backlog.limit	=	sk->sk_rcvbuf << 1;
>  	sk->sk_state		=	TCP_CLOSE;
>  	sk_set_socket(sk, sock);
>  
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 1e18f9c..776c844 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1372,6 +1372,10 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>  			goto drop;
>  	}
>  
> +
> +	if (sk_rcvqueues_full(sk, skb))
> +		goto drop;
> +
>  	rc = 0;
>  
>  	bh_lock_sock(sk);
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 2850e35..3ead20a 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -584,6 +584,10 @@ static void flush_stack(struct sock **stack, unsigned int count,
>  
>  		sk = stack[i];
>  		if (skb1) {
> +			if (sk_rcvqueues_full(sk, skb)) {
> +				kfree_skb(skb1);
> +				goto drop;
> +			}
>  			bh_lock_sock(sk);
>  			if (!sock_owned_by_user(sk))
>  				udpv6_queue_rcv_skb(sk, skb1);
> @@ -759,6 +763,10 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
>  
>  	/* deliver */
>  
> +	if (sk_rcvqueues_full(sk, skb)) {
> +		sock_put(sk);
> +		goto discard;
> +	}
>  	bh_lock_sock(sk);
>  	if (!sock_owned_by_user(sk))
>  		udpv6_queue_rcv_skb(sk, skb);
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [PATCH net-next-2.6] bnx2x: Remove two prefetch()
From: Eliezer Tamir @ 2010-04-28 15:44 UTC (permalink / raw)
  To: eilong
  Cc: David Miller, vladz, eric.dumazet@gmail.com, xiaosuo@gmail.com,
	hadi@cyberus.ca, therbert@google.com, shemminger@vyatta.com,
	netdev@vger.kernel.org
In-Reply-To: <1272460455.30392.24.camel@lb-tlvb-eilong.il.broadcom.com>

On Wed, Apr 28, 2010 at 4:14 PM, Eilon Greenstein <eilong@broadcom.com> wrote:
>
> On Tue, 2010-04-27 at 15:19 -0700, David Miller wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Wed, 28 Apr 2010 00:18:13 +0200
> >
> > > [PATCH net-next-2.6] bnx2x: Remove two prefetch()
> > >
> > > 1) Even on 64bit arches, sizeof(struct sk_buff) < 256
> > > 2) No need to prefetch same pointer twice.
> > >
> > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > > CC: Eilon Greenstein <eilong@broadcom.com>
> >
> > Eilon please review and ACK/NACK
>
> Vlad ran few benchmarks, and we couldn't find any justification for
> those prefetch calls. After consulting with Eliezer Tamir (the original
> author) we are glad to Ack this patch.
>
> Thanks Eric!
> Acked-by: <eilong@broadcom.com>
>
>
Normally, I would not have said anything but since Eilon asked.
Acked-by: <eliezer@tamir.org.il>
(this time in plain text)

^ permalink raw reply

* Re: [PATCH 1/3] ptp: Added a brand new class driver for ptp clocks.
From: Randy Dunlap @ 2010-04-28 15:45 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev
In-Reply-To: <20100428060837.GB4516@riccoc20.at.omicron.at>

On 04/27/10 23:08, Richard Cochran wrote:
> On Tue, Apr 27, 2010 at 03:32:39PM -0700, Randy Dunlap wrote:
>> How do I use the testptp.mk file?
> 
> The makefile uses the KBUILD_OUTPUT environment variable to find the
> kernel includes, with the new header. I do something like this:
> 
>   export ARCH=powerpc
>   export KBUILD_OUTPUT=~/work/kernel/ptp_p2020
>   mkdir -p $KBUILD_OUTPUT
>   make mpc85xx_smp_defconfig
>   make menuconfig
>   make -j3 uImage
>   make headers_install
>   make -C Documentation/ptp -f testptp.mk
> 
>> Drop the ".ko".  We normally don't include that part of the module name.
> 
> Okay, can do. I just imitated what I saw in other Kbuild files.
> 
>>> diff --git a/include/linux/Kbuild b/include/linux/Kbuild
>>> index 2fc8e14..2d616cb 100644
>>> --- a/include/linux/Kbuild
>>> +++ b/include/linux/Kbuild
>>> @@ -318,6 +318,7 @@ unifdef-y += poll.h
>>>  unifdef-y += ppp_defs.h
>>>  unifdef-y += ppp-comp.h
>>>  unifdef-y += pps.h
>>> +unifdef-y += ptp_clock.h
>>>  unifdef-y += ptrace.h
>>>  unifdef-y += quota.h
>>>  unifdef-y += random.h
>>
>> I think that the Kbuild file also needs this line:
>> header-y += ptp_clock.h
>>
>> so that builds that use O=objdir will work, but even with that
>> change, I couldn't get it to work.  (?)
> 
> Well, I am not sure what to do here. I followed the example of the PPS
> code. That code only has the unifdef-y assigment. But now I see that
> Documentation/kbuild/makefiles.txt says unifdef-y is deprecated.
> 
> Can someone clarify what is correct: is just header-y enough?

Yes, it should be.

-- 
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply

* Re: [PATCH 1/2] netfilter: xtables: inclusion of xt_SYSRQ
From: John Haxby @ 2010-04-28 15:50 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Patrick McHardy, Netfilter Developer Mailing List,
	Linux Netdev List
In-Reply-To: <alpine.LSU.2.01.1004281701220.9518@obet.zrqbmnf.qr>

On 28/04/10 16:03, Jan Engelhardt wrote:
> On Wednesday 2010-04-28 16:54, John Haxby wrote:
>    
>> use-case I see -- the one I see is where the sys admins used to have a "crash
>> trolley" which was a console and PS/2 keyboard which they could plug into a
>> machine to get some information, but as many rack machines no longer have
>> anything PS/2 and USB hot plug is unlikely to work on a sick machine
>>      
> Oh I can tell you stories... sometimes it's so dead in the water that
> the console unblanking would not work any more, rendering even any PS/2
> useless. Stupid southbridge chipsets blowing up DMA :-)
>    

There's no hope in that case :-)  Just take the machine out and give it 
a decent burial.

On the other hand it's not uncommon to see reports that a machine has 
"hung totally" that include output from ping to show that it hasn't.  
Actually it's amazingly common to see this.  And in just this situation 
xt_SYSRQ is still quite likely to work.

jch

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: sk_add_backlog() take rmem_alloc into account
From: Eric Dumazet @ 2010-04-28 15:52 UTC (permalink / raw)
  To: Brian Bloniarz; +Cc: David Miller, therbert, netdev, rick.jones2
In-Reply-To: <4BD85718.5000404@athenacr.com>

Le mercredi 28 avril 2010 à 11:41 -0400, Brian Bloniarz a écrit :

> >  
> > +/*
> > + * Take into account size of receive queue and backlog queue
> > + */
> > +static inline bool sk_rcvqueues_full(const struct sock *sk, const struct sk_buff *skb)
> > +{
> > +	unsigned int qsize = sk->sk_backlog.len + atomic_read(&sk->sk_rmem_alloc);
> > +
> > +	return qsize + skb->truesize > sk->sk_rcvbuf;
> > +}
> > +
> 
> Reading sk_backlog.len without the socket lock held seems to
> contradict the comment in sock.h:
>   *	@sk_backlog: always used with the per-socket spinlock held
>   ...
> struct sock {
> 
>         ...
> 	/*
> 	 * The backlog queue is special, it is always used with
> 	 * the per-socket spinlock held and requires low latency
> 	 * access. Therefore we special case it's implementation.
> 	 */
> 	struct { ... } sk_backlog;
> 
> Is this just a doc mismatch or does sk_backlog.len need to use
> atomic_read & atomic_set?
> 

I'll submit a doc cleanup, and will avoid this 32bit hole in a reorg of
struct sock layout.

We read 'sk_backlog.len'  without lock to have a hint. We could have a
false positive only when queue is full, so this is not a big deal.

Then, after locking, we call sk_rcvqueues_full() once again.

Thanks for reviewing !



^ permalink raw reply

* Re: linux-next: manual merge of the rr tree with the net tree
From: Michael S. Tsirkin @ 2010-04-28 15:54 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Rusty Russell, linux-next, linux-kernel, David Miller, netdev
In-Reply-To: <20100427040913.GA19951@redhat.com>

On Tue, Apr 27, 2010 at 07:09:13AM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 27, 2010 at 11:58:52AM +1000, Stephen Rothwell wrote:
> > Hi Rusty,
> > 
> > Today's linux-next merge of the rr tree got a conflict in
> > drivers/net/virtio_net.c between commit
> > 5e01d2f91df62be4d6f282149bc2a8858992ceca ("virtio-net: move sg off
> > stack") from the net tree and commit
> > 7f62a724a65f864d84f50857bbfd36c240155c8f ("virtio_net: use virtqueue_xxx
> > wrappers") from the rr tree.
> > 
> > I fixed it up (see below) and can carry the fix as necessary.
> 
> Hmm, Rusty, do you intend for the patches to go through netdev this
> time? If you do, it might be simplest to just ask Dave to merge
> them in net-next-2.6 now. I can prepare and send them if you like.

For whoever develops on top of -rr, the following backports
virtio_net change from net-next. Hope this helps


commit 77416b2a007b67f92d2f7b3b1edac7405c5890f7
Author: Michael S. Tsirkin <mst@redhat.com>
Date:   Wed Apr 28 18:48:27 2010 +0300

    virtio-net: move sg off stack
    
    Move sg structure off stack and into virtnet_info structure.
    This helps remove extra sg_init_table calls as well as reduce
    stack usage.
    
    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
    Tested-by: Michael S. Tsirkin <mst@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    
    Conflicts:
    
    	drivers/net/virtio_net.c

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index fca44b2..dc872ba 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -40,8 +40,7 @@ module_param(gso, bool, 0444);
 
 #define VIRTNET_SEND_COMMAND_SG_MAX    2
 
-struct virtnet_info
-{
+struct virtnet_info {
 	struct virtio_device *vdev;
 	struct virtqueue *rvq, *svq, *cvq;
 	struct net_device *dev;
@@ -62,6 +61,10 @@ struct virtnet_info
 
 	/* Chain pages by the private ptr. */
 	struct page *pages;
+
+	/* fragments + linear part + virtio header */
+	struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
+	struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
 };
 
 struct skb_vnet_hdr {
@@ -324,10 +327,8 @@ static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
 {
 	struct sk_buff *skb;
 	struct skb_vnet_hdr *hdr;
-	struct scatterlist sg[2];
 	int err;
 
-	sg_init_table(sg, 2);
 	skb = netdev_alloc_skb_ip_align(vi->dev, MAX_PACKET_LEN);
 	if (unlikely(!skb))
 		return -ENOMEM;
@@ -335,11 +336,11 @@ static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
 	skb_put(skb, MAX_PACKET_LEN);
 
 	hdr = skb_vnet_hdr(skb);
-	sg_set_buf(sg, &hdr->hdr, sizeof hdr->hdr);
+	sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
 
-	skb_to_sgvec(skb, sg + 1, 0, skb->len);
+	skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
 
-	err = virtqueue_add_buf_gfp(vi->rvq, sg, 0, 2, skb, gfp);
+	err = virtqueue_add_buf_gfp(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
 	if (err < 0)
 		dev_kfree_skb(skb);
 
@@ -348,13 +349,11 @@ static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp)
 
 static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
 {
-	struct scatterlist sg[MAX_SKB_FRAGS + 2];
 	struct page *first, *list = NULL;
 	char *p;
 	int i, err, offset;
 
-	sg_init_table(sg, MAX_SKB_FRAGS + 2);
-	/* page in sg[MAX_SKB_FRAGS + 1] is list tail */
+	/* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
 	for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
 		first = get_a_page(vi, gfp);
 		if (!first) {
@@ -362,7 +361,7 @@ static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
 				give_pages(vi, list);
 			return -ENOMEM;
 		}
-		sg_set_buf(&sg[i], page_address(first), PAGE_SIZE);
+		sg_set_buf(&vi->rx_sg[i], page_address(first), PAGE_SIZE);
 
 		/* chain new page in list head to match sg */
 		first->private = (unsigned long)list;
@@ -376,17 +375,17 @@ static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
 	}
 	p = page_address(first);
 
-	/* sg[0], sg[1] share the same page */
-	/* a separated sg[0] for  virtio_net_hdr only during to QEMU bug*/
-	sg_set_buf(&sg[0], p, sizeof(struct virtio_net_hdr));
+	/* vi->rx_sg[0], vi->rx_sg[1] share the same page */
+	/* a separated vi->rx_sg[0] for virtio_net_hdr only due to QEMU bug */
+	sg_set_buf(&vi->rx_sg[0], p, sizeof(struct virtio_net_hdr));
 
-	/* sg[1] for data packet, from offset */
+	/* vi->rx_sg[1] for data packet, from offset */
 	offset = sizeof(struct padded_vnet_hdr);
-	sg_set_buf(&sg[1], p + offset, PAGE_SIZE - offset);
+	sg_set_buf(&vi->rx_sg[1], p + offset, PAGE_SIZE - offset);
 
 	/* chain first in list head */
 	first->private = (unsigned long)list;
-	err = virtqueue_add_buf_gfp(vi->rvq, sg, 0, MAX_SKB_FRAGS + 2,
+	err = virtqueue_add_buf_gfp(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
 				    first, gfp);
 	if (err < 0)
 		give_pages(vi, first);
@@ -397,16 +396,15 @@ static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp)
 static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp)
 {
 	struct page *page;
-	struct scatterlist sg;
 	int err;
 
 	page = get_a_page(vi, gfp);
 	if (!page)
 		return -ENOMEM;
 
-	sg_init_one(&sg, page_address(page), PAGE_SIZE);
+	sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
 
-	err = virtqueue_add_buf_gfp(vi->rvq, &sg, 0, 1, page);
+	err = virtqueue_add_buf_gfp(vi->rvq, &vi->rx_sg, 0, 1, page);
 	if (err < 0)
 		give_pages(vi, page);
 
@@ -515,12 +513,9 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
 
 static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
 {
-	struct scatterlist sg[2+MAX_SKB_FRAGS];
 	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
 
-	sg_init_table(sg, 2+MAX_SKB_FRAGS);
-
 	pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest);
 
 	if (skb->ip_summed == CHECKSUM_PARTIAL) {
@@ -554,12 +549,12 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
 
 	/* Encode metadata header at front. */
 	if (vi->mergeable_rx_bufs)
-		sg_set_buf(sg, &hdr->mhdr, sizeof hdr->mhdr);
+		sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
 	else
-		sg_set_buf(sg, &hdr->hdr, sizeof hdr->hdr);
+		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
 
-	hdr->num_sg = skb_to_sgvec(skb, sg+1, 0, skb->len) + 1;
-	return virtqueue_add_buf(vi->svq, sg, hdr->num_sg, 0, skb);
+	hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
+	return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg, 0, skb);
 }
 
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -942,6 +937,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	vdev->priv = vi;
 	vi->pages = NULL;
 	INIT_DELAYED_WORK(&vi->refill, refill_work);
+	sg_init_table(vi->rx_sg, ARRAY_SIZE(vi->rx_sg));
+	sg_init_table(vi->tx_sg, ARRAY_SIZE(vi->tx_sg));
 
 	/* If we can receive ANY GSO packets, we must allocate large ones. */
 	if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
-- 
MST

^ permalink raw reply related

* Re: [PATCH] sky2: use the DMA state API instead of the pci equivalents
From: Stephen Hemminger @ 2010-04-28 16:08 UTC (permalink / raw)
  To: FUJITA Tomonori; +Cc: netdev
In-Reply-To: <20100428095826S.fujita.tomonori@lab.ntt.co.jp>

On Wed, 28 Apr 2010 09:57:05 +0900
FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> wrote:

> This replace the PCI DMA state API (include/linux/pci-dma.h) with the
> DMA equivalents since the PCI DMA state API will be obsolete.
> 
> No functional change.
> 
> For further information about the background:
> 
> http://marc.info/?l=linux-netdev&m=127037540020276&w=2
> 
> Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>

Tested, this works thanks.

Acked-by: Stephen Hemminger <shemminger@vyatta.com>


-- 

^ permalink raw reply

* Re: [PATCH] xfrm: potential uninitialized variable num_xfrms
From: David Miller @ 2010-04-28 16:41 UTC (permalink / raw)
  To: xiaosuo; +Cc: hadi, timo.teras, herbert, adobriyan, netdev
In-Reply-To: <1272439222-2935-1-git-send-email-xiaosuo@gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Wed, 28 Apr 2010 15:20:22 +0800

> potential uninitialized variable num_xfrms
> 
> fix compiler warning: 'num_xfrms' may be used uninitialized in this function.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>

We've all been seeing it for weeks too, but I honestly have
reservations about trying to simply pacify the compiler here.

The num_xfrms variable is only used in code paths that actually
initialize it's value.  The compiler just can't see this in the
control flow.

Check it if you don't believe me.

^ permalink raw reply

* Re: [PATCH RFC: linux-next 1/2] irq: Add CPU mask affinity hint callback framework
From: Thomas Gleixner @ 2010-04-28 16:45 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr
  Cc: davem@davemloft.net, arjan@linux.jf.intel.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <Pine.WNT.4.64.1004270856560.320@PPWASKIE-MOBL2.amr.corp.intel.com>

B1;2005;0cPeter,

On Tue, 27 Apr 2010, Peter P Waskiewicz Jr wrote:
> On Tue, 27 Apr 2010, Thomas Gleixner wrote:
> > On Sun, 18 Apr 2010, Peter P Waskiewicz Jr wrote:
> > > +/**
> > > + * struct irqaffinityhint - per interrupt affinity helper
> > > + * @callback:	device driver callback function
> > > + * @dev:	reference for the affected device
> > > + * @irq:	interrupt number
> > > + */
> > > +struct irqaffinityhint {
> > > +	irq_affinity_hint_t callback;
> > > +	void *dev;
> > > +	int irq;
> > > +};
> > 
> > Why do you need that extra data structure ? The device and the irq
> > number are known, so all you need is the callback itself. So no need
> > for allocating memory ....
> 
> When I register the function callback with the interrupt layer, I need to
> know what device structures to reference back in the driver.  In other words,
> if I call into an underlying driver with just an interrupt number, then I
> have no way at getting at the dev structures (netdevice for me, plus my
> private adapter structures), unless I declare them globally (yuck).

Grr, I knew that I missed something. That'll teach me to review
patches before the coffee has reached my brain cells :)

> I had a different approach before this one where I assumed the device from
> the irq handler callback was safe to use for the device in this new callback.
> I didn't feel really great about that, since it's an implicit assumption that
> could cause things to go sideways really quickly.
>
> Let me know what you think either way.  I'm certainly willing to make a
> change, I just don't know at this point what's the safest approach from what
> I currently have.

So you need a reference to your device, so what about the following:

struct irq_affinity_hint;

struct irq_affinity_hint {
       unsigned int (*callback)(unsigned int irq, struct irq_affinity_hint *hint,
				cpumask_var_t *mask);
}

Now you embed that struct into your device private data structure and
you get the reference to it back in the callback function. No extra
kmalloc/kfree, less code.

One other thing I noticed, but forgot to comment on:

> +static int irq_affinity_hint_proc_show(struct seq_file *m, void *v)
> +{
> +	struct irq_desc *desc = irq_to_desc((long)m->private);
> +	struct cpumask mask;
> +	unsigned int ret = 0;

 Why do we return 0, when there is no callback and no hint available ?

> +

  We don't want to have cpumask enforced on stack. Please make that:

     	cpumask_var_t mask;

	if (!alloc_cpumask_var(&mask, GFP_KERNEL))
	       return -ENOMEM;

> +	if (desc->hint && desc->hint->callback) {

  The access to desc-> needs to be protected with
  desc->lock. Otherwise you might race with a callback unregister.

> +		ret = desc->hint->callback(&mask, (long)m->private,
> +		                           desc->hint->dev);
> +		if (!ret)
> +			seq_cpumask(m, &mask);
> +	}
> +
> +	seq_putc(m, '\n');
> +	return ret;
> +}

Thanks,

	tglx

^ permalink raw reply

* Re: Problem with "tcp: bind() fix when many ports are bound" commit
From: David Miller @ 2010-04-28 16:50 UTC (permalink / raw)
  To: grzegorz.chwesewicz; +Cc: linux-kernel, netdev, eric.dumazet
In-Reply-To: <4BD82B29.3000105@retis.net.pl>

From: Grzegorz Chwesewicz <grzegorz.chwesewicz@retis.net.pl>
Date: Wed, 28 Apr 2010 14:33:45 +0200

> 	Hi, I have a problem with binding to port with the latest git kernel
> (my HEAD is at 1600f9def09de07c5dbeb539e978fa73880690dd). Please CC to
> me as I'm not subscribed to the list.

Please send networking bug reports CC:'d to netdev and the people
who wrote and signed off on the commit you've narrowed down the
problem to.

Otherwise the appropriate people will take longer to find out about
your bug, and therefore the bug will take longer to fix than
necessary.

Thanks.

> 
> Example with buggy kernel:
> 
> ensima-hp ~ # /etc/init.d/apache2 start
>  * Starting apache2 ...
> (98)Address already in use: make_sock: could not bind to address
> 127.0.0.1:80
> no listening sockets available, shutting down
> Unable to open logs
> 
> As you can see nothing is listening on port 80, but there are old
> connections to port 80 with CLOSE_WAIT and FIN_WAIT2 state.
> 
> ensima-hp ~ # netstat -pan --inet|grep 80
> netstat: no support for `AF INET (sctp)' on this system.
> tcp        0      0 127.0.0.1:631           0.0.0.0:*
> LISTEN      4806/cupsd
> tcp        1      0 127.0.0.1:54040         127.0.0.1:80
> CLOSE_WAIT  5814/konquerorHk573
> tcp        0      0 127.0.0.1:80            127.0.0.1:54042
> FIN_WAIT2   -
> tcp        0      0 127.0.0.1:80            127.0.0.1:54040
> FIN_WAIT2   -
> tcp        1      0 127.0.0.1:54042         127.0.0.1:80
> CLOSE_WAIT  6175/konquerordx573
> 
> So I can't start apache as long as these connections are not fully
> closed, after that apache starts without problems.
> 
> ensima-hp ~ # netstat -pan --inet|grep 80
> netstat: no support for `AF INET (sctp)' on this system.
> tcp        0      0 127.0.0.1:631           0.0.0.0:*
> LISTEN      4806/cupsd
> 
> ensima-hp ~ # /etc/init.d/apache2 start
>  * Starting apache2 ...		[OK]
> 
> Problem occured between 2.6.34-rc4 and latest git, bisect shows that the
> problem is caused by:
> 
> commit fda48a0d7a8412cedacda46a9c0bf8ef9cd13559
> Author: Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Wed Apr 21 09:26:15 2010 +0000
> 
> tcp: bind() fix when many ports are bound
> 
> Reverting this commit from current HEAD, resolving conflict in
> 'net/ipv6/inet6_connection_sock.c' file, and compiling new kernel solves
> the problem.
> 
> 'net/ipv6/inet6_connection_sock.c' before resolving conflict:
> 
>  41         sk_for_each_bound(sk2, node, &tb->owners) {
>  42                 if (sk != sk2 &&
>  43                     (!sk->sk_bound_dev_if ||
>  44                      !sk2->sk_bound_dev_if ||
>  45 <<<<<<< HEAD
>  46                      sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
>  47                         if ((!sk->sk_reuse || !sk2->sk_reuse ||
>  48                              sk2->sk_state == TCP_LISTEN) &&
>  49                              ipv6_rcv_saddr_equal(sk, sk2))
>  50                                 break;
>  51                         else if (sk->sk_reuse && sk2->sk_reuse &&
>  52                                 !ipv6_addr_any(inet6_rcv_saddr(sk)) &&
>  53                                 ipv6_rcv_saddr_equal(sk, sk2))
>  54                                 break;
>  55                 }
>  56 =======
>  57                      sk->sk_bound_dev_if == sk2->sk_bound_dev_if) &&
>  58                     (!sk->sk_reuse || !sk2->sk_reuse ||
>  59                      sk2->sk_state == TCP_LISTEN) &&
>  60                      ipv6_rcv_saddr_equal(sk, sk2))
>  61                         break;
>  62 >>>>>>> fda48a0... tcp: bind() fix when many ports are bound
>  63         }
>  64
>  65         return node != NULL;
> 
> 
> 
>  66 }
> 
> 'net/ipv6/inet6_connection_sock.c' after resolving conflict:
> 
>  41         sk_for_each_bound(sk2, node, &tb->owners) {
>  42                 if (sk != sk2 &&
>  43                     (!sk->sk_bound_dev_if ||
>  44                      !sk2->sk_bound_dev_if ||
>  45                      sk->sk_bound_dev_if == sk2->sk_bound_dev_if) &&
>  46                     (!sk->sk_reuse || !sk2->sk_reuse ||
>  47                      sk2->sk_state == TCP_LISTEN) &&
>  48                      ipv6_rcv_saddr_equal(sk, sk2))
>  49                         break;
>  50         }
> 
> 
> 
>  51
>  52         return node != NULL;
>  53 }
> 
> -- 
> Greetings
> Grzegorz Chwesewicz
> mailto:grzegorz.chwesewicz@retis.net.pl
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [PATCH net-next-2.6] bnx2x: Remove two prefetch()
From: David Miller @ 2010-04-28 16:53 UTC (permalink / raw)
  To: eilong
  Cc: vladz, eliezert, eric.dumazet, xiaosuo, hadi, therbert,
	shemminger, netdev
In-Reply-To: <1272460455.30392.24.camel@lb-tlvb-eilong.il.broadcom.com>

From: "Eilon Greenstein" <eilong@broadcom.com>
Date: Wed, 28 Apr 2010 16:14:15 +0300

> On Tue, 2010-04-27 at 15:19 -0700, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Wed, 28 Apr 2010 00:18:13 +0200
>> 
>> > [PATCH net-next-2.6] bnx2x: Remove two prefetch()
>> > 
>> > 1) Even on 64bit arches, sizeof(struct sk_buff) < 256
>> > 2) No need to prefetch same pointer twice.
>> > 
>> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>> > CC: Eilon Greenstein <eilong@broadcom.com>
>> 
>> Eilon please review and ACK/NACK
> 
> Vlad ran few benchmarks, and we couldn't find any justification for
> those prefetch calls. After consulting with Eliezer Tamir (the original
> author) we are glad to Ack this patch.
> 
> Thanks Eric!
> Acked-by: <eilong@broadcom.com>

Thanks, applied.

Please put your full name as well as your email address in Acked-by:
tags, just like you do for Signed-off-by: tags.

^ permalink raw reply

* Re: [PATCH net-next-2.6] bnx2x: Remove two prefetch()
From: David Miller @ 2010-04-28 16:55 UTC (permalink / raw)
  To: eliezer
  Cc: eilong, vladz, eric.dumazet, xiaosuo, hadi, therbert, shemminger,
	netdev
In-Reply-To: <w2ue8f3c3211004280842r9f2589e8qb8fd4b7933cd9756@mail.gmail.com>

From: Eliezer Tamir <eliezer@tamir.org.il>
Date: Wed, 28 Apr 2010 18:42:37 +0300

> Acked-by: <eliezer@tamir.org.il>

Like I told Eilon, please specify your full name in future Acked-by: tags,
just as you would for a Signed-off-by: tag.

Thanks.

^ permalink raw reply

* Re: [Bugme-new] [Bug 15868] New: Deleting IP address from interface doesn't prevent sending a data.
From: David Miller @ 2010-04-28 17:00 UTC (permalink / raw)
  To: akpm; +Cc: netdev, bugzilla-daemon, bugme-daemon, Yurij.Plotnikov
In-Reply-To: <20100428074244.304683c7.akpm@linux-foundation.org>

From: Andrew Morton <akpm@linux-foundation.org>
Date: Wed, 28 Apr 2010 07:42:44 -0400

> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 

This is expected behavior.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox