Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net-next-2.6 4/7] caif: Disconnect without waiting for response
From: David Miller @ 2010-04-28 20:03 UTC (permalink / raw)
  To: sjur.brandeland; +Cc: netdev, marcel, daniel.martensson, sjurbr, linus.walleij
In-Reply-To: <1272480880-30672-4-git-send-email-sjur.brandeland@stericsson.com>

From: sjur.brandeland@stericsson.com
Date: Wed, 28 Apr 2010 20:54:37 +0200

> From: Sjur Braendeland <sjur.brandeland@stericsson.com>
> 
> Changes:
> o Function cfcnfg_disconn_adapt_layer is changed to do asynchronous
>   disconnect, not waiting for any response from the modem. Due to this
>   the function cfcnfg_linkdestroy_rsp does nothing anymore.
> o Because disconnect may take down a connection before a connect response
>   is received the function cfcnfg_linkup_rsp is checking if the client is
>   still waiting for the response, if not a disconnect request is sent to
>   the modem.
> o cfctrl is no longer keeping track of pending disconnect requests.
> o Added function cfctrl_cancel_req, which is used for deleting a pending
>   connect request if disconnect is done before connect response is received.
> o Removed unused function cfctrl_insert_req2
> o Added better handling of connect reject from modem.
> 
> Signed-off-by: Sjur Braendeland <sjur.brandeland@stericsson.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next-2.6 5/7] caif: Rewritten socket implementation
From: David Miller @ 2010-04-28 20:03 UTC (permalink / raw)
  To: sjur.brandeland; +Cc: netdev, marcel, daniel.martensson, sjurbr, linus.walleij
In-Reply-To: <1272480880-30672-5-git-send-email-sjur.brandeland@stericsson.com>

From: sjur.brandeland@stericsson.com
Date: Wed, 28 Apr 2010 20:54:38 +0200

> From: Sjur Braendeland <sjur.brandeland@stericsson.com>
> 
> Changes:
>  This is a complete re-write of the socket layer. Making the socket
>  implementation more aligned with the other socket layers and using more
>  of the support functions available in sock.c. Lots of code is copied
>  from af_unix (and some from af_irda).
>  Non-blocking mode should be working as well.
> 
> Signed-off-by: Sjur Braendeland <sjur.brandeland@stericsson.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next-2.6 6/7] caif: Bugfixes in CAIF netdevice for close and flow control
From: David Miller @ 2010-04-28 20:03 UTC (permalink / raw)
  To: sjur.brandeland; +Cc: netdev, marcel, daniel.martensson, sjurbr, linus.walleij
In-Reply-To: <1272480880-30672-6-git-send-email-sjur.brandeland@stericsson.com>

From: sjur.brandeland@stericsson.com
Date: Wed, 28 Apr 2010 20:54:39 +0200

> From: Sjur Braendeland <sjur.brandeland@stericsson.com>
> 
> Changes:
> o Bugfix: Flow control was causing the device to be destroyed.
> o Bugfix: Handle CAIF channel connect failures.
> o If the underlying link layer is gone the net-device is no longer removed,
>   but closed.
> 
> Signed-off-by: Sjur Braendeland <sjur.brandeland@stericsson.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next-2.6 7/7] Bugfix: Link selection was swapped in switch.
From: David Miller @ 2010-04-28 20:03 UTC (permalink / raw)
  To: sjur.brandeland; +Cc: netdev, marcel, daniel.martensson, sjurbr, linus.walleij
In-Reply-To: <1272480880-30672-7-git-send-email-sjur.brandeland@stericsson.com>

From: sjur.brandeland@stericsson.com
Date: Wed, 28 Apr 2010 20:54:40 +0200

> From: Sjur Braendeland <sjur.brandeland@stericsson.com>
> 
> Signed-off-by: Sjur Braendeland <sjur.brandeland@stericsson.com>

Applied.

^ permalink raw reply

* Re: 2.6.34-rc5-git7 (plus all patches) -- another suspicious rcu_dereference_check() usage.
From: Paul E. McKenney @ 2010-04-28 20:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Miles Lane, Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, netdev, Jens Axboe, Gui Jianfeng,
	Li Zefan, Johannes Berg, shemminger
In-Reply-To: <1272483491.2201.9.camel@edumazet-laptop>

On Wed, Apr 28, 2010 at 09:38:11PM +0200, Eric Dumazet wrote:
> Le mercredi 28 avril 2010 à 10:54 -0700, Paul E. McKenney a écrit :
> > On Mon, Apr 26, 2010 at 08:51:06PM -0400, Miles Lane wrote:
> > > This one occurred during the wakeup from suspend to RAM.
> > > 
> > > [  984.724697] [ INFO: suspicious rcu_dereference_check() usage. ]
> > > [  984.724700] ---------------------------------------------------
> > > [  984.724703] include/linux/fdtable.h:88 invoked
> > > rcu_dereference_check() without protection!
> > > [  984.724706]
> > > [  984.724707] other info that might help us debug this:
> > > [  984.724708]
> > > [  984.724711]
> > > [  984.724711] rcu_scheduler_active = 1, debug_locks = 1
> > > [  984.724714] no locks held by dbus-daemon/4680.
> > > [  984.724717]
> > > [  984.724717] stack backtrace:
> > > [  984.724721] Pid: 4680, comm: dbus-daemon Not tainted 2.6.34-rc5-git7 #33
> > > [  984.724724] Call Trace:
> > > [  984.724734]  [<ffffffff81074556>] lockdep_rcu_dereference+0x9d/0xa6
> > > [  984.724740]  [<ffffffff810fc785>] fcheck_files+0xb1/0xc9
> > > [  984.724745]  [<ffffffff810fc7f5>] fget_light+0x35/0xab
> > > [  984.724751]  [<ffffffff81433e1b>] ? sock_poll_wait+0x13/0x18
> > > [  984.724755]  [<ffffffff81433e39>] ? unix_poll+0x19/0x95
> > > [  984.724762]  [<ffffffff8110aa95>] do_sys_poll+0x1ff/0x3e5
> > > [  984.724766]  [<ffffffff8110a19e>] ? __pollwait+0x0/0xc7
> > > [  984.724771]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > [  984.724776]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > [  984.724780]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > [  984.724784]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > [  984.724788]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > [  984.724793]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > [  984.724797]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > [  984.724802]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > [  984.724806]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > [  984.724812]  [<ffffffff8110ae0f>] sys_poll+0x50/0xbb
> > > [  984.724818]  [<ffffffff81009d82>] system_call_fastpath+0x16/0x1b
> > 
> > Hmmm...  I am not convinced that this is a false positive.  Couldn't
> > there be a multi-threaded process where one thread is invoking poll()
> > on a UNIX socket just as another thread is calling close() on it?
> > 
> > The current fcheck_files() logic requires that the caller either (1) be in
> > an RCU read-side critical section, (2) hold ->files_lock, or (3) passing
> > in a files_struct with ->count equal to 1 (initialization or cleanup).
> > 
> > So I don't feel comfortable just slapping an RCU read-side critical
> > section around this one, at least not unless someone who understands
> > the locking says that doing so is OK.
> > 
> > 		
> 
> Its a single threaded program.
> 
> So fget_light() calls fcheck_files(files, fd); without rcu lock,
> but some /proc/pid/fd/... user temporarly raised files->count just
> before we perform the condition check.

So I should add a single-threaded check.  My first thought was to use
current_is_single_threaded(), but the bit about scanning the full list
of processes does give me pause.  However, thread_group_empty() looks
like a much lighter-weight alternative.

I believe that it is possible for a pair of single-threaded processes
to share a file descriptor, but that should not be a problem, as both
of them would need to close it for it to go away.

But what happens if someone does a clone() with CLONE_FILES, as some
of the AIO stuff seems to do?  Won't that allow one of the resulting
processes to close the file for both of them, even though both are
otherwise single-threaded?  And the ->count seems to be the only
distinction between these two cases.

And AIO does CLONE_VM as well as CLONE_FILES, but that seems to mean that
the check must scan the processes with current_is_single_threaded().
Besides which, a user could invoke clone() with only CLONE_FILES
specified, right?

Or am I just confused here?

							Thanx, Paul

^ permalink raw reply

* Re: [PATCH]: sctp: Fix skb_over_panic resulting from multiple invalid parameter errors (CVE-2010-1173) (v3)
From: Vlad Yasevich @ 2010-04-28 20:16 UTC (permalink / raw)
  To: Neil Horman; +Cc: sri, linux-sctp, eteo, netdev, davem, security
In-Reply-To: <20100428193755.GF4818@hmsreliant.think-freely.org>



Neil Horman wrote:

... snip description ...

> 
> 
>  include/net/sctp/structs.h |    1 
>  net/sctp/sm_make_chunk.c   |   70 +++++++++++++++++++++++++++++++++++++++------
>  2 files changed, 62 insertions(+), 9 deletions(-)
> 
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index ff30177..597f8e2 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -778,6 +778,7 @@ int sctp_user_addto_chunk(struct sctp_chunk *chunk, int off, int len,
>  			  struct iovec *data);
>  void sctp_chunk_free(struct sctp_chunk *);
>  void  *sctp_addto_chunk(struct sctp_chunk *, int len, const void *data);
> +void  *sctp_addto_chunk_fixed(struct sctp_chunk *, int len, const void *data);
>  struct sctp_chunk *sctp_chunkify(struct sk_buff *,
>  				 const struct sctp_association *,
>  				 struct sock *);
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index f592163..9623112 100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -107,7 +107,7 @@ static const struct sctp_paramhdr prsctp_param = {
>  	cpu_to_be16(sizeof(struct sctp_paramhdr)),
>  };
>  
> -/* A helper to initialize to initialize an op error inside a
> +/* A helper to initialize an op error inside a
>   * provided chunk, as most cause codes will be embedded inside an
>   * abort chunk.
>   */
> @@ -124,6 +124,29 @@ void  sctp_init_cause(struct sctp_chunk *chunk, __be16 cause_code,
>  	chunk->subh.err_hdr = sctp_addto_chunk(chunk, sizeof(sctp_errhdr_t), &err);
>  }
>  
> +/* A helper to initialize an op error inside a
> + * provided chunk, as most cause codes will be embedded inside an
> + * abort chunk.  Differs from sctp_init_cause in that it won't oops
> + * if there isn't enough space in the op error chunk
> + */
> +int sctp_init_cause_fixed(struct sctp_chunk *chunk, __be16 cause_code,
> +		      size_t paylen)
> +{
> +	sctp_errhdr_t err;
> +	__u16 len;
> +
> +	/* Cause code constants are now defined in network order.  */
> +	err.cause = cause_code;
> +	len = sizeof(sctp_errhdr_t) + paylen;
> +	err.length  = htons(len);
> +
> +	if (skb_tailroom(chunk->skb) >  len)
> +		return -ENOSPC;
> +	chunk->subh.err_hdr = sctp_addto_chunk_fixed(chunk,
> +						     sizeof(sctp_errhdr_t),
> +						     &err);
> +	return 0;
> +}
>  /* 3.3.2 Initiation (INIT) (1)
>   *
>   * This chunk is used to initiate a SCTP association between two
> @@ -1131,6 +1154,24 @@ nodata:
>  	return retval;
>  }
>  
> +/* Create an Operation Error chunk of a fixed size,
> + * specifically, max(asoc->pathmtu, SCTP_DEFAULT_MAXSEGMENT)
> + * This is a helper function to allocate an error chunk for
> + * for those invalid parameter codes in which we may not want
> + * to report all the errors, if the incomming chunk is large
> + */
> +static inline struct sctp_chunk *sctp_make_op_error_fixed(
> +	const struct sctp_association *asoc,
> +	const struct sctp_chunk *chunk)
> +{
> +	size_t size = asoc ? asoc->pathmtu : 0;
> +
> +	if (size > SCTP_DEFAULT_MAXSEGMENT)
> +		size = SCTP_DEFAULT_MAXSEGMENT;
> +

This doesn't look right.  If you don't have an association or if pmtu is
not initialized, you will end up with a size of 0.  I think you simply
want to a
	if (!size)

there.


> +	return sctp_make_op_error_space(asoc, chunk, size);
> +}
> +
>  /* Create an Operation Error chunk.  */
>  struct sctp_chunk *sctp_make_op_error(const struct sctp_association *asoc,
>  				 const struct sctp_chunk *chunk,
> @@ -1373,6 +1414,18 @@ void *sctp_addto_chunk(struct sctp_chunk *chunk, int len, const void *data)
>  	return target;
>  }
>  
> +/* Append bytes to the end of a chunk. Returns NULL if there isn't sufficient
> + * space in the chunk
> + */
> +void *sctp_addto_chunk_fixed(struct sctp_chunk *chunk,
> +			     int len, const void *data)
> +{
> +	if (skb_tailroom(chunk->skb) > len)
> +		return sctp_addto_chunk(chunk, len, data);
> +	else
> +		return NULL;
> +}
> +
>  /* Append bytes from user space to the end of a chunk.  Will panic if
>   * chunk is not big enough.
>   * Returns a kernel err value.
> @@ -1793,9 +1846,9 @@ static int sctp_process_missing_param(const struct sctp_association *asoc,
>  	if (*errp) {
>  		report.num_missing = htonl(1);
>  		report.type = paramtype;
> -		sctp_init_cause(*errp, SCTP_ERROR_MISS_PARAM,
> -				sizeof(report));
> -		sctp_addto_chunk(*errp, sizeof(report), &report);
> +		sctp_init_cause_fixed(*errp, SCTP_ERROR_MISS_PARAM,
> +				      sizeof(report));
> +		sctp_addto_chunk_fixed(*errp, sizeof(report), &report);
>  	}
>  
>  	/* Stop processing this chunk. */
> @@ -1813,7 +1866,7 @@ static int sctp_process_inv_mandatory(const struct sctp_association *asoc,
>  		*errp = sctp_make_op_error_space(asoc, chunk, 0);
>  
>  	if (*errp)
> -		sctp_init_cause(*errp, SCTP_ERROR_INV_PARAM, 0);
> +		sctp_init_cause_fixed(*errp, SCTP_ERROR_INV_PARAM, 0);
>  
>  	/* Stop processing this chunk. */
>  	return 0;

I don't think missing or mandatory parameters are effected by this.  They are a
once and done deal.  There don't report multiple errors and we don't add any
more error after them.

> @@ -1976,13 +2029,12 @@ static sctp_ierror_t sctp_process_unk_param(const struct sctp_association *asoc,
>  		 * returning multiple unknown parameters.
>  		 */
>  		if (NULL == *errp)
> -			*errp = sctp_make_op_error_space(asoc, chunk,
> -					ntohs(chunk->chunk_hdr->length));
> +			*errp = sctp_make_op_error_fixed(asoc, chunk);
>  
>  		if (*errp) {
> -			sctp_init_cause(*errp, SCTP_ERROR_UNKNOWN_PARAM,
> +			sctp_init_cause_fixed(*errp, SCTP_ERROR_UNKNOWN_PARAM,
>  					WORD_ROUND(ntohs(param.p->length)));
> -			sctp_addto_chunk(*errp,
> +			sctp_addto_chunk_fixed(*errp,
>  					WORD_ROUND(ntohs(param.p->length)),
>  					param.v);
>  		} else {
> 

So we completely get rid of variable size error chunk in this case, which I can
live with.  It simplifies the code.

-vlad

^ permalink raw reply

* Re: 2.6.34-rc5-git7 (plus all patches) -- another suspicious rcu_dereference_check() usage.
From: Eric Dumazet @ 2010-04-28 20:18 UTC (permalink / raw)
  To: paulmck
  Cc: Miles Lane, Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, netdev, Jens Axboe, Gui Jianfeng,
	Li Zefan, Johannes Berg, shemminger
In-Reply-To: <20100428200904.GS2540@linux.vnet.ibm.com>

Le mercredi 28 avril 2010 à 13:09 -0700, Paul E. McKenney a écrit :
> On Wed, Apr 28, 2010 at 09:38:11PM +0200, Eric Dumazet wrote:
> > Le mercredi 28 avril 2010 à 10:54 -0700, Paul E. McKenney a écrit :
> > > On Mon, Apr 26, 2010 at 08:51:06PM -0400, Miles Lane wrote:
> > > > This one occurred during the wakeup from suspend to RAM.
> > > > 
> > > > [  984.724697] [ INFO: suspicious rcu_dereference_check() usage. ]
> > > > [  984.724700] ---------------------------------------------------
> > > > [  984.724703] include/linux/fdtable.h:88 invoked
> > > > rcu_dereference_check() without protection!
> > > > [  984.724706]
> > > > [  984.724707] other info that might help us debug this:
> > > > [  984.724708]
> > > > [  984.724711]
> > > > [  984.724711] rcu_scheduler_active = 1, debug_locks = 1
> > > > [  984.724714] no locks held by dbus-daemon/4680.
> > > > [  984.724717]
> > > > [  984.724717] stack backtrace:
> > > > [  984.724721] Pid: 4680, comm: dbus-daemon Not tainted 2.6.34-rc5-git7 #33
> > > > [  984.724724] Call Trace:
> > > > [  984.724734]  [<ffffffff81074556>] lockdep_rcu_dereference+0x9d/0xa6
> > > > [  984.724740]  [<ffffffff810fc785>] fcheck_files+0xb1/0xc9
> > > > [  984.724745]  [<ffffffff810fc7f5>] fget_light+0x35/0xab
> > > > [  984.724751]  [<ffffffff81433e1b>] ? sock_poll_wait+0x13/0x18
> > > > [  984.724755]  [<ffffffff81433e39>] ? unix_poll+0x19/0x95
> > > > [  984.724762]  [<ffffffff8110aa95>] do_sys_poll+0x1ff/0x3e5
> > > > [  984.724766]  [<ffffffff8110a19e>] ? __pollwait+0x0/0xc7
> > > > [  984.724771]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > [  984.724776]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > [  984.724780]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > [  984.724784]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > [  984.724788]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > [  984.724793]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > [  984.724797]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > [  984.724802]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > [  984.724806]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > [  984.724812]  [<ffffffff8110ae0f>] sys_poll+0x50/0xbb
> > > > [  984.724818]  [<ffffffff81009d82>] system_call_fastpath+0x16/0x1b
> > > 
> > > Hmmm...  I am not convinced that this is a false positive.  Couldn't
> > > there be a multi-threaded process where one thread is invoking poll()
> > > on a UNIX socket just as another thread is calling close() on it?
> > > 
> > > The current fcheck_files() logic requires that the caller either (1) be in
> > > an RCU read-side critical section, (2) hold ->files_lock, or (3) passing
> > > in a files_struct with ->count equal to 1 (initialization or cleanup).
> > > 
> > > So I don't feel comfortable just slapping an RCU read-side critical
> > > section around this one, at least not unless someone who understands
> > > the locking says that doing so is OK.
> > > 
> > > 		
> > 
> > Its a single threaded program.
> > 
> > So fget_light() calls fcheck_files(files, fd); without rcu lock,
> > but some /proc/pid/fd/... user temporarly raised files->count just
> > before we perform the condition check.
> 
> So I should add a single-threaded check.  My first thought was to use
> current_is_single_threaded(), but the bit about scanning the full list
> of processes does give me pause.  However, thread_group_empty() looks
> like a much lighter-weight alternative.
> 
> I believe that it is possible for a pair of single-threaded processes
> to share a file descriptor, but that should not be a problem, as both
> of them would need to close it for it to go away.
> 
> But what happens if someone does a clone() with CLONE_FILES, as some
> of the AIO stuff seems to do?  Won't that allow one of the resulting
> processes to close the file for both of them, even though both are
> otherwise single-threaded?  And the ->count seems to be the only
> distinction between these two cases.
> 
> And AIO does CLONE_VM as well as CLONE_FILES, but that seems to mean that
> the check must scan the processes with current_is_single_threaded().
> Besides which, a user could invoke clone() with only CLONE_FILES
> specified, right?
> 
> Or am I just confused here?
> 
> 							Thanx, Paul

If a program is mono threaded, and doing a fget_light() syscall, it
cannot possibly do a clone() in // ;)

If we want to be picky, we could add a user provided condition, aka "we
are sure we are allowed to do this because we are the owner of the files
struct".






diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index 6da962c..027f5e1 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -2694,7 +2694,7 @@ void __do_SAK(struct tty_struct *tty)
 			spin_lock(&p->files->file_lock);
 			fdt = files_fdtable(p->files);
 			for (i = 0; i < fdt->max_fds; i++) {
-				filp = fcheck_files(p->files, i);
+				filp = fcheck_files(p->files, i, false);
 				if (!filp)
 					continue;
 				if (filp->f_op->read == tty_read &&
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 452d02f..dabf4d8 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -119,7 +119,7 @@ SYSCALL_DEFINE2(dup2, unsigned int, oldfd, unsigned int, newfd)
 		int retval = oldfd;
 
 		rcu_read_lock();
-		if (!fcheck_files(files, oldfd))
+		if (!fcheck_files(files, oldfd, false))
 			retval = -EBADF;
 		rcu_read_unlock();
 		return retval;
diff --git a/fs/file_table.c b/fs/file_table.c
index 32d12b7..2865f72 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -274,7 +274,7 @@ struct file *fget(unsigned int fd)
 	struct files_struct *files = current->files;
 
 	rcu_read_lock();
-	file = fcheck_files(files, fd);
+	file = fcheck_files(files, fd, false);
 	if (file) {
 		if (!atomic_long_inc_not_zero(&file->f_count)) {
 			/* File object ref couldn't be taken */
@@ -303,10 +303,10 @@ struct file *fget_light(unsigned int fd, int *fput_needed)
 
 	*fput_needed = 0;
 	if (likely((atomic_read(&files->count) == 1))) {
-		file = fcheck_files(files, fd);
+		file = fcheck_files(files, fd, true);
 	} else {
 		rcu_read_lock();
-		file = fcheck_files(files, fd);
+		file = fcheck_files(files, fd, false);
 		if (file) {
 			if (atomic_long_inc_not_zero(&file->f_count))
 				*fput_needed = 1;
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 8418fcc..0e89448 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1716,7 +1716,7 @@ static int proc_fd_info(struct inode *inode, struct path *path, char *info)
 		 * hold ->file_lock.
 		 */
 		spin_lock(&files->file_lock);
-		file = fcheck_files(files, fd);
+		file = fcheck_files(files, fd, false);
 		if (file) {
 			if (path) {
 				*path = file->f_path;
@@ -1755,7 +1755,7 @@ static int tid_fd_revalidate(struct dentry *dentry, struct nameidata *nd)
 		files = get_files_struct(task);
 		if (files) {
 			rcu_read_lock();
-			if (fcheck_files(files, fd)) {
+			if (fcheck_files(files, fd, false)) {
 				rcu_read_unlock();
 				put_files_struct(files);
 				if (task_dumpable(task)) {
@@ -1813,7 +1813,7 @@ static struct dentry *proc_fd_instantiate(struct inode *dir,
 	 * hold ->file_lock.
 	 */
 	spin_lock(&files->file_lock);
-	file = fcheck_files(files, fd);
+	file = fcheck_files(files, fd, false);
 	if (!file)
 		goto out_unlock;
 	if (file->f_mode & FMODE_READ)
@@ -1899,7 +1899,7 @@ static int proc_readfd_common(struct file * filp, void * dirent,
 				char name[PROC_NUMBUF];
 				int len;
 
-				if (!fcheck_files(files, fd))
+				if (!fcheck_files(files, fd, false))
 					continue;
 				rcu_read_unlock();
 
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 013dc52..76423ad 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -57,11 +57,12 @@ struct files_struct {
 	struct file * fd_array[NR_OPEN_DEFAULT];
 };
 
-#define rcu_dereference_check_fdtable(files, fdtfd) \
+#define rcu_dereference_check_fdtable(files, fdtfd, cond) \
 	(rcu_dereference_check((fdtfd), \
 			       rcu_read_lock_held() || \
 			       lockdep_is_held(&(files)->file_lock) || \
-			       atomic_read(&(files)->count) == 1))
+			       atomic_read(&(files)->count) == 1 || \
+			       cond))
 
 #define files_fdtable(files) \
 		(rcu_dereference_check_fdtable((files), (files)->fdt))
@@ -79,13 +80,13 @@ static inline void free_fdtable(struct fdtable *fdt)
 	call_rcu(&fdt->rcu, free_fdtable_rcu);
 }
 
-static inline struct file * fcheck_files(struct files_struct *files, unsigned int fd)
+static inline struct file * fcheck_files(struct files_struct *files, unsigned int fd, bool cond)
 {
 	struct file * file = NULL;
 	struct fdtable *fdt = files_fdtable(files);
 
 	if (fd < fdt->max_fds)
-		file = rcu_dereference_check_fdtable(files, fdt->fd[fd]);
+		file = rcu_dereference_check_fdtable(files, fdt->fd[fd], cond);
 	return file;
 }
 



^ permalink raw reply related

* Re: [PATCH]: sctp: Fix skb_over_panic resulting from multiple invalid parameter errors (CVE-2010-1173) (v4)
From: Neil Horman @ 2010-04-28 20:30 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: sri, linux-sctp, eteo, netdev, davem, security
In-Reply-To: <4BD897B6.5040405@hp.com>

Ok, version 4

Change Notes:
1) Minor cleanups, from Vlads notes

Summary:


Hey-
	Recently, it was reported to me that the kernel could oops in the
following way:

<5> kernel BUG at net/core/skbuff.c:91!
<5> invalid operand: 0000 [#1]
<5> Modules linked in: sctp netconsole nls_utf8 autofs4 sunrpc iptable_filter
ip_tables cpufreq_powersave parport_pc lp parport vmblock(U) vsock(U) vmci(U)
vmxnet(U) vmmemctl(U) vmhgfs(U) acpiphp dm_mirror dm_mod button battery ac md5
ipv6 uhci_hcd ehci_hcd snd_ens1371 snd_rawmidi snd_seq_device snd_pcm_oss
snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_ac97_codec snd soundcore
pcnet32 mii floppy ext3 jbd ata_piix libata mptscsih mptsas mptspi mptscsi
mptbase sd_mod scsi_mod
<5> CPU:    0
<5> EIP:    0060:[<c02bff27>]    Not tainted VLI
<5> EFLAGS: 00010216   (2.6.9-89.0.25.EL) 
<5> EIP is at skb_over_panic+0x1f/0x2d
<5> eax: 0000002c   ebx: c033f461   ecx: c0357d96   edx: c040fd44
<5> esi: c033f461   edi: df653280   ebp: 00000000   esp: c040fd40
<5> ds: 007b   es: 007b   ss: 0068
<5> Process swapper (pid: 0, threadinfo=c040f000 task=c0370be0)
<5> Stack: c0357d96 e0c29478 00000084 00000004 c033f461 df653280 d7883180
e0c2947d 
<5>        00000000 00000080 df653490 00000004 de4f1ac0 de4f1ac0 00000004
df653490 
<5>        00000001 e0c2877a 08000800 de4f1ac0 df653490 00000000 e0c29d2e
00000004 
<5> Call Trace:
<5>  [<e0c29478>] sctp_addto_chunk+0xb0/0x128 [sctp]
<5>  [<e0c2947d>] sctp_addto_chunk+0xb5/0x128 [sctp]
<5>  [<e0c2877a>] sctp_init_cause+0x3f/0x47 [sctp]
<5>  [<e0c29d2e>] sctp_process_unk_param+0xac/0xb8 [sctp]
<5>  [<e0c29e90>] sctp_verify_init+0xcc/0x134 [sctp]
<5>  [<e0c20322>] sctp_sf_do_5_1B_init+0x83/0x28e [sctp]
<5>  [<e0c25333>] sctp_do_sm+0x41/0x77 [sctp]
<5>  [<c01555a4>] cache_grow+0x140/0x233
<5>  [<e0c26ba1>] sctp_endpoint_bh_rcv+0xc5/0x108 [sctp]
<5>  [<e0c2b863>] sctp_inq_push+0xe/0x10 [sctp]
<5>  [<e0c34600>] sctp_rcv+0x454/0x509 [sctp]
<5>  [<e084e017>] ipt_hook+0x17/0x1c [iptable_filter]
<5>  [<c02d005e>] nf_iterate+0x40/0x81
<5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
<5>  [<c02e0c7f>] ip_local_deliver_finish+0xc6/0x151
<5>  [<c02d0362>] nf_hook_slow+0x83/0xb5
<5>  [<c02e0bb2>] ip_local_deliver+0x1a2/0x1a9
<5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
<5>  [<c02e103e>] ip_rcv+0x334/0x3b4
<5>  [<c02c66fd>] netif_receive_skb+0x320/0x35b
<5>  [<e0a0928b>] init_stall_timer+0x67/0x6a [uhci_hcd]
<5>  [<c02c67a4>] process_backlog+0x6c/0xd9
<5>  [<c02c690f>] net_rx_action+0xfe/0x1f8
<5>  [<c012a7b1>] __do_softirq+0x35/0x79
<5>  [<c0107efb>] handle_IRQ_event+0x0/0x4f
<5>  [<c01094de>] do_softirq+0x46/0x4d

Its an skb_over_panic BUG halt that results from processing an init chunk in
which too many of its variable length parameters are in some way malformed.

The problem is in sctp_process_unk_param:
if (NULL == *errp)
	*errp = sctp_make_op_error_space(asoc, chunk,
					 ntohs(chunk->chunk_hdr->length));

	if (*errp) {
		sctp_init_cause(*errp, SCTP_ERROR_UNKNOWN_PARAM,
				 WORD_ROUND(ntohs(param.p->length)));
		sctp_addto_chunk(*errp,
			WORD_ROUND(ntohs(param.p->length)),
				  param.v);

When we allocate an error chunk, we assume that the worst case scenario requires
that we have chunk_hdr->length data allocated, which would be correct nominally,
given that we call sctp_addto_chunk for the violating parameter.  Unfortunately,
we also, in sctp_init_cause insert a sctp_errhdr_t structure into the error
chunk, so the worst case situation in which all parameters are in violation
requires chunk_hdr->length+(sizeof(sctp_errhdr_t)*param_count) bytes of data.

The result of this error is that a deliberately malformed packet sent to a
listening host can cause a remote DOS, described in CVE-2010-1173:
http://cve.mitre.org/cgi-bin/cvename.cgi?name=2010-1173

I've tested the below fix and confirmed that it fixes the issue.  We move to a
strategy whereby we allocate a fixed size error chunk and ignore errors we don't
have space to report.  Tested by me successfully

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 include/net/sctp/structs.h |    1 
 net/sctp/sm_make_chunk.c   |   62 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 58 insertions(+), 5 deletions(-)


diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index ff30177..597f8e2 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -778,6 +778,7 @@ int sctp_user_addto_chunk(struct sctp_chunk *chunk, int off, int len,
 			  struct iovec *data);
 void sctp_chunk_free(struct sctp_chunk *);
 void  *sctp_addto_chunk(struct sctp_chunk *, int len, const void *data);
+void  *sctp_addto_chunk_fixed(struct sctp_chunk *, int len, const void *data);
 struct sctp_chunk *sctp_chunkify(struct sk_buff *,
 				 const struct sctp_association *,
 				 struct sock *);
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index f592163..2971731 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -107,7 +107,7 @@ static const struct sctp_paramhdr prsctp_param = {
 	cpu_to_be16(sizeof(struct sctp_paramhdr)),
 };
 
-/* A helper to initialize to initialize an op error inside a
+/* A helper to initialize an op error inside a
  * provided chunk, as most cause codes will be embedded inside an
  * abort chunk.
  */
@@ -124,6 +124,29 @@ void  sctp_init_cause(struct sctp_chunk *chunk, __be16 cause_code,
 	chunk->subh.err_hdr = sctp_addto_chunk(chunk, sizeof(sctp_errhdr_t), &err);
 }
 
+/* A helper to initialize an op error inside a
+ * provided chunk, as most cause codes will be embedded inside an
+ * abort chunk.  Differs from sctp_init_cause in that it won't oops
+ * if there isn't enough space in the op error chunk
+ */
+int sctp_init_cause_fixed(struct sctp_chunk *chunk, __be16 cause_code,
+		      size_t paylen)
+{
+	sctp_errhdr_t err;
+	__u16 len;
+
+	/* Cause code constants are now defined in network order.  */
+	err.cause = cause_code;
+	len = sizeof(sctp_errhdr_t) + paylen;
+	err.length  = htons(len);
+
+	if (skb_tailroom(chunk->skb) >  len)
+		return -ENOSPC;
+	chunk->subh.err_hdr = sctp_addto_chunk_fixed(chunk,
+						     sizeof(sctp_errhdr_t),
+						     &err);
+	return 0;
+}
 /* 3.3.2 Initiation (INIT) (1)
  *
  * This chunk is used to initiate a SCTP association between two
@@ -1131,6 +1154,24 @@ nodata:
 	return retval;
 }
 
+/* Create an Operation Error chunk of a fixed size,
+ * specifically, max(asoc->pathmtu, SCTP_DEFAULT_MAXSEGMENT)
+ * This is a helper function to allocate an error chunk for
+ * for those invalid parameter codes in which we may not want
+ * to report all the errors, if the incomming chunk is large
+ */
+static inline struct sctp_chunk *sctp_make_op_error_fixed(
+	const struct sctp_association *asoc,
+	const struct sctp_chunk *chunk)
+{
+	size_t size = asoc ? asoc->pathmtu : 0;
+
+	if (!size)
+		size = SCTP_DEFAULT_MAXSEGMENT;
+
+	return sctp_make_op_error_space(asoc, chunk, size);
+}
+
 /* Create an Operation Error chunk.  */
 struct sctp_chunk *sctp_make_op_error(const struct sctp_association *asoc,
 				 const struct sctp_chunk *chunk,
@@ -1373,6 +1414,18 @@ void *sctp_addto_chunk(struct sctp_chunk *chunk, int len, const void *data)
 	return target;
 }
 
+/* Append bytes to the end of a chunk. Returns NULL if there isn't sufficient
+ * space in the chunk
+ */
+void *sctp_addto_chunk_fixed(struct sctp_chunk *chunk,
+			     int len, const void *data)
+{
+	if (skb_tailroom(chunk->skb) > len)
+		return sctp_addto_chunk(chunk, len, data);
+	else
+		return NULL;
+}
+
 /* Append bytes from user space to the end of a chunk.  Will panic if
  * chunk is not big enough.
  * Returns a kernel err value.
@@ -1976,13 +2029,12 @@ static sctp_ierror_t sctp_process_unk_param(const struct sctp_association *asoc,
 		 * returning multiple unknown parameters.
 		 */
 		if (NULL == *errp)
-			*errp = sctp_make_op_error_space(asoc, chunk,
-					ntohs(chunk->chunk_hdr->length));
+			*errp = sctp_make_op_error_fixed(asoc, chunk);
 
 		if (*errp) {
-			sctp_init_cause(*errp, SCTP_ERROR_UNKNOWN_PARAM,
+			sctp_init_cause_fixed(*errp, SCTP_ERROR_UNKNOWN_PARAM,
 					WORD_ROUND(ntohs(param.p->length)));
-			sctp_addto_chunk(*errp,
+			sctp_addto_chunk_fixed(*errp,
 					WORD_ROUND(ntohs(param.p->length)),
 					param.v);
 		} else {

^ permalink raw reply related

* Re: [PATCH]: sctp: Fix skb_over_panic resulting from multiple invalid parameter errors (CVE-2010-1173) (v4)
From: Vlad Yasevich @ 2010-04-28 20:37 UTC (permalink / raw)
  To: Neil Horman; +Cc: sri, linux-sctp, eteo, netdev, davem, security
In-Reply-To: <20100428203059.GG4818@hmsreliant.think-freely.org>


Looks good.

Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>

-vlad

Neil Horman wrote:
> Ok, version 4
> 
> Change Notes:
> 1) Minor cleanups, from Vlads notes
> 
> Summary:
> 
> 
> Hey-
> 	Recently, it was reported to me that the kernel could oops in the
> following way:
> 
> <5> kernel BUG at net/core/skbuff.c:91!
> <5> invalid operand: 0000 [#1]
> <5> Modules linked in: sctp netconsole nls_utf8 autofs4 sunrpc iptable_filter
> ip_tables cpufreq_powersave parport_pc lp parport vmblock(U) vsock(U) vmci(U)
> vmxnet(U) vmmemctl(U) vmhgfs(U) acpiphp dm_mirror dm_mod button battery ac md5
> ipv6 uhci_hcd ehci_hcd snd_ens1371 snd_rawmidi snd_seq_device snd_pcm_oss
> snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_ac97_codec snd soundcore
> pcnet32 mii floppy ext3 jbd ata_piix libata mptscsih mptsas mptspi mptscsi
> mptbase sd_mod scsi_mod
> <5> CPU:    0
> <5> EIP:    0060:[<c02bff27>]    Not tainted VLI
> <5> EFLAGS: 00010216   (2.6.9-89.0.25.EL) 
> <5> EIP is at skb_over_panic+0x1f/0x2d
> <5> eax: 0000002c   ebx: c033f461   ecx: c0357d96   edx: c040fd44
> <5> esi: c033f461   edi: df653280   ebp: 00000000   esp: c040fd40
> <5> ds: 007b   es: 007b   ss: 0068
> <5> Process swapper (pid: 0, threadinfo=c040f000 task=c0370be0)
> <5> Stack: c0357d96 e0c29478 00000084 00000004 c033f461 df653280 d7883180
> e0c2947d 
> <5>        00000000 00000080 df653490 00000004 de4f1ac0 de4f1ac0 00000004
> df653490 
> <5>        00000001 e0c2877a 08000800 de4f1ac0 df653490 00000000 e0c29d2e
> 00000004 
> <5> Call Trace:
> <5>  [<e0c29478>] sctp_addto_chunk+0xb0/0x128 [sctp]
> <5>  [<e0c2947d>] sctp_addto_chunk+0xb5/0x128 [sctp]
> <5>  [<e0c2877a>] sctp_init_cause+0x3f/0x47 [sctp]
> <5>  [<e0c29d2e>] sctp_process_unk_param+0xac/0xb8 [sctp]
> <5>  [<e0c29e90>] sctp_verify_init+0xcc/0x134 [sctp]
> <5>  [<e0c20322>] sctp_sf_do_5_1B_init+0x83/0x28e [sctp]
> <5>  [<e0c25333>] sctp_do_sm+0x41/0x77 [sctp]
> <5>  [<c01555a4>] cache_grow+0x140/0x233
> <5>  [<e0c26ba1>] sctp_endpoint_bh_rcv+0xc5/0x108 [sctp]
> <5>  [<e0c2b863>] sctp_inq_push+0xe/0x10 [sctp]
> <5>  [<e0c34600>] sctp_rcv+0x454/0x509 [sctp]
> <5>  [<e084e017>] ipt_hook+0x17/0x1c [iptable_filter]
> <5>  [<c02d005e>] nf_iterate+0x40/0x81
> <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
> <5>  [<c02e0c7f>] ip_local_deliver_finish+0xc6/0x151
> <5>  [<c02d0362>] nf_hook_slow+0x83/0xb5
> <5>  [<c02e0bb2>] ip_local_deliver+0x1a2/0x1a9
> <5>  [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
> <5>  [<c02e103e>] ip_rcv+0x334/0x3b4
> <5>  [<c02c66fd>] netif_receive_skb+0x320/0x35b
> <5>  [<e0a0928b>] init_stall_timer+0x67/0x6a [uhci_hcd]
> <5>  [<c02c67a4>] process_backlog+0x6c/0xd9
> <5>  [<c02c690f>] net_rx_action+0xfe/0x1f8
> <5>  [<c012a7b1>] __do_softirq+0x35/0x79
> <5>  [<c0107efb>] handle_IRQ_event+0x0/0x4f
> <5>  [<c01094de>] do_softirq+0x46/0x4d
> 
> Its an skb_over_panic BUG halt that results from processing an init chunk in
> which too many of its variable length parameters are in some way malformed.
> 
> The problem is in sctp_process_unk_param:
> if (NULL == *errp)
> 	*errp = sctp_make_op_error_space(asoc, chunk,
> 					 ntohs(chunk->chunk_hdr->length));
> 
> 	if (*errp) {
> 		sctp_init_cause(*errp, SCTP_ERROR_UNKNOWN_PARAM,
> 				 WORD_ROUND(ntohs(param.p->length)));
> 		sctp_addto_chunk(*errp,
> 			WORD_ROUND(ntohs(param.p->length)),
> 				  param.v);
> 
> When we allocate an error chunk, we assume that the worst case scenario requires
> that we have chunk_hdr->length data allocated, which would be correct nominally,
> given that we call sctp_addto_chunk for the violating parameter.  Unfortunately,
> we also, in sctp_init_cause insert a sctp_errhdr_t structure into the error
> chunk, so the worst case situation in which all parameters are in violation
> requires chunk_hdr->length+(sizeof(sctp_errhdr_t)*param_count) bytes of data.
> 
> The result of this error is that a deliberately malformed packet sent to a
> listening host can cause a remote DOS, described in CVE-2010-1173:
> http://cve.mitre.org/cgi-bin/cvename.cgi?name=2010-1173
> 
> I've tested the below fix and confirmed that it fixes the issue.  We move to a
> strategy whereby we allocate a fixed size error chunk and ignore errors we don't
> have space to report.  Tested by me successfully
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 
> 
>  include/net/sctp/structs.h |    1 
>  net/sctp/sm_make_chunk.c   |   62 +++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 58 insertions(+), 5 deletions(-)
> 
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index ff30177..597f8e2 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -778,6 +778,7 @@ int sctp_user_addto_chunk(struct sctp_chunk *chunk, int off, int len,
>  			  struct iovec *data);
>  void sctp_chunk_free(struct sctp_chunk *);
>  void  *sctp_addto_chunk(struct sctp_chunk *, int len, const void *data);
> +void  *sctp_addto_chunk_fixed(struct sctp_chunk *, int len, const void *data);
>  struct sctp_chunk *sctp_chunkify(struct sk_buff *,
>  				 const struct sctp_association *,
>  				 struct sock *);
> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> index f592163..2971731 100644
> --- a/net/sctp/sm_make_chunk.c
> +++ b/net/sctp/sm_make_chunk.c
> @@ -107,7 +107,7 @@ static const struct sctp_paramhdr prsctp_param = {
>  	cpu_to_be16(sizeof(struct sctp_paramhdr)),
>  };
>  
> -/* A helper to initialize to initialize an op error inside a
> +/* A helper to initialize an op error inside a
>   * provided chunk, as most cause codes will be embedded inside an
>   * abort chunk.
>   */
> @@ -124,6 +124,29 @@ void  sctp_init_cause(struct sctp_chunk *chunk, __be16 cause_code,
>  	chunk->subh.err_hdr = sctp_addto_chunk(chunk, sizeof(sctp_errhdr_t), &err);
>  }
>  
> +/* A helper to initialize an op error inside a
> + * provided chunk, as most cause codes will be embedded inside an
> + * abort chunk.  Differs from sctp_init_cause in that it won't oops
> + * if there isn't enough space in the op error chunk
> + */
> +int sctp_init_cause_fixed(struct sctp_chunk *chunk, __be16 cause_code,
> +		      size_t paylen)
> +{
> +	sctp_errhdr_t err;
> +	__u16 len;
> +
> +	/* Cause code constants are now defined in network order.  */
> +	err.cause = cause_code;
> +	len = sizeof(sctp_errhdr_t) + paylen;
> +	err.length  = htons(len);
> +
> +	if (skb_tailroom(chunk->skb) >  len)
> +		return -ENOSPC;
> +	chunk->subh.err_hdr = sctp_addto_chunk_fixed(chunk,
> +						     sizeof(sctp_errhdr_t),
> +						     &err);
> +	return 0;
> +}
>  /* 3.3.2 Initiation (INIT) (1)
>   *
>   * This chunk is used to initiate a SCTP association between two
> @@ -1131,6 +1154,24 @@ nodata:
>  	return retval;
>  }
>  
> +/* Create an Operation Error chunk of a fixed size,
> + * specifically, max(asoc->pathmtu, SCTP_DEFAULT_MAXSEGMENT)
> + * This is a helper function to allocate an error chunk for
> + * for those invalid parameter codes in which we may not want
> + * to report all the errors, if the incomming chunk is large
> + */
> +static inline struct sctp_chunk *sctp_make_op_error_fixed(
> +	const struct sctp_association *asoc,
> +	const struct sctp_chunk *chunk)
> +{
> +	size_t size = asoc ? asoc->pathmtu : 0;
> +
> +	if (!size)
> +		size = SCTP_DEFAULT_MAXSEGMENT;
> +
> +	return sctp_make_op_error_space(asoc, chunk, size);
> +}
> +
>  /* Create an Operation Error chunk.  */
>  struct sctp_chunk *sctp_make_op_error(const struct sctp_association *asoc,
>  				 const struct sctp_chunk *chunk,
> @@ -1373,6 +1414,18 @@ void *sctp_addto_chunk(struct sctp_chunk *chunk, int len, const void *data)
>  	return target;
>  }
>  
> +/* Append bytes to the end of a chunk. Returns NULL if there isn't sufficient
> + * space in the chunk
> + */
> +void *sctp_addto_chunk_fixed(struct sctp_chunk *chunk,
> +			     int len, const void *data)
> +{
> +	if (skb_tailroom(chunk->skb) > len)
> +		return sctp_addto_chunk(chunk, len, data);
> +	else
> +		return NULL;
> +}
> +
>  /* Append bytes from user space to the end of a chunk.  Will panic if
>   * chunk is not big enough.
>   * Returns a kernel err value.
> @@ -1976,13 +2029,12 @@ static sctp_ierror_t sctp_process_unk_param(const struct sctp_association *asoc,
>  		 * returning multiple unknown parameters.
>  		 */
>  		if (NULL == *errp)
> -			*errp = sctp_make_op_error_space(asoc, chunk,
> -					ntohs(chunk->chunk_hdr->length));
> +			*errp = sctp_make_op_error_fixed(asoc, chunk);
>  
>  		if (*errp) {
> -			sctp_init_cause(*errp, SCTP_ERROR_UNKNOWN_PARAM,
> +			sctp_init_cause_fixed(*errp, SCTP_ERROR_UNKNOWN_PARAM,
>  					WORD_ROUND(ntohs(param.p->length)));
> -			sctp_addto_chunk(*errp,
> +			sctp_addto_chunk_fixed(*errp,
>  					WORD_ROUND(ntohs(param.p->length)),
>  					param.v);
>  		} else {
> 

^ permalink raw reply

* Re: 2.6.34-rc5-git7 (plus all patches) -- another suspicious rcu_dereference_check() usage.
From: Paul E. McKenney @ 2010-04-28 20:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Miles Lane, Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, netdev, Jens Axboe, Gui Jianfeng,
	Li Zefan, Johannes Berg, shemminger
In-Reply-To: <1272485923.2201.22.camel@edumazet-laptop>

On Wed, Apr 28, 2010 at 10:18:43PM +0200, Eric Dumazet wrote:
> Le mercredi 28 avril 2010 à 13:09 -0700, Paul E. McKenney a écrit :
> > On Wed, Apr 28, 2010 at 09:38:11PM +0200, Eric Dumazet wrote:
> > > Le mercredi 28 avril 2010 à 10:54 -0700, Paul E. McKenney a écrit :
> > > > On Mon, Apr 26, 2010 at 08:51:06PM -0400, Miles Lane wrote:
> > > > > This one occurred during the wakeup from suspend to RAM.
> > > > > 
> > > > > [  984.724697] [ INFO: suspicious rcu_dereference_check() usage. ]
> > > > > [  984.724700] ---------------------------------------------------
> > > > > [  984.724703] include/linux/fdtable.h:88 invoked
> > > > > rcu_dereference_check() without protection!
> > > > > [  984.724706]
> > > > > [  984.724707] other info that might help us debug this:
> > > > > [  984.724708]
> > > > > [  984.724711]
> > > > > [  984.724711] rcu_scheduler_active = 1, debug_locks = 1
> > > > > [  984.724714] no locks held by dbus-daemon/4680.
> > > > > [  984.724717]
> > > > > [  984.724717] stack backtrace:
> > > > > [  984.724721] Pid: 4680, comm: dbus-daemon Not tainted 2.6.34-rc5-git7 #33
> > > > > [  984.724724] Call Trace:
> > > > > [  984.724734]  [<ffffffff81074556>] lockdep_rcu_dereference+0x9d/0xa6
> > > > > [  984.724740]  [<ffffffff810fc785>] fcheck_files+0xb1/0xc9
> > > > > [  984.724745]  [<ffffffff810fc7f5>] fget_light+0x35/0xab
> > > > > [  984.724751]  [<ffffffff81433e1b>] ? sock_poll_wait+0x13/0x18
> > > > > [  984.724755]  [<ffffffff81433e39>] ? unix_poll+0x19/0x95
> > > > > [  984.724762]  [<ffffffff8110aa95>] do_sys_poll+0x1ff/0x3e5
> > > > > [  984.724766]  [<ffffffff8110a19e>] ? __pollwait+0x0/0xc7
> > > > > [  984.724771]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > [  984.724776]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > [  984.724780]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > [  984.724784]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > [  984.724788]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > [  984.724793]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > [  984.724797]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > [  984.724802]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > [  984.724806]  [<ffffffff8110a265>] ? pollwake+0x0/0x4f
> > > > > [  984.724812]  [<ffffffff8110ae0f>] sys_poll+0x50/0xbb
> > > > > [  984.724818]  [<ffffffff81009d82>] system_call_fastpath+0x16/0x1b
> > > > 
> > > > Hmmm...  I am not convinced that this is a false positive.  Couldn't
> > > > there be a multi-threaded process where one thread is invoking poll()
> > > > on a UNIX socket just as another thread is calling close() on it?
> > > > 
> > > > The current fcheck_files() logic requires that the caller either (1) be in
> > > > an RCU read-side critical section, (2) hold ->files_lock, or (3) passing
> > > > in a files_struct with ->count equal to 1 (initialization or cleanup).
> > > > 
> > > > So I don't feel comfortable just slapping an RCU read-side critical
> > > > section around this one, at least not unless someone who understands
> > > > the locking says that doing so is OK.
> > > > 
> > > > 		
> > > 
> > > Its a single threaded program.
> > > 
> > > So fget_light() calls fcheck_files(files, fd); without rcu lock,
> > > but some /proc/pid/fd/... user temporarly raised files->count just
> > > before we perform the condition check.
> > 
> > So I should add a single-threaded check.  My first thought was to use
> > current_is_single_threaded(), but the bit about scanning the full list
> > of processes does give me pause.  However, thread_group_empty() looks
> > like a much lighter-weight alternative.
> > 
> > I believe that it is possible for a pair of single-threaded processes
> > to share a file descriptor, but that should not be a problem, as both
> > of them would need to close it for it to go away.
> > 
> > But what happens if someone does a clone() with CLONE_FILES, as some
> > of the AIO stuff seems to do?  Won't that allow one of the resulting
> > processes to close the file for both of them, even though both are
> > otherwise single-threaded?  And the ->count seems to be the only
> > distinction between these two cases.
> > 
> > And AIO does CLONE_VM as well as CLONE_FILES, but that seems to mean that
> > the check must scan the processes with current_is_single_threaded().
> > Besides which, a user could invoke clone() with only CLONE_FILES
> > specified, right?
> > 
> > Or am I just confused here?
> > 
> > 							Thanx, Paul
> 
> If a program is mono threaded, and doing a fget_light() syscall, it
> cannot possibly do a clone() in // ;)

The sequence of events that I am worried about is as follows:

1.	Single-threaded process does clone(CLONE_FILES).  The
	result is a pair of single-threaded processes that share
	file descriptors.

2.	One of these processes does files_fdtable(i) at the same
	time as the other process closes file descriptor i.

So, clone and -then- do fget_light().

> If we want to be picky, we could add a user provided condition, aka "we
> are sure we are allowed to do this because we are the owner of the files
> struct".

But yes, if I understand your trick below, the race conditions from
the above sequence of events would simply force the processes off
of the fget_light() path, which should be OK.

						Thanx, Paul

> diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
> index 6da962c..027f5e1 100644
> --- a/drivers/char/tty_io.c
> +++ b/drivers/char/tty_io.c
> @@ -2694,7 +2694,7 @@ void __do_SAK(struct tty_struct *tty)
>  			spin_lock(&p->files->file_lock);
>  			fdt = files_fdtable(p->files);
>  			for (i = 0; i < fdt->max_fds; i++) {
> -				filp = fcheck_files(p->files, i);
> +				filp = fcheck_files(p->files, i, false);
>  				if (!filp)
>  					continue;
>  				if (filp->f_op->read == tty_read &&
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 452d02f..dabf4d8 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -119,7 +119,7 @@ SYSCALL_DEFINE2(dup2, unsigned int, oldfd, unsigned int, newfd)
>  		int retval = oldfd;
> 
>  		rcu_read_lock();
> -		if (!fcheck_files(files, oldfd))
> +		if (!fcheck_files(files, oldfd, false))
>  			retval = -EBADF;
>  		rcu_read_unlock();
>  		return retval;
> diff --git a/fs/file_table.c b/fs/file_table.c
> index 32d12b7..2865f72 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -274,7 +274,7 @@ struct file *fget(unsigned int fd)
>  	struct files_struct *files = current->files;
> 
>  	rcu_read_lock();
> -	file = fcheck_files(files, fd);
> +	file = fcheck_files(files, fd, false);
>  	if (file) {
>  		if (!atomic_long_inc_not_zero(&file->f_count)) {
>  			/* File object ref couldn't be taken */
> @@ -303,10 +303,10 @@ struct file *fget_light(unsigned int fd, int *fput_needed)
> 
>  	*fput_needed = 0;
>  	if (likely((atomic_read(&files->count) == 1))) {
> -		file = fcheck_files(files, fd);
> +		file = fcheck_files(files, fd, true);
>  	} else {
>  		rcu_read_lock();
> -		file = fcheck_files(files, fd);
> +		file = fcheck_files(files, fd, false);
>  		if (file) {
>  			if (atomic_long_inc_not_zero(&file->f_count))
>  				*fput_needed = 1;
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 8418fcc..0e89448 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1716,7 +1716,7 @@ static int proc_fd_info(struct inode *inode, struct path *path, char *info)
>  		 * hold ->file_lock.
>  		 */
>  		spin_lock(&files->file_lock);
> -		file = fcheck_files(files, fd);
> +		file = fcheck_files(files, fd, false);
>  		if (file) {
>  			if (path) {
>  				*path = file->f_path;
> @@ -1755,7 +1755,7 @@ static int tid_fd_revalidate(struct dentry *dentry, struct nameidata *nd)
>  		files = get_files_struct(task);
>  		if (files) {
>  			rcu_read_lock();
> -			if (fcheck_files(files, fd)) {
> +			if (fcheck_files(files, fd, false)) {
>  				rcu_read_unlock();
>  				put_files_struct(files);
>  				if (task_dumpable(task)) {
> @@ -1813,7 +1813,7 @@ static struct dentry *proc_fd_instantiate(struct inode *dir,
>  	 * hold ->file_lock.
>  	 */
>  	spin_lock(&files->file_lock);
> -	file = fcheck_files(files, fd);
> +	file = fcheck_files(files, fd, false);
>  	if (!file)
>  		goto out_unlock;
>  	if (file->f_mode & FMODE_READ)
> @@ -1899,7 +1899,7 @@ static int proc_readfd_common(struct file * filp, void * dirent,
>  				char name[PROC_NUMBUF];
>  				int len;
> 
> -				if (!fcheck_files(files, fd))
> +				if (!fcheck_files(files, fd, false))
>  					continue;
>  				rcu_read_unlock();
> 
> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
> index 013dc52..76423ad 100644
> --- a/include/linux/fdtable.h
> +++ b/include/linux/fdtable.h
> @@ -57,11 +57,12 @@ struct files_struct {
>  	struct file * fd_array[NR_OPEN_DEFAULT];
>  };
> 
> -#define rcu_dereference_check_fdtable(files, fdtfd) \
> +#define rcu_dereference_check_fdtable(files, fdtfd, cond) \
>  	(rcu_dereference_check((fdtfd), \
>  			       rcu_read_lock_held() || \
>  			       lockdep_is_held(&(files)->file_lock) || \
> -			       atomic_read(&(files)->count) == 1))
> +			       atomic_read(&(files)->count) == 1 || \
> +			       cond))
> 
>  #define files_fdtable(files) \
>  		(rcu_dereference_check_fdtable((files), (files)->fdt))
> @@ -79,13 +80,13 @@ static inline void free_fdtable(struct fdtable *fdt)
>  	call_rcu(&fdt->rcu, free_fdtable_rcu);
>  }
> 
> -static inline struct file * fcheck_files(struct files_struct *files, unsigned int fd)
> +static inline struct file * fcheck_files(struct files_struct *files, unsigned int fd, bool cond)
>  {
>  	struct file * file = NULL;
>  	struct fdtable *fdt = files_fdtable(files);
> 
>  	if (fd < fdt->max_fds)
> -		file = rcu_dereference_check_fdtable(files, fdt->fd[fd]);
> +		file = rcu_dereference_check_fdtable(files, fdt->fd[fd], cond);
>  	return file;
>  }
> 
> 
> 

^ permalink raw reply

* [PATCHv7] add mergeable buffers support to vhost_net
From: David L Stevens @ 2010-04-28 20:57 UTC (permalink / raw)
  To: mst; +Cc: netdev, kvm, virtualization

This patch adds mergeable receive buffer support to vhost_net.

Signed-off-by: David L Stevens <dlstevens@us.ibm.com>

diff -ruNp net-next-v0/drivers/vhost/net.c net-next-v7/drivers/vhost/net.c
--- net-next-v0/drivers/vhost/net.c	2010-04-24 21:36:54.000000000 -0700
+++ net-next-v7/drivers/vhost/net.c	2010-04-28 12:26:18.000000000 -0700
@@ -74,6 +74,23 @@ static int move_iovec_hdr(struct iovec *
 	}
 	return seg;
 }
+/* Copy iovec entries for len bytes from iovec. Return segments used. */
+static int copy_iovec_hdr(const struct iovec *from, struct iovec *to,
+			  size_t len, int iovcount)
+{
+	int seg = 0;
+	size_t size;
+	while (len && seg < iovcount) {
+		size = min(from->iov_len, len);
+		to->iov_base = from->iov_base;
+		to->iov_len = size;
+		len -= size;
+		++from;
+		++to;
+		++seg;
+	}
+	return seg;
+}
 
 /* Caller must have TX VQ lock */
 static void tx_poll_stop(struct vhost_net *net)
@@ -109,7 +126,7 @@ static void handle_tx(struct vhost_net *
 	};
 	size_t len, total_len = 0;
 	int err, wmem;
-	size_t hdr_size;
+	size_t vhost_hlen;
 	struct socket *sock = rcu_dereference(vq->private_data);
 	if (!sock)
 		return;
@@ -128,13 +145,13 @@ static void handle_tx(struct vhost_net *
 
 	if (wmem < sock->sk->sk_sndbuf / 2)
 		tx_poll_stop(net);
-	hdr_size = vq->hdr_size;
+	vhost_hlen = vq->vhost_hlen;
 
 	for (;;) {
-		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
-					 ARRAY_SIZE(vq->iov),
-					 &out, &in,
-					 NULL, NULL);
+		head = vhost_get_desc(&net->dev, vq, vq->iov,
+				      ARRAY_SIZE(vq->iov),
+				      &out, &in,
+				      NULL, NULL);
 		/* Nothing new?  Wait for eventfd to tell us they refilled. */
 		if (head == vq->num) {
 			wmem = atomic_read(&sock->sk->sk_wmem_alloc);
@@ -155,20 +172,20 @@ static void handle_tx(struct vhost_net *
 			break;
 		}
 		/* Skip header. TODO: support TSO. */
-		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
+		s = move_iovec_hdr(vq->iov, vq->hdr, vhost_hlen, out);
 		msg.msg_iovlen = out;
 		len = iov_length(vq->iov, out);
 		/* Sanity check */
 		if (!len) {
 			vq_err(vq, "Unexpected header len for TX: "
 			       "%zd expected %zd\n",
-			       iov_length(vq->hdr, s), hdr_size);
+			       iov_length(vq->hdr, s), vhost_hlen);
 			break;
 		}
 		/* TODO: Check specific error and bomb out unless ENOBUFS? */
 		err = sock->ops->sendmsg(NULL, sock, &msg, len);
 		if (unlikely(err < 0)) {
-			vhost_discard_vq_desc(vq);
+			vhost_discard_desc(vq, 1);
 			tx_poll_start(net, sock);
 			break;
 		}
@@ -187,12 +204,25 @@ static void handle_tx(struct vhost_net *
 	unuse_mm(net->dev.mm);
 }
 
+static int vhost_head_len(struct vhost_virtqueue *vq, struct sock *sk)
+{
+	struct sk_buff *head;
+	int len = 0;
+
+	lock_sock(sk);
+	head = skb_peek(&sk->sk_receive_queue);
+	if (head)
+		len = head->len + vq->sock_hlen;
+	release_sock(sk);
+	return len;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_rx(struct vhost_net *net)
 {
 	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
-	unsigned head, out, in, log, s;
+	unsigned in, log, s;
 	struct vhost_log *vq_log;
 	struct msghdr msg = {
 		.msg_name = NULL,
@@ -203,14 +233,14 @@ static void handle_rx(struct vhost_net *
 		.msg_flags = MSG_DONTWAIT,
 	};
 
-	struct virtio_net_hdr hdr = {
-		.flags = 0,
-		.gso_type = VIRTIO_NET_HDR_GSO_NONE
+	struct virtio_net_hdr_mrg_rxbuf hdr = {
+		.hdr.flags = 0,
+		.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
 	};
 
 	size_t len, total_len = 0;
-	int err;
-	size_t hdr_size;
+	int err, headcount, datalen;
+	size_t vhost_hlen;
 	struct socket *sock = rcu_dereference(vq->private_data);
 	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
 		return;
@@ -218,18 +248,19 @@ static void handle_rx(struct vhost_net *
 	use_mm(net->dev.mm);
 	mutex_lock(&vq->mutex);
 	vhost_disable_notify(vq);
-	hdr_size = vq->hdr_size;
+	vhost_hlen = vq->vhost_hlen;
 
 	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
 		vq->log : NULL;
 
-	for (;;) {
-		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
-					 ARRAY_SIZE(vq->iov),
-					 &out, &in,
-					 vq_log, &log);
+	while ((datalen = vhost_head_len(vq, sock->sk))) {
+		headcount = vhost_get_desc_n(vq, vq->heads,
+					     datalen + vhost_hlen,
+					     &in, vq_log, &log);
+		if (headcount < 0)
+			break;
 		/* OK, now we need to know about added descriptors. */
-		if (head == vq->num) {
+		if (!headcount) {
 			if (unlikely(vhost_enable_notify(vq))) {
 				/* They have slipped one in as we were
 				 * doing that: check again. */
@@ -241,46 +272,53 @@ static void handle_rx(struct vhost_net *
 			break;
 		}
 		/* We don't need to be notified again. */
-		if (out) {
-			vq_err(vq, "Unexpected descriptor format for RX: "
-			       "out %d, int %d\n",
-			       out, in);
-			break;
-		}
-		/* Skip header. TODO: support TSO/mergeable rx buffers. */
-		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
+		if (vhost_hlen)
+			/* Skip header. TODO: support TSO. */
+			s = move_iovec_hdr(vq->iov, vq->hdr, vhost_hlen, in);
+		else
+			s = copy_iovec_hdr(vq->iov, vq->hdr, vq->sock_hlen, in);
 		msg.msg_iovlen = in;
 		len = iov_length(vq->iov, in);
 		/* Sanity check */
 		if (!len) {
 			vq_err(vq, "Unexpected header len for RX: "
 			       "%zd expected %zd\n",
-			       iov_length(vq->hdr, s), hdr_size);
+			       iov_length(vq->hdr, s), vhost_hlen);
 			break;
 		}
 		err = sock->ops->recvmsg(NULL, sock, &msg,
 					 len, MSG_DONTWAIT | MSG_TRUNC);
 		/* TODO: Check specific error and bomb out unless EAGAIN? */
 		if (err < 0) {
-			vhost_discard_vq_desc(vq);
+			vhost_discard_desc(vq, headcount);
 			break;
 		}
-		/* TODO: Should check and handle checksum. */
-		if (err > len) {
-			pr_err("Discarded truncated rx packet: "
-			       " len %d > %zd\n", err, len);
-			vhost_discard_vq_desc(vq);
+		if (err != datalen) {
+			pr_err("Discarded rx packet: "
+			       " len %d, expected %zd\n", err, datalen);
+			vhost_discard_desc(vq, headcount);
 			continue;
 		}
 		len = err;
-		err = memcpy_toiovec(vq->hdr, (unsigned char *)&hdr, hdr_size);
-		if (err) {
-			vq_err(vq, "Unable to write vnet_hdr at addr %p: %d\n",
-			       vq->iov->iov_base, err);
+		if (vhost_hlen &&
+		    memcpy_toiovecend(vq->hdr, (unsigned char *)&hdr, 0,
+				      vhost_hlen)) {
+			vq_err(vq, "Unable to write vnet_hdr at addr %p\n",
+			       vq->iov->iov_base);
 			break;
 		}
-		len += hdr_size;
-		vhost_add_used_and_signal(&net->dev, vq, head, len);
+		/* TODO: Should check and handle checksum. */
+		if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF) &&
+		    memcpy_toiovecend(vq->hdr, (unsigned char *)&headcount,
+				      offsetof(typeof(hdr), num_buffers),
+				      sizeof(hdr.num_buffers))) {
+			vq_err(vq, "Failed num_buffers write");
+			vhost_discard_desc(vq, headcount);
+			break;
+		}
+		len += vhost_hlen;
+		vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
+					    headcount);
 		if (unlikely(vq_log))
 			vhost_log_write(vq, vq_log, log, len);
 		total_len += len;
@@ -561,9 +599,21 @@ done:
 
 static int vhost_net_set_features(struct vhost_net *n, u64 features)
 {
-	size_t hdr_size = features & (1 << VHOST_NET_F_VIRTIO_NET_HDR) ?
-		sizeof(struct virtio_net_hdr) : 0;
+	size_t vhost_hlen, sock_hlen, hdr_len;
 	int i;
+
+	hdr_len = (features & (1 << VIRTIO_NET_F_MRG_RXBUF)) ?
+			sizeof(struct virtio_net_hdr_mrg_rxbuf) :
+			sizeof(struct virtio_net_hdr);
+	if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) {
+		/* vhost provides vnet_hdr */
+		vhost_hlen = hdr_len;
+		sock_hlen = 0;
+	} else {
+		/* socket provides vnet_hdr */
+		vhost_hlen = 0;
+		sock_hlen = hdr_len;
+	}
 	mutex_lock(&n->dev.mutex);
 	if ((features & (1 << VHOST_F_LOG_ALL)) &&
 	    !vhost_log_access_ok(&n->dev)) {
@@ -574,7 +624,8 @@ static int vhost_net_set_features(struct
 	smp_wmb();
 	for (i = 0; i < VHOST_NET_VQ_MAX; ++i) {
 		mutex_lock(&n->vqs[i].mutex);
-		n->vqs[i].hdr_size = hdr_size;
+		n->vqs[i].vhost_hlen = vhost_hlen;
+		n->vqs[i].sock_hlen = sock_hlen;
 		mutex_unlock(&n->vqs[i].mutex);
 	}
 	vhost_net_flush(n);
diff -ruNp net-next-v0/drivers/vhost/vhost.c net-next-v7/drivers/vhost/vhost.c
--- net-next-v0/drivers/vhost/vhost.c	2010-04-22 11:31:57.000000000 -0700
+++ net-next-v7/drivers/vhost/vhost.c	2010-04-28 11:16:13.000000000 -0700
@@ -114,7 +114,8 @@ static void vhost_vq_reset(struct vhost_
 	vq->used_flags = 0;
 	vq->log_used = false;
 	vq->log_addr = -1ull;
-	vq->hdr_size = 0;
+	vq->vhost_hlen = 0;
+	vq->sock_hlen = 0;
 	vq->private_data = NULL;
 	vq->log_base = NULL;
 	vq->error_ctx = NULL;
@@ -861,6 +862,53 @@ static unsigned get_indirect(struct vhos
 	return 0;
 }
 
+/* This is a multi-buffer version of vhost_get_desc
+ * @vq		- the relevant virtqueue
+ * datalen	- data length we'll be reading
+ * @iovcount	- returned count of io vectors we fill
+ * @log		- vhost log
+ * @log_num	- log offset
+ *	returns number of buffer heads allocated, negative on error
+ */
+int vhost_get_desc_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
+		     int datalen, unsigned *iovcount, struct vhost_log *log,
+		     unsigned int *log_num)
+{
+	unsigned int out, in;
+	int seg = 0;
+	int headcount = 0;
+	int r;
+
+	while (datalen > 0) {
+		if (headcount >= VHOST_NET_MAX_SG) {
+			r = -ENOBUFS;
+			goto err;
+		}
+		heads[headcount].id = vhost_get_desc(vq->dev, vq, vq->iov + seg,
+					      ARRAY_SIZE(vq->iov) - seg, &out,
+					      &in, log, log_num);
+		if (heads[headcount].id == vq->num) {
+			r = 0;
+			goto err;
+		}
+		if (out || in <= 0) {
+			vq_err(vq, "unexpected descriptor format for RX: "
+				"out %d, in %d\n", out, in);
+			r = -EINVAL;
+			goto err;
+		}
+		heads[headcount].len = iov_length(vq->iov + seg, in);
+		datalen -= heads[headcount].len;
+		++headcount;
+		seg += in;
+	}
+	*iovcount = seg;
+	return headcount;
+err:
+	vhost_discard_desc(vq, headcount);
+	return r;
+}
+
 /* This looks in the virtqueue and for the first available buffer, and converts
  * it to an iovec for convenient access.  Since descriptors consist of some
  * number of output then some number of input descriptors, it's actually two
@@ -868,7 +916,7 @@ static unsigned get_indirect(struct vhos
  *
  * This function returns the descriptor number found, or vq->num (which
  * is never a valid descriptor number) if none was found. */
-unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+unsigned vhost_get_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
 			   struct iovec iov[], unsigned int iov_size,
 			   unsigned int *out_num, unsigned int *in_num,
 			   struct vhost_log *log, unsigned int *log_num)
@@ -986,9 +1034,9 @@ unsigned vhost_get_vq_desc(struct vhost_
 }
 
 /* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
-void vhost_discard_vq_desc(struct vhost_virtqueue *vq)
+void vhost_discard_desc(struct vhost_virtqueue *vq, int n)
 {
-	vq->last_avail_idx--;
+	vq->last_avail_idx -= n;
 }
 
 /* After we've used one of their buffers, we tell them about it.  We'll then
@@ -1033,6 +1081,68 @@ int vhost_add_used(struct vhost_virtqueu
 	return 0;
 }
 
+static void vhost_log_used(struct vhost_virtqueue *vq,
+			   struct vring_used_elem __user *used)
+{
+	/* Make sure data is seen before log. */
+	smp_wmb();
+	/* Log used ring entry write. */
+	log_write(vq->log_base,
+		  vq->log_addr +
+		   ((void __user *)used - (void __user *)vq->used),
+		  sizeof *used);
+	/* Log used index update. */
+	log_write(vq->log_base,
+		  vq->log_addr + offsetof(struct vring_used, idx),
+		  sizeof vq->used->idx);
+	if (vq->log_ctx)
+		eventfd_signal(vq->log_ctx, 1);
+}
+
+static int __vhost_add_used_n(struct vhost_virtqueue *vq,
+			    struct vring_used_elem *heads,
+			    unsigned count)
+{
+	struct vring_used_elem __user *used;
+	int start;
+
+	start = vq->last_used_idx % vq->num;
+	used = vq->used->ring + start;
+	if (copy_to_user(used, heads, count * sizeof *used)) {
+		vq_err(vq, "Failed to write used");
+		return -EFAULT;
+	}
+	/* Make sure buffer is written before we update index. */
+	smp_wmb();
+	if (put_user(vq->last_used_idx + count, &vq->used->idx)) {
+		vq_err(vq, "Failed to increment used idx");
+		return -EFAULT;
+	}
+	if (unlikely(vq->log_used))
+		vhost_log_used(vq, used);
+	vq->last_used_idx += count;
+	return 0;
+}
+
+/* After we've used one of their buffers, we tell them about it.  We'll then
+ * want to notify the guest, using eventfd. */
+int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
+		     unsigned count)
+{
+	int start, n, r;
+
+	start = vq->last_used_idx % vq->num;
+	n = vq->num - start;
+	if (n < count) {
+		r = __vhost_add_used_n(vq, heads, n);
+		if (r < 0)
+			return r;
+		heads += n;
+		count -= n;
+	}
+	return __vhost_add_used_n(vq, heads, count);
+}
+
 /* This actually signals the guest, using eventfd. */
 void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
@@ -1062,6 +1172,15 @@ void vhost_add_used_and_signal(struct vh
 	vhost_signal(dev, vq);
 }
 
+/* multi-buffer version of vhost_add_used_and_signal */
+void vhost_add_used_and_signal_n(struct vhost_dev *dev,
+				 struct vhost_virtqueue *vq,
+				 struct vring_used_elem *heads, unsigned count)
+{
+	vhost_add_used_n(vq, heads, count);
+	vhost_signal(dev, vq);
+}
+
 /* OK, now we need to know about added descriptors. */
 bool vhost_enable_notify(struct vhost_virtqueue *vq)
 {
@@ -1086,7 +1205,7 @@ bool vhost_enable_notify(struct vhost_vi
 		return false;
 	}
 
-	return avail_idx != vq->last_avail_idx;
+	return avail_idx != vq->avail_idx;
 }
 
 /* We don't need to be notified again. */
diff -ruNp net-next-v0/drivers/vhost/vhost.h net-next-v7/drivers/vhost/vhost.h
--- net-next-v0/drivers/vhost/vhost.h	2010-04-24 21:37:41.000000000 -0700
+++ net-next-v7/drivers/vhost/vhost.h	2010-04-26 10:35:25.000000000 -0700
@@ -84,7 +84,9 @@ struct vhost_virtqueue {
 	struct iovec indirect[VHOST_NET_MAX_SG];
 	struct iovec iov[VHOST_NET_MAX_SG];
 	struct iovec hdr[VHOST_NET_MAX_SG];
-	size_t hdr_size;
+	size_t vhost_hlen;
+	size_t sock_hlen;
+	struct vring_used_elem heads[VHOST_NET_MAX_SG];
 	/* We use a kind of RCU to access private pointer.
 	 * All readers access it from workqueue, which makes it possible to
 	 * flush the workqueue instead of synchronize_rcu. Therefore readers do
@@ -120,16 +122,23 @@ long vhost_dev_ioctl(struct vhost_dev *,
 int vhost_vq_access_ok(struct vhost_virtqueue *vq);
 int vhost_log_access_ok(struct vhost_dev *);
 
-unsigned vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
+int vhost_get_desc_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
+		     int datalen, unsigned int *iovcount, struct vhost_log *log,
+		     unsigned int *log_num);
+unsigned vhost_get_desc(struct vhost_dev *, struct vhost_virtqueue *,
 			   struct iovec iov[], unsigned int iov_count,
 			   unsigned int *out_num, unsigned int *in_num,
 			   struct vhost_log *log, unsigned int *log_num);
-void vhost_discard_vq_desc(struct vhost_virtqueue *);
+void vhost_discard_desc(struct vhost_virtqueue *, int);
 
 int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
-void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
+int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
+		     unsigned count);
 void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
-			       unsigned int head, int len);
+			       unsigned int id, int len);
+void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
+			       struct vring_used_elem *heads, unsigned count);
+void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
 void vhost_disable_notify(struct vhost_virtqueue *);
 bool vhost_enable_notify(struct vhost_virtqueue *);
 
@@ -149,7 +158,8 @@ enum {
 	VHOST_FEATURES = (1 << VIRTIO_F_NOTIFY_ON_EMPTY) |
 			 (1 << VIRTIO_RING_F_INDIRECT_DESC) |
 			 (1 << VHOST_F_LOG_ALL) |
-			 (1 << VHOST_NET_F_VIRTIO_NET_HDR),
+			 (1 << VHOST_NET_F_VIRTIO_NET_HDR) |
+			 (1 << VIRTIO_NET_F_MRG_RXBUF),
 };
 
 static inline int vhost_has_feature(struct vhost_dev *dev, int bit)



^ permalink raw reply

* [net-next-2.6 PATCH] igb: Clean up left over prototype of igb_get_hw_dev_name()
From: Jeff Kirsher @ 2010-04-28 20:59 UTC (permalink / raw)
  To: davem; +Cc: netdev, gospo, Emil Tantilov, Jeff Kirsher

From: Emil Tantilov <emil.s.tantilov@intel.com>

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---

 drivers/net/igb/igb.h |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/net/igb/igb.h b/drivers/net/igb/igb.h
index 096a526..735ede9 100644
--- a/drivers/net/igb/igb.h
+++ b/drivers/net/igb/igb.h
@@ -338,7 +338,6 @@ enum igb_boards {
 extern char igb_driver_name[];
 extern char igb_driver_version[];
 
-extern char *igb_get_hw_dev_name(struct e1000_hw *hw);
 extern int igb_up(struct igb_adapter *);
 extern void igb_down(struct igb_adapter *);
 extern void igb_reinit_locked(struct igb_adapter *);


^ permalink raw reply related

* Re: vlan performance issue on outgoing traffic (solved)
From: R. Weinedel @ 2010-04-28 21:13 UTC (permalink / raw)
  To: Brandeburg, Jesse; +Cc: netdev@vger.kernel.org
In-Reply-To: <alpine.WNT.2.00.1004271110300.4368@jbrandeb-desk1.amr.corp.intel.com>

I made some further test with kernel 2.6.26 and I recognized that my cpu
is really very busy (90-100 % sw-irq's) during offloads.

After that I build and install a 2.6.33 kernel and the problem seems to
be solved. Now I got 946 Mbit/s offload (iperf) !.

Ralf Weinedel


^ permalink raw reply

* Re: pull request: wireless-2.6 2010-04-15
From: Hauke Mehrtens @ 2010-04-28 21:20 UTC (permalink / raw)
  To: David Miller
  Cc: linville-2XuSBdqkA4R54TAoqtyWWQ,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20100415.142907.68448693.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

Hi David,

in your merge in 5c01d5669356e13f0fb468944c1dd4c6a7e978ad you added "int
i;" into wl1271_main.c which is unused in that function.

This patch fixes the merge problem:

Signed-off-by: Hauke Mehrtens <hauke-5/S+JYg5SzeELgA04lAiVw@public.gmane.org>

--- a/drivers/net/wireless/wl12xx/wl1271_main.c
+++ b/drivers/net/wireless/wl12xx/wl1271_main.c
@@ -1311,7 +1311,6 @@
 	struct wl1271_filter_params *fp;
 	struct netdev_hw_addr *ha;
 	struct wl1271 *wl = hw->priv;
-	int i;

 	if (unlikely(wl->state == WL1271_STATE_OFF))
 		return 0;

David Miller wrote:
> From: "John W. Linville" <linville-2XuSBdqkA4R54TAoqtyWWQ@public.gmane.org>
> Date: Thu, 15 Apr 2010 16:03:31 -0400
> 
>> Another fix intended for 2.6.34...without it some firmware wierdness can
>> induce the driver into hanging the box... :-(
>>
>> Please let me know if there are problems!
> 
> Pulled, thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: pull request: wireless-2.6 2010-04-15
From: David Miller @ 2010-04-28 21:23 UTC (permalink / raw)
  To: hauke-5/S+JYg5SzeELgA04lAiVw
  Cc: linville-2XuSBdqkA4R54TAoqtyWWQ,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <4BD8A6B4.20603-5/S+JYg5SzeELgA04lAiVw@public.gmane.org>

From: Hauke Mehrtens <hauke-5/S+JYg5SzeELgA04lAiVw@public.gmane.org>
Date: Wed, 28 Apr 2010 23:20:52 +0200

> Hi David,
> 
> in your merge in 5c01d5669356e13f0fb468944c1dd4c6a7e978ad you added "int
> i;" into wl1271_main.c which is unused in that function.
> 
> This patch fixes the merge problem:
> 
> Signed-off-by: Hauke Mehrtens <hauke-5/S+JYg5SzeELgA04lAiVw@public.gmane.org>

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH]: sctp: Fix skb_over_panic resulting from multiple invalid parameter errors (CVE-2010-1173) (v4)
From: David Miller @ 2010-04-28 21:23 UTC (permalink / raw)
  To: vladislav.yasevich; +Cc: nhorman, sri, linux-sctp, eteo, netdev, security
In-Reply-To: <4BD89C70.6080406@hp.com>

From: Vlad Yasevich <vladislav.yasevich@hp.com>
Date: Wed, 28 Apr 2010 16:37:04 -0400

> 
> Looks good.
> 
> Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>

Applied, thanks Neil and Vlad.

^ permalink raw reply

* Re: [PATCH] forcedeth: Stay in NAPI as long as there's work
From: David Miller @ 2010-04-28 21:25 UTC (permalink / raw)
  To: shemminger; +Cc: joe, therbert, netdev, aabdulla
In-Reply-To: <20100428112528.01277670@nehalam>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Wed, 28 Apr 2010 11:25:28 -0700

> The following does the same thing without the extra overhead
> of testing all the registers. It also handles the out of memory
> case.
> 
> Compile tested only...
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Tom can you test this version?

^ permalink raw reply

* Re: [net-next-2.6 PATCH] igb: Clean up left over prototype of igb_get_hw_dev_name()
From: David Miller @ 2010-04-28 21:26 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo, emil.s.tantilov
In-Reply-To: <20100428205949.30209.69599.stgit@localhost.localdomain>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed, 28 Apr 2010 13:59:53 -0700

> From: Emil Tantilov <emil.s.tantilov@intel.com>
> 
> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH v2] net/sb1250: register mdio bus in probe
From: David Miller @ 2010-04-28 21:32 UTC (permalink / raw)
  To: sebastian; +Cc: ralf, netdev
In-Reply-To: <20100428195701.GA3461@Chamillionaire.breakpoint.cc>

From: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Date: Wed, 28 Apr 2010 21:57:01 +0200

> "ifconfig eth0 up && ifconfig eth0 down" triggers:
 ...
> mdiobus_register() calls device_register() which initializes the kobj of
> the device. mdiobus_unregister() calls only device_del() so we have one
> reference left. That one is leaving with mdiobus_free() which is only
> called on remove.
> Since I don't see any reason why mdiobus_register()/mdiobus_unregister()
> should happen in ->open()/->close() I move them to probe & exit.
> 
> Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>

This looks a lot better, applied, thanks Sebastian!

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: speedup udp receive path
From: David Miller @ 2010-04-28 21:36 UTC (permalink / raw)
  To: eric.dumazet; +Cc: hadi, xiaosuo, therbert, shemminger, netdev, eilong, bmb
In-Reply-To: <1272463605.2267.70.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 28 Apr 2010 16:06:45 +0200

> [PATCH net-next-2.6] net: speedup udp receive path
> 
> Since commit 95766fff ([UDP]: Add memory accounting.), 
> each received packet needs one extra sock_lock()/sock_release() pair.
> 
> This added latency because of possible backlog handling. Then later,
> ticket spinlocks added yet another latency source in case of DDOS.
> 
> This patch introduces lock_sock_bh() and unlock_sock_bh()
> synchronization primitives, avoiding one atomic operation and backlog
> processing.
> 
> skb_free_datagram_locked() uses them instead of full blown
> lock_sock()/release_sock(). skb is orphaned inside locked section for
> proper socket memory reclaim, and finally freed outside of it.
> 
> UDP receive path now take the socket spinlock only once.
> 
> Signed-off-by: Eric DUmazet <eric.dumazet@gmail.com>

Clever, let's see what this breaks :-)

Applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Paul E. McKenney @ 2010-04-28 21:37 UTC (permalink / raw)
  To: Miles Lane
  Cc: Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan, Johannes Berg
In-Reply-To: <p2ka44ae5cd1004281358n86ce29d2tbece16b2fb974dab@mail.gmail.com>

On Wed, Apr 28, 2010 at 04:58:54PM -0400, Miles Lane wrote:
> On Sat, Apr 24, 2010 at 10:34 PM, Paul E. McKenney <
> paulmck@linux.vnet.ibm.com> wrote:
> On Fri, Apr 23, 2010 at 06:59:12PM -0400, Miles Lane wrote:

[ . . . ]

> > commit 0868dd631def762ba00c2f0f397a53c5cdf24ae2
> > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Date:   Sat Apr 24 19:23:30 2010 -0700
> >
> >    block-cgroup: fix RCU-lockdep splat in blkiocg_add_blkio_group()
> >
> >    It is necessary to be in an RCU read-side critical section when invoking
> >    css_id(), so this patch adds one to blkiocg_add_blkio_group().  This is
> >    actually a false positive, because this is called at initialization
> > time,
> >    and hence always refers to the root cgroup, which cannot go away.
> >
> >    Located-by: Miles Lane <miles.lane@gmail.com>
> >    Suggested-by: Vivek Goyal <vgoyal@redhat.com>
> >    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index 5fe03de..55c8c73 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -71,7 +71,9 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
> >
> >        spin_lock_irqsave(&blkcg->lock, flags);
> >        rcu_assign_pointer(blkg->key, key);
> > +       rcu_read_lock();
> >        blkg->blkcg_id = css_id(&blkcg->css);
> > +       rcu_read_unlock();
> >        hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
> >        spin_unlock_irqrestore(&blkcg->lock, flags);
> >  #ifdef CONFIG_DEBUG_BLK_CGROUP
> >
> 
> Hi Paul,
> Did this patch make it into your patch set?  Has your patch set gone into
> the Linus tree?
> I just tested 2.6.34-rc5-git8 and hit one of the issues again.  I think this
> patch is intended to correct this issue?

I replaced the above with an improved patch from Vivek Goyal, which has
not yet reached mainline.  I will resend my patch stack.

							Thanx, Paul

> [    2.289598] [ INFO: suspicious rcu_dereference_check() usage. ]
> [    2.289604] ---------------------------------------------------
> [    2.289610] kernel/cgroup.c:4438 invoked rcu_dereference_check() without
> protection!
> [    2.289615]
> [    2.289617] other info that might help us debug this:
> [    2.289619]
> [    2.289624]
> [    2.289626] rcu_scheduler_active = 1, debug_locks = 1
> [    2.289632] 2 locks held by async/1/575:
> [    2.289637]  #0:  (&shost->scan_mutex){+.+.+.}, at: [<c121063d>]
> __scsi_add_device+0x5b/0xab
> [    2.289660]  #1:  (&(&blkcg->lock)->rlock){......}, at: [<c1143acb>]
> blkiocg_add_blkio_group+0x1a/0x73
> [    2.289678]
> [    2.289680] stack backtrace:
> [    2.289688] Pid: 575, comm: async/1 Not tainted 2.6.34-rc5-git8 #17
> [    2.289693] Call Trace:
> [    2.289704]  [<c12ee273>] ? printk+0xf/0x14
> [    2.289715]  [<c1050fbd>] lockdep_rcu_dereference+0x74/0x7d
> [    2.289725]  [<c106227d>] css_id+0x37/0x46
> [    2.289734]  [<c1143adc>] blkiocg_add_blkio_group+0x2b/0x73
> [    2.289744]  [<c1146a19>] cfq_init_queue+0xd6/0x2a3
> [    2.289755]  [<c120d657>] ? scsi_request_fn+0x0/0x3ea
> [    2.289764]  [<c113a316>] elevator_init+0xa1/0xd5
> [    2.289774]  [<c113bc3f>] blk_init_queue_node+0x103/0x109
> [    2.289783]  [<c113bc50>] blk_init_queue+0xb/0xd
> [    2.289792]  [<c120da58>] __scsi_alloc_queue+0x17/0xef
> [    2.289802]  [<c120db40>] scsi_alloc_queue+0x10/0x49
> [    2.289811]  [<c120f381>] scsi_alloc_sdev+0x14f/0x1ef
> [    2.289821]  [<c120f617>] scsi_probe_and_add_lun+0xb5/0x7ed
> [    2.289831]  [<c10517dc>] ? trace_hardirqs_on_caller+0x119/0x141
> [    2.289843]  [<c1210669>] __scsi_add_device+0x87/0xab
> [    2.289854]  [<c1232dfe>] ata_scsi_scan_host+0x64/0x136
> [    2.289865]  [<c12312c3>] async_port_probe+0x9e/0xa4
> [    2.289876]  [<c10479c8>] async_thread+0xf0/0x1d4
> [    2.289887]  [<c102b474>] ? default_wake_function+0x0/0xd
> [    2.289896]  [<c10478d8>] ? async_thread+0x0/0x1d4
> [    2.289906]  [<c1041a2a>] kthread+0x6a/0x6f
> [    2.289916]  [<c10419c0>] ? kthread+0x0/0x6f
> [    2.289926]  [<c1003742>] kernel_thread_helper+0x6/0x1a
> [    2.289934]
> [    2.289935] ===================================================
> [    2.289941] [ INFO: suspicious rcu_dereference_check() usage. ]
> [    2.289946] ---------------------------------------------------
> [    2.289951] kernel/cgroup.c:1651 invoked rcu_dereference_check() without
> protection!
> [    2.289957]
> [    2.289958] other info that might help us debug this:
> [    2.289961]
> [    2.289966]
> [    2.289967] rcu_scheduler_active = 1, debug_locks = 1
> [    2.289973] 1 lock held by async/1/575:
> [    2.289978]  #0:  (&shost->scan_mutex){+.+.+.}, at: [<c121063d>]
> __scsi_add_device+0x5b/0xab
> [    2.289995]
> [    2.289996] stack backtrace:
> [    2.290003] Pid: 575, comm: async/1 Not tainted 2.6.34-rc5-git8 #17
> [    2.290008] Call Trace:
> [    2.290016]  [<c12ee273>] ? printk+0xf/0x14
> [    2.290025]  [<c1050fbd>] lockdep_rcu_dereference+0x74/0x7d
> [    2.290035]  [<c106423a>] cgroup_path+0x4a/0x110
> [    2.290045]  [<c1143b12>] blkiocg_add_blkio_group+0x61/0x73
> [    2.290055]  [<c1146a19>] cfq_init_queue+0xd6/0x2a3
> [    2.290065]  [<c120d657>] ? scsi_request_fn+0x0/0x3ea
> [    2.290074]  [<c113a316>] elevator_init+0xa1/0xd5
> [    2.290083]  [<c113bc3f>] blk_init_queue_node+0x103/0x109
> [    2.290093]  [<c113bc50>] blk_init_queue+0xb/0xd
> [    2.290102]  [<c120da58>] __scsi_alloc_queue+0x17/0xef
> [    2.290111]  [<c120db40>] scsi_alloc_queue+0x10/0x49
> [    2.290120]  [<c120f381>] scsi_alloc_sdev+0x14f/0x1ef
> [    2.290131]  [<c120f617>] scsi_probe_and_add_lun+0xb5/0x7ed
> [    2.290140]  [<c10517dc>] ? trace_hardirqs_on_caller+0x119/0x141
> [    2.290152]  [<c1210669>] __scsi_add_device+0x87/0xab
> [    2.290162]  [<c1232dfe>] ata_scsi_scan_host+0x64/0x136
> [    2.290172]  [<c12312c3>] async_port_probe+0x9e/0xa4
> [    2.290182]  [<c10479c8>] async_thread+0xf0/0x1d4
> [    2.290192]  [<c102b474>] ? default_wake_function+0x0/0xd
> [    2.290202]  [<c10478d8>] ? async_thread+0x0/0x1d4
> [    2.290211]  [<c1041a2a>] kthread+0x6a/0x6f
> [    2.290221]  [<c10419c0>] ? kthread+0x0/0x6f
> [    2.290230]  [<c1003742>] kernel_thread_helper+0x6/0x1a

^ permalink raw reply

* Re: [PATCH]: sctp: Fix skb_over_panic resulting from multiple invalid parameter errors (CVE-2010-1173) (v4)
From: Neil Horman @ 2010-04-28 21:50 UTC (permalink / raw)
  To: David Miller; +Cc: vladislav.yasevich, sri, linux-sctp, eteo, netdev, security
In-Reply-To: <20100428.142339.45883821.davem@davemloft.net>

On Wed, Apr 28, 2010 at 02:23:39PM -0700, David Miller wrote:
> From: Vlad Yasevich <vladislav.yasevich@hp.com>
> Date: Wed, 28 Apr 2010 16:37:04 -0400
> 
> > 
> > Looks good.
> > 
> > Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
> 
> Applied, thanks Neil and Vlad.
> 
Thanks!
Neil


^ permalink raw reply

* [PATCH net-next-2.6] net: ip_queue_rcv_skb() helper
From: Eric Dumazet @ 2010-04-28 22:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20100428.143610.232922250.davem@davemloft.net>

Le mercredi 28 avril 2010 à 14:36 -0700, David Miller a écrit :

> 
> Clever, let's see what this breaks :-)
> 
> Applied, thanks Eric.

Thanks ;)

Let's respin an old work about dst, with a first small work unit :

Next patch will try to not touch dst refcount in input path (previously
attempted in July 2009)
Ref : http://kerneltrap.org/mailarchive/linux-netdev/2009/7/22/6248753


[PATCH net-next-2.6] net: ip_queue_rcv_skb() helper

When queueing a skb to socket, we can immediately release its dst if
target socket do not use IP_CMSG_PKTINFO.

tcp_data_queue() can drop dst too.

This to benefit from a hot cache line and avoid the receiver, possibly
on another cpu, to dirty this cache line himself.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/ip.h       |    1 +
 net/ipv4/ip_sockglue.c |   16 ++++++++++++++++
 net/ipv4/raw.c         |    2 +-
 net/ipv4/tcp_input.c   |    1 +
 net/ipv4/udp.c         |    2 +-
 net/ipv6/raw.c         |    2 +-
 net/ipv6/udp.c         |    2 +-
 7 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index a84ceb6..8149b77 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -393,6 +393,7 @@ extern int ip_options_rcv_srr(struct sk_buff *skb);
  *	Functions provided by ip_sockglue.c
  */
 
+extern int	ip_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
 extern void	ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb);
 extern int	ip_cmsg_send(struct net *net,
 			     struct msghdr *msg, struct ipcm_cookie *ipc);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index b0aa054..ce23178 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -954,6 +954,22 @@ e_inval:
 	return -EINVAL;
 }
 
+/**
+ * ip_queue_rcv_skb - Queue an skb into sock receive queue
+ * @sk: socket
+ * @skb: buffer
+ *
+ * Queues an skb into socket receive queue. If IP_CMSG_PKTINFO option
+ * is not set, we drop skb dst entry now, while dst cache line is hot.
+ */
+int ip_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+{
+	if (!(inet_sk(sk)->cmsg_flags & IP_CMSG_PKTINFO))
+		skb_dst_drop(skb);
+	return sock_queue_rcv_skb(sk, skb);
+}
+EXPORT_SYMBOL(ip_queue_rcv_skb);
+
 int ip_setsockopt(struct sock *sk, int level,
 		int optname, char __user *optval, unsigned int optlen)
 {
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index cc6f097..52ef5af 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -290,7 +290,7 @@ static int raw_rcv_skb(struct sock * sk, struct sk_buff * skb)
 {
 	/* Charge it to the socket. */
 
-	if (sock_queue_rcv_skb(sk, skb) < 0) {
+	if (ip_queue_rcv_skb(sk, skb) < 0) {
 		kfree_skb(skb);
 		return NET_RX_DROP;
 	}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ae3ec15..e82162c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4367,6 +4367,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
 		goto drop;
 
+	skb_dst_drop(skb);
 	__skb_pull(skb, th->doff * 4);
 
 	TCP_ECN_accept_cwr(tp, skb);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 63eb56b..8591398 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1264,7 +1264,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	if (inet_sk(sk)->inet_daddr)
 		sock_rps_save_rxhash(sk, skb->rxhash);
 
-	rc = sock_queue_rcv_skb(sk, skb);
+	rc = ip_queue_rcv_skb(sk, skb);
 	if (rc < 0) {
 		int is_udplite = IS_UDPLITE(sk);
 
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 8562738..0e3d2dd 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -381,7 +381,7 @@ static inline int rawv6_rcv_skb(struct sock * sk, struct sk_buff * skb)
 	}
 
 	/* Charge it to the socket. */
-	if (sock_queue_rcv_skb(sk, skb) < 0) {
+	if (ip_queue_rcv_skb(sk, skb) < 0) {
 		kfree_skb(skb);
 		return NET_RX_DROP;
 	}
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 3ead20a..aa0e47a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -514,7 +514,7 @@ int udpv6_queue_rcv_skb(struct sock * sk, struct sk_buff *skb)
 			goto drop;
 	}
 
-	if ((rc = sock_queue_rcv_skb(sk, skb)) < 0) {
+	if ((rc = ip_queue_rcv_skb(sk, skb)) < 0) {
 		/* Note that an ENOMEM error is charged twice */
 		if (rc == -ENOMEM)
 			UDP6_INC_STATS_BH(sock_net(sk),



^ permalink raw reply related

* Re: [PATCH net-next-2.6] net: ip_queue_rcv_skb() helper
From: David Miller @ 2010-04-28 22:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1272493364.2201.67.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 29 Apr 2010 00:22:44 +0200

> Next patch will try to not touch dst refcount in input path (previously
> attempted in July 2009)
> Ref : http://kerneltrap.org/mailarchive/linux-netdev/2009/7/22/6248753

Yes, I remember this.

> [PATCH net-next-2.6] net: ip_queue_rcv_skb() helper
> 
> When queueing a skb to socket, we can immediately release its dst if
> target socket do not use IP_CMSG_PKTINFO.
> 
> tcp_data_queue() can drop dst too.
> 
> This to benefit from a hot cache line and avoid the receiver, possibly
> on another cpu, to dirty this cache line himself.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Pretty soon the whole receive path will be "read mostly" :-)

Applied, thanks Eric.

^ permalink raw reply

* Re: [net-next-2.6 PATCH 2/2] add ndo_set_port_profile op support for enic dynamic vnics
From: Scott Feldman @ 2010-04-28 22:38 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: davem, netdev, chrisw
In-Reply-To: <201004282116.14601.arnd@arndb.de>

On 4/28/10 12:16 PM, "Arnd Bergmann" <arnd@arndb.de> wrote:

> On Wednesday 28 April 2010, Scott Feldman wrote:

>>> Passing just the slave device however would not work in the general case, as
>>> I tried to point out in the mail you replied to. If the slave interface is
>>> owned by a guest using PCI passthrough, or it sits below a stack of nested
>>> interfaces (vlan, bridge, tap, vhost, ...), it's impossible to know what
>>> interface is responsible for setting up the slave.
>> 
>> For port-profile, we want to pass the device that is to be "plugged-in" to
>> the network based on port-profile association.  This is the device that
>> gives basic connectivity to the guest interface, regardless of how the guest
>> interface is wired to the device.  It could be direct PCI pass-thru, macvtap
>> stack, some yet-to-be-invented kernel-bypass stack, etc.
> 
> But if the device is already passed to the guest using pass-thru or
> containers, you would no longer to query or change the port profile,
> because it is no longer visible in the host, right?

Drats, I made a mistake here.  You're right, in the pass-thru case the host
lost control of the device, so we need another device to proxy the
port-profile for the pass-thru device.  I had this in the second patch
submission where I was trying to extend the SR-IOV if_link cmds to included
port-profile, but that got mired down in the VF discussions.
  
>>> Note that you cannot perform the association
>>> through the slave interface itself because the remote switch would discard
>>> any traffic originating from an unassociated interface.
>> 
>> That's not a limitation of our device/switch.
> 
> This seems to contradict what you write above, at least when you drop the
> assumption that the protocol is implemented in the NIC firmware.
> The switch obviously does not care about the interface name in Linux or
> any of its data structures. What it cares about instead is the traffic on
> the wire and which of its ports this takes place on.
> 
> When you create a new dynamic enic device or a macvtap port but not assoicate
> it, the switch cannot allow this device to send or receive any traffic itself,
> as you write above (not 'plugged in'). The application (or firmware, for that
> matter) therefore needs to talk to the switch over an interface that is
> already
> associated. With VDP, this is the base device that a VEPA port is created
> from,
> i.e. the one that talks LLDP to the switch, i.e. the one that comes up at boot
> time when you have no virtualization and plug into a dumb switch.
> 
> I assumed that this was a specific PF in your NIC,  but it now sounds like it
> could be an internal device that is only visible in your firmware and not
> exposed
> as a network interface in Linux, right?

Yes, that's right.  Without going into implementation details, assume any
enic has firmware with private mgmt channel to switch to do the equivalent
of your base device->LLDP->switch.

> Your firmware can obviously find out the right communication channel for
> a associating a dynamic interface with the switch, but when this is implemnted
> in software, we cannot generally know that and rely on getting access to the
> interface that lets us talk to the switch. The information which interface
> is getting associated however is completely useless to an implementation like
> this.

So we're kind of back to where we were with iovnl.  We need to specify both
devices, the base device that has access to the switch and the target device
to associate the port-profile with.  Something like:

   ip port_profile set DEVICE [ base DEVICE ] [ { pre_associate |
                                                  pre_associate_rr } ]
                              { name PORT-PROFILE | vsi MGR:VTID:VER }
                              mac LLADDR
                              [ vlan VID ]
                              [ host_uuid HOST_UUID ]
                              [ client_uuid CLIENT_UUID ]
                              [ client_name CLIENT_NAME ]
   ip port_profile del DEVICE [ base DEVICE ] [ mac LLADDR [ vlan VID ] ]
   ip port_profile show DEVICE [ base DEVICE ]

The netdev ops are (when netlink msg handled in kernel):

    ndo_set_port_profile(netdev *target, ...)
    ndo_get_port_profile(netdev *target, ...)
    ndo_del_port_profile(netdev *target, ...)

Base device is optional.  If base device is not given, then target device
gets netdev ops.  If base device is given, then base device gets netdev ops
and *target refers to target device.  This covers the following cases:

1. Current enic where base == target since target can communicate directly
with switch to associate port-profile.  This will not work for the enic
pass-thru case as noted earlier.  We get:

    ip port_profile set eth0 name joes-garage ...

And

    eth0:ndo_set_port_profile(NULL, ...)

2. Future enic for pass-thru case where base != target.  We get:

    ip port_profile set eth1 base eth0 name joes-garage ...

And

    eth0:ndi_set_port_profile(eth1, ...)

3. Future VEPA, we get:

    ip port_profile set eth11 base eth10 vsi 1:23456:7

And (here netlink msg handled in user-space):

    VDP msg sent on eth10 to set port-profile on eth11 using vis tuple
    

Does this work?  I want to get agreement before coding up patch attempt #4.

-scott



^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox