Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: ipgre rss is broken since gro
From: Dmitry Kravkov @ 2012-12-09 20:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev@vger.kernel.org
In-Reply-To: <1355018645.3113.0.camel@lb-tlvb-dmitry.il.broadcom.com>

On Sun, 2012-12-09 at 04:04 +0200, Dmitry Kravkov wrote:
> On Sat, 2012-12-08 at 18:01 -0800, Eric Dumazet wrote:
> > On Sat, Dec 8, 2012 at 3:31 PM, Dmitry Kravkov <dmitry@broadcom.com> wrote:
> > > Here is the trace for a while:
> > >
> > > BUG: unable to handle kernel NULL pointer dereference at           (null)
> > > IP: [<ffffffff8144f35e>] skb_gro_receive+0xbe/0x5a0
> > 
> > Hi Dmitry
> > 
> > NULL pointer deref probably fixed on net tree (or Linus tree) by
> > 
> > http://git.kernel.org/?p=linux/kernel/git/davem/net.git;a=commit;h=c3c7c254b2e8cd99b0adf288c2a1bddacd7ba255

This resolved the deref. Thanks.

> > For the GRO stuff and RSS, I wonder if skbs have a property that makes
> > them dropped somewhere, you might try drop_monitor / drop_watch

for this item: drop_watch does not show any drops (i've disable all
other interfaces for clear env)
I will explain a little bit more the setup:
bnx2x device (under testing) is configured for RSS for IPGRE packets.
Sending multiple (3) TCP_STREAM causes ip_gre interface to disappear
packets (even ICMP).
This is not happening with single TCP_STREAM, or before gro_cell
introduction.

i was searching for the drops by print-out in ip_gre.c but disappeared
packets completes this code:

static int ipgre_rcv(struct sk_buff *skb)
(snip)
	        printk("%s:%d\n", __FUNCTION__, __LINE__);

                tstats = this_cpu_ptr(tunnel->dev->tstats);
                u64_stats_update_begin(&tstats->syncp);
                tstats->rx_packets++;
                tstats->rx_bytes += skb->len;
                u64_stats_update_end(&tstats->syncp);

                gro_cells_receive(&tunnel->gro_cells, skb);
                return 0;

> > 
> 
> I will try both and update ... 
> Thanks

^ permalink raw reply

* [PATCH net] inet_diag: validate port comparison byte code to prevent unsafe reads
From: Neal Cardwell @ 2012-12-09 21:09 UTC (permalink / raw)
  To: David Miller; +Cc: edumazet, netdev, Neal Cardwell

Add logic to verify that a port comparison byte code operation
actually has the second inet_diag_bc_op from which we read the port
for such operations.

Previously the code blindly referenced op[1] without first checking
whether a second inet_diag_bc_op struct could fit there. So a
malicious user could make the kernel read 4 bytes beyond the end of
the bytecode array by claiming to have a whole port comparison byte
code (2 inet_diag_bc_op structs) when in fact the bytecode was not
long enough to hold both.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
[Patch was generated relative to the previous 3-patch inet_diag patch series
 from last night.]
 net/ipv4/inet_diag.c |   30 +++++++++++++++++++++++-------
 1 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 95f1a45..67746f5 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -557,6 +557,16 @@ static bool valid_hostcond(const struct inet_diag_bc_op *op, int len,
 	return true;
 }
 
+/* Validate a port comparison operator. */
+static inline bool valid_port_comparison(const struct inet_diag_bc_op *op,
+					 int len, int *min_len)
+{
+	/* Port comparisons put the port in a follow-on inet_diag_bc_op. */
+	*min_len += sizeof(struct inet_diag_bc_op);
+	if (len < *min_len)
+		return false;
+}
+
 static int inet_diag_bc_audit(const void *bytecode, int bytecode_len)
 {
 	const void *bc = bytecode;
@@ -572,24 +582,30 @@ static int inet_diag_bc_audit(const void *bytecode, int bytecode_len)
 		case INET_DIAG_BC_D_COND:
 			if (!valid_hostcond(bc, len, &min_len))
 				return -EINVAL;
-			/* fall through */
-		case INET_DIAG_BC_AUTO:
+			break;
 		case INET_DIAG_BC_S_GE:
 		case INET_DIAG_BC_S_LE:
 		case INET_DIAG_BC_D_GE:
 		case INET_DIAG_BC_D_LE:
-		case INET_DIAG_BC_JMP:
-			if (op->no < min_len || op->no > len + 4 || op->no & 3)
-				return -EINVAL;
-			if (op->no < len &&
-			    !valid_cc(bytecode, bytecode_len, len - op->no))
+			if (!valid_port_comparison(bc, len, &min_len))
 				return -EINVAL;
 			break;
+		case INET_DIAG_BC_AUTO:
+		case INET_DIAG_BC_JMP:
 		case INET_DIAG_BC_NOP:
 			break;
 		default:
 			return -EINVAL;
 		}
+
+		if (op->code != INET_DIAG_BC_NOP) {
+			if (op->no < min_len || op->no > len + 4 || op->no & 3)
+				return -EINVAL;
+			if (op->no < len &&
+			    !valid_cc(bytecode, bytecode_len, len - op->no))
+				return -EINVAL;
+		}
+
 		if (op->yes < min_len || op->yes > len + 4 || op->yes & 3)
 			return -EINVAL;
 		bc  += op->yes;
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH net] inet_diag: validate port comparison byte code to prevent unsafe reads
From: Neal Cardwell @ 2012-12-09 21:03 UTC (permalink / raw)
  To: David Miller; +Cc: edumazet, netdev, Neal Cardwell

Add logic to verify that a port comparison byte code operation
actually has the second inet_diag_bc_op from which we read the port
for such operations.

Previously the code blindly referenced op[1] without first checking
whether a second inet_diag_bc_op struct could fit there. So a
malicious user could make the kernel read 4 bytes beyond the end of
the bytecode array by claiming to have a whole port comparison byte
code (2 inet_diag_bc_op structs) when in fact the bytecode was not
long enough to hold both.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
[Patch was generated relative to the previous 3-patch inet_diag patch series
 from last night.]
 net/ipv4/inet_diag.c |   30 +++++++++++++++++++++++-------
 1 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 95f1a45..67746f5 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -557,6 +557,16 @@ static bool valid_hostcond(const struct inet_diag_bc_op *op, int len,
 	return true;
 }
 
+/* Validate a port comparison operator. */
+static inline bool valid_port_comparison(const struct inet_diag_bc_op *op,
+					 int len, int *min_len)
+{
+	/* Port comparisons put the port in a follow-on inet_diag_bc_op. */
+	*min_len += sizeof(struct inet_diag_bc_op);
+	if (len < *min_len)
+		return false;
+}
+
 static int inet_diag_bc_audit(const void *bytecode, int bytecode_len)
 {
 	const void *bc = bytecode;
@@ -572,24 +582,30 @@ static int inet_diag_bc_audit(const void *bytecode, int bytecode_len)
 		case INET_DIAG_BC_D_COND:
 			if (!valid_hostcond(bc, len, &min_len))
 				return -EINVAL;
-			/* fall through */
-		case INET_DIAG_BC_AUTO:
+			break;
 		case INET_DIAG_BC_S_GE:
 		case INET_DIAG_BC_S_LE:
 		case INET_DIAG_BC_D_GE:
 		case INET_DIAG_BC_D_LE:
-		case INET_DIAG_BC_JMP:
-			if (op->no < min_len || op->no > len + 4 || op->no & 3)
-				return -EINVAL;
-			if (op->no < len &&
-			    !valid_cc(bytecode, bytecode_len, len - op->no))
+			if (!valid_port_comparison(bc, len, &min_len))
 				return -EINVAL;
 			break;
+		case INET_DIAG_BC_AUTO:
+		case INET_DIAG_BC_JMP:
 		case INET_DIAG_BC_NOP:
 			break;
 		default:
 			return -EINVAL;
 		}
+
+		if (op->code != INET_DIAG_BC_NOP) {
+			if (op->no < min_len || op->no > len + 4 || op->no & 3)
+				return -EINVAL;
+			if (op->no < len &&
+			    !valid_cc(bytecode, bytecode_len, len - op->no))
+				return -EINVAL;
+		}
+
 		if (op->yes < min_len || op->yes > len + 4 || op->yes & 3)
 			return -EINVAL;
 		bc  += op->yes;
-- 
1.7.7.3

^ permalink raw reply related

* Re: [PATCH net 1/3] inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
From: Neal Cardwell @ 2012-12-09 21:15 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, Netdev
In-Reply-To: <CADVnQy=PJezr79ib=JFh9drEYVa35PbgyAahP+TDKKuD3HC81Q@mail.gmail.com>

On Sun, Dec 9, 2012 at 1:01 AM, Neal Cardwell <ncardwell@google.com> wrote:
> On Sun, Dec 9, 2012 at 12:46 AM, David Miller <davem@davemloft.net> wrote:
>>
>> Thanks a lot for working on a complete fix for these problems, I'll
>> review these patches soon.
>
> Thanks, David! I appreciate it.

I noticed another related validation issue, and submitted a separate
patch for that one, based on those previous three:

   http://patchwork.ozlabs.org/patch/204786/

Please let me know if you'd like be to regenerate them all as a single
4-patch series instead.

thanks,
neal

ps: please excuse the duplicate send of that last patch... the first
"git send-email" seemed unsuccessful, so I retried, but apparently the
first attempt actually succeeded...

^ permalink raw reply

* Re: [PATCH net 1/3] inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
From: David Miller @ 2012-12-09 21:21 UTC (permalink / raw)
  To: ncardwell; +Cc: edumazet, netdev
In-Reply-To: <CADVnQyk0Hx-LVmo+fkMzHy9cC7MQx0ACALRXVinLhto-kpv32g@mail.gmail.com>

From: Neal Cardwell <ncardwell@google.com>
Date: Sun, 9 Dec 2012 16:15:07 -0500

> On Sun, Dec 9, 2012 at 1:01 AM, Neal Cardwell <ncardwell@google.com> wrote:
>> On Sun, Dec 9, 2012 at 12:46 AM, David Miller <davem@davemloft.net> wrote:
>>>
>>> Thanks a lot for working on a complete fix for these problems, I'll
>>> review these patches soon.
>>
>> Thanks, David! I appreciate it.
> 
> I noticed another related validation issue, and submitted a separate
> patch for that one, based on those previous three:
> 
>    http://patchwork.ozlabs.org/patch/204786/
> 
> Please let me know if you'd like be to regenerate them all as a single
> 4-patch series instead.

What you did is fine, thanks Neal.

^ permalink raw reply

* Re: ipgre rss is broken since gro
From: Eric Dumazet @ 2012-12-09 23:27 UTC (permalink / raw)
  To: Dmitry Kravkov; +Cc: netdev@vger.kernel.org
In-Reply-To: <1355086161.3113.9.camel@lb-tlvb-dmitry.il.broadcom.com>

On Sun, Dec 9, 2012 at 12:49 PM, Dmitry Kravkov <dmitry@broadcom.com> wrote:

> for this item: drop_watch does not show any drops (i've disable all
> other interfaces for clear env)
> I will explain a little bit more the setup:
> bnx2x device (under testing) is configured for RSS for IPGRE packets.
> Sending multiple (3) TCP_STREAM causes ip_gre interface to disappear
> packets (even ICMP).
> This is not happening with single TCP_STREAM, or before gro_cell
> introduction.
>
> i was searching for the drops by print-out in ip_gre.c but disappeared
> packets completes this code:
>
> static int ipgre_rcv(struct sk_buff *skb)
> (snip)
>                 printk("%s:%d\n", __FUNCTION__, __LINE__);
>
>                 tstats = this_cpu_ptr(tunnel->dev->tstats);
>                 u64_stats_update_begin(&tstats->syncp);
>                 tstats->rx_packets++;
>                 tstats->rx_bytes += skb->len;
>                 u64_stats_update_end(&tstats->syncp);
>
>                 gro_cells_receive(&tunnel->gro_cells, skb);
>                 return 0;

I dont know, I tried a bnx2x setup, and 100 tcp flows, no special problem.

If you receive a lot of packets on a single RX queue, they might be
dropped because cpu cant cope with the load
(This has nothing to do with GRE or GRO )

cat /proc/net/softnet_stat

^ permalink raw reply

* Re: [PATCHv6] virtio-spec: virtio network device multiqueue support
From: Rusty Russell @ 2012-12-09 23:48 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang; +Cc: bhutchings, netdev, kvm, virtualization
In-Reply-To: <20121207141856.GA17901@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> writes:
> Add multiqueue support to virtio network device.
> Add a new feature flag VIRTIO_NET_F_MQ for this feature, a new
> configuration field max_virtqueue_pairs to detect supported number of
> virtqueues as well as a new command VIRTIO_NET_CTRL_MQ to program
> packet steering for unidirectional protocols.

One trivial change: alter "8000h" to "0x8000" for consistency in the
text.

Could I have a Signed-off-by so I can apply it please?

Thanks,
Rusty.

^ permalink raw reply

* Re: [PATCH net 1/3] inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
From: David Miller @ 2012-12-10  0:01 UTC (permalink / raw)
  To: ncardwell; +Cc: edumazet, netdev
In-Reply-To: <1355031803-14547-1-git-send-email-ncardwell@google.com>

From: Neal Cardwell <ncardwell@google.com>
Date: Sun,  9 Dec 2012 00:43:21 -0500

> Fix inet_diag to be aware of the fact that AF_INET6 TCP connections
> instantiated for IPv4 traffic and in the SYN-RECV state were actually
> created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This
> means that for such connections inet6_rsk(req) returns a pointer to a
> random spot in memory up to roughly 64KB beyond the end of the
> request_sock.
> 
> With this bug, for a server using AF_INET6 TCP sockets and serving
> IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to
> inet_diag_fill_req() causing an oops or the export to user space of 16
> bytes of kernel memory as a garbage IPv6 address, depending on where
> the garbage inet6_rsk(req) pointed.
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>

Applied.

^ permalink raw reply

* Re: [PATCH net 2/3] inet_diag: validate byte code to prevent oops in inet_diag_bc_run()
From: David Miller @ 2012-12-10  0:01 UTC (permalink / raw)
  To: ncardwell; +Cc: edumazet, netdev
In-Reply-To: <1355031803-14547-2-git-send-email-ncardwell@google.com>

From: Neal Cardwell <ncardwell@google.com>
Date: Sun,  9 Dec 2012 00:43:22 -0500

> Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND
> operations.
> 
> Previously we did not validate the inet_diag_hostcond, address family,
> address length, and prefix length. So a malicious user could make the
> kernel read beyond the end of the bytecode array by claiming to have a
> whole inet_diag_hostcond when the bytecode was not long enough to
> contain a whole inet_diag_hostcond of the given address family. Or
> they could make the kernel read up to about 27 bytes beyond the end of
> a connection address by passing a prefix length that exceeded the
> length of addresses of the given family.
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>

Applied.

^ permalink raw reply

* Re: [PATCH net 3/3] inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run()
From: David Miller @ 2012-12-10  0:01 UTC (permalink / raw)
  To: ncardwell; +Cc: edumazet, netdev
In-Reply-To: <1355031803-14547-3-git-send-email-ncardwell@google.com>

From: Neal Cardwell <ncardwell@google.com>
Date: Sun,  9 Dec 2012 00:43:23 -0500

> Add logic to check the address family of the user-supplied conditional
> and the address family of the connection entry. We now do not do
> prefix matching of addresses from different address families (AF_INET
> vs AF_INET6), except for the previously existing support for having an
> IPv4 prefix match an IPv4-mapped IPv6 address (which this commit
> maintains as-is).
> 
> This change is needed for two reasons:
> 
> (1) The addresses are different lengths, so comparing a 128-bit IPv6
> prefix match condition to a 32-bit IPv4 connection address can cause
> us to unwittingly walk off the end of the IPv4 address and read
> garbage or oops.
> 
> (2) The IPv4 and IPv6 address spaces are semantically distinct, so a
> simple bit-wise comparison of the prefixes is not meaningful, and
> would lead to bogus results (except for the IPv4-mapped IPv6 case,
> which this commit maintains).
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>

Applied.

^ permalink raw reply

* Re: [PATCH net] inet_diag: validate port comparison byte code to prevent unsafe reads
From: David Miller @ 2012-12-10  0:02 UTC (permalink / raw)
  To: ncardwell; +Cc: edumazet, netdev
In-Reply-To: <1355086980-30438-1-git-send-email-ncardwell@google.com>

From: Neal Cardwell <ncardwell@google.com>
Date: Sun,  9 Dec 2012 16:03:00 -0500

> Add logic to verify that a port comparison byte code operation
> actually has the second inet_diag_bc_op from which we read the port
> for such operations.
> 
> Previously the code blindly referenced op[1] without first checking
> whether a second inet_diag_bc_op struct could fit there. So a
> malicious user could make the kernel read 4 bytes beyond the end of
> the bytecode array by claiming to have a whole port comparison byte
> code (2 inet_diag_bc_op structs) when in fact the bytecode was not
> long enough to hold both.
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>

Haste makes waste...

> +/* Validate a port comparison operator. */
> +static inline bool valid_port_comparison(const struct inet_diag_bc_op *op,
> +					 int len, int *min_len)
> +{
> +	/* Port comparisons put the port in a follow-on inet_diag_bc_op. */
> +	*min_len += sizeof(struct inet_diag_bc_op);
> +	if (len < *min_len)
> +		return false;
> +}
> +

I added the missing "return true" at the end of this new function.

Applied, but please be more careful and at least look at the compiler
warnings when you submit your changes.

Thanks.

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the pci tree
From: Stephen Rothwell @ 2012-12-10  1:08 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: linux-next, linux-kernel, Bjorn Helgaas, Emmanuel Grumbach,
	Johannes Berg

[-- Attachment #1: Type: text/plain, Size: 2120 bytes --]

Hi all,

Today's linux-next merge of the net-next tree got a conflict in
drivers/net/wireless/iwlwifi/pcie/trans.c between commit b9d146e30a2d
("iwlwifi: collapse wrapper for pcie_capability_read_word()") from the
pci tree and commit 7afe3705cd4e ("iwlwifi: continue clean up -
pcie/trans.c") from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

diff --cc drivers/net/wireless/iwlwifi/pcie/trans.c
index 1dfa6be,f6c21e7..0000000
--- a/drivers/net/wireless/iwlwifi/pcie/trans.c
+++ b/drivers/net/wireless/iwlwifi/pcie/trans.c
@@@ -670,8 -94,10 +94,8 @@@ static void iwl_pcie_set_pwr_vmain(stru
  
  /* PCI registers */
  #define PCI_CFG_RETRY_TIMEOUT	0x041
 -#define PCI_CFG_LINK_CTRL_VAL_L0S_EN	0x01
 -#define PCI_CFG_LINK_CTRL_VAL_L1_EN	0x02
  
- static void iwl_apm_config(struct iwl_trans *trans)
+ static void iwl_pcie_apm_config(struct iwl_trans *trans)
  {
  	struct iwl_trans_pcie *trans_pcie = IWL_TRANS_GET_PCIE_TRANS(trans);
  	u16 lctl;
@@@ -684,20 -110,19 +108,17 @@@
  	 * If not (unlikely), enable L0S, so there is at least some
  	 *    power savings, even without L1.
  	 */
- 
  	pcie_capability_read_word(trans_pcie->pci_dev, PCI_EXP_LNKCTL, &lctl);
 -
 -	if ((lctl & PCI_CFG_LINK_CTRL_VAL_L1_EN) ==
 -				PCI_CFG_LINK_CTRL_VAL_L1_EN) {
 +	if (lctl & PCI_EXP_LNKCTL_ASPM_L1) {
  		/* L1-ASPM enabled; disable(!) L0S */
  		iwl_set_bit(trans, CSR_GIO_REG, CSR_GIO_REG_VAL_L0S_ENABLED);
- 		dev_printk(KERN_INFO, trans->dev,
- 			   "L1 Enabled; Disabling L0S\n");
+ 		dev_info(trans->dev, "L1 Enabled; Disabling L0S\n");
  	} else {
  		/* L1-ASPM disabled; enable(!) L0S */
  		iwl_clear_bit(trans, CSR_GIO_REG, CSR_GIO_REG_VAL_L0S_ENABLED);
- 		dev_printk(KERN_INFO, trans->dev,
- 			   "L1 Disabled; Enabling L0S\n");
+ 		dev_info(trans->dev, "L1 Disabled; Enabling L0S\n");
  	}
 -	trans->pm_support = !(lctl & PCI_CFG_LINK_CTRL_VAL_L0S_EN);
 +	trans->pm_support = !(lctl & PCI_EXP_LNKCTL_ASPM_L0S);
  }
  
  /*

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [Suggestion] net/atm : for sprintf, need check the total write length whether larger than a page.
From: Chen Gang @ 2012-12-10  1:39 UTC (permalink / raw)
  To: chas williams - CONTRACTOR; +Cc: David Miller, netdev
In-Reply-To: <50C1415C.2060104@asianux.com>

Hello Chas Williams:

  Excuse me:
    I am a reporter (not a reviewer), and not suitable to review another's patch.

  so:
    I suggest you to send your patch (as your willing) to the reviewers, directly.
    and need not cc to me (I am not a reviewer).

  thanks.

gchen.


于 2012年12月07日 09:07, Chen Gang 写道:
> 于 2012年12月06日 22:08, chas williams - CONTRACTOR 写道:
>> On Thu, 06 Dec 2012 09:15:10 +0800
>> Chen Gang <gang.chen@asianux.com> wrote:
>>
>>> 于 2012年12月05日 22:55, chas williams - CONTRACTOR 写道:
>>
>>>> did you mean '\0' instead of '\n'?  scnprintf() considers the trailing
>>>> '\0' when formatting.
>>>
>>>   no, originally, the end is "\n\0".
>>>
>>>   I prefer we still compatible "\n" when the contents are very large.
>>>   if count already == (PAGE_SIZE - 1), we have no chance to append "\n" to the end.
>>>
>>> -		pos += sprintf(pos, "\n");
>>> +		count += scnprintf(buf + count, PAGE_SIZE - count, "\n");
>>
>> i would make the code a bit messy to do this for not much gain.  again,
>> it isnt likely you would run into this in a normal situation.
> 
>   surely.
> 
>   thanks.
> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> 
> 


-- 
Chen Gang

Asianux Corporation

^ permalink raw reply

* linux-next: manual merge of the virtio tree with the net-next tree
From: Stephen Rothwell @ 2012-12-10  2:28 UTC (permalink / raw)
  To: Rusty Russell
  Cc: linux-next, linux-kernel, Jason Wang, David Miller, netdev,
	Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 6360 bytes --]

Hi Rusty,

Today's linux-next merge of the virtio tree got a conflict in
drivers/net/virtio_net.c between commit e9d7417b97f4 ("virtio-net:
separate fields of sending/receiving queue from virtnet_info") and
986a4f4d452d ("virtio_net: multiqueue support") from the net-next tree
and commit a89f05573fa2 ("virtio-net: remove unused skb_vnet_hdr->num_sg
field"), 2c6d439a7316 ("virtio-net: correct capacity math on ring full"),
e794093a52cd ("virtio_net: don't rely on virtqueue_add_buf() returning
capacity") and 7dc5f95d9b6c ("virtio: net: make it clear that
virtqueue_add_buf() no longer returns > 0") from the virtio tree.

I fixed it up (I think - see below) and can carry the fix as necessary
(no action is required).

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

diff --cc drivers/net/virtio_net.c
index a644eeb,6289891..0000000
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@@ -523,20 -464,21 +522,21 @@@ static bool try_fill_recv(struct receiv
  
  	do {
  		if (vi->mergeable_rx_bufs)
 -			err = add_recvbuf_mergeable(vi, gfp);
 +			err = add_recvbuf_mergeable(rq, gfp);
  		else if (vi->big_packets)
 -			err = add_recvbuf_big(vi, gfp);
 +			err = add_recvbuf_big(rq, gfp);
  		else
 -			err = add_recvbuf_small(vi, gfp);
 +			err = add_recvbuf_small(rq, gfp);
  
  		oom = err == -ENOMEM;
- 		if (err < 0)
+ 		if (err)
  			break;
 -		++vi->num;
 -	} while (vi->rvq->num_free);
 +		++rq->num;
- 	} while (err > 0);
++	} while (rq->vq->num_free);
+ 
 -	if (unlikely(vi->num > vi->max))
 -		vi->max = vi->num;
 -	virtqueue_kick(vi->rvq);
 +	if (unlikely(rq->num > rq->max))
 +		rq->max = rq->num;
 +	virtqueue_kick(rq->vq);
  	return !oom;
  }
  
@@@ -625,29 -557,13 +625,29 @@@ again
  	return received;
  }
  
 -static void free_old_xmit_skbs(struct virtnet_info *vi)
 +static int virtnet_open(struct net_device *dev)
 +{
 +	struct virtnet_info *vi = netdev_priv(dev);
 +	int i;
 +
 +	for (i = 0; i < vi->max_queue_pairs; i++) {
 +		/* Make sure we have some buffers: if oom use wq. */
 +		if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
 +			schedule_delayed_work(&vi->refill, 0);
 +		virtnet_napi_enable(&vi->rq[i]);
 +	}
 +
 +	return 0;
 +}
 +
- static unsigned int free_old_xmit_skbs(struct send_queue *sq)
++static void free_old_xmit_skbs(struct send_queue *sq)
  {
  	struct sk_buff *skb;
- 	unsigned int len, tot_sgs = 0;
+ 	unsigned int len;
 +	struct virtnet_info *vi = sq->vq->vdev->priv;
  	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
  
 -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
 +	while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
  		pr_debug("Sent skb %p\n", skb);
  
  		u64_stats_update_begin(&stats->tx_syncp);
@@@ -655,17 -571,15 +655,16 @@@
  		stats->tx_packets++;
  		u64_stats_update_end(&stats->tx_syncp);
  
- 		tot_sgs += skb_vnet_hdr(skb)->num_sg;
  		dev_kfree_skb_any(skb);
  	}
- 	return tot_sgs;
  }
  
 -static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
 +static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
  {
  	struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
  	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
+ 	unsigned num_sg;
 +	struct virtnet_info *vi = sq->vq->vdev->priv;
  
  	pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest);
  
@@@ -700,42 -614,32 +699,35 @@@
  
  	/* Encode metadata header at front. */
  	if (vi->mergeable_rx_bufs)
 -		sg_set_buf(vi->tx_sg, &hdr->mhdr, sizeof hdr->mhdr);
 +		sg_set_buf(sq->sg, &hdr->mhdr, sizeof hdr->mhdr);
  	else
 -		sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
 +		sg_set_buf(sq->sg, &hdr->hdr, sizeof hdr->hdr);
  
- 	hdr->num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
- 	return virtqueue_add_buf(sq->vq, sq->sg, hdr->num_sg,
 -	num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
 -	return virtqueue_add_buf(vi->svq, vi->tx_sg, num_sg,
++	num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
++	return virtqueue_add_buf(sq->vq, sq->sg, num_sg,
  				 0, skb, GFP_ATOMIC);
  }
  
  static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
  {
  	struct virtnet_info *vi = netdev_priv(dev);
 +	int qnum = skb_get_queue_mapping(skb);
 +	struct send_queue *sq = &vi->sq[qnum];
- 	int capacity;
+ 	int err;
  
  	/* Free up any pending old buffers before queueing new ones. */
 -	free_old_xmit_skbs(vi);
 +	free_old_xmit_skbs(sq);
  
  	/* Try to transmit */
- 	capacity = xmit_skb(sq, skb);
- 
- 	/* This can happen with OOM and indirect buffers. */
- 	if (unlikely(capacity < 0)) {
- 		if (likely(capacity == -ENOMEM)) {
- 			if (net_ratelimit())
- 				dev_warn(&dev->dev,
- 					 "TXQ (%d) failure: out of memory\n",
- 					 qnum);
- 		} else {
- 			dev->stats.tx_fifo_errors++;
- 			if (net_ratelimit())
- 				dev_warn(&dev->dev,
- 					 "Unexpected TXQ (%d) failure: %d\n",
- 					 qnum, capacity);
- 		}
 -	err = xmit_skb(vi, skb);
++	err = xmit_skb(sq, skb);
+ 
+ 	/* This should not happen! */
+ 	if (unlikely(err)) {
+ 		dev->stats.tx_fifo_errors++;
+ 		if (net_ratelimit())
+ 			dev_warn(&dev->dev,
 -				 "Unexpected TX queue failure: %d\n", err);
++				 "Unexpected TXQ (%d) failure: %d\n",
++				 qnum, err);
  		dev->stats.tx_dropped++;
  		kfree_skb(skb);
  		return NETDEV_TX_OK;
@@@ -748,14 -652,14 +740,13 @@@
  
  	/* Apparently nice girls don't return TX_BUSY; stop the queue
  	 * before it gets out of hand.  Naturally, this wastes entries. */
- 	if (capacity < 2+MAX_SKB_FRAGS) {
 -	if (vi->svq->num_free < 2+MAX_SKB_FRAGS) {
 -		netif_stop_queue(dev);
 -		if (unlikely(!virtqueue_enable_cb_delayed(vi->svq))) {
++	if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
 +		netif_stop_subqueue(dev, qnum);
 +		if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
  			/* More just got used, free them then recheck. */
- 			capacity += free_old_xmit_skbs(sq);
- 			if (capacity >= 2+MAX_SKB_FRAGS) {
 -			free_old_xmit_skbs(vi);
 -			if (vi->svq->num_free >= 2+MAX_SKB_FRAGS) {
 -				netif_start_queue(dev);
 -				virtqueue_disable_cb(vi->svq);
++			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
 +				netif_start_subqueue(dev, qnum);
 +				virtqueue_disable_cb(sq->vq);
  			}
  		}
  	}

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* linux-next: build failure after merge of the virtio tree
From: Stephen Rothwell @ 2012-12-10  2:37 UTC (permalink / raw)
  To: Rusty Russell; +Cc: linux-next, linux-kernel, Jason Wang, David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 1583 bytes --]

Hi Rusty,

After merging the virtio tree, today's linux-next build (x86_64
allmodconfig) failed like this:

drivers/net/virtio_net.c: In function 'vq2txq':
drivers/net/virtio_net.c:150:2: error: implicit declaration of function 'virtqueue_get_queue_index' [-Werror=implicit-function-declaration]

Caused by commit 986a4f4d452d ("virtio_net: multiqueue support") from the
net-next tree interacting with commit 105e892960e1 ("virtio: move
queue_index and num_free fields into core struct virtqueue") from the
virtio tree.

I applied the patch below and can carry it as necessary.

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Mon, 10 Dec 2012 13:33:57 +1100
Subject: [PATCH] virtnet: for up for virtqueue_get_queue_index removal

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 drivers/net/virtio_net.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 33d6f6f..8afe32d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -147,7 +147,7 @@ struct padded_vnet_hdr {
  */
 static int vq2txq(struct virtqueue *vq)
 {
-	return (virtqueue_get_queue_index(vq) - 1) / 2;
+	return (vq->index - 1) / 2;
 }
 
 static int txq2vq(int txq)
@@ -157,7 +157,7 @@ static int txq2vq(int txq)
 
 static int vq2rxq(struct virtqueue *vq)
 {
-	return virtqueue_get_queue_index(vq) / 2;
+	return vq->index / 2;
 }
 
 static int rxq2vq(int rxq)
-- 
1.7.10.280.gaa39

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related

* Re: [PATCH v4] iproute2: add mdb sub-command to bridge
From: Cong Wang @ 2012-12-10  3:13 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, bridge, Herbert Xu, Jesper Dangaard Brouer, Thomas Graf,
	David S. Miller
In-Reply-To: <1a884874-7123-401f-a53c-cc8e301b1ffe@tahiti.vyatta.com>

On Fri, 2012-12-07 at 09:07 -0800, Stephen Hemminger wrote:
> 
> > 
> > 	# ./bridge/bridge mdb
> > 	bridge br0:
> > 	port eth0, group 224.8.8.9
> > 	port eth1, group 224.8.8.8
> 
> I like the ability to see what is going on. It would
> be good if there was also a monitor like hook to see new entries
> created and destroyed (later version).

I am working on a patch to implement RTM_NEWMDB and RTM_DELMDB, so we
will have this.

> 
> The output format should be one line per entry for easier
> parsing by utilities. Something similar to 'ip neigh show'
> 

Right, I will update this in v5.

Thanks!

^ permalink raw reply

* Re: [PATCH net 1/3] inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV state
From: Neal Cardwell @ 2012-12-10  3:40 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, Netdev
In-Reply-To: <20121209.190129.2104270281417362831.davem@davemloft.net>

On Sun, Dec 9, 2012 at 7:01 PM, David Miller <davem@davemloft.net> wrote:
> From: Neal Cardwell <ncardwell@google.com>
> Date: Sun,  9 Dec 2012 00:43:21 -0500
>
>> Fix inet_diag to be aware of the fact that AF_INET6 TCP connections
>> instantiated for IPv4 traffic and in the SYN-RECV state were actually
>> created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This
>> means that for such connections inet6_rsk(req) returns a pointer to a
>> random spot in memory up to roughly 64KB beyond the end of the
>> request_sock.
>>
>> With this bug, for a server using AF_INET6 TCP sockets and serving
>> IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to
>> inet_diag_fill_req() causing an oops or the export to user space of 16
>> bytes of kernel memory as a garbage IPv6 address, depending on where
>> the garbage inet6_rsk(req) pointed.
>>
>> Signed-off-by: Neal Cardwell <ncardwell@google.com>
>
> Applied.

Thanks, David.

neal

^ permalink raw reply

* [PATCH v5] iproute2: add mdb sub-command to bridge
From: Cong Wang @ 2012-12-10  3:47 UTC (permalink / raw)
  To: netdev
  Cc: bridge, Cong Wang, Herbert Xu, Stephen Hemminger, David S. Miller,
	Thomas Graf, Jesper Dangaard Brouer

From: Cong Wang <amwang@redhat.com>

V5: make the output pretty

V4: fix filter_dev
    remove some useless #include

V3: improve the output, display router info only for -d
    fix router parsing code

V2: sync with the kernel patch
    handle IPv6 addr
    a few cleanup

Sample output:

	# ./bridge/bridge mdb show dev br0
	bridge br0 port eth1 group 224.8.8.9
	bridge br0 port eth0 group 224.8.8.8

	# ./bridge/bridge -d mdb show dev br0
	bridge br0 port eth1 group 224.8.8.9
	bridge br0 port eth0 group 224.8.8.8
	router ports on br0: eth0 

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
---
 bridge/Makefile    |    2 +-
 bridge/br_common.h |    3 +-
 bridge/bridge.c    |    1 +
 bridge/mdb.c       |  172 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 176 insertions(+), 2 deletions(-)

diff --git a/bridge/Makefile b/bridge/Makefile
index 9a6743e..67aceb4 100644
--- a/bridge/Makefile
+++ b/bridge/Makefile
@@ -1,4 +1,4 @@
-BROBJ = bridge.o fdb.o monitor.o link.o
+BROBJ = bridge.o fdb.o monitor.o link.o mdb.o
 
 include ../Config
 
diff --git a/bridge/br_common.h b/bridge/br_common.h
index ce11a0b..71f7caf 100644
--- a/bridge/br_common.h
+++ b/bridge/br_common.h
@@ -5,11 +5,12 @@ extern int print_fdb(const struct sockaddr_nl *who,
 		     struct nlmsghdr *n, void *arg);
 
 extern int do_fdb(int argc, char **argv);
+extern int do_mdb(int argc, char **argv);
 extern int do_monitor(int argc, char **argv);
 
 extern int preferred_family;
 extern int show_stats;
-extern int show_detail;
+extern int show_details;
 extern int timestamp;
 extern struct rtnl_handle rth;
 
diff --git a/bridge/bridge.c b/bridge/bridge.c
index e2c33b0..1fcd365 100644
--- a/bridge/bridge.c
+++ b/bridge/bridge.c
@@ -43,6 +43,7 @@ static const struct cmd {
 	int (*func)(int argc, char **argv);
 } cmds[] = {
 	{ "fdb", 	do_fdb },
+	{ "mdb", 	do_mdb },
 	{ "monitor",	do_monitor },
 	{ "help",	do_help },
 	{ 0 }
diff --git a/bridge/mdb.c b/bridge/mdb.c
new file mode 100644
index 0000000..390d7f6
--- /dev/null
+++ b/bridge/mdb.c
@@ -0,0 +1,172 @@
+/*
+ * Get mdb table with netlink
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/socket.h>
+#include <net/if.h>
+#include <netinet/in.h>
+#include <linux/if_bridge.h>
+#include <linux/if_ether.h>
+#include <string.h>
+#include <arpa/inet.h>
+
+#include "libnetlink.h"
+#include "br_common.h"
+#include "rt_names.h"
+#include "utils.h"
+
+#ifndef MDBA_RTA
+#define MDBA_RTA(r) \
+	((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct br_port_msg))))
+#endif
+
+int filter_index;
+
+static void usage(void)
+{
+	fprintf(stderr, "       bridge mdb {show} [ dev DEV ]\n");
+	exit(-1);
+}
+
+static void br_print_router_ports(FILE *f, struct rtattr *attr)
+{
+	uint32_t *port_ifindex;
+	struct rtattr *i;
+	int rem;
+
+	rem = RTA_PAYLOAD(attr);
+	for (i = RTA_DATA(attr); RTA_OK(i, rem); i = RTA_NEXT(i, rem)) {
+		port_ifindex = RTA_DATA(i);
+		fprintf(f, "%s ", ll_index_to_name(*port_ifindex));
+	}
+
+	fprintf(f, "\n");
+}
+
+static void print_mdb_entry(FILE *f, int ifindex, struct br_mdb_entry *e)
+{
+	SPRINT_BUF(abuf);
+
+	if (e->addr.proto == htons(ETH_P_IP))
+		fprintf(f, "bridge %s port %s group %s\n", ll_index_to_name(ifindex),
+			ll_index_to_name(e->ifindex),
+			inet_ntop(AF_INET, &e->addr.u.ip4, abuf, sizeof(abuf)));
+	else
+		fprintf(f, "bridge %s port %s group %s\n", ll_index_to_name(ifindex),
+			ll_index_to_name(e->ifindex),
+			inet_ntop(AF_INET6, &e->addr.u.ip6, abuf, sizeof(abuf)));
+}
+
+static void br_print_mdb_entry(FILE *f, int ifindex, struct rtattr *attr)
+{
+	struct rtattr *i;
+	int rem;
+	struct br_mdb_entry *e;
+
+	rem = RTA_PAYLOAD(attr);
+	for (i = RTA_DATA(attr); RTA_OK(i, rem); i = RTA_NEXT(i, rem)) {
+		e = RTA_DATA(i);
+		print_mdb_entry(f, ifindex, e);
+	}
+}
+
+int print_mdb(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
+{
+	FILE *fp = arg;
+	struct br_port_msg *r = NLMSG_DATA(n);
+	int len = n->nlmsg_len;
+	struct rtattr * tb[MDBA_MAX+1];
+
+	if (n->nlmsg_type != RTM_GETMDB) {
+		fprintf(stderr, "Not RTM_GETMDB: %08x %08x %08x\n",
+			n->nlmsg_len, n->nlmsg_type, n->nlmsg_flags);
+
+		return 0;
+	}
+
+	len -= NLMSG_LENGTH(sizeof(*r));
+	if (len < 0) {
+		fprintf(stderr, "BUG: wrong nlmsg len %d\n", len);
+		return -1;
+	}
+
+	if (filter_index && filter_index != r->ifindex)
+		return 0;
+
+	parse_rtattr(tb, MDBA_MAX, MDBA_RTA(r), n->nlmsg_len - NLMSG_LENGTH(sizeof(*r)));
+
+	if (tb[MDBA_MDB]) {
+		struct rtattr *i;
+		int rem = RTA_PAYLOAD(tb[MDBA_MDB]);
+
+		for (i = RTA_DATA(tb[MDBA_MDB]); RTA_OK(i, rem); i = RTA_NEXT(i, rem))
+			br_print_mdb_entry(fp, r->ifindex, i);
+	}
+
+	if (tb[MDBA_ROUTER]) {
+		if (show_details) {
+			fprintf(fp, "router ports on %s: ", ll_index_to_name(r->ifindex));
+			br_print_router_ports(fp, tb[MDBA_ROUTER]);
+		}
+	}
+
+	return 0;
+}
+
+static int mdb_show(int argc, char **argv)
+{
+	char *filter_dev = NULL;
+
+	while (argc > 0) {
+		if (strcmp(*argv, "dev") == 0) {
+			NEXT_ARG();
+			if (filter_dev)
+				duparg("dev", *argv);
+			filter_dev = *argv;
+		}
+		argc--; argv++;
+	}
+
+	if (filter_dev) {
+		filter_index = if_nametoindex(filter_dev);
+		if (filter_index == 0) {
+			fprintf(stderr, "Cannot find device \"%s\"\n",
+				filter_dev);
+			return -1;
+		}
+	}
+
+	if (rtnl_wilddump_request(&rth, PF_BRIDGE, RTM_GETMDB) < 0) {
+		perror("Cannot send dump request");
+		exit(1);
+	}
+
+	if (rtnl_dump_filter(&rth, print_mdb, stdout) < 0) {
+		fprintf(stderr, "Dump terminated\n");
+		exit(1);
+	}
+
+	return 0;
+}
+
+int do_mdb(int argc, char **argv)
+{
+	ll_init_map(&rth);
+
+	if (argc > 0) {
+		if (matches(*argv, "show") == 0 ||
+		    matches(*argv, "lst") == 0 ||
+		    matches(*argv, "list") == 0)
+			return mdb_show(argc-1, argv+1);
+		if (matches(*argv, "help") == 0)
+			usage();
+	} else
+		return mdb_show(0, NULL);
+
+	fprintf(stderr, "Command \"%s\" is unknown, try \"bridge mdb help\".\n", *argv);
+	exit(-1);
+}

^ permalink raw reply related

* Re: [PATCH net] inet_diag: validate port comparison byte code to prevent unsafe reads
From: Neal Cardwell @ 2012-12-10  5:17 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, Netdev
In-Reply-To: <20121209.190223.1179297442063515404.davem@davemloft.net>

On Sun, Dec 9, 2012 at 7:02 PM, David Miller <davem@davemloft.net> wrote:
> I added the missing "return true" at the end of this new function.
>
> Applied, but please be more careful and at least look at the compiler
> warnings when you submit your changes.
>
> Thanks.

Roger that. Oof. Can't believe I missed that.

thanks,
neal

^ permalink raw reply

* Re: [PATCHv6] virtio-spec: virtio network device multiqueue support
From: Jason Wang @ 2012-12-10  5:54 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: bhutchings, netdev, kvm, virtualization
In-Reply-To: <20121207141856.GA17901@redhat.com>

On Friday, December 07, 2012 04:18:56 PM Michael S. Tsirkin wrote:
> Add multiqueue support to virtio network device.
> Add a new feature flag VIRTIO_NET_F_MQ for this feature, a new
> configuration field max_virtqueue_pairs to detect supported number of
> virtqueues as well as a new command VIRTIO_NET_CTRL_MQ to program
> packet steering for unidirectional protocols.
> 
> ---
> 
> Changes in v6:
> - rename RFS -> multiqueue to avoid confusion with RFS in linux
>   mention automatic receive steering as Rusty suggested
> 
> Changes in v5:
> - Address Rusty's comments.
>   Changes are only in the text, not the ideas.
> - Some minor formatting changes.
> 
> Changes in v4:
> - address Jason's comments
> - have configuration specify the number of VQ pairs and not pairs - 1
> 
> Changes in v3:
> - rename multiqueue -> rfs this is what we support
> - Be more explicit about what driver should do.
> - Simplify layout making VQs functionality depend on feature.
> - Remove unused commands, only leave in programming # of queues
> 
> Changes in v2:
> Address Jason's comments on v2:
> - Changed STEERING_HOST to STEERING_RX_FOLLOWS_TX:
>    this is both clearer and easier to support.
>    It does not look like we need a separate steering command
>    since host can just watch tx packets as they go.
> - Moved RX and TX steering sections near each other.
> - Add motivation for other changes in v2
> 
> Changes in v1 (from Jason's rfc):
> - reserved vq 3: this makes all rx vqs even and tx vqs odd, which
>    looks nicer to me.
> - documented packet steering, added a generalized steering programming
>    command. Current modes are single queue and host driven multiqueue,
>    but I envision support for guest driven multiqueue in the future.
> - make default vqs unused when in mq mode - this wastes some memory
>    but makes it more efficient to switch between modes as
>    we can avoid this causing packet reordering.
> 
> Rusty, could you please take a look and comment soon?
> If this looks OK to everyone, we can proceed with finalizing the
> implementation. Would be nice to try and put it in 3.8.
> 
> diff --git a/virtio-spec.lyx b/virtio-spec.lyx
> index 83f2771..c5b32c4 100644
> --- a/virtio-spec.lyx
> +++ b/virtio-spec.lyx
> @@ -59,6 +59,7 @@
>  \author -608949062 "Rusty Russell,,,"
>  \author -385801441 "Cornelia Huck" cornelia.huck@de.ibm.com
>  \author 1531152142 "Paolo Bonzini,,,"
> +\author 1986246365 "Michael S. Tsirkin"
>  \end_header
[...]
> +
> +\begin_layout Plain Layout
> +
> +\change_inserted 1986246365 1353594263
> +
> +#define VIRTIO_NET_CTRL_MQ    1

Should be 4 here.
> +\end_layout
> +
> +\begin_layout Plain Layout
> +
> +\change_inserted 1986246365 1353594273
> +
> + #define VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET        0
> +\end_layout
> +
> +\begin_layout Plain Layout
> +
> +\change_inserted 1986246365 1353594273
> +
> + #define VIRTIO_NET_CTRL_MQ_VQ_PAIRS_MIN        1
> +\end_layout
> +
[...]

^ permalink raw reply

* Re: [PATCH net-next 03/10] tipc: sk_recv_queue size check only for connectionless sockets
From: Ying Xue @ 2012-12-10  6:27 UTC (permalink / raw)
  To: Neil Horman; +Cc: Jon Maloy, Paul Gortmaker, David Miller, netdev, Ying Xue
In-Reply-To: <20121209165020.GA4362@neilslaptop.think-freely.org>

Neil Horman wrote:
> On Fri, Dec 07, 2012 at 05:30:11PM -0500, Jon Maloy wrote:
>   
>> On 12/07/2012 02:20 PM, Neil Horman wrote:
>>     
>>> On Fri, Dec 07, 2012 at 09:28:11AM -0500, Paul Gortmaker wrote:
>>>       
>>>> From: Ying Xue <ying.xue@windriver.com>
>>>>
>>>> The sk_receive_queue limit control is currently performed for
>>>> all arriving messages, disregarding socket and message type.
>>>> But for connected sockets this check is redundant, since the protocol
>>>> flow control already makes queue overflow impossible.
>>>>
>>>>         
>>> Can you explain where that occurs?  
>>>       
>> It happens in the functions port_dispatcher_sigh() and tipc_send(), 
>> among other places. Both are to be found in the file port.c, which 
>> was supposed to contain the 'generic' (i.e., API independent) part 
>> of the send/receive code.
>> Now that we have only one API left, the socket API, we are 
>> planning to merge the code in socket.c and port.c, and get rid of 
>> some code overhead.
>>
>> The flow control in TIPC is message based, where the sender
>> requires to receive an explicit acknowledge message for each 
>> 512 message the receiver reads to user space.
>> If the sender has more than 1024 messages outstanding without having
>> received an acknowledge he will be suspended or receive EAGAIN until 
>> he does.
>> The plan going forward is to replace this mechanism with a more 
>> standard looking byte based flow control, while maintaining 
>> backwards compatibility.
>>
>>     
> Ok, That makes more sense, thank you.  Although I still don't think this is
> safe (but the problem may not be solely introduced by this patch).  Using a
> global limit that assumes the sender will block when the congestion window is
> reached just doesn't seem sane to me.  It clearly works with the Linux
> implementation, as it conforms to your expectations, but an alternate
> implementation could create a DOS situation by simply ignoring the window limit,
> and continuing to send.  I see that we drop frames over the global limit in
> filter_rcv, but the check in rx_queue_full bumps up that limit based on the
> value of msg_importance(msg), but that threshold is ignored if the value of
> msg_importance is invalid.  All a sender needs to do is flood a receiver with
> frames containing an invalid set of message importance bits, and you will queue
> frames indefinately.  In fact that will also happen if you send message of
> CRITICAL importance as well, so you don't even need to supply an invalid value
> here.
>
>   

You are absolutely right. I will correct these drawbacks in next version.

>>> I see where the tipc dispatch function calls
>>> sk_add_backlog, which checks the per socket recieve queue (regardless of weather
>>> the receiving socket is connection oriented or connectionless), but if the
>>> receiver doesn't call receive very often, This just adds a check against your
>>> global limit, doing nothing for your per-socket limits. 
>>>       
>> OVERLOAD_LIMIT_BASE is tested against a per-socket message counter, so it _is_
>> our per-socket limit. In fact, TIPC connectionless overflow control currently 
>> is a kind of a hybrid, based on a message counter when the socket is not locked, 
>> and based on sk_rcv_queue's byte limit when a message has to be added to the 
>> backlog.
>> We are planning to fix this inconsistency too.
>>     
> Good, thank you,  that was seeming quite wrong to me.
>
>   
>>  In fact it seems to
>>     
>>> repeat the same check twice, as in the worst case of the incomming message being
>>> TIPC_LOW_IMPORTANCE, its just going to check that the global limit is exactly
>>> OVERLOAD_LIMIT_BASE/2 again.
>>>       
>> Yes, you are right. The intention is that only the first test, 
>> if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2)){..}
>> will be run for the vast majority of messages, since we must assume
>> that there is no overload most of the time.
>> An inelegant optimization, perhaps, but not logically wrong.
>>     
> No, not logically wrong, but not an optimization either.  With this change,
> your only use of rx_queue_full passes OVERLOAD_LIMIT_BASE/2 as the base value to
> rx_queue_full, and then you do some multiplication based on that.  If you really
> want to optimize this, leave OVERLOAD_LIMIT_BASE where it is (rather than
> doubling it like this patch series does), mark rx_queue_full as inline, and just
> pass OVERLOAD_LIMIT_BASE as the argument, it will save you a division opration,
> the conditional branch and a call instruction.  If you add a multiplication
> factor table, you can eliminate the if/else clauses in rx_queue_full as well.
>
>   

Good suggestion with a factor table. Maybe it's unnecessary to 
explicitly mark rx_queue_full as inline. Currently it sounds like we let 
complier decide whether a function is defined as inline or not.

Regards,
Ying

> Neil
>
>   
>> ///jon
>>
>>     
>>> Neil
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>       
>>     
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>   

^ permalink raw reply

* RE: linux-next: manual merge of the net-next tree with the pci tree
From: Grumbach, Emmanuel @ 2012-12-10  6:44 UTC (permalink / raw)
  To: Stephen Rothwell, David Miller, netdev@vger.kernel.org
  Cc: linux-next@vger.kernel.org, linux-kernel@vger.kernel.org,
	Bjorn Helgaas, Berg, Johannes
In-Reply-To: <20121210120825.49096bafd378139e98becb8d@canb.auug.org.au>

> Today's linux-next merge of the net-next tree got a conflict in
> drivers/net/wireless/iwlwifi/pcie/trans.c between commit b9d146e30a2d
> ("iwlwifi: collapse wrapper for pcie_capability_read_word()") from the pci
> tree and commit 7afe3705cd4e ("iwlwifi: continue clean up -
> pcie/trans.c") from the net-next tree.
> 
> I fixed it up (see below) and can carry the fix as necessary (no action is
> required).
> 
Looks right - thanks Stephen!
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

^ permalink raw reply

* [PATCH] net: Allow DCBnl to use other namespaces besides init_net
From: John Fastabend @ 2012-12-10  6:48 UTC (permalink / raw)
  To: netdev; +Cc: ebiederm

Allow DCB and net namespace to work together. This is useful if you
have containers that are bound to 'phys' interfaces that want to
also manage their DCB attributes.

The net namespace is taken from sock_net(skb->sk) of the netlink skb.

CC: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/dcb/dcbnl.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/dcb/dcbnl.c b/net/dcb/dcbnl.c
index b07c75d..1b588e2 100644
--- a/net/dcb/dcbnl.c
+++ b/net/dcb/dcbnl.c
@@ -1665,9 +1665,6 @@ static int dcb_doit(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 	if ((nlh->nlmsg_type == RTM_SETDCB) && !capable(CAP_NET_ADMIN))
 		return -EPERM;
 
-	if (!net_eq(net, &init_net))
-		return -EINVAL;
-
 	ret = nlmsg_parse(nlh, sizeof(*dcb), tb, DCB_ATTR_MAX,
 			  dcbnl_rtnl_policy);
 	if (ret < 0)
@@ -1684,7 +1681,7 @@ static int dcb_doit(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 	if (!tb[DCB_ATTR_IFNAME])
 		return -EINVAL;
 
-	netdev = dev_get_by_name(&init_net, nla_data(tb[DCB_ATTR_IFNAME]));
+	netdev = dev_get_by_name(net, nla_data(tb[DCB_ATTR_IFNAME]));
 	if (!netdev)
 		return -ENODEV;
 
@@ -1708,7 +1705,7 @@ static int dcb_doit(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 
 	nlmsg_end(reply_skb, reply_nlh);
 
-	ret = rtnl_unicast(reply_skb, &init_net, portid);
+	ret = rtnl_unicast(reply_skb, net, portid);
 out:
 	dev_put(netdev);
 	return ret;

^ permalink raw reply related

* Re: ixgbe:  pci_get_device() call without counterpart call of pci_dev_put()
From: Jeff Kirsher @ 2012-12-10  7:17 UTC (permalink / raw)
  To: Elena Gurevich; +Cc: netdev, e1000-devel, Greg Rose, Don Skidmore
In-Reply-To: <01b301cdd5f2$3a9b2360$afd16a20$@toganetworks.com>

[-- Attachment #1: Type: text/plain, Size: 2020 bytes --]

On 12/09/2012 01:47 AM, Elena Gurevich wrote:
> Hi all,
> I am pioneer in linux device drivers here and using Intel 82599 NIC as
> reference model, 
> During investigation to drivers sources I found the suspicious code:  
>
> Is code sequence (1)   and (2) the possible device reference count leakage
> ???
>
>
> Thanks a lot in advance
> Lena
>
> --------snipped from ixgbe_main.c  file function ixgbe_io_error_detected()
> -----------
>
> . . . 
> 		/* Find the pci device of the offending VF */
> 		vfdev = pci_get_device(PCI_VENDOR_ID_INTEL, device_id,
> NULL);
> 		while (vfdev) {
> 			if (vfdev->devfn == (req_id & 0xFF))
> 				break;
> <------------------------------      (1) leaves the loop with successful get
> call  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
>
> 			vfdev = pci_get_device(PCI_VENDOR_ID_INTEL,
> 					       device_id, vfdev);
> 		}
> 		/*
> 		 * There's a slim chance the VF could have been hot plugged,
> 		 * so if it is no longer present we don't need to issue the
> 		 * VFLR.  Just clean up the AER in that case.
> 		 */
> 		if (vfdev) {
> 			e_dev_err("Issuing VFLR to VF %d\n", vf);
> 			pci_write_config_dword(vfdev, 0xA8, 0x00008000);
> 		}
>
> 		pci_cleanup_aer_uncorrect_error_status(pdev);
> 	}
>
> 	/*
> 	 * Even though the error may have occurred on the other port
> 	 * we still need to increment the vf error reference count for
> 	 * both ports because the I/O resume function will be called
> 	 * for both of them.
> 	 */
> 	adapter->vferr_refcount++;
>
> 	return PCI_ERS_RESULT_RECOVERED;
> <--------------------------------------------  (2) leaves the function
> without put call !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> ----------------------------------  snipped
> -----------------------------------
>
Added Greg Rose (ixgbevf maintainer) & Don Skidmore (ixgbe maintainer)

Since the code you have questions about is dealing with the vf, Greg
will most likely be able to shed some light to your questions.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 899 bytes --]

^ permalink raw reply

* [PATCH 1/1] net: ethernet: davinci_cpdma: Add boundary for rx and tx descriptors
From: Mugunthan V N @ 2012-12-10  7:37 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-arm-kernel, linux-omap, s.hauer, Mugunthan V N

When there is heavy transmission traffic in the CPDMA, then Rx descriptors
memory is also utilized as tx desc memory this leads to reduced rx desc memory
which leads to poor performance.

This patch adds boundary for tx and rx descriptors in bd ram dividing the
descriptor memory to ensure that during heavy transmission tx doesn't use
rx descriptors.

This patch is already applied to davinci_emac driver, since CPSW and
davici_dmac uses the same CPDMA, moving the boundry seperation from
Davinci EMAC driver to CPDMA driver which was done in the following
commit

commit 86d8c07ff2448eb4e860e50f34ef6ee78e45c40c
Author: Sascha Hauer <s.hauer@pengutronix.de>
Date:   Tue Jan 3 05:27:47 2012 +0000

    net/davinci: do not use all descriptors for tx packets

    The driver uses a shared pool for both rx and tx descriptors.
    During open it queues fixed number of 128 descriptors for receive
    packets. For each received packet it tries to queue another
    descriptor. If this fails the descriptor is lost for rx.
    The driver has no limitation on tx descriptors to use, so it
    can happen during a nmap / ping -f attack that the driver
    allocates all descriptors for tx and looses all rx descriptors.
    The driver stops working then.
    To fix this limit the number of tx descriptors used to half of
    the descriptors available, the rx path uses the other half.

    Tested on a custom board using nmap / ping -f to the board from
    two different hosts.

Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
---
 drivers/net/ethernet/ti/davinci_cpdma.c |   20 ++++++++++++++------
 drivers/net/ethernet/ti/davinci_emac.c  |    8 --------
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
index 4995673..d37f546 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -105,13 +105,13 @@ struct cpdma_ctlr {
 };
 
 struct cpdma_chan {
+	struct cpdma_desc __iomem	*head, *tail;
+	void __iomem			*hdp, *cp, *rxfree;
 	enum cpdma_state		state;
 	struct cpdma_ctlr		*ctlr;
 	int				chan_num;
 	spinlock_t			lock;
-	struct cpdma_desc __iomem	*head, *tail;
 	int				count;
-	void __iomem			*hdp, *cp, *rxfree;
 	u32				mask;
 	cpdma_handler_fn		handler;
 	enum dma_data_direction		dir;
@@ -217,7 +217,7 @@ desc_from_phys(struct cpdma_desc_pool *pool, dma_addr_t dma)
 }
 
 static struct cpdma_desc __iomem *
-cpdma_desc_alloc(struct cpdma_desc_pool *pool, int num_desc)
+cpdma_desc_alloc(struct cpdma_desc_pool *pool, int num_desc, bool is_rx)
 {
 	unsigned long flags;
 	int index;
@@ -225,8 +225,14 @@ cpdma_desc_alloc(struct cpdma_desc_pool *pool, int num_desc)
 
 	spin_lock_irqsave(&pool->lock, flags);
 
-	index = bitmap_find_next_zero_area(pool->bitmap, pool->num_desc, 0,
-					   num_desc, 0);
+	if (is_rx) {
+		index = bitmap_find_next_zero_area(pool->bitmap,
+				pool->num_desc/2, 0, num_desc, 0);
+	 } else {
+		index = bitmap_find_next_zero_area(pool->bitmap,
+				pool->num_desc, pool->num_desc/2, num_desc, 0);
+	}
+
 	if (index < pool->num_desc) {
 		bitmap_set(pool->bitmap, index, num_desc);
 		desc = pool->iomap + pool->desc_size * index;
@@ -660,6 +666,7 @@ int cpdma_chan_submit(struct cpdma_chan *chan, void *token, void *data,
 	unsigned long			flags;
 	u32				mode;
 	int				ret = 0;
+	bool                            is_rx;
 
 	spin_lock_irqsave(&chan->lock, flags);
 
@@ -668,7 +675,8 @@ int cpdma_chan_submit(struct cpdma_chan *chan, void *token, void *data,
 		goto unlock_ret;
 	}
 
-	desc = cpdma_desc_alloc(ctlr->pool, 1);
+	is_rx = (chan->rxfree != 0);
+	desc = cpdma_desc_alloc(ctlr->pool, 1, is_rx);
 	if (!desc) {
 		chan->stats.desc_alloc_fail++;
 		ret = -ENOMEM;
diff --git a/drivers/net/ethernet/ti/davinci_emac.c b/drivers/net/ethernet/ti/davinci_emac.c
index fce89a0..f349273 100644
--- a/drivers/net/ethernet/ti/davinci_emac.c
+++ b/drivers/net/ethernet/ti/davinci_emac.c
@@ -120,7 +120,6 @@ static const char emac_version_string[] = "TI DaVinci EMAC Linux v6.1";
 #define EMAC_DEF_TX_CH			(0) /* Default 0th channel */
 #define EMAC_DEF_RX_CH			(0) /* Default 0th channel */
 #define EMAC_DEF_RX_NUM_DESC		(128)
-#define EMAC_DEF_TX_NUM_DESC		(128)
 #define EMAC_DEF_MAX_TX_CH		(1) /* Max TX channels configured */
 #define EMAC_DEF_MAX_RX_CH		(1) /* Max RX channels configured */
 #define EMAC_POLL_WEIGHT		(64) /* Default NAPI poll weight */
@@ -342,7 +341,6 @@ struct emac_priv {
 	u32 mac_hash2;
 	u32 multicast_hash_cnt[EMAC_NUM_MULTICAST_BITS];
 	u32 rx_addr_type;
-	atomic_t cur_tx;
 	const char *phy_id;
 #ifdef CONFIG_OF
 	struct device_node *phy_node;
@@ -1050,9 +1048,6 @@ static void emac_tx_handler(void *token, int len, int status)
 {
 	struct sk_buff		*skb = token;
 	struct net_device	*ndev = skb->dev;
-	struct emac_priv	*priv = netdev_priv(ndev);
-
-	atomic_dec(&priv->cur_tx);
 
 	if (unlikely(netif_queue_stopped(ndev)))
 		netif_start_queue(ndev);
@@ -1101,9 +1096,6 @@ static int emac_dev_xmit(struct sk_buff *skb, struct net_device *ndev)
 		goto fail_tx;
 	}
 
-	if (atomic_inc_return(&priv->cur_tx) >= EMAC_DEF_TX_NUM_DESC)
-		netif_stop_queue(ndev);
-
 	return NETDEV_TX_OK;
 
 fail_tx:
-- 
1.7.9.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox