Netdev List
 help / color / mirror / Atom feed
* Re: pktgen: tricks
From: Eric Dumazet @ 2009-09-24 10:32 UTC (permalink / raw)
  To: Denys Fedoryschenko
  Cc: Stephen Hemminger, Jesper Dangaard Brouer, Robert Olsson, netdev
In-Reply-To: <200909241310.05297.denys@visp.net.lb>

Denys Fedoryschenko a écrit :
> On Thursday 24 September 2009 03:41:41 Stephen Hemminger wrote:
>> Other kernel config help:
>>   - turn off lock dependency checker, kmecheck, page alloc debug
>>     basically anything that slows stuff down
>>   - turn off content group scheduler
> Maybe, but i'm not sure (i can't test it):
> Disable randomize VA space? On embedded boards it was helping. 
> In some case disabling SMP helped, when various SMP locks involved, but maybe 
> not for pktgen.
> 
>

pktgen is a kernel module, and is not affected by randomize VA space.

But of course, disabling SMP must help, as long as your machine needs
one cpu only :)


^ permalink raw reply

* Re: r8169 chips on some Intel D945GSEJT boards fail to work after PXE boot
From: Simon Farnsworth @ 2009-09-24 11:12 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev
In-Reply-To: <20090923205723.GA28058@electric-eye.fr.zoreil.com>

Francois Romieu wrote:
> Simon Farnsworth <simon.farnsworth@onelan.com> :
> [...]
>> Some boards are good, and just work, whether I boot via PXE or boot from
>> the local disk; dmesg.working and lspci.working are from a good board.
>>
>> Some boards are bad; they work fine if I boot from local disk (including
>> network), but the kernel cannot detect link, or send or receive data if
>> I PXE boot. dmesg.broken and lspci.broken are from a bad board.
> 
> No cunning theroy in sight but does reducing the amount of memory on a
> bad board from 1 Go to 512 Mo turn it into a good one ?
> 
We've tried this, and we've tried 2GB and 1GB modules; the failure to
boot sticks with the board, not with the memory module. On my most
recent attempt, the failing board isn't showing a correctable error
status, so I've not yet tried your patch, on the assumption that it just
clears the error status.

Is my assumption wrong? If not, is there anything else I can do that
would help you diagnose this?

> The failing board exhibits a correctable error status bit. Clearing it
> is the least we can do.
> 
> diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
> index 50c6a3c..79bc4ab 100644
> --- a/drivers/net/r8169.c
> +++ b/drivers/net/r8169.c
> @@ -2200,6 +2200,11 @@ rtl8169_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>  	tp->pcie_cap = pci_find_capability(pdev, PCI_CAP_ID_EXP);
>  	if (!tp->pcie_cap && netif_msg_probe(tp))
>  		dev_info(&pdev->dev, "no PCI Express capability\n");
> +	else {
> +		pci_write_config_word(pdev, tp->pcie_cap + PCI_EXP_DEVSTA,
> +				      PCI_EXP_DEVSTA_CED | PCI_EXP_DEVSTA_NFED |
> +				      PCI_EXP_DEVSTA_FED | PCI_EXP_DEVSTA_URD);
> +	}
>  
>  	RTL_W16(IntrMask, 0x0000);
>  

-- 
Simon Farnsworth


^ permalink raw reply

* Re: [PATCH] ems_pci: fix size of CAN controllers BAR mapping for CPC-PCI v2
From: Sebastian Haas @ 2009-09-24 11:52 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, Wolfgang Grandegger
In-Reply-To: <4ABB2FBF.8080906-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Wolfgang,

Wolfgang Grandegger schrieb:
> Hi Sebastian,
> 
> Sebastian Haas wrote:
>> The driver mapped only 128 bytes of the CAN controller address space when a
>> CPC-PCI v2 was detected (incl. CPC-104P). This patch will fix it by always
>> mapping the whole address space (4096 bytes on all boards) of the
>> corresponding PCI BAR.
>>
>> Signed-off-by: Sebastian Haas <haas-zsNKPWJ8Pib6hrUXjxyGrA@public.gmane.org>
>> ---
>>
>>  drivers/net/can/sja1000/ems_pci.c |    8 +++++---
>>  1 files changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/can/sja1000/ems_pci.c b/drivers/net/can/sja1000/ems_pci.c
>> index 7d84b8a..ba98063 100644
>> --- a/drivers/net/can/sja1000/ems_pci.c
>> +++ b/drivers/net/can/sja1000/ems_pci.c
>> @@ -94,12 +94,14 @@ struct ems_pci_card {
>>  #define EMS_PCI_CDR             (CDR_CBP | CDR_CLKOUT_MASK)
>>  
>>  #define EMS_PCI_V1_BASE_BAR     1
>> -#define EMS_PCI_V1_MEM_SIZE     4096
>> +#define EMS_PCI_V1_MEM_SIZE     4096 /* size of PITA control area */
>>  #define EMS_PCI_V2_BASE_BAR     2
>> -#define EMS_PCI_V2_MEM_SIZE     128
>> +#define EMS_PCI_V2_MEM_SIZE     128 /* size of PLX control area */
>>  #define EMS_PCI_CAN_BASE_OFFSET 0x400 /* offset where the controllers starts */
>>  #define EMS_PCI_CAN_CTRL_SIZE   0x200 /* memory size for each controller */
>>  
>> +#define EMS_PCI_CONTR_MEM_SIZE  4096 /* size of controller area */
>> +
>>  static struct pci_device_id ems_pci_tbl[] = {
>>  	/* CPC-PCI v1 */
>>  	{PCI_VENDOR_ID_SIEMENS, 0x2104, PCI_ANY_ID, PCI_ANY_ID,},
>> @@ -266,7 +268,7 @@ static int __devinit ems_pci_add_card(struct pci_dev *pdev,
>>  		goto failure_cleanup;
>>  	}
>>  
>> -	card->base_addr = pci_iomap(pdev, base_bar, mem_size);
>> +	card->base_addr = pci_iomap(pdev, base_bar, EMS_PCI_CONTR_MEM_SIZE);
>>  	if (card->base_addr == NULL) {
>>  		err = -ENOMEM;
>>  		goto failure_cleanup;
> 
> I see. To avoid confusion I suggest renaming some variables and defines:
> 
> s/EMS_PCI_V1_MEM_SIZE/EMS_PCI_V1_CONF_SIZE/
> s/EMS_PCI_V2_MEM_SIZE/EMS_PCI_V2_CONF_SIZE/
> s/mem_size/conf_size/
> s/EMS_PCI_CONTR_MEM_SIZE/EMS_PCI_BASE_SIZE/
> 
> Would that not be more appropriate?
Okay, I just wanted to minimize changes. Will change it and resubmit.

Sebastian
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkq7XZQACgkQpqRB8PJG7XwjeQCfZZJP1FDD7TLBf2hK3dV64GgX
3oMAn2jIkdOx9Euc1UihiK/BXLSIUp06
=W9tL
-----END PGP SIGNATURE-----
-- 
EMS Dr. Thomas Wuensche e.K.
Sonnenhang 3
85304 Ilmmuenster
HRA Neuburg a.d. Donau, HR-Nr. 70.106
Phone: +49-8441-490260
Fax  : +49-8441-81860
http://www.ems-wuensche.com

^ permalink raw reply

* [PATCH] net: fix htmldocs sunrpc, clnt.c
From: Jaswinder Singh Rajput @ 2009-09-24 12:19 UTC (permalink / raw)
  To: Ricardo Labiaga, Benny Halevy, Andy Adamson, Trond Myklebust,
	Randy Dunlap <randy.


  DOCPROC Documentation/DocBook/networking.xml
  Warning(net/sunrpc/clnt.c:647): No description found for parameter 'req'
  Warning(net/sunrpc/clnt.c:647): No description found for parameter 'tk_ops'
  Warning(net/sunrpc/clnt.c:647): Excess function parameter 'ops' description in 'rpc_run_bc_task'

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Cc: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Cc: Benny Halevy <bhalevy@panasas.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: David Miller <davem@davemloft.net>
---
 net/sunrpc/clnt.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index a417d5a..38829e2 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -640,10 +640,11 @@ EXPORT_SYMBOL_GPL(rpc_call_async);
 /**
  * rpc_run_bc_task - Allocate a new RPC task for backchannel use, then run
  * rpc_execute against it
- * @ops: RPC call ops
+ * @req: RPC request
+ * @tk_ops: RPC call ops
  */
 struct rpc_task *rpc_run_bc_task(struct rpc_rqst *req,
-					const struct rpc_call_ops *tk_ops)
+				const struct rpc_call_ops *tk_ops)
 {
 	struct rpc_task *task;
 	struct xdr_buf *xbufp = &req->rq_snd_buf;
-- 
1.6.0.6



^ permalink raw reply related

* [PATCH v2] ems_pci: fix size of CAN controllers BAR mapping for CPC-PCI v2
From: Sebastian Haas @ 2009-09-24 13:55 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, wg-5Yr1BZd7O62+XT7JhA+gdA

The driver mapped only 128 bytes of the CAN controller address space when a
CPC-PCI v2 was detected (incl. CPC-104P). This patch will fix it by always
mapping the whole address space (4096 bytes on all boards) of the
corresponding PCI BAR.

Signed-off-by: Sebastian Haas <haas-zsNKPWJ8Pib6hrUXjxyGrA@public.gmane.org>
---

 drivers/net/can/sja1000/ems_pci.c |   16 +++++++++-------
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/can/sja1000/ems_pci.c b/drivers/net/can/sja1000/ems_pci.c
index 7d84b8a..fd04789 100644
--- a/drivers/net/can/sja1000/ems_pci.c
+++ b/drivers/net/can/sja1000/ems_pci.c
@@ -94,12 +94,14 @@ struct ems_pci_card {
 #define EMS_PCI_CDR             (CDR_CBP | CDR_CLKOUT_MASK)
 
 #define EMS_PCI_V1_BASE_BAR     1
-#define EMS_PCI_V1_MEM_SIZE     4096
+#define EMS_PCI_V1_CONF_SIZE    4096 /* size of PITA control area */
 #define EMS_PCI_V2_BASE_BAR     2
-#define EMS_PCI_V2_MEM_SIZE     128
+#define EMS_PCI_V2_CONF_SIZE    128 /* size of PLX control area */
 #define EMS_PCI_CAN_BASE_OFFSET 0x400 /* offset where the controllers starts */
 #define EMS_PCI_CAN_CTRL_SIZE   0x200 /* memory size for each controller */
 
+#define EMS_PCI_BASE_SIZE  4096 /* size of controller area */
+
 static struct pci_device_id ems_pci_tbl[] = {
 	/* CPC-PCI v1 */
 	{PCI_VENDOR_ID_SIEMENS, 0x2104, PCI_ANY_ID, PCI_ANY_ID,},
@@ -224,7 +226,7 @@ static int __devinit ems_pci_add_card(struct pci_dev *pdev,
 	struct sja1000_priv *priv;
 	struct net_device *dev;
 	struct ems_pci_card *card;
-	int max_chan, mem_size, base_bar;
+	int max_chan, conf_size, base_bar;
 	int err, i;
 
 	/* Enabling PCI device */
@@ -251,22 +253,22 @@ static int __devinit ems_pci_add_card(struct pci_dev *pdev,
 		card->version = 2; /* CPC-PCI v2 */
 		max_chan = EMS_PCI_V2_MAX_CHAN;
 		base_bar = EMS_PCI_V2_BASE_BAR;
-		mem_size = EMS_PCI_V2_MEM_SIZE;
+		conf_size = EMS_PCI_V2_CONF_SIZE;
 	} else {
 		card->version = 1; /* CPC-PCI v1 */
 		max_chan = EMS_PCI_V1_MAX_CHAN;
 		base_bar = EMS_PCI_V1_BASE_BAR;
-		mem_size = EMS_PCI_V1_MEM_SIZE;
+		conf_size = EMS_PCI_V1_CONF_SIZE;
 	}
 
 	/* Remap configuration space and controller memory area */
-	card->conf_addr = pci_iomap(pdev, 0, mem_size);
+	card->conf_addr = pci_iomap(pdev, 0, conf_size);
 	if (card->conf_addr == NULL) {
 		err = -ENOMEM;
 		goto failure_cleanup;
 	}
 
-	card->base_addr = pci_iomap(pdev, base_bar, mem_size);
+	card->base_addr = pci_iomap(pdev, base_bar, EMS_PCI_BASE_SIZE);
 	if (card->base_addr == NULL) {
 		err = -ENOMEM;
 		goto failure_cleanup;

-- 
EMS Dr. Thomas Wuensche e.K.
Sonnenhang 3
85304 Ilmmuenster
HRA Neuburg a.d. Donau, HR-Nr. 70.106
Phone: +49-8441-490260
Fax  : +49-8441-81860
http://www.ems-wuensche.com

^ permalink raw reply related

* Re: [PATCH v2] ems_pci: fix size of CAN controllers BAR mapping for CPC-PCI v2
From: Wolfgang Grandegger @ 2009-09-24 14:17 UTC (permalink / raw)
  To: Sebastian Haas
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA, davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <20090924135505.13453.61811.stgit-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>

Sebastian Haas wrote:
> The driver mapped only 128 bytes of the CAN controller address space when a
> CPC-PCI v2 was detected (incl. CPC-104P). This patch will fix it by always
> mapping the whole address space (4096 bytes on all boards) of the
> corresponding PCI BAR.
> 
> Signed-off-by: Sebastian Haas <haas-zsNKPWJ8Pib6hrUXjxyGrA@public.gmane.org>
Signed-off-by: Wolfgang Grandegger <wg-5Yr1BZd7O62+XT7JhA+gdA@public.gmane.org>

Thanks,

Wolfgang.

^ permalink raw reply

* [PATCH net-next-2.6] ixgbe: correct the parameter description
From: Jiri Pirko @ 2009-09-24 14:36 UTC (permalink / raw)
  To: davem; +Cc: netdev

ccffad25b5136958d4769ed6de5e87992dd9c65c changed parameters for function
ixgbe_update_uc_addr_list_generic but parameter description was not updated.
This patch corrects it.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>

diff --git a/drivers/net/ixgbe/ixgbe_common.c b/drivers/net/ixgbe/ixgbe_common.c
index 6621e17..2c7db17 100644
--- a/drivers/net/ixgbe/ixgbe_common.c
+++ b/drivers/net/ixgbe/ixgbe_common.c
@@ -1355,9 +1355,7 @@ static void ixgbe_add_uc_addr(struct ixgbe_hw *hw, u8 *addr, u32 vmdq)
 /**
  *  ixgbe_update_uc_addr_list_generic - Updates MAC list of secondary addresses
  *  @hw: pointer to hardware structure
- *  @addr_list: the list of new addresses
- *  @addr_count: number of addresses
- *  @next: iterator function to walk the address list
+ *  @uc_list: the list of new addresses
  *
  *  The given list replaces any existing list.  Clears the secondary addrs from
  *  receive address registers.  Uses unused receive address registers for the

^ permalink raw reply related

* Re: [PATCH] net: fix htmldocs sunrpc, clnt.c
From: Randy Dunlap @ 2009-09-24 16:19 UTC (permalink / raw)
  To: Jaswinder Singh Rajput
  Cc: Ricardo Labiaga, Benny Halevy, Andy Adamson, Trond Myklebust,
	David Miller, linux-nfs-u79uwXL29TY76Z2rM5mHXA, netdev, LKML
In-Reply-To: <1253794781.5860.29.camel-6Ww87KsxWewAvxtiuMwx3w@public.gmane.org>

Jaswinder Singh Rajput wrote:
>   DOCPROC Documentation/DocBook/networking.xml
>   Warning(net/sunrpc/clnt.c:647): No description found for parameter 'req'
>   Warning(net/sunrpc/clnt.c:647): No description found for parameter 'tk_ops'
>   Warning(net/sunrpc/clnt.c:647): Excess function parameter 'ops' description in 'rpc_run_bc_task'
> 
> Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Ack.  Already sent, but possibly lost.

> Cc: Ricardo Labiaga <Ricardo.Labiaga-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
> Cc: Benny Halevy <bhalevy-C4P08NqkoRlBDgjK7y7TUQ@public.gmane.org>
> Cc: Andy Adamson <andros-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
> Cc: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
> Cc: Randy Dunlap <randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Cc: David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
> ---
>  net/sunrpc/clnt.c |    5 +++--
>  1 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index a417d5a..38829e2 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -640,10 +640,11 @@ EXPORT_SYMBOL_GPL(rpc_call_async);
>  /**
>   * rpc_run_bc_task - Allocate a new RPC task for backchannel use, then run
>   * rpc_execute against it
> - * @ops: RPC call ops
> + * @req: RPC request
> + * @tk_ops: RPC call ops
>   */
>  struct rpc_task *rpc_run_bc_task(struct rpc_rqst *req,
> -					const struct rpc_call_ops *tk_ops)
> +				const struct rpc_call_ops *tk_ops)
>  {
>  	struct rpc_task *task;
>  	struct xdr_buf *xbufp = &req->rq_snd_buf;

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: ixgbe patch to provide NIC's tx/rx counters via ethtool
From: Rick Jones @ 2009-09-24 16:30 UTC (permalink / raw)
  To: Ben Greear; +Cc: NetDev
In-Reply-To: <4ABAD48C.9010808@candelatech.com>

Ben Greear wrote:
> Rick Jones wrote:
>> Ben Greear wrote:
>>
>>> When LRO is enabled, the received packet and byte counters represent the
>>> LRO'd packets, not the packets/bytes on the wire. 
>>
>>
>> When LRO is enabled, are all the bytes on the wire actually 
>> transferred into the host?
> 
> No...the ethernet, IP and TCP headers and such are not, for packets that 
> are combined into a single large SKB.
> 
> That is why the driver counts them wrong.  The bytes are off by a few 
> percentage points, but the packet count is off by an order of magnitude.

An overly philosphical question perhaps, but are ethtool stats supposed to 
represent what was on the wire, or what entered the host?

rick

^ permalink raw reply

* Re: [PATCH] net: fix htmldocs sunrpc, clnt.c
From: Benny Halevy @ 2009-09-24 16:46 UTC (permalink / raw)
  To: Randy Dunlap, Jaswinder Singh Rajput
  Cc: Ricardo Labiaga, Andy Adamson, Trond Myklebust, David Miller,
	linux-nfs, netdev, LKML
In-Reply-To: <4ABB9C17.3020307@oracle.com>

On Sep. 24, 2009, 19:19 +0300, Randy Dunlap <randy.dunlap@oracle.com> wrote:
> Jaswinder Singh Rajput wrote:
>>   DOCPROC Documentation/DocBook/networking.xml
>>   Warning(net/sunrpc/clnt.c:647): No description found for parameter 'req'
>>   Warning(net/sunrpc/clnt.c:647): No description found for parameter 'tk_ops'
>>   Warning(net/sunrpc/clnt.c:647): Excess function parameter 'ops' description in 'rpc_run_bc_task'
>>
>> Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
> 
> Ack.  Already sent, but possibly lost.

Ack.  thanks!

> 
>> Cc: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
>> Cc: Benny Halevy <bhalevy@panasas.com>
>> Cc: Andy Adamson <andros@netapp.com>
>> Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
>> Cc: Randy Dunlap <randy.dunlap@oracle.com>
>> Cc: David Miller <davem@davemloft.net>
>> ---
>>  net/sunrpc/clnt.c |    5 +++--
>>  1 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
>> index a417d5a..38829e2 100644
>> --- a/net/sunrpc/clnt.c
>> +++ b/net/sunrpc/clnt.c
>> @@ -640,10 +640,11 @@ EXPORT_SYMBOL_GPL(rpc_call_async);
>>  /**
>>   * rpc_run_bc_task - Allocate a new RPC task for backchannel use, then run
>>   * rpc_execute against it
>> - * @ops: RPC call ops
>> + * @req: RPC request
>> + * @tk_ops: RPC call ops
>>   */
>>  struct rpc_task *rpc_run_bc_task(struct rpc_rqst *req,
>> -					const struct rpc_call_ops *tk_ops)
>> +				const struct rpc_call_ops *tk_ops)
>>  {
>>  	struct rpc_task *task;
>>  	struct xdr_buf *xbufp = &req->rq_snd_buf;
> 

^ permalink raw reply

* Re: ixgbe patch to provide NIC's tx/rx counters via ethtool
From: Ben Greear @ 2009-09-24 17:07 UTC (permalink / raw)
  To: Rick Jones; +Cc: NetDev
In-Reply-To: <4ABB9EB3.1000307@hp.com>

On 09/24/2009 09:30 AM, Rick Jones wrote:
> Ben Greear wrote:
>> Rick Jones wrote:
>>> Ben Greear wrote:
>>>
>>>> When LRO is enabled, the received packet and byte counters represent
>>>> the
>>>> LRO'd packets, not the packets/bytes on the wire.
>>>
>>>
>>> When LRO is enabled, are all the bytes on the wire actually
>>> transferred into the host?
>>
>> No...the ethernet, IP and TCP headers and such are not, for packets
>> that are combined into a single large SKB.
>>
>> That is why the driver counts them wrong. The bytes are off by a few
>> percentage points, but the packet count is off by an order of magnitude.
>
> An overly philosphical question perhaps, but are ethtool stats supposed
> to represent what was on the wire, or what entered the host?

They report whatever they report, you get to set custom labels for the values,
and every NIC/driver may be different, so only humans and crazy code like mine that does
specific things based on the driver reported by ethtool should use it.

A more interesting question to me is what netdev-stats tx/rx byte counters should report?

My opinions:
ethernet header (yes)
ethernet CRC  (yes)
ethernet preamble (no)
ethernet frame gap (no)

I think many don't count the CRC, but I haven't looked recently.

Some didn't even report the ethernet header properly a few years ago, but
I think most do now.

When LRO is enabled, it's hard to say if we should report the LRO pkt
stats or the stats on the wire for the netdev-stats.  At least in my case,
I want to report the stats on the wire, but it's also good to see the
LRO stats because you can easily tell that LRO is actually working if you
see low pkts-per-second counters v/s high-bits-per-sec.


Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply

* Re: [PATCH 3/3 V2] i2400m-sdio: select IWMC3200TOP in Kconfig
From: Inaky Perez-Gonzalez @ 2009-09-24 17:25 UTC (permalink / raw)
  To: Winkler, Tomas
  Cc: davem@davemloft.net, linville@tuxdriver.com,
	netdev@vger.kernel.org, linux-wireless@vger.kernel.org,
	linux-mmc@vger.kernel.org, Zhu, Yi, Kao, Cindy H, Cohen, Guy,
	Rindjunsky, Ron
In-Reply-To: <1253779256-17058-1-git-send-email-tomas.winkler@intel.com>

On Thu, 2009-09-24 at 02:00 -0600, Winkler, Tomas wrote:
> i2400m-sdio requires iwmc3200top for its operation
> 
> create separate config option to separate 3200 specifics
> from eventual further wimax sdio HW.
> 
> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>

Acked-by: Inaky Perez-Gonzalez <inaky@linux.intel.com>

I'll merge this into the WiMAX tree.

-- 
-- Inaky



^ permalink raw reply

* question on raw sockets and source IP address validation
From: Chris Friesen @ 2009-09-24 17:53 UTC (permalink / raw)
  To: Linux Network Development list

Hi all,

Normally when sending a packet on a SOCK_RAW socket the source IP
address is validated against the addresses configured on the host.  If
the address isn't configured, the packet isn't sent.

This can be avoided by setting IP_HDRINCL, but then the app needs to
handle all the fragmentation itself.

Is there any way to bypass the source address validation without IP_HDRINCL?

Thanks,

Chris

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Gregory Haskins @ 2009-09-24 18:03 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ira W. Snyder, Michael S. Tsirkin, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4ABB1D44.5000007@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 24877 bytes --]

Avi Kivity wrote:
> On 09/24/2009 12:15 AM, Gregory Haskins wrote:
>>
>>>> There are various aspects about designing high-performance virtual
>>>> devices such as providing the shortest paths possible between the
>>>> physical resources and the consumers.  Conversely, we also need to
>>>> ensure that we meet proper isolation/protection guarantees at the same
>>>> time.  What this means is there are various aspects to any
>>>> high-performance PV design that require to be placed in-kernel to
>>>> maximize the performance yet properly isolate the guest.
>>>>
>>>> For instance, you are required to have your signal-path (interrupts and
>>>> hypercalls), your memory-path (gpa translation), and
>>>> addressing/isolation model in-kernel to maximize performance.
>>>>
>>>>        
>>> Exactly.  That's what vhost puts into the kernel and nothing more.
>>>      
>> Actually, no.  Generally, _KVM_ puts those things into the kernel, and
>> vhost consumes them.  Without KVM (or something equivalent), vhost is
>> incomplete.  One of my goals with vbus is to generalize the "something
>> equivalent" part here.
>>    
> 
> I don't really see how vhost and vbus are different here.  vhost expects
> signalling to happen through a couple of eventfds and requires someone
> to supply them and implement kernel support (if needed).  vbus requires
> someone to write a connector to provide the signalling implementation. 
> Neither will work out-of-the-box when implementing virtio-net over
> falling dominos, for example.

I realize in retrospect that my choice of words above implies vbus _is_
complete, but this is not what I was saying.  What I was trying to
convey is that vbus is _more_ complete.  Yes, in either case some kind
of glue needs to be written.  The difference is that vbus implements
more of the glue generally, and leaves less required to be customized
for each iteration.

Going back to our stack diagrams, you could think of a vhost solution
like this:

--------------------------
| virtio-net
--------------------------
| virtio-ring
--------------------------
| virtio-bus
--------------------------
| ? undefined-1 ?
--------------------------
| vhost
--------------------------

and you could think of a vbus solution like this

--------------------------
| virtio-net
--------------------------
| virtio-ring
--------------------------
| virtio-bus
--------------------------
| bus-interface
--------------------------
| ? undefined-2 ?
--------------------------
| bus-model
--------------------------
| virtio-net-device (vhost ported to vbus model? :)
--------------------------


So the difference between vhost and vbus in this particular context is
that you need to have "undefined-1" do device discovery/hotswap,
config-space, address-decode/isolation, signal-path routing, memory-path
routing, etc.  Today this function is filled by things like virtio-pci,
pci-bus, KVM/ioeventfd, and QEMU for x86.  I am not as familiar with
lguest, but presumably it is filled there by components like
virtio-lguest, lguest-bus, lguest.ko, and lguest-launcher.  And to use
more contemporary examples, we might have virtio-domino, domino-bus,
domino.ko, and domino-launcher as well as virtio-ira, ira-bus, ira.ko,
and ira-launcher.

Contrast this to the vbus stack:  The bus-X components (when optionally
employed by the connector designer) do device-discovery, hotswap,
config-space, address-decode/isolation, signal-path and memory-path
routing, etc in a general (and pv-centric) way. The "undefined-2"
portion is the "connector", and just needs to convey messages like
"DEVCALL" and "SHMSIGNAL".  The rest is handled in other parts of the stack.

So to answer your question, the difference is that the part that has to
be customized in vbus should be a fraction of what needs to be
customized with vhost because it defines more of the stack.  And, as
eluded to in my diagram, both virtio-net and vhost (with some
modifications to fit into the vbus framework) are potentially
complementary, not competitors.

> 
>>>> Vbus accomplishes its in-kernel isolation model by providing a
>>>> "container" concept, where objects are placed into this container by
>>>> userspace.  The host kernel enforces isolation/protection by using a
>>>> namespace to identify objects that is only relevant within a specific
>>>> container's context (namely, a "u32 dev-id").  The guest addresses the
>>>> objects by its dev-id, and the kernel ensures that the guest can't
>>>> access objects outside of its dev-id namespace.
>>>>
>>>>        
>>> vhost manages to accomplish this without any kernel support.
>>>      
>> No, vhost manages to accomplish this because of KVMs kernel support
>> (ioeventfd, etc).   Without a KVM-like in-kernel support, vhost is a
>> merely a kind of "tuntap"-like clone signalled by eventfds.
>>    
> 
> Without a vbus-connector-falling-dominos, vbus-venet can't do anything
> either.

Mostly covered above...

However, I was addressing your assertion that vhost somehow magically
accomplishes this "container/addressing" function without any specific
kernel support.  This is incorrect.  I contend that this kernel support
is required and present.  The difference is that its defined elsewhere
(and typically in a transport/arch specific way).

IOW: You can basically think of the programmed PIO addresses as forming
its "container".  Only addresses explicitly added are visible, and
everything else is inaccessible.  This whole discussion is merely a
question of what's been generalized verses what needs to be
re-implemented each time.


> Both vhost and vbus need an interface,

Agreed

> vhost's is just narrower since it doesn't do configuration or enumeration.

I would say that makes vhost solution's interface wider, not narrower.
With the vbus kvm-connector, simple vbus device instantiation implicitly
registers it in the address/enumeration namespace, and transmits a
devadd event.  It does all that with no more interface complexity than
instantiating a vhost device.  However, vhost has to then also
separately configure its address/enumeration space with other subsystems
(e.g. pci, ioeventfd, msi, etc), and define its config-space twice.

This means something in userspace has to proxy and/or refactor requests.
 This also means that the userspace component has to have some knowledge
of _how_ to proxy/refactor said requests (i.e. splitting the design),
which is another example of where the current vhost model really falls
apart IMO.

> 
>> This goes directly to my rebuttal of your claim that vbus places too
>> much in the kernel.  I state that, one way or the other, address decode
>> and isolation _must_ be in the kernel for performance.  Vbus does this
>> with a devid/container scheme.  vhost+virtio-pci+kvm does it with
>> pci+pio+ioeventfd.
>>    
> 
> vbus doesn't do kvm guest address decoding for the fast path.  It's
> still done by ioeventfd.

That is not correct.  vbus does its own native address decoding in the
fast path, such as here:

http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/client.c;h=e85b2d92d629734866496b67455dd307486e394a;hb=e6cbd4d1decca8e829db3b2b9b6ec65330b379e9#l331

The connector delivers a SHMSIGNAL(id) message, and its decoded
generically by an rcu protected radix tree.

I think what you are thinking of is that my KVM-connector in AlacrityVM
uses PIO/ioeventfd (*) as part of its transport to deliver that
SHMSIGNAL message.  In this sense, I am doing two address-decodes (one
for the initial pio, one for the subsequent shmsignal), but this is an
implementation detail of the KVM connector.

(Also note that its an implementation detail that the KVM maintainer
forced me into ;)  The original vbus design utilized a global hypercall
in place of the PIO, and thus the shmsignal was the only real decode
occurring)

(*) actually I dropped ioeventfd in my latest tree, but this is a
separate topic.  I still use KVM's pio-bus, however.

> 
>>>   The guest
>>> simply has not access to any vhost resources other than the guest->host
>>> doorbell, which is handed to the guest outside vhost (so it's somebody
>>> else's problem, in userspace).
>>>      
>> You mean _controlled_ by userspace, right?  Obviously, the other side of
>> the kernel still needs to be programmed (ioeventfd, etc).  Otherwise,
>> vhost would be pointless: e.g. just use vanilla tuntap if you don't need
>> fast in-kernel decoding.
>>    
> 
> Yes (though for something like level-triggered interrupts we're probably
> keeping it in userspace, enjoying the benefits of vhost data path while
> paying more for signalling).

Thats fine.  I am primarily interested in the high-performance IO, so
low-perf/legacy components can fall back to something else like
userspace if that best serves them.

> 
>>>> All that is required is a way to transport a message with a "devid"
>>>> attribute as an address (such as DEVCALL(devid)) and the framework
>>>> provides the rest of the decode+execute function.
>>>>
>>>>        
>>> vhost avoids that.
>>>      
>> No, it doesn't avoid it.  It just doesn't specify how its done, and
>> relies on something else to do it on its behalf.
>>    
> 
> That someone else can be in userspace, apart from the actual fast path.

No, this "devcall" like decoding _is_ fast path and it can't be in
userspace if you care about performance.  And if you don't care about
performance, you can use existing facilities (like QEMU+tuntap) so vhost
and vbus alike would become unnecessary in that scenario.

> 
>> Conversely, vbus specifies how its done, but not how to transport the
>> verb "across the wire".  That is the role of the vbus-connector
>> abstraction.
>>    
> 
> So again, vbus does everything in the kernel (since it's so easy and
> cheap) but expects a vbus-connector.  vhost does configuration in
> userspace (since it's so clunky and fragile) but expects a couple of
> eventfds.

Well, we are talking about fast-path here, so I am not sure why
config-space is coming up in this context.  I digress. I realize you are
being sarcastic, but your easy+cheap/clunky+fragile assessment is more
accurate than you perhaps realize.

You keep extolling that vhost does most things in userspace and that is
an advantage.  But the simple fact is that they both functionally do
almost the same amount in-kernel, because they _have_ to.  This includes
the obvious stuff like signal and memory routing, but also the less
obvious stuff like most of config-space.  Ultimately, most of
config-space needs to terminate at the device-model (the one exception
is perhaps "read-only attributes", like "MACQUERY").  Therefore, even if
you use a vhost like model, most of your parameters will invariably be a
translation from one config space to another and passed on (e.g. pci
config-cycle to ioctl()).

The disparity of in-kernel vs userspace functionality that remains
between the two implementations are basically the enumeration/hotswap
and read-only attribute functions.  These functions are prohibitively
complex in the vhost+virtio-pci+kvm model (full ICH/pci chipset
emulation, etc), so I understand why we wouldnt want to move those
in-kernel.  However, vbus was designed from scratch specifically for PV
to be flexible and simple.  As a result, the remaining functions in the
kvm-connector take advantage of this simplicity and just ride on the
existing model that we needed for fast-path anyway.  What this means is
there are of no significant consequence to do these few minor details
in-kernel, other than this long discussion.

In fact, it's actually a simpler design to unify things this way because
you avoid splitting the device model up. Consider how painful the vhost
implementation would be if it didn't already have the userspace
virtio-net to fall-back on.  This is effectively what we face for new
devices going forward if that model is to persist.


> 
>>>> Contrast this to vhost+virtio-pci (called simply "vhost" from here).
>>>>
>>>>        
>>> It's the wrong name.  vhost implements only the data path.
>>>      
>> Understood, but vhost+virtio-pci is what I am contrasting, and I use
>> "vhost" for short from that point on because I am too lazy to type the
>> whole name over and over ;)
>>    
> 
> If you #define A A+B+C don't expect intelligent conversation afterwards.

Fair enough, but I did attempt to declare the definition before using
it.  Sorry again for the confusion.

> 
>>>> It is not immune to requiring in-kernel addressing support either, but
>>>> rather it just does it differently (and its not as you might expect via
>>>> qemu).
>>>>
>>>> Vhost relies on QEMU to render PCI objects to the guest, which the
>>>> guest
>>>> assigns resources (such as BARs, interrupts, etc).
>>>>        
>>> vhost does not rely on qemu.  It relies on its user to handle
>>> configuration.  In one important case it's qemu+pci.  It could just as
>>> well be the lguest launcher.
>>>      
>> I meant vhost=vhost+virtio-pci here.  Sorry for the confusion.
>>
>> The point I am making specifically is that vhost in general relies on
>> other in-kernel components to function.  I.e. It cannot function without
>> having something like the PCI model to build an IO namespace.  That
>> namespace (in this case, pio addresses+data tuples) are used for the
>> in-kernel addressing function under KVM + virtio-pci.
>>
>> The case of the lguest launcher is a good one to highlight.  Yes, you
>> can presumably also use lguest with vhost, if the requisite facilities
>> are exposed to lguest-bus, and some eventfd based thing like ioeventfd
>> is written for the host (if it doesnt exist already).
>>
>> And when the next virt design "foo" comes out, it can make a "foo-bus"
>> model, and implement foo-eventfd on the backend, etc, etc.
>>    
> 
> It's exactly the same with vbus needing additional connectors for
> additional transports.

No, see my reply above.

> 
>> Ira can make ira-bus, and ira-eventfd, etc, etc.
>>
>> Each iteration will invariably introduce duplicated parts of the stack.
>>    
> 
> Invariably?

As in "always"

>  Use libraries (virtio-shmem.ko, libvhost.so).

What do you suppose vbus is?  vbus-proxy.ko = virtio-shmem.ko, and you
dont need libvhost.so per se since you can just use standard kernel
interfaces (like configfs/sysfs).  I could create an .so going forward
for the new ioctl-based interface, I suppose.

> 
> 
>>> For the N+1th time, no.  vhost is perfectly usable without pci.  Can we
>>> stop raising and debunking this point?
>>>      
>> Again, I understand vhost is decoupled from PCI, and I don't mean to
>> imply anything different.  I use PCI as an example here because a) its
>> the only working example of vhost today (to my knowledge), and b) you
>> have stated in the past that PCI is the only "right" way here, to
>> paraphrase.  Perhaps you no longer feel that way, so I apologize if you
>> feel you already recanted your position on PCI and I missed it.
>>    
> 
> For kvm/x86 pci definitely remains king.

For full virtualization, sure.  I agree.  However, we are talking about
PV here.  For PV, PCI is not a requirement and is a technical dead-end IMO.

KVM seems to be the only virt solution that thinks otherwise (*), but I
believe that is primarily a condition of its maturity.  I aim to help
advance things here.

(*) citation: xen has xenbus, lguest has lguest-bus, vmware has some
vmi-esq thing (I forget what its called) to name a few.  Love 'em or
hate 'em, most other hypervisors do something along these lines.  I'd
like to try to create one for KVM, but to unify them all (at least for
the Linux-based host designs).

>  I was talking about the two
> lguest users and Ira.
> 
>> I digress.  My point here isn't PCI.  The point here is the missing
>> component for when PCI is not present.  The component that is partially
>> satisfied by vbus's devid addressing scheme.  If you are going to use
>> vhost, and you don't have PCI, you've gotta build something to replace
>> it.
>>    
> 
> Yes, that's why people have keyboards.  They'll write that glue code if
> they need it.  If it turns out to be a hit an people start having virtio
> transport module writing parties, they'll figure out a way to share code.

Sigh...  The party has already started.  I tried to invite you months ago...

> 
>>>> All you really need is a simple decode+execute mechanism, and a way to
>>>> program it from userspace control.  vbus tries to do just that:
>>>> commoditize it so all you need is the transport of the control messages
>>>> (like DEVCALL()), but the decode+execute itself is reuseable, even
>>>> across various environments (like KVM or Iras rig).
>>>>
>>>>        
>>> If you think it should be "commodotized", write libvhostconfig.so.
>>>      
>> I know you are probably being facetious here, but what do you propose
>> for the parts that must be in-kernel?
>>    
> 
> On the guest side, virtio-shmem.ko can unify the ring access.  It
> probably makes sense even today.  On the host side I eventfd is the
> kernel interface and libvhostconfig.so can provide the configuration
> when an existing ABI is not imposed.

That won't cut it.  For one, creating an eventfd is only part of the
equation.  I.e. you need to have originate/terminate somewhere
interesting (and in-kernel, otherwise use tuntap).

> 
>>>> And your argument, I believe, is that vbus allows both to be
>>>> implemented
>>>> in the kernel (though to reiterate, its optional) and is therefore a
>>>> bad
>>>> design, so lets discuss that.
>>>>
>>>> I believe the assertion is that things like config-space are best left
>>>> to userspace, and we should only relegate fast-path duties to the
>>>> kernel.  The problem is that, in my experience, a good deal of
>>>> config-space actually influences the fast-path and thus needs to
>>>> interact with the fast-path mechanism eventually anyway.
>>>> Whats left
>>>> over that doesn't fall into this category may cheaply ride on existing
>>>> plumbing, so its not like we created something new or unnatural just to
>>>> support this subclass of config-space.
>>>>
>>>>        
>>> Flexibility is reduced, because changing code in the kernel is more
>>> expensive than in userspace, and kernel/user interfaces aren't typically
>>> as wide as pure userspace interfaces.  Security is reduced, since a bug
>>> in the kernel affects the host, while a bug in userspace affects just on
>>> guest.
>>>      
>> For a mac-address attribute?  Thats all we are really talking about
>> here.  These points you raise, while true of any kernel code I suppose,
>> are a bit of a stretch in this context.
>>    
> 
> Look at the virtio-net feature negotiation.  There's a lot more there
> than the MAC address, and it's going to grow.

Agreed, but note that makes my point.  That feature negotiation almost
invariably influences the device-model, not some config-space shim.
IOW: terminating config-space at some userspace shim is pointless.  The
model ultimately needs the result of whatever transpires during that
negotiation anyway.

> 
>>> Example: feature negotiation.  If it happens in userspace, it's easy to
>>> limit what features we expose to the guest.
>>>      
>> Its not any harder in the kernel.  I do this today.
>>
>> And when you are done negotiating said features, you will generally have
>> to turn around and program the feature into the backend anyway (e.g.
>> ioctl() to vhost module).  Now you have to maintain some knowledge of
>> that particular feature and how to program it in two places.
>>    
> 
> No, you can leave it enabled unconditionally in vhost (the guest won't
> use what it doesn't know about).

Perhaps, but IMO sending a "feature-mask"-like object down is far easier
then proxying/refactoring config-space and sending that down.  I'd still
chalk the win here to the vbus model used in AlacrityVM.

FWIW: venet has the ability to enable/disable features on the host side,
so clearly userspace config-space is not required for the basic premise.

> 
>> Conversely, I am eliminating the (unnecessary) middleman by letting the
>> feature negotiating take place directly between the two entities that
>> will consume it.
>>    
> 
> The middleman is necessary, if you want to support live migration

Orchestrating live-migration has nothing to do with whether config-space
is serviced by a middle-man or not.  It shouldn't be required to have
device-specific knowledge at all beyond what was initially needed to
create/config the object at boot time.

IOW, the orchestrator merely needs to know that a device-model object is
present and a method to serialize and reconstitute its state (if
appropriate).

, or to
> restrict a guest to a subset of your features.

No, that is incorrect.  We are not talking about directly exposing
something like HW cpuid here.  These are all virtual models, and they
can optionally expose as much or as little as we want.  They do this
under administrative control by userspace, and independent of the
location of the config-space handler.

> 
>>>   If it happens in the
>>> kernel, we need to add an interface to let the kernel know which
>>> features it should expose to the guest.
>>>      
>> You need this already either way for both models anyway.  As an added
>> bonus, vbus has generalized that interface using sysfs attributes, so
>> all models are handled in a similar and community accepted way.
>>    
> 
> vhost doesn't need it since userspace takes care of it.

Ok, but see my related reply above.

> 
>>>   We also need to add an
>>> interface to let userspace know which features were negotiated, if we
>>> want to implement live migration.  Something fairly trivial bloats
>>> rapidly.
>>>      
>> Can you elaborate on the requirements for live-migration?  Wouldnt an
>> opaque save/restore model work here? (e.g. why does userspace need to be
>> able to interpret the in-kernel state, just pass it along as a blob to
>> the new instance).
>>    
> 
> A blob would work, if you commit to forward and backward compatibility
> in the kernel side (i.e. an older kernel must be able to accept a blob
> from a newer one).

Thats understood and acceptable.

> I don't like blobs though, they tie you to the implemenetation.

What would you suggest otherwise?

> 
>>> As you can see above, userspace needs to be involved in this, and the
>>> number of interfaces required is smaller if it's in userspace:
>>>      
>> Actually, no.  My experience has been the opposite.  Anytime I sat down
>> and tried to satisfy your request to move things to the userspace,
>> things got ugly and duplicative really quick.  I suspect part of the
>> reason you may think its easier because you already have part of
>> virtio-net in userspace and its surrounding support, but that is not the
>> case moving forward for new device types.
>>    
> 
> I can't comment on your experience, but we'll definitely build on
> existing code for new device types.

Fair enough.  I'll build on my experience, either reusing existing code
or implementing new designs where appropriate.  If you or anyone else
want to join me in my efforts, the more the merrier.

> 
>>> you only
>>> need to know which features the kernel supports (they can be enabled
>>> unconditionally, just not exposed).
>>>
>>> Further, some devices are perfectly happy to be implemented in
>>> userspace, so we need userspace configuration support anyway.  Why
>>> reimplement it in the kernel?
>>>      
>> Thats fine.  vbus is targetted for high-performance IO.  So if you have
>> a robust userspace (like KVM+QEMU) and low-performance constraints (say,
>> for a console or something), put it in userspace and vbus is not
>> involved.  I don't care.
>>    
> 
> So now the hypothetical non-pci hypervisor needs to support two busses.

No.  The hypothetical hypervisor only needs to decide where
low-performance devices should live.  If that is best served by
making/reusing a unique bus for them, I have no specific problem with
that.  Systems are typically composed of multiple buses anyway.

Conversely, there is nothing wrong with putting low-performance devices
on a bus designed for high-performance either, and vbus can accommodate
both types.  The latter is what I would advocate for simplicity's sake,
but its not a requirement.

Kind Regards,
-Greg



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Gregory Haskins @ 2009-09-24 18:04 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Ira W. Snyder, Michael S. Tsirkin, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4ABB27B9.4050904@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 1091 bytes --]

Avi Kivity wrote:
> On 09/23/2009 10:37 PM, Avi Kivity wrote:
>>
>> Example: feature negotiation.  If it happens in userspace, it's easy
>> to limit what features we expose to the guest.  If it happens in the
>> kernel, we need to add an interface to let the kernel know which
>> features it should expose to the guest.  We also need to add an
>> interface to let userspace know which features were negotiated, if we
>> want to implement live migration.  Something fairly trivial bloats
>> rapidly.
> 
> btw, we have this issue with kvm reporting cpuid bits to the guest. 
> Instead of letting kvm talk directly to the hardware and the guest, kvm
> gets the cpuid bits from the hardware, strips away features it doesn't
> support, exposes that to userspace, and expects userspace to program the
> cpuid bits it wants to expose to the guest (which may be different than
> what kvm exposed to userspace, and different from guest to guest).
> 

This issue doesn't exist in the model I am referring to, as these are
all virtual-devices anyway.  See my last reply

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 267 bytes --]

^ permalink raw reply

* Re: [PATCH net-next-2.6] ixgbe: correct the parameter description
From: Peter P Waskiewicz Jr @ 2009-09-24 18:26 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <20090924143627.GD2919@psychotron.redhat.com>

On Thu, 2009-09-24 at 07:36 -0700, Jiri Pirko wrote:
> ccffad25b5136958d4769ed6de5e87992dd9c65c changed parameters for function
> ixgbe_update_uc_addr_list_generic but parameter description was not updated.
> This patch corrects it.
> 
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>

Thanks Jiri,

Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>


^ permalink raw reply

* Re: ixgbe patch to provide NIC's tx/rx counters via ethtool
From: Peter P Waskiewicz Jr @ 2009-09-24 18:28 UTC (permalink / raw)
  To: Ben Greear; +Cc: NetDev
In-Reply-To: <4ABAA2D0.4030608@candelatech.com>

On Wed, 2009-09-23 at 15:36 -0700, Ben Greear wrote:
> When LRO is enabled, the received packet and byte counters represent the
> LRO'd packets, not the packets/bytes on the wire.  The Intel 82599 NIC has
> registers that keep count of the physical packets.  Add these counters to
> the ethtool stats.  The byte counters are 36-bit, but the high 4 bits were
> being ignored in the 2.6.31 ixgbe driver:  Read those as well to allow
> longer time between polling the stats to detect wraps.
> 
> Signed-off-by: Ben Greear <greearb@candelatech.com>
> 
> 
> Please do not apply this until the ixgbe authors ACK it.  There may
> have been reasons for not reading the high 4 bits, or they may dislike
> this approach entirely.

Aside from the trivial line-wrap on the comments, I'm fine with this
patch.  There is no issue I could find with the hardware that would
limit you from reading the high 4 bits.  And since we're reading it
already to clear the register, we might as well use the value we get
from it.

Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>


^ permalink raw reply

* [PATCH] 3c59x: Get rid of "Trying to free already-free IRQ"
From: Anton Vorontsov @ 2009-09-24 18:31 UTC (permalink / raw)
  To: David Miller; +Cc: Rafael J. Wysocki, linux-pm, netdev

Following trace pops up if we try to suspend with 3c59x ethernet NIC
brought down:

  root@b1:~# ifconfig eth16 down
  root@b1:~# echo mem > /sys/power/state
  ...
  3c59x 0000:00:10.0: suspend
  3c59x 0000:00:10.0: PME# disabled
  Trying to free already-free IRQ 48
  ------------[ cut here ]------------
  Badness at c00554e4 [verbose debug info unavailable]
  NIP: c00554e4 LR: c00554e4 CTR: c019a098
  REGS: c7975c60 TRAP: 0700   Not tainted  (2.6.31-rc4)
  MSR: 00021032 <ME,CE,IR,DR>  CR: 28242422  XER: 20000000
  TASK = c79cb0c0[1746] 'bash' THREAD: c7974000
  ...
  NIP [c00554e4] __free_irq+0x108/0x1b0
  LR [c00554e4] __free_irq+0x108/0x1b0
  Call Trace:
  [c7975d10] [c00554e4] __free_irq+0x108/0x1b0 (unreliable)
  [c7975d30] [c005559c] free_irq+0x10/0x24
  [c7975d40] [c01e21ec] vortex_suspend+0x70/0xc4
  [c7975d60] [c017e584] pci_legacy_suspend+0x58/0x100

This is because the driver manages interrupts without checking for
netif_running().

Though, there are few other issues with suspend/resume in this driver.
The intention of calling free_irq() in suspend() was to avoid any
possible spurious interrupts (see commit 5b039e681b8c5f30aac9cc04385
"3c59x PM fixes"). But,

- On resume, the driver was requesting IRQ just after pci_set_master(),
  but before vortex_up() (which actually resets 3c59x chips).

- Issuing free_irq() on a shared IRQ doesn't guarantee that a buggy
  HW won't trigger spurious interrupts in another driver that
  requested the same interrupt. So, if we want to protect from
  unexpected interrupts, then on suspend we should issue disable_irq(),
  not free_irq().

Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
---
 drivers/net/3c59x.c |   12 +++---------
 1 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/net/3c59x.c b/drivers/net/3c59x.c
index c34aee9..7cdd4b0 100644
--- a/drivers/net/3c59x.c
+++ b/drivers/net/3c59x.c
@@ -807,10 +807,10 @@ static int vortex_suspend(struct pci_dev *pdev, pm_message_t state)
 		if (netif_running(dev)) {
 			netif_device_detach(dev);
 			vortex_down(dev, 1);
+			disable_irq(dev->irq);
 		}
 		pci_save_state(pdev);
 		pci_enable_wake(pdev, pci_choose_state(pdev, state), 0);
-		free_irq(dev->irq, dev);
 		pci_disable_device(pdev);
 		pci_set_power_state(pdev, pci_choose_state(pdev, state));
 	}
@@ -833,18 +833,12 @@ static int vortex_resume(struct pci_dev *pdev)
 			return err;
 		}
 		pci_set_master(pdev);
-		if (request_irq(dev->irq, vp->full_bus_master_rx ?
-				&boomerang_interrupt : &vortex_interrupt, IRQF_SHARED, dev->name, dev)) {
-			pr_warning("%s: Could not reserve IRQ %d\n", dev->name, dev->irq);
-			pci_disable_device(pdev);
-			return -EBUSY;
-		}
 		if (netif_running(dev)) {
 			err = vortex_up(dev);
 			if (err)
 				return err;
-			else
-				netif_device_attach(dev);
+			enable_irq(dev->irq);
+			netif_device_attach(dev);
 		}
 	}
 	return 0;
-- 
1.6.3.3

^ permalink raw reply related

* [PATCH] inet_peer: Optimize inet_getid()
From: Eric Dumazet @ 2009-09-24 19:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

While investigating for network latencies, I found inet_getid() was a contention point
for some workloads.

Fix is straightforward, using cmpxchg() instead of
a spin_lock_bh()/spin_unlock_bh() pair on a central lock.

Another possibility was to use an atomic_t and atomic_add_return() but
the size of struct inet_peer object would had doubled on x86_64 because of
SLAB_HWCACHE_ALIGN constraint.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/inetpeer.h |   16 ++++++++--------
 net/ipv4/inetpeer.c    |    3 ---
 2 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/include/net/inetpeer.h b/include/net/inetpeer.h
index 15e1f8f..952f0ad 100644
--- a/include/net/inetpeer.h
+++ b/include/net/inetpeer.h
@@ -37,17 +37,17 @@ struct inet_peer	*inet_getpeer(__be32 daddr, int create);
 /* can be called from BH context or outside */
 extern void inet_putpeer(struct inet_peer *p);
 
-extern spinlock_t inet_peer_idlock;
 /* can be called with or without local BH being disabled */
 static inline __u16	inet_getid(struct inet_peer *p, int more)
 {
-	__u16 id;
-
-	spin_lock_bh(&inet_peer_idlock);
-	id = p->ip_id_count;
-	p->ip_id_count += 1 + more;
-	spin_unlock_bh(&inet_peer_idlock);
-	return id;
+	__u16 old;
+
+	while (1) {
+		old = p->ip_id_count;
+		if (cmpxchg(&p->ip_id_count, old, old + 1 + more) == old)
+			break;
+	}
+	return old;
 }
 
 #endif /* _NET_INETPEER_H */
diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index b1fbe18..5dc29b8 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -67,9 +67,6 @@
  *		ip_id_count: idlock
  */
 
-/* Exported for inet_getid inline function.  */
-DEFINE_SPINLOCK(inet_peer_idlock);
-
 static struct kmem_cache *peer_cachep __read_mostly;
 
 #define node_height(x) x->avl_height

^ permalink raw reply related

* Re: [PATCH net-next-2.6] ixgbe: correct the parameter description
From: Jeff Kirsher @ 2009-09-24 19:08 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr
  Cc: Jiri Pirko, davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <1253816772.3153.2.camel@localhost.localdomain>

On Thu, Sep 24, 2009 at 11:26, Peter P Waskiewicz Jr
<peter.p.waskiewicz.jr@intel.com> wrote:
> On Thu, 2009-09-24 at 07:36 -0700, Jiri Pirko wrote:
>> ccffad25b5136958d4769ed6de5e87992dd9c65c changed parameters for function
>> ixgbe_update_uc_addr_list_generic but parameter description was not updated.
>> This patch corrects it.
>>
>> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>
> Thanks Jiri,
>
> Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
>

Jiri/Dave-

I already have a few ixgbe patches in my queue, so I will add this
patch to my queue and push it along with my other patches to
Dave/netdev.

-- 
Cheers,
Jeff

^ permalink raw reply

* Re: ixgbe patch to provide NIC's tx/rx counters via ethtool
From: Jeff Kirsher @ 2009-09-24 19:10 UTC (permalink / raw)
  To: Peter P Waskiewicz Jr; +Cc: Ben Greear, NetDev
In-Reply-To: <1253816908.3153.4.camel@localhost.localdomain>

On Thu, Sep 24, 2009 at 11:28, Peter P Waskiewicz Jr
<peter.p.waskiewicz.jr@intel.com> wrote:
> On Wed, 2009-09-23 at 15:36 -0700, Ben Greear wrote:
>> When LRO is enabled, the received packet and byte counters represent the
>> LRO'd packets, not the packets/bytes on the wire.  The Intel 82599 NIC has
>> registers that keep count of the physical packets.  Add these counters to
>> the ethtool stats.  The byte counters are 36-bit, but the high 4 bits were
>> being ignored in the 2.6.31 ixgbe driver:  Read those as well to allow
>> longer time between polling the stats to detect wraps.
>>
>> Signed-off-by: Ben Greear <greearb@candelatech.com>
>>
>>
>> Please do not apply this until the ixgbe authors ACK it.  There may
>> have been reasons for not reading the high 4 bits, or they may dislike
>> this approach entirely.
>
> Aside from the trivial line-wrap on the comments, I'm fine with this
> patch.  There is no issue I could find with the hardware that would
> limit you from reading the high 4 bits.  And since we're reading it
> already to clear the register, we might as well use the value we get
> from it.
>
> Acked-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
>

I have added this patch to my tree and will push along with my other
ixgbe patches to dave/netdev.  Thanks.

-- 
Cheers,
Jeff

^ permalink raw reply

* Re: question on raw sockets and source IP address validation
From: Neil Horman @ 2009-09-24 19:26 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Linux Network Development list
In-Reply-To: <4ABBB223.8090700@nortel.com>

On Thu, Sep 24, 2009 at 11:53:39AM -0600, Chris Friesen wrote:
> Hi all,
> 
> Normally when sending a packet on a SOCK_RAW socket the source IP
> address is validated against the addresses configured on the host.  If
> the address isn't configured, the packet isn't sent.
> 
> This can be avoided by setting IP_HDRINCL, but then the app needs to
> handle all the fragmentation itself.
> 
> Is there any way to bypass the source address validation without IP_HDRINCL?
> 
Nope, not with socket(AF_INET, SOCK_RAW, ...).  its an IPv4 socket, so you get
ipv4 routing.  If you don't want the ipv4 behavior, you can always use
AF_PACKET, to send raw frames direct to network interfaces.  Of course, thats
going to imply that you do all your ip level fragmentation yourself as well.

That said, its not doing source validation, your socket is actually doing a
route lookup on the flow from your specified source address to your destination
address.  So you should be able to fool the socket into doing the lookup by
adding a route to your routing table from your source address to your
destination address via the interface that you want to send the frames out of.

Regards
Neil

> Thanks,
> 
> Chris
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Ira W. Snyder @ 2009-09-24 19:27 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Gregory Haskins, Michael S. Tsirkin, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4ABB1D44.5000007@redhat.com>

On Thu, Sep 24, 2009 at 10:18:28AM +0300, Avi Kivity wrote:
> On 09/24/2009 12:15 AM, Gregory Haskins wrote:
> >
> >>> There are various aspects about designing high-performance virtual
> >>> devices such as providing the shortest paths possible between the
> >>> physical resources and the consumers.  Conversely, we also need to
> >>> ensure that we meet proper isolation/protection guarantees at the same
> >>> time.  What this means is there are various aspects to any
> >>> high-performance PV design that require to be placed in-kernel to
> >>> maximize the performance yet properly isolate the guest.
> >>>
> >>> For instance, you are required to have your signal-path (interrupts and
> >>> hypercalls), your memory-path (gpa translation), and
> >>> addressing/isolation model in-kernel to maximize performance.
> >>>
> >>>        
> >> Exactly.  That's what vhost puts into the kernel and nothing more.
> >>      
> > Actually, no.  Generally, _KVM_ puts those things into the kernel, and
> > vhost consumes them.  Without KVM (or something equivalent), vhost is
> > incomplete.  One of my goals with vbus is to generalize the "something
> > equivalent" part here.
> >    
> 
> I don't really see how vhost and vbus are different here.  vhost expects 
> signalling to happen through a couple of eventfds and requires someone 
> to supply them and implement kernel support (if needed).  vbus requires 
> someone to write a connector to provide the signalling implementation.  
> Neither will work out-of-the-box when implementing virtio-net over 
> falling dominos, for example.
> 
> >>> Vbus accomplishes its in-kernel isolation model by providing a
> >>> "container" concept, where objects are placed into this container by
> >>> userspace.  The host kernel enforces isolation/protection by using a
> >>> namespace to identify objects that is only relevant within a specific
> >>> container's context (namely, a "u32 dev-id").  The guest addresses the
> >>> objects by its dev-id, and the kernel ensures that the guest can't
> >>> access objects outside of its dev-id namespace.
> >>>
> >>>        
> >> vhost manages to accomplish this without any kernel support.
> >>      
> > No, vhost manages to accomplish this because of KVMs kernel support
> > (ioeventfd, etc).   Without a KVM-like in-kernel support, vhost is a
> > merely a kind of "tuntap"-like clone signalled by eventfds.
> >    
> 
> Without a vbus-connector-falling-dominos, vbus-venet can't do anything 
> either.  Both vhost and vbus need an interface, vhost's is just narrower 
> since it doesn't do configuration or enumeration.
> 
> > This goes directly to my rebuttal of your claim that vbus places too
> > much in the kernel.  I state that, one way or the other, address decode
> > and isolation _must_ be in the kernel for performance.  Vbus does this
> > with a devid/container scheme.  vhost+virtio-pci+kvm does it with
> > pci+pio+ioeventfd.
> >    
> 
> vbus doesn't do kvm guest address decoding for the fast path.  It's 
> still done by ioeventfd.
> 
> >>   The guest
> >> simply has not access to any vhost resources other than the guest->host
> >> doorbell, which is handed to the guest outside vhost (so it's somebody
> >> else's problem, in userspace).
> >>      
> > You mean _controlled_ by userspace, right?  Obviously, the other side of
> > the kernel still needs to be programmed (ioeventfd, etc).  Otherwise,
> > vhost would be pointless: e.g. just use vanilla tuntap if you don't need
> > fast in-kernel decoding.
> >    
> 
> Yes (though for something like level-triggered interrupts we're probably 
> keeping it in userspace, enjoying the benefits of vhost data path while 
> paying more for signalling).
> 
> >>> All that is required is a way to transport a message with a "devid"
> >>> attribute as an address (such as DEVCALL(devid)) and the framework
> >>> provides the rest of the decode+execute function.
> >>>
> >>>        
> >> vhost avoids that.
> >>      
> > No, it doesn't avoid it.  It just doesn't specify how its done, and
> > relies on something else to do it on its behalf.
> >    
> 
> That someone else can be in userspace, apart from the actual fast path.
> 
> > Conversely, vbus specifies how its done, but not how to transport the
> > verb "across the wire".  That is the role of the vbus-connector abstraction.
> >    
> 
> So again, vbus does everything in the kernel (since it's so easy and 
> cheap) but expects a vbus-connector.  vhost does configuration in 
> userspace (since it's so clunky and fragile) but expects a couple of 
> eventfds.
> 
> >>> Contrast this to vhost+virtio-pci (called simply "vhost" from here).
> >>>
> >>>        
> >> It's the wrong name.  vhost implements only the data path.
> >>      
> > Understood, but vhost+virtio-pci is what I am contrasting, and I use
> > "vhost" for short from that point on because I am too lazy to type the
> > whole name over and over ;)
> >    
> 
> If you #define A A+B+C don't expect intelligent conversation afterwards.
> 
> >>> It is not immune to requiring in-kernel addressing support either, but
> >>> rather it just does it differently (and its not as you might expect via
> >>> qemu).
> >>>
> >>> Vhost relies on QEMU to render PCI objects to the guest, which the guest
> >>> assigns resources (such as BARs, interrupts, etc).
> >>>        
> >> vhost does not rely on qemu.  It relies on its user to handle
> >> configuration.  In one important case it's qemu+pci.  It could just as
> >> well be the lguest launcher.
> >>      
> > I meant vhost=vhost+virtio-pci here.  Sorry for the confusion.
> >
> > The point I am making specifically is that vhost in general relies on
> > other in-kernel components to function.  I.e. It cannot function without
> > having something like the PCI model to build an IO namespace.  That
> > namespace (in this case, pio addresses+data tuples) are used for the
> > in-kernel addressing function under KVM + virtio-pci.
> >
> > The case of the lguest launcher is a good one to highlight.  Yes, you
> > can presumably also use lguest with vhost, if the requisite facilities
> > are exposed to lguest-bus, and some eventfd based thing like ioeventfd
> > is written for the host (if it doesnt exist already).
> >
> > And when the next virt design "foo" comes out, it can make a "foo-bus"
> > model, and implement foo-eventfd on the backend, etc, etc.
> >    
> 
> It's exactly the same with vbus needing additional connectors for 
> additional transports.
> 
> > Ira can make ira-bus, and ira-eventfd, etc, etc.
> >
> > Each iteration will invariably introduce duplicated parts of the stack.
> >    
> 
> Invariably?  Use libraries (virtio-shmem.ko, libvhost.so).
> 

Referencing libraries that don't yet exist doesn't seem like a good
argument against vbus from my point of view. I'm not speficially
advocating for vbus; I'm just letting you know how it looks to another
developer in the trenches.

If you'd like to see the amount of duplication present, look at the code
I'm currently working on. It mostly works at this point, though I
haven't finished my userspace, nor figured out how to actually transfer
data.

The current question I have (just to let you know where I am in
development) is:

I have the physical address of the remote data, but how do I get it into
a userspace buffer, so I can pass it to tun?

http://www.mmarray.org/~iws/virtio-phys/

Ira

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] inet_peer: Optimize inet_getid()
From: Stephen Hemminger @ 2009-09-24 19:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, Linux Netdev List
In-Reply-To: <4ABBC2D8.2040901@gmail.com>

On Thu, 24 Sep 2009 21:04:56 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> While investigating for network latencies, I found inet_getid() was a contention point
> for some workloads.
> 
> Fix is straightforward, using cmpxchg() instead of
> a spin_lock_bh()/spin_unlock_bh() pair on a central lock.
> 
> Another possibility was to use an atomic_t and atomic_add_return() but
> the size of struct inet_peer object would had doubled on x86_64 because of
> SLAB_HWCACHE_ALIGN constraint.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

I thought cmpxchg was not available on all architectures.

^ permalink raw reply

* Re: question on raw sockets and source IP address validation
From: Chris Friesen @ 2009-09-24 19:37 UTC (permalink / raw)
  To: Neil Horman; +Cc: Linux Network Development list
In-Reply-To: <20090924192651.GC19787@hmsreliant.think-freely.org>

On 09/24/2009 01:26 PM, Neil Horman wrote:

> That said, its not doing source validation, your socket is actually doing a
> route lookup on the flow from your specified source address to your destination
> address.  So you should be able to fool the socket into doing the lookup by
> adding a route to your routing table from your source address to your
> destination address via the interface that you want to send the frames out of.

Hmm...that's an interesting point.  Worth investigating for sure.

Thanks,

Chris


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox