Netdev List
 help / color / mirror / Atom feed
* Re: [net-next PATCH v2 00/12] Flow API
From: Or Gerlitz @ 2015-01-14 15:00 UTC (permalink / raw)
  To: John Fastabend
  Cc: Thomas Graf, simon.horman, Scott Feldman, Linux Netdev List,
	Jamal Hadi Salim, Andy Gospodarek, David Miller,
	Alexei Starovoitov
In-Reply-To: <54B680EB.8040404@gmail.com>

On Wed, Jan 14, 2015 at 4:44 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 01/13/2015 10:29 PM, Or Gerlitz wrote:
>>
>> On Tue, Jan 13, 2015 at 11:35 PM, John Fastabend
>> <john.fastabend@gmail.com> wrote:
>>>
>>> I tried to roll in all the feedback from v1 into this series annotated
>>> here,

>> Can you please drop V2 to your
>> https://github.com/jrfastab/rocker-net-next.git tree?

> Should be there now, but as Alexei noted I seem to have missed updating
> the patch commit messages so they still reference the old names.

Oh, I see it now.. yep, with the old names and dates to Dec 30th...

^ permalink raw reply

* Re: [REMINDER] iproute2 ss: Some thoughts about additional info output layout
From: Vadim Kochan @ 2015-01-14 14:57 UTC (permalink / raw)
  To: netdev; +Cc: vadim4j
In-Reply-To: <20150108221240.GA23636@angus-think.lan>

On Fri, Jan 09, 2015 at 12:12:40AM +0200, Vadim Kochan wrote:
> Hi,
> 
> I think that current output of ss utility visually looks little weird
> when additional info options were specified, so I feel that a lot of
> yours will say that it will break existing scripts of ss output parsing
> but I will try, so thats how I think ss output would looks better in
> case if additionally info options were specified (I am not sure how it
> would looks in the email) :
> 
> 
> u_str  ESTAB      0      0                                                   * 14875                                                             * 14876
> users:("terminator",pid=30585,fd=8),
>       ("terminator",pid=29450,fd=8),
>       ("firefox",pid=21863,fd=8),
>       ("terminator",pid=18927,fd=8),
>       ("terminator",pid=17955,fd=8),
>       ("terminator",pid=9620,fd=8),
>       ("terminator",pid=7894,fd=8),
>       ("terminator",pid=7245,fd=8),
>       ("terminator",pid=2542,fd=8),
>       ("qtile",pid=654,fd=8)
> skmem:(r0,rb212992,t0,tb212992,f0,w0,o0,bl0) <-> 
> 
> u_str  ESTAB      0      0                                                   * 7550221                                                           * 7550222      
> users:("terminator",pid=9620,fd=12)
> skmem:(r0,rb4194304,t0,tb4194304,f0,w0,o0,bl0) <-> 
> 
> udp    ESTAB      0      0                                           10.3.5.18:ipproto-58744                                           192.168.1.1:swipe        
> users:("firefox",pid=21863,fd=54) 
> uid:1000 ino:7328962 sk:ffff880212f77bc0 <->
> skmem:(r0,rb4194304,t0,tb4194304,f0,w0,o0,bl0)
> 
> udp    ESTAB      0      0                           fe80::6e88:14ff:feac:51e4:ipproto-33582                                     fd6b:a256:8c2e::1:swipe        
> users:("firefox",pid=21863,fd=62) 
> uid:1000 ino:7328974 sk:ffff880205a76e80 <->
> skmem:(r0,rb4194304,t0,tb4194304,f0,w0,o0,bl0)
> 
> tcp    ESTAB      0      0                                       192.168.1.198:57851                                               173.194.113.205:https        
> users:("firefox",pid=21863,fd=67) 
> uid:1000 ino:7601516 sk:ffff880102e94d00 <->
> skmem:(r0,rb372480,t0,tb87040,f0,w0,o0,bl0)
> ts sack cubic
> wscale:7,11
> rto:253.333 rtt:50.609/18.209 ato:40
> mss:1448 cwnd:10
> send 2.3Mbps lastsnd:46636 lastrcv:46570 lastack:46593 pacing_rate 4.6Mbps rcv_rtt:73.333 rcv_space:42340
> 
> tcp    ESTAB      0      0                                       192.168.1.198:47385                                                74.125.143.104:https        
> users:("firefox",pid=21863,fd=63)
> uid:1000 ino:7602727 sk:ffff880007a1db00 <->
> skmem:(r0,rb372480,t0,tb87040,f0,w0,o0,bl0)
> ts sack cubic wscale:7,11 rto:296.666 rtt:95.496/22.756 ato:40
> mss:1418 cwnd:10
> send 1.2Mbps lastsnd:58640 lastrcv:58550 lastack:58550 pacing_rate 2.4Mbps rcv_rtt:96.666 rcv_space:42340
> 
> Regards,
> Vadim Kochan

Jut reminder may be it was missed within a lots of emails.

Thanks,

^ permalink raw reply

* Re: [PATCH 1/2] net/macb: Adding comments to various #defs to make interpretation easier
From: Nicolas Ferre @ 2015-01-14 15:10 UTC (permalink / raw)
  To: Xander Huff
  Cc: jaeden.amero, rich.tollerton, ben.shelton, brad.mouring, netdev,
	linux-kernel
In-Reply-To: <1421187351-27279-1-git-send-email-xander.huff@ni.com>

Le 13/01/2015 23:15, Xander Huff a écrit :
> This change is to help improve at-a-glace knowledge of the purpose of the
> various Cadence MACB/GEM registers. Comments are more helpful for human
> readability than short acronyms.
> 
> Describe various #define varibles Cadence MACB/GEM registers as documented
> in Xilinix's "Zynq-7000 All Programmable SoC TechnicalReference Manual, v1.9.1
> (UG-585)"
> 
> Signed-off-by: Xander Huff <xander.huff@ni.com>

For the record:
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>

> ---
>  drivers/net/ethernet/cadence/macb.h | 269 ++++++++++++++++++++++--------------
>  1 file changed, 162 insertions(+), 107 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
> index 084191b..8e8c3c9 100644
> --- a/drivers/net/ethernet/cadence/macb.h
> +++ b/drivers/net/ethernet/cadence/macb.h
> @@ -15,20 +15,20 @@
>  #define MACB_MAX_QUEUES 8
>  
>  /* MACB register offsets */
> -#define MACB_NCR				0x0000
> -#define MACB_NCFGR				0x0004
> -#define MACB_NSR				0x0008
> +#define MACB_NCR				0x0000 /* Network Control */
> +#define MACB_NCFGR				0x0004 /* Network Config */
> +#define MACB_NSR				0x0008 /* Network Status */
>  #define MACB_TAR				0x000c /* AT91RM9200 only */
>  #define MACB_TCR				0x0010 /* AT91RM9200 only */
> -#define MACB_TSR				0x0014
> -#define MACB_RBQP				0x0018
> -#define MACB_TBQP				0x001c
> -#define MACB_RSR				0x0020
> -#define MACB_ISR				0x0024
> -#define MACB_IER				0x0028
> -#define MACB_IDR				0x002c
> -#define MACB_IMR				0x0030
> -#define MACB_MAN				0x0034
> +#define MACB_TSR				0x0014 /* Transmit Status */
> +#define MACB_RBQP				0x0018 /* RX Q Base Address */
> +#define MACB_TBQP				0x001c /* TX Q Base Address */
> +#define MACB_RSR				0x0020 /* Receive Status */
> +#define MACB_ISR				0x0024 /* Interrupt Status */
> +#define MACB_IER				0x0028 /* Interrupt Enable */
> +#define MACB_IDR				0x002c /* Interrupt Disable */
> +#define MACB_IMR				0x0030 /* Interrupt Mask */
> +#define MACB_MAN				0x0034 /* PHY Maintenance */
>  #define MACB_PTR				0x0038
>  #define MACB_PFR				0x003c
>  #define MACB_FTO				0x0040
> @@ -68,27 +68,27 @@
>  #define MACB_MID				0x00fc
>  
>  /* GEM register offsets. */
> -#define GEM_NCFGR				0x0004
> -#define GEM_USRIO				0x000c
> -#define GEM_DMACFG				0x0010
> -#define GEM_HRB					0x0080
> -#define GEM_HRT					0x0084
> -#define GEM_SA1B				0x0088
> -#define GEM_SA1T				0x008C
> -#define GEM_SA2B				0x0090
> -#define GEM_SA2T				0x0094
> -#define GEM_SA3B				0x0098
> -#define GEM_SA3T				0x009C
> -#define GEM_SA4B				0x00A0
> -#define GEM_SA4T				0x00A4
> -#define GEM_OTX					0x0100
> -#define GEM_DCFG1				0x0280
> -#define GEM_DCFG2				0x0284
> -#define GEM_DCFG3				0x0288
> -#define GEM_DCFG4				0x028c
> -#define GEM_DCFG5				0x0290
> -#define GEM_DCFG6				0x0294
> -#define GEM_DCFG7				0x0298
> +#define GEM_NCFGR				0x0004 /* Network Config */
> +#define GEM_USRIO				0x000c /* User IO */
> +#define GEM_DMACFG				0x0010 /* DMA Configuration */
> +#define GEM_HRB					0x0080 /* Hash Bottom */
> +#define GEM_HRT					0x0084 /* Hash Top */
> +#define GEM_SA1B				0x0088 /* Specific1 Bottom */
> +#define GEM_SA1T				0x008C /* Specific1 Top */
> +#define GEM_SA2B				0x0090 /* Specific2 Bottom */
> +#define GEM_SA2T				0x0094 /* Specific2 Top */
> +#define GEM_SA3B				0x0098 /* Specific3 Bottom */
> +#define GEM_SA3T				0x009C /* Specific3 Top */
> +#define GEM_SA4B				0x00A0 /* Specific4 Bottom */
> +#define GEM_SA4T				0x00A4 /* Specific4 Top */
> +#define GEM_OTX					0x0100 /* Octets transmitted */
> +#define GEM_DCFG1				0x0280 /* Design Config 1 */
> +#define GEM_DCFG2				0x0284 /* Design Config 2 */
> +#define GEM_DCFG3				0x0288 /* Design Config 3 */
> +#define GEM_DCFG4				0x028c /* Design Config 4 */
> +#define GEM_DCFG5				0x0290 /* Design Config 5 */
> +#define GEM_DCFG6				0x0294 /* Design Config 6 */
> +#define GEM_DCFG7				0x0298 /* Design Config 7 */
>  
>  #define GEM_ISR(hw_q)				(0x0400 + ((hw_q) << 2))
>  #define GEM_TBQP(hw_q)				(0x0440 + ((hw_q) << 2))
> @@ -98,67 +98,73 @@
>  #define GEM_IMR(hw_q)				(0x0640 + ((hw_q) << 2))
>  
>  /* Bitfields in NCR */
> -#define MACB_LB_OFFSET				0
> +#define MACB_LB_OFFSET				0 /* reserved */
>  #define MACB_LB_SIZE				1
> -#define MACB_LLB_OFFSET				1
> +#define MACB_LLB_OFFSET				1 /* Loop back local */
>  #define MACB_LLB_SIZE				1
> -#define MACB_RE_OFFSET				2
> +#define MACB_RE_OFFSET				2 /* Receive enable */
>  #define MACB_RE_SIZE				1
> -#define MACB_TE_OFFSET				3
> +#define MACB_TE_OFFSET				3 /* Transmit enable */
>  #define MACB_TE_SIZE				1
> -#define MACB_MPE_OFFSET				4
> +#define MACB_MPE_OFFSET				4 /* Management port enable */
>  #define MACB_MPE_SIZE				1
> -#define MACB_CLRSTAT_OFFSET			5
> +#define MACB_CLRSTAT_OFFSET			5 /* Clear stats regs */
>  #define MACB_CLRSTAT_SIZE			1
> -#define MACB_INCSTAT_OFFSET			6
> +#define MACB_INCSTAT_OFFSET			6 /* Incremental stats regs */
>  #define MACB_INCSTAT_SIZE			1
> -#define MACB_WESTAT_OFFSET			7
> +#define MACB_WESTAT_OFFSET			7 /* Write enable stats regs */
>  #define MACB_WESTAT_SIZE			1
> -#define MACB_BP_OFFSET				8
> +#define MACB_BP_OFFSET				8 /* Back pressure */
>  #define MACB_BP_SIZE				1
> -#define MACB_TSTART_OFFSET			9
> +#define MACB_TSTART_OFFSET			9 /* Start transmission */
>  #define MACB_TSTART_SIZE			1
> -#define MACB_THALT_OFFSET			10
> +#define MACB_THALT_OFFSET			10 /* Transmit halt */
>  #define MACB_THALT_SIZE				1
> -#define MACB_NCR_TPF_OFFSET			11
> +#define MACB_NCR_TPF_OFFSET			11 /* Transmit pause frame */
>  #define MACB_NCR_TPF_SIZE			1
> -#define MACB_TZQ_OFFSET				12
> +#define MACB_TZQ_OFFSET				12 /* Transmit zero quantum
> +						    * pause frame
> +						    */
>  #define MACB_TZQ_SIZE				1
>  
>  /* Bitfields in NCFGR */
> -#define MACB_SPD_OFFSET				0
> +#define MACB_SPD_OFFSET				0 /* Speed */
>  #define MACB_SPD_SIZE				1
> -#define MACB_FD_OFFSET				1
> +#define MACB_FD_OFFSET				1 /* Full duplex */
>  #define MACB_FD_SIZE				1
> -#define MACB_BIT_RATE_OFFSET			2
> +#define MACB_BIT_RATE_OFFSET			2 /* Discard non-VLAN frames */
>  #define MACB_BIT_RATE_SIZE			1
> -#define MACB_JFRAME_OFFSET			3
> +#define MACB_JFRAME_OFFSET			3 /* reserved */
>  #define MACB_JFRAME_SIZE			1
> -#define MACB_CAF_OFFSET				4
> +#define MACB_CAF_OFFSET				4 /* Copy all frames */
>  #define MACB_CAF_SIZE				1
> -#define MACB_NBC_OFFSET				5
> +#define MACB_NBC_OFFSET				5 /* No broadcast */
>  #define MACB_NBC_SIZE				1
> -#define MACB_NCFGR_MTI_OFFSET			6
> +#define MACB_NCFGR_MTI_OFFSET			6 /* Multicast hash enable */
>  #define MACB_NCFGR_MTI_SIZE			1
> -#define MACB_UNI_OFFSET				7
> +#define MACB_UNI_OFFSET				7 /* Unicast hash enable */
>  #define MACB_UNI_SIZE				1
> -#define MACB_BIG_OFFSET				8
> +#define MACB_BIG_OFFSET				8 /* Receive 1536 byte frames */
>  #define MACB_BIG_SIZE				1
> -#define MACB_EAE_OFFSET				9
> +#define MACB_EAE_OFFSET				9 /* External address match
> +						   * enable
> +						   */
>  #define MACB_EAE_SIZE				1
>  #define MACB_CLK_OFFSET				10
>  #define MACB_CLK_SIZE				2
> -#define MACB_RTY_OFFSET				12
> +#define MACB_RTY_OFFSET				12 /* Retry test */
>  #define MACB_RTY_SIZE				1
> -#define MACB_PAE_OFFSET				13
> +#define MACB_PAE_OFFSET				13 /* Pause enable */
>  #define MACB_PAE_SIZE				1
>  #define MACB_RM9200_RMII_OFFSET			13 /* AT91RM9200 only */
>  #define MACB_RM9200_RMII_SIZE			1  /* AT91RM9200 only */
> -#define MACB_RBOF_OFFSET			14
> +#define MACB_RBOF_OFFSET			14 /* Receive buffer offset */
>  #define MACB_RBOF_SIZE				2
> -#define MACB_RLCE_OFFSET			16
> +#define MACB_RLCE_OFFSET			16 /* Length field error frame
> +						    * discard
> +						    */
>  #define MACB_RLCE_SIZE				1
> -#define MACB_DRFCS_OFFSET			17
> +#define MACB_DRFCS_OFFSET			17 /* FCS remove */
>  #define MACB_DRFCS_SIZE				1
>  #define MACB_EFRHD_OFFSET			18
>  #define MACB_EFRHD_SIZE				1
> @@ -166,111 +172,160 @@
>  #define MACB_IRXFCS_SIZE			1
>  
>  /* GEM specific NCFGR bitfields. */
> -#define GEM_GBE_OFFSET				10
> +#define GEM_GBE_OFFSET				10 /* Gigabit mode enable */
>  #define GEM_GBE_SIZE				1
> -#define GEM_CLK_OFFSET				18
> +#define GEM_CLK_OFFSET				18 /* MDC clock division */
>  #define GEM_CLK_SIZE				3
> -#define GEM_DBW_OFFSET				21
> +#define GEM_DBW_OFFSET				21 /* Data bus width */
>  #define GEM_DBW_SIZE				2
>  #define GEM_RXCOEN_OFFSET			24
>  #define GEM_RXCOEN_SIZE				1
>  
>  /* Constants for data bus width. */
> -#define GEM_DBW32				0
> -#define GEM_DBW64				1
> -#define GEM_DBW128				2
> +#define GEM_DBW32				0 /* 32 bit AMBA AHB data bus
> +						   * width
> +						   */
> +#define GEM_DBW64				1 /* 64 bit AMBA AHB data bus
> +						   * width
> +						   */
> +#define GEM_DBW128				2 /* 128 bit AMBA AHB data bus
> +						   * width
> +						   */
>  
>  /* Bitfields in DMACFG. */
> -#define GEM_FBLDO_OFFSET			0
> +#define GEM_FBLDO_OFFSET			0 /* AHB fixed burst length for
> +						   * DMA data operations
> +						   */
>  #define GEM_FBLDO_SIZE				5
> -#define GEM_ENDIA_OFFSET			7
> +#define GEM_ENDIA_OFFSET			7 /* AHB endian swap mode enable
> +						   * for packet data accesses
> +						   */
>  #define GEM_ENDIA_SIZE				1
> -#define GEM_RXBMS_OFFSET			8
> +#define GEM_RXBMS_OFFSET			8 /* Receiver packet buffer
> +						   * memory size select
> +						   */
>  #define GEM_RXBMS_SIZE				2
> -#define GEM_TXPBMS_OFFSET			10
> +#define GEM_TXPBMS_OFFSET			10 /* Transmitter packet buffer
> +						    * memory size select
> +						    */
>  #define GEM_TXPBMS_SIZE				1
> -#define GEM_TXCOEN_OFFSET			11
> +#define GEM_TXCOEN_OFFSET			11 /* Transmitter IP, TCP and
> +						    * UDP checksum generation
> +						    * offload enable
> +						    */
>  #define GEM_TXCOEN_SIZE				1
> -#define GEM_RXBS_OFFSET				16
> +#define GEM_RXBS_OFFSET				16 /* DMA receive buffer size in
> +						    * AHB system memory
> +						    */
>  #define GEM_RXBS_SIZE				8
> -#define GEM_DDRP_OFFSET				24
> +#define GEM_DDRP_OFFSET				24 /* disc_when_no_ahb */
>  #define GEM_DDRP_SIZE				1
>  
>  
>  /* Bitfields in NSR */
> -#define MACB_NSR_LINK_OFFSET			0
> +#define MACB_NSR_LINK_OFFSET			0 /* pcs_link_state */
>  #define MACB_NSR_LINK_SIZE			1
> -#define MACB_MDIO_OFFSET			1
> +#define MACB_MDIO_OFFSET			1 /* status of the mdio_in
> +						   * pin
> +						   */
>  #define MACB_MDIO_SIZE				1
> -#define MACB_IDLE_OFFSET			2
> +#define MACB_IDLE_OFFSET			2 /* The PHY management logic is
> +						   * idle (i.e. has completed)
> +						   */
>  #define MACB_IDLE_SIZE				1
>  
>  /* Bitfields in TSR */
> -#define MACB_UBR_OFFSET				0
> +#define MACB_UBR_OFFSET				0 /* Used bit read */
>  #define MACB_UBR_SIZE				1
> -#define MACB_COL_OFFSET				1
> +#define MACB_COL_OFFSET				1 /* Collision occurred */
>  #define MACB_COL_SIZE				1
> -#define MACB_TSR_RLE_OFFSET			2
> +#define MACB_TSR_RLE_OFFSET			2 /* Retry limit exceeded */
>  #define MACB_TSR_RLE_SIZE			1
> -#define MACB_TGO_OFFSET				3
> +#define MACB_TGO_OFFSET				3 /* Transmit go */
>  #define MACB_TGO_SIZE				1
> -#define MACB_BEX_OFFSET				4
> +#define MACB_BEX_OFFSET				4 /* Transmit frame corruption
> +						   * due to AHB error
> +						   */
>  #define MACB_BEX_SIZE				1
>  #define MACB_RM9200_BNQ_OFFSET			4 /* AT91RM9200 only */
>  #define MACB_RM9200_BNQ_SIZE			1 /* AT91RM9200 only */
> -#define MACB_COMP_OFFSET			5
> +#define MACB_COMP_OFFSET			5 /* Trnasmit complete */
>  #define MACB_COMP_SIZE				1
> -#define MACB_UND_OFFSET				6
> +#define MACB_UND_OFFSET				6 /* Trnasmit under run */
>  #define MACB_UND_SIZE				1
>  
>  /* Bitfields in RSR */
> -#define MACB_BNA_OFFSET				0
> +#define MACB_BNA_OFFSET				0 /* Buffer not available */
>  #define MACB_BNA_SIZE				1
> -#define MACB_REC_OFFSET				1
> +#define MACB_REC_OFFSET				1 /* Frame received */
>  #define MACB_REC_SIZE				1
> -#define MACB_OVR_OFFSET				2
> +#define MACB_OVR_OFFSET				2 /* Receive overrun */
>  #define MACB_OVR_SIZE				1
>  
>  /* Bitfields in ISR/IER/IDR/IMR */
> -#define MACB_MFD_OFFSET				0
> +#define MACB_MFD_OFFSET				0 /* Management frame sent */
>  #define MACB_MFD_SIZE				1
> -#define MACB_RCOMP_OFFSET			1
> +#define MACB_RCOMP_OFFSET			1 /* Receive complete */
>  #define MACB_RCOMP_SIZE				1
> -#define MACB_RXUBR_OFFSET			2
> +#define MACB_RXUBR_OFFSET			2 /* RX used bit read */
>  #define MACB_RXUBR_SIZE				1
> -#define MACB_TXUBR_OFFSET			3
> +#define MACB_TXUBR_OFFSET			3 /* TX used bit read */
>  #define MACB_TXUBR_SIZE				1
> -#define MACB_ISR_TUND_OFFSET			4
> +#define MACB_ISR_TUND_OFFSET			4 /* Enable trnasmit buffer
> +						   * under run interrupt
> +						   */
>  #define MACB_ISR_TUND_SIZE			1
> -#define MACB_ISR_RLE_OFFSET			5
> +#define MACB_ISR_RLE_OFFSET			5 /* Enable retry limit exceeded
> +						   * or late collision interrupt
> +						   */
>  #define MACB_ISR_RLE_SIZE			1
> -#define MACB_TXERR_OFFSET			6
> +#define MACB_TXERR_OFFSET			6 /* Enable transmit frame
> +						   * corruption due to AHB error
> +						   * interrupt
> +						   */
>  #define MACB_TXERR_SIZE				1
> -#define MACB_TCOMP_OFFSET			7
> +#define MACB_TCOMP_OFFSET			7 /* Enable transmit complete
> +						   * interrupt
> +						   */
>  #define MACB_TCOMP_SIZE				1
> -#define MACB_ISR_LINK_OFFSET			9
> +#define MACB_ISR_LINK_OFFSET			9 /* Enable link change
> +						   * interrupt
> +						   */
>  #define MACB_ISR_LINK_SIZE			1
> -#define MACB_ISR_ROVR_OFFSET			10
> +#define MACB_ISR_ROVR_OFFSET			10 /* Enable receive overrun
> +						    * interrupt
> +						    */
>  #define MACB_ISR_ROVR_SIZE			1
> -#define MACB_HRESP_OFFSET			11
> +#define MACB_HRESP_OFFSET			11 /* Enable hrsep not OK
> +						    * interrupt
> +						    */
>  #define MACB_HRESP_SIZE				1
> -#define MACB_PFR_OFFSET				12
> +#define MACB_PFR_OFFSET				12 /* Enable pause frame with
> +						    * non-zero pause quantum
> +						    * interrupt
> +						    */
>  #define MACB_PFR_SIZE				1
> -#define MACB_PTZ_OFFSET				13
> +#define MACB_PTZ_OFFSET				13 /* Enable pause time zero
> +						    * interrupt
> +						    */
>  #define MACB_PTZ_SIZE				1
>  
>  /* Bitfields in MAN */
> -#define MACB_DATA_OFFSET			0
> +#define MACB_DATA_OFFSET			0 /* data */
>  #define MACB_DATA_SIZE				16
> -#define MACB_CODE_OFFSET			16
> +#define MACB_CODE_OFFSET			16 /* Must be written to 10 */
>  #define MACB_CODE_SIZE				2
> -#define MACB_REGA_OFFSET			18
> +#define MACB_REGA_OFFSET			18 /* Register address */
>  #define MACB_REGA_SIZE				5
> -#define MACB_PHYA_OFFSET			23
> +#define MACB_PHYA_OFFSET			23 /* PHY address */
>  #define MACB_PHYA_SIZE				5
> -#define MACB_RW_OFFSET				28
> +#define MACB_RW_OFFSET				28 /* Operation. 10 is read. 01
> +						    * is write.
> +						    */
>  #define MACB_RW_SIZE				2
> -#define MACB_SOF_OFFSET				30
> +#define MACB_SOF_OFFSET				30 /* Must be written to 1 for
> +						    * Clause 22 operation
> +						    */
>  #define MACB_SOF_SIZE				2
>  
>  /* Bitfields in USRIO (AVR32) */
> @@ -286,7 +341,7 @@
>  /* Bitfields in USRIO (AT91) */
>  #define MACB_RMII_OFFSET			0
>  #define MACB_RMII_SIZE				1
> -#define GEM_RGMII_OFFSET			0	/* GEM gigabit mode */
> +#define GEM_RGMII_OFFSET			0 /* GEM gigabit mode */
>  #define GEM_RGMII_SIZE				1
>  #define MACB_CLKEN_OFFSET			1
>  #define MACB_CLKEN_SIZE				1
> 


-- 
Nicolas Ferre

^ permalink raw reply

* Re: [PATCH 1/2] net/macb: Adding comments to various #defs to make interpretation easier
From: Brad Mouring @ 2015-01-14 15:11 UTC (permalink / raw)
  To: David Miller
  Cc: xander.huff, nicolas.ferre, jaeden.amero, rich.tollerton,
	ben.shelton, brad.mouring, netdev, linux-kernel
In-Reply-To: <20150114.002609.1903467089677912749.davem@davemloft.net>

On Wed, Jan 14, 2015 at 12:26:09AM -0500, David Miller wrote:
> From: Xander Huff <xander.huff@ni.com>
> Date: Tue, 13 Jan 2015 16:15:50 -0600
> 
> > This change is to help improve at-a-glace knowledge of the purpose of the
> > various Cadence MACB/GEM registers. Comments are more helpful for human
> > readability than short acronyms.
> > 
> > Describe various #define varibles Cadence MACB/GEM registers as documented
> > in Xilinix's "Zynq-7000 All Programmable SoC TechnicalReference Manual, v1.9.1
s/Xilinix/Xilinx/. Sorry for the previous html-spam. Didn't follow the rule to always get coffee in the system prior to responding.
> > (UG-585)"
> > 
> > Signed-off-by: Xander Huff <xander.huff@ni.com>
> 
> Applied.

^ permalink raw reply

* RE: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
From: Zhou, Danny @ 2015-01-14 15:26 UTC (permalink / raw)
  To: Willem de Bruijn, John Fastabend
  Cc: Network Development, Neil Horman, Daniel Borkmann, Ronciak, John,
	Hannes Frederic Sowa, brouer@redhat.com
In-Reply-To: <CA+FuTSdCV8puZVe-6aWQt5mPk0i3_CBK7hecOgviLc0GpdUmNw@mail.gmail.com>



> -----Original Message-----
> From: Willem de Bruijn [mailto:willemb@google.com]
> Sent: Wednesday, January 14, 2015 2:53 AM
> To: John Fastabend
> Cc: Network Development; Zhou, Danny; Neil Horman; Daniel Borkmann; Ronciak, John; Hannes Frederic Sowa;
> brouer@redhat.com
> Subject: Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
> 
> On Mon, Jan 12, 2015 at 11:35 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
> > This patch adds net_device ops to split off a set of driver queues
> > from the driver and map the queues into user space via mmap. This
> > allows the queues to be directly manipulated from user space. For
> > raw packet interface this removes any overhead from the kernel network
> > stack.
> 
> Can you elaborate how packet payload mapping is handled?
> Processes are still responsible for translating from user virtual to
> physical (and bus) addresses, correct? The IOMMU is only there
> to restrict the physical address ranges that may be written.
> 

User space processes have to use the IOVA returned from af_packet to fill 
NIC's Rx (as well as Tx) descriptors. When a DMA request is trigged for transferring a 
coming packet from the NIC to host memory, the device ID(specified by PCIe device' B:N:F) 
field in the DMA request will be used by IOMMU to find the device address translation
structure for this domain/device. Then the IOMMU will use the IOVA field in the 
DMA request as the match field to look up the per-device address translation structure 
to get the corresponding physical address pointing to where packet should be transferred to.

If an invalid IOVA address (e.g. arbitrary address or physical address) is filled in NIC's descriptors, 
IOMMU would prevent DMA from happening due to above lookup operation returns failure.

> >
> > With these operations we bypass the network stack and packet_type
> > handlers that would typically send traffic to an af_packet socket.
> > This means hardware must do the forwarding. To do this ew can use
> > the ETHTOOL_SRXCLSRLINS ops in the ethtool command set. It is
> > currently supported by multiple drivers including sfc, mlx4, niu,
> > ixgbe, and i40e. Supporting some way to steer traffic to a queue
> > is the _only_ hardware requirement to support this interface.
> >
> > A follow on patch adds support for ixgbe but we expect at least
> > the subset of drivers implementing ETHTOOL_SRXCLSRLINS can be
> > implemented later.
> >
> > The high level flow, leveraging the af_packet control path, looks
> > like:
> >
> >         bind(fd, &sockaddr, sizeof(sockaddr));
> >
> >         /* Get the device type and info */
> >         getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
> >                    &optlen);
> >
> >         /* With device info we can look up descriptor format */
> >
> >         /* Get the layout of ring space offset, page_sz, cnt */
> >         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
> >                    &info, &optlen);
> >
> >         /* request some queues from the driver */
> >         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> >                    &qpairs_info, sizeof(qpairs_info));
> >
> >         /* if we let the driver pick us queues learn which queues
> >          * we were given
> >          */
> >         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> >                    &qpairs_info, sizeof(qpairs_info));
> >
> >         /* And mmap queue pairs to user space */
> >         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
> >              MAP_SHARED, fd, 0);
> >
> >         /* Now we have some user space queues to read/write to*/
> >
> > There is one critical difference when running with these interfaces
> > vs running without them. In the normal case the af_packet module
> > uses a standard descriptor format exported by the af_packet user
> > space headers. In this model because we are working directly with
> > driver queues the descriptor format maps to the descriptor format
> > used by the device. User space applications can learn device
> > information from the socket option PACKET_DEV_DESC_INFO. These
> > are described by giving the vendor/deviceid and a descriptor layout
> > in offset/length/width/alignment/byte_ordering.
> 
> Raising the issue of exposed vs. virtualized interface just once
> more. I wonder if it is possible to keep the virtual to physical
> translation in the kernel while avoiding syscall latency, by doing
> the translation in a kernel thread on a coupled hyperthread that
> waits with mwait on the virtual queue producer index. The page
> table operations that Neil proposed in v1 of this patch may work
> even better.
> 

This is one shot request during initialization, so should be ok from latency
prospective. The NIC requests physically contiguous host memory region
to be used as rx/tx packet buffer, so the physical address is provided for af_packet
or the NIC driver to do this check. Otherwise, it is hard to check it for given
virtual address and size of the memory regions.

> > To protect against arbitrary DMA writes IOMMU devices put memory
> > in a single domain to stop arbitrary DMA to memory. Note it would
> > be possible to dma into another sockets pages because most NIC
> > devices only support a single domain. This would require being
> > able to guess another sockets page layout. However the socket
> > operation does require CAP_NET_ADMIN privileges.
> >
> > Additionally we have a set of DPDK patches to enable DPDK with this
> > interface. DPDK can be downloaded @ dpdk.org although as I hope is
> > clear from above DPDK is just our paticular test environment we
> > expect other libraries could be built on this interface.
> >
> > Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> > ---
> >  include/linux/netdevice.h      |   79 ++++++++
> >  include/uapi/linux/if_packet.h |   88 +++++++++
> >  net/packet/af_packet.c         |  397 ++++++++++++++++++++++++++++++++++++++++
> >  net/packet/internal.h          |   10 +
> >  4 files changed, 573 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 679e6e9..b71c97d 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -52,6 +52,8 @@
> >  #include <linux/neighbour.h>
> >  #include <uapi/linux/netdevice.h>
> >
> > +#include <linux/if_packet.h>
> > +
> >  struct netpoll_info;
> >  struct device;
> >  struct phy_device;
> > @@ -1030,6 +1032,54 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
> >   * int (*ndo_switch_port_stp_update)(struct net_device *dev, u8 state);
> >   *     Called to notify switch device port of bridge port STP
> >   *     state change.
> > + *
> > + * int (*ndo_split_queue_pairs) (struct net_device *dev,
> > + *                              unsigned int qpairs_start_from,
> > + *                              unsigned int qpairs_num,
> > + *                              struct sock *sk)
> > + *     Called to request a set of queues from the driver to be handed to the
> > + *     callee for management. After this returns the driver will not use the
> > + *     queues.
> > + *
> > + * int (*ndo_get_split_queue_pairs) (struct net_device *dev,
> > + *                              unsigned int *qpairs_start_from,
> > + *                              unsigned int *qpairs_num,
> > + *                              struct sock *sk)
> > + *     Called to get the location of queues that have been split for user
> > + *     space to use. The socket must have previously requested the queues via
> > + *     ndo_split_queue_pairs successfully.
> > + *
> > + * int (*ndo_return_queue_pairs) (struct net_device *dev,
> > + *                               struct sock *sk)
> > + *     Called to return a set of queues identified by sock to the driver. The
> > + *     socket must have previously requested the queues via
> > + *     ndo_split_queue_pairs for this action to be performed.
> > + *
> > + * int (*ndo_get_device_qpair_map_region_info) (struct net_device *dev,
> > + *                             struct tpacket_dev_qpair_map_region_info *info)
> > + *     Called to return mapping of queue memory region.
> > + *
> > + * int (*ndo_get_device_desc_info) (struct net_device *dev,
> > + *                                 struct tpacket_dev_info *dev_info)
> > + *     Called to get device specific information. This should uniquely identify
> > + *     the hardware so that descriptor formats can be learned by the stack/user
> > + *     space.
> > + *
> > + * int (*ndo_direct_qpair_page_map) (struct vm_area_struct *vma,
> > + *                                  struct net_device *dev)
> > + *     Called to map queue pair range from split_queue_pairs into mmap region.
> > + *
> > + * int (*ndo_direct_validate_dma_mem_region_map)
> > + *                                     (struct net_device *dev,
> > + *                                      struct tpacket_dma_mem_region *region,
> > + *                                      struct sock *sk)
> > + *     Called to validate DMA address remaping for userspace memory region
> > + *
> > + * int (*ndo_get_dma_region_info)
> > + *                              (struct net_device *dev,
> > + *                               struct tpacket_dma_mem_region *region,
> > + *                               struct sock *sk)
> > + *     Called to get dma region' information such as iova.
> >   */
> >  struct net_device_ops {
> >         int                     (*ndo_init)(struct net_device *dev);
> > @@ -1190,6 +1240,35 @@ struct net_device_ops {
> >         int                     (*ndo_switch_port_stp_update)(struct net_device *dev,
> >                                                               u8 state);
> >  #endif
> > +       int                     (*ndo_split_queue_pairs)(struct net_device *dev,
> > +                                        unsigned int qpairs_start_from,
> > +                                        unsigned int qpairs_num,
> > +                                        struct sock *sk);
> > +       int                     (*ndo_get_split_queue_pairs)
> > +                                       (struct net_device *dev,
> > +                                        unsigned int *qpairs_start_from,
> > +                                        unsigned int *qpairs_num,
> > +                                        struct sock *sk);
> > +       int                     (*ndo_return_queue_pairs)
> > +                                       (struct net_device *dev,
> > +                                        struct sock *sk);
> > +       int                     (*ndo_get_device_qpair_map_region_info)
> > +                                       (struct net_device *dev,
> > +                                        struct tpacket_dev_qpair_map_region_info *info);
> > +       int                     (*ndo_get_device_desc_info)
> > +                                       (struct net_device *dev,
> > +                                        struct tpacket_dev_info *dev_info);
> > +       int                     (*ndo_direct_qpair_page_map)
> > +                                       (struct vm_area_struct *vma,
> > +                                        struct net_device *dev);
> > +       int                     (*ndo_validate_dma_mem_region_map)
> > +                                       (struct net_device *dev,
> > +                                        struct tpacket_dma_mem_region *region,
> > +                                        struct sock *sk);
> > +       int                     (*ndo_get_dma_region_info)
> > +                                       (struct net_device *dev,
> > +                                        struct tpacket_dma_mem_region *region,
> > +                                        struct sock *sk);
> >  };
> >
> >  /**
> > diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
> > index da2d668..eb7a727 100644
> > --- a/include/uapi/linux/if_packet.h
> > +++ b/include/uapi/linux/if_packet.h
> > @@ -54,6 +54,13 @@ struct sockaddr_ll {
> >  #define PACKET_FANOUT                  18
> >  #define PACKET_TX_HAS_OFF              19
> >  #define PACKET_QDISC_BYPASS            20
> > +#define PACKET_RXTX_QPAIRS_SPLIT       21
> > +#define PACKET_RXTX_QPAIRS_RETURN      22
> > +#define PACKET_DEV_QPAIR_MAP_REGION_INFO       23
> > +#define PACKET_DEV_DESC_INFO           24
> > +#define PACKET_DMA_MEM_REGION_MAP       25
> > +#define PACKET_DMA_MEM_REGION_RELEASE   26
> > +
> >
> >  #define PACKET_FANOUT_HASH             0
> >  #define PACKET_FANOUT_LB               1
> > @@ -64,6 +71,87 @@ struct sockaddr_ll {
> >  #define PACKET_FANOUT_FLAG_ROLLOVER    0x1000
> >  #define PACKET_FANOUT_FLAG_DEFRAG      0x8000
> >
> > +#define PACKET_MAX_NUM_MAP_MEMORY_REGIONS 64
> > +#define PACKET_MAX_NUM_DESC_FORMATS      8
> > +#define PACKET_MAX_NUM_DESC_FIELDS       64
> > +#define PACKET_NIC_DESC_FIELD(fseq, foffset, fwidth, falign, fbo) \
> > +               .seqn = (__u8)fseq,                             \
> > +               .offset = (__u8)foffset,                        \
> > +               .width = (__u8)fwidth,                          \
> > +               .align = (__u8)falign,                          \
> > +               .byte_order = (__u8)fbo
> > +
> > +#define MAX_MAP_MEMORY_REGIONS 64
> > +
> > +/* setsockopt takes addr, size ,direction parametner, getsockopt takes
> > + * iova, size, direction.
> > + * */
> > +struct tpacket_dma_mem_region {
> > +       void *addr;             /* userspace virtual address */
> > +       __u64 phys_addr;        /* physical address */
> > +       __u64 iova;             /* IO virtual address used for DMA */
> > +       unsigned long size;     /* size of region */
> > +       int direction;          /* dma data direction */
> > +};
> > +
> > +struct tpacket_dev_qpair_map_region_info {
> > +       unsigned int tp_dev_bar_sz;             /* size of BAR */
> > +       unsigned int tp_dev_sysm_sz;            /* size of systerm memory */
> > +       /* number of contiguous memory on BAR mapping to user space */
> > +       unsigned int tp_num_map_regions;
> > +       /* number of contiguous memory on system mapping to user apce */
> > +       unsigned int tp_num_sysm_map_regions;
> > +       struct map_page_region {
> > +               unsigned page_offset;   /* offset to start of region */
> > +               unsigned page_sz;       /* size of page */
> > +               unsigned page_cnt;      /* number of pages */
> > +       } tp_regions[MAX_MAP_MEMORY_REGIONS];
> > +};
> > +
> > +struct tpacket_dev_qpairs_info {
> > +       unsigned int tp_qpairs_start_from;      /* qpairs index to start from */
> > +       unsigned int tp_qpairs_num;             /* number of qpairs */
> > +};
> > +
> > +enum tpack_desc_byte_order {
> > +       BO_NATIVE = 0,
> > +       BO_NETWORK,
> > +       BO_BIG_ENDIAN,
> > +       BO_LITTLE_ENDIAN,
> > +};
> > +
> > +struct tpacket_nic_desc_fld {
> > +       __u8 seqn;      /* Sequency index of descriptor field */
> > +       __u8 offset;    /* Offset to start */
> > +       __u8 width;     /* Width of field */
> > +       __u8 align;     /* Alignment in bits */
> > +       enum tpack_desc_byte_order byte_order;  /* Endian flag */
> > +};
> > +
> > +struct tpacket_nic_desc_expr {
> > +       __u8 version;           /* Version number */
> > +       __u8 size;              /* Descriptor size in bytes */
> > +       enum tpack_desc_byte_order byte_order;          /* Endian flag */
> > +       __u8 num_of_fld;        /* Number of valid fields */
> > +       /* List of each descriptor field */
> > +       struct tpacket_nic_desc_fld fields[PACKET_MAX_NUM_DESC_FIELDS];
> > +};
> > +
> > +struct tpacket_dev_info {
> > +       __u16   tp_device_id;
> > +       __u16   tp_vendor_id;
> > +       __u16   tp_subsystem_device_id;
> > +       __u16   tp_subsystem_vendor_id;
> > +       __u32   tp_numa_node;
> > +       __u32   tp_revision_id;
> > +       __u32   tp_num_total_qpairs;
> > +       __u32   tp_num_inuse_qpairs;
> > +       __u32   tp_num_rx_desc_fmt;
> > +       __u32   tp_num_tx_desc_fmt;
> > +       struct tpacket_nic_desc_expr tp_rx_dexpr[PACKET_MAX_NUM_DESC_FORMATS];
> > +       struct tpacket_nic_desc_expr tp_tx_dexpr[PACKET_MAX_NUM_DESC_FORMATS];
> > +};
> > +
> >  struct tpacket_stats {
> >         unsigned int    tp_packets;
> >         unsigned int    tp_drops;
> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > index 6880f34..8cd17da 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -214,6 +214,9 @@ static void prb_clear_rxhash(struct tpacket_kbdq_core *,
> >  static void prb_fill_vlan_info(struct tpacket_kbdq_core *,
> >                 struct tpacket3_hdr *);
> >  static void packet_flush_mclist(struct sock *sk);
> > +static int umem_release(struct net_device *dev, struct packet_sock *po);
> > +static int get_umem_pages(struct tpacket_dma_mem_region *region,
> > +                         struct packet_umem_region *umem);
> >
> >  struct packet_skb_cb {
> >         unsigned int origlen;
> > @@ -2633,6 +2636,16 @@ static int packet_release(struct socket *sock)
> >         sock_prot_inuse_add(net, sk->sk_prot, -1);
> >         preempt_enable();
> >
> > +       if (po->tp_owns_queue_pairs) {
> > +               struct net_device *dev;
> > +
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (dev) {
> > +                       dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
> > +                       umem_release(dev, po);
> > +               }
> > +       }
> > +
> >         spin_lock(&po->bind_lock);
> >         unregister_prot_hook(sk, false);
> >         packet_cached_dev_reset(po);
> > @@ -2829,6 +2842,8 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
> >         po->num = proto;
> >         po->xmit = dev_queue_xmit;
> >
> > +       INIT_LIST_HEAD(&po->umem_list);
> > +
> >         err = packet_alloc_pending(po);
> >         if (err)
> >                 goto out2;
> > @@ -3226,6 +3241,88 @@ static void packet_flush_mclist(struct sock *sk)
> >  }
> >
> >  static int
> > +get_umem_pages(struct tpacket_dma_mem_region *region,
> > +              struct packet_umem_region *umem)
> > +{
> > +       struct page **page_list;
> > +       unsigned long npages;
> > +       unsigned long offset;
> > +       unsigned long base;
> > +       unsigned long i;
> > +       int ret;
> > +       dma_addr_t phys_base;
> > +
> > +       phys_base = (region->phys_addr) & PAGE_MASK;
> > +       base = ((unsigned long)region->addr) & PAGE_MASK;
> > +       offset = ((unsigned long)region->addr) & (~PAGE_MASK);
> > +       npages = PAGE_ALIGN(region->size + offset) >> PAGE_SHIFT;
> > +
> > +       npages = min_t(unsigned long, npages, umem->nents);
> > +       sg_init_table(umem->sglist, npages);
> > +
> > +       umem->nmap = 0;
> > +       page_list = (struct page **)__get_free_page(GFP_KERNEL);
> > +       if (!page_list)
> > +               return -ENOMEM;
> > +
> > +       while (npages) {
> > +               unsigned long min = min_t(unsigned long, npages,
> > +                                         PAGE_SIZE / sizeof(struct page *));
> > +
> > +               ret = get_user_pages(current, current->mm, base, min,
> > +                                    1, 0, page_list, NULL);
> > +               if (ret < 0)
> > +                       break;
> > +
> > +               base += ret * PAGE_SIZE;
> > +               npages -= ret;
> > +
> > +               /* validate if the memory region is physically contigenous */
> > +               for (i = 0; i < ret; i++) {
> > +                       unsigned int page_index =
> > +                               (page_to_phys(page_list[i]) - phys_base) /
> > +                               PAGE_SIZE;
> > +
> > +                       if (page_index != umem->nmap + i) {
> > +                               int j;
> > +
> > +                               for (j = 0; j < (umem->nmap + i); j++)
> > +                                       put_page(sg_page(&umem->sglist[j]));
> > +
> > +                               free_page((unsigned long)page_list);
> > +                               return -EFAULT;
> > +                       }
> > +
> > +                       sg_set_page(&umem->sglist[umem->nmap + i],
> > +                                   page_list[i], PAGE_SIZE, 0);
> > +               }
> > +
> > +               umem->nmap += ret;
> > +       }
> > +
> > +       free_page((unsigned long)page_list);
> > +       return 0;
> > +}
> > +
> > +static int
> > +umem_release(struct net_device *dev, struct packet_sock *po)
> > +{
> > +       struct packet_umem_region *umem, *tmp;
> > +       int i;
> > +
> > +       list_for_each_entry_safe(umem, tmp, &po->umem_list, list) {
> > +               dma_unmap_sg(dev->dev.parent, umem->sglist,
> > +                            umem->nmap, umem->direction);
> > +               for (i = 0; i < umem->nmap; i++)
> > +                       put_page(sg_page(&umem->sglist[i]));
> > +
> > +               vfree(umem);
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static int
> >  packet_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen)
> >  {
> >         struct sock *sk = sock->sk;
> > @@ -3428,6 +3525,167 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
> >                 po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
> >                 return 0;
> >         }
> > +       case PACKET_RXTX_QPAIRS_SPLIT:
> > +       {
> > +               struct tpacket_dev_qpairs_info qpairs;
> > +               const struct net_device_ops *ops;
> > +               struct net_device *dev;
> > +               int err;
> > +
> > +               if (optlen != sizeof(qpairs))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&qpairs, optval, sizeof(qpairs)))
> > +                       return -EFAULT;
> > +
> > +               /* Only allow one set of queues to be owned by userspace */
> > +               if (po->tp_owns_queue_pairs)
> > +                       return -EBUSY;
> > +
> > +               /* This call only works after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +               ops = dev->netdev_ops;
> > +               if (!ops->ndo_split_queue_pairs)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err =  ops->ndo_split_queue_pairs(dev,
> > +                                                 qpairs.tp_qpairs_start_from,
> > +                                                 qpairs.tp_qpairs_num, sk);
> > +               if (!err)
> > +                       po->tp_owns_queue_pairs = true;
> > +
> > +               return err;
> > +       }
> > +       case PACKET_RXTX_QPAIRS_RETURN:
> > +       {
> > +               struct tpacket_dev_qpairs_info qpairs_info;
> > +               const struct net_device_ops *ops;
> > +               struct net_device *dev;
> > +               int err;
> > +
> > +               if (optlen != sizeof(qpairs_info))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
> > +                       return -EFAULT;
> > +
> > +               if (!po->tp_owns_queue_pairs)
> > +                       return -EINVAL;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +               ops = dev->netdev_ops;
> > +               if (!ops->ndo_split_queue_pairs)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err =  dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
> > +               if (!err)
> > +                       po->tp_owns_queue_pairs = false;
> > +
> > +               return err;
> > +       }
> > +       case PACKET_DMA_MEM_REGION_MAP:
> > +       {
> > +               struct tpacket_dma_mem_region region;
> > +               const struct net_device_ops *ops;
> > +               struct net_device *dev;
> > +               struct packet_umem_region *umem;
> > +               unsigned long npages;
> > +               unsigned long offset;
> > +               unsigned long i;
> > +               int err;
> > +
> > +               if (optlen != sizeof(region))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&region, optval, sizeof(region)))
> > +                       return -EFAULT;
> > +               if ((region.direction != DMA_BIDIRECTIONAL) &&
> > +                   (region.direction != DMA_TO_DEVICE) &&
> > +                   (region.direction != DMA_FROM_DEVICE))
> > +                       return -EFAULT;
> > +
> > +               if (!po->tp_owns_queue_pairs)
> > +                       return -EINVAL;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +
> > +               offset = ((unsigned long)region.addr) & (~PAGE_MASK);
> > +               npages = PAGE_ALIGN(region.size + offset) >> PAGE_SHIFT;
> > +
> > +               umem = vzalloc(sizeof(*umem) +
> > +                              sizeof(struct scatterlist) * npages);
> > +               if (!umem)
> > +                       return -ENOMEM;
> > +
> > +               umem->nents = npages;
> > +               umem->direction = region.direction;
> > +
> > +               down_write(&current->mm->mmap_sem);
> > +               if (get_umem_pages(&region, umem) < 0) {
> > +                       ret = -EFAULT;
> > +                       goto exit;
> > +               }
> > +
> > +               if ((umem->nmap == npages) &&
> > +                   (0 != dma_map_sg(dev->dev.parent, umem->sglist,
> > +                                    umem->nmap, region.direction))) {
> > +                       region.iova = sg_dma_address(umem->sglist) + offset;
> > +
> > +                       ops = dev->netdev_ops;
> > +                       if (!ops->ndo_validate_dma_mem_region_map) {
> > +                               ret = -EOPNOTSUPP;
> > +                               goto unmap;
> > +                       }
> > +
> > +                       /* use driver to validate mapping of dma memory */
> > +                       err = ops->ndo_validate_dma_mem_region_map(dev,
> > +                                                                  &region,
> > +                                                                  sk);
> > +                       if (!err) {
> > +                               list_add_tail(&umem->list, &po->umem_list);
> > +                               ret = 0;
> > +                               goto exit;
> > +                       }
> > +               }
> > +
> > +unmap:
> > +               dma_unmap_sg(dev->dev.parent, umem->sglist,
> > +                            umem->nmap, umem->direction);
> > +               for (i = 0; i < umem->nmap; i++)
> > +                       put_page(sg_page(&umem->sglist[i]));
> > +
> > +               vfree(umem);
> > +exit:
> > +               up_write(&current->mm->mmap_sem);
> > +
> > +               return ret;
> > +       }
> > +       case PACKET_DMA_MEM_REGION_RELEASE:
> > +       {
> > +               struct net_device *dev;
> > +
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +
> > +               down_write(&current->mm->mmap_sem);
> > +               ret = umem_release(dev, po);
> > +               up_write(&current->mm->mmap_sem);
> > +
> > +               return ret;
> > +       }
> > +
> >         default:
> >                 return -ENOPROTOOPT;
> >         }
> > @@ -3523,6 +3781,129 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
> >         case PACKET_QDISC_BYPASS:
> >                 val = packet_use_direct_xmit(po);
> >                 break;
> > +       case PACKET_RXTX_QPAIRS_SPLIT:
> > +       {
> > +               struct net_device *dev;
> > +               struct tpacket_dev_qpairs_info qpairs_info;
> > +               int err;
> > +
> > +               if (len != sizeof(qpairs_info))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
> > +                       return -EFAULT;
> > +
> > +               /* This call only work after a successful queue pairs split-off
> > +                * operation via setsockopt()
> > +                */
> > +               if (!po->tp_owns_queue_pairs)
> > +                       return -EINVAL;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +               if (!dev->netdev_ops->ndo_split_queue_pairs)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err =  dev->netdev_ops->ndo_get_split_queue_pairs(dev,
> > +                                       &qpairs_info.tp_qpairs_start_from,
> > +                                       &qpairs_info.tp_qpairs_num, sk);
> > +
> > +               lv = sizeof(qpairs_info);
> > +               data = &qpairs_info;
> > +               break;
> > +       }
> > +       case PACKET_DEV_QPAIR_MAP_REGION_INFO:
> > +       {
> > +               struct tpacket_dev_qpair_map_region_info info;
> > +               const struct net_device_ops *ops;
> > +               struct net_device *dev;
> > +               int err;
> > +
> > +               if (len != sizeof(info))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&info, optval, sizeof(info)))
> > +                       return -EFAULT;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +
> > +               ops = dev->netdev_ops;
> > +               if (!ops->ndo_get_device_qpair_map_region_info)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err = ops->ndo_get_device_qpair_map_region_info(dev, &info);
> > +               if (err)
> > +                       return err;
> > +
> > +               lv = sizeof(struct tpacket_dev_qpair_map_region_info);
> > +               data = &info;
> > +               break;
> > +       }
> > +       case PACKET_DEV_DESC_INFO:
> > +       {
> > +               struct net_device *dev;
> > +               struct tpacket_dev_info info;
> > +               int err;
> > +
> > +               if (len != sizeof(info))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&info, optval, sizeof(info)))
> > +                       return -EFAULT;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +               if (!dev->netdev_ops->ndo_get_device_desc_info)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err =  dev->netdev_ops->ndo_get_device_desc_info(dev, &info);
> > +               if (err)
> > +                       return err;
> > +
> > +               lv = sizeof(struct tpacket_dev_info);
> > +               data = &info;
> > +               break;
> > +       }
> > +       case PACKET_DMA_MEM_REGION_MAP:
> > +       {
> > +               struct tpacket_dma_mem_region info;
> > +               struct net_device *dev;
> > +               int err;
> > +
> > +               if (len != sizeof(info))
> > +                               return -EINVAL;
> > +               if (copy_from_user(&info, optval, sizeof(info)))
> > +                               return -EFAULT;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +
> > +               if (!dev->netdev_ops->ndo_get_dma_region_info)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err =  dev->netdev_ops->ndo_get_dma_region_info(dev, &info, sk);
> > +               if (err)
> > +                       return err;
> > +
> > +               lv = sizeof(struct tpacket_dma_mem_region);
> > +               data = &info;
> > +               break;
> > +       }
> > +
> >         default:
> >                 return -ENOPROTOOPT;
> >         }
> > @@ -3536,7 +3917,6 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
> >         return 0;
> >  }
> >
> > -
> >  static int packet_notifier(struct notifier_block *this,
> >                            unsigned long msg, void *ptr)
> >  {
> > @@ -3920,6 +4300,8 @@ static int packet_mmap(struct file *file, struct socket *sock,
> >         struct packet_sock *po = pkt_sk(sk);
> >         unsigned long size, expected_size;
> >         struct packet_ring_buffer *rb;
> > +       const struct net_device_ops *ops;
> > +       struct net_device *dev;
> >         unsigned long start;
> >         int err = -EINVAL;
> >         int i;
> > @@ -3927,8 +4309,20 @@ static int packet_mmap(struct file *file, struct socket *sock,
> >         if (vma->vm_pgoff)
> >                 return -EINVAL;
> >
> > +       dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +       if (!dev)
> > +               return -EINVAL;
> > +
> >         mutex_lock(&po->pg_vec_lock);
> >
> > +       if (po->tp_owns_queue_pairs) {
> > +               ops = dev->netdev_ops;
> > +               err = ops->ndo_direct_qpair_page_map(vma, dev);
> > +               if (err)
> > +                       goto out;
> > +               goto done;
> > +       }
> > +
> >         expected_size = 0;
> >         for (rb = &po->rx_ring; rb <= &po->tx_ring; rb++) {
> >                 if (rb->pg_vec) {
> > @@ -3966,6 +4360,7 @@ static int packet_mmap(struct file *file, struct socket *sock,
> >                 }
> >         }
> >
> > +done:
> >         atomic_inc(&po->mapped);
> >         vma->vm_ops = &packet_mmap_ops;
> >         err = 0;
> > diff --git a/net/packet/internal.h b/net/packet/internal.h
> > index cdddf6a..55d2fce 100644
> > --- a/net/packet/internal.h
> > +++ b/net/packet/internal.h
> > @@ -90,6 +90,14 @@ struct packet_fanout {
> >         struct packet_type      prot_hook ____cacheline_aligned_in_smp;
> >  };
> >
> > +struct packet_umem_region {
> > +       struct list_head        list;
> > +       int                     nents;
> > +       int                     nmap;
> > +       int                     direction;
> > +       struct scatterlist      sglist[0];
> > +};
> > +
> >  struct packet_sock {
> >         /* struct sock has to be the first member of packet_sock */
> >         struct sock             sk;
> > @@ -97,6 +105,7 @@ struct packet_sock {
> >         union  tpacket_stats_u  stats;
> >         struct packet_ring_buffer       rx_ring;
> >         struct packet_ring_buffer       tx_ring;
> > +       struct list_head        umem_list;
> >         int                     copy_thresh;
> >         spinlock_t              bind_lock;
> >         struct mutex            pg_vec_lock;
> > @@ -113,6 +122,7 @@ struct packet_sock {
> >         unsigned int            tp_reserve;
> >         unsigned int            tp_loss:1;
> >         unsigned int            tp_tx_has_off:1;
> > +       unsigned int            tp_owns_queue_pairs:1;
> >         unsigned int            tp_tstamp;
> >         struct net_device __rcu *cached_dev;
> >         int                     (*xmit)(struct sk_buff *skb);
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
From: Zhou, Danny @ 2015-01-14 15:28 UTC (permalink / raw)
  To: David Miller, David.Laight@ACULAB.COM
  Cc: john.fastabend@gmail.com, dborkman@redhat.com,
	hannes@stressinduktion.org, netdev@vger.kernel.org,
	nhorman@tuxdriver.com, Ronciak, John, brouer@redhat.com
In-Reply-To: <20150113.122733.1459000731112321026.davem@davemloft.net>



> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Wednesday, January 14, 2015 1:28 AM
> To: David.Laight@ACULAB.COM
> Cc: john.fastabend@gmail.com; dborkman@redhat.com; hannes@stressinduktion.org; netdev@vger.kernel.org; Zhou, Danny;
> nhorman@tuxdriver.com; Ronciak, John; brouer@redhat.com
> Subject: Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
> 
> From: David Laight <David.Laight@ACULAB.COM>
> Date: Tue, 13 Jan 2015 17:15:30 +0000
> 
> > How about something like:
> >
> > struct tpacket_dma_mem_region {
> >     __u64 addr;        /* userspace virtual address */
> >     __u64 phys_addr;    /* physical address */
> >     __u64 iova;        /* IO virtual address used for DMA */
> >     __u64 size;    /* size of region */
> >     int direction;        /* dma data direction */
> > } aligned(8);
> >
> > So that it is independant of 32/64 bits.
> > It is a shame that gcc has no way of defining a 64bit 'void *' on 32bit systems.
> > You can use a union, but you still need to zero extend the value on LE (worse on BE).
> 
> We have an __aligned_u64, please use that.

Thanks, will do.

^ permalink raw reply

* [bisected regression] e1000e: "Detected Hardware Unit Hang"
From: Thomas Jarosch @ 2015-01-14 15:32 UTC (permalink / raw)
  To: 'Linux Netdev List'; +Cc: Eric Dumazet, Jeff Kirsher, e1000-devel

Hello,

after updating a good bunch of production level machines
from kernel 3.4.101 to kernel 3.14.25, a few of them started
to show serious trouble when there was a lot of network traffic.

---------------------------------------------------------------
Jan 14 11:14:57 intrartc kernel: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
Jan 14 11:14:57 intrartc kernel:  TDH                  <3b>
Jan 14 11:14:57 intrartc kernel:  TDT                  <76>
Jan 14 11:14:57 intrartc kernel:  next_to_use          <76>
Jan 14 11:14:57 intrartc kernel:  next_to_clean        <31>
Jan 14 11:14:57 intrartc kernel: buffer_info[next_to_clean]:
Jan 14 11:14:57 intrartc kernel:  time_stamp           <ffff328c>
Jan 14 11:14:57 intrartc kernel:  next_to_watch        <3b>
Jan 14 11:14:57 intrartc kernel:  jiffies              <ffff33b9>
Jan 14 11:14:57 intrartc kernel:  next_to_watch.status <0>
Jan 14 11:14:57 intrartc kernel: MAC Status             <40080083>
Jan 14 11:14:57 intrartc kernel: PHY Status             <796d>
Jan 14 11:14:57 intrartc kernel: PHY 1000BASE-T Status  <3800>
Jan 14 11:14:57 intrartc kernel: PHY Extended Status    <3000>
Jan 14 11:14:57 intrartc kernel: PCI Status             <10>
Jan 14 11:14:59 intrartc kernel: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
..
---------------------------------------------------------------

All of those troubled machines use an Intel DH61CR board and
are driven by the e1000e driver. Kernels 3.7.0 to 3.19-rc4 are affected.

The problem vanishes when you disable TSO. This is the
recommended "solution" on serverfault and others.
http://ehc.ac/p/e1000/bugs/378/
http://serverfault.com/questions/616485/e1000e-reset-adapter-unexpectedly-detected-hardware-unit-hang

I have a test setup that can trigger the problem within seconds
and bisected it down to this commit (hi Eric!):
---------------------------------------------------------------
commit 69b08f62e17439ee3d436faf0b9a7ca6fffb78db
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Sep 26 06:46:57 2012 +0000

    net: use bigger pages in __netdev_alloc_frag

    We currently use percpu order-0 pages in __netdev_alloc_frag
    to deliver fragments used by __netdev_alloc_skb()

    Depending on NIC driver and arch being 32 or 64 bit, it allows a page to
    be split in several fragments (between 1 and 8), assuming PAGE_SIZE=4096

    Switching to bigger pages (32768 bytes for PAGE_SIZE=4096 case) allows :

    - Better filling of space (the ending hole overhead is less an issue)

    - Less calls to page allocator or accesses to page->_count

    - Could allow struct skb_shared_info futures changes without major
    performance impact.

    This patch implements a transparent fallback to smaller
    pages in case of memory pressure.

    It also uses a standard "struct page_frag" instead of a custom one.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Alexander Duyck <alexander.h.duyck@intel.com>
    Cc: Benjamin LaHaise <bcrl@kvack.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>
---------------------------------------------------------------

Reverting the commit f.e. in kernel 3.7.0  solves the issue.
I've done some more tests:

    3.18.0 32bit + PAE: broken
    3.6.0 32bit + PAE: works
    3.7.0 32bit + PAE: broken
    3.7.0 32bit + PAE + revert 69b08f62e17439ee3d436faf0b9a7ca6fffb78db -> works

    3.7.0 32bit (without PAE) -> broken
    3.7.0 32bit + "GFP_COMP" flag removed in __netdev_alloc_frag(): broken
    3.7.0 32bit + "GFP_COMP" flag replaced with
                              "GFP_DMA" in __netdev_alloc_frag(): works!
    3.7.0 32bit + "GFP_COMP" flag + "GFP_DMA" flag: broken
    3.19-rc4 32bit: broken


The problem is triggered only when the traffic is forwarded to another client.
(this client is behind NAT). Generating traffic directly
on the system did not trigger the issue.

To me it looks like Eric's change uncovered a memory allocation
issue in the e1000e driver: It probably uses a memory address
unsuitable for DMA or so. This is just a guess though.

Funny fact: I have another Intel DH61CR board that does not show the problem.
I've borrowed (...) the mainboard from one affected box for my bisect test setup.

Please CC: comments. Thanks.

Best regards,
Thomas

^ permalink raw reply

* non-OVS based vxlan config broken on 3.19-rc ?!
From: Or Gerlitz @ 2015-01-14 15:18 UTC (permalink / raw)
  To: tom Herbert, thomas Graf, Marcelo Leitner, Jesse Gross
  Cc: netdev@vger.kernel.org

Guys, just realized that non-OVS based vxlan config is broken with 
3.19-rc... I see that it works for me on 3.18.2 and breaks on 3.19-rc3 
(Linus tree). Tested over mlx4 (both offloaded and non offloaded modes) 
and igb, see below the simplest form I can see it breaks on 3.19-rcand 
works on 3.18

Looking on tcpdump and stats, the arp reply arrives to the 3.19-rc host 
NIC driver but is dropped along the stack beforehanded to the vxlan 
driver, not sure where and why...

Or.

> $ tcpdump -nni vxlan42 -e
> 16:48:40.961852 8e:c2:13:4f:97:6e > ff:ff:ff:ff:ff:ff, ethertype ARP 
> (0x0806), length 42: Request who-has 192.168.92.18 tell 192.168.92.17, 
> length 28
> 16:48:41.961864 8e:c2:13:4f:97:6e > ff:ff:ff:ff:ff:ff, ethertype ARP 
> (0x0806), length 42: Request who-has 192.168.92.18 tell 192.168.92.17, 
> length 28
> 16:48:42.979948 8e:c2:13:4f:97:6e > ff:ff:ff:ff:ff:ff, ethertype ARP 
> (0x0806), length 42: Request who-has 192.168.92.18 tell 192.168.92.17, 
> length 28
>
> $ tcpdump -nni eth3 -e
> 16:48:46.993870 00:02:c9:e9:bf:32 > 01:00:5e:00:00:2a, ethertype IPv4 
> (0x0800), length 92: 192.168.31.17.33434 > 239.0.0.42.4789: UDP, length 50
> 16:48:46.993905 f4:52:14:01:da:82 > 00:02:c9:e9:bf:32, ethertype IPv4 
> (0x0800), length 92: 192.168.31.18.39155 > 192.168.31.17.4789: UDP, 
> length 50
> 16:48:47.993855 00:02:c9:e9:bf:32 > 01:00:5e:00:00:2a, ethertype IPv4 
> (0x0800), length 92: 192.168.31.17.33434 > 239.0.0.42.4789: UDP, length 50
> 16:48:47.993881 f4:52:14:01:da:82 > 00:02:c9:e9:bf:32, ethertype IPv4 
> (0x0800), length 92: 192.168.31.18.39155 > 192.168.31.17.4789: UDP, 
> length 50
>
> $ nstat
> Wed Jan 14 16:54:04 2015
>
> #kernel
> IpInReceives                    2                  0.0
> IpInDelivers                    2                  0.0
> IpOutRequests                   4                  0.0
> IcmpOutErrors                   2                  0.0
> IcmpOutEchoReps                 2                  0.0
> IcmpMsgOutType8                 2                  0.0
> TcpInSegs                       1                  0.0
> TcpOutSegs                      1                  0.0
> UdpInDatagrams                  1                  0.0
> TcpExtTCPPureAcks               1                  0.0
> TcpExtTCPOrigDataSent           1                  0.0
> IpExtOutMcastPkts               1                  0.0
> IpExtInOctets                   124                0.0
> IpExtOutOctets                  670                0.0
> IpExtOutMcastOctets             78                 0.0
> IpExtInNoECTPkts                2                  0.0
>
> Every 1.0s: netstat -s -w
> Wed Jan 14 16:54:56 2015
>
> Ip:
>     1077125083 total packets received
>     182 with invalid addresses
>     0 forwarded
>     0 incoming packets discarded
>     1077124621 incoming packets delivered
>     545614475 requests sent out
> Icmp:
>     67936 ICMP messages received
>     60 input ICMP message failed.
>     ICMP input histogram:
>         destination unreachable: 3319
>         echo requests: 528
>         echo replies: 64089
>     71890 ICMP messages sent
>     0 ICMP messages failed
>     ICMP output histogram:
>         destination unreachable: 3542
>         echo request: 67821
>         echo replies: 527
> IcmpMsg:
>         InType0: 64089
>         InType3: 3319
>         InType8: 528
>         OutType0: 527
>         OutType3: 3542
>         OutType8: 67821
> UdpLite:
> IpExt:
>     InMcastPkts: 827
>     OutMcastPkts: 2993
>     InBcastPkts: 1424
>     InOctets: 2102729994314
>     OutOctets: 35397269627
>     InMcastOctets: 31854
>     OutMcastOctets: 237154
>     InBcastOctets: 453473
>     InNoECTPkts: 1440876685



# host A with IP address 192.168.31.17

IP=ip

# mlx4
ETH=eth0

$IP link del vxlan42
$IP link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dstport 
4789 dev $ETH
$IP link set vxlan42 up
ifconfig vxlan42 192.168.92.17/24
ifconfig vxlan42 mtu 1450

# plain ping
ping 192.168.31.18 -c 2
# encaped  ping <-- BREAKS
ping 192.168.92.18 -c 2

# host B with IP address 192.168.31.18

IP=ip

# mlx4
ETH=eth0

$IP link del vxlan42
$IP link add vxlan42 type vxlan id 42 group 239.0.0.42 ttl 10 dstport 
4789 dev $ETH
$IP link set vxlan42 up
ifconfig vxlan42 192.168.92.18/24
ifconfig vxlan42 mtu 1450

ping 192.168.31.17 -c 2
ping 192.168.92.17 -c 2

^ permalink raw reply

* Re: [patch net-next 1/2 v3] tc: add BPF based action
From: Alexei Starovoitov @ 2015-01-14 15:39 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Jiri Pirko, Network Development, David S. Miller,
	Jamal Hadi Salim, Hannes Frederic Sowa
In-Reply-To: <54B66F08.2010305@redhat.com>

On Wed, Jan 14, 2015 at 5:28 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
>
> I'm still wondering about the drop semantics ... wouldn't it be more
> intuitive to use 0 for drops in this context?

good point.
I think it must be 0 to match behavior of socket filters, etc.
If program tries to access beyond packet size or does divide
by zero if will be terminated and will return 0.
So zero should be the safest action from caller point of view.

^ permalink raw reply

* Re: Investment
From: Suklee Peck @ 2015-01-13 19:57 UTC (permalink / raw)
  To: Recipients

I'm contacting you on behalf of an investment placed under management 5 years ago by Shui bian. He needs assistance in investing these funds. If you are interested, you can write to his private email ( saitt01@qq.com ) for further details.
gesalpe
Best Regards,
Suklee Peck

^ permalink raw reply

* Re: non-OVS based vxlan config broken on 3.19-rc ?!
From: thomas Graf @ 2015-01-14 15:52 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: tom Herbert, Marcelo Leitner, Jesse Gross, netdev@vger.kernel.org
In-Reply-To: <54B688D9.8030101@mellanox.com>

On 01/14/15 at 05:18pm, Or Gerlitz wrote:
> Guys, just realized that non-OVS based vxlan config is broken with
> 3.19-rc... I see that it works for me on 3.18.2 and breaks on 3.19-rc3
> (Linus tree). Tested over mlx4 (both offloaded and non offloaded modes) and
> igb, see below the simplest form I can see it breaks on 3.19-rcand works on
> 3.18
> 
> Looking on tcpdump and stats, the arp reply arrives to the 3.19-rc host NIC
> driver but is dropped along the stack beforehanded to the vxlan driver, not
> sure where and why...

As additional data point: I tested the VXLAN-GBP with iproute2 based tunnels
on net-next and that works fine. Driver used was a e1000 in KVM.

^ permalink raw reply

* Re: [PATCH 2/2] net/macb: improved ethtool statistics support
From: Nicolas Ferre @ 2015-01-14 15:53 UTC (permalink / raw)
  To: Xander Huff, netdev, David Miller
  Cc: jaeden.amero, rich.tollerton, ben.shelton, brad.mouring,
	linux-kernel, Cyrille Pitchen
In-Reply-To: <1421187351-27279-2-git-send-email-xander.huff@ni.com>

Le 13/01/2015 23:15, Xander Huff a écrit :
> Currently `ethtool -S` simply returns "no stats available". It
> would be more useful to see what the various ethtool statistics
> registers' values are. This change implements get_ethtool_stats,
> get_strings, and get_sset_count functions to accomplish this.
> 
> Read all GEM statistics registers and sum them into
> macb.ethtool_stats. Add the necessary infrastructure to make this
> accessible via `ethtool -S`.
> 
> Update gem_update_stats to utilize ethtool_stats.
> 
> Signed-off-by: Xander Huff <xander.huff@ni.com>

David,

I see some issues with this patch: can you hold it a little bit please
(aka NAK)?

Remarks enclosed:

> ---
>  drivers/net/ethernet/cadence/macb.c |  55 +++++++-
>  drivers/net/ethernet/cadence/macb.h | 256 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 307 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c
> index 3767271..dd8c202 100644
> --- a/drivers/net/ethernet/cadence/macb.c
> +++ b/drivers/net/ethernet/cadence/macb.c
> @@ -1827,12 +1827,23 @@ static int macb_close(struct net_device *dev)
>  
>  static void gem_update_stats(struct macb *bp)
>  {
> -	u32 __iomem *reg = bp->regs + GEM_OTX;
> +	int i;
>  	u32 *p = &bp->hw_stats.gem.tx_octets_31_0;
> -	u32 *end = &bp->hw_stats.gem.rx_udp_checksum_errors + 1;
>  
> -	for (; p < end; p++, reg++)
> -		*p += __raw_readl(reg);
> +	for (i = 0; i < GEM_STATS_LEN; ++i, ++p) {
> +		u32 offset = gem_statistics[i].offset;
> +		u64 val = __raw_readl(bp->regs+offset);
> +
> +		bp->ethtool_stats[i] += val;
> +		*p += val;
> +
> +		if (offset == GEM_OCTTXL || offset == GEM_OCTRXL) {
> +			/* Add GEM_OCTTXH, GEM_OCTRXH */
> +			val = __raw_readl(bp->regs+offset+4);

style: whitespace around '+'

> +			bp->ethtool_stats[i] += ((u64)val)<<32;

style: ditto

> +			*(++p) += val;
> +		}
> +	}
>  }
>  
>  static struct net_device_stats *gem_get_stats(struct macb *bp)
> @@ -1873,6 +1884,39 @@ static struct net_device_stats *gem_get_stats(struct macb *bp)
>  	return nstat;
>  }
>  
> +static void gem_get_ethtool_stats(struct net_device *dev,
> +				  struct ethtool_stats *stats, u64 *data)
> +{
> +	struct macb *bp;
> +
> +	bp = netdev_priv(dev);
> +	gem_update_stats(bp);
> +	memcpy(data, &bp->ethtool_stats, sizeof(u64)*GEM_STATS_LEN);

style: ditto

> +}
> +
> +static int gem_get_sset_count(struct net_device *dev, int sset)
> +{
> +	switch (sset) {
> +	case ETH_SS_STATS:
> +		return GEM_STATS_LEN;
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +}
> +
> +static void gem_get_ethtool_strings(struct net_device *dev, u32 sset, u8 *p)
> +{
> +	int i;
> +
> +	switch (sset) {
> +	case ETH_SS_STATS:
> +		for (i = 0; i < GEM_STATS_LEN; i++, p += ETH_GSTRING_LEN)
> +			memcpy(p, gem_statistics[i].stat_string,
> +			       ETH_GSTRING_LEN);
> +		break;
> +	}
> +}
> +
>  struct net_device_stats *macb_get_stats(struct net_device *dev)
>  {
>  	struct macb *bp = netdev_priv(dev);
> @@ -1988,6 +2032,9 @@ const struct ethtool_ops macb_ethtool_ops = {
>  	.get_regs		= macb_get_regs,
>  	.get_link		= ethtool_op_get_link,
>  	.get_ts_info		= ethtool_op_get_ts_info,
> +	.get_ethtool_stats	= gem_get_ethtool_stats,
> +	.get_strings		= gem_get_ethtool_strings,
> +	.get_sset_count		= gem_get_sset_count,

I think that the 10/100 macb version of this IP doesn't have the same
statistic possibilities: so you shouldn't register these functions for
all the variants of the IP.
Can you please verify this and only register these functions in the proper
if (macb_is_gem(bp)) alternative?


>  };
>  EXPORT_SYMBOL_GPL(macb_ethtool_ops);
>  
> diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
> index 8e8c3c9..378b218 100644
> --- a/drivers/net/ethernet/cadence/macb.h
> +++ b/drivers/net/ethernet/cadence/macb.h
> @@ -82,6 +82,159 @@
>  #define GEM_SA4B				0x00A0 /* Specific4 Bottom */
>  #define GEM_SA4T				0x00A4 /* Specific4 Top */
>  #define GEM_OTX					0x0100 /* Octets transmitted */
> +#define GEM_OCTTXL				0x0100 /* Octets transmitted
> +							* [31:0]
> +							*/

Well please place the comment on a single line. Even if it exceeds 80
characters, it can be an exception for this.
However, check with David but personally, I feel that this formatting is
not good...

> +#define GEM_OCTTXH				0x0104 /* Octets transmitted
> +							* [47:32]
> +							*/
> +#define GEM_TXCNT				0x0108 /* Error-free Frames
> +							* Transmitted counter
> +							*/
> +#define GEM_TXBCCNT				0x010c /* Error-free Broadcast
> +							* Frames counter
> +							*/
> +#define GEM_TXMCCNT				0x0110 /* Error-free Multicast
> +							* Frames counter
> +							*/
> +#define GEM_TXPAUSECNT				0x0114 /* Pause Frames
> +							* Transmitted Counter
> +							*/
> +#define GEM_TX64CNT				0x0118 /* Error-free 64 byte
> +							* Frames Transmitted
> +							* counter
> +							*/

... particularly when it comes to reach 3 lines like above ^^^^^^

> +#define GEM_TX65CNT				0x011c /* Error-free 65-127 byte
> +							* Frames Transmitted
> +							* counter
> +							*/
> +#define GEM_TX128CNT				0x0120 /* Error-free 128-255
> +							* byte Frames
> +							* Transmitted counter
> +							*/
> +#define GEM_TX256CNT				0x0124 /* Error-free 256-511
> +							* byte Frames
> +							* transmitted counter
> +							*/
> +#define GEM_TX512CNT				0x0128 /* Error-free 512-1023
> +							* byte Frames
> +							* transmitted counter
> +							*/
> +#define GEM_TX1024CNT				0x012c /* Error-free 1024-1518
> +							* byte Frames
> +							* transmitted counter
> +							*/
> +#define GEM_TX1519CNT				0x0130 /* Error-free larger than
> +							* 1519 byte Frames
> +							* tranmitted counter
> +							*/
> +#define GEM_TXURUNCNT				0x0134 /* TX under run error
> +							* counter
> +							*/
> +#define GEM_SNGLCOLLCNT				0x0138 /* Single Collision Frame
> +							* Counter
> +							*/
> +#define GEM_MULTICOLLCNT			0x013c /* Multiple Collision
> +							* Frame Counter
> +							*/
> +#define GEM_EXCESSCOLLCNT			0x0140 /* Excessive Collision
> +							* Frame Counter
> +							*/
> +#define GEM_LATECOLLCNT				0x0144 /* Late Collision Frame
> +							* Counter
> +							*/
> +#define GEM_TXDEFERCNT				0x0148 /* Deferred Transmission
> +							* Frame Counter
> +							*/
> +#define GEM_TXCSENSECNT				0x014c /* Carrier Sense Error
> +							* Counter
> +							*/
> +#define GEM_ORX					0x0150 /* Octets received */
> +#define GEM_OCTRXL				0x0150 /* Octets received
> +							* [31:0]
> +							*/
> +#define GEM_OCTRXH				0x0154 /* Octets received
> +							* [47:32]
> +							*/
> +#define GEM_RXCNT				0x0158 /* Error-free Frames
> +							* Received Counter
> +							*/
> +#define GEM_RXBROADCNT				0x015c /* Error-free Broadcast
> +							* Frames Received
> +							* Counter
> +							*/
> +#define GEM_RXMULTICNT				0x0160 /* Error-free Multicast
> +							* Frames Received
> +							* Counter
> +							*/
> +#define GEM_RXPAUSECNT				0x0164 /* Error-free Pause
> +							* Frames Received
> +							* Counter
> +							*/
> +#define GEM_RX64CNT				0x0168 /* Error-free 64 byte
> +							* Frames Received
> +							* Counter
> +							*/
> +#define GEM_RX65CNT				0x016c /* Error-free 65-127 byte
> +							* Frames Received
> +							* Counter
> +							*/
> +#define GEM_RX128CNT				0x0170 /* Error-free 128-255
> +							* byte Frames Received
> +							* Counter
> +							*/
> +#define GEM_RX256CNT				0x0174 /* Error-free 256-511
> +							* byte Frames Received
> +							* Counter
> +							*/
> +#define GEM_RX512CNT				0x0178 /* Error-free 512-1023
> +							* byte Frames Received
> +							* Counter
> +							*/
> +#define GEM_RX1024CNT				0x017c /* Error-free 1024-1518
> +							* byte Frames Received
> +							* Counter
> +							*/
> +#define GEM_RX1519CNT				0x0180 /* Error-free larger than
> +							* 1519 Frames Received
> +							* Counter
> +							*/
> +#define GEM_RXUNDRCNT				0x0184 /* Undersize Frames
> +							* Received Counter
> +							*/
> +#define GEM_RXOVRCNT				0x0188 /* Oversize Frames
> +							* Received Counter
> +							*/
> +#define GEM_RXJABCNT				0x018c /* Jabbers Received
> +							* Counter
> +							*/
> +#define GEM_RXFCSCNT				0x0190 /* Frame Check Sequence
> +							* Error Counter
> +							*/
> +#define GEM_RXLENGTHCNT				0x0194 /* Length Field Error
> +							* Counter
> +							*/
> +#define GEM_RXSYMBCNT				0x0198 /* Symbol Error
> +							* Counter
> +							*/
> +#define GEM_RXALIGNCNT				0x019c /* Alignment Error
> +							* Counter
> +							*/
> +#define GEM_RXRESERRCNT				0x01a0 /* Receive Resource Error
> +							* Counter
> +							*/
> +#define GEM_RXORCNT				0x01a4 /* Receive Overrun
> +							* Counter
> +							*/
> +#define GEM_RXIPCCNT				0x01a8 /* IP header Checksum
> +							* Error Counter
> +							*/
> +#define GEM_RXTCPCCNT				0x01ac /* TCP Checksum Error
> +							* Counter
> +							*/
> +#define GEM_RXUDPCCNT				0x01b0 /* UDP Checksum Error
> +							* Counter
> +							*/
>  #define GEM_DCFG1				0x0280 /* Design Config 1 */
>  #define GEM_DCFG2				0x0284 /* Design Config 2 */
>  #define GEM_DCFG3				0x0288 /* Design Config 3 */
> @@ -650,6 +803,107 @@ struct gem_stats {
>  	u32	rx_udp_checksum_errors;
>  };
>  
> +/* Describes the name and offset of an individual statistic register, as

style: should be like this:
/*
 * bla bla bla
 */

> + * returned by `ethtool -S`. Also describes which net_device_stats statistics
> + * this register should contribute to.
> + */
> +struct gem_statistic {
> +	char stat_string[ETH_GSTRING_LEN];
> +	int offset;
> +	u32 stat_bits;
> +};
> +
> +/* Bitfield defs for net_device_stat statistics */
> +#define GEM_NDS_RXERR_OFFSET		0
> +#define GEM_NDS_RXLENERR_OFFSET		1
> +#define GEM_NDS_RXOVERERR_OFFSET	2
> +#define GEM_NDS_RXCRCERR_OFFSET		3
> +#define GEM_NDS_RXFRAMEERR_OFFSET	4
> +#define GEM_NDS_RXFIFOERR_OFFSET	5
> +#define GEM_NDS_TXERR_OFFSET		6
> +#define GEM_NDS_TXABORTEDERR_OFFSET	7
> +#define GEM_NDS_TXCARRIERERR_OFFSET	8
> +#define GEM_NDS_TXFIFOERR_OFFSET	9
> +#define GEM_NDS_COLLISIONS_OFFSET	10
> +
> +#define GEM_STAT_TITLE(name, title) GEM_STAT_TITLE_BITS(name, title, 0)
> +#define GEM_STAT_TITLE_BITS(name, title, bits) {	\
> +	.stat_string = title,				\
> +	.offset = GEM_##name,				\
> +	.stat_bits = bits				\
> +}
> +
> +/* list of gem statistic registers. The names MUST match the
> + * corresponding GEM_* definitions.
> + */
> +static const struct gem_statistic gem_statistics[] = {
> +	GEM_STAT_TITLE(OCTTXL, "tx_octets"), /* OCTTXH combined with OCTTXL */
> +	GEM_STAT_TITLE(TXCNT, "tx_frames"),
> +	GEM_STAT_TITLE(TXBCCNT, "tx_broadcast_frames"),
> +	GEM_STAT_TITLE(TXMCCNT, "tx_multicast_frames"),
> +	GEM_STAT_TITLE(TXPAUSECNT, "tx_pause_frames"),
> +	GEM_STAT_TITLE(TX64CNT, "tx_64_byte_frames"),
> +	GEM_STAT_TITLE(TX65CNT, "tx_65_127_byte_frames"),
> +	GEM_STAT_TITLE(TX128CNT, "tx_128_255_byte_frames"),
> +	GEM_STAT_TITLE(TX256CNT, "tx_256_511_byte_frames"),
> +	GEM_STAT_TITLE(TX512CNT, "tx_512_1023_byte_frames"),
> +	GEM_STAT_TITLE(TX1024CNT, "tx_1024_1518_byte_frames"),
> +	GEM_STAT_TITLE(TX1519CNT, "tx_greater_than_1518_byte_frames"),
> +	GEM_STAT_TITLE_BITS(TXURUNCNT, "tx_underrun",
> +			    GEM_BIT(NDS_TXERR)|GEM_BIT(NDS_TXFIFOERR)),
> +	GEM_STAT_TITLE_BITS(SNGLCOLLCNT, "tx_single_collision_frames",
> +			    GEM_BIT(NDS_TXERR)|GEM_BIT(NDS_COLLISIONS)),
> +	GEM_STAT_TITLE_BITS(MULTICOLLCNT, "tx_multiple_collision_frames",
> +			    GEM_BIT(NDS_TXERR)|GEM_BIT(NDS_COLLISIONS)),
> +	GEM_STAT_TITLE_BITS(EXCESSCOLLCNT, "tx_excessive_collisions",
> +			    GEM_BIT(NDS_TXERR)|
> +			    GEM_BIT(NDS_TXABORTEDERR)|
> +			    GEM_BIT(NDS_COLLISIONS)),
> +	GEM_STAT_TITLE_BITS(LATECOLLCNT, "tx_late_collisions",
> +			    GEM_BIT(NDS_TXERR)|GEM_BIT(NDS_COLLISIONS)),
> +	GEM_STAT_TITLE(TXDEFERCNT, "tx_deferred_frames"),
> +	GEM_STAT_TITLE_BITS(TXCSENSECNT, "tx_carrier_sense_errors",
> +			    GEM_BIT(NDS_TXERR)|GEM_BIT(NDS_COLLISIONS)),
> +	GEM_STAT_TITLE(OCTRXL, "rx_octets"), /* OCTRXH combined with OCTRXL */
> +	GEM_STAT_TITLE(RXCNT, "rx_frames"),
> +	GEM_STAT_TITLE(RXBROADCNT, "rx_broadcast_frames"),
> +	GEM_STAT_TITLE(RXMULTICNT, "rx_multicast_frames"),
> +	GEM_STAT_TITLE(RXPAUSECNT, "rx_pause_frames"),
> +	GEM_STAT_TITLE(RX64CNT, "rx_64_byte_frames"),
> +	GEM_STAT_TITLE(RX65CNT, "rx_65_127_byte_frames"),
> +	GEM_STAT_TITLE(RX128CNT, "rx_128_255_byte_frames"),
> +	GEM_STAT_TITLE(RX256CNT, "rx_256_511_byte_frames"),
> +	GEM_STAT_TITLE(RX512CNT, "rx_512_1023_byte_frames"),
> +	GEM_STAT_TITLE(RX1024CNT, "rx_1024_1518_byte_frames"),
> +	GEM_STAT_TITLE(RX1519CNT, "rx_greater_than_1518_byte_frames"),
> +	GEM_STAT_TITLE_BITS(RXUNDRCNT, "rx_undersized_frames",
> +			    GEM_BIT(NDS_RXERR)|GEM_BIT(NDS_RXLENERR)),
> +	GEM_STAT_TITLE_BITS(RXOVRCNT, "rx_oversize_frames",
> +			    GEM_BIT(NDS_RXERR)|GEM_BIT(NDS_RXLENERR)),
> +	GEM_STAT_TITLE_BITS(RXJABCNT, "rx_jabbers",
> +			    GEM_BIT(NDS_RXERR)|GEM_BIT(NDS_RXLENERR)),
> +	GEM_STAT_TITLE_BITS(RXFCSCNT, "rx_frame_check_sequence_errors",
> +			    GEM_BIT(NDS_RXERR)|GEM_BIT(NDS_RXCRCERR)),
> +	GEM_STAT_TITLE_BITS(RXLENGTHCNT, "rx_length_field_frame_errors",
> +			    GEM_BIT(NDS_RXERR)),
> +	GEM_STAT_TITLE_BITS(RXSYMBCNT, "rx_symbol_errors",
> +			    GEM_BIT(NDS_RXERR)|GEM_BIT(NDS_RXFRAMEERR)),
> +	GEM_STAT_TITLE_BITS(RXALIGNCNT, "rx_alignment_errors",
> +			    GEM_BIT(NDS_RXERR)|GEM_BIT(NDS_RXOVERERR)),
> +	GEM_STAT_TITLE_BITS(RXRESERRCNT, "rx_resource_errors",
> +			    GEM_BIT(NDS_RXERR)|GEM_BIT(NDS_RXOVERERR)),
> +	GEM_STAT_TITLE_BITS(RXORCNT, "rx_overruns",
> +			    GEM_BIT(NDS_RXERR)|GEM_BIT(NDS_RXFIFOERR)),
> +	GEM_STAT_TITLE_BITS(RXIPCCNT, "rx_ip_header_checksum_errors",
> +			    GEM_BIT(NDS_RXERR)),
> +	GEM_STAT_TITLE_BITS(RXTCPCCNT, "rx_tcp_checksum_errors",
> +			    GEM_BIT(NDS_RXERR)),
> +	GEM_STAT_TITLE_BITS(RXUDPCCNT, "rx_udp_checksum_errors",
> +			    GEM_BIT(NDS_RXERR)),
> +};
> +
> +#define GEM_STATS_LEN ARRAY_SIZE(gem_statistics)
> +
>  struct macb;
>  
>  struct macb_or_gem_ops {
> @@ -728,6 +982,8 @@ struct macb {
>  	dma_addr_t skb_physaddr;		/* phys addr from pci_map_single */
>  	int skb_length;				/* saved skb length for pci_unmap_single */
>  	unsigned int		max_tx_length;
> +
> +	u64			ethtool_stats[GEM_STATS_LEN];
>  };
>  
>  extern const struct ethtool_ops macb_ethtool_ops;
> 


-- 
Nicolas Ferre

^ permalink raw reply

* Re: [patch net-next 1/2 v3] tc: add BPF based action
From: Jiri Pirko @ 2015-01-14 15:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Network Development, David S. Miller,
	Jamal Hadi Salim, Hannes Frederic Sowa
In-Reply-To: <CAMEtUux+HOOegzyi82fYT_JMDX_+d0dZCQ=Zc5GWC-awQmJu1A@mail.gmail.com>

Wed, Jan 14, 2015 at 04:39:34PM CET, ast@plumgrid.com wrote:
>On Wed, Jan 14, 2015 at 5:28 AM, Daniel Borkmann <dborkman@redhat.com> wrote:
>>
>> I'm still wondering about the drop semantics ... wouldn't it be more
>> intuitive to use 0 for drops in this context?
>
>good point.
>I think it must be 0 to match behavior of socket filters, etc.
>If program tries to access beyond packet size or does divide
>by zero if will be terminated and will return 0.
>So zero should be the safest action from caller point of view.


Will do. Thanks!

^ permalink raw reply

* Re: [PATCH net 0/3]tg3: synchronize_irq() should be called without taking locks
From: Peter Hurley @ 2015-01-14 15:57 UTC (permalink / raw)
  To: Prashant Sreedharan, davem; +Cc: netdev, mchan
In-Reply-To: <1421217039-11689-1-git-send-email-prashant@broadcom.com>

On 01/14/2015 01:30 AM, Prashant Sreedharan wrote:
> Prashant Sreedharan (3):
>   tg3_timer() should grab tp->lock before checking for tp->irq_sync
>   tg3_reset_task() needs to use rtnl_lock to synchronize
>   Release tp->lock before invoking synchronize_irq()

Thanks!

For series:

Reported-by: Peter Hurley <peter@hurleysoftware.com>
Tested-by: Peter Hurley <peter@hurleysoftware.com>

But maybe one of these patches should reference that this fixes
BUG: sleeping function... so that others can quickly find this
fix (if they're bisecting or whatever). For the same reason, it
might be useful for this series to be just one patch.

Regards,
Peter Hurley

^ permalink raw reply

* Re: [3.19-rc3] tg3: BUG: sleeping function called from invalid context
From: Peter Hurley @ 2015-01-14 16:06 UTC (permalink / raw)
  To: Prashant Sreedharan; +Cc: Michael Chan, netdev, Linux kernel
In-Reply-To: <1421116255.16485.14.camel@prashant>

On 01/12/2015 09:30 PM, Prashant Sreedharan wrote:
> On Mon, 2015-01-12 at 19:59 -0500, Peter Hurley wrote:
>> On 3.19-rc3, I'm seeing this might_sleep() warning [1] from the tg3_open()
>> call stack. Let me know if I need to bisect this.
>>
>> Regards,
>> Peter Hurley
>>
>> [1]
>>
>> [   17.203009] BUG: sleeping function called from invalid context at /home/peter/src/kernels/mainline/kernel/irq/manage.c:104
>> [   17.203067] in_atomic(): 1, irqs_disabled(): 0, pid: 1106, name: ip
>> [   17.203092] 2 locks held by ip/1106:
>> [   17.205255]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff816adf1f>] rtnetlink_rcv+0x1f/0x40
>> [   17.207445]  #1:  (&(&tp->lock)->rlock){+.....}, at: [<ffffffffa01073e6>] tg3_start+0xc06/0x11f0 [tg3]
>> [   17.209725] CPU: 2 PID: 1106 Comm: ip Not tainted 3.19.0-rc3+wip-xeon+lockdep #rc3+wip
>> [   17.211900] Hardware name: Dell Inc. Precision WorkStation T5400  /0RW203, BIOS A11 04/30/2012
>> [   17.214086]  0000000000000068 ffff8802ac823498 ffffffff817af7e8 0000000000000005
>> [   17.216265]  ffffffff81a9be78 ffff8802ac8234a8 ffffffff810998a5 ffff8802ac8234d8
>> [   17.218446]  ffffffff8109991a ffff8802ac8234c8 ffff8802af0aae00 ffffffffa00ed000
>> [   17.220636] Call Trace:
>> [   17.222743]  [<ffffffff817af7e8>] dump_stack+0x4f/0x7b
>> [   17.224808]  [<ffffffff810998a5>] ___might_sleep+0x105/0x140
>> [   17.226842]  [<ffffffff8109991a>] __might_sleep+0x3a/0xa0
>> [   17.228869]  [<ffffffffa00ed000>] ? 0xffffffffa00ed000
>> [   17.230939]  [<ffffffff810d7d78>] synchronize_irq+0x38/0xa0
>> [   17.232967]  [<ffffffffa00ed000>] ? 0xffffffffa00ed000
>> [   17.234991]  [<ffffffffa010105f>] tg3_chip_reset+0x13f/0x9c0 [tg3]
>> [   17.236988]  [<ffffffffa01020ae>] tg3_reset_hw+0x7e/0x2d20 [tg3]
>> [   17.238996]  [<ffffffff813bfaff>] ? __udelay+0x2f/0x40
>> [   17.241007]  [<ffffffffa00ef2f7>] ? _tw32_flush+0x47/0x80 [tg3]
>> [   17.243066]  [<ffffffffa0104dac>] tg3_init_hw+0x5c/0x70 [tg3]
>> [   17.245438]  [<ffffffffa010740b>] tg3_start+0xc2b/0x11f0 [tg3]
>> [   17.247444]  [<ffffffffa0107ad7>] ? tg3_open+0x107/0x2e0 [tg3]
>> [   17.249556]  [<ffffffff810c338d>] ? trace_hardirqs_on+0xd/0x10
>> [   17.251581]  [<ffffffff8107806f>] ? __local_bh_enable_ip+0x6f/0x100
>> [   17.253710]  [<ffffffffa0107af8>] tg3_open+0x128/0x2e0 [tg3]
>> [   17.255758]  [<ffffffff816ba3f5>] ? netpoll_poll_disable+0x5/0xa0
>> [   17.257932]  [<ffffffff816a14af>] __dev_open+0xbf/0x140
>> [   17.260091]  [<ffffffff816a17c1>] __dev_change_flags+0xa1/0x160
>> [   17.262222]  [<ffffffff816a18a9>] dev_change_flags+0x29/0x60
>> [   17.264360]  [<ffffffff816b0e02>] do_setlink+0x2f2/0xa30
>> [   17.266431]  [<ffffffff816b1b7f>] rtnl_newlink+0x51f/0x750
>> [   17.268485]  [<ffffffff816b1749>] ? rtnl_newlink+0xe9/0x750
>> [   17.270483]  [<ffffffff811869c2>] ? free_pages_prepare+0x1d2/0x270
>> [   17.272507]  [<ffffffff810c32bd>] ? trace_hardirqs_on_caller+0x11d/0x1e0
>> [   17.274531]  [<ffffffff813dd1b2>] ? nla_parse+0x32/0x120
>> [   17.276531]  [<ffffffff81021ab5>] ? native_sched_clock+0x35/0xa0
>> [   17.278514]  [<ffffffff816adfd5>] rtnetlink_rcv_msg+0x95/0x250
>> [   17.280485]  [<ffffffff8109f699>] ? preempt_count_sub+0x49/0x50
>> [   17.282448]  [<ffffffff817b4a02>] ? mutex_lock_nested+0x382/0x530
>> [   17.284402]  [<ffffffff816adf1f>] ? rtnetlink_rcv+0x1f/0x40
>> [   17.286290]  [<ffffffff816adf1f>] ? rtnetlink_rcv+0x1f/0x40
>> [   17.288142]  [<ffffffff816adf40>] ? rtnetlink_rcv+0x40/0x40
>> [   17.290031]  [<ffffffff816cedc1>] netlink_rcv_skb+0xc1/0xe0
>> [   17.291836]  [<ffffffff816adf2e>] rtnetlink_rcv+0x2e/0x40
>> [   17.293615]  [<ffffffff816ce473>] netlink_unicast+0xf3/0x1d0
>> [   17.295420]  [<ffffffff816ce863>] netlink_sendmsg+0x313/0x690
>> [   17.297132]  [<ffffffff811ada4f>] ? might_fault+0x5f/0xb0
>> [   17.298799]  [<ffffffff8168253c>] do_sock_sendmsg+0x8c/0x100
>> [   17.300493]  [<ffffffff81681e3e>] ? copy_msghdr_from_user+0x15e/0x1f0
>> [   17.302173]  [<ffffffff81682aeb>] ___sys_sendmsg+0x30b/0x320
>> [   17.303798]  [<ffffffff81021ab5>] ? native_sched_clock+0x35/0xa0
>> [   17.305431]  [<ffffffff810bdee0>] ? cpuacct_account_field+0x80/0xb0
>> [   17.307085]  [<ffffffff81021ab5>] ? native_sched_clock+0x35/0xa0
>> [   17.308744]  [<ffffffff810a4f35>] ? sched_clock_local+0x25/0x90
>> [   17.310375]  [<ffffffff810a5dc1>] ? vtime_account_user+0x91/0xa0
>> [   17.311948]  [<ffffffff810a5198>] ? sched_clock_cpu+0xb8/0xe0
>> [   17.313509]  [<ffffffff810bf8be>] ? put_lock_stats.isra.26+0xe/0x30
>> [   17.315069]  [<ffffffff810c007e>] ? lock_release_holdtime.part.27+0x12e/0x1b0
>> [   17.316618]  [<ffffffff810a5dc1>] ? vtime_account_user+0x91/0xa0
>> [   17.318162]  [<ffffffff8109f5d1>] ? get_parent_ip+0x11/0x50
>> [   17.319703]  [<ffffffff8109f699>] ? preempt_count_sub+0x49/0x50
>> [   17.321235]  [<ffffffff811807e5>] ? context_tracking_user_exit+0x55/0x130
>> [   17.322732]  [<ffffffff811807e5>] ? context_tracking_user_exit+0x55/0x130
>> [   17.324197]  [<ffffffff816834f2>] __sys_sendmsg+0x42/0x80
>> [   17.325634]  [<ffffffff81683542>] SyS_sendmsg+0x12/0x20
>> [   17.327048]  [<ffffffff817ba12d>] system_call_fastpath+0x16/0x1b
> 
> Please bisect, there hasn't been tg3 code changes in this path that
> might cause this.

What triggers this is the new debugging code added to catch nested
sleeps; specifically e22b886 ("sched/wait: Add might_sleep() checks").

Regards,
Peter Hurley

^ permalink raw reply

* Re: [PATCH iproute2 0/3] ip netns: Run over all netns
From: Vadim Kochan @ 2015-01-14 16:13 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Vadim Kochan, netdev
In-Reply-To: <20150113172837.793f0969@urahara>

On Tue, Jan 13, 2015 at 05:28:37PM -0800, Stephen Hemminger wrote:
> On Wed,  7 Jan 2015 13:04:19 +0200
> Vadim Kochan <vadim4j@gmail.com> wrote:
> 
> > From: Vadim Kochan <vadim4j@gmail.com>
> > 
> > Allow 'ip netns del' and 'ip netns exec' run over each network namespace names.
> > 
> > 'ip netns exec' executes command forcely on eacn nsname.
> > 
> > Vadim Kochan (3):
> >   lib: Exec func on each netns
> >   ip netns: Allow exec on each netns
> >   ip netns: Delete all netns
> > 
> >  include/namespace.h |  6 ++++
> >  include/utils.h     |  5 +++
> >  ip/ipnetns.c        | 96 ++++++++++++++++++++++++++++++++---------------------
> >  lib/namespace.c     | 22 ++++++++++++
> >  lib/utils.c         | 28 ++++++++++++++++
> >  man/man8/ip-netns.8 | 26 ++++++++++++---
> >  6 files changed, 141 insertions(+), 42 deletions(-)
> > 
> 
> It is a useful concept but as others have pointed out the idea of reserving
> a keyword 'all' at this point is likely to cause somebody to break.
> So sorry.
OK, then what about additional option like:

    $ ip -all netns exec ...

?

Thanks,

^ permalink raw reply

* Re: [PATCH v4 20/20] kbuild: add a new kselftest_install make target to install selftests
From: Shuah Khan @ 2015-01-14 16:32 UTC (permalink / raw)
  To: mmarek, masami.hiramatsu.pt
  Cc: gregkh, akpm, rostedt, mingo, davem, keescook, tranmanphong, mpe,
	cov, dh.herrmann, hughd, bobby.prani, serge.hallyn, ebiederm,
	tim.bird, josh, koct9i, linux-kbuild, linux-kernel, linux-api,
	netdev
In-Reply-To: <2c5a28faaa79d9c2415854a08817ada509fcb943.1420571615.git.shuahkh@osg.samsung.com>

On 01/06/2015 12:43 PM, Shuah Khan wrote:
> Add a new make target to install to install kernel selftests.
> This new target will build and install selftests. kselftest
> target now depends on kselftest_install and runs the generated
> kselftest script to reduce duplicate work and for common look
> and feel when running tests.
> 
> make kselftest_target:
> -- exports kselftest INSTALL_KSFT_PATH
>    default $(INSTALL_MOD_PATH)/lib/kselftest/$(KERNELRELEASE)
> -- exports INSTALL_KSFT_PATH
> -- runs selftests make install target
> 
> Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
> ---
>  Makefile | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)

Hi Marek,

Could you please Ack this patch, if this version looks good,
so I can take this through ksefltest tree.

thanks,
-- Shuah


-- 
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh@osg.samsung.com | (970) 217-8978

^ permalink raw reply

* Re: [PATCH net-next v13 3/3] net: hisilicon: new hip04 ethernet driver
From: Eric Dumazet @ 2015-01-14 16:34 UTC (permalink / raw)
  To: Ding Tianhong
  Cc: arnd, robh+dt, davem, grant.likely, agraf, sergei.shtylyov,
	linux-arm-kernel, xuwei5, zhangfei.gao, netdev, devicetree, linux
In-Reply-To: <1421217254-12008-4-git-send-email-dingtianhong@huawei.com>

On Wed, 2015-01-14 at 14:34 +0800, Ding Tianhong wrote:
> Support Hisilicon hip04 ethernet driver, including 100M / 1000M controller.
> The controller has no tx done interrupt, reclaim xmitted buffer in the poll.
> 
> v13: Fix the problem of alignment parameters for function and checkpatch warming.
> 
> v12: According Alex's suggestion, modify the changelog and add MODULE_DEVICE_TABLE
>      for hip04 ethernet.
> 
> v11: Add ethtool support for tx coalecse getting and setting, the xmit_more
>      is not supported for this patch, but I think it could work for hip04,
>      will support it later after some tests for performance better.
> 
>      Here are some performance test results by ping and iperf(add tx_coalesce_frames/users),
>      it looks that the performance and latency is more better by tx_coalesce_frames/usecs.
> 
>      - Before:
>      $ ping 192.168.1.1 ...
>      === 192.168.1.1 ping statistics ===
>      24 packets transmitted, 24 received, 0% packet loss, time 22999ms
>      rtt min/avg/max/mdev = 0.180/0.202/0.403/0.043 ms
> 
>      $ iperf -c 192.168.1.1 ...
>      [ ID] Interval       Transfer     Bandwidth
>      [  3]  0.0- 1.0 sec   115 MBytes   945 Mbits/sec
> 
>      - After:
>      $ ping 192.168.1.1 ...
>      === 192.168.1.1 ping statistics ===
>      24 packets transmitted, 24 received, 0% packet loss, time 22999ms
>      rtt min/avg/max/mdev = 0.178/0.190/0.380/0.041 ms
> 
>      $ iperf -c 192.168.1.1 ...
>      [ ID] Interval       Transfer     Bandwidth
>      [  3]  0.0- 1.0 sec   115 MBytes   965 Mbits/sec
> 
> v10: According David Miller and Arnd Bergmann's suggestion, add some modification
>      for v9 version
>      - drop the workqueue
>      - batch cleanup based on tx_coalesce_frames/usecs for better throughput
>      - use a reasonable default tx timeout (200us, could be shorted
>        based on measurements) with a range timer
>      - fix napi poll function return value
>      - use a lockless queue for cleanup
> 
> Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
> ---
>  drivers/net/ethernet/hisilicon/Makefile    |   2 +-
>  drivers/net/ethernet/hisilicon/hip04_eth.c | 969 +++++++++++++++++++++++++++++
>  2 files changed, 970 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/net/ethernet/hisilicon/hip04_eth.c
> 
> diff --git a/drivers/net/ethernet/hisilicon/Makefile b/drivers/net/ethernet/hisilicon/Makefile
> index 40115a7..6c14540 100644
> --- a/drivers/net/ethernet/hisilicon/Makefile
> +++ b/drivers/net/ethernet/hisilicon/Makefile
> @@ -3,4 +3,4 @@
>  #
>  
>  obj-$(CONFIG_HIX5HD2_GMAC) += hix5hd2_gmac.o
> -obj-$(CONFIG_HIP04_ETH) += hip04_mdio.o
> +obj-$(CONFIG_HIP04_ETH) += hip04_mdio.o hip04_eth.o
> diff --git a/drivers/net/ethernet/hisilicon/hip04_eth.c b/drivers/net/ethernet/hisilicon/hip04_eth.c
> new file mode 100644
> index 0000000..525214e
> --- /dev/null
> +++ b/drivers/net/ethernet/hisilicon/hip04_eth.c
> @@ -0,0 +1,969 @@
> +
> +/* Copyright (c) 2014 Linaro Ltd.
> + * Copyright (c) 2014 Hisilicon Limited.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/etherdevice.h>
> +#include <linux/platform_device.h>
> +#include <linux/interrupt.h>
> +#include <linux/ktime.h>
> +#include <linux/of_address.h>
> +#include <linux/phy.h>
> +#include <linux/of_mdio.h>
> +#include <linux/of_net.h>
> +#include <linux/mfd/syscon.h>
> +#include <linux/regmap.h>
> +
> +#define PPE_CFG_RX_ADDR			0x100
> +#define PPE_CFG_POOL_GRP		0x300
> +#define PPE_CFG_RX_BUF_SIZE		0x400
> +#define PPE_CFG_RX_FIFO_SIZE		0x500
> +#define PPE_CURR_BUF_CNT		0xa200
> +
> +#define GE_DUPLEX_TYPE			0x08
> +#define GE_MAX_FRM_SIZE_REG		0x3c
> +#define GE_PORT_MODE			0x40
> +#define GE_PORT_EN			0x44
> +#define GE_SHORT_RUNTS_THR_REG		0x50
> +#define GE_TX_LOCAL_PAGE_REG		0x5c
> +#define GE_TRANSMIT_CONTROL_REG		0x60
> +#define GE_CF_CRC_STRIP_REG		0x1b0
> +#define GE_MODE_CHANGE_REG		0x1b4
> +#define GE_RECV_CONTROL_REG		0x1e0
> +#define GE_STATION_MAC_ADDRESS		0x210
> +#define PPE_CFG_CPU_ADD_ADDR		0x580
> +#define PPE_CFG_MAX_FRAME_LEN_REG	0x408
> +#define PPE_CFG_BUS_CTRL_REG		0x424
> +#define PPE_CFG_RX_CTRL_REG		0x428
> +#define PPE_CFG_RX_PKT_MODE_REG		0x438
> +#define PPE_CFG_QOS_VMID_GEN		0x500
> +#define PPE_CFG_RX_PKT_INT		0x538
> +#define PPE_INTEN			0x600
> +#define PPE_INTSTS			0x608
> +#define PPE_RINT			0x604
> +#define PPE_CFG_STS_MODE		0x700
> +#define PPE_HIS_RX_PKT_CNT		0x804
> +
> +/* REG_INTERRUPT */
> +#define RCV_INT				BIT(10)
> +#define RCV_NOBUF			BIT(8)
> +#define RCV_DROP			BIT(7)
> +#define TX_DROP				BIT(6)
> +#define DEF_INT_ERR			(RCV_NOBUF | RCV_DROP | TX_DROP)
> +#define DEF_INT_MASK			(RCV_INT | DEF_INT_ERR)
> +
> +/* TX descriptor config */
> +#define TX_FREE_MEM			BIT(0)
> +#define TX_READ_ALLOC_L3		BIT(1)
> +#define TX_FINISH_CACHE_INV		BIT(2)
> +#define TX_CLEAR_WB			BIT(4)
> +#define TX_L3_CHECKSUM			BIT(5)
> +#define TX_LOOP_BACK			BIT(11)
> +
> +/* RX error */
> +#define RX_PKT_DROP			BIT(0)
> +#define RX_L2_ERR			BIT(1)
> +#define RX_PKT_ERR			(RX_PKT_DROP | RX_L2_ERR)
> +
> +#define SGMII_SPEED_1000		0x08
> +#define SGMII_SPEED_100			0x07
> +#define SGMII_SPEED_10			0x06
> +#define MII_SPEED_100			0x01
> +#define MII_SPEED_10			0x00
> +
> +#define GE_DUPLEX_FULL			BIT(0)
> +#define GE_DUPLEX_HALF			0x00
> +#define GE_MODE_CHANGE_EN		BIT(0)
> +
> +#define GE_TX_AUTO_NEG			BIT(5)
> +#define GE_TX_ADD_CRC			BIT(6)
> +#define GE_TX_SHORT_PAD_THROUGH		BIT(7)
> +
> +#define GE_RX_STRIP_CRC			BIT(0)
> +#define GE_RX_STRIP_PAD			BIT(3)
> +#define GE_RX_PAD_EN			BIT(4)
> +
> +#define GE_AUTO_NEG_CTL			BIT(0)
> +
> +#define GE_RX_INT_THRESHOLD		BIT(6)
> +#define GE_RX_TIMEOUT			0x04
> +
> +#define GE_RX_PORT_EN			BIT(1)
> +#define GE_TX_PORT_EN			BIT(2)
> +
> +#define PPE_CFG_STS_RX_PKT_CNT_RC	BIT(12)
> +
> +#define PPE_CFG_RX_PKT_ALIGN		BIT(18)
> +#define PPE_CFG_QOS_VMID_MODE		BIT(14)
> +#define PPE_CFG_QOS_VMID_GRP_SHIFT	8
> +
> +#define PPE_CFG_RX_FIFO_FSFU		BIT(11)
> +#define PPE_CFG_RX_DEPTH_SHIFT		16
> +#define PPE_CFG_RX_START_SHIFT		0
> +#define PPE_CFG_RX_CTRL_ALIGN_SHIFT	11
> +
> +#define PPE_CFG_BUS_LOCAL_REL		BIT(14)
> +#define PPE_CFG_BUS_BIG_ENDIEN		BIT(0)
> +
> +#define RX_DESC_NUM			128
> +#define TX_DESC_NUM			256
> +#define TX_NEXT(N)			(((N) + 1) & (TX_DESC_NUM-1))
> +#define RX_NEXT(N)			(((N) + 1) & (RX_DESC_NUM-1))
> +
> +#define GMAC_PPE_RX_PKT_MAX_LEN		379
> +#define GMAC_MAX_PKT_LEN		1516
> +#define GMAC_MIN_PKT_LEN		31
> +#define RX_BUF_SIZE			1600
> +#define RESET_TIMEOUT			1000
> +#define TX_TIMEOUT			(6 * HZ)
> +
> +#define DRV_NAME			"hip04-ether"
> +#define DRV_VERSION			"v1.0"
> +
> +#define HIP04_MAX_TX_COALESCE_USECS	200
> +#define HIP04_MIN_TX_COALESCE_USECS	100
> +#define HIP04_MAX_TX_COALESCE_FRAMES	200
> +#define HIP04_MIN_TX_COALESCE_FRAMES	100
> +
> +struct tx_desc {
> +	u32 send_addr;

	__be32 send_adddr; ?

> +	u32 send_size;

	__be32

> +	u32 next_addr;
	__be32

> +	u32 cfg;
	__be32

> +	u32 wb_addr;
	__be32 wb_addr ?

> +} __aligned(64);
> +
> +struct rx_desc {
> +	u16 reserved_16;
> +	u16 pkt_len;
> +	u32 reserve1[3];
> +	u32 pkt_err;
> +	u32 reserve2[4];
> +};
> +
> +struct hip04_priv {
> +	void __iomem *base;
> +	int phy_mode;
> +	int chan;
> +	unsigned int port;
> +	unsigned int speed;
> +	unsigned int duplex;
> +	unsigned int reg_inten;
> +
> +	struct napi_struct napi;
> +	struct net_device *ndev;
> +
> +	struct tx_desc *tx_desc;
> +	dma_addr_t tx_desc_dma;
> +	struct sk_buff *tx_skb[TX_DESC_NUM];
> +	dma_addr_t tx_phys[TX_DESC_NUM];

This is not an efficient way to store skb/phys, as for each skb, info
will be store in 2 separate cache lines.

It would be better to use a 

struct hip04_tx_desc {
   struct sk_buff   *skb;
   dma_addr_t       phys;
} 

> +	unsigned int tx_head;
> +
> +	int tx_coalesce_frames;
> +	int tx_coalesce_usecs;
> +	struct hrtimer tx_coalesce_timer;
> +
> +	unsigned char *rx_buf[RX_DESC_NUM];
> +	dma_addr_t rx_phys[RX_DESC_NUM];

Same thing here : Use a struct to get better data locality.

> +	unsigned int rx_head;
> +	unsigned int rx_buf_size;
> +
> +	struct device_node *phy_node;
> +	struct phy_device *phy;
> +	struct regmap *map;
> +	struct work_struct tx_timeout_task;
> +
> +	/* written only by tx cleanup */
> +	unsigned int tx_tail ____cacheline_aligned_in_smp;
> +};
> +
> +static inline unsigned int tx_count(unsigned int head, unsigned int tail)
> +{
> +	return (head - tail) % (TX_DESC_NUM - 1);
> +}
> +
> +static void hip04_config_port(struct net_device *ndev, u32 speed, u32 duplex)
> +{
> +	struct hip04_priv *priv = netdev_priv(ndev);
> +	u32 val;
> +
> +	priv->speed = speed;
> +	priv->duplex = duplex;
> +
> +	switch (priv->phy_mode) {
> +	case PHY_INTERFACE_MODE_SGMII:
> +		if (speed == SPEED_1000)
> +			val = SGMII_SPEED_1000;
> +		else if (speed == SPEED_100)
> +			val = SGMII_SPEED_100;
> +		else
> +			val = SGMII_SPEED_10;
> +		break;
> +	case PHY_INTERFACE_MODE_MII:
> +		if (speed == SPEED_100)
> +			val = MII_SPEED_100;
> +		else
> +			val = MII_SPEED_10;
> +		break;
> +	default:
> +		netdev_warn(ndev, "not supported mode\n");
> +		val = MII_SPEED_10;
> +		break;
> +	}
> +	writel_relaxed(val, priv->base + GE_PORT_MODE);
> +
> +	val = duplex ? GE_DUPLEX_FULL : GE_DUPLEX_HALF;
> +	writel_relaxed(val, priv->base + GE_DUPLEX_TYPE);
> +
> +	val = GE_MODE_CHANGE_EN;
> +	writel_relaxed(val, priv->base + GE_MODE_CHANGE_REG);
> +}
> +
> +static void hip04_reset_ppe(struct hip04_priv *priv)
> +{
> +	u32 val, tmp, timeout = 0;
> +
> +	do {
> +		regmap_read(priv->map, priv->port * 4 + PPE_CURR_BUF_CNT, &val);
> +		regmap_read(priv->map, priv->port * 4 + PPE_CFG_RX_ADDR, &tmp);
> +		if (timeout++ > RESET_TIMEOUT)
> +			break;
> +	} while (val & 0xfff);
> +}
> +
> +static void hip04_config_fifo(struct hip04_priv *priv)
> +{
> +	u32 val;
> +
> +	val = readl_relaxed(priv->base + PPE_CFG_STS_MODE);
> +	val |= PPE_CFG_STS_RX_PKT_CNT_RC;
> +	writel_relaxed(val, priv->base + PPE_CFG_STS_MODE);
> +
> +	val = BIT(priv->port);
> +	regmap_write(priv->map, priv->port * 4 + PPE_CFG_POOL_GRP, val);
> +
> +	val = priv->port << PPE_CFG_QOS_VMID_GRP_SHIFT;
> +	val |= PPE_CFG_QOS_VMID_MODE;
> +	writel_relaxed(val, priv->base + PPE_CFG_QOS_VMID_GEN);
> +
> +	val = RX_BUF_SIZE;
> +	regmap_write(priv->map, priv->port * 4 + PPE_CFG_RX_BUF_SIZE, val);
> +
> +	val = RX_DESC_NUM << PPE_CFG_RX_DEPTH_SHIFT;
> +	val |= PPE_CFG_RX_FIFO_FSFU;
> +	val |= priv->chan << PPE_CFG_RX_START_SHIFT;
> +	regmap_write(priv->map, priv->port * 4 + PPE_CFG_RX_FIFO_SIZE, val);
> +
> +	val = NET_IP_ALIGN << PPE_CFG_RX_CTRL_ALIGN_SHIFT;
> +	writel_relaxed(val, priv->base + PPE_CFG_RX_CTRL_REG);
> +
> +	val = PPE_CFG_RX_PKT_ALIGN;
> +	writel_relaxed(val, priv->base + PPE_CFG_RX_PKT_MODE_REG);
> +
> +	val = PPE_CFG_BUS_LOCAL_REL | PPE_CFG_BUS_BIG_ENDIEN;
> +	writel_relaxed(val, priv->base + PPE_CFG_BUS_CTRL_REG);
> +
> +	val = GMAC_PPE_RX_PKT_MAX_LEN;
> +	writel_relaxed(val, priv->base + PPE_CFG_MAX_FRAME_LEN_REG);
> +
> +	val = GMAC_MAX_PKT_LEN;
> +	writel_relaxed(val, priv->base + GE_MAX_FRM_SIZE_REG);
> +
> +	val = GMAC_MIN_PKT_LEN;
> +	writel_relaxed(val, priv->base + GE_SHORT_RUNTS_THR_REG);
> +
> +	val = readl_relaxed(priv->base + GE_TRANSMIT_CONTROL_REG);
> +	val |= GE_TX_AUTO_NEG | GE_TX_ADD_CRC | GE_TX_SHORT_PAD_THROUGH;
> +	writel_relaxed(val, priv->base + GE_TRANSMIT_CONTROL_REG);
> +
> +	val = GE_RX_STRIP_CRC;
> +	writel_relaxed(val, priv->base + GE_CF_CRC_STRIP_REG);
> +
> +	val = readl_relaxed(priv->base + GE_RECV_CONTROL_REG);
> +	val |= GE_RX_STRIP_PAD | GE_RX_PAD_EN;
> +	writel_relaxed(val, priv->base + GE_RECV_CONTROL_REG);
> +
> +	val = GE_AUTO_NEG_CTL;
> +	writel_relaxed(val, priv->base + GE_TX_LOCAL_PAGE_REG);
> +}
> +
> +static void hip04_mac_enable(struct net_device *ndev)
> +{
> +	struct hip04_priv *priv = netdev_priv(ndev);
> +	u32 val;
> +
> +	/* enable tx & rx */
> +	val = readl_relaxed(priv->base + GE_PORT_EN);
> +	val |= GE_RX_PORT_EN | GE_TX_PORT_EN;
> +	writel_relaxed(val, priv->base + GE_PORT_EN);
> +
> +	/* clear rx int */
> +	val = RCV_INT;
> +	writel_relaxed(val, priv->base + PPE_RINT);
> +
> +	/* config recv int */
> +	val = GE_RX_INT_THRESHOLD | GE_RX_TIMEOUT;
> +	writel_relaxed(val, priv->base + PPE_CFG_RX_PKT_INT);
> +
> +	/* enable interrupt */
> +	priv->reg_inten = DEF_INT_MASK;
> +	writel_relaxed(priv->reg_inten, priv->base + PPE_INTEN);
> +}
> +
> +static void hip04_mac_disable(struct net_device *ndev)
> +{
> +	struct hip04_priv *priv = netdev_priv(ndev);
> +	u32 val;
> +
> +	/* disable int */
> +	priv->reg_inten &= ~(DEF_INT_MASK);
> +	writel_relaxed(priv->reg_inten, priv->base + PPE_INTEN);
> +
> +	/* disable tx & rx */
> +	val = readl_relaxed(priv->base + GE_PORT_EN);
> +	val &= ~(GE_RX_PORT_EN | GE_TX_PORT_EN);
> +	writel_relaxed(val, priv->base + GE_PORT_EN);
> +}
> +
> +static void hip04_set_xmit_desc(struct hip04_priv *priv, dma_addr_t phys)
> +{
> +	writel(phys, priv->base + PPE_CFG_CPU_ADD_ADDR);
> +}
> +
> +static void hip04_set_recv_desc(struct hip04_priv *priv, dma_addr_t phys)
> +{
> +	regmap_write(priv->map, priv->port * 4 + PPE_CFG_RX_ADDR, phys);
> +}
> +
> +static u32 hip04_recv_cnt(struct hip04_priv *priv)
> +{
> +	return readl(priv->base + PPE_HIS_RX_PKT_CNT);
> +}
> +
> +static void hip04_update_mac_address(struct net_device *ndev)
> +{
> +	struct hip04_priv *priv = netdev_priv(ndev);
> +
> +	writel_relaxed(((ndev->dev_addr[0] << 8) | (ndev->dev_addr[1])),
> +		       priv->base + GE_STATION_MAC_ADDRESS);
> +	writel_relaxed(((ndev->dev_addr[2] << 24) | (ndev->dev_addr[3] << 16) |
> +			(ndev->dev_addr[4] << 8) | (ndev->dev_addr[5])),
> +		       priv->base + GE_STATION_MAC_ADDRESS + 4);
> +}
> +
> +static int hip04_set_mac_address(struct net_device *ndev, void *addr)
> +{
> +	eth_mac_addr(ndev, addr);
> +	hip04_update_mac_address(ndev);
> +	return 0;
> +}
> +
> +static int hip04_tx_reclaim(struct net_device *ndev, bool force)
> +{
> +	struct hip04_priv *priv = netdev_priv(ndev);
> +	unsigned tx_tail = priv->tx_tail;
> +	struct tx_desc *desc;
> +	unsigned int bytes_compl = 0, pkts_compl = 0;
> +	unsigned int count;
> +
> +	smp_rmb();
> +	count = tx_count(ACCESS_ONCE(priv->tx_head), tx_tail);
> +	if (count == 0)
> +		goto out;
> +
> +	while (count) {
> +		desc = &priv->tx_desc[tx_tail];
> +		if (desc->send_addr != 0) {
> +			if (force)
> +				desc->send_addr = 0;
> +			else
> +				break;
> +		}
> +
> +		if (priv->tx_phys[tx_tail]) {
> +			dma_unmap_single(&ndev->dev, priv->tx_phys[tx_tail],
> +					 priv->tx_skb[tx_tail]->len,
> +					 DMA_TO_DEVICE);
> +			priv->tx_phys[tx_tail] = 0;
> +		}
> +		pkts_compl++;
> +		bytes_compl += priv->tx_skb[tx_tail]->len;
> +		dev_kfree_skb(priv->tx_skb[tx_tail]);
> +		priv->tx_skb[tx_tail] = NULL;
> +		tx_tail = TX_NEXT(tx_tail);
> +		count--;
> +	}
> +
> +	priv->tx_tail = tx_tail;
> +	smp_wmb(); /* Ensure tx_tail visible to xmit */
> +
> +out:
> +	if (pkts_compl || bytes_compl)

Testing bytes_compl should be enough : There is no way pkt_compl could
be 0 if bytes_compl is not 0.

> +		netdev_completed_queue(ndev, pkts_compl, bytes_compl);
> +
> +	if (unlikely(netif_queue_stopped(ndev)) && (count < (TX_DESC_NUM - 1)))
> +		netif_wake_queue(ndev);
> +
> +	return count;
> +}
> +
> +static int hip04_mac_start_xmit(struct sk_buff *skb, struct net_device *ndev)
> +{
> +	struct hip04_priv *priv = netdev_priv(ndev);
> +	struct net_device_stats *stats = &ndev->stats;
> +	unsigned int tx_head = priv->tx_head, count;
> +	struct tx_desc *desc = &priv->tx_desc[tx_head];
> +	dma_addr_t phys;
> +
> +	smp_rmb();
> +	count = tx_count(tx_head, ACCESS_ONCE(priv->tx_tail));
> +	if (count == (TX_DESC_NUM - 1)) {
> +		netif_stop_queue(ndev);
> +		return NETDEV_TX_BUSY;
> +	}
> +
> +	phys = dma_map_single(&ndev->dev, skb->data, skb->len, DMA_TO_DEVICE);
> +	if (dma_mapping_error(&ndev->dev, phys)) {
> +		dev_kfree_skb(skb);
> +		return NETDEV_TX_OK;
> +	}
> +
> +	priv->tx_skb[tx_head] = skb;
> +	priv->tx_phys[tx_head] = phys;
> +	desc->send_addr = cpu_to_be32(phys);
> +	desc->send_size = cpu_to_be32(skb->len);
> +	desc->cfg = cpu_to_be32(TX_CLEAR_WB | TX_FINISH_CACHE_INV);
> +	phys = priv->tx_desc_dma + tx_head * sizeof(struct tx_desc);
> +	desc->wb_addr = cpu_to_be32(phys);
> +	skb_tx_timestamp(skb);
> +
> +	hip04_set_xmit_desc(priv, phys);
> +	priv->tx_head = TX_NEXT(tx_head);
> +	count++;

Starting from this point, skb might already have been freed by TX
completion.

Its racy to access skb->len 

> +	netdev_sent_queue(ndev, skb->len);
> +
> +	stats->tx_bytes += skb->len;
> +	stats->tx_packets++;
> +
> +	/* Ensure tx_head update visible to tx reclaim */
> +	smp_wmb();
> +
> +	/* queue is getting full, better start cleaning up now */
> +	if (count >= priv->tx_coalesce_frames) {
> +		if (napi_schedule_prep(&priv->napi)) {
> +			/* disable rx interrupt and timer */
> +			priv->reg_inten &= ~(RCV_INT);
> +			writel_relaxed(DEF_INT_MASK & ~RCV_INT,
> +				       priv->base + PPE_INTEN);
> +			hrtimer_cancel(&priv->tx_coalesce_timer);
> +			__napi_schedule(&priv->napi);
> +		}
> +	} else if (!hrtimer_is_queued(&priv->tx_coalesce_timer)) {
> +		/* cleanup not pending yet, start a new timer */
> +		hrtimer_start_expires(&priv->tx_coalesce_timer,
> +				      HRTIMER_MODE_REL);
> +	}
> +
> +	return NETDEV_TX_OK;
> +}
> +
> +static int hip04_rx_poll(struct napi_struct *napi, int budget)
> +{
> +	struct hip04_priv *priv = container_of(napi, struct hip04_priv, napi);
> +	struct net_device *ndev = priv->ndev;
> +	struct net_device_stats *stats = &ndev->stats;
> +	unsigned int cnt = hip04_recv_cnt(priv);
> +	struct rx_desc *desc;
> +	struct sk_buff *skb;
> +	unsigned char *buf;
> +	bool last = false;
> +	dma_addr_t phys;
> +	int rx = 0;
> +	int tx_remaining;
> +	u16 len;
> +	u32 err;
> +
> +	while (cnt && !last) {
> +		buf = priv->rx_buf[priv->rx_head];
> +		skb = build_skb(buf, priv->rx_buf_size);
> +		if (unlikely(!skb))
> +			net_dbg_ratelimited("build_skb failed\n");

Well, is skb is NULL, you're crashing later...
You really have to address a memory allocation error much better than
that !

> +
> +		dma_unmap_single(&ndev->dev, priv->rx_phys[priv->rx_head],
> +				 RX_BUF_SIZE, DMA_FROM_DEVICE);
> +		priv->rx_phys[priv->rx_head] = 0;
> +
> +		desc = (struct rx_desc *)skb->data;
> +		len = be16_to_cpu(desc->pkt_len);
> +		err = be32_to_cpu(desc->pkt_err);
> +
> +		if (0 == len) {
> +			dev_kfree_skb_any(skb);
> +			last = true;
> +		} else if ((err & RX_PKT_ERR) || (len >= GMAC_MAX_PKT_LEN)) {
> +			dev_kfree_skb_any(skb);
> +			stats->rx_dropped++;
> +			stats->rx_errors++;
> +		} else {
> +			skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
> +			skb_put(skb, len);
> +			skb->protocol = eth_type_trans(skb, ndev);
> +			napi_gro_receive(&priv->napi, skb);
> +			stats->rx_packets++;
> +			stats->rx_bytes += len;
> +			rx++;
> +		}
> +
> +		buf = netdev_alloc_frag(priv->rx_buf_size);
> +		if (!buf)
> +			goto done;

Same problem here : In case of memory allocation error, your driver is
totally screwed.

> +		phys = dma_map_single(&ndev->dev, buf,
> +				      RX_BUF_SIZE, DMA_FROM_DEVICE);
> +		if (dma_mapping_error(&ndev->dev, phys))
> +			goto done;

Same problem here : You really have to recover properly.

> +		priv->rx_buf[priv->rx_head] = buf;
> +		priv->rx_phys[priv->rx_head] = phys;
> +		hip04_set_recv_desc(priv, phys);
> +
> +		priv->rx_head = RX_NEXT(priv->rx_head);
> +		if (rx >= budget)
> +			goto done;
> +
> +		if (--cnt == 0)
> +			cnt = hip04_recv_cnt(priv);
> +	}
> +
> +	if (!(priv->reg_inten & RCV_INT)) {
> +		/* enable rx interrupt */
> +		priv->reg_inten |= RCV_INT;
> +		writel_relaxed(priv->reg_inten, priv->base + PPE_INTEN);
> +	}
> +	napi_complete(napi);
> +done:
> +	/* clean up tx descriptors and start a new timer if necessary */
> +	tx_remaining = hip04_tx_reclaim(ndev, false);
> +	if (rx < budget && tx_remaining)
> +		hrtimer_start_expires(&priv->tx_coalesce_timer, HRTIMER_MODE_REL);
> +
> +	return rx;
> +}
> +

^ permalink raw reply

* [patch-net-next v2 2/3] net: ethernet: cpsw: split out IRQ handler
From: Felipe Balbi @ 2015-01-14 16:58 UTC (permalink / raw)
  To: davem
  Cc: Tony Lindgren, Linux OMAP Mailing List, mugunthanvnm, netdev,
	Felipe Balbi
In-Reply-To: <1421254729-10602-1-git-send-email-balbi@ti.com>

Now we can introduce dedicated IRQ handlers
for each of the IRQ events. This helps with
cleaning up a little bit of the clutter in
cpsw_interrupt() while also making sure that
TX IRQs will try to handle TX buffers while
RX IRQs will try to handle RX buffers.

Signed-off-by: Felipe Balbi <balbi@ti.com>
---
 drivers/net/ethernet/ti/cpsw.c | 41 ++++++++++++++++++++++++++++++-----------
 1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 8e1af51e4b76..c6c483f3e49f 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -754,18 +754,36 @@ requeue:
 		dev_kfree_skb_any(new_skb);
 }
 
-static irqreturn_t cpsw_interrupt(int irq, void *dev_id)
+static irqreturn_t cpsw_dummy_interrupt(int irq, void *dev_id)
 {
 	struct cpsw_priv *priv = dev_id;
 	int value = irq - priv->irqs_table[0];
 
-	/* NOTICE: Ending IRQ here. The trick with the 'value' variable above
-	 * is to make sure we will always write the correct value to the EOI
-	 * register. Namely 0 for RX_THRESH Interrupt, 1 for RX Interrupt, 2
-	 * for TX Interrupt and 3 for MISC Interrupt.
-	 */
 	cpdma_ctlr_eoi(priv->dma, value);
 
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t cpsw_tx_interrupt(int irq, void *dev_id)
+{
+	struct cpsw_priv *priv = dev_id;
+
+	cpdma_ctlr_eoi(priv->dma, CPDMA_EOI_TX);
+	cpdma_chan_process(priv->txch, 128);
+
+	priv = cpsw_get_slave_priv(priv, 1);
+	if (priv)
+		cpdma_chan_process(priv->txch, 128);
+
+	return IRQ_HANDLED;
+}
+
+static irqreturn_t cpsw_rx_interrupt(int irq, void *dev_id)
+{
+	struct cpsw_priv *priv = dev_id;
+
+	cpdma_ctlr_eoi(priv->dma, CPDMA_EOI_RX);
+
 	cpsw_intr_disable(priv);
 	if (priv->irq_enabled == true) {
 		cpsw_disable_irq(priv);
@@ -1617,7 +1635,8 @@ static void cpsw_ndo_poll_controller(struct net_device *ndev)
 
 	cpsw_intr_disable(priv);
 	cpdma_ctlr_int_ctrl(priv->dma, false);
-	cpsw_interrupt(ndev->irq, priv);
+	cpsw_rx_interrupt(priv->irq[1], priv);
+	cpsw_tx_interrupt(priv->irq[2], priv);
 	cpdma_ctlr_int_ctrl(priv->dma, true);
 	cpsw_intr_enable(priv);
 }
@@ -2351,7 +2370,7 @@ static int cpsw_probe(struct platform_device *pdev)
 		goto clean_ale_ret;
 
 	priv->irqs_table[0] = irq;
-	ret = devm_request_irq(&pdev->dev, irq, cpsw_interrupt,
+	ret = devm_request_irq(&pdev->dev, irq, cpsw_dummy_interrupt,
 			       0, dev_name(&pdev->dev), priv);
 	if (ret < 0) {
 		dev_err(priv->dev, "error attaching irq (%d)\n", ret);
@@ -2363,7 +2382,7 @@ static int cpsw_probe(struct platform_device *pdev)
 		goto clean_ale_ret;
 
 	priv->irqs_table[1] = irq;
-	ret = devm_request_irq(&pdev->dev, irq, cpsw_interrupt,
+	ret = devm_request_irq(&pdev->dev, irq, cpsw_rx_interrupt,
 			       0, dev_name(&pdev->dev), priv);
 	if (ret < 0) {
 		dev_err(priv->dev, "error attaching irq (%d)\n", ret);
@@ -2375,7 +2394,7 @@ static int cpsw_probe(struct platform_device *pdev)
 		goto clean_ale_ret;
 
 	priv->irqs_table[2] = irq;
-	ret = devm_request_irq(&pdev->dev, irq, cpsw_interrupt,
+	ret = devm_request_irq(&pdev->dev, irq, cpsw_tx_interrupt,
 			       0, dev_name(&pdev->dev), priv);
 	if (ret < 0) {
 		dev_err(priv->dev, "error attaching irq (%d)\n", ret);
@@ -2387,7 +2406,7 @@ static int cpsw_probe(struct platform_device *pdev)
 		goto clean_ale_ret;
 
 	priv->irqs_table[3] = irq;
-	ret = devm_request_irq(&pdev->dev, irq, cpsw_interrupt,
+	ret = devm_request_irq(&pdev->dev, irq, cpsw_dummy_interrupt,
 			       0, dev_name(&pdev->dev), priv);
 	if (ret < 0) {
 		dev_err(priv->dev, "error attaching irq (%d)\n", ret);
-- 
2.2.0


^ permalink raw reply related

* [patch-net-next v2 1/3] net: ethernet: cpsw: unroll IRQ request loop
From: Felipe Balbi @ 2015-01-14 16:58 UTC (permalink / raw)
  To: davem
  Cc: Tony Lindgren, Linux OMAP Mailing List, mugunthanvnm, netdev,
	Felipe Balbi

This patch is in preparation for a nicer IRQ
handling scheme where we use different IRQ
handlers for each IRQ line (as it should be).

Later, we will also drop IRQs offset 0 and 3
because they are always disabled in this driver.

Signed-off-by: Felipe Balbi <balbi@ti.com>
---
 drivers/net/ethernet/ti/cpsw.c | 62 ++++++++++++++++++++++++++++++++----------
 1 file changed, 47 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index e61ee8351272..8e1af51e4b76 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -2156,7 +2156,8 @@ static int cpsw_probe(struct platform_device *pdev)
 	void __iomem			*ss_regs;
 	struct resource			*res, *ss_res;
 	u32 slave_offset, sliver_offset, slave_size;
-	int ret = 0, i, k = 0;
+	int ret = 0, i;
+	int irq;
 
 	ndev = alloc_etherdev(sizeof(struct cpsw_priv));
 	if (!ndev) {
@@ -2345,24 +2346,55 @@ static int cpsw_probe(struct platform_device *pdev)
 		goto clean_ale_ret;
 	}
 
-	while ((res = platform_get_resource(priv->pdev, IORESOURCE_IRQ, k))) {
-		if (k >= ARRAY_SIZE(priv->irqs_table)) {
-			ret = -EINVAL;
-			goto clean_ale_ret;
-		}
+	irq = platform_get_irq(pdev, 0);
+	if (irq < 0)
+		goto clean_ale_ret;
 
-		ret = devm_request_irq(&pdev->dev, res->start, cpsw_interrupt,
-				       0, dev_name(&pdev->dev), priv);
-		if (ret < 0) {
-			dev_err(priv->dev, "error attaching irq (%d)\n", ret);
-			goto clean_ale_ret;
-		}
+	priv->irqs_table[0] = irq;
+	ret = devm_request_irq(&pdev->dev, irq, cpsw_interrupt,
+			       0, dev_name(&pdev->dev), priv);
+	if (ret < 0) {
+		dev_err(priv->dev, "error attaching irq (%d)\n", ret);
+		goto clean_ale_ret;
+	}
 
-		priv->irqs_table[k] = res->start;
-		k++;
+	irq = platform_get_irq(pdev, 1);
+	if (irq < 0)
+		goto clean_ale_ret;
+
+	priv->irqs_table[1] = irq;
+	ret = devm_request_irq(&pdev->dev, irq, cpsw_interrupt,
+			       0, dev_name(&pdev->dev), priv);
+	if (ret < 0) {
+		dev_err(priv->dev, "error attaching irq (%d)\n", ret);
+		goto clean_ale_ret;
+	}
+
+	irq = platform_get_irq(pdev, 2);
+	if (irq < 0)
+		goto clean_ale_ret;
+
+	priv->irqs_table[2] = irq;
+	ret = devm_request_irq(&pdev->dev, irq, cpsw_interrupt,
+			       0, dev_name(&pdev->dev), priv);
+	if (ret < 0) {
+		dev_err(priv->dev, "error attaching irq (%d)\n", ret);
+		goto clean_ale_ret;
+	}
+
+	irq = platform_get_irq(pdev, 3);
+	if (irq < 0)
+		goto clean_ale_ret;
+
+	priv->irqs_table[3] = irq;
+	ret = devm_request_irq(&pdev->dev, irq, cpsw_interrupt,
+			       0, dev_name(&pdev->dev), priv);
+	if (ret < 0) {
+		dev_err(priv->dev, "error attaching irq (%d)\n", ret);
+		goto clean_ale_ret;
 	}
 
-	priv->num_irqs = k;
+	priv->num_irqs = 4;
 
 	ndev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
 
-- 
2.2.0

^ permalink raw reply related

* [patch-net-next v2 3/3] net: ethernet: cpsw: don't requests IRQs we don't use
From: Felipe Balbi @ 2015-01-14 16:58 UTC (permalink / raw)
  To: davem
  Cc: Tony Lindgren, Linux OMAP Mailing List, mugunthanvnm, netdev,
	Felipe Balbi
In-Reply-To: <1421254729-10602-1-git-send-email-balbi@ti.com>

CPSW never uses RX_THRESHOLD or MISC interrupts. In
fact, they are always kept masked in their appropriate
IRQ Enable register.

Instead of allocating an IRQ that never fires, it's best
to remove that code altogether and let future patches
implement it if anybody needs those.

Signed-off-by: Felipe Balbi <balbi@ti.com>
---
 drivers/net/ethernet/ti/cpsw.c | 55 ++++++++++++------------------------------
 1 file changed, 15 insertions(+), 40 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index c6c483f3e49f..ba09ff3c1695 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -754,16 +754,6 @@ requeue:
 		dev_kfree_skb_any(new_skb);
 }
 
-static irqreturn_t cpsw_dummy_interrupt(int irq, void *dev_id)
-{
-	struct cpsw_priv *priv = dev_id;
-	int value = irq - priv->irqs_table[0];
-
-	cpdma_ctlr_eoi(priv->dma, value);
-
-	return IRQ_HANDLED;
-}
-
 static irqreturn_t cpsw_tx_interrupt(int irq, void *dev_id)
 {
 	struct cpsw_priv *priv = dev_id;
@@ -1635,8 +1625,8 @@ static void cpsw_ndo_poll_controller(struct net_device *ndev)
 
 	cpsw_intr_disable(priv);
 	cpdma_ctlr_int_ctrl(priv->dma, false);
-	cpsw_rx_interrupt(priv->irq[1], priv);
-	cpsw_tx_interrupt(priv->irq[2], priv);
+	cpsw_rx_interrupt(priv->irq[0], priv);
+	cpsw_tx_interrupt(priv->irq[1], priv);
 	cpdma_ctlr_int_ctrl(priv->dma, true);
 	cpsw_intr_enable(priv);
 }
@@ -2358,30 +2348,27 @@ static int cpsw_probe(struct platform_device *pdev)
 		goto clean_dma_ret;
 	}
 
-	ndev->irq = platform_get_irq(pdev, 0);
+	ndev->irq = platform_get_irq(pdev, 1);
 	if (ndev->irq < 0) {
 		dev_err(priv->dev, "error getting irq resource\n");
 		ret = -ENOENT;
 		goto clean_ale_ret;
 	}
 
-	irq = platform_get_irq(pdev, 0);
-	if (irq < 0)
-		goto clean_ale_ret;
-
-	priv->irqs_table[0] = irq;
-	ret = devm_request_irq(&pdev->dev, irq, cpsw_dummy_interrupt,
-			       0, dev_name(&pdev->dev), priv);
-	if (ret < 0) {
-		dev_err(priv->dev, "error attaching irq (%d)\n", ret);
-		goto clean_ale_ret;
-	}
+	/* Grab RX and TX IRQs. Note that we also have RX_THRESHOLD and
+	 * MISC IRQs which are always kept disabled with this driver so
+	 * we will not request them.
+	 *
+	 * If anyone wants to implement support for those, make sure to
+	 * first request and append them to irqs_table array.
+	 */
 
+	/* RX IRQ */
 	irq = platform_get_irq(pdev, 1);
 	if (irq < 0)
 		goto clean_ale_ret;
 
-	priv->irqs_table[1] = irq;
+	priv->irqs_table[0] = irq;
 	ret = devm_request_irq(&pdev->dev, irq, cpsw_rx_interrupt,
 			       0, dev_name(&pdev->dev), priv);
 	if (ret < 0) {
@@ -2389,31 +2376,19 @@ static int cpsw_probe(struct platform_device *pdev)
 		goto clean_ale_ret;
 	}
 
+	/* TX IRQ */
 	irq = platform_get_irq(pdev, 2);
 	if (irq < 0)
 		goto clean_ale_ret;
 
-	priv->irqs_table[2] = irq;
+	priv->irqs_table[1] = irq;
 	ret = devm_request_irq(&pdev->dev, irq, cpsw_tx_interrupt,
 			       0, dev_name(&pdev->dev), priv);
 	if (ret < 0) {
 		dev_err(priv->dev, "error attaching irq (%d)\n", ret);
 		goto clean_ale_ret;
 	}
-
-	irq = platform_get_irq(pdev, 3);
-	if (irq < 0)
-		goto clean_ale_ret;
-
-	priv->irqs_table[3] = irq;
-	ret = devm_request_irq(&pdev->dev, irq, cpsw_dummy_interrupt,
-			       0, dev_name(&pdev->dev), priv);
-	if (ret < 0) {
-		dev_err(priv->dev, "error attaching irq (%d)\n", ret);
-		goto clean_ale_ret;
-	}
-
-	priv->num_irqs = 4;
+	priv->num_irqs = 2;
 
 	ndev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
 
-- 
2.2.0

^ permalink raw reply related

* [patch net] team: avoid possible underflow of count_pending value for notify_peers and mcast_rejoin
From: Jiri Pirko @ 2015-01-14 17:15 UTC (permalink / raw)
  To: netdev; +Cc: davem, jbenc

This patch is fixing a race condition that may cause setting
count_pending to -1, which results in unwanted big bulk of arp messages
(in case of "notify peers").

Consider following scenario:

count_pending == 2
   CPU0                                           CPU1
					team_notify_peers_work
					  atomic_dec_and_test (dec count_pending to 1)
					  schedule_delayed_work
 team_notify_peers
   atomic_add (adding 1 to count_pending)
					team_notify_peers_work
					  atomic_dec_and_test (dec count_pending to 1)
					  schedule_delayed_work
					team_notify_peers_work
					  atomic_dec_and_test (dec count_pending to 0)
   schedule_delayed_work
					team_notify_peers_work
					  atomic_dec_and_test (dec count_pending to -1)

Fix this race by using atomic_dec_if_positive - that will prevent
count_pending running under 0.

Fixes: fc423ff00df3a1955441 ("team: add peer notification")
Fixes: 492b200efdd20b8fcfd  ("team: add support for sending multicast rejoins")
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
---
 drivers/net/team/team.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index 43bcfff..4b2bfc5 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -622,6 +622,7 @@ static int team_change_mode(struct team *team, const char *kind)
 static void team_notify_peers_work(struct work_struct *work)
 {
 	struct team *team;
+	int val;
 
 	team = container_of(work, struct team, notify_peers.dw.work);
 
@@ -629,9 +630,14 @@ static void team_notify_peers_work(struct work_struct *work)
 		schedule_delayed_work(&team->notify_peers.dw, 0);
 		return;
 	}
+	val = atomic_dec_if_positive(&team->notify_peers.count_pending);
+	if (val < 0) {
+		rtnl_unlock();
+		return;
+	}
 	call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, team->dev);
 	rtnl_unlock();
-	if (!atomic_dec_and_test(&team->notify_peers.count_pending))
+	if (val)
 		schedule_delayed_work(&team->notify_peers.dw,
 				      msecs_to_jiffies(team->notify_peers.interval));
 }
@@ -662,6 +668,7 @@ static void team_notify_peers_fini(struct team *team)
 static void team_mcast_rejoin_work(struct work_struct *work)
 {
 	struct team *team;
+	int val;
 
 	team = container_of(work, struct team, mcast_rejoin.dw.work);
 
@@ -669,9 +676,14 @@ static void team_mcast_rejoin_work(struct work_struct *work)
 		schedule_delayed_work(&team->mcast_rejoin.dw, 0);
 		return;
 	}
+	val = atomic_dec_if_positive(&team->mcast_rejoin.count_pending);
+	if (val < 0) {
+		rtnl_unlock();
+		return;
+	}
 	call_netdevice_notifiers(NETDEV_RESEND_IGMP, team->dev);
 	rtnl_unlock();
-	if (!atomic_dec_and_test(&team->mcast_rejoin.count_pending))
+	if (val)
 		schedule_delayed_work(&team->mcast_rejoin.dw,
 				      msecs_to_jiffies(team->mcast_rejoin.interval));
 }
-- 
1.9.3

^ permalink raw reply related

* Re: [bisected regression] e1000e: "Detected Hardware Unit Hang"
From: Eric Dumazet @ 2015-01-14 17:20 UTC (permalink / raw)
  To: Thomas Jarosch
  Cc: 'Linux Netdev List', Eric Dumazet, Jeff Kirsher,
	e1000-devel
In-Reply-To: <1719052.SGOfRAJhfQ@storm>

On Wed, 2015-01-14 at 16:32 +0100, Thomas Jarosch wrote:
> Hello,
> 
> after updating a good bunch of production level machines
> from kernel 3.4.101 to kernel 3.14.25, a few of them started
> to show serious trouble when there was a lot of network traffic.
> 
> ---------------------------------------------------------------
> Jan 14 11:14:57 intrartc kernel: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
> Jan 14 11:14:57 intrartc kernel:  TDH                  <3b>
> Jan 14 11:14:57 intrartc kernel:  TDT                  <76>
> Jan 14 11:14:57 intrartc kernel:  next_to_use          <76>
> Jan 14 11:14:57 intrartc kernel:  next_to_clean        <31>
> Jan 14 11:14:57 intrartc kernel: buffer_info[next_to_clean]:
> Jan 14 11:14:57 intrartc kernel:  time_stamp           <ffff328c>
> Jan 14 11:14:57 intrartc kernel:  next_to_watch        <3b>
> Jan 14 11:14:57 intrartc kernel:  jiffies              <ffff33b9>
> Jan 14 11:14:57 intrartc kernel:  next_to_watch.status <0>
> Jan 14 11:14:57 intrartc kernel: MAC Status             <40080083>
> Jan 14 11:14:57 intrartc kernel: PHY Status             <796d>
> Jan 14 11:14:57 intrartc kernel: PHY 1000BASE-T Status  <3800>
> Jan 14 11:14:57 intrartc kernel: PHY Extended Status    <3000>
> Jan 14 11:14:57 intrartc kernel: PCI Status             <10>
> Jan 14 11:14:59 intrartc kernel: e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang:
> ..
> ---------------------------------------------------------------
> 
> All of those troubled machines use an Intel DH61CR board and
> are driven by the e1000e driver. Kernels 3.7.0 to 3.19-rc4 are affected.
> 
> The problem vanishes when you disable TSO. This is the
> recommended "solution" on serverfault and others.
> http://ehc.ac/p/e1000/bugs/378/
> http://serverfault.com/questions/616485/e1000e-reset-adapter-unexpectedly-detected-hardware-unit-hang
> 
> I have a test setup that can trigger the problem within seconds
> and bisected it down to this commit (hi Eric!):
> ---------------------------------------------------------------
> commit 69b08f62e17439ee3d436faf0b9a7ca6fffb78db
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Wed Sep 26 06:46:57 2012 +0000
> 
>     net: use bigger pages in __netdev_alloc_frag
> 
>     We currently use percpu order-0 pages in __netdev_alloc_frag
>     to deliver fragments used by __netdev_alloc_skb()
> 
>     Depending on NIC driver and arch being 32 or 64 bit, it allows a page to
>     be split in several fragments (between 1 and 8), assuming PAGE_SIZE=4096
> 
>     Switching to bigger pages (32768 bytes for PAGE_SIZE=4096 case) allows :
> 
>     - Better filling of space (the ending hole overhead is less an issue)
> 
>     - Less calls to page allocator or accesses to page->_count
> 
>     - Could allow struct skb_shared_info futures changes without major
>     performance impact.
> 
>     This patch implements a transparent fallback to smaller
>     pages in case of memory pressure.
> 
>     It also uses a standard "struct page_frag" instead of a custom one.
> 
>     Signed-off-by: Eric Dumazet <edumazet@google.com>
>     Cc: Alexander Duyck <alexander.h.duyck@intel.com>
>     Cc: Benjamin LaHaise <bcrl@kvack.org>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> ---------------------------------------------------------------
> 
> Reverting the commit f.e. in kernel 3.7.0  solves the issue.
> I've done some more tests:
> 
>     3.18.0 32bit + PAE: broken
>     3.6.0 32bit + PAE: works
>     3.7.0 32bit + PAE: broken
>     3.7.0 32bit + PAE + revert 69b08f62e17439ee3d436faf0b9a7ca6fffb78db -> works
> 
>     3.7.0 32bit (without PAE) -> broken
>     3.7.0 32bit + "GFP_COMP" flag removed in __netdev_alloc_frag(): broken
>     3.7.0 32bit + "GFP_COMP" flag replaced with
>                               "GFP_DMA" in __netdev_alloc_frag(): works!
>     3.7.0 32bit + "GFP_COMP" flag + "GFP_DMA" flag: broken
>     3.19-rc4 32bit: broken
> 
> 
> The problem is triggered only when the traffic is forwarded to another client.
> (this client is behind NAT). Generating traffic directly
> on the system did not trigger the issue.
> 
> To me it looks like Eric's change uncovered a memory allocation
> issue in the e1000e driver: It probably uses a memory address
> unsuitable for DMA or so. This is just a guess though.
> 
> Funny fact: I have another Intel DH61CR board that does not show the problem.
> I've borrowed (...) the mainboard from one affected box for my bisect test setup.
> 
> Please CC: comments. Thanks.

I would try to use lower data per txd. I am not sure 24KB is really
supported.

( check commit d821a4c4d11ad160925dab2bb009b8444beff484 for details)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index e14fd85f64eb..8d973f7edfbd 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -3897,7 +3897,7 @@ void e1000e_reset(struct e1000_adapter *adapter)
 	 * limit of 24KB due to receive synchronization limitations.
 	 */
 	adapter->tx_fifo_limit = min_t(u32, ((er32(PBA) >> 16) << 10) - 96,
-				       24 << 10);
+				       8 << 10);
 
 	/* Disable Adaptive Interrupt Moderation if 2 full packets cannot
 	 * fit in receive buffer.

^ permalink raw reply related

* [PATCH v3 02/16] virtio/9p: verify device has config space
From: Michael S. Tsirkin @ 2015-01-14 17:27 UTC (permalink / raw)
  To: linux-kernel, virtualization
  Cc: Eric Van Hensbergen, netdev, v9fs-developer, Ron Minnich,
	David S. Miller
In-Reply-To: <1421256142-11512-1-git-send-email-mst@redhat.com>

Some devices might not implement config space access
(e.g. remoteproc used not to - before 3.9).
virtio/9p needs config space access so make it
fail gracefully if not there.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 net/9p/trans_virtio.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index daa749c..d8e376a 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -524,6 +524,12 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 	int err;
 	struct virtio_chan *chan;
 
+	if (!vdev->config->get) {
+		dev_err(&vdev->dev, "%s failure: config access disabled\n",
+			__func__);
+		return -EINVAL;
+	}
+
 	chan = kmalloc(sizeof(struct virtio_chan), GFP_KERNEL);
 	if (!chan) {
 		pr_err("Failed to allocate virtio 9P channel\n");
-- 
MST

^ permalink raw reply related

* [PATCH v3 05/16] virtio/net: verify device has config space
From: Michael S. Tsirkin @ 2015-01-14 17:27 UTC (permalink / raw)
  To: linux-kernel, virtualization
  Cc: Rusty Russell, cornelia.huck, virtualization, netdev
In-Reply-To: <1421256142-11512-1-git-send-email-mst@redhat.com>

Some devices might not implement config space access
(e.g. remoteproc used not to - before 3.9).
virtio/net needs config space access so make it
fail gracefully if not there.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5ca9771..9bc1072 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1713,6 +1713,12 @@ static int virtnet_probe(struct virtio_device *vdev)
 	struct virtnet_info *vi;
 	u16 max_queue_pairs;
 
+	if (!vdev->config->get) {
+		dev_err(&vdev->dev, "%s failure: config access disabled\n",
+			__func__);
+		return -EINVAL;
+	}
+
 	if (!virtnet_validate_features(vdev))
 		return -EINVAL;
 
-- 
MST

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox