Netdev List
 help / color / mirror / Atom feed
* request_threaded_irq()
From: Mark Ryden @ 2010-11-12 12:27 UTC (permalink / raw)
  To: netdev

Hello netdev,

grepping under net-next-2.6/drivers/net for request_threaded_irq() ,
shows that it appears only in 3 drivers:

can/mcp251x.c
wireless/b43/main.c
wireless/rt2x00/rt2x00pci.c

I was wondering: when thinking about performance, is it worthwhile to use this
API instead of ordinary request_irq() . It seems to me that
request_threaded_irq()  might
be better in some cases than NAPI polling forr network drivers (or at
list it might be so in some systems, maybe multicore ?)


Best,
Mark Ryden

^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] can: Topcliff: PCH_CAN driver: Fix build warnings
From: Wolfgang Grandegger @ 2010-11-12 11:45 UTC (permalink / raw)
  To: Tomoya MORINAGA
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w, Samuel Ortiz,
	margie.foster-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, LKML,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w, Masayuki Ohtake,
	Marc Kleine-Budde, Christian Pellegrin,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w, David S. Miller,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w, qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <004a01cb825a$3a8bd060$66f8800a-a06+6cuVnkTSQfdrb5gaxUEOCMrvLtNR@public.gmane.org>

On 11/12/2010 12:10 PM, Tomoya MORINAGA wrote:
> On Saturday, October 30, 2010 4:32 AM, Wolfgang Grandegger wrote :
> 
> Sorry, for my late.
> 
>>>> +
>>>> +#define PCH_RX_OK 0x00000010
>>>> +#define PCH_TX_OK 0x00000008
>>>> +#define PCH_BUS_OFF 0x00000080
>>>> +#define PCH_EWARN 0x00000040
>>>> +#define PCH_EPASSIV 0x00000020
>>>> +#define PCH_LEC0 0x00000001
>>>> +#define PCH_LEC1 0x00000002
>>>> +#define PCH_LEC2 0x00000004
>>>
>>> These are just single set bit, please use BIT()
>>> Consider adding the name of the corresponding register to the define's
>>> name.
> 
> I agree.
> 
>>>
>>>> +#define PCH_LEC_ALL (PCH_LEC0 | PCH_LEC1 | PCH_LEC2)
>>>> +#define PCH_STUF_ERR PCH_LEC0
>>>> +#define PCH_FORM_ERR PCH_LEC1
>>>> +#define PCH_ACK_ERR (PCH_LEC0 | PCH_LEC1)
>>>> +#define PCH_BIT1_ERR PCH_LEC2
>>>> +#define PCH_BIT0_ERR (PCH_LEC0 | PCH_LEC2)
>>>> +#define PCH_CRC_ERR (PCH_LEC1 | PCH_LEC2)
>>
>> This is an enumeration:
>>
>> enum {
>> PCH_STUF_ERR = 1,
>> PCH_FORM_ERR,
>> PCH_ACK_ERR,
>> PCH_BIT1_ERR;
>> PCH_BIT0_ERR,
>> PCH_CRC_ERR,
>> PCH_LEC_ALL;
>> }
> 
> No, 
> LEC is for bit assignment.
> Thus, "enum" can't be used.

Why? For me it's a classical enum because the value matters, and *not*
the individual bit. Do you agree?

>>> I suggest to convert to a if-bit-set because there might be more than
>>> one bit set.
>>
>> Marc, what do you mean here. It's an enumeraton. Maybe the following
>> code is more clear:
>>
>> lec = status & PCH_LEC_ALL;
>> if (lec > 0) {
>> switch (lec) {
> 
> No.
> LEC is not enum.

See also my sub-sequent comment here:

http://marc.info/?l=linux-netdev&m=128880088907148&w=2

>>>> + case PCH_STUF_ERR:
>>>> + cf->data[2] |= CAN_ERR_PROT_STUFF;
>>>> + break;
>>>> + case PCH_FORM_ERR:
>>>> + cf->data[2] |= CAN_ERR_PROT_FORM;
>>>> + break;
>>>> + case PCH_ACK_ERR:
>>>> + cf->data[2] |= CAN_ERR_PROT_LOC_ACK |
>>>> +        CAN_ERR_PROT_LOC_ACK_DEL;
>>
>> Could you check what that type of bus error that is? Usually it's a ack
>> lost error.
> 
> I will modify.
> BTW, I can see ti_hecc also has the above the same code.

Yes, also the AT91 driver uses a somehow incorrect error mask. I will
check and provide a patch a.s.a.p.

>>
>>>> + break;
>>>> + case PCH_BIT1_ERR:
>>>> + case PCH_BIT0_ERR:
>>>> + cf->data[2] |= CAN_ERR_PROT_BIT;
>>>> + break;
>>>> + case PCH_CRC_ERR:
>>>> + cf->data[2] |= CAN_ERR_PROT_LOC_CRC_SEQ |
>>>> +        CAN_ERR_PROT_LOC_CRC_DEL;
>>>> + break;
>>>> + default:
>>>> + iowrite32(status | PCH_LEC_ALL, &priv->regs->stat);
>>>> + break;
>>>> + }
>>>> +
>>>> + }
>>
>> Also, could you please add the TEC and REC:
>>
>> cf->data[6] = ioread32(&priv->regs->errc) & CAN_TEC;
>> cf->data[7] = (ioread32(&priv->regs->errc) & CAN_REC) >> 8;
> 
> I will add.

BTW: it could be done with one I/O call:

  errc = ioread32(&priv->regs->errc);
  cf->data[6] = errc & CAN_TEC;
  cf->data[7] = (errc & CAN_REC) >> 8;

> But I couldn't find 

Don't understand? It's also implemented for the SJA1000 driver:

http://lxr.linux.no/#linux+v2.6.36/drivers/net/can/sja1000/sja1000.c#L466

Wolfgang.

^ permalink raw reply

* [net-next] stmmac: update the driver documentation
From: Giuseppe CAVALLARO @ 2010-11-12 11:37 UTC (permalink / raw)
  To: netdev; +Cc: Giuseppe Cavallaro

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
 Documentation/networking/stmmac.txt |   48 +++++++++++++++++++++++++++--------
 1 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt
index 7ee770b..80a7a34 100644
--- a/Documentation/networking/stmmac.txt
+++ b/Documentation/networking/stmmac.txt
@@ -7,7 +7,7 @@ This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers
 (Synopsys IP blocks); it has been fully tested on STLinux platforms.
 
 Currently this network device driver is for all STM embedded MAC/GMAC
-(7xxx SoCs).
+(7xxx SoCs). Other platforms start using it i.e. ARM SPEAr.
 
 DWC Ether MAC 10/100/1000 Universal version 3.41a and DWC Ether MAC 10/100
 Universal version 4.0 have been used for developing the first code
@@ -95,9 +95,14 @@ Several information came from the platform; please refer to the
 driver's Header file in include/linux directory.
 
 struct plat_stmmacenet_data {
-        int bus_id;
-        int pbl;
-        int has_gmac;
+	int bus_id;
+	int pbl;
+	int clk_csr;
+	int has_gmac;
+	int enh_desc;
+	int tx_coe;
+	int bugged_jumbo;
+	int pmt;
         void (*fix_mac_speed)(void *priv, unsigned int speed);
         void (*bus_setup)(unsigned long ioaddr);
 #ifdef CONFIG_STM_DRIVERS
@@ -114,6 +119,12 @@ Where:
   registers (on STM platforms);
 - has_gmac: GMAC core is on board (get it at run-time in the next step);
 - bus_id: bus identifier.
+- tx_coe: core is able to perform the tx csum in HW.
+- enh_desc: if sets the MAC will use the enhanced descriptor structure.
+- clk_csr: CSR Clock range selection.
+- bugged_jumbo: some HWs are not able to perform the csum in HW for
+  over-sized frames due to limited buffer sizes. Setting this
+  flag the csum will be done in SW on JUMBO frames.
 
 struct plat_stmmacphy_data {
         int bus_id;
@@ -131,13 +142,28 @@ Where:
 - interface: physical MII interface mode;
 - phy_reset: hook to reset HW function.
 
+SOURCES:
+- Kconfig
+- Makefile
+- stmmac_main.c: main network device driver;
+- stmmac_mdio.c: mdio functions;
+- stmmac_ethtool.c: ethtool support;
+- stmmac_timer.[ch]: timer code used for mitigating the driver dma interrupts
+  Only tested on ST40 platforms based.
+- stmmac.h: private driver structure;
+- common.h: common definitions and VFTs;
+- descs.h: descriptor structure definitions;
+- dwmac1000_core.c: GMAC core functions;
+- dwmac1000_dma.c:  dma functions for the GMAC chip;
+- dwmac1000.h: specific header file for the GMAC;
+- dwmac100_core: MAC 100 core and dma code;
+- dwmac100_dma.c: dma funtions for the MAC chip;
+- dwmac1000.h: specific header file for the MAC;
+- dwmac_lib.c: generic DMA functions shared among chips
+- enh_desc.c: functions for handling enhanced descriptors
+- norm_desc.c: functions for handling normal descriptors
+
 TODO:
-- Continue to make the driver more generic and suitable for other Synopsys
-  Ethernet controllers used on other architectures (i.e. ARM).
-- 10G controllers are not supported.
-- MAC uses Normal descriptors and GMAC uses enhanced ones.
-  This is a limit that should be reviewed. MAC could want to
-  use the enhanced structure.
-- Checksumming: Rx/Tx csum is done in HW in case of GMAC only.
+- XGMAC controller is not supported.
 - Review the timer optimisation code to use an embedded device that seems to be
   available in new chip generations.
-- 
1.5.5.6


^ permalink raw reply related

* Re: Kernel rwlock design, Multicore and IGMP
From: Eric Dumazet @ 2010-11-12 11:25 UTC (permalink / raw)
  To: Cypher Wu; +Cc: Américo Wang, linux-kernel, netdev
In-Reply-To: <AANLkTikHjGkq5FDoXHaUbRkpurmT3mSLiu8toqqRL4Gi@mail.gmail.com>

Le vendredi 12 novembre 2010 à 19:10 +0800, Cypher Wu a écrit :

> I used to using that way, just seperate the call internal and
> external, with external one hold lock then call the internal one. But
> in that case ip_check_mc() is called indirectly from igmpv3_sendpack()
> and is not very clear how to give the different paramter?

I said that I was preparing a RCU patch, dont bother with this ;)

Should be ready in a couple of minutes.

Thanks

^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Cypher Wu @ 2010-11-12 11:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Américo Wang, linux-kernel, netdev
In-Reply-To: <1289546874.17691.1774.camel@edumazet-laptop>

On Fri, Nov 12, 2010 at 3:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>> >>
>> >> Hi
>> >>
>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
>> >>
>> >>
>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>> >>> other platforms. It have a priority for write lock that when tried it
>> >>> will block the following read lock even if read lock is hold by
>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
>> >>> arch/tile/lib/spinlock_32.c.
>> >>
>> >> This seems a bug to me.
>> >>
>> >> read_lock() can be nested. We used such a schem in the past in iptables
>> >> (it can re-enter itself),
>> >> and we used instead a spinlock(), but with many discussions with lkml
>> >> and Linus himself if I remember well.
>> >>
>> >It seems not a problem that read_lock() can be nested or not since
>> >rwlock doesn't have 'owner', it's just that should we give
>> >write_lock() a priority than read_lock() since if there have a lot
>> >read_lock()s then they'll starve write_lock().
>> >We should work out a well defined behavior so all the
>> >platform-dependent raw_rwlock has to design under that principle.
>>
>
> AFAIK, Lockdep allows read_lock() to be nested.
>
>> It is a known weakness of rwlock, it is designed like that. :)
>>
>
> Agreed.
>
>> The solution is to use RCU or seqlock, but I don't think seqlock
>> is proper for this case you described. So, try RCU lock.
>
> In the IGMP case, it should be easy for the task owning a read_lock() to
> pass a parameter to the called function saying 'I already own the
> read_lock(), dont try to re-acquire it'

I used to using that way, just seperate the call internal and
external, with external one hold lock then call the internal one. But
in that case ip_check_mc() is called indirectly from igmpv3_sendpack()
and is not very clear how to give the different paramter?

>
> A RCU conversion is far more complex.
>
>
>
>

^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] can: Topcliff: PCH_CAN driver: Fix build warnings
From: Tomoya MORINAGA @ 2010-11-12 11:10 UTC (permalink / raw)
  To: Wolfgang Grandegger
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w, Masayuki Ohtake,
	Samuel Ortiz, margie.foster-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, LKML,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w, Marc Kleine-Budde,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w, David S. Miller,
	Christian Pellegrin, qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <4CCB213A.2020206@grandegger.com>

On Saturday, October 30, 2010 4:32 AM, Wolfgang Grandegger wrote :

Sorry, for my late.

> >> +
> >> +#define PCH_RX_OK 0x00000010
> >> +#define PCH_TX_OK 0x00000008
> >> +#define PCH_BUS_OFF 0x00000080
> >> +#define PCH_EWARN 0x00000040
> >> +#define PCH_EPASSIV 0x00000020
> >> +#define PCH_LEC0 0x00000001
> >> +#define PCH_LEC1 0x00000002
> >> +#define PCH_LEC2 0x00000004
> > 
> > These are just single set bit, please use BIT()
> > Consider adding the name of the corresponding register to the define's
> > name.

I agree.

> > 
> >> +#define PCH_LEC_ALL (PCH_LEC0 | PCH_LEC1 | PCH_LEC2)
> >> +#define PCH_STUF_ERR PCH_LEC0
> >> +#define PCH_FORM_ERR PCH_LEC1
> >> +#define PCH_ACK_ERR (PCH_LEC0 | PCH_LEC1)
> >> +#define PCH_BIT1_ERR PCH_LEC2
> >> +#define PCH_BIT0_ERR (PCH_LEC0 | PCH_LEC2)
> >> +#define PCH_CRC_ERR (PCH_LEC1 | PCH_LEC2)
> 
> This is an enumeration:
> 
> enum {
> PCH_STUF_ERR = 1,
> PCH_FORM_ERR,
> PCH_ACK_ERR,
> PCH_BIT1_ERR;
> PCH_BIT0_ERR,
> PCH_CRC_ERR,
> PCH_LEC_ALL;
> }

No, 
LEC is for bit assignment.
Thus, "enum" can't be used.


> >> +#define PCH_FIFO_THRESH 16
> >> +
> >> +enum pch_can_mode {
> >> + PCH_CAN_ENABLE,
> >> + PCH_CAN_DISABLE,
> >> + PCH_CAN_ALL,
> >> + PCH_CAN_NONE,
> >> + PCH_CAN_STOP,
> >> + PCH_CAN_RUN,
> >> +};
> >> +
> >> +struct pch_can_regs {
> >> + u32 cont;
> >> + u32 stat;
> >> + u32 errc;
> >> + u32 bitt;
> >> + u32 intr;
> >> + u32 opt;
> >> + u32 brpe;
> >> + u32 reserve1;
> > 
> > VVVV
> >> + u32 if1_creq;
> >> + u32 if1_cmask;
> >> + u32 if1_mask1;
> >> + u32 if1_mask2;
> >> + u32 if1_id1;
> >> + u32 if1_id2;
> >> + u32 if1_mcont;
> >> + u32 if1_dataa1;
> >> + u32 if1_dataa2;
> >> + u32 if1_datab1;
> >> + u32 if1_datab2;
> > ^^^^
> > 
> > these registers and....
> > 
> >> + u32 reserve2;
> >> + u32 reserve3[12];
> > 
> > ...and these
> > 
> > VVVV
> >> + u32 if2_creq;
> >> + u32 if2_cmask;
> >> + u32 if2_mask1;
> >> + u32 if2_mask2;
> >> + u32 if2_id1;
> >> + u32 if2_id2;
> >> + u32 if2_mcont;
> >> + u32 if2_dataa1;
> >> + u32 if2_dataa2;
> >> + u32 if2_datab1;
> >> + u32 if2_datab2;
> > 
> > ^^^^
> > 
> > ...are identical. I suggest to make a struct defining a complete
> > "Message Interface Register Set". If you include the correct number of
> > reserved bytes in the struct, you can have an array of two of these
> > structs in the struct pch_can_regs.
> 
> Yep, that would be nice. Using it consequently would also allow to
> remove duplicated code efficiently. I will name it "struct pch_can_if"
> for latter references.

I will modify like above.

> 
> > 
> >> + u32 reserve4;
> >> + u32 reserve5[20];
> >> + u32 treq1;
> >> + u32 treq2;
> >> + u32 reserve6[2];
> >> + u32 reserve7[56];
> >> + u32 reserve8[3];
> 
> Why not just one reserveX ?

I will modify to a reserveX.

> 
> >> + u32 srst;
> >> +};
> >> +
> >> +struct pch_can_priv {
> >> + struct can_priv can;
> >> + struct pci_dev *dev;
> >> + unsigned int tx_enable[MAX_MSG_OBJ];
> >> + unsigned int rx_enable[MAX_MSG_OBJ];
> >> + unsigned int rx_link[MAX_MSG_OBJ];
> >> + unsigned int int_enables;
> >> + unsigned int int_stat;
> >> + struct net_device *ndev;
> >> + spinlock_t msgif_reg_lock; /* Message Interface Registers Access Lock*/
> >                                                                             ^^^
> > please add a whitespace

I will modify.


> > 
> > IMHO the function name is missleading, if I understand the code
> > correctly, this functions triggers the transmission of the message.
> > After this it checks for busy, but 
> > 
> >> +static void pch_can_check_if_busy(u32 __iomem *creq_addr, u32 num)
> >                                      ^^^^
> > 
> > that should probaby be a void
> 
> With separate structs for if1 and i2, a pointer to the relevant "struct
> pch_can_if" could be passed instead.

I will modify

> >> +
> >> + if (set == 1) {
> >> + /* Setting the MsgVal and RxIE bits */
> >> + pch_can_bit_set(&priv->regs->if1_mcont, CAN_IF_MCONT_RXIE);
> >> + pch_can_bit_set(&priv->regs->if1_id2, CAN_ID_MSGVAL);
> >> +
> >> + } else if (set == 0) {
> >> + /* Resetting the MsgVal and RxIE bits */
> >> + pch_can_bit_clear(&priv->regs->if1_mcont, CAN_IF_MCONT_RXIE);
> >> + pch_can_bit_clear(&priv->regs->if1_id2, CAN_ID_MSGVAL);
> >> + }
> 
> Why not just?
> 
> if (set)
> else

I will modify.


> >> +static void pch_can_set_tx_enable(struct pch_can_priv *priv, u32 buff_num,
> >> + u32 set)
> >> +{
> >> + unsigned long flags;
> >> +
> >> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >> + /* Reading the Msg buffer from Message RAM to Interface2 registers. */
> >> + iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if2_cmask);
> >> + pch_can_check_if_busy(&priv->regs->if2_creq, buff_num);
> >> +
> >> + /* Setting the IF2CMASK register for accessing the
> >> + MsgVal and TxIE bits */
> >> + iowrite32(CAN_CMASK_RDWR | CAN_CMASK_ARB | CAN_CMASK_CTRL,
> >> + &priv->regs->if2_cmask);
> >> +
> >> + if (set == 1) {
> >> + /* Setting the MsgVal and TxIE bits */
> >> + pch_can_bit_set(&priv->regs->if2_mcont, CAN_IF_MCONT_TXIE);
> >> + pch_can_bit_set(&priv->regs->if2_id2, CAN_ID_MSGVAL);
> >> + } else if (set == 0) {
> >> + /* Resetting the MsgVal and TxIE bits. */
> >> + pch_can_bit_clear(&priv->regs->if2_mcont, CAN_IF_MCONT_TXIE);
> >> + pch_can_bit_clear(&priv->regs->if2_id2, CAN_ID_MSGVAL);
> >> + }
> >> +
> >> + pch_can_check_if_busy(&priv->regs->if2_creq, buff_num);
> >> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >> +}
> 
> That function is almost identical to pch_can_set_rx_enable. Just if2 is
> used instead of if1 and CAN_IF_MCONT_TXIE instead of CAN_IF_MCONT_RXIE.
> With separate "struct  pch_can_if" for if1 and if2, it could be handled
> by a common function.

I will modify.

> 
> >> +static void pch_can_tx_enable_all(struct pch_can_priv *priv)
> >> +{
> >> + int i;
> >> +
> >> + /* Traversing to obtain the object configured as transmit object. */
> >> + for (i = PCH_RX_OBJ_NUM + 1; i <= PCH_OBJ_NUM; i++)
> >> + pch_can_set_tx_enable(priv, i, 1);
> >> +}
> >> +
> >> +static void pch_can_tx_disable_all(struct pch_can_priv *priv)
> >> +{
> >> + int i;
> >> +
> >> + /* Traversing to obtain the object configured as transmit object. */
> >> + for (i = PCH_RX_OBJ_NUM + 1; i <= PCH_OBJ_NUM; i++)
> >> + pch_can_set_tx_enable(priv, i, 0);
> >> +}
> 
> I think there is no need for separate functions for enable and disable.
> Just pass "enable" 0 or 1 like you do with "set" above.

I will modify

> 
> >> +static int pch_can_int_pending(struct pch_can_priv *priv)
> >           ^^^
> > 
> > make it u32 as it returns a register value, or a u16 as you only use
> > the 16 lower bits.

I will modify.

> > 
> >> +{
> >> + return ioread32(&priv->regs->intr) & 0xffff;
> >> +}
> >> +
> >> +static void pch_can_clear_buffers(struct pch_can_priv *priv)
> >> +{
> >> + int i; /* Msg Obj ID (1~32) */
> >> +
> >> + for (i = 1; i <= PCH_RX_OBJ_NUM; i++) {
> > 
> > IMHO the readability would be improved if you define something like
> > PCH_RX_OBJ_START and PCH_RX_OBJ_END.

I will modify.


> > 
> >> + iowrite32(CAN_CMASK_RX_TX_SET, &priv->regs->if1_cmask);
> >> + iowrite32(0xffff, &priv->regs->if1_mask1);
> >> + iowrite32(0xffff, &priv->regs->if1_mask2);
> >> + iowrite32(0x0, &priv->regs->if1_id1);
> >> + iowrite32(0x0, &priv->regs->if1_id2);
> >> + iowrite32(0x0, &priv->regs->if1_mcont);
> >> + iowrite32(0x0, &priv->regs->if1_dataa1);
> >> + iowrite32(0x0, &priv->regs->if1_dataa2);
> >> + iowrite32(0x0, &priv->regs->if1_datab1);
> >> + iowrite32(0x0, &priv->regs->if1_datab2);
> >> + iowrite32(CAN_CMASK_RDWR | CAN_CMASK_MASK |
> >> +   CAN_CMASK_ARB | CAN_CMASK_CTRL,
> >> +   &priv->regs->if1_cmask);
> >> + pch_can_check_if_busy(&priv->regs->if1_creq, i);
> >> + }
> >> +
> >> + for (i = PCH_RX_OBJ_NUM + 1; i <= PCH_OBJ_NUM; i++) {
> >                  ^^^^^^^^^^^^^^^^^^
> > dito for TX objects
> > 
> >> + iowrite32(CAN_CMASK_RX_TX_SET, &priv->regs->if2_cmask);
> >> + iowrite32(0xffff, &priv->regs->if2_mask1);
> >> + iowrite32(0xffff, &priv->regs->if2_mask2);
> >> + iowrite32(0x0, &priv->regs->if2_id1);
> >> + iowrite32(0x0, &priv->regs->if2_id2);
> >> + iowrite32(0x0, &priv->regs->if2_mcont);
> >> + iowrite32(0x0, &priv->regs->if2_dataa1);
> >> + iowrite32(0x0, &priv->regs->if2_dataa2);
> >> + iowrite32(0x0, &priv->regs->if2_datab1);
> >> + iowrite32(0x0, &priv->regs->if2_datab2);
> >> + iowrite32(CAN_CMASK_RDWR | CAN_CMASK_MASK | CAN_CMASK_ARB |
> >> +   CAN_CMASK_CTRL, &priv->regs->if2_cmask);
> >> + pch_can_check_if_busy(&priv->regs->if2_creq, i);
> 
> This is almost the same code as above, just if2 instead of if1. With
> separate "struct  pch_can_if" for if1 and i2, it could be handled by a
> common function.

I will modify.

> 
> >> + }
> >> +}
> >> +
> >> +static void pch_can_config_rx_tx_buffers(struct pch_can_priv *priv)
> >> +{
> >> + int i;
> >> + unsigned long flags;
> >> +
> >> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >> +
> >> + for (i = 1; i <= PCH_RX_OBJ_NUM; i++) {
> >> + iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if1_cmask);
> >> + pch_can_check_if_busy(&priv->regs->if1_creq, i);
> > 
> > If I understand the code correctly, the about function triggers a
> > transfer. Why do you first trigger a transfer, then set the message contents....

For read-modify-write.


> >> + }
> >> +
> >> + if (status & PCH_LEC_ALL) {
> >> + priv->can.can_stats.bus_error++;
> >> + stats->rx_errors++;
> >> + switch (status & PCH_LEC_ALL) {
> > 
> > I suggest to convert to a if-bit-set because there might be more than
> > one bit set.
> 
> Marc, what do you mean here. It's an enumeraton. Maybe the following
> code is more clear:
> 
> lec = status & PCH_LEC_ALL;
> if (lec > 0) {
> switch (lec) {

No.
LEC is not enum.


> 
> >> + case PCH_STUF_ERR:
> >> + cf->data[2] |= CAN_ERR_PROT_STUFF;
> >> + break;
> >> + case PCH_FORM_ERR:
> >> + cf->data[2] |= CAN_ERR_PROT_FORM;
> >> + break;
> >> + case PCH_ACK_ERR:
> >> + cf->data[2] |= CAN_ERR_PROT_LOC_ACK |
> >> +        CAN_ERR_PROT_LOC_ACK_DEL;
> 
> Could you check what that type of bus error that is? Usually it's a ack
> lost error.

I will modify.
BTW, I can see ti_hecc also has the above the same code.

> 
> >> + break;
> >> + case PCH_BIT1_ERR:
> >> + case PCH_BIT0_ERR:
> >> + cf->data[2] |= CAN_ERR_PROT_BIT;
> >> + break;
> >> + case PCH_CRC_ERR:
> >> + cf->data[2] |= CAN_ERR_PROT_LOC_CRC_SEQ |
> >> +        CAN_ERR_PROT_LOC_CRC_DEL;
> >> + break;
> >> + default:
> >> + iowrite32(status | PCH_LEC_ALL, &priv->regs->stat);
> >> + break;
> >> + }
> >> +
> >> + }
> 
> Also, could you please add the TEC and REC:
> 
> cf->data[6] = ioread32(&priv->regs->errc) & CAN_TEC;
> cf->data[7] = (ioread32(&priv->regs->errc) & CAN_REC) >> 8;

I will add.
But I couldn't find 

> >> +
> >> +static int pch_can_rx_msg_lost(struct net_device *ndev, int obj_id)
> >> +{
> >> + struct pch_can_priv *priv = netdev_priv(ndev);
> >> + struct net_device_stats *stats = &(priv->ndev->stats);
> >> + struct sk_buff *skb;
> >> + struct can_frame *cf;
> >> +
> >> + dev_err(&priv->ndev->dev, "Msg Obj is overwritten.\n");
> 
> Please use dev_dbg or even remove the line above.

I will modify.


> 
> >> + pch_can_bit_clear(&priv->regs->if1_mcont,
> >> +   CAN_IF_MCONT_MSGLOST);
> >> + iowrite32(CAN_CMASK_RDWR | CAN_CMASK_CTRL,
> >> +   &priv->regs->if1_cmask);
> >> + pch_can_check_if_busy(&priv->regs->if1_creq, obj_id);
> 
> I think the if busy checks could be improved. Why do you need to wait here?

Sorry, I can' understand.
This is for clear MSGLOSG bit.

> 
> >> +
> >> + skb = alloc_can_err_skb(ndev, &cf);
> >> + if (!skb)
> >> + return -ENOMEM;
> >> +
> >> + priv->can.can_stats.error_passive++;
> >> + priv->can.state = CAN_STATE_ERROR_PASSIVE;
> 
> Please remove the above two bogus lines.

I will remove.


> >> +
> >> + can_get_echo_skb(ndev, int_stat - PCH_RX_OBJ_NUM - 1);
> >> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >> + iowrite32(CAN_CMASK_RX_TX_GET | CAN_CMASK_CLRINTPND,
> >> +   &priv->regs->if2_cmask);
> >> + dlc = ioread32(&priv->regs->if2_mcont) & CAN_IF_MCONT_DLC;
> >> + pch_can_check_if_busy(&priv->regs->if2_creq, int_stat);
> >> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >> + if (dlc > 8)
> >> + dlc = 8;
> > 
> > use get_can_dlc

I will use

> > 
> >> + stats->tx_bytes += dlc;
> >> + stats->tx_packets++;
> >> +}
> >> +
> >> +static int pch_can_rx_poll(struct napi_struct *napi, int quota)
> >> +{
> >> + struct net_device *ndev = napi->dev;
> >> + struct pch_can_priv *priv = netdev_priv(ndev);
> >> + u32 int_stat;
> >> + int rcv_pkts = 0;
> >> + u32 reg_stat;
> >> + unsigned long flags;
> >> +
> >> + int_stat = pch_can_int_pending(priv);
> >> + if (!int_stat)
> >> + goto END;
> 
> Labels should be lowercase as well.

I will modify

> 
> >> +
> >> + if ((int_stat == CAN_STATUS_INT) && (quota > 0)) {
> >> + reg_stat = ioread32(&priv->regs->stat);
> >> + if (reg_stat & (PCH_BUS_OFF | PCH_LEC_ALL)) {
> >> + if ((reg_stat & PCH_LEC_ALL) != PCH_LEC_ALL) {
> >> + pch_can_error(ndev, reg_stat);
> >> + quota--;
> >> + }
> >> + }
> >> +
> >> + if (reg_stat & PCH_TX_OK) {
> >> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >> + iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if2_cmask);
> >> + pch_can_check_if_busy(&priv->regs->if2_creq,
> >> +        ioread32(&priv->regs->intr));
> >                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > Isn't this "int_stat". Might it be possilbe that regs->intr changes
> > between the pch_can_int_pending and here?
> > 
> > What should this transfer do?
> > 
> >> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >> + pch_can_bit_clear(&priv->regs->stat, PCH_TX_OK);
> >> + }
> >> +
> >> + if (reg_stat & PCH_RX_OK)
> >> + pch_can_bit_clear(&priv->regs->stat, PCH_RX_OK);
> >> +
> >> + int_stat = pch_can_int_pending(priv);
> >> + }
> >> +
> >> + if (quota == 0)
> >> + goto END;
> >> +
> >> + if ((int_stat >= 1) && (int_stat <= PCH_RX_OBJ_NUM)) {
> >> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >> + rcv_pkts += pch_can_rx_normal(ndev, int_stat, quota);
> >> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >> + quota -= rcv_pkts;
> >> + if (rcv_pkts < 0)
> > 
> > how can this happen?

I will modify to quota.


> > 
> >> + goto END;
> >> + } else if ((int_stat > PCH_RX_OBJ_NUM) && (int_stat <= PCH_OBJ_NUM)) {
> >> + /* Handle transmission interrupt */
> >> + pch_can_tx_complete(ndev, int_stat);
> >> + }
> >> +
> >> +END:
> >> + napi_complete(napi);
> >> + pch_can_set_int_enables(priv, PCH_CAN_ALL);
> >> +
> >> + return rcv_pkts;
> >> +}
> >> +
> >> +static int pch_set_bittiming(struct net_device *ndev)
> >> +{
> >> + struct pch_can_priv *priv = netdev_priv(ndev);
> >> + const struct can_bittiming *bt = &priv->can.bittiming;
> >> + u32 canbit;
> >> + u32 bepe;
> >> +
> >> + /* Setting the CCE bit for accessing the Can Timing register. */
> >> + pch_can_bit_set(&priv->regs->cont, CAN_CTRL_CCE);
> >> +
> >> + canbit = (bt->brp - 1) & MSK_BITT_BRP;
> >> + canbit |= (bt->sjw - 1) << BIT_BITT_SJW;
> >> + canbit |= (bt->phase_seg1 + bt->prop_seg - 1) << BIT_BITT_TSEG1;
> >> + canbit |= (bt->phase_seg2 - 1) << BIT_BITT_TSEG2;
> >> + bepe = ((bt->brp - 1) & MSK_BRPE_BRPE) >> BIT_BRPE_BRPE;
> >> + iowrite32(canbit, &priv->regs->bitt);
> >> + iowrite32(bepe, &priv->regs->brpe);
> >> + pch_can_bit_clear(&priv->regs->cont, CAN_CTRL_CCE);
> >> +
> >> + return 0;
> >> +}
> >> +
> >> +static void pch_can_start(struct net_device *ndev)
> >> +{
> >> + struct pch_can_priv *priv = netdev_priv(ndev);
> >> +
> >> + if (priv->can.state != CAN_STATE_STOPPED)
> >> + pch_can_reset(priv);
> >> +
> >> + pch_set_bittiming(ndev);
> >> + pch_can_set_optmode(priv);
> >> +
> >> + pch_can_tx_enable_all(priv);
> >> + pch_can_rx_enable_all(priv);
> >> +
> >> + /* Setting the CAN to run mode. */
> >> + pch_can_set_run_mode(priv, PCH_CAN_RUN);
> >> +
> >> + priv->can.state = CAN_STATE_ERROR_ACTIVE;
> >> +
> >> + return;
> >> +}
> >> +
> >> +static int pch_can_do_set_mode(struct net_device *ndev, enum can_mode mode)
> >> +{
> >> + int ret = 0;
> >> +
> >> + switch (mode) {
> >> + case CAN_MODE_START:
> >> + pch_can_start(ndev);
> >> + netif_wake_queue(ndev);
> >> + break;
> >> + default:
> >> + ret = -EOPNOTSUPP;
> >> + break;
> >> + }
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +static int pch_can_open(struct net_device *ndev)
> >> +{
> >> + struct pch_can_priv *priv = netdev_priv(ndev);
> >> + int retval;
> >> +
> >> + /* Regsitering the interrupt. */
> 
> Typo!

I will modify.

> 
> >> + retval = request_irq(priv->dev->irq, pch_can_interrupt, IRQF_SHARED,
> >> +      ndev->name, ndev);
> >> + if (retval) {
> >> + dev_err(&ndev->dev, "request_irq failed.\n");
> >> + goto req_irq_err;
> >> + }
> >> +
> >> + /* Open common can device */
> >> + retval = open_candev(ndev);
> >> + if (retval) {
> >> + dev_err(ndev->dev.parent, "open_candev() failed %d\n", retval);
> >> + goto err_open_candev;
> >> + }
> >> +
> >> + pch_can_init(priv);
> >> + pch_can_start(ndev);
> >> + napi_enable(&priv->napi);
> >> + netif_start_queue(ndev);
> >> +
> >> + return 0;
> >> +
> >> +err_open_candev:
> >> + free_irq(priv->dev->irq, ndev);
> >> +req_irq_err:
> >> + pch_can_release(priv);
> >> +
> >> + return retval;
> >> +}
> >> +
> >> +static int pch_close(struct net_device *ndev)
> >> +{
> >> + struct pch_can_priv *priv = netdev_priv(ndev);
> >> +
> >> + netif_stop_queue(ndev);
> >> + napi_disable(&priv->napi);
> >> + pch_can_release(priv);
> >> + free_irq(priv->dev->irq, ndev);
> >> + close_candev(ndev);
> >> + priv->can.state = CAN_STATE_STOPPED;
> >> + return 0;
> >> +}
> >> +
> >> +static netdev_tx_t pch_xmit(struct sk_buff *skb, struct net_device *ndev)
> >> +{
> >> + unsigned long flags;
> >> + struct pch_can_priv *priv = netdev_priv(ndev);
> >> + struct can_frame *cf = (struct can_frame *)skb->data;
> >> + int tx_buffer_avail = 0;
> > 
> > What I'm totally missing is the TX flow controll. Your driver has to
> > ensure that the package leave the controller in the order that come
> > into the xmit function. Further you have to stop your xmit queue if
> > you're out of tx objects and reenable if you have a object free.
> > 
> > Use netif_stop_queue() and netif_wake_queue() for this.

I will add flow control.

> >> + }
> >> + if (cf->can_dlc > 2) {
> >> + u32 data1 = *((u16 *)&cf->data[2]);
> >> + iowrite32(data1, &priv->regs->if2_dataa2);
> >> + }
> >> + if (cf->can_dlc > 4) {
> >> + u32 data1 = *((u16 *)&cf->data[4]);
> >> + iowrite32(data1, &priv->regs->if2_datab1);
> >> + }
> >> + if (cf->can_dlc > 6) {
> >> + u32 data1 = *((u16 *)&cf->data[6]);
> >> + iowrite32(data1, &priv->regs->if2_datab2);
> >> + }
> 
> Could be handled by a loop.

Pending.


> 
> >> + can_put_echo_skb(skb, ndev, tx_buffer_avail - PCH_RX_OBJ_NUM - 1);
> >> +
> >> + /* Set the size of the data. */
> >> + iowrite32(cf->can_dlc, &priv->regs->if2_mcont);
> >> +
> >> + /* Update if2_mcont */
> >> + pch_can_bit_set(&priv->regs->if2_mcont,
> >> + CAN_IF_MCONT_NEWDAT | CAN_IF_MCONT_TXRQXT |
> >> + CAN_IF_MCONT_TXIE);
> > 
> > pleae first perpare your value, then write to hardware.
> > 
> >> +
> >> + if (tx_buffer_avail == PCH_RX_OBJ_NUM) /* If points tail of FIFO  */
> >> + pch_can_bit_set(&priv->regs->if2_mcont, CAN_IF_MCONT_EOB);
> > 
> > dito
> > 
> > Is EOB relevant for TX objects?
> > 
> >> + pch_can_check_if_busy(&priv->regs->if2_creq, tx_buffer_avail);
> >> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >> +
> >> + return NETDEV_TX_OK;
> >> +}
> >> +
> >> +static const struct net_device_ops pch_can_netdev_ops = {
> >> + .ndo_open = pch_can_open,
> >> + .ndo_stop = pch_close,
> >> + .ndo_start_xmit = pch_xmit,
> >> +};
> >> +
> >> +static void __devexit pch_can_remove(struct pci_dev *pdev)
> >> +{
> >> + struct net_device *ndev = pci_get_drvdata(pdev);
> >> + struct pch_can_priv *priv = netdev_priv(ndev);
> >> +
> >> + unregister_candev(priv->ndev);
> >> + pci_iounmap(pdev, priv->regs);
> >> + if (priv->use_msi)
> >> + pci_disable_msi(priv->dev);
> >> + pci_release_regions(pdev);
> >> + pci_disable_device(pdev);
> >> + pci_set_drvdata(pdev, NULL);
> >> + free_candev(priv->ndev);
> >> +}
> >> +
> >> +#ifdef CONFIG_PM
> >> +static void pch_can_set_int_custom(struct pch_can_priv *priv)
> >> +{
> >> + /* Clearing the IE, SIE and EIE bits of Can control register. */
> >> + pch_can_bit_clear(&priv->regs->cont, CAN_CTRL_IE_SIE_EIE);
> >> +
> >> + /* Appropriately setting them. */
> >> + pch_can_bit_set(&priv->regs->cont,
> >> + ((priv->int_enables & MSK_CTRL_IE_SIE_EIE) << 1));
> >> +}
> >> +
> >> +/* This function retrieves interrupt enabled for the CAN device. */
> >> +static u32 pch_can_get_int_enables(struct pch_can_priv *priv)
> >> +{
> >> + /* Obtaining the status of IE, SIE and EIE interrupt bits. */
> >> + return (ioread32(&priv->regs->cont) & CAN_CTRL_IE_SIE_EIE) >> 1;
> >> +}
> >> +
> >> +static u32 pch_can_get_rx_enable(struct pch_can_priv *priv, u32 buff_num)
> >> +{
> >> + unsigned long flags;
> >> + u32 enable;
> >> +
> >> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >> + iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if1_cmask);
> >> + pch_can_check_if_busy(&priv->regs->if1_creq, buff_num);
> >> +
> >> + if (((ioread32(&priv->regs->if1_id2)) & CAN_ID_MSGVAL) &&
> >> + ((ioread32(&priv->regs->if1_mcont)) &
> >> + CAN_IF_MCONT_RXIE))
> >> + enable = 1;
> >> + else
> >> + enable = 0;
> >> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >> + return enable;
> >> +}
> >> +
> >> +static u32 pch_can_get_tx_enable(struct pch_can_priv *priv, u32 buff_num)
> >> +{
> >> + unsigned long flags;
> >> + u32 enable;
> >> +
> >> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >> +
> >> + iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if2_cmask);
> >> + pch_can_check_if_busy(&priv->regs->if2_creq, buff_num);
> >> + if (((ioread32(&priv->regs->if2_id2)) & CAN_ID_MSGVAL) &&
> >> + ((ioread32(&priv->regs->if2_mcont)) &
> >> + CAN_IF_MCONT_TXIE)) {
> >> + enable = 1;
> >> + } else {
> >> + enable = 0;
> >> + }
> >> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >> +
> >> + return enable;
> >> +}
> 
> The above two functions could be handled by a common one passing "struct
> pch_can_if". See similar comments above.

I will modify like the same.


> As the driver has already been merged. Please provide incremental
> patches against the net-2.6 branch. Also, it would be nice if you could
> check in-order transmission and reception, e.g., with the can-utils
> program canfdtest:
> 
> http://svn.berlios.de/wsvn/socketcan/trunk/can-utils/canfdtest.c
> 

Thank you for your information.


------
Thanks,

Tomoya MORINAGA
OKI SEMICONDUCTOR CO., LTD.

^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Cypher Wu @ 2010-11-12 11:06 UTC (permalink / raw)
  To: Américo Wang; +Cc: Yong Zhang, Eric Dumazet, linux-kernel, netdev
In-Reply-To: <20101112091818.GB5949@cr0.nay.redhat.com>

2010/11/12 Américo Wang <xiyou.wangcong@gmail.com>:
> On Fri, Nov 12, 2010 at 05:09:45PM +0800, Yong Zhang wrote:
>>On Fri, Nov 12, 2010 at 4:19 PM, Américo Wang <xiyou.wangcong@gmail.com> wrote:
>>> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
>>>>Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
>>>>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>>>>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>>>>> >>
>>>>> >> Hi
>>>>> >>
>>>>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
>>>>> >>
>>>>> >>
>>>>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>>>>> >>> other platforms. It have a priority for write lock that when tried it
>>>>> >>> will block the following read lock even if read lock is hold by
>>>>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
>>>>> >>> arch/tile/lib/spinlock_32.c.
>>>>> >>
>>>>> >> This seems a bug to me.
>>>>> >>
>>>>> >> read_lock() can be nested. We used such a schem in the past in iptables
>>>>> >> (it can re-enter itself),
>>>>> >> and we used instead a spinlock(), but with many discussions with lkml
>>>>> >> and Linus himself if I remember well.
>>>>> >>
>>>>> >It seems not a problem that read_lock() can be nested or not since
>>>>> >rwlock doesn't have 'owner', it's just that should we give
>>>>> >write_lock() a priority than read_lock() since if there have a lot
>>>>> >read_lock()s then they'll starve write_lock().
>>>>> >We should work out a well defined behavior so all the
>>>>> >platform-dependent raw_rwlock has to design under that principle.
>>>>>
>>>>
>>>>AFAIK, Lockdep allows read_lock() to be nested.
>>>>
>>>>> It is a known weakness of rwlock, it is designed like that. :)
>>>>>
>>>>
>>>>Agreed.
>>>>
>>>
>>> Just for record, both Tile and X86 implement rwlock with a write-bias,
>>> this somewhat reduces the write-starvation problem.
>>
>>Are you sure(on x86)?
>>
>>It seems that we never realize writer-bias rwlock.
>>
>
> Try
>
> % grep RW_LOCK_BIAS -nr arch/x86
>
> *And* read the code to see how it works. :)
>
> Note, on Tile, it uses a little different algorithm.
>

It seems that rwlock on x86 and tile have different behavior, x86 use
RW_LOCK_BIAS, when read_lock() it will test if the lock is 0, and if
so then the read_lock() have to 'spinning', otherwise it dec the lock;
when write_lock() tried it first check if lock is It seems that rwlock
on x86 and tile have different behavior, x86 use RW_LOCK_BIAS and if
so, set lock to 0 and continue, otherwise it will 'spinning'.
I'm not very familiar with x86 architecture, but the code seems like
working that way.

^ permalink raw reply

* [PATCH] r8169: fix checksum broken
From: Shan Wei @ 2010-11-12 10:15 UTC (permalink / raw)
  To: romieu; +Cc: netdev@vger.kernel.org, David Miller

If r8196 received packets with invalid sctp/igmp(not tcp, udp) checksum, r8196 set skb->ip_summed
wit CHECKSUM_UNNECESSARY. This cause that upper protocol don't check checksum field.

I am not family with r8196 driver. I try to guess the meaning of RxProtoIP and IPFail.
RxProtoIP stands for received IPv4 packet that upper protocol is not tcp and udp. 
!(opts1 & IPFail) is true means that driver correctly to check checksum in IPv4 header.

If it's right, I think we should not set ip_summed wit CHECKSUM_UNNECESSARY for my sctp packets
with invalid checksum. 

If it's not right, please tell me. 


Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
---
 drivers/net/r8169.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index d88ce9f..2cf6c2e 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -4440,8 +4440,7 @@ static inline void rtl8169_rx_csum(struct sk_buff *skb, u32 opts1)
 	u32 status = opts1 & RxProtoMask;
 
 	if (((status == RxProtoTCP) && !(opts1 & TCPFail)) ||
-	    ((status == RxProtoUDP) && !(opts1 & UDPFail)) ||
-	    ((status == RxProtoIP) && !(opts1 & IPFail)))
+	    ((status == RxProtoUDP) && !(opts1 & UDPFail)))
 		skb->ip_summed = CHECKSUM_UNNECESSARY;
 	else
 		skb_checksum_none_assert(skb);
-- 
1.6.3.3


^ permalink raw reply related

* Re: Kernel rwlock design, Multicore and IGMP
From: Américo Wang @ 2010-11-12  9:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Américo Wang, Cypher Wu, linux-kernel, netdev
In-Reply-To: <1289553759.3185.1.camel@edumazet-laptop>

On Fri, Nov 12, 2010 at 10:22:39AM +0100, Eric Dumazet wrote:
>Le vendredi 12 novembre 2010 à 16:19 +0800, Américo Wang a écrit :
>> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
>
>> >A RCU conversion is far more complex.
>> >
>> 
>> Yup.
>
>
>Well, actually this is easy in this case.
>
>I'll post a patch to do this RCU conversion.
>

Cool! Please keep me in Cc.

Some conversions from rwlock to RCU are indeed complex,
don't know about this case.

Anyway, patch please. :)

^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Eric Dumazet @ 2010-11-12  9:22 UTC (permalink / raw)
  To: Américo Wang; +Cc: Cypher Wu, linux-kernel, netdev
In-Reply-To: <20101112081945.GA5949@cr0.nay.redhat.com>

Le vendredi 12 novembre 2010 à 16:19 +0800, Américo Wang a écrit :
> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:

> >A RCU conversion is far more complex.
> >
> 
> Yup.


Well, actually this is easy in this case.

I'll post a patch to do this RCU conversion.

^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Américo Wang @ 2010-11-12  9:18 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Américo Wang, Eric Dumazet, Cypher Wu, linux-kernel, netdev
In-Reply-To: <AANLkTik-KDvc2fgH91vBHGT6vqxbZrv=9DoQknujPWy2@mail.gmail.com>

On Fri, Nov 12, 2010 at 05:09:45PM +0800, Yong Zhang wrote:
>On Fri, Nov 12, 2010 at 4:19 PM, Américo Wang <xiyou.wangcong@gmail.com> wrote:
>> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
>>>Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
>>>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>>>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>>>> >>
>>>> >> Hi
>>>> >>
>>>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
>>>> >>
>>>> >>
>>>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>>>> >>> other platforms. It have a priority for write lock that when tried it
>>>> >>> will block the following read lock even if read lock is hold by
>>>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
>>>> >>> arch/tile/lib/spinlock_32.c.
>>>> >>
>>>> >> This seems a bug to me.
>>>> >>
>>>> >> read_lock() can be nested. We used such a schem in the past in iptables
>>>> >> (it can re-enter itself),
>>>> >> and we used instead a spinlock(), but with many discussions with lkml
>>>> >> and Linus himself if I remember well.
>>>> >>
>>>> >It seems not a problem that read_lock() can be nested or not since
>>>> >rwlock doesn't have 'owner', it's just that should we give
>>>> >write_lock() a priority than read_lock() since if there have a lot
>>>> >read_lock()s then they'll starve write_lock().
>>>> >We should work out a well defined behavior so all the
>>>> >platform-dependent raw_rwlock has to design under that principle.
>>>>
>>>
>>>AFAIK, Lockdep allows read_lock() to be nested.
>>>
>>>> It is a known weakness of rwlock, it is designed like that. :)
>>>>
>>>
>>>Agreed.
>>>
>>
>> Just for record, both Tile and X86 implement rwlock with a write-bias,
>> this somewhat reduces the write-starvation problem.
>
>Are you sure(on x86)?
>
>It seems that we never realize writer-bias rwlock.
>

Try

% grep RW_LOCK_BIAS -nr arch/x86

*And* read the code to see how it works. :)

Note, on Tile, it uses a little different algorithm.

^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Yong Zhang @ 2010-11-12  9:09 UTC (permalink / raw)
  To: Américo Wang; +Cc: Eric Dumazet, Cypher Wu, linux-kernel, netdev
In-Reply-To: <20101112081945.GA5949@cr0.nay.redhat.com>

On Fri, Nov 12, 2010 at 4:19 PM, Américo Wang <xiyou.wangcong@gmail.com> wrote:
> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
>>Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
>>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>>> >>
>>> >> Hi
>>> >>
>>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
>>> >>
>>> >>
>>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>>> >>> other platforms. It have a priority for write lock that when tried it
>>> >>> will block the following read lock even if read lock is hold by
>>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
>>> >>> arch/tile/lib/spinlock_32.c.
>>> >>
>>> >> This seems a bug to me.
>>> >>
>>> >> read_lock() can be nested. We used such a schem in the past in iptables
>>> >> (it can re-enter itself),
>>> >> and we used instead a spinlock(), but with many discussions with lkml
>>> >> and Linus himself if I remember well.
>>> >>
>>> >It seems not a problem that read_lock() can be nested or not since
>>> >rwlock doesn't have 'owner', it's just that should we give
>>> >write_lock() a priority than read_lock() since if there have a lot
>>> >read_lock()s then they'll starve write_lock().
>>> >We should work out a well defined behavior so all the
>>> >platform-dependent raw_rwlock has to design under that principle.
>>>
>>
>>AFAIK, Lockdep allows read_lock() to be nested.
>>
>>> It is a known weakness of rwlock, it is designed like that. :)
>>>
>>
>>Agreed.
>>
>
> Just for record, both Tile and X86 implement rwlock with a write-bias,
> this somewhat reduces the write-starvation problem.

Are you sure(on x86)?

It seems that we never realize writer-bias rwlock.

Thanks,
Yong
>
>
>>> The solution is to use RCU or seqlock, but I don't think seqlock
>>> is proper for this case you described. So, try RCU lock.
>>
>>In the IGMP case, it should be easy for the task owning a read_lock() to
>>pass a parameter to the called function saying 'I already own the
>>read_lock(), dont try to re-acquire it'
>>
>>A RCU conversion is far more complex.
>>
>
> Yup.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] macvlan: lockless tx path
From: Patrick McHardy @ 2010-11-12  8:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Ben Greear, Ben Hutchings
In-Reply-To: <1289459644.17691.1011.camel@edumazet-laptop>

On 11.11.2010 08:14, Eric Dumazet wrote:
> macvlan is a stacked device, like tunnels. We should use the lockless
> mechanism we are using in tunnels and loopback.
> 
> This patch completely removes locking in TX path.
> 
> tx stat counters are added into existing percpu stat structure, renamed
> from rx_stats to pcpu_stats.
> 
> Note : this reverts commit 2c11455321f37 (macvlan: add multiqueue
> capability)
> 
> Note : rx_errors converted to a 32bit counter, like tx_dropped, since
> they dont need 64bit range.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Patrick McHardy <kaber@trash.net>
> Cc: Ben Greear <greearb@candelatech.com>
> Cc: Ben Hutchings <bhutchings@solarflare.com>
> ---
> V2: correct kerneldoc
>     u32 for tx_dropped and rx_errors

Looks good to me.

Acked-by: Patrick McHardy <kaber@trash.net>



^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Américo Wang @ 2010-11-12  8:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Américo Wang, Cypher Wu, linux-kernel, netdev
In-Reply-To: <1289546874.17691.1774.camel@edumazet-laptop>

On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
>Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>> >>
>> >> Hi
>> >>
>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
>> >>
>> >>
>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>> >>> other platforms. It have a priority for write lock that when tried it
>> >>> will block the following read lock even if read lock is hold by
>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
>> >>> arch/tile/lib/spinlock_32.c.
>> >>
>> >> This seems a bug to me.
>> >>
>> >> read_lock() can be nested. We used such a schem in the past in iptables
>> >> (it can re-enter itself),
>> >> and we used instead a spinlock(), but with many discussions with lkml
>> >> and Linus himself if I remember well.
>> >>
>> >It seems not a problem that read_lock() can be nested or not since
>> >rwlock doesn't have 'owner', it's just that should we give
>> >write_lock() a priority than read_lock() since if there have a lot
>> >read_lock()s then they'll starve write_lock().
>> >We should work out a well defined behavior so all the
>> >platform-dependent raw_rwlock has to design under that principle.
>> 
>
>AFAIK, Lockdep allows read_lock() to be nested.
>
>> It is a known weakness of rwlock, it is designed like that. :)
>> 
>
>Agreed.
>

Just for record, both Tile and X86 implement rwlock with a write-bias,
this somewhat reduces the write-starvation problem.


>> The solution is to use RCU or seqlock, but I don't think seqlock
>> is proper for this case you described. So, try RCU lock.
>
>In the IGMP case, it should be easy for the task owning a read_lock() to
>pass a parameter to the called function saying 'I already own the
>read_lock(), dont try to re-acquire it'
>
>A RCU conversion is far more complex.
>

Yup.

^ permalink raw reply

* Re: [PATCH] netfilter: ipv6: fix overlap check for fragments
From: Patrick McHardy @ 2010-11-12  7:52 UTC (permalink / raw)
  To: Shan Wei
  Cc: nicolas.dichtel, David Miller, netdev@vger.kernel.org,
	netfilter-devel
In-Reply-To: <4CD3F0F8.5030205@cn.fujitsu.com>

On 05.11.2010 12:56, Shan Wei wrote:
> The type of FRAG6_CB(prev)->offset is int, skb->len is *unsigned* int,
> and offset is int.
> 
> Without this patch, type conversion occurred to this expression, when
> (FRAG6_CB(prev)->offset + prev->len) is less than offset.

Applied, thanks Shan.

^ permalink raw reply

* Re: [PATCH] netfilter: ipv6: fix overlap check for fragments
From: Shan Wei @ 2010-11-12  7:47 UTC (permalink / raw)
  To: nicolas.dichtel, netdev@vger.kernel.org, netfilter-devel,
	Patrick McHardy
  Cc: David Miller
In-Reply-To: <4CD3F0F8.5030205@cn.fujitsu.com>

Hi Patrick:
 Would you apply this to your tree. The similar patch for ipv6 has been
applied by David . 
See http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commit;h=f46421416fb6b91513fb687d6503142cd99034a5

Shan Wei wrote, at 11/05/2010 07:56 PM:
> The type of FRAG6_CB(prev)->offset is int, skb->len is *unsigned* int,
> and offset is int.
> 
> Without this patch, type conversion occurred to this expression, when
> (FRAG6_CB(prev)->offset + prev->len) is less than offset.
> 
> 
> Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
> ---
>  net/ipv6/netfilter/nf_conntrack_reasm.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
> index 3a3f129..79d43aa 100644
> --- a/net/ipv6/netfilter/nf_conntrack_reasm.c
> +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
> @@ -286,7 +286,7 @@ found:
>  
>  	/* Check for overlap with preceding fragment. */
>  	if (prev &&
> -	    (NFCT_FRAG6_CB(prev)->offset + prev->len) - offset > 0)
> +	    (NFCT_FRAG6_CB(prev)->offset + prev->len) > offset)
>  		goto discard_fq;
>  
>  	/* Look for overlap with succeeding segment. */


-- 

Best Regards
-----
Shan Wei

^ permalink raw reply

* Re: [PATCH] rtnetlink: Fix message size calculation for link messages
From: Patrick McHardy @ 2010-11-12  7:43 UTC (permalink / raw)
  To: davem, netdev
In-Reply-To: <20101112014759.GA8491@canuck.infradead.org>

On 12.11.2010 02:47, Thomas Graf wrote:
> nlmsg_total_size() calculates the length of a netlink message
> including header and alignment. nla_total_size() calculates the
> space an individual attribute consumes which was meant to be used
> in this context.
> 
> Also, ensure to account for the attribute header for the
> IFLA_INFO_XSTATS attribute as implementations of get_xstats_size()
> seem to assume that we do so.
> 
> The addition of two message headers minus the missing attribute
> header resulted in a calculated message size that was larger than
> required. Therefore we never risked running out of skb tailroom.
> 
> Signed-off-by: Thomas Graf <tgraf@infradead.org>
> Cc: Patrick McHardy <kaber@trash.net>

Looks good to me, thanks Thomas.

Acked-by: Patrick McHardy <kaber@trash.net>

^ permalink raw reply

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
From: Patrick McHardy @ 2010-11-12  7:36 UTC (permalink / raw)
  To: Hua Zhong
  Cc: 'Eric Paris', netdev, linux-kernel, davem, kuznet, pekkas,
	jmorris, yoshfuji
In-Reply-To: <00c201cb81eb$84e18160$8ea48420$@com>

On 11.11.2010 22:58, Hua Zhong wrote:
>> Yes, I realize this is little different than if the
>> SYN was dropped in the first network device, but it is different
>> because we know what happened!  We know that connect() call failed
>> and that there isn't anything coming back.
> 
> I would argue that -j DROP should behave exactly as the packet is dropped in the network, while -j REJECT should signal the failure to the application as soon as possible (which it doesn't seem to do).

It sends an ICMP error or TCP reset. Interpretation is up to TCP.

^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Eric Dumazet @ 2010-11-12  7:27 UTC (permalink / raw)
  To: Américo Wang; +Cc: Cypher Wu, linux-kernel, netdev
In-Reply-To: <20101112071323.GB5660@cr0.nay.redhat.com>

Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
> >>
> >> Hi
> >>
> >> CC netdev, since you ask questions about network stuff _and_ rwlock
> >>
> >>
> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
> >>> other platforms. It have a priority for write lock that when tried it
> >>> will block the following read lock even if read lock is hold by
> >>> others. Its code can be read in Linux Kernel 2.6.36 in
> >>> arch/tile/lib/spinlock_32.c.
> >>
> >> This seems a bug to me.
> >>
> >> read_lock() can be nested. We used such a schem in the past in iptables
> >> (it can re-enter itself),
> >> and we used instead a spinlock(), but with many discussions with lkml
> >> and Linus himself if I remember well.
> >>
> >It seems not a problem that read_lock() can be nested or not since
> >rwlock doesn't have 'owner', it's just that should we give
> >write_lock() a priority than read_lock() since if there have a lot
> >read_lock()s then they'll starve write_lock().
> >We should work out a well defined behavior so all the
> >platform-dependent raw_rwlock has to design under that principle.
> 

AFAIK, Lockdep allows read_lock() to be nested.

> It is a known weakness of rwlock, it is designed like that. :)
> 

Agreed.

> The solution is to use RCU or seqlock, but I don't think seqlock
> is proper for this case you described. So, try RCU lock.

In the IGMP case, it should be easy for the task owning a read_lock() to
pass a parameter to the called function saying 'I already own the
read_lock(), dont try to re-acquire it'

A RCU conversion is far more complex.




^ permalink raw reply

* Re: [PATCH 4/10] Fix leaking of kernel heap addresses in net/
From: Eric Dumazet @ 2010-11-12  7:23 UTC (permalink / raw)
  To: Dan Rosenberg
  Cc: David Miller, socketcan, kuznet, urs.thuermann, yoshfuji, kaber,
	jmorris, remi.denis-courmont, pekkas, sri, vladislav.yasevich, tj,
	lizf, joe, shemminger, hadi, ebiederm, adobriyan, jpirko,
	johannes.berg, daniel.lezcano, xemul, socketcan-core, netdev,
	linux-sctp, torvalds
In-Reply-To: <1289529269.3090.207.camel@Dan>

Le jeudi 11 novembre 2010 à 21:34 -0500, Dan Rosenberg a écrit :
> > I want whatever you replace it with to be equivalent for
> > object tracking purposes.
> 
> In nearly all of the cases I fixed, the socket inode is already
> provided, which serves as a perfectly good unique identifier.  Would you
> prefer I include that information twice?

Oh well. Please read this answer carefuly.

Some facts to feed your next patch. I am very pleased you changed your
mind and that we keep useful information in kernel log.

1) Inode numbers are not guaranteed to be unique. Its a 32bit seq
number, and we dont check another socket inode use the same inode number
(after 2^32 allocations it can happens)

2) /proc/net/ files can deliver same "line" of information several
times, because of their implementation.

3) Because of SLAB_DESTROY_BY_RCU, same 'kernel socket pointer' can be
seen several times in /proc/net/tcp & /proc/net/udp, but really on
different "sockets"

4) Some good applications use both the socket pointer and inode number
(tuple) to filter out the [2] problem. Dont break them, please ?
Anything that might break an application must be at the very least
tunable.

In my opinion, a good thing would be :

- Use a special printf() format , aka "secure pointer", as Thomas
suggested.

- Make sure you print different opaque values for two different kernel
pointers. This is mandatory.

- Make sure the NULL pointer stay as a NULL pointer to not let the
hostile user know your secret, and to ease debugging stuff.

- Have security experts advice to chose a nice crypto function, maybe
jenkin hash. Not too slow would be nice.


static unsigned long securize_kpointers_rnd;

At boot time, stick a random value in this variable.
(Maybe make sure the 5 low order bits are 0)

unsigned long opacify_kptr(unsigned long ptr)
{
	if (ptr == 0)
		return ptr;
	if (capable(CAP_NET_ADMIN))
		return ptr;

	return some_crypto_hash(ptr, &securize_kpointers_rnd);
}

At least, use a central point, so that we can easily add/change the
logic if needed.

Please provide this patch in kernel/printk.c for initial review, then
once everybody is OK, you can send one patch for net tree.

No need to send 10 patches if we dont agree on the general principle.




^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Américo Wang @ 2010-11-12  7:13 UTC (permalink / raw)
  To: Cypher Wu; +Cc: Eric Dumazet, linux-kernel, netdev
In-Reply-To: <AANLkTik93QUQxSDmwd4Qj-gXQiWWPzd68JPAYAHBAsHR@mail.gmail.com>

On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>>
>> Hi
>>
>> CC netdev, since you ask questions about network stuff _and_ rwlock
>>
>>
>>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>>> other platforms. It have a priority for write lock that when tried it
>>> will block the following read lock even if read lock is hold by
>>> others. Its code can be read in Linux Kernel 2.6.36 in
>>> arch/tile/lib/spinlock_32.c.
>>
>> This seems a bug to me.
>>
>> read_lock() can be nested. We used such a schem in the past in iptables
>> (it can re-enter itself),
>> and we used instead a spinlock(), but with many discussions with lkml
>> and Linus himself if I remember well.
>>
>It seems not a problem that read_lock() can be nested or not since
>rwlock doesn't have 'owner', it's just that should we give
>write_lock() a priority than read_lock() since if there have a lot
>read_lock()s then they'll starve write_lock().
>We should work out a well defined behavior so all the
>platform-dependent raw_rwlock has to design under that principle.

It is a known weakness of rwlock, it is designed like that. :)

The solution is to use RCU or seqlock, but I don't think seqlock
is proper for this case you described. So, try RCU lock.

^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Américo Wang @ 2010-11-12  6:28 UTC (permalink / raw)
  To: Cypher Wu; +Cc: Eric Dumazet, linux-kernel, netdev
In-Reply-To: <AANLkTik93QUQxSDmwd4Qj-gXQiWWPzd68JPAYAHBAsHR@mail.gmail.com>

On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>>
>> Is TILE using ticket spinlocks ?
>>

Eric, yes.


>
>What ticket spinlocks means? Could you explain a little, pls:) I'm not
>familiar with Linux kernel, I'm trying to get more understanding of it
>since I'm working on Linux platform now.
>

You might want to search "ticket spinlock" with google. :)

Briefly speaking, it is introduced to solve unfairness
of the old spinlock implementation. In linux kernel, not all
arch implement this. X86 implementation is done by Nick with
asm code, while tile uses C to implement it.

^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Cypher Wu @ 2010-11-12  3:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netdev
In-Reply-To: <1289489007.17691.1310.camel@edumazet-laptop>

On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>
> Hi
>
> CC netdev, since you ask questions about network stuff _and_ rwlock
>
>
>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>> other platforms. It have a priority for write lock that when tried it
>> will block the following read lock even if read lock is hold by
>> others. Its code can be read in Linux Kernel 2.6.36 in
>> arch/tile/lib/spinlock_32.c.
>
> This seems a bug to me.
>
> read_lock() can be nested. We used such a schem in the past in iptables
> (it can re-enter itself),
> and we used instead a spinlock(), but with many discussions with lkml
> and Linus himself if I remember well.
>
It seems not a problem that read_lock() can be nested or not since
rwlock doesn't have 'owner', it's just that should we give
write_lock() a priority than read_lock() since if there have a lot
read_lock()s then they'll starve write_lock().
We should work out a well defined behavior so all the
platform-dependent raw_rwlock has to design under that principle.
>
>>
>> That different could cause a deadlock in kernel if we join/leave
>> Multicast Group simultaneous and frequently on mutlicores. IGMP
>> message is sent by
>>
>> igmp_ifc_timer_expire() -> igmpv3_send_cr() -> igmpv3_sendpack()
>>
>> in timer interrupt, igmpv3_send_cr() will generate the sk_buff for
>> IGMP message with mc_list_lock read locked and then call
>> igmpv3_sendpack() with it unlocked.
>> But if we have so many join/leave messages have to generate and it
>> can't be sent in one sk_buff then igmpv3_send_cr() -> add_grec() will
>> call igmpv3_sendpack() to send it and reallocate a new buffer. When
>> the message is sent:
>>
>> __mkroute_output() -> ip_check_mc()
>>
>> will read lock mc_list_lock again. If there is another core is try
>> write lock mc_list_lock between the two read lock, then deadlock
>> ocurred.
>>
>> The rwlock on other platforms I've check, say, PowerPC, x86, ARM, is
>> just read lock shared and write_lock mutex, so if we've hold read lock
>> the write lock will just wait, and if there have a read lock again it
>> will success.
>>
>> So, What's the criteria of rwlock design in the Linux kernel? Is that
>> read lock re-hold of IGMP a design error in Linux kernel, or the read
>> lock has to be design like that?
>>
>
> Well, we try to get rid of all rwlocks in performance critical sections.
>
> I would say, if you believe one rwlock can justify the special TILE
> behavior you tried to make, then we should instead migrate this rwlock
> to a RCU + spinlock schem (so that all arches benefit from this work,
> not only TILE)

IGMP in not very performance critical so maybe rwlock is enough?

>
>> There is a other thing, that the timer interrupt will start timer on
>> the same in_dev, should that be optimized?
>>
>
> Not sure I understand what you mean.

Since mc_list & mc_tomb is shared list in the kernel we needn't to
start multiple timers on different cores for them, we can synchronize
it on one core.

>
>> BTW: If we have so many cores, say 64, is there other things we have
>> to think about spinlock? If there have collisions ocurred, should we
>> just read the shared memory again and again, or just a very little
>> 'delay' is better? I've seen relax() is called in the implementation
>> of spinlock on TILEPro platform.
>> --
>
> Is TILE using ticket spinlocks ?
>

What ticket spinlocks means? Could you explain a little, pls:) I'm not
familiar with Linux kernel, I'm trying to get more understanding of it
since I'm working on Linux platform now.

>
>

^ permalink raw reply

* Re: [PATCH 4/10] Fix leaking of kernel heap addresses in net/
From: David Miller @ 2010-11-12  2:59 UTC (permalink / raw)
  To: drosenberg-PiUznwcHFHrqlBn2x/YWAg
  Cc: urs.thuermann-l29pVbxQd1IUtdQbppsyvg,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	linux-sctp-u79uwXL29TY76Z2rM5mHXA,
	shemminger-ZtmgI6mnKB3QT0dZR+AlfA, xemul-GEFAQzZX7r8dnm+yROfE0A,
	pekkas-UjJjq++bwZ7HOG6cAo2yLw,
	eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w,
	jmorris-gx6/JNMH7DfYtjvyW6yDsg, kuznet-v/Mj1YrvjDBInbfyfbPRSQ,
	adobriyan-Re5JQEeQqe8AvxtiuMwx3w, sri-r/Jw6+rmf7HQT0dZR+AlfA,
	johannes.berg-ral2JQCrhuEAvxtiuMwx3w, hadi-jkUAjuhPggJWk0Htik3J/w,
	vladislav.yasevich-VXdhtT5mjnY, lizf-BthXqXjhjHXQFUHtdCDX3A,
	tj-DgEjT+Ai2ygdnm+yROfE0A,
	remi.denis-courmont-xNZwKgViW5gAvxtiuMwx3w,
	daniel.lezcano-GANU6spQydw, jpirko-H+wXaHxf7aLQT0dZR+AlfA,
	yoshfuji-VfPWfsRibaP+Ru+s062T9g, socketcan-fJ+pQTUTwRTk1uMJSBkQmQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	joe-6d6DIl74uiNBDgjK7y7TUQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kaber-dcUjhNyLwpNeoWH0uzbU5w
In-Reply-To: <1289530264.3090.212.camel@Dan>

From: Dan Rosenberg <drosenberg-PiUznwcHFHrqlBn2x/YWAg@public.gmane.org>
Date: Thu, 11 Nov 2010 21:51:04 -0500

> This has already been suggested, and I agree it is a much better
> approach.  If I take this approach, and find some suitable substitute
> for those cases where the socket inode is not available, will you
> consider these changes?

I will consider an approach where the keys reported allow object
tracking equally as the actual pointers allow for right now.

I've said that this is my criteria about 3 times now.

^ permalink raw reply

* Re: [PATCH 4/10] Fix leaking of kernel heap addresses in net/
From: Dan Rosenberg @ 2010-11-12  2:51 UTC (permalink / raw)
  To: David Miller
  Cc: socketcan, kuznet, urs.thuermann, yoshfuji, kaber, jmorris,
	remi.denis-courmont, pekkas, sri, vladislav.yasevich, tj,
	eric.dumazet, lizf, joe, shemminger, hadi, ebiederm, adobriyan,
	jpirko, johannes.berg, daniel.lezcano, xemul, socketcan-core,
	netdev, linux-sctp, torvalds
In-Reply-To: <20101111.184902.233699247.davem@davemloft.net>


> > 
> >> I want whatever you replace it with to be equivalent for
> >> object tracking purposes.
> > 
> > In nearly all of the cases I fixed, the socket inode is already
> > provided, which serves as a perfectly good unique identifier.  Would you
> > prefer I include that information twice?
> 
> The problem is that the socket inode is not available in a certain
> subclass of cases, so the transformation is not equivalent.
> 
> Why not attack this at the heart of where your concern is, and hack
> the %p format handling to do whatever it is you like instead of
> patching code all over the tree?

This has already been suggested, and I agree it is a much better
approach.  If I take this approach, and find some suitable substitute
for those cases where the socket inode is not available, will you
consider these changes?

-Dan


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox