Netdev List
 help / color / mirror / Atom feed
* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Krishna Kumar2 @ 2010-11-09 15:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty
In-Reply-To: <20101109132239.GF22705@redhat.com>

"Michael S. Tsirkin" <mst@redhat.com> wrote on 11/09/2010 06:52:39 PM:

> > > Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
> > >
> > > On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > > > > Krishna Kumar2/India/IBM@IBMIN wrote on 10/20/2010 02:24:52 PM:
> > > >
> > > > Any feedback, comments, objections, issues or bugs about the
> > > > patches? Please let me know if something needs to be done.
> > > >
> > > > Some more test results:
> > > > _____________________________________________________
> > > >          Host->Guest BW (numtxqs=2)
> > > > #       BW%     CPU%    RCPU%   SD%     RSD%
> > > > _____________________________________________________
> > >
> > > I think we discussed the need for external to guest testing
> > > over 10G. For large messages we should not see any change
> > > but you should be able to get better numbers for small messages
> > > assuming a MQ NIC card.
> >
> > I had to make a few changes to qemu (and a minor change in macvtap
> > driver) to get multiple TXQ support using macvtap working. The NIC
> > is a ixgbe card.
> >
> >
__________________________________________________________________________
> >             Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
> > #      BW1     BW2 (%)       SD1    SD2 (%)        RSD1    RSD2 (%)
> >
__________________________________________________________________________
> > 1      14367   13142 (-8.5)  56     62 (10.7)      8        8 (0)
> > 2      3652    3855 (5.5)    37     35 (-5.4)      7        6 (-14.2)
> > 4      12529   12059 (-3.7)  65     77 (18.4)      35       35 (0)
> > 8      13912   14668 (5.4)   288    332 (15.2)     175      184 (5.1)
> > 16     13433   14455 (7.6)   1218   1321 (8.4)     920      943 (2.5)
> > 24     12750   13477 (5.7)   2876   2985 (3.7)     2514     2348 (-6.6)
> > 32     11729   12632 (7.6)   5299   5332 (.6)      4934     4497 (-8.8)
> > 40     11061   11923 (7.7)   8482   8364 (-1.3)    8374     7495
(-10.4)
> > 48     10624   11267 (6.0)   12329  12258 (-.5)    12762    11538
(-9.5)
> > 64     10524   10596 (.6)    21689  22859 (5.3)    23626    22403
(-5.1)
> > 80     9856    10284 (4.3)   35769  36313 (1.5)    39932    36419
(-8.7)
> > 96     9691    10075 (3.9)   52357  52259 (-.1)    58676    53463
(-8.8)
> > 128    9351    9794 (4.7)    114707 94275 (-17.8)  114050   97337
(-14.6)
> >
__________________________________________________________________________
> > Avg:      BW: (3.3)      SD: (-7.3)      RSD: (-11.0)
> >
> >
__________________________________________________________________________
> >             Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
> > #      BW1      BW2 (%)       SD1   SD2 (%)        RSD1   RSD2 (%)
> >
__________________________________________________________________________
> > 1      16509    15985 (-3.1)  45    47 (4.4)       7       7 (0)
> > 2      6963     4499 (-35.3)  17    51 (200.0)     7       7 (0)
> > 4      12932    11080 (-14.3) 49    74 (51.0)      35      35 (0)
> > 8      13878    14095 (1.5)   223   292 (30.9)     175     181 (3.4)
> > 16     13440    13698 (1.9)   980   1131 (15.4)    926     942 (1.7)
> > 24     12680    12927 (1.9)   2387  2463 (3.1)     2526    2342 (-7.2)
> > 32     11714    12261 (4.6)   4506  4486 (-.4)     4941    4463 (-9.6)
> > 40     11059    11651 (5.3)   7244  7081 (-2.2)    8349    7437 (-10.9)
> > 48     10580    11095 (4.8)   10811 10500 (-2.8)   12809   11403
(-10.9)
> > 64     10569    10566 (0)     19194 19270 (.3)     23648   21717 (-8.1)
> > 80     9827     10753 (9.4)   31668 29425 (-7.0)   39991   33824
(-15.4)
> > 96     10043    10150 (1.0)   45352 44227 (-2.4)   57766   51131
(-11.4)
> > 128    9360     9979 (6.6)    92058 79198 (-13.9)  114381  92873
(-18.8)
> >
__________________________________________________________________________
> > Avg:      BW: (-.5)      SD: (-7.5)      RSD: (-14.7)
> >
> > Is there anything else you would like me to test/change, or shall
> > I submit the next version (with the above macvtap changes)?
> >
> > Thanks,
> >
> > - KK
>
> Something strange here, right?
> 1. You are consistently getting >10G/s here, and even with a single
stream?

Sorry, I should have mentioned this though I had stated in my
earlier mails. Each test result has two iterations, each of 60
seconds, except when #netperfs is 1 for which I do 10 iteration
(sum across 10 iterations).  I started doing many more iterations
for 1 netperf after finding the issue earlier with single stream.
So the BW is only 4.5-7 Gbps.

> 2. With 2 streams, is where we get < 10G/s originally. Instead of
>    doubling that we get a marginal improvement with 2 queues and
>    about 30% worse with 1 queue.

(doubling happens consistently for guest -> host, but never for
remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
testing scenario. In first case, there is a slight improvement in
BW and good reduction in SD. In the second case, only SD improves
(though BW drops for 2 stream for some reason).  In both cases,
BW and SD improves as the number of sessions increase.

> Is your card MQ?

Yes, the card is MQ. ixgbe 10g card.

Thanks,

- KK


^ permalink raw reply

* Re: Netlink limitations
From: Thomas Graf @ 2010-11-09 14:49 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Jan Engelhardt, David S. Miller, pablo, netdev
In-Reply-To: <4CD91406.4000603@trash.net>

On Tue, Nov 09, 2010 at 10:27:34AM +0100, Patrick McHardy wrote:
> Am 08.11.2010 16:16, schrieb Thomas Graf:
> > On Sun, Nov 07, 2010 at 06:17:43PM +0100, Patrick McHardy wrote:
> >> On 07.11.2010 17:44, Jan Engelhardt wrote:
> >>> ...
> > 
> >> That's somewhat similar to the nlattr32 idea, but a length of 0 makes
> >> more sense since that's currently not used. In that case the length
> >> would be read from a second length field which has 32 bits.
> > 
> > I agree. This makes perfect sense. I was just thinking wheter we want
> > to encode more information into the attribute header if we extend it
> > anyways.
> 
> What kind of additional information are you thinking about?

We have tried to come up with ways of forwarding netlink messages to
other machines several times. It always failed due to the fact that
protocols encode attributes/data differently without having the
ability to specify the encoding. E.g. we have nfnetlink encoding
everything in network byte order, most others encode it in host byte
order. Therefore all parsers must be aware of all protocols if they
want to forward/do something with the data. A flag would help there.

I haven't given up on the idea of self describing netlink protocols
yet. For example we could encode the attribute type
(i8|u16|u32|u16|string) in additional to the existing nested attribute
flag.

Given this additional meta data to each attribute and given we limit
ourselves to only using strict netlink attribtes going forward, i.e.
we no longer encode whole structs in attributes we'd be given netlink
protocols which can be forwarded, saved, audited while still being
fairly efficient.

^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] can: Topcliff: PCH_CAN driver: Fix build warnings
From: Marc Kleine-Budde @ 2010-11-09 14:47 UTC (permalink / raw)
  To: Tomoya MORINAGA
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w, Masayuki Ohtake,
	Samuel Ortiz, margie.foster-ral2JQCrhuEAvxtiuMwx3w,
	Oliver Hartkopp, LKML, socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w, Wolfgang Grandegger,
	netdev-u79uwXL29TY76Z2rM5mHXA, joel.clark-ral2JQCrhuEAvxtiuMwx3w,
	David S. Miller, Christian Pellegrin,
	qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <00fd01cb8009$5efc5fd0$66f8800a-a06+6cuVnkTSQfdrb5gaxUEOCMrvLtNR@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 2312 bytes --]

On 11/09/2010 01:26 PM, Tomoya MORINAGA wrote:
>> It's just a question if the driver for an Intel Chipset should/could ignore
>> the endian problematic.
>>
>> If this is intended the driver should only appear in Kconfig depending on X86
>> or little endian systems.
> 
> This driver is for only x86 processor.
> I have no intention to support processor except x86.

Fair enough, if you do the endianess conversion correct, either with my
proposed cpu_to_be16 or with the simpler and probably unnoticeable
slower version from Oliver. With Olivers version it's a bit harder to
build a loop that just copies the number of bytes specified in dlc.

>> Besides this remark, the struct pch_can_regs could also be defined in a way
>> that every single CAN payload data byte could be accessed directly to allow
>> e.g. this code in pch_can_rx_normal():
>>
>> cf->data[0] = ioread8(&priv->regs->if1_data0);
>> cf->data[1] = ioread8(&priv->regs->if1_data1);
>> cf->data[2] = ioread8(&priv->regs->if1_data2);
>> (..)
>> cf->data[6] = ioread8(&priv->regs->if1_data6);
>> cf->data[7] = ioread8(&priv->regs->if1_data7);
>>
>> This is easy to understand and additionally endian aware.
> 
> I agree. Thease codes are very simple.
> I will modify like above.
> 
>>
>> In opposite to this:
>>
>> + if (cf->can_dlc > 2) {
>> + u32 data1 = *((u16 *)&cf->data[2]);
>> + iowrite32(data1, &priv->regs->if2_dataa2);
>> + }
>>
>> Which puts a little? endian u16 into the big endian register layout.
>> Sorry i'm just getting curious on this register access implementation.
> 
> I think cf->data is like below
> MSB----------LSB
> D3-D2-D1-D0
> D7-D6-D5-D4

No, cf->data is an array of 8 u8, so it looks this way:

D0 D1 D2 D3  D4 D5 D6 D7

> *((u16 *)&cf->data[2]) means "D3-D2".

No, it's D2 D3. This is why you need the endianess conversion.

> This order coprresponds to register order.
> data register is like below
> MSB----------LSB
> **-**-D3-D2
> 
> ** means reserved area.

cheers, Marc


-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

[-- Attachment #2: Type: text/plain, Size: 188 bytes --]

_______________________________________________
Socketcan-core mailing list
Socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org
https://lists.berlios.de/mailman/listinfo/socketcan-core

^ permalink raw reply

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
From: Jesper Dangaard Brouer @ 2010-11-09 14:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1289312178.17448.20.camel@traveldev.cxnet.dk>

On Tue, 2010-11-09 at 15:16 +0100, Jesper Dangaard Brouer wrote:
> On Tue, 2010-11-09 at 14:59 +0100, Jesper Dangaard Brouer wrote:
> > On Mon, 2010-11-08 at 16:06 +0100, Eric Dumazet wrote:
> > ...
> > > > > > > I noticed that the loopback performance has gotten quite bad:
> > > > > > > 
> > > > > > > http://www.phoronix.com/scan.php?page=article&item=linux_2612_2637&num=6
> 
> > > Their network test is basically :
> > > 
> > > netcat -l 9999 >/dev/null &
> > > time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999
> > 
> > Should it not be:
> >  netcat -l -p 9999 >/dev/null &
> > 
> > When I run the commands "dd | netcat", netcat never finish/exits, I have
> > to press Ctrl-C to stop it.  What am I doing wrong? Any tricks?
> 
> To fix this I added "-q 0" to netcat.  Thus my working commands are:
> 
>  netcat -l -p 9999 >/dev/null &
>  time dd if=/dev/zero bs=1M count=10000 | netcat -q0 127.0.0.1 9999
> 
> Running this on my "big" 10G testlab system, Dual Xeon 5550 2.67GHz,
> kernel version 2.6.32-5-amd64 (which I usually don't use)
> The results are 7.487 sec

Using kernel 2.6.35.8-comx01+ (which is 35-stable with some minor
patches of my own) on the same type of hardware (our preprod server).
The result is 12 sec.

time dd if=/dev/zero bs=1M count=10000 | netcat -q0 127.0.0.1 9999
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 12,0805 s, 868 MB/s

real    0m12.082s
user    0m0.311s
sys     0m15.896s

BUT perf top reveals that its probably related to the function
'find_busiest_group' ... any kernel config hints how I get rid of that?


              samples  pcnt function                    DSO
             _______ _____ ___________________________ ______________

             4152.00 12.8% copy_user_generic_string    [kernel]      
             1802.00  5.6% find_busiest_group          [kernel]      
              852.00  2.6% __clear_user                [kernel]      
              836.00  2.6% _raw_spin_lock_bh           [kernel]      
              819.00  2.5% ipt_do_table                [ip_tables]   
              628.00  1.9% rebalance_domains           [kernel]      
              564.00  1.7% _raw_spin_lock              [kernel]      
              562.00  1.7% _raw_spin_lock_irqsave      [kernel]      
              522.00  1.6% schedule                    [kernel]      
              441.00  1.4% find_next_bit               [kernel]      
              413.00  1.3% _raw_spin_unlock_irqrestore [kernel]      
              394.00  1.2% tcp_sendmsg                 [kernel]      
              391.00  1.2% tcp_packet                  [nf_conntrack]
              368.00  1.1% do_select                   [kernel]      



> Using vmstat I see approx 400000 context switches per sec.
>
Previous:
> Perf top says:
>             samples  pcnt function                  DSO
>              _______ _____ _________________________ ___________
> 
>              6442.00 16.3% copy_user_generic_string  [kernel]   
>              2226.00  5.6% __clear_user              [kernel]   
>               912.00  2.3% _spin_lock_irqsave        [kernel]   
>               773.00  2.0% _spin_lock_bh             [kernel]   
>               736.00  1.9% schedule                  [kernel]   
>               582.00  1.5% ipt_do_table              [ip_tables]
>               569.00  1.4% _spin_lock                [kernel]   
>               505.00  1.3% get_page_from_freelist    [kernel]   
>               451.00  1.1% _spin_unlock_irqrestore   [kernel]   
>               434.00  1.1% do_select                 [kernel]   
>               354.00  0.9% tcp_sendmsg               [kernel]   
>               348.00  0.9% tick_nohz_stop_sched_tick [kernel]   
>               347.00  0.9% tcp_transmit_skb          [kernel]   
>               345.00  0.9% zero_fd_set               [kernel]   


-- 
Jesper Dangaard Brouer
ComX Networks A/S


^ permalink raw reply

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
From: Eric Dumazet @ 2010-11-09 14:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev
In-Reply-To: <1289312178.17448.20.camel@traveldev.cxnet.dk>

Le mardi 09 novembre 2010 à 15:16 +0100, Jesper Dangaard Brouer a
écrit :
> On Tue, 2010-11-09 at 14:59 +0100, Jesper Dangaard Brouer wrote:
> > On Mon, 2010-11-08 at 16:06 +0100, Eric Dumazet wrote:
> > ...
> > > > > > > I noticed that the loopback performance has gotten quite bad:
> > > > > > > 
> > > > > > > http://www.phoronix.com/scan.php?page=article&item=linux_2612_2637&num=6
> 
> > > Their network test is basically :
> > > 
> > > netcat -l 9999 >/dev/null &
> > > time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999
> > 
> > Should it not be:
> >  netcat -l -p 9999 >/dev/null &
> > 
> > When I run the commands "dd | netcat", netcat never finish/exits, I have
> > to press Ctrl-C to stop it.  What am I doing wrong? Any tricks?
> 
> To fix this I added "-q 0" to netcat.  Thus my working commands are:
> 
>  netcat -l -p 9999 >/dev/null &
>  time dd if=/dev/zero bs=1M count=10000 | netcat -q0 127.0.0.1 9999
> 
> Running this on my "big" 10G testlab system, Dual Xeon 5550 2.67GHz,
> kernel version 2.6.32-5-amd64 (which I usually don't use)
> The results are 7.487 sec:
> 
> time dd if=/dev/zero bs=1M count=10000 | netcat -q0 127.0.0.1 9999
> 10000+0 records in
> 10000+0 records out
> 10485760000 bytes (10 GB) copied, 7,48562 s, 1,4 GB/s
> 
> real    0m7.487s
> user    0m0.224s
> sys     0m9.785s

So far, so good. These are the expected numbers. Now we have to
understand why corei7 gets 38 seconds instead of 8 :)




^ permalink raw reply

* SEASON LOAN!
From: PRIVATE HOME LENDER INC @ 2010-11-09 14:18 UTC (permalink / raw)
  To: aitierfamily

Apply for a loan to establish your business.
Our interest is very affordable and our loan process is very fast as well as
percentage rate of 2.5% yea rly from $5 000.00 Min. To $100 000 000.00 M a x.
Contact us via email if you are interested with the following details


NAME:
PHONE:
DURATION:
ADDRESS:
AMOUNT:

Regards,
Private Money Lender Inc
phl111@kkwl.ac.th.


^ permalink raw reply

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
From: Jesper Dangaard Brouer @ 2010-11-09 14:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1289311159.17448.9.camel@traveldev.cxnet.dk>

On Tue, 2010-11-09 at 14:59 +0100, Jesper Dangaard Brouer wrote:
> On Mon, 2010-11-08 at 16:06 +0100, Eric Dumazet wrote:
> ...
> > > > > > I noticed that the loopback performance has gotten quite bad:
> > > > > > 
> > > > > > http://www.phoronix.com/scan.php?page=article&item=linux_2612_2637&num=6

> > Their network test is basically :
> > 
> > netcat -l 9999 >/dev/null &
> > time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999
> 
> Should it not be:
>  netcat -l -p 9999 >/dev/null &
> 
> When I run the commands "dd | netcat", netcat never finish/exits, I have
> to press Ctrl-C to stop it.  What am I doing wrong? Any tricks?

To fix this I added "-q 0" to netcat.  Thus my working commands are:

 netcat -l -p 9999 >/dev/null &
 time dd if=/dev/zero bs=1M count=10000 | netcat -q0 127.0.0.1 9999

Running this on my "big" 10G testlab system, Dual Xeon 5550 2.67GHz,
kernel version 2.6.32-5-amd64 (which I usually don't use)
The results are 7.487 sec:

time dd if=/dev/zero bs=1M count=10000 | netcat -q0 127.0.0.1 9999
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 7,48562 s, 1,4 GB/s

real    0m7.487s
user    0m0.224s
sys     0m9.785s

Using vmstat I see approx 400000 context switches per sec.

Perf top says:
            samples  pcnt function                  DSO
             _______ _____ _________________________ ___________

             6442.00 16.3% copy_user_generic_string  [kernel]   
             2226.00  5.6% __clear_user              [kernel]   
              912.00  2.3% _spin_lock_irqsave        [kernel]   
              773.00  2.0% _spin_lock_bh             [kernel]   
              736.00  1.9% schedule                  [kernel]   
              582.00  1.5% ipt_do_table              [ip_tables]
              569.00  1.4% _spin_lock                [kernel]   
              505.00  1.3% get_page_from_freelist    [kernel]   
              451.00  1.1% _spin_unlock_irqrestore   [kernel]   
              434.00  1.1% do_select                 [kernel]   
              354.00  0.9% tcp_sendmsg               [kernel]   
              348.00  0.9% tick_nohz_stop_sched_tick [kernel]   
              347.00  0.9% tcp_transmit_skb          [kernel]   
              345.00  0.9% zero_fd_set               [kernel]   

Perf also complains about it cannot resolve netcat debug symbols.


-- 
Jesper Dangaard Brouer
ComX Networks A/S


^ permalink raw reply

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
From: Jesper Dangaard Brouer @ 2010-11-09 13:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1289228785.2820.203.camel@edumazet-laptop>

On Mon, 2010-11-08 at 16:06 +0100, Eric Dumazet wrote:
...
> > > > > I noticed that the loopback performance has gotten quite bad:
> > > > > 
> > > > > http://www.phoronix.com/scan.php?page=article&item=linux_2612_2637&num=6
> >

> CC netdev, if you dont mind.

No problem :-)

> Their network test is basically :
> 
> netcat -l 9999 >/dev/null &
> time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999

Should it not be:
 netcat -l -p 9999 >/dev/null &

When I run the commands "dd | netcat", netcat never finish/exits, I have
to press Ctrl-C to stop it.  What am I doing wrong? Any tricks?

"dd" reports 17.54 sec, and but the "time" measurement cannot be uses as
the netcat just hangs/waits for more data...

time dd if=/dev/zero bs=1M count=10000 | netcat 127.0.0.1 9999
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 17,5419 s, 598 MB/s

When I run the commands, I see 261682 context switches per sec...

This quick test were on a Core i7 920 using kernel
2.6.32-rc3-net-next-dev-mp2t.

-- 
Jesper Dangaard Brouer
ComX Networks A/S


^ permalink raw reply

* Re: Loopback performance from kernel 2.6.12 to 2.6.37
From: Eric Dumazet @ 2010-11-09 14:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev
In-Reply-To: <1289311159.17448.9.camel@traveldev.cxnet.dk>

Le mardi 09 novembre 2010 à 14:59 +0100, Jesper Dangaard Brouer a
écrit :
> On Mon, 2010-11-08 at 16:06 +0100, Eric Dumazet wrote:
> ...
> > > > > > I noticed that the loopback performance has gotten quite bad:
> > > > > > 
> > > > > > http://www.phoronix.com/scan.php?page=article&item=linux_2612_2637&num=6
> > >
> 
> > CC netdev, if you dont mind.
> 
> No problem :-)
> 
> > Their network test is basically :
> > 
> > netcat -l 9999 >/dev/null &
> > time dd if=/dev/zero bs=1M count=10000 | netcat  127.0.0.1 9999
> 
> Should it not be:
>  netcat -l -p 9999 >/dev/null &
> 

It depends on netcat version. On yours, yes, you need the "-p"

> When I run the commands "dd | netcat", netcat never finish/exits, I have
> to press Ctrl-C to stop it.  What am I doing wrong? Any tricks?
> 
> "dd" reports 17.54 sec, and but the "time" measurement cannot be uses as
> the netcat just hangs/waits for more data...
> 

Your netcat version needs a "-c" switch

time dd if=/dev/zero bs=1M count=10000 | netcat -c 127.0.0.1 9999


> time dd if=/dev/zero bs=1M count=10000 | netcat 127.0.0.1 9999
> 10000+0 records in
> 10000+0 records out
> 10485760000 bytes (10 GB) copied, 17,5419 s, 598 MB/s
> 
> When I run the commands, I see 261682 context switches per sec...
> 
> This quick test were on a Core i7 920 using kernel
> 2.6.32-rc3-net-next-dev-mp2t.
> 

32 or 64bit kernel ?

Thanks !




^ permalink raw reply

* Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
From: Michael S. Tsirkin @ 2010-11-09 13:22 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: anthony, arnd, avi, davem, eric.dumazet, kvm, netdev, rusty
In-Reply-To: <OF529E89EE.84DB37B8-ON652577D6.001665CF-652577D6.00192784@in.ibm.com>

On Tue, Nov 09, 2010 at 10:08:21AM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote on 10/26/2010 02:27:09 PM:
> 
> > Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
> >
> > On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > > > Krishna Kumar2/India/IBM@IBMIN wrote on 10/20/2010 02:24:52 PM:
> > >
> > > Any feedback, comments, objections, issues or bugs about the
> > > patches? Please let me know if something needs to be done.
> > >
> > > Some more test results:
> > > _____________________________________________________
> > >          Host->Guest BW (numtxqs=2)
> > > #       BW%     CPU%    RCPU%   SD%     RSD%
> > > _____________________________________________________
> >
> > I think we discussed the need for external to guest testing
> > over 10G. For large messages we should not see any change
> > but you should be able to get better numbers for small messages
> > assuming a MQ NIC card.
> 
> I had to make a few changes to qemu (and a minor change in macvtap
> driver) to get multiple TXQ support using macvtap working. The NIC
> is a ixgbe card.
> 
> __________________________________________________________________________
>             Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
> #      BW1     BW2 (%)       SD1    SD2 (%)        RSD1    RSD2 (%)
> __________________________________________________________________________
> 1      14367   13142 (-8.5)  56     62 (10.7)      8        8 (0)
> 2      3652    3855 (5.5)    37     35 (-5.4)      7        6 (-14.2)
> 4      12529   12059 (-3.7)  65     77 (18.4)      35       35 (0)
> 8      13912   14668 (5.4)   288    332 (15.2)     175      184 (5.1)
> 16     13433   14455 (7.6)   1218   1321 (8.4)     920      943 (2.5)
> 24     12750   13477 (5.7)   2876   2985 (3.7)     2514     2348 (-6.6)
> 32     11729   12632 (7.6)   5299   5332 (.6)      4934     4497 (-8.8)
> 40     11061   11923 (7.7)   8482   8364 (-1.3)    8374     7495 (-10.4)
> 48     10624   11267 (6.0)   12329  12258 (-.5)    12762    11538 (-9.5)
> 64     10524   10596 (.6)    21689  22859 (5.3)    23626    22403 (-5.1)
> 80     9856    10284 (4.3)   35769  36313 (1.5)    39932    36419 (-8.7)
> 96     9691    10075 (3.9)   52357  52259 (-.1)    58676    53463 (-8.8)
> 128    9351    9794 (4.7)    114707 94275 (-17.8)  114050   97337 (-14.6)
> __________________________________________________________________________
> Avg:      BW: (3.3)      SD: (-7.3)      RSD: (-11.0)
> 
> __________________________________________________________________________
>             Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
> #      BW1      BW2 (%)       SD1   SD2 (%)        RSD1   RSD2 (%)
> __________________________________________________________________________
> 1      16509    15985 (-3.1)  45    47 (4.4)       7       7 (0)
> 2      6963     4499 (-35.3)  17    51 (200.0)     7       7 (0)
> 4      12932    11080 (-14.3) 49    74 (51.0)      35      35 (0)
> 8      13878    14095 (1.5)   223   292 (30.9)     175     181 (3.4)
> 16     13440    13698 (1.9)   980   1131 (15.4)    926     942 (1.7)
> 24     12680    12927 (1.9)   2387  2463 (3.1)     2526    2342 (-7.2)
> 32     11714    12261 (4.6)   4506  4486 (-.4)     4941    4463 (-9.6)
> 40     11059    11651 (5.3)   7244  7081 (-2.2)    8349    7437 (-10.9)
> 48     10580    11095 (4.8)   10811 10500 (-2.8)   12809   11403 (-10.9)
> 64     10569    10566 (0)     19194 19270 (.3)     23648   21717 (-8.1)
> 80     9827     10753 (9.4)   31668 29425 (-7.0)   39991   33824 (-15.4)
> 96     10043    10150 (1.0)   45352 44227 (-2.4)   57766   51131 (-11.4)
> 128    9360     9979 (6.6)    92058 79198 (-13.9)  114381  92873 (-18.8)
> __________________________________________________________________________
> Avg:      BW: (-.5)      SD: (-7.5)      RSD: (-14.7)
> 
> Is there anything else you would like me to test/change, or shall
> I submit the next version (with the above macvtap changes)?
> 
> Thanks,
> 
> - KK

Something strange here, right?
1. You are consistently getting >10G/s here, and even with a single stream?
2. With 2 streams, is where we get < 10G/s originally. Instead of
   doubling that we get a marginal improvement with 2 queues and
   about 30% worse with 1 queue.

Is your card MQ?

-- 
MST

^ permalink raw reply

* Re: [PATCH] virtio_net: Fix queue full check
From: Michael S. Tsirkin @ 2010-11-09 13:15 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Rusty Russell, davem, netdev, yvugenfi
In-Reply-To: <OF5977B5FB.669A3250-ON652577D6.00168221-652577D6.00180738@in.ibm.com>

On Tue, Nov 09, 2010 at 09:56:03AM +0530, Krishna Kumar2 wrote:
> Rusty Russell <rusty@rustcorp.com.au> wrote on 11/08/2010 04:38:47 AM:
> 
> > Re: [PATCH] virtio_net: Fix queue full check
> >
> > On Thu, 4 Nov 2010 10:54:24 pm Michael S. Tsirkin wrote:
> > > I thought about this some more.  I think the original
> > > code is actually correct in returning ENOSPC: indirect
> > > buffers are nice, but it's a mistake
> > > to rely on them as a memory allocation might fail.
> > >
> > > And if you look at virtio-net, it is dropping packets
> > > under memory pressure which is not really a happy outcome:
> > > the packet will get freed, reallocated and we get another one,
> > > adding pressure on the allocator instead of releasing it
> > > until we free up some buffers.
> > >
> > > So I now think we should calculate the capacity
> > > assuming non-indirect entries, and if we manage to
> > > use indirect, all the better.
> >
> > I've long said it's a weakness in the network stack that it insists
> > drivers stop the tx queue before they *might* run out of room, leading to
> > worst-case assumptions and underutilization of the tx ring.
> >
> > However, I lost that debate, and so your patch is the way it's supposed
> to
> > work.  The other main indirect user (block) doesn't care as its queue
> > allows for post-attempt blocking.
> >
> > I enhanced your commentry a little:
> >
> > Subject: virtio: return correct capacity to users
> > Date: Thu, 4 Nov 2010 14:24:24 +0200
> > From: "Michael S. Tsirkin" <mst@redhat.com>
> >
> > We can't rely on indirect buffers for capacity
> > calculations because they need a memory allocation
> > which might fail.  In particular, virtio_net can get
> > into this situation under stress, and it drops packets
> > and performs badly.
> >
> > So return the number of buffers we can guarantee users.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> > Reported-By: Krishna Kumar2 <krkumar2@in.ibm.com>
> 
> I have tested this patch for 3-4 hours but so far I have not got the tx
> full
> error. I am not sure if "Tested-By" applies to this situation, but just in
> case:
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> Reported-By: Krishna Kumar2 <krkumar2@in.ibm.com>
> Tested-By: Krishna Kumar2 <krkumar2@in.ibm.com>
> 
> I think both this patch and the original patch I submitted
> are needed? That patch removes ENOMEM check and the increment
> of dev->stats.tx_fifo_errors, and reports "memory failure".
> 
> Thanks,
> 
> - KK

So I think your patch on top of this one would be wrong:
we actually make sure out of memory does not affect TX path
at all, so any error would be unexpected.

Incrementing tx fifo errors is probably also helpful for debugging.

If you like, we could kill the special handling for ENOMEM.
Not sure whether it matters.

-- 
MST

^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] can: Topcliff: PCH_CAN driver: Fix build warnings
From: Marc Kleine-Budde @ 2010-11-09 12:59 UTC (permalink / raw)
  To: Tomoya MORINAGA
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w, Masayuki Ohtake,
	Samuel Ortiz, margie.foster-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, LKML,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w, Wolfgang Grandegger,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w, David S. Miller,
	Christian Pellegrin, qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <00fe01cb8009$62e11410$66f8800a-a06+6cuVnkTSQfdrb5gaxUEOCMrvLtNR@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 3570 bytes --]

On 11/09/2010 01:26 PM, Tomoya MORINAGA wrote:
>>>> Can you please explain me your locking sheme? If I understand the
>>>> documenation correctly the two message interfaces can be used mutual.
>>>> And you use one for rx the other one for tx.
>>>
>>> I show our locking scheme.
>>> When CPU accesses MessageRAM via IF1, CPU protect until read-modify-write
>>> so that IF2 access not occurred, vice versa.
>>
>> Why is that needed?
> 
> For MessageRAM data consistency.

As far as I understand the datasheet the access to IF1 and IF2 is
completely independent. Why do you lock here?

[...]

>>>> Please use just "debug" level not warning here. Consider to use
>>>> netdev_dbg() instead. IMHO the __func__ can be dropped and the
>>>> "official" name for the error is "Error Warning".
>>>
>>> I want to know the reason.
>>> Why is it not dev_warn but netdev_dbg ?
>>
>> If you use warning level it would end up on the console or and in the
>> syslog. It's quite complicated (for programs) to get information from
>> there. This is why we send CAN error frames. They hold the same
>> information but int a binary form, thus it's easier to process.
> 
> I understand the reason.
> BTW, Why do you say not dev_dbg but netdev_dbg ?

Sorry - netdev_dbg() is easier to use, its first argument is the
netdevice, while dev_dbg needs a device and that's deeply hidden in the
netdevice.

[...]

>>>>> +static netdev_tx_t pch_xmit(struct sk_buff *skb, struct net_device *ndev)
>>>>> +{
>>>>> + unsigned long flags;
>>>>> + struct pch_can_priv *priv = netdev_priv(ndev);
>>>>> + struct can_frame *cf = (struct can_frame *)skb->data;
>>>>> + int tx_buffer_avail = 0;
>>>>
>>>> What I'm totally missing is the TX flow controll. Your driver has to
>>>> ensure that the package leave the controller in the order that come
>>>> into the xmit function. Further you have to stop your xmit queue if
>>>> you're out of tx objects and reenable if you have a object free.
>>>>
>>>> Use netif_stop_queue() and netif_wake_queue() for this.
>>>
>>> In this code, I think "out of tx objects" cannot be  occurred.
>>
>> It's not a matter of code it's the hardware. You cannot put more than a
>> certain number of CAN frames into the hardware. If you have a CAN bus at
>> a certain speed, you can only send a certain number of CAN frames in a
>> second. So you cannot push more than this amount of frames/s into the
>> hardware.
>>
>>> Nevertheless, are netif_stop_queue() and netif_wake_queue() is necessary ?
>>
>> Yes.
> 
> I can' understand your issue.
> Please can you hear my opinion?
> 
> Please see the head of pch_xmit.
> 
>>> + if (priv->tx_obj == (PCH_OBJ_NUM + 1)) { /* Point tail Obj + 1 */
>>> +  while (ioread32(&priv->regs->treq2) & 0xfc00)
>>> +   udelay(1);
> 
> When points tail of Tx message object,
> this driver waits until completion of all tx messaeg objects.

Looping busy it not an option here.

> Thus, application/driver ought not to be able to put Tx object exceed the number of tx message object.
> Thus I think these code(netif_stop_queue/netif_wake_queue) are completely redundant.

Nope - please remove the waiting completely and convert your flow
control to netif_stop_queue/netif_wake_queue.

cheers, Marc
-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

[-- Attachment #2: Type: text/plain, Size: 188 bytes --]

_______________________________________________
Socketcan-core mailing list
Socketcan-core-0fE9KPoRgkgATYTw5x5z8w@public.gmane.org
https://lists.berlios.de/mailman/listinfo/socketcan-core

^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] can: Topcliff: PCH_CAN driver: Fix build warnings
From: Tomoya MORINAGA @ 2010-11-09 12:26 UTC (permalink / raw)
  To: Marc Kleine-Budde
  Cc: Wolfgang Grandegger, David S. Miller, Wolfram Sang,
	Christian Pellegrin, Barry Song, Samuel Ortiz, socketcan-core,
	netdev, LKML, andrew.chih.howe.khor, qi.wang, margie.foster,
	yong.y.wang, Masayuki Ohtake, kok.howg.ewe, joel.clark
In-Reply-To: <005301cb7ffa$5b63cd90$66f8800a@maildom.okisemi.com>

Sorry, for late response.
----- Original Message ----- 
From: "Tomoya MORINAGA" <tomoya-linux@dsn.okisemi.com>
To: "Tomoya MORINAGA" <tomoya-linux@dsn.okisemi.com>
Sent: Tuesday, November 09, 2010 7:39 PM
Subject: Re: [PATCH net-next-2.6 v2] can: Topcliff: PCH_CAN driver: Fix build warnings


> On 11/02/2010 11:27 AM, Tomoya MORINAGA wrote:
> > On Saturday, October 30, 2010 1:23 AM,  Marc Kleine-Budde wrote:
> > 
> >>> The driver has already been merged. Please send incremental patches
> >>> against david's net-2.6 branch.
> > 
> > I agree.
> > 
> >>
> >> Here a review, find comments inline. Lets talk about my remarks, please
> >> answer inline and don't delete the code.
> >>
> >> Can you please explain me your locking sheme? If I understand the
> >> documenation correctly the two message interfaces can be used mutual.
> >> And you use one for rx the other one for tx.
> > 
> > I show our locking scheme.
> > When CPU accesses MessageRAM via IF1, CPU protect until read-modify-write
> > so that IF2 access not occurred, vice versa.
> 
> Why is that needed?

For MessageRAM data consistency.

> 
> > 
> >>
> >> Please use netdev_<level> instead of dev_<level> for debug.
> > 
> > I agree.
> > 
> >>
> >>> --- /dev/null
> >>> +++ b/drivers/net/can/pch_can.c
> >>> @@ -0,0 +1,1436 @@
> >>> +/*
> >>> + * Copyright (C) 1999 - 2010 Intel Corporation.
> >>> + * Copyright (C) 2010 OKI SEMICONDUCTOR CO., LTD.
> >>> + *
> >>> + * This program is free software; you can redistribute it and/or modify
> >>> + * it under the terms of the GNU General Public License as published by
> >>> + * the Free Software Foundation; version 2 of the License.
> >>> + *
> >>> + * This program is distributed in the hope that it will be useful,
> >>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >>> + * GNU General Public License for more details.
> >>> + *
> >>> + * You should have received a copy of the GNU General Public License
> >>> + * along with this program; if not, write to the Free Software
> >>> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307, USA.
> >>> + */
> >>> +
> >>> +#include <linux/interrupt.h>
> >>> +#include <linux/delay.h>
> >>> +#include <linux/io.h>
> >>> +#include <linux/module.h>
> >>> +#include <linux/sched.h>
> >>> +#include <linux/pci.h>
> >>> +#include <linux/init.h>
> >>> +#include <linux/kernel.h>
> >>> +#include <linux/types.h>
> >>> +#include <linux/errno.h>
> >>> +#include <linux/netdevice.h>
> >>> +#include <linux/skbuff.h>
> >>> +#include <linux/can.h>
> >>> +#include <linux/can/dev.h>
> >>> +#include <linux/can/error.h>
> >>> +
> >>> +#define MAX_MSG_OBJ  32
> >>> +#define MSG_OBJ_RX  0 /* The receive message object flag. */
> >>> +#define MSG_OBJ_TX  1 /* The transmit message object flag. */
> >>> +
> >>> +#define CAN_CTRL_INIT  0x0001 /* The INIT bit of CANCONT register. */
> >>> +#define CAN_CTRL_IE  0x0002 /* The IE bit of CAN control register */
> >>> +#define CAN_CTRL_IE_SIE_EIE 0x000e
> >>> +#define CAN_CTRL_CCE  0x0040
> >>> +#define CAN_CTRL_OPT  0x0080 /* The OPT bit of CANCONT register. */
> >>> +#define CAN_OPT_SILENT  0x0008 /* The Silent bit of CANOPT reg. */
> >>> +#define CAN_OPT_LBACK  0x0010 /* The LoopBack bit of CANOPT reg. */
> >>> +#define CAN_CMASK_RX_TX_SET 0x00f3
> >>> +#define CAN_CMASK_RX_TX_GET 0x0073
> >>> +#define CAN_CMASK_ALL  0xff
> >>> +#define CAN_CMASK_RDWR  0x80
> >>> +#define CAN_CMASK_ARB  0x20
> >>> +#define CAN_CMASK_CTRL  0x10
> >>> +#define CAN_CMASK_MASK  0x40
> >>> +#define CAN_CMASK_NEWDAT 0x04
> >>> +#define CAN_CMASK_CLRINTPND 0x08
> >>> +
> >>> +#define CAN_IF_MCONT_NEWDAT 0x8000
> >>> +#define CAN_IF_MCONT_INTPND 0x2000
> >>> +#define CAN_IF_MCONT_UMASK 0x1000
> >>> +#define CAN_IF_MCONT_TXIE 0x0800
> >>> +#define CAN_IF_MCONT_RXIE 0x0400
> >>> +#define CAN_IF_MCONT_RMTEN 0x0200
> >>> +#define CAN_IF_MCONT_TXRQXT 0x0100
> >>> +#define CAN_IF_MCONT_EOB 0x0080
> >>> +#define CAN_IF_MCONT_DLC 0x000f
> >>> +#define CAN_IF_MCONT_MSGLOST 0x4000
> >>> +#define CAN_MASK2_MDIR_MXTD 0xc000
> >>> +#define CAN_ID2_DIR  0x2000
> >>> +#define CAN_ID_MSGVAL  0x8000
> >>> +
> >>> +#define CAN_STATUS_INT  0x8000
> >>> +#define CAN_IF_CREQ_BUSY 0x8000
> >>> +#define CAN_ID2_XTD  0x4000
> >>> +
> >>> +#define CAN_REC   0x00007f00
> >>> +#define CAN_TEC   0x000000ff
> >>
> >> A prefix for like PCH_ instead of CAN_ for all those define above would
> >> be fine to avoid namespace clashes and/or confusion with the defines from the socketcan framework.
> >>
> > 
> > I agree.
> > 
> >>> +
> >>> +#define PCH_RX_OK  0x00000010
> >>> +#define PCH_TX_OK  0x00000008
> >>> +#define PCH_BUS_OFF  0x00000080
> >>> +#define PCH_EWARN  0x00000040
> >>> +#define PCH_EPASSIV  0x00000020
> >>> +#define PCH_LEC0  0x00000001
> >>> +#define PCH_LEC1  0x00000002
> >>> +#define PCH_LEC2  0x00000004
> >>
> >> These are just single set bit, please use BIT()
> >> Consider adding the name of the corresponding register to the define's
> >> name.
> > 
> > I agree.
> > 
> >>
> >>> +#define PCH_LEC_ALL  (PCH_LEC0 | PCH_LEC1 | PCH_LEC2)
> >>> +#define PCH_STUF_ERR  PCH_LEC0
> >>> +#define PCH_FORM_ERR  PCH_LEC1
> >>> +#define PCH_ACK_ERR  (PCH_LEC0 | PCH_LEC1)
> >>> +#define PCH_BIT1_ERR  PCH_LEC2
> >>> +#define PCH_BIT0_ERR  (PCH_LEC0 | PCH_LEC2)
> >>> +#define PCH_CRC_ERR  (PCH_LEC1 | PCH_LEC2)
> >>> +
> >>> +/* bit position of certain controller bits. */
> >>> +#define BIT_BITT_BRP  0
> >>> +#define BIT_BITT_SJW  6
> >>> +#define BIT_BITT_TSEG1  8
> >>> +#define BIT_BITT_TSEG2  12
> >>> +#define BIT_IF1_MCONT_RXIE 10
> >>> +#define BIT_IF2_MCONT_TXIE 11
> >>> +#define BIT_BRPE_BRPE  6
> >>> +#define BIT_ES_TXERRCNT  0
> >>> +#define BIT_ES_RXERRCNT  8
> >>
> >> these are usually called SHIFT
> > 
> > I agree.  Is the below TRUE ?
> > e.g.#define PCH_SHIFT_BITT_BRP 0
> 
> I would put the SHIFT at the end, YMMV
> 
> #define PCH_BIT_BRP_SHIFT

I agree.

> 
> > 
> >>
> >>> +#define MSK_BITT_BRP  0x3f
> >>> +#define MSK_BITT_SJW  0xc0
> >>> +#define MSK_BITT_TSEG1  0xf00
> >>> +#define MSK_BITT_TSEG2  0x7000
> >>> +#define MSK_BRPE_BRPE  0x3c0
> >>> +#define MSK_BRPE_GET  0x0f
> >>> +#define MSK_CTRL_IE_SIE_EIE 0x07
> >>> +#define MSK_MCONT_TXIE  0x08
> >>> +#define MSK_MCONT_RXIE  0x10
> >>
> >> MSK or MASK is okay, however the last two are just single bits.
> >>
> >> Please add a PCH_ prefix here, too.
> > 
> > I agree.
> > 
> >>
> >>> +#define PCH_CAN_NO_TX_BUFF 1
> >>> +#define COUNTER_LIMIT  10
> >>
> >> dito
> > 
> > I agree.
> > 
> >>
> >>> +
> >>> +#define PCH_CAN_CLK  50000000 /* 50MHz */
> >>> +
> >>> +/*
> >>> + * Define the number of message object.
> >>> + * PCH CAN communications are done via Message RAM.
> >>> + * The Message RAM consists of 32 message objects.
> >>> + */
> >>> +#define PCH_RX_OBJ_NUM  26  /* 1~ PCH_RX_OBJ_NUM is Rx*/
> >>> +#define PCH_TX_OBJ_NUM  6  /* PCH_RX_OBJ_NUM is RX ~ Tx*/
> >>> +#define PCH_OBJ_NUM  (PCH_TX_OBJ_NUM + PCH_RX_OBJ_NUM)
> >>
> >> You define MAX_MSG_OBJ earlier, seems like two names for the same value.
> > 
> > In case, a use uses all message objects(=32), you are right.
> > But user does not alway use all message object.
> 
> No one will change these values if the driver isn't buggy. And it
> doesn't make any sense to not use all objects.

I see.
I will delete PCH_OBJ_NUM or MAX_MSG_OBJ.

> 
> > 
> > 
> >>
> >>> +
> >>> +#define PCH_FIFO_THRESH  16
> >>> +
> >>> +enum pch_can_mode {
> >>> + PCH_CAN_ENABLE,
> >>> + PCH_CAN_DISABLE,
> >>> + PCH_CAN_ALL,
> >>> + PCH_CAN_NONE,
> >>> + PCH_CAN_STOP,
> >>> + PCH_CAN_RUN,
> >>> +};
> >>> +
> >>> +struct pch_can_regs {
> >>> + u32 cont;
> >>> + u32 stat;
> >>> + u32 errc;
> >>> + u32 bitt;
> >>> + u32 intr;
> >>> + u32 opt;
> >>> + u32 brpe;
> >>> + u32 reserve1;
> >>
> >> VVVV
> >>> + u32 if1_creq;
> >>> + u32 if1_cmask;
> >>> + u32 if1_mask1;
> >>> + u32 if1_mask2;
> >>> + u32 if1_id1;
> >>> + u32 if1_id2;
> >>> + u32 if1_mcont;
> >>> + u32 if1_dataa1;
> >>> + u32 if1_dataa2;
> >>> + u32 if1_datab1;
> >>> + u32 if1_datab2;
> >> ^^^^
> >>
> >> these registers and....
> >>
> >>> + u32 reserve2;
> >>> + u32 reserve3[12];
> >>
> >> ...and these
> >>
> >> VVVV
> >>> + u32 if2_creq;
> >>> + u32 if2_cmask;
> >>> + u32 if2_mask1;
> >>> + u32 if2_mask2;
> >>> + u32 if2_id1;
> >>> + u32 if2_id2;
> >>> + u32 if2_mcont;
> >>> + u32 if2_dataa1;
> >>> + u32 if2_dataa2;
> >>> + u32 if2_datab1;
> >>> + u32 if2_datab2;
> >>
> >> ^^^^
> >>
> >> ...are identical. I suggest to make a struct defining a complete
> >> "Message Interface Register Set". If you include the correct number of
> >> reserved bytes in the struct, you can have an array of two of these
> >> structs in the struct pch_can_regs.
> > 
> > To me, IMHOHO, it looks insignificant point.
> > Please show the merit ?
> 
> See Wolfgangs comments. You can get rid of duplicated code....

Using this method, I can't image to be able to reduce code size, now.
However I will try it.


> 
> > 
> >>
> >>> + u32 reserve4;
> >>> + u32 reserve5[20];
> >>> + u32 treq1;
> >>> + u32 treq2;
> >>> + u32 reserve6[2];
> >>> + u32 reserve7[56];
> >>> + u32 reserve8[3];
> >>> + u32 srst;
> >>> +};
> >>> +
> >>> +struct pch_can_priv {
> >>> + struct can_priv can;
> >>> + struct pci_dev *dev;
> >>> + unsigned int tx_enable[MAX_MSG_OBJ];
> >>> + unsigned int rx_enable[MAX_MSG_OBJ];
> >>> + unsigned int rx_link[MAX_MSG_OBJ];
> >>> + unsigned int int_enables;
> >>> + unsigned int int_stat;
> >>> + struct net_device *ndev;
> >>> + spinlock_t msgif_reg_lock; /* Message Interface Registers Access Lock*/
> >>                                                                             ^^^
> >> please add a whitespace
> > 
> > I agree.
> > 
> >>
> >>> + unsigned int msg_obj[MAX_MSG_OBJ];
> >>> + struct pch_can_regs __iomem *regs;
> >>> + struct napi_struct napi;
> >>> + unsigned int tx_obj; /* Point next Tx Obj index */
> >>> + unsigned int use_msi;
> >>> +};
> >>> +
> >>> +static struct can_bittiming_const pch_can_bittiming_const = {
> >>> + .name = "pch_can",
> >>> + .tseg1_min = 1,
> >>> + .tseg1_max = 16,
> >>> + .tseg2_min = 1,
> >>> + .tseg2_max = 8,
> >>> + .sjw_max = 4,
> >>> + .brp_min = 1,
> >>> + .brp_max = 1024, /* 6bit + extended 4bit */
> >>> + .brp_inc = 1,
> >>> +};
> >>> +
> >>> +static DEFINE_PCI_DEVICE_TABLE(pch_pci_tbl) = {
> >>> + {PCI_VENDOR_ID_INTEL, 0x8818, PCI_ANY_ID, PCI_ANY_ID,},
> >>> + {0,}
> >>> +};
> >>> +MODULE_DEVICE_TABLE(pci, pch_pci_tbl);
> >>> +
> >>> +static inline void pch_can_bit_set(u32 *addr, u32 mask)
> >>                                       ^^^^^
> >>
> >> that should be an void __iomem *, see mail I've send the other day.
> >> Please use sparse to check for this kinds of errors.
> >> (Compile the driver with C=2, i.e.: make drivers/net/can/pch_can.ko C=2)
> >>
> > 
> > I agree.
> > 
> >>> +{
> >>> + iowrite32(ioread32(addr) | mask, addr);
> >>> +}
> >>> +
> >>> +static inline void pch_can_bit_clear(u32 *addr, u32 mask)
> >>                                         ^^^^^
> >>
> >> dito
> > 
> > I agree.
> > 
> >>
> >>> +{
> >>> + iowrite32(ioread32(addr) & ~mask, addr);
> >>> +}
> >>> +
> >>> +static void pch_can_set_run_mode(struct pch_can_priv *priv,
> >>> +     enum pch_can_mode mode)
> >>> +{
> >>> + switch (mode) {
> >>> + case PCH_CAN_RUN:
> >>> +  pch_can_bit_clear(&priv->regs->cont, CAN_CTRL_INIT);
> >>> +  break;
> >>> +
> >>> + case PCH_CAN_STOP:
> >>> +  pch_can_bit_set(&priv->regs->cont, CAN_CTRL_INIT);
> >>> +  break;
> >>> +
> >>> + default:
> >>> +  dev_err(&priv->ndev->dev, "%s -> Invalid Mode.\n", __func__);
> >>> +  break;
> >>> + }
> >>> +}
> >>> +
> >>> +static void pch_can_set_optmode(struct pch_can_priv *priv)
> >>> +{
> >>> + u32 reg_val = ioread32(&priv->regs->opt);
> >>> +
> >>> + if (priv->can.ctrlmode & CAN_CTRLMODE_LISTENONLY)
> >>> +  reg_val |= CAN_OPT_SILENT;
> >>> +
> >>> + if (priv->can.ctrlmode & CAN_CTRLMODE_LOOPBACK)
> >>> +  reg_val |= CAN_OPT_LBACK;
> >>> +
> >>> + pch_can_bit_set(&priv->regs->cont, CAN_CTRL_OPT);
> >>> + iowrite32(reg_val, &priv->regs->opt);
> >>> +}
> >>> +
> >>
> >> IMHO the function name is missleading, if I understand the code
> >> correctly, this functions triggers the transmission of the message.
> >> After this it checks for busy, 
> > 
> > Yes, your understanding is TRUE.
> > 
> >> but
> >>
> >>> +static void pch_can_check_if_busy(u32 __iomem *creq_addr, u32 num)
> >>                                      ^^^^
> >>
> > 
> > Yes, me too.
> > I will rename the function name.
> > 
> > How about "pch_can_rw_msg_obj"
> > 
> >> that should probaby be a void
> > What't the above mean ?
> > pch_can_check_if_busy is already "void" function.
> 
> >>> +static void pch_can_check_if_busy(u32 __iomem *creq_addr, u32 num)
> >>                                     ^^^^
> 
> That u32 should be a void.

I agree.

> 
> BTW: Does the Intel chip support x64? If so, have you tested the driver
> on a 64 bit kernel.
> 
> > 
> >>
> >>> +{
> >>> + u32 counter = COUNTER_LIMIT;
> >>> + u32 ifx_creq;
> >>> +
> >>> + iowrite32(num, creq_addr);
> >>> + while (counter) {
> >>> +  ifx_creq = ioread32(creq_addr) & CAN_IF_CREQ_BUSY;
> >>> +  if (!ifx_creq)
> >>> +   break;
> >>> +  counter--;
> >>> +  udelay(1);
> >>> + }
> >>> + if (!counter)
> >>> +  pr_err("%s:IF1 BUSY Flag is set forever.\n", __func__);
> >>> +}
> >>> +
> >>> +static void pch_can_set_int_enables(struct pch_can_priv *priv,
> >>> +        enum pch_can_mode interrupt_no)
> >>> +{
> >>> + switch (interrupt_no) {
> >>> + case PCH_CAN_ENABLE:
> >>> +  pch_can_bit_set(&priv->regs->cont, CAN_CTRL_IE);
> >>
> >> noone uses this case.
> > 
> > I agree.
> > 
> >>
> >>> +  break;
> >>> +
> >>> + case PCH_CAN_DISABLE:
> >>> +  pch_can_bit_clear(&priv->regs->cont, CAN_CTRL_IE);
> >>> +  break;
> >>> +
> >>> + case PCH_CAN_ALL:
> >>> +  pch_can_bit_set(&priv->regs->cont, CAN_CTRL_IE_SIE_EIE);
> >>> +  break;
> >>> +
> >>> + case PCH_CAN_NONE:
> >>> +  pch_can_bit_clear(&priv->regs->cont, CAN_CTRL_IE_SIE_EIE);
> >>> +  break;
> >>> +
> >>> + default:
> >>> +  dev_err(&priv->ndev->dev, "Invalid interrupt number.\n");
> >>> +  break;
> >>> + }
> >>> +}
> >>> +
> >>> +static void pch_can_set_rx_enable(struct pch_can_priv *priv, u32 buff_num,
> >>> +      int set)
> >>> +{
> >>> + unsigned long flags;
> >>> +
> >>> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >>> + /* Reading the receive buffer data from RAM to Interface1 registers */
> >>> + iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if1_cmask);
> >>> + pch_can_check_if_busy(&priv->regs->if1_creq, buff_num);
> >>> +
> >>> + /* Setting the IF1MASK1 register to access MsgVal and RxIE bits */
> >>> + iowrite32(CAN_CMASK_RDWR | CAN_CMASK_ARB | CAN_CMASK_CTRL,
> >>> +    &priv->regs->if1_cmask);
> >>> +
> >>> + if (set == 1) {
> >>> +  /* Setting the MsgVal and RxIE bits */
> >>> +  pch_can_bit_set(&priv->regs->if1_mcont, CAN_IF_MCONT_RXIE);
> >>> +  pch_can_bit_set(&priv->regs->if1_id2, CAN_ID_MSGVAL);
> >>> +
> >>> + } else if (set == 0) {
> >>> +  /* Resetting the MsgVal and RxIE bits */
> >>> +  pch_can_bit_clear(&priv->regs->if1_mcont, CAN_IF_MCONT_RXIE);
> >>> +  pch_can_bit_clear(&priv->regs->if1_id2, CAN_ID_MSGVAL);
> >>> + }
> >>> +
> >>> + pch_can_check_if_busy(&priv->regs->if1_creq, buff_num);
> >>> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >>> +}
> >>> +
> >>> +static void pch_can_rx_enable_all(struct pch_can_priv *priv)
> >>> +{
> >>> + int i;
> >>> +
> >>> + /* Traversing to obtain the object configured as receivers. */
> >>> + for (i = 1; i <= PCH_RX_OBJ_NUM; i++)
> >>> +  pch_can_set_rx_enable(priv, i, 1);
> >>> +}
> >>> +
> >>> +static void pch_can_rx_disable_all(struct pch_can_priv *priv)
> >>> +{
> >>> + int i;
> >>> +
> >>> + /* Traversing to obtain the object configured as receivers. */
> >>> + for (i = 1; i <= PCH_RX_OBJ_NUM; i++)
> >>> +  pch_can_set_rx_enable(priv, i, 0);
> >>> +}
> >>> +
> >>> +static void pch_can_set_tx_enable(struct pch_can_priv *priv, u32 buff_num,
> >>> +     u32 set)
> >>> +{
> >>> + unsigned long flags;
> >>> +
> >>> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >>> + /* Reading the Msg buffer from Message RAM to Interface2 registers. */
> >>> + iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if2_cmask);
> >>> + pch_can_check_if_busy(&priv->regs->if2_creq, buff_num);
> >>> +
> >>> + /* Setting the IF2CMASK register for accessing the
> >>> +  MsgVal and TxIE bits */
> >>> + iowrite32(CAN_CMASK_RDWR | CAN_CMASK_ARB | CAN_CMASK_CTRL,
> >>> +   &priv->regs->if2_cmask);
> >>> +
> >>> + if (set == 1) {
> >>> +  /* Setting the MsgVal and TxIE bits */
> >>> +  pch_can_bit_set(&priv->regs->if2_mcont, CAN_IF_MCONT_TXIE);
> >>> +  pch_can_bit_set(&priv->regs->if2_id2, CAN_ID_MSGVAL);
> >>> + } else if (set == 0) {
> >>> +  /* Resetting the MsgVal and TxIE bits. */
> >>> +  pch_can_bit_clear(&priv->regs->if2_mcont, CAN_IF_MCONT_TXIE);
> >>> +  pch_can_bit_clear(&priv->regs->if2_id2, CAN_ID_MSGVAL);
> >>> + }
> >>> +
> >>> + pch_can_check_if_busy(&priv->regs->if2_creq, buff_num);
> >>> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >>> +}
> >>> +
> >>> +static void pch_can_tx_enable_all(struct pch_can_priv *priv)
> >>> +{
> >>> + int i;
> >>> +
> >>> + /* Traversing to obtain the object configured as transmit object. */
> >>> + for (i = PCH_RX_OBJ_NUM + 1; i <= PCH_OBJ_NUM; i++)
> >>> +  pch_can_set_tx_enable(priv, i, 1);
> >>> +}
> >>> +
> >>> +static void pch_can_tx_disable_all(struct pch_can_priv *priv)
> >>> +{
> >>> + int i;
> >>> +
> >>> + /* Traversing to obtain the object configured as transmit object. */
> >>> + for (i = PCH_RX_OBJ_NUM + 1; i <= PCH_OBJ_NUM; i++)
> >>> +  pch_can_set_tx_enable(priv, i, 0);
> >>> +}
> >>> +
> >>> +static int pch_can_int_pending(struct pch_can_priv *priv)
> >>           ^^^
> >>
> >> make it u32 as it returns a register value, or a u16 as you only use
> >> the 16 lower bits.
> > 
> > I agree. I will modify to u32.
> > 
> >>
> >>> +{
> >>> + return ioread32(&priv->regs->intr) & 0xffff;
> >>> +}
> >>> +
> >>> +static void pch_can_clear_buffers(struct pch_can_priv *priv)
> >>> +{
> >>> + int i; /* Msg Obj ID (1~32) */
> >>> +
> >>> + for (i = 1; i <= PCH_RX_OBJ_NUM; i++) {
> >>
> >> IMHO the readability would be improved if you define something like
> >> PCH_RX_OBJ_START and PCH_RX_OBJ_END.
> > 
> > I agree.
> > 
> >>
> >>> +  iowrite32(CAN_CMASK_RX_TX_SET, &priv->regs->if1_cmask);
> >>> +  iowrite32(0xffff, &priv->regs->if1_mask1);
> >>> +  iowrite32(0xffff, &priv->regs->if1_mask2);
> >>> +  iowrite32(0x0, &priv->regs->if1_id1);
> >>> +  iowrite32(0x0, &priv->regs->if1_id2);
> >>> +  iowrite32(0x0, &priv->regs->if1_mcont);
> >>> +  iowrite32(0x0, &priv->regs->if1_dataa1);
> >>> +  iowrite32(0x0, &priv->regs->if1_dataa2);
> >>> +  iowrite32(0x0, &priv->regs->if1_datab1);
> >>> +  iowrite32(0x0, &priv->regs->if1_datab2);
> >>> +  iowrite32(CAN_CMASK_RDWR | CAN_CMASK_MASK |
> >>> +     CAN_CMASK_ARB | CAN_CMASK_CTRL,
> >>> +     &priv->regs->if1_cmask);
> >>> +  pch_can_check_if_busy(&priv->regs->if1_creq, i);
> >>> + }
> >>> +
> >>> + for (i = PCH_RX_OBJ_NUM + 1; i <= PCH_OBJ_NUM; i++) {
> >>                  ^^^^^^^^^^^^^^^^^^
> >> dito for TX objects
> > 
> > I agree.
> > 
> >>
> >>> +  iowrite32(CAN_CMASK_RX_TX_SET, &priv->regs->if2_cmask);
> >>> +  iowrite32(0xffff, &priv->regs->if2_mask1);
> >>> +  iowrite32(0xffff, &priv->regs->if2_mask2);
> >>> +  iowrite32(0x0, &priv->regs->if2_id1);
> >>> +  iowrite32(0x0, &priv->regs->if2_id2);
> >>> +  iowrite32(0x0, &priv->regs->if2_mcont);
> >>> +  iowrite32(0x0, &priv->regs->if2_dataa1);
> >>> +  iowrite32(0x0, &priv->regs->if2_dataa2);
> >>> +  iowrite32(0x0, &priv->regs->if2_datab1);
> >>> +  iowrite32(0x0, &priv->regs->if2_datab2);
> >>> +  iowrite32(CAN_CMASK_RDWR | CAN_CMASK_MASK | CAN_CMASK_ARB |
> >>> +     CAN_CMASK_CTRL, &priv->regs->if2_cmask);
> >>> +  pch_can_check_if_busy(&priv->regs->if2_creq, i);
> >>> + }
> >>> +}
> >>> +
> >>> +static void pch_can_config_rx_tx_buffers(struct pch_can_priv *priv)
> >>> +{
> >>> + int i;
> >>> + unsigned long flags;
> >>> +
> >>> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >>> +
> >>> + for (i = 1; i <= PCH_RX_OBJ_NUM; i++) {
> >>> +  iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if1_cmask);
> >>> +  pch_can_check_if_busy(&priv->regs->if1_creq, i);
> >>
> >> If I understand the code correctly, the about function triggers a
> >> transfer. Why do you first trigger a transfer, then set the message contents....
> > 
> > For doing Read-Modify-Write.
> > As to fixed parameter of message object, it doesn't be modified every access.
> 
> I see.
> 
> > We will modify to write only.
> > 
> >>> +
> >>> +  iowrite32(0x0, &priv->regs->if1_id1);
> >>> +  iowrite32(0x0, &priv->regs->if1_id2);
> >>> +
> >>> +  pch_can_bit_set(&priv->regs->if1_mcont, CAN_IF_MCONT_UMASK);
> >>
> >>     Why do you set the "Use acceptance mask" bit? We want to receive
> >>     all can messages.
> > 
> > Without "Use acceptance mask" means received packet matched ID[28:0] only.
> > As a result, filter is enabled.
> > 
> > With "Use acceptance mask" and setting Msk[0:28]=all 1, all packets can be received(=No filter state) 
> 
> Thanks for the explenation.
> 
> > 
> >>
> >>> +
> >>> +  /* Set FIFO mode set to 0 except last Rx Obj*/
> >>> +  pch_can_bit_clear(&priv->regs->if1_mcont, CAN_IF_MCONT_EOB);
> >>> +  /* In case FIFO mode, Last EoB of Rx Obj must be 1 */
> >>> +  if (i == (PCH_RX_OBJ_NUM - 1))
> >>> +   pch_can_bit_set(&priv->regs->if1_mcont,
> >>> +     CAN_IF_MCONT_EOB);
> >>
> >>     Make it if () { } else { }, please.
> > 
> > Sorry, I can't understand.
> > else {} is not necessary.
> 
> Please look at the code block above, again. You frist clean the bit
> unconditionally, then you set the bit in the if. Please make it:
> 
> if (last)
>  set_bit
> else
>  clear_bit

I understand.
I will modify like above.
Thanks.

> 
> > 
> >>
> >>> +
> >>> +  iowrite32(0, &priv->regs->if1_mask1);
> >>> +  pch_can_bit_clear(&priv->regs->if1_mask2,
> >>> +      0x1fff | CAN_MASK2_MDIR_MXTD);
> >>> +
> >>> +  /* Setting CMASK for writing */
> >>> +  iowrite32(CAN_CMASK_RDWR | CAN_CMASK_MASK | CAN_CMASK_ARB |
> >>> +     CAN_CMASK_CTRL, &priv->regs->if1_cmask);
> >>> +
> >>> +  pch_can_check_if_busy(&priv->regs->if1_creq, i);
> >>
> >> ...and then trigger the transfer again?
> > 
> > This means Read-Modify-Write.
> 
> ic
> 
> > 
> >>
> >>> + }
> >>> +
> >>> + for (i = PCH_RX_OBJ_NUM + 1; i <= PCH_OBJ_NUM; i++) {
> >>> +  iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if2_cmask);
> >>> +  pch_can_check_if_busy(&priv->regs->if2_creq, i);
> >>
> >> same question about triggering the transfer 2 times applied here, too
> > 
> > ditto.
> > 
> >>> +
> >>> +  /* Resetting DIR bit for reception */
> >>> +  iowrite32(0x0, &priv->regs->if2_id1);
> >>> +  iowrite32(0x0, &priv->regs->if2_id2);
> >>> +  pch_can_bit_set(&priv->regs->if2_id2, CAN_ID2_DIR);
> >>
> >> Can you combine the two accesses to >if2_id2 into one?
> > 
> > I agree.
> > 
> >>
> >>> +
> >>> +  /* Setting EOB bit for transmitter */
> >>> +  iowrite32(CAN_IF_MCONT_EOB, &priv->regs->if2_mcont);
> >>> +
> >>> +  pch_can_bit_set(&priv->regs->if2_mcont,
> >>> +    CAN_IF_MCONT_UMASK);
> >>
> >> dito for if2_mcont
> > 
> > ditto.
> > 
> >>
> >>> +
> >>> +  iowrite32(0, &priv->regs->if2_mask1);
> >>> +  pch_can_bit_clear(&priv->regs->if2_mask2, 0x1fff);
> >>> +
> >>> +  /* Setting CMASK for writing */
> >>> +  iowrite32(CAN_CMASK_RDWR | CAN_CMASK_MASK | CAN_CMASK_ARB |
> >>> +     CAN_CMASK_CTRL, &priv->regs->if2_cmask);
> >>> +
> >>> +  pch_can_check_if_busy(&priv->regs->if2_creq, i);
> >>> + }
> >>> +
> >>> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >>> +}
> >>> +
> >>> +static void pch_can_init(struct pch_can_priv *priv)
> >>> +{
> >>> + /* Stopping the Can device. */
> >>> + pch_can_set_run_mode(priv, PCH_CAN_STOP);
> >>> +
> >>> + /* Clearing all the message object buffers. */
> >>> + pch_can_clear_buffers(priv);
> >>> +
> >>> + /* Configuring the respective message object as either rx/tx object. */
> >>> + pch_can_config_rx_tx_buffers(priv);
> >>> +
> >>> + /* Enabling the interrupts. */
> >>> + pch_can_set_int_enables(priv, PCH_CAN_ALL);
> >>> +}
> >>> +
> >>> +static void pch_can_release(struct pch_can_priv *priv)
> >>> +{
> >>> + /* Stooping the CAN device. */
> >>> + pch_can_set_run_mode(priv, PCH_CAN_STOP);
> >>> +
> >>> + /* Disabling the interrupts. */
> >>> + pch_can_set_int_enables(priv, PCH_CAN_NONE);
> >>> +
> >>> + /* Disabling all the receive object. */
> >>> + pch_can_rx_disable_all(priv);
> >>> +
> >>> + /* Disabling all the transmit object. */
> >>> + pch_can_tx_disable_all(priv);
> >>> +}
> >>> +
> >>> +/* This function clears interrupt(s) from the CAN device. */
> >>> +static void pch_can_int_clr(struct pch_can_priv *priv, u32 mask)
> >>> +{
> >>> + if (mask == CAN_STATUS_INT) {
> >>
> >> is this a valid case?
> > 
> > This "if" is always false.
> > I will delete this condition.
> > 
> >>
> >>> +  ioread32(&priv->regs->stat);
> >>> +  return;
> >>> + }
> >>> +
> >>> + /* Clear interrupt for transmit object */
> >>> + if ((mask >= 1) && (mask <= PCH_RX_OBJ_NUM)) {
> >>> +  /* Setting CMASK for clearing the reception interrupts. */
> >>> +  iowrite32(CAN_CMASK_RDWR | CAN_CMASK_CTRL | CAN_CMASK_ARB,
> >>> +     &priv->regs->if1_cmask);
> >>> +
> >>> +  /* Clearing the Dir bit. */
> >>> +  pch_can_bit_clear(&priv->regs->if1_id2, CAN_ID2_DIR);
> >>> +
> >>> +  /* Clearing NewDat & IntPnd */
> >>> +  pch_can_bit_clear(&priv->regs->if1_mcont,
> >>> +      CAN_IF_MCONT_NEWDAT | CAN_IF_MCONT_INTPND);
> >>> +
> >>> +  pch_can_check_if_busy(&priv->regs->if1_creq, mask);
> >>> + } else if ((mask > PCH_RX_OBJ_NUM) && (mask <= PCH_OBJ_NUM)) {
> >>> +  /* Setting CMASK for clearing interrupts for
> >>> +     frame transmission. */
> >>
> >> /*
> >>  * this is the prefered style of multi line comments,
> >>  * please adjust you comments
> >>  */
> > 
> > I understand.
> > 
> >>
> >>> +  iowrite32(CAN_CMASK_RDWR | CAN_CMASK_CTRL | CAN_CMASK_ARB,
> >>> +     &priv->regs->if2_cmask);
> >>> +
> >>> +  /* Resetting the ID registers. */
> >>> +  pch_can_bit_set(&priv->regs->if2_id2,
> >>> +          CAN_ID2_DIR | (0x7ff << 2));
> >>> +  iowrite32(0x0, &priv->regs->if2_id1);
> >>> +
> >>> +  /* Claring NewDat, TxRqst & IntPnd */
> >>> +  pch_can_bit_clear(&priv->regs->if2_mcont,
> >>> +      CAN_IF_MCONT_NEWDAT | CAN_IF_MCONT_INTPND |
> >>> +      CAN_IF_MCONT_TXRQXT);
> >>> +  pch_can_check_if_busy(&priv->regs->if2_creq, mask);
> >>> + }
> >>> +}
> >>> +
> >>> +static u32 pch_can_get_buffer_status(struct pch_can_priv *priv)
> >>> +{
> >>> + return (ioread32(&priv->regs->treq1) & 0xffff) |
> >>> +        ((ioread32(&priv->regs->treq2) & 0xffff) << 16);
> >>
> >> the second 0xffff is not needed, as the return value is u32 and you shift by 16.
> > 
> > I agree.
> > 
> >>
> >>> +}
> >>> +
> >>> +static void pch_can_reset(struct pch_can_priv *priv)
> >>> +{
> >>> + /* write to sw reset register */
> >>> + iowrite32(1, &priv->regs->srst);
> >>> + iowrite32(0, &priv->regs->srst);
> >>> +}
> >>> +
> >>> +static void pch_can_error(struct net_device *ndev, u32 status)
> >>> +{
> >>> + struct sk_buff *skb;
> >>> + struct pch_can_priv *priv = netdev_priv(ndev);
> >>> + struct can_frame *cf;
> >>> + u32 errc;
> >>> + struct net_device_stats *stats = &(priv->ndev->stats);
> >>> + enum can_state state = priv->can.state;
> >>> +
> >>> + skb = alloc_can_err_skb(ndev, &cf);
> >>> + if (!skb)
> >>> +  return;
> >>> +
> >>> + if (status & PCH_BUS_OFF) {
> >>> +  pch_can_tx_disable_all(priv);
> >>> +  pch_can_rx_disable_all(priv);
> >>> +  state = CAN_STATE_BUS_OFF;
> >>> +  cf->can_id |= CAN_ERR_BUSOFF;
> >>> +  can_bus_off(ndev);
> >>> + }
> >>> +
> >>> + /* Warning interrupt. */
> >>> + if (status & PCH_EWARN) {
> >>> +  state = CAN_STATE_ERROR_WARNING;
> >>> +  priv->can.can_stats.error_warning++;
> >>> +  cf->can_id |= CAN_ERR_CRTL;
> >>> +  errc = ioread32(&priv->regs->errc);
> >>> +  if (((errc & CAN_REC) >> 8) > 96)
> >>> +   cf->data[1] |= CAN_ERR_CRTL_RX_WARNING;
> >>> +  if ((errc & CAN_TEC) > 96)
> >>> +   cf->data[1] |= CAN_ERR_CRTL_TX_WARNING;
> >>> +  dev_warn(&ndev->dev,
> >>> +   "%s -> Error Counter is more than 96.\n", __func__);
> >>
> >> Please use just "debug" level not warning here. Consider to use
> >> netdev_dbg() instead. IMHO the __func__ can be dropped and the
> >> "official" name for the error is "Error Warning".
> > 
> > I want to know the reason.
> > Why is it not dev_warn but netdev_dbg ?
> 
> If you use warning level it would end up on the console or and in the
> syslog. It's quite complicated (for programs) to get information from
> there. This is why we send CAN error frames. They hold the same
> information but int a binary form, thus it's easier to process.

I understand the reason.
BTW, Why do you say not dev_dbg but netdev_dbg ?

> 
> > 
> >>
> >>> + }
> >>> + /* Error passive interrupt. */
> >>> + if (status & PCH_EPASSIV) {
> >>> +  priv->can.can_stats.error_passive++;
> >>> +  state = CAN_STATE_ERROR_PASSIVE;
> >>> +  cf->can_id |= CAN_ERR_CRTL;
> >>> +  errc = ioread32(&priv->regs->errc);
> >>> +  if (((errc & CAN_REC) >> 8) > 127)
> >>> +   cf->data[1] |= CAN_ERR_CRTL_RX_PASSIVE;
> >>> +  if ((errc & CAN_TEC) > 127)
> >>> +   cf->data[1] |= CAN_ERR_CRTL_TX_PASSIVE;
> >>> +  dev_err(&ndev->dev,
> >>> +   "%s -> CAN controller is ERROR PASSIVE .\n", __func__);
> >>
> >> dito
> > 
> > ditto
> > 
> >>
> >>> + }
> >>> +
> >>> + if (status & PCH_LEC_ALL) {
> >>> +  priv->can.can_stats.bus_error++;
> >>> +  stats->rx_errors++;
> >>> +  switch (status & PCH_LEC_ALL) {
> >>
> >> I suggest to convert to a if-bit-set because there might be more than
> >> one bit set.
> > 
> > I agree.
> > 
> >>
> >>> +  case PCH_STUF_ERR:
> >>> +   cf->data[2] |= CAN_ERR_PROT_STUFF;
> >>> +   break;
> >>> +  case PCH_FORM_ERR:
> >>> +   cf->data[2] |= CAN_ERR_PROT_FORM;
> >>> +   break;
> >>> +  case PCH_ACK_ERR:
> >>> +   cf->data[2] |= CAN_ERR_PROT_LOC_ACK |
> >>> +           CAN_ERR_PROT_LOC_ACK_DEL;
> >>> +   break;
> >>> +  case PCH_BIT1_ERR:
> >>> +  case PCH_BIT0_ERR:
> >>> +   cf->data[2] |= CAN_ERR_PROT_BIT;
> >>> +   break;
> >>> +  case PCH_CRC_ERR:
> >>> +   cf->data[2] |= CAN_ERR_PROT_LOC_CRC_SEQ |
> >>> +           CAN_ERR_PROT_LOC_CRC_DEL;
> >>> +   break;
> >>> +  default:
> >>> +   iowrite32(status | PCH_LEC_ALL, &priv->regs->stat);
> >>> +   break;
> >>> +  }
> >>> +
> >>> + }
> >>> +
> >>> + priv->can.state = state;
> >>> + netif_receive_skb(skb);
> >>> +
> >>> + stats->rx_packets++;
> >>> + stats->rx_bytes += cf->can_dlc;
> >>> +}
> >>> +
> >>> +static irqreturn_t pch_can_interrupt(int irq, void *dev_id)
> >>> +{
> >>> + struct net_device *ndev = (struct net_device *)dev_id;
> >>> + struct pch_can_priv *priv = netdev_priv(ndev);
> >>> +
> >>> + pch_can_set_int_enables(priv, PCH_CAN_NONE);
> >>> + napi_schedule(&priv->napi);
> >>> +
> >>> + return IRQ_HANDLED;
> >>> +}
> >>> +
> >>> +static void pch_fifo_thresh(struct pch_can_priv *priv, int obj_id)
> >>> +{
> >>> + if (obj_id < PCH_FIFO_THRESH) {
> >>> +  iowrite32(CAN_CMASK_RDWR | CAN_CMASK_CTRL |
> >>> +     CAN_CMASK_ARB, &priv->regs->if1_cmask);
> >>> +
> >>> +  /* Clearing the Dir bit. */
> >>> +  pch_can_bit_clear(&priv->regs->if1_id2, CAN_ID2_DIR);
> >>> +
> >>> +  /* Clearing NewDat & IntPnd */
> >>> +  pch_can_bit_clear(&priv->regs->if1_mcont,
> >>> +      CAN_IF_MCONT_INTPND);
> >>> +  pch_can_check_if_busy(&priv->regs->if1_creq, obj_id);
> >>> + } else if (obj_id > PCH_FIFO_THRESH) {
> >>> +  pch_can_int_clr(priv, obj_id);
> >>> + } else if (obj_id == PCH_FIFO_THRESH) {
> >>> +  int cnt;
> >>> +  for (cnt = 0; cnt < PCH_FIFO_THRESH; cnt++)
> >>> +   pch_can_int_clr(priv, cnt+1);
> >>> + }
> >>> +}
> >>> +
> >>> +static int pch_can_rx_msg_lost(struct net_device *ndev, int obj_id)
> >>> +{
> >>> + struct pch_can_priv *priv = netdev_priv(ndev);
> >>> + struct net_device_stats *stats = &(priv->ndev->stats);
> >>> + struct sk_buff *skb;
> >>> + struct can_frame *cf;
> >>> +
> >>> + dev_err(&priv->ndev->dev, "Msg Obj is overwritten.\n");
> >>> + pch_can_bit_clear(&priv->regs->if1_mcont,
> >>> +     CAN_IF_MCONT_MSGLOST);
> >>> + iowrite32(CAN_CMASK_RDWR | CAN_CMASK_CTRL,
> >>> +    &priv->regs->if1_cmask);
> >>> + pch_can_check_if_busy(&priv->regs->if1_creq, obj_id);
> >>> +
> >>> + skb = alloc_can_err_skb(ndev, &cf);
> >>> + if (!skb)
> >>> +  return -ENOMEM;
> >>> +
> >>> + priv->can.can_stats.error_passive++;
> >>> + priv->can.state = CAN_STATE_ERROR_PASSIVE;
> >>> + cf->can_id |= CAN_ERR_CRTL;
> >>> + cf->data[1] = CAN_ERR_CRTL_RX_OVERFLOW;
> >>> + stats->rx_over_errors++;
> >>> + stats->rx_errors++;
> >>> +
> >>> + netif_receive_skb(skb);
> >>> +
> >>> + return 0;
> >>> +}
> >>> +
> >>> +static int pch_can_rx_normal(struct net_device *ndev, u32 obj_num, int quota)
> >>> +{
> >>> + u32 reg;
> >>> + canid_t id;
> >>> + u32 ide;
> >>> + u32 rtr;
> >>> + int rcv_pkts = 0;
> >>> + int rtn;
> >>> + int next_flag = 0;
> >>> + struct sk_buff *skb;
> >>> + struct can_frame *cf;
> >>> + struct pch_can_priv *priv = netdev_priv(ndev);
> >>> + struct net_device_stats *stats = &(priv->ndev->stats);
> >>> +
> >>> + /* Reading the messsage object from the Message RAM */
> >>> + iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if1_cmask);
> >>> + pch_can_check_if_busy(&priv->regs->if1_creq, obj_num);
> >>> +
> >>> + /* Reading the MCONT register. */
> >>> + reg = ioread32(&priv->regs->if1_mcont);
> >>> + reg &= 0xffff;
> >>> +
> >>> + for (; (!(reg & CAN_IF_MCONT_EOB)) && (quota > 0);
> >>> +      obj_num++, next_flag = 0) {
> >>> +  /* If MsgLost bit set. */
> >>> +  if (reg & CAN_IF_MCONT_MSGLOST) {
> >>> +   rtn = pch_can_rx_msg_lost(ndev, obj_num);
> >>> +   if (!rtn)
> >>> +    return rtn;
> >>> +   rcv_pkts++;
> >>> +   quota--;
> >>> +   next_flag = 1;
> >>> +  } else if (!(reg & CAN_IF_MCONT_NEWDAT))
> >>> +   next_flag = 1;
> >>> +
> >>
> >> after rearanging the code (see below..) you should be able to use a continue here.
> >>
> >>> +  if (!next_flag) {
> >>> +   skb = alloc_can_skb(priv->ndev, &cf);
> >>> +   if (!skb)
> >>> +    return -ENOMEM;
> >>> +
> >>> +   /* Get Received data */
> >>> +   ide = ((ioread32(&priv->regs->if1_id2)) & CAN_ID2_XTD);
> >>> +   if (ide) {
> >>> +    id = (ioread32(&priv->regs->if1_id1) & 0xffff);
> >>> +    id |= (((ioread32(&priv->regs->if1_id2)) &
> >>> +          0x1fff) << 16);
> >>> +    cf->can_id = (id & CAN_EFF_MASK) | CAN_EFF_FLAG;
> >>                                               ^^^^^^^^^^^^^^^^^
> >>
> >> is the mask needed, you mask the if1_id{1,2} already
> > 
> > I will delete
> > 
> >>
> >>> +   } else {
> >>> +    id = (((ioread32(&priv->regs->if1_id2)) &
> >>> +        (CAN_SFF_MASK << 2)) >> 2);
> >>> +    cf->can_id = (id & CAN_SFF_MASK);
> >>
> >> one mask can go away
> > 
> > I agree.
> > 
> >>
> >>> +   }
> >>> +
> >>> +   rtr = ioread32(&priv->regs->if1_id2) &  CAN_ID2_DIR;
> >>                                                               ^^
> >>
> >> remove one space
> > 
> > I agree.
> > 
> >>
> >>> +
> >>> +   if (rtr)
> >>> +    cf->can_id |= CAN_RTR_FLAG;
> >>> +
> >>> +   cf->can_dlc = get_can_dlc((ioread32(&priv->regs->
> >>> +         if1_mcont)) & 0xF);
> >>> +   *(u16 *)(cf->data + 0) = ioread16(&priv->regs->
> >>> +         if1_dataa1);
> >>> +   *(u16 *)(cf->data + 2) = ioread16(&priv->regs->
> >>> +         if1_dataa2);
> >>> +   *(u16 *)(cf->data + 4) = ioread16(&priv->regs->
> >>> +         if1_datab1);
> >>> +   *(u16 *)(cf->data + 6) = ioread16(&priv->regs->
> >>> +         if1_datab2);
> >>
> >> are you sure, the bytes in the can package a in the correct order.
> >> Please test your pch_can against a non pch_can system.
> > 
> > Unfortunately, we don't have non pch_can system.
> 
> Have a look a the driver/net/can/usb subdir and buy one of those. It
> really hard to find bugs if you test against your own driver.

We can't buy this right now for few budget.
But I heard we have "CANalyzer".
Using this, we can test this endian concern.
But we may need much time for studying how to use.

> 
> > 
> >>
> >>> +
> >>> +   netif_receive_skb(skb);
> >>> +   rcv_pkts++;
> >>> +   stats->rx_packets++;
> >>> +   quota--;
> >>> +   stats->rx_bytes += cf->can_dlc;
> >>> +
> >>> +   pch_fifo_thresh(priv, obj_num);
> >>> +  }
> >>> +
> >>> +  /* Reading the messsage object from the Message RAM */
> >>> +  iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if1_cmask);
> >>> +  pch_can_check_if_busy(&priv->regs->if1_creq, obj_num + 1);
> >>> +  reg = ioread32(&priv->regs->if1_mcont);
> >>
> >> this is almost the same code as before the the loop, can you rearange
> >> the code to avoid duplication?
> > 
> > I agree.
> > 
> >>
> >>> + }
> >>> +
> >>> + return rcv_pkts;
> >>> +}
> >>> +
> >>> +static void pch_can_tx_complete(struct net_device *ndev, u32 int_stat)
> >>> +{
> >>> + struct pch_can_priv *priv = netdev_priv(ndev);
> >>> + struct net_device_stats *stats = &(priv->ndev->stats);
> >>> + unsigned long flags;
> >>> + u32 dlc;
> >>> +
> >>> + can_get_echo_skb(ndev, int_stat - PCH_RX_OBJ_NUM - 1);
> >>> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >>> + iowrite32(CAN_CMASK_RX_TX_GET | CAN_CMASK_CLRINTPND,
> >>> +    &priv->regs->if2_cmask);
> >>> + dlc = ioread32(&priv->regs->if2_mcont) & CAN_IF_MCONT_DLC;
> >>> + pch_can_check_if_busy(&priv->regs->if2_creq, int_stat);
> >>> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >>> + if (dlc > 8)
> >>> +  dlc = 8;
> >>
> >> use get_can_dlc
> > 
> > I agree.
> > 
> >>
> >>> + stats->tx_bytes += dlc;
> >>> + stats->tx_packets++;
> >>> +}
> >>> +
> >>> +static int pch_can_rx_poll(struct napi_struct *napi, int quota)
> >>> +{
> >>> + struct net_device *ndev = napi->dev;
> >>> + struct pch_can_priv *priv = netdev_priv(ndev);
> >>> + u32 int_stat;
> >>> + int rcv_pkts = 0;
> >>> + u32 reg_stat;
> >>> + unsigned long flags;
> >>> +
> >>> + int_stat = pch_can_int_pending(priv);
> >>> + if (!int_stat)
> >>> +  goto END;
> >>> +
> >>> + if ((int_stat == CAN_STATUS_INT) && (quota > 0)) {
> >>> +  reg_stat = ioread32(&priv->regs->stat);
> >>> +  if (reg_stat & (PCH_BUS_OFF | PCH_LEC_ALL)) {
> >>> +   if ((reg_stat & PCH_LEC_ALL) != PCH_LEC_ALL) {
> >>> +    pch_can_error(ndev, reg_stat);
> >>> +    quota--;
> >>> +   }
> >>> +  }
> >>> +
> >>> +  if (reg_stat & PCH_TX_OK) {
> >>> +   spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >>> +   iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if2_cmask);
> >>> +   pch_can_check_if_busy(&priv->regs->if2_creq,
> >>> +            ioread32(&priv->regs->intr));
> >>                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >>
> >> Isn't this "int_stat". Might it be possilbe that regs->intr changes
> >> between the pch_can_int_pending and here?
> > 
> > This code was mistake.
> > This condition, message object is not acccessed.
> > Thus, pch_can_check_if_busy can be deleted.
> > 
> >>
> >> What should this transfer do?
> >>
> >>> +   spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >>> +   pch_can_bit_clear(&priv->regs->stat, PCH_TX_OK);
> >>> +  }
> >>> +
> >>> +  if (reg_stat & PCH_RX_OK)
> >>> +   pch_can_bit_clear(&priv->regs->stat, PCH_RX_OK);
> >>> +
> >>> +  int_stat = pch_can_int_pending(priv);
> >>> + }
> >>> +
> >>> + if (quota == 0)
> >>> +  goto END;
> >>> +
> >>> + if ((int_stat >= 1) && (int_stat <= PCH_RX_OBJ_NUM)) {
> >>> +  spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >>> +  rcv_pkts += pch_can_rx_normal(ndev, int_stat, quota);
> >>> +  spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >>> +  quota -= rcv_pkts;
> >>> +  if (rcv_pkts < 0)
> >>
> >> how can this happen?
> > 
> > My mistake.
> > if (quota < 0) is TRUE.
> > 
> > 
> >>
> >>> +   goto END;
> >>> + } else if ((int_stat > PCH_RX_OBJ_NUM) && (int_stat <= PCH_OBJ_NUM)) {
> >>> +  /* Handle transmission interrupt */
> >>> +  pch_can_tx_complete(ndev, int_stat);
> >>> + }
> >>> +
> >>> +END:
> >>> + napi_complete(napi);
> >>> + pch_can_set_int_enables(priv, PCH_CAN_ALL);
> >>> +
> >>> + return rcv_pkts;
> >>> +}
> >>> +
> >>> +static int pch_set_bittiming(struct net_device *ndev)
> >>> +{
> >>> + struct pch_can_priv *priv = netdev_priv(ndev);
> >>> + const struct can_bittiming *bt = &priv->can.bittiming;
> >>> + u32 canbit;
> >>> + u32 bepe;
> >>> +
> >>> + /* Setting the CCE bit for accessing the Can Timing register. */
> >>> + pch_can_bit_set(&priv->regs->cont, CAN_CTRL_CCE);
> >>> +
> >>> + canbit = (bt->brp - 1) & MSK_BITT_BRP;
> >>> + canbit |= (bt->sjw - 1) << BIT_BITT_SJW;
> >>> + canbit |= (bt->phase_seg1 + bt->prop_seg - 1) << BIT_BITT_TSEG1;
> >>> + canbit |= (bt->phase_seg2 - 1) << BIT_BITT_TSEG2;
> >>> + bepe = ((bt->brp - 1) & MSK_BRPE_BRPE) >> BIT_BRPE_BRPE;
> >>> + iowrite32(canbit, &priv->regs->bitt);
> >>> + iowrite32(bepe, &priv->regs->brpe);
> >>> + pch_can_bit_clear(&priv->regs->cont, CAN_CTRL_CCE);
> >>> +
> >>> + return 0;
> >>> +}
> >>> +
> >>> +static void pch_can_start(struct net_device *ndev)
> >>> +{
> >>> + struct pch_can_priv *priv = netdev_priv(ndev);
> >>> +
> >>> + if (priv->can.state != CAN_STATE_STOPPED)
> >>> +  pch_can_reset(priv);
> >>> +
> >>> + pch_set_bittiming(ndev);
> >>> + pch_can_set_optmode(priv);
> >>> +
> >>> + pch_can_tx_enable_all(priv);
> >>> + pch_can_rx_enable_all(priv);
> >>> +
> >>> + /* Setting the CAN to run mode. */
> >>> + pch_can_set_run_mode(priv, PCH_CAN_RUN);
> >>> +
> >>> + priv->can.state = CAN_STATE_ERROR_ACTIVE;
> >>> +
> >>> + return;
> >>> +}
> >>> +
> >>> +static int pch_can_do_set_mode(struct net_device *ndev, enum can_mode mode)
> >>> +{
> >>> + int ret = 0;
> >>> +
> >>> + switch (mode) {
> >>> + case CAN_MODE_START:
> >>> +  pch_can_start(ndev);
> >>> +  netif_wake_queue(ndev);
> >>> +  break;
> >>> + default:
> >>> +  ret = -EOPNOTSUPP;
> >>> +  break;
> >>> + }
> >>> +
> >>> + return ret;
> >>> +}
> >>> +
> >>> +static int pch_can_open(struct net_device *ndev)
> >>> +{
> >>> + struct pch_can_priv *priv = netdev_priv(ndev);
> >>> + int retval;
> >>> +
> >>> + /* Regsitering the interrupt. */
> >>> + retval = request_irq(priv->dev->irq, pch_can_interrupt, IRQF_SHARED,
> >>> +        ndev->name, ndev);
> >>> + if (retval) {
> >>> +  dev_err(&ndev->dev, "request_irq failed.\n");
> >>> +  goto req_irq_err;
> >>> + }
> >>> +
> >>> + /* Open common can device */
> >>> + retval = open_candev(ndev);
> >>> + if (retval) {
> >>> +  dev_err(ndev->dev.parent, "open_candev() failed %d\n", retval);
> >>> +  goto err_open_candev;
> >>> + }
> >>> +
> >>> + pch_can_init(priv);
> >>> + pch_can_start(ndev);
> >>> + napi_enable(&priv->napi);
> >>> + netif_start_queue(ndev);
> >>> +
> >>> + return 0;
> >>> +
> >>> +err_open_candev:
> >>> + free_irq(priv->dev->irq, ndev);
> >>> +req_irq_err:
> >>> + pch_can_release(priv);
> >>> +
> >>> + return retval;
> >>> +}
> >>> +
> >>> +static int pch_close(struct net_device *ndev)
> >>> +{
> >>> + struct pch_can_priv *priv = netdev_priv(ndev);
> >>> +
> >>> + netif_stop_queue(ndev);
> >>> + napi_disable(&priv->napi);
> >>> + pch_can_release(priv);
> >>> + free_irq(priv->dev->irq, ndev);
> >>> + close_candev(ndev);
> >>> + priv->can.state = CAN_STATE_STOPPED;
> >>> + return 0;
> >>> +}
> >>> +
> >>> +static netdev_tx_t pch_xmit(struct sk_buff *skb, struct net_device *ndev)
> >>> +{
> >>> + unsigned long flags;
> >>> + struct pch_can_priv *priv = netdev_priv(ndev);
> >>> + struct can_frame *cf = (struct can_frame *)skb->data;
> >>> + int tx_buffer_avail = 0;
> >>
> >> What I'm totally missing is the TX flow controll. Your driver has to
> >> ensure that the package leave the controller in the order that come
> >> into the xmit function. Further you have to stop your xmit queue if
> >> you're out of tx objects and reenable if you have a object free.
> >>
> >> Use netif_stop_queue() and netif_wake_queue() for this.
> > 
> > In this code, I think "out of tx objects" cannot be  occurred.
> 
> It's not a matter of code it's the hardware. You cannot put more than a
> certain number of CAN frames into the hardware. If you have a CAN bus at
> a certain speed, you can only send a certain number of CAN frames in a
> second. So you cannot push more than this amount of frames/s into the
> hardware.
> 
> > Nevertheless, are netif_stop_queue() and netif_wake_queue() is necessary ?
> 
> Yes.

I can' understand your issue.
Please can you hear my opinion?

Please see the head of pch_xmit.

> > + if (priv->tx_obj == (PCH_OBJ_NUM + 1)) { /* Point tail Obj + 1 */
> > +  while (ioread32(&priv->regs->treq2) & 0xfc00)
> > +   udelay(1);

When points tail of Tx message object,
this driver waits until completion of all tx messaeg objects.
Thus, application/driver ought not to be able to put Tx object exceed the number of tx message object.
Thus I think these code(netif_stop_queue/netif_wake_queue) are completely redundant.

> 
> > 
> >>
> >>> +
> >>> + if (can_dropped_invalid_skb(ndev, skb))
> >>> +  return NETDEV_TX_OK;
> >>> +
> >>> + if (priv->tx_obj == (PCH_OBJ_NUM + 1)) { /* Point tail Obj + 1 */
> >>> +  while (ioread32(&priv->regs->treq2) & 0xfc00)
> >>> +   udelay(1);
> >>
> >> please no (possible) infinite delays!
> > 
> > I will add break processing.
> > 
> >>
> >>> +  priv->tx_obj = PCH_RX_OBJ_NUM + 1; /* Point head of Tx Obj ID */
> >>> + }
> >>
> >>> +
> >>> + tx_buffer_avail = priv->tx_obj;
> >>
> >> why has the "object" become a "buffer" now? :)
> > 
> > You are right.
> > I will modify the name.
> > 
> >>
> >>> + priv->tx_obj++;
> >>> +
> >>> + /* Attaining the lock. */
> >>> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >>> +
> >>> + /* Setting the CMASK register to set value*/
> >>                                                  ^^^
> >>
> >> pleas add a whitespace
> > 
> > I agree.
> > 
> >>
> >>> + iowrite32(CAN_CMASK_RX_TX_SET, &priv->regs->if2_cmask);
> >>> +
> >>> + /* If ID extended is set. */
> >>> + if (cf->can_id & CAN_EFF_FLAG) {
> >>> +  iowrite32(cf->can_id & 0xffff, &priv->regs->if2_id1);
> >>> +  iowrite32(((cf->can_id >> 16) & 0x1fff) | CAN_ID2_XTD,
> >>> +       &priv->regs->if2_id2);
> >>> + } else {
> >>> +  iowrite32(0, &priv->regs->if2_id1);
> >>> +  iowrite32((cf->can_id & CAN_SFF_MASK) << 2,
> >>> +      &priv->regs->if2_id2);
> >>> + }
> >>> +
> >>> + pch_can_bit_set(&priv->regs->if2_id2, CAN_ID_MSGVAL);
> >>
> >> Do you need to do a read-modify-write of the hardware register? Please
> >> prepare the values you want to write to hardware, then do it.
> > 
> > Current design policy for read/write message object,
> > the driver is designed with Read-Modify-Write.
> > 
> > I will modify to Write only for reducing accessing Message RAM.
> > 
> >>
> >>> +
> >>> + /* If remote frame has to be transmitted.. */
> >>> + if (!(cf->can_id & CAN_RTR_FLAG))
> >>> +  pch_can_bit_set(&priv->regs->if2_id2, CAN_ID2_DIR);
> >> dito
> >>> + /* If remote frame has to be transmitted.. */
> >>> + if (cf->can_id & CAN_RTR_FLAG)
> >>> +  pch_can_bit_clear(&priv->regs->if2_id2, CAN_ID2_DIR);
> >> dito
> >>> +
> >>> + /* Copy data to register */
> >>> + if (cf->can_dlc > 0) {
> >>> +  u32 data1 = *((u16 *)&cf->data[0]);
> >>> +  iowrite32(data1, &priv->regs->if2_dataa1);
> >>
> >> do you think you send the bytes in correct order?
> > 
> > Let me study this endianess.
> > 
> >>
> >>> + }
> >>> + if (cf->can_dlc > 2) {
> >>> +  u32 data1 = *((u16 *)&cf->data[2]);
> >>> +  iowrite32(data1, &priv->regs->if2_dataa2);
> >>> + }
> >>> + if (cf->can_dlc > 4) {
> >>> +  u32 data1 = *((u16 *)&cf->data[4]);
> >>> +  iowrite32(data1, &priv->regs->if2_datab1);
> >>> + }
> >>> + if (cf->can_dlc > 6) {
> >>> +  u32 data1 = *((u16 *)&cf->data[6]);
> >>> +  iowrite32(data1, &priv->regs->if2_datab2);
> >>> + }
> >>> +
> >>> + can_put_echo_skb(skb, ndev, tx_buffer_avail - PCH_RX_OBJ_NUM - 1);
> >>> +
> >>> + /* Set the size of the data. */
> >>> + iowrite32(cf->can_dlc, &priv->regs->if2_mcont);
> >>> +
> >>> + /* Update if2_mcont */
> >>> + pch_can_bit_set(&priv->regs->if2_mcont,
> >>> +   CAN_IF_MCONT_NEWDAT | CAN_IF_MCONT_TXRQXT |
> >>> +   CAN_IF_MCONT_TXIE);
> >>
> >> pleae first perpare your value, then write to hardware.
> > 
> > ditto.
> > 
> >>
> >>> +
> >>> + if (tx_buffer_avail == PCH_RX_OBJ_NUM) /* If points tail of FIFO  */
> >>> +  pch_can_bit_set(&priv->regs->if2_mcont, CAN_IF_MCONT_EOB);
> >>
> >> dito
> >>
> >> Is EOB relevant for TX objects?
> > 
> > This is mistake. No meaning for tx.
> > I will modify.
> > 
> >>
> >>> + pch_can_check_if_busy(&priv->regs->if2_creq, tx_buffer_avail);
> >>> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >>> +
> >>> + return NETDEV_TX_OK;
> >>> +}
> >>> +
> >>> +static const struct net_device_ops pch_can_netdev_ops = {
> >>> + .ndo_open  = pch_can_open,
> >>> + .ndo_stop  = pch_close,
> >>> + .ndo_start_xmit  = pch_xmit,
> >>> +};
> >>> +
> >>> +static void __devexit pch_can_remove(struct pci_dev *pdev)
> >>> +{
> >>> + struct net_device *ndev = pci_get_drvdata(pdev);
> >>> + struct pch_can_priv *priv = netdev_priv(ndev);
> >>> +
> >>> + unregister_candev(priv->ndev);
> >>> + pci_iounmap(pdev, priv->regs);
> >>> + if (priv->use_msi)
> >>> +  pci_disable_msi(priv->dev);
> >>> + pci_release_regions(pdev);
> >>> + pci_disable_device(pdev);
> >>> + pci_set_drvdata(pdev, NULL);
> >>> + free_candev(priv->ndev);
> >>> +}
> >>> +
> >>> +#ifdef CONFIG_PM
> >>> +static void pch_can_set_int_custom(struct pch_can_priv *priv)
> >>> +{
> >>> + /* Clearing the IE, SIE and EIE bits of Can control register. */
> >>> + pch_can_bit_clear(&priv->regs->cont, CAN_CTRL_IE_SIE_EIE);
> >>> +
> >>> + /* Appropriately setting them. */
> >>> + pch_can_bit_set(&priv->regs->cont,
> >>> +   ((priv->int_enables & MSK_CTRL_IE_SIE_EIE) << 1));
> >>> +}
> >>> +
> >>> +/* This function retrieves interrupt enabled for the CAN device. */
> >>> +static u32 pch_can_get_int_enables(struct pch_can_priv *priv)
> >>> +{
> >>> + /* Obtaining the status of IE, SIE and EIE interrupt bits. */
> >>> + return (ioread32(&priv->regs->cont) & CAN_CTRL_IE_SIE_EIE) >> 1;
> >>> +}
> >>> +
> >>> +static u32 pch_can_get_rx_enable(struct pch_can_priv *priv, u32 buff_num)
> >>> +{
> >>> + unsigned long flags;
> >>> + u32 enable;
> >>> +
> >>> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >>> + iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if1_cmask);
> >>> + pch_can_check_if_busy(&priv->regs->if1_creq, buff_num);
> >>> +
> >>> + if (((ioread32(&priv->regs->if1_id2)) & CAN_ID_MSGVAL) &&
> >>> +   ((ioread32(&priv->regs->if1_mcont)) &
> >>> +   CAN_IF_MCONT_RXIE))
> >>> +  enable = 1;
> >>> + else
> >>> +  enable = 0;
> >>> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >>> + return enable;
> >>> +}
> >>> +
> >>> +static u32 pch_can_get_tx_enable(struct pch_can_priv *priv, u32 buff_num)
> >>> +{
> >>> + unsigned long flags;
> >>> + u32 enable;
> >>> +
> >>> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >>> +
> >>> + iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if2_cmask);
> >>> + pch_can_check_if_busy(&priv->regs->if2_creq, buff_num);
> >>> + if (((ioread32(&priv->regs->if2_id2)) & CAN_ID_MSGVAL) &&
> >>> +   ((ioread32(&priv->regs->if2_mcont)) &
> >>> +   CAN_IF_MCONT_TXIE)) {
> >>> +  enable = 1;
> >>> + } else {
> >>> +  enable = 0;
> >>> + }
> >>> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >>> +
> >>> + return enable;
> >>> +}
> >>> +
> >>> +static void pch_can_set_rx_buffer_link(struct pch_can_priv *priv,
> >>> +           u32 buffer_num, u32 set)
> >>> +{
> >>> + unsigned long flags;
> >>> +
> >>> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >>> + iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if1_cmask);
> >>> + pch_can_check_if_busy(&priv->regs->if1_creq, buffer_num);
> >>> + iowrite32(CAN_CMASK_RDWR | CAN_CMASK_CTRL, &priv->regs->if1_cmask);
> >>> + if (set == 1)
> >>> +  pch_can_bit_clear(&priv->regs->if1_mcont, CAN_IF_MCONT_EOB);
> >>> + else
> >>> +  pch_can_bit_set(&priv->regs->if1_mcont, CAN_IF_MCONT_EOB);
> >>> +
> >>> + pch_can_check_if_busy(&priv->regs->if1_creq, buffer_num);
> >>> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >>> +}
> >>> +
> >>> +static u32 pch_can_get_rx_buffer_link(struct pch_can_priv *priv, u32 buffer_num)
> >>> +{
> >>> + unsigned long flags;
> >>> + u32 link;
> >>> +
> >>> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >>> + iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if1_cmask);
> >>> + pch_can_check_if_busy(&priv->regs->if1_creq, buffer_num);
> >>> +
> >>> + if (ioread32(&priv->regs->if1_mcont) & CAN_IF_MCONT_EOB)
> >>> +  link = 0;
> >>> + else
> >>> +  link = 1;
> >>> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >>> + return link;
> >>> +}
> >>> +
> >>> +static int pch_can_suspend(struct pci_dev *pdev, pm_message_t state)
> >>> +{
> >>> + int i;
> >>> + int retval;
> >>> + u32 buf_stat; /* Variable for reading the transmit buffer status. */
> >>> + u32 counter = COUNTER_LIMIT;
> >>> +
> >>> + struct net_device *dev = pci_get_drvdata(pdev);
> >>> + struct pch_can_priv *priv = netdev_priv(dev);
> >>> +
> >>> + /* Stop the CAN controller */
> >>> + pch_can_set_run_mode(priv, PCH_CAN_STOP);
> >>> +
> >>> + /* Indicate that we are aboutto/in suspend */
> >>> + priv->can.state = CAN_STATE_STOPPED;
> >>> +
> >>> + /* Waiting for all transmission to complete. */
> >>> + while (counter) {
> >>> +  buf_stat = pch_can_get_buffer_status(priv);
> >>> +  if (!buf_stat)
> >>> +   break;
> >>> +  counter--;
> >>> +  udelay(1);
> >>> + }
> >>> + if (!counter)
> >>> +  dev_err(&pdev->dev, "%s -> Transmission time out.\n", __func__);
> >>> +
> >>> + /* Save interrupt configuration and then disable them */
> >>> + priv->int_enables = pch_can_get_int_enables(priv);
> >>> + pch_can_set_int_enables(priv, PCH_CAN_DISABLE);
> >>> +
> >>> + /* Save Tx buffer enable state */
> >>> + for (i = PCH_RX_OBJ_NUM + 1; i <= PCH_OBJ_NUM; i++)
> >>> +  priv->tx_enable[i] = pch_can_get_tx_enable(priv, i);
> >>> +
> >>> + /* Disable all Transmit buffers */
> >>> + pch_can_tx_disable_all(priv);
> >>> +
> >>> + /* Save Rx buffer enable state */
> >>> + for (i = 1; i <= PCH_RX_OBJ_NUM; i++) {
> >>> +  priv->rx_enable[i] = pch_can_get_rx_enable(priv, i);
> >>> +  priv->rx_link[i] = pch_can_get_rx_buffer_link(priv, i);
> >>> + }
> >>> +
> >>> + /* Disable all Receive buffers */
> >>> + pch_can_rx_disable_all(priv);
> >>> + retval = pci_save_state(pdev);
> >>> + if (retval) {
> >>> +  dev_err(&pdev->dev, "pci_save_state failed.\n");
> >>> + } else {
> >>> +  pci_enable_wake(pdev, PCI_D3hot, 0);
> >>> +  pci_disable_device(pdev);
> >>> +  pci_set_power_state(pdev, pci_choose_state(pdev, state));
> >>> + }
> >>> +
> >>> + return retval;
> >>> +}
> >>> +
> >>> +static int pch_can_resume(struct pci_dev *pdev)
> >>> +{
> >>> + int i;
> >>> + int retval;
> >>> + struct net_device *dev = pci_get_drvdata(pdev);
> >>> + struct pch_can_priv *priv = netdev_priv(dev);
> >>> +
> >>> + pci_set_power_state(pdev, PCI_D0);
> >>> + pci_restore_state(pdev);
> >>> + retval = pci_enable_device(pdev);
> >>> + if (retval) {
> >>> +  dev_err(&pdev->dev, "pci_enable_device failed.\n");
> >>> +  return retval;
> >>> + }
> >>> +
> >>> + pci_enable_wake(pdev, PCI_D3hot, 0);
> >>> +
> >>> + priv->can.state = CAN_STATE_ERROR_ACTIVE;
> >>> +
> >>> + /* Disabling all interrupts. */
> >>> + pch_can_set_int_enables(priv, PCH_CAN_DISABLE);
> >>> +
> >>> + /* Setting the CAN device in Stop Mode. */
> >>> + pch_can_set_run_mode(priv, PCH_CAN_STOP);
> >>> +
> >>> + /* Configuring the transmit and receive buffers. */
> >>> + pch_can_config_rx_tx_buffers(priv);
> >>> +
> >>> + /* Restore the CAN state */
> >>> + pch_set_bittiming(dev);
> >>> +
> >>> + /* Listen/Active */
> >>> + pch_can_set_optmode(priv);
> >>> +
> >>> + /* Enabling the transmit buffer. */
> >>> + for (i = 1; i <= PCH_RX_OBJ_NUM; i++)
> >>> +  pch_can_set_tx_enable(priv, i, priv->tx_enable[i]);
> >>> +
> >>> + /* Configuring the receive buffer and enabling them. */
> >>> + for (i = PCH_RX_OBJ_NUM + 1; i <= PCH_OBJ_NUM; i++) {
> >>> +  /* Restore buffer link */
> >>> +  pch_can_set_rx_buffer_link(priv, i, priv->rx_link[i]);
> >>> +
> >>> +  /* Restore buffer enables */
> >>> +  pch_can_set_rx_enable(priv, i, priv->rx_enable[i]);
> >>> + }
> >>> +
> >>> + /* Enable CAN Interrupts */
> >>> + pch_can_set_int_custom(priv);
> >>> +
> >>> + /* Restore Run Mode */
> >>> + pch_can_set_run_mode(priv, PCH_CAN_RUN);
> >>> +
> >>> + return retval;
> >>> +}
> >>> +#else
> >>> +#define pch_can_suspend NULL
> >>> +#define pch_can_resume NULL
> >>> +#endif
> >>> +
> >>> +static int pch_can_get_berr_counter(const struct net_device *dev,
> >>> +        struct can_berr_counter *bec)
> >>> +{
> >>> + struct pch_can_priv *priv = netdev_priv(dev);
> >>> +
> >>> + bec->txerr = ioread32(&priv->regs->errc) & CAN_TEC;
> >>> + bec->rxerr = (ioread32(&priv->regs->errc) & CAN_REC) >> 8;
> >>> +
> >>> + return 0;
> >>> +}
> >>> +
> >>> +static int __devinit pch_can_probe(struct pci_dev *pdev,
> >>> +       const struct pci_device_id *id)
> >>> +{
> >>> + struct net_device *ndev;
> >>> + struct pch_can_priv *priv;
> >>> + int rc;
> >>> + void __iomem *addr;
> >>> +
> >>> + rc = pci_enable_device(pdev);
> >>> + if (rc) {
> >>> +  dev_err(&pdev->dev, "Failed pci_enable_device %d\n", rc);
> >>> +  goto probe_exit_endev;
> >>> + }
> >>> +
> >>> + rc = pci_request_regions(pdev, KBUILD_MODNAME);
> >>> + if (rc) {
> >>> +  dev_err(&pdev->dev, "Failed pci_request_regions %d\n", rc);
> >>> +  goto probe_exit_pcireq;
> >>> + }
> >>> +
> >>> + addr = pci_iomap(pdev, 1, 0);
> >>> + if (!addr) {
> >>> +  rc = -EIO;
> >>> +  dev_err(&pdev->dev, "Failed pci_iomap\n");
> >>> +  goto probe_exit_ipmap;
> >>> + }
> >>> +
> >>> + ndev = alloc_candev(sizeof(struct pch_can_priv), PCH_TX_OBJ_NUM);
> >>> + if (!ndev) {
> >>> +  rc = -ENOMEM;
> >>> +  dev_err(&pdev->dev, "Failed alloc_candev\n");
> >>> +  goto probe_exit_alloc_candev;
> >>> + }
> >>> +
> >>> + priv = netdev_priv(ndev);
> >>> + priv->ndev = ndev;
> >>> + priv->regs = addr;
> >>> + priv->dev = pdev;
> >>> + priv->can.bittiming_const = &pch_can_bittiming_const;
> >>> + priv->can.do_set_mode = pch_can_do_set_mode;
> >>> + priv->can.do_get_berr_counter = pch_can_get_berr_counter;
> >>> + priv->can.ctrlmode_supported = CAN_CTRLMODE_LISTENONLY |
> >>> +           CAN_CTRLMODE_LOOPBACK;
> >>> + priv->tx_obj = PCH_RX_OBJ_NUM + 1; /* Point head of Tx Obj */
> >>> +
> >>> + ndev->irq = pdev->irq;
> >>> + ndev->flags |= IFF_ECHO;
> >>> +
> >>> + pci_set_drvdata(pdev, ndev);
> >>> + SET_NETDEV_DEV(ndev, &pdev->dev);
> >>> + ndev->netdev_ops = &pch_can_netdev_ops;
> >>> + priv->can.clock.freq = PCH_CAN_CLK; /* Hz */
> >>> +
> >>> + netif_napi_add(ndev, &priv->napi, pch_can_rx_poll, PCH_RX_OBJ_NUM);
> >>> +
> >>> + rc = pci_enable_msi(priv->dev);
> >>> + if (rc) {
> >>> +  dev_info(&ndev->dev, "PCH CAN opened without MSI\n");
> >>> +  priv->use_msi = 0;
> >>> + } else {
> >>> +  dev_info(&ndev->dev, "PCH CAN opened with MSI\n");
> >>> +  priv->use_msi = 1;
> >>> + }
> >>> +
> >>> + rc = register_candev(ndev);
> >>> + if (rc) {
> >>> +  dev_err(&pdev->dev, "Failed register_candev %d\n", rc);
> >>> +  goto probe_exit_reg_candev;
> >>> + }
> >>> +
> >>> + return 0;
> >>> +
> >>> +probe_exit_reg_candev:
> >>> + free_candev(ndev);
> >>> +probe_exit_alloc_candev:
> >>> + pci_iounmap(pdev, addr);
> >>> +probe_exit_ipmap:
> >>> + pci_release_regions(pdev);
> >>> +probe_exit_pcireq:
> >>> + pci_disable_device(pdev);
> >>> +probe_exit_endev:
> >>> + return rc;
> >>> +}
> >>> +
> >>> +static struct pci_driver pch_can_pcidev = {
> >>> + .name = "pch_can",
> >>> + .id_table = pch_pci_tbl,
> >>> + .probe = pch_can_probe,
> >>> + .remove = __devexit_p(pch_can_remove),
> >>> + .suspend = pch_can_suspend,
> >>> + .resume = pch_can_resume,
> >>> +};
> >>> +
> >>> +static int __init pch_can_pci_init(void)
> >>> +{
> >>> + return pci_register_driver(&pch_can_pcidev);
> >>> +}
> >>> +module_init(pch_can_pci_init);
> >>> +
> >>> +static void __exit pch_can_pci_exit(void)
> >>> +{
> >>> + pci_unregister_driver(&pch_can_pcidev);
> >>> +}
> >>> +module_exit(pch_can_pci_exit);
> >>> +
> >>> +MODULE_DESCRIPTION("Intel EG20T PCH CAN(Controller Area Network) Driver");
> >>> +MODULE_LICENSE("GPL v2");
> >>> +MODULE_VERSION("0.94");
> >>
> >> cheers, Marc
> >>
> >> -- 
> >> Pengutronix e.K.                  | Marc Kleine-Budde           |
> >> Industrial Linux Solutions        | Phone: +49-231-2826-924     |
> >> Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
> >> Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |
> >>
> >>
> > 
> > Thanks, Tomoya(OKI SEMICONDUCTOR CO., LTD.)
> 
> cheers, Marc
> 
> -- 
> Pengutronix e.K.                  | Marc Kleine-Budde           |
> Industrial Linux Solutions        | Phone: +49-231-2826-924     |
> Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
> Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |
> 
>

^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] can: Topcliff: PCH_CAN driver: Fix build warnings
From: Tomoya MORINAGA @ 2010-11-09 12:26 UTC (permalink / raw)
  To: Oliver Hartkopp
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w, Masayuki Ohtake,
	Samuel Ortiz, margie.foster-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, LKML,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w, Wolfgang Grandegger,
	Marc Kleine-Budde, joel.clark-ral2JQCrhuEAvxtiuMwx3w,
	David S. Miller, Christian Pellegrin,
	qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <4CCAFA68.2030903@hartkopp.net>

Sorry, for late response.
On Saturday, October 30, 2010 1:46 AM,  Oliver Hartkopp wrote:

> 
> It's just a question if the driver for an Intel Chipset should/could ignore
> the endian problematic.
> 
> If this is intended the driver should only appear in Kconfig depending on X86
> or little endian systems.

This driver is for only x86 processor.
I have no intention to support processor except x86.

> 
> Besides this remark, the struct pch_can_regs could also be defined in a way
> that every single CAN payload data byte could be accessed directly to allow
> e.g. this code in pch_can_rx_normal():
> 
> cf->data[0] = ioread8(&priv->regs->if1_data0);
> cf->data[1] = ioread8(&priv->regs->if1_data1);
> cf->data[2] = ioread8(&priv->regs->if1_data2);
> (..)
> cf->data[6] = ioread8(&priv->regs->if1_data6);
> cf->data[7] = ioread8(&priv->regs->if1_data7);
> 
> This is easy to understand and additionally endian aware.

I agree. Thease codes are very simple.
I will modify like above.

> 
> In opposite to this:
> 
> + if (cf->can_dlc > 2) {
> + u32 data1 = *((u16 *)&cf->data[2]);
> + iowrite32(data1, &priv->regs->if2_dataa2);
> + }
> 
> Which puts a little? endian u16 into the big endian register layout.
> Sorry i'm just getting curious on this register access implementation.

I think cf->data is like below
MSB----------LSB
D3-D2-D1-D0
D7-D6-D5-D4

*((u16 *)&cf->data[2]) means "D3-D2".
This order coprresponds to register order.
data register is like below
MSB----------LSB
**-**-D3-D2

** means reserved area.

Thanks,

Tomoya MORINAGA
OKI SEMICONDUCTOR CO., LTD.

^ permalink raw reply

* Re: Netlink limitations
From: Patrick McHardy @ 2010-11-09 12:24 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: David S. Miller, Holger Eitzenberger, Pablo Neira Ayuso, netdev
In-Reply-To: <alpine.LNX.2.01.1011091303250.21752@obet.zrqbmnf.qr>

Am 09.11.2010 13:10, schrieb Jan Engelhardt:
> 
> On Sunday 2010-11-07 18:17, Patrick McHardy wrote:
>> On 07.11.2010 17:44, Jan Engelhardt wrote:
>>> we mentioned it only briefly at the Netfilter workshop a few weeks ago, 
>>> but as I am trying to figure out how to use Netlink in Xtables, 
>>> Netlink's limitations really start ruining my day.
>>>
>>> The well-known issue is that NL messages[sic] the kernel is supposed to 
>>> receive have a max size of 64K, due to nlmsghdr's use of uint16_t. This 
>>> is very problematic because attributes can easily amass more than 64K. 
>>> Think of a chain full of rules, represented by a top-level attribute 
>>> that nests attributes. The problem is bidirectional, a table 
>>> dump has the same problem.
>>
>> Messages are not limited to 64k, individual attributes are. Holger
>> started working on a nlattr32, which uses 32 bit for the length
>> value.
> 
> Does he have a format specification available?

As I said, the basic idea is to use a length value of zero to indicate
that the length should be read from a second length member. Basically:

struct nlattr32 {
	__u16 nla_len;
	__u16 nla_type;
	__u32 nla_len2;
};


^ permalink raw reply

* Re: Netlink limitations
From: Jan Engelhardt @ 2010-11-09 12:10 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: David S. Miller, Holger Eitzenberger, Pablo Neira Ayuso, netdev
In-Reply-To: <4CD6DF37.6010800@trash.net>


On Sunday 2010-11-07 18:17, Patrick McHardy wrote:
>On 07.11.2010 17:44, Jan Engelhardt wrote:
>> we mentioned it only briefly at the Netfilter workshop a few weeks ago, 
>> but as I am trying to figure out how to use Netlink in Xtables, 
>> Netlink's limitations really start ruining my day.
>> 
>> The well-known issue is that NL messages[sic] the kernel is supposed to 
>> receive have a max size of 64K, due to nlmsghdr's use of uint16_t. This 
>> is very problematic because attributes can easily amass more than 64K. 
>> Think of a chain full of rules, represented by a top-level attribute 
>> that nests attributes. The problem is bidirectional, a table 
>> dump has the same problem.
>
>Messages are not limited to 64k, individual attributes are. Holger
>started working on a nlattr32, which uses 32 bit for the length
>value.

Does he have a format specification available?

^ permalink raw reply

* Re: Netlink limitations
From: Jan Engelhardt @ 2010-11-09 11:58 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Patrick McHardy, David S. Miller, pablo, netdev
In-Reply-To: <20101108151635.GA20629@canuck.infradead.org>


On Monday 2010-11-08 16:16, Thomas Graf wrote:
>
>Also, it is not required to pack everything in attributes. Your protocol
>may specify that the whole message payload consists of chained attributes.

Yeah but that does not work well with nla_parse and nfnetlink.c
unserializing the byte stream into a tb[] array.

Does it hurt to not use nfnetlink? The only issue I see is that
af_netlink.c's nl_table is an array, when it should preferably be a
hash or tree if we are going to add more subsystems.

^ permalink raw reply

* Re: Netlink limitations
From: Patrick McHardy @ 2010-11-09  9:27 UTC (permalink / raw)
  To: Jan Engelhardt, David S. Miller, pablo, netdev
In-Reply-To: <20101108151635.GA20629@canuck.infradead.org>

Am 08.11.2010 16:16, schrieb Thomas Graf:
> On Sun, Nov 07, 2010 at 06:17:43PM +0100, Patrick McHardy wrote:
>> On 07.11.2010 17:44, Jan Engelhardt wrote:
>>> ...
> 
>> That's somewhat similar to the nlattr32 idea, but a length of 0 makes
>> more sense since that's currently not used. In that case the length
>> would be read from a second length field which has 32 bits.
> 
> I agree. This makes perfect sense. I was just thinking wheter we want
> to encode more information into the attribute header if we extend it
> anyways.

What kind of additional information are you thinking about?

^ permalink raw reply

* [PATCH v15 16/17] An example how to modifiy NIC driver to use napi_gro_frags() interface
From: xiaohui.xin @ 2010-11-09  9:03 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <fc6e95d63a2c62aaf77f8ded22fc43ccefcdbbff.1289280885.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

This example is made on ixgbe driver.
It provides API is_rx_buffer_mapped_as_page() to indicate
if the driver use napi_gro_frags() interface or not.
The example allocates 2 pages for DMA for one ring descriptor
using netdev_alloc_page(). When packets is coming, using
napi_gro_frags() to allocate skb and to receive the packets.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
---
 drivers/net/ixgbe/ixgbe.h      |    3 +
 drivers/net/ixgbe/ixgbe_main.c |  163 +++++++++++++++++++++++++++++++---------
 2 files changed, 131 insertions(+), 35 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe.h b/drivers/net/ixgbe/ixgbe.h
index 9e15eb9..89367ca 100644
--- a/drivers/net/ixgbe/ixgbe.h
+++ b/drivers/net/ixgbe/ixgbe.h
@@ -131,6 +131,9 @@ struct ixgbe_rx_buffer {
 	struct page *page;
 	dma_addr_t page_dma;
 	unsigned int page_offset;
+	u16 mapped_as_page;
+	struct page *page_skb;
+	unsigned int page_skb_offset;
 };
 
 struct ixgbe_queue_stats {
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index e32af43..a4a5263 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1029,6 +1029,12 @@ static inline void ixgbe_release_rx_desc(struct ixgbe_hw *hw,
 	IXGBE_WRITE_REG(hw, IXGBE_RDT(rx_ring->reg_idx), val);
 }
 
+static bool is_rx_buffer_mapped_as_page(struct ixgbe_rx_buffer *bi,
+					struct net_device *dev)
+{
+	return true;
+}
+
 /**
  * ixgbe_alloc_rx_buffers - Replace used receive buffers; packet split
  * @adapter: address of board private structure
@@ -1045,13 +1051,17 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
 	i = rx_ring->next_to_use;
 	bi = &rx_ring->rx_buffer_info[i];
 
+
 	while (cleaned_count--) {
 		rx_desc = IXGBE_RX_DESC_ADV(*rx_ring, i);
 
+		bi->mapped_as_page =
+			is_rx_buffer_mapped_as_page(bi, adapter->netdev);
+
 		if (!bi->page_dma &&
 		    (rx_ring->flags & IXGBE_RING_RX_PS_ENABLED)) {
 			if (!bi->page) {
-				bi->page = alloc_page(GFP_ATOMIC);
+				bi->page = netdev_alloc_page(adapter->netdev);
 				if (!bi->page) {
 					adapter->alloc_rx_page_failed++;
 					goto no_buffers;
@@ -1068,7 +1078,7 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
 						    DMA_FROM_DEVICE);
 		}
 
-		if (!bi->skb) {
+		if (!bi->mapped_as_page && !bi->skb) {
 			struct sk_buff *skb;
 			/* netdev_alloc_skb reserves 32 bytes up front!! */
 			uint bufsz = rx_ring->rx_buf_len + SMP_CACHE_BYTES;
@@ -1088,6 +1098,19 @@ static void ixgbe_alloc_rx_buffers(struct ixgbe_adapter *adapter,
 			                         rx_ring->rx_buf_len,
 						 DMA_FROM_DEVICE);
 		}
+
+		if (bi->mapped_as_page && !bi->page_skb) {
+			bi->page_skb = netdev_alloc_page(adapter->netdev);
+			if (!bi->page_skb) {
+				adapter->alloc_rx_page_failed++;
+				goto no_buffers;
+			}
+			bi->page_skb_offset = 0;
+			bi->dma = dma_map_page(&pdev->dev, bi->page_skb,
+					bi->page_skb_offset,
+					(PAGE_SIZE / 2),
+					PCI_DMA_FROMDEVICE);
+		}
 		/* Refresh the desc even if buffer_addrs didn't change because
 		 * each write-back erases this info. */
 		if (rx_ring->flags & IXGBE_RING_RX_PS_ENABLED) {
@@ -1165,6 +1188,13 @@ struct ixgbe_rsc_cb {
 	bool delay_unmap;
 };
 
+static bool is_no_buffer(struct ixgbe_rx_buffer *rx_buffer_info)
+{
+	return (!rx_buffer_info->skb ||
+		!rx_buffer_info->page_skb) &&
+		!rx_buffer_info->page;
+}
+
 #define IXGBE_RSC_CB(skb) ((struct ixgbe_rsc_cb *)(skb)->cb)
 
 static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
@@ -1174,6 +1204,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 	struct ixgbe_adapter *adapter = q_vector->adapter;
 	struct net_device *netdev = adapter->netdev;
 	struct pci_dev *pdev = adapter->pdev;
+	struct napi_struct *napi = &q_vector->napi;
 	union ixgbe_adv_rx_desc *rx_desc, *next_rxd;
 	struct ixgbe_rx_buffer *rx_buffer_info, *next_buffer;
 	struct sk_buff *skb;
@@ -1211,32 +1242,68 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			len = le16_to_cpu(rx_desc->wb.upper.length);
 		}
 
+		if (is_no_buffer(rx_buffer_info))
+			break;
 		cleaned = true;
-		skb = rx_buffer_info->skb;
-		prefetch(skb->data);
-		rx_buffer_info->skb = NULL;
 
-		if (rx_buffer_info->dma) {
-			if ((adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED) &&
-			    (!(staterr & IXGBE_RXD_STAT_EOP)) &&
-				 (!(skb->prev))) {
-				/*
-				 * When HWRSC is enabled, delay unmapping
-				 * of the first packet. It carries the
-				 * header information, HW may still
-				 * access the header after the writeback.
-				 * Only unmap it when EOP is reached
-				 */
-				IXGBE_RSC_CB(skb)->delay_unmap = true;
-				IXGBE_RSC_CB(skb)->dma = rx_buffer_info->dma;
-			} else {
-				dma_unmap_single(&pdev->dev,
-				                 rx_buffer_info->dma,
-				                 rx_ring->rx_buf_len,
-				                 DMA_FROM_DEVICE);
+		if (!rx_buffer_info->mapped_as_page) {
+			skb = rx_buffer_info->skb;
+			prefetch(skb->data);
+			rx_buffer_info->skb = NULL;
+
+			if (rx_buffer_info->dma) {
+				if ((adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED) &&
+						(!(staterr & IXGBE_RXD_STAT_EOP)) &&
+						(!(skb->prev))) {
+					/*
+					 * When HWRSC is enabled, delay unmapping
+					 * of the first packet. It carries the
+					 * header information, HW may still
+					 * access the header after the writeback.
+					 * Only unmap it when EOP is reached
+					 */
+					IXGBE_RSC_CB(skb)->delay_unmap = true;
+					IXGBE_RSC_CB(skb)->dma = rx_buffer_info->dma;
+				} else
+					dma_unmap_single(&pdev->dev,
+							rx_buffer_info->dma,
+							rx_ring->rx_buf_len,
+							DMA_FROM_DEVICE);
+				rx_buffer_info->dma = 0;
+				skb_put(skb, len);
+			}
+		} else {
+			skb = napi_get_frags(napi);
+			prefetch(rx_buffer_info->page_skb_offset);
+			rx_buffer_info->skb = NULL;
+			if (rx_buffer_info->dma) {
+				if ((adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED) &&
+						(!(staterr & IXGBE_RXD_STAT_EOP)) &&
+						(!(skb->prev))) {
+					/*
+					 * When HWRSC is enabled, delay unmapping
+					 * of the first packet. It carries the
+					 * header information, HW may still
+					 * access the header after the writeback.
+					 * Only unmap it when EOP is reached
+					 */
+					IXGBE_RSC_CB(skb)->delay_unmap = true;
+					IXGBE_RSC_CB(skb)->dma = rx_buffer_info->dma;
+				} else
+					dma_unmap_page(&pdev->dev, rx_buffer_info->dma,
+							PAGE_SIZE / 2,
+							PCI_DMA_FROMDEVICE);
+				rx_buffer_info->dma = 0;
+				skb_fill_page_desc(skb,
+						skb_shinfo(skb)->nr_frags,
+						rx_buffer_info->page_skb,
+						rx_buffer_info->page_skb_offset,
+						len);
+				rx_buffer_info->page_skb = NULL;
+				skb->len += len;
+				skb->data_len += len;
+				skb->truesize += len;
 			}
-			rx_buffer_info->dma = 0;
-			skb_put(skb, len);
 		}
 
 		if (upper_len) {
@@ -1283,10 +1350,16 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 				skb = ixgbe_transform_rsc_queue(skb, &(rx_ring->rsc_count));
 			if (adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED) {
 				if (IXGBE_RSC_CB(skb)->delay_unmap) {
-					dma_unmap_single(&pdev->dev,
-							 IXGBE_RSC_CB(skb)->dma,
-					                 rx_ring->rx_buf_len,
-							 DMA_FROM_DEVICE);
+					if (!rx_buffer_info->mapped_as_page)
+						dma_unmap_single(&pdev->dev,
+								IXGBE_RSC_CB(skb)->dma,
+								rx_ring->rx_buf_len,
+								DMA_FROM_DEVICE);
+					else
+						dma_unmap_page(&pdev->dev,
+								IXGBE_RSC_CB(skb)->dma,
+								PAGE_SIZE / 2,
+								DMA_FROM_DEVICE);
 					IXGBE_RSC_CB(skb)->dma = 0;
 					IXGBE_RSC_CB(skb)->delay_unmap = false;
 				}
@@ -1304,6 +1377,11 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 				rx_buffer_info->dma = next_buffer->dma;
 				next_buffer->skb = skb;
 				next_buffer->dma = 0;
+				if (rx_buffer_info->mapped_as_page) {
+					rx_buffer_info->page_skb =
+							next_buffer->page_skb;
+					next_buffer->page_skb = NULL;
+				}
 			} else {
 				skb->next = next_buffer->skb;
 				skb->next->prev = skb;
@@ -1323,7 +1401,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		total_rx_bytes += skb->len;
 		total_rx_packets++;
 
-		skb->protocol = eth_type_trans(skb, adapter->netdev);
+		if (!rx_buffer_info->mapped_as_page)
+			skb->protocol = eth_type_trans(skb, adapter->netdev);
 #ifdef IXGBE_FCOE
 		/* if ddp, not passing to ULD unless for FCP_RSP or error */
 		if (adapter->flags & IXGBE_FLAG_FCOE_ENABLED) {
@@ -1332,7 +1411,14 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 				goto next_desc;
 		}
 #endif /* IXGBE_FCOE */
-		ixgbe_receive_skb(q_vector, skb, staterr, rx_ring, rx_desc);
+
+		if (!rx_buffer_info->mapped_as_page)
+			ixgbe_receive_skb(q_vector, skb, staterr,
+						rx_ring, rx_desc);
+		else {
+			skb_record_rx_queue(skb, rx_ring->queue_index);
+			napi_gro_frags(napi);
+		}
 
 next_desc:
 		rx_desc->wb.upper.status_error = 0;
@@ -3622,9 +3708,16 @@ static void ixgbe_clean_rx_ring(struct ixgbe_adapter *adapter,
 
 		rx_buffer_info = &rx_ring->rx_buffer_info[i];
 		if (rx_buffer_info->dma) {
-			dma_unmap_single(&pdev->dev, rx_buffer_info->dma,
-			                 rx_ring->rx_buf_len,
-					 DMA_FROM_DEVICE);
+			if (!rx_buffer_info->mapped_as_page)
+				dma_unmap_single(&pdev->dev, rx_buffer_info->dma,
+						rx_ring->rx_buf_len,
+						PCI_DMA_FROMDEVICE);
+			else {
+				dma_unmap_page(&pdev->dev, rx_buffer_info->dma,
+						PAGE_SIZE / 2,
+						PCI_DMA_FROMDEVICE);
+				rx_buffer_info->page_skb = NULL;
+			}
 			rx_buffer_info->dma = 0;
 		}
 		if (rx_buffer_info->skb) {
@@ -3651,7 +3744,7 @@ static void ixgbe_clean_rx_ring(struct ixgbe_adapter *adapter,
 				       PAGE_SIZE / 2, DMA_FROM_DEVICE);
 			rx_buffer_info->page_dma = 0;
 		}
-		put_page(rx_buffer_info->page);
+		netdev_free_page(adapter->netdev, rx_buffer_info->page);
 		rx_buffer_info->page = NULL;
 		rx_buffer_info->page_offset = 0;
 	}
-- 
1.7.3

^ permalink raw reply related

* [PATCH v15 15/17] Provides multiple submits and asynchronous notifications.
From: xiaohui.xin @ 2010-11-09  9:03 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <fc6e95d63a2c62aaf77f8ded22fc43ccefcdbbff.1289280885.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

    The vhost-net backend now only supports synchronous send/recv
    operations. The patch provides multiple submits and asynchronous
    notifications. This is needed for zero-copy case.

    Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
---
 drivers/vhost/net.c   |  355 +++++++++++++++++++++++++++++++++++++++++++++----
 drivers/vhost/vhost.c |   78 +++++++++++
 drivers/vhost/vhost.h |   15 ++-
 3 files changed, 423 insertions(+), 25 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 7c80082..17c599a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -24,6 +24,8 @@
 #include <linux/if_arp.h>
 #include <linux/if_tun.h>
 #include <linux/if_macvlan.h>
+#include <linux/mpassthru.h>
+#include <linux/aio.h>
 
 #include <net/sock.h>
 
@@ -32,6 +34,7 @@
 /* Max number of bytes transferred before requeueing the job.
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x80000
+static struct kmem_cache *notify_cache;
 
 enum {
 	VHOST_NET_VQ_RX = 0,
@@ -49,6 +52,7 @@ struct vhost_net {
 	struct vhost_dev dev;
 	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
 	struct vhost_poll poll[VHOST_NET_VQ_MAX];
+	struct kmem_cache *cache;
 	/* Tells us whether we are polling a socket for TX.
 	 * We only do this when socket buffer fills up.
 	 * Protected by tx vq lock. */
@@ -109,11 +113,184 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
 	net->tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	if (!list_empty(&vq->notifier)) {
+		iocb = list_first_entry(&vq->notifier,
+				struct kiocb, ki_list);
+		list_del(&iocb->ki_list);
+	}
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+	return iocb;
+}
+
+static void handle_iocb(struct kiocb *iocb)
+{
+	struct vhost_virtqueue *vq = iocb->private;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	list_add_tail(&iocb->ki_list, &vq->notifier);
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+}
+
+static int is_async_vq(struct vhost_virtqueue *vq)
+{
+	return (vq->link_state == VHOST_VQ_LINK_ASYNC);
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+		struct vhost_virtqueue *vq,
+		struct socket *sock)
+{
+	struct kiocb *iocb = NULL;
+	struct vhost_log *vq_log = NULL;
+	int rx_total_len = 0;
+	unsigned int head, log, in, out;
+	int size;
+
+	if (!is_async_vq(vq))
+		return;
+
+	if (sock->sk->sk_data_ready)
+		sock->sk->sk_data_ready(sock->sk, 0);
+
+	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
+		vq->log : NULL;
+
+	while ((iocb = notify_dequeue(vq)) != NULL) {
+		if (!iocb->ki_left) {
+			vhost_add_used_and_signal(&net->dev, vq,
+					iocb->ki_pos, iocb->ki_nbytes);
+			size = iocb->ki_nbytes;
+			head = iocb->ki_pos;
+			rx_total_len += iocb->ki_nbytes;
+
+			if (iocb->ki_dtor)
+				iocb->ki_dtor(iocb);
+			kmem_cache_free(net->cache, iocb);
+
+			/* when log is enabled, recomputing the log is needed,
+			 * since these buffers are in async queue, may not get
+			 * the log info before.
+			 */
+			if (unlikely(vq_log)) {
+				if (!log)
+					__vhost_get_vq_desc(&net->dev, vq,
+							vq->iov,
+							ARRAY_SIZE(vq->iov),
+							&out, &in, vq_log,
+							&log, head);
+				vhost_log_write(vq, vq_log, log, size);
+			}
+			if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
+				vhost_poll_queue(&vq->poll);
+				break;
+			}
+		} else {
+			int i = 0;
+			int count = iocb->ki_left;
+			int hc = count;
+			while (count--) {
+				if (iocb) {
+					vq->heads[i].id = iocb->ki_pos;
+					vq->heads[i].len = iocb->ki_nbytes;
+					size = iocb->ki_nbytes;
+					head = iocb->ki_pos;
+					rx_total_len += iocb->ki_nbytes;
+
+					if (iocb->ki_dtor)
+						iocb->ki_dtor(iocb);
+					kmem_cache_free(net->cache, iocb);
+
+					if (unlikely(vq_log)) {
+						if (!log)
+							__vhost_get_vq_desc(
+							&net->dev, vq, vq->iov,
+							ARRAY_SIZE(vq->iov),
+							&out, &in, vq_log,
+							&log, head);
+						vhost_log_write(
+							vq, vq_log, log, size);
+					}
+				} else
+					break;
+
+				i++;
+				if (count)
+					iocb = notify_dequeue(vq);
+			}
+			vhost_add_used_and_signal_n(
+					&net->dev, vq, vq->heads, hc);
+		}
+	}
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+		struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	struct list_head *entry, *tmp;
+	unsigned long flags;
+	int tx_total_len = 0;
+
+	if (!is_async_vq(vq))
+		return;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	list_for_each_safe(entry, tmp, &vq->notifier) {
+		iocb = list_entry(entry,
+				struct kiocb, ki_list);
+		if (!iocb->ki_flags)
+			continue;
+		list_del(&iocb->ki_list);
+		vhost_add_used_and_signal(&net->dev, vq,
+				iocb->ki_pos, 0);
+		tx_total_len += iocb->ki_nbytes;
+
+		if (iocb->ki_dtor)
+			iocb->ki_dtor(iocb);
+
+		kmem_cache_free(net->cache, iocb);
+		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
+			vhost_poll_queue(&vq->poll);
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+}
+
+static struct kiocb *create_iocb(struct vhost_net *net,
+		struct vhost_virtqueue *vq,
+		unsigned head)
+{
+	struct kiocb *iocb = NULL;
+
+	if (!is_async_vq(vq))
+		return NULL;
+
+	iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
+	if (!iocb)
+		return NULL;
+	iocb->private = vq;
+	iocb->ki_pos = head;
+	iocb->ki_dtor = handle_iocb;
+	if (vq == &net->dev.vqs[VHOST_NET_VQ_RX])
+		iocb->ki_user_data = vq->num;
+	iocb->ki_iovec = vq->hdr;
+	return iocb;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
 {
 	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
+	struct kiocb *iocb = NULL;
 	unsigned out, in, s;
 	int head;
 	struct msghdr msg = {
@@ -146,6 +323,10 @@ static void handle_tx(struct vhost_net *net)
 	if (wmem < sock->sk->sk_sndbuf / 2)
 		tx_poll_stop(net);
 	hdr_size = vq->vhost_hlen;
+	if (!vq->vhost_hlen && is_async_vq(vq))
+		hdr_size = vq->sock_hlen;
+
+	handle_async_tx_events_notify(net, vq);
 
 	for (;;) {
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
@@ -157,11 +338,14 @@ static void handle_tx(struct vhost_net *net)
 			break;
 		/* Nothing new?  Wait for eventfd to tell us they refilled. */
 		if (head == vq->num) {
-			wmem = atomic_read(&sock->sk->sk_wmem_alloc);
-			if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
-				tx_poll_start(net, sock);
-				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
-				break;
+			if (!is_async_vq(vq)) {
+				wmem = atomic_read(&sock->sk->sk_wmem_alloc);
+				if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
+					tx_poll_start(net, sock);
+					set_bit(SOCK_ASYNC_NOSPACE,
+					&sock->flags);
+					break;
+				}
 			}
 			if (unlikely(vhost_enable_notify(vq))) {
 				vhost_disable_notify(vq);
@@ -178,6 +362,13 @@ static void handle_tx(struct vhost_net *net)
 		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
 		msg.msg_iovlen = out;
 		len = iov_length(vq->iov, out);
+		/* if async operations supported */
+		if (is_async_vq(vq)) {
+			iocb = create_iocb(net, vq, head);
+			if (!iocb)
+				break;
+		}
+
 		/* Sanity check */
 		if (!len) {
 			vq_err(vq, "Unexpected header len for TX: "
@@ -186,12 +377,18 @@ static void handle_tx(struct vhost_net *net)
 			break;
 		}
 		/* TODO: Check specific error and bomb out unless ENOBUFS? */
-		err = sock->ops->sendmsg(NULL, sock, &msg, len);
+		err = sock->ops->sendmsg(iocb, sock, &msg, len);
 		if (unlikely(err < 0)) {
+			if (is_async_vq(vq))
+				kmem_cache_free(net->cache, iocb);
 			vhost_discard_vq_desc(vq, 1);
 			tx_poll_start(net, sock);
 			break;
 		}
+
+		if (is_async_vq(vq))
+			continue;
+
 		if (err != len)
 			pr_debug("Truncated TX packet: "
 				 " len %d != %zd\n", err, len);
@@ -203,6 +400,8 @@ static void handle_tx(struct vhost_net *net)
 		}
 	}
 
+	handle_async_tx_events_notify(net, vq);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
@@ -396,7 +595,8 @@ static void handle_rx_big(struct vhost_net *net)
 static void handle_rx_mergeable(struct vhost_net *net)
 {
 	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
-	unsigned uninitialized_var(in), log;
+	unsigned uninitialized_var(in), log, out;
+	struct kiocb *iocb;
 	struct vhost_log *vq_log;
 	struct msghdr msg = {
 		.msg_name = NULL,
@@ -417,28 +617,44 @@ static void handle_rx_mergeable(struct vhost_net *net)
 	size_t vhost_hlen, sock_hlen;
 	size_t vhost_len, sock_len;
 	struct socket *sock = rcu_dereference(vq->private_data);
-	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
+	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
+		      !is_async_vq(vq)))
 		return;
-
 	use_mm(net->dev.mm);
 	mutex_lock(&vq->mutex);
 	vhost_disable_notify(vq);
 	vhost_hlen = vq->vhost_hlen;
 	sock_hlen = vq->sock_hlen;
 
+	/* In async cases, when write log is enabled, in case the submitted
+	 * buffers did not get log info before the log enabling, so we'd
+	 * better recompute the log info when needed. We do this in
+	 * handle_async_rx_events_notify().
+	 */
+
 	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
 		vq->log : NULL;
 
-	while ((sock_len = peek_head_len(sock->sk))) {
-		sock_len += sock_hlen;
-		vhost_len = sock_len + vhost_hlen;
-		headcount = get_rx_bufs(vq, vq->heads, vhost_len,
+	handle_async_rx_events_notify(net, vq, sock);
+
+	while (is_async_vq(vq) || (sock_len = peek_head_len(sock->sk))) {
+		if (is_async_vq(vq))
+			headcount = vhost_get_vq_desc(&net->dev, vq, vq->iov,
+						      ARRAY_SIZE(vq->iov),
+						      &out, &in,
+						      vq->log, &log);
+		else {
+			sock_len += sock_hlen;
+			vhost_len = sock_len + vhost_hlen;
+			headcount = get_rx_bufs(vq, vq->heads, vhost_len,
 					&in, vq_log, &log);
+		}
 		/* On error, stop handling until the next kick. */
 		if (unlikely(headcount < 0))
 			break;
 		/* OK, now we need to know about added descriptors. */
-		if (!headcount) {
+		if ((!headcount && !is_async_vq(vq)) ||
+			(headcount == vq->num && is_async_vq(vq))) {
 			if (unlikely(vhost_enable_notify(vq))) {
 				/* They have slipped one in as we were
 				 * doing that: check again. */
@@ -450,16 +666,41 @@ static void handle_rx_mergeable(struct vhost_net *net)
 			break;
 		}
 		/* We don't need to be notified again. */
-		if (unlikely((vhost_hlen)))
-			/* Skip header. TODO: support TSO. */
-			move_iovec_hdr(vq->iov, vq->hdr, vhost_hlen, in);
-		else
-			/* Copy the header for use in VIRTIO_NET_F_MRG_RXBUF:
-			 * needed because sendmsg can modify msg_iov. */
-			copy_iovec_hdr(vq->iov, vq->hdr, sock_hlen, in);
+		if (unlikely((vhost_hlen))) {
+			if (is_async_vq(vq))
+				vq->hdr[0].iov_len = vhost_hlen;
+			else
+				/* Skip header. TODO: support TSO. */
+				move_iovec_hdr(vq->iov, vq->hdr,
+						vhost_hlen, in);
+		} else {
+			if (is_async_vq(vq))
+				vq->hdr[0].iov_len = sock_hlen;
+			else
+				/* Copy the header for use in
+				 * VIRTIO_NET_F_MRG_RXBUF:
+				 * needed because sendmsg can
+				 * modify msg_iov. */
+				copy_iovec_hdr(vq->iov, vq->hdr,
+						sock_hlen, in);
+		}
 		msg.msg_iovlen = in;
-		err = sock->ops->recvmsg(NULL, sock, &msg,
+		if (is_async_vq(vq)) {
+			iocb = create_iocb(net, vq, headcount);
+			if (!iocb)
+				break;
+		}
+		err = sock->ops->recvmsg(iocb, sock, &msg,
 					 sock_len, MSG_DONTWAIT | MSG_TRUNC);
+		if (is_async_vq(vq)) {
+			if (err < 0) {
+				kmem_cache_free(net->cache, iocb);
+				vhost_discard_vq_desc(vq, headcount);
+				break;
+			}
+			continue;
+		}
+
 		/* Userspace might have consumed the packet meanwhile:
 		 * it's not supposed to do this usually, but might be hard
 		 * to prevent. Discard data we got (if any) and keep going. */
@@ -496,6 +737,8 @@ static void handle_rx_mergeable(struct vhost_net *net)
 		}
 	}
 
+	handle_async_rx_events_notify(net, vq, sock);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
@@ -561,6 +804,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
 	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
 	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+	n->cache = NULL;
 
 	f->private_data = n;
 
@@ -624,6 +868,21 @@ static void vhost_net_flush(struct vhost_net *n)
 	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
 }
 
+static void vhost_async_cleanup(struct vhost_net *n)
+{
+	/* clean the notifier */
+	struct vhost_virtqueue *vq;
+	struct kiocb *iocb = NULL;
+	if (n->cache) {
+		vq = &n->dev.vqs[VHOST_NET_VQ_RX];
+		while ((iocb = notify_dequeue(vq)) != NULL)
+			kmem_cache_free(n->cache, iocb);
+		vq = &n->dev.vqs[VHOST_NET_VQ_TX];
+		while ((iocb = notify_dequeue(vq)) != NULL)
+			kmem_cache_free(n->cache, iocb);
+	}
+}
+
 static int vhost_net_release(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = f->private_data;
@@ -640,6 +899,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	/* We do an extra flush before freeing memory,
 	 * since jobs can re-queue themselves. */
 	vhost_net_flush(n);
+	vhost_async_cleanup(n);
 	kfree(n);
 	return 0;
 }
@@ -691,21 +951,61 @@ static struct socket *get_tap_socket(int fd)
 	return sock;
 }
 
-static struct socket *get_socket(int fd)
+static struct socket *get_mp_socket(int fd)
+{
+	struct file *file = fget(fd);
+	struct socket *sock;
+	if (!file)
+		return ERR_PTR(-EBADF);
+	sock = mp_get_socket(file);
+	if (IS_ERR(sock))
+		fput(file);
+	return sock;
+}
+
+static struct socket *get_socket(struct vhost_virtqueue *vq, int fd,
+				 enum vhost_vq_link_state *state)
 {
 	struct socket *sock;
 	/* special case to disable backend */
 	if (fd == -1)
 		return NULL;
+
+	*state = VHOST_VQ_LINK_SYNC;
+
 	sock = get_raw_socket(fd);
 	if (!IS_ERR(sock))
 		return sock;
 	sock = get_tap_socket(fd);
 	if (!IS_ERR(sock))
 		return sock;
+	/* If we dont' have notify_cache, then dont do mpassthru */
+	if (!notify_cache)
+		return ERR_PTR(-ENOTSOCK);
+	/* If we don't have mergeable buffer then dont do mpassthru */
+	if (vhost_has_feature(vq->dev, VIRTIO_NET_F_MRG_RXBUF)) {
+		sock = get_mp_socket(fd);
+		if (!IS_ERR(sock)) {
+			*state = VHOST_VQ_LINK_ASYNC;
+			return sock;
+		}
+	}
 	return ERR_PTR(-ENOTSOCK);
 }
 
+static void vhost_init_link_state(struct vhost_net *n, int index)
+{
+	struct vhost_virtqueue *vq = n->vqs + index;
+
+	WARN_ON(!mutex_is_locked(&vq->mutex));
+	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+		INIT_LIST_HEAD(&vq->notifier);
+		spin_lock_init(&vq->notify_lock);
+		if (!n->cache)
+			n->cache = notify_cache;
+	}
+}
+
 static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 {
 	struct socket *sock, *oldsock;
@@ -729,12 +1029,14 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 		r = -EFAULT;
 		goto err_vq;
 	}
-	sock = get_socket(fd);
+	sock = get_socket(vq, fd, &vq->link_state);
 	if (IS_ERR(sock)) {
 		r = PTR_ERR(sock);
 		goto err_vq;
 	}
 
+	vhost_init_link_state(n, index);
+
 	/* start polling new socket */
 	oldsock = vq->private_data;
 	if (sock != oldsock) {
@@ -879,6 +1181,9 @@ static struct miscdevice vhost_net_misc = {
 
 static int vhost_net_init(void)
 {
+	notify_cache = kmem_cache_create("vhost_kiocb",
+					sizeof(struct kiocb), 0,
+					SLAB_HWCACHE_ALIGN, NULL);
 	return misc_register(&vhost_net_misc);
 }
 module_init(vhost_net_init);
@@ -886,6 +1191,8 @@ module_init(vhost_net_init);
 static void vhost_net_exit(void)
 {
 	misc_deregister(&vhost_net_misc);
+	if (notify_cache)
+		kmem_cache_destroy(notify_cache);
 }
 module_exit(vhost_net_exit);
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index dd3d6f7..295d9ab 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1015,6 +1015,84 @@ static int get_indirect(struct vhost_dev *dev, struct vhost_virtqueue *vq,
 	return 0;
 }
 
+/* To recompute the log */
+int __vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+			struct iovec iov[], unsigned int iov_size,
+			unsigned int *out_num, unsigned int *in_num,
+			struct vhost_log *log, unsigned int *log_num,
+			unsigned int head)
+{
+	struct vring_desc desc;
+	unsigned int i, found = 0;
+	int ret;
+
+	/* When we start there are none of either input nor output. */
+	*out_num = *in_num = 0;
+	if (unlikely(log))
+		*log_num = 0;
+
+	i = head;
+	do {
+		unsigned iov_count = *in_num + *out_num;
+		if (unlikely(i >= vq->num)) {
+			vq_err(vq, "Desc index is %u > %u, head = %u",
+					i, vq->num, head);
+			return -EINVAL;
+		}
+		if (unlikely(++found > vq->num)) {
+			vq_err(vq, "Loop detected: last one at %u "
+					"vq size %u head %u\n",
+					i, vq->num, head);
+			return -EINVAL;
+		}
+		ret = copy_from_user(&desc, vq->desc + i, sizeof desc);
+		if (unlikely(ret)) {
+			vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
+					i, vq->desc + i);
+			return -EFAULT;
+		}
+		if (desc.flags & VRING_DESC_F_INDIRECT) {
+			ret = get_indirect(dev, vq, iov, iov_size,
+					out_num, in_num,
+					log, log_num, &desc);
+			if (unlikely(ret < 0)) {
+				vq_err(vq, "Failure detected "
+				       "in indirect descriptor at idx %d\n", i);
+				return ret;
+			}
+			continue;
+		}
+
+		ret = translate_desc(dev, desc.addr, desc.len, iov + iov_count,
+				iov_size - iov_count);
+		if (unlikely(ret < 0)) {
+			vq_err(vq, "Translation failure %d descriptor idx %d\n",
+					ret, i);
+			return ret;
+		}
+		if (desc.flags & VRING_DESC_F_WRITE) {
+			/* If this is an input descriptor,
+			 * increment that count. */
+			*in_num += ret;
+			if (unlikely(log)) {
+				log[*log_num].addr = desc.addr;
+				log[*log_num].len = desc.len;
+				++*log_num;
+			}
+		} else {
+			/* If it's an output descriptor, they're all supposed
+			 * to come before any input descriptors. */
+			if (unlikely(*in_num)) {
+				vq_err(vq, "Descriptor has out after in: "
+						"idx %d\n", i);
+				return -EINVAL;
+			}
+			*out_num += ret;
+		}
+	} while ((i = next_desc(&desc)) != -1);
+
+	return head;
+}
 /* This looks in the virtqueue and for the first available buffer, and converts
  * it to an iovec for convenient access.  Since descriptors consist of some
  * number of output then some number of input descriptors, it's actually two
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index afd7729..915336d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -55,6 +55,11 @@ struct vhost_log {
 	u64 len;
 };
 
+enum vhost_vq_link_state {
+	VHOST_VQ_LINK_SYNC = 0,
+	VHOST_VQ_LINK_ASYNC = 1,
+};
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -110,6 +115,10 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log log[VHOST_NET_MAX_SG];
+	/* Differiate async socket for 0-copy from normal */
+	enum vhost_vq_link_state link_state;
+	struct list_head notifier;
+	spinlock_t notify_lock;
 };
 
 struct vhost_dev {
@@ -136,7 +145,11 @@ void vhost_dev_cleanup(struct vhost_dev *);
 long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, unsigned long arg);
 int vhost_vq_access_ok(struct vhost_virtqueue *vq);
 int vhost_log_access_ok(struct vhost_dev *);
-
+int __vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
+			  struct iovec iov[], unsigned int iov_count,
+			  unsigned int *out_num, unsigned int *in_num,
+			  struct vhost_log *log, unsigned int *log_num,
+			  unsigned int head);
 int vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
 		      struct iovec iov[], unsigned int iov_count,
 		      unsigned int *out_num, unsigned int *in_num,
-- 
1.7.3

^ permalink raw reply related

* [PATCH v15 14/17]Add a kconfig entry and make entry for mp device.
From: xiaohui.xin @ 2010-11-09  9:03 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <fc6e95d63a2c62aaf77f8ded22fc43ccefcdbbff.1289280885.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 drivers/vhost/Kconfig  |   10 ++++++++++
 drivers/vhost/Makefile |    2 ++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index e4e2fd1..a6b8cbf 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,13 @@ config VHOST_NET
 	  To compile this driver as a module, choose M here: the module will
 	  be called vhost_net.
 
+config MEDIATE_PASSTHRU
+	tristate "mediate passthru network driver (EXPERIMENTAL)"
+	depends on VHOST_NET
+	---help---
+	  zerocopy network I/O support, we call it as mediate passthru to
+	  be distiguish with hardare passthru.
+
+	  To compile this driver as a module, choose M here: the module will
+	  be called mpassthru.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..c18b9fc 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
 obj-$(CONFIG_VHOST_NET) += vhost_net.o
 vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o
-- 
1.7.3

^ permalink raw reply related

* [PATCH v15 13/17] Add mp(mediate passthru) device.
From: xiaohui.xin @ 2010-11-09  9:03 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <fc6e95d63a2c62aaf77f8ded22fc43ccefcdbbff.1289280885.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

The patch add mp(mediate passthru) device, which now
based on vhost-net backend driver and provides proto_ops
to send/receive guest buffers data from/to guest vitio-net
driver.
It also exports async functions which can be used by other
drivers like macvtap to utilize zero-copy too.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 drivers/vhost/mpassthru.c | 1515 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 1515 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vhost/mpassthru.c

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 0000000..492430c
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,1515 @@
+/*
+ *  MPASSTHRU - Mediate passthrough device.
+ *  Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ *  GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAME        "mpassthru"
+#define DRV_DESCRIPTION "Mediate passthru device driver"
+#define DRV_COPYRIGHT   "(C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G"
+
+#include <linux/compat.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/major.h>
+#include <linux/slab.h>
+#include <linux/smp_lock.h>
+#include <linux/poll.h>
+#include <linux/fcntl.h>
+#include <linux/init.h>
+#include <linux/aio.h>
+
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/miscdevice.h>
+#include <linux/ethtool.h>
+#include <linux/rtnetlink.h>
+#include <linux/if.h>
+#include <linux/if_arp.h>
+#include <linux/if_ether.h>
+#include <linux/crc32.h>
+#include <linux/nsproxy.h>
+#include <linux/uaccess.h>
+#include <linux/virtio_net.h>
+#include <linux/mpassthru.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <net/rtnetlink.h>
+#include <net/sock.h>
+
+#include <asm/system.h>
+#include "../net/bonding/bonding.h"
+
+struct mp_struct {
+	struct mp_file		*mfile;
+	struct net_device       *dev;
+	struct page_pool	*pool;
+	struct socket           socket;
+	struct socket_wq	wq;
+	struct mm_struct	*mm;
+};
+
+struct mp_file {
+	atomic_t count;
+	struct mp_struct *mp;
+	struct net *net;
+};
+
+struct mp_sock {
+	struct sock		sk;
+	struct mp_struct	*mp;
+};
+
+/* The main function to allocate external buffers */
+static struct skb_ext_page *page_ctor(struct mp_port *port,
+				      struct sk_buff *skb,
+				      int npages)
+{
+	int i;
+	unsigned long flags;
+	struct page_pool *pool;
+	struct page_info *info = NULL;
+
+	if (npages != 1)
+		BUG();
+	pool = container_of(port, struct page_pool, port);
+
+	spin_lock_irqsave(&pool->read_lock, flags);
+	if (!list_empty(&pool->readq)) {
+		info = list_first_entry(&pool->readq, struct page_info, list);
+		list_del(&info->list);
+	}
+	spin_unlock_irqrestore(&pool->read_lock, flags);
+	if (!info)
+		return NULL;
+
+	for (i = 0; i < info->pnum; i++)
+		get_page(info->pages[i]);
+	info->skb = skb;
+	return &info->ext_page;
+}
+
+static struct page_info *mp_hash_lookup(struct page_pool *pool,
+					struct page *page);
+static struct page_info *mp_hash_delete(struct page_pool *pool,
+					struct page_info *info);
+
+static struct skb_ext_page *mp_lookup(struct net_device *dev,
+				      struct page *page)
+{
+	struct mp_struct *mp =
+		container_of(dev->mp_port->sock->sk, struct mp_sock, sk)->mp;
+	struct page_pool *pool = mp->pool;
+	struct page_info *info;
+
+	info = mp_hash_lookup(pool, page);
+	if (!info)
+		return NULL;
+	return &info->ext_page;
+}
+
+struct page_pool *page_pool_create(struct net_device *dev,
+				  struct socket *sock)
+{
+	struct page_pool *pool;
+	struct net_device *master;
+	struct slave *slave;
+	struct bonding *bond;
+	int i;
+	int rc;
+
+	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+	if (!pool)
+		return NULL;
+
+	/* How to deal with bonding device:
+	 * check if all the slaves are capable of zero-copy.
+	 * if not, fail.
+	 */
+	master = dev->master;
+	if (master) {
+		bond = netdev_priv(master);
+		read_lock(&bond->lock);
+		bond_for_each_slave(bond, slave, i) {
+			rc = netdev_mp_port_prep(slave->dev, &pool->port);
+			if (rc)
+				break;
+		}
+		read_unlock(&bond->lock);
+	} else
+		rc = netdev_mp_port_prep(dev, &pool->port);
+	if (rc)
+		goto fail;
+
+	INIT_LIST_HEAD(&pool->readq);
+	spin_lock_init(&pool->read_lock);
+	pool->hash_table =
+		kzalloc(sizeof(struct page_info *) * HASH_BUCKETS, GFP_KERNEL);
+	if (!pool->hash_table)
+		goto fail;
+
+	pool->dev = dev;
+	pool->port.ctor = page_ctor;
+	pool->port.sock = sock;
+	pool->port.hash = mp_lookup;
+	pool->locked_pages = 0;
+	pool->cur_pages = 0;
+	pool->orig_locked_vm = 0;
+
+	/* for bonding device, assign all the slaves the same page_pool */
+	if (master) {
+		read_lock(&bond->lock);
+		bond_for_each_slave(bond, slave, i) {
+			dev_hold(slave->dev);
+			slave->dev->mp_port = &pool->port;
+		}
+		read_unlock(&bond->lock);
+	} else {
+		dev_hold(dev);
+		dev->mp_port = &pool->port;
+	}
+
+	return pool;
+fail:
+	kfree(pool);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(page_pool_create);
+
+void dev_bond_hold(struct net_device *dev)
+{
+	struct net_device *master;
+	struct bonding *bond;
+	struct slave *slave;
+	int i;
+
+	master = dev->master;
+	if (master) {
+		bond = netdev_priv(master);
+		read_lock(&bond->lock);
+		bond_for_each_slave(bond, slave, i) {
+			if (slave->dev != dev)
+				dev_hold(slave->dev);
+		}
+		read_unlock(&bond->lock);
+	}
+}
+
+void dev_bond_put(struct net_device *dev)
+{
+	struct net_device *master;
+	struct bonding *bond;
+	struct slave *slave;
+	int i;
+
+	master = dev->master;
+	if (master) {
+		bond = netdev_priv(master);
+		read_lock(&bond->lock);
+		bond_for_each_slave(bond, slave, i) {
+			if (slave->dev != dev)
+				dev_put(slave->dev);
+		}
+		read_unlock(&bond->lock);
+	}
+}
+
+void dev_change_state(struct net_device *dev)
+{
+	struct net_device *master;
+	struct bonding *bond;
+	struct slave *slave;
+	int i;
+
+	master = dev->master;
+	if (master) {
+		bond = netdev_priv(master);
+		read_lock(&bond->lock);
+		bond_for_each_slave(bond, slave, i) {
+			dev_change_flags(slave->dev,
+					 slave->dev->flags & (~IFF_UP));
+			dev_change_flags(slave->dev,
+					 slave->dev->flags | IFF_UP);
+		}
+		read_unlock(&bond->lock);
+	} else {
+		dev_change_flags(dev, dev->flags & (~IFF_UP));
+		dev_change_flags(dev, dev->flags | IFF_UP);
+	}
+}
+EXPORT_SYMBOL_GPL(dev_change_state);
+
+static int mp_page_pool_attach(struct mp_struct *mp, struct page_pool *pool)
+{
+	int rc = 0;
+	/* should be protected by mp_mutex */
+	if (mp->pool) {
+		rc = -EBUSY;
+		goto fail;
+	}
+	if (mp->dev != pool->dev) {
+		rc = -EFAULT;
+		goto fail;
+	}
+	mp->pool = pool;
+	return 0;
+fail:
+	kfree(pool->hash_table);
+	kfree(pool);
+	return rc;
+}
+
+struct page_info *info_dequeue(struct page_pool *pool)
+{
+	unsigned long flags;
+	struct page_info *info = NULL;
+	spin_lock_irqsave(&pool->read_lock, flags);
+	if (!list_empty(&pool->readq)) {
+		info = list_first_entry(&pool->readq,
+				struct page_info, list);
+		list_del(&info->list);
+	}
+	spin_unlock_irqrestore(&pool->read_lock, flags);
+	return info;
+}
+
+static void mp_ki_dtor(struct kiocb *iocb)
+{
+	struct page_info *info = (struct page_info *)(iocb->private);
+	int i;
+
+	if (info->flags == INFO_READ) {
+		for (i = 0; i < info->pnum; i++) {
+			if (info->pages[i]) {
+				set_page_dirty_lock(info->pages[i]);
+				put_page(info->pages[i]);
+			}
+		}
+		mp_hash_delete(info->pool, info);
+		if (info->skb) {
+			info->skb->destructor = NULL;
+			kfree_skb(info->skb);
+		}
+	}
+	/* Decrement the number of locked pages */
+	info->pool->cur_pages -= info->pnum;
+	kmem_cache_free(ext_page_info_cache, info);
+
+	return;
+}
+
+static struct kiocb *create_iocb(struct page_info *info, int size)
+{
+	struct kiocb *iocb = NULL;
+
+	iocb = info->iocb;
+	if (!iocb)
+		return iocb;
+	iocb->ki_flags = 0;
+	iocb->ki_users = 1;
+	iocb->ki_key = 0;
+	iocb->ki_ctx = NULL;
+	iocb->ki_cancel = NULL;
+	iocb->ki_retry = NULL;
+	iocb->ki_eventfd = NULL;
+	iocb->ki_pos = info->desc_pos;
+	iocb->ki_nbytes = size;
+	iocb->ki_dtor(iocb);
+	iocb->private = (void *)info;
+	iocb->ki_dtor = mp_ki_dtor;
+
+	return iocb;
+}
+
+void page_pool_destroy(struct mm_struct *mm, struct page_pool *pool)
+{
+	struct page_info *info;
+	struct net_device *master;
+	struct slave *slave;
+	struct bonding *bond;
+	int i;
+
+	if (!pool)
+		return;
+
+	while ((info = info_dequeue(pool))) {
+		for (i = 0; i < info->pnum; i++)
+			if (info->pages[i])
+				put_page(info->pages[i]);
+		create_iocb(info, 0);
+		kmem_cache_free(ext_page_info_cache, info);
+	}
+	down_write(&mm->mmap_sem);
+	mm->locked_vm -= pool->locked_pages;
+	up_write(&mm->mmap_sem);
+
+	master = pool->dev->master;
+	if (master) {
+		bond = netdev_priv(master);
+		read_lock(&bond->lock);
+		bond_for_each_slave(bond, slave, i) {
+			slave->dev->mp_port = NULL;
+			dev_put(slave->dev);
+		}
+		read_unlock(&bond->lock);
+	} else {
+		pool->dev->mp_port = NULL;
+		dev_put(pool->dev);
+	}
+
+	kfree(pool->hash_table);
+	kfree(pool);
+}
+EXPORT_SYMBOL_GPL(page_pool_destroy);
+
+static void mp_page_pool_detach(struct mp_struct *mp)
+{
+	/* locked by mp_mutex */
+	if (mp->pool) {
+		page_pool_destroy(mp->mm, mp->pool);
+		mp->pool = NULL;
+	}
+}
+
+static void __mp_detach(struct mp_struct *mp)
+{
+	struct net_device *master;
+	struct bonding *bond;
+	struct slave *slave;
+	int i;
+
+	mp->mfile = NULL;
+	master = mp->dev->master;
+	if (master) {
+		bond = netdev_priv(master);
+		read_lock(&bond->lock);
+		bond_for_each_slave(bond, slave, i)
+			dev_change_flags(slave->dev,
+					 slave->dev->flags & ~IFF_UP);
+		read_unlock(&bond->lock);
+	} else
+		dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
+	mp_page_pool_detach(mp);
+	if (master) {
+		bond = netdev_priv(master);
+		read_lock(&bond->lock);
+		bond_for_each_slave(bond, slave, i)
+			dev_change_flags(slave->dev,
+					 slave->dev->flags | IFF_UP);
+		read_unlock(&bond->lock);
+	} else
+		dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+}
+
+static DEFINE_MUTEX(mp_mutex);
+
+static void mp_detach(struct mp_struct *mp)
+{
+	mutex_lock(&mp_mutex);
+	__mp_detach(mp);
+	mutex_unlock(&mp_mutex);
+}
+
+static struct mp_struct *mp_get(struct mp_file *mfile)
+{
+	struct mp_struct *mp = NULL;
+	if (atomic_inc_not_zero(&mfile->count))
+		mp = mfile->mp;
+
+	return mp;
+}
+
+static void mp_put(struct mp_file *mfile)
+{
+	if (atomic_dec_and_test(&mfile->count)) {
+		if (mfile->mp)
+			return;
+		if (!rtnl_is_locked()) {
+			rtnl_lock();
+			mp_detach(mfile->mp);
+			rtnl_unlock();
+		} else
+			mp_detach(mfile->mp);
+	}
+}
+
+static void iocb_tag(struct kiocb *iocb)
+{
+	iocb->ki_flags = 1;
+}
+
+/* The callback to destruct the external buffers or skb */
+static void page_dtor(struct skb_ext_page *ext_page)
+{
+	struct page_info *info;
+	struct page_pool *pool;
+	struct sock *sk;
+	struct sk_buff *skb;
+
+	if (!ext_page)
+		return;
+	info = container_of(ext_page, struct page_info, ext_page);
+	if (!info)
+		return;
+	pool = info->pool;
+	skb = info->skb;
+
+	if (info->flags == INFO_READ) {
+		create_iocb(info, 0);
+		return;
+	}
+
+	/* For transmit, we should wait for the DMA finish by hardware.
+	 * Queue the notifier to wake up the backend driver
+	 */
+
+	iocb_tag(info->iocb);
+	sk = pool->port.sock->sk;
+	sk->sk_write_space(sk);
+
+	return;
+}
+
+/* For small exteranl buffers transmit, we don't need to call
+ * get_user_pages().
+ */
+static struct page_info *alloc_small_page_info(struct page_pool *pool,
+		struct kiocb *iocb, int total)
+{
+	struct page_info *info =
+		kmem_cache_alloc(ext_page_info_cache, GFP_KERNEL);
+
+	if (!info)
+		return NULL;
+	info->ext_page.dtor = page_dtor;
+	info->pool = pool;
+	info->flags = INFO_WRITE;
+	info->iocb = iocb;
+	info->pnum = 0;
+	return info;
+}
+
+typedef u32 key_mp_t;
+static inline key_mp_t mp_hash(struct page *page, int buckets)
+{
+	key_mp_t k;
+#if BITS_PER_LONG == 64
+	k = ((((unsigned long)page << 32UL) >> 32UL) /
+			sizeof(struct page)) % buckets ;
+#elif BITS_PER_LONG == 32
+	k = ((unsigned long)page / sizeof(struct page)) % buckets;
+#endif
+
+	return k;
+}
+
+static void mp_hash_insert(struct page_pool *pool,
+		struct page *page, struct page_info *page_info)
+{
+	struct page_info *tmp;
+	key_mp_t key = mp_hash(page, HASH_BUCKETS);
+	if (!pool->hash_table[key]) {
+		pool->hash_table[key] = page_info;
+		return;
+	}
+
+	tmp = pool->hash_table[key];
+	while (tmp->next)
+		tmp = tmp->next;
+
+	tmp->next = page_info;
+	page_info->prev = tmp;
+	return;
+}
+
+static struct page_info *mp_hash_delete(struct page_pool *pool,
+					struct page_info *info)
+{
+	key_mp_t key = mp_hash(info->pages[0], HASH_BUCKETS);
+	struct page_info *tmp = NULL;
+
+	tmp = pool->hash_table[key];
+	while (tmp) {
+		if (tmp == info) {
+			if (!tmp->prev) {
+				pool->hash_table[key] = tmp->next;
+				if (tmp->next)
+					tmp->next->prev = NULL;
+			} else {
+				tmp->prev->next = tmp->next;
+				if (tmp->next)
+					tmp->next->prev = tmp->prev;
+			}
+			return tmp;
+		}
+		tmp = tmp->next;
+	}
+	return tmp;
+}
+
+static struct page_info *mp_hash_lookup(struct page_pool *pool,
+					struct page *page)
+{
+	key_mp_t key = mp_hash(page, HASH_BUCKETS);
+	struct page_info *tmp = NULL;
+
+	int i;
+	tmp = pool->hash_table[key];
+	while (tmp) {
+		for (i = 0; i < tmp->pnum; i++) {
+			if (tmp->pages[i] == page)
+				return tmp;
+		}
+		tmp = tmp->next;
+	}
+	return tmp;
+}
+
+/* The main function to transform the guest user space address
+ * to host kernel address via get_user_pages(). Thus the hardware
+ * can do DMA directly to the external buffer address.
+ */
+static struct page_info *alloc_page_info(struct page_pool *pool,
+		struct kiocb *iocb, struct iovec *iov,
+		int count, struct frag *frags,
+		int npages, int total)
+{
+	int rc;
+	int i, j, n = 0;
+	int len;
+	unsigned long base;
+	struct page_info *info = NULL;
+
+	if (pool->cur_pages + count > pool->locked_pages) {
+		printk(KERN_INFO "Exceed memory lock rlimt.");
+		return NULL;
+	}
+
+	info = kmem_cache_alloc(ext_page_info_cache, GFP_KERNEL);
+
+	if (!info)
+		return NULL;
+	info->skb = NULL;
+	info->next = info->prev = NULL;
+
+	for (i = j = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+
+		if (!len)
+			continue;
+		n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+
+		rc = get_user_pages_fast(base, n, npages ? 1 : 0,
+				&info->pages[j]);
+		if (rc != n)
+			goto failed;
+
+		while (n--) {
+			frags[j].offset = base & ~PAGE_MASK;
+			frags[j].size = min_t(int, len,
+					PAGE_SIZE - frags[j].offset);
+			len -= frags[j].size;
+			base += frags[j].size;
+			j++;
+		}
+	}
+
+#ifdef CONFIG_HIGHMEM
+	if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
+		for (i = 0; i < j; i++) {
+			if (PageHighMem(info->pages[i]))
+				goto failed;
+		}
+	}
+#endif
+
+	info->ext_page.dtor = page_dtor;
+	info->ext_page.page = info->pages[0];
+	info->pool = pool;
+	info->pnum = j;
+	info->iocb = iocb;
+	if (!npages)
+		info->flags = INFO_WRITE;
+	else
+		info->flags = INFO_READ;
+
+	if (info->flags == INFO_READ) {
+		if (frags[0].offset == 0 && iocb->ki_iovec[0].iov_len) {
+			frags[0].offset = iocb->ki_iovec[0].iov_len;
+			pool->port.vnet_hlen = iocb->ki_iovec[0].iov_len;
+		}
+		for (i = 0; i < j; i++)
+			mp_hash_insert(pool, info->pages[i], info);
+	}
+	/* increment the number of locked pages */
+	pool->cur_pages += j;
+	return info;
+
+failed:
+	for (i = 0; i < j; i++)
+		put_page(info->pages[i]);
+
+	kmem_cache_free(ext_page_info_cache, info);
+
+	return NULL;
+}
+
+static void mp_sock_destruct(struct sock *sk)
+{
+	struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+	kfree(mp);
+}
+
+static void mp_sock_state_change(struct sock *sk)
+{
+	wait_queue_head_t *wqueue = sk_sleep(sk);
+	if (wqueue && waitqueue_active(wqueue))
+		wake_up_interruptible_sync_poll(wqueue, POLLIN);
+}
+
+static void mp_sock_write_space(struct sock *sk)
+{
+	wait_queue_head_t *wqueue = sk_sleep(sk);
+	if (wqueue && waitqueue_active(wqueue))
+		wake_up_interruptible_sync_poll(wqueue, POLLOUT);
+}
+
+void async_data_ready(struct sock *sk, struct page_pool *pool)
+{
+	struct sk_buff *skb = NULL;
+	struct page_info *info = NULL;
+	int len;
+
+	while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+		struct page *page;
+		int off;
+		int size = 0, i = 0;
+		struct skb_shared_info *shinfo = skb_shinfo(skb);
+		struct skb_ext_page *ext_page =
+			(struct skb_ext_page *)(shinfo->destructor_arg);
+		struct virtio_net_hdr_mrg_rxbuf hdr = {
+			.hdr.flags = 0,
+			.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
+		};
+
+		if (!ext_page) {
+			kfree_skb(skb);
+			continue;
+		}
+		if (skb->ip_summed == CHECKSUM_COMPLETE)
+			printk(KERN_INFO "Complete checksum occurs\n");
+
+		if (shinfo->frags[0].page == ext_page->page) {
+			info = container_of(ext_page,
+					    struct page_info,
+					    ext_page);
+			if (shinfo->nr_frags)
+				hdr.num_buffers = shinfo->nr_frags;
+			else
+				hdr.num_buffers = shinfo->nr_frags + 1;
+		} else {
+			info = container_of(ext_page,
+					    struct page_info,
+					    ext_page);
+			hdr.num_buffers = shinfo->nr_frags + 1;
+		}
+		skb_push(skb, ETH_HLEN);
+
+		if (skb_is_gso(skb)) {
+			hdr.hdr.hdr_len = skb_headlen(skb);
+			hdr.hdr.gso_size = shinfo->gso_size;
+			if (shinfo->gso_type & SKB_GSO_TCPV4)
+				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+			else if (shinfo->gso_type & SKB_GSO_TCPV6)
+				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+			else if (shinfo->gso_type & SKB_GSO_UDP)
+				hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_UDP;
+			else
+				BUG();
+			if (shinfo->gso_type & SKB_GSO_TCP_ECN)
+				hdr.hdr.gso_type |= VIRTIO_NET_HDR_GSO_ECN;
+
+		} else
+			hdr.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE;
+
+		if (skb->ip_summed == CHECKSUM_PARTIAL) {
+			hdr.hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			hdr.hdr.csum_start =
+				skb->csum_start - skb_headroom(skb);
+			hdr.hdr.csum_offset = skb->csum_offset;
+		}
+
+		off = info->hdr[0].iov_len;
+		len = memcpy_toiovec(info->iov, (unsigned char *)&hdr, off);
+		if (len) {
+			pr_debug("Unable to write vnet_hdr at addr '%p': '%d'\n",
+				info->iov, len);
+			goto clean;
+		}
+
+		memcpy_toiovec(info->iov, skb->data, skb_headlen(skb));
+
+		info->iocb->ki_left = hdr.num_buffers;
+		if (shinfo->frags[0].page == ext_page->page) {
+			size = shinfo->frags[0].size +
+				shinfo->frags[0].page_offset - off;
+			i = 1;
+		} else {
+			size = skb_headlen(skb);
+			i = 0;
+		}
+		create_iocb(info, off + size);
+		for (i = i; i < shinfo->nr_frags; i++) {
+			page = shinfo->frags[i].page;
+			info = mp_hash_lookup(pool, shinfo->frags[i].page);
+			create_iocb(info, shinfo->frags[i].size);
+		}
+		info->skb = skb;
+		shinfo->nr_frags = 0;
+		shinfo->destructor_arg = NULL;
+		continue;
+clean:
+		kfree_skb(skb);
+		for (i = 0; i < info->pnum; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(ext_page_info_cache, info);
+	}
+	return;
+}
+EXPORT_SYMBOL_GPL(async_data_ready);
+
+static void mp_sock_data_ready(struct sock *sk, int coming)
+{
+	struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+	struct page_pool *pool = NULL;
+
+	pool = mp->pool;
+	if (!pool)
+		return;
+	return async_data_ready(sk, pool);
+}
+
+static inline struct sk_buff *mp_alloc_skb(struct sock *sk, size_t prepad,
+					   size_t len, size_t linear,
+					   int noblock, int *err)
+{
+	struct sk_buff *skb;
+
+	/* Under a page?  Don't bother with paged skb. */
+	if (prepad + len < PAGE_SIZE || !linear)
+		linear = len;
+
+	skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
+			err);
+	if (!skb)
+		return NULL;
+
+	skb_reserve(skb, prepad);
+	skb_put(skb, linear);
+	skb->data_len = len - linear;
+	skb->len += len - linear;
+
+	return skb;
+}
+
+static int mp_skb_from_vnet_hdr(struct sk_buff *skb,
+		struct virtio_net_hdr *vnet_hdr)
+{
+	unsigned short gso_type = 0;
+	if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		switch (vnet_hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+		case VIRTIO_NET_HDR_GSO_TCPV4:
+			gso_type = SKB_GSO_TCPV4;
+			break;
+		case VIRTIO_NET_HDR_GSO_TCPV6:
+			gso_type = SKB_GSO_TCPV6;
+			break;
+		case VIRTIO_NET_HDR_GSO_UDP:
+			gso_type = SKB_GSO_UDP;
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		if (vnet_hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
+			gso_type |= SKB_GSO_TCP_ECN;
+
+		if (vnet_hdr->gso_size == 0)
+			return -EINVAL;
+	}
+
+	if (vnet_hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		if (!skb_partial_csum_set(skb, vnet_hdr->csum_start,
+					vnet_hdr->csum_offset))
+			return -EINVAL;
+	}
+
+	if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		skb_shinfo(skb)->gso_size = vnet_hdr->gso_size;
+		skb_shinfo(skb)->gso_type = gso_type;
+
+		/* Header must be checked, and gso_segs computed. */
+		skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
+		skb_shinfo(skb)->gso_segs = 0;
+	}
+	return 0;
+}
+
+int async_sendmsg(struct sock *sk, struct kiocb *iocb, struct page_pool *pool,
+		  struct iovec *iov, int count)
+{
+	struct virtio_net_hdr vnet_hdr = {0};
+	int hdr_len = 0;
+	struct page_info *info = NULL;
+	struct frag frags[MAX_SKB_FRAGS];
+	struct sk_buff *skb;
+	int total = 0, header, n, i, len, rc;
+	unsigned long base;
+
+	total = iov_length(iov, count);
+
+	if (total < ETH_HLEN)
+		return -EINVAL;
+
+	if (total <= COPY_THRESHOLD)
+		goto copy;
+
+	n = 0;
+	for (i = 0; i < count; i++) {
+		base = (unsigned long)iov[i].iov_base;
+		len = iov[i].iov_len;
+		if (!len)
+			continue;
+		n += ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+		if (n > MAX_SKB_FRAGS)
+			return -EINVAL;
+	}
+
+copy:
+	hdr_len = sizeof(vnet_hdr);
+	if ((total - iocb->ki_iovec[0].iov_len) < 0)
+		return -EINVAL;
+
+	rc = memcpy_fromiovecend((void *)&vnet_hdr, iocb->ki_iovec, 0, hdr_len);
+	if (rc < 0)
+		return -EINVAL;
+
+	if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+			vnet_hdr.csum_start + vnet_hdr.csum_offset + 2 >
+			vnet_hdr.hdr_len)
+		vnet_hdr.hdr_len = vnet_hdr.csum_start +
+			vnet_hdr.csum_offset + 2;
+
+	if (vnet_hdr.hdr_len > total)
+		return -EINVAL;
+
+	header = total > COPY_THRESHOLD ? COPY_HDR_LEN : total;
+
+	skb = mp_alloc_skb(sk, NET_IP_ALIGN, header,
+			iocb->ki_iovec[0].iov_len, 1, &rc);
+	if (!skb)
+		goto drop;
+
+	skb_set_network_header(skb, ETH_HLEN);
+	memcpy_fromiovec(skb->data, iov, header);
+
+	skb_reset_mac_header(skb);
+	skb->protocol = eth_hdr(skb)->h_proto;
+
+	rc = mp_skb_from_vnet_hdr(skb, &vnet_hdr);
+	if (rc)
+		goto drop;
+
+	if (header == total) {
+		rc = total;
+		info = alloc_small_page_info(pool, iocb, total);
+	} else {
+		info = alloc_page_info(pool, iocb, iov, count, frags, 0, total);
+		if (info)
+			for (i = 0; i < info->pnum; i++) {
+				skb_add_rx_frag(skb, i, info->pages[i],
+						frags[i].offset, frags[i].size);
+				info->pages[i] = NULL;
+			}
+	}
+	if (!pool->cur_pages)
+		sk->sk_state_change(sk);
+
+	if (info != NULL) {
+		info->desc_pos = iocb->ki_pos;
+		info->skb = skb;
+		skb_shinfo(skb)->destructor_arg = &info->ext_page;
+		skb->dev = pool->dev->master ? pool->dev->master : pool->dev;
+		create_iocb(info, total);
+		dev_queue_xmit(skb);
+		return 0;
+	}
+drop:
+	kfree_skb(skb);
+	if (info) {
+		for (i = 0; i < info->pnum; i++)
+			put_page(info->pages[i]);
+		kmem_cache_free(ext_page_info_cache, info);
+	}
+	pool->dev->stats.tx_dropped++;
+	return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(async_sendmsg);
+
+static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *m, size_t total_len)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_pool *pool;
+	struct iovec *iov = m->msg_iov;
+	int count = m->msg_iovlen;
+
+	pool = mp->pool;
+	if (!pool)
+		return -ENODEV;
+	return async_sendmsg(sock->sk, iocb, pool, iov, count);
+}
+
+int async_recvmsg(struct kiocb *iocb, struct page_pool *pool,
+		  struct iovec *iov, int count, int flags)
+{
+	int npages, payload;
+	struct page_info *info;
+	struct frag frags[MAX_SKB_FRAGS];
+	unsigned long base;
+	int i, len;
+	unsigned long flag;
+
+	if (!(flags & MSG_DONTWAIT))
+		return -EINVAL;
+
+	if (!pool)
+		return -EINVAL;
+
+	/* Error detections in case invalid external buffer */
+	if (count > 2 && iov[1].iov_len < pool->port.hdr_len &&
+			pool->dev->features & NETIF_F_SG) {
+		return -EINVAL;
+	}
+
+	npages = pool->port.npages;
+	payload = pool->port.data_len;
+
+	/* If KVM guest virtio-net FE driver use SG feature */
+	if (count > 2) {
+		for (i = 2; i < count; i++) {
+			base = (unsigned long)iov[i].iov_base & ~PAGE_MASK;
+			len = iov[i].iov_len;
+			if (npages == 1)
+				len = min_t(int, len, PAGE_SIZE - base);
+			else if (base)
+				break;
+			payload -= len;
+			if (payload <= 0)
+				goto proceed;
+			if (npages == 1 || (len & ~PAGE_MASK))
+				break;
+		}
+	}
+
+	if ((((unsigned long)iov[1].iov_base & ~PAGE_MASK)
+				- NET_SKB_PAD - NET_IP_ALIGN) >= 0)
+		goto proceed;
+
+	return -EINVAL;
+proceed:
+	/* skip the virtnet head */
+	if (count > 1) {
+		iov++;
+		count--;
+	}
+
+	/* Translate address to kernel */
+	info = alloc_page_info(pool, iocb, iov, count, frags, npages, 0);
+	if (!info)
+		return -ENOMEM;
+	info->hdr[0].iov_base = iocb->ki_iovec[0].iov_base;
+	info->hdr[0].iov_len = iocb->ki_iovec[0].iov_len;
+	iocb->ki_iovec[0].iov_len = 0;
+	iocb->ki_left = 0;
+	info->desc_pos = iocb->ki_pos;
+
+	if (count > 1) {
+		iov--;
+		count++;
+	}
+
+	memcpy(info->iov, iov, sizeof(struct iovec) * count);
+
+	spin_lock_irqsave(&pool->read_lock, flag);
+	list_add_tail(&info->list, &pool->readq);
+	spin_unlock_irqrestore(&pool->read_lock, flag);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(async_recvmsg);
+
+static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *m, size_t total_len,
+		int flags)
+{
+	struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+	struct page_pool *pool;
+	struct iovec *iov = m->msg_iov;
+	int count = m->msg_iovlen;
+
+	pool = mp->pool;
+	if (!pool)
+		return -EINVAL;
+
+	return async_recvmsg(iocb, pool, iov, count, flags);
+}
+
+/* Ops structure to mimic raw sockets with mp device */
+static const struct proto_ops mp_socket_ops = {
+	.sendmsg = mp_sendmsg,
+	.recvmsg = mp_recvmsg,
+};
+
+static struct proto mp_proto = {
+	.name           = "mp",
+	.owner          = THIS_MODULE,
+	.obj_size       = sizeof(struct mp_sock),
+};
+
+static int mp_chr_open(struct inode *inode, struct file * file)
+{
+	struct mp_file *mfile;
+	cycle_kernel_lock();
+
+	pr_debug("mp: mp_chr_open\n");
+	mfile = kzalloc(sizeof(*mfile), GFP_KERNEL);
+	if (!mfile)
+		return -ENOMEM;
+	atomic_set(&mfile->count, 0);
+	mfile->mp = NULL;
+	mfile->net = get_net(current->nsproxy->net_ns);
+	file->private_data = mfile;
+	return 0;
+}
+
+static int mp_attach(struct mp_struct *mp, struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+	int err;
+
+	netif_tx_lock_bh(mp->dev);
+
+	err = -EINVAL;
+
+	if (mfile->mp)
+		goto out;
+
+	err = -EBUSY;
+	if (mp->mfile)
+		goto out;
+
+	err = 0;
+	mfile->mp = mp;
+	mp->mfile = mfile;
+	mp->socket.file = file;
+	sock_hold(mp->socket.sk);
+	atomic_inc(&mfile->count);
+
+out:
+	netif_tx_unlock_bh(mp->dev);
+	return err;
+}
+
+static int do_unbind(struct mp_file *mfile)
+{
+	struct mp_struct *mp = mp_get(mfile);
+
+	if (mp) {
+		mp_detach(mp);
+		sock_put(mp->socket.sk);
+	}
+	mp_put(mfile);
+	return 0;
+}
+
+static long mp_chr_ioctl(struct file *file, unsigned int cmd,
+		unsigned long arg)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp;
+	struct net_device *dev;
+	struct page_pool *pool;
+	void __user* argp = (void __user *)arg;
+	unsigned long  __user *limitp = argp;
+	struct ifreq ifr;
+	struct sock *sk;
+	unsigned long limit, locked, lock_limit;
+	int ret;
+
+	ret = -EINVAL;
+
+	switch (cmd) {
+	case MPASSTHRU_BINDDEV:
+		ret = -EFAULT;
+		if (copy_from_user(&ifr, argp, sizeof ifr))
+			break;
+
+		ifr.ifr_name[IFNAMSIZ-1] = '\0';
+
+		ret = -ENODEV;
+
+		rtnl_lock();
+		dev = dev_get_by_name(mfile->net, ifr.ifr_name);
+		if (!dev) {
+			rtnl_unlock();
+			break;
+		}
+		dev_bond_hold(dev);
+		mutex_lock(&mp_mutex);
+
+		ret = -EBUSY;
+
+		/* the device can be only bind once */
+		if (dev_is_mpassthru(dev))
+			goto err_dev_put;
+
+		ret = -EFAULT;
+
+		if (!(dev->features & NETIF_F_SG)) {
+			pr_debug("The device has no SG features.\n");
+			goto err_dev_put;
+		}
+		mp = mfile->mp;
+		if (mp)
+			goto err_dev_put;
+
+		mp = kzalloc(sizeof(*mp), GFP_KERNEL);
+		if (!mp) {
+			ret = -ENOMEM;
+			goto err_dev_put;
+		}
+		mp->dev = dev;
+		mp->mm = get_task_mm(current);
+		ret = -ENOMEM;
+
+		sk = sk_alloc(mfile->net, AF_UNSPEC, GFP_KERNEL, &mp_proto);
+		if (!sk)
+			goto err_free_mp;
+
+		mp->socket.wq = &mp->wq;
+		init_waitqueue_head(&mp->wq.wait);
+		mp->socket.ops = &mp_socket_ops;
+		sock_init_data(&mp->socket, sk);
+		sk->sk_sndbuf = INT_MAX;
+		container_of(sk, struct mp_sock, sk)->mp = mp;
+
+		sk->sk_destruct = mp_sock_destruct;
+		sk->sk_data_ready = mp_sock_data_ready;
+		sk->sk_write_space = mp_sock_write_space;
+		sk->sk_state_change = mp_sock_state_change;
+
+		pool = page_pool_create(dev, &mp->socket);
+		if (!pool) {
+			ret = -EFAULT;
+			goto err_free_sk;
+		}
+
+		ret = mp_attach(mp, file);
+		if (ret < 0)
+			goto err_free_sk;
+
+		ret = mp_page_pool_attach(mp, pool);
+		if (ret < 0)
+			goto err_free_sk;
+		dev_bond_put(dev);
+		dev_put(dev);
+		dev_change_state(dev);
+out:
+		mutex_unlock(&mp_mutex);
+		rtnl_unlock();
+		break;
+err_free_sk:
+		sk_free(sk);
+err_free_mp:
+		mfile->mp = NULL;
+		kfree(mp);
+err_dev_put:
+		dev_bond_put(dev);
+		dev_put(dev);
+		goto out;
+
+	case MPASSTHRU_UNBINDDEV:
+		rtnl_lock();
+		ret = do_unbind(mfile);
+		rtnl_unlock();
+		break;
+
+	case MPASSTHRU_SET_MEM_LOCKED:
+		ret = copy_from_user(&limit, limitp, sizeof limit);
+		if (ret < 0)
+			return ret;
+
+		mp = mp_get(mfile);
+		if (!mp)
+			return -ENODEV;
+
+		mutex_lock(&mp_mutex);
+		if (mp->mm != current->mm) {
+			mutex_unlock(&mp_mutex);
+			return -EPERM;
+		}
+
+		limit = PAGE_ALIGN(limit) >> PAGE_SHIFT;
+		down_write(&mp->mm->mmap_sem);
+		if (!mp->pool->locked_pages)
+			mp->pool->orig_locked_vm = mp->mm->locked_vm;
+		locked = limit + mp->pool->orig_locked_vm;
+		lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+		if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+			up_write(&mp->mm->mmap_sem);
+			mutex_unlock(&mp_mutex);
+			mp_put(mfile);
+			return -ENOMEM;
+		}
+		mp->mm->locked_vm = locked;
+		up_write(&mp->mm->mmap_sem);
+
+		mp->pool->locked_pages = limit;
+		mutex_unlock(&mp_mutex);
+
+		mp_put(mfile);
+		return 0;
+
+	case MPASSTHRU_GET_MEM_LOCKED_NEED:
+		limit = DEFAULT_NEED;
+		return copy_to_user(limitp, &limit, sizeof limit);
+
+
+	default:
+		break;
+	}
+	return ret;
+}
+
+static unsigned int mp_chr_poll(struct file *file, poll_table * wait)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp = mp_get(mfile);
+	struct sock *sk;
+	unsigned int mask = 0;
+
+	if (!mp)
+		return POLLERR;
+
+	sk = mp->socket.sk;
+
+	poll_wait(file, &mp->wq.wait, wait);
+
+	if (!skb_queue_empty(&sk->sk_receive_queue))
+		mask |= POLLIN | POLLRDNORM;
+
+	if (sock_writeable(sk) ||
+		(!test_and_set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags) &&
+			 sock_writeable(sk)))
+		mask |= POLLOUT | POLLWRNORM;
+
+	if (mp->dev->reg_state != NETREG_REGISTERED)
+		mask = POLLERR;
+
+	mp_put(mfile);
+	return mask;
+}
+
+static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
+				unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	struct mp_struct *mp = mp_get(file->private_data);
+	struct sock *sk = mp->socket.sk;
+	struct sk_buff *skb;
+	int len, err;
+	ssize_t result = 0;
+
+	if (!mp)
+		return -EBADFD;
+
+	/* currently, async is not supported.
+	 * but we may support real async aio from user application,
+	 * maybe qemu virtio-net backend.
+	 */
+	if (!is_sync_kiocb(iocb))
+		return -EFAULT;
+
+	len = iov_length(iov, count);
+
+	if (unlikely(len < ETH_HLEN))
+		return -EINVAL;
+
+	skb = sock_alloc_send_skb(sk, len + NET_IP_ALIGN,
+				  file->f_flags & O_NONBLOCK, &err);
+
+	if (!skb)
+		return -ENOMEM;
+
+	skb_reserve(skb, NET_IP_ALIGN);
+	skb_put(skb, len);
+
+	if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) {
+		kfree_skb(skb);
+		return -EAGAIN;
+	}
+
+	skb->protocol = eth_type_trans(skb, mp->dev);
+	skb->dev = mp->dev;
+
+	dev_queue_xmit(skb);
+
+	mp_put(file->private_data);
+	return result;
+}
+
+static int mp_chr_close(struct inode *inode, struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+
+	/*
+	 * Ignore return value since an error only means there was nothing to
+	 * do
+	 */
+	rtnl_lock();
+	do_unbind(mfile);
+	rtnl_unlock();
+	put_net(mfile->net);
+	kfree(mfile);
+
+	return 0;
+}
+
+#ifdef CONFIG_COMPAT
+static long mp_chr_compat_ioctl(struct file *f, unsigned int ioctl,
+				unsigned long arg)
+{
+	return mp_chr_ioctl(f, ioctl, (unsigned long)compat_ptr(arg));
+}
+#endif
+
+static const struct file_operations mp_fops = {
+	.owner  = THIS_MODULE,
+	.llseek = no_llseek,
+	.write  = do_sync_write,
+	.aio_write = mp_chr_aio_write,
+	.poll   = mp_chr_poll,
+	.unlocked_ioctl = mp_chr_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl = mp_chr_compat_ioctl,
+#endif
+	.open   = mp_chr_open,
+	.release = mp_chr_close,
+};
+
+static struct miscdevice mp_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "mp",
+	.nodename = "net/mp",
+	.fops = &mp_fops,
+};
+
+static int mp_device_event(struct notifier_block *unused,
+		unsigned long event, void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct mp_port *port;
+	struct mp_struct *mp = NULL;
+	struct socket *sock = NULL;
+	struct sock *sk;
+
+	port = dev->mp_port;
+	if (port == NULL)
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_UNREGISTER:
+		sock = dev->mp_port->sock;
+		mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+		do_unbind(mp->mfile);
+		break;
+	case NETDEV_CHANGE:
+		sk = dev->mp_port->sock->sk;
+		sk->sk_state_change(sk);
+		break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block mp_notifier_block __read_mostly = {
+	.notifier_call  = mp_device_event,
+};
+
+static int mp_init(void)
+{
+	int err = 0;
+
+	ext_page_info_cache = kmem_cache_create("skb_page_info",
+						sizeof(struct page_info),
+						0, SLAB_HWCACHE_ALIGN, NULL);
+	if (!ext_page_info_cache)
+		return -ENOMEM;
+
+	err = misc_register(&mp_miscdev);
+	if (err) {
+		printk(KERN_ERR "mp: Can't register misc device\n");
+		kmem_cache_destroy(ext_page_info_cache);
+	} else {
+		printk(KERN_INFO "Registering mp misc device - minor = %d\n",
+				mp_miscdev.minor);
+		register_netdevice_notifier(&mp_notifier_block);
+	}
+	return err;
+}
+
+void mp_exit(void)
+{
+	unregister_netdevice_notifier(&mp_notifier_block);
+	misc_deregister(&mp_miscdev);
+	kmem_cache_destroy(ext_page_info_cache);
+}
+
+/* Get an underlying socket object from mp file.  Returns error unless file is
+ * attached to a device.  The returned object works like a packet socket, it
+ * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
+ * holding a reference to the file for as long as the socket is in use. */
+struct socket *mp_get_socket(struct file *file)
+{
+	struct mp_file *mfile = file->private_data;
+	struct mp_struct *mp;
+
+	if (file->f_op != &mp_fops)
+		return ERR_PTR(-EINVAL);
+	mp = mp_get(mfile);
+	if (!mp)
+		return ERR_PTR(-EBADFD);
+	mp_put(mfile);
+	return &mp->socket;
+}
+EXPORT_SYMBOL_GPL(mp_get_socket);
+
+module_init(mp_init);
+module_exit(mp_exit);
+MODULE_AUTHOR(DRV_COPYRIGHT);
+MODULE_DESCRIPTION(DRV_DESCRIPTION);
+MODULE_LICENSE("GPL v2");
-- 
1.7.3

^ permalink raw reply related

* [PATCH v15 12/17] Add header file for mp device.
From: xiaohui.xin @ 2010-11-09  9:03 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <fc6e95d63a2c62aaf77f8ded22fc43ccefcdbbff.1289280885.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/mpassthru.h |  133 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 133 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/mpassthru.h

diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
new file mode 100644
index 0000000..1115f55
--- /dev/null
+++ b/include/linux/mpassthru.h
@@ -0,0 +1,133 @@
+#ifndef __MPASSTHRU_H
+#define __MPASSTHRU_H
+
+#include <linux/types.h>
+#include <linux/if_ether.h>
+#include <linux/ioctl.h>
+
+/* ioctl defines */
+#define MPASSTHRU_BINDDEV      _IOW('M', 213, int)
+#define MPASSTHRU_UNBINDDEV    _IO('M', 214)
+#define MPASSTHRU_SET_MEM_LOCKED       _IOW('M', 215, unsigned long)
+#define MPASSTHRU_GET_MEM_LOCKED_NEED  _IOR('M', 216, unsigned long)
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN   (L1_CACHE_BYTES < 64 ? 64 : L1_CACHE_BYTES)
+
+#define DEFAULT_NEED   ((8192*2*2)*4096)
+
+struct frag {
+	u16     offset;
+	u16     size;
+};
+
+#define HASH_BUCKETS    (8192*2)
+struct page_info {
+	struct list_head        list;
+	struct page_info        *next;
+	struct page_info        *prev;
+	struct page             *pages[MAX_SKB_FRAGS];
+	struct sk_buff          *skb;
+	struct page_pool        *pool;
+
+	/* The pointer relayed to skb, to indicate
+	 * it's a external allocated skb or kernel
+	 */
+	struct skb_ext_page    ext_page;
+	/* flag to indicate read or write */
+#define INFO_READ                      0
+#define INFO_WRITE                     1
+	unsigned                flags;
+	/* exact number of locked pages */
+	unsigned                pnum;
+
+	/* The fields after that is for backend
+	 * driver, now for vhost-net.
+	 */
+	/* the kiocb structure related to */
+	struct kiocb            *iocb;
+	/* the ring descriptor index */
+	unsigned int            desc_pos;
+	/* the iovec coming from backend, we only
+	 * need few of them */
+	struct iovec            hdr[2];
+	struct iovec            iov[2];
+};
+
+struct page_pool {
+	/* the queue for rx side */
+	struct list_head        readq;
+	/* the lock to protect readq */
+	spinlock_t              read_lock;
+	/* record the orignal rlimit */
+	struct rlimit           o_rlim;
+	/* userspace wants to locked */
+	int                     locked_pages;
+	/* currently locked pages */
+	int                     cur_pages;
+	/* the memory locked before */
+	unsigned long		orig_locked_vm;
+	/* the device according to */
+	struct net_device       *dev;
+	/* the mp_port according to dev */
+	struct mp_port          port;
+	/* the hash_table list to find each locked page */
+	struct page_info        **hash_table;
+};
+
+static struct kmem_cache *ext_page_info_cache;
+
+#ifdef __KERNEL__
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+struct socket *mp_get_socket(struct file *);
+struct page_pool *page_pool_create(struct net_device *dev,
+				   struct socket *sock);
+int async_recvmsg(struct kiocb *iocb, struct page_pool *pool,
+		  struct iovec *iov, int count, int flags);
+int async_sendmsg(struct sock *sk, struct kiocb *iocb,
+		  struct page_pool *pool, struct iovec *iov,
+		  int count);
+void async_data_ready(struct sock *sk, struct page_pool *pool);
+void dev_change_state(struct net_device *dev);
+void page_pool_destroy(struct mm_struct *mm, struct page_pool *pool);
+#else
+#include <linux/err.h>
+#include <linux/errno.h>
+struct file;
+struct socket;
+static inline struct socket *mp_get_socket(struct file *f)
+{
+	return ERR_PTR(-EINVAL);
+}
+static inline struct page_pool *page_pool_create(struct net_device *dev,
+		struct socket *sock)
+{
+	return ERR_PTR(-EINVAL);
+}
+static inline int async_recvmsg(struct kiocb *iocb, struct page_pool *pool,
+		struct iovec *iov, int count, int flags)
+{
+	return -EINVAL;
+}
+static inline int async_sendmsg(struct sock *sk, struct kiocb *iocb,
+		struct page_pool *pool, struct iovec *iov,
+		int count)
+{
+	return -EINVAL;
+}
+static inline void async_data_ready(struct sock *sk, struct page_pool *pool)
+{
+	return;
+}
+static inline void dev_change_state(struct net_device *dev)
+{
+	return;
+}
+static inline void page_pool_destroy(struct mm_struct *mm,
+				     struct page_pool *pool)
+{
+	return;
+}
+#endif /* CONFIG_MEDIATE_PASSTHRU */
+#endif /* __KERNEL__ */
+#endif /* __MPASSTHRU_H */
-- 
1.7.3

^ permalink raw reply related

* [PATCH v15 11/17] Add a hook to intercept external buffers from NIC driver.
From: xiaohui.xin @ 2010-11-09  9:03 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <fc6e95d63a2c62aaf77f8ded22fc43ccefcdbbff.1289280885.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

The hook is called in __netif_receive_skb().

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81new@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 net/core/dev.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 84fbb83..bdad1c8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2814,6 +2814,40 @@ int __skb_bond_should_drop(struct sk_buff *skb, struct net_device *master)
 }
 EXPORT_SYMBOL(__skb_bond_should_drop);
 
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+/* Add a hook to intercept mediate passthru(zero-copy) packets,
+ * and insert it to the socket queue owned by mp_port specially.
+ */
+static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb,
+					       struct packet_type **pt_prev,
+					       int *ret,
+					       struct net_device *orig_dev)
+{
+	struct mp_port *mp_port = NULL;
+	struct sock *sk = NULL;
+
+	if (!dev_is_mpassthru(skb->dev) && !dev_is_mpassthru(orig_dev))
+		return skb;
+	if (dev_is_mpassthru(skb->dev))
+		mp_port = skb->dev->mp_port;
+	else if (orig_dev->master == skb->dev && dev_is_mpassthru(orig_dev))
+		mp_port = orig_dev->mp_port;
+
+	if (*pt_prev) {
+		*ret = deliver_skb(skb, *pt_prev, orig_dev);
+		*pt_prev = NULL;
+	}
+
+	sk = mp_port->sock->sk;
+	skb_queue_tail(&sk->sk_receive_queue, skb);
+	sk->sk_state_change(sk);
+
+	return NULL;
+}
+#else
+#define handle_mpassthru(skb, pt_prev, ret, orig_dev)     (skb)
+#endif
+
 static int __netif_receive_skb(struct sk_buff *skb)
 {
 	struct packet_type *ptype, *pt_prev;
@@ -2891,6 +2925,11 @@ static int __netif_receive_skb(struct sk_buff *skb)
 ncls:
 #endif
 
+	/* To intercept mediate passthru(zero-copy) packets here */
+	skb = handle_mpassthru(skb, &pt_prev, &ret, orig_dev);
+	if (!skb)
+		goto out;
+
 	/* Handle special case of bridge or macvlan */
 	rx_handler = rcu_dereference(skb->dev->rx_handler);
 	if (rx_handler) {
@@ -2983,6 +3022,7 @@ err:
 EXPORT_SYMBOL(netdev_mp_port_prep);
 #endif
 
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
-- 
1.7.3

^ permalink raw reply related

* [PATCH v15 10/17]If device is in zero-copy mode first, bonding will fail.
From: xiaohui.xin @ 2010-11-09  9:03 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, herbert, jdike; +Cc: Xin Xiaohui
In-Reply-To: <fc6e95d63a2c62aaf77f8ded22fc43ccefcdbbff.1289280885.git.xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

If device is in this zero-copy mode first, we cannot handle this,
so fail it. This patch is for this.

If bonding is created first, and one of the device will be in zero-copy
mode, this will be handled by mp device. It will first check if all the
slaves have the zero-copy capability. If no, fail too. Otherwise make
all the slaves in zero-copy mode.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
---
 drivers/net/bonding/bond_main.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 3b16f62..dfb6a2c 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1428,6 +1428,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 			   bond_dev->name);
 	}
 
+	/* if the device is in zero-copy mode before bonding, fail it. */
+	if (dev_is_mpassthru(slave_dev))
+		return -EBUSY;
+
 	/* already enslaved */
 	if (slave_dev->flags & IFF_SLAVE) {
 		pr_debug("Error, Device was already enslaved\n");
-- 
1.7.3

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox