Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Paul E. McKenney @ 2010-04-25  2:34 UTC (permalink / raw)
  To: Miles Lane
  Cc: Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan, Johannes Berg
In-Reply-To: <m2xa44ae5cd1004231559hcf90671asf146a43b4748c2c3@mail.gmail.com>

On Fri, Apr 23, 2010 at 06:59:12PM -0400, Miles Lane wrote:
> On Fri, Apr 23, 2010 at 3:42 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Fri, Apr 23, 2010 at 08:50:59AM -0400, Miles Lane wrote:
> >> Hi Paul,
> >> There has been a bit of back and forth, and I am not sure what patches
> >> I should test now.
> >> Could you send me a bundle of whatever needs testing now?
> >
> > Hello, Miles,
> >
> > I am posting my set as replies to this message.  There are a couple
> > of KVM fixes that are going up via Avi's tree, and a number of networking
> > fixes that are going up via Dave Miller's tree -- a number of these
> > are against quickly changing code, so it didn't make sense for me to
> > keep them separately.
> >
> > I believe that the two splats below are addressed by this patch set
> > carried in the networking tree:
> >
> >        https://patchwork.kernel.org/patch/90754/
> 
> With your twelve patches and the one linked to above applied to
> 2.6.34-rc5-git3, here are the warnings I see:
> 
> [    0.173969] [ INFO: suspicious rcu_dereference_check() usage. ]
> [    0.174097] ---------------------------------------------------
> [    0.174226] include/linux/cgroup.h:534 invoked
> rcu_dereference_check() without protection!
> [    0.174429]
> [    0.174430] other info that might help us debug this:
> [    0.174431]
> [    0.174792]
> [    0.174793] rcu_scheduler_active = 1, debug_locks = 1
> [    0.175037] no locks held by watchdog/0/5.
> [    0.175162]
> [    0.175163] stack backtrace:
> [    0.175405] Pid: 5, comm: watchdog/0 Not tainted 2.6.34-rc5-git3 #22
> [    0.175534] Call Trace:
> [    0.175666]  [<ffffffff81067fbe>] lockdep_rcu_dereference+0x9d/0xa5
> [    0.175799]  [<ffffffff8102d678>] task_subsys_state+0x59/0x70
> [    0.175931]  [<ffffffff810328fa>] __sched_setscheduler+0x19d/0x300
> [    0.176064]  [<ffffffff8102b477>] ? need_resched+0x1e/0x28
> [    0.176196]  [<ffffffff813cd401>] ? schedule+0x5c3/0x66e
> [    0.176327]  [<ffffffff81091943>] ? watchdog+0x0/0x8c
> [    0.176457]  [<ffffffff81032a78>] sched_setscheduler+0xe/0x10
> [    0.176587]  [<ffffffff8109196d>] watchdog+0x2a/0x8c
> [    0.176677]  [<ffffffff81091943>] ? watchdog+0x0/0x8c
> [    0.176808]  [<ffffffff81057152>] kthread+0x89/0x91
> [    0.176939]  [<ffffffff8106891e>] ? trace_hardirqs_on_caller+0x114/0x13f
> [    0.177073]  [<ffffffff81003994>] kernel_thread_helper+0x4/0x10
> [    0.177204]  [<ffffffff813cfc40>] ? restore_args+0x0/0x30
> [    0.177334]  [<ffffffff810570c9>] ? kthread+0x0/0x91
> [    0.177463]  [<ffffffff81003990>] ? kernel_thread_helper+0x0/0x10

According to Documentation/cgroups/cgroups.txt, we must hold cgroup_mutex,
the task's task_alloc lock, or be in an RCU read-side critical section.
We are in neither of these.

I would argue that sched_setscheduler() should take care of
synchronization, but am not sure which of these three are appropriate
for sched_setscheduler() to acquire.  Peter, thoughts?

> [    3.173419] [ INFO: suspicious rcu_dereference_check() usage. ]
> [    3.173419] ---------------------------------------------------
> [    3.173419] kernel/cgroup.c:4438 invoked rcu_dereference_check()
> without protection!
> [    3.173419]
> [    3.173419] other info that might help us debug this:
> [    3.173419]
> [    3.173419]
> [    3.173419] rcu_scheduler_active = 1, debug_locks = 1
> [    3.173419] 2 locks held by async/0/668:
> [    3.173419]  #0:  (&shost->scan_mutex){+.+.+.}, at:
> [<ffffffff812df020>] __scsi_add_device+0x83/0xe4
> [    3.173419]  #1:  (&(&blkcg->lock)->rlock){......}, at:
> [<ffffffff811f2df9>] blkiocg_add_blkio_group+0x29/0x7f
> [    3.173419]
> [    3.173419] stack backtrace:
> [    3.173419] Pid: 668, comm: async/0 Not tainted 2.6.34-rc5-git3 #22
> [    3.173419] Call Trace:
> [    3.173419]  [<ffffffff81067fbe>] lockdep_rcu_dereference+0x9d/0xa5
> [    3.173419]  [<ffffffff8107f9ad>] css_id+0x3f/0x51
> [    3.173419]  [<ffffffff811f2e08>] blkiocg_add_blkio_group+0x38/0x7f
> [    3.173419]  [<ffffffff811f4dd0>] cfq_init_queue+0xdf/0x2dc
> [    3.173419]  [<ffffffff811e33b1>] elevator_init+0xba/0xf5
> [    3.173419]  [<ffffffff812dbfaa>] ? scsi_request_fn+0x0/0x451
> [    3.173419]  [<ffffffff811e68d7>] blk_init_queue_node+0x12f/0x135
> [    3.173419]  [<ffffffff811e68e9>] blk_init_queue+0xc/0xe
> [    3.173419]  [<ffffffff812dc41c>] __scsi_alloc_queue+0x21/0x111
> [    3.173419]  [<ffffffff812dc524>] scsi_alloc_queue+0x18/0x64
> [    3.173419]  [<ffffffff812de520>] scsi_alloc_sdev+0x19e/0x256
> [    3.173419]  [<ffffffff812de6be>] scsi_probe_and_add_lun+0xe6/0x9c5
> [    3.173419]  [<ffffffff8106891e>] ? trace_hardirqs_on_caller+0x114/0x13f
> [    3.173419]  [<ffffffff813ce056>] ? __mutex_lock_common+0x3e4/0x43a
> [    3.173419]  [<ffffffff812df020>] ? __scsi_add_device+0x83/0xe4
> [    3.173419]  [<ffffffff812d09dc>] ? transport_setup_classdev+0x0/0x17
> [    3.173419]  [<ffffffff812df020>] ? __scsi_add_device+0x83/0xe4
> [    3.173419]  [<ffffffff812df055>] __scsi_add_device+0xb8/0xe4
> [    3.173419]  [<ffffffff812ea945>] ata_scsi_scan_host+0x74/0x16e
> [    3.173419]  [<ffffffff81057699>] ? autoremove_wake_function+0x0/0x34
> [    3.173419]  [<ffffffff812e8de4>] async_port_probe+0xab/0xb7
> [    3.173419]  [<ffffffff8105e1b1>] ? async_thread+0x0/0x1f4
> [    3.173419]  [<ffffffff8105e2b6>] async_thread+0x105/0x1f4
> [    3.173419]  [<ffffffff81033d8e>] ? default_wake_function+0x0/0xf
> [    3.173419]  [<ffffffff8105e1b1>] ? async_thread+0x0/0x1f4
> [    3.173419]  [<ffffffff81057152>] kthread+0x89/0x91
> [    3.173419]  [<ffffffff8106891e>] ? trace_hardirqs_on_caller+0x114/0x13f
> [    3.173419]  [<ffffffff81003994>] kernel_thread_helper+0x4/0x10
> [    3.173419]  [<ffffffff813cfc40>] ? restore_args+0x0/0x30
> [    3.173419]  [<ffffffff810570c9>] ? kthread+0x0/0x91
> [    3.173419]  [<ffffffff81003990>] ? kernel_thread_helper+0x0/0x10

Please see below for a patch for this based on my earlier conversation
with Vivek Goyal.  (Vivek, if you are already pushing a fix elsewhere,
please let me know, and I will drop my patch in favor of yours.)

> [   32.905446] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   32.905449] ---------------------------------------------------
> [   32.905453] net/core/dev.c:1993 invoked rcu_dereference_check()
> without protection!
> [   32.905456]
> [   32.905457] other info that might help us debug this:
> [   32.905458]
> [   32.905461]
> [   32.905462] rcu_scheduler_active = 1, debug_locks = 1
> [   32.905466] 2 locks held by canberra-gtk-pl/4182:
> [   32.905469]  #0:  (sk_lock-AF_INET){+.+.+.}, at:
> [<ffffffff81394f7d>] inet_stream_connect+0x3a/0x24d
> [   32.905483]  #1:  (rcu_read_lock_bh){.+....}, at:
> [<ffffffff8134a789>] dev_queue_xmit+0x14e/0x4b8
> [   32.905495]
> [   32.905496] stack backtrace:
> [   32.905500] Pid: 4182, comm: canberra-gtk-pl Not tainted 2.6.34-rc5-git3 #22
> [   32.905504] Call Trace:
> [   32.905512]  [<ffffffff81067fbe>] lockdep_rcu_dereference+0x9d/0xa5
> [   32.905518]  [<ffffffff8134a894>] dev_queue_xmit+0x259/0x4b8
> [   32.905524]  [<ffffffff8134a789>] ? dev_queue_xmit+0x14e/0x4b8
> [   32.905531]  [<ffffffff81041c66>] ? _local_bh_enable_ip+0xcd/0xda
> [   32.905538]  [<ffffffff813536da>] neigh_resolve_output+0x234/0x285
> [   32.905544]  [<ffffffff8136f69f>] ip_finish_output2+0x257/0x28c
> [   32.905549]  [<ffffffff8136f73c>] ip_finish_output+0x68/0x6a
> [   32.905554]  [<ffffffff81370433>] T.866+0x52/0x59
> [   32.905559]  [<ffffffff8137067e>] ip_output+0xaa/0xb4
> [   32.905565]  [<ffffffff8136eb38>] ip_local_out+0x20/0x24
> [   32.905571]  [<ffffffff8136f184>] ip_queue_xmit+0x309/0x368
> [   32.905578]  [<ffffffff810e4226>] ? __kmalloc_track_caller+0x111/0x155
> [   32.905585]  [<ffffffff8138316f>] ? tcp_connect+0x223/0x3d3
> [   32.905591]  [<ffffffff813818f1>] tcp_transmit_skb+0x707/0x745
> [   32.905597]  [<ffffffff813832c2>] tcp_connect+0x376/0x3d3
> [   32.905604]  [<ffffffff81268a43>] ? secure_tcp_sequence_number+0x55/0x6f
> [   32.905610]  [<ffffffff81387270>] tcp_v4_connect+0x3df/0x455
> [   32.905617]  [<ffffffff8133cb59>] ? lock_sock_nested+0xf3/0x102
> [   32.905623]  [<ffffffff81394fe7>] inet_stream_connect+0xa4/0x24d
> [   32.905629]  [<ffffffff8133b398>] sys_connect+0x90/0xd0
> [   32.905636]  [<ffffffff81002b9c>] ? sysret_check+0x27/0x62
> [   32.905642]  [<ffffffff8106891e>] ? trace_hardirqs_on_caller+0x114/0x13f
> [   32.905649]  [<ffffffff813cec80>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [   32.905655]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b

A fix for the above is already in Dave Miller's tree.

> [   51.912282] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   51.912285] ---------------------------------------------------
> [   51.912289] net/mac80211/sta_info.c:886 invoked
> rcu_dereference_check() without protection!
> [   51.912293]
> [   51.912293] other info that might help us debug this:
> [   51.912295]
> [   51.912298]
> [   51.912298] rcu_scheduler_active = 1, debug_locks = 1
> [   51.912302] no locks held by wpa_supplicant/3951.
> [   51.912305]
> [   51.912306] stack backtrace:
> [   51.912310] Pid: 3951, comm: wpa_supplicant Not tainted 2.6.34-rc5-git3 #22
> [   51.912314] Call Trace:
> [   51.912317]  <IRQ>  [<ffffffff81067fbe>] lockdep_rcu_dereference+0x9d/0xa5
> [   51.912345]  [<ffffffffa014f9ae>]
> ieee80211_find_sta_by_hw+0x46/0x10f [mac80211]
> [   51.912358]  [<ffffffffa014fa8e>] ieee80211_find_sta+0x17/0x19 [mac80211]
> [   51.912373]  [<ffffffffa01e50f2>] iwl_tx_queue_reclaim+0xdb/0x1b1 [iwlcore]
> [   51.912380]  [<ffffffff8106842b>] ? mark_lock+0x2d/0x235
> [   51.912391]  [<ffffffffa0252f1c>] iwl5000_rx_reply_tx+0x4a9/0x556 [iwlagn]
> [   51.912399]  [<ffffffff8120a353>] ? is_swiotlb_buffer+0x2e/0x3b
> [   51.912407]  [<ffffffffa024bbf4>] iwl_rx_handle+0x163/0x2b5 [iwlagn]
> [   51.912414]  [<ffffffff81068904>] ? trace_hardirqs_on_caller+0xfa/0x13f
> [   51.912422]  [<ffffffffa024c3ac>] iwl_irq_tasklet+0x2bb/0x3c0 [iwlagn]
> [   51.912429]  [<ffffffff810411f3>] tasklet_action+0xa7/0x10f
> [   51.912435]  [<ffffffff81042205>] __do_softirq+0x144/0x252
> [   51.912442]  [<ffffffff81003a8c>] call_softirq+0x1c/0x34
> [   51.912447]  [<ffffffff810050e4>] do_softirq+0x38/0x80
> [   51.912452]  [<ffffffff81041cd2>] irq_exit+0x45/0x94
> [   51.912457]  [<ffffffff81004829>] do_IRQ+0xad/0xc4
> [   51.912463]  [<ffffffff810cbbd3>] ? might_fault+0x63/0xb3
> [   51.912470]  [<ffffffff813cfb93>] ret_from_intr+0x0/0xf
> [   51.912474]  <EOI>  [<ffffffff810cbbd3>] ? might_fault+0x63/0xb3
> [   51.912484]  [<ffffffff8106a75d>] ? lock_release+0x208/0x215
> [   51.912490]  [<ffffffff810cbc1c>] might_fault+0xac/0xb3
> [   51.912495]  [<ffffffff810cbbd3>] ? might_fault+0x63/0xb3
> [   51.912501]  [<ffffffff812025e3>] __clear_user+0x15/0x59
> [   51.912508]  [<ffffffff8100b2bc>] save_i387_xstate+0x9c/0x1bc
> [   51.912515]  [<ffffffff81002276>] do_signal+0x240/0x686
> [   51.912521]  [<ffffffff81002b9c>] ? sysret_check+0x27/0x62
> [   51.912527]  [<ffffffff8106891e>] ? trace_hardirqs_on_caller+0x114/0x13f
> [   51.912533]  [<ffffffff813cec80>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [   51.912539]  [<ffffffff810026e3>] do_notify_resume+0x27/0x5f
> [   51.912545]  [<ffffffff813cec80>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [   51.912551]  [<ffffffff81002e86>] int_signal+0x12/0x17

This is a repeat from last time that confused me at the time.  I could
do a hacky "fix" by putting an RCU read-side critical section around
the for_each_sta_info() in ieee80211_find_sta_by_hw(), but I do not
understand this code well enough to feel comfortable doing so.

Johannes, any enlightenment?

> [   51.929529] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   51.929532] ---------------------------------------------------
> [   51.929536] net/mac80211/sta_info.c:886 invoked
> rcu_dereference_check() without protection!
> [   51.929540]
> [   51.929541] other info that might help us debug this:
> [   51.929542]
> [   51.929545]
> [   51.929546] rcu_scheduler_active = 1, debug_locks = 1
> [   51.929550] 1 lock held by Xorg/4013:
> [   51.929553]  #0:  (clock-AF_UNIX){++.+..}, at: [<ffffffff8133cebd>]
> sock_def_readable+0x19/0x62
> [   51.929567]
> [   51.929568] stack backtrace:
> [   51.929573] Pid: 4013, comm: Xorg Not tainted 2.6.34-rc5-git3 #22
> [   51.929576] Call Trace:
> [   51.929579]  <IRQ>  [<ffffffff81067fbe>] lockdep_rcu_dereference+0x9d/0xa5
> [   51.929603]  [<ffffffffa014f9fe>]
> ieee80211_find_sta_by_hw+0x96/0x10f [mac80211]
> [   51.929615]  [<ffffffffa014fa8e>] ieee80211_find_sta+0x17/0x19 [mac80211]
> [   51.929631]  [<ffffffffa01e50f2>] iwl_tx_queue_reclaim+0xdb/0x1b1 [iwlcore]
> [   51.929642]  [<ffffffffa0252f1c>] iwl5000_rx_reply_tx+0x4a9/0x556 [iwlagn]
> [   51.929649]  [<ffffffff81068685>] ? mark_held_locks+0x52/0x70
> [   51.929656]  [<ffffffff813cf46c>] ? _raw_spin_unlock_irqrestore+0x3a/0x69
> [   51.929662]  [<ffffffff8120a353>] ? is_swiotlb_buffer+0x2e/0x3b
> [   51.929671]  [<ffffffffa024bbf4>] iwl_rx_handle+0x163/0x2b5 [iwlagn]
> [   51.929680]  [<ffffffffa024c3ac>] iwl_irq_tasklet+0x2bb/0x3c0 [iwlagn]
> [   51.929687]  [<ffffffff810411f3>] tasklet_action+0xa7/0x10f
> [   51.929693]  [<ffffffff81042205>] __do_softirq+0x144/0x252
> [   51.929700]  [<ffffffff81003a8c>] call_softirq+0x1c/0x34
> [   51.929705]  [<ffffffff810050e4>] do_softirq+0x38/0x80
> [   51.929711]  [<ffffffff81041cd2>] irq_exit+0x45/0x94
> [   51.929717]  [<ffffffff81019b10>] smp_apic_timer_interrupt+0x87/0x95
> [   51.929724]  [<ffffffff81003553>] apic_timer_interrupt+0x13/0x20
> [   51.929727]  <EOI>  [<ffffffff813cf46e>] ?
> _raw_spin_unlock_irqrestore+0x3c/0x69
> [   51.929739]  [<ffffffff8102d3fb>] __wake_up_sync_key+0x49/0x52
> [   51.929745]  [<ffffffff8133cee7>] sock_def_readable+0x43/0x62
> [   51.929751]  [<ffffffff813b1c61>] unix_stream_sendmsg+0x243/0x2e2
> [   51.929758]  [<ffffffff8133b912>] ? sock_aio_write+0x0/0xcf
> [   51.929764]  [<ffffffff81339342>] __sock_sendmsg+0x59/0x64
> [   51.929770]  [<ffffffff8133b9cd>] sock_aio_write+0xbb/0xcf
> [   51.929777]  [<ffffffff810e9909>] do_sync_readv_writev+0xbc/0xfb
> [   51.929785]  [<ffffffff811c1792>] ? selinux_file_permission+0xa2/0xaf
> [   51.929790]  [<ffffffff810e9690>] ? copy_from_user+0x2a/0x2c
> [   51.929797]  [<ffffffff811baff1>] ? security_file_permission+0x11/0x13
> [   51.929804]  [<ffffffff810ea6a6>] do_readv_writev+0xa2/0x122
> [   51.929810]  [<ffffffff810ead93>] ? fcheck_files+0x8f/0xc9
> [   51.929816]  [<ffffffff810ea764>] vfs_writev+0x3e/0x49
> [   51.929821]  [<ffffffff810ea84a>] sys_writev+0x45/0x8e
> [   51.929828]  [<ffffffff81002b6b>] system_call_fastpath+0x16/0x1b

Ditto.

						Thanx, Paul

------------------------------------------------------------------------

commit 0868dd631def762ba00c2f0f397a53c5cdf24ae2
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Sat Apr 24 19:23:30 2010 -0700

    block-cgroup: fix RCU-lockdep splat in blkiocg_add_blkio_group()
    
    It is necessary to be in an RCU read-side critical section when invoking
    css_id(), so this patch adds one to blkiocg_add_blkio_group().  This is
    actually a false positive, because this is called at initialization time,
    and hence always refers to the root cgroup, which cannot go away.
    
    Located-by: Miles Lane <miles.lane@gmail.com>
    Suggested-by: Vivek Goyal <vgoyal@redhat.com>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 5fe03de..55c8c73 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -71,7 +71,9 @@ void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
 
 	spin_lock_irqsave(&blkcg->lock, flags);
 	rcu_assign_pointer(blkg->key, key);
+	rcu_read_lock();
 	blkg->blkcg_id = css_id(&blkcg->css);
+	rcu_read_unlock();
 	hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
 	spin_unlock_irqrestore(&blkcg->lock, flags);
 #ifdef CONFIG_DEBUG_BLK_CGROUP

^ permalink raw reply related

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Paul E. McKenney @ 2010-04-25  2:36 UTC (permalink / raw)
  To: Miles Lane
  Cc: Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan
In-Reply-To: <u2qa44ae5cd1004232235i2e1cd2a0g634fc1d5d8c3f7c2@mail.gmail.com>

On Sat, Apr 24, 2010 at 01:35:01AM -0400, Miles Lane wrote:
> 2.6.34-rc5-git5 with all of your patches applied.
> 
> I reconfigured my kernel build options and got the following new issue:
> 
> [    2.686515] [ INFO: suspicious rcu_dereference_check() usage. ]
> [    2.686519] ---------------------------------------------------
> [    2.686523] kernel/cgroup.c:4438 invoked rcu_dereference_check()
> without protection!
> [    2.686526]
> [    2.686527] other info that might help us debug this:
> [    2.686529]
> [    2.686532]
> [    2.686533] rcu_scheduler_active = 1, debug_locks = 1
> [    2.686537] 2 locks held by swapper/1:
> [    2.686540]  #0:  (mtd_table_mutex){+.+.+.}, at:
> [<ffffffff812d7714>] register_mtd_blktrans+0xa2/0x25e
> [    2.686555]  #1:  (&(&blkcg->lock)->rlock){......}, at:
> [<ffffffff811ca7bd>] blkiocg_add_blkio_group+0x29/0x7f
> [    2.686566]
> [    2.686567] stack backtrace:
> [    2.686572] Pid: 1, comm: swapper Not tainted 2.6.34-rc5-git5 #25
> [    2.686576] Call Trace:
> [    2.686584]  [<ffffffff810642da>] lockdep_rcu_dereference+0x9d/0xa5
> [    2.686591]  [<ffffffff8107af54>] css_id+0x3f/0x52
> [    2.686597]  [<ffffffff811ca7cc>] blkiocg_add_blkio_group+0x38/0x7f
> [    2.686603]  [<ffffffff811cc593>] cfq_init_queue+0xdf/0x2dc
> [    2.686609]  [<ffffffff811bb858>] elevator_init+0xba/0xf5
> [    2.686616]  [<ffffffff812d7046>] ? mtd_blktrans_request+0x0/0x1c
> [    2.686623]  [<ffffffff811c0b62>] blk_init_queue_node+0x12f/0x135
> [    2.686629]  [<ffffffff811c0b74>] blk_init_queue+0xc/0xe
> [    2.686635]  [<ffffffff812d7777>] register_mtd_blktrans+0x105/0x25e
> [    2.686642]  [<ffffffff818c0de9>] ? init_mtdblock+0x0/0x2c
> [    2.686648]  [<ffffffff818c0e13>] init_mtdblock+0x2a/0x2c
> [    2.686656]  [<ffffffff810001ef>] do_one_initcall+0x59/0x14e
> [    2.686663]  [<ffffffff818986a6>] kernel_init+0x160/0x1ea
> [    2.686669]  [<ffffffff81003814>] kernel_thread_helper+0x4/0x10
> [    2.686677]  [<ffffffff8140d77c>] ? restore_args+0x0/0x30
> [    2.686683]  [<ffffffff81898546>] ? kernel_init+0x0/0x1ea
> [    2.686688]  [<ffffffff81003810>] ? kernel_thread_helper+0x0/0x10
> [    2.687683] mtdoops: mtd device (mtddev=name/number) must be supplied

This should be covered by the patch I sent with my previous email.

And thank you again, Miles, for all the testing!!!

							Thanx, Paul

^ permalink raw reply

* Re: [PATCH] e100: Fix the TX workqueue race
From: David Miller @ 2010-04-25  2:58 UTC (permalink / raw)
  To: alan; +Cc: e1000-devel, netdev
In-Reply-To: <20100424121127.084b9766@linux.intel.com>

From: Alan Cox <alan@linux.intel.com>
Date: Sat, 24 Apr 2010 12:11:27 +0100

> No idea why it won't apply - I guess net has diverged from -next in
> this area. Other problem is not typing "stg ref" before "stg export"

It has, the debug print statement above the lines you are changing are
completely different.

Please generate this patch against net-2.6 so I can apply it, thanks
Alan.

------------------------------------------------------------------------------
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: [PATCH] e100: Fix the TX workqueue race
From: David Miller @ 2010-04-25  3:00 UTC (permalink / raw)
  To: alan; +Cc: netdev, e1000-devel
In-Reply-To: <20100424113629.0ad3569b@linux.intel.com>

From: Alan Cox <alan@linux.intel.com>
Date: Sat, 24 Apr 2010 11:36:29 +0100

> Puzzling as it came from a building -next tree. Will see whats
> happened next week if I get time, but I'm afraid net stuff isn't a
> priority - in fact its disappointing that having diagnosed a bug
> months ago (which was the hard bit) and posted a test patch months
> ago the maintainers haven't fixed it.

It's disappointing to me that someone as experienced and skilled
as yourself can't generate a clean patch which is 1) against
the appropriate tree for a bug fix and 2) actually compiles.

Or is this too much to ask? :-)

^ permalink raw reply

* Re: [PATCH 2/2] sky2: add support for receive hashing (v3)
From: David Miller @ 2010-04-25  3:04 UTC (permalink / raw)
  To: shemminger; +Cc: jeff, netdev
In-Reply-To: <20100424162239.1aae32e0@nehalam>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Sat, 24 Apr 2010 16:22:39 -0700

> Subject: sky2: add support for receive hashing
> 
> Sky2 hardware supports hardware receive hash calculation.
> Now that Receive Packet Steering is available, add support
> to enable it.
> 
> This version does not depend on CONFIG_RPS. Also set_flags rejects
> all values except RXHASH, so driver won't have to change next time
> somebody adds a new one.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Applied, thanks Stephen.

^ permalink raw reply

* 2.6.33.2 networking regression
From: Dave Jones @ 2010-04-25  3:16 UTC (permalink / raw)
  To: netdev; +Cc: stable

Something odd happened when I upgraded my router
from 2.6.33.1 to .33.2.  Its internal NIC (a VIA Velocity)
stopped recieving packets.

dmesg was getting flooded with..

[  188.919957] via-velocity 0000:00:0e.0: BAR 0: set to [io  0xf800-0xf8ff] (PCI address [0xf800-0xf8ff]
[  188.920002] via-velocity 0000:00:0e.0: BAR 1: set to [mem 0xfdffe000-0xfdffe0ff] (PCI address [0xfdffe000-0xfdffe0ff]
[  203.913967] via-velocity 0000:00:0e.0: BAR 0: set to [io  0xf800-0xf8ff] (PCI address [0xf800-0xf8ff]
[  203.914181] via-velocity 0000:00:0e.0: BAR 1: set to [mem 0xfdffe000-0xfdffe0ff] (PCI address [0xfdffe000-0xfdffe0ff]

every so often for some reason.

rebooting back to .1, it works fine.

There don't appear to be any direct changes to via-velocity.c in the
diff, so I'm really confused. Any clues ? I'll bisect it, but it
probably won't be until Monday..

	Dave

^ permalink raw reply

* Re: 2.6.33.2 networking regression
From: David Miller @ 2010-04-25  3:19 UTC (permalink / raw)
  To: davej; +Cc: netdev, stable
In-Reply-To: <20100425031656.GA27598@redhat.com>

From: Dave Jones <davej@redhat.com>
Date: Sat, 24 Apr 2010 23:16:57 -0400

> There don't appear to be any direct changes to via-velocity.c in the
> diff, so I'm really confused. Any clues ? I'll bisect it, but it
> probably won't be until Monday..

Looks like some x86/PCI/ACPI change causes this, rather than a
networking change.

^ permalink raw reply

* Re: [PATCH] e100: Fix the TX workqueue race
From: David Miller @ 2010-04-25  4:10 UTC (permalink / raw)
  To: alan; +Cc: e1000-devel, netdev
In-Reply-To: <20100424.195859.193729555.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Sat, 24 Apr 2010 19:58:59 -0700 (PDT)

> Please generate this patch against net-2.6 so I can apply it, thanks
> Alan.

Nevermind, I took care of this for you.

------------------------------------------------------------------------------
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: [PATCH v3] rps: optimize rps_get_cpu()
From: David Miller @ 2010-04-25  5:51 UTC (permalink / raw)
  To: xiaosuo; +Cc: therbert, eric.dumazet, netdev
In-Reply-To: <1272122227-13070-1-git-send-email-xiaosuo@gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Sat, 24 Apr 2010 23:17:07 +0800

> optimize rps_get_cpu().
> 
> don't initialize ports when we can get the ports. one memory access for ports
> than two.
> 
> Signed-off-by: Changli Gao <xiaosuo@gmail.com>

Applied, thanks.

We can load both addresses in one go on 64-bit btw.

It seems we're just duplicating, one by one, the optimizations
we already do in INET_COMBINED_PORTS() and INET_ADDR_COOKIE().
:-)

^ permalink raw reply

* Re: [PATCH v3] rps: optimize rps_get_cpu()
From: Changli Gao @ 2010-04-25  6:48 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, eric.dumazet, netdev
In-Reply-To: <20100424.225128.52181685.davem@davemloft.net>

On Sun, Apr 25, 2010 at 1:51 PM, David Miller <davem@davemloft.net> wrote:
> From: Changli Gao <xiaosuo@gmail.com>
> Date: Sat, 24 Apr 2010 23:17:07 +0800
>
>> optimize rps_get_cpu().
>>
>> don't initialize ports when we can get the ports. one memory access for ports
>> than two.
>>
>> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
>
> Applied, thanks.
>
> We can load both addresses in one go on 64-bit btw.
>

Are they always aligned to 64-bit boundary? I don't think so.



-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCH v3] rps: optimize rps_get_cpu()
From: David Miller @ 2010-04-25  7:38 UTC (permalink / raw)
  To: xiaosuo; +Cc: therbert, eric.dumazet, netdev
In-Reply-To: <z2h412e6f7f1004242348ibd5f96a2i7cd0557c37bb0921@mail.gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Sun, 25 Apr 2010 14:48:49 +0800

> Are they always aligned to 64-bit boundary? I don't think so.

If not than TCP stack should be crashing for past 15 years.

^ permalink raw reply

* [PATCH net-next-2.6] netns: rename unregister_pernet_subsys parameter
From: Jiri Pirko @ 2010-04-25  7:41 UTC (permalink / raw)
  To: netdev; +Cc: davem, ebiederm

Stay consistent with other functions and with comment also and name
pernet_operations parameter properly.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index bd8c471..69a20bf 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -469,10 +469,10 @@ EXPORT_SYMBOL_GPL(register_pernet_subsys);
  *	addition run the exit method for all existing network
  *	namespaces.
  */
-void unregister_pernet_subsys(struct pernet_operations *module)
+void unregister_pernet_subsys(struct pernet_operations *ops)
 {
 	mutex_lock(&net_mutex);
-	unregister_pernet_operations(module);
+	unregister_pernet_operations(ops);
 	mutex_unlock(&net_mutex);
 }
 EXPORT_SYMBOL_GPL(unregister_pernet_subsys);

^ permalink raw reply related

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: Johannes Berg @ 2010-04-25  7:45 UTC (permalink / raw)
  To: paulmck
  Cc: Miles Lane, Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan
In-Reply-To: <20100425023455.GM2440@linux.vnet.ibm.com>

On Sat, 2010-04-24 at 19:34 -0700, Paul E. McKenney wrote:

> > [   51.912282] [ INFO: suspicious rcu_dereference_check() usage. ]
> > [   51.912285] ---------------------------------------------------
> > [   51.912289] net/mac80211/sta_info.c:886 invoked
> > rcu_dereference_check() without protection!
> > [   51.912293]
> > [   51.912293] other info that might help us debug this:
> > [   51.912295]
> > [   51.912298]
> > [   51.912298] rcu_scheduler_active = 1, debug_locks = 1
> > [   51.912302] no locks held by wpa_supplicant/3951.
> > [   51.912305]
> > [   51.912306] stack backtrace:
> > [   51.912310] Pid: 3951, comm: wpa_supplicant Not tainted 2.6.34-rc5-git3 #22
> > [   51.912314] Call Trace:
> > [   51.912317]  <IRQ>  [<ffffffff81067fbe>] lockdep_rcu_dereference+0x9d/0xa5
> > [   51.912345]  [<ffffffffa014f9ae>]
> > ieee80211_find_sta_by_hw+0x46/0x10f [mac80211]
> > [   51.912358]  [<ffffffffa014fa8e>] ieee80211_find_sta+0x17/0x19 [mac80211]
> > [   51.912373]  [<ffffffffa01e50f2>] iwl_tx_queue_reclaim+0xdb/0x1b1 [iwlcore]
> > [   51.912380]  [<ffffffff8106842b>] ? mark_lock+0x2d/0x235
> > [   51.912391]  [<ffffffffa0252f1c>] iwl5000_rx_reply_tx+0x4a9/0x556 [iwlagn]
> > [   51.912399]  [<ffffffff8120a353>] ? is_swiotlb_buffer+0x2e/0x3b
> > [   51.912407]  [<ffffffffa024bbf4>] iwl_rx_handle+0x163/0x2b5 [iwlagn]
> > [   51.912414]  [<ffffffff81068904>] ? trace_hardirqs_on_caller+0xfa/0x13f
> > [   51.912422]  [<ffffffffa024c3ac>] iwl_irq_tasklet+0x2bb/0x3c0 [iwlagn]
> > [   51.912429]  [<ffffffff810411f3>] tasklet_action+0xa7/0x10f
> > [   51.912435]  [<ffffffff81042205>] __do_softirq+0x144/0x252
> > [   51.912442]  [<ffffffff81003a8c>] call_softirq+0x1c/0x34
> > [   51.912447]  [<ffffffff810050e4>] do_softirq+0x38/0x80
> > [   51.912452]  [<ffffffff81041cd2>] irq_exit+0x45/0x94
> > [   51.912457]  [<ffffffff81004829>] do_IRQ+0xad/0xc4
> > [   51.912463]  [<ffffffff810cbbd3>] ? might_fault+0x63/0xb3
> > [   51.912470]  [<ffffffff813cfb93>] ret_from_intr+0x0/0xf
> > [   51.912474]  <EOI>  [<ffffffff810cbbd3>] ? might_fault+0x63/0xb3
> > [   51.912484]  [<ffffffff8106a75d>] ? lock_release+0x208/0x215
> > [   51.912490]  [<ffffffff810cbc1c>] might_fault+0xac/0xb3
> > [   51.912495]  [<ffffffff810cbbd3>] ? might_fault+0x63/0xb3
> > [   51.912501]  [<ffffffff812025e3>] __clear_user+0x15/0x59
> > [   51.912508]  [<ffffffff8100b2bc>] save_i387_xstate+0x9c/0x1bc
> > [   51.912515]  [<ffffffff81002276>] do_signal+0x240/0x686
> > [   51.912521]  [<ffffffff81002b9c>] ? sysret_check+0x27/0x62
> > [   51.912527]  [<ffffffff8106891e>] ? trace_hardirqs_on_caller+0x114/0x13f
> > [   51.912533]  [<ffffffff813cec80>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > [   51.912539]  [<ffffffff810026e3>] do_notify_resume+0x27/0x5f
> > [   51.912545]  [<ffffffff813cec80>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > [   51.912551]  [<ffffffff81002e86>] int_signal+0x12/0x17
> 
> This is a repeat from last time that confused me at the time.  I could
> do a hacky "fix" by putting an RCU read-side critical section around
> the for_each_sta_info() in ieee80211_find_sta_by_hw(), but I do not
> understand this code well enough to feel comfortable doing so.
> 
> Johannes, any enlightenment?

The station locking is a tad confusing, but I've added the right
annotations already, should be coming to a kernel near you soon (i.e.
are in net-2.6 right now).

johannes


^ permalink raw reply

* Re: [PATCH v3] rps: optimize rps_get_cpu()
From: David Miller @ 2010-04-25  7:48 UTC (permalink / raw)
  To: xiaosuo; +Cc: therbert, eric.dumazet, netdev
In-Reply-To: <20100425.003834.147984813.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Sun, 25 Apr 2010 00:38:34 -0700 (PDT)

> From: Changli Gao <xiaosuo@gmail.com>
> Date: Sun, 25 Apr 2010 14:48:49 +0800
> 
>> Are they always aligned to 64-bit boundary? I don't think so.
> 
> If not than TCP stack should be crashing for past 15 years.

Nevermind, currently we only depend upon the addresses in struct sock
being 64-bit aligned not the protocol headers.

It shouldn't be hard to make the protocol header addresses 64-bit
aligned too.  Simply setting the default NET_IP_ALIGN to '6' instead
of '2' ought to be sufficient.

skb->data upon alloc_skb() is 64-bit aligned.

So if we skb_reserve(NET_IP_ALIGN '6'), then we have the ethernet
header (14 bytes).  And since 'saddr' is 12 bytes into struct iphdr it
will be (6 + 14 + 12) == 32 bytes in from the original 64-bit aligned
skb->data.

Therefore, since skb->data is 64-bit aligned, skb->data plus a
multiple of 8 (which 32 is) will also be 64-bit aligned, and that
means iph->saddr will be 64-bit aligned.

^ permalink raw reply

* Re: [PATCH] RCU: don't turn off lockdep when find suspicious rcu_dereference_check() usage
From: David Miller @ 2010-04-25  7:49 UTC (permalink / raw)
  To: johannes
  Cc: paulmck, miles.lane, vgoyal, eparis, laijs, mingo, peterz,
	linux-kernel, nauman, eric.dumazet, netdev, jens.axboe,
	guijianfeng, lizf
In-Reply-To: <1272181534.3614.1.camel@jlt3.sipsolutions.net>

From: Johannes Berg <johannes@sipsolutions.net>
Date: Sun, 25 Apr 2010 09:45:34 +0200

> The station locking is a tad confusing, but I've added the right
> annotations already, should be coming to a kernel near you soon (i.e.
> are in net-2.6 right now).

Linus took in everything I have so it should be in Linus's tree
by now.

^ permalink raw reply

* Re: [PATCH net-next-2.6] netns: rename unregister_pernet_subsys parameter
From: David Miller @ 2010-04-25  7:50 UTC (permalink / raw)
  To: jpirko; +Cc: netdev, ebiederm
In-Reply-To: <20100425074138.GA2866@psychotron.redhat.com>

From: Jiri Pirko <jpirko@redhat.com>
Date: Sun, 25 Apr 2010 09:41:39 +0200

> Stay consistent with other functions and with comment also and name
> pernet_operations parameter properly.
> 
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>

Applied, thanks Jiri.

^ permalink raw reply

* Re: [PATCH v3] rps: optimize rps_get_cpu()
From: Changli Gao @ 2010-04-25  8:03 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, eric.dumazet, netdev
In-Reply-To: <20100425.004842.225645379.davem@davemloft.net>

On Sun, Apr 25, 2010 at 3:48 PM, David Miller <davem@davemloft.net> wrote:
>
> Nevermind, currently we only depend upon the addresses in struct sock
> being 64-bit aligned not the protocol headers.
>
> It shouldn't be hard to make the protocol header addresses 64-bit
> aligned too.  Simply setting the default NET_IP_ALIGN to '6' instead
> of '2' ought to be sufficient.
>
> skb->data upon alloc_skb() is 64-bit aligned.
>
> So if we skb_reserve(NET_IP_ALIGN '6'), then we have the ethernet
> header (14 bytes).  And since 'saddr' is 12 bytes into struct iphdr it
> will be (6 + 14 + 12) == 32 bytes in from the original 64-bit aligned
> skb->data.
>
> Therefore, since skb->data is 64-bit aligned, skb->data plus a
> multiple of 8 (which 32 is) will also be 64-bit aligned, and that
> means iph->saddr will be 64-bit aligned.
>

But if there is a vlan header, extra 4-bytes are appended to the
ethernet header, so the addresses aren't aligned to 64-bit boundary
when we set NET_IP_ALIGN to 6.


-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* [RFC][PATCH v4 04/18] Add a function make external buffer owner to query capability.
From: xiaohui.xin @ 2010-04-25  9:19 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1272187206-18534-3-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

The external buffer owner can use the functions to get
the capability of the underlying NIC driver.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |    2 +
 net/core/dev.c            |   51 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3a1583b..2f9a4f2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1599,6 +1599,8 @@ extern gro_result_t	napi_gro_frags(struct napi_struct *napi);
 extern int netdev_mp_port_attach(struct net_device *dev,
 				 struct mpassthru_port *port);
 extern void netdev_mp_port_detach(struct net_device *dev);
+int netdev_mp_port_prep(struct net_device *dev,
+			struct mpassthru_port *port);
 
 static inline void napi_free_frags(struct napi_struct *napi)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 6a73fc7..4972bc4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2492,6 +2492,57 @@ void netdev_mp_port_detach(struct net_device *dev)
 }
 EXPORT_SYMBOL(netdev_mp_port_detach);
 
+/* To support meidate passthru(zero-copy) with NIC driver,
+ * we'd better query NIC driver for the capability it can
+ * provide, especially for packet split mode, now we only
+ * query for the header size, and the payload a descriptor
+ * may carry. If a driver does not use the API to export,
+ * then we may try to use a default value, currently,
+ * we use the default value from an IGB driver. Now,
+ * it's only called by mpassthru device.
+ */
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+int netdev_mp_port_prep(struct net_device *dev,
+			struct mpassthru_port *port)
+{
+	int rc;
+	int npages, data_len;
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	/* needed by packet split */
+
+	if (ops->ndo_mp_port_prep) {
+		rc = ops->ndo_mp_port_prep(dev, port);
+		if (rc)
+			return rc;
+	} else {
+		/* If the NIC driver did not report this,
+		 * then we try to use default value.
+		 */
+		port->hdr_len = 128;
+		port->data_len = 2048;
+		port->npages = 1;
+	}
+
+	if (port->hdr_len <= 0)
+		goto err;
+
+	npages = port->npages;
+	data_len = port->data_len;
+	if (npages <= 0 || npages > MAX_SKB_FRAGS ||
+			(data_len < PAGE_SIZE * (npages - 1) ||
+			 data_len > PAGE_SIZE * npages))
+		goto err;
+
+	return 0;
+err:
+	dev_warn(&dev->dev, "invalid page constructor parameters\n");
+
+	return -EINVAL;
+}
+EXPORT_SYMBOL(netdev_mp_port_prep);
+#endif
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
-- 
1.5.4.4


^ permalink raw reply related

* [RFC][PATCH v4 18/18] Provides multiple submits and async notifications
From: xiaohui.xin @ 2010-04-25  9:20 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1272187206-18534-17-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

    Provides multiple submits and asynchronous notifications.

    The vhost-net backend now only supports synchronous send/recv
    operations. The patch provides multiple submits and asynchronous
    notifications. This is needed for zero-copy case.

    Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
---
 drivers/vhost/net.c   |  236 +++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/vhost/vhost.c |  120 ++++++++++++++-----------
 drivers/vhost/vhost.h |   14 +++
 3 files changed, 314 insertions(+), 56 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 38989d1..18f6c41 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -23,6 +23,8 @@
 #include <linux/if_arp.h>
 #include <linux/if_tun.h>
 #include <linux/if_macvlan.h>
+#include <linux/mpassthru.h>
+#include <linux/aio.h>
 
 #include <net/sock.h>
 
@@ -48,6 +50,7 @@ struct vhost_net {
 	struct vhost_dev dev;
 	struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
 	struct vhost_poll poll[VHOST_NET_VQ_MAX];
+	struct kmem_cache       *cache;
 	/* Tells us whether we are polling a socket for TX.
 	 * We only do this when socket buffer fills up.
 	 * Protected by tx vq lock. */
@@ -92,11 +95,138 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
 	net->tx_poll_state = VHOST_NET_POLL_STARTED;
 }
 
+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	if (!list_empty(&vq->notifier)) {
+		iocb = list_first_entry(&vq->notifier,
+				struct kiocb, ki_list);
+		list_del(&iocb->ki_list);
+	}
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+	return iocb;
+}
+
+static void handle_iocb(struct kiocb *iocb)
+{
+	struct vhost_virtqueue *vq = iocb->private;
+	unsigned long flags;
+
+	spin_lock_irqsave(&vq->notify_lock, flags);
+	list_add_tail(&iocb->ki_list, &vq->notifier);
+	spin_unlock_irqrestore(&vq->notify_lock, flags);
+}
+
+static int is_async_vq(struct vhost_virtqueue *vq)
+{
+	return (vq->link_state == VHOST_VQ_LINK_ASYNC);
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+					  struct vhost_virtqueue *vq,
+					  struct socket *sock)
+{
+	struct kiocb *iocb = NULL;
+	struct vhost_log *vq_log = NULL;
+	int rx_total_len = 0;
+	unsigned int head, log, in, out;
+	int size;
+
+	if (!is_async_vq(vq))
+		return;
+
+	if (sock->sk->sk_data_ready)
+		sock->sk->sk_data_ready(sock->sk, 0);
+
+	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
+		vq->log : NULL;
+
+	while ((iocb = notify_dequeue(vq)) != NULL) {
+		vhost_add_used_and_signal(&net->dev, vq,
+				iocb->ki_pos, iocb->ki_nbytes);
+		size = iocb->ki_nbytes;
+		head = iocb->ki_pos;
+		rx_total_len += iocb->ki_nbytes;
+
+		if (iocb->ki_dtor)
+			iocb->ki_dtor(iocb);
+		kmem_cache_free(net->cache, iocb);
+
+		/* when log is enabled, recomputing the log info is needed,
+		 * since these buffers are in async queue, and may not get
+		 * the log info before.
+		 */
+		if (unlikely(vq_log)) {
+			if (!log)
+				__vhost_get_vq_desc(&net->dev, vq, vq->iov,
+						    ARRAY_SIZE(vq->iov),
+						    &out, &in, vq_log,
+						    &log, head);
+			vhost_log_write(vq, vq_log, log, size);
+		}
+		if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
+			vhost_poll_queue(&vq->poll);
+			break;
+		}
+	}
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+					  struct vhost_virtqueue *vq)
+{
+	struct kiocb *iocb = NULL;
+	int tx_total_len = 0;
+
+	if (!is_async_vq(vq))
+		return;
+
+	while ((iocb = notify_dequeue(vq)) != NULL) {
+		vhost_add_used_and_signal(&net->dev, vq,
+				iocb->ki_pos, 0);
+		tx_total_len += iocb->ki_nbytes;
+
+		if (iocb->ki_dtor)
+			iocb->ki_dtor(iocb);
+
+		kmem_cache_free(net->cache, iocb);
+		if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
+			vhost_poll_queue(&vq->poll);
+			break;
+		}
+	}
+}
+
+static struct kiocb *create_iocb(struct vhost_net *net,
+				 struct vhost_virtqueue *vq,
+				 unsigned head)
+{
+	struct kiocb *iocb = NULL;
+
+	if (!is_async_vq(vq))
+		return NULL;
+
+	iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
+	if (!iocb)
+		return NULL;
+	iocb->private = vq;
+	iocb->ki_pos = head;
+	iocb->ki_dtor = handle_iocb;
+	if (vq == &net->dev.vqs[VHOST_NET_VQ_RX]) {
+		iocb->ki_user_data = vq->num;
+		iocb->ki_iovec = vq->hdr;
+	}
+	return iocb;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
 {
 	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
+	struct kiocb *iocb = NULL;
 	unsigned head, out, in, s;
 	struct msghdr msg = {
 		.msg_name = NULL,
@@ -129,6 +259,8 @@ static void handle_tx(struct vhost_net *net)
 		tx_poll_stop(net);
 	hdr_size = vq->hdr_size;
 
+	handle_async_tx_events_notify(net, vq);
+
 	for (;;) {
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
@@ -156,6 +288,13 @@ static void handle_tx(struct vhost_net *net)
 		/* Skip header. TODO: support TSO. */
 		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
 		msg.msg_iovlen = out;
+
+		if (is_async_vq(vq)) {
+			iocb = create_iocb(net, vq, head);
+			if (!iocb)
+				break;
+		}
+
 		len = iov_length(vq->iov, out);
 		/* Sanity check */
 		if (!len) {
@@ -165,12 +304,18 @@ static void handle_tx(struct vhost_net *net)
 			break;
 		}
 		/* TODO: Check specific error and bomb out unless ENOBUFS? */
-		err = sock->ops->sendmsg(NULL, sock, &msg, len);
+		err = sock->ops->sendmsg(iocb, sock, &msg, len);
 		if (unlikely(err < 0)) {
+			if (is_async_vq(vq))
+				kmem_cache_free(net->cache, iocb);
 			vhost_discard_vq_desc(vq);
 			tx_poll_start(net, sock);
 			break;
 		}
+
+		if (is_async_vq(vq))
+			continue;
+
 		if (err != len)
 			pr_err("Truncated TX packet: "
 			       " len %d != %zd\n", err, len);
@@ -182,6 +327,8 @@ static void handle_tx(struct vhost_net *net)
 		}
 	}
 
+	handle_async_tx_events_notify(net, vq);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
@@ -191,6 +338,7 @@ static void handle_tx(struct vhost_net *net)
 static void handle_rx(struct vhost_net *net)
 {
 	struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
+	struct kiocb *iocb = NULL;
 	unsigned head, out, in, log, s;
 	struct vhost_log *vq_log;
 	struct msghdr msg = {
@@ -211,7 +359,8 @@ static void handle_rx(struct vhost_net *net)
 	int err;
 	size_t hdr_size;
 	struct socket *sock = rcu_dereference(vq->private_data);
-	if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
+	if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
+			vq->link_state == VHOST_VQ_LINK_SYNC))
 		return;
 
 	use_mm(net->dev.mm);
@@ -219,9 +368,17 @@ static void handle_rx(struct vhost_net *net)
 	vhost_disable_notify(vq);
 	hdr_size = vq->hdr_size;
 
+	/* In async cases, when write log is enabled, in case the submitted
+	 * buffers did not get log info before the log enabling, so we'd
+	 * better recompute the log info when needed. We do this in
+	 * handle_async_rx_events_notify().
+	 */
+
 	vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
 		vq->log : NULL;
 
+	handle_async_rx_events_notify(net, vq, sock);
+
 	for (;;) {
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
@@ -250,6 +407,13 @@ static void handle_rx(struct vhost_net *net)
 		s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
 		msg.msg_iovlen = in;
 		len = iov_length(vq->iov, in);
+
+		if (is_async_vq(vq)) {
+			iocb = create_iocb(net, vq, head);
+			if (!iocb)
+				break;
+		}
+
 		/* Sanity check */
 		if (!len) {
 			vq_err(vq, "Unexpected header len for RX: "
@@ -257,13 +421,20 @@ static void handle_rx(struct vhost_net *net)
 			       iov_length(vq->hdr, s), hdr_size);
 			break;
 		}
-		err = sock->ops->recvmsg(NULL, sock, &msg,
+
+		err = sock->ops->recvmsg(iocb, sock, &msg,
 					 len, MSG_DONTWAIT | MSG_TRUNC);
 		/* TODO: Check specific error and bomb out unless EAGAIN? */
 		if (err < 0) {
+			if (is_async_vq(vq))
+				kmem_cache_free(net->cache, iocb);
 			vhost_discard_vq_desc(vq);
 			break;
 		}
+
+		if (is_async_vq(vq))
+			continue;
+
 		/* TODO: Should check and handle checksum. */
 		if (err > len) {
 			pr_err("Discarded truncated rx packet: "
@@ -289,6 +460,8 @@ static void handle_rx(struct vhost_net *net)
 		}
 	}
 
+	handle_async_rx_events_notify(net, vq, sock);
+
 	mutex_unlock(&vq->mutex);
 	unuse_mm(net->dev.mm);
 }
@@ -342,6 +515,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
 	vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
 	n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+	n->cache = NULL;
 
 	f->private_data = n;
 
@@ -405,6 +579,18 @@ static void vhost_net_flush(struct vhost_net *n)
 	vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
 }
 
+static void vhost_async_cleanup(struct vhost_net *n)
+{
+	/* clean the notifier */
+	struct vhost_virtqueue *vq = &n->dev.vqs[VHOST_NET_VQ_RX];
+	struct kiocb *iocb = NULL;
+	if (n->cache) {
+		while ((iocb = notify_dequeue(vq)) != NULL)
+			kmem_cache_free(n->cache, iocb);
+		kmem_cache_destroy(n->cache);
+	}
+}
+
 static int vhost_net_release(struct inode *inode, struct file *f)
 {
 	struct vhost_net *n = f->private_data;
@@ -421,6 +607,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
 	/* We do an extra flush before freeing memory,
 	 * since jobs can re-queue themselves. */
 	vhost_net_flush(n);
+	vhost_async_cleanup(n);
 	kfree(n);
 	return 0;
 }
@@ -472,21 +659,58 @@ static struct socket *get_tap_socket(int fd)
 	return sock;
 }
 
-static struct socket *get_socket(int fd)
+static struct socket *get_mp_socket(int fd)
+{
+	struct file *file = fget(fd);
+	struct socket *sock;
+	if (!file)
+		return ERR_PTR(-EBADF);
+	sock = mp_get_socket(file);
+	if (IS_ERR(sock))
+		fput(file);
+	return sock;
+}
+
+static struct socket *get_socket(struct vhost_virtqueue *vq, int fd,
+				 enum vhost_vq_link_state *state)
 {
 	struct socket *sock;
 	/* special case to disable backend */
 	if (fd == -1)
 		return NULL;
+
+	*state = VHOST_VQ_LINK_SYNC;
+
 	sock = get_raw_socket(fd);
 	if (!IS_ERR(sock))
 		return sock;
 	sock = get_tap_socket(fd);
 	if (!IS_ERR(sock))
 		return sock;
+	sock = get_mp_socket(fd);
+	if (!IS_ERR(sock)) {
+		*state = VHOST_VQ_LINK_ASYNC;
+		return sock;
+	}
 	return ERR_PTR(-ENOTSOCK);
 }
 
+static void vhost_init_link_state(struct vhost_net *n, int index)
+{
+	struct vhost_virtqueue *vq = n->vqs + index;
+
+	WARN_ON(!mutex_is_locked(&vq->mutex));
+	if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+		INIT_LIST_HEAD(&vq->notifier);
+		spin_lock_init(&vq->notify_lock);
+		if (!n->cache) {
+			n->cache = kmem_cache_create("vhost_kiocb",
+					sizeof(struct kiocb), 0,
+					SLAB_HWCACHE_ALIGN, NULL);
+		}
+	}
+}
+
 static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 {
 	struct socket *sock, *oldsock;
@@ -510,12 +734,14 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 		r = -EFAULT;
 		goto err_vq;
 	}
-	sock = get_socket(fd);
+	sock = get_socket(vq, fd, &vq->link_state);
 	if (IS_ERR(sock)) {
 		r = PTR_ERR(sock);
 		goto err_vq;
 	}
 
+	vhost_init_link_state(n, index);
+
 	/* start polling new socket */
 	oldsock = vq->private_data;
 	if (sock == oldsock)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 3f10194..b39e47c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -860,61 +860,17 @@ static unsigned get_indirect(struct vhost_dev *dev, struct vhost_virtqueue *vq,
 	return 0;
 }
 
-/* This looks in the virtqueue and for the first available buffer, and converts
- * it to an iovec for convenient access.  Since descriptors consist of some
- * number of output then some number of input descriptors, it's actually two
- * iovecs, but we pack them into one and note how many of each there were.
- *
- * This function returns the descriptor number found, or vq->num (which
- * is never a valid descriptor number) if none was found. */
-unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
-			   struct iovec iov[], unsigned int iov_size,
-			   unsigned int *out_num, unsigned int *in_num,
-			   struct vhost_log *log, unsigned int *log_num)
+/* This computes the log info according to the index of buffer */
+unsigned __vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+			     struct iovec iov[], unsigned int iov_size,
+			     unsigned int *out_num, unsigned int *in_num,
+			     struct vhost_log *log, unsigned int *log_num,
+			     unsigned int head)
 {
 	struct vring_desc desc;
-	unsigned int i, head, found = 0;
-	u16 last_avail_idx;
+	unsigned int i = head, found = 0;
 	int ret;
 
-	/* Check it isn't doing very strange things with descriptor numbers. */
-	last_avail_idx = vq->last_avail_idx;
-	if (get_user(vq->avail_idx, &vq->avail->idx)) {
-		vq_err(vq, "Failed to access avail idx at %p\n",
-		       &vq->avail->idx);
-		return vq->num;
-	}
-
-	if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
-		vq_err(vq, "Guest moved used index from %u to %u",
-		       last_avail_idx, vq->avail_idx);
-		return vq->num;
-	}
-
-	/* If there's nothing new since last we looked, return invalid. */
-	if (vq->avail_idx == last_avail_idx)
-		return vq->num;
-
-	/* Only get avail ring entries after they have been exposed by guest. */
-	smp_rmb();
-
-	/* Grab the next descriptor number they're advertising, and increment
-	 * the index we've seen. */
-	if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
-		vq_err(vq, "Failed to read head: idx %d address %p\n",
-		       last_avail_idx,
-		       &vq->avail->ring[last_avail_idx % vq->num]);
-		return vq->num;
-	}
-
-	/* If their number is silly, that's an error. */
-	if (head >= vq->num) {
-		vq_err(vq, "Guest says index %u > %u is available",
-		       head, vq->num);
-		return vq->num;
-	}
-
-	/* When we start there are none of either input nor output. */
 	*out_num = *in_num = 0;
 	if (unlikely(log))
 		*log_num = 0;
@@ -978,8 +934,70 @@ unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
 			*out_num += ret;
 		}
 	} while ((i = next_desc(&desc)) != -1);
+	return head;
+}
+
+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access.  Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which
+ * is never a valid descriptor number) if none was found. */
+unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+			   struct iovec iov[], unsigned int iov_size,
+			   unsigned int *out_num, unsigned int *in_num,
+			   struct vhost_log *log, unsigned int *log_num)
+{
+	struct vring_desc desc;
+	unsigned int i, head, found = 0;
+	u16 last_avail_idx;
+	int ret;
+
+	/* Check it isn't doing very strange things with descriptor numbers. */
+	last_avail_idx = vq->last_avail_idx;
+	if (get_user(vq->avail_idx, &vq->avail->idx)) {
+		vq_err(vq, "Failed to access avail idx at %p\n",
+		       &vq->avail->idx);
+		return vq->num;
+	}
+
+	if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
+		vq_err(vq, "Guest moved used index from %u to %u",
+		       last_avail_idx, vq->avail_idx);
+		return vq->num;
+	}
+
+	/* If there's nothing new since last we looked, return invalid. */
+	if (vq->avail_idx == last_avail_idx)
+		return vq->num;
+
+	/* Only get avail ring entries after they have been exposed by guest. */
+	smp_rmb();
+
+	/* Grab the next descriptor number they're advertising, and increment
+	 * the index we've seen. */
+	if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
+		vq_err(vq, "Failed to read head: idx %d address %p\n",
+		       last_avail_idx,
+		       &vq->avail->ring[last_avail_idx % vq->num]);
+		return vq->num;
+	}
+
+	/* If their number is silly, that's an error. */
+	if (head >= vq->num) {
+		vq_err(vq, "Guest says index %u > %u is available",
+		       head, vq->num);
+		return vq->num;
+	}
+
+	ret = __vhost_get_vq_desc(dev, vq, iov, iov_size,
+				  out_num, in_num,
+				  log, log_num, head);
 
 	/* On success, increment avail index. */
+	if (ret == vq->num)
+		return ret;
 	vq->last_avail_idx++;
 	return head;
 }
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 44591ba..3c9cbce 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -43,6 +43,11 @@ struct vhost_log {
 	u64 len;
 };
 
+enum vhost_vq_link_state {
+	VHOST_VQ_LINK_SYNC = 0,
+	VHOST_VQ_LINK_ASYNC = 1,
+};
+
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
 	struct vhost_dev *dev;
@@ -96,6 +101,10 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log log[VHOST_NET_MAX_SG];
+	/* Differiate async socket for 0-copy from normal */
+	enum vhost_vq_link_state link_state;
+	struct list_head notifier;
+	spinlock_t notify_lock;
 };
 
 struct vhost_dev {
@@ -124,6 +133,11 @@ unsigned vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
 			   struct iovec iov[], unsigned int iov_count,
 			   unsigned int *out_num, unsigned int *in_num,
 			   struct vhost_log *log, unsigned int *log_num);
+unsigned __vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
+			   struct iovec iov[], unsigned int iov_count,
+			   unsigned int *out_num, unsigned int *in_num,
+			   struct vhost_log *log, unsigned int *log_num,
+			   unsigned int head);
 void vhost_discard_vq_desc(struct vhost_virtqueue *);
 
 int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
-- 
1.5.4.4


^ permalink raw reply related

* [RFC][PATCH v4 01/18] Add a new struct for device to manipulate external buffer.
From: xiaohui.xin @ 2010-04-25  9:19 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |   19 ++++++++++++++++++-
 1 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c79a88b..bf79756 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -530,6 +530,22 @@ struct netdev_queue {
 	unsigned long		tx_dropped;
 } ____cacheline_aligned_in_smp;
 
+/* Add a structure in structure net_device, the new field is
+ * named as mp_port. It's for mediate passthru (zero-copy).
+ * It contains the capability for the net device driver,
+ * a socket, and an external buffer creator, external means
+ * skb buffer belongs to the device may not be allocated from
+ * kernel space.
+ */
+struct mpassthru_port	{
+	int		hdr_len;
+	int		data_len;
+	int		npages;
+	unsigned	flags;
+	struct socket	*sock;
+	struct skb_external_page *(*ctor)(struct mpassthru_port *,
+				struct sk_buff *, int);
+};
 
 /*
  * This structure defines the management hooks for network devices.
@@ -952,7 +968,8 @@ struct net_device {
 	struct macvlan_port	*macvlan_port;
 	/* GARP */
 	struct garp_port	*garp_port;
-
+	/* mpassthru */
+	struct mpassthru_port	*mp_port;
 	/* class/net/name entry */
 	struct device		dev;
 	/* space for optional device, statistics, and wireless sysfs groups */
-- 
1.5.4.4

^ permalink raw reply related

* [RFC][PATCH v4 02/18] Export 2 func for device to assign/dassign new structure.
From: xiaohui.xin @ 2010-04-25  9:19 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1272187206-18534-1-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Export 2 func for device to assign/deassign new strucure

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |    3 +++
 net/core/dev.c            |   28 ++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bf79756..5c473fb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1592,6 +1592,9 @@ extern gro_result_t	napi_frags_finish(struct napi_struct *napi,
 					  gro_result_t ret);
 extern struct sk_buff *	napi_frags_skb(struct napi_struct *napi);
 extern gro_result_t	napi_gro_frags(struct napi_struct *napi);
+extern int netdev_mp_port_attach(struct net_device *dev,
+				 struct mpassthru_port *port);
+extern void netdev_mp_port_detach(struct net_device *dev);
 
 static inline void napi_free_frags(struct napi_struct *napi)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index e5972f7..6a73fc7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2464,6 +2464,34 @@ void netif_nit_deliver(struct sk_buff *skb)
 	rcu_read_unlock();
 }
 
+/* Export two functions to assign/de-assign mp_port pointer
+ * to a net device.
+ */
+
+int netdev_mp_port_attach(struct net_device *dev,
+			struct mpassthru_port *port)
+{
+	/* locked by mp_mutex */
+	if (rcu_dereference(dev->mp_port))
+		return -EBUSY;
+
+	rcu_assign_pointer(dev->mp_port, port);
+
+	return 0;
+}
+EXPORT_SYMBOL(netdev_mp_port_attach);
+
+void netdev_mp_port_detach(struct net_device *dev)
+{
+	/* locked by mp_mutex */
+	if (!rcu_dereference(dev->mp_port))
+		return;
+
+	rcu_assign_pointer(dev->mp_port, NULL);
+	synchronize_rcu();
+}
+EXPORT_SYMBOL(netdev_mp_port_detach);
+
 /**
  *	netif_receive_skb - process receive buffer from network
  *	@skb: buffer to process
-- 
1.5.4.4


^ permalink raw reply related

* [RFC][PATCH v4 03/18] Add a ndo_mp_port_prep pointer to net_device_ops.
From: xiaohui.xin @ 2010-04-25  9:19 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1272187206-18534-2-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

If the driver want to allocate external buffers,
then it can export it's capability, as the skb
buffer header length, the page length can be DMA, etc.
The external buffers owner may utilize this.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5c473fb..3a1583b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -707,6 +707,10 @@ struct net_device_ops {
 	int			(*ndo_fcoe_get_wwn)(struct net_device *dev,
 						    u64 *wwn, int type);
 #endif
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+	int			(*ndo_mp_port_prep)(struct net_device *dev,
+						struct mpassthru_port *port);
+#endif
 };

 /*
-- 
1.5.4.4

^ permalink raw reply related

* [RFC][PATCH v4 05/18] Add a function to indicate if device use external buffer.
From: xiaohui.xin @ 2010-04-25  9:19 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1272187206-18534-4-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/netdevice.h |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2f9a4f2..a1a2aaf 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1602,6 +1602,13 @@ extern void netdev_mp_port_detach(struct net_device *dev);
 int netdev_mp_port_prep(struct net_device *dev,
 			struct mpassthru_port *port);
 
+static int dev_is_mpassthru(struct net_device *dev)
+{
+	if (dev && dev->mp_port)
+		return 1;
+	return 0;
+}
+
 static inline void napi_free_frags(struct napi_struct *napi)
 {
 	kfree_skb(napi->skb);
-- 
1.5.4.4


^ permalink raw reply related

* [RFC][PATCH v4 06/18] Add interface to get external buffers.
From: xiaohui.xin @ 2010-04-25  9:19 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1272187206-18534-5-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Currently, it can get external buffers from mp device.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---
 include/linux/skbuff.h |   12 ++++++++++++
 net/core/skbuff.c      |   16 ++++++++++++++++
 2 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3104e7d..96799f5 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1525,6 +1525,18 @@ static inline void netdev_free_page(struct net_device *dev, struct page *page)
 	__free_page(page);
 }
 
+extern struct skb_external_page *netdev_alloc_external_pages(
+					struct net_device *dev,
+					struct sk_buff *skb, int npages);
+
+static inline struct skb_external_page *netdev_alloc_external_page(
+		struct net_device *dev,
+		struct sk_buff *skb, unsigned int size)
+{
+	return netdev_alloc_external_pages(dev, skb,
+					   DIV_ROUND_UP(size, PAGE_SIZE));
+}
+
 /**
  *	skb_clone_writable - is the header of a clone writable
  *	@skb: buffer to check
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 93c4e06..6345acc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -278,6 +278,22 @@ struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(__netdev_alloc_page);
 
+struct skb_external_page *netdev_alloc_external_pages(struct net_device *dev,
+			struct sk_buff *skb, int npages)
+{
+	struct mpassthru_port *port;
+	struct skb_external_page *ext_page = NULL;
+
+	port = rcu_dereference(dev->mp_port);
+	if (!port)
+		goto out;
+	BUG_ON(npages > port->npages);
+	ext_page = port->ctor(port, skb, npages);
+out:
+	return ext_page;
+}
+EXPORT_SYMBOL(netdev_alloc_external_pages);
+
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		int size)
 {
-- 
1.5.4.4

^ permalink raw reply related

* [RFC][PATCH v4 07/18] Make __alloc_skb() to get external buffer.
From: xiaohui.xin @ 2010-04-25  9:19 UTC (permalink / raw)
  To: netdev, kvm, linux-kernel, mst, mingo, davem, jdike; +Cc: Xin Xiaohui
In-Reply-To: <1272187206-18534-6-git-send-email-xiaohui.xin@intel.com>

From: Xin Xiaohui <xiaohui.xin@intel.com>

Add a dev parameter to __alloc_skb(), skb->data
points to external buffer, recompute skb->head,
maintain shinfo of the external buffer, record
external buffer info into destructor_arg field.

Signed-off-by: Xin Xiaohui <xiaohui.xin@intel.com>
Signed-off-by: Zhao Yu <yzhao81@gmail.com>
Reviewed-by: Jeff Dike <jdike@linux.intel.com>
---

	__alloc_skb() cleanup by

	Jeff Dike <jdike@linux.intel.com>

 include/linux/skbuff.h |    7 ++++---
 net/core/skbuff.c      |   43 +++++++++++++++++++++++++++++++++++++------
 2 files changed, 41 insertions(+), 9 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 96799f5..8949b15 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -448,17 +448,18 @@ extern void kfree_skb(struct sk_buff *skb);
 extern void consume_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
 extern struct sk_buff *__alloc_skb(unsigned int size,
-				   gfp_t priority, int fclone, int node);
+				   gfp_t priority, int fclone,
+				   int node, struct net_device *dev);
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
-	return __alloc_skb(size, priority, 0, -1);
+	return __alloc_skb(size, priority, 0, -1, NULL);
 }
 
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, 1, -1);
+	return __alloc_skb(size, priority, 1, -1, NULL);
 }
 
 extern int skb_recycle_check(struct sk_buff *skb, int skb_size);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6345acc..ae223d2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -161,7 +161,8 @@ EXPORT_SYMBOL(skb_under_panic);
  *	@fclone: allocate from fclone cache instead of head cache
  *		and allocate a cloned (child) skb
  *	@node: numa node to allocate memory on
- *
+ *	@dev: a device owns the skb if the skb try to get external buffer.
+ *		otherwise is NULL.
  *	Allocate a new &sk_buff. The returned buffer has no headroom and a
  *	tail room of size bytes. The object has a reference count of one.
  *	The return is the buffer. On a failure the return is %NULL.
@@ -170,12 +171,13 @@ EXPORT_SYMBOL(skb_under_panic);
  *	%GFP_ATOMIC.
  */
 struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
-			    int fclone, int node)
+			    int fclone, int node, struct net_device *dev)
 {
 	struct kmem_cache *cache;
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
-	u8 *data;
+	u8 *data = NULL;
+	struct skb_external_page *ext_page = NULL;
 
 	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
 
@@ -185,8 +187,23 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 		goto out;
 
 	size = SKB_DATA_ALIGN(size);
-	data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
-			gfp_mask, node);
+
+	/* If the device wants to do mediate passthru(zero-copy),
+	 * the skb may try to get external buffers from outside.
+	 * If fails, then fall back to alloc buffers from kernel.
+	 */
+	if (dev && dev->mp_port) {
+		ext_page = netdev_alloc_external_page(dev, skb, size);
+		if (ext_page) {
+			data = ext_page->start;
+			size = ext_page->size;
+		}
+	}
+
+	if (!data)
+		data = kmalloc_node_track_caller(
+				size + sizeof(struct skb_shared_info),
+				gfp_mask, node);
 	if (!data)
 		goto nodata;
 
@@ -208,6 +225,15 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	skb->mac_header = ~0U;
 #endif
 
+	/* If the skb get external buffers sucessfully, since the shinfo is
+	 * at the end of the buffer, we may retain the shinfo once we
+	 * need it sometime.
+	 */
+	if (ext_page) {
+		skb->head = skb->data - NET_IP_ALIGN - NET_SKB_PAD;
+		memcpy(ext_page->ushinfo, skb_shinfo(skb),
+		       sizeof(struct skb_shared_info));
+	}
 	/* make sure we initialize shinfo sequentially */
 	shinfo = skb_shinfo(skb);
 	atomic_set(&shinfo->dataref, 1);
@@ -231,6 +257,11 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 
 		child->fclone = SKB_FCLONE_UNAVAILABLE;
 	}
+	/* Record the external buffer info in this field. It's not so good,
+	 * but we cannot find another place easily.
+	 */
+	shinfo->destructor_arg = ext_page;
+
 out:
 	return skb;
 nodata:
@@ -259,7 +290,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct sk_buff *skb;
 
-	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node, dev);
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
-- 
1.5.4.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox