* Re: [Pv-drivers] [PATCH 2.6.31-rc9] net: VMware virtual Ethernet NIC driver: vmxnet3
From: Chris Wright @ 2009-09-29 20:30 UTC (permalink / raw)
To: Bhavesh Davda
Cc: Chris Wright, Shreyas Bhatewara, pv-drivers@vmware.com,
netdev@vger.kernel.org, Stephen Hemminger,
linux-kernel@vger.kernel.org, virtualization, Anthony Liguori,
Greg Kroah-Hartman, Andrew Morton, Jeff Garzik, David S. Miller
In-Reply-To: <8B1F619C9F5F454E81D90D3C161698D7017DB76576@EXCH-MBX-3.vmware.com>
* Bhavesh Davda (bhavesh@vmware.com) wrote:
> Hi Chris,
>
> Thanks a bunch for your really thorough review! I'll answer some of your questions here. Shreyas can respond to your comments about some of the coding style/comments/etc. in a separate mail.
The style is less important at this stage, but certainly eases review
to make it more consistent w/ Linux code. The StudlyCaps, extra macros
(screaming caps) and inconistent space/tabs are visual distractions,
that's all.
> > > INTx, MSI, MSI-X (25 vectors) interrupts
> > > 16 Rx queues, 8 Tx queues
> >
> > Driver doesn't appear to actually support more than a single MSI-X
> > interrupt.
> > What is your plan for doing real multiqueue?
>
> When we first wrote the driver a couple of years ago, Linux lacked proper multiqueue support, hence we chose to use only a single queue though the emulated device does support 16 Rx and 8 Tx queues, and 25 MSI-X vectors: 16 for Rx, 8 for Tx and 1 for other asynchronous event notifications, by design. Actually a driver can repurpose any of the 25 vectors for any notifications; just explaining the rationale for desiging the device with 25 MSI-X vectors.
I see, thanks.
> We do have an internal prototype of a Linux vmxnet3 driver with 4 Tx queues and 4 Rx queues, using 9 MSI-X vectors, but it needs some work before calling it production ready.
I'd expect once you switch to alloc_etherdev_mq(), make napi work per
rx queue, and fix MSI-X allocation (all needed for 4/4), you should
have enough to support the max of 16/8 (IOW, 4/4 still sounds like an
aritificial limitation).
> > How about GRO conversion?
>
> Looks attractive, and we'll work on that in a subsequent patch. Again, when we first wrote the driver, the NETIF_F_GRO stuff didn't exist in Linux.
OK, shouldn't be too much work.
Another thing I forgot to mention is that net_device now has
net_device_stats in it. So you shouldn't need net_device_stats in
vmxnet3_adapter.
> > Also, heavy use of BUG_ON() (counted 51 of them), are you sure that
> > none
> > of them can be triggered by guest or remote (esp. the ones that happen
> > in interrupt context)? Some initial thoughts below.
>
> We'll definitely audit all the BUG_ONs again to make sure they can't be exploited.
>
> > > --- /dev/null
> > > +++ b/drivers/net/vmxnet3/upt1_defs.h
> > > +#define UPT1_MAX_TX_QUEUES 64
> > > +#define UPT1_MAX_RX_QUEUES 64
> >
> > This is different than the 16/8 described above (and seemingly all moot
> > since it becomes a single queue device).
>
> Nice catch! Those are not even used and are from the earliest days of our driver development. We'll nuke those.
Could you describe the UPT layer a bit? There were a number of
constants that didn't appear to be used.
> > > +/* interrupt moderation level */
> > > +#define UPT1_IML_NONE 0 /* no interrupt moderation */
> > > +#define UPT1_IML_HIGHEST 7 /* least intr generated */
> > > +#define UPT1_IML_ADAPTIVE 8 /* adpative intr moderation */
> >
> > enum? also only appears to support adaptive mode?
>
> Yes, the Linux driver currently only asks for adaptive mode, but the device supports 8 interrupt moderation levels.
>
> > > --- /dev/null
> > > +++ b/drivers/net/vmxnet3/vmxnet3_defs.h
> > > +struct Vmxnet3_MiscConf {
> > > + struct Vmxnet3_DriverInfo driverInfo;
> > > + uint64_t uptFeatures;
> > > + uint64_t ddPA; /* driver data PA */
> > > + uint64_t queueDescPA; /* queue descriptor table
> > PA */
> > > + uint32_t ddLen; /* driver data len */
> > > + uint32_t queueDescLen; /* queue desc. table len
> > in bytes */
> > > + uint32_t mtu;
> > > + uint16_t maxNumRxSG;
> > > + uint8_t numTxQueues;
> > > + uint8_t numRxQueues;
> > > + uint32_t reserved[4];
> > > +};
> >
> > should this be packed (or others that are shared w/ device)? i assume
> > you've already done 32 vs 64 here
> >
>
> No need for packing since the fields are naturally 64-bit aligned. True for all structures shared between the driver and device.
I had quickly looked and thought I saw a hole that would lead to
inconsistent layout for 32-bit vs 64-bit. I figured I'd be wrong
there ;-)
> > > +#define VMXNET3_MAX_TX_QUEUES 8
> > > +#define VMXNET3_MAX_RX_QUEUES 16
> >
> > different to UPT, I must've missed some layering here
>
> These are the authoritiative #defines. Ignore the UPT ones.
>
> > > --- /dev/null
> > > +++ b/drivers/net/vmxnet3/vmxnet3_drv.c
> > > + VMXNET3_WRITE_BAR0_REG(adapter, VMXNET3_REG_IMR + intr_idx *
> > 8, 0);
> >
> > writel(0, adapter->hw_addr0 + VMXNET3_REG_IMR + intr_idx * 8)
> > seems just as clear to me.
>
> Fair enough. We were just trying to clearly show which register accesses go to BAR 0 versus BAR 1.
>
> > only ever num_intrs=1, so there's some plan to bump this up and make
> > these wrappers useful?
>
> Yes.
>
> > > +static void
> > > +vmxnet3_process_events(struct vmxnet3_adapter *adapter)
> >
> > Should be trivial to break out to it's own MSI-X vector, basically set
> > up to do that already.
>
> Yes, and the device is configurable to use any vector for any "events", but didn't see any compelling reason to do so. "ECR" events are extremely rare and we've got a shadow copy of the ECR register that avoids an expensive round trip to the device, stored in adapter->shared->ecr. So we can cheaply handle events on the hot Tx/Rx path with minimal overhead. But if you really see a compelling reason to allocate a separate MSI-X vector for events, we can certainly do that.
Nah, just thinking outloud while trying to understand the driver. I
figured it'd be the + 1 vector (16 + 8 + 1).
thanks,
-chris
^ permalink raw reply
* Re: Very strange issues with ethernet wake on lan
From: Rafael J. Wysocki @ 2009-09-29 20:28 UTC (permalink / raw)
To: Maxim Levitsky
Cc: netdev@vger.kernel.org, linux-pm@lists.linux-foundation.org
In-Reply-To: <1250394174.32268.13.camel@localhost.localdomain>
On Sunday 16 August 2009, Maxim Levitsky wrote:
> Hi,
>
> I have recently put back the davicom dm9009 ethernet card into my
> computer.
>
> Some long time ago, I have written its suspend/resume routines.
> Now I see that few things have changed, like I need to enable wake in
> sysfs or better patch the code to do so, some nice helpers like
> pci_prepare_to_sleep have arrived, etc.
>
>
> I narrowed the strange issue down to following situation:
>
> I reload dmfe.ko (and networkmanager is disabled)
> I don't ifup the device, thus pretty much no hardware initialization
> takes place (but this appears not to matter anyway)
>
> I then suspend the system, and WOL doesn't work (I have patched the
> driver to enable WOL automaticly)
>
> I then, suspend again. WOL works, and continues to work as long as I
> don't reload the driver. If I do, same situation repeats.
>
> Also, after a boot, WOL works, so a reload cycle triggers that issue.
>
> And most importantly, if I don't do a
>
> pci_set_power_state(pci_dev, pci_choose_state (pci_dev, state));
>
> in .suspend, then WOL always works.
>
> and I have even tried to set state manually to PCI_D3hot or PCI_D3cold,
>
> I also tried to use pci_save_state
>
>
> I also have 2 copies of this card, and both have this issue.
> I also tried 2 pci slots.
>
> Kernel is vanilla 2.6.31-rc5
Please check if this still happens with 2.6.32-rc1.
Best,
Rafael
^ permalink raw reply
* Re: [Pv-drivers] [PATCH 2.6.31-rc9] net: VMware virtual Ethernet NIC driver: vmxnet3
From: David Miller @ 2009-09-29 19:55 UTC (permalink / raw)
To: bhavesh
Cc: arnd, chrisw, pv-drivers, netdev, linux-kernel, virtualization,
greg, anthony, jgarzik, akpm, shemminger
In-Reply-To: <8B1F619C9F5F454E81D90D3C161698D7017DB76582@EXCH-MBX-3.vmware.com>
From: Bhavesh Davda <bhavesh@vmware.com>
Date: Tue, 29 Sep 2009 12:52:05 -0700
>> One thing that should possibly be fixed is the naming of identifiers,
>> e.g.
>> 's/Vmxnet3_MiscConf/vmxnet3_misc_conf/g', unless these header files are
>> shared with the host implementation.
>
> These header files are indeed shared with the host implementation, as you've guessed. If it's not a big deal, we would like to keep the names the same, just for our own sanity's sake?
No. This isn't your source tree, it's everyone's. So you should
adhere to basic naming conventions and coding standards of the
tree regardless of what you happen to use or need to use internally.
^ permalink raw reply
* RE: [Pv-drivers] [PATCH 2.6.31-rc9] net: VMware virtual Ethernet NIC driver: vmxnet3
From: Bhavesh Davda @ 2009-09-29 19:52 UTC (permalink / raw)
To: Arnd Bergmann, Chris Wright
Cc: pv-drivers@vmware.com, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, virtualization, Greg Kroah-Hartman,
Anthony Liguori, Jeff Garzik, Andrew Morton, Stephen Hemminger,
David S. Miller
In-Reply-To: <200909291505.50961.arnd@arndb.de>
Hi Arnd,
> On Tuesday 29 September 2009, Chris Wright wrote:
> > > +struct Vmxnet3_MiscConf {
> > > + struct Vmxnet3_DriverInfo driverInfo;
> > > + uint64_t uptFeatures;
> > > + uint64_t ddPA; /* driver data PA */
> > > + uint64_t queueDescPA; /* queue descriptor
> table PA */
> > > + uint32_t ddLen; /* driver data len */
> > > + uint32_t queueDescLen; /* queue desc. table len
> in bytes */
> > > + uint32_t mtu;
> > > + uint16_t maxNumRxSG;
> > > + uint8_t numTxQueues;
> > > + uint8_t numRxQueues;
> > > + uint32_t reserved[4];
> > > +};
> >
> > should this be packed (or others that are shared w/ device)? i
> assume
> > you've already done 32 vs 64 here
>
> I would not mark it packed, because it already is well-defined on all
> systems. You should add __packed only to the fields where you screwed
> up, but not to structures that already work fine.
You're exactly right; I reiterated as much in my response to Chris.
> One thing that should possibly be fixed is the naming of identifiers,
> e.g.
> 's/Vmxnet3_MiscConf/vmxnet3_misc_conf/g', unless these header files are
> shared with the host implementation.
These header files are indeed shared with the host implementation, as you've guessed. If it's not a big deal, we would like to keep the names the same, just for our own sanity's sake?
Thanks!
- Bhavesh
>
> Arnd <><
^ permalink raw reply
* RE: [Pv-drivers] [PATCH 2.6.31-rc9] net: VMware virtual Ethernet NIC driver: vmxnet3
From: Bhavesh Davda @ 2009-09-29 19:45 UTC (permalink / raw)
To: Chris Wright, Shreyas Bhatewara
Cc: pv-drivers@vmware.com, netdev@vger.kernel.org, Stephen Hemminger,
linux-kernel@vger.kernel.org, virtualization, Anthony Liguori,
Greg Kroah-Hartman, Andrew Morton, Jeff Garzik, David S. Miller
In-Reply-To: <20090929085333.GC3958@sequoia.sous-sol.org>
Hi Chris,
Thanks a bunch for your really thorough review! I'll answer some of your questions here. Shreyas can respond to your comments about some of the coding style/comments/etc. in a separate mail.
> > INTx, MSI, MSI-X (25 vectors) interrupts
> > 16 Rx queues, 8 Tx queues
>
> Driver doesn't appear to actually support more than a single MSI-X
> interrupt.
> What is your plan for doing real multiqueue?
When we first wrote the driver a couple of years ago, Linux lacked proper multiqueue support, hence we chose to use only a single queue though the emulated device does support 16 Rx and 8 Tx queues, and 25 MSI-X vectors: 16 for Rx, 8 for Tx and 1 for other asynchronous event notifications, by design. Actually a driver can repurpose any of the 25 vectors for any notifications; just explaining the rationale for desiging the device with 25 MSI-X vectors.
We do have an internal prototype of a Linux vmxnet3 driver with 4 Tx queues and 4 Rx queues, using 9 MSI-X vectors, but it needs some work before calling it production ready.
> How about GRO conversion?
Looks attractive, and we'll work on that in a subsequent patch. Again, when we first wrote the driver, the NETIF_F_GRO stuff didn't exist in Linux.
> Also, heavy use of BUG_ON() (counted 51 of them), are you sure that
> none
> of them can be triggered by guest or remote (esp. the ones that happen
> in interrupt context)? Some initial thoughts below.
We'll definitely audit all the BUG_ONs again to make sure they can't be exploited.
> > --- /dev/null
> > +++ b/drivers/net/vmxnet3/upt1_defs.h
> > +#define UPT1_MAX_TX_QUEUES 64
> > +#define UPT1_MAX_RX_QUEUES 64
>
> This is different than the 16/8 described above (and seemingly all moot
> since it becomes a single queue device).
Nice catch! Those are not even used and are from the earliest days of our driver development. We'll nuke those.
> > +/* interrupt moderation level */
> > +#define UPT1_IML_NONE 0 /* no interrupt moderation */
> > +#define UPT1_IML_HIGHEST 7 /* least intr generated */
> > +#define UPT1_IML_ADAPTIVE 8 /* adpative intr moderation */
>
> enum? also only appears to support adaptive mode?
Yes, the Linux driver currently only asks for adaptive mode, but the device supports 8 interrupt moderation levels.
> > --- /dev/null
> > +++ b/drivers/net/vmxnet3/vmxnet3_defs.h
> > +struct Vmxnet3_MiscConf {
> > + struct Vmxnet3_DriverInfo driverInfo;
> > + uint64_t uptFeatures;
> > + uint64_t ddPA; /* driver data PA */
> > + uint64_t queueDescPA; /* queue descriptor table
> PA */
> > + uint32_t ddLen; /* driver data len */
> > + uint32_t queueDescLen; /* queue desc. table len
> in bytes */
> > + uint32_t mtu;
> > + uint16_t maxNumRxSG;
> > + uint8_t numTxQueues;
> > + uint8_t numRxQueues;
> > + uint32_t reserved[4];
> > +};
>
> should this be packed (or others that are shared w/ device)? i assume
> you've already done 32 vs 64 here
>
No need for packing since the fields are naturally 64-bit aligned. True for all structures shared between the driver and device.
> > +#define VMXNET3_MAX_TX_QUEUES 8
> > +#define VMXNET3_MAX_RX_QUEUES 16
>
> different to UPT, I must've missed some layering here
These are the authoritiative #defines. Ignore the UPT ones.
> > --- /dev/null
> > +++ b/drivers/net/vmxnet3/vmxnet3_drv.c
> > + VMXNET3_WRITE_BAR0_REG(adapter, VMXNET3_REG_IMR + intr_idx *
> 8, 0);
>
> writel(0, adapter->hw_addr0 + VMXNET3_REG_IMR + intr_idx * 8)
> seems just as clear to me.
Fair enough. We were just trying to clearly show which register accesses go to BAR 0 versus BAR 1.
> only ever num_intrs=1, so there's some plan to bump this up and make
> these wrappers useful?
Yes.
> > +static void
> > +vmxnet3_process_events(struct vmxnet3_adapter *adapter)
>
> Should be trivial to break out to it's own MSI-X vector, basically set
> up to do that already.
Yes, and the device is configurable to use any vector for any "events", but didn't see any compelling reason to do so. "ECR" events are extremely rare and we've got a shadow copy of the ECR register that avoids an expensive round trip to the device, stored in adapter->shared->ecr. So we can cheaply handle events on the hot Tx/Rx path with minimal overhead. But if you really see a compelling reason to allocate a separate MSI-X vector for events, we can certainly do that.
>
> Plan to switch to GRO?
Already answered.
Thanks
- Bhavesh
^ permalink raw reply
* Re: mac80211: NOHZ: local_softirq_pending 08
From: John W. Linville @ 2009-09-29 19:29 UTC (permalink / raw)
To: Oliver Hartkopp
Cc: Michael Buesch, Kalle Valo, linux-wireless, netdev, Johannes Berg
In-Reply-To: <4AABCF28.6090505@hartkopp.net>
On Sat, Sep 12, 2009 at 06:41:12PM +0200, Oliver Hartkopp wrote:
> i cooked a patch that introduces netif_rx_ti() and fixes up the problems in
> mac80211 and the CAN subsystem.
Oliver,
Are you going to send this patch to Dave? If you want me to carry
it instead, please resend it with a proper changelog including a
Signed-off-by line. For that matter, Dave will most certainly want
that as well...
Thanks!
John
--
John W. Linville Someday the world will need a hero, and you
linville@tuxdriver.com might be all we have. Be ready.
^ permalink raw reply
* Re: [Fwd: Re: Bug#538372: header failure including netlink.h (or uio.h)]
From: Jarek Poplawski @ 2009-09-29 19:21 UTC (permalink / raw)
Cc: Manuel Prinz, netdev
In-Reply-To: <20090929092727.GA7025@ff.dom.local>
Jarek Poplawski wrote, On 09/29/2009 11:27 AM:
> On 28-09-2009 13:24, Manuel Prinz wrote:
>> Hi everyone,
>>
>> I'm forwarding this bug in Debian (http://bugs.debian.org/538372) as
>> requested by the Debian kernel team. A patch is available. Applying just
>> the first hunk fixes the issue for me. I've not enough kernel knowledge
>> to judge if this fix is a proper solution, though.
>>
>> It would be really great if someone could have a look at it. Thanks in
>> advance! (And please CC me in replies. Thanks!)
>
> I've tried it with current include/linux and it works OK. Replacing
> uio.h on Debian really was not enough, but it looks like missing
> compiler.h entries could be the reason. Otherwise, please send your
> compile error log.
Hmm... Probably you noticed it yourself, so a little update for the
record. My method wasn't the proper way, but repeating it with
sanitized headers from the current kernel works even better: replacing
Debian's uio.h only seems to be enough yet.
Jarek P.
^ permalink raw reply
* [net-2.6 PATCH 5/5] qlge: Fix error exit for probe call.
From: Ron Mercer @ 2009-09-29 18:39 UTC (permalink / raw)
To: davem; +Cc: netdev, ron.mercer
In-Reply-To: <1254249565-16381-1-git-send-email-ron.mercer@qlogic.com>
Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
---
drivers/net/qlge/qlge_main.c | 10 +++++-----
1 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/drivers/net/qlge/qlge_main.c b/drivers/net/qlge/qlge_main.c
index fbef305..c8a9efe 100644
--- a/drivers/net/qlge/qlge_main.c
+++ b/drivers/net/qlge/qlge_main.c
@@ -3832,11 +3832,14 @@ static int __devinit ql_init_device(struct pci_dev *pdev,
return err;
}
+ qdev->ndev = ndev;
+ qdev->pdev = pdev;
+ pci_set_drvdata(pdev, ndev);
pos = pci_find_capability(pdev, PCI_CAP_ID_EXP);
if (pos <= 0) {
dev_err(&pdev->dev, PFX "Cannot find PCI Express capability, "
"aborting.\n");
- goto err_out;
+ return pos;
} else {
pci_read_config_word(pdev, pos + PCI_EXP_DEVCTL, &val16);
val16 &= ~PCI_EXP_DEVCTL_NOSNOOP_EN;
@@ -3849,7 +3852,7 @@ static int __devinit ql_init_device(struct pci_dev *pdev,
err = pci_request_regions(pdev, DRV_NAME);
if (err) {
dev_err(&pdev->dev, "PCI region request failed.\n");
- goto err_out;
+ return err;
}
pci_set_master(pdev);
@@ -3867,7 +3870,6 @@ static int __devinit ql_init_device(struct pci_dev *pdev,
goto err_out;
}
- pci_set_drvdata(pdev, ndev);
qdev->reg_base =
ioremap_nocache(pci_resource_start(pdev, 1),
pci_resource_len(pdev, 1));
@@ -3887,8 +3889,6 @@ static int __devinit ql_init_device(struct pci_dev *pdev,
goto err_out;
}
- qdev->ndev = ndev;
- qdev->pdev = pdev;
err = ql_get_board_info(qdev);
if (err) {
dev_err(&pdev->dev, "Register access failed.\n");
--
1.6.0.2
^ permalink raw reply related
* [net-2.6 PATCH 4/5] qlge: Protect reset recovery with rtnl_lock().
From: Ron Mercer @ 2009-09-29 18:39 UTC (permalink / raw)
To: davem; +Cc: netdev, ron.mercer
In-Reply-To: <1254249565-16381-1-git-send-email-ron.mercer@qlogic.com>
Move the call to rtnl_lock() to before the internal call to
ql_adapter_down()/ql_adapter_up(). This prevents collisions that can
happen when recovering from an asic error.
Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
---
drivers/net/qlge/qlge_main.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/net/qlge/qlge_main.c b/drivers/net/qlge/qlge_main.c
index b05300d..fbef305 100644
--- a/drivers/net/qlge/qlge_main.c
+++ b/drivers/net/qlge/qlge_main.c
@@ -3703,7 +3703,7 @@ static void ql_asic_reset_work(struct work_struct *work)
struct ql_adapter *qdev =
container_of(work, struct ql_adapter, asic_reset_work.work);
int status;
-
+ rtnl_lock();
status = ql_adapter_down(qdev);
if (status)
goto error;
@@ -3711,12 +3711,12 @@ static void ql_asic_reset_work(struct work_struct *work)
status = ql_adapter_up(qdev);
if (status)
goto error;
-
+ rtnl_unlock();
return;
error:
QPRINTK(qdev, IFUP, ALERT,
"Driver up/down cycle failed, closing device\n");
- rtnl_lock();
+
set_bit(QL_ADAPTER_UP, &qdev->flags);
dev_close(qdev->ndev);
rtnl_unlock();
--
1.6.0.2
^ permalink raw reply related
* [net-2.6 PATCH 2/5] qlge: Fix out of sync hardware semaphore.
From: Ron Mercer @ 2009-09-29 18:39 UTC (permalink / raw)
To: davem; +Cc: netdev, ron.mercer
In-Reply-To: <1254249565-16381-1-git-send-email-ron.mercer@qlogic.com>
ql_clear_routing_entries() takes/gives it's own hardware semaphore since
it is called from more than one place. ql_route_initialize() should
make this call and THEN take it's own semaphore before doing it's work.
Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
---
drivers/net/qlge/qlge_main.c | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/net/qlge/qlge_main.c b/drivers/net/qlge/qlge_main.c
index 2205292..e4b756d 100644
--- a/drivers/net/qlge/qlge_main.c
+++ b/drivers/net/qlge/qlge_main.c
@@ -3142,14 +3142,14 @@ static int ql_route_initialize(struct ql_adapter *qdev)
{
int status = 0;
- status = ql_sem_spinlock(qdev, SEM_RT_IDX_MASK);
+ /* Clear all the entries in the routing table. */
+ status = ql_clear_routing_entries(qdev);
if (status)
return status;
- /* Clear all the entries in the routing table. */
- status = ql_clear_routing_entries(qdev);
+ status = ql_sem_spinlock(qdev, SEM_RT_IDX_MASK);
if (status)
- goto exit;
+ return status;
status = ql_set_routing_reg(qdev, RT_IDX_ALL_ERR_SLOT, RT_IDX_ERR, 1);
if (status) {
--
1.6.0.2
^ permalink raw reply related
* [net-2.6 PATCH 1/5] qlge: Fix bad bit definitions.
From: Ron Mercer @ 2009-09-29 18:39 UTC (permalink / raw)
To: davem; +Cc: netdev, ron.mercer
In-Reply-To: <1254249565-16381-1-git-send-email-ron.mercer@qlogic.com>
Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
---
drivers/net/qlge/qlge.h | 18 +++++++++---------
1 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/net/qlge/qlge.h b/drivers/net/qlge/qlge.h
index a9845a2..30d5585 100644
--- a/drivers/net/qlge/qlge.h
+++ b/drivers/net/qlge/qlge.h
@@ -1381,15 +1381,15 @@ struct intr_context {
/* adapter flags definitions. */
enum {
- QL_ADAPTER_UP = (1 << 0), /* Adapter has been brought up. */
- QL_LEGACY_ENABLED = (1 << 3),
- QL_MSI_ENABLED = (1 << 3),
- QL_MSIX_ENABLED = (1 << 4),
- QL_DMA64 = (1 << 5),
- QL_PROMISCUOUS = (1 << 6),
- QL_ALLMULTI = (1 << 7),
- QL_PORT_CFG = (1 << 8),
- QL_CAM_RT_SET = (1 << 9),
+ QL_ADAPTER_UP = 0, /* Adapter has been brought up. */
+ QL_LEGACY_ENABLED = 1,
+ QL_MSI_ENABLED = 2,
+ QL_MSIX_ENABLED = 3,
+ QL_DMA64 = 4,
+ QL_PROMISCUOUS = 5,
+ QL_ALLMULTI = 6,
+ QL_PORT_CFG = 7,
+ QL_CAM_RT_SET = 8,
};
/* link_status bit definitions */
--
1.6.0.2
^ permalink raw reply related
* [net-2.6 PATCH 3/5] qlge: Fix spin_lock warning.
From: Ron Mercer @ 2009-09-29 18:39 UTC (permalink / raw)
To: davem; +Cc: netdev, ron.mercer
In-Reply-To: <1254249565-16381-1-git-send-email-ron.mercer@qlogic.com>
Remove the unnecessary locking around the call to ql_adapter_reset().
Sep 25 08:17:29 localhost kernel: SOFTIRQ-ON-W at:
Sep 25 08:17:29 localhost kernel: [<c0000000000a2964>] .lock_acquire+0x10c/0x158
Sep 25 08:17:29 localhost kernel: [<c0000000004542e0>] ._spin_lock+0x34/0x58
Sep 25 08:17:29 localhost kernel: [<d000000006723070>] .ql_adapter_down+0x40c/0x4a0 [qlge]
Sep 25 08:17:29 localhost kernel: [<d0000000067256d8>] .qlge_close+0x38/0x58 [qlge]
Sep 25 08:17:29 localhost kernel: [<c0000000003ada6c>] .dev_close+0xdc/0x118
Sep 25 08:17:29 localhost kernel: [<c0000000003adb48>] .rollback_registered+0xa0/0x158
Sep 25 08:17:29 localhost kernel: [<c0000000003adc50>] .unregister_netdevice+0x50/0x7c
Sep 25 08:17:29 localhost kernel: [<c0000000003adca0>] .unregister_netdev+0x24/0x40
Sep 25 08:17:29 localhost kernel: [<d00000000672e0c0>] .qlge_remove+0x28/0x64 [qlge]
Sep 25 08:17:29 localhost kernel: [<c000000000253fdc>] .pci_device_remove+0x50/0x90
Sep 25 08:17:29 localhost kernel: [<c0000000002f5434>] .__device_release_driver+0x94/0xf8
Sep 25 08:17:29 localhost kernel: [<c0000000002f5560>] .driver_detach+0xc8/0xfc
Sep 25 08:17:29 localhost kernel: [<c0000000002f3fd8>] .bus_remove_driver+0xb4/0x114
Sep 25 08:17:29 localhost kernel: [<c0000000002f5d4c>] .driver_unregister+0x80/0xa4
Sep 25 08:17:29 localhost kernel: [<c00000000025421c>] .pci_unregister_driver+0x50/0xc8
Sep 25 08:17:29 localhost kernel: [<d00000000672e044>] .qlge_exit+0x1c/0x34 [qlge]
Sep 25 08:17:29 localhost kernel: [<c0000000000ac8b0>] .SyS_delete_module+0x234/0x2d0
Sep 25 08:17:29 localhost kernel: [<c000000000008554>] syscall_exit+0x0/0x40
Sep 25 08:17:29 localhost kernel: INITIAL USE at:
Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
---
drivers/net/qlge/qlge_main.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)
diff --git a/drivers/net/qlge/qlge_main.c b/drivers/net/qlge/qlge_main.c
index e4b756d..b05300d 100644
--- a/drivers/net/qlge/qlge_main.c
+++ b/drivers/net/qlge/qlge_main.c
@@ -3380,12 +3380,10 @@ static int ql_adapter_down(struct ql_adapter *qdev)
ql_free_rx_buffers(qdev);
- spin_lock(&qdev->hw_lock);
status = ql_adapter_reset(qdev);
if (status)
QPRINTK(qdev, IFDOWN, ERR, "reset(func #%d) FAILED!\n",
qdev->func);
- spin_unlock(&qdev->hw_lock);
return status;
}
--
1.6.0.2
^ permalink raw reply related
* [net-2.6 PATCH 0/5] qlge: Bug fixes for qlge.
From: Ron Mercer @ 2009-09-29 18:39 UTC (permalink / raw)
To: davem; +Cc: netdev, ron.mercer
^ permalink raw reply
* Re: [PATCH] /proc/net/tcp, overhead removed
From: Yakov Lerner @ 2009-09-29 17:34 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Eric Dumazet, netdev, davem
In-Reply-To: <20090929084534.41274f66@nehalam>
On Tue, Sep 29, 2009 at 18:45, Stephen Hemminger <shemminger@vyatta.com> wrote:
> On Tue, 29 Sep 2009 11:55:18 +0300
> Yakov Lerner <iler.ml@gmail.com> wrote:
>
>> On Tue, Sep 29, 2009 at 10:56, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >
>> > Yakov Lerner a écrit :
>> > > Take 2.
>> > >
>> > > "Sharp improvement in performance of /proc/net/tcp when number of
>> > > sockets is large and hashsize is large.
>> > > O(numsock * hashsize) time becomes O(numsock + hashsize). On slow
>> > > processors, speed difference can be x100 and more."
>> > >
>> > > I must say that I'm not fully satisfied with my choice of "st->sbucket"
>> > > for the new preserved index. The better name would be "st->snum".
>> > > Re-using "st->sbucket" saves 4 bytes, and keeps the patch to one sourcefile.
>> > > But "st->sbucket" has different meaning in OPENREQ and LISTEN states;
>> > > this can be confusing.
>> > > Maybe better add "snum" member to struct tcp_iter_state ?
>> > >
>> > > Shall I change subject when sending "take N+1", or keep the old subject ?
>> > >
>> > > Signed-off-by: Yakov Lerner <iler.ml@gmail.com>
>> > > ---
>> > > net/ipv4/tcp_ipv4.c | 35 +++++++++++++++++++++++++++++++++--
>> > > 1 files changed, 33 insertions(+), 2 deletions(-)
>> > >
>> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> > > index 7cda24b..e4c4f19 100644
>> > > --- a/net/ipv4/tcp_ipv4.c
>> > > +++ b/net/ipv4/tcp_ipv4.c
>> > > @@ -1994,13 +1994,14 @@ static inline int empty_bucket(struct tcp_iter_state *st)
>> > > hlist_nulls_empty(&tcp_hashinfo.ehash[st->bucket].twchain);
>> > > }
>> > >
>> > > -static void *established_get_first(struct seq_file *seq)
>> > > +static void *established_get_first_after(struct seq_file *seq, int bucket)
>> > > {
>> > > struct tcp_iter_state *st = seq->private;
>> > > struct net *net = seq_file_net(seq);
>> > > void *rc = NULL;
>> > >
>> > > - for (st->bucket = 0; st->bucket < tcp_hashinfo.ehash_size; ++st->bucket) {
>> > > + for (st->bucket = bucket; st->bucket < tcp_hashinfo.ehash_size;
>> > > + ++st->bucket) {
>> > > struct sock *sk;
>> > > struct hlist_nulls_node *node;
>> > > struct inet_timewait_sock *tw;
>> > > @@ -2010,6 +2011,8 @@ static void *established_get_first(struct seq_file *seq)
>> > > if (empty_bucket(st))
>> > > continue;
>> > >
>> > > + st->sbucket = st->num;
>> > > +
>> > > spin_lock_bh(lock);
>> > > sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[st->bucket].chain) {
>> > > if (sk->sk_family != st->family ||
>> > > @@ -2036,6 +2039,11 @@ out:
>> > > return rc;
>> > > }
>> > >
>> > > +static void *established_get_first(struct seq_file *seq)
>> > > +{
>> > > + return established_get_first_after(seq, 0);
>> > > +}
>> > > +
>> > > static void *established_get_next(struct seq_file *seq, void *cur)
>> > > {
>> > > struct sock *sk = cur;
>> > > @@ -2064,6 +2072,9 @@ get_tw:
>> > > while (++st->bucket < tcp_hashinfo.ehash_size &&
>> > > empty_bucket(st))
>> > > ;
>> > > +
>> > > + st->sbucket = st->num;
>> > > +
>> > > if (st->bucket >= tcp_hashinfo.ehash_size)
>> > > return NULL;
>> > >
>> > > @@ -2107,6 +2118,7 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
>> > >
>> > > if (!rc) {
>> > > st->state = TCP_SEQ_STATE_ESTABLISHED;
>> > > + st->sbucket = 0;
>> > > rc = established_get_idx(seq, pos);
>> > > }
>> > >
>> > > @@ -2116,6 +2128,25 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
>> > > static void *tcp_seq_start(struct seq_file *seq, loff_t *pos)
>> > > {
>> > > struct tcp_iter_state *st = seq->private;
>> > > +
>> > > + if (*pos && *pos >= st->sbucket &&
>> > > + (st->state == TCP_SEQ_STATE_ESTABLISHED ||
>> > > + st->state == TCP_SEQ_STATE_TIME_WAIT)) {
>> > > + void *cur;
>> > > + int nskip;
>> > > +
>> > > + /* for states estab and tw, st->sbucket is index (*pos) */
>> > > + /* corresponding to the beginning of bucket st->bucket */
>> > > +
>> > > + st->num = st->sbucket;
>> > > + /* jump to st->bucket, then skip (*pos - st->sbucket) items */
>> > > + st->state = TCP_SEQ_STATE_ESTABLISHED;
>> > > + cur = established_get_first_after(seq, st->bucket);
>> > > + for (nskip = *pos - st->num; cur && nskip > 0; --nskip)
>> > > + cur = established_get_next(seq, cur);
>> > > + return cur;
>> > > + }
>> > > +
>> > > st->state = TCP_SEQ_STATE_LISTENING;
>> > > st->num = 0;
>> > > return *pos ? tcp_get_idx(seq, *pos - 1) : SEQ_START_TOKEN;
>> >
>> > Just in case you are working on "take 3" of the patch, there is a fondamental problem.
>> >
>> > All the scalability problems come from the fact that tcp_seq_start()
>> > *has* to rescan all the tables from the begining, because of lseek() capability
>> > on /proc/net/tcp file
>> >
>> > We probably could disable llseek() (on other positions than start of the file),
>> > and rely only on internal state (listening/established hashtable, hash bucket, position in chain)
>> >
>> > I cannot imagine how an application could rely on lseek() on >0 position in this file.
>>
>>
>> I thought /proc/net/tcp can both be fast and allow lseek;
>> (1) when no lseek was issued since last read
>> (we can detect this), /proc/net/tcp can jump to the
>> last known bucket (common case), vs
>> (2) switch to slow mode (scan from the beginning of hash)
>> when lseek was used , no ?
>
> If you look at fib_hash and fib_trie, they already do the same thing.
> * fib_hash records last hash chain to avoid overhead of rescan.
> * fib_trie records last route and does fast lookup to restart from there.
Thanks for the pointer.
^ permalink raw reply
* Re: [PATCH] [RFC] IPv4 TCP fails to send window scale option when window scale is zero
From: Eric Dumazet @ 2009-09-29 17:19 UTC (permalink / raw)
To: Gilad Ben-Yossef; +Cc: netdev, Ori Finkalman
In-Reply-To: <4AC22250.7060301@codefidence.com>
Gilad Ben-Yossef a écrit :
> From: Ori Finkalman <ori@comsleep.com>
>
>
> Acknowledge TCP window scale support by inserting the proper option in
> SYN/ACK header
> even if our window scale is zero.
>
>
> This fixes the following observed behavior:
>
>
> 1. Client sends a SYN with TCP window scaling option and non zero window
> scale value to a Linux box.
>
> 2. Linux box notes large receive window from client.
>
> 3. Linux decides on a zero value of window scale for its part.
>
> 4. Due to compare against requested window scale size option, Linux does
> not to send windows scale
>
> TCP option header on SYN/ACK at all.
>
>
> Result:
>
>
> Client box thinks TCP window scaling is not supported, since SYN/ACK had
> no TCP window scale option,
> while Linux thinks that TCP window scaling is supported (and scale might
> be non zero), since SYN had
>
> TCP window scale option and we have a mismatched idea between the client
> and server regarding window sizes.
>
>
> Please comment and/or apply.
>
>
> ---
>
>
> Bug reported and patch written by Ori Finkalman from Comsleep Ltd. I'm
> just helping mainline it.
>
>
> The behavior was observed with a Windows box as the client and latest
> Debian kernel but for the best
> of my understanding this can happen with latest kernel versions and
> other client OS (probably also Linux)
>
> as well.
>
>
>
> Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
> Signed-off-by: Ori Finkelman <ori@comsleep.com>
>
>
> Index: net/ipv4/tcp_output.c
> ===================================================================
> --- net/ipv4/tcp_output.c (revision 46)
> +++ net/ipv4/tcp_output.c (revision 210)
> @@ -353,6 +353,7 @@ static void tcp_init_nondata_skb(struct
> #define OPTION_SACK_ADVERTISE (1 << 0)
> #define OPTION_TS (1 << 1)
> #define OPTION_MD5 (1 << 2)
> +#define OPTION_WSCALE (1 << 3)
>
> struct tcp_out_options {
> u8 options; /* bit field of OPTION_* */
> @@ -417,7 +418,7 @@ static void tcp_options_write(__be32 *pt
> TCPOLEN_SACK_PERM);
> }
>
> - if (unlikely(opts->ws)) {
> + if (unlikely(OPTION_WSCALE & opts->options)) {
> *ptr++ = htonl((TCPOPT_NOP << 24) |
> (TCPOPT_WINDOW << 16) |
> (TCPOLEN_WINDOW << 8) |
> @@ -530,8 +531,8 @@ static unsigned tcp_synack_options(struc
>
> if (likely(ireq->wscale_ok)) {
> opts->ws = ireq->rcv_wscale;
> - if(likely(opts->ws))
> - size += TCPOLEN_WSCALE_ALIGNED;
> + opts->options |= OPTION_WSCALE;
> + size += TCPOLEN_WSCALE_ALIGNED;
> }
> if (likely(doing_ts)) {
> opts->options |= OPTION_TS;
>
>
>
Seems not the more logical places to put this logic...
How about this instead ?
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5200aab..b78c084 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -216,6 +216,11 @@ void tcp_select_initial_window(int __space, __u32 mss,
space >>= 1;
(*rcv_wscale)++;
}
+ /*
+ * Set a minimum wscale of 1
+ */
+ if (*rcv_wscale == 0)
+ *rcv_wscale = 1;
}
/* Set initial window to value enough for senders,
^ permalink raw reply related
* Re: [PATCH] connector: Fix sid connector (was: Badness at kernel/softirq.c:143...)
From: Evgeniy Polyakov @ 2009-09-29 17:08 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Christian Borntraeger, Evgeny Polyakov, Scott James Remnant,
Linux Kernel, Matt Helsley, David S. Miller, netdev
In-Reply-To: <20090929153631.GA12699@redhat.com>
On Tue, Sep 29, 2009 at 05:36:31PM +0200, Oleg Nesterov (oleg@redhat.com) wrote:
> > Doesn't it only check pgid while patch intention was to send
> > notification about sid?
>
> If the proposed sid already was the session id, then prgp shouldn't
> be empty.
>
> but this doesn't really matter, we also check ->signal->leader
> (not sure, but afaics this check is not strictly necessary because
> of PIDTYPE_PGID check)
Ok, I see, thanks.
--
Evgeniy Polyakov
^ permalink raw reply
* Re: [PATCH] connector: Fix sid connector (was: Badness at kernel/softirq.c:143...)
From: Evgeniy Polyakov @ 2009-09-29 17:07 UTC (permalink / raw)
To: Christian Borntraeger
Cc: Oleg Nesterov, Scott James Remnant, Linux Kernel, Matt Helsley,
David S. Miller, netdev
In-Reply-To: <200909291712.33099.borntraeger@de.ibm.com>
On Tue, Sep 29, 2009 at 05:12:33PM +0200, Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> just in case the discussion concludes that my patch is fine,
> here is a fixed version.
>
> [PATCH] connector: Fix sid connector
>
> The sid connector gives the following warning:
> Badness at kernel/softirq.c:143
> [...]
> Call Trace:
> ([<000000013fe04100>] 0x13fe04100)
> [<000000000048a946>] sk_filter+0x9a/0xd0
> [<000000000049d938>] netlink_broadcast+0x2c0/0x53c
> [<00000000003ba9ae>] cn_netlink_send+0x272/0x2b0
> [<00000000003baef0>] proc_sid_connector+0xc4/0xd4
> [<0000000000142604>] __set_special_pids+0x58/0x90
> [<0000000000159938>] sys_setsid+0xb4/0xd8
> [<00000000001187fe>] sysc_noemu+0x10/0x16
> [<00000041616cb266>] 0x41616cb266
>
> The warning is
> ---> WARN_ON_ONCE(in_irq() || irqs_disabled());
>
> The network code must not be called with disabled interrupts but
> sys_setsid holds the tasklist_lock with spinlock_irq while calling
> the connector. We can safely move proc_sid_connector from
> __set_special_pids to sys_setsid.
>
> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Looks good, thank you.
Ack.
--
Evgeniy Polyakov
^ permalink raw reply
* Re: [PATCH] dev_alloc_skb: avoid using GFP_ATOMIC
From: Stephen Hemminger @ 2009-09-29 16:58 UTC (permalink / raw)
To: swalter; +Cc: davem, netdev, swalter, Steven Walter
In-Reply-To: <1254242593-2279-1-git-send-email-swalter@lexmark.com>
On Tue, 29 Sep 2009 12:43:13 -0400
swalter@lexmark.com wrote:
> From: swalter <swalter@swalter-d630.(none)>
>
>
> Signed-off-by: Steven Walter <swalter@lpdev.prtdev.lexmark.com>
> ---
> net/core/skbuff.c | 6 +++++-
> 1 files changed, 5 insertions(+), 1 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 9e0597d..58ec625 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -306,7 +306,11 @@ struct sk_buff *dev_alloc_skb(unsigned int length)
> * There is more code here than it seems:
> * __dev_alloc_skb is an inline
> */
> - return __dev_alloc_skb(length, GFP_ATOMIC);
> + if (in_interrupt() || in_atomic() || irqs_disabled()) {
> + return __dev_alloc_skb(length, GFP_ATOMIC);
> + } else {
> + return __dev_alloc_skb(length, GFP_KERNEL);
> + }
> }
> EXPORT_SYMBOL(dev_alloc_skb);
>
No, this should be fixed by caller (using netdev_alloc_skb)
also, it may break cases like swap over NFS that want to get memory
when memory pool is low
--
^ permalink raw reply
* [PATCH] dev_alloc_skb: avoid using GFP_ATOMIC
From: swalter @ 2009-09-29 16:43 UTC (permalink / raw)
To: davem, netdev; +Cc: swalter, Steven Walter
From: swalter <swalter@swalter-d630.(none)>
Signed-off-by: Steven Walter <swalter@lpdev.prtdev.lexmark.com>
---
net/core/skbuff.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9e0597d..58ec625 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -306,7 +306,11 @@ struct sk_buff *dev_alloc_skb(unsigned int length)
* There is more code here than it seems:
* __dev_alloc_skb is an inline
*/
- return __dev_alloc_skb(length, GFP_ATOMIC);
+ if (in_interrupt() || in_atomic() || irqs_disabled()) {
+ return __dev_alloc_skb(length, GFP_ATOMIC);
+ } else {
+ return __dev_alloc_skb(length, GFP_KERNEL);
+ }
}
EXPORT_SYMBOL(dev_alloc_skb);
--
1.6.2.3.g5bbe6
^ permalink raw reply related
* RE: [PATCH 2.6.31-rc9] net: VMware virtual Ethernet NIC driver: vmxnet3
From: Shreyas Bhatewara @ 2009-09-29 16:37 UTC (permalink / raw)
To: David Miller
Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
shemminger@linux-foundation.org, jgarzik@pobox.com,
anthony@codemonkey.ws, chrisw@sous-sol.org, greg@kroah.com,
akpm@linux-foundation.org,
virtualization@lists.linux-foundation.org, pv-drivers@vmware.com
In-Reply-To: <20090928.170821.124026517.davem@davemloft.net>
> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Monday, September 28, 2009 5:08 PM
> To: Shreyas Bhatewara
> Cc: linux-kernel@vger.kernel.org; netdev@vger.kernel.org;
> shemminger@linux-foundation.org; jgarzik@pobox.com;
> anthony@codemonkey.ws; chrisw@sous-sol.org; greg@kroah.com; akpm@linux-
> foundation.org; virtualization@lists.linux-foundation.org; pv-
> drivers@vmware.com
> Subject: Re: [PATCH 2.6.31-rc9] net: VMware virtual Ethernet NIC
> driver: vmxnet3
>
> From: Shreyas Bhatewara <sbhatewara@vmware.com>
> Date: Mon, 28 Sep 2009 16:56:45 -0700
>
> > + uint32_t rxdIdx:12; /* Index of the RxDesc */
>
> Don't use uint32_t et al. sized types, use "u32" and friends
> throughout.
Sure, I will fix that.
->Shreyas
^ permalink raw reply
* Re: [PATCH] connector: Fix sid connector (was: Badness at kernel/softirq.c:143...)
From: Oleg Nesterov @ 2009-09-29 16:28 UTC (permalink / raw)
To: Christian Borntraeger
Cc: Scott James Remnant, Evgeniy Polyakov, Linux Kernel, Matt Helsley,
David S. Miller, netdev
In-Reply-To: <200909291712.33099.borntraeger@de.ibm.com>
On 09/29, Christian Borntraeger wrote:
>
>
> The network code must not be called with disabled interrupts but
> sys_setsid holds the tasklist_lock with spinlock_irq while calling
> the connector. We can safely move proc_sid_connector from
> __set_special_pids to sys_setsid.
>
> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
>
> ---
> kernel/exit.c | 4 +---
> kernel/sys.c | 2 ++
> 2 files changed, 3 insertions(+), 3 deletions(-)
>
> Index: linux-2.6/kernel/exit.c
> ===================================================================
> --- linux-2.6.orig/kernel/exit.c
> +++ linux-2.6/kernel/exit.c
> @@ -359,10 +359,8 @@ void __set_special_pids(struct pid *pid)
> {
> struct task_struct *curr = current->group_leader;
>
> - if (task_session(curr) != pid) {
> + if (task_session(curr) != pid)
> change_pid(curr, PIDTYPE_SID, pid);
> - proc_sid_connector(curr);
> - }
>
> if (task_pgrp(curr) != pid)
> change_pid(curr, PIDTYPE_PGID, pid);
> Index: linux-2.6/kernel/sys.c
> ===================================================================
> --- linux-2.6.orig/kernel/sys.c
> +++ linux-2.6/kernel/sys.c
> @@ -1110,6 +1110,8 @@ SYSCALL_DEFINE0(setsid)
> err = session;
> out:
> write_unlock_irq(&tasklist_lock);
> + if (err > 0)
> + proc_sid_connector(sid);
> return err;
> }
Acked-by: Oleg Nesterov <oleg@redhat.com>
I'd suggest you to resend this patch to Andrew. Unless you know
another way to push it into Linus's tree ;)
Perhaps it makes sense to update the changelog, it should mention
the issues with daemonize().
Oleg.
^ permalink raw reply
* Re: [PATCH] /proc/net/tcp, overhead removed
From: Stephen Hemminger @ 2009-09-29 15:45 UTC (permalink / raw)
To: Yakov Lerner; +Cc: Eric Dumazet, netdev, davem
In-Reply-To: <f36b08ee0909290155u177b1983y2eafa8b353a143e0@mail.gmail.com>
On Tue, 29 Sep 2009 11:55:18 +0300
Yakov Lerner <iler.ml@gmail.com> wrote:
> On Tue, Sep 29, 2009 at 10:56, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > Yakov Lerner a écrit :
> > > Take 2.
> > >
> > > "Sharp improvement in performance of /proc/net/tcp when number of
> > > sockets is large and hashsize is large.
> > > O(numsock * hashsize) time becomes O(numsock + hashsize). On slow
> > > processors, speed difference can be x100 and more."
> > >
> > > I must say that I'm not fully satisfied with my choice of "st->sbucket"
> > > for the new preserved index. The better name would be "st->snum".
> > > Re-using "st->sbucket" saves 4 bytes, and keeps the patch to one sourcefile.
> > > But "st->sbucket" has different meaning in OPENREQ and LISTEN states;
> > > this can be confusing.
> > > Maybe better add "snum" member to struct tcp_iter_state ?
> > >
> > > Shall I change subject when sending "take N+1", or keep the old subject ?
> > >
> > > Signed-off-by: Yakov Lerner <iler.ml@gmail.com>
> > > ---
> > > net/ipv4/tcp_ipv4.c | 35 +++++++++++++++++++++++++++++++++--
> > > 1 files changed, 33 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 7cda24b..e4c4f19 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -1994,13 +1994,14 @@ static inline int empty_bucket(struct tcp_iter_state *st)
> > > hlist_nulls_empty(&tcp_hashinfo.ehash[st->bucket].twchain);
> > > }
> > >
> > > -static void *established_get_first(struct seq_file *seq)
> > > +static void *established_get_first_after(struct seq_file *seq, int bucket)
> > > {
> > > struct tcp_iter_state *st = seq->private;
> > > struct net *net = seq_file_net(seq);
> > > void *rc = NULL;
> > >
> > > - for (st->bucket = 0; st->bucket < tcp_hashinfo.ehash_size; ++st->bucket) {
> > > + for (st->bucket = bucket; st->bucket < tcp_hashinfo.ehash_size;
> > > + ++st->bucket) {
> > > struct sock *sk;
> > > struct hlist_nulls_node *node;
> > > struct inet_timewait_sock *tw;
> > > @@ -2010,6 +2011,8 @@ static void *established_get_first(struct seq_file *seq)
> > > if (empty_bucket(st))
> > > continue;
> > >
> > > + st->sbucket = st->num;
> > > +
> > > spin_lock_bh(lock);
> > > sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[st->bucket].chain) {
> > > if (sk->sk_family != st->family ||
> > > @@ -2036,6 +2039,11 @@ out:
> > > return rc;
> > > }
> > >
> > > +static void *established_get_first(struct seq_file *seq)
> > > +{
> > > + return established_get_first_after(seq, 0);
> > > +}
> > > +
> > > static void *established_get_next(struct seq_file *seq, void *cur)
> > > {
> > > struct sock *sk = cur;
> > > @@ -2064,6 +2072,9 @@ get_tw:
> > > while (++st->bucket < tcp_hashinfo.ehash_size &&
> > > empty_bucket(st))
> > > ;
> > > +
> > > + st->sbucket = st->num;
> > > +
> > > if (st->bucket >= tcp_hashinfo.ehash_size)
> > > return NULL;
> > >
> > > @@ -2107,6 +2118,7 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
> > >
> > > if (!rc) {
> > > st->state = TCP_SEQ_STATE_ESTABLISHED;
> > > + st->sbucket = 0;
> > > rc = established_get_idx(seq, pos);
> > > }
> > >
> > > @@ -2116,6 +2128,25 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
> > > static void *tcp_seq_start(struct seq_file *seq, loff_t *pos)
> > > {
> > > struct tcp_iter_state *st = seq->private;
> > > +
> > > + if (*pos && *pos >= st->sbucket &&
> > > + (st->state == TCP_SEQ_STATE_ESTABLISHED ||
> > > + st->state == TCP_SEQ_STATE_TIME_WAIT)) {
> > > + void *cur;
> > > + int nskip;
> > > +
> > > + /* for states estab and tw, st->sbucket is index (*pos) */
> > > + /* corresponding to the beginning of bucket st->bucket */
> > > +
> > > + st->num = st->sbucket;
> > > + /* jump to st->bucket, then skip (*pos - st->sbucket) items */
> > > + st->state = TCP_SEQ_STATE_ESTABLISHED;
> > > + cur = established_get_first_after(seq, st->bucket);
> > > + for (nskip = *pos - st->num; cur && nskip > 0; --nskip)
> > > + cur = established_get_next(seq, cur);
> > > + return cur;
> > > + }
> > > +
> > > st->state = TCP_SEQ_STATE_LISTENING;
> > > st->num = 0;
> > > return *pos ? tcp_get_idx(seq, *pos - 1) : SEQ_START_TOKEN;
> >
> > Just in case you are working on "take 3" of the patch, there is a fondamental problem.
> >
> > All the scalability problems come from the fact that tcp_seq_start()
> > *has* to rescan all the tables from the begining, because of lseek() capability
> > on /proc/net/tcp file
> >
> > We probably could disable llseek() (on other positions than start of the file),
> > and rely only on internal state (listening/established hashtable, hash bucket, position in chain)
> >
> > I cannot imagine how an application could rely on lseek() on >0 position in this file.
>
>
> I thought /proc/net/tcp can both be fast and allow lseek;
> (1) when no lseek was issued since last read
> (we can detect this), /proc/net/tcp can jump to the
> last known bucket (common case), vs
> (2) switch to slow mode (scan from the beginning of hash)
> when lseek was used , no ?
If you look at fib_hash and fib_trie, they already do the same thing.
* fib_hash records last hash chain to avoid overhead of rescan.
* fib_trie records last route and does fast lookup to restart from there.
--
^ permalink raw reply
* Re: [PATCH] connector: Fix sid connector (was: Badness at kernel/softirq.c:143...)
From: Oleg Nesterov @ 2009-09-29 15:36 UTC (permalink / raw)
To: Evgeniy Polyakov
Cc: Christian Borntraeger, Evgeny Polyakov, Scott James Remnant,
Linux Kernel, Matt Helsley, David S. Miller, netdev
In-Reply-To: <20090929145413.GA26327@ioremap.net>
On 09/29, Evgeniy Polyakov wrote:
>
> On Tue, Sep 29, 2009 at 04:25:38PM +0200, Oleg Nesterov (oleg@redhat.com) wrote:
> > > --- a/kernel/sys.c
> > > +++ b/kernel/sys.c
> > > @@ -1090,6 +1090,7 @@ SYSCALL_DEFINE0(setsid)
> > > struct pid *sid = task_pid(group_leader);
> > > pid_t session = pid_vnr(sid);
> > > int err = -EPERM;
> > > + int send_cn = 0;
> > >
> > > write_lock_irq(&tasklist_lock);
> > > /* Fail if I am already a session leader */
> > > @@ -1104,12 +1105,18 @@ SYSCALL_DEFINE0(setsid)
> > >
> > > group_leader->signal->leader = 1;
> > > __set_special_pids(sid);
> > > + if (task_session(group_leader) != sid)
> > > + send_cn = 1;
> >
> > This is not right, task_session(group_leader) must be == sid after
> > __set_special_pids().
>
> Yeah, that check should be done before __set_special_pids().
>
> > And I don't think "int send_cn" is needed. sys_setsid() must not
> > succeed if the caller lived in session == task_pid(group_leader).
>
> Doesn't it only check pgid while patch intention was to send
> notification about sid?
If the proposed sid already was the session id, then prgp shouldn't
be empty.
but this doesn't really matter, we also check ->signal->leader
(not sure, but afaics this check is not strictly necessary because
of PIDTYPE_PGID check)
> I.e. setsid() succeeds, although nothing
> happens.
This shouldn't happen, or sys_setsid() is buggy. Look, the new session
id is task_pid(current). If sys_setsid() succeeds but we don't change
the session, this means we were already the leader. In that case we
should return -EPERM.
Oleg.
^ permalink raw reply
* [PATCH] [RFC] IPv4 TCP fails to send window scale option when window scale is zero
From: Gilad Ben-Yossef @ 2009-09-29 15:05 UTC (permalink / raw)
To: netdev; +Cc: Ori Finkalman
From: Ori Finkalman <ori@comsleep.com>
Acknowledge TCP window scale support by inserting the proper option in
SYN/ACK header
even if our window scale is zero.
This fixes the following observed behavior:
1. Client sends a SYN with TCP window scaling option and non zero window
scale value to a Linux box.
2. Linux box notes large receive window from client.
3. Linux decides on a zero value of window scale for its part.
4. Due to compare against requested window scale size option, Linux does
not to send windows scale
TCP option header on SYN/ACK at all.
Result:
Client box thinks TCP window scaling is not supported, since SYN/ACK had
no TCP window scale option,
while Linux thinks that TCP window scaling is supported (and scale might
be non zero), since SYN had
TCP window scale option and we have a mismatched idea between the client
and server regarding window sizes.
Please comment and/or apply.
---
Bug reported and patch written by Ori Finkalman from Comsleep Ltd. I'm
just helping mainline it.
The behavior was observed with a Windows box as the client and latest
Debian kernel but for the best
of my understanding this can happen with latest kernel versions and
other client OS (probably also Linux)
as well.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Signed-off-by: Ori Finkelman <ori@comsleep.com>
Index: net/ipv4/tcp_output.c
===================================================================
--- net/ipv4/tcp_output.c (revision 46)
+++ net/ipv4/tcp_output.c (revision 210)
@@ -353,6 +353,7 @@ static void tcp_init_nondata_skb(struct
#define OPTION_SACK_ADVERTISE (1 << 0)
#define OPTION_TS (1 << 1)
#define OPTION_MD5 (1 << 2)
+#define OPTION_WSCALE (1 << 3)
struct tcp_out_options {
u8 options; /* bit field of OPTION_* */
@@ -417,7 +418,7 @@ static void tcp_options_write(__be32 *pt
TCPOLEN_SACK_PERM);
}
- if (unlikely(opts->ws)) {
+ if (unlikely(OPTION_WSCALE & opts->options)) {
*ptr++ = htonl((TCPOPT_NOP << 24) |
(TCPOPT_WINDOW << 16) |
(TCPOLEN_WINDOW << 8) |
@@ -530,8 +531,8 @@ static unsigned tcp_synack_options(struc
if (likely(ireq->wscale_ok)) {
opts->ws = ireq->rcv_wscale;
- if(likely(opts->ws))
- size += TCPOLEN_WSCALE_ALIGNED;
+ opts->options |= OPTION_WSCALE;
+ size += TCPOLEN_WSCALE_ALIGNED;
}
if (likely(doing_ts)) {
opts->options |= OPTION_TS;
--
Gilad Ben-Yossef
Chief Coffee Drinker & CTO
Codefidence Ltd.
Web: http://codefidence.com
Cell: +972-52-8260388
Skype: gilad_codefidence
Tel: +972-8-9316883 ext. 201
Fax: +972-8-9316884
Email: gilad@codefidence.com
Check out our Open Source technology and training blog - http://tuxology.net
"Now the world has gone to bed
Darkness won't engulf my head
I can see by infra-red
How I hate the night."
^ permalink raw reply
* Re: [PATCH] connector: Fix sid connector (was: Badness at kernel/softirq.c:143...)
From: Christian Borntraeger @ 2009-09-29 15:12 UTC (permalink / raw)
To: Oleg Nesterov, Scott James Remnant
Cc: Evgeniy Polyakov, Linux Kernel, Matt Helsley, David S. Miller,
netdev
In-Reply-To: <20090929144554.GA10937@redhat.com>
Am Dienstag 29 September 2009 16:45:54 schrieb Oleg Nesterov:
> I think Christian's patch only needs the small fixup.
Oleg, Evgeniy,
just in case the discussion concludes that my patch is fine,
here is a fixed version.
[PATCH] connector: Fix sid connector
The sid connector gives the following warning:
Badness at kernel/softirq.c:143
[...]
Call Trace:
([<000000013fe04100>] 0x13fe04100)
[<000000000048a946>] sk_filter+0x9a/0xd0
[<000000000049d938>] netlink_broadcast+0x2c0/0x53c
[<00000000003ba9ae>] cn_netlink_send+0x272/0x2b0
[<00000000003baef0>] proc_sid_connector+0xc4/0xd4
[<0000000000142604>] __set_special_pids+0x58/0x90
[<0000000000159938>] sys_setsid+0xb4/0xd8
[<00000000001187fe>] sysc_noemu+0x10/0x16
[<00000041616cb266>] 0x41616cb266
The warning is
---> WARN_ON_ONCE(in_irq() || irqs_disabled());
The network code must not be called with disabled interrupts but
sys_setsid holds the tasklist_lock with spinlock_irq while calling
the connector. We can safely move proc_sid_connector from
__set_special_pids to sys_setsid.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
---
kernel/exit.c | 4 +---
kernel/sys.c | 2 ++
2 files changed, 3 insertions(+), 3 deletions(-)
Index: linux-2.6/kernel/exit.c
===================================================================
--- linux-2.6.orig/kernel/exit.c
+++ linux-2.6/kernel/exit.c
@@ -359,10 +359,8 @@ void __set_special_pids(struct pid *pid)
{
struct task_struct *curr = current->group_leader;
- if (task_session(curr) != pid) {
+ if (task_session(curr) != pid)
change_pid(curr, PIDTYPE_SID, pid);
- proc_sid_connector(curr);
- }
if (task_pgrp(curr) != pid)
change_pid(curr, PIDTYPE_PGID, pid);
Index: linux-2.6/kernel/sys.c
===================================================================
--- linux-2.6.orig/kernel/sys.c
+++ linux-2.6/kernel/sys.c
@@ -1110,6 +1110,8 @@ SYSCALL_DEFINE0(setsid)
err = session;
out:
write_unlock_irq(&tasklist_lock);
+ if (err > 0)
+ proc_sid_connector(sid);
return err;
}
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox