* Biweekly KVM Test report, kernel 94252... qemu 5cc3c...
From: Xu, Jiajun @ 2009-09-29 3:34 UTC (permalink / raw)
To: 'kvm-devel'
Hi All,
This Weekly KVM Testing Report against lastest kvm.git
94252a58662dc4ca6191eac479efb40e0716865c and qemu-kvm.git
5cc3cfb6c2254483ae324da407a13307fe7355f3.
Qemu-kvm tree build issue is fixed by qemu commit 781774b38c90797add71d029b7fbee43200c66d4.
There is no other new bug found in this two weeks. There are 7 old bugs open in bug tracking.
Seven Old Issues:
================================================
1. Guest hang with exhausted IRQ sources error if 8 VFs assigned
https://sourceforge.net/tracker/?func=detail&aid=2847560&group_id=180599&atid=893831
2. Hot-added device is not visible in guest after migration
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2832416&group_id=180599
3. ltp diotest running time is 2.54 times than before
https://sourceforge.net/tracker/?func=detail&aid=2723366&group_id=180599&atid=893831
4. 32bits Rhel5/FC6 guest may fail to reboot after installation
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=1991647&group_id=180599
5. OpenSuse10.2 can not be installed
http://sourceforge.net/tracker/index.php?func=detail&aid=2088475&group_id=180599&atid=893831
6. Fail to Save Restore Guest
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2175042&group_id=180599
7. perfctr wrmsr warning when booting 64bit RHEl5.3
https://sourceforge.net/tracker/?func=detail&aid=2721640&group_id=180599&atid=893831
Test environment
================================================
Platform A
Stoakley/Clovertown
CPU 4
Memory size 8G'
Report Summary on IA32-pae
Summary Test Report of Last Session
=====================================================================
Total Pass Fail NoResult Crash
=====================================================================
control_panel 8 6 2 0 0
gtest 16 16 0 0 0
=====================================================================
control_panel 8 6 2 0 0
:KVM_256M_guest_PAE_gPAE 1 1 0 0 0
:KVM_linux_win_PAE_gPAE 1 1 0 0 0
:KVM_two_winxp_PAE_gPAE 1 1 0 0 0
:KVM_four_sguest_PAE_gPA 1 1 0 0 0
:KVM_1500M_guest_PAE_gPA 1 1 0 0 0
:KVM_LM_Continuity_PAE_g 1 1 0 0 0
:KVM_LM_SMP_PAE_gPAE 1 0 1 0 0
:KVM_SR_Continuity_PAE_g 1 0 1 0 0
gtest 16 16 0 0 0
:ltp_nightly_PAE_gPAE 1 1 0 0 0
:boot_up_acpi_PAE_gPAE 1 1 0 0 0
:boot_up_acpi_xp_PAE_gPA 1 1 0 0 0
:boot_up_vista_PAE_gPAE 1 1 0 0 0
:reboot_xp_PAE_gPAE 1 1 0 0 0
:boot_base_kernel_PAE_gP 1 1 0 0 0
:boot_up_acpi_win2k3_PAE 1 1 0 0 0
:boot_smp_acpi_win2k3_PA 1 1 0 0 0
:boot_smp_acpi_win2k_PAE 1 1 0 0 0
:boot_up_win2008_PAE_gPA 1 1 0 0 0
:boot_up_acpi_win2k_PAE_ 1 1 0 0 0
:boot_smp_acpi_xp_PAE_gP 1 1 0 0 0
:boot_up_noacpi_win2k_PA 1 1 0 0 0
:boot_smp_vista_PAE_gPAE 1 1 0 0 0
:boot_smp_win2008_PAE_gP 1 1 0 0 0
:kb_nightly_PAE_gPAE 1 1 0 0 0
=====================================================================
Total 24 22 2 0 0
Report Summary on IA32e
Summary Test Report of Last Session
=====================================================================
Total Pass Fail NoResult Crash
=====================================================================
control_panel 17 14 3 0 0
gtest 23 23 0 0 0
=====================================================================
control_panel 17 14 3 0 0
:KVM_4G_guest_64_g32e 1 1 0 0 0
:KVM_four_sguest_64_gPAE 1 1 0 0 0
:KVM_LM_SMP_64_g32e 1 1 0 0 0
:KVM_linux_win_64_gPAE 1 1 0 0 0
:KVM_LM_SMP_64_gPAE 1 1 0 0 0
:KVM_SR_Continuity_64_gP 1 0 1 0 0
:KVM_four_sguest_64_g32e 1 1 0 0 0
:KVM_four_dguest_64_gPAE 1 1 0 0 0
:KVM_SR_SMP_64_gPAE 1 0 1 0 0
:KVM_LM_Continuity_64_g3 1 1 0 0 0
:KVM_1500M_guest_64_gPAE 1 1 0 0 0
:KVM_LM_Continuity_64_gP 1 1 0 0 0
:KVM_1500M_guest_64_g32e 1 1 0 0 0
:KVM_SR_Continuity_64_g3 1 0 1 0 0
:KVM_two_winxp_64_gPAE 1 1 0 0 0
:KVM_256M_guest_64_gPAE 1 1 0 0 0
:KVM_256M_guest_64_g32e 1 1 0 0 0
gtest 23 23 0 0 0
:boot_up_acpi_64_gPAE 1 1 0 0 0
:boot_up_noacpi_xp_64_gP 1 1 0 0 0
:boot_base_kernel_64_gPA 1 1 0 0 0
:boot_up_vista_64_g32e 1 1 0 0 0
:boot_smp_acpi_win2k3_64 1 1 0 0 0
:boot_smp_acpi_win2k_64_ 1 1 0 0 0
:kb_nightly_64_gPAE 1 1 0 0 0
:boot_up_acpi_xp_64_g32e 1 1 0 0 0
:boot_up_noacpi_win2k_64 1 1 0 0 0
:boot_smp_acpi_xp_64_gPA 1 1 0 0 0
:boot_smp_acpi_xp_64_g32 1 1 0 0 0
:boot_smp_vista_64_gPAE 1 1 0 0 0
:boot_up_acpi_64_g32e 1 1 0 0 0
:boot_base_kernel_64_g32 1 1 0 0 0
:kb_nightly_64_g32e 1 1 0 0 0
:boot_up_acpi_win2k3_64_ 1 1 0 0 0
:boot_up_win2008_64_gPAE 1 1 0 0 0
:ltp_nightly_64_g32e 1 1 0 0 0
:boot_smp_win2008_64_g32 1 1 0 0 0
:boot_up_vista_64_gPAE 1 1 0 0 0
:ltp_nightly_64_gPAE 1 1 0 0 0
:boot_smp_acpi_win2k3_64 1 1 0 0 0
:boot_up_noacpi_win2k3_6 1 1 0 0 0
=====================================================================
Total 40 37 3 0 0
Test environment
================================================
Platform B
Nehalem
CPU 8
Memory size 4G'
Summary Test Report of Last Session
=====================================================================
Total Pass Fail NoResult Crash
=====================================================================
control_panel_ept_vpid 7 7 0 0 0
control_panel_ept 3 3 0 0 0
control_panel 4 4 0 0 0
control_panel_vpid 4 4 0 0 0
gtest_vpid 4 4 0 0 0
gtest_ept 2 2 0 0 0
gtest 4 4 0 0 0
gtest_ept_vpid 14 14 0 0 0
=====================================================================
control_panel_ept_vpid 7 7 0 0 0
:KVM_256M_guest_PAE_gPAE 1 1 0 0 0
:KVM_four_sguest_PAE_gPA 1 1 0 0 0
:KVM_1500M_guest_PAE_gPA 1 1 0 0 0
:KVM_linux_win_PAE_gPAE 1 1 0 0 0
:KVM_LM_Continuity_PAE_g 1 1 0 0 0
:KVM_LM_SMP_PAE_gPAE 1 1 0 0 0
:KVM_two_winxp_PAE_gPAE 1 1 0 0 0
control_panel_ept 3 3 0 0 0
:KVM_four_sguest_PAE_gPA 1 1 0 0 0
:KVM_1500M_guest_PAE_gPA 1 1 0 0 0
:KVM_linux_win_PAE_gPAE 1 1 0 0 0
control_panel 4 4 0 0 0
:KVM_four_sguest_PAE_gPA 1 1 0 0 0
:KVM_1500M_guest_PAE_gPA 1 1 0 0 0
:KVM_linux_win_PAE_gPAE 1 1 0 0 0
:KVM_LM_SMP_PAE_gPAE 1 1 0 0 0
control_panel_vpid 4 4 0 0 0
:KVM_four_sguest_PAE_gPA 1 1 0 0 0
:KVM_1500M_guest_PAE_gPA 1 1 0 0 0
:KVM_linux_win_PAE_gPAE 1 1 0 0 0
:KVM_LM_SMP_PAE_gPAE 1 1 0 0 0
gtest_vpid 4 4 0 0 0
:boot_smp_vista_PAE_gPAE 1 1 0 0 0
:reboot_xp_PAE_gPAE 1 1 0 0 0
:boot_base_kernel_PAE_gP 1 1 0 0 0
:boot_up_win2008_PAE_gPA 1 1 0 0 0
gtest_ept 2 2 0 0 0
:boot_smp_acpi_win2k3_PA 1 1 0 0 0
:boot_smp_acpi_win2k_PAE 1 1 0 0 0
gtest 4 4 0 0 0
:boot_smp_win2008_PAE_gP 1 1 0 0 0
:boot_up_vista_PAE_gPAE 1 1 0 0 0
:boot_smp_acpi_xp_PAE_gP 1 1 0 0 0
:boot_up_noacpi_win2k_PA 1 1 0 0 0
gtest_ept_vpid 14 14 0 0 0
:boot_up_acpi_PAE_gPAE 1 1 0 0 0
:ltp_nightly_PAE_gPAE 1 1 0 0 0
:boot_up_acpi_xp_PAE_gPA 1 1 0 0 0
:boot_up_vista_PAE_gPAE 1 1 0 0 0
:boot_base_kernel_PAE_gP 1 1 0 0 0
:boot_up_acpi_win2k3_PAE 1 1 0 0 0
:boot_smp_acpi_win2k3_PA 1 1 0 0 0
:boot_up_win2008_PAE_gPA 1 1 0 0 0
:boot_up_acpi_win2k_PAE_ 1 1 0 0 0
:boot_smp_acpi_win2k_PAE 1 1 0 0 0
:boot_smp_acpi_xp_PAE_gP 1 1 0 0 0
:boot_up_noacpi_win2k_PA 1 1 0 0 0
:boot_smp_win2008_PAE_gP 1 1 0 0 0
:kb_nightly_PAE_gPAE 1 1 0 0 0
=====================================================================
Total 42 42 0 0 0
Summary Test Report of Last Session
=====================================================================
Total Pass Fail NoResult Crash
=====================================================================
control_panel_ept_vpid 16 13 3 0 0
control_panel_ept 5 5 0 0 0
control_panel 5 5 0 0 0
control_panel_vpid 7 6 1 0 0
gtest_vpid 4 4 0 0 0
gtest_ept 1 1 0 0 0
gtest 6 6 0 0 0
vtd 7 6 1 0 0
vtd_ept_vpid 12 10 2 0 0
gtest_ept_vpid 19 19 0 0 0
=====================================================================
control_panel_ept_vpid 16 13 3 0 0
:KVM_SR_SMP_64_gPAE 1 0 1 0 0
:KVM_LM_Continuity_64_g3 1 1 0 0 0
:KVM_four_sguest_64_gPAE 1 1 0 0 0
:KVM_linux_win_64_gPAE 1 1 0 0 0
:KVM_LM_SMP_64_g32e 1 1 0 0 0
:KVM_1500M_guest_64_gPAE 1 1 0 0 0
:KVM_LM_Continuity_64_gP 1 1 0 0 0
:KVM_SR_Continuity_64_gP 1 0 1 0 0
:KVM_LM_SMP_64_gPAE 1 1 0 0 0
:KVM_1500M_guest_64_g32e 1 1 0 0 0
:KVM_256M_guest_64_gPAE 1 1 0 0 0
:KVM_two_winxp_64_gPAE 1 1 0 0 0
:KVM_SR_Continuity_64_g3 1 0 1 0 0
:KVM_256M_guest_64_g32e 1 1 0 0 0
:KVM_four_sguest_64_g32e 1 1 0 0 0
:KVM_four_dguest_64_gPAE 1 1 0 0 0
control_panel_ept 5 5 0 0 0
:KVM_linux_win_64_g32e 1 1 0 0 0
:KVM_1500M_guest_64_g32e 1 1 0 0 0
:KVM_four_sguest_64_gPAE 1 1 0 0 0
:KVM_LM_SMP_64_g32e 1 1 0 0 0
:KVM_1500M_guest_64_gPAE 1 1 0 0 0
control_panel 5 5 0 0 0
:KVM_1500M_guest_64_g32e 1 1 0 0 0
:KVM_linux_win_64_gPAE 1 1 0 0 0
:KVM_four_sguest_64_g32e 1 1 0 0 0
:KVM_LM_SMP_64_g32e 1 1 0 0 0
:KVM_1500M_guest_64_gPAE 1 1 0 0 0
control_panel_vpid 7 6 1 0 0
:KVM_linux_win_64_g32e 1 1 0 0 0
:KVM_SR_SMP_64_gPAE 1 0 1 0 0
:KVM_1500M_guest_64_g32e 1 1 0 0 0
:KVM_four_sguest_64_gPAE 1 1 0 0 0
:KVM_two_winxp_64_gPAE 1 1 0 0 0
:KVM_LM_SMP_64_g32e 1 1 0 0 0
:KVM_1500M_guest_64_gPAE 1 1 0 0 0
gtest_vpid 4 4 0 0 0
:boot_smp_win2008_64_g32 1 1 0 0 0
:boot_up_vista_64_gPAE 1 1 0 0 0
:boot_smp_acpi_win2k3_64 1 1 0 0 0
:boot_smp_acpi_xp_64_g32 1 1 0 0 0
gtest_ept 1 1 0 0 0
:boot_smp_acpi_xp_64_g32 1 1 0 0 0
gtest 6 6 0 0 0
:boot_smp_win2008_64_g32 1 1 0 0 0
:boot_base_kernel_64_g32 1 1 0 0 0
:boot_smp_acpi_xp_64_gPA 1 1 0 0 0
:boot_smp_acpi_win2k_64_ 1 1 0 0 0
:boot_up_win2008_64_gPAE 1 1 0 0 0
:boot_smp_vista_64_g32e 1 1 0 0 0
vtd 7 6 1 0 0
:one_pcie_scp_64_g32e 1 1 0 0 0
:one_pcie_up_nomsi_64_g3 1 1 0 0 0
:one_pcie_up_xp_64_g32e 1 1 0 0 0
:one_pcie_up_64_g32e 1 1 0 0 0
:lm_pcie_up_64_g32e 1 0 1 0 0
:two_dev_up_64_g32e 1 1 0 0 0
:hp_pcie_up_64_g32e 1 1 0 0 0
vtd_ept_vpid 12 10 2 0 0
:one_pcie_up_nomsi_64_g3 1 1 0 0 0
:one_pcie_up_xp_64_g32e 1 1 0 0 0
:one_pcie_scp_64_gPAE 1 1 0 0 0
:one_pcie_up_64_g32e 1 1 0 0 0
:lm_pcie_up_64_g32e 1 0 1 0 0
:two_dev_up_64_g32e 1 0 1 0 0
:one_pcie_scp_64_g32e 1 1 0 0 0
:one_pcie_up_xp_64_gPAE 1 1 0 0 0
:hp_pcie_smp_64_g32e 1 1 0 0 0
:one_pcie_smp_64_g32e 1 1 0 0 0
:hp_pcie_up_64_g32e 1 1 0 0 0
:one_pcie_up_64_gPAE 1 1 0 0 0
gtest_ept_vpid 19 19 0 0 0
:boot_up_acpi_64_gPAE 1 1 0 0 0
:boot_up_noacpi_xp_64_gP 1 1 0 0 0
:boot_base_kernel_64_gPA 1 1 0 0 0
:boot_smp_acpi_win2k3_64 1 1 0 0 0
:boot_smp_acpi_win2k_64_ 1 1 0 0 0
:kb_nightly_64_gPAE 1 1 0 0 0
:boot_up_acpi_xp_64_g32e 1 1 0 0 0
:boot_up_noacpi_win2k_64 1 1 0 0 0
:boot_smp_acpi_xp_64_gPA 1 1 0 0 0
:boot_smp_acpi_xp_64_g32 1 1 0 0 0
:boot_up_acpi_64_g32e 1 1 0 0 0
:boot_base_kernel_64_g32 1 1 0 0 0
:kb_nightly_64_g32e 1 1 0 0 0
:boot_up_acpi_win2k3_64_ 1 1 0 0 0
:boot_up_win2008_64_gPAE 1 1 0 0 0
:ltp_nightly_64_g32e 1 1 0 0 0
:boot_smp_win2008_64_g32 1 1 0 0 0
:boot_smp_acpi_win2k3_64 1 1 0 0 0
:boot_up_noacpi_win2k3_6 1 1 0 0 0
=====================================================================
Total 82 75 7 0 0
Best Regards,
Jiajun
^ permalink raw reply
* Re: compat-wireless master-2009-09-28 breakage and suggested fixes
From: Hin-Tak Leung @ 2009-09-29 3:33 UTC (permalink / raw)
To: Luis R. Rodriguez; +Cc: Luis Rodriguez, linux-wireless
In-Reply-To: <20090929025509.GB6529@mosca>
On Tue, Sep 29, 2009 at 3:55 AM, Luis R. Rodriguez
<lrodriguez@atheros.com> wrote:
> On Mon, Sep 28, 2009 at 07:35:22PM -0700, Hin-Tak Leung wrote:
>> commit d0cf9c0dadcdc89a755bcb301cfc9c796eb28ccf
>> Author: Stephen Hemminger <shemminger@vyatta.com>
>> Date: Mon Aug 31 19:50:57 2009 +0000
>>
>> wireless: convert drivers to netdev_tx_t
>>
>> and the 2nd change due to this:
>>
>> commit 384912ed194e43c03ad1cdaa09b0b1e488c34d46
>> Author: Marcel Holtmann <marcel@holtmann.org>
>> Date: Mon Aug 31 21:08:19 2009 +0000
>>
>> net: Add DEVTYPE support for Ethernet based devices
>>
>> Both of these changes are traced back to changes in
>> <linux/netdevice.h> , which compat-wireless does not ship. What's your
>> policy on these kind of changes to compat-wireless?
>> (the 2nd SET_NETDEV_DEVTYPE change probably can be spanned by an
>> ifndef SET_NETDEV_DEVTYPE, and roll into
>> "compat/patches/01-netdev.patch"? Should the first kind of change also
>> go into compat/patches/01-netdev.patch?)
>
> I don't see this yet on wireless-testing but it is on 2.6.32.
> I backported this as follows. I'll push this out shortly.
>
> From: Luis R. Rodriguez <lrodriguez@atheros.com>
> Subject: [PATCH] Fix compilation against for 2.6.32 changes
>
> 2.6.32 added SET_NETDEV_DEVTYPE() and netdev_tx
Argh, I see what you are getting at: codes that emulates "API changes
between 2.6.(x-1) and 2.6.(x)" go into file
compat/compat-2.6.(x).{h,c} .
The 'compatibility code for new kernel' wording wasn't too obvious :-).
>
> Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
Acked.
> ---
> compat/compat-2.6.32.h | 12 ++++++++++++
> 1 files changed, 12 insertions(+), 0 deletions(-)
>
> diff --git a/compat/compat-2.6.32.h b/compat/compat-2.6.32.h
> index f7081f2..418b521 100644
> --- a/compat/compat-2.6.32.h
> +++ b/compat/compat-2.6.32.h
> @@ -31,6 +31,18 @@
>
> #define dev_change_net_namespace(a, b, c) (-EOPNOTSUPP)
>
> +#define SET_NETDEV_DEVTYPE(netdev, type)
> +
> +#ifdef __KERNEL__
> +/* Driver transmit return codes */
> +enum netdev_tx {
> + BACKPORT_NETDEV_TX_OK = NETDEV_TX_OK, /* driver took care of packet */
> + BACKPORT_NETDEV_TX_BUSY = NETDEV_TX_BUSY, /* driver tx path was busy*/
> + BACKPORT_NETDEV_TX_LOCKED = NETDEV_TX_LOCKED, /* driver tx lock was already taken */
> +};
> +typedef enum netdev_tx netdev_tx_t;
> +#endif /* __KERNEL__ */
> +
> #endif /* (LINUX_VERSION_CODE < KERNEL_VERSION(2,6,32)) */
>
> #endif /* LINUX_26_32_COMPAT_H */
> --
> 1.6.3.3
>
>
^ permalink raw reply
* Re: IO scheduler based IO controller V10
From: Vivek Goyal @ 2009-09-29 3:22 UTC (permalink / raw)
To: Nauman Rafique
Cc: linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf,
mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida, taka,
guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya
In-Reply-To: <e98e18940909281737q142c788dpd20b8bdc05dd0eff@mail.gmail.com>
On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
> Hi Vivek,
> Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with
> Jens about IO controller during Linux Plumbers Conference '09. Jens
> expressed his concerns about the size and complexity of the patches. I
> believe that is a reasonable concern. We talked about things that
> could be done to reduce the size of the patches. The requirement that
> the "solution has to work with all IO schedulers" seems like a
> secondary concern at this point; and it came out as one thing that can
> help to reduce the size of the patch set.
Initially doing cgroup based IO control only for CFQ should help a lot in
reducing the patchset size.
> Another possibility is to
> use a simpler scheduling algorithm e.g. weighted round robin, instead
> of BFQ scheduler. BFQ indeed has great properties, but we cannot deny
> the fact that it is complex to understand, and might be cumbersome to
> maintain.
Core of the BFQ I have gotten rid of already. The remaining part is idle tree
and data structures. I will see how can I simplify it further.
> Also, hierarchical scheduling is something that could be
> unnecessary in the first set of patches, even though cgroups are
> hierarchical in nature.
Sure. Though I don't think that a lot of code is there because of
hierarchical nature. If we solve the issue at CFQ layer, we have to
maintain atleast two levels. One for queue and other for groups. So even
the simplest solution becomes almost hierarchical in nature. But I will
still see how to get rid of some code here too...
>
> We are starting from a point where there is no cgroup based IO
> scheduling in the kernel. And it is probably not reasonable to satisfy
> all IO scheduling related requirements in one patch set. We can start
> with something simple, and build on top of that. So a very simple
> patch set that enables cgroup based proportional scheduling for CFQ
> seems like the way to go at this point.
Sure, we can start with CFQ only. But a bigger question we need to answer
is that is CFQ the right place to solve the issue? Jens, do you think
that CFQ is the right place to solve the problem?
Andrew seems to favor a high level approach so that IO schedulers are less
complex and we can provide fairness at high level logical devices also.
I will again try to summarize my understanding so far about the pros/cons
of each approach and then we can take the discussion forward.
Fairness in terms of size of IO or disk time used
=================================================
On a seeky media, fairness in terms of disk time can get us better results
instead fairness interms of size of IO or number of IO.
If we implement some kind of time based solution at higher layer, then
that higher layer should know who used how much of time each group used. We
can probably do some kind of timestamping in bio to get a sense when did it
get into disk and when did it finish. But on a multi queue hardware there
can be multiple requests in the disk either from same queue or from differnet
queues and with pure timestamping based apparoch, so far I could not think
how at high level we will get an idea who used how much of time.
So this is the first point of contention that how do we want to provide
fairness. In terms of disk time used or in terms of size of IO/number of
IO.
Max bandwidth Controller or Proportional bandwidth controller
=============================================================
What is our primary requirement here? A weight based proportional
bandwidth controller where we can use the resources optimally and any
kind of throttling kicks in only if there is contention for the disk.
Or we want max bandwidth control where a group is not allowed to use the
disk even if disk is free.
Or we need both? I would think that at some point of time we will need
both but we can start with proportional bandwidth control first.
Fairness for higher level logical devices
=========================================
Do we want good fairness numbers for higher level logical devices also
or it is sufficient to provide fairness at leaf nodes. Providing fairness
at leaf nodes can help us use the resources optimally and in the process
we can get fairness at higher level also in many of the cases.
But do we want strict fairness numbers on higher level logical devices
even if it means sub-optimal usage of unerlying phsical devices?
I think that for proportinal bandwidth control, it should be ok to provide
fairness at higher level logical device but for max bandwidth control it
might make more sense to provide fairness at higher level. Consider a
case where from a striped device a customer wants to limit a group to
30MB/s and in case of leaf node control, if every leaf node provides
30MB/s, it might accumulate to much more than specified rate at logical
device.
Latency Control and strong isolation between groups
===================================================
Do we want a good isolation between groups and better latencies and
stronger isolation between groups?
I think if problem is solved at IO scheduler level, we can achieve better
latency control and hence stronger isolation between groups.
Higher level solutions should find it hard to provide same kind of latency
control and isolation between groups as IO scheduler based solution.
Fairness for buffered writes
============================
Doing io control at any place below page cache has disadvantage that page
cache might not dispatch more writes from higher weight group hence higher
weight group might not see more IO done. Andrew says that we don't have
a solution to this problem in kernel and he would like to see it handled
properly.
Only way to solve this seems to be to slow down the writers before they
write into page cache. IO throttling patch handled it by slowing down
writer if it crossed max specified rate. Other suggestions have come in
the form of dirty_ratio per memory cgroup or a separate cgroup controller
al-together where some kind of per group write limit can be specified.
So if solution is implemented at IO scheduler layer or at device mapper
layer, both shall have to rely on another controller to be co-mounted
to handle buffered writes properly.
Fairness with-in group
======================
One of the issues with higher level controller is that how to do fair
throttling so that fairness with-in group is not impacted. Especially
the case of making sure that we don't break the notion of ioprio of the
processes with-in group.
Especially io throttling patch was very bad in terms of prio with-in
group where throttling treated everyone equally and difference between
process prio disappeared.
Reads Vs Writes
===============
A higher level control most likely will change the ratio in which reads
and writes are dispatched to disk with-in group. It used to be decided
by IO scheduler so far but with higher level groups doing throttling and
possibly buffering the bios and releasing them later, they will have to
come up with their own policy on in what proportion reads and writes
should be dispatched. In case of IO scheduler based control, all the
queuing takes place at IO scheduler and it still retains control of
in what ration reads and writes should be dispatched.
Summary
=======
- An io scheduler based io controller can provide better latencies,
stronger isolation between groups, time based fairness and will not
interfere with io schedulers policies like class, ioprio and
reader vs writer issues.
But it can gunrantee fairness at higher logical level devices.
Especially in case of max bw control, leaf node control does not sound
to be the most appropriate thing.
- IO throttling provides max bw control in terms of absolute rate. It has
the advantage that it can provide control at higher level logical device
and also control buffered writes without need of additional controller
co-mounted.
But it does only max bw control and not proportion control so one might
not be using resources optimally. It looses sense of task prio and class
with-in group as any of the task can be throttled with-in group. Because
throttling does not kick in till you hit the max bw limit, it should find
it hard to provide same latencies as io scheduler based control.
- dm-ioband also has the advantage that it can provide fairness at higher
level logical devices.
But, fairness is provided only in terms of size of IO or number of IO.
No time based fairness. It is very throughput oriented and does not
throttle high speed group if other group is running slow random reader.
This results in bad latnecies for random reader group and weaker
isolation between groups.
Also it does not provide fairness if a group is not continuously
backlogged. So if one is running 1-2 dd/sequential readers in the group,
one does not get fairness until workload is increased to a point where
group becomes continuously backlogged. This also results in poor
latencies and limited fairness.
At this point of time it does not look like a single IO controller all
the scenarios/requirements. This means few things to me.
- Drop some of the requirements and go with one implementation which meets
those reduced set of requirements.
- Have more than one IO controller implementation in kenrel. One for lower
level control for better latencies, stronger isolation and optimal resource
usage and other one for fairness at higher level logical devices and max
bandwidth control.
And let user decide which one to use based on his/her needs.
- Come up with more intelligent way of doing IO control where single
controller covers all the cases.
At this point of time, I am more inclined towards option 2 of having more
than one implementation in kernel. :-) (Until and unless we can brainstrom
and come up with ideas to make option 3 happen).
>
> It would be great if we discuss our plans on the mailing list, so we
> can get early feedback from everyone.
This is what comes to my mind so far. Please add to the list if I have missed
some points. Also correct me if I am wrong about the pros/cons of the
approaches.
Thoughts/ideas/opinions are welcome...
Thanks
Vivek
^ permalink raw reply
* Re: IO scheduler based IO controller V10
From: Vivek Goyal @ 2009-09-29 3:22 UTC (permalink / raw)
To: Nauman Rafique
Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
paolo.valente, jmarchan, guijianfeng, fernando, mikew,
yoshikawa.takuya, jmoyer, mingo, m-ikeda, riel, lizf, fchecconi,
s-uchida, containers, linux-kernel, akpm, righi.andrea, torvalds
In-Reply-To: <e98e18940909281737q142c788dpd20b8bdc05dd0eff@mail.gmail.com>
On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
> Hi Vivek,
> Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with
> Jens about IO controller during Linux Plumbers Conference '09. Jens
> expressed his concerns about the size and complexity of the patches. I
> believe that is a reasonable concern. We talked about things that
> could be done to reduce the size of the patches. The requirement that
> the "solution has to work with all IO schedulers" seems like a
> secondary concern at this point; and it came out as one thing that can
> help to reduce the size of the patch set.
Initially doing cgroup based IO control only for CFQ should help a lot in
reducing the patchset size.
> Another possibility is to
> use a simpler scheduling algorithm e.g. weighted round robin, instead
> of BFQ scheduler. BFQ indeed has great properties, but we cannot deny
> the fact that it is complex to understand, and might be cumbersome to
> maintain.
Core of the BFQ I have gotten rid of already. The remaining part is idle tree
and data structures. I will see how can I simplify it further.
> Also, hierarchical scheduling is something that could be
> unnecessary in the first set of patches, even though cgroups are
> hierarchical in nature.
Sure. Though I don't think that a lot of code is there because of
hierarchical nature. If we solve the issue at CFQ layer, we have to
maintain atleast two levels. One for queue and other for groups. So even
the simplest solution becomes almost hierarchical in nature. But I will
still see how to get rid of some code here too...
>
> We are starting from a point where there is no cgroup based IO
> scheduling in the kernel. And it is probably not reasonable to satisfy
> all IO scheduling related requirements in one patch set. We can start
> with something simple, and build on top of that. So a very simple
> patch set that enables cgroup based proportional scheduling for CFQ
> seems like the way to go at this point.
Sure, we can start with CFQ only. But a bigger question we need to answer
is that is CFQ the right place to solve the issue? Jens, do you think
that CFQ is the right place to solve the problem?
Andrew seems to favor a high level approach so that IO schedulers are less
complex and we can provide fairness at high level logical devices also.
I will again try to summarize my understanding so far about the pros/cons
of each approach and then we can take the discussion forward.
Fairness in terms of size of IO or disk time used
=================================================
On a seeky media, fairness in terms of disk time can get us better results
instead fairness interms of size of IO or number of IO.
If we implement some kind of time based solution at higher layer, then
that higher layer should know who used how much of time each group used. We
can probably do some kind of timestamping in bio to get a sense when did it
get into disk and when did it finish. But on a multi queue hardware there
can be multiple requests in the disk either from same queue or from differnet
queues and with pure timestamping based apparoch, so far I could not think
how at high level we will get an idea who used how much of time.
So this is the first point of contention that how do we want to provide
fairness. In terms of disk time used or in terms of size of IO/number of
IO.
Max bandwidth Controller or Proportional bandwidth controller
=============================================================
What is our primary requirement here? A weight based proportional
bandwidth controller where we can use the resources optimally and any
kind of throttling kicks in only if there is contention for the disk.
Or we want max bandwidth control where a group is not allowed to use the
disk even if disk is free.
Or we need both? I would think that at some point of time we will need
both but we can start with proportional bandwidth control first.
Fairness for higher level logical devices
=========================================
Do we want good fairness numbers for higher level logical devices also
or it is sufficient to provide fairness at leaf nodes. Providing fairness
at leaf nodes can help us use the resources optimally and in the process
we can get fairness at higher level also in many of the cases.
But do we want strict fairness numbers on higher level logical devices
even if it means sub-optimal usage of unerlying phsical devices?
I think that for proportinal bandwidth control, it should be ok to provide
fairness at higher level logical device but for max bandwidth control it
might make more sense to provide fairness at higher level. Consider a
case where from a striped device a customer wants to limit a group to
30MB/s and in case of leaf node control, if every leaf node provides
30MB/s, it might accumulate to much more than specified rate at logical
device.
Latency Control and strong isolation between groups
===================================================
Do we want a good isolation between groups and better latencies and
stronger isolation between groups?
I think if problem is solved at IO scheduler level, we can achieve better
latency control and hence stronger isolation between groups.
Higher level solutions should find it hard to provide same kind of latency
control and isolation between groups as IO scheduler based solution.
Fairness for buffered writes
============================
Doing io control at any place below page cache has disadvantage that page
cache might not dispatch more writes from higher weight group hence higher
weight group might not see more IO done. Andrew says that we don't have
a solution to this problem in kernel and he would like to see it handled
properly.
Only way to solve this seems to be to slow down the writers before they
write into page cache. IO throttling patch handled it by slowing down
writer if it crossed max specified rate. Other suggestions have come in
the form of dirty_ratio per memory cgroup or a separate cgroup controller
al-together where some kind of per group write limit can be specified.
So if solution is implemented at IO scheduler layer or at device mapper
layer, both shall have to rely on another controller to be co-mounted
to handle buffered writes properly.
Fairness with-in group
======================
One of the issues with higher level controller is that how to do fair
throttling so that fairness with-in group is not impacted. Especially
the case of making sure that we don't break the notion of ioprio of the
processes with-in group.
Especially io throttling patch was very bad in terms of prio with-in
group where throttling treated everyone equally and difference between
process prio disappeared.
Reads Vs Writes
===============
A higher level control most likely will change the ratio in which reads
and writes are dispatched to disk with-in group. It used to be decided
by IO scheduler so far but with higher level groups doing throttling and
possibly buffering the bios and releasing them later, they will have to
come up with their own policy on in what proportion reads and writes
should be dispatched. In case of IO scheduler based control, all the
queuing takes place at IO scheduler and it still retains control of
in what ration reads and writes should be dispatched.
Summary
=======
- An io scheduler based io controller can provide better latencies,
stronger isolation between groups, time based fairness and will not
interfere with io schedulers policies like class, ioprio and
reader vs writer issues.
But it can gunrantee fairness at higher logical level devices.
Especially in case of max bw control, leaf node control does not sound
to be the most appropriate thing.
- IO throttling provides max bw control in terms of absolute rate. It has
the advantage that it can provide control at higher level logical device
and also control buffered writes without need of additional controller
co-mounted.
But it does only max bw control and not proportion control so one might
not be using resources optimally. It looses sense of task prio and class
with-in group as any of the task can be throttled with-in group. Because
throttling does not kick in till you hit the max bw limit, it should find
it hard to provide same latencies as io scheduler based control.
- dm-ioband also has the advantage that it can provide fairness at higher
level logical devices.
But, fairness is provided only in terms of size of IO or number of IO.
No time based fairness. It is very throughput oriented and does not
throttle high speed group if other group is running slow random reader.
This results in bad latnecies for random reader group and weaker
isolation between groups.
Also it does not provide fairness if a group is not continuously
backlogged. So if one is running 1-2 dd/sequential readers in the group,
one does not get fairness until workload is increased to a point where
group becomes continuously backlogged. This also results in poor
latencies and limited fairness.
At this point of time it does not look like a single IO controller all
the scenarios/requirements. This means few things to me.
- Drop some of the requirements and go with one implementation which meets
those reduced set of requirements.
- Have more than one IO controller implementation in kenrel. One for lower
level control for better latencies, stronger isolation and optimal resource
usage and other one for fairness at higher level logical devices and max
bandwidth control.
And let user decide which one to use based on his/her needs.
- Come up with more intelligent way of doing IO control where single
controller covers all the cases.
At this point of time, I am more inclined towards option 2 of having more
than one implementation in kernel. :-) (Until and unless we can brainstrom
and come up with ideas to make option 3 happen).
>
> It would be great if we discuss our plans on the mailing list, so we
> can get early feedback from everyone.
This is what comes to my mind so far. Please add to the list if I have missed
some points. Also correct me if I am wrong about the pros/cons of the
approaches.
Thoughts/ideas/opinions are welcome...
Thanks
Vivek
^ permalink raw reply
* Re: IO scheduler based IO controller V10
From: Vivek Goyal @ 2009-09-29 3:22 UTC (permalink / raw)
To: Nauman Rafique
Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
paolo.valente-rcYM44yAMweonA0d6jMUrA,
jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
In-Reply-To: <e98e18940909281737q142c788dpd20b8bdc05dd0eff-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
> Hi Vivek,
> Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with
> Jens about IO controller during Linux Plumbers Conference '09. Jens
> expressed his concerns about the size and complexity of the patches. I
> believe that is a reasonable concern. We talked about things that
> could be done to reduce the size of the patches. The requirement that
> the "solution has to work with all IO schedulers" seems like a
> secondary concern at this point; and it came out as one thing that can
> help to reduce the size of the patch set.
Initially doing cgroup based IO control only for CFQ should help a lot in
reducing the patchset size.
> Another possibility is to
> use a simpler scheduling algorithm e.g. weighted round robin, instead
> of BFQ scheduler. BFQ indeed has great properties, but we cannot deny
> the fact that it is complex to understand, and might be cumbersome to
> maintain.
Core of the BFQ I have gotten rid of already. The remaining part is idle tree
and data structures. I will see how can I simplify it further.
> Also, hierarchical scheduling is something that could be
> unnecessary in the first set of patches, even though cgroups are
> hierarchical in nature.
Sure. Though I don't think that a lot of code is there because of
hierarchical nature. If we solve the issue at CFQ layer, we have to
maintain atleast two levels. One for queue and other for groups. So even
the simplest solution becomes almost hierarchical in nature. But I will
still see how to get rid of some code here too...
>
> We are starting from a point where there is no cgroup based IO
> scheduling in the kernel. And it is probably not reasonable to satisfy
> all IO scheduling related requirements in one patch set. We can start
> with something simple, and build on top of that. So a very simple
> patch set that enables cgroup based proportional scheduling for CFQ
> seems like the way to go at this point.
Sure, we can start with CFQ only. But a bigger question we need to answer
is that is CFQ the right place to solve the issue? Jens, do you think
that CFQ is the right place to solve the problem?
Andrew seems to favor a high level approach so that IO schedulers are less
complex and we can provide fairness at high level logical devices also.
I will again try to summarize my understanding so far about the pros/cons
of each approach and then we can take the discussion forward.
Fairness in terms of size of IO or disk time used
=================================================
On a seeky media, fairness in terms of disk time can get us better results
instead fairness interms of size of IO or number of IO.
If we implement some kind of time based solution at higher layer, then
that higher layer should know who used how much of time each group used. We
can probably do some kind of timestamping in bio to get a sense when did it
get into disk and when did it finish. But on a multi queue hardware there
can be multiple requests in the disk either from same queue or from differnet
queues and with pure timestamping based apparoch, so far I could not think
how at high level we will get an idea who used how much of time.
So this is the first point of contention that how do we want to provide
fairness. In terms of disk time used or in terms of size of IO/number of
IO.
Max bandwidth Controller or Proportional bandwidth controller
=============================================================
What is our primary requirement here? A weight based proportional
bandwidth controller where we can use the resources optimally and any
kind of throttling kicks in only if there is contention for the disk.
Or we want max bandwidth control where a group is not allowed to use the
disk even if disk is free.
Or we need both? I would think that at some point of time we will need
both but we can start with proportional bandwidth control first.
Fairness for higher level logical devices
=========================================
Do we want good fairness numbers for higher level logical devices also
or it is sufficient to provide fairness at leaf nodes. Providing fairness
at leaf nodes can help us use the resources optimally and in the process
we can get fairness at higher level also in many of the cases.
But do we want strict fairness numbers on higher level logical devices
even if it means sub-optimal usage of unerlying phsical devices?
I think that for proportinal bandwidth control, it should be ok to provide
fairness at higher level logical device but for max bandwidth control it
might make more sense to provide fairness at higher level. Consider a
case where from a striped device a customer wants to limit a group to
30MB/s and in case of leaf node control, if every leaf node provides
30MB/s, it might accumulate to much more than specified rate at logical
device.
Latency Control and strong isolation between groups
===================================================
Do we want a good isolation between groups and better latencies and
stronger isolation between groups?
I think if problem is solved at IO scheduler level, we can achieve better
latency control and hence stronger isolation between groups.
Higher level solutions should find it hard to provide same kind of latency
control and isolation between groups as IO scheduler based solution.
Fairness for buffered writes
============================
Doing io control at any place below page cache has disadvantage that page
cache might not dispatch more writes from higher weight group hence higher
weight group might not see more IO done. Andrew says that we don't have
a solution to this problem in kernel and he would like to see it handled
properly.
Only way to solve this seems to be to slow down the writers before they
write into page cache. IO throttling patch handled it by slowing down
writer if it crossed max specified rate. Other suggestions have come in
the form of dirty_ratio per memory cgroup or a separate cgroup controller
al-together where some kind of per group write limit can be specified.
So if solution is implemented at IO scheduler layer or at device mapper
layer, both shall have to rely on another controller to be co-mounted
to handle buffered writes properly.
Fairness with-in group
======================
One of the issues with higher level controller is that how to do fair
throttling so that fairness with-in group is not impacted. Especially
the case of making sure that we don't break the notion of ioprio of the
processes with-in group.
Especially io throttling patch was very bad in terms of prio with-in
group where throttling treated everyone equally and difference between
process prio disappeared.
Reads Vs Writes
===============
A higher level control most likely will change the ratio in which reads
and writes are dispatched to disk with-in group. It used to be decided
by IO scheduler so far but with higher level groups doing throttling and
possibly buffering the bios and releasing them later, they will have to
come up with their own policy on in what proportion reads and writes
should be dispatched. In case of IO scheduler based control, all the
queuing takes place at IO scheduler and it still retains control of
in what ration reads and writes should be dispatched.
Summary
=======
- An io scheduler based io controller can provide better latencies,
stronger isolation between groups, time based fairness and will not
interfere with io schedulers policies like class, ioprio and
reader vs writer issues.
But it can gunrantee fairness at higher logical level devices.
Especially in case of max bw control, leaf node control does not sound
to be the most appropriate thing.
- IO throttling provides max bw control in terms of absolute rate. It has
the advantage that it can provide control at higher level logical device
and also control buffered writes without need of additional controller
co-mounted.
But it does only max bw control and not proportion control so one might
not be using resources optimally. It looses sense of task prio and class
with-in group as any of the task can be throttled with-in group. Because
throttling does not kick in till you hit the max bw limit, it should find
it hard to provide same latencies as io scheduler based control.
- dm-ioband also has the advantage that it can provide fairness at higher
level logical devices.
But, fairness is provided only in terms of size of IO or number of IO.
No time based fairness. It is very throughput oriented and does not
throttle high speed group if other group is running slow random reader.
This results in bad latnecies for random reader group and weaker
isolation between groups.
Also it does not provide fairness if a group is not continuously
backlogged. So if one is running 1-2 dd/sequential readers in the group,
one does not get fairness until workload is increased to a point where
group becomes continuously backlogged. This also results in poor
latencies and limited fairness.
At this point of time it does not look like a single IO controller all
the scenarios/requirements. This means few things to me.
- Drop some of the requirements and go with one implementation which meets
those reduced set of requirements.
- Have more than one IO controller implementation in kenrel. One for lower
level control for better latencies, stronger isolation and optimal resource
usage and other one for fairness at higher level logical devices and max
bandwidth control.
And let user decide which one to use based on his/her needs.
- Come up with more intelligent way of doing IO control where single
controller covers all the cases.
At this point of time, I am more inclined towards option 2 of having more
than one implementation in kernel. :-) (Until and unless we can brainstrom
and come up with ideas to make option 3 happen).
>
> It would be great if we discuss our plans on the mailing list, so we
> can get early feedback from everyone.
This is what comes to my mind so far. Please add to the list if I have missed
some points. Also correct me if I am wrong about the pros/cons of the
approaches.
Thoughts/ideas/opinions are welcome...
Thanks
Vivek
^ permalink raw reply
* Re: hard lockup, followed by ext4_lookup: deleted inode referenced: 524788
From: Theodore Tso @ 2009-09-29 3:13 UTC (permalink / raw)
To: Andy Isaacson; +Cc: linux-kernel, linux-ext4
In-Reply-To: <20090928212838.GS12922@hexapodia.org>
On Mon, Sep 28, 2009 at 02:28:38PM -0700, Andy Isaacson wrote:
>
> I've attached the complete output from "fsck -n /dev/sda1" and "stat
> <%d>" on each inode reported to be deleted.
>
So the large numbers of multiply-claimed blocks message is definitely
a clue:
> Multiply-claimed block(s) in inode 919422: 3704637
> Multiply-claimed block(s) in inode 928410: 3704637
> Multiply-claimed block(s) in inode 928622: 3703283
> Multiply-claimed block(s) in inode 943927: 3703283
> Multiply-claimed block(s) in inode 933307: 3702930
> Multiply-claimed block(s) in inode 943902: 3702930
What this indicates to me is that an inode table block was written to
the wrong location on disk. In fact, given large numbers of inode
numbers involved, it looks like large numbers of inode table blocks
were written to the wrong location on disk.
So what happend with the file "/etc/rcS.d/S90mountdebugfs" is probably
_not_ that it was deleted on September 22nd, but rather sometime
recently the inode table block containing to inode #524788 was
overwritten by another inode table block, containing a deleted inode
at that relative position in the inode table block.
This must have happened since the last successful boot, since with
/etc/rcS.d/S90mountdebugfs pointing at a deleted inode, any attempt to
boot the system after the corruption had taken place would have
resulted in catastrophe.
I'm surprised by how many inode tables blocks apparently had gotten
mis-directed. Almost certainly there must have been some kind of
hardware failure that must have triggered this. I'm not sure what
caused it, but it does seem like your filesystem has been toasted
fairly badly.
At this point my advice to you would be to try to recover as much data
from the disk as you can, and to *not* try to run fsck or mount the
filesystem read/write until you are confident you have recovered all
of the critical files you care about, or have made a image copy of the
disk using dd to a backup hard drive first. If you're really curious
we could try to look at the dumpe2fs output and see if we can find the
pattern of what might have caused so many misdirected writes, but
there's no guarantee that we would be able to find the definitive root
cause, and from a recovery perspective, it's probably faster and less
risk to reinstall your system disk from scratch.
Good luck, and I'm sorry your file system had gotten so badly
disrupted.
- Ted
^ permalink raw reply
* RE: [PATCH] replace rdtsc emulation-vs-native xen boot option with per-domain (hypervisor part)
From: Dan Magenheimer @ 2009-09-29 3:13 UTC (permalink / raw)
To: Zhang, Xiantao, Ian Pratt, Jeremy Fitzhardinge
Cc: Xen-Devel (E-mail), Keir Fraser
In-Reply-To: <706158FABBBA044BAD4FE898A02E4BC201C9A18327@pdsmsx503.ccr.corp.intel.com>
> Any other necessary reasons to introduce it except fixing
> skew issue between cpus?
I think the reasons were discussed at length in a previous
thread. To summarize, some applications already do (or can
or will) use rdtsc as a legal non-privileged instruction
and expect to get the Intel SDM-defined behavior of the
instruction. There are some hardware+software environments
that do not correctly provide this behavior, including
some older SMP machines and early generation AMD desktop
multi-core machines. There are many more that DO correctly
provide this behavior, including all recent generation
Intel machines and nearly all recent generation AMD
machines... and, very importantly, any VM running on
VMware. Because rdtsc is "unsafe" on some machines
does not stop application programmers from using it
on machines where it is safe; it will be increasingly
likely that an application programmer may never
experience an "unsafe" hardware/software environment.
We cannot tell app programmers that they cannot use rdtsc,
only warn them that it is risky for some older machines.
Some will use it anyway and I have personally talked to
some that are. We cannot predict or legislate
how rdtsc will be used in the future. It is easy to
imagine a scenario where a transaction-oriented application
timestamps transactions with rdtsc and then, when an
infrequent error condition occurs, tries to replay the
transactions; this would work fine on any new hardware
and on VMware and even on Xen for awhile... but without
rdtsc-emulaion could mysteriously fail and cause data
corruption, perhaps only after a migration (or two or
three) and only when the infrequent error condition
occurs after a few migrations. Would you want to be
on the support team that tries to diagnose that?
VMware does not have this problem. The cost for Xen is
some performance. I do not take loss of performance
lightly and have spent weeks now looking for a better
solution. I have not found one and am open to any
creative alternative. But pretending that apps, today
and in the future will NOT use rdtsc, or that they
will use it only following Xen-prescribed constraints,
is just wishful thinking and not appropriate for
hardware/software vendors selling to enterprise
customers (or selling to cloud providers that expect to
provide services for enterprise customers).
The patch provided records and reports frequency of
rdtsc emulations. If an administrator cares to improve
performance and can verify that all apps that are (now
or ever will be) running on this VM are using
rdtsc safely, a per-domain option can be specified
to get the performance back. The opposite is nearly
impossible to ascertain.
So I believe, for rdtsc, correctness is more important
than a small amount of performance.
Dan
^ permalink raw reply
* [PATCH][rc1] cgroup: catch bad css refcnt at css_put v2
From: KAMEZAWA Hiroyuki @ 2009-09-29 3:09 UTC (permalink / raw)
To: Li Zefan
Cc: linux-kernel, akpm@linux-foundation.org, mingo,
balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
menage@google.com
In-Reply-To: <4AC159D9.8020900@cn.fujitsu.com>
__css_put() doesn't check a buggy case as refcnt goes to minus.
This patch adds a check for it.
Changelog:
- using WARN_ON_ONCE() instead of WARN_ON()
Acked-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
kernel/cgroup.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
Index: linux-2.6.32-rc1/kernel/cgroup.c
===================================================================
--- linux-2.6.32-rc1.orig/kernel/cgroup.c
+++ linux-2.6.32-rc1/kernel/cgroup.c
@@ -3708,8 +3708,10 @@ static void check_for_release(struct cgr
void __css_put(struct cgroup_subsys_state *css)
{
struct cgroup *cgrp = css->cgroup;
+ int val;
rcu_read_lock();
- if (atomic_dec_return(&css->refcnt) == 1) {
+ val = atomic_dec_return(&css->refcnt);
+ if (val == 1) {
if (notify_on_release(cgrp)) {
set_bit(CGRP_RELEASABLE, &cgrp->flags);
check_for_release(cgrp);
@@ -3717,6 +3719,7 @@ void __css_put(struct cgroup_subsys_stat
cgroup_wakeup_rmdir_waiter(cgrp);
}
rcu_read_unlock();
+ WARN_ON_ONCE(val < 1);
}
/*
^ permalink raw reply
* Re: tmio-mmc driver update
From: Ian Molton @ 2009-09-29 3:09 UTC (permalink / raw)
To: linux-mmc, Magnus Damm, sameo, pb
In-Reply-To: <c09aa50a0909251735w18f825efx38bdcdf95c6fb388@mail.gmail.com>
*prod*
After all the fuss about this, no-ones going to comment?
I just found the last tmio-related bug I want to squash before I
consider this patchset 'good to go' so people actually updatinng their
drivers / sending me code would be much appreciated. I'd prefer this
went in with all the affected drivers tested and ASIC3 I _do not have
one of_ to test.
Its only a few lines - can we get it finished please?
-Ian
2009/9/26 Ian Molton <ian@mnementh.co.uk>:
> 2009/9/26 Ian Molton <ian@mnementh.co.uk>:
> Hi guys,
> This is the tmio based part of the tmio-mmc driver update that removes the
> CNF area (clock and power control) from the driver and abstracts it so that
> more users can share the MMC driver itself.
>
> Its compilable, runs on my hardware, and is checkpatch clean.
>
> I havent updated ASIC3 as I cannot test it here - please could someone with
> hardware do this. It should only require a few lines of code similar to that
> in the other drivers. I expect it can share the code in tmio-core.c
>
> Magnus, please base your driver on this patchset, since its tested and shows
> no regressions on my hardware (something else seems to have stopped my
> tc6393xb device reading some cards, but this must have happened prior to
> this patch (from a quick bisect). It will read other cards, and so I dont
> think this patch is related. I'll look into this when I get time, but dont
> want to delay other peoples drivers getting updated / merged.
>
> Once someone sends me a patch for ASIC3 I will prepare this patchset for
> submission upstream.
>
> --
> Ian Molton
> Linux, Automotive, and other hacking:
> http://www.mnementh.co.uk/
>
--
Ian Molton
Linux, Automotive, and other hacking:
http://www.mnementh.co.uk/
^ permalink raw reply
* Re: [PATCH][rc1] cgroup: catch bad css refcnt at css_put
From: KAMEZAWA Hiroyuki @ 2009-09-29 3:06 UTC (permalink / raw)
To: Paul Menage
Cc: linux-kernel, akpm@linux-foundation.org, mingo,
balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
lizf@cn.fujitsu.com
In-Reply-To: <6599ad830909280720u6f70e52cp33b89c02c7e389fa@mail.gmail.com>
On Mon, 28 Sep 2009 07:20:08 -0700
Paul Menage <menage@google.com> wrote:
> On Mon, Sep 28, 2009 at 2:13 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > This is a patch for checking css->refcnt's sanity at css_put().
> >
> > BTW, I noticed that...css->refcnt may overflow if used with memcg...
> > Now, refcnt is incremented per a page. Paul, do you have any idea ?
> > (Ah, yes. "don't use css->refcnt per page" is maybe reasonable but
> > it will be big change..)
> >
> > ==
> > __css_put() doesn't check a bug as refcnt goes to minus.
> > I think it should be caught. This patch adds a check for it.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Acked-by: Paul Menage <menage@google.com>
>
Thanks.
> Looks reasonable, although there's no guarantee that it will warn on a
> buggy release rather than a correct release that occur after a buggy
> release.
>
yes, it's a problem of refcnt.
Thanks,
-Kame
> > ---
> > kernel/cgroup.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > Index: linux-2.6.32-rc1/kernel/cgroup.c
> > ===================================================================
> > --- linux-2.6.32-rc1.orig/kernel/cgroup.c
> > +++ linux-2.6.32-rc1/kernel/cgroup.c
> > @@ -3708,8 +3708,10 @@ static void check_for_release(struct cgr
> > void __css_put(struct cgroup_subsys_state *css)
> > {
> > struct cgroup *cgrp = css->cgroup;
> > + int val;
> > rcu_read_lock();
> > - if (atomic_dec_return(&css->refcnt) == 1) {
> > + val = atomic_dec_return(&css->refcnt);
> > + if (val == 1) {
> > if (notify_on_release(cgrp)) {
> > set_bit(CGRP_RELEASABLE, &cgrp->flags);
> > check_for_release(cgrp);
> > @@ -3717,6 +3719,7 @@ void __css_put(struct cgroup_subsys_stat
> > cgroup_wakeup_rmdir_waiter(cgrp);
> > }
> > rcu_read_unlock();
> > + WARN_ON(val < 1);
> > }
> >
> > /*
> >
> >
>
^ permalink raw reply
* Re: [PATCH] sdio: add MMC_CAP_VDD_165_195 host capability
From: Philip Langdale @ 2009-09-29 2:25 UTC (permalink / raw)
To: David Vrabel
Cc: Ohad Ben-Cohen, akpm, "p hilipl", ian, matt, pierre,
roberto.foglietta, linux-mmc
In-Reply-To: <4AC0FBF0.6090007@csr.com>
On Mon, 28 Sep 2009 19:09:52 +0100
David Vrabel <david.vrabel@csr.com> wrote:
> Ohad Ben-Cohen wrote:
> > From: Ohad Ben-Cohen <ohad@wizery.com>
> >
> > To allow the usage of MMC_VDD_165_195, host capability
> > MMC_CAP_VDD_165_195 is introduced. This is necessary
> > because MMC_VDD_165_195 is currently reserved/undefined.
>
> The host already reports what voltages it supports (in
> mmc_host::ocr_avail) so a seperate MMC_CAP_* isn't needed.
>
> This interpretation of the reserved bits in the OCR should only done
> for certain cards where the bits actually do mean 1.8 V operation
> (with v2.0 signalling) is possible.
>
> That's a fair amount of work so perhaps in the interim something like
> this:
>
> --- a/drivers/mmc/core/sdio.c
> +++ b/drivers/mmc/core/sdio.c
> @@ -494,6 +494,9 @@ int mmc_attach_sdio(struct mmc_host *host, u32
> ocr) ocr &= ~0x7F;
> }
>
> + if (ocr & MMC_VDD_165_195)
> + printk(KERN_WARNING "%s: warning: card claims
> non-standard 1.65-1.95 V support"
> +
> host->ocr = mmc_select_voltage(host, ocr);
>
> /*
>
> And revisit this if these bits either: a) gain a (different) standard
> meaning; or b) some other card uses these bits in a different
> non-standard way. Neither seems likely.
Hi David,
Ok, that sounds reasonable, but my concern is a controller that
publishes support for MMC_VDD_165_195 for mmc cards but doesn't
claim support for SDIO cards - particularly considering the
signalling implications you mentioned. Now, maybe you don't see
this happening in the wild, but it seems to me that it has to
be possible. It seems that to guard against this, you'd need a
host cap that says "165_195 for SD" and if it's not present,
mask it out of the OCR when dealing with SD/IO cards.
Am I being too paranoid?
--phil
^ permalink raw reply
* Re: udev problems on angstrom-2008.1
From: Denys Dmytriyenko @ 2009-09-29 3:01 UTC (permalink / raw)
To: openembedded-devel
In-Reply-To: <200909281752.11487.anarsoul@gmail.com>
On Mon, Sep 28, 2009 at 05:52:05PM +0300, Vasily Khoruzhick wrote:
> Hi, I'm experiencing problems with image for rx1950 PDA (s3c2442-based PDA
> with 32M of RAM) built by openembedded.
> For some reason, udev doesn't work, and complains about lack of free space on
> /dev (which is tmpfs) like this:
>
> udevd-event[1918]: mknod(/dev/vcS, 020660, (7,0) failed: No space left on
> device.
Been there, done that.
Short answer - you have too many requested devnodes for udev to handle within
32MB of RAM. If you absolutely need all the devices enabled in the kernel
(try disabling/reducing LEGACY_PTY), consider replacing udev with mdev (or
static devnodes). Or try moving udev's database from tmpfs into another
partition, maybe on flash.
Long answer - udev besides creating the actual devnodes also keeps its
database in the same tmpfs filesystem. It consists of an internal directory
structure with symlinks (a la sysfs) and is stored in /dev/.udev
Keeping all that in tmpfs is not efficient, as it wastes inodes and apparently
inodes in tmpfs/ramfs are the size of a memory page (4KB). By default tmpfs is
limited to half the RAM, which in your case is 16MB. That would allow roughly
4000 inodes. You definitely need more, because of the above messages. Even if
you fit all of them, using half of your available RAM just for udev is a total
waste. So, consider suggestions above.
--
Denys
^ permalink raw reply
* Re: [RFC][PATCH 8/10] memcg: clean up charge/uncharge anon
From: KAMEZAWA Hiroyuki @ 2009-09-29 3:14 UTC (permalink / raw)
To: Daisuke Nishimura; +Cc: linux-mm@kvack.org, balbir@linux.vnet.ibm.com
In-Reply-To: <20090929120348.0bcb17d1.nishimura@mxp.nes.nec.co.jp>
On Tue, 29 Sep 2009 12:03:48 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> Just to make sure.
>
> > > Maybe there is something I don't understand..
> > > IIUC, when page_remove_rmap() is called by do_wp_page(),
> > > there must be pte(s) which points to the page and a pte is guarded by
> > > page table lock. So, I think page_mapcount() > 0 before calling page_remove_rmap()
> > > because there must be a valid pte, at least.
> > >
> > > Can this scenario happen ?
> > I think so. I intended to mention this case :)
> > I'm sorry for my vague explanation.
> >
> > > ==
> > > Thread A. Thread B.
> > >
> > > do_wp_page() do_swap_page()
> > > PageAnon(oldpage)
> > > lock_page() lock_page()=> wait.
> > > reuse = false.
> > > unlock_page() get lock.
> > > do copy-on-write
> > > pte_same() == true
> > > page_remove_rmap(oldpage) (mapcount goes to -1)
> > > page_set_anon_rmap() (new anon rmap again)
> > > ==
> > > Then, oldpage's mapcount goes down to 0 and up to 1 immediately.
> > >
> I meant "process" not "thread".
Okay ;)
> I think this cannot happen in the case of threads, because these page_remove_rmap()
> and page_set_anon_rmap() are called under pte lock(they share the pte).
>
Anyway, I'll fix this patch.
But Balbir ask me to post batched_charge/uncharge first, this clean up series
will be postponed.
I think..
1. post softlimit fixes.
2. batched uncharge/charge
3. post some fixes from this set.
I personally want to reorder all functions but it makes diff (between versions)
too big. So, I think I should avoid big reorganization.
I'll go moderate way.
Hmm..but I'll do move percpu/perzone functions below definitions of structs.
Thanks,
-Kame
>
> Thanks,
> Daisuke Nishimura.
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* initialisation freqency patch problem.
From: Ian Molton @ 2009-09-29 3:01 UTC (permalink / raw)
To: linux-mmc
Hi folks,
The commit 8dfd0374be84793360db7fff2e635d2cd3bbcb21 is causing one of
my MMC cards to fail to initialise.
Has anyone else seen initialisation failures since this patch?
I suspect the problem is in tmio-mmc but its a weird one - only one of
my two tc6393xb based hosts has this issue, and none of t7l66 or
tc6387, and only with one card.
I'll look into it tomorrow and see what actual clock frequency is
getting selected.
I'm off to bed now.
--
Ian Molton
Linux, Automotive, and other hacking:
http://www.mnementh.co.uk/
^ permalink raw reply
* Re: [Bonding-devel] [PATCH 4/4] bonding: add sysfs files to display tlb and alb hash table contents
From: Stephen Hemminger @ 2009-09-29 3:00 UTC (permalink / raw)
To: Andy Gospodarek; +Cc: Andy Gospodarek, netdev, fubar, bonding-devel
In-Reply-To: <20090929013713.GG4436@gospo.rdu.redhat.com>
On Mon, 28 Sep 2009 21:37:13 -0400
Andy Gospodarek <andy@greyhouse.net> wrote:
> On Mon, Sep 28, 2009 at 05:34:20PM -0700, Stephen Hemminger wrote:
> > On Mon, 28 Sep 2009 20:12:03 -0400
> > Andy Gospodarek <andy@greyhouse.net> wrote:
> >
> > > On Mon, Sep 28, 2009 at 04:22:37PM -0700, Stephen Hemminger wrote:
> > > > On Fri, 11 Sep 2009 17:13:17 -0400
> > > > Andy Gospodarek <andy@greyhouse.net> wrote:
> > > >
> > > > >
> > > > > bonding: add sysfs files to display tlb and alb hash table contents
> > > > >
> > > > > While debugging some problems with alb (mode 6) bonding I realized that
> > > > > being able to output the contents of both hash tables would be helpful.
> > > > > This is what the output looks like for the two files:
> > > > >
> > > > > device load
> > > > > eth1 491
> > > > > eth2 491
> > > > > hash device last device tx bytes load next previous
> > > > > 2 eth1 eth1 2254 491 0 0
> > > > > 3 eth2 eth2 2744 491 0 0
> > > > > 6 eth2 0 488 0 0
> > > > > 8 eth2 0 461698 0 0
> > > > > 1b eth2 0 249 0 0
> > > > > eb eth2 0 21 0 0
> > > > > ff eth2 0 22 0 0
> > > > >
> > > > > hash ip_src ip_dst mac_dst slave assign ntt
> > > > > 2 10.0.3.2 10.0.3.11 00:e0:81:71:ee:a9 eth1 1 0
> > > > > 3 10.0.3.2 10.0.3.10 00:e0:81:71:ee:a9 eth2 1 0
> > > > > 8 10.0.3.2 10.0.3.1 00:e0:81:71:ee:a9 eth2 1 0
> > > > >
> > > > > These were a great help debugging the fixes I have just posted and they
> > > > > might be helpful for others, so I decided to include them in my
> > > > > patchset.
> > > > >
> > > > > Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
> > > >
> > > > No.
> > > >
> > > > Please don't put formatted output in sysfs, it is not meant to be
> > > > used like proc, there is supposed to be only one value per file.
> > >
> > > Then based on the over 300 files in /sys/ that are more than 1 line on
> > > my currently running kernel, it seems there is significant work to do.
> > >
> > > Seemingly arbitrary requests like this are extremely annoying when the
> > > current kernel violates them all over the place.
> > >
> >
> > The rules are documented in Documentation/sysfs-rules.txt. If you want
> > to change the rules, submit a change to the rules.
> >
>
> That specific request is actually in filesystems/sysfs.txt in the
> 'Attributes' section, but the fact that it's actually outlined somewhere
> makes the request seem less 'arbitrary.' ;-)
>
Ah, that is where the note is:
----------------------
Attributes
~~~~~~~~~~
Attributes can be exported for kobjects in the form of regular files in
the filesystem. Sysfs forwards file I/O operations to methods defined
for the attributes, providing a means to read and write kernel
attributes.
Attributes should be ASCII text files, preferably with only one value
per file. It is noted that it may not be efficient to contain only one
value per file, so it is socially acceptable to express an array of
values of the same type.
Mixing types, expressing multiple lines of data, and doing fancy
formatting of data is heavily frowned upon. Doing these things may get
you publically humiliated and your code rewritten without notice.
--
^ permalink raw reply
* Re: Paravirtualization on VMware's Platform [VMI].
From: Alok Kataria @ 2009-09-29 3:00 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Gerd Hoffmann, Ingo Molnar, Thomas Gleixner,
the arch/x86 maintainers, LKML, Jeremy Fitzhardinge, Chris Wright,
Rusty Russell, virtualization@lists.osdl.org, Greg KH,
Linus Torvalds, Andrew Morton
In-Reply-To: <4AC17018.4040600@zytor.com>
On Mon, 2009-09-28 at 19:25 -0700, H. Peter Anvin wrote:
> On 09/28/2009 05:45 PM, Alok Kataria wrote:
> > + bool "VMI Guest support [will be deprecated soon]"
> > + default n
>
> This is incorrect use of the word "deprecated"... it's *already*
> deprecated (a word which pretty much means the opposite of "recommended".)
>
> As far as "default n" is concerned... this is usually not necessary; "n"
> is the default unless anything else is specified.
How about this ? Thanks.
--
Mark VMI for removal in feature-removal-schedule.txt.
From: Alok N Kataria <akataria@vmware.com>
Add text in feature-removal.txt and also modify Kconfig to disable
vmi by default.
---
Documentation/feature-removal-schedule.txt | 30 ++++++++++++++++++++++++++++
arch/x86/Kconfig | 12 ++++++++---
2 files changed, 39 insertions(+), 3 deletions(-)
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 89a47b5..d24c1af 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -451,3 +451,33 @@ Why: OSS sound_core grabs all legacy minors (0-255) of SOUND_MAJOR
will also allow making ALSA OSS emulation independent of
sound_core. The dependency will be broken then too.
Who: Tejun Heo <tj@kernel.org>
+
+----------------------------
+
+What: Support for VMware's guest paravirtuliazation technique [VMI] will be
+ dropped.
+When: 2.6.37 or earlier.
+Why: With the recent innovations in CPU hardware acceleration technologies
+ from Intel and AMD, VMware ran a few experiments to compare these
+ techniques to guest paravirtualization technique on VMware's platform.
+ These hardware assisted virtualization techniques have outperformed the
+ performance benefits provided by VMI in most of the workloads. VMware
+ expects that these hardware features will be ubiquitous in a couple of
+ years, as a result, VMware has started a phased retirement of this
+ feature from the hypervisor. We will be removing this feature from the
+ Kernel too. Right now we are targeting 2.6.37 but can retire earlier if
+ technical reasons ( read opportunity to remove major chunk of pvops)
+ arise.
+
+ Please note that VMI has always been an optimization and non-VMI kernels
+ still work fine on VMware's platform.
+ Latest versions of VMware's product which support VMI are,
+ Workstation 7.0 and VSphere 4.0 on ESX side, future maintainence
+ releases for these products will continue supporting VMI.
+
+ For more details about VMI retirement take a look at this,
+ http://blogs.vmware.com/guestosguide/2009/09/vmi-retirement.html
+
+Who: Alok N Kataria <akataria@vmware.com>
+
+----------------------------
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f777aaf..44c1660 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -496,14 +496,20 @@ if PARAVIRT_GUEST
source "arch/x86/xen/Kconfig"
config VMI
- bool "VMI Guest support"
- select PARAVIRT
- depends on X86_32
+ bool "VMI Guest support [deprecated]"
+ depends on X86_32 && PARAVIRT
---help---
VMI provides a paravirtualized interface to the VMware ESX server
(it could be used by other hypervisors in theory too, but is not
at the moment), by linking the kernel to a GPL-ed ROM module
provided by the hypervisor.
+ As of September 2009, VMware has started a phased retirement of this
+ feature from VMware's products. Please see
+ feature-removal-schedule.txt for details.
+ If you are planning to enable this option, please note that you
+ cannot live migrate a VMI enabled VM to a future VMware product,
+ which doesn't support VMI. So if you expect your kernel to seamlessly
+ migrate to newer VMware products, keep this disabled.
config KVM_CLOCK
bool "KVM paravirtualized clock"
^ permalink raw reply related
* Re: [PATCH][rc1] cgroup: catch bad css refcnt at css_put
From: KAMEZAWA Hiroyuki @ 2009-09-29 2:55 UTC (permalink / raw)
To: Li Zefan
Cc: linux-kernel, akpm@linux-foundation.org, mingo,
balbir@linux.vnet.ibm.com, nishimura@mxp.nes.nec.co.jp,
menage@google.com
In-Reply-To: <4AC159D9.8020900@cn.fujitsu.com>
On Tue, 29 Sep 2009 08:50:33 +0800
Li Zefan <lizf@cn.fujitsu.com> wrote:
> KAMEZAWA Hiroyuki wrote:
> > This is a patch for checking css->refcnt's sanity at css_put().
> >
> > BTW, I noticed that...css->refcnt may overflow if used with memcg...
> > Now, refcnt is incremented per a page. Paul, do you have any idea ?
>
> atomic64_t ?
>
maybe. atomic_long_t ?
> But for 4K pagesize, it won't overflow until when the referenced
> memory is > 8T?
>
you're right. But there tends to be a few users who use unbelievable amounts
of memory in the world.
(Such user uses memcg or not is another problem ;)
> > (Ah, yes. "don't use css->refcnt per page" is maybe reasonable but
> > it will be big change..)
> >
> > ==
> > __css_put() doesn't check a bug as refcnt goes to minus.
> > I think it should be caught. This patch adds a check for it.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Acked-by: Li Zefan <lizf@cn.fujitsu.com>
>
> > ---
> > kernel/cgroup.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > Index: linux-2.6.32-rc1/kernel/cgroup.c
> > ===================================================================
> > --- linux-2.6.32-rc1.orig/kernel/cgroup.c
> > +++ linux-2.6.32-rc1/kernel/cgroup.c
> > @@ -3708,8 +3708,10 @@ static void check_for_release(struct cgr
> > void __css_put(struct cgroup_subsys_state *css)
> > {
> > struct cgroup *cgrp = css->cgroup;
> > + int val;
> > rcu_read_lock();
> > - if (atomic_dec_return(&css->refcnt) == 1) {
> > + val = atomic_dec_return(&css->refcnt);
> > + if (val == 1) {
> > if (notify_on_release(cgrp)) {
> > set_bit(CGRP_RELEASABLE, &cgrp->flags);
> > check_for_release(cgrp);
> > @@ -3717,6 +3719,7 @@ void __css_put(struct cgroup_subsys_stat
> > cgroup_wakeup_rmdir_waiter(cgrp);
> > }
> > rcu_read_unlock();
> > + WARN_ON(val < 1);
>
> When we run into this, it'll probably fill up the syslog quickly,
> so I think WARN_ON_ONCE() is a bit better.
>
Hmm, ok. I'll rewrite.
Thanks,
-Kame
> > }
> >
>
^ permalink raw reply
* Re: [RFC][PATCH 8/10] memcg: clean up charge/uncharge anon
From: Daisuke Nishimura @ 2009-09-29 3:03 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: nishimura, linux-mm@kvack.org, balbir@linux.vnet.ibm.com
In-Reply-To: <20090929111828.6f9148d6.nishimura@mxp.nes.nec.co.jp>
Just to make sure.
> > Maybe there is something I don't understand..
> > IIUC, when page_remove_rmap() is called by do_wp_page(),
> > there must be pte(s) which points to the page and a pte is guarded by
> > page table lock. So, I think page_mapcount() > 0 before calling page_remove_rmap()
> > because there must be a valid pte, at least.
> >
> > Can this scenario happen ?
> I think so. I intended to mention this case :)
> I'm sorry for my vague explanation.
>
> > ==
> > Thread A. Thread B.
> >
> > do_wp_page() do_swap_page()
> > PageAnon(oldpage)
> > lock_page() lock_page()=> wait.
> > reuse = false.
> > unlock_page() get lock.
> > do copy-on-write
> > pte_same() == true
> > page_remove_rmap(oldpage) (mapcount goes to -1)
> > page_set_anon_rmap() (new anon rmap again)
> > ==
> > Then, oldpage's mapcount goes down to 0 and up to 1 immediately.
> >
I meant "process" not "thread".
I think this cannot happen in the case of threads, because these page_remove_rmap()
and page_set_anon_rmap() are called under pte lock(they share the pte).
Thanks,
Daisuke Nishimura.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: compat-wireless master-2009-09-28 breakage and suggested fixes
From: Luis R. Rodriguez @ 2009-09-29 2:55 UTC (permalink / raw)
To: Hin-Tak Leung; +Cc: Luis Rodriguez, linux-wireless
In-Reply-To: <3ace41890909281935k284402f3q90b54ae20a636eeb@mail.gmail.com>
On Mon, Sep 28, 2009 at 07:35:22PM -0700, Hin-Tak Leung wrote:
> commit d0cf9c0dadcdc89a755bcb301cfc9c796eb28ccf
> Author: Stephen Hemminger <shemminger@vyatta.com>
> Date: Mon Aug 31 19:50:57 2009 +0000
>
> wireless: convert drivers to netdev_tx_t
>
> and the 2nd change due to this:
>
> commit 384912ed194e43c03ad1cdaa09b0b1e488c34d46
> Author: Marcel Holtmann <marcel@holtmann.org>
> Date: Mon Aug 31 21:08:19 2009 +0000
>
> net: Add DEVTYPE support for Ethernet based devices
>
> Both of these changes are traced back to changes in
> <linux/netdevice.h> , which compat-wireless does not ship. What's your
> policy on these kind of changes to compat-wireless?
> (the 2nd SET_NETDEV_DEVTYPE change probably can be spanned by an
> ifndef SET_NETDEV_DEVTYPE, and roll into
> "compat/patches/01-netdev.patch"? Should the first kind of change also
> go into compat/patches/01-netdev.patch?)
I don't see this yet on wireless-testing but it is on 2.6.32.
I backported this as follows. I'll push this out shortly.
From: Luis R. Rodriguez <lrodriguez@atheros.com>
Subject: [PATCH] Fix compilation against for 2.6.32 changes
2.6.32 added SET_NETDEV_DEVTYPE() and netdev_tx
Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com>
---
compat/compat-2.6.32.h | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/compat/compat-2.6.32.h b/compat/compat-2.6.32.h
index f7081f2..418b521 100644
--- a/compat/compat-2.6.32.h
+++ b/compat/compat-2.6.32.h
@@ -31,6 +31,18 @@
#define dev_change_net_namespace(a, b, c) (-EOPNOTSUPP)
+#define SET_NETDEV_DEVTYPE(netdev, type)
+
+#ifdef __KERNEL__
+/* Driver transmit return codes */
+enum netdev_tx {
+ BACKPORT_NETDEV_TX_OK = NETDEV_TX_OK, /* driver took care of packet */
+ BACKPORT_NETDEV_TX_BUSY = NETDEV_TX_BUSY, /* driver tx path was busy*/
+ BACKPORT_NETDEV_TX_LOCKED = NETDEV_TX_LOCKED, /* driver tx lock was already taken */
+};
+typedef enum netdev_tx netdev_tx_t;
+#endif /* __KERNEL__ */
+
#endif /* (LINUX_VERSION_CODE < KERNEL_VERSION(2,6,32)) */
#endif /* LINUX_26_32_COMPAT_H */
--
1.6.3.3
^ permalink raw reply related
* Re: [Openipmi-developer] Subject:[RFC Patch 1/2]IPMI/ACPI: Locate the IPMI system interface in ACPI namespace
From: ykzhao @ 2009-09-29 2:49 UTC (permalink / raw)
To: Corey Minyard
Cc: linux-acpi@vger.kernel.org,
openipmi-developer@lists.sourceforge.net, lenb@kernel.org
In-Reply-To: <4AC0CFC9.5050705@acm.org>
On Mon, 2009-09-28 at 23:01 +0800, Corey Minyard wrote:
> I've been looking for something like this for a while, but I didn't have
> a system that supports this, so I didn't have a way to test it. Thanks
> for doing this.
> Now to the code.
>
> In general, the code is not consistent in the way it uses blank lines
> between functions, if statements, etc. Can you make it consistent (and
> consistent with the rest of the driver)?
Agree with what you said. I will make it cleaner.
>
> Can you name all the functions starting with acpi_device or something
> like that to make their function clear?
>
> You need to run this through the kernel checkpatch script, it has some
> coding style problems.
>
> Can the old ACPI code go away? I understand that it will be redundant
> with these additions, but I'm not 100% sure.
Now we can't delete the old ACPI code.
The IPMI system interface can be located in ACPI by using the following
two ways:
1. locate it in SPMI table. This is done by using the old ACPI code.
2. locate it in ACPI device tree. This is realized by enumerating the
ACPI device tree. And this is done in my patch.
>
> Why not fill out an info structure directly in check_bmc_device and then
> call try_device_init_acpi() directly from there instead of creating a
> new structure device, allocating it, saving it in a list, etc. That
> would save some code and simplify things a little.
What you said is also OK. But it will be more clearer to divide into two
steps. One is to locate all the IPMI system intefaces in ACPI device
tree and register them.
Another concern is that it will get some mutex lock when calling the
function of acpi_walk_namespace(The check_bmc_device is the callback
function user in acpi_walk_namespace). To avoid that the mutex is locked
for too long time, IMO it is reasonable to divide two steps.
>
> + if (device_count > 1) {
> + printk(KERN_WARNING "More than one BMC device is found in "
> + "ACPI table\n");
> + printk(KERN_WARNING "Of course the BMC device will be "
> + "registered\n");
> + }
>
> It's legal (and possible) to have more than one BMC. I don't think this
> code is necessary.
Ok. I will delete this check.
>
> + if (resource->type == ACPI_RESOURCE_TYPE_MEMORY32 ||
> + resource->type == ACPI_RESOURCE_TYPE_MEMORY24 ||
> + resource->type == ACPI_RESOURCE_TYPE_FIXED_MEMORY32) {
> + printk(KERN_DEBUG
> + "Can't handle the Memory32/24/fixed32 type\n");
> + printk(KERN_DEBUG "please send acpidump to "
> + "linux-acpi@vger.kernel.org\n");
> + return AE_OK;
> + }
>
> I don't really understand this, but debug is probably not the
> appropriate printk level for this if the user needs to see it. Also,
> what is going on here? Why isn't this supported?
In fact the above resource is parsed from the _CRS object defined in
IPMI device. The resource type for the following example is IO.
For example:
// Returns the "Current Resources"
Name(_CRS,
ResourceTemplate() {
IO(Decode16, 0xCA9, 0, 3) // Ports 0xCA9, 0xCAA & 0xCAB
}
)
For most IPMI system interfaces defined in ACPI device tree, the address
type will be IO.
But the IPMI 2.0 spec has an example definition of IPMI system
interface, in which the address type is 64-bit memory type.
In fact I don't find that the resource type is
TYPE_MEMORY32/24/FIXED_MEMORY32 for the _CRS object.
But I don't know whether it is necessary to support the above three
type. So when this message is complained, we can get the acpidump and
add the support for it.
Of course we can add the support of parsing the base address from the
MEMORY32/24/FIXED_MEMORY32 resource type.
>
> + /*
> + * If the resource type is ACPI_RESOURCE_IRQ, it is not
> + * supported.
> + */
>
> Why not? Is there something else that should be logged or done? Also, wouldn't you put this in an IRQ function?
OK. I will try to add the support of parsing the irq number when the
resource type is ACPI_RESOURCE_IRQ.
>
>
> + if (p_ipmi->interrupttype) {
> + /*
> + * If it already support the interrupt through GPE,
> + * it is unnecessary to get this interrupt again.
> + */
> + printk(KERN_DEBUG "Interrupt through GPE is already"
> + " supported.\n");
> + return AE_OK;
> + }
> + if (extended_irq->interrupt_count != 1) {
> + printk(KERN_DEBUG "Incorrect resource setting about "
> + "interrupt \n");
> + return AE_OK;
> + }
>
> I think the printks need to be a little clearer, and if the user needs
> to see them (like these are errors in the ACPI structures) they should
> be warnings or something like that.
Yes. The KERN_WARNING prefix should be used instead of KERN_DEBUG.
thanks.
>
> Thanks,
>
> -corey
>
> yakui.zhao@intel.com wrote:
> > According to the IPMI 2.0 spec the IPMI system interface can be located with
> > ACPI. One is located in SPMI table(Service Processor Management Interface
> > table). Another is located in ACPI namespace.
> > This patch is to locate the IPMI system interface in ACPI namespace and
> > register it.
> > It includes the following two steps:
> > 1. enumerate the ACPI device tree to find the IPMI system interface
> > The IPMI device type is IPI0001. When the device is found, it
> > will continue to parse the corresponding resources.
> > For example:
> > interface type (KCS, BT, SMIC) (SSIF is not supported)
> > interrupt number and type (_GPE or GSI)
> > Memory or IO base address
> > 2. register the IPMI system interface.
> >
> >
> > Signed-off-by: Zhao Yakui <yakui.zhao@intel.com>
> > ---
> > drivers/char/ipmi/ipmi_si_intf.c | 360 +++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 360 insertions(+)
> >
> > Index: linux-2.6/drivers/char/ipmi/ipmi_si_intf.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/char/ipmi/ipmi_si_intf.c 2009-09-21 16:49:29.000000000 +0800
> > +++ linux-2.6/drivers/char/ipmi/ipmi_si_intf.c 2009-09-28 11:43:53.000000000 +0800
> > @@ -1813,6 +1813,35 @@
> > * are no more.
> > */
> > static int acpi_failure;
> > +static LIST_HEAD(acpi_ipmi);
> > +
> > +struct acpi_device_ipmi {
> > + struct list_head link;
> > + u8 interfacetype;
> > + /*
> > + * Bit 0 - SCI interrupt supported
> > + * Bit 1 - I/O APIC/SAPIC
> > + */
> > + u8 interrupttype;
> > + /*
> > + * If bit 0 of InterruptType is set, then this is the SCI
> > + * interrupt in the GPEx_STS register.
> > + */
> > + u8 gpe;
> > + /*
> > + * If bit 1 of InterruptType is set, then this is the I/O
> > + * APIC/SAPIC interrupt.
> > + */
> > + u32 global_interrupt;
> > +
> > + /* The actual register address. */
> > + struct acpi_generic_address addr;
> > + struct acpi_generic_address sm_addr;
> > +
> > + u8 ipmi_revision;
> > + u8 resource_count;
> > + struct device *dev;
> > +};
> >
> > /* For GPE-type interrupts. */
> > static u32 ipmi_acpi_gpe(void *context)
> > @@ -2001,7 +2030,337 @@
> >
> > return 0;
> > }
> > +static __devinit int try_init_acpi_device(struct acpi_device_ipmi *spmi)
> > +{
> > + struct smi_info *info;
> > + u8 addr_space;
> > +
> > + if (spmi->addr.space_id == ACPI_ADR_SPACE_SYSTEM_MEMORY)
> > + addr_space = IPMI_MEM_ADDR_SPACE;
> > + else
> > + addr_space = IPMI_IO_ADDR_SPACE;
> > +
> > + info = kzalloc(sizeof(*info), GFP_KERNEL);
> > + if (!info) {
> > + printk(KERN_ERR "ipmi_si: Could not allocate SI data (3)\n");
> > + return -ENOMEM;
> > + }
> > +
> > + info->addr_source = "ACPI";
> > +
> > + /* Figure out the interface type. */
> > + switch (spmi->interfacetype) {
> > + case 1: /* KCS */
> > + info->si_type = SI_KCS;
> > + break;
> > + case 2: /* SMIC */
> > + info->si_type = SI_SMIC;
> > + break;
> > + case 3: /* BT */
> > + info->si_type = SI_BT;
> > + break;
> > + default:
> > + printk(KERN_INFO "ipmi_si: Unknown ACPI/SPMI SI type %d\n",
> > + spmi->interfacetype);
> > + kfree(info);
> > + return -EIO;
> > + }
> > +
> > + if (spmi->interrupttype & 1) {
> > + /* We've got a GPE interrupt. */
> > + info->irq = spmi->gpe;
> > + info->irq_setup = acpi_gpe_irq_setup;
> > + } else if (spmi->interrupttype & 2) {
> > + /* We've got an APIC/SAPIC interrupt. */
> > + info->irq = spmi->global_interrupt;
> > + info->irq_setup = std_irq_setup;
> > + } else {
> > + /* Use the default interrupt setting. */
> > + info->irq = 0;
> > + info->irq_setup = NULL;
> > + }
> > +
> > + if (spmi->addr.bit_width) {
> > + /* A (hopefully) properly formed register bit width. */
> > + info->io.regspacing = spmi->addr.bit_width / 8;
> > + } else {
> > + info->io.regspacing = DEFAULT_REGSPACING;
> > + }
> > + info->io.regsize = info->io.regspacing;
> > + info->io.regshift = spmi->addr.bit_offset;
> > +
> > + if (spmi->addr.space_id == ACPI_ADR_SPACE_SYSTEM_MEMORY) {
> > + info->io_setup = mem_setup;
> > + info->io.addr_type = IPMI_MEM_ADDR_SPACE;
> > + } else if (spmi->addr.space_id == ACPI_ADR_SPACE_SYSTEM_IO) {
> > + info->io_setup = port_setup;
> > + info->io.addr_type = IPMI_IO_ADDR_SPACE;
> > + } else {
> > + kfree(info);
> > + printk(KERN_WARNING
> > + "ipmi_si: Unknown ACPI I/O Address type\n");
> > + return -EIO;
> > + }
> > + info->io.addr_data = spmi->addr.address;
> > + info->dev = spmi->dev;
> > +
> > + try_smi_init(info);
> > +
> > + return 0;
> > +}
> > +static acpi_status
> > +bmc_parse_io_ports(struct acpi_resource *resource, void *context)
> > +{
> > + struct acpi_device_ipmi *p_ipmi = context;
> > +
> > + /*
> > + * If the resource type is ACPI_RESOURCE_IRQ, it is not
> > + * supported.
> > + */
> > + if (resource->type == ACPI_RESOURCE_TYPE_EXTENDED_IRQ) {
> > + struct acpi_resource_extended_irq *extended_irq;
> > + extended_irq = &resource->data.extended_irq;
> > + if (p_ipmi->interrupttype) {
> > + /*
> > + * If it already support the interrupt through GPE,
> > + * it is unnecessary to get this interrupt again.
> > + */
> > + printk(KERN_DEBUG "Interrupt through GPE is already"
> > + " supported.\n");
> > + return AE_OK;
> > + }
> > + if (extended_irq->interrupt_count != 1) {
> > + printk(KERN_DEBUG "Incorrect resource setting about "
> > + "interrupt \n");
> > + return AE_OK;
> > + }
> > + p_ipmi->global_interrupt = extended_irq->interrupts[0];
> > + if (p_ipmi->global_interrupt) {
> > + /* GSI interrupt type */
> > + p_ipmi->interrupttype |= 0x02;
> > + }
> > + return AE_OK;
> > + }
> > + if (resource->type == ACPI_RESOURCE_TYPE_IO ||
> > + resource->type == ACPI_RESOURCE_TYPE_FIXED_IO) {
> > + u16 address;
> > + struct acpi_resource_io *io;
> > + struct acpi_resource_fixed_io *fixed_io;
> > +
> > + fixed_io = &resource->data.fixed_io;
> > + if (p_ipmi->resource_count) {
> > + /*
> > + * Multiply definitions of IO/memory address are
> > + * obtained. It is incorrect. We will continue
> > + * to use the first IO/memory definition.
> > + * If not correct, please fix me.
> > + */
> > + return AE_OK;
> > + }
> > + if (resource->type == ACPI_RESOURCE_TYPE_IO) {
> > + io = &resource->data.io;
> > + if (!io->minimum) {
> > + /* when IO address is zero, return */
> > + return AE_OK;
> > + }
> > + address = io->minimum;
> > + } else {
> > + fixed_io = &resource->data.fixed_io;
> > + if (!fixed_io->address)
> > + return AE_OK;
> > + address = fixed_io->address;
> > + }
> > + p_ipmi->resource_count++;
> > + p_ipmi->addr.space_id = ACPI_ADR_SPACE_SYSTEM_IO;
> > + p_ipmi->addr.address = address;
> > + return AE_OK;
> > + }
> > +
> > + if (resource->type == ACPI_RESOURCE_TYPE_MEMORY32 ||
> > + resource->type == ACPI_RESOURCE_TYPE_MEMORY24 ||
> > + resource->type == ACPI_RESOURCE_TYPE_FIXED_MEMORY32) {
> > + printk(KERN_DEBUG
> > + "Can't handle the Memory32/24/fixed32 type\n");
> > + printk(KERN_DEBUG "please send acpidump to "
> > + "linux-acpi@vger.kernel.org\n");
> > + return AE_OK;
> > + }
> > + if (resource->type == ACPI_RESOURCE_TYPE_ADDRESS16 ||
> > + resource->type == ACPI_RESOURCE_TYPE_ADDRESS32 ||
> > + resource->type == ACPI_RESOURCE_TYPE_ADDRESS64) {
> > + struct acpi_resource_address64 address64;
> > + acpi_resource_to_address64(resource, &address64);
> > + if (p_ipmi->resource_count) {
> > + /*
> > + * Multiply definitions of IO/memory address are
> > + * obtained. It is incorrect. We will continue
> > + * to use the first IO/memory definition.
> > + * If not correct, please fix me.
> > + */
> > + return AE_OK;
> > + }
> > + if (address64.resource_type != ACPI_MEMORY_RANGE &&
> > + address64.resource_type != ACPI_IO_RANGE) {
> > + /* ignore the incorrect resource type */
> > + return AE_OK;
> > + }
> > + p_ipmi->addr.address = address64.minimum;
> > + p_ipmi->resource_count++;
> > + if (address64.resource_type == ACPI_MEMORY_RANGE)
> > + p_ipmi->addr.space_id = ACPI_ADR_SPACE_SYSTEM_MEMORY;
> > + else
> > + p_ipmi->addr.space_id = ACPI_ADR_SPACE_SYSTEM_IO;
> > +
> > + return AE_OK;
> > + }
> >
> > + return AE_OK;
> > +}
> > +
> > +/*
> > + * parse_bmc_resource -- parse the BMC resources from ACPI
> > + * @p_ipmi: the memory to store the BCM resource
> > + * @handle: ACPI device handle
> > + */
> > +static int parse_bmc_resource(struct acpi_device_ipmi *p_ipmi,
> > + acpi_handle handle)
> > +{
> > + int parse_ok = false;
> > + unsigned long long temp_data;
> > + acpi_status status;
> > +
> > + /* According to IPMI spec there should exist the _IFT method
> > + * for the IPMI device. So when there is no _IFT, it is regarded
> > + * as the incorrect BMC device and won't parse the resource again.
> > + */
> > + status = acpi_evaluate_integer(handle, "_IFT", NULL, &temp_data);
> > + if (ACPI_FAILURE(status))
> > + return parse_ok;
> > +
> > + p_ipmi->interfacetype = temp_data;
> > + /* Figure out the interface type. If the interface type is not
> > + * KCS/SMIC/BT, it is regared as the incorrect IPMI device.
> > + * Of course the SSIF interface type is also defined, but we
> > + * can't handle it. So it is not supported */
> > + switch (temp_data) {
> > + case 1: /* KCS */
> > + case 2: /* SMIC */
> > + case 3: /* BT */
> > + break;
> > + default:
> > + printk(KERN_INFO "ipmi_si: Unknown ACPI/SPMI SI type %d\n",
> > + p_ipmi->interfacetype);
> > + return parse_ok;
> > + }
> > + /* check whether there exists the _GPE method. If it exists, it
> > + * means that interrupt through GPE is supported.
> > + */
> > + temp_data = 0;
> > + status = acpi_evaluate_integer(handle, "_GPE", NULL, &temp_data);
> > + if (ACPI_SUCCESS(status)) {
> > + p_ipmi->gpe = temp_data;
> > + /* set the GPE interrupt type */
> > + p_ipmi->interrupttype |= 0x01;
> > + }
> > + /* get the IPMI revision */
> > + temp_data = 0;
> > + status = acpi_evaluate_integer(handle, "_SRV", NULL, &temp_data);
> > + if (ACPI_SUCCESS(status))
> > + p_ipmi->ipmi_revision = temp_data;
> > +
> > + status = acpi_walk_resources(handle, METHOD_NAME__CRS,
> > + bmc_parse_io_ports, p_ipmi);
> > + if (ACPI_FAILURE(status)) {
> > + printk(KERN_WARNING "Can't parse the _CRS object \n");
> > + return parse_ok;
> > + }
> > + if (!p_ipmi->resource_count) {
> > + /* The incorrect IO/Memory address is parsed */
> > + printk(KERN_WARNING "Incorrect IO/Memory address is parsed\n");
> > + return parse_ok;
> > + }
> > + parse_ok = true;
> > +
> > + return parse_ok;
> > +}
> > +
> > +const struct acpi_device_id ipmi_ids[] = {
> > + {ACPI_VIDEO_HID, 0},
> > + {"", 0},
> > +};
> > +/*
> > + * check_bmc_device -- check whether @handle is a BMC device and then
> > + * get its corresponding resource. For example: IO/Mem
> > + * address, interface type
> > + * @handle: ACPI device handle
> > + * @level : depth in the ACPI namespace tree
> > + * @context: the number of bmc device. In theory there is not more than
> > + * one ACPI BMC device.
> > + * @rv: a return value to fill if desired (Not use)
> > + */
> > +static acpi_status
> > +check_bmc_device(acpi_handle handle, u32 level, void *context,
> > + void **return_value)
> > +{
> > + struct acpi_device *acpi_dev;
> > + struct acpi_device_ipmi *p_ipmi = NULL;
> > + int *count = (int *)context;
> > +
> > + acpi_dev = NULL;
> > + /* Get the acpi device for device handle */
> > + if (acpi_bus_get_device(handle, &acpi_dev) || !acpi_dev) {
> > + /* If there is no ACPI device for handle, return */
> > + return AE_OK;
> > + }
> > +
> > + if (acpi_match_device_ids(acpi_dev, ipmi_ids))
> > + return AE_OK;
> > +
> > + p_ipmi = kzalloc(sizeof(*p_ipmi), GFP_KERNEL);
> > + if (!p_ipmi) {
> > + printk(KERN_DEBUG "Can't allocate memory for IPMI device\n");
> > + return AE_OK;
> > + }
> > + p_ipmi->dev = &acpi_dev->dev;
> > + if (!parse_bmc_resource(p_ipmi, handle)) {
> > + kfree(p_ipmi);
> > + } else {
> > + list_add_tail(&p_ipmi->link, &acpi_ipmi);
> > + *count = *count + 1;
> > + }
> > +
> > + return AE_OK;
> > +}
> > +static __devinit void acpi_device_find_bmc(void)
> > +{
> > + acpi_status status;
> > + int device_count = 0;
> > + struct acpi_device_ipmi *p_ipmi, *p_ipmi2;
> > +
> > + if (acpi_disabled)
> > + return;
> > +
> > + status = acpi_walk_namespace(ACPI_TYPE_DEVICE, ACPI_ROOT_OBJECT,
> > + ACPI_UINT32_MAX,
> > + check_bmc_device, &device_count, NULL);
> > + if (!device_count) {
> > + /* when no IPMI device is found in ACPI namespace, return */
> > + return;
> > + }
> > + if (device_count > 1) {
> > + printk(KERN_WARNING "More than one BMC device is found in "
> > + "ACPI table\n");
> > + printk(KERN_WARNING "Of course the BMC device will be "
> > + "registered\n");
> > + }
> > + list_for_each_entry_safe(p_ipmi, p_ipmi2, &acpi_ipmi, link) {
> > + try_init_acpi_device(p_ipmi);
> > + list_del(&p_ipmi->link);
> > + kfree(p_ipmi);
> > + }
> > +
> > + return;
> > +}
> > static __devinit void acpi_find_bmc(void)
> > {
> > acpi_status status;
> > @@ -2022,6 +2381,7 @@
> >
> > try_init_acpi(spmi);
> > }
> > + acpi_device_find_bmc();
> > }
> > #endif
> >
> >
> > ------------------------------------------------------------------------------
> > Come build with us! The BlackBerry® Developer Conference in SF, CA
> > is the only developer event you need to attend this year. Jumpstart your
> > developing skills, take BlackBerry mobile applications to market and stay
> > ahead of the curve. Join us from November 9-12, 2009. Register now!
> > http://p.sf.net/sfu/devconf
> > _______________________________________________
> > Openipmi-developer mailing list
> > Openipmi-developer@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/openipmi-developer
> >
> >
>
^ permalink raw reply
* Re: private ioctls in input driver
From: Barry Song @ 2009-09-29 2:50 UTC (permalink / raw)
To: Dmitry Torokhov; +Cc: Trilok Soni, linux-input
In-Reply-To: <5d5443650909281005u37503ab0rc6b62effe73f8046@mail.gmail.com>
On Tue, Sep 29, 2009 at 1:05 AM, Trilok Soni <soni.trilok@gmail.com> wrote:
>
> Hi Dmitry,
>
> On Mon, Sep 28, 2009 at 10:32 PM, Dmitry Torokhov
> <dmitry.torokhov@gmail.com> wrote:
> > On Fri, Sep 25, 2009 at 02:21:40PM +0530, Trilok Soni wrote:
> >> Hi Dmitry,
> >>
> >> On Fri, Sep 25, 2009 at 9:32 AM, Dmitry Torokhov
> >> <dmitry.torokhov@gmail.com> wrote:
> >> > Hi Trilok,
> >> >
> >> > On Thu, Sep 24, 2009 at 05:21:09PM +0530, Trilok Soni wrote:
> >> >> Hi Dmitry,
> >> >>
> >> >> Is there any way of creating private ioctls in the input driver? I see
> >> >> that all the input framework handled
> >> >> by the framework itself and there is no way to call private ioctls if
> >> >> it doesn't match the standard ones.
> >> >>
> >> >
> >> > You are right, event devices only allow standard ioctl. What kind of
> >> > ictl are you considering? Normally device-specific controls are done via
> >> > sysfs attached to the parent device (see atkbd, psmouse, etc).
> >>
> >> sysfs might good for purpose when you can associate one file per
> >> value, so for more data we can't simply create one file per the data.
> >> Say five fingers touch data (I know we have MT_* support but here it
> >> is just for example) , say id, x, y, z etc., per finger, then we can't
> >> create one file for each of them.
> >
> > Maybe use configfs if sysfs is not suitable? I am not sure.
> >
> > I would like to not-have driver-specific ioctls in evdev/input core but
> > rather keep them with device/driver itself. Input core should only have
> > stuff that makes sense for multiple devices.
>
> I mean on the similar line only, we won't add any driver-specific
> ioctls in evdev/input core, but just transfer their control to resp.
> device/driver itself, may be similar in the line of how v4l2 does.
Sometimes, we really have this kind of requirement. Now there are more
and more kinds of input devices. Some devices have special controls
which should not belong to generic input layer. How about add a case
item in evdev_do_ioctl() to handle these commands and call drivers'
ioctl directly?
>
> --
> ---Trilok Soni
> http://triloksoni.wordpress.com
> http://www.linkedin.com/in/triloksoni
> --
> To unsubscribe from this list: send the line "unsubscribe linux-input" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-input" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* [PATCH] Bluetooth: Enable auto sleep mode for btmrvl driver
From: Bing Zhao @ 2009-09-29 2:50 UTC (permalink / raw)
To: linux-bluetooth; +Cc: Bing Zhao, Amitkumar Karwar
From: Amitkumar Karwar <akarwar@marvell.com>
The auto sleep mode for btmrvl driver is not enabled by default.
This patch enables auto sleep mode when card is probed.
Signed-off-by: Amitkumar Karwar <akarwar@marvell.com>
Signed-off-by: Bing Zhao <bzhao@marvell.com>
---
drivers/bluetooth/btmrvl_drv.h | 1 +
drivers/bluetooth/btmrvl_main.c | 56 +++++++++++++++++++++++---------------
drivers/bluetooth/btmrvl_sdio.c | 2 +
3 files changed, 37 insertions(+), 22 deletions(-)
diff --git a/drivers/bluetooth/btmrvl_drv.h b/drivers/bluetooth/btmrvl_drv.h
index 411c7a7..523d197 100644
--- a/drivers/bluetooth/btmrvl_drv.h
+++ b/drivers/bluetooth/btmrvl_drv.h
@@ -131,6 +131,7 @@ void btmrvl_check_evtpkt(struct btmrvl_private *priv, struct sk_buff *skb);
int btmrvl_process_event(struct btmrvl_private *priv, struct sk_buff *skb);
int btmrvl_send_module_cfg_cmd(struct btmrvl_private *priv, int subcmd);
+int btmrvl_enable_ps(struct btmrvl_private *priv);
int btmrvl_prepare_command(struct btmrvl_private *priv);
#ifdef CONFIG_DEBUG_FS
diff --git a/drivers/bluetooth/btmrvl_main.c b/drivers/bluetooth/btmrvl_main.c
index e605563..43b5b68 100644
--- a/drivers/bluetooth/btmrvl_main.c
+++ b/drivers/bluetooth/btmrvl_main.c
@@ -189,6 +189,39 @@ int btmrvl_send_module_cfg_cmd(struct btmrvl_private *priv, int subcmd)
}
EXPORT_SYMBOL_GPL(btmrvl_send_module_cfg_cmd);
+int btmrvl_enable_ps(struct btmrvl_private *priv)
+{
+ struct sk_buff *skb;
+ struct btmrvl_cmd *cmd;
+ int ret = 0;
+
+ skb = bt_skb_alloc(sizeof(*cmd), GFP_ATOMIC);
+ if (skb == NULL) {
+ BT_ERR("No free skb");
+ return -ENOMEM;
+ }
+
+ cmd = (struct btmrvl_cmd *) skb_put(skb, sizeof(*cmd));
+ cmd->ocf_ogf = cpu_to_le16(hci_opcode_pack(OGF,
+ BT_CMD_AUTO_SLEEP_MODE));
+ cmd->length = 1;
+
+ if (priv->btmrvl_dev.psmode)
+ cmd->data[0] = BT_PS_ENABLE;
+ else
+ cmd->data[0] = BT_PS_DISABLE;
+
+ bt_cb(skb)->pkt_type = MRVL_VENDOR_PKT;
+
+ skb->dev = (void *) priv->btmrvl_dev.hcidev;
+ skb_queue_head(&priv->adapter->tx_queue, skb);
+
+ BT_DBG("Queue PSMODE Command:%d", cmd->data[0]);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(btmrvl_enable_ps);
+
static int btmrvl_enable_hs(struct btmrvl_private *priv)
{
struct sk_buff *skb;
@@ -258,28 +291,7 @@ int btmrvl_prepare_command(struct btmrvl_private *priv)
if (priv->btmrvl_dev.pscmd) {
priv->btmrvl_dev.pscmd = 0;
-
- skb = bt_skb_alloc(sizeof(*cmd), GFP_ATOMIC);
- if (skb == NULL) {
- BT_ERR("No free skb");
- return -ENOMEM;
- }
-
- cmd = (struct btmrvl_cmd *) skb_put(skb, sizeof(*cmd));
- cmd->ocf_ogf = cpu_to_le16(hci_opcode_pack(OGF, BT_CMD_AUTO_SLEEP_MODE));
- cmd->length = 1;
-
- if (priv->btmrvl_dev.psmode)
- cmd->data[0] = BT_PS_ENABLE;
- else
- cmd->data[0] = BT_PS_DISABLE;
-
- bt_cb(skb)->pkt_type = MRVL_VENDOR_PKT;
-
- skb->dev = (void *) priv->btmrvl_dev.hcidev;
- skb_queue_head(&priv->adapter->tx_queue, skb);
-
- BT_DBG("Queue PSMODE Command:%d", cmd->data[0]);
+ btmrvl_enable_ps(priv);
}
if (priv->btmrvl_dev.hscmd) {
diff --git a/drivers/bluetooth/btmrvl_sdio.c b/drivers/bluetooth/btmrvl_sdio.c
index 5b33b85..d6aaf51 100644
--- a/drivers/bluetooth/btmrvl_sdio.c
+++ b/drivers/bluetooth/btmrvl_sdio.c
@@ -930,6 +930,8 @@ static int btmrvl_sdio_probe(struct sdio_func *func,
priv->hw_wakeup_firmware = btmrvl_sdio_wakeup_fw;
btmrvl_send_module_cfg_cmd(priv, MODULE_BRINGUP_REQ);
+ priv->btmrvl_dev.psmode = 1;
+ btmrvl_enable_ps(priv);
return 0;
--
1.5.4.3
^ permalink raw reply related
* Re: [Fastboot] kexec - 2.6.14 - loads BIOS again
From: Prabhakar K. @ 2009-09-29 2:48 UTC (permalink / raw)
To: Eric W. Biederman; +Cc: fastboot, Kexec Mailing List
In-Reply-To: <m1tyympope.fsf@fess.ebiederm.org>
[-- Attachment #1.1: Type: text/plain, Size: 2119 bytes --]
From: Eric W. Biederman <ebiederm@xmission.com>
Subject: Re: [Fastboot] kexec - 2.6.14 - loads BIOS again
To: "Prabhakar K." <krishkar99@yahoo.com>
Cc: "Kexec Mailing List" <kexec@lists.infradead.org>, fastboot@lists.linux-foundation.org
Date: Tuesday, September 29, 2009, 6:11 AM
"Prabhakar K." <krishkar99@yahoo.com> writes:
>> Hi - After my previous posting I was able to make some progress.
>>
>> When I do
>>
>> kexec -l vmlinux --append="init 1 root=/dev/hda2" --console-serial
>>
>> and
>> kexec -e,
>>
>> I get message:
>>
>> Starting new kernel
>> I'm in purgatory
>>
>> And after that it still boots from BIOS (to the same first kernel).
>>
>> Whats the problem here?
>
> I would recommend instrumenting up purgatory and then your target kernel.
>
> It is quite possible the problem is in your backport.
>
>>>>>
> I have instrumented the purgatory code. when I run kexec -e, It hangs for a
> while at
> sha256_starts(&ctx); in the function verify_sha256_digest(), and then reboots
> to the first kernel (from BIOS). I have put debug prints in sha256_starts, but
> none of those are executed.
>
> Interestingly, the same function sha256_starts is executed during kexec -l,
> which is called as part of update_purgatory( )( when my_load( ) is called from
> kexec.c).
> so its surprise why the same function is giving problems when kexec -e is run.
Do you by any chance have a watchdog you are not petting?
>> Yes, we do have software watchdog and hardware watchdog. I disabled software watchdog and I think the reason for resetting to the first kernel can be explained due to hardware watchdog kicking in.
To try out a different version of the kexec, I took kexec version 20080227 from http://www.kernel.org/pub/linux/kernel/people/horms/kexec-tools/and instrumented that code as well. With this, I was able to see the sha 256 verification done during kexec -e and also the post verification of purgatory code. And it hangs there (does not reboot to the first kernel).
So, any conclusions based on these ? Please CC me in reply. Thanks !!
Eric
[-- Attachment #1.2: Type: text/html, Size: 3034 bytes --]
[-- Attachment #2: Type: text/plain, Size: 143 bytes --]
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply
* [PATCH] Bluetooth: Removal of unused variable in btmrvl driver
From: Bing Zhao @ 2009-09-29 2:43 UTC (permalink / raw)
To: linux-bluetooth; +Cc: Bing Zhao, Rahul Tank
From: Rahul Tank <rahult@marvell.com>
This patch removes unused variable "drvdbg" from btmrvl_debugfs_data
structure.
Signed-off-by: Rahul Tank <rahult@marvell.com>
Signed-off-by: Bing Zhao <bzhao@marvell.com>
---
drivers/bluetooth/btmrvl_debugfs.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)
diff --git a/drivers/bluetooth/btmrvl_debugfs.c b/drivers/bluetooth/btmrvl_debugfs.c
index 4617bd1..d43b5cb 100644
--- a/drivers/bluetooth/btmrvl_debugfs.c
+++ b/drivers/bluetooth/btmrvl_debugfs.c
@@ -29,7 +29,6 @@ struct btmrvl_debugfs_data {
struct dentry *root_dir, *config_dir, *status_dir;
/* config */
- struct dentry *drvdbg;
struct dentry *psmode;
struct dentry *pscmd;
struct dentry *hsmode;
--
1.5.4.3
^ permalink raw reply related
* compat-wireless master-2009-09-28 breakage and suggested fixes
From: Hin-Tak Leung @ 2009-09-29 2:35 UTC (permalink / raw)
To: Luis R. Rodriguez, linux-wireless
Hi Luis,
A couple of breakages against current wireless testing head:
compat-release - master-2009-09-23-1-gd1e5747
git-describe v2.6.32-rc1-39226-g63dbea4
master-tag master-2009-09-28
-----------
make[1]: Entering directory `/usr/src/kernels/2.6.30.8-67.fc11.x86_64'
CC [M] /home/Hin-Tak/tmp-git/compat-wireless-2.6/net/mac80211/main.o
In file included from
/home/Hin-Tak/tmp-git/compat-wireless-2.6/net/mac80211/main.c:29:
/home/Hin-Tak/tmp-git/compat-wireless-2.6/net/mac80211/ieee80211_i.h:1053:
error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before
‘ieee80211_monitor_start_xmit’
/home/Hin-Tak/tmp-git/compat-wireless-2.6/net/mac80211/ieee80211_i.h:1055:
error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before
‘ieee80211_subif_start_xmit’
make[3]: *** [/home/Hin-Tak/tmp-git/compat-wireless-2.6/net/mac80211/main.o]
Error 1
make[2]: *** [/home/Hin-Tak/tmp-git/compat-wireless-2.6/net/mac80211] Error 2
make[1]: *** [_module_/home/Hin-Tak/tmp-git/compat-wireless-2.6] Error 2
make[1]: Leaving directory `/usr/src/kernels/2.6.30.8-67.fc11.x86_64'
make: *** [modules] Error 2
----------
----------
CC [M] /home/Hin-Tak/tmp-git/compat-wireless-2.6/net/mac80211/iface.o
/home/Hin-Tak/tmp-git/compat-wireless-2.6/net/mac80211/iface.c: In
function ‘ieee80211_if_add’:
/home/Hin-Tak/tmp-git/compat-wireless-2.6/net/mac80211/iface.c:815:
error: implicit declaration of function ‘SET_NETDEV_DEVTYPE’
make[3]: *** [/home/Hin-Tak/tmp-git/compat-wireless-2.6/net/mac80211/iface.o]
Error 1
make[2]: *** [/home/Hin-Tak/tmp-git/compat-wireless-2.6/net/mac80211] Error 2
make[1]: *** [_module_/home/Hin-Tak/tmp-git/compat-wireless-2.6] Error 2
----------
The first error can be fixed by inserting the netdev_tx_t enum:
----------------
$ diff -u ../wireless-testing/net/mac80211/ieee80211_i.h
net/mac80211/ieee80211_i.h
--- ../wireless-testing/net/mac80211/ieee80211_i.h 2009-09-29
01:21:11.000000000 +0100
+++ net/mac80211/ieee80211_i.h 2009-09-29 02:43:26.000000000 +0100
@@ -28,6 +28,24 @@
#include "key.h"
#include "sta_info.h"
+#ifdef NETDEV_TX_OK
+#undef NETDEV_TX_OK
+#endif
+#ifdef NETDEV_TX_BUSY
+#undef NETDEV_TX_BUSY
+#endif
+#ifdef NETDEV_TX_LOCKED
+#undef NETDEV_TX_LOCKED
+#endif
+
+/* Driver transmit return codes */
+enum netdev_tx {
+ NETDEV_TX_OK = 0, /* driver took care of packet */
+ NETDEV_TX_BUSY, /* driver tx path was busy*/
+ NETDEV_TX_LOCKED = -1, /* driver tx lock was already taken */
+};
+typedef enum netdev_tx netdev_tx_t;
+
struct ieee80211_local;
/* Maximum number of broadcast/multicast frames to buffer when some of the
------------------
The 2nd by adding SET_NETDEV_DEVTYPE():
----------
--- ../wireless-testing/net/mac80211/iface.c 2009-09-29 01:21:11.000000000 +0100
+++ net/mac80211/iface.c 2009-09-29 02:50:54.000000000 +0100
@@ -22,6 +22,8 @@
#include "led.h"
#include "driver-ops.h"
+#define SET_NETDEV_DEVTYPE(net, devtype) ((net)->dev.type = (devtype))
+
/**
* DOC: Interface list locking
*
-------------
The first change is due to this:
commit d0cf9c0dadcdc89a755bcb301cfc9c796eb28ccf
Author: Stephen Hemminger <shemminger@vyatta.com>
Date: Mon Aug 31 19:50:57 2009 +0000
wireless: convert drivers to netdev_tx_t
and the 2nd change due to this:
commit 384912ed194e43c03ad1cdaa09b0b1e488c34d46
Author: Marcel Holtmann <marcel@holtmann.org>
Date: Mon Aug 31 21:08:19 2009 +0000
net: Add DEVTYPE support for Ethernet based devices
Both of these changes are traced back to changes in
<linux/netdevice.h> , which compat-wireless does not ship. What's your
policy on these kind of changes to compat-wireless?
(the 2nd SET_NETDEV_DEVTYPE change probably can be spanned by an
ifndef SET_NETDEV_DEVTYPE, and roll into
"compat/patches/01-netdev.patch"? Should the first kind of change also
go into compat/patches/01-netdev.patch?)
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.