* 2.6.22 -mm merge plans
@ 2007-04-30 23:20 Andrew Morton
2007-04-30 23:48 ` to something appropriate (was Re: 2.6.22 -mm merge plans) Jeff Garzik
` (22 more replies)
0 siblings, 23 replies; 233+ messages in thread
From: Andrew Morton @ 2007-04-30 23:20 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-mm
- If replying, please be sure to cc the appropriate individuals. Please
also consider rewriting the Subject: to something appropriate.
- I'll cc linux-mm on this - the memory-management situation is complicated.
- The overall stability in recent -mm's was not sufficiently high and we ran
out of time to find all the bugs. I shouldn't have merged all those patches
last week - they contained an exceptional amount of garbage.
This all means that more bugs than usual will probably leak into mainline,
and we'll have to fix them there.
- I've been ducking most non-bugfix patches recently. I have ~200 feature
and cleanup patches queued for later consideration, so people who sent those
will be hearing from me eventually.
extend-print_symbol-capability.patch
reiserfs-suppress-lockdep-warning.patch
rework-pm_ops-pm_disk_mode-kill-misuse.patch
power-management-remove-firmware-disk-mode.patch
power-management-implement-pm_opsvalid-for-everybody.patch
power-management-force-pm_opsvalid-callback-to-be.patch
add-kvasprintf.patch
pm-include-eio-from-errno-baseh.patch
Sent
ia64-race-flushing-icache-in-do_no_page-path.patch
People are still discussing this
zlib-backout.patch
A huge zlib revert patch. It's a last resort for bug #8405, which is still
being worked on. 2.6.20.x needs fixing, too.
networking-fix-sending-netlink-message-when-replace-route.patch
Will send to davem
slab-introduce-krealloc.patch
Will merge soon
exit-acpi-processor-module-gracefully-if-acpi-is-disabled.patch
Will send to Len
remove-unused-header-file-arch-arm-mach-s3c2410-basth.patch
iop13xx-msi-support-rev6.patch
arm-remove-useless-config-option-generic_bust_spinlock.patch
Will send to rmk
cifs-use-mutexdiff.patch
cifs-use-simple_prepare_write-to-zero-page-data.patch
Will send to sfrench
macintosh-mediabay-convert-to-kthread-api.patch
macintosh-adb-convert-to-the-kthread-api.patch
macintosh-therm_pm72c-partially-convert-to-kthread-api.patch
powerpc-pseries-rtasd-convert-to-kthread-api.patch
powerpc-pseries-eeh-convert-to-kthread-api.patch
Will send to paulus (I already did - does Paul not handle the macintosh
driver?)
revert-gregkh-driver-remove-struct-subsystem-as-it-is-no-longer-needed.patch
This is here because Greg's tree wrecks Dmitry's tree. Will drop once they
sort it out.
idr-fix-obscure-bug-in-allocation-path.patch
idr-separate-out-idr_mark_full.patch
ida-implement-idr-based-id-allocator.patch
ida-implement-idr-based-id-allocator-fix.patch
These will go in via Greg's tree.
fix-sysfs-rom-file-creation-for-bios-rom-shadows.patch
more-fix-gregkh-driver-sysfs-kill-unnecessary-attribute-owner.patch
even-more-fix-gregkh-driver-sysfs-kill-unnecessary-attribute-owner.patch
even-even-more-fix-gregkh-driver-sysfs-kill-unnecessary-attribute-owner.patch
acpi-driver-model-flags-and-platform_enable_wake.patch
update-documentation-driver-model-platformtxt.patch
power-management-remove-some-useless-code-from-arm.patch
Will send to Greg for the driver tree
git-dvb.patch
dvb_en_50221-convert-to-kthread-api.patch
mm-only-saa7134-tvaudio-convert-to-kthread-api.patch
git-dvb-vs-gregkh-driver-sysfs-kill-unnecessary-attribute-owner.patch
For Mauro
i2c-tsl2550-support.patch
apple-smc-driver-hardware-monitoring-and-control.patch
For Jean
ia64-sn-xpc-convert-to-use-kthread-api.patch
ia64-sn-xpc-convert-to-use-kthread-api-fix.patch
ia64-sn-xpc-convert-to-use-kthread-api-fix-2.patch
spin_lock_unlocked-macro-cleanup-in-arch-ia64.patch
For Tony
sbp2-include-fixes.patch
ieee1394-iso-needs-schedh.patch
For Stephan
input-convert-from-class-devices-to-standard-devices.patch
input-evdev-implement-proper-locking.patch
mousedev-fix.patch
mousedev-fix-2.patch
Dmitry will merge these once Greg has merged the preparatory work. Except these
patches make the Vaio-of-doom crash in obscure circumstances, and we weren't
able to fix that?
wistron_btns-add-led-support.patch
input-ff-add-ff_raw-effect.patch
input-phantom-add-a-new-driver.patch
For Dmitry
kconfig-abort-configuration-with-recursive-dependencies.patch
kbuild-handle-compressed-cpio-initramfs-es.patch
For Sam and Roman
ahci-crash-fix.patch
libata-acpi-add-infrastructure-for-drivers-to-use.patch
pata_acpi-restore-driver.patch
optional-led-trigger-for-libata.patch
ata_timing-ensure-t-cycle-is-always-correct.patch
pata_pcmcia-recognize-2gb-compactflash-from-transcend.patch
drivers-ata-remove-the-wildcard-from-sata_nv-driver.patch
pata_icside-driver.patch
ata stuff
sl82c105-switch-to-ref-counting-api.patch
For Bart
mmc-omap-add-missing-newline.patch
mmc-omap-fix-omap-to-use-mmc_power_on.patch
mmc-omap-clean-up-omap-set_ios-and-make-mmc_power_on.patch
Not sure. These hit three different subsystems: arm, omap and mmc. I might
just send them in.
nommu-present-backing-device-capabilities-for-mtd.patch
nommu-add-support-for-direct-mapping-through-mtdconcat.patch
nommu-generalise-the-handling-of-mtd-specific-superblocks.patch
nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch
romfs-printk-format-warnings.patch
dont-force-uclinux-mtd-map-to-be-root-dev.patch
For dwmw2 (again?)
8139too-force-media-setting-fix.patch
sundance-change-phy-address-search-from-phy=1-to-phy=0.patch
forcedeth-improve-napi-logic.patch
ne-add-platform_driver.patch
ne-add-platform_driver-fix.patch
ne-mips-use-platform_driver-for-ne-on-rbtx49xx.patch
mips-drop-unnecessary-config_isa-from-rbtx49xx.patch
ibmtr_cs-fix-hang-on-eject.patch
For netdev tree
2621-rc5-mm3-fix-e1000-compilation.patch
Will re-re-resend to Auke
ppp_generic-fix-lockdep-warning.patch
Jeff, I guess. It's not clear that this is correct.
input-rfkill-add-support-for-input-key-to-control-wireless-radio.patch
Will resend to davem once the preparatory bits are merged by Greg.
bluetooth-add-sco-work-around-for-the-broadcom.patch
Will resend to Marcel
fix-i-oat-for-kexec.patch
Will re-re-re-re-resend to Dan
auth_gss-unregister-gss_domain-when-unloading-module.patch
nfs-kill-the-obsolete-nfs_paranoia.patch
nfs-statfs-error-handling-fix.patch
nfs-use-__set_current_state.patch
nfs-suppress-warnings-about-nfs4err_old_stateid-in-nfs4_handle_exception.patch
For Trond
round_up-macro-cleanup-in-drivers-parisc.patch
Will re-re-resend to Kyle.
pcmcia-pccard-deadlock-fix.patch
pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch
at91_cf-minor-fix.patch
add-new_id-to-pcmcia-drivers.patch
ide-cs-recognize-2gb-compactflash-from-transcend.patch
Dominik is busy. Will probably re-review and send these direct to Linus.
serial-driver-pmc-msp71xx.patch
rm9000-serial-driver.patch
serial-define-fixed_port-flag-for-serial_core.patch
serial-use-resource_size_t-for-serial-port-io-addresses.patch
mpsc-serial-driver-tx-locking.patch
serial-suppress-rts-assertion-with-disabled-crtscts.patch
8250_pci-fix-pci-must_checks.patch
Seems that I'm maintaining serial now. Will re-review, re-check with rmk then
send.
fix-gregkh-pci-pci-remove-the-broken-pci_multithread_probe-option.patch
remove-pci_dac_dma_-apis.patch
round_up-macro-cleanup-in-drivers-pci.patch
pcie-remove-spin_lock_unlocked.patch
cpqphp-partially-convert-to-use-the-kthread-api.patch
ibmphp-partially-convert-to-use-the-kthreads-api.patch
cpci_hotplug-partially-convert-to-use-the-kthread-api.patch
msi-fix-arm-compile.patch
support-pci-mcfg-space-on-intel-i915-bridges.patch
pci-syscallc-switch-to-refcounting-api.patch
Stuff to (various levels of re-)send to Greg for the PCI tree. I'll probably
drop the kthread patches as they seemed a bit half-baked and I've lost track
of which ones have which levels of baking.
pci-device-ensure-sysdata-initialised-v2.patch
This is for Jeff's git-pciseg.patch which is sort-of on hold at present.
git-s390-vs-gregkh-driver-sysfs-kill-unnecessary-attribute-owner.patch
s390-scsi-zfcp_erp-partially-convert-to-use-the-kthread-api.patch
s390-qeth-convert-to-use-the-kthread-api.patch
s390-net-lcs-convert-to-the-kthread-api.patch
For Martin
round_up-macro-cleanup-in-arch-sh64-kernel-pci_sh5c.patch
For Paul
drivers-scsi-small-cleanups.patch
drivers-scsi-advansysc-cleanups.patch
megaraid-fix-warnings-when-config_proc_fs=n.patch
remove-unnecessary-check-in-drivers-scsi-sgc.patch
pci_module_init-convertion-in-tmscsimc.patch
drivers-scsi-ncr5380c-replacing-yield-with-a.patch
drivers-scsi-megaraidc-replacing-yield-with-a.patch
drivers-scsi-mca_53c9xc-save_flags-cli-removal.patch
sym53c8xx_2-claims-cpqarray-device.patch
drivers-scsi-wd33c93c-cleanups.patch
scsi-cover-up-bugs-fix-up-compiler-warnings-in-megaraid-driver.patch
drivers-scsi-qla4xxx-possible-cleanups.patch
make-seagate_st0x_detect-static.patch
scsi-fix-obvious-typo-spin_lock_irqrestore-in-gdthc.patch
drivers-scsi-aic7xxx_old-convert-to-generic-boolean-values.patch
cleanup-variable-usage-in-mesh-interrupt-handler.patch
fix--confusion-in-fusion-driver.patch
use-unchecked_isa_dma-in-sd_revalidate_disk.patch
fdomainc-get-rid-of-unused-stuff.patch
remove-the-broken-scsi_acornscsi_3-driver.patch
scsi-fix-config_scsi_wait_scan=m.patch
sas_scsi_host-partially-convert-to-use-the-kthread-api.patch
qla1280-use-dma_64bit_mask-instead-of-0ull.patch
pci-error-recovery-symbios-scsi-base-support.patch
pci-error-recovery-symbios-scsi-first-failure.patch
Will re^N-send to James.
sparc64-powerc-convert-to-use-the-kthread-api.patch
Might drop, might send to davem.
git-unionfs.patch
Does this have a future?
cxacru-add-documentation-file.patch
cxacru-cleanup-sysfs-attribute-code.patch
For Greg.
i386-map-enough-initial-memory-to-create-lowmem-mappings-fix.patch
fault-injection-disable-stacktrace-filter-for-x86-64.patch
i386-efi-fix-proc-iomem-type-for-kexec-tools.patch
fault-injection-enable-stacktrace-with-dwarf2-unwinder.patch
i386-__inquire_remote_apic-printk-warning-fix.patch
x86-msr-add-support-for-safe-variants.patch
For Andi
xfs-clean-up-shrinker-games.patch
xfs-fix-unmount-race.patch
For David
add-apply_to_page_range-which-applies-a-function-to-a-pte-range.patch
add-apply_to_page_range-which-applies-a-function-to-a-pte-range-fix.patch
safer-nr_node_ids-and-nr_node_ids-determination-and-initial.patch
use-zvc-counters-to-establish-exact-size-of-dirtyable-pages.patch
proper-prototype-for-hugetlb_get_unmapped_area.patch
mm-remove-gcc-workaround.patch
slab-ensure-cache_alloc_refill-terminates.patch
mm-more-rmap-checking.patch
mm-make-read_cache_page-synchronous.patch
fs-buffer-dont-pageuptodate-without-page-locked.patch
allow-oom_adj-of-saintly-processes.patch
introduce-config_has_dma.patch
mm-slabc-proper-prototypes.patch
mm-detach_vmas_to_be_unmapped-fix.patch
Misc MM things. Will merge.
add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages.patch
add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated.patch
split-the-free-lists-for-movable-and-unmovable-allocations.patch
choose-pages-from-the-per-cpu-list-based-on-migration-type.patch
add-a-configure-option-to-group-pages-by-mobility.patch
drain-per-cpu-lists-when-high-order-allocations-fail.patch
move-free-pages-between-lists-on-steal.patch
group-short-lived-and-reclaimable-kernel-allocations.patch
group-high-order-atomic-allocations.patch
do-not-group-pages-by-mobility-type-on-low-memory-systems.patch
bias-the-placement-of-kernel-pages-at-lower-pfns.patch
be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback.patch
fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2.patch
create-the-zone_movable-zone.patch
allow-huge-page-allocations-to-use-gfp_high_movable.patch
x86-specify-amount-of-kernel-memory-at-boot-time.patch
ppc-and-powerpc-specify-amount-of-kernel-memory-at-boot-time.patch
x86_64-specify-amount-of-kernel-memory-at-boot-time.patch
ia64-specify-amount-of-kernel-memory-at-boot-time.patch
add-documentation-for-additional-boot-parameter-and-sysctl.patch
handle-kernelcore=-boot-parameter-in-common-code-to-avoid-boot-problem-on-ia64.patch
Mel's moveable-zone work.
I don't believe that this has had sufficient review and I'm sure that it
hasn't had sufficient third-party testing. Most of the approbations thus far
have consisted of people liking the overall idea, based on the changelogs and
multi-year-old discussions.
For such a large and core change I'd have expected more detailed reviewing
effort and more third-party testing. And I STILL haven't made time to review
the code in detail myself.
So I'm a bit uncomfortable with moving ahead with these changes.
mm-simplify-filemap_nopage.patch
mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
mm-merge-nopfn-into-fault.patch
convert-hugetlbfs-to-use-vm_ops-fault.patch
mm-remove-legacy-cruft.patch
mm-debug-check-for-the-fault-vs-invalidate-race.patch
mm-fix-clear_page_dirty_for_io-vs-fault-race.patch
add-unitialized_var-macro-for-suppressing-gcc-warnings.patch
i386-add-ptep_test_and_clear_dirtyyoung.patch
i386-use-pte_update_defer-in-ptep_test_and_clear_dirtyyoung.patch
Miscish MM changes. Will merge, dependent upon what still applies and works
if the moveable-zone patches get stalled.
smaps-extract-pmd-walker-from-smaps-code.patch
smaps-add-pages-referenced-count-to-smaps.patch
smaps-add-clear_refs-file-to-clear-reference.patch
referenced-page accounting in /proc/pid/smaps. Is realted to the maps2
patches. Will merge.
maps2-uninline-some-functions-in-the-page-walker.patch
maps2-eliminate-the-pmd_walker-struct-in-the-page-walker.patch
maps2-remove-vma-from-args-in-the-page-walker.patch
maps2-propagate-errors-from-callback-in-page-walker.patch
maps2-add-callbacks-for-each-level-to-page-walker.patch
maps2-move-the-page-walker-code-to-lib.patch
maps2-simplify-interdependence-of-proc-pid-maps-and-smaps.patch
maps2-move-clear_refs-code-to-task_mmuc.patch
maps2-regroup-task_mmu-by-interface.patch
maps2-make-proc-pid-smaps-optional-under-config_embedded.patch
maps2-make-proc-pid-clear_refs-option-under-config_embedded.patch
maps2-add-proc-pid-pagemap-interface.patch
maps2-add-proc-kpagemap-interface.patch
/proc/pid/pagemap and /proc/kpagemap. A fairly important and low-level way of
exposing memory state to userspace, for developers.
Matt still has a decent-sized todo list here. Might merge, might hold over
for 2.6.23.
lumpy-reclaim-v4.patch
This is in a similar situation to the moveable-zone work. Sounds great on
paper, but it needs considerable third-party testing and review. It is a
major change to core MM and, we hope, a significant advance. On paper.
add-pfn_valid_within-helper-for-sub-max_order-hole-detection.patch
anti-fragmentation-switch-over-to-pfn_valid_within.patch
lumpy-move-to-using-pfn_valid_within.patch
More Mel things, and linkage between Mel-things and lumpy reclaim. It's here
where the patch ordering gets into a mess and things won't improve if
moveable-zones and lumpy-reclaim get deferred. Such a deferral would limit my
ability to queue more MM changes for 2.6.23.
readahead-improve-heuristic-detecting-sequential-reads.patch
readahead-code-cleanup.patch
Will merge.
bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks.patch
remove-page_group_by_mobility.patch
dont-group-high-order-atomic-allocations.patch
More moveable-zone work.
mm-move-common-segment-checks-to-separate-helper-function-v7.patch
slab-use-num_possible_cpus-in-enable_cpucache.patch
slab-dont-allocate-empty-shared-caches.patch
slab-numa-kmem_cache-diet.patch
do-not-disable-interrupts-when-reading-min_free_kbytes.patch
slab-mark-set_up_list3s-__init.patch
mm-clean-up-and-kernelify-shrinker-registration.patch
fix-section-mismatch-of-memory-hotplug-related-code.patch
add-white-list-into-modpostc-for-memory-hotplug-code-and-ia64s-machvec-section.patch
split-mmap.patch
only-allow-nonlinear-vmas-for-ram-backed-filesystems.patch
cpusets-allow-tif_memdie-threads-to-allocate-anywhere.patch
More MM misc. Will merge those patches which survive other merge decisions.
i386-use-page-allocator-to-allocate-thread_info-structure.patch
slub-core.patch
slub. Or part thereof. This is another patch series which got messed up by
poor patch sequencing.
make-page-private-usable-in-compound-pages-v1.patch
optimize-compound_head-by-avoiding-a-shared-page.patch
add-virt_to_head_page-and-consolidate-code-in-slab-and-slub.patch
slub-fix-object-tracking.patch
slub-enable-tracking-of-full-slabs.patch
slub-validation-of-slabs-metadata-and-guard-zones.patch
slub-add-min_partial.patch
slub-add-ability-to-list-alloc--free-callers-per-slab.patch
slub-free-slabs-and-sort-partial-slab-lists-in-kmem_cache_shrink.patch
slub-remove-object-activities-out-of-checking-functions.patch
slub-user-documentation.patch
slub-add-slabinfo-tool.patch
Most of the rest of slub. Will merge it all.
quicklists-for-page-table-pages.patch
quicklist-support-for-ia64.patch
quicklist-support-for-x86_64.patch
quicklist-support-for-sparc64.patch
Will merge
slob-handle-slab_panic-flag.patch
include-kern_-constant-in-printk-calls-in-mm-slabc.patch
mm-madvise-avoid-exclusive-mmap_sem.patch
mm-remove-destroy_dirty_buffers-from-invalidate_bdev.patch
mm-optimize-kill_bdev.patch
mm-optimize-acorn-partition-truncate.patch
slab-allocators-remove-obsolete-slab_must_hwcache_align.patch
kmem_cache-simplify-slab-cache-creation.patch
slab-allocators-remove-multiple-alignment-specifications.patch
use-slab_panic-flag-cleanup.patch
fault-injection-fix-failslab-with-config_numa.patch
mm-document-fault_data-and-flags.patch
mm-fix-handling-of-panic_on_oom-when-cpusets-are-in-use.patch
oom-fix-constraint-deadlock.patch
More MM misc. Will merge.
get_unmapped_area-handles-map_fixed-on-powerpc.patch
get_unmapped_area-handles-map_fixed-on-alpha.patch
get_unmapped_area-handles-map_fixed-on-arm.patch
get_unmapped_area-handles-map_fixed-on-frv.patch
get_unmapped_area-handles-map_fixed-on-i386.patch
get_unmapped_area-handles-map_fixed-on-ia64.patch
get_unmapped_area-handles-map_fixed-on-parisc.patch
get_unmapped_area-handles-map_fixed-on-sparc64.patch
get_unmapped_area-handles-map_fixed-on-x86_64.patch
get_unmapped_area-handles-map_fixed-in-hugetlbfs.patch
get_unmapped_area-handles-map_fixed-in-generic-code.patch
get_unmapped_area-doesnt-need-hugetlbfs-hacks-anymore.patch
Will merge.
slub-exploit-page-mobility-to-increase-allocation-order.patch
Slub entanglement with moveable-zones. Will merge if moveable-zones is merged.
slab-allocators-remove-slab_debug_initial-flag.patch
slab-allocators-remove-slab_ctor_atomic.patch
slub-mm-only-make-slub-the-default-slab-allocator.patch
Various slab-related patches which are dependent upon multiple previous
patches.
slub-i386-support.patch
Will hold for a while.
lazy-freeing-of-memory-through-madv_free.patch
lazy-freeing-of-memory-through-madv_free-vs-mm-madvise-avoid-exclusive-mmap_sem.patch
restore-madv_dontneed-to-its-original-linux-behaviour.patch
I think the MADV_FREE changes need more work:
We need crystal-clear statements regarding the present functionality, the new
functionality and how these relate to the spec and to implmentations in other
OS'es. Once we have that info we are in a position to work out whether the
code can be merged as-is, or if additional changes are needed.
Because right now, I don't know where we are with respect to these things and
I doubt if many of our users know either. How can Michael write a manpage for
this is we don't tell him what it all does?
implement-file-posix-capabilities.patch
file-capabilities-accomodate-future-64-bit-caps.patch
return-eperm-not-echild-on-security_task_wait-failure.patch
I think we're still waiting for the security guys to work out what to do with
this work.
blackfin-arch.patch
driver_bfin_serial_core.patch
blackfin-on-chip-ethernet-mac-controller-driver.patch
blackfin-patch-add-blackfin-support-in-smc91x.patch
blackfin-on-chip-rtc-controller-driver.patch
blackfin-blackfin-on-chip-spi-controller-driver.patch
convert-h8-300-to-generic-timekeeping.patch
h8300-generic-irq.patch
h8300-add-zimage-support.patch
round_up-macro-cleanup-in-arch-alpha-kernel-osf_sysc.patch
alpha-fix-bootp-image-creation.patch
alpha-prctl-macros.patch
srmcons-fix-kmallocgfp_kernel-inside-spinlock.patch
arm26-remove-useless-config-option-generic_bust_spinlock.patch
arch stuff. Will merge.
fix-refrigerator-vs-thaw_process-race.patch
swsusp-use-inline-functions-for-changing-page-flags.patch
swsusp-do-not-use-page-flags.patch
mm-remove-unused-page-flags.patch
swsusp-fix-error-paths-in-snapshot_open.patch
swsusp-use-gfp_kernel-for-creating-basic-data-structures.patch
freezer-remove-pf_nofreeze-from-handle_initrd.patch
swsusp-use-rbtree-for-tracking-allocated-swap.patch
freezer-fix-racy-usage-of-try_to_freeze-in-kswapd.patch
remove-software_suspend.patch
power-management-change-sys-power-disk-display.patch
kconfig-mentioneds-hibernation-not-just-swsusp.patch
swsusp-fix-snapshot_release.patch
swsusp-free-more-memory.patch
swsusp: will merge.
remove-unused-header-file-arch-m68k-atari-atasoundh.patch
spin_lock_unlocked-cleanup-in-arch-m68k.patch
remove-unused-header-file-drivers-serial-crisv10h.patch
cris-check-for-memory-allocation.patch
cris-remove-code-related-to-pre-22-kernel.patch
uml-delete-unused-code.patch
uml-formatting-fixes.patch
uml-host_info-tidying.patch
uml-mark-tt-mode-code-for-future-removal.patch
uml-print-coredump-limits.patch
uml-handle-block-device-hotplug-errors.patch
uml-driver-formatting-fixes.patch
uml-driver-formatting-fixes-fix.patch
uml-network-interface-hotplug-error-handling.patch
array_size-check-for-type.patch
uml-move-sigio-testing-to-sigioc.patch
uml-create-archh.patch
uml-create-as-layouth.patch
uml-move-remaining-useful-contents-of-user_utilh.patch
uml-remove-user_utilh.patch
uml-add-missing-__init-declarations.patch
remove-unused-header-file-arch-um-kernel-tt-include-mode_kern-tth.patch
uml-improve-checking-and-diagnostics-of-ethernet-macs.patch
uml-eliminate-temporary-buffer-in-eth_configure.patch
uml-replace-one-element-array-with-zero-element-array.patch
uml-fix-umid-in-xterm-titles.patch
uml-speed-up-exec.patch
uml-no-locking-needed-in-tlsc.patch
uml-tidy-processc.patch
uml-remove-page_size.patch
uml-kernel_thread-shouldnt-panic.patch
uml-tidy-fault-code.patch
uml-kernel-segfaults-should-dump-proper-registers.patch
uml-comment-early-boot-locking.patch
uml-irq-locking-commentary.patch
uml-delete-host_frame_size.patch
uml-drivers-get-release-methods.patch
uml-dump-registers-on-ptrace-or-wait-failure.patch
uml-speed-up-page-table-walking.patch
uml-remove-unused-x86_64-code.patch
uml-start-fixing-os_read_file-and-os_write_file.patch
uml-tidy-libc-code.patch
uml-convert-libc-layer-to-call-read-and-write.patch
uml-batch-i-o-requests.patch
uml-send-pointers-instead-of-structures-to-i-o-thread.patch
uml-dump-core-on-panic.patch
uml-dont-try-to-handle-signals-on-initial-process-stack.patch
uml-change-remaining-callers-of-os_read_write_file.patch
uml-formatting-fixes-around-os_read_write_file-callers.patch
uml-remove-debugging-remnants.patch
uml-rename-os_read_write_file_k-back-to-os_read_write_file.patch
uml-aio-deadlock-avoidance.patch
uml-speed-page-fault-path.patch
uml-eliminate-a-piece-of-debugging-code.patch
uml-more-page-fault-path-trimming.patch
uml-only-flush-areas-covered-by-vma.patch
uml-out-of-tmpfs-space-error-clarification.patch
uml-virtualized-time-fix.patch
v850-generic-timekeeping-conversion.patch
xtensa-strlcpy-is-smart-enough.patch
More arch things. Will merge.
deprecate-smbfs-in-favour-of-cifs.patch
Probably 2.6.23.
cpuset-remove-sched-domain-hooks-from-cpusets.patch
Hold.
# clone-flag-clone_parent_tidptr-leaves-invalid-results-in-memory.patch: Eric B had issues
clone-flag-clone_parent_tidptr-leaves-invalid-results-in-memory.patch
factor-outstanding-i-o-error-handling.patch
block_write_full_page-handle-enospc.patch
simplify-the-stacktrace-code.patch
filesystem-disk-errors-at-boot-time-caused-by-probe.patch
allow-access-to-proc-pid-fd-after-setuid.patch
ext2-3-4-fix-file-date-underflow-on-ext2-3-filesystems-on-64-bit-systems.patch
reduce-size-of-task_struct-on-64-bit-machines.patch
fix-quadratic-behavior-of-shrink_dcache_parent.patch
mm-shrink-parent-dentries-when-shrinking-slab.patch
ipmi-add-powerpc-openfirmware-sensing.patch
ipmi-allow-shared-interrupts.patch
ipmi-add-new-ipmi-nmi-watchdog-handling.patch
ipmi-add-pci-remove-handling.patch
freezer-task-exit_state-should-be-treated-as-bolean.patch
softlockup-trivial-s-99-max_rt_prio.patch
fix-constant-folding-and-poor-optimization-in-byte-swapping.patch
documentation-ask-driver-writers-to-provide-pm-support.patch
# fix-__d_path-for-lazy-unmounts-and-make-it-unambiguous.patch: Alan issues
use-symbolic-constants-in-generic-lseek-code.patch
use-use-seek_max-to-validate-user-lseek-arguments.patch
devpts-add-fsnotify-create-event.patch
tty-clarify-documentation-of-write.patch
drivers-char-hvc_consolec-cleanups.patch
is_power_of_2-in-fat.patch
is_power_of_2-in-fs-hfs.patch
is_power_of_2-in-fs-block_devc.patch
freevxfs-possible-null-pointer-dereference-fix.patch
reiserfs-possible-null-pointer-dereference-during-resize.patch
scripts-kernel-doc-whitespace-cleanup.patch
fix-section-mismatch-warning-in-lib-swiotlbc.patch
init-do_mountsc-proper-prepare_namespace-prototype.patch
fix-compilation-of-drivers-with-o0.patch
reiserfs-shrink-superblock-if-no-xattrs.patch
module-use-krealloc.patch
reiserfs-correct-misspelled-reiserfs_proc_info-to.patch
kconfig-centralize-the-selection-of-semaphore-debugging.patch
irq-add-__must_check-to-request_irq.patch
# use-stop_machine_run-in-the-intel-rng-driver.patch: needs re-review
use-stop_machine_run-in-the-intel-rng-driver.patch
cap-shmmax-at-int_max-in-compat-shminfo.patch
exec-fix-remove_arg_zero.patch
merge-sys_clone-sys_unshare-nsproxy-and-namespace.patch
rcutorture-mark-rcu_torture_init-as-__init.patch
init-dma-masks-in-pnp_dev.patch
optimize-timespec_trunc.patch
ext3-dirindex-error-pointer-issues.patch
the-scheduled-removal-of-obsolete_oss-options.patch
epoll-optimizations-and-cleanups.patch
oss-strlcpy-is-smart-enough.patch
add-filesystem-subtype-support.patch
fix-race-between-proc_get_inode-and-remove_proc_entry.patch
fix-race-between-proc_readdir-and-remove_proc_entry.patch
proc-remove-pathetic-deleted-warn_on.patch
vfs-remove-superflous-sb-==-null-checks.patch
nameic-remove-utterly-outdated-comment.patch
tpm_infineon-add-support-for-devices-in-mmio-space.patch
replace-pci_find_device-in-drivers-telephony-ixjc.patch
floppy-handle-device_create_file-failure-while-init.patch
drivers-macintosh-mac_hidc-make-code-static.patch
rocket-remove-modversions-include.patch
virtual_eisa_root_init-should-be-__init.patch
proc-maps-protection.patch
remove-unused-header-file-drivers-message-i2o-i2o_lanh.patch
remove-unused-header-file-drivers-char-digih.patch
drivers-char-synclinkc-check-kmalloc-return-value.patch
procfs-reorder-struct-pid_dentry-to-save-space-on-64bit-archs-and-constify-them.patch
add-file-position-info-to-proc.patch
vfs-delay-the-dentry-name-generation-on-sockets-and.patch
tty-i386-x86_64-arbitary-speed-support.patch
kprobes-make-kprobesymbol_name-const.patch
fix-cycladesh-for-x86_64-and-probably-others.patch
cyclades-remove-custom-types.patch
small-fixes-for-jsm-driver.patch
jsm-driver-fix-for-linuxpps-support.patch
as-fix-antic_expire-check.patch
rtc-add-rtc-rs5c313-driver.patch
# rtc-add-rtc-class-driver-for-the-maxim-max6900.patch: Jean requested updates
rtc-add-rtc-class-driver-for-the-maxim-max6900.patch
# fix-rmmod-read-write-races-in-proc-entries.patch: worrisome (Arjan)
fix-rmmod-read-write-races-in-proc-entries.patch
# getrusage-fill-ru_inblock-and-ru_oublock-fields-if-possible.patch: wrong
getrusage-fill-ru_inblock-and-ru_oublock-fields-if-possible.patch
futex-restartable-futex_wait.patch
proc-oom_score-oops-re-badness.patch
enlarge-console-name.patch
fixes-and-cleanups-for-earlyprintk-aka-boot-console.patch
tty-remove-unnecessary-export-of-proc_clear_tty.patch
tty-simplify-calling-of-put_pid.patch
tty-introduce-no_tty-and-use-it-in-selinux.patch
reiserfs-proc-support-requires-proc_fs.patch
kprobes-fix-sparse-null-warning.patch
add-ability-to-keep-track-of-callers-of-symbol_getput.patch
update-mtd-use-of-symbol_getput.patch
update-dvb-use-of-symbol_getput.patch
move-die-notifier-handling-to-common-code.patch
char-rocket-add-module_device_table.patch
char-cs5535_gpio-add-module_device_table.patch
remove-do_sync_file_range.patch
protect-tty-drivers-list-with-tty_mutex.patch
# more-scheduled-oss-driver-removal.patch: too early?
more-scheduled-oss-driver-removal.patch
schedule-obsolete-oss-drivers-for-removal-4th-round.patch
delete-unused-header-file-math-emu-extendedh.patch
fix-sscanf-%n-match-at-end-of-input-string.patch
make-remove_inode_dquot_ref-static.patch
fix-race-between-attach_task-and-cpuset_exit.patch
delete-unused-header-file-linux-awe_voiceh.patch
kernel-irq-procc-unprotected-iteration-over-the-irq-action-list-in-name_unique.patch
parport-dev-driver-model-support.patch
legacy-pc-parports-support-parport-dev.patch
layered-parport-code-uses-parport-dev.patch
cache-pipe-buf-page-address-for-non-highmem-arch.patch
add-support-for-deferrable-timers-respun.patch
add-a-new-deferrable-delayed-work-init.patch
linux-sysdevh-needs-to-include-linux-moduleh.patch
irq-check-for-percpu-flag-only-when-adding-first-irqaction.patch
# time-smp-friendly-alignment-of-struct-clocksource.patch: needs x86_64-move-__vgetcpu_mode-__jiffies-to-the-vsyscall_2-zone.patch
time-smp-friendly-alignment-of-struct-clocksource.patch
move-timekeeping-code-to-timekeepingc.patch
ignore-stolen-time-in-the-softlockup-watchdog.patch
add-touch_all_softlockup_watchdogs.patch
header-cleaning-dont-include-smp_lockh-when-not-used.patch
fix-82875-pci-setup.patch
unexport-pci_proc_attach_device.patch
make-dev-port-conditional-on-config-symbol.patch
remove-artificial-software-max_loop-limit.patch
kdump-kexec-calculate-note-size-at-compile-time.patch
fix-kevents-childs-priority-greediness.patch
display-all-possible-partitions-when-the-root-filesystem-failed-to-mount.patch
enhance-initcall_debug-measure-latency.patch
kprobes-print-details-of-kretprobe-on-assertion-failure.patch
reregister_binfmt-returns-with-ebusy.patch
pnpacpi-sets-pnpdev-devarchdata.patch
simplify-module_get_kallsym-by-dropping-length-arg.patch
fix-race-between-rmmod-and-cat-proc-kallsyms.patch
simplify-kallsyms_lookup.patch
fix-race-between-cat-proc-wchan-and-rmmod-et-al.patch
fix-race-between-cat-proc-slab_allocators-and-rmmod.patch
kernel-paramsc-fix-lying-comment-for-param_array.patch
replace-deprecated-sa_xxx-interrupt-flags.patch
deprecate-sa_xxx-interrupt-flags-v2.patch
# expose-range-checking-functions-from-arch-specific.patch: wrong? crap!
expose-range-checking-functions-from-arch-specific.patch
remove-hardcoding-of-hard_smp_processor_id-on-up.patch
use-the-apic-to-determine-the-hardware-processor-id-i386.patch
use-the-apic-to-determine-the-hardware-processor-id-x86_64.patch
always-ask-the-hardware-to-obtain-hardware-processor-id-ia64.patch
round_up-macro-cleanup-in-drivers-char-lpc.patch
i386-schedh-inclusion-from-moduleh-is-baack.patch
parport_serial-fix-pci-must_checks.patch
round_up-macro-cleanup-in-fs-selectcompatreaddirc.patch
round_up-macro-cleanup-in-fs-smbfs-requestc.patch
doc-kernel-parameters-use-x86-32-tag-instead-of-ia-32.patch
kernel-doc-handle-arrays-with-arithmetic-expressions-as.patch
merge-compat_ioctlh-into-compat_ioctlc.patch
lockdep-treats-down_write_trylock-like-regular-down_write.patch
pad-irq_desc-to-internode-cacheline-size.patch
partition-add-support-for-sysv68-partitions.patch
dtlk-fix-error-checks-in-module_init.patch
add-spaces-on-either-side-of-case-operator.patch
cleanup-compat-ioctl-handling.patch
partitions-check-the-return-value-of-kobject_add-etc.patch
kallsyms-cleanup-use-seq_release_private-where-appropriate.patch
proc-cleanup-use-seq_release_private-where-appropriate.patch
cciss-reformat-error-handling.patch
cciss-add-sg_io-ioctl-to-cciss.patch
cciss-set-rq-errors-more-correctly-in-driver.patch
generate-main-index-page-when-building-htmldocs.patch
alphabetically-sorted-entries-in.patch
fix-hotplug-for-legacy-platform-drivers.patch
# remove-redundant-check-from-proc_setattr: need sds ack
remove-redundant-check-from-proc_setattr.patch
remove-redundant-check-from-proc_sys_setattr.patch
make-iunique-use-a-do-while-loop-rather-than-its-obscure-goto-loop.patch
kernel-doc-html-mode-struct-highlights.patch
add-webpages-url-and-summarize-3-lines.patch
add-keyboard-blink-driver.patch
efi-warn-only-for-pre-100-system-tables.patch
apm-fix-incorrect-comment.patch
cciss-include-scsi-scsih-unconditionally.patch
highres-dyntick-prevent-xtime-lock-contention.patch
documentation-cciss-detecting-failed-drives.patch
spin_lock_unlocked-cleanup-in-init_taskh.patch
spin_lock_unlocked-cleanup-in-drivers-char-keyboard.patch
spin_lock_unlocked-cleanup-in-drivers-serial.patch
lockdep-lookup_chain_cache-comment-errata.patch
taskstats-fix-getdelays-usage-information.patch
smbfs-remove-unnecessary-allow_signal.patch
pnpbios-conert-to-use-the-kthread-api.patch
introduce-a-handy-list_first_entry-macro-v2.patch
document-spin_lock_unlocked-rw_lock_unlocked-deprecation.patch
getdelaysc-fix-overrun.patch
serial_txx9-use-assigned-device-numbers.patch
serial_txx9-zap-changelog-from-source-code.patch
cpu-time-limit-patch--setrlimitrlimit_cpu-0-cheat-fix.patch
ext3-copy-i_flags-to-inode-flags-on-write.patch
codingstyle-start-flamewar-about-use-of-braces.patch
upper-32-bits.patch
console-utf-8-fixes.patch
#report-that-kernel-is-tainted-if-there-were-an-oops-before.patch
clarify-the-creation-of-the-localversion_auto-string.patch
add-pci_try_set_mwi.patch
check-privileges-before-setting-mount-propagation.patch
jbd-check-for-error-returned-by-kthread_create-on-creating-journal-thread.patch
clean-up-mutex_trylock-noise.patch
the-scheduled-einval-for-invalid-timevals-in-setitimer.patch
reiserfs-use-__set_current_state.patch
drivers-char-use-__set_current_state.patch
kill-warnings-when-building-mandocs.patch
cleanup-mostly-unused-iospace-macros.patch
lockdep-removed-unused-ip-argument-in-mark_lock-mark_held_locks.patch
fat_dont-use_free_clusters-for-fat32.patch
copy-i_flags-to-ext2-inode-flags-on-write.patch
fix-chapter-reference-in-codingstyle.patch
sleep-during-spinlock-in-tpm-driver.patch
consolidate-asm-consth-to-linux-consth.patch
x86_64-kill-19000-sparse-warnings.patch
move-log_buf_shift-to-a-more-sensible-place.patch
w1-printk-format-warning.patch
w1-allow-bus-master-to-have-reset-and-byte-ops.patch
driver-for-the-maxim-ds1wm-a-1-wire-bus-master-asic-core.patch
dma_declare_coherent_memory-wrong-allocation.patch
deflate-inflate_dynamic-too.patch
fix-wrong-identifier-name-in-documentation-driver-model-devrestxt.patch
edd-switch-to-refcounting-pci-apis.patch
fix-vfat-compat-ioctls-on-64-bit-systems.patch
Misc. A few of these need rechecking by people who had comments. I'll
re-review these and will mostly-merge.
consolidate-generic_writepages-and-mpage_writepages.patch
Might merge. I forget what happened to this.
sync_sb_inodes-propagate-errors.patch
This still isn't right.
minor-spi_butterfly-cleanup.patch
dev-spidevbc-interface.patch
# mpc52xx-psc-spi-master-driver.patch: needs s-o-b
mpc52xx-psc-spi-master-driver.patch
Will merge.
mips-convert-to-use-shared-apm-emulation-fix.patch
Send to Ralf. Or drop. Not sure what it's doing here.
make-static-counters-in-new_inode-and-iunique-be-32-bits.patch
change-libfs-sb-creation-routines-to-avoid-collisions-with-their-root-inodes.patch
Will merge.
schedule_on_each_cpu-use-preempt_disable.patch
reimplement-flush_workqueue.patch
implement-flush_work.patch
flush_workqueue-use-preempt_disable-to-hold-off-cpu-hotplug.patch
flush_cpu_workqueue-dont-flush-an-empty-worklist.patch
aio-use-flush_work.patch
kblockd-use-flush_work.patch
relayfs-use-flush_keventd_work.patch
tg3-use-flush_keventd_work.patch
e1000-use-flush_keventd_work.patch
libata-use-flush_work.patch
phy-use-flush_work.patch
Will mostly-merge. Some can go via subsystem maintainers if/when the base
patches are in.
extend-notifier_call_chain-to-count-nr_calls-made.patch
define-and-use-new-eventscpu_lock_acquire-and-cpu_lock_release.patch
eliminate-lock_cpu_hotplug-in-kernel-schedc.patch
call-cpu_chain-with-cpu_down_failed-if-cpu_down_prepare-failed.patch
call-cpu_chain-with-cpu_down_failed-if-cpu_down_prepare-failed-vs-reduce-size-of-task_struct-on-64-bit-machines.patch
slab-use-cpu_lock_.patch
workqueue-fix-freezeable-workqueues-implementation.patch
workqueue-fix-flush_workqueue-vs-cpu_dead-race.patch
workqueue-dont-clear-cwq-thread-until-it-exits.patch
workqueue-dont-migrate-pending-works-from-the-dead-cpu.patch
workqueue-kill-run_scheduled_work.patch
workqueue-dont-save-interrupts-in-run_workqueue.patch
workqueue-make-cancel_rearming_delayed_workqueue-work-on-idle-dwork.patch
workqueue-introduce-cpu_singlethread_map.patch
workqueue-introduce-workqueue_struct-singlethread.patch
workqueue-make-init_workqueues-__init.patch
make-queue_delayed_work-friendly-to-flush_fork.patch
unify-queue_delayed_work-and-queue_delayed_work_on.patch
workqueue-introduce-wq_per_cpu-helper.patch
make-cancel_rearming_delayed_work-work-on-any-workqueue-not-just-keventd_wq.patch
ipvs-flush-defense_work-before-module-unload.patch
workqueue-kill-noautorel-works.patch
worker_thread-dont-play-with-signals.patch
worker_thread-fix-racy-try_to_freeze-usage.patch
zap_other_threads-remove-unneeded-exit_signal-change.patch
# slab-shutdown-cache_reaper-when-cpu-goes-down.patch
unify-flush_work-flush_work_keventd-and-rename-it-to-cancel_work_sync.patch
____call_usermodehelper-dont-flush_signals.patch
A lot of this is Oleg's workqueue rework which I deferred from 2.6.21. Will
merge.
freezer-read-pf_borrowed_mm-in-a-nonracy-way.patch
freezer-close-theoretical-race-between-refrigerator-and-thaw_tasks.patch
freezer-remove-pf_nofreeze-from-rcutorture-thread.patch
freezer-remove-pf_nofreeze-from-bluetooth-threads.patch
freezer-add-try_to_freeze-calls-to-all-kernel-threads.patch
freezer-fix-vfork-problem.patch
freezer-take-kernel_execve-into-consideration.patch
kthread-dont-depend-on-work-queues-take-2.patch
change-reparent_to_init-to-reparent_to_kthreadd.patch
Freezer work - trying to get the freezer ready to use it for CPU hotplug.
Will merge.
nlmclnt_recovery-dont-use-clone_sighand.patch
usbatm_heavy_init-dont-use-clone_sighand.patch
wait_for_helper-remove-unneeded-do_sigaction.patch
worker_thread-dont-play-with-sigchld-and-numa-policy.patch
change-kernel-threads-to-ignore-signals-instead-of-blocking-them.patch
fix-kthread_create-vs-freezer-theoretical-race.patch
fix-pf_nofreeze-and-freezeable-race-2.patch
freezer-document-task_lock-in-thaw_process.patch
move-frozen_process-to-kernel-power-processc.patch
remvoe-kthread_bind-call-from-_cpu_down.patch
Various core thread-management things. Will merge.
move-page-writeback-acounting-out-of-macros.patch
Will merge. Or might drop, dunno. I think it makes sense.
ext2-reservations.patch
This still awaits more testing.
make-drivers-isdn-capi-capiutilccdebbuf_alloc-static.patch
drivers-isdn-hardware-eicon-remove-unused-header-files.patch
fix-spinlock-usage-in-hysdn_log_close.patch
ISDN: will merge.
remove-obsolete-label-from-isdn4linux-v3.patch
This caused a lkml foodfight. Will drop.
remove-nfs4_acl_add_ace.patch
the-nfsv2-nfsv3-server-does-not-handle-zero-length-write.patch
knfsd-rename-sk_defer_lock-to-sk_lock.patch
nfsd-nfs4state-remove-unnecessary-daemonize-call.patch
rpc-add-wrapper-for-svc_reserve-to-account-for-checksum.patch
nfsd things - will merge after checking with Neil.
sched-fix-idle-load-balancing-in-softirqd-context.patch
sched-dynticks-idle-load-balancing-v3.patch
speedup-divides-by-cpu_power-in-scheduler.patch
sched-optimize-siblings-status-check-logic-in-wake_idle.patch
sched-redundant-reschedule-when-set_user_nice-boosts-a-prio-of-a-task-from-the-expired-array.patch
sched-align-rq-to-cacheline-boundary.patch
CPU scheduler: will merge.
rcutorture-use-array_size-macro-when-appropriate.patch
rcutorture-style-cleanup-avoid-=-null-in-boolean-tests.patch
rcutorture-remove-redundant-assignment-to-cur_ops-in.patch
Will merge.
utimensat-implementation.patch
Will merge.
rtc-remove-sys-class-rtc-dev.patch
rtc-rtc-interfaces-dont-use-class_device.patch
rtc-simplified-rtc-sysfs-attribute-handling.patch
rtc-simplified-proc-driver-rtc-handling.patch
rtc-remove-rest-of-class_device.patch
rtc-suspend-resume-restores-system-clock.patch
rtc-simplified-rtc-sysfs-attribute-handling-tidy.patch
rtc-update-to-class-device-removal-patches.patch
rtc-kconfig-cleanup.patch
rtc-update-vr41xx-alarm-handling.patch
rtc-cmos-wakeup-interface.patch
acpi-wakeup-hooks-for-rtc-cmos.patch
workaround-rtc-related-acpi-table-bugs.patch
revert-rtc-add-rtc_merge_alarm.patch
remove-rtc_alm_set-mode-bugs.patch
rtc-cmos-make-it-load-on-pnpbios-systems.patch
Will merge.
declare-struct-ktime.patch
futex-priority-based-wakeup.patch
make-futex_wait-use-an-hrtimer-for-timeout.patch
futex_requeue_pi-optimization.patch
Will merge.
kprobes-use-hlist_for_each_entry.patch
kprobes-codingstyle-cleanups.patch
kprobes-kretprobes-simplifcations.patch
kprobes-the-on-off-knob-thru-debugfs-updated.patch
Will merge.
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch
atomich-complete-atomic_long-operations-in-asm-generic.patch
atomich-i386-type-safety-fix.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch
atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch
local_t-architecture-independant-extension.patch
local_t-alpha-extension.patch
local_t-i386-extension.patch
local_t-ia64-extension.patch
local_t-mips-extension.patch
local_t-parisc-cleanup.patch
local_t-powerpc-extension.patch
local_t-sparc64-cleanup.patch
local_t-x86_64-extension.patch
linux-kernel-markers-kconfig-menus.patch
linux-kernel-markers-architecture-independant-code.patch
linux-kernel-markers-powerpc-optimization.patch
linux-kernel-markers-i386-optimization.patch
markers-add-instrumentation-markers-menus-to-avr32.patch
linux-kernel-markers-non-optimized-architectures.patch
markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch
linux-kernel-markers-documentation.patch
markers-define-the-linker-macro-extra_rwdata.patch
markers-use-extra_rwdata-in-architectures.patch
Static markers. Will merge.
some-grammatical-fixups-and-additions-to-atomich-kernel-doc.patch
no-longer-include-asm-kdebugh.patch
Will merge.
nfs-fix-congestion-control-use-atomic_longs.patch
Will merge.
udf-use-sector_t-and-loff_t-for-file-offsets.patch
udf-introduce-struct-extent_position.patch
udf-use-get_bh.patch
udf-add-assertions.patch
udf-support-files-larger-than-1g.patch
udf-fix-link-counts.patch
udf-possible-null-pointer-dereference-while-load_partition.patch
Will merge.
attach_pid-with-struct-pid-parameter.patch
statically-initialize-struct-pid-for-swapper.patch
explicitly-set-pgid-and-sid-of-init-process.patch
use-struct-pid-parameter-in-copy_process.patch
use-task_pgrp-task_session-in-copy_process.patch
kill-unused-sesssion-and-group-values-in-rocket-driver.patch
fix-some-coding-style-errors-in-autofs.patch
replace-pid_t-in-autofs-with-struct-pid-reference.patch
dont-init-pgrp-and-__session-in-init_signals.patch
Will merge.
signal-timer-event-fds-v9-anonymous-inode-source.patch
signal-timer-event-fds-v9-signalfd-core.patch
signal-timer-event-signalfd-wire-up-x86-arches.patch
signal-timer-event-fds-v9-signalfd-compat-code.patch
signal-timer-event-fds-v9-timerfd-core.patch
signal-timer-event-timerfd-wire-up-x86-arches.patch
signal-timer-event-fds-v9-timerfd-compat-code.patch
signal-timer-event-fds-v9-eventfd-core.patch
signal-timer-event-eventfd-wire-up-x86-arches.patch
signal-timer-event-fds-v9-kaio-eventfd-support-example.patch
epoll-use-anonymous-inodes.patch
Will merge.
epoll-cleanups-epoll-no-module.patch
epoll-cleanups-epoll-remove-static-pre-declarations-and-akpm-ize-the-code.patch
Will merge.
revoke-special-mmap-handling.patch
revoke-special-mmap-handling-vs-fault-vs-invalidate.patch
revoke-core-code.patch
revoke-core-code-misc-fixes.patch
revoke-core-code-fix-shared-mapping-revoke.patch
revoke-core-code-move-magic.patch
revoke-core-code-fs-revokec-cleanups-and-bugfix-for-64bit-systems.patch
revoke-core-code-revoke-no-revoke-for-nommu.patch
revoke-core-code-fix-shared-mapping-revoke-revoke-only-revoke-mappings-for-the-given-inode.patch
revoke-core-code-break-cow-for-private-mappings.patch
revoke-core-code-generic_file_revoke-stub-for-nommu.patch
revoke-core-code-break-cow-fixes.patch
revoke-core-code-mapping-revocation.patch
revoke-core-code-only-fput-unused-files.patch
revoke-core-code-slab-allocators-remove-slab_debug_initial-flag-revoke.patch
revoke-support-for-ext2-and-ext3.patch
revoke-add-documentation.patch
revoke-wire-up-i386-system-calls.patch
Hold. This is tricky stuff and I don't think we've seen sufficient reviewing,
testing and acking yet?
add-irqf_irqpoll-flag-common-code.patch
add-irqf_irqpoll-flag-on-x86_64.patch
add-irqf_irqpoll-flag-on-i386.patch
add-irqf_irqpoll-flag-on-ia64.patch
add-irqf_irqpoll-flag-on-sh.patch
add-irqf_irqpoll-flag-on-parisc.patch
add-irqf_irqpoll-flag-on-arm.patch
Merge.
char-cyclades-remove-pause.patch
char-cyclades-cy_readx-writex-cleanup.patch
char-cyclades-timer-cleanup.patch
char-cyclades-remove-volatiles.patch
char-cyclades-remove-useless-casts.patch
Merge.
pnp-notice-whether-we-have-pnp-devices-pnpbios-or-pnpacpi.patch
pnp-workaround-hp-bios-defect-that-leaves-smcf010-device-partly-enabled.patch
smsc-ircc2-tidy-up-module-parameter-checking.patch
smsc-ircc2-add-pnp-support.patch
x86-serial-convert-legacy-com-ports-to-platform-devices.patch
Misc stuff. Will merge.
lguest-the-guest-code.patch
lguest-vs-x86_64-mm-use-per-cpu-variables-for-gdt-pda.patch
lguest-the-guest-code-update-lguests-patch-code-for-new-paravirt-patch.patch
lguest-the-host-code.patch
lguest-the-host-code-vs-x86_64-mm-i386-separate-hardware-defined-tss-from-linux-additions.patch
lguest-the-host-code-fix-lguest-oops-when-guest-dies-while-receiving-i-o.patch
lguest-the-host-code-simplification-dont-pin-guest-trap-handlers.patch
lguest-the-asm-offsets.patch
lguest-the-makefile-and-kconfig.patch
lguest-the-console-driver.patch
lguest-the-net-driver.patch
lguest-the-block-driver.patch
lguest-the-documentation-example-launcher.patch
lguest-the-documentation-example-launcher-fix-lguest-documentation-error.patch
Will merge the rustyvisor.
fs-convert-core-functions-to-zero_user_page.patch
fs-convert-core-functions-to-zero_user_page-pass-kmap-type.patch
fs-convert-core-functions-to-zero_user_page-fix-2.patch
affs-use-zero_user_page.patch
ecryptfs-use-zero_user_page.patch
ext3-use-zero_user_page.patch
ext4-use-zero_user_page.patch
gfs2-use-zero_user_page.patch
nfs-use-zero_user_page.patch
ntfs-use-zero_user_page.patch
ntfs-use-zero_user_page-fix.patch
ocfs2-use-zero_user_page.patch
reiserfs-use-zero_user_page.patch
xfs-use-zero_user_page.patch
fs-deprecate-memclear_highpage_flush.patch
Merge.
char-cyclades-create-cy_init_ze.patch
char-cyclades-use-pci_iomap-unmap.patch
char-cyclades-init-ze-immediately.patch
char-cyclades-create-cy_pci_probe.patch
char-cyclades-move-card-entries-init-into-function.patch
char-cyclades-init-card-struct-immediately.patch
char-cyclades-remove-some-global-vars.patch
char-cyclades-cy_init-error-handling.patch
char-cyclades-tty_register_device-separately-for-each-device.patch
char-cyclades-clear-interrupts-before-releasing.patch
char-cyclades-allow-debug_shirq.patch
Merge
add-suspend-related-notifications-for-cpu-hotplug.patch
microcode-use-suspend-related-cpu-hotplug-notifications.patch
Merge.
vmstat-use-our-own-timer-events.patch
Merge.
readahead-kconfig-options.patch
radixtree-introduce-scan-hole-data-functions.patch
mm-introduce-probe_page.patch
mm-introduce-pg_readahead.patch
readahead-add-look-ahead-support-to-__do_page_cache_readahead.patch
readahead-insert-cond_resched-calls.patch
readahead-minmax_ra_pages.patch
readahead-events-accounting.patch
readahead-rescue_pages.patch
readahead-sysctl-parameters.patch
readahead-min-max-sizes.patch
readahead-state-based-method-aging-accounting.patch
readahead-state-based-method-routines.patch
readahead-state-based-method.patch
readahead-state-based-method-check-node-id.patch
readahead-state-based-method-decouple-readahead_ratio-from-growth_limit.patch
readahead-state-based-method-cancel-lookahead-gracefully.patch
readahead-context-based-method.patch
readahead-initial-method-guiding-sizes.patch
readahead-initial-method-thrashing-guard-size.patch
readahead-initial-method-user-recommended-size.patch
readahead-initial-method.patch
readahead-backward-prefetching-method.patch
readahead-thrashing-recovery-method.patch
readahead-thrashing-recovery-method-check-unbalanced-aging.patch
readahead-thrashing-recovery-method-refill-holes.patch
readahead-call-scheme.patch
readahead-call-scheme-cleanup.patch
readahead-call-scheme-catch-thrashing-on-lookahead-time.patch
readahead-laptop-mode.patch
readahead-loop-case.patch
readahead-nfsd-case.patch
readahead-remove-parameter-ra_max-from-thrashing_recovery_readahead.patch
readahead-remove-parameter-ra_max-from-adjust_rala.patch
readahead-state-based-method-protect-against-tiny-size.patch
readahead-rename-state_based_readahead-to-clock_based_readahead.patch
readahead-account-i-o-block-times-for-stock-readahead.patch
readahead-rescue_pages-updates.patch
readahead-remove-noaction-shrink-events.patch
readahead-remove-size-limit-on-read_ahead_kb.patch
readahead-remove-size-limit-of-max_sectors_kb-on-read_ahead_kb.patch
readahead-partial-sendfile-fix.patch
readahead-turn-on-by-default.patch
Hopefully Wu will be coming up with a much simpler best-of-readahead patch
soon. I don't think we can get these patches over the hump and they are
somewhat costly to maintain.
[93 random fbdev patches]
Will merge.
drivers-mdc-use-array_size-macro-when-appropriate.patch
md-cleanup-use-seq_release_private-where-appropriate.patch
md-remove-broken-sigkill-support.patch
Will merge after checking with Neil
md-dm-reduce-stack-usage-with-stacked-block-devices.patch
Will we ever fix this?
statistics-infrastructure-prerequisite-list.patch
statistics-infrastructure-prerequisite-parser.patch
statistics-infrastructure-prerequisite-parser-fix.patch
add-for_each_substring-and-match_substring.patch
statistics-infrastructure-prerequisite-timestamp.patch
statistics-infrastructure-make-printk_clock-a-generic-kernel-wide-nsec-resolution.patch
statistics-infrastructure-documentation.patch statistics-infrastructure.patch
statistics-infrastructure-add-for_each_substring-and-match_substring-exploitation.patch
statistics-infrastructure-fix-parsing-of-statistics-type-attribute.patch
statistics-infrastructure-simplify-statistics-debugfs-write-function.patch
statistics-infrastructure-simplify-statistics-debugfs-read-functions.patch
statistics-infrastructure-fix-string-termination.patch
statistics-infrastructure-small-cleanup-in-debugfs-write-function.patch
statistics-infrastructure-fix-cpu-hot-unplug-related-memory-leak.patch
statistics-infrastructure-timer_stats-slimmed-down-statistics-prereq-labels.patch
statistics-infrastructure-timer_stats-slimmed-down-statistics-prereq-keys.patch
statistics-infrastructure-statistics-fix-sorted-list.patch
add-suspend-related-notifications-for-cpu-hotplug-statistics.patch
statistics-infrastructure-exploitation-zfcp.patch
timer_stats-slimmed-down-using-statistics-infrastucture.patch
We have a second user of the statistics infrastructure! If we have a third,
perhaps we can merge it. It's an unobvious call.
mprotect-patch-for-use-by-slim.patch
integrity-service-api-and-dummy-provider.patch
integrity-service-api-and-dummy-provider-integrity_dummy_verify_metadata.patch
slim-main-patch.patch
slim-main-lsm-getprocattr-hook-api-change.patch
slim-secfs-patch.patch
slim-make-and-config-stuff.patch
slim-debug-output.patch
slim-integrity-patch.patch
slim-documentation.patch
integrity-new-hooks.patch
integrity-new-hooks-fix.patch
integrity-fs-hook-placement.patch
integrity-evm-as-an-integrity-service-provider.patch
integrity-evm-as-an-integrity-service-provider-tidy.patch
integrity-evm-as-an-integrity-service-provider-tidy-fix.patch
integrity-evm-as-an-integrity-service-provider-tidy-fix-2.patch
integrity-ima-integrity_measure-support.patch
integrity-ima-integrity_measure-support-tidy.patch
integrity-ima-integrity_measure-support-fix.patch
integrity-ima-integrity_measure-support-fix-2.patch
integrity-ima-integrity_measure-support-ima-exit.patch
integrity-ima-integrity_measure-support-remove-spinlock.patch
integrity-ima-identifiers.patch
integrity-ima-cleanup.patch
integrity-tpm-internal-kernel-interface.patch
integrity-tpm-internal-kernel-interface-tidy.patch
ibac-patch.patch
Hold. This seems a long way from being mergeable.
use-menuconfig-objects-acpi.patch
use-menuconfig-objects-libata.patch
use-menuconfig-objects-block-layer.patch
use-menuconfig-objects-connector.patch
use-menuconfig-objects-crypto.patch
use-menuconfig-objects-crypto-hw.patch
use-menuconfig-objects-dccp.patch
use-menuconfig-objects-i2o.patch
use-menuconfig-objects-ide.patch
use-menuconfig-objects-ipvs.patch
use-menuconfig-objects-sctp.patch
use-menuconfig-objects-tipc.patch
use-menuconfig-objects-arcnet.patch
use-menuconfig-objects-phy.patch
use-menuconfig-objects-toeknring.patch
use-menuconfig-objects-netdev.patch
use-menuconfig-objects-oldcd.patch
use-menuconfig-objects-parport.patch
use-menuconfig-objects-pcmcia.patch
use-menuconfig-objects-pnp.patch
use-menuconfig-objects-w1.patch
Will merge sometime. Some needs to go via subsystem maintainers.
w1-build-fix.patch
A gcc-4.3 maybe-fix. Still awaiting testing results.
^ permalink raw reply [flat|nested] 233+ messages in thread* to something appropriate (was Re: 2.6.22 -mm merge plans) 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton @ 2007-04-30 23:48 ` Jeff Garzik 2007-05-01 0:07 ` Dave Jones ` (2 more replies) 2007-04-30 23:59 ` 2.6.22 -mm merge plans Bill Irwin ` (21 subsequent siblings) 22 siblings, 3 replies; 233+ messages in thread From: Jeff Garzik @ 2007-04-30 23:48 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel Andrew Morton wrote: > ahci-crash-fix.patch > libata-acpi-add-infrastructure-for-drivers-to-use.patch > pata_acpi-restore-driver.patch > optional-led-trigger-for-libata.patch > ata_timing-ensure-t-cycle-is-always-correct.patch > pata_pcmcia-recognize-2gb-compactflash-from-transcend.patch > drivers-ata-remove-the-wildcard-from-sata_nv-driver.patch > pata_icside-driver.patch > > ata stuff Tejun helpfully posted a bunch of clashing patches for all the ACPI stuff :) You might be better off dropping and getting a resend after the dust settles. That LED trigger patch seems technically correct, but also filling a need that few have. IMO it craps up the hot path for little gain. The other stuff should be in my mbox to be reviewed and applied > 8139too-force-media-setting-fix.patch > sundance-change-phy-address-search-from-phy=1-to-phy=0.patch > forcedeth-improve-napi-logic.patch > ne-add-platform_driver.patch > ne-add-platform_driver-fix.patch > ne-mips-use-platform_driver-for-ne-on-rbtx49xx.patch > mips-drop-unnecessary-config_isa-from-rbtx49xx.patch > ibmtr_cs-fix-hang-on-eject.patch > > For netdev tree 8139too: needs review w/ attention paid to historical usage, and I haven't had time for months. Not sure its right. sundance: I really think this is phy-dependent, and should not be universally applied. In the standard MII PHY, phy 0 is a ghost of another id. forcedeth: TX NAPI wants more than that minimum effort other stuff: in mbox > ppp_generic-fix-lockdep-warning.patch > > Jeff, I guess. It's not clear that this is correct. Usually PPP is paulus -> jgarzik -> linus, but you can bounce it straight to me if Paulus doesn't respond > pcmcia-pccard-deadlock-fix.patch > pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch > at91_cf-minor-fix.patch > add-new_id-to-pcmcia-drivers.patch > ide-cs-recognize-2gb-compactflash-from-transcend.patch > > Dominik is busy. Will probably re-review and send these direct to Linus. I really wish "add ID" patches would not get buried in this tree. Certainly they are trivial enough to go straight to Linus, but it would be nice to go through subsystem maintainers, some of whom have also picked up these new-id patches. We don't send new-id patches for PCI drivers to GregKH, and we should similarly /not/ direct PCMCIA id patches to the PCMCIA bus tree. It's far more scalable to send new-id patches to the maintainers dealing with the subsystem under which each driver falls (net, scsi, IDE, ...) > pci-device-ensure-sysdata-initialised-v2.patch > > This is for Jeff's git-pciseg.patch which is sort-of on hold at present. HP was kind enough to dig up another machine for me. Other machines are appearing with x86 PCI domains (aka PCI segments) these days, so I need to get this into upstream. Jeff ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: to something appropriate (was Re: 2.6.22 -mm merge plans) 2007-04-30 23:48 ` to something appropriate (was Re: 2.6.22 -mm merge plans) Jeff Garzik @ 2007-05-01 0:07 ` Dave Jones 2007-05-01 0:09 ` Andrew Morton 2007-05-01 9:49 ` Alan Cox 2 siblings, 0 replies; 233+ messages in thread From: Dave Jones @ 2007-05-01 0:07 UTC (permalink / raw) To: Jeff Garzik; +Cc: Andrew Morton, linux-kernel On Mon, Apr 30, 2007 at 07:48:50PM -0400, Jeff Garzik wrote: > > add-new_id-to-pcmcia-drivers.patch > > Dominik is busy. Will probably re-review and send these direct to Linus. > I really wish "add ID" patches would not get buried in this tree. I don't think this is what you think it is (hint: look at the patch). This is very much a pcmcia patch. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: to something appropriate (was Re: 2.6.22 -mm merge plans) 2007-04-30 23:48 ` to something appropriate (was Re: 2.6.22 -mm merge plans) Jeff Garzik 2007-05-01 0:07 ` Dave Jones @ 2007-05-01 0:09 ` Andrew Morton 2007-05-01 0:24 ` Jeff Garzik 2007-05-01 9:49 ` Alan Cox 2 siblings, 1 reply; 233+ messages in thread From: Andrew Morton @ 2007-05-01 0:09 UTC (permalink / raw) To: Jeff Garzik; +Cc: linux-kernel > Subject: to something appropriate (was Re: 2.6.22 -mm merge plans) smartypants. On Mon, 30 Apr 2007 19:48:50 -0400 Jeff Garzik <jeff@garzik.org> wrote: > Andrew Morton wrote: > > ahci-crash-fix.patch > > libata-acpi-add-infrastructure-for-drivers-to-use.patch > > pata_acpi-restore-driver.patch > > optional-led-trigger-for-libata.patch > > ata_timing-ensure-t-cycle-is-always-correct.patch > > pata_pcmcia-recognize-2gb-compactflash-from-transcend.patch > > drivers-ata-remove-the-wildcard-from-sata_nv-driver.patch > > pata_icside-driver.patch > > > > ata stuff > > Tejun helpfully posted a bunch of clashing patches for all the ACPI > stuff :) You might be better off dropping and getting a resend after > the dust settles. > > That LED trigger patch seems technically correct, but also filling a > need that few have. IMO it craps up the hot path for little gain. OK, well I'll see what's recoverable after libata-all gets updated, will blindly spam you with it as usual ;) > > > 8139too-force-media-setting-fix.patch > > sundance-change-phy-address-search-from-phy=1-to-phy=0.patch > > forcedeth-improve-napi-logic.patch > > ne-add-platform_driver.patch > > ne-add-platform_driver-fix.patch > > ne-mips-use-platform_driver-for-ne-on-rbtx49xx.patch > > mips-drop-unnecessary-config_isa-from-rbtx49xx.patch > > ibmtr_cs-fix-hang-on-eject.patch > > > > For netdev tree > > 8139too: needs review w/ attention paid to historical usage, and I > haven't had time for months. Not sure its right. > > sundance: I really think this is phy-dependent, and should not be > universally applied. In the standard MII PHY, phy 0 is a ghost of > another id. > > forcedeth: TX NAPI wants more than that minimum effort Ditto. > > ppp_generic-fix-lockdep-warning.patch > > > > Jeff, I guess. It's not clear that this is correct. > > Usually PPP is paulus -> jgarzik -> linus, but you can bounce it > straight to me if Paulus doesn't respond > OK, i'll move it to the netdev queue and will keep sending until something happens. > > > pcmcia-pccard-deadlock-fix.patch > > pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch > > at91_cf-minor-fix.patch > > add-new_id-to-pcmcia-drivers.patch > > ide-cs-recognize-2gb-compactflash-from-transcend.patch > > > > Dominik is busy. Will probably re-review and send these direct to Linus. > > I really wish "add ID" patches would not get buried in this tree. > > Certainly they are trivial enough to go straight to Linus, but it would > be nice to go through subsystem maintainers, some of whom have also > picked up these new-id patches. > > We don't send new-id patches for PCI drivers to GregKH, and we should > similarly /not/ direct PCMCIA id patches to the PCMCIA bus tree. It's > far more scalable to send new-id patches to the maintainers dealing with > the subsystem under which each driver falls (net, scsi, IDE, ...) > Yeah, a new-id patch is a pretty critical bugfix if you happen to have that hardware. I'll get all these into 2.6.22 by whatever means and will adopt your advice in future. Probably these should go into -stable too, but I don't know what Greg&Chris's position is on new device IDs. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: to something appropriate (was Re: 2.6.22 -mm merge plans) 2007-05-01 0:09 ` Andrew Morton @ 2007-05-01 0:24 ` Jeff Garzik 2007-05-01 0:40 ` [stable] " Chris Wright 0 siblings, 1 reply; 233+ messages in thread From: Jeff Garzik @ 2007-05-01 0:24 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, stable Andrew Morton wrote: > Yeah, a new-id patch is a pretty critical bugfix if you happen to have that > hardware. I'll get all these into 2.6.22 by whatever means and will adopt > your advice in future. > > Probably these should go into -stable too, but I don't know what > Greg&Chris's position is on new device IDs. I don't know either. But a one-line ID patch is pretty painless considering the gain, so I would vote for stable@kernel.org taking such patches. If it's more than one line added per ID though, NAK for -stable, IMO... Jeff ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans) 2007-05-01 0:24 ` Jeff Garzik @ 2007-05-01 0:40 ` Chris Wright 2007-05-01 0:45 ` Jeff Garzik 0 siblings, 1 reply; 233+ messages in thread From: Chris Wright @ 2007-05-01 0:40 UTC (permalink / raw) To: Jeff Garzik; +Cc: Andrew Morton, linux-kernel, stable * Jeff Garzik (jeff@garzik.org) wrote: > Andrew Morton wrote: > > Yeah, a new-id patch is a pretty critical bugfix if you happen to have that > > hardware. I'll get all these into 2.6.22 by whatever means and will adopt > > your advice in future. > > > > Probably these should go into -stable too, but I don't know what > > Greg&Chris's position is on new device IDs. > > I don't know either. But a one-line ID patch is pretty painless > considering the gain, so I would vote for stable@kernel.org taking such > patches. > > If it's more than one line added per ID though, NAK for -stable, IMO... Well, there's 2 issues here. 1) the patch in question is not -stable material (the patch name is a bit misleading). 2) you can add them runtime in userspace (and for pcmcia too after patch in question is applied), so we've historically avoided that kind of patch for -stable. thanks, -chris ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans) 2007-05-01 0:40 ` [stable] " Chris Wright @ 2007-05-01 0:45 ` Jeff Garzik 2007-05-01 4:58 ` Greg KH 0 siblings, 1 reply; 233+ messages in thread From: Jeff Garzik @ 2007-05-01 0:45 UTC (permalink / raw) To: Chris Wright; +Cc: Andrew Morton, linux-kernel, stable Chris Wright wrote: > 2) you can add them > runtime in userspace (and for pcmcia too after patch in question is > applied), so we've historically avoided that kind of patch for -stable. Due to distro installer environments, and very poor support for making dynamic PCI IDs persistent once added, what you describe is more of a goal than reality. Jeff ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans) 2007-05-01 0:45 ` Jeff Garzik @ 2007-05-01 4:58 ` Greg KH 2007-05-01 16:14 ` Chuck Ebbert 0 siblings, 1 reply; 233+ messages in thread From: Greg KH @ 2007-05-01 4:58 UTC (permalink / raw) To: Jeff Garzik; +Cc: Chris Wright, Andrew Morton, linux-kernel, stable On Mon, Apr 30, 2007 at 08:45:25PM -0400, Jeff Garzik wrote: > Chris Wright wrote: > > 2) you can add them > > runtime in userspace (and for pcmcia too after patch in question is > > applied), so we've historically avoided that kind of patch for -stable. > > > Due to distro installer environments, and very poor support for making > dynamic PCI IDs persistent once added, what you describe is more of a > goal than reality. But distros can easily add the device id to their kernel if needed, it isn't something that the -stable tree shoud be accepting. Otherwise, we will be swamped with those types of patches... thanks, greg k-h ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans) 2007-05-01 4:58 ` Greg KH @ 2007-05-01 16:14 ` Chuck Ebbert 2007-05-01 16:40 ` Alan Cox 0 siblings, 1 reply; 233+ messages in thread From: Chuck Ebbert @ 2007-05-01 16:14 UTC (permalink / raw) To: Greg KH; +Cc: Jeff Garzik, Chris Wright, Andrew Morton, linux-kernel, stable Greg KH wrote: > On Mon, Apr 30, 2007 at 08:45:25PM -0400, Jeff Garzik wrote: >> Chris Wright wrote: >>> 2) you can add them >>> runtime in userspace (and for pcmcia too after patch in question is >>> applied), so we've historically avoided that kind of patch for -stable. >> >> Due to distro installer environments, and very poor support for making >> dynamic PCI IDs persistent once added, what you describe is more of a >> goal than reality. > > But distros can easily add the device id to their kernel if needed, it > isn't something that the -stable tree shoud be accepting. Otherwise, we > will be swamped with those types of patches... > Oh sure, leave the distros swamped with them instead. :) And they all have to do it separately, meaning they don't stay in sync and they duplicate each other's work... ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans) 2007-05-01 16:14 ` Chuck Ebbert @ 2007-05-01 16:40 ` Alan Cox 2007-05-01 23:34 ` Greg KH 0 siblings, 1 reply; 233+ messages in thread From: Alan Cox @ 2007-05-01 16:40 UTC (permalink / raw) To: Chuck Ebbert Cc: Greg KH, Jeff Garzik, Chris Wright, Andrew Morton, linux-kernel, stable > > But distros can easily add the device id to their kernel if needed, it > > isn't something that the -stable tree shoud be accepting. Otherwise, we > > will be swamped with those types of patches... > > > > Oh sure, leave the distros swamped with them instead. :) > > And they all have to do it separately, meaning they don't stay in sync > and they duplicate each other's work... Well they *don't* have to work that separately. They could set up some shared tree which would look suspiciously like what Greg is doing but with the ID updates.... ;) ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans) 2007-05-01 16:40 ` Alan Cox @ 2007-05-01 23:34 ` Greg KH 2007-05-02 0:52 ` Chris Wright 0 siblings, 1 reply; 233+ messages in thread From: Greg KH @ 2007-05-01 23:34 UTC (permalink / raw) To: Alan Cox Cc: Chuck Ebbert, Jeff Garzik, Chris Wright, Andrew Morton, linux-kernel, stable On Tue, May 01, 2007 at 05:40:33PM +0100, Alan Cox wrote: > > > But distros can easily add the device id to their kernel if needed, it > > > isn't something that the -stable tree shoud be accepting. Otherwise, we > > > will be swamped with those types of patches... > > > > > > > Oh sure, leave the distros swamped with them instead. :) > > > > And they all have to do it separately, meaning they don't stay in sync > > and they duplicate each other's work... > > Well they *don't* have to work that separately. They could set up some > shared tree which would look suspiciously like what Greg is doing but > with the ID updates.... ;) And is this really a problem? The whole goal of the -stable tree was to accomidate the users who relied on kernel.org kernels, and wanted bugfixes and security updates. It was not for new features or new hardware support. If people feel we should revisit this goal, then that's fine, and I have no objection to that. But until then, I think the rules that we have had in place for over the past 2 years should still remain in affect. thanks, greg k-h ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans) 2007-05-01 23:34 ` Greg KH @ 2007-05-02 0:52 ` Chris Wright 2007-05-02 14:10 ` Chuck Ebbert 0 siblings, 1 reply; 233+ messages in thread From: Chris Wright @ 2007-05-02 0:52 UTC (permalink / raw) To: Greg KH Cc: Alan Cox, Chuck Ebbert, Jeff Garzik, Chris Wright, Andrew Morton, linux-kernel, stable * Greg KH (greg@kroah.com) wrote: > And is this really a problem? The whole goal of the -stable tree was to > accomidate the users who relied on kernel.org kernels, and wanted > bugfixes and security updates. It was not for new features or new > hardware support. > > If people feel we should revisit this goal, then that's fine, and I have > no objection to that. But until then, I think the rules that we have > had in place for over the past 2 years should still remain in affect. I have to agree. I went back through my mbox and found vanishingly few pci_id update patches. So it's not clear there's even a big issue. thanks, -chris ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans) 2007-05-02 0:52 ` Chris Wright @ 2007-05-02 14:10 ` Chuck Ebbert 0 siblings, 0 replies; 233+ messages in thread From: Chuck Ebbert @ 2007-05-02 14:10 UTC (permalink / raw) To: Chris Wright Cc: Greg KH, Alan Cox, Jeff Garzik, Andrew Morton, linux-kernel, stable Chris Wright wrote: > * Greg KH (greg@kroah.com) wrote: >> And is this really a problem? The whole goal of the -stable tree was to >> accomidate the users who relied on kernel.org kernels, and wanted >> bugfixes and security updates. It was not for new features or new >> hardware support. >> >> If people feel we should revisit this goal, then that's fine, and I have >> no objection to that. But until then, I think the rules that we have >> had in place for over the past 2 years should still remain in affect. > > I have to agree. I went back through my mbox and found vanishingly few > pci_id update patches. So it's not clear there's even a big issue. > Of course you didn't find many -- most people know that's not part of the -stable charter. If you asked for them you'd get them... ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: to something appropriate (was Re: 2.6.22 -mm merge plans) 2007-04-30 23:48 ` to something appropriate (was Re: 2.6.22 -mm merge plans) Jeff Garzik 2007-05-01 0:07 ` Dave Jones 2007-05-01 0:09 ` Andrew Morton @ 2007-05-01 9:49 ` Alan Cox 2 siblings, 0 replies; 233+ messages in thread From: Alan Cox @ 2007-05-01 9:49 UTC (permalink / raw) To: Jeff Garzik; +Cc: Andrew Morton, linux-kernel > Tejun helpfully posted a bunch of clashing patches for all the ACPI > stuff :) You might be better off dropping and getting a resend after > the dust settles. Agree about the ACPI stuff. > That LED trigger patch seems technically correct, but also filling a > need that few have. IMO it craps up the hot path for little gain. It only touches the affected devices and if its still an issue then it should be via an arch define as well. Can you fire something off to the submitter Jeff so we've got a direction ? ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton 2007-04-30 23:48 ` to something appropriate (was Re: 2.6.22 -mm merge plans) Jeff Garzik @ 2007-04-30 23:59 ` Bill Irwin 2007-05-01 0:09 ` nfsd/md patches " Neil Brown ` (20 subsequent siblings) 22 siblings, 0 replies; 233+ messages in thread From: Bill Irwin @ 2007-04-30 23:59 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm On Mon, Apr 30, 2007 at 04:20:07PM -0700, Andrew Morton wrote: > proper-prototype-for-hugetlb_get_unmapped_area.patch ... > convert-hugetlbfs-to-use-vm_ops-fault.patch ... > get_unmapped_area-handles-map_fixed-in-hugetlbfs.patch ... > get_unmapped_area-doesnt-need-hugetlbfs-hacks-anymore.patch ... > Will merge. I've gone over these again and all are still good. The same holds for the get_unmapped_area() series in general where I've reviewed it for hugetlb relevance. -- wli ^ permalink raw reply [flat|nested] 233+ messages in thread
* nfsd/md patches Re: 2.6.22 -mm merge plans 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton 2007-04-30 23:48 ` to something appropriate (was Re: 2.6.22 -mm merge plans) Jeff Garzik 2007-04-30 23:59 ` 2.6.22 -mm merge plans Bill Irwin @ 2007-05-01 0:09 ` Neil Brown 2007-05-01 9:08 ` Christoph Hellwig 2007-05-01 0:54 ` MADV_FREE functionality Rik van Riel ` (19 subsequent siblings) 22 siblings, 1 reply; 233+ messages in thread From: Neil Brown @ 2007-05-01 0:09 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm On Monday April 30, akpm@linux-foundation.org wrote: > > remove-nfs4_acl_add_ace.patch > the-nfsv2-nfsv3-server-does-not-handle-zero-length-write.patch > knfsd-rename-sk_defer_lock-to-sk_lock.patch > nfsd-nfs4state-remove-unnecessary-daemonize-call.patch > rpc-add-wrapper-for-svc_reserve-to-account-for-checksum.patch > > nfsd things - will merge after checking with Neil. > All acked, though that last one won't fix any oopses like the comment hopes for - I really should look into that. > > drivers-mdc-use-array_size-macro-when-appropriate.patch > md-cleanup-use-seq_release_private-where-appropriate.patch > md-remove-broken-sigkill-support.patch > > Will merge after checking with Neil NAK on md-remove-broken-sigkill-support.patch - I'll follow up the original mail. ACK on the other two. > > md-dm-reduce-stack-usage-with-stacked-block-devices.patch > > Will we ever fix this? > I think we have several votes for "just merge it". I don't think there are known problems with it. NeilBrown ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: nfsd/md patches Re: 2.6.22 -mm merge plans 2007-05-01 0:09 ` nfsd/md patches " Neil Brown @ 2007-05-01 9:08 ` Christoph Hellwig 2007-05-01 9:15 ` Andrew Morton 2007-05-01 9:52 ` Neil Brown 0 siblings, 2 replies; 233+ messages in thread From: Christoph Hellwig @ 2007-05-01 9:08 UTC (permalink / raw) To: Neil Brown; +Cc: Andrew Morton, linux-kernel, linux-mm apropos nfsd patches, what's the merge plans for my two export ops patch series? ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: nfsd/md patches Re: 2.6.22 -mm merge plans 2007-05-01 9:08 ` Christoph Hellwig @ 2007-05-01 9:15 ` Andrew Morton 2007-05-01 9:21 ` Christoph Hellwig 2007-05-01 9:52 ` Neil Brown 1 sibling, 1 reply; 233+ messages in thread From: Andrew Morton @ 2007-05-01 9:15 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Neil Brown, linux-kernel, linux-mm On Tue, 1 May 2007 10:08:43 +0100 Christoph Hellwig <hch@infradead.org> wrote: > apropos nfsd patches, what's the merge plans for my two export ops > patch series? box:/usr/src/25/patches> grep -l '^From:.*hch' $(cat-series ../series ) dvb_en_50221-convert-to-kthread-api.patch simplify-the-stacktrace-code.patch vfs-remove-superflous-sb-==-null-checks.patch nameic-remove-utterly-outdated-comment.patch move-die-notifier-handling-to-common-code.patch merge-compat_ioctlh-into-compat_ioctlc.patch cleanup-compat-ioctl-handling.patch kprobes-use-hlist_for_each_entry.patch kprobes-codingstyle-cleanups.patch kprobes-kretprobes-simplifcations.patch I give up. Where are they hiding? ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: nfsd/md patches Re: 2.6.22 -mm merge plans 2007-05-01 9:15 ` Andrew Morton @ 2007-05-01 9:21 ` Christoph Hellwig 0 siblings, 0 replies; 233+ messages in thread From: Christoph Hellwig @ 2007-05-01 9:21 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Hellwig, Neil Brown, linux-kernel, linux-mm On Tue, May 01, 2007 at 02:15:25AM -0700, Andrew Morton wrote: > On Tue, 1 May 2007 10:08:43 +0100 Christoph Hellwig <hch@infradead.org> wrote: > > > apropos nfsd patches, what's the merge plans for my two export ops > > patch series? This question was directed to Neil, sorry. > box:/usr/src/25/patches> grep -l '^From:.*hch' $(cat-series ../series ) > dvb_en_50221-convert-to-kthread-api.patch > simplify-the-stacktrace-code.patch > vfs-remove-superflous-sb-==-null-checks.patch > nameic-remove-utterly-outdated-comment.patch > move-die-notifier-handling-to-common-code.patch > merge-compat_ioctlh-into-compat_ioctlc.patch > cleanup-compat-ioctl-handling.patch > kprobes-use-hlist_for_each_entry.patch > kprobes-codingstyle-cleanups.patch > kprobes-kretprobes-simplifcations.patch > > I give up. Where are they hiding? Good question :) I sent them to the nfs list. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: nfsd/md patches Re: 2.6.22 -mm merge plans 2007-05-01 9:08 ` Christoph Hellwig 2007-05-01 9:15 ` Andrew Morton @ 2007-05-01 9:52 ` Neil Brown 2007-05-01 10:15 ` Christoph Hellwig 1 sibling, 1 reply; 233+ messages in thread From: Neil Brown @ 2007-05-01 9:52 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Andrew Morton, linux-kernel, linux-mm On Tuesday May 1, hch@infradead.org wrote: > apropos nfsd patches, what's the merge plans for my two export ops > patch series? Still sitting in my tree - I've had my mind on other things (nfs-utils, portmap....) and let them slip - sorry. I think also there was an unanswered question about the second series (there first I am completely happy with). > Date: Fri, 30 Mar 2007 13:34:53 +1000 > > My only question involves motivation. > > You say "less complex", but to me it just looks "different" - though > being very familiar with the original, that might be a biased view. > Can you say more about how it is less complex? > Maybe the extension to generic 64bit support will make that clear... > > But then generic 64bit support should just be an independent set of > helper functions that can be plugged in to the export_operations > structure. > It think I programmed myself to use a reply to that to be my wake_up to forwarded them on, and forgot to register a timeout handler.... NeilBrown ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: nfsd/md patches Re: 2.6.22 -mm merge plans 2007-05-01 9:52 ` Neil Brown @ 2007-05-01 10:15 ` Christoph Hellwig 2007-05-01 14:34 ` Trond Myklebust 0 siblings, 1 reply; 233+ messages in thread From: Christoph Hellwig @ 2007-05-01 10:15 UTC (permalink / raw) To: Neil Brown; +Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm On Tue, May 01, 2007 at 07:52:11PM +1000, Neil Brown wrote: > On Tuesday May 1, hch@infradead.org wrote: > > apropos nfsd patches, what's the merge plans for my two export ops > > patch series? > > Still sitting in my tree - I've had my mind on other things > (nfs-utils, portmap....) and let them slip - sorry. > > I think also there was an unanswered question about the second series > (there first I am completely happy with). A sorry, this mail got somewhere lost. I'll reply on the nfs list because we have a little more context there. (and due to the subscribers only policy I can't crosspost unfortunately) ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: nfsd/md patches Re: 2.6.22 -mm merge plans 2007-05-01 10:15 ` Christoph Hellwig @ 2007-05-01 14:34 ` Trond Myklebust 0 siblings, 0 replies; 233+ messages in thread From: Trond Myklebust @ 2007-05-01 14:34 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Neil Brown, Andrew Morton, linux-kernel, linux-mm On Tue, 2007-05-01 at 11:15 +0100, Christoph Hellwig wrote: > On Tue, May 01, 2007 at 07:52:11PM +1000, Neil Brown wrote: > > On Tuesday May 1, hch@infradead.org wrote: > > > apropos nfsd patches, what's the merge plans for my two export ops > > > patch series? > > > > Still sitting in my tree - I've had my mind on other things > > (nfs-utils, portmap....) and let them slip - sorry. > > > > I think also there was an unanswered question about the second series > > (there first I am completely happy with). > > A sorry, this mail got somewhere lost. I'll reply on the nfs list > because we have a little more context there. (and due to the subscribers > only policy I can't crosspost unfortunately) I though we lifted the subscribers only policy quite a while back. There should be nothing preventing you from cross-posting. Cheers Trond ^ permalink raw reply [flat|nested] 233+ messages in thread
* MADV_FREE functionality 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (2 preceding siblings ...) 2007-05-01 0:09 ` nfsd/md patches " Neil Brown @ 2007-05-01 0:54 ` Rik van Riel 2007-05-01 1:18 ` Andrew Morton 2007-05-01 1:23 ` Ulrich Drepper 2007-05-01 1:39 ` 2.6.22 -mm merge plans Stefan Richter ` (18 subsequent siblings) 22 siblings, 2 replies; 233+ messages in thread From: Rik van Riel @ 2007-05-01 0:54 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm Andrew Morton wrote: > lazy-freeing-of-memory-through-madv_free.patch > lazy-freeing-of-memory-through-madv_free-vs-mm-madvise-avoid-exclusive-mmap_sem.patch > restore-madv_dontneed-to-its-original-linux-behaviour.patch > > I think the MADV_FREE changes need more work: > > We need crystal-clear statements regarding the present functionality, the new > functionality and how these relate to the spec and to implmentations in other > OS'es. Once we have that info we are in a position to work out whether the > code can be merged as-is, or if additional changes are needed. There are two MADV variants that free pages, both do the exact same thing with mapped file pages, but both do something slightly different with anonymous pages. MADV_DONTNEED will unmap file pages and free anonymous pages. When a process accesses anonymous memory at an address that was zapped with MADV_DONTNEED, it will return fresh zero filled pages. MADV_FREE will unmap file pages. MADV_FREE on anonymous pages is interpreted as a signal that the application no longer needs the data in the pages, and they can be thrown away if the kernel needs the memory for something else. However, if the process accesses the memory again before the kernel needs it, the process will simply get the original pages back. If the kernel needed the memory first, the process will get a fresh zero filled page like with MADV_DONTNEED. In short: - both MADV_FREE and MADV_DONTNEED only unmap file pages - after MADV_DONTNEED the application will always get back fresh zero filled anonymous pages when accessing the memory - after MADV_FREE the application can either get back the original data (without a page fault) or zero filled anonymous memory The Linux MADV_DONTNEED behavior is not POSIX compliant. POSIX says that with MADV_DONTNEED the application's data will be preserved. Currently glibc simply ignores POSIX_MADV_DONTNEED requests from applications on Linux. Changing the behaviour which some Linux applications may rely on might not be the best idea. If you want POSIX_MADV_DONTNEED behaviour added, please let me know and I'll whip up a patch. > Because right now, I don't know where we are with respect to these things and > I doubt if many of our users know either. How can Michael write a manpage for > this is we don't tell him what it all does? If you need any additional information, please let me know. If you still think the MADV_FREE patches themselves should not be merged yet, can we at least merge the #defines, so the Fedora kernel can get the MADV_FREE functionality? Again, I'd be more than willing to whip up a patch for that. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: MADV_FREE functionality 2007-05-01 0:54 ` MADV_FREE functionality Rik van Riel @ 2007-05-01 1:18 ` Andrew Morton 2007-05-01 1:23 ` Rik van Riel 2007-05-01 7:13 ` Jakub Jelinek 2007-05-01 1:23 ` Ulrich Drepper 1 sibling, 2 replies; 233+ messages in thread From: Andrew Morton @ 2007-05-01 1:18 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel, linux-mm, Michael Kerrisk On Mon, 30 Apr 2007 20:54:02 -0400 Rik van Riel <riel@redhat.com> wrote: > Andrew Morton wrote: > > > lazy-freeing-of-memory-through-madv_free.patch > > lazy-freeing-of-memory-through-madv_free-vs-mm-madvise-avoid-exclusive-mmap_sem.patch > > restore-madv_dontneed-to-its-original-linux-behaviour.patch > > > > I think the MADV_FREE changes need more work: > > > > We need crystal-clear statements regarding the present functionality, the new > > functionality and how these relate to the spec and to implmentations in other > > OS'es. Once we have that info we are in a position to work out whether the > > code can be merged as-is, or if additional changes are needed. > > There are two MADV variants that free pages, both do the exact > same thing with mapped file pages, but both do something slightly > different with anonymous pages. > > MADV_DONTNEED will unmap file pages and free anonymous pages. > When a process accesses anonymous memory at an address that > was zapped with MADV_DONTNEED, it will return fresh zero filled > pages. > > MADV_FREE will unmap file pages. MADV_FREE on anonymous pages > is interpreted as a signal that the application no longer needs > the data in the pages, and they can be thrown away if the kernel > needs the memory for something else. However, if the process > accesses the memory again before the kernel needs it, the process > will simply get the original pages back. If the kernel needed > the memory first, the process will get a fresh zero filled page > like with MADV_DONTNEED. > > In short: > - both MADV_FREE and MADV_DONTNEED only unmap file pages > - after MADV_DONTNEED the application will always get back > fresh zero filled anonymous pages when accessing the > memory > - after MADV_FREE the application can either get back the > original data (without a page fault) or zero filled > anonymous memory > > The Linux MADV_DONTNEED behavior is not POSIX compliant. > POSIX says that with MADV_DONTNEED the application's data > will be preserved. > > Currently glibc simply ignores POSIX_MADV_DONTNEED requests > from applications on Linux. Changing the behaviour which > some Linux applications may rely on might not be the best > idea. OK, thanks. I stuck that in the changelog. Michael, do you think that's enough to finalise a manpage? > If you need any additional information, please let me know. The patch doesn't update the various comments in madvise.c at all, which is a surprise. Could you please check that they are all accurate and complete? Also, where did we end up with the Solaris compatibility? The patch I have at present retains MADV_FREE=0x05 for sparc and sparc64 which should be good. Did we decide that the Solaris and Linux implementations of MADV_FREE are compatible? What about the Solaris and Linux MADV_DONTNEED implementations? ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: MADV_FREE functionality 2007-05-01 1:18 ` Andrew Morton @ 2007-05-01 1:23 ` Rik van Riel 2007-05-01 7:13 ` Jakub Jelinek 1 sibling, 0 replies; 233+ messages in thread From: Rik van Riel @ 2007-05-01 1:23 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm, Michael Kerrisk Andrew Morton wrote: >> If you need any additional information, please let me know. > > The patch doesn't update the various comments in madvise.c at all, which is > a surprise. Could you please check that they are all accurate and complete? I'll take a look. > Also, where did we end up with the Solaris compatibility? > > The patch I have at present retains MADV_FREE=0x05 for sparc and sparc64 > which should be good. > > Did we decide that the Solaris and Linux implementations of MADV_FREE are > compatible? Yes, the Linux, Solaris and FreeBSD implementations of MADV_FREE appear to have equivalent semantics. > What about the Solaris and Linux MADV_DONTNEED implementations? This was never, and is still not, the same. Linux will throw away the data in anonymous pages while POSIX says we should simply move the data to swap. I assume Solaris and FreeBSD will move the data to swap instead of throwing it away. For file backed pages I suspect they all behave the same. This is the reason that inside glibc, POSIX_MADV_DONTNEED is a noop. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: MADV_FREE functionality 2007-05-01 1:18 ` Andrew Morton 2007-05-01 1:23 ` Rik van Riel @ 2007-05-01 7:13 ` Jakub Jelinek 1 sibling, 0 replies; 233+ messages in thread From: Jakub Jelinek @ 2007-05-01 7:13 UTC (permalink / raw) To: Andrew Morton; +Cc: Rik van Riel, linux-kernel, linux-mm, Michael Kerrisk On Mon, Apr 30, 2007 at 06:18:39PM -0700, Andrew Morton wrote: > > In short: > > - both MADV_FREE and MADV_DONTNEED only unmap file pages > > - after MADV_DONTNEED the application will always get back > > fresh zero filled anonymous pages when accessing the > > memory > > - after MADV_FREE the application can either get back the > > original data (without a page fault) or zero filled > > anonymous memory > > > > The Linux MADV_DONTNEED behavior is not POSIX compliant. > > POSIX says that with MADV_DONTNEED the application's data > > will be preserved. > > > > Currently glibc simply ignores POSIX_MADV_DONTNEED requests > > from applications on Linux. Changing the behaviour which > > some Linux applications may rely on might not be the best > > idea. > > OK, thanks. I stuck that in the changelog. FYI, Solaris man page on MADV_FREE says: MADV_FREE Tells the kernel that contents in the specified address range are no longer important and the range will be overwritten. When there is demand for memory, the system will free pages associated with the speci- fied address range. In this instance, the next time a page in the address range is referenced, it will con- tail all zeroes. Otherwise, it will contain the data that was there prior to the MADV_FREE call. References made to the address range will not make the system read from backing store (swap space) until the page is modified again. This value cannot be used on mappings that have under- lying file objects. The last paragraph seems to be just about the operation being undefined, madvise MADV_FREE on MAP_SHARED file mapping returns 0 rather than flagging an error. FreeBSD man page: MADV_FREE Gives the VM system the freedom to free pages, and tells the system that information in the specified page range is no longer important. This is an efficient way of allowing malloc(3) to free pages anywhere in the address space, while keeping the address space valid. The next time that the page is referenced, the page might be demand zeroed, or might contain the data that was there before the MADV_FREE call. References made to that address space range will not make the VM system page the information back in from backing store until the page is modified again. > Also, where did we end up with the Solaris compatibility? > > The patch I have at present retains MADV_FREE=0x05 for sparc and sparc64 > which should be good. > > Did we decide that the Solaris and Linux implementations of MADV_FREE are > compatible? SPARC Solaris binary compatibility in Linux is in really bad shape, madvise in Solaris is implemented using memcntl syscall (at least according to truss(1)) and that syscall is systbl.S: .word solaris_unimplemented /* memcntl 131 */ When/if anyone decides to put more effort into the Solaris binary compatibility (I'm quite doubtful anyone will), codes which don't match can be simply translated into other codes, ignored etc., we can't use sys_madvise to implement memcntl syscall anyway. While Solaris MADV_FREE is the same as Linux MADV_FREE proposed by Rik (except perhaps the documented undefined behavior with file mappings, on #include <sys/mman.h> #include <unistd.h> #include <fcntl.h> int main (void) { getpid (); int fd = open ("test", O_RDWR); void *p = mmap (0, 8192, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); memset (p, ' ', 8192); madvise (p, 8192, MADV_FREE); return 0; } on Solaris the spaces actually made it into the file), MADV_DONTNEED is not, but that doesn't really matter except for arch/sparc*/solaris/ layer if anyone cares. We certainly can't change current MADV_DONTNEED behavior, all we can do is implement a new MADV_* code with a different behavior and let glibc translate POSIX_MADV_* codes on posix_madvise to the Linux specific MADV_*. Jakub ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: MADV_FREE functionality 2007-05-01 0:54 ` MADV_FREE functionality Rik van Riel 2007-05-01 1:18 ` Andrew Morton @ 2007-05-01 1:23 ` Ulrich Drepper 1 sibling, 0 replies; 233+ messages in thread From: Ulrich Drepper @ 2007-05-01 1:23 UTC (permalink / raw) To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm On 4/30/07, Rik van Riel <riel@redhat.com> wrote: > Andrew Morton wrote: > > Because right now, I don't know where we are with respect to these things and > > I doubt if many of our users know either. How can Michael write a manpage for > > this is we don't tell him what it all does? I think we've been very clear before and Rik's description here puts it all nicely in one place. If you're worried about semantics you can rest assured, it is all sound. If this is what is holding up the patch then add it to your collection. Only if you have technical objections should you hold it off. The patch makes sense (and has been validated by being implemented in the same way on other OSes) and it is really needed. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (3 preceding siblings ...) 2007-05-01 0:54 ` MADV_FREE functionality Rik van Riel @ 2007-05-01 1:39 ` Stefan Richter 2007-05-01 2:30 ` 2.6.22 -mm merge plans (RE: input) Dmitry Torokhov ` (17 subsequent siblings) 22 siblings, 0 replies; 233+ messages in thread From: Stefan Richter @ 2007-05-01 1:39 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm Andrew Morton wrote: > sbp2-include-fixes.patch > ieee1394-iso-needs-schedh.patch > > For Stephan They were merged some hours ago. -- Stefan Richter -=====-=-=== -=-= ----= http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans (RE: input) 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (4 preceding siblings ...) 2007-05-01 1:39 ` 2.6.22 -mm merge plans Stefan Richter @ 2007-05-01 2:30 ` Dmitry Torokhov 2007-05-01 8:14 ` Jiri Slaby 2007-05-01 8:11 ` 2.6.22 -mm merge plans -- pfn_valid_within Andy Whitcroft ` (16 subsequent siblings) 22 siblings, 1 reply; 233+ messages in thread From: Dmitry Torokhov @ 2007-05-01 2:30 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm, Éric Piel, Jiri Slaby On Monday 30 April 2007 19:20, Andrew Morton wrote: > > input-convert-from-class-devices-to-standard-devices.patch > input-evdev-implement-proper-locking.patch > mousedev-fix.patch > mousedev-fix-2.patch > > Dmitry will merge these once Greg has merged the preparatory work. Except these > patches make the Vaio-of-doom crash in obscure circumstances, and we weren't > able to fix that? > Would like to keep cooking in your tree till we get your Vaio going, if you don't mind. > wistron_btns-add-led-support.patch Will review once again and apply. > input-ff-add-ff_raw-effect.patch > input-phantom-add-a-new-driver.patch > It looks like Phanotom will not be using input layer... > input-rfkill-add-support-for-input-key-to-control-wireless-radio.patch > > Will resend to davem once the preparatory bits are merged by Greg. > You mean me, right? I need to do some locking changes that DaveM pointed out. -- Dmitry ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans (RE: input) 2007-05-01 2:30 ` 2.6.22 -mm merge plans (RE: input) Dmitry Torokhov @ 2007-05-01 8:14 ` Jiri Slaby 2007-05-01 12:05 ` Dmitry Torokhov 0 siblings, 1 reply; 233+ messages in thread From: Jiri Slaby @ 2007-05-01 8:14 UTC (permalink / raw) To: Dmitry Torokhov; +Cc: Andrew Morton, linux-kernel, Jiri Slaby Dmitry Torokhov napsal(a): > On Monday 30 April 2007 19:20, Andrew Morton wrote: >> input-ff-add-ff_raw-effect.patch >> input-phantom-add-a-new-driver.patch >> > > It looks like Phanotom will not be using input layer... Yes, I have a new version, planning to test it tomorrow when I reach the device and then post it. You don't want it in input/misc in that case, right? If yes, Andrew, please drop both. thanks, -- http://www.fi.muni.cz/~xslaby/ Jiri Slaby faculty of informatics, masaryk university, brno, cz e-mail: jirislaby gmail com, gpg pubkey fingerprint: B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans (RE: input) 2007-05-01 8:14 ` Jiri Slaby @ 2007-05-01 12:05 ` Dmitry Torokhov 0 siblings, 0 replies; 233+ messages in thread From: Dmitry Torokhov @ 2007-05-01 12:05 UTC (permalink / raw) To: Jiri Slaby; +Cc: Andrew Morton, linux-kernel On 5/1/07, Jiri Slaby <jirislaby@gmail.com> wrote: > Dmitry Torokhov napsal(a): > > On Monday 30 April 2007 19:20, Andrew Morton wrote: > >> input-ff-add-ff_raw-effect.patch > >> input-phantom-add-a-new-driver.patch > >> > > > > It looks like Phanotom will not be using input layer... > > Yes, I have a new version, planning to test it tomorrow when I reach the device > and then post it. > > You don't want it in input/misc in that case, right? Correct. Input/misc is only visible if CONFIG_INPUT is selected. But if Phantom is not using input layer it should not depend on CONFIG_INPUT. I'd put it into drivers/misc... -- Dmitry ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- pfn_valid_within 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (5 preceding siblings ...) 2007-05-01 2:30 ` 2.6.22 -mm merge plans (RE: input) Dmitry Torokhov @ 2007-05-01 8:11 ` Andy Whitcroft 2007-05-01 8:19 ` Andrew Morton 2007-05-01 8:42 ` "partical" kthread conversion Christoph Hellwig ` (15 subsequent siblings) 22 siblings, 1 reply; 233+ messages in thread From: Andy Whitcroft @ 2007-05-01 8:11 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm, Mel Gorman Andrew Morton wrote: > add-pfn_valid_within-helper-for-sub-max_order-hole-detection.patch > anti-fragmentation-switch-over-to-pfn_valid_within.patch > lumpy-move-to-using-pfn_valid_within.patch > > More Mel things, and linkage between Mel-things and lumpy reclaim. It's here > where the patch ordering gets into a mess and things won't improve if > moveable-zones and lumpy-reclaim get deferred. Such a deferral would limit my > ability to queue more MM changes for 2.6.23. The first of these is really a cleanup and should slide into the stack before Mobility and Lumpy. The other two should then join their respective stacks anti-fragmentation-... to Mobility and lumpy-... to Lumpy. I would not expect them to increase linkage that way. -apw ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- pfn_valid_within 2007-05-01 8:11 ` 2.6.22 -mm merge plans -- pfn_valid_within Andy Whitcroft @ 2007-05-01 8:19 ` Andrew Morton 0 siblings, 0 replies; 233+ messages in thread From: Andrew Morton @ 2007-05-01 8:19 UTC (permalink / raw) To: Andy Whitcroft; +Cc: linux-kernel, linux-mm, Mel Gorman On Tue, 01 May 2007 09:11:18 +0100 Andy Whitcroft <apw@shadowen.org> wrote: > Andrew Morton wrote: > > > add-pfn_valid_within-helper-for-sub-max_order-hole-detection.patch > > anti-fragmentation-switch-over-to-pfn_valid_within.patch > > lumpy-move-to-using-pfn_valid_within.patch > > > > More Mel things, and linkage between Mel-things and lumpy reclaim. It's here > > where the patch ordering gets into a mess and things won't improve if > > moveable-zones and lumpy-reclaim get deferred. Such a deferral would limit my > > ability to queue more MM changes for 2.6.23. > > The first of these is really a cleanup and should slide into the stack > before Mobility and Lumpy. The other two should then join their > respective stacks anti-fragmentation-... to Mobility and lumpy-... to > Lumpy. I would not expect them to increase linkage that way. > yup, that improved things a bit, thanks. ^ permalink raw reply [flat|nested] 233+ messages in thread
* "partical" kthread conversion 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (6 preceding siblings ...) 2007-05-01 8:11 ` 2.6.22 -mm merge plans -- pfn_valid_within Andy Whitcroft @ 2007-05-01 8:42 ` Christoph Hellwig 2007-05-01 8:51 ` Andrew Morton 2007-05-01 8:44 ` 2.6.22 -mm merge plans -- vm bugfixes Nick Piggin ` (14 subsequent siblings) 22 siblings, 1 reply; 233+ messages in thread From: Christoph Hellwig @ 2007-05-01 8:42 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, ebiederm On Mon, Apr 30, 2007 at 04:20:07PM -0700, Andrew Morton wrote: > macintosh-mediabay-convert-to-kthread-api.patch > macintosh-adb-convert-to-the-kthread-api.patch > macintosh-therm_pm72c-partially-convert-to-kthread-api.patch > powerpc-pseries-rtasd-convert-to-kthread-api.patch > powerpc-pseries-eeh-convert-to-kthread-api.patch > > Will send to paulus (I already did - does Paul not handle the macintosh > driver?) Please don't send out the partial kthread conversions, as they're not that helpful. Depending on the way we'll let the API evolve a kthread_create/run not paired by a kthread_stop might be actually harmful. Please only send along patches that are paired or always built in so that they don't require stopping at all. Btw, many of the drivers above should probably go to benh. There's probably a few more patches falling into this category, these were just the first one the stick into my eye. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: "partical" kthread conversion 2007-05-01 8:42 ` "partical" kthread conversion Christoph Hellwig @ 2007-05-01 8:51 ` Andrew Morton 2007-05-02 14:01 ` Dean Nelson 0 siblings, 1 reply; 233+ messages in thread From: Andrew Morton @ 2007-05-01 8:51 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-kernel, ebiederm On Tue, 1 May 2007 09:42:45 +0100 Christoph Hellwig <hch@infradead.org> wrote: > On Mon, Apr 30, 2007 at 04:20:07PM -0700, Andrew Morton wrote: > > macintosh-mediabay-convert-to-kthread-api.patch > > macintosh-adb-convert-to-the-kthread-api.patch > > macintosh-therm_pm72c-partially-convert-to-kthread-api.patch > > powerpc-pseries-rtasd-convert-to-kthread-api.patch > > powerpc-pseries-eeh-convert-to-kthread-api.patch > > > > Will send to paulus (I already did - does Paul not handle the macintosh > > driver?) > > Please don't send out the partial kthread conversions, as they're not > that helpful. Depending on the way we'll let the API evolve a > kthread_create/run not paired by a kthread_stop might be actually harmful. > > Please only send along patches that are paired or always built in so that > they don't require stopping at all. > > Btw, many of the drivers above should probably go to benh. > > There's probably a few more patches falling into this category, these > were just the first one the stick into my eye. Yes, I think I'll probably drop all of them - I've completely lost track of which ones are complete, which ones need more work, etc. I might send ia64-sn-xpc-convert-to-use-kthread-api.patch+fixes off to Tony, as people put quite a bit of review and test effort into that one. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: "partical" kthread conversion 2007-05-01 8:51 ` Andrew Morton @ 2007-05-02 14:01 ` Dean Nelson 2007-05-02 14:45 ` Eric W. Biederman 0 siblings, 1 reply; 233+ messages in thread From: Dean Nelson @ 2007-05-02 14:01 UTC (permalink / raw) To: Andrew Morton; +Cc: hch, ebiederm, linux-kernel On Tue, May 01, 2007 at 01:51:41AM -0700, Andrew Morton wrote: > On Tue, 1 May 2007 09:42:45 +0100 Christoph Hellwig <hch@infradead.org> wrote: > > > On Mon, Apr 30, 2007 at 04:20:07PM -0700, Andrew Morton wrote: > > > macintosh-mediabay-convert-to-kthread-api.patch > > > macintosh-adb-convert-to-the-kthread-api.patch > > > macintosh-therm_pm72c-partially-convert-to-kthread-api.patch > > > powerpc-pseries-rtasd-convert-to-kthread-api.patch > > > powerpc-pseries-eeh-convert-to-kthread-api.patch > > > > > > Will send to paulus (I already did - does Paul not handle the macintosh > > > driver?) > > > > Please don't send out the partial kthread conversions, as they're not > > that helpful. Depending on the way we'll let the API evolve a > > kthread_create/run not paired by a kthread_stop might be actually harmful. > > > > Please only send along patches that are paired or always built in so that > > they don't require stopping at all. > > > > Btw, many of the drivers above should probably go to benh. > > > > There's probably a few more patches falling into this category, these > > were just the first one the stick into my eye. > > Yes, I think I'll probably drop all of them - I've completely lost track of > which ones are complete, which ones need more work, etc. > > I might send ia64-sn-xpc-convert-to-use-kthread-api.patch+fixes off to > Tony, as people put quite a bit of review and test effort into that one. Andrew, I would recommend holding off on sending these xpc patches to Tony as the kthread_run()s aren't paired with kthread_stop()s yet. I need to generate an additional patch after I've first sorted out how best to deal with kthread_stop()'ng XPC's pool of kthreads with Eric. Thanks, Dean ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: "partical" kthread conversion 2007-05-02 14:01 ` Dean Nelson @ 2007-05-02 14:45 ` Eric W. Biederman 2007-05-02 15:37 ` Dean Nelson 2007-05-02 19:33 ` Andrew Morton 0 siblings, 2 replies; 233+ messages in thread From: Eric W. Biederman @ 2007-05-02 14:45 UTC (permalink / raw) To: Dean Nelson; +Cc: Andrew Morton, hch, linux-kernel Dean Nelson <dcn@sgi.com> writes: > On Tue, May 01, 2007 at 01:51:41AM -0700, Andrew Morton wrote: >> > There's probably a few more patches falling into this category, these >> > were just the first one the stick into my eye. >> >> Yes, I think I'll probably drop all of them - I've completely lost track of >> which ones are complete, which ones need more work, etc. Andrew as far as dropping them. If all you have is one of my dinky patches that changes things to use kthread_run feel free, because of the general necessity of calling kthread_stop I'm going to have to rework those anyway, and I still have the originals. If there is something more the we probably want to keep the patch because someone has actually looked at it and done something useful. I'm just now starting to work my way through them all again paying a little closer attention, so I can do a thorough conversion. >> I might send ia64-sn-xpc-convert-to-use-kthread-api.patch+fixes off to >> Tony, as people put quite a bit of review and test effort into that one. > > Andrew, I would recommend holding off on sending these xpc patches to > Tony as the kthread_run()s aren't paired with kthread_stop()s yet. I > need to generate an additional patch after I've first sorted out how > best to deal with kthread_stop()'ng XPC's pool of kthreads with Eric. Ok. Dean gve me a couple of a day or so. I think I have just worked through how to directly create kthreads without too much pain. We are still going to need kthreadd for spawning for a bit because I don't expect all architectures to change over immediately, but I think things can be done in a fairly simple low risk manner. The changes to the kernel_thread replacement aren't going to be too bad, pretty much just adding a couple of parameters. It is copy_thread where things get sticky. If we can spawn threads fast enough we don't need a thread pool, I would rather do that. Eric ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: "partical" kthread conversion 2007-05-02 14:45 ` Eric W. Biederman @ 2007-05-02 15:37 ` Dean Nelson 2007-05-02 15:49 ` Eric W. Biederman 2007-05-02 19:33 ` Andrew Morton 1 sibling, 1 reply; 233+ messages in thread From: Dean Nelson @ 2007-05-02 15:37 UTC (permalink / raw) To: Eric W. Biederman; +Cc: akpm, hch, linux-kernel On Wed, May 02, 2007 at 08:45:54AM -0600, Eric W. Biederman wrote: > Dean Nelson <dcn@sgi.com> writes: > > > On Tue, May 01, 2007 at 01:51:41AM -0700, Andrew Morton wrote: > >> I might send ia64-sn-xpc-convert-to-use-kthread-api.patch+fixes off to > >> Tony, as people put quite a bit of review and test effort into that one. > > > > Andrew, I would recommend holding off on sending these xpc patches to > > Tony as the kthread_run()s aren't paired with kthread_stop()s yet. I > > need to generate an additional patch after I've first sorted out how > > best to deal with kthread_stop()'ng XPC's pool of kthreads with Eric. > > Ok. Dean gve me a couple of a day or so. I think I have just worked > through how to directly create kthreads without too much pain. We are > still going to need kthreadd for spawning for a bit because I don't > expect all architectures to change over immediately, but I think > things can be done in a fairly simple low risk manner. > > The changes to the kernel_thread replacement aren't going to be too > bad, pretty much just adding a couple of parameters. It is > copy_thread where things get sticky. > > If we can spawn threads fast enough we don't need a thread pool, I > would rather do that. I'd typed up some questions for you about the new patch I need to create which I'd just sent to you, so I won't repeat them here. Before proceeding to far with your above changes, you might wait to see the proposal that Robin Holt is putting together for a kthread pool. I'm not sure how spawning a thread (which involves allocation of the task_struct amongst other things, plus scheduling) can beat a wake_up() of an already existing thread for cost time-wise. Dean ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: "partical" kthread conversion 2007-05-02 15:37 ` Dean Nelson @ 2007-05-02 15:49 ` Eric W. Biederman 0 siblings, 0 replies; 233+ messages in thread From: Eric W. Biederman @ 2007-05-02 15:49 UTC (permalink / raw) To: Dean Nelson; +Cc: akpm, hch, linux-kernel Dean Nelson <dcn@sgi.com> writes: > I'd typed up some questions for you about the new patch I need to create > which I'd just sent to you, so I won't repeat them here. > > Before proceeding to far with your above changes, you might wait to see > the proposal that Robin Holt is putting together for a kthread pool. > I'm not sure how spawning a thread (which involves allocation of the > task_struct amongst other things, plus scheduling) can beat a wake_up() > of an already existing thread for cost time-wise. A reasonable point, although if you don't happen to sleep in the allocations I suspect time wise it's pretty much a wash. I have some other reasons I might need the capability of clone a thread from a non-parent process, and it has the potential to simplify some things in the kthread case so I'm going to finish investigating, since I believe I have figured out a path to that target. Eric ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: "partical" kthread conversion 2007-05-02 14:45 ` Eric W. Biederman 2007-05-02 15:37 ` Dean Nelson @ 2007-05-02 19:33 ` Andrew Morton 2007-05-02 20:38 ` Eric W. Biederman 1 sibling, 1 reply; 233+ messages in thread From: Andrew Morton @ 2007-05-02 19:33 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Dean Nelson, hch, linux-kernel On Wed, 02 May 2007 08:45:54 -0600 ebiederm@xmission.com (Eric W. Biederman) wrote: > Dean Nelson <dcn@sgi.com> writes: > > > On Tue, May 01, 2007 at 01:51:41AM -0700, Andrew Morton wrote: > >> > There's probably a few more patches falling into this category, these > >> > were just the first one the stick into my eye. > >> > >> Yes, I think I'll probably drop all of them - I've completely lost track of > >> which ones are complete, which ones need more work, etc. > > Andrew as far as dropping them. If all you have is one of my dinky patches > that changes things to use kthread_run feel free, because of the general > necessity of calling kthread_stop I'm going to have to rework those anyway, > and I still have the originals. > I gave up and dropped them all - let's have another run at it. Possibly some of them were complete and didn't deserve dropping, in which case you can send them straight back at me. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: "partical" kthread conversion 2007-05-02 19:33 ` Andrew Morton @ 2007-05-02 20:38 ` Eric W. Biederman 0 siblings, 0 replies; 233+ messages in thread From: Eric W. Biederman @ 2007-05-02 20:38 UTC (permalink / raw) To: Andrew Morton; +Cc: Dean Nelson, hch, linux-kernel Andrew Morton <akpm@linux-foundation.org> writes: > I gave up and dropped them all - let's have another run at it. Possibly > some of them were complete and didn't deserve dropping, in which case you > can send them straight back at me. Sounds like a plan. Eric ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (7 preceding siblings ...) 2007-05-01 8:42 ` "partical" kthread conversion Christoph Hellwig @ 2007-05-01 8:44 ` Nick Piggin 2007-05-01 8:54 ` Andrew Morton 2007-05-01 19:31 ` Hugh Dickins 2007-05-01 8:46 ` pcmcia ioctl removal Christoph Hellwig ` (13 subsequent siblings) 22 siblings, 2 replies; 233+ messages in thread From: Nick Piggin @ 2007-05-01 8:44 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, Hugh Dickins, Andrea Arcangeli, Christoph Hellwig Andrew Morton wrote: > mm-simplify-filemap_nopage.patch > mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch > mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch > mm-merge-nopfn-into-fault.patch > convert-hugetlbfs-to-use-vm_ops-fault.patch > mm-remove-legacy-cruft.patch > mm-debug-check-for-the-fault-vs-invalidate-race.patch > mm-fix-clear_page_dirty_for_io-vs-fault-race.patch > Miscish MM changes. Will merge, dependent upon what still applies and works > if the moveable-zone patches get stalled. These fix some bugs in the core vm, at least the former one we have seen numerous people hitting in production... I don't suppose you mean these are logically dependant on new features sitting below them in your patch stack, just that you don't want to spend time fixing a lot of rejects? If so, I can help fix those up, but I don't think there is anything major, IIRC the biggest annoyance is just that changing some GFP_types throws some big hunks. So, do you or anyone else have any problems with these patches going in 2.6.22? I haven't had much feedback for a while, but I was under the impression that people are more-or-less happy with them? mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch This patch fixes the core filemap_nopage vs invalidate_inode_pages2 race by having filemap_nopage return a locked page to do_no_page, and removes the fairly complex (and inadequate) truncate_count synchronisation logic. There were concerns that we could do this more cheaply, but I think it is important to start with a base that is simple and more likely to be correct and build on that. My testing didn't show any obvious problems with performance. mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch mm-merge-nopfn-into-fault.patch etc. These move ->nopage, ->populate, ->nopfn (and soon, ->page_mkwrite) into a single, unified interface. Although this strictly closes some similar holes in nonlinear faults as well, they are very uncommon, so I wouldn't be so upset if these aren't merged in 2.6.22 (I don't see any reason not to, but at least they don't fix major bugs). -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-01 8:44 ` 2.6.22 -mm merge plans -- vm bugfixes Nick Piggin @ 2007-05-01 8:54 ` Andrew Morton 2007-05-01 19:31 ` Hugh Dickins 1 sibling, 0 replies; 233+ messages in thread From: Andrew Morton @ 2007-05-01 8:54 UTC (permalink / raw) To: Nick Piggin Cc: linux-kernel, linux-mm, Hugh Dickins, Andrea Arcangeli, Christoph Hellwig On Tue, 01 May 2007 18:44:07 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > mm-simplify-filemap_nopage.patch > > mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch > > mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch > > mm-merge-nopfn-into-fault.patch > > convert-hugetlbfs-to-use-vm_ops-fault.patch > > mm-remove-legacy-cruft.patch > > mm-debug-check-for-the-fault-vs-invalidate-race.patch > > > mm-fix-clear_page_dirty_for_io-vs-fault-race.patch > > > Miscish MM changes. Will merge, dependent upon what still applies and works > > if the moveable-zone patches get stalled. > > These fix some bugs in the core vm, at least the former one we have > seen numerous people hitting in production... > > I don't suppose you mean these are logically dependant on new features > sitting below them in your patch stack, just that you don't want to > spend time fixing a lot of rejects? It'll probably be OK - I just haven't checked yet. I'm fairly handy at fixing rejects nowadays ;) Nobody seems to be taking up this opportunity to provide us with review and test results on the antifrag patches. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-01 8:44 ` 2.6.22 -mm merge plans -- vm bugfixes Nick Piggin 2007-05-01 8:54 ` Andrew Morton @ 2007-05-01 19:31 ` Hugh Dickins 2007-05-02 3:08 ` Nick Piggin 1 sibling, 1 reply; 233+ messages in thread From: Hugh Dickins @ 2007-05-01 19:31 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli, Christoph Hellwig On Tue, 1 May 2007, Nick Piggin wrote: > Andrew Morton wrote: > > > mm-simplify-filemap_nopage.patch > > mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch > > mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch > > mm-merge-nopfn-into-fault.patch > > convert-hugetlbfs-to-use-vm_ops-fault.patch > > mm-remove-legacy-cruft.patch > > mm-debug-check-for-the-fault-vs-invalidate-race.patch > > > mm-fix-clear_page_dirty_for_io-vs-fault-race.patch > > > Miscish MM changes. Will merge, dependent upon what still applies and works > > if the moveable-zone patches get stalled. > > These fix some bugs in the core vm, at least the former one we have > seen numerous people hitting in production... ... > > So, do you or anyone else have any problems with these patches going in > 2.6.22? I haven't had much feedback for a while, but I was under the > impression that people are more-or-less happy with them? > > mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch > > This patch fixes the core filemap_nopage vs invalidate_inode_pages2 > race by having filemap_nopage return a locked page to do_no_page, > and removes the fairly complex (and inadequate) truncate_count > synchronisation logic. > > There were concerns that we could do this more cheaply, but I think it > is important to start with a base that is simple and more likely to > be correct and build on that. My testing didn't show any obvious > problems with performance. I don't see _problems_ with performance, but I do consistently see the same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency and page fault tests on SMP, several machines, just as I did last year. I'm assuming this patch is the one responsible: at 2.6.20-rc4 time you posted a set of 10 and a set of 7 patches I tried in versus out; at 2.6.21-rc3-mm2 time you had a group of patches in -mm I tried in versus out; with similar results. I did check the graphs on test.kernel.org, I couldn't see any bad behaviour there that correlated with this work; though each -mm has such a variety of new work in it, it's very hard to attribute. And nobody else has reported any regression from your patches. I'm inclined to write it off as poorer performance in some micro- benchmarks, against which we offset the improved understandabilty of holding the page lock over the file fault. But I was quite disappointed when mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch appeared, putting double unmap_mapping_range calls in. Certainly you were wrong to take the one out, but a pity to end up with two. Your comment says/said: The nopage vs invalidate race fix patch did not take care of truncating private COW pages. Mind you, I'm pretty sure this was previously racy even for regular truncate, not to mention vmtruncate_range. vmtruncate_range (holepunch) was deficient I agree, and though we can now take out your second unmap_mapping_range there, that's only because I've slipped one into shmem_truncate_range. In due course it needs to be properly handled by noting the range in shmem inode info. (I think you couldn't take that approach, noting invalid range in ->mapping while invalidating, because NFS has/had some cases of invalidate_whatever without i_mutex?) But I'm pretty sure (to use your words!) regular truncate was not racy before: I believe Andrea's sequence count was handling that case fine, without a second unmap_mapping_range. Well, I guess I've come to accept that, expensive as unmap_mapping_range may be, truncating files while they're mmap'ed is perverse behaviour: perhaps even deserving such punishment. But it is a shame, and leaves me wondering what you gained with the page lock there. One thing gained is ease of understanding, and if your later patches build an edifice upon the knowledge of holding that page lock while faulting, I've no wish to undermine that foundation. > > mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch > mm-merge-nopfn-into-fault.patch > etc. > > These move ->nopage, ->populate, ->nopfn (and soon, ->page_mkwrite) > into a single, unified interface. Although this strictly closes some > similar holes in nonlinear faults as well, they are very uncommon, so > I wouldn't be so upset if these aren't merged in 2.6.22 (I don't see > any reason not to, but at least they don't fix major bugs). I don't have an opinion on these, but I think BenH and others were strongly in favour, with various people waiting upon them. Hugh ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-01 19:31 ` Hugh Dickins @ 2007-05-02 3:08 ` Nick Piggin 2007-05-02 9:15 ` Nick Piggin 2007-05-02 14:00 ` Hugh Dickins 0 siblings, 2 replies; 233+ messages in thread From: Nick Piggin @ 2007-05-02 3:08 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli, Christoph Hellwig Hugh Dickins wrote: > On Tue, 1 May 2007, Nick Piggin wrote: >>There were concerns that we could do this more cheaply, but I think it >>is important to start with a base that is simple and more likely to >>be correct and build on that. My testing didn't show any obvious >>problems with performance. > > > I don't see _problems_ with performance, but I do consistently see the > same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency > and page fault tests on SMP, several machines, just as I did last year. OK. I did run some tests at one stage which didn't show a regression on my P4, however I don't know that they were statistically significant. I'll try a couple more runs and post numbers. > I'm assuming this patch is the one responsible: at 2.6.20-rc4 time > you posted a set of 10 and a set of 7 patches I tried in versus out; > at 2.6.21-rc3-mm2 time you had a group of patches in -mm I tried in > versus out; with similar results. > > I did check the graphs on test.kernel.org, I couldn't see any bad > behaviour there that correlated with this work; though each -mm > has such a variety of new work in it, it's very hard to attribute. > And nobody else has reported any regression from your patches. > > I'm inclined to write it off as poorer performance in some micro- > benchmarks, against which we offset the improved understandabilty > of holding the page lock over the file fault. > > But I was quite disappointed when > mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch > appeared, putting double unmap_mapping_range calls in. Certainly > you were wrong to take the one out, but a pity to end up with two. > > Your comment says/said: > The nopage vs invalidate race fix patch did not take care of truncating > private COW pages. Mind you, I'm pretty sure this was previously racy > even for regular truncate, not to mention vmtruncate_range. > > vmtruncate_range (holepunch) was deficient I agree, and though we > can now take out your second unmap_mapping_range there, that's only > because I've slipped one into shmem_truncate_range. In due course it > needs to be properly handled by noting the range in shmem inode info. > > (I think you couldn't take that approach, noting invalid range in > ->mapping while invalidating, because NFS has/had some cases of > invalidate_whatever without i_mutex?) Sorry, I didn't parse this? But I wonder whether it is better to do it in vmtruncate_range than the filesystem? Private COWed pages are not really a filesystem "thing"... > But I'm pretty sure (to use your words!) regular truncate was not racy > before: I believe Andrea's sequence count was handling that case fine, > without a second unmap_mapping_range. OK, I think you're right. I _think_ it should also be OK with the lock_page version as well: we should not be able to have any pages after the first unmap_mapping_range call, because of the i_size write. So if we have no pages, there is nothing to 'cow' from. > Well, I guess I've come to accept that, expensive as unmap_mapping_range > may be, truncating files while they're mmap'ed is perverse behaviour: > perhaps even deserving such punishment. > > But it is a shame, and leaves me wondering what you gained with the > page lock there. > > One thing gained is ease of understanding, and if your later patches > build an edifice upon the knowledge of holding that page lock while > faulting, I've no wish to undermine that foundation. It also fixes a bug, doesn't it? ;) -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-02 3:08 ` Nick Piggin @ 2007-05-02 9:15 ` Nick Piggin 2007-05-02 14:00 ` Hugh Dickins 1 sibling, 0 replies; 233+ messages in thread From: Nick Piggin @ 2007-05-02 9:15 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli, Christoph Hellwig Nick Piggin wrote: > Hugh Dickins wrote: > >> On Tue, 1 May 2007, Nick Piggin wrote: > > >>> There were concerns that we could do this more cheaply, but I think it >>> is important to start with a base that is simple and more likely to >>> be correct and build on that. My testing didn't show any obvious >>> problems with performance. >> >> >> >> I don't see _problems_ with performance, but I do consistently see the >> same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency >> and page fault tests on SMP, several machines, just as I did last year. > > > OK. I did run some tests at one stage which didn't show a regression > on my P4, however I don't know that they were statistically significant. > I'll try a couple more runs and post numbers. I didn't have enough time tonight to get means/stddev, etc, but the runs are pretty stable. Patch tested was just the lock page one. SMP kernel, tasks bound to 1 CPU: P4 Xeon pagefault fork exec 2.6.21 1.67-1.69 140.7-142.0 449.5-460.8 +patch 1.75-1.77 144.0-145.5 456.2-463.0 So it's taken on nearly 5% on pagefault, but looks like less than 2% on fork, so not as bad as your numbers (phew). G5 pagefault fork exec 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 Bigger hit there. Page faults can be improved a tiny bit by not using a test and clear op in unlock_page (less barriers for the G5). I don't think that's really a blocker problem for a merge, but I wonder what we can do to improve it. Lockless pagecache shaves quite a bit of straight line find_get_page performance there. Going to a non-sleeping lock might be one way to go in the long term, but it would require quite a lot of restructuring. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-02 3:08 ` Nick Piggin 2007-05-02 9:15 ` Nick Piggin @ 2007-05-02 14:00 ` Hugh Dickins 2007-05-03 1:32 ` Nick Piggin 2007-05-09 12:34 ` Nick Piggin 1 sibling, 2 replies; 233+ messages in thread From: Hugh Dickins @ 2007-05-02 14:00 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli, Christoph Hellwig On Wed, 2 May 2007, Nick Piggin wrote: > Hugh Dickins wrote: > > > > But I was quite disappointed when > > mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch > > appeared, putting double unmap_mapping_range calls in. Certainly > > you were wrong to take the one out, but a pity to end up with two. > > > > Your comment says/said: > > The nopage vs invalidate race fix patch did not take care of truncating > > private COW pages. Mind you, I'm pretty sure this was previously racy > > even for regular truncate, not to mention vmtruncate_range. > > > > vmtruncate_range (holepunch) was deficient I agree, and though we > > can now take out your second unmap_mapping_range there, that's only > > because I've slipped one into shmem_truncate_range. In due course it > > needs to be properly handled by noting the range in shmem inode info. > > > > (I think you couldn't take that approach, noting invalid range in > > ->mapping while invalidating, because NFS has/had some cases of > > invalidate_whatever without i_mutex?) > > Sorry, I didn't parse this? I meant that i_size is used to protect against truncation races, but we have no equivalent inval_start,inval_end in the struct inode or struct address_space, such as could be used for similar protection against races while invalidating. And that IIRC there are places where NFS was doing the invalidation without i_mutex: so there could be concurrent invalidations, so one inval_start,inval_end in the structure wouldn't be enough anyway. > But I wonder whether it is better to do > it in vmtruncate_range than the filesystem? Private COWed pages are > not really a filesystem "thing"... It wasn't the thought of private COWed pages which made me put a second unmap_mapping_range in shmem_truncate_range, it was its own internal file<->swap consistency which needed that (as a quick fix). The real fix to be having a trunc_start,trunc_end or whatever in the shmem_inode_info (assuming it's not wanted in the common inode: might be if holepunch spreads e.g. it's been mentioned with fallocate). Re private COWed pages and holepunch: Miklos and I agree that really it would be better for holepunch _not_ to remove them - but that's rather off-topic. More on-topic, since you suggest doing more within vmtruncate_range than the filesystem: no, I'm afraid that's misdesigned, and I want to move almost all of it into the filesystem ->truncate_range. Because, if what vmtruncate_range is doing before it gets to the filesystem isn't to be just a waste of time, the filesystem needs to know what's going on in advance - just as notify_change warns the filesystem about a coming truncation. But easier than inventing some new notification is to move it all into the filesystem, with unmap_mapping_range+truncate_inode_pages_range its library helpers. > > > But I'm pretty sure (to use your words!) regular truncate was not racy > > before: I believe Andrea's sequence count was handling that case fine, > > without a second unmap_mapping_range. > > OK, I think you're right. I _think_ it should also be OK with the > lock_page version as well: we should not be able to have any pages > after the first unmap_mapping_range call, because of the i_size > write. So if we have no pages, there is nothing to 'cow' from. I'd be delighted if you can remove those later unmap_mapping_ranges. As I recall, the important thing for the copy pages is to be holding the page lock (or whatever other serialization) on the copied page still while the copy page is inserted into pagetable: that looks to be so in your __do_fault. > > But it is a shame, and leaves me wondering what you gained with the > > page lock there. > > > > One thing gained is ease of understanding, and if your later patches > > build an edifice upon the knowledge of holding that page lock while > > faulting, I've no wish to undermine that foundation. > > It also fixes a bug, doesn't it? ;) Well, I'd come to think that perhaps the bugs would be solved by that second unmap_mapping_range alone, so the pagelock changes just a misleading diversion. I'm not sure how I feel about that: calling unmap_mapping_range a second time feels such a cheat, but if (big if) it does solve the races, and the pagelock method is as expensive as your numbers now suggest... Hugh ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-02 14:00 ` Hugh Dickins @ 2007-05-03 1:32 ` Nick Piggin 2007-05-03 10:37 ` Christoph Hellwig ` (2 more replies) 2007-05-09 12:34 ` Nick Piggin 1 sibling, 3 replies; 233+ messages in thread From: Nick Piggin @ 2007-05-03 1:32 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 3875 bytes --] Hugh Dickins wrote: > On Wed, 2 May 2007, Nick Piggin wrote: [snip] > More on-topic, since you suggest doing more within vmtruncate_range > than the filesystem: no, I'm afraid that's misdesigned, and I want > to move almost all of it into the filesystem ->truncate_range. > Because, if what vmtruncate_range is doing before it gets to the > filesystem isn't to be just a waste of time, the filesystem needs > to know what's going on in advance - just as notify_change warns > the filesystem about a coming truncation. But easier than inventing > some new notification is to move it all into the filesystem, with > unmap_mapping_range+truncate_inode_pages_range its library helpers. Well I would prefer it to follow the same pattern as regular truncate. I don't think it is misdesigned to call the filesystem _first_, but I think if you do that then the filesystem should call the vm to prepare / finish truncate, rather than open code calls to unmap itself. >>>But I'm pretty sure (to use your words!) regular truncate was not racy >>>before: I believe Andrea's sequence count was handling that case fine, >>>without a second unmap_mapping_range. >> >>OK, I think you're right. I _think_ it should also be OK with the >>lock_page version as well: we should not be able to have any pages >>after the first unmap_mapping_range call, because of the i_size >>write. So if we have no pages, there is nothing to 'cow' from. > > > I'd be delighted if you can remove those later unmap_mapping_ranges. > As I recall, the important thing for the copy pages is to be holding > the page lock (or whatever other serialization) on the copied page > still while the copy page is inserted into pagetable: that looks > to be so in your __do_fault. Yeah, I think my thought process went wrong on those... I'll revisit. >>>But it is a shame, and leaves me wondering what you gained with the >>>page lock there. >>> >>>One thing gained is ease of understanding, and if your later patches >>>build an edifice upon the knowledge of holding that page lock while >>>faulting, I've no wish to undermine that foundation. >> >>It also fixes a bug, doesn't it? ;) > > > Well, I'd come to think that perhaps the bugs would be solved by > that second unmap_mapping_range alone, so the pagelock changes > just a misleading diversion. > > I'm not sure how I feel about that: calling unmap_mapping_range a > second time feels such a cheat, but if (big if) it does solve the > races, and the pagelock method is as expensive as your numbers > now suggest... Well aside from being terribly ugly, it means we can still drop the dirty bit where we'd otherwise rather not, so I don't think we can do that. I think there may be some way we can do this without taking the page lock, and I was going to look at it, but I think it is quite neat to just lock the page... I don't think performance is _that_ bad. On the P4 it is a couple of % on the microbenchmarks. The G5 is worse, but even then I don't think it is I'll try to improve that and get back to you. The problem is that lock/unlock_page is expensive on powerpc, and if we improve that, we improve more than just the fault handler... The attached patch gets performance up a bit by avoiding some barriers and some cachelines: G5 pagefault fork exec 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 So that brings the fork/exec hits down to much less than 5%, and would likely speed up other things that lock the page, like write or page reclaim. I think we could get further performance improvement by implementing arch specific bitops for lock/unlock operations, so we don't need to use things like smb_mb__before_clear_bit() if they aren't needed or full barriers in the test_and_set_bit(). -- SUSE Labs, Novell Inc. [-- Attachment #2: mm-unlock-speedup.patch --] [-- Type: text/plain, Size: 2228 bytes --] Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h 2007-04-24 10:39:56.000000000 +1000 +++ linux-2.6/include/linux/page-flags.h 2007-05-03 08:38:53.000000000 +1000 @@ -91,6 +91,8 @@ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define PG_waiters 20 /* Page has PG_locked waiters */ + /* PG_owner_priv_1 users should have descriptive aliases */ #define PG_checked PG_owner_priv_1 /* Used by some filesystems */ Index: linux-2.6/include/linux/pagemap.h =================================================================== --- linux-2.6.orig/include/linux/pagemap.h 2007-04-24 10:39:56.000000000 +1000 +++ linux-2.6/include/linux/pagemap.h 2007-05-03 08:35:08.000000000 +1000 @@ -141,7 +141,7 @@ static inline void lock_page(struct page *page) { might_sleep(); - if (TestSetPageLocked(page)) + if (unlikely(TestSetPageLocked(page))) __lock_page(page); } @@ -152,7 +152,7 @@ static inline void lock_page_nosync(struct page *page) { might_sleep(); - if (TestSetPageLocked(page)) + if (unlikely(TestSetPageLocked(page))) __lock_page_nosync(page); } Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c 2007-05-02 15:00:26.000000000 +1000 +++ linux-2.6/mm/filemap.c 2007-05-03 08:34:32.000000000 +1000 @@ -532,11 +532,13 @@ */ void fastcall unlock_page(struct page *page) { + VM_BUG_ON(!PageLocked(page)); smp_mb__before_clear_bit(); - if (!TestClearPageLocked(page)) - BUG(); - smp_mb__after_clear_bit(); - wake_up_page(page, PG_locked); + ClearPageLocked(page); + if (unlikely(test_bit(PG_waiters, &page->flags))) { + clear_bit(PG_waiters, &page->flags); + wake_up_page(page, PG_locked); + } } EXPORT_SYMBOL(unlock_page); @@ -568,6 +570,11 @@ { DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); + set_bit(PG_waiters, &page->flags); + if (unlikely(!TestSetPageLocked(page))) { + clear_bit(PG_waiters, &page->flags); + return; + } __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page, TASK_UNINTERRUPTIBLE); } ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-03 1:32 ` Nick Piggin @ 2007-05-03 10:37 ` Christoph Hellwig 2007-05-03 12:56 ` Nick Piggin 2007-05-03 12:24 ` Hugh Dickins 2007-05-03 16:52 ` Andrew Morton 2 siblings, 1 reply; 233+ messages in thread From: Christoph Hellwig @ 2007-05-03 10:37 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli, Christoph Hellwig On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote: > The attached patch gets performance up a bit by avoiding some > barriers and some cachelines: > > G5 > pagefault fork exec > 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 > +patch 1.71-1.73 175.2-180.8 780.5-794.2 > +patch2 1.61-1.63 169.8-175.0 748.6-757.0 > > So that brings the fork/exec hits down to much less than 5%, and > would likely speed up other things that lock the page, like write > or page reclaim. Is that every fork/exec or just under certain cicumstances? A 5% regression on every fork/exec is not acceptable. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-03 10:37 ` Christoph Hellwig @ 2007-05-03 12:56 ` Nick Piggin 2007-05-04 9:23 ` Nick Piggin 0 siblings, 1 reply; 233+ messages in thread From: Nick Piggin @ 2007-05-03 12:56 UTC (permalink / raw) To: Christoph Hellwig Cc: Hugh Dickins, Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli Christoph Hellwig wrote: > On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote: > >>The attached patch gets performance up a bit by avoiding some >>barriers and some cachelines: >> >>G5 >> pagefault fork exec >>2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 >>+patch 1.71-1.73 175.2-180.8 780.5-794.2 >>+patch2 1.61-1.63 169.8-175.0 748.6-757.0 >> >>So that brings the fork/exec hits down to much less than 5%, and >>would likely speed up other things that lock the page, like write >>or page reclaim. > > > Is that every fork/exec or just under certain cicumstances? > A 5% regression on every fork/exec is not acceptable. Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4 numbers will be improved as well with that patch. Then if we have specific lock/unlock bitops, I hope it should reduce that further. The overhead that is there should just be coming from the extra overhead in the file backed fault handler. For noop fork/execs, I think that tends to be more pronounced, it is hard to see any difference on any non-micro benchmark. The other thing is that I think there could be some cache effects happening -- for example the exec numbers on the 2nd line are disproportionately large. It definitely isn't a good thing to drop performance anywhere though, so I'll keep looking for improvements. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-03 12:56 ` Nick Piggin @ 2007-05-04 9:23 ` Nick Piggin 2007-05-04 9:43 ` Nick Piggin 2007-05-08 3:03 ` Benjamin Herrenschmidt 0 siblings, 2 replies; 233+ messages in thread From: Nick Piggin @ 2007-05-04 9:23 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Hellwig, Hugh Dickins, Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli, Benjamin Herrenschmidt [-- Attachment #1: Type: text/plain, Size: 1021 bytes --] Nick Piggin wrote: > Christoph Hellwig wrote: >> Is that every fork/exec or just under certain cicumstances? >> A 5% regression on every fork/exec is not acceptable. > > > Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4 > numbers will be improved as well with that patch. Then if we have > specific lock/unlock bitops, I hope it should reduce that further. OK, with the races and missing barriers fixed from the previous patch, plus the attached one added (+patch3), numbers are better again (I'm not sure if I have the ppc barriers correct though). These ops could also be put to use in bit spinlocks, buffer lock, and probably a few other places too. 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 +patch 1.71-1.73 175.2-180.8 780.5-794.2 +patch2 1.61-1.63 169.8-175.0 748.6-757.0 +patch3 1.54-1.57 165.6-170.9 748.5-757.5 So fault performance goes to under 5%, fork is in the noise, exec is still up 1%, but maybe that's noise or cache effects again. -- SUSE Labs, Novell Inc. [-- Attachment #2: lock-bitops.patch --] [-- Type: text/plain, Size: 10991 bytes --] Index: linux-2.6/include/asm-powerpc/bitops.h =================================================================== --- linux-2.6.orig/include/asm-powerpc/bitops.h 2007-05-04 16:08:20.000000000 +1000 +++ linux-2.6/include/asm-powerpc/bitops.h 2007-05-04 16:14:39.000000000 +1000 @@ -87,6 +87,24 @@ : "cc" ); } +static __inline__ void clear_bit_unlock(int nr, volatile unsigned long *addr) +{ + unsigned long old; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( + LWSYNC_ON_SMP +"1:" PPC_LLARX "%0,0,%3 # clear_bit_unlock\n" + "andc %0,%0,%2\n" + PPC405_ERR77(0,%3) + PPC_STLCX "%0,0,%3\n" + "bne- 1b" + : "=&r" (old), "+m" (*p) + : "r" (mask), "r" (p) + : "cc" ); +} + static __inline__ void change_bit(int nr, volatile unsigned long *addr) { unsigned long old; @@ -126,6 +144,27 @@ return (old & mask) != 0; } +static __inline__ int test_and_set_bit_lock(unsigned long nr, + volatile unsigned long *addr) +{ + unsigned long old, t; + unsigned long mask = BITOP_MASK(nr); + unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); + + __asm__ __volatile__( +"1:" PPC_LLARX "%0,0,%3 # test_and_set_bit_lock\n" + "or %1,%0,%2 \n" + PPC405_ERR77(0,%3) + PPC_STLCX "%1,0,%3 \n" + "bne- 1b" + ISYNC_ON_SMP + : "=&r" (old), "=&r" (t) + : "r" (mask), "r" (p) + : "cc", "memory"); + + return (old & mask) != 0; +} + static __inline__ int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr) { Index: linux-2.6/include/linux/pagemap.h =================================================================== --- linux-2.6.orig/include/linux/pagemap.h 2007-05-04 16:14:36.000000000 +1000 +++ linux-2.6/include/linux/pagemap.h 2007-05-04 16:17:34.000000000 +1000 @@ -136,13 +136,18 @@ extern void FASTCALL(__wait_on_page_locked(struct page *page)); extern void FASTCALL(unlock_page(struct page *page)); +static inline int trylock_page(struct page *page) +{ + return (likely(!TestSetPageLocked_Lock(page))); +} + /* * lock_page may only be called if we have the page's inode pinned. */ static inline void lock_page(struct page *page) { might_sleep(); - if (unlikely(TestSetPageLocked(page))) + if (!trylock_page(page)) __lock_page(page); } @@ -153,7 +158,7 @@ static inline void lock_page_nosync(struct page *page) { might_sleep(); - if (unlikely(TestSetPageLocked(page))) + if (!trylock_page(page)) __lock_page_nosync(page); } Index: linux-2.6/drivers/scsi/sg.c =================================================================== --- linux-2.6.orig/drivers/scsi/sg.c 2007-04-12 14:35:08.000000000 +1000 +++ linux-2.6/drivers/scsi/sg.c 2007-05-04 16:23:27.000000000 +1000 @@ -1734,7 +1734,7 @@ */ flush_dcache_page(pages[i]); /* ?? Is locking needed? I don't think so */ - /* if (TestSetPageLocked(pages[i])) + /* if (!trylock_page(pages[i])) goto out_unlock; */ } Index: linux-2.6/fs/cifs/file.c =================================================================== --- linux-2.6.orig/fs/cifs/file.c 2007-04-12 14:35:09.000000000 +1000 +++ linux-2.6/fs/cifs/file.c 2007-05-04 16:23:36.000000000 +1000 @@ -1229,7 +1229,7 @@ if (first < 0) lock_page(page); - else if (TestSetPageLocked(page)) + else if (!trylock_page(page)) break; if (unlikely(page->mapping != mapping)) { Index: linux-2.6/fs/jbd/commit.c =================================================================== --- linux-2.6.orig/fs/jbd/commit.c 2007-04-12 14:35:09.000000000 +1000 +++ linux-2.6/fs/jbd/commit.c 2007-05-04 16:23:30.000000000 +1000 @@ -64,7 +64,7 @@ goto nope; /* OK, it's a truncated page */ - if (TestSetPageLocked(page)) + if (!trylock_page(page)) goto nope; page_cache_get(page); Index: linux-2.6/fs/jbd2/commit.c =================================================================== --- linux-2.6.orig/fs/jbd2/commit.c 2007-04-12 14:35:09.000000000 +1000 +++ linux-2.6/fs/jbd2/commit.c 2007-05-04 16:23:40.000000000 +1000 @@ -64,7 +64,7 @@ goto nope; /* OK, it's a truncated page */ - if (TestSetPageLocked(page)) + if (!trylock_page(page)) goto nope; page_cache_get(page); Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c =================================================================== --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c 2007-03-05 15:17:25.000000000 +1100 +++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c 2007-05-04 16:23:33.000000000 +1000 @@ -601,7 +601,7 @@ } else pg_offset = PAGE_CACHE_SIZE; - if (page->index == tindex && !TestSetPageLocked(page)) { + if (page->index == tindex && trylock_page(page)) { len = xfs_probe_page(page, pg_offset, mapped); unlock_page(page); } @@ -685,7 +685,7 @@ if (page->index != tindex) goto fail; - if (TestSetPageLocked(page)) + if (!trylock_page(page)) goto fail; if (PageWriteback(page)) goto fail_unlock_page; Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h 2007-05-03 08:38:53.000000000 +1000 +++ linux-2.6/include/linux/page-flags.h 2007-05-04 16:18:23.000000000 +1000 @@ -116,8 +116,12 @@ set_bit(PG_locked, &(page)->flags) #define TestSetPageLocked(page) \ test_and_set_bit(PG_locked, &(page)->flags) +#define TestSetPageLocked_Lock(page) \ + test_and_set_bit_lock(PG_locked, &(page)->flags) #define ClearPageLocked(page) \ clear_bit(PG_locked, &(page)->flags) +#define ClearPageLocked_Unlock(page) \ + clear_bit_unlock(PG_locked, &(page)->flags) #define TestClearPageLocked(page) \ test_and_clear_bit(PG_locked, &(page)->flags) Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2007-05-02 15:00:28.000000000 +1000 +++ linux-2.6/mm/memory.c 2007-05-04 16:19:12.000000000 +1000 @@ -1550,7 +1550,7 @@ * not dirty accountable. */ if (PageAnon(old_page)) { - if (!TestSetPageLocked(old_page)) { + if (trylock_page(old_page)) { reuse = can_share_swap_page(old_page); unlock_page(old_page); } Index: linux-2.6/mm/migrate.c =================================================================== --- linux-2.6.orig/mm/migrate.c 2007-05-02 14:48:36.000000000 +1000 +++ linux-2.6/mm/migrate.c 2007-05-04 16:19:15.000000000 +1000 @@ -569,7 +569,7 @@ * establishing additional references. We are the only one * holding a reference to the new page at this point. */ - if (TestSetPageLocked(newpage)) + if (!trylock_page(newpage)) BUG(); /* Prepare mapping for the new page.*/ @@ -621,7 +621,7 @@ goto move_newpage; rc = -EAGAIN; - if (TestSetPageLocked(page)) { + if (!trylock_page(page)) { if (!force) goto move_newpage; lock_page(page); Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c 2007-04-13 20:48:04.000000000 +1000 +++ linux-2.6/mm/rmap.c 2007-05-04 16:19:18.000000000 +1000 @@ -426,7 +426,7 @@ referenced += page_referenced_anon(page); else if (is_locked) referenced += page_referenced_file(page); - else if (TestSetPageLocked(page)) + else if (!trylock_page(page)) referenced++; else { if (page->mapping) Index: linux-2.6/mm/shmem.c =================================================================== --- linux-2.6.orig/mm/shmem.c 2007-05-02 15:00:26.000000000 +1000 +++ linux-2.6/mm/shmem.c 2007-05-04 16:19:22.000000000 +1000 @@ -1155,7 +1155,7 @@ } /* We have to do this with page locked to prevent races */ - if (TestSetPageLocked(swappage)) { + if (!trylock_page(swappage)) { shmem_swp_unmap(entry); spin_unlock(&info->lock); wait_on_page_locked(swappage); @@ -1214,7 +1214,7 @@ shmem_swp_unmap(entry); filepage = find_get_page(mapping, idx); if (filepage && - (!PageUptodate(filepage) || TestSetPageLocked(filepage))) { + (!PageUptodate(filepage) || !trylock_page(filepage))) { spin_unlock(&info->lock); wait_on_page_locked(filepage); page_cache_release(filepage); Index: linux-2.6/mm/swap.c =================================================================== --- linux-2.6.orig/mm/swap.c 2007-04-12 14:35:11.000000000 +1000 +++ linux-2.6/mm/swap.c 2007-05-04 16:19:28.000000000 +1000 @@ -412,7 +412,7 @@ for (i = 0; i < pagevec_count(pvec); i++) { struct page *page = pvec->pages[i]; - if (PagePrivate(page) && !TestSetPageLocked(page)) { + if (PagePrivate(page) && trylock_page(page)) { if (PagePrivate(page)) try_to_release_page(page, 0); unlock_page(page); Index: linux-2.6/mm/swap_state.c =================================================================== --- linux-2.6.orig/mm/swap_state.c 2007-04-24 10:39:57.000000000 +1000 +++ linux-2.6/mm/swap_state.c 2007-05-04 16:19:32.000000000 +1000 @@ -252,7 +252,7 @@ */ static inline void free_swap_cache(struct page *page) { - if (PageSwapCache(page) && !TestSetPageLocked(page)) { + if (PageSwapCache(page) && trylock_page(page)) { remove_exclusive_swap_page(page); unlock_page(page); } Index: linux-2.6/mm/swapfile.c =================================================================== --- linux-2.6.orig/mm/swapfile.c 2007-04-24 10:39:55.000000000 +1000 +++ linux-2.6/mm/swapfile.c 2007-05-04 16:19:25.000000000 +1000 @@ -401,7 +401,7 @@ if (p) { if (swap_entry_free(p, swp_offset(entry)) == 1) { page = find_get_page(&swapper_space, entry.val); - if (page && unlikely(TestSetPageLocked(page))) { + if (page && unlikely(!trylock_page(page))) { page_cache_release(page); page = NULL; } Index: linux-2.6/mm/truncate.c =================================================================== --- linux-2.6.orig/mm/truncate.c 2007-05-02 15:00:27.000000000 +1000 +++ linux-2.6/mm/truncate.c 2007-05-04 16:19:35.000000000 +1000 @@ -185,7 +185,7 @@ if (page_index > next) next = page_index; next++; - if (TestSetPageLocked(page)) + if (!trylock_page(page)) continue; if (PageWriteback(page)) { unlock_page(page); @@ -291,7 +291,7 @@ pgoff_t index; int lock_failed; - lock_failed = TestSetPageLocked(page); + lock_failed = !trylock_page(page); /* * We really shouldn't be looking at the ->index of an Index: linux-2.6/mm/vmscan.c =================================================================== --- linux-2.6.orig/mm/vmscan.c 2007-04-24 10:39:56.000000000 +1000 +++ linux-2.6/mm/vmscan.c 2007-05-04 16:19:38.000000000 +1000 @@ -466,7 +466,7 @@ page = lru_to_page(page_list); list_del(&page->lru); - if (TestSetPageLocked(page)) + if (!trylock_page(page)) goto keep; VM_BUG_ON(PageActive(page)); @@ -538,7 +538,7 @@ * A synchronous write - probably a ramdisk. Go * ahead and try to reclaim the page. */ - if (TestSetPageLocked(page)) + if (!trylock_page(page)) goto keep; if (PageDirty(page) || PageWriteback(page)) goto keep_locked; ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-04 9:23 ` Nick Piggin @ 2007-05-04 9:43 ` Nick Piggin 2007-05-08 3:03 ` Benjamin Herrenschmidt 1 sibling, 0 replies; 233+ messages in thread From: Nick Piggin @ 2007-05-04 9:43 UTC (permalink / raw) Cc: Christoph Hellwig, Hugh Dickins, Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli, Benjamin Herrenschmidt Nick Piggin wrote: > Nick Piggin wrote: > >> Christoph Hellwig wrote: > > >>> Is that every fork/exec or just under certain cicumstances? >>> A 5% regression on every fork/exec is not acceptable. >> >> >> >> Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4 >> numbers will be improved as well with that patch. Then if we have >> specific lock/unlock bitops, I hope it should reduce that further. > > > OK, with the races and missing barriers fixed from the previous patch, > plus the attached one added (+patch3), numbers are better again (I'm not > sure if I have the ppc barriers correct though). > > These ops could also be put to use in bit spinlocks, buffer lock, and > probably a few other places too. > > 2.6.21 1.49-1.51 164.6-170.8 741.8-760.3 > +patch 1.71-1.73 175.2-180.8 780.5-794.2 > +patch2 1.61-1.63 169.8-175.0 748.6-757.0 > +patch3 1.54-1.57 165.6-170.9 748.5-757.5 > > So fault performance goes to under 5%, fork is in the noise, exec is > still up 1%, but maybe that's noise or cache effects again. OK, with my new lock/unlock_page, dd if=large (bigger than RAM) sparse file of=/dev/null with an experimentally optimal block size (32K) goes from 626MB/s to 683MB/s on 2 CPU G5 booted with maxcpus=1. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-04 9:23 ` Nick Piggin 2007-05-04 9:43 ` Nick Piggin @ 2007-05-08 3:03 ` Benjamin Herrenschmidt 1 sibling, 0 replies; 233+ messages in thread From: Benjamin Herrenschmidt @ 2007-05-08 3:03 UTC (permalink / raw) To: Nick Piggin Cc: Christoph Hellwig, Hugh Dickins, Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli On Fri, 2007-05-04 at 19:23 +1000, Nick Piggin wrote: > These ops could also be put to use in bit spinlocks, buffer lock, and > probably a few other places too. Ok, the performance hit seems to be under control (especially with the bigger benchmark showing actual improvements). There's a little bogon with the PG_waiters bit that you already know about but appart from that it should be ok. I must say I absolutely _LOVE_ the bitops with explicit _lock/_unlock semantics. That should allow us to remove a whole bunch of dodgy barriers and smp_mb__before_whatever_magic_crap() things we have all over the place by providing precisely the expected semantics for bit locks. There are quite a few people who've been trying to do bit locks and I've always been very worried by how easy it is to get the barriers wrong (or too much barriers in the fast path) with these. There are a couple of things we might want to think about regarding the actual API to bit locks... the API you propose is simple, but it might not fit some of the most exotic usage requirements, which typically are related to manipulating other bits along with the lock bit. We might just ignore them though. In the case of the page lock, it's only hitting the slow path, and I would expect other usage scenarii to be similar. Cheers, Ben. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-03 1:32 ` Nick Piggin 2007-05-03 10:37 ` Christoph Hellwig @ 2007-05-03 12:24 ` Hugh Dickins 2007-05-03 12:43 ` Nick Piggin 2007-05-03 16:52 ` Andrew Morton 2 siblings, 1 reply; 233+ messages in thread From: Hugh Dickins @ 2007-05-03 12:24 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli, Christoph Hellwig On Thu, 3 May 2007, Nick Piggin wrote: > > The problem is that lock/unlock_page is expensive on powerpc, and > if we improve that, we improve more than just the fault handler... > > The attached patch gets performance up a bit by avoiding some > barriers and some cachelines: There's a strong whiff of raciness about this... but I could very easily be wrong. > Index: linux-2.6/mm/filemap.c > =================================================================== > --- linux-2.6.orig/mm/filemap.c 2007-05-02 15:00:26.000000000 +1000 > +++ linux-2.6/mm/filemap.c 2007-05-03 08:34:32.000000000 +1000 > @@ -532,11 +532,13 @@ > */ > void fastcall unlock_page(struct page *page) > { > + VM_BUG_ON(!PageLocked(page)); > smp_mb__before_clear_bit(); > - if (!TestClearPageLocked(page)) > - BUG(); > - smp_mb__after_clear_bit(); > - wake_up_page(page, PG_locked); > + ClearPageLocked(page); > + if (unlikely(test_bit(PG_waiters, &page->flags))) { > + clear_bit(PG_waiters, &page->flags); > + wake_up_page(page, PG_locked); > + } > } > EXPORT_SYMBOL(unlock_page); > > @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) > { > DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); > > + set_bit(PG_waiters, &page->flags); > + if (unlikely(!TestSetPageLocked(page))) { What happens if another cpu is coming through __lock_page at the same time, did its set_bit, now finds PageLocked, and so proceeds to the __wait_on_bit_lock? But this cpu now clears PG_waiters, so this task's unlock_page won't wake the other? > + clear_bit(PG_waiters, &page->flags); > + return; > + } > __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page, > TASK_UNINTERRUPTIBLE); > } ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-03 12:24 ` Hugh Dickins @ 2007-05-03 12:43 ` Nick Piggin 2007-05-03 12:58 ` Hugh Dickins 0 siblings, 1 reply; 233+ messages in thread From: Nick Piggin @ 2007-05-03 12:43 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli, Christoph Hellwig Hugh Dickins wrote: > On Thu, 3 May 2007, Nick Piggin wrote: > >>The problem is that lock/unlock_page is expensive on powerpc, and >>if we improve that, we improve more than just the fault handler... >> >>The attached patch gets performance up a bit by avoiding some >>barriers and some cachelines: > > > There's a strong whiff of raciness about this... > but I could very easily be wrong. > > >>Index: linux-2.6/mm/filemap.c >>=================================================================== >>--- linux-2.6.orig/mm/filemap.c 2007-05-02 15:00:26.000000000 +1000 >>+++ linux-2.6/mm/filemap.c 2007-05-03 08:34:32.000000000 +1000 >>@@ -532,11 +532,13 @@ >> */ >> void fastcall unlock_page(struct page *page) >> { >>+ VM_BUG_ON(!PageLocked(page)); >> smp_mb__before_clear_bit(); >>- if (!TestClearPageLocked(page)) >>- BUG(); >>- smp_mb__after_clear_bit(); >>- wake_up_page(page, PG_locked); >>+ ClearPageLocked(page); >>+ if (unlikely(test_bit(PG_waiters, &page->flags))) { >>+ clear_bit(PG_waiters, &page->flags); >>+ wake_up_page(page, PG_locked); >>+ } >> } >> EXPORT_SYMBOL(unlock_page); >> >>@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) >> { >> DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); >> >>+ set_bit(PG_waiters, &page->flags); >>+ if (unlikely(!TestSetPageLocked(page))) { > > > What happens if another cpu is coming through __lock_page at the > same time, did its set_bit, now finds PageLocked, and so proceeds > to the __wait_on_bit_lock? But this cpu now clears PG_waiters, > so this task's unlock_page won't wake the other? You're right, we can't clear the bit here. Doubt it mattered much anyway? BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page above... that barrier is in the slow path as well though, so it shouldn't matter either. > > >>+ clear_bit(PG_waiters, &page->flags); >>+ return; >>+ } >> __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page, >> TASK_UNINTERRUPTIBLE); >> } > > -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-03 12:43 ` Nick Piggin @ 2007-05-03 12:58 ` Hugh Dickins 2007-05-03 13:08 ` Nick Piggin 0 siblings, 1 reply; 233+ messages in thread From: Hugh Dickins @ 2007-05-03 12:58 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli, Christoph Hellwig On Thu, 3 May 2007, Nick Piggin wrote: > >>@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) > > > { > > > DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); > > > > > >+ set_bit(PG_waiters, &page->flags); > > >+ if (unlikely(!TestSetPageLocked(page))) { > > > > What happens if another cpu is coming through __lock_page at the > > same time, did its set_bit, now finds PageLocked, and so proceeds > > to the __wait_on_bit_lock? But this cpu now clears PG_waiters, > > so this task's unlock_page won't wake the other? > > You're right, we can't clear the bit here. Doubt it mattered much anyway? Ah yes, that's a good easy answer. In fact, just remove this whole test and block (we already tried TestSetPageLocked outside just a short while ago, so this repeat won't often save anything). > > BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page > above... that barrier is in the slow path as well though, so it shouldn't > matter either. I vaguely wondered how such barriers had managed to dissolve away, but cranking my brain up to think about barriers takes far too long. > > >+ clear_bit(PG_waiters, &page->flags); > > >+ return; > > >+ } > > > __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page, > > > TASK_UNINTERRUPTIBLE); > >> } ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-03 12:58 ` Hugh Dickins @ 2007-05-03 13:08 ` Nick Piggin 0 siblings, 0 replies; 233+ messages in thread From: Nick Piggin @ 2007-05-03 13:08 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli, Christoph Hellwig Hugh Dickins wrote: > On Thu, 3 May 2007, Nick Piggin wrote: > >>>>@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!) >>>>{ >>>> DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); >>>> >>>>+ set_bit(PG_waiters, &page->flags); >>>>+ if (unlikely(!TestSetPageLocked(page))) { >>> >>>What happens if another cpu is coming through __lock_page at the >>>same time, did its set_bit, now finds PageLocked, and so proceeds >>>to the __wait_on_bit_lock? But this cpu now clears PG_waiters, >>>so this task's unlock_page won't wake the other? >> >>You're right, we can't clear the bit here. Doubt it mattered much anyway? > > > Ah yes, that's a good easy answer. In fact, just remove this whole > test and block (we already tried TestSetPageLocked outside just a > short while ago, so this repeat won't often save anything). Yeah, I was getting too clever for my own boots :) I think the patch has merit though. Unfortunate that it uses another page flag, however it seemed to have quite a bit speedup on unlock_page (probably from both the barriers and an extra random cacheline load (from the hash)). I guess it has to get good results from more benchmarks... >>BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page >>above... that barrier is in the slow path as well though, so it shouldn't >>matter either. > > > I vaguely wondered how such barriers had managed to dissolve away, > but cranking my brain up to think about barriers takes far too long. That barrier was one too many :) However I believe the fastpath barrier can go away because the PG_locked operation is depending on the same cacheline as PG_waiters. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-03 1:32 ` Nick Piggin 2007-05-03 10:37 ` Christoph Hellwig 2007-05-03 12:24 ` Hugh Dickins @ 2007-05-03 16:52 ` Andrew Morton 2007-05-04 4:16 ` Nick Piggin 2 siblings, 1 reply; 233+ messages in thread From: Andrew Morton @ 2007-05-03 16:52 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, linux-kernel, linux-mm, Andrea Arcangeli, Christoph Hellwig On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > void fastcall unlock_page(struct page *page) > { > + VM_BUG_ON(!PageLocked(page)); > smp_mb__before_clear_bit(); > - if (!TestClearPageLocked(page)) > - BUG(); > - smp_mb__after_clear_bit(); > - wake_up_page(page, PG_locked); > + ClearPageLocked(page); > + if (unlikely(test_bit(PG_waiters, &page->flags))) { > + clear_bit(PG_waiters, &page->flags); > + wake_up_page(page, PG_locked); > + } > } Why is that significantly faster than plain old wake_up_page(), which tests waitqueue_active()? ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-03 16:52 ` Andrew Morton @ 2007-05-04 4:16 ` Nick Piggin 0 siblings, 0 replies; 233+ messages in thread From: Nick Piggin @ 2007-05-04 4:16 UTC (permalink / raw) To: Andrew Morton Cc: Hugh Dickins, linux-kernel, linux-mm, Andrea Arcangeli, Christoph Hellwig Andrew Morton wrote: > On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >> void fastcall unlock_page(struct page *page) >> { >>+ VM_BUG_ON(!PageLocked(page)); >> smp_mb__before_clear_bit(); >>- if (!TestClearPageLocked(page)) >>- BUG(); >>- smp_mb__after_clear_bit(); >>- wake_up_page(page, PG_locked); >>+ ClearPageLocked(page); >>+ if (unlikely(test_bit(PG_waiters, &page->flags))) { >>+ clear_bit(PG_waiters, &page->flags); >>+ wake_up_page(page, PG_locked); >>+ } >> } > > > Why is that significantly faster than plain old wake_up_page(), which > tests waitqueue_active()? Because it needs fewer barriers and doesn't touch random a random hash cacheline in the fastpath. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-02 14:00 ` Hugh Dickins 2007-05-03 1:32 ` Nick Piggin @ 2007-05-09 12:34 ` Nick Piggin 2007-05-09 14:28 ` Hugh Dickins 1 sibling, 1 reply; 233+ messages in thread From: Nick Piggin @ 2007-05-09 12:34 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 1280 bytes --] Hugh Dickins wrote: > On Wed, 2 May 2007, Nick Piggin wrote: >>>But I'm pretty sure (to use your words!) regular truncate was not racy >>>before: I believe Andrea's sequence count was handling that case fine, >>>without a second unmap_mapping_range. >> >>OK, I think you're right. I _think_ it should also be OK with the >>lock_page version as well: we should not be able to have any pages >>after the first unmap_mapping_range call, because of the i_size >>write. So if we have no pages, there is nothing to 'cow' from. > > > I'd be delighted if you can remove those later unmap_mapping_ranges. > As I recall, the important thing for the copy pages is to be holding > the page lock (or whatever other serialization) on the copied page > still while the copy page is inserted into pagetable: that looks > to be so in your __do_fault. Hmm, on second thoughts, I think I was right the first time, and do need the unmap after the pages are truncated. With the lock_page code, after the first unmap, we can get new ptes mapping pages, and subsequently they can be COWed and then the original pte zapped before the truncate loop checks it. However, I wonder if we can't test mapping_mapped before the spinlock, which would make most truncates cheaper? -- SUSE Labs, Novell Inc. [-- Attachment #2: mm-truncate-avoid-rmap-locks --] [-- Type: text/plain, Size: 1053 bytes --] Index: linux-2.6/mm/filemap.c =================================================================== --- linux-2.6.orig/mm/filemap.c 2007-04-24 15:02:51.000000000 +1000 +++ linux-2.6/mm/filemap.c 2007-05-09 17:30:47.000000000 +1000 @@ -2579,8 +2579,7 @@ if (rw == WRITE) { write_len = iov_length(iov, nr_segs); end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT; - if (mapping_mapped(mapping)) - unmap_mapping_range(mapping, offset, write_len, 0); + unmap_mapping_range(mapping, offset, write_len, 0); } retval = filemap_write_and_wait(mapping); Index: linux-2.6/mm/memory.c =================================================================== --- linux-2.6.orig/mm/memory.c 2007-05-09 17:25:28.000000000 +1000 +++ linux-2.6/mm/memory.c 2007-05-09 17:30:22.000000000 +1000 @@ -1956,6 +1956,9 @@ pgoff_t hba = holebegin >> PAGE_SHIFT; pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT; + if (!mapping_mapped(mapping)) + return; + /* Check for overflow. */ if (sizeof(holelen) > sizeof(hlen)) { long long holeend = ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-09 12:34 ` Nick Piggin @ 2007-05-09 14:28 ` Hugh Dickins 2007-05-09 14:45 ` Nick Piggin 0 siblings, 1 reply; 233+ messages in thread From: Hugh Dickins @ 2007-05-09 14:28 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli, Christoph Hellwig On Wed, 9 May 2007, Nick Piggin wrote: > Hugh Dickins wrote: > > On Wed, 2 May 2007, Nick Piggin wrote: > > > > >But I'm pretty sure (to use your words!) regular truncate was not racy > > > >before: I believe Andrea's sequence count was handling that case fine, > > > >without a second unmap_mapping_range. > > > > > >OK, I think you're right. I _think_ it should also be OK with the > > >lock_page version as well: we should not be able to have any pages > > >after the first unmap_mapping_range call, because of the i_size > > >write. So if we have no pages, there is nothing to 'cow' from. > > > > I'd be delighted if you can remove those later unmap_mapping_ranges. > > As I recall, the important thing for the copy pages is to be holding > > the page lock (or whatever other serialization) on the copied page > > still while the copy page is inserted into pagetable: that looks > > to be so in your __do_fault. > > Hmm, on second thoughts, I think I was right the first time, and do > need the unmap after the pages are truncated. With the lock_page code, > after the first unmap, we can get new ptes mapping pages, and > subsequently they can be COWed and then the original pte zapped before > the truncate loop checks it. The filesystem (or page cache) allows pages beyond i_size to come in there? That wasn't a problem before, was it? But now it is? > > However, I wonder if we can't test mapping_mapped before the spinlock, > which would make most truncates cheaper? Slightly cheaper, yes, though I doubt it'd be much in comparison with actually doing any work in unmap_mapping_range or truncate_inode_pages. Suspect you'd need a barrier of some kind between the i_size_write and the mapping_mapped test? But that's a change we could have made at any time if we'd bothered, it's not really the issue here. Hugh ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-09 14:28 ` Hugh Dickins @ 2007-05-09 14:45 ` Nick Piggin 2007-05-09 15:38 ` Hugh Dickins 0 siblings, 1 reply; 233+ messages in thread From: Nick Piggin @ 2007-05-09 14:45 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli, Christoph Hellwig Hugh Dickins wrote: > On Wed, 9 May 2007, Nick Piggin wrote: > >>Hugh Dickins wrote: >> >>>On Wed, 2 May 2007, Nick Piggin wrote: >> >>>>>But I'm pretty sure (to use your words!) regular truncate was not racy >>>>>before: I believe Andrea's sequence count was handling that case fine, >>>>>without a second unmap_mapping_range. >>>> >>>>OK, I think you're right. I _think_ it should also be OK with the >>>>lock_page version as well: we should not be able to have any pages >>>>after the first unmap_mapping_range call, because of the i_size >>>>write. So if we have no pages, there is nothing to 'cow' from. >>> >>>I'd be delighted if you can remove those later unmap_mapping_ranges. >>>As I recall, the important thing for the copy pages is to be holding >>>the page lock (or whatever other serialization) on the copied page >>>still while the copy page is inserted into pagetable: that looks >>>to be so in your __do_fault. >> >>Hmm, on second thoughts, I think I was right the first time, and do >>need the unmap after the pages are truncated. With the lock_page code, >>after the first unmap, we can get new ptes mapping pages, and >>subsequently they can be COWed and then the original pte zapped before >>the truncate loop checks it. > > > The filesystem (or page cache) allows pages beyond i_size to come > in there? That wasn't a problem before, was it? But now it is? The filesystem still doesn't, but if i_size is updated after the page is returned, we can have a problem that was previously taken care of with the truncate_count but now isn't. >>However, I wonder if we can't test mapping_mapped before the spinlock, >>which would make most truncates cheaper? > > > Slightly cheaper, yes, though I doubt it'd be much in comparison with > actually doing any work in unmap_mapping_range or truncate_inode_pages. But if we're supposing the common case for truncate is unmapped mappings, then the main cost there will be the locking, which I'm trying to avoid. Hopefully with this patch, most truncate workloads would get faster, even though truncate mapped files is going to be unavoidably slower. > Suspect you'd need a barrier of some kind between the i_size_write and > the mapping_mapped test? The unmap_mapping_range that runs after the truncate_inode_pages should run in the correct order, I believe. > But that's a change we could have made at > any time if we'd bothered, it's not really the issue here. I don't see how you could, because you need to increment truncate_count. But I believe this is fixing the issue, even if it does so in a peripheral manner, because it avoids the added cost for unmapped files. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-09 14:45 ` Nick Piggin @ 2007-05-09 15:38 ` Hugh Dickins 2007-05-09 22:24 ` Nick Piggin 0 siblings, 1 reply; 233+ messages in thread From: Hugh Dickins @ 2007-05-09 15:38 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli, Christoph Hellwig On Thu, 10 May 2007, Nick Piggin wrote: > > > > The filesystem (or page cache) allows pages beyond i_size to come > > in there? That wasn't a problem before, was it? But now it is? > > The filesystem still doesn't, but if i_size is updated after the page > is returned, we can have a problem that was previously taken care of > with the truncate_count but now isn't. But... I thought the page lock was now taking care of that in your scheme? truncate_inode_pages has to wait for the page lock, then it finds the page is mapped and... ahh, it finds the copiee page is not mapped, so doesn't do its own little unmap_mapping_range, and the copied page squeaks through. Drat. I really think the truncate_count solution worked better, for truncation anyway. There may be persuasive reasons you need the page lock for invalidation: I gave up on trying to understand the required behaviour(s) for invalidation. So, bring back (the original use of, not my tree marker use of) truncate_count? Hmm, you probably don't want to do that, because there was some pleasure in removing the strange barriers associated with it. A second unmap_mapping_range is just one line of code - but it sure feels like a defeat to me, calling the whole exercise into question. (But then, you'd be right to say my perfectionism made it impossible for me to come up with any solution to the invalidation issues.) > > Suspect you'd need a barrier of some kind between the i_size_write and > > the mapping_mapped test? > > The unmap_mapping_range that runs after the truncate_inode_pages should > run in the correct order, I believe. Yes, if there's going to be that backup call, the first won't really need a barrier. > > But that's a change we could have made at > > any time if we'd bothered, it's not really the issue here. > > I don't see how you could, because you need to increment truncate_count. Though indeed we did so, I don't see that we needed to increment truncate_count in that case (nobody could be coming through do_no_page on that file, when there are no mappings of it). > But I believe this is fixing the issue, even if it does so in a peripheral > manner, because it avoids the added cost for unmapped files. It's a small improvement to your common case, I agree. Hugh ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- vm bugfixes 2007-05-09 15:38 ` Hugh Dickins @ 2007-05-09 22:24 ` Nick Piggin 0 siblings, 0 replies; 233+ messages in thread From: Nick Piggin @ 2007-05-09 22:24 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli, Christoph Hellwig Hugh Dickins wrote: > On Thu, 10 May 2007, Nick Piggin wrote: > >>>The filesystem (or page cache) allows pages beyond i_size to come >>>in there? That wasn't a problem before, was it? But now it is? >> >>The filesystem still doesn't, but if i_size is updated after the page >>is returned, we can have a problem that was previously taken care of >>with the truncate_count but now isn't. > > > But... I thought the page lock was now taking care of that in your > scheme? truncate_inode_pages has to wait for the page lock, then > it finds the page is mapped and... ahh, it finds the copiee page > is not mapped, so doesn't do its own little unmap_mapping_range, > and the copied page squeaks through. Drat. > > I really think the truncate_count solution worked better, for > truncation anyway. There may be persuasive reasons you need the > page lock for invalidation: I gave up on trying to understand the > required behaviour(s) for invalidation. > > So, bring back (the original use of, not my tree marker use of) > truncate_count? Hmm, you probably don't want to do that, because > there was some pleasure in removing the strange barriers associated > with it. > > A second unmap_mapping_range is just one line of code - but it sure > feels like a defeat to me, calling the whole exercise into question. > (But then, you'd be right to say my perfectionism made it impossible > for me to come up with any solution to the invalidation issues.) Well we could bring back the truncate_count, but I think that sucks because that's moving work into the page fault handler in order to avoid a bit of work when truncating mapped files. >>>But that's a change we could have made at >>>any time if we'd bothered, it's not really the issue here. >> >>I don't see how you could, because you need to increment truncate_count. > > > Though indeed we did so, I don't see that we needed to increment > truncate_count in that case (nobody could be coming through > do_no_page on that file, when there are no mappings of it). Of course :P -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* pcmcia ioctl removal 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (8 preceding siblings ...) 2007-05-01 8:44 ` 2.6.22 -mm merge plans -- vm bugfixes Nick Piggin @ 2007-05-01 8:46 ` Christoph Hellwig 2007-05-01 8:56 ` Russell King ` (3 more replies) 2007-05-01 8:48 ` pci hotplug patches Christoph Hellwig ` (12 subsequent siblings) 22 siblings, 4 replies; 233+ messages in thread From: Christoph Hellwig @ 2007-05-01 8:46 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm, linux-pcmcia > pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch ... > Dominik is busy. Will probably re-review and send these direct to Linus. The patch above is the removal of cardmgr support. While I'd love to see this cruft gone it definitively needs maintainer judgement on whether they time has come that no one relies on cardmgr anymore. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-01 8:46 ` pcmcia ioctl removal Christoph Hellwig @ 2007-05-01 8:56 ` Russell King 2007-05-01 8:57 ` Willy Tarreau ` (2 subsequent siblings) 3 siblings, 0 replies; 233+ messages in thread From: Russell King @ 2007-05-01 8:56 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm, linux-pcmcia On Tue, May 01, 2007 at 09:46:23AM +0100, Christoph Hellwig wrote: > > pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch > > ... > > > Dominik is busy. Will probably re-review and send these direct to Linus. > > The patch above is the removal of cardmgr support. While I'd love to > see this cruft gone it definitively needs maintainer judgement on whether > they time has come that no one relies on cardmgr anymore. And I still run and use a platform where the GUI issues cardmgr ioctls. A recent kernel upgrade (from 2.6.9ish to something more recent) broke the "eject" GUI button applet due to the fscking with the cardmgr ioctls, and it thinks the wireless card is always plugged in (and therefore the signal strength meter remains.) With all the ioctls gone I'll probably loose the signal strength meter. And no, I don't have the resources (read: code) to fix and rebuild userspace since I didn't snarf them when the CVS server was alive a few years back. That's the problem with API changes - things always break. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-01 8:46 ` pcmcia ioctl removal Christoph Hellwig 2007-05-01 8:56 ` Russell King @ 2007-05-01 8:57 ` Willy Tarreau 2007-05-01 9:08 ` Andrew Morton 2007-05-01 9:16 ` Robert P. J. Day 2007-05-09 12:54 ` Pavel Machek 3 siblings, 1 reply; 233+ messages in thread From: Willy Tarreau @ 2007-05-01 8:57 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm, linux-pcmcia Hi Christoph, On Tue, May 01, 2007 at 09:46:23AM +0100, Christoph Hellwig wrote: > > pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch > > ... > > > Dominik is busy. Will probably re-review and send these direct to Linus. > > The patch above is the removal of cardmgr support. While I'd love to > see this cruft gone it definitively needs maintainer judgement on whether > they time has come that no one relies on cardmgr anymore. Well, I've not followed evolutions in this area for a long time. Here's what I get on my notebook : willy@wtap:~$ uname -r 2.6.20-wt3-wtap willy@wtap:~$ ps auxw|grep card root 1216 0.0 0.0 0 0 ? S< Apr28 0:00 [pccardd] root 1221 0.0 0.0 0 0 ? S< Apr28 0:00 [pccardd] root 1244 0.0 0.0 0 0 ? S< Apr28 0:00 [pccardd] root 1251 0.0 0.0 0 0 ? Ss Apr28 0:00 /sbin/cardmgr What's the new recommended way of using PCMCIA cards when cardmgr is gone ? Thanks, Willy ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-01 8:57 ` Willy Tarreau @ 2007-05-01 9:08 ` Andrew Morton 2007-05-01 14:46 ` Adrian Bunk 0 siblings, 1 reply; 233+ messages in thread From: Andrew Morton @ 2007-05-01 9:08 UTC (permalink / raw) To: Willy Tarreau; +Cc: Christoph Hellwig, linux-kernel, linux-mm, linux-pcmcia On Tue, 1 May 2007 10:57:10 +0200 Willy Tarreau <w@1wt.eu> wrote: > Hi Christoph, > > On Tue, May 01, 2007 at 09:46:23AM +0100, Christoph Hellwig wrote: > > > pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch > > > > ... > > > > > Dominik is busy. Will probably re-review and send these direct to Linus. > > > > The patch above is the removal of cardmgr support. While I'd love to > > see this cruft gone it definitively needs maintainer judgement on whether > > they time has come that no one relies on cardmgr anymore. > > Well, I've not followed evolutions in this area for a long time. Here's > what I get on my notebook : > > willy@wtap:~$ uname -r > 2.6.20-wt3-wtap > willy@wtap:~$ ps auxw|grep card > root 1216 0.0 0.0 0 0 ? S< Apr28 0:00 [pccardd] > root 1221 0.0 0.0 0 0 ? S< Apr28 0:00 [pccardd] > root 1244 0.0 0.0 0 0 ? S< Apr28 0:00 [pccardd] > root 1251 0.0 0.0 0 0 ? Ss Apr28 0:00 /sbin/cardmgr > Yes, that seems premature. feature-removal.txt is pretty useless for getting poeple off old tools. If we're ever to make this migration we'll need loud and scary printks coming out of the kernel. Probably it'll take another year or two to get there *once* we've done that. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-01 9:08 ` Andrew Morton @ 2007-05-01 14:46 ` Adrian Bunk 0 siblings, 0 replies; 233+ messages in thread From: Adrian Bunk @ 2007-05-01 14:46 UTC (permalink / raw) To: Andrew Morton Cc: Willy Tarreau, Christoph Hellwig, linux-kernel, linux-mm, linux-pcmcia On Tue, May 01, 2007 at 02:08:20AM -0700, Andrew Morton wrote: > On Tue, 1 May 2007 10:57:10 +0200 Willy Tarreau <w@1wt.eu> wrote: > > > Hi Christoph, > > > > On Tue, May 01, 2007 at 09:46:23AM +0100, Christoph Hellwig wrote: > > > > pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch > > > > > > ... > > > > > > > Dominik is busy. Will probably re-review and send these direct to Linus. > > > > > > The patch above is the removal of cardmgr support. While I'd love to > > > see this cruft gone it definitively needs maintainer judgement on whether > > > they time has come that no one relies on cardmgr anymore. > > > > Well, I've not followed evolutions in this area for a long time. Here's > > what I get on my notebook : > > > > willy@wtap:~$ uname -r > > 2.6.20-wt3-wtap > > willy@wtap:~$ ps auxw|grep card > > root 1216 0.0 0.0 0 0 ? S< Apr28 0:00 [pccardd] > > root 1221 0.0 0.0 0 0 ? S< Apr28 0:00 [pccardd] > > root 1244 0.0 0.0 0 0 ? S< Apr28 0:00 [pccardd] > > root 1251 0.0 0.0 0 0 ? Ss Apr28 0:00 /sbin/cardmgr > > > > Yes, that seems premature. feature-removal.txt is pretty useless for > getting poeple off old tools. If we're ever to make this migration we'll > need loud and scary printks coming out of the kernel. Probably it'll take > another year or two to get there *once* we've done that. You already said the same two years ago, and you forwarded a patch implementing exactly this nearly two years ago: commit c352ec8ab87b065cd2edda171811f49ac7d0d5cd Author: Dominik Brodowski <linux@dominikbrodowski.net> Date: Tue Sep 13 01:25:03 2005 -0700 [PATCH] pcmcia: warn on IOCTL usage More visible user information of scheduled feature removal. Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff --git a/drivers/pcmcia/pcmcia_ioctl.c b/drivers/pcmcia/pcmcia_ioctl.c index 39ba640..80969f7 100644 --- a/drivers/pcmcia/pcmcia_ioctl.c +++ b/drivers/pcmcia/pcmcia_ioctl.c @@ -376,6 +376,7 @@ static int ds_open(struct inode *inode, struct file *file) socket_t i = iminor(inode); struct pcmcia_socket *s; user_info_t *user; + static int warning_printed = 0; ds_dbg(0, "ds_open(socket %d)\n", i); @@ -407,6 +408,17 @@ static int ds_open(struct inode *inode, struct file *file) s->user = user; file->private_data = user; + if (!warning_printed) { + printk(KERN_INFO "pcmcia: Detected deprecated PCMCIA ioctl " + "usage.\n"); + printk(KERN_INFO "pcmcia: This interface will soon be removed from " + "the kernel; please expect breakage unless you upgrade " + "to new tools.\n"); + printk(KERN_INFO "pcmcia: see http://www.kernel.org/pub/linux/" + "utils/kernel/pcmcia/pcmcia.html for details.\n"); + warning_printed = 1; + } + if (s->pcmcia_state.present) queue_event(user, CS_EVENT_CARD_INSERTION); return 0; cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed ^ permalink raw reply related [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-01 8:46 ` pcmcia ioctl removal Christoph Hellwig 2007-05-01 8:56 ` Russell King 2007-05-01 8:57 ` Willy Tarreau @ 2007-05-01 9:16 ` Robert P. J. Day 2007-05-01 9:44 ` Willy Tarreau 2007-05-01 10:12 ` Jan Engelhardt 2007-05-09 12:54 ` Pavel Machek 3 siblings, 2 replies; 233+ messages in thread From: Robert P. J. Day @ 2007-05-01 9:16 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Andrew Morton, linux-kernel, linux-mm, linux-pcmcia On Tue, 1 May 2007, Christoph Hellwig wrote: > > pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch > > ... > > > Dominik is busy. Will probably re-review and send these direct to Linus. > > The patch above is the removal of cardmgr support. While I'd love > to see this cruft gone it definitively needs maintainer judgement on > whether they time has come that no one relies on cardmgr anymore. since i was the one who submitted the original patch to remove that stuff, let me make an observation. when i submitted a patch to remove, for instance, the traffic shaper since it's clearly obsolete, i was told -- in no uncertain terms -- that that couldn't be done since there had been no warning about its impending removal. fair enough, i can accept that. on the other hand, the features removal file contains the following: ... What: PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl]) When: November 2005 ... in other words, the PCMCIA ioctl feature *has* been listed as obsolete for quite some time, and is already a *year and a half* overdue for removal. in short, it's annoying to take the position that stuff can't be deleted without warning, then turn around and be reluctant to remove stuff for which *more than ample warning* has already been given. doing that just makes a joke of the features removal file, and makes you wonder what its purpose is in the first place. a little consistency would be nice here, don't you think? rday -- ======================================================================== Robert P. J. Day Linux Consulting, Training and Annoying Kernel Pedantry Waterloo, Ontario, CANADA http://fsdev.net/wiki/index.php?title=Main_Page ======================================================================== ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-01 9:16 ` Robert P. J. Day @ 2007-05-01 9:44 ` Willy Tarreau 2007-05-01 10:16 ` Robert P. J. Day 2007-05-01 10:26 ` Gabriel C 2007-05-01 10:12 ` Jan Engelhardt 1 sibling, 2 replies; 233+ messages in thread From: Willy Tarreau @ 2007-05-01 9:44 UTC (permalink / raw) To: Robert P. J. Day Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm, linux-pcmcia On Tue, May 01, 2007 at 05:16:13AM -0400, Robert P. J. Day wrote: > On Tue, 1 May 2007, Christoph Hellwig wrote: > > > > pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch > > > > ... > > > > > Dominik is busy. Will probably re-review and send these direct to Linus. > > > > The patch above is the removal of cardmgr support. While I'd love > > to see this cruft gone it definitively needs maintainer judgement on > > whether they time has come that no one relies on cardmgr anymore. > > since i was the one who submitted the original patch to remove that > stuff, let me make an observation. > > when i submitted a patch to remove, for instance, the traffic shaper > since it's clearly obsolete, i was told -- in no uncertain terms -- > that that couldn't be done since there had been no warning about its > impending removal. > > fair enough, i can accept that. > > on the other hand, the features removal file contains the following: > > ... > What: PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl]) > When: November 2005 > ... > > in other words, the PCMCIA ioctl feature *has* been listed as obsolete > for quite some time, and is already a *year and a half* overdue for > removal. > > in short, it's annoying to take the position that stuff can't be > deleted without warning, then turn around and be reluctant to remove > stuff for which *more than ample warning* has already been given. > doing that just makes a joke of the features removal file, and makes > you wonder what its purpose is in the first place. > > a little consistency would be nice here, don't you think? No, it just shows how useless this file is. What is needed is a big warning during usage, not a file that nobody reads. Facts are : - 90% of people here do not even know that this file exists - 80% of the people who know about it do not consult it on a regular basis - 80% of those who consult it on a regular basis are not concerned - 75% of statistics are invented => only 20% of 20% of 10% of those who read LKML know that one feature they are concerned about will soon be removed = 0.4% of LKML readers. If you put a warning in kernel messages (as I've seen for a long time about tcpdump using obsolete AF_PACKET), close to 100% of the users of the obsolete code who are likely to change their kernels will notice it. I'm sorry for your patch which may get delayed a lot. You would spend fewer time stuffing warnings in areas affected by scheduled removal. BTW, I'm not even against the end of cardmgr support, it's just that I don't know what the alternative is, and I suspect that many users do not either. A big warning would have brought them to google who would have provided them with suggestions for alternatives. Willy ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-01 9:44 ` Willy Tarreau @ 2007-05-01 10:16 ` Robert P. J. Day 2007-05-01 10:26 ` Gabriel C 1 sibling, 0 replies; 233+ messages in thread From: Robert P. J. Day @ 2007-05-01 10:16 UTC (permalink / raw) To: Willy Tarreau Cc: Christoph Hellwig, Andrew Morton, Linux kernel mailing list, linux-mm, linux-pcmcia On Tue, 1 May 2007, Willy Tarreau wrote: > On Tue, May 01, 2007 at 05:16:13AM -0400, Robert P. J. Day wrote: ... snip ... > > in other words, the PCMCIA ioctl feature *has* been listed as > > obsolete for quite some time, and is already a *year and a half* > > overdue for removal. > > > > in short, it's annoying to take the position that stuff can't be > > deleted without warning, then turn around and be reluctant to remove > > stuff for which *more than ample warning* has already been given. > > doing that just makes a joke of the features removal file, and makes > > you wonder what its purpose is in the first place. > > > > a little consistency would be nice here, don't you think? > > No, it just shows how useless this file is. agreed. it's mildly entertaining to have watched this raging discussion over the last few days regarding bugs and emails and bugzilla and adrian's regressions, while the one feature that's meant to track aging and removable kernel features is essentially valueless, and no one seems to care. > What is needed is a big warning during usage, not a file that nobody > reads. agreed there as well. but short of that, it would still be nice if people took a minute, perused the feature removal file, and at least brought it up-to-date. if it's going to have any value, then: 1) all proposed removal dates should be reviewed to make sure they're still meaningful, 2) stuff that's overdue for removal should be either removed, or have its expiry date brought forward, and 3) stuff in the kernel tree that is understood to be obsolete or nearly so should have an entry added to that file, so that the clock can at least *start* ticking for that stuff, and you can at least say you *tried* to warn current users. as a start, i posted last month the results of running the simple command: $ grep -iw obsolete $(find . -name Kconfig\*) and some of what was printed is clearly misleading. (don't worry, tilman -- we're not going to reopen that whole isdn4linux thing. :-) i mean, what of the following is actually obsolete: * traffic policing * IP6 Userspace queueing via NETLINK * IP Userspace queueing via NETLINK * ebt: ulog support * Traffic Shaper and so on (and there's that legacy PM thing as well). > I'm sorry for your patch which may get delayed a lot. obviously, leaving stuff like that in the kernel doesn't actually *hurt* anything but, yeah, it's a tad annoying to invest a few minutes to do some janitor work based on what should be killable, submit the patch, then have people freak out about how that is still an essential feature. bottom line: if you want janitor folks to help out with cleanup, make sure they know what can legitimately be cleaned, and stop wasting peoples' time. rday p.s. now if there were only a way to, say, tag various kernel features as "obsolete" or "deprecated" ... :-) -- ======================================================================== Robert P. J. Day Linux Consulting, Training and Annoying Kernel Pedantry Waterloo, Ontario, CANADA http://fsdev.net/wiki/index.php?title=Main_Page ======================================================================== ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-01 9:44 ` Willy Tarreau 2007-05-01 10:16 ` Robert P. J. Day @ 2007-05-01 10:26 ` Gabriel C 2007-05-01 10:52 ` Willy Tarreau 1 sibling, 1 reply; 233+ messages in thread From: Gabriel C @ 2007-05-01 10:26 UTC (permalink / raw) To: Willy Tarreau Cc: Robert P. J. Day, Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm, linux-pcmcia Willy Tarreau wrote: > On Tue, May 01, 2007 at 05:16:13AM -0400, Robert P. J. Day wrote: > >> On Tue, 1 May 2007, Christoph Hellwig wrote: >> >> >>>> pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch >>>> >>> ... >>> >>> >>>> Dominik is busy. Will probably re-review and send these direct to Linus. >>>> >>> The patch above is the removal of cardmgr support. While I'd love >>> to see this cruft gone it definitively needs maintainer judgement on >>> whether they time has come that no one relies on cardmgr anymore. >>> >> since i was the one who submitted the original patch to remove that >> stuff, let me make an observation. >> >> when i submitted a patch to remove, for instance, the traffic shaper >> since it's clearly obsolete, i was told -- in no uncertain terms -- >> that that couldn't be done since there had been no warning about its >> impending removal. >> >> fair enough, i can accept that. >> >> on the other hand, the features removal file contains the following: >> >> ... >> What: PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl]) >> When: November 2005 >> ... >> >> in other words, the PCMCIA ioctl feature *has* been listed as obsolete >> for quite some time, and is already a *year and a half* overdue for >> removal. >> >> in short, it's annoying to take the position that stuff can't be >> deleted without warning, then turn around and be reluctant to remove >> stuff for which *more than ample warning* has already been given. >> doing that just makes a joke of the features removal file, and makes >> you wonder what its purpose is in the first place. >> >> a little consistency would be nice here, don't you think? >> > > No, it just shows how useless this file is. What is needed is a big > warning during usage, not a file that nobody reads. Facts are : > > - 90% of people here do not even know that this file exists > - 80% of the people who know about it do not consult it on a regular basis > - 80% of those who consult it on a regular basis are not concerned > - 75% of statistics are invented > > => only 20% of 20% of 10% of those who read LKML know that one feature > they are concerned about will soon be removed = 0.4% of LKML readers. > > If you put a warning in kernel messages (as I've seen for a long time > about tcpdump using obsolete AF_PACKET), close to 100% of the users > of the obsolete code who are likely to change their kernels will notice > it. > > Hmm ? There is already a fat warning in dmesg for a long time now. snip ... pcmcia: Detected deprecated PCMCIA ioctl usage from process: discover. pcmcia: This interface will soon be removed from the kernel; please expect breakage unless you upgrade to new tools. pcmcia: see http://www.kernel.org/pub/linux/utils/kernel/pcmcia/pcmcia.html for details. ... > I'm sorry for your patch which may get delayed a lot. You would spend > fewer time stuffing warnings in areas affected by scheduled removal. > > BTW, I'm not even against the end of cardmgr support, it's just that > I don't know what the alternative is, and I suspect that many users > do not either. A big warning would have brought them to google who > would have provided them with suggestions for alternatives. > > Willy > > > Regards, Gabriel ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-01 10:26 ` Gabriel C @ 2007-05-01 10:52 ` Willy Tarreau 0 siblings, 0 replies; 233+ messages in thread From: Willy Tarreau @ 2007-05-01 10:52 UTC (permalink / raw) To: Gabriel C Cc: Robert P. J. Day, Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm, linux-pcmcia On Tue, May 01, 2007 at 12:26:48PM +0200, Gabriel C wrote: > Willy Tarreau wrote: > >On Tue, May 01, 2007 at 05:16:13AM -0400, Robert P. J. Day wrote: > > > >>On Tue, 1 May 2007, Christoph Hellwig wrote: > >> > >> > >>>> pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch > >>>> > >>>... > >>> > >>> > >>>>Dominik is busy. Will probably re-review and send these direct to > >>>>Linus. > >>>> > >>>The patch above is the removal of cardmgr support. While I'd love > >>>to see this cruft gone it definitively needs maintainer judgement on > >>>whether they time has come that no one relies on cardmgr anymore. > >>> > >>since i was the one who submitted the original patch to remove that > >>stuff, let me make an observation. > >> > >>when i submitted a patch to remove, for instance, the traffic shaper > >>since it's clearly obsolete, i was told -- in no uncertain terms -- > >>that that couldn't be done since there had been no warning about its > >>impending removal. > >> > >>fair enough, i can accept that. > >> > >>on the other hand, the features removal file contains the following: > >> > >>... > >>What: PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl]) > >>When: November 2005 > >>... > >> > >>in other words, the PCMCIA ioctl feature *has* been listed as obsolete > >>for quite some time, and is already a *year and a half* overdue for > >>removal. > >> > >>in short, it's annoying to take the position that stuff can't be > >>deleted without warning, then turn around and be reluctant to remove > >>stuff for which *more than ample warning* has already been given. > >>doing that just makes a joke of the features removal file, and makes > >>you wonder what its purpose is in the first place. > >> > >>a little consistency would be nice here, don't you think? > >> > > > >No, it just shows how useless this file is. What is needed is a big > >warning during usage, not a file that nobody reads. Facts are : > > > > - 90% of people here do not even know that this file exists > > - 80% of the people who know about it do not consult it on a regular > > basis > > - 80% of those who consult it on a regular basis are not concerned > > - 75% of statistics are invented > > > >=> only 20% of 20% of 10% of those who read LKML know that one feature > > they are concerned about will soon be removed = 0.4% of LKML readers. > > > >If you put a warning in kernel messages (as I've seen for a long time > >about tcpdump using obsolete AF_PACKET), close to 100% of the users > >of the obsolete code who are likely to change their kernels will notice > >it. > > > > > > Hmm ? There is already a fat warning in dmesg for a long time now. > > > snip > > ... > > pcmcia: Detected deprecated PCMCIA ioctl usage from process: discover. > pcmcia: This interface will soon be removed from the kernel; please > expect breakage unless you upgrade to new tools. > pcmcia: see > http://www.kernel.org/pub/linux/utils/kernel/pcmcia/pcmcia.html for details. > > ... Oh you're terribly right ! I have grepped my logs and found it lying there too in the middle of the memory probe messages and the IO probe ones. I never noticed it. I know it's my fault, but I think that "fat warning" is not really what could characterize it though, because of the context verbose around it : pcmcia: parent PCI bridge Memory window: 0x90000000 - 0x903fffff pcmcia: parent PCI bridge Memory window: 0x30000000 - 0x3dffffff pccard: PCMCIA card inserted into slot 0 cs: memory probe 0x90000000-0x903fffff: excluding 0x90000000-0x9003ffff 0x90080000-0x900bffff 0x90100000-0x9013ffff 0x90180000-0x901bffff 0x90200000-0x9023ffff 0x90280000-0x902bffff 0x90300000-0x9033ffff 0x90380000-0x903bffff pcmcia: registering new device pcmcia0.0 pcmcia: Detected deprecated PCMCIA ioctl usage from process: cardmgr. pcmcia: This interface will soon be removed from the kernel; please expect breakage unless you upgrade to new tools. pcmcia: see http://www.kernel.org/pub/linux/utils/kernel/pcmcia/pcmcia.html for details. cs: IO port probe 0xc00-0xcff: clean. cs: IO port probe 0x820-0x8ff: clean. cs: IO port probe 0x800-0x80f: clean. cs: IO port probe 0x3e0-0x4ff: clean. cs: IO port probe 0x100-0x3af: excluding 0x140-0x14f 0x378-0x37f cs: IO port probe 0xa00-0xaff: clean. Now I have the URL for the details ;-) BTW, I thing we should standardize on some formating to display messages about deprecated/obsolete code. Maybe something like this would be more noticeable : cs: memory probe 0x90000000-0x903fffff: excluding 0x90000000-0x9003ffff 0x90080000-0x900bffff 0x90100000-0x9013ffff 0x90180000-0x901bffff 0x90200000-0x9023ffff 0x90280000-0x902bffff 0x90300000-0x9033ffff 0x90380000-0x903bffff pcmcia: registering new device pcmcia0.0 WARNING !!! Detected DEPRECATED PCMCIA ioctl usage from process: cardmgr. WARNING !!! This process may stop working past November 2005. WARNING !!! see http://www.kernel.org/pub/linux/utils/kernel/pcmcia/pcmcia.html for details. cs: IO port probe 0xc00-0xcff: clean. Regards, Willy ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-01 9:16 ` Robert P. J. Day 2007-05-01 9:44 ` Willy Tarreau @ 2007-05-01 10:12 ` Jan Engelhardt 2007-05-01 11:00 ` Willy Tarreau 2007-05-01 19:10 ` Russell King 1 sibling, 2 replies; 233+ messages in thread From: Jan Engelhardt @ 2007-05-01 10:12 UTC (permalink / raw) To: Robert P. J. Day Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm, linux-pcmcia On May 1 2007 05:16, Robert P. J. Day wrote: > >on the other hand, the features removal file contains the following: > >... >What: PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl]) >When: November 2005 >... > >in other words, the PCMCIA ioctl feature *has* been listed as obsolete >for quite some time, and is already a *year and a half* overdue for >removal. > >in short, it's annoying to take the position that stuff can't be >deleted without warning, then turn around and be reluctant to remove >stuff for which *more than ample warning* has already been given. >doing that just makes a joke of the features removal file, and makes >you wonder what its purpose is in the first place. > >a little consistency would be nice here, don't you think? I think this could raise their attention... init/Makefile obj-y += obsolete.o init/obsolete.c: static __init int obsolete_init(void) { printk("\e[1;31m"" The following stuff is gonna get removed \e[5;37m SOON: \e[0m - cardmgr - foobar - bweebol "); schedule_timeout(3 * HZ); return; } static __exit void obsolete_exit(void) {} Jan -- ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-01 10:12 ` Jan Engelhardt @ 2007-05-01 11:00 ` Willy Tarreau 2007-05-01 12:06 ` Konstantin Münning 2007-05-01 13:56 ` Rogan Dawes 2007-05-01 19:10 ` Russell King 1 sibling, 2 replies; 233+ messages in thread From: Willy Tarreau @ 2007-05-01 11:00 UTC (permalink / raw) To: Jan Engelhardt Cc: Robert P. J. Day, Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm, linux-pcmcia On Tue, May 01, 2007 at 12:12:36PM +0200, Jan Engelhardt wrote: > > On May 1 2007 05:16, Robert P. J. Day wrote: > > > >on the other hand, the features removal file contains the following: > > > >... > >What: PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl]) > >When: November 2005 > >... > > > >in other words, the PCMCIA ioctl feature *has* been listed as obsolete > >for quite some time, and is already a *year and a half* overdue for > >removal. > > > >in short, it's annoying to take the position that stuff can't be > >deleted without warning, then turn around and be reluctant to remove > >stuff for which *more than ample warning* has already been given. > >doing that just makes a joke of the features removal file, and makes > >you wonder what its purpose is in the first place. > > > >a little consistency would be nice here, don't you think? > > I think this could raise their attention... > > init/Makefile > obj-y += obsolete.o > > init/obsolete.c: > static __init int obsolete_init(void) > { > printk("\e[1;31m"" > > The following stuff is gonna get removed \e[5;37m SOON: \e[0m > - cardmgr > - foobar > - bweebol > > "); > schedule_timeout(3 * HZ); > return; > } > > static __exit void obsolete_exit(void) {} There's something I like here : the fact that all features are centralized and not hidden in the noise. Clearly we need some standard inside the kernel to manage obsolete code as well as we currently do by hand. Willy ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-01 11:00 ` Willy Tarreau @ 2007-05-01 12:06 ` Konstantin Münning 2007-05-01 13:56 ` Rogan Dawes 1 sibling, 0 replies; 233+ messages in thread From: Konstantin Münning @ 2007-05-01 12:06 UTC (permalink / raw) To: Willy Tarreau Cc: Jan Engelhardt, linux-pcmcia, linux-kernel, linux-mm, Robert P. J. Day, Andrew Morton Willy Tarreau wrote: > On Tue, May 01, 2007 at 12:12:36PM +0200, Jan Engelhardt wrote: >> On May 1 2007 05:16, Robert P. J. Day wrote: >>> on the other hand, the features removal file contains the following: >>> >>> ... >>> What: PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl]) >>> When: November 2005 >>> ... >>> >>> in other words, the PCMCIA ioctl feature *has* been listed as obsolete >>> for quite some time, and is already a *year and a half* overdue for >>> removal. >>> >>> in short, it's annoying to take the position that stuff can't be >>> deleted without warning, then turn around and be reluctant to remove >>> stuff for which *more than ample warning* has already been given. >>> doing that just makes a joke of the features removal file, and makes >>> you wonder what its purpose is in the first place. >>> >>> a little consistency would be nice here, don't you think? >> I think this could raise their attention... >> >> init/Makefile >> obj-y += obsolete.o >> >> init/obsolete.c: >> static __init int obsolete_init(void) >> { >> printk("\e[1;31m"" >> >> The following stuff is gonna get removed \e[5;37m SOON: \e[0m >> - cardmgr >> - foobar >> - bweebol >> >> "); >> schedule_timeout(3 * HZ); >> return; >> } >> >> static __exit void obsolete_exit(void) {} > > There's something I like here : the fact that all features are centralized > and not hidden in the noise. Clearly we need some standard inside the kernel > to manage obsolete code as well as we currently do by hand. > > Willy What about something like the tainted flag which status can be displayed easily? And even better when a list of the used obsolete features can be displayed as well on request? This way you don't need to search the logs. A standardized obsolete function like the one above could do all the job. Just my 2 cents. -- Konstantin Münning ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-01 11:00 ` Willy Tarreau 2007-05-01 12:06 ` Konstantin Münning @ 2007-05-01 13:56 ` Rogan Dawes 1 sibling, 0 replies; 233+ messages in thread From: Rogan Dawes @ 2007-05-01 13:56 UTC (permalink / raw) To: Willy Tarreau Cc: Jan Engelhardt, linux-pcmcia, linux-kernel, linux-mm, Robert P. J. Day, Andrew Morton Willy Tarreau wrote: > On Tue, May 01, 2007 at 12:12:36PM +0200, Jan Engelhardt wrote: >> On May 1 2007 05:16, Robert P. J. Day wrote: >>> on the other hand, the features removal file contains the following: >>> >>> ... >>> What: PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl]) >>> When: November 2005 >>> ... >>> >>> in other words, the PCMCIA ioctl feature *has* been listed as obsolete >>> for quite some time, and is already a *year and a half* overdue for >>> removal. >>> >>> in short, it's annoying to take the position that stuff can't be >>> deleted without warning, then turn around and be reluctant to remove >>> stuff for which *more than ample warning* has already been given. >>> doing that just makes a joke of the features removal file, and makes >>> you wonder what its purpose is in the first place. >>> >>> a little consistency would be nice here, don't you think? >> I think this could raise their attention... >> >> init/Makefile >> obj-y += obsolete.o >> >> init/obsolete.c: >> static __init int obsolete_init(void) >> { >> printk("\e[1;31m"" >> >> The following stuff is gonna get removed \e[5;37m SOON: \e[0m >> - cardmgr >> - foobar >> - bweebol >> >> "); >> schedule_timeout(3 * HZ); >> return; >> } >> >> static __exit void obsolete_exit(void) {} > > There's something I like here : the fact that all features are centralized > and not hidden in the noise. Clearly we need some standard inside the kernel > to manage obsolete code as well as we currently do by hand. > > Willy The difference between this function and the PCAP/TCPDUMP warning is that the warning only showed up when the obsolete functionality was actually used. Maybe a mechanism to automatically increase the severity of reporting as the removal date approaches would be an idea? i.e. for each new kernel that you build leading up the the removal date, a severity is calculated based on the time until official removal, and then, depending on the severity, the message can be logged in various ways. Rogan ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-01 10:12 ` Jan Engelhardt 2007-05-01 11:00 ` Willy Tarreau @ 2007-05-01 19:10 ` Russell King 2007-05-01 20:41 ` Jan Engelhardt 1 sibling, 1 reply; 233+ messages in thread From: Russell King @ 2007-05-01 19:10 UTC (permalink / raw) To: Jan Engelhardt Cc: Robert P. J. Day, Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm, linux-pcmcia On Tue, May 01, 2007 at 12:12:36PM +0200, Jan Engelhardt wrote: > init/obsolete.c: > static __init int obsolete_init(void) > { > printk("\e[1;31m"" > > The following stuff is gonna get removed \e[5;37m SOON: \e[0m > - cardmgr > - foobar > - bweebol > > "); > schedule_timeout(3 * HZ); > return; > } The kernel console isn't VT102 compatible. It doesn't understand any escape codes, at all. Neither does sysklogd. So the above will just end up as rubbish on your console. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-01 19:10 ` Russell King @ 2007-05-01 20:41 ` Jan Engelhardt 0 siblings, 0 replies; 233+ messages in thread From: Jan Engelhardt @ 2007-05-01 20:41 UTC (permalink / raw) To: Russell King Cc: Robert P. J. Day, Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm, linux-pcmcia On May 1 2007 20:10, Russell King wrote: >On Tue, May 01, 2007 at 12:12:36PM +0200, Jan Engelhardt wrote: >> init/obsolete.c: >> static __init int obsolete_init(void) >> { >> printk("\e[1;31m"" >> >> The following stuff is gonna get removed \e[5;37m SOON: \e[0m >> - cardmgr >> - foobar >> - bweebol >> >> "); >> schedule_timeout(3 * HZ); >> return; >> } > >The kernel console isn't VT102 compatible. It doesn't understand any >escape codes, at all. Neither does sysklogd. So the above will just >end up as rubbish on your console. It will (should) at least show up as nicely as in the C source code. With escape codes, but largely readable. If someone knows how to directly spew it to tty0 (the current active console - most likely tty1), the better. Anyway, I just wanted to point out how to really highlight it for the user to see. Although then there would be the distros who obscure it with funky bootsplash screens. But hopefully, their users would not need to care too much for old stuff (gets updated through the distro's update mechanism) Jan -- ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-01 8:46 ` pcmcia ioctl removal Christoph Hellwig ` (2 preceding siblings ...) 2007-05-01 9:16 ` Robert P. J. Day @ 2007-05-09 12:54 ` Pavel Machek 2007-05-09 13:00 ` Robert P. J. Day 2007-05-09 13:03 ` Adrian Bunk 3 siblings, 2 replies; 233+ messages in thread From: Pavel Machek @ 2007-05-09 12:54 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm, linux-pcmcia Hi! > > pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch > > ... > > > Dominik is busy. Will probably re-review and send these direct to Linus. > > The patch above is the removal of cardmgr support. While I'd love to > see this cruft gone it definitively needs maintainer judgement on whether > they time has come that no one relies on cardmgr anymore. I remember needing cardmgr few months ago on sa-1100 arm system. I'm not sure this is obsolete-enough to kill. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-09 12:54 ` Pavel Machek @ 2007-05-09 13:00 ` Robert P. J. Day 2007-05-09 13:03 ` Adrian Bunk 1 sibling, 0 replies; 233+ messages in thread From: Robert P. J. Day @ 2007-05-09 13:00 UTC (permalink / raw) To: Pavel Machek Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm, linux-pcmcia On Wed, 9 May 2007, Pavel Machek wrote: > Hi! > > > > pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch > > > > ... > > > > > Dominik is busy. Will probably re-review and send these direct to Linus. > > > > The patch above is the removal of cardmgr support. While I'd love > > to see this cruft gone it definitively needs maintainer judgement > > on whether they time has come that no one relies on cardmgr > > anymore. > > I remember needing cardmgr few months ago on sa-1100 arm system. I'm > not sure this is obsolete-enough to kill. in that case, someone really should update feature-removal-schedule.txt, which currently reads: What: PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl]) When: November 2005 ... rday -- ======================================================================== Robert P. J. Day Linux Consulting, Training and Annoying Kernel Pedantry Waterloo, Ontario, CANADA http://fsdev.net/wiki/index.php?title=Main_Page ======================================================================== ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-09 12:54 ` Pavel Machek 2007-05-09 13:00 ` Robert P. J. Day @ 2007-05-09 13:03 ` Adrian Bunk 2007-05-09 19:11 ` Romano Giannetti 1 sibling, 1 reply; 233+ messages in thread From: Adrian Bunk @ 2007-05-09 13:03 UTC (permalink / raw) To: Pavel Machek Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm, linux-pcmcia On Wed, May 09, 2007 at 12:54:16PM +0000, Pavel Machek wrote: > Hi! > > > > pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch > > > > ... > > > > > Dominik is busy. Will probably re-review and send these direct to Linus. > > > > The patch above is the removal of cardmgr support. While I'd love to > > see this cruft gone it definitively needs maintainer judgement on whether > > they time has come that no one relies on cardmgr anymore. > > I remember needing cardmgr few months ago on sa-1100 arm system. I'm > not sure this is obsolete-enough to kill. Why didn't pcmciautils work? > Pavel cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-09 13:03 ` Adrian Bunk @ 2007-05-09 19:11 ` Romano Giannetti 2007-05-10 12:40 ` Adrian Bunk 0 siblings, 1 reply; 233+ messages in thread From: Romano Giannetti @ 2007-05-09 19:11 UTC (permalink / raw) To: Adrian Bunk Cc: Pavel Machek, Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm, linux-pcmcia On Wed, 2007-05-09 at 15:03 +0200, Adrian Bunk wrote: > On Wed, May 09, 2007 at 12:54:16PM +0000, Pavel Machek wrote: > relies on cardmgr anymore. > > > > I remember needing cardmgr few months ago on sa-1100 arm system. I'm > > not sure this is obsolete-enough to kill. > > Why didn't pcmciautils work? I have had a problem until 2.6.20 was out with pcmciautils (it did not recognise the second function of multi-functions pcmcia cards that needed a firmware .cis file), and the only way to use it was with cardmgr, way after Nov 2005 :-). Now it is solved (modulo that sometime the pcmcia modem is ttyS1, sometime ttyS2, but that's another history --- and probably my fault). But I wonder if similar problems are hidden away... what about put the ioctls under a normally-disabled option and let a kernel out with it? Romano -- La presente comunicación tiene carácter confidencial y es para el exclusivo uso del destinatario indicado en la misma. Si Ud. no es el destinatario indicado, le informamos que cualquier forma de distribución, reproducción o uso de esta comunicación y/o de la información contenida en la misma están estrictamente prohibidos por la ley. Si Ud. ha recibido esta comunicación por error, por favor, notifíquelo inmediatamente al remitente contestando a este mensaje y proceda a continuación a destruirlo. Gracias por su colaboración. This communication contains confidential information. It is for the exclusive use of the intended addressee. If you are not the intended addressee, please note that any form of distribution, copying or use of this communication or the information in it is strictly prohibited by law. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy this message. Thank you for your cooperation. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pcmcia ioctl removal 2007-05-09 19:11 ` Romano Giannetti @ 2007-05-10 12:40 ` Adrian Bunk 0 siblings, 0 replies; 233+ messages in thread From: Adrian Bunk @ 2007-05-10 12:40 UTC (permalink / raw) To: Romano Giannetti Cc: Pavel Machek, Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm, linux-pcmcia On Wed, May 09, 2007 at 09:11:52PM +0200, Romano Giannetti wrote: > On Wed, 2007-05-09 at 15:03 +0200, Adrian Bunk wrote: > > On Wed, May 09, 2007 at 12:54:16PM +0000, Pavel Machek wrote: > > relies on cardmgr anymore. > > > > > > I remember needing cardmgr few months ago on sa-1100 arm system. I'm > > > not sure this is obsolete-enough to kill. > > > > Why didn't pcmciautils work? > > I have had a problem until 2.6.20 was out with pcmciautils (it did not > recognise the second function of multi-functions pcmcia cards that > needed a firmware .cis file), and the only way to use it was with > cardmgr, way after Nov 2005 :-). > > Now it is solved (modulo that sometime the pcmcia modem is ttyS1, > sometime ttyS2, but that's another history --- and probably my fault). > But I wonder if similar problems are hidden away... what about put the > ioctls under a normally-disabled option and let a kernel out with it? It already prints a runtime warning to the user since 2005. And people won't notice a changed default when using "make oldconfig". Are there any known known regressions left? Otherwise, the best way for getting problem reports for pcmciautils is to remove the ioctl (that's an experience from similar cases in other parts of the kernel)... > Romano cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed ^ permalink raw reply [flat|nested] 233+ messages in thread
* pci hotplug patches 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (9 preceding siblings ...) 2007-05-01 8:46 ` pcmcia ioctl removal Christoph Hellwig @ 2007-05-01 8:48 ` Christoph Hellwig 2007-05-02 3:57 ` Greg KH 2007-05-01 8:54 ` cache-pipe-buf-page-address-for-non-highmem-arch.patch Christoph Hellwig ` (11 subsequent siblings) 22 siblings, 1 reply; 233+ messages in thread From: Christoph Hellwig @ 2007-05-01 8:48 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm, greg > fix-gregkh-pci-pci-remove-the-broken-pci_multithread_probe-option.patch > remove-pci_dac_dma_-apis.patch > round_up-macro-cleanup-in-drivers-pci.patch > pcie-remove-spin_lock_unlocked.patch > cpqphp-partially-convert-to-use-the-kthread-api.patch > ibmphp-partially-convert-to-use-the-kthreads-api.patch > cpci_hotplug-partially-convert-to-use-the-kthread-api.patch > msi-fix-arm-compile.patch > support-pci-mcfg-space-on-intel-i915-bridges.patch > pci-syscallc-switch-to-refcounting-api.patch > > Stuff to (various levels of re-)send to Greg for the PCI tree. I'll probably > drop the kthread patches as they seemed a bit half-baked and I've lost track > of which ones have which levels of baking. All the partially kthread conversion were superceed with full conversion from me. I've only got feedback from the cpci maintainer, and he acked my patch together with a simple fix from him. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pci hotplug patches 2007-05-01 8:48 ` pci hotplug patches Christoph Hellwig @ 2007-05-02 3:57 ` Greg KH 2007-05-13 20:59 ` Christoph Hellwig 0 siblings, 1 reply; 233+ messages in thread From: Greg KH @ 2007-05-02 3:57 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm Cc: kristen.c.accardi On Tue, May 01, 2007 at 09:48:41AM +0100, Christoph Hellwig wrote: > > fix-gregkh-pci-pci-remove-the-broken-pci_multithread_probe-option.patch > > remove-pci_dac_dma_-apis.patch > > round_up-macro-cleanup-in-drivers-pci.patch > > pcie-remove-spin_lock_unlocked.patch > > cpqphp-partially-convert-to-use-the-kthread-api.patch > > ibmphp-partially-convert-to-use-the-kthreads-api.patch > > cpci_hotplug-partially-convert-to-use-the-kthread-api.patch > > msi-fix-arm-compile.patch > > support-pci-mcfg-space-on-intel-i915-bridges.patch > > pci-syscallc-switch-to-refcounting-api.patch > > > > Stuff to (various levels of re-)send to Greg for the PCI tree. I'll probably > > drop the kthread patches as they seemed a bit half-baked and I've lost track > > of which ones have which levels of baking. > > All the partially kthread conversion were superceed with full conversion > from me. I've only got feedback from the cpci maintainer, and he acked > my patch together with a simple fix from him. Hm, I'm no longer the PCI Hotplug maintainer, so that's why I haven't added them to my tree. It would probably be best for everyone involved to send them to her instead :) thanks, greg k-h ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pci hotplug patches 2007-05-02 3:57 ` Greg KH @ 2007-05-13 20:59 ` Christoph Hellwig 2007-05-14 11:48 ` Greg KH 0 siblings, 1 reply; 233+ messages in thread From: Christoph Hellwig @ 2007-05-13 20:59 UTC (permalink / raw) To: Greg KH Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm, kristen.c.accardi On Tue, May 01, 2007 at 08:57:45PM -0700, Greg KH wrote: > Hm, I'm no longer the PCI Hotplug maintainer, so that's why I haven't > added them to my tree. It would probably be best for everyone involved > to send them to her instead :) FYI: MAINTAINERS still lists you as the maintainer of the cpqphp driver. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: pci hotplug patches 2007-05-13 20:59 ` Christoph Hellwig @ 2007-05-14 11:48 ` Greg KH 0 siblings, 0 replies; 233+ messages in thread From: Greg KH @ 2007-05-14 11:48 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm, kristen.c.accardi On Sun, May 13, 2007 at 09:59:24PM +0100, Christoph Hellwig wrote: > On Tue, May 01, 2007 at 08:57:45PM -0700, Greg KH wrote: > > Hm, I'm no longer the PCI Hotplug maintainer, so that's why I haven't > > added them to my tree. It would probably be best for everyone involved > > to send them to her instead :) > > FYI: MAINTAINERS still lists you as the maintainer of the cpqphp driver. Ick, I'll go fix that up, I don't even have the hardware anymore (donated it to a local university...) thanks, greg k-h ^ permalink raw reply [flat|nested] 233+ messages in thread
* cache-pipe-buf-page-address-for-non-highmem-arch.patch 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (10 preceding siblings ...) 2007-05-01 8:48 ` pci hotplug patches Christoph Hellwig @ 2007-05-01 8:54 ` Christoph Hellwig [not found] ` <20070501020441.10b6a003.akpm@linux-foundation.org> 2007-05-01 8:55 ` consolidate-generic_writepages-and-mpage_writepages.patch Christoph Hellwig ` (10 subsequent siblings) 22 siblings, 1 reply; 233+ messages in thread From: Christoph Hellwig @ 2007-05-01 8:54 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm, kenchen > cache-pipe-buf-page-address-for-non-highmem-arch.patch I still don't like this one at all. If page_address on x86_64 is too slow we should fix the root cause. ^ permalink raw reply [flat|nested] 233+ messages in thread
[parent not found: <20070501020441.10b6a003.akpm@linux-foundation.org>]
* Re: cache-pipe-buf-page-address-for-non-highmem-arch.patch [not found] ` <20070501020441.10b6a003.akpm@linux-foundation.org> @ 2007-05-03 3:48 ` Ken Chen 0 siblings, 0 replies; 233+ messages in thread From: Ken Chen @ 2007-05-03 3:48 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Hellwig, linux-kernel, linux-mm, Andi Kleen On 5/1/07, Andrew Morton <akpm@linux-foundation.org> wrote: > Fair enough, it is a bit of an ugly thing. And I see no measurements there > on what the overall speedup was for any workload. > > Ken, which memory model was in use? sparsemem? discontigmem with config_numa on. ^ permalink raw reply [flat|nested] 233+ messages in thread
* consolidate-generic_writepages-and-mpage_writepages.patch 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (11 preceding siblings ...) 2007-05-01 8:54 ` cache-pipe-buf-page-address-for-non-highmem-arch.patch Christoph Hellwig @ 2007-05-01 8:55 ` Christoph Hellwig 2007-05-01 9:17 ` 2.6.22 -mm merge plans Pekka Enberg ` (9 subsequent siblings) 22 siblings, 0 replies; 233+ messages in thread From: Christoph Hellwig @ 2007-05-01 8:55 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm > consolidate-generic_writepages-and-mpage_writepages.patch > > Might merge. I forget what happened to this. ACK from me on this one. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (12 preceding siblings ...) 2007-05-01 8:55 ` consolidate-generic_writepages-and-mpage_writepages.patch Christoph Hellwig @ 2007-05-01 9:17 ` Pekka Enberg 2007-05-01 9:24 ` Christoph Hellwig ` (2 more replies) 2007-05-01 10:16 ` fragmentation avoidance " Mel Gorman ` (8 subsequent siblings) 22 siblings, 3 replies; 233+ messages in thread From: Pekka Enberg @ 2007-05-01 9:17 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm, hch, npiggin, a.p.zijlstra On 5/1/07, Andrew Morton <akpm@linux-foundation.org> wrote: > revoke-special-mmap-handling.patch [snip] > Hold. This is tricky stuff and I don't think we've seen sufficient > reviewing, testing and acking yet? Agreed. While Peter and Nick have done some review of the patches, I would really like VFS maintainers to review them before merge. Christoph, have you had the chance to take a look at it? ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-01 9:17 ` 2.6.22 -mm merge plans Pekka Enberg @ 2007-05-01 9:24 ` Christoph Hellwig 2007-05-01 9:37 ` Peter Zijlstra 2007-05-01 12:19 ` Andi Kleen 2 siblings, 0 replies; 233+ messages in thread From: Christoph Hellwig @ 2007-05-01 9:24 UTC (permalink / raw) To: Pekka Enberg Cc: Andrew Morton, linux-kernel, linux-mm, hch, npiggin, a.p.zijlstra On Tue, May 01, 2007 at 12:17:28PM +0300, Pekka Enberg wrote: > On 5/1/07, Andrew Morton <akpm@linux-foundation.org> wrote: > > revoke-special-mmap-handling.patch > > [snip] > > >Hold. This is tricky stuff and I don't think we've seen sufficient > >reviewing, testing and acking yet? > > Agreed. While Peter and Nick have done some review of the patches, I > would really like VFS maintainers to review them before merge. > Christoph, have you had the chance to take a look at it? Not so far, but it's on my long list of highly useful things I want to review. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-01 9:17 ` 2.6.22 -mm merge plans Pekka Enberg 2007-05-01 9:24 ` Christoph Hellwig @ 2007-05-01 9:37 ` Peter Zijlstra 2007-05-01 12:19 ` Andi Kleen 2 siblings, 0 replies; 233+ messages in thread From: Peter Zijlstra @ 2007-05-01 9:37 UTC (permalink / raw) To: Pekka Enberg; +Cc: Andrew Morton, linux-kernel, linux-mm, hch, npiggin On Tue, 2007-05-01 at 12:17 +0300, Pekka Enberg wrote: > On 5/1/07, Andrew Morton <akpm@linux-foundation.org> wrote: > > revoke-special-mmap-handling.patch > > [snip] > > > Hold. This is tricky stuff and I don't think we've seen sufficient > > reviewing, testing and acking yet? > > Agreed. While Peter and Nick have done some review of the patches, I > would really like VFS maintainers to review them before merge. > Christoph, have you had the chance to take a look at it? I'll have another look at it; also, I'll try to work through Mel's patches once again. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-01 9:17 ` 2.6.22 -mm merge plans Pekka Enberg 2007-05-01 9:24 ` Christoph Hellwig 2007-05-01 9:37 ` Peter Zijlstra @ 2007-05-01 12:19 ` Andi Kleen 2007-05-01 17:12 ` Pekka Enberg 2 siblings, 1 reply; 233+ messages in thread From: Andi Kleen @ 2007-05-01 12:19 UTC (permalink / raw) To: Pekka Enberg Cc: Andrew Morton, linux-kernel, linux-mm, hch, npiggin, a.p.zijlstra "Pekka Enberg" <penberg@cs.helsinki.fi> writes: > On 5/1/07, Andrew Morton <akpm@linux-foundation.org> wrote: > > revoke-special-mmap-handling.patch > > [snip] > > > Hold. This is tricky stuff and I don't think we've seen sufficient > > reviewing, testing and acking yet? > > Agreed. While Peter and Nick have done some review of the patches, I > would really like VFS maintainers to review them before merge. > Christoph, have you had the chance to take a look at it? Also have the cache performance concerns raised on the original review been addressed? -Andi ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-01 12:19 ` Andi Kleen @ 2007-05-01 17:12 ` Pekka Enberg 0 siblings, 0 replies; 233+ messages in thread From: Pekka Enberg @ 2007-05-01 17:12 UTC (permalink / raw) To: Andi Kleen Cc: Andrew Morton, linux-kernel, linux-mm, hch, npiggin, a.p.zijlstra Hi Andi, On 01 May 2007 14:19:45 +0200, Andi Kleen <andi@firstfloor.org> wrote: > Also have the cache performance concerns raised on the original review > been addressed? I am only aware of the fget_light() related issues Eric Dumazet raised but it's fixed. If you're thinking of something else, could you please remind me what it is? ^ permalink raw reply [flat|nested] 233+ messages in thread
* fragmentation avoidance Re: 2.6.22 -mm merge plans 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (13 preceding siblings ...) 2007-05-01 9:17 ` 2.6.22 -mm merge plans Pekka Enberg @ 2007-05-01 10:16 ` Mel Gorman 2007-05-01 13:02 ` 2.6.22 -mm merge plans -- lumpy reclaim Andy Whitcroft ` (3 more replies) 2007-05-01 12:17 ` Andi Kleen ` (7 subsequent siblings) 22 siblings, 4 replies; 233+ messages in thread From: Mel Gorman @ 2007-05-01 10:16 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm, apw, clameter, y-goto On (30/04/07 16:20), Andrew Morton didst pronounce: > add-apply_to_page_range-which-applies-a-function-to-a-pte-range.patch > add-apply_to_page_range-which-applies-a-function-to-a-pte-range-fix.patch > safer-nr_node_ids-and-nr_node_ids-determination-and-initial.patch > use-zvc-counters-to-establish-exact-size-of-dirtyable-pages.patch > proper-prototype-for-hugetlb_get_unmapped_area.patch > mm-remove-gcc-workaround.patch > slab-ensure-cache_alloc_refill-terminates.patch > mm-more-rmap-checking.patch > mm-make-read_cache_page-synchronous.patch > fs-buffer-dont-pageuptodate-without-page-locked.patch > allow-oom_adj-of-saintly-processes.patch > introduce-config_has_dma.patch > mm-slabc-proper-prototypes.patch > mm-detach_vmas_to_be_unmapped-fix.patch > > Misc MM things. Will merge. After Andy's mail, I am guessing that the patch below is also going here in the stack as a cleanup. add-pfn_valid_within-helper-for-sub-max_order-hole-detection.patch > add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages.patch > add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated.patch > split-the-free-lists-for-movable-and-unmovable-allocations.patch > choose-pages-from-the-per-cpu-list-based-on-migration-type.patch > add-a-configure-option-to-group-pages-by-mobility.patch > drain-per-cpu-lists-when-high-order-allocations-fail.patch > move-free-pages-between-lists-on-steal.patch > group-short-lived-and-reclaimable-kernel-allocations.patch > group-high-order-atomic-allocations.patch > do-not-group-pages-by-mobility-type-on-low-memory-systems.patch > bias-the-placement-of-kernel-pages-at-lower-pfns.patch > be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback.patch > fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2.patch Plus the patch below from Andy's pfn_valid_within() series would be here: anti-fragmentation-switch-over-to-pfn_valid_within.patch These patches are the grouping pages by mobility patches. They get tested every time someone boots the machine from the perspective that they affect the page allocator. It is working to keep fragmentation problems to a minimum and being exercised. We have beaten it heavily here on tests with a variety of machines using the system that drives test.kernel.org for both functionality and performance testing. That covers x86, x86_64, ppc64 and occasionally IA64. Granted, there are corner-case machines out there or we'd never receive bug reports at all. They are currently being reviewed by Christoph Lameter. His feedback in the linux-mm thread "Antifrag patchset comments" has given me a TODO list which I'm currently working through. So far, there has been no fundamental mistake in my opinion and the additional work is logical extensions. The closest thing to a fundamental mistake was grouping pages by MAX_ORDER_NR_PAGES instead of an arbitrary order. What I did was fine for x86_64, i386 and ppc64 but not as useful for IA64 with 1GB worth of memory in MAX_ORDER_NR_PAGES. I also missed some temporary allocations as picked up in Christophs review. > create-the-zone_movable-zone.patch > allow-huge-page-allocations-to-use-gfp_high_movable.patch > x86-specify-amount-of-kernel-memory-at-boot-time.patch > ppc-and-powerpc-specify-amount-of-kernel-memory-at-boot-time.patch > x86_64-specify-amount-of-kernel-memory-at-boot-time.patch > ia64-specify-amount-of-kernel-memory-at-boot-time.patch > add-documentation-for-additional-boot-parameter-and-sysctl.patch > handle-kernelcore=-boot-parameter-in-common-code-to-avoid-boot-problem-on-ia64.patch > > Mel's moveable-zone work. These patches are what creates ZONE_MOVABLE. The last 6 patches should be collapsed into a single patch: handle-kernelcore=-generic I believe Yasunori Goto is looking at these from the perspective of memory hot-remove and has caught a few bugs in the past. Goto-san may be able to comment on whether they have been reviewed recently. The main complexity is in one function in patch one which determines where the PFN is in each node for ZONE_MOVABLE. Getting that right so that the requested amount of kernel memory spread as evenly as possible is just not straight-forward. > I don't believe that this has had sufficient review and I'm sure that it > hasn't had sufficient third-party testing. Most of the approbations thus far > have consisted of people liking the overall idea, based on the changelogs and > multi-year-old discussions. > > For such a large and core change I'd have expected more detailed reviewing > effort and more third-party testing. And I STILL haven't made time to review > the code in detail myself. > > So I'm a bit uncomfortable with moving ahead with these changes. > Ok. It is getting reviewed by Christoph and I'm going through the TODO items it yielded. Andy has also been regularly reviewing them which is probably why they have had less public errors than you might expect from something like this. Christoph may like to comment more here. > <snip> > > lumpy-reclaim-v4.patch And I guess this patch also moves here lumpy-move-to-using-pfn_valid_within.patch > > This is in a similar situation to the moveable-zone work. Sounds great on > paper, but it needs considerable third-party testing and review. It is a > major change to core MM and, we hope, a significant advance. On paper. Andy will probably comment more here. Like the fragmentation stuff, we have beaten this heavily in tests. I'm not sure of it's review situation. > More Mel things, and linkage between Mel-things and lumpy reclaim. It's here > where the patch ordering gets into a mess and things won't improve if > moveable-zones and lumpy-reclaim get deferred. Such a deferral would limit my > ability to queue more MM changes for 2.6.23. > This is where the three patches were originally. From the other thread, I am assuming these are sorted out. > <snip> > > bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks.patch > remove-page_group_by_mobility.patch > dont-group-high-order-atomic-allocations.patch > > More moveable-zone work. > This is the MIGRATE_RESERVE patch and two patches that back out parts of the grouping pages by mobility stack. If possible, these patches should move to the end of that stack. To fix the ordering, would it be helpful to provide a fresh stack based on 2.6.21? That would delete 4 patches in all. The two that introduce configuration items and highorder atomic groupings and these two patches that subsequently remove them. > <SNIP> > > slub-exploit-page-mobility-to-increase-allocation-order.patch > > Slub entanglement with moveable-zones. Will merge if moveable-zones is merged. > Well, grouping pages by mobility is what it really depends on. The ZONE_MOVABLE is not required for SLUB. However, I get the point and agree with it. If the rest of SLUB gets merged, this patch could be moved to the end of the grouping by mobility stack. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- lumpy reclaim 2007-05-01 10:16 ` fragmentation avoidance " Mel Gorman @ 2007-05-01 13:02 ` Andy Whitcroft 2007-05-01 18:03 ` Peter Zijlstra 2007-05-01 19:00 ` Andrew Morton 2007-05-01 14:54 ` fragmentation avoidance Re: 2.6.22 -mm merge plans Christoph Lameter ` (2 subsequent siblings) 3 siblings, 2 replies; 233+ messages in thread From: Andy Whitcroft @ 2007-05-01 13:02 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, linux-kernel, linux-mm, clameter, y-goto, Peter Zijlstra Mel Gorman wrote: <snip> >> lumpy-reclaim-v4.patch > > And I guess this patch also moves here > > lumpy-move-to-using-pfn_valid_within.patch > >> This is in a similar situation to the moveable-zone work. Sounds great on >> paper, but it needs considerable third-party testing and review. It is a >> major change to core MM and, we hope, a significant advance. On paper. > > Andy will probably comment more here. Like the fragmentation stuff, we have > beaten this heavily in tests. With this stack the basic functionality for Lumpy reclaim is complete. Better integration with kswapd is desirable, but IMO that should be a separate change. In testing it has produced significant improvements the likelyhood of reclaiming a page (reclaim effectiveness) at very high orders (where the likelyhood of success is least), and effectiveness at lower orders should be better again. In general -mm testing lumpy is triggered for any stalled allocation above order-0; it is common to see stack allocations triggering lumpy under higher load. kswapd also now utilises lumpy when required. As Mel has indicated a lot of automated testing has been done on these patches. As reclaim is only entered when low on memory, our testing focuses on triggering pushing the system to a heavily fragmented state where reclaim is used heavily. This testing has not shown any regressions and shows improved effectiveness particularly under load. Effectiveness for regular reclaim is based on random distributions, as such it is only likely to successfully reclaim pages at lower orders. Lumpy reclaim improves on this by actively targeting reclaim on areas at the orders required and so succeeds at significantly higher order. Very high order allocations require better layout, from the mobility patches. I have some primitive stats patches which we have used performance testing. Perhaps those could be brought up to date to provide better visibility into lumpy's operation. Again this would be a separate patch. > I'm not sure of it's review situation. As lumpy reclaim and grouping-by-mobility are complementary patch sets (in that they both assist at the highest order) we work pretty closely and I generally pass all my patches past Mel before general release. Early versions were based on patches from Peter Zijlstra who also reviewed earlier versions if memory serves. The changes since then have been reviewed by Mel and Andrew Morton only to my knowledge. Perhaps Peter would have some time to take a look over the latest stack as it appears in -mm when that releases; ping me for a patch kit if you want it before then :). <snip> -apw ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- lumpy reclaim 2007-05-01 13:02 ` 2.6.22 -mm merge plans -- lumpy reclaim Andy Whitcroft @ 2007-05-01 18:03 ` Peter Zijlstra 2007-05-01 19:00 ` Andrew Morton 1 sibling, 0 replies; 233+ messages in thread From: Peter Zijlstra @ 2007-05-01 18:03 UTC (permalink / raw) To: Andy Whitcroft Cc: Mel Gorman, Andrew Morton, linux-kernel, linux-mm, clameter, y-goto On Tue, 2007-05-01 at 14:02 +0100, Andy Whitcroft wrote: > Perhaps Peter would have some time to take a look over the latest stack > as it appears in -mm when that releases; ping me for a patch kit if you > want it before then :). Lumpy-reclaim -v7, as per the roll-up provided privately; Code is looking good, I like what you did to it :-) Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans -- lumpy reclaim 2007-05-01 13:02 ` 2.6.22 -mm merge plans -- lumpy reclaim Andy Whitcroft 2007-05-01 18:03 ` Peter Zijlstra @ 2007-05-01 19:00 ` Andrew Morton 1 sibling, 0 replies; 233+ messages in thread From: Andrew Morton @ 2007-05-01 19:00 UTC (permalink / raw) To: Andy Whitcroft Cc: Mel Gorman, linux-kernel, linux-mm, clameter, y-goto, Peter Zijlstra On Tue, 01 May 2007 14:02:41 +0100 Andy Whitcroft <apw@shadowen.org> wrote: > I have some primitive stats patches which we have used performance > testing. Perhaps those could be brought up to date to provide better > visibility into lumpy's operation. Again this would be a separate patch. Feel free to add new counters in /proc/vmstat - perhaps per-order success and fail rates? Monitoring the ratio between those would show how effective lumpiness is being, perhaps. It's always nice to see what's going on in there. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: fragmentation avoidance Re: 2.6.22 -mm merge plans 2007-05-01 10:16 ` fragmentation avoidance " Mel Gorman 2007-05-01 13:02 ` 2.6.22 -mm merge plans -- lumpy reclaim Andy Whitcroft @ 2007-05-01 14:54 ` Christoph Lameter 2007-05-01 19:00 ` Mel Gorman 2007-05-01 18:57 ` Andrew Morton 2007-05-07 13:07 ` Yasunori Goto 3 siblings, 1 reply; 233+ messages in thread From: Christoph Lameter @ 2007-05-01 14:54 UTC (permalink / raw) To: Mel Gorman; +Cc: Andrew Morton, linux-kernel, linux-mm, apw, y-goto On Tue, 1 May 2007, Mel Gorman wrote: > anti-fragmentation-switch-over-to-pfn_valid_within.patch > > These patches are the grouping pages by mobility patches. They get tested > every time someone boots the machine from the perspective that they affect > the page allocator. It is working to keep fragmentation problems to a > minimum and being exercised. We have beaten it heavily here on tests > with a variety of machines using the system that drives test.kernel.org > for both functionality and performance testing. That covers x86, x86_64, > ppc64 and occasionally IA64. Granted, there are corner-case machines out > there or we'd never receive bug reports at all. > > They are currently being reviewed by Christoph Lameter. His feedback in > the linux-mm thread "Antifrag patchset comments" has given me a TODO list > which I'm currently working through. So far, there has been no fundamental > mistake in my opinion and the additional work is logical extensions. I think we really urgently need a defragmentation solution in Linux in order to support higher page allocations for various purposes. SLUB f.e. would benefit from it and the large blocksize patches are not reasonable without such a method. However, the current code is not up to the task. I did not see a clean categorization of allocations nor a consistent handling of those. The cleanup work that would have to be done throughout the kernel is not there. It is spotty. There seems to be a series of heuristic driving this thing (I have to agree with Nick there). The temporary allocations that were missed are just a few that I found. The review of the rest of the kernel was not done. Mel said that he fixed up locations that showed up to be a problem in testing. That is another issue: Too much focus on testing instead of conceptual cleanness and clean code in the kernel. It looks like this is geared for a specific series of tests on specific platforms and also to a particular allocation size (max order sized huge pages). There are major technical problems with 1. Large Scale allocs. Multiple MAX_ORDER blocks as required by the antifrag patches may not exist on all platforms. Thus the antifrag patches will not be able to generate their MAX_ORDER sections. We could reduce MAX_ORDER on some platforms but that would have other implications like limiting the highest order allocation. 2. Small huge page size support. F.e. IA64 can support down to page size huge pages. The antifrag patches handle huge page in a special way. They are categorized as movable. Small huge pages may therefore contaminate the movable area. 3. Defining the size of ZONE_MOVABLE. This was done to guarantee availability of movable memory but the practical effect is to guarantee that we panic when too many unreclaimable allocations have been done. I have already said during the review that IMHO the patches are not ready for merging. They are currently more like a prototype that explores ideas. The generalization steps are not done. How we could make progress: 1. Develop a useful categorization of allocations in the kernel whose utility goes beyond the antifrag patches. I.e. length of the objects existence and the method of reclaim could be useful in various contexts. 2. Have statistics of these various allocations. 3. Page allocator should gather statistics on how memory was allocated in the various categories. 4. The available data can then be used to driver more intelligent reclaim and develop methods of antifrag or defragmentation. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: fragmentation avoidance Re: 2.6.22 -mm merge plans 2007-05-01 14:54 ` fragmentation avoidance Re: 2.6.22 -mm merge plans Christoph Lameter @ 2007-05-01 19:00 ` Mel Gorman 0 siblings, 0 replies; 233+ messages in thread From: Mel Gorman @ 2007-05-01 19:00 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andrew Morton, linux-kernel, linux-mm, apw, y-goto On Tue, 1 May 2007, Christoph Lameter wrote: > On Tue, 1 May 2007, Mel Gorman wrote: > >> anti-fragmentation-switch-over-to-pfn_valid_within.patch >> >> These patches are the grouping pages by mobility patches. They get tested >> every time someone boots the machine from the perspective that they affect >> the page allocator. It is working to keep fragmentation problems to a >> minimum and being exercised. We have beaten it heavily here on tests >> with a variety of machines using the system that drives test.kernel.org >> for both functionality and performance testing. That covers x86, x86_64, >> ppc64 and occasionally IA64. Granted, there are corner-case machines out >> there or we'd never receive bug reports at all. >> >> They are currently being reviewed by Christoph Lameter. His feedback in >> the linux-mm thread "Antifrag patchset comments" has given me a TODO list >> which I'm currently working through. So far, there has been no fundamental >> mistake in my opinion and the additional work is logical extensions. > > I think we really urgently need a defragmentation solution in Linux in > order to support higher page allocations for various purposes. SLUB f.e. > would benefit from it and the large blocksize patches are not reasonable > without such a method. > I continue to maintain that anti-fragmentation is a pre-requisite for any defragmentation mechanism to be effective without trashing overall performance. If allocation success rates are low when everything possible has been reclaimed as is the case without fragmentation avoidance, then defragmentation will not help unless the the 1:1 phys:virt mappings is broken which incurs its own considerable set of problems. > However, the current code is not up to the task. I did not see a clean > categorization of allocations nor a consistent handling of those. The > cleanup work that would have to be done throughout the kernel is not > there. The choice of mobility marker to use in each case was deliberate (even if I have made mistakes but what else is review for?). The choice by default is UNMOVABLE as it's the safe choice even if may be sub-optimal. The description of the mobility types may not be the clearest. For example, buffers were placed beside page cache in MOVABLE because they can both be reclaimed in the same fashion - I consider moving it to disk to be as "movable" as any other definition of the word but in your world movable always means page migration which has led to some confusion. They could have been separated out as MOVABLE and BUFFERS for a conceptually cleaner split but it did not seem necessary because the more types there are, the bigger the memory and performance footprint becomes. Additional flag groupings like GFP_BUFFERS could be defined that alias to MOVABLE if you felt it would make the code clearer but functionally, the behaviour remains the same. This is similar to your feedback on the treatment of GFP_TEMPORARY. There can be as many alias mobility types as you wish but if more "real" types are required, you can have as you want as long as NR_PAGEBLOCK_BITS is increased properly and allocflags_to_migratetype() is able to translate GFP flags to the appropriate mobility type. It increases the performance and memory footprint though. > It is spotty. There seems to be a series of heuristic driving this > thing (I have to agree with Nick there). The temporary allocations that > were missed are just a few that I found. The review of the rest of the > kernel was not done. The review for temporary allocations was aimed at catching the most common callers, not every single one of them because a full review of every caller is a large undertaking. If anything, it makes more sense to do a review of all callers at the end when the core mechanism is finished. The default to treat them as UNMOVABLE is sensible. > Mel said that he fixed up locations that showed up to > be a problem in testing. That is another issue: Too much focus on testing > instead of conceptual cleanness and clean code in the kernel. The patches started as a thought experiment of what "should work". They were then tested to find flaws in the model and the results were fed back in. How is that a disadvantage exactly? > It looks > like this is geared for a specific series of tests on specific platforms > and also to a particular allocation size (max order sized huge pages). > Some series of tests had to be chosen and one combination was chosen that was known to be particularly hostile to external fragmentation - i.e. large numbers of kernel cache allocations at the same time as page cache allocations. No one has suggested an alternative test that would be more suitable. The platforms used were x86, x86_64 and ppc64 which are not exactly insignificant platforms. At the time, I didn't have an IA64 machine and franky the one I have now does not always boot so testing is not as thorough. Huge page sized pages were chosen because they were the hardest allocation to satisfy. If they could be allocated successfully, it stood to reason that smaller allocations at least as well. Hugepages and MAX_ORDER pages were close to the same size on x86, x86_64 and ppc64 which is why that figure was chosen. I point out that while IA64 can specify hugepagesz= to change the hugepage size, it's not documented in Documentation/kernel-parameters.txt or I might have spotted this sooner. These decisions were not random. > There are major technical problems with > > 1. Large Scale allocs. Multiple MAX_ORDER blocks as required by the > antifrag patches may not exist on all platforms. Thus the antifrag > patches will not be able to generate their MAX_ORDER sections. We > could reduce MAX_ORDER on some platforms but that would have other > implications like limiting the highest order allocation. MAX_ORDER was a sensible choice on the three initial platforms. However, it is not a fundamental value in the mechanism and is an easy assumption to break. I've included a patch below based on your review that choses a size based on the value of HPAGE_SHIFT. It took 45 minutes to cobble together so it's rough looking and I might have missed something but it has passed stress tests on x86 without difficulty. Here is the dmesg output [ 0.000000] Built 1 zonelists, mobility grouping on at order 5. Total pages: 16224 Voila, grouping on order 5 instead of 10 (I used 5 instead of HPAGE_SHIFT for testing purposes). The order used can be any value >= 2 and < MAX_ORDER. > 2. Small huge page size support. F.e. IA64 can support down to page size > huge pages. The antifrag patches handle huge page in a special way. > They are categorized as movable. Small huge pages may > therefore contaminate the movable area. They are only categorised as movable when a sysctl is set. This has to be the deliberate choice of the administrator and its intention was to allow hugepages to be alloced from ZONE_MOVABLE. This was to allow flexible sizing of the hugepage pool when that zone is configured until such time as hugepages were really movable in 100% of situations. > 3. Defining the size of ZONE_MOVABLE. This was done to guarantee > availability of movable memory but the practical effect is to > guarantee that we panic when too many unreclaimable allocations have > been done. > The size of ZONE_MOVABLE is determined at boot time and it is not required for grouping page by mobility to be effective. Presumably by an administrator that has identified the problem that is fixed by having this zone available. Furthermore, it would be done with the understanding of what it means for OOM situations if the partition is made too small. The expectation is that he has a solid understanding of his workload before using this option. > I have already said during the review that IMHO the patches are not ready > for merging. They are currently more like a prototype that explores ideas. > The generalization steps are not done. > > How we could make progress: > > 1. Develop a useful categorization of allocations in the kernel whose > utility goes beyond the antifrag patches. I.e. length of > the objects existence and the method of reclaim could be useful in > various contexts. > The length of objects existence is something I am wary of because it puts a big burden on the caller of the page allocator. The method of reclaim is already implied by the existing categorisations. What may be missing is clear documentation UNMOVABLE - You can't reclaim it RECLAIMABLE - You need the help of another subsystem to reclaim objects within the page before the page is reclaimed or the allocation is short-lived. Even when reclaimable, there is no guarantee that reclaim will succeed. MOVABLE - The page is directly reclaimable by kswapd or it may be migrated. Being able to reclaim is guaranteed except where mlock() is involved. mlock pages need to be migrated. You've defined these better yourself in your review. Arguably, RECLAIMABLE should be separate from TEMPORARY and page buffers should be away from MOVABLE but this did not appear necessary when tested. If this breakout is found to be required, it is trivial to implement. > 2. Have statistics of these various allocations. > > 3. Page allocator should gather statistics on how memory was allocated in > the various categories. > Statistics gathering has been done before and it can be done again. They were used earlier in the development of the patches and then I stopped bringing them forward in the belief they would not be of general interest. In a large part, they helped define the current mobility types. Gathering statistics again is not a fundamental problem. > 4. The available data can then be used to driver more intelligent reclaim > and develop methods of antifrag or defragmentation. > Once that data is available, it would help show how successfully fragmentation avoidance as it currently stands and how it can be improved. The lack of the statistics today does not seem a blocking issue because there are no users of fragmentation avoidance that blow up if it's not effective. Patch for breaking the MAX_ORDER grouping is as follows. Again, it's 45 minutes coding so maybe I missed something but it survived a quick stress testing. Not signed off due to incompleteness (e.g. should use a constant if the hugepage size is known at compile time, nr_pages_pageblock should be __read_mostly, not checked everywhere etc) and lack of full regression testing and verification. If I hadn't bothered updating comments or printks, the patch would be fairly small. diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.21-rc7-mm2-004_temporary/include/linux/pageblock-flags.h linux-2.6.21-rc7-mm2-005_group_arbitrary/include/linux/pageblock-flags.h --- linux-2.6.21-rc7-mm2-004_temporary/include/linux/pageblock-flags.h 2007-04-27 22:04:34.000000000 +0100 +++ linux-2.6.21-rc7-mm2-005_group_arbitrary/include/linux/pageblock-flags.h 2007-05-01 16:02:51.000000000 +0100 @@ -1,6 +1,6 @@ /* * Macros for manipulating and testing flags related to a - * MAX_ORDER_NR_PAGES block of pages. + * large contiguous block of pages. * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -35,6 +35,10 @@ enum pageblock_bits { NR_PAGEBLOCK_BITS }; +/* Each pages_per_mobility_block of pages has NR_PAGEBLOCK_BITS */ +extern unsigned long nr_pages_pageblock; +extern int pageblock_order; + /* Forward declaration */ struct page; diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.21-rc7-mm2-004_temporary/mm/page_alloc.c linux-2.6.21-rc7-mm2-005_group_arbitrary/mm/page_alloc.c --- linux-2.6.21-rc7-mm2-004_temporary/mm/page_alloc.c 2007-04-27 22:04:34.000000000 +0100 +++ linux-2.6.21-rc7-mm2-005_group_arbitrary/mm/page_alloc.c 2007-05-01 19:54:18.000000000 +0100 @@ -58,6 +58,8 @@ unsigned long totalram_pages __read_most unsigned long totalreserve_pages __read_mostly; long nr_swap_pages; int percpu_pagelist_fraction; +unsigned long nr_pages_pageblock; +int pageblock_order; static void __free_pages_ok(struct page *page, unsigned int order); @@ -721,7 +723,7 @@ static int fallbacks[MIGRATE_TYPES][MIGR /* * Move the free pages in a range to the free lists of the requested type. - * Note that start_page and end_pages are not aligned in a MAX_ORDER_NR_PAGES + * Note that start_page and end_pages are not aligned in a pageblock * boundary. If alignment is required, use move_freepages_block() */ int move_freepages(struct zone *zone, @@ -771,10 +773,10 @@ int move_freepages_block(struct zone *zo struct page *start_page, *end_page; start_pfn = page_to_pfn(page); - start_pfn = start_pfn & ~(MAX_ORDER_NR_PAGES-1); + start_pfn = start_pfn & ~(nr_pages_pageblock-1); start_page = pfn_to_page(start_pfn); - end_page = start_page + MAX_ORDER_NR_PAGES - 1; - end_pfn = start_pfn + MAX_ORDER_NR_PAGES - 1; + end_page = start_page + nr_pages_pageblock - 1; + end_pfn = start_pfn + nr_pages_pageblock - 1; /* Do not cross zone boundaries */ if (start_pfn < zone->zone_start_pfn) @@ -838,14 +840,14 @@ static struct page *__rmqueue_fallback(s * back for a reclaimable kernel allocation, be more * agressive about taking ownership of free pages */ - if (unlikely(current_order >= MAX_ORDER / 2) || + if (unlikely(current_order >= pageblock_order / 2) || start_migratetype == MIGRATE_RECLAIMABLE) { unsigned long pages; pages = move_freepages_block(zone, page, start_migratetype); /* Claim the whole block if over half of it is free */ - if ((pages << current_order) >= (1 << (MAX_ORDER-2))) + if ((pages << current_order) >= (1 << (pageblock_order-2))) set_pageblock_migratetype(page, start_migratetype); @@ -858,7 +860,7 @@ static struct page *__rmqueue_fallback(s __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order)); - if (current_order == MAX_ORDER - 1) + if (current_order == pageblock_order) set_pageblock_migratetype(page, start_migratetype); @@ -2253,14 +2255,16 @@ void __meminit build_all_zonelists(void) * made on memory-hotadd so a system can start with mobility * disabled and enable it later */ - if (vm_total_pages < (MAX_ORDER_NR_PAGES * MIGRATE_TYPES)) + if (vm_total_pages < (nr_pages_pageblock * MIGRATE_TYPES)) page_group_by_mobility_disabled = 1; else page_group_by_mobility_disabled = 0; - printk("Built %i zonelists, mobility grouping %s. Total pages: %ld\n", + printk("Built %i zonelists, mobility grouping %s at order %d. " + "Total pages: %ld\n", num_online_nodes(), page_group_by_mobility_disabled ? "off" : "on", + pageblock_order, vm_total_pages); } @@ -2333,7 +2337,7 @@ static inline unsigned long wait_table_b #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1)) /* - * Mark a number of MAX_ORDER_NR_PAGES blocks as MIGRATE_RESERVE. The number + * Mark a number of pageblocks as MIGRATE_RESERVE. The number * of blocks reserved is based on zone->pages_min. The memory within the * reserve will tend to store contiguous free pages. Setting min_free_kbytes * higher will lead to a bigger reserve which will get freed as contiguous @@ -2348,9 +2352,10 @@ static void setup_zone_migrate_reserve(s /* Get the start pfn, end pfn and the number of blocks to reserve */ start_pfn = zone->zone_start_pfn; end_pfn = start_pfn + zone->spanned_pages; - reserve = roundup(zone->pages_min, MAX_ORDER_NR_PAGES) >> (MAX_ORDER-1); + reserve = roundup(zone->pages_min, nr_pages_pageblock) >> + pageblock_order; - for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES) { + for (pfn = start_pfn; pfn < end_pfn; pfn += nr_pages_pageblock) { if (!pfn_valid(pfn)) continue; page = pfn_to_page(pfn); @@ -2425,7 +2430,7 @@ void __meminit memmap_init_zone(unsigned * the start are marked MIGRATE_RESERVE by * setup_zone_migrate_reserve() */ - if ((pfn & (MAX_ORDER_NR_PAGES-1))) + if ((pfn & (nr_pages_pageblock-1))) set_pageblock_migratetype(page, MIGRATE_MOVABLE); INIT_LIST_HEAD(&page->lru); @@ -3129,8 +3134,8 @@ static void __meminit calculate_node_tot #ifndef CONFIG_SPARSEMEM /* * Calculate the size of the zone->blockflags rounded to an unsigned long - * Start by making sure zonesize is a multiple of MAX_ORDER-1 by rounding up - * Then figure 1 NR_PAGEBLOCK_BITS worth of bits per MAX_ORDER-1, finally + * Start by making sure zonesize is a multiple of pageblock_order by rounding up + * Then figure 1 NR_PAGEBLOCK_BITS worth of bits per pageblock, finally * round what is now in bits to nearest long in bits, then return it in * bytes. */ @@ -3138,8 +3143,8 @@ static unsigned long __init usemap_size( { unsigned long usemapsize; - usemapsize = roundup(zonesize, MAX_ORDER_NR_PAGES); - usemapsize = usemapsize >> (MAX_ORDER-1); + usemapsize = roundup(zonesize, nr_pages_pageblock); + usemapsize = usemapsize >> pageblock_order; usemapsize *= NR_PAGEBLOCK_BITS; usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long)); @@ -3161,6 +3166,26 @@ static void inline setup_usemap(struct p struct zone *zone, unsigned long zonesize) {} #endif /* CONFIG_SPARSEMEM */ +/* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */ +void __init initonce_nr_pages_pageblock(void) +{ + /* There will never be a 1:1 mapping, it makes no sense */ + if (nr_pages_pageblock) + return; + +#ifdef CONFIG_HUGETLB_PAGE + /* + * Assume the largest contiguous order of interest is a huge page. + * This value may be variable depending on boot parameters on IA64 + */ + pageblock_order = HUGETLB_PAGE_ORDER; +#else + /* If huge pages are not in use, group based on MAX_ORDER */ + pageblock_order = MAX_ORDER-1; +#endif + nr_pages_pageblock = 1 << pageblock_order; +} + /* * Set up the zone data structures: * - mark all pages reserved @@ -3241,6 +3266,7 @@ static void __meminit free_area_init_cor if (!size) continue; + initonce_nr_pages_pageblock(); setup_usemap(pgdat, zone, size); ret = init_currently_empty_zone(zone, zone_start_pfn, size, MEMMAP_EARLY); @@ -4132,15 +4158,15 @@ static inline int pfn_to_bitidx(struct z { #ifdef CONFIG_SPARSEMEM pfn &= (PAGES_PER_SECTION-1); - return (pfn >> (MAX_ORDER-1)) * NR_PAGEBLOCK_BITS; + return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS; #else pfn = pfn - zone->zone_start_pfn; - return (pfn >> (MAX_ORDER-1)) * NR_PAGEBLOCK_BITS; + return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS; #endif /* CONFIG_SPARSEMEM */ } /** - * get_pageblock_flags_group - Return the requested group of flags for the MAX_ORDER_NR_PAGES block of pages + * get_pageblock_flags_group - Return the requested group of flags for the nr_pages_pageblock block of pages * @page: The page within the block of interest * @start_bitidx: The first bit of interest to retrieve * @end_bitidx: The last bit of interest -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: fragmentation avoidance Re: 2.6.22 -mm merge plans 2007-05-01 10:16 ` fragmentation avoidance " Mel Gorman 2007-05-01 13:02 ` 2.6.22 -mm merge plans -- lumpy reclaim Andy Whitcroft 2007-05-01 14:54 ` fragmentation avoidance Re: 2.6.22 -mm merge plans Christoph Lameter @ 2007-05-01 18:57 ` Andrew Morton 2007-05-07 13:07 ` Yasunori Goto 3 siblings, 0 replies; 233+ messages in thread From: Andrew Morton @ 2007-05-01 18:57 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-kernel, linux-mm, apw, clameter, y-goto On Tue, 1 May 2007 11:16:51 +0100 mel@skynet.ie (Mel Gorman) wrote: > OK, I did all the reorganisation which you recommended. > Ok. It is getting reviewed by Christoph and I'm going through the TODO items > it yielded. Andy has also been regularly reviewing them which is probably > why they have had less public errors than you might expect from something > like this. Great. I'm a bit behind on my linux-mm reading. > Christoph may like to comment more here. That would be helpful. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: fragmentation avoidance Re: 2.6.22 -mm merge plans 2007-05-01 10:16 ` fragmentation avoidance " Mel Gorman ` (2 preceding siblings ...) 2007-05-01 18:57 ` Andrew Morton @ 2007-05-07 13:07 ` Yasunori Goto 3 siblings, 0 replies; 233+ messages in thread From: Yasunori Goto @ 2007-05-07 13:07 UTC (permalink / raw) To: Mel Gorman; +Cc: Andrew Morton, linux-kernel, linux-mm, apw, clameter Sorry for late response. I went on a vacation in last week. And I'm in the mountain of a ton of unread mail now.... > > Mel's moveable-zone work. > > These patches are what creates ZONE_MOVABLE. The last 6 patches should be > collapsed into a single patch: > > handle-kernelcore=-generic > > I believe Yasunori Goto is looking at these from the perspective of memory > hot-remove and has caught a few bugs in the past. Goto-san may be able to > comment on whether they have been reviewed recently. Hmm, I don't think my review is enough. To be precise, I'm just one user/tester of ZONE_MOVABLE. I have tried to make memory remove patches with Mel-san's ZONE_MOVABLE patch. And the bugs are things that I found in its work. (I'll post these patches in a few days.) > The main complexity is in one function in patch one which determines where > the PFN is in each node for ZONE_MOVABLE. Getting that right so that the > requested amount of kernel memory spread as evenly as possible is just > not straight-forward. >From memory-hotplug view, ZONE_MOVABLE should be aligned by section size. But MAX_ORDER alignment is enough for others... Bye. -- Yasunori Goto ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (14 preceding siblings ...) 2007-05-01 10:16 ` fragmentation avoidance " Mel Gorman @ 2007-05-01 12:17 ` Andi Kleen 2007-05-01 22:08 ` Mathieu Desnoyers 2007-05-02 0:31 ` Rusty Russell 2007-05-01 13:06 ` file capabilities and security_task_wait failure " Stephen Smalley ` (6 subsequent siblings) 22 siblings, 2 replies; 233+ messages in thread From: Andi Kleen @ 2007-05-01 12:17 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, rusty, mathieu.desnoyers, wfg Andrew Morton <akpm@linux-foundation.org> writes: > Static markers. Will merge. There don't seem to be any users of this. How do you know it hasn't already bitrotted? It seems quite overcomplicated to me. Has the complexity been justified? > > Will merge the rustyvisor. IMHO the user code still doesn't belong into Documentation. Also it needs another review round I guess. And some beta testing by more people. > Hopefully Wu will be coming up with a much simpler best-of-readahead patch > soon. I don't think we can get these patches over the hump and they are > somewhat costly to maintain. Didn't he have one already? There was a relatively simple readahead patch recently, although it was unclear what dependencies it needed. IMHO this work has much potential so i hope the benchmarking-review process can be done quickly. -Andi ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-01 12:17 ` Andi Kleen @ 2007-05-01 22:08 ` Mathieu Desnoyers 2007-05-02 10:44 ` Andi Kleen 2007-05-02 0:31 ` Rusty Russell 1 sibling, 1 reply; 233+ messages in thread From: Mathieu Desnoyers @ 2007-05-01 22:08 UTC (permalink / raw) To: Andi Kleen; +Cc: Andrew Morton, linux-kernel, rusty, wfg Hi Andi, * Andi Kleen (andi@firstfloor.org) wrote: > Andrew Morton <akpm@linux-foundation.org> writes: > > > > Static markers. Will merge. > There don't seem to be any users of this. How do you know it hasn't > already bitrotted? > See the detailed explanation at : http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc7/2.6.21-rc7-mm2/broken-out/linux-kernel-markers-kconfig-menus.patch Major points : It is currently used as an instrumentation infrastructure for the LTTng tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in WindRiver products. The SystemTAP project also plan to use this type of infrastructure to trace sites hard to instrument. The Linux Kernel Markers has the support of Frank C. Eigler, author of their current marker alternative (which he wishes to drop in order to adopt the markers infrastructure as soon as it hits mainline). Quoting Jim Keniston <jkenisto@us.ibm.com> : "kprobes remains a vital foundation for SystemTap. But markers are attactive as an alternate source of trace/debug info. Here's why: [...]" > It seems quite overcomplicated to me. Has the complexity been justified? > To summarize the document pointed at the URL above, where the full the key goals of the markers, showing the rationale being the most important design choices : - Almost non perceivable impact on production machines when compiled in but markers are "disabled". - Use a separate section to keep the data to minimize d-cache trashing. - Put the code (stack setup and function call) in unlikely branches of the if() condition to minimize i-cache impact. - Since it is required to allow instrumentation of variables within the body of a function, accept the impact on compiler's optimizations and let it keep the variables "live" sometimes longer than required. It is up to the person who puts the marker in the code to choose the location that will have a small impact in this aspect. - Allow per-architecture optimized versions which removes the need for a d-cache based branch (patch a "load immediate" instruction instead). It minimized the d-cache impact of the disabled markers. - Accept the cost of an unlikely branch at the marker site because the gcc compiler does not give the ability to put "nops" instead of a branch generated from C code. Keep this in mind for future per-architecture optimizations. - Instrumentation of challenging kernel sites - Instrumentation such as the one provided in the already existing Lock dependency checker (lockdep) and instrumentation of trap handlers implies being reentrant for such context. Therefore, the implementation must be lock-free and update the state in an atomic fashion (rcu-style). It must also let the programmer who describes a marker site the ability to specify what is forbidden in the probe that will be connected to the marker : can it generate a trap ? Can it call lockdep (irq disable, take any type of lock), can it call printk ? This is why flags can be passed to the _MARK() marker, while the MARK() marker has the default flags. Please tell me if I forgot to explain the rationale behind some implementation detail and I will be happy to explain in more depth. Regards, Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-01 22:08 ` Mathieu Desnoyers @ 2007-05-02 10:44 ` Andi Kleen 2007-05-02 16:37 ` Frank Ch. Eigler ` (2 more replies) 0 siblings, 3 replies; 233+ messages in thread From: Andi Kleen @ 2007-05-02 10:44 UTC (permalink / raw) To: Mathieu Desnoyers; +Cc: Andi Kleen, Andrew Morton, linux-kernel, rusty, wfg > It is currently used as an instrumentation infrastructure for the LTTng > tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in > WindRiver products. The SystemTAP project also plan to use this type of > infrastructure to trace sites hard to instrument. The Linux Kernel > Markers has the support of Frank C. Eigler, author of their current > marker alternative (which he wishes to drop in order to adopt the > markers infrastructure as soon as it hits mainline). All of the above don't use mainline kernels. That doesn't constitute using it. > Quoting Jim Keniston <jkenisto@us.ibm.com> : > > "kprobes remains a vital foundation for SystemTap. But markers are > attactive as an alternate source of trace/debug info. Here's why: > [...]" Talk is cheap. Do they have working code to use it? > - Allow per-architecture optimized versions which removes the need for > a d-cache based branch (patch a "load immediate" instruction > instead). It minimized the d-cache impact of the disabled markers. That's a good idea in general, but should be generalized (available independently), not hidden in your subsystem. I know a couple of places who could use this successfully. > - Accept the cost of an unlikely branch at the marker site because the > gcc compiler does not give the ability to put "nops" instead of a > branch generated from C code. Keep this in mind for future > per-architecture optimizations. See upcomming paravirt code for a way to do this. > - Instrumentation of challenging kernel sites > - Instrumentation such as the one provided in the already existing > Lock dependency checker (lockdep) and instrumentation of trap > handlers implies being reentrant for such context. Therefore, the > implementation must be lock-free and update the state in an atomic > fashion (rcu-style). It must also let the programmer who describes > a marker site the ability to specify what is forbidden in the probe > that will be connected to the marker : can it generate a trap ? Can > it call lockdep (irq disable, take any type of lock), can it call > printk ? This is why flags can be passed to the _MARK() marker, > while the MARK() marker has the default flags. Why can't you just generally forbid probes from doing all of this? It would greatly simplify your code, wouldn't it? Keep it simple please. > Please tell me if I forgot to explain the rationale behind some > implementation detail and I will be happy to explain in more depth. Having lots of flags to do things differently optionally normally starts up all warning lights of early over design. While Linux has this sometimes it is generally only in mature old subsystems. But when something is freshly merged it shouldn't be like this. That is because code tends to grow more complicated over its livetime and when it is already complicated at the beginning it will eventually fall over (you can study current slab as a poster child of this) -Andi ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-02 10:44 ` Andi Kleen @ 2007-05-02 16:37 ` Frank Ch. Eigler 2007-05-02 16:47 ` Andrew Morton 2007-05-02 17:19 ` Mathieu Desnoyers 2 siblings, 0 replies; 233+ messages in thread From: Frank Ch. Eigler @ 2007-05-02 16:37 UTC (permalink / raw) To: Andi Kleen; +Cc: Mathieu Desnoyers, Andrew Morton, linux-kernel, rusty, wfg Andi Kleen <andi@firstfloor.org> writes: > [...] The SystemTAP project also plan to use this type of > > infrastructure to trace sites hard to instrument. The Linux Kernel > > Markers has the support of Frank C. Eigler, author of their current > > marker alternative [...] > > All of the above don't use mainline kernels. > That doesn't constitute using it. Systemtap does run on mainline kernels. > > "kprobes remains a vital foundation for SystemTap. But markers are > > attactive as an alternate source of trace/debug info. Here's why: > > [...]" > > Talk is cheap. Do they have working code to use it? [...] We had been waiting on the chicken & egg semaphore. LTTNG has working code yesterday (months ago); systemtap will have it "tomorrow" (a week or few). - FChE ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-02 10:44 ` Andi Kleen 2007-05-02 16:37 ` Frank Ch. Eigler @ 2007-05-02 16:47 ` Andrew Morton 2007-05-02 17:29 ` Christoph Hellwig 2007-05-02 17:49 ` Andi Kleen 2007-05-02 17:19 ` Mathieu Desnoyers 2 siblings, 2 replies; 233+ messages in thread From: Andrew Morton @ 2007-05-02 16:47 UTC (permalink / raw) To: Andi Kleen; +Cc: Mathieu Desnoyers, linux-kernel, rusty, wfg On Wed, 2 May 2007 12:44:13 +0200 Andi Kleen <andi@firstfloor.org> wrote: > > It is currently used as an instrumentation infrastructure for the LTTng > > tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in > > WindRiver products. The SystemTAP project also plan to use this type of > > infrastructure to trace sites hard to instrument. The Linux Kernel > > Markers has the support of Frank C. Eigler, author of their current > > marker alternative (which he wishes to drop in order to adopt the > > markers infrastructure as soon as it hits mainline). > > All of the above don't use mainline kernels. That's because they have to add a markers patch! > That doesn't constitute using it. Andi, there was a huge amount of discussion about all this in September last year (subjects: *markers* and *LTTng*). The outcome of all that was, I believe, that the kernel should have a static marker infrastructure. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-02 16:47 ` Andrew Morton @ 2007-05-02 17:29 ` Christoph Hellwig 2007-05-02 20:36 ` Mathieu Desnoyers 2007-05-02 17:49 ` Andi Kleen 1 sibling, 1 reply; 233+ messages in thread From: Christoph Hellwig @ 2007-05-02 17:29 UTC (permalink / raw) To: Andrew Morton; +Cc: Andi Kleen, Mathieu Desnoyers, linux-kernel, rusty, wfg On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote: > > That doesn't constitute using it. > > Andi, there was a huge amount of discussion about all this in September last > year (subjects: *markers* and *LTTng*). The outcome of all that was, I > believe, that the kernel should have a static marker infrastructure. Only when it's actually useable. A prerequisite for merging it is having an actual trace transport infrastructure aswell as a few actually useful tracing modules in the kernel tree. Let this count as a vote to merge the markers once we have the infrastructure above ready, it'll be very useful then. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-02 17:29 ` Christoph Hellwig @ 2007-05-02 20:36 ` Mathieu Desnoyers 2007-05-02 20:53 ` Andrew Morton 2007-05-03 8:08 ` Christoph Hellwig 0 siblings, 2 replies; 233+ messages in thread From: Mathieu Desnoyers @ 2007-05-02 20:36 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton, Andi Kleen, linux-kernel, rusty, wfg * Christoph Hellwig (hch@infradead.org) wrote: > On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote: > > > That doesn't constitute using it. > > > > Andi, there was a huge amount of discussion about all this in September last > > year (subjects: *markers* and *LTTng*). The outcome of all that was, I > > believe, that the kernel should have a static marker infrastructure. > > Only when it's actually useable. A prerequisite for merging it is > having an actual trace transport infrastructure aswell as a few actually > useful tracing modules in the kernel tree. > > Let this count as a vote to merge the markers once we have the infrastructure > above ready, it'll be very useful then. Hi Christoph, The idea is the following : either we integrate the infrastructure for instrumentation / data serialization / buffer management / extraction of data to user space in multiple different steps, which makes code review easier for you guys, or we bring the main pieces of the LTTng project altogether with the Linux Kernel Markers, which would result in a bigger change. Based on the premise that discussing about logically distinct pieces of infrastructure is easier and can be done more thoroughly when done separately, we decided to submit the markers first, with the other pieces planned in a near future. I agree that it would be very useful to have the full tracing stack available in the Linux kernel, but we inevitably face the argument : "this change is too big" if we submit all LTTng modules at once or the argument : "we want the whole tracing stack, not just part of it" if we don't. This is why we chose to push the tracing infrastructure chunk by chunk : to make code review and criticism more efficient. Regards, Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-02 20:36 ` Mathieu Desnoyers @ 2007-05-02 20:53 ` Andrew Morton 2007-05-02 23:11 ` Mathieu Desnoyers 2007-05-03 8:09 ` Christoph Hellwig 2007-05-03 8:08 ` Christoph Hellwig 1 sibling, 2 replies; 233+ messages in thread From: Andrew Morton @ 2007-05-02 20:53 UTC (permalink / raw) To: Mathieu Desnoyers; +Cc: Christoph Hellwig, Andi Kleen, linux-kernel, rusty, wfg On Wed, 2 May 2007 16:36:27 -0400 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote: > * Christoph Hellwig (hch@infradead.org) wrote: > > On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote: > > > > That doesn't constitute using it. > > > > > > Andi, there was a huge amount of discussion about all this in September last > > > year (subjects: *markers* and *LTTng*). The outcome of all that was, I > > > believe, that the kernel should have a static marker infrastructure. > > > > Only when it's actually useable. A prerequisite for merging it is > > having an actual trace transport infrastructure aswell as a few actually > > useful tracing modules in the kernel tree. > > > > Let this count as a vote to merge the markers once we have the infrastructure > > above ready, it'll be very useful then. > > Hi Christoph, > > The idea is the following : either we integrate the infrastructure for > instrumentation / data serialization / buffer management / extraction of > data to user space in multiple different steps, which makes code review > easier for you guys, or we bring the main pieces of the LTTng project > altogether with the Linux Kernel Markers, which would result in a bigger > change. > > Based on the premise that discussing about logically distinct pieces of > infrastructure is easier and can be done more thoroughly when done > separately, we decided to submit the markers first, with the other > pieces planned in a near future. > > I agree that it would be very useful to have the full tracing stack > available in the Linux kernel, but we inevitably face the argument : > "this change is too big" if we submit all LTTng modules at once or > the argument : "we want the whole tracing stack, not just part of it" > if we don't. > > This is why we chose to push the tracing infrastructure chunk by chunk : > to make code review and criticism more efficient. > I didn't know that this was the plan. The problem I have with this is that once we've merged one part, we're committed to merging the other parts even though we haven't seen them yet. What happens if there's a revolt over the next set of patches? Do we remove the core markers patches again? We end up in a cant-go-forward, cant-go-backward situation. I thought the existing code was useful as-is for several projects, without requiring additional patching to core kernel. If such additional patching _is_ needed to make the markers code useful then I agree that we should continue to buffer the markers code in -mm until the use-markers-for-something patches have been eyeballed. In which case we have: atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch atomich-complete-atomic_long-operations-in-asm-generic.patch atomich-i386-type-safety-fix.patch atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch local_t-architecture-independant-extension.patch local_t-alpha-extension.patch local_t-i386-extension.patch local_t-ia64-extension.patch local_t-mips-extension.patch local_t-parisc-cleanup.patch local_t-powerpc-extension.patch local_t-sparc64-cleanup.patch local_t-x86_64-extension.patch For 2.6.22 linux-kernel-markers-kconfig-menus.patch linux-kernel-markers-architecture-independant-code.patch linux-kernel-markers-powerpc-optimization.patch linux-kernel-markers-i386-optimization.patch markers-add-instrumentation-markers-menus-to-avr32.patch linux-kernel-markers-non-optimized-architectures.patch markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch linux-kernel-markers-documentation.patch # markers-define-the-linker-macro-extra_rwdata.patch markers-use-extra_rwdata-in-architectures.patch # some-grammatical-fixups-and-additions-to-atomich-kernel-doc.patch no-longer-include-asm-kdebugh.patch Hold. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-02 20:53 ` Andrew Morton @ 2007-05-02 23:11 ` Mathieu Desnoyers 2007-05-02 23:21 ` Andrew Morton ` (2 more replies) 2007-05-03 8:09 ` Christoph Hellwig 1 sibling, 3 replies; 233+ messages in thread From: Mathieu Desnoyers @ 2007-05-02 23:11 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Hellwig, Andi Kleen, linux-kernel, rusty, wfg * Andrew Morton (akpm@linux-foundation.org) wrote: > On Wed, 2 May 2007 16:36:27 -0400 > Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote: > > > * Christoph Hellwig (hch@infradead.org) wrote: > > > On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote: > > > > > That doesn't constitute using it. > > > > > > > > Andi, there was a huge amount of discussion about all this in September last > > > > year (subjects: *markers* and *LTTng*). The outcome of all that was, I > > > > believe, that the kernel should have a static marker infrastructure. > > > > > > Only when it's actually useable. A prerequisite for merging it is > > > having an actual trace transport infrastructure aswell as a few actually > > > useful tracing modules in the kernel tree. > > > > > > Let this count as a vote to merge the markers once we have the infrastructure > > > above ready, it'll be very useful then. > > > > Hi Christoph, > > > > The idea is the following : either we integrate the infrastructure for > > instrumentation / data serialization / buffer management / extraction of > > data to user space in multiple different steps, which makes code review > > easier for you guys, or we bring the main pieces of the LTTng project > > altogether with the Linux Kernel Markers, which would result in a bigger > > change. > > > > Based on the premise that discussing about logically distinct pieces of > > infrastructure is easier and can be done more thoroughly when done > > separately, we decided to submit the markers first, with the other > > pieces planned in a near future. > > > > I agree that it would be very useful to have the full tracing stack > > available in the Linux kernel, but we inevitably face the argument : > > "this change is too big" if we submit all LTTng modules at once or > > the argument : "we want the whole tracing stack, not just part of it" > > if we don't. > > > > This is why we chose to push the tracing infrastructure chunk by chunk : > > to make code review and criticism more efficient. > > > > I didn't know that this was the plan. > > The problem I have with this is that once we've merged one part, we're > committed to merging the other parts even though we haven't seen them yet. > > What happens if there's a revolt over the next set of patches? Do we > remove the core markers patches again? We end up in a cant-go-forward, > cant-go-backward situation. > > I thought the existing code was useful as-is for several projects, without > requiring additional patching to core kernel. If such additional patching > _is_ needed to make the markers code useful then I agree that we should > continue to buffer the markers code in -mm until the > use-markers-for-something patches have been eyeballed. > My statement was probably not clear enough. The actual marker code is useful as-is without any further kernel patching required : SystemTAP is an example where they use external modules to load probes that can connect either to markers or through kprobes. LTTng, in its current state, has a mostly modular core that also uses the markers. Although some, like Christoph and myself, think that it would benefit to the kernel community to have a common infrastructure for more than just markers (meaning common serialization and buffering mechanism), it does not change the fact that the markers, being in mainline, are usable by projects through additional kernel modules. If we are looking at current "potential users" that are already in mainline, we could change blktrace to make it use the markers. Mathieu > In which case we have: > > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch > atomich-complete-atomic_long-operations-in-asm-generic.patch > atomich-i386-type-safety-fix.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch > atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch > local_t-architecture-independant-extension.patch > local_t-alpha-extension.patch > local_t-i386-extension.patch > local_t-ia64-extension.patch > local_t-mips-extension.patch > local_t-parisc-cleanup.patch > local_t-powerpc-extension.patch > local_t-sparc64-cleanup.patch > local_t-x86_64-extension.patch > > For 2.6.22 > > linux-kernel-markers-kconfig-menus.patch > linux-kernel-markers-architecture-independant-code.patch > linux-kernel-markers-powerpc-optimization.patch > linux-kernel-markers-i386-optimization.patch > markers-add-instrumentation-markers-menus-to-avr32.patch > linux-kernel-markers-non-optimized-architectures.patch > markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch > linux-kernel-markers-documentation.patch > # > markers-define-the-linker-macro-extra_rwdata.patch > markers-use-extra_rwdata-in-architectures.patch > # > some-grammatical-fixups-and-additions-to-atomich-kernel-doc.patch > no-longer-include-asm-kdebugh.patch > > Hold. > -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-02 23:11 ` Mathieu Desnoyers @ 2007-05-02 23:21 ` Andrew Morton 2007-05-03 15:04 ` Mathieu Desnoyers 2007-05-03 8:06 ` Christoph Hellwig 2007-05-03 10:31 ` Andi Kleen 2 siblings, 1 reply; 233+ messages in thread From: Andrew Morton @ 2007-05-02 23:21 UTC (permalink / raw) To: Mathieu Desnoyers; +Cc: Christoph Hellwig, Andi Kleen, linux-kernel, rusty, wfg On Wed, 2 May 2007 19:11:04 -0400 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote: > > I didn't know that this was the plan. > > > > The problem I have with this is that once we've merged one part, we're > > committed to merging the other parts even though we haven't seen them yet. > > > > What happens if there's a revolt over the next set of patches? Do we > > remove the core markers patches again? We end up in a cant-go-forward, > > cant-go-backward situation. > > > > I thought the existing code was useful as-is for several projects, without > > requiring additional patching to core kernel. If such additional patching > > _is_ needed to make the markers code useful then I agree that we should > > continue to buffer the markers code in -mm until the > > use-markers-for-something patches have been eyeballed. > > > > My statement was probably not clear enough. The actual marker code is > useful as-is without any further kernel patching required : SystemTAP is > an example where they use external modules to load probes that can > connect either to markers or through kprobes. LTTng, in its current state, > has a mostly modular core that also uses the markers. OK, that's what I thought. > Although some, like Christoph and myself, think that it would benefit to > the kernel community to have a common infrastructure for more than just > markers (meaning common serialization and buffering mechanism), it does > not change the fact that the markers, being in mainline, are usable by > projects through additional kernel modules. > > If we are looking at current "potential users" that are already in > mainline, we could change blktrace to make it use the markers. That'd be a useful demonstration. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-02 23:21 ` Andrew Morton @ 2007-05-03 15:04 ` Mathieu Desnoyers 2007-05-03 15:12 ` Christoph Hellwig 0 siblings, 1 reply; 233+ messages in thread From: Mathieu Desnoyers @ 2007-05-03 15:04 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Hellwig, Andi Kleen, linux-kernel, rusty, wfg * Andrew Morton (akpm@linux-foundation.org) wrote: > > Although some, like Christoph and myself, think that it would benefit to > > the kernel community to have a common infrastructure for more than just > > markers (meaning common serialization and buffering mechanism), it does > > not change the fact that the markers, being in mainline, are usable by > > projects through additional kernel modules. > > > > If we are looking at current "potential users" that are already in > > mainline, we could change blktrace to make it use the markers. > > That'd be a useful demonstration. Here is a proof of concept patch, for demonstration purpose, of moving blktrace to the markers. A few remarks : this patch has the positive effect of removing some code from the block io tracing hot paths, minimizing the i-cache impact in a system where the io tracing is compiled in but inactive. It also moves the blk tracing code from a header (and therefore from the body of the instrumented functions) to a separate C file. There, as soon as one device has to be traced, every devices have to fall into the tracing function call. This is slower than the previous inline function which tested the condition quickly. If it becomes a show stopper, it could be fixed by having the possibility to test a supplementary condition, dependant of the marker context, at the marker site, just after the enable/disable test. It does not make the code smaller, since I left all the specialized tracing functions for requests, bio, generic, remap, which would go away once a generic infrastructure is in place to serialize the information passed to the marker. This is mostly why I consider it a proof a concept. Patch named "markers-port-blktrace-to-markers.patch", can be placed after the marker patches in the 2.6.21-rc7-mm2 series. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Index: linux-2.6-lttng/block/elevator.c =================================================================== --- linux-2.6-lttng.orig/block/elevator.c 2007-05-02 20:33:22.000000000 -0400 +++ linux-2.6-lttng/block/elevator.c 2007-05-02 20:33:49.000000000 -0400 @@ -32,7 +32,7 @@ #include <linux/init.h> #include <linux/compiler.h> #include <linux/delay.h> -#include <linux/blktrace_api.h> +#include <linux/marker.h> #include <linux/hash.h> #include <asm/uaccess.h> @@ -571,7 +571,7 @@ unsigned ordseq; int unplug_it = 1; - blk_add_trace_rq(q, rq, BLK_TA_INSERT); + MARK(blk_request_insert, "%p %p", q, rq); rq->q = q; @@ -757,7 +757,7 @@ * not be passed by new incoming requests */ rq->cmd_flags |= REQ_STARTED; - blk_add_trace_rq(q, rq, BLK_TA_ISSUE); + MARK(blk_request_issue, "%p %p", q, rq); } if (!q->boundary_rq || q->boundary_rq == rq) { Index: linux-2.6-lttng/block/ll_rw_blk.c =================================================================== --- linux-2.6-lttng.orig/block/ll_rw_blk.c 2007-05-02 20:33:32.000000000 -0400 +++ linux-2.6-lttng/block/ll_rw_blk.c 2007-05-02 23:21:02.000000000 -0400 @@ -28,6 +28,7 @@ #include <linux/task_io_accounting_ops.h> #include <linux/interrupt.h> #include <linux/cpu.h> +#include <linux/marker.h> #include <linux/blktrace_api.h> #include <linux/fault-inject.h> @@ -1551,7 +1552,7 @@ if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) { mod_timer(&q->unplug_timer, jiffies + q->unplug_delay); - blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG); + MARK(blk_plug_device, "%p %p %d", q, NULL, 0); } } @@ -1617,7 +1618,7 @@ * devices don't necessarily have an ->unplug_fn defined */ if (q->unplug_fn) { - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, + MARK(blk_pdu_unplug_io, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); q->unplug_fn(q); @@ -1628,7 +1629,7 @@ { request_queue_t *q = container_of(work, request_queue_t, unplug_work); - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, + MARK(blk_pdu_unplug_io, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); q->unplug_fn(q); @@ -1638,7 +1639,7 @@ { request_queue_t *q = (request_queue_t *)data; - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL, + MARK(blk_pdu_unplug_timer, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); kblockd_schedule_work(&q->unplug_work); @@ -2148,7 +2149,7 @@ rq_init(q, rq); - blk_add_trace_generic(q, bio, rw, BLK_TA_GETRQ); + MARK(blk_get_request, "%p %p %d", q, bio, rw); out: return rq; } @@ -2178,7 +2179,7 @@ if (!rq) { struct io_context *ioc; - blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ); + MARK(blk_sleep_request, "%p %p %d", q, bio, rw); __generic_unplug_device(q); spin_unlock_irq(q->queue_lock); @@ -2252,7 +2253,7 @@ */ void blk_requeue_request(request_queue_t *q, struct request *rq) { - blk_add_trace_rq(q, rq, BLK_TA_REQUEUE); + MARK(blk_requeue, "%p %p", q, rq); if (blk_rq_tagged(rq)) blk_queue_end_tag(q, rq); @@ -2937,7 +2938,7 @@ if (!ll_back_merge_fn(q, req, bio)) break; - blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE); + MARK(blk_bio_backmerge, "%p %p", q, bio); req->biotail->bi_next = bio; req->biotail = bio; @@ -2954,7 +2955,7 @@ if (!ll_front_merge_fn(q, req, bio)) break; - blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE); + MARK(blk_bio_frontmerge, "%p %p", q, bio); bio->bi_next = req->bio; req->bio = bio; @@ -3184,10 +3185,10 @@ blk_partition_remap(bio); if (old_sector != -1) - blk_add_trace_remap(q, bio, old_dev, bio->bi_sector, - old_sector); + MARK(blk_remap, "%p %p %u %llu %llu", q, bio, old_dev, + (u64)bio->bi_sector, (u64)old_sector); - blk_add_trace_bio(q, bio, BLK_TA_QUEUE); + MARK(blk_bio_queue, "%p %p", q, bio); old_sector = bio->bi_sector; old_dev = bio->bi_bdev->bd_dev; @@ -3329,7 +3330,7 @@ int total_bytes, bio_nbytes, error, next_idx = 0; struct bio *bio; - blk_add_trace_rq(req->q, req, BLK_TA_COMPLETE); + MARK(blk_request_complete, "%p %p", req->q, req); /* * extend uptodate bool to allow < 0 value to be direct io error Index: linux-2.6-lttng/block/Kconfig =================================================================== --- linux-2.6-lttng.orig/block/Kconfig 2007-05-02 20:34:30.000000000 -0400 +++ linux-2.6-lttng/block/Kconfig 2007-05-02 20:34:53.000000000 -0400 @@ -32,6 +32,7 @@ depends on SYSFS select RELAY select DEBUG_FS + select MARKERS help Say Y here, if you want to be able to trace the block layer actions on a given queue. Tracing allows you to see any traffic happening Index: linux-2.6-lttng/block/Makefile =================================================================== --- linux-2.6-lttng.orig/block/Makefile 2007-05-02 21:20:30.000000000 -0400 +++ linux-2.6-lttng/block/Makefile 2007-05-02 21:20:46.000000000 -0400 @@ -9,4 +9,4 @@ obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o -obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o +obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o blk-probe.o Index: linux-2.6-lttng/block/blk-probe.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6-lttng/block/blk-probe.c 2007-05-02 23:43:44.000000000 -0400 @@ -0,0 +1,276 @@ + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/crc32.h> +#include <linux/marker.h> +#include <linux/blktrace_api.h> + + +/** + * blk_add_trace_rq - Add a trace for a request oriented action + * Expected variable arguments : + * @q: queue the io is for + * @rq: the source request + * + * Description: + * Records an action against a request. Will log the bio offset + size. + * + **/ +static void blk_add_trace_rq(const struct __mark_marker_data *mdata, + const char *fmt, ...) +{ + va_list args; + u32 what; + struct blk_trace *bt; + int rw; + struct blk_probe_data *pinfo = mdata->pdata; + struct request_queue *q; + struct request *rq; + + va_start(args, fmt); + q = va_arg(args, struct request_queue *); + rq = va_arg(args, struct request *); + va_end(args); + + what = pinfo->flags; + bt = q->blk_trace; + rw = rq->cmd_flags & 0x03; + + if (likely(!bt)) + return; + + if (blk_pc_request(rq)) { + what |= BLK_TC_ACT(BLK_TC_PC); + __blk_add_trace(bt, 0, rq->data_len, rw, what, rq->errors, sizeof(rq->cmd), rq->cmd); + } else { + what |= BLK_TC_ACT(BLK_TC_FS); + __blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, rw, what, rq->errors, 0, NULL); + } +} + +/** + * blk_add_trace_bio - Add a trace for a bio oriented action + * Expected variable arguments : + * @q: queue the io is for + * @bio: the source bio + * + * Description: + * Records an action against a bio. Will log the bio offset + size. + * + **/ +static void blk_add_trace_bio(const struct __mark_marker_data *mdata, + const char *fmt, ...) +{ + va_list args; + u32 what; + struct blk_trace *bt; + struct blk_probe_data *pinfo = mdata->pdata; + struct request_queue *q; + struct bio *bio; + + va_start(args, fmt); + q = va_arg(args, struct request_queue *); + bio = va_arg(args, struct bio *); + va_end(args); + + what = pinfo->flags; + bt = q->blk_trace; + + if (likely(!bt)) + return; + + __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), 0, NULL); +} + +/** + * blk_add_trace_generic - Add a trace for a generic action + * Expected variable arguments : + * @q: queue the io is for + * @bio: the source bio + * @rw: the data direction + * + * Description: + * Records a simple trace + * + **/ +static void blk_add_trace_generic(const struct __mark_marker_data *mdata, + const char *fmt, ...) +{ + va_list args; + struct blk_trace *bt; + u32 what; + struct blk_probe_data *pinfo = mdata->pdata; + struct request_queue *q; + struct bio *bio; + int rw; + + va_start(args, fmt); + q = va_arg(args, struct request_queue *); + bio = va_arg(args, struct bio *); + rw = va_arg(args, int); + va_end(args); + + what = pinfo->flags; + bt = q->blk_trace; + + if (likely(!bt)) + return; + + if (bio) + blk_add_trace_bio(mdata, "%p %p", q, bio); + else + __blk_add_trace(bt, 0, 0, rw, what, 0, 0, NULL); +} + +/** + * blk_add_trace_pdu_int - Add a trace for a bio with an integer payload + * Expected variable arguments : + * @q: queue the io is for + * @bio: the source bio + * @pdu: the integer payload + * + * Description: + * Adds a trace with some integer payload. This might be an unplug + * option given as the action, with the depth at unplug time given + * as the payload + * + **/ +static void blk_add_trace_pdu_int(const struct __mark_marker_data *mdata, + const char *fmt, ...) +{ + va_list args; + struct blk_trace *bt; + u32 what; + struct blk_probe_data *pinfo = mdata->pdata; + struct request_queue *q; + struct bio *bio; + unsigned int pdu; + __be64 rpdu; + + va_start(args, fmt); + q = va_arg(args, struct request_queue *); + bio = va_arg(args, struct bio *); + pdu = va_arg(args, unsigned int); + va_end(args); + + what = pinfo->flags; + bt = q->blk_trace; + rpdu = cpu_to_be64(pdu); + + if (likely(!bt)) + return; + + if (bio) + __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), sizeof(rpdu), &rpdu); + else + __blk_add_trace(bt, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu); +} + +/** + * blk_add_trace_remap - Add a trace for a remap operation + * Expected variable arguments : + * @q: queue the io is for + * @bio: the source bio + * @dev: target device + * @from: source sector + * @to: target sector + * + * Description: + * Device mapper or raid target sometimes need to split a bio because + * it spans a stripe (or similar). Add a trace for that action. + * + **/ +static void blk_add_trace_remap(const struct __mark_marker_data *mdata, + const char *fmt, ...) +{ + va_list args; + struct blk_trace *bt; + struct blk_io_trace_remap r; + u32 what; + struct blk_probe_data *pinfo = mdata->pdata; + struct request_queue *q; + struct bio *bio; + u64 dev, from, to; + + va_start(args, fmt); + q = va_arg(args, struct request_queue *); + bio = va_arg(args, struct bio *); + dev = va_arg(args, u64); + from = va_arg(args, u64); + to = va_arg(args, u64); + va_end(args); + + what = pinfo->flags; + bt = q->blk_trace; + + if (likely(!bt)) + return; + + r.device = cpu_to_be32(dev); + r.sector = cpu_to_be64(to); + + __blk_add_trace(bt, from, bio->bi_size, bio->bi_rw, BLK_TA_REMAP, !bio_flagged(bio, BIO_UPTODATE), sizeof(r), &r); +} + +#define FACILITY_NAME "blk" + +static struct blk_probe_data probe_array[] = +{ + { "blk_bio_queue", "%p %p", BLK_TA_QUEUE, blk_add_trace_bio }, + { "blk_bio_backmerge", "%p %p", BLK_TA_BACKMERGE, blk_add_trace_bio }, + { "blk_bio_frontmerge", "%p %p", BLK_TA_FRONTMERGE, blk_add_trace_bio }, + { "blk_get_request", "%p %p %d", BLK_TA_GETRQ, blk_add_trace_generic }, + { "blk_sleep_request", "%p %p %d", BLK_TA_SLEEPRQ, + blk_add_trace_generic }, + { "blk_requeue", "%p %p", BLK_TA_REQUEUE, blk_add_trace_rq }, + { "blk_request_issue", "%p %p", BLK_TA_ISSUE, blk_add_trace_rq }, + { "blk_request_complete", "%p %p", BLK_TA_COMPLETE, blk_add_trace_rq }, + { "blk_plug_device", "%p %p %d", BLK_TA_PLUG, blk_add_trace_generic }, + { "blk_pdu_unplug_io", "%p %p %d", BLK_TA_UNPLUG_IO, + blk_add_trace_pdu_int }, + { "blk_pdu_unplug_timer", "%p %p %d", BLK_TA_UNPLUG_TIMER, + blk_add_trace_pdu_int }, + { "blk_request_insert", "%p %p", BLK_TA_INSERT, + blk_add_trace_rq }, + { "blk_pdu_split", "%p %p %d", BLK_TA_SPLIT, + blk_add_trace_pdu_int }, + { "blk_bio_bounce", "%p %p", BLK_TA_BOUNCE, blk_add_trace_bio }, + { "blk_remap", "%p %p %u %llu %llu", BLK_TA_REMAP, + blk_add_trace_remap }, +}; + + +#define NUM_PROBES ARRAY_SIZE(probe_array) + +int blk_probe_connect(void) +{ + int result; + uint8_t i; + + for (i = 0; i < NUM_PROBES; i++) { + result = marker_set_probe(probe_array[i].name, + probe_array[i].format, + probe_array[i].callback, &probe_array[i]); + if (!result) + printk(KERN_INFO + "blktrace unable to register probe %s\n", + probe_array[i].name); + } + return 0; +} +EXPORT_SYMBOL_GPL(blk_probe_connect); + +void blk_probe_disconnect(void) +{ + uint8_t i; + + for (i = 0; i < NUM_PROBES; i++) { + marker_remove_probe(probe_array[i].name); + } + synchronize_sched(); /* Wait for probes to finish */ +} +EXPORT_SYMBOL_GPL(blk_probe_disconnect); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Mathieu Desnoyers"); +MODULE_DESCRIPTION(FACILITY_NAME " probe"); Index: linux-2.6-lttng/block/blktrace.c =================================================================== --- linux-2.6-lttng.orig/block/blktrace.c 2007-05-02 20:33:15.000000000 -0400 +++ linux-2.6-lttng/block/blktrace.c 2007-05-02 23:48:32.000000000 -0400 @@ -28,6 +28,10 @@ static DEFINE_PER_CPU(unsigned long long, blk_trace_cpu_offset) = { 0, }; static unsigned int blktrace_seq __read_mostly = 1; +/* Global reference count of probes */ +static struct mutex blk_probe_mutex; +static int blk_probes_ref = 0; + /* * Send out a notify message. */ @@ -229,6 +233,12 @@ blk_remove_tree(bt->dir); free_percpu(bt->sequence); kfree(bt); + mutex_lock(&blk_probe_mutex); + if (blk_probes_ref == 1) { + blk_probe_disconnect(); + blk_probes_ref--; + } + mutex_unlock(&blk_probe_mutex); } static int blk_trace_remove(request_queue_t *q) @@ -386,6 +396,14 @@ goto err; } + /* Connect probes to markers */ + mutex_lock(&blk_probe_mutex); + if (!blk_probes_ref) { + blk_probe_connect(); + blk_probes_ref++; + } + mutex_unlock(&blk_probe_mutex); + return 0; err: if (dir) @@ -552,6 +570,7 @@ static __init int blk_trace_init(void) { mutex_init(&blk_tree_mutex); + mutex_init(&blk_probe_mutex); on_each_cpu(blk_trace_check_cpu_time, NULL, 1, 1); blk_trace_set_ht_offsets(); Index: linux-2.6-lttng/include/linux/blktrace_api.h =================================================================== --- linux-2.6-lttng.orig/include/linux/blktrace_api.h 2007-05-02 20:45:58.000000000 -0400 +++ linux-2.6-lttng/include/linux/blktrace_api.h 2007-05-02 22:12:46.000000000 -0400 @@ -3,6 +3,7 @@ #include <linux/blkdev.h> #include <linux/relay.h> +#include <linux/marker.h> /* * Trace categories @@ -142,149 +143,24 @@ u32 pid; }; +/* Probe data used for probe-marker connection */ +struct blk_probe_data { + const char *name; + const char *format; + u32 flags; + marker_probe_func *callback; +}; + #if defined(CONFIG_BLK_DEV_IO_TRACE) extern int blk_trace_ioctl(struct block_device *, unsigned, char __user *); extern void blk_trace_shutdown(request_queue_t *); extern void __blk_add_trace(struct blk_trace *, sector_t, int, int, u32, int, int, void *); - -/** - * blk_add_trace_rq - Add a trace for a request oriented action - * @q: queue the io is for - * @rq: the source request - * @what: the action - * - * Description: - * Records an action against a request. Will log the bio offset + size. - * - **/ -static inline void blk_add_trace_rq(struct request_queue *q, struct request *rq, - u32 what) -{ - struct blk_trace *bt = q->blk_trace; - int rw = rq->cmd_flags & 0x03; - - if (likely(!bt)) - return; - - if (blk_pc_request(rq)) { - what |= BLK_TC_ACT(BLK_TC_PC); - __blk_add_trace(bt, 0, rq->data_len, rw, what, rq->errors, sizeof(rq->cmd), rq->cmd); - } else { - what |= BLK_TC_ACT(BLK_TC_FS); - __blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, rw, what, rq->errors, 0, NULL); - } -} - -/** - * blk_add_trace_bio - Add a trace for a bio oriented action - * @q: queue the io is for - * @bio: the source bio - * @what: the action - * - * Description: - * Records an action against a bio. Will log the bio offset + size. - * - **/ -static inline void blk_add_trace_bio(struct request_queue *q, struct bio *bio, - u32 what) -{ - struct blk_trace *bt = q->blk_trace; - - if (likely(!bt)) - return; - - __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), 0, NULL); -} - -/** - * blk_add_trace_generic - Add a trace for a generic action - * @q: queue the io is for - * @bio: the source bio - * @rw: the data direction - * @what: the action - * - * Description: - * Records a simple trace - * - **/ -static inline void blk_add_trace_generic(struct request_queue *q, - struct bio *bio, int rw, u32 what) -{ - struct blk_trace *bt = q->blk_trace; - - if (likely(!bt)) - return; - - if (bio) - blk_add_trace_bio(q, bio, what); - else - __blk_add_trace(bt, 0, 0, rw, what, 0, 0, NULL); -} - -/** - * blk_add_trace_pdu_int - Add a trace for a bio with an integer payload - * @q: queue the io is for - * @what: the action - * @bio: the source bio - * @pdu: the integer payload - * - * Description: - * Adds a trace with some integer payload. This might be an unplug - * option given as the action, with the depth at unplug time given - * as the payload - * - **/ -static inline void blk_add_trace_pdu_int(struct request_queue *q, u32 what, - struct bio *bio, unsigned int pdu) -{ - struct blk_trace *bt = q->blk_trace; - __be64 rpdu = cpu_to_be64(pdu); - - if (likely(!bt)) - return; - - if (bio) - __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), sizeof(rpdu), &rpdu); - else - __blk_add_trace(bt, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu); -} - -/** - * blk_add_trace_remap - Add a trace for a remap operation - * @q: queue the io is for - * @bio: the source bio - * @dev: target device - * @from: source sector - * @to: target sector - * - * Description: - * Device mapper or raid target sometimes need to split a bio because - * it spans a stripe (or similar). Add a trace for that action. - * - **/ -static inline void blk_add_trace_remap(struct request_queue *q, struct bio *bio, - dev_t dev, sector_t from, sector_t to) -{ - struct blk_trace *bt = q->blk_trace; - struct blk_io_trace_remap r; - - if (likely(!bt)) - return; - - r.device = cpu_to_be32(dev); - r.sector = cpu_to_be64(to); - - __blk_add_trace(bt, from, bio->bi_size, bio->bi_rw, BLK_TA_REMAP, !bio_flagged(bio, BIO_UPTODATE), sizeof(r), &r); -} +extern int blk_probe_connect(void); +extern void blk_probe_disconnect(void); #else /* !CONFIG_BLK_DEV_IO_TRACE */ #define blk_trace_ioctl(bdev, cmd, arg) (-ENOTTY) #define blk_trace_shutdown(q) do { } while (0) -#define blk_add_trace_rq(q, rq, what) do { } while (0) -#define blk_add_trace_bio(q, rq, what) do { } while (0) -#define blk_add_trace_generic(q, rq, rw, what) do { } while (0) -#define blk_add_trace_pdu_int(q, what, bio, pdu) do { } while (0) -#define blk_add_trace_remap(q, bio, dev, f, t) do {} while (0) #endif /* CONFIG_BLK_DEV_IO_TRACE */ #endif Index: linux-2.6-lttng/mm/bounce.c =================================================================== --- linux-2.6-lttng.orig/mm/bounce.c 2007-05-02 21:34:39.000000000 -0400 +++ linux-2.6-lttng/mm/bounce.c 2007-05-02 21:36:17.000000000 -0400 @@ -13,7 +13,7 @@ #include <linux/init.h> #include <linux/hash.h> #include <linux/highmem.h> -#include <linux/blktrace_api.h> +#include <linux/marker.h> #include <asm/tlbflush.h> #define POOL_SIZE 64 @@ -237,7 +237,7 @@ if (!bio) return; - blk_add_trace_bio(q, *bio_orig, BLK_TA_BOUNCE); + MARK(blk_bio_bounce, "%p %p", q, *bio_orig); /* * at least one page was bounced, fill in possible non-highmem Index: linux-2.6-lttng/mm/highmem.c =================================================================== --- linux-2.6-lttng.orig/mm/highmem.c 2007-05-02 21:36:27.000000000 -0400 +++ linux-2.6-lttng/mm/highmem.c 2007-05-02 21:36:39.000000000 -0400 @@ -26,7 +26,7 @@ #include <linux/init.h> #include <linux/hash.h> #include <linux/highmem.h> -#include <linux/blktrace_api.h> +#include <linux/marker.h> #include <asm/tlbflush.h> /* Index: linux-2.6-lttng/fs/bio.c =================================================================== --- linux-2.6-lttng.orig/fs/bio.c 2007-05-02 21:37:52.000000000 -0400 +++ linux-2.6-lttng/fs/bio.c 2007-05-02 21:40:30.000000000 -0400 @@ -25,7 +25,7 @@ #include <linux/module.h> #include <linux/mempool.h> #include <linux/workqueue.h> -#include <linux/blktrace_api.h> +#include <linux/marker.h> #include <scsi/sg.h> /* for struct sg_iovec */ #define BIO_POOL_SIZE 2 @@ -1081,7 +1081,7 @@ if (!bp) return bp; - blk_add_trace_pdu_int(bdev_get_queue(bi->bi_bdev), BLK_TA_SPLIT, bi, + MARK(blk_pdu_split, "%p %p %d", bdev_get_queue(bi->bi_bdev), bi, bi->bi_sector + first_sectors); BUG_ON(bi->bi_vcnt != 1); Index: linux-2.6-lttng/drivers/block/cciss.c =================================================================== --- linux-2.6-lttng.orig/drivers/block/cciss.c 2007-05-02 21:44:30.000000000 -0400 +++ linux-2.6-lttng/drivers/block/cciss.c 2007-05-02 21:45:08.000000000 -0400 @@ -37,7 +37,7 @@ #include <linux/hdreg.h> #include <linux/spinlock.h> #include <linux/compat.h> -#include <linux/blktrace_api.h> +#include <linux/marker.h> #include <asm/uaccess.h> #include <asm/io.h> @@ -2502,7 +2502,7 @@ } cmd->rq->data_len = 0; cmd->rq->completion_data = cmd; - blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE); + MARK(blk_request_complete, "%p %p", cmd->rq->q, cmd->rq); blk_complete_request(cmd->rq); } Index: linux-2.6-lttng/drivers/md/dm.c =================================================================== --- linux-2.6-lttng.orig/drivers/md/dm.c 2007-05-02 21:44:41.000000000 -0400 +++ linux-2.6-lttng/drivers/md/dm.c 2007-05-02 21:47:19.000000000 -0400 @@ -19,7 +19,7 @@ #include <linux/slab.h> #include <linux/idr.h> #include <linux/hdreg.h> -#include <linux/blktrace_api.h> +#include <linux/marker.h> #include <linux/smp_lock.h> #define DM_MSG_PREFIX "core" @@ -485,8 +485,8 @@ wake_up(&io->md->wait); if (io->error != DM_ENDIO_REQUEUE) { - blk_add_trace_bio(io->md->queue, io->bio, - BLK_TA_COMPLETE); + MARK(blk_request_complete, "%p %p", + io->md->queue, io->bio); bio_endio(io->bio, io->bio->bi_size, io->error); } @@ -582,10 +582,10 @@ r = ti->type->map(ti, clone, &tio->info); if (r == DM_MAPIO_REMAPPED) { /* the bio has been remapped so dispatch it */ - - blk_add_trace_remap(bdev_get_queue(clone->bi_bdev), clone, - tio->io->bio->bi_bdev->bd_dev, sector, - clone->bi_sector); + MARK(blk_remap, "%p %p %u %llu %llu", + bdev_get_queue(clone->bi_bdev), clone, + (u64)tio->io->bio->bi_bdev->bd_dev, (u64)sector, + (u64)clone->bi_sector); generic_make_request(clone); } else if (r < 0 || r == DM_MAPIO_REQUEUE) { -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-03 15:04 ` Mathieu Desnoyers @ 2007-05-03 15:12 ` Christoph Hellwig 2007-05-03 17:16 ` Mathieu Desnoyers 0 siblings, 1 reply; 233+ messages in thread From: Christoph Hellwig @ 2007-05-03 15:12 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Andrew Morton, Christoph Hellwig, Andi Kleen, linux-kernel, rusty, wfg On Thu, May 03, 2007 at 11:04:15AM -0400, Mathieu Desnoyers wrote: > - blk_add_trace_rq(q, rq, BLK_TA_INSERT); > + MARK(blk_request_insert, "%p %p", q, rq); I don't really like the shouting MARK name very much. Can we have a less-generic, less shouting name, e.g. trace_marker? The aboe would then be: trace_mark(blk_request_insert, "%p %p", q, rq); > +#define NUM_PROBES ARRAY_SIZE(probe_array) just get rid of this and use ARRAY_SIZE diretly below. > +int blk_probe_connect(void) > +{ > + int result; > + uint8_t i; just use an int for for loops. it's easy to read and probably faster on most systems (it the compiler isn't smart enough and promotes it to int anyway during code generation) > +void blk_probe_disconnect(void) > +{ > + uint8_t i; > + > + for (i = 0; i < NUM_PROBES; i++) { > + marker_remove_probe(probe_array[i].name); > + } > + synchronize_sched(); /* Wait for probes to finish */ kprobes does this kind of synchronization internally, so the marker wrapper should probabl aswell. > +static int blk_probes_ref = 0; no need to initialize this. > /* > * Send out a notify message. > */ > @@ -229,6 +233,12 @@ > blk_remove_tree(bt->dir); > free_percpu(bt->sequence); > kfree(bt); > + mutex_lock(&blk_probe_mutex); > + if (blk_probes_ref == 1) { > + blk_probe_disconnect(); > + blk_probes_ref--; > + } if (--blk_probes_ref == 0) blk_probe_disconnect(); would probably be a tad cleaner. > + if (!blk_probes_ref) { > + blk_probe_connect(); > + blk_probes_ref++; > + } Dito here with a: if (!blk_probes_ref++) blk_probe_connect(); also the connect in the name seems rather add, what about arm/disarm instead? > static __init int blk_trace_init(void) > { > mutex_init(&blk_tree_mutex); > + mutex_init(&blk_probe_mutex); both should use DEFINE_MUTEX for compile-time initialization isntead. Also it's probably better to put the trace points into blktrace.c, that means all blktrace code can be static and self-contained. And we can probably do some additional cleanups by simplifying things later on. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-03 15:12 ` Christoph Hellwig @ 2007-05-03 17:16 ` Mathieu Desnoyers 2007-05-03 17:25 ` Christoph Hellwig 0 siblings, 1 reply; 233+ messages in thread From: Mathieu Desnoyers @ 2007-05-03 17:16 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton, Andi Kleen, linux-kernel, rusty, wfg Here is the reworked patch, except a comment : * Christoph Hellwig (hch@infradead.org) wrote: > > +void blk_probe_disconnect(void) > > +{ > > + uint8_t i; > > + > > + for (i = 0; i < NUM_PROBES; i++) { > > + marker_remove_probe(probe_array[i].name); > > + } > > + synchronize_sched(); /* Wait for probes to finish */ > > kprobes does this kind of synchronization internally, so the marker > wrapper should probabl aswell. > The problem appears on heavily loaded systems. Doing 50 synchronize_sched() calls in a row can take up to a few seconds on a 4-way machine. This is why I prefer to do it in the module to which the callbacks belong. Here is the reviewed patch. It depends on a newer version of markers I'll send to Andrew soon. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Index: linux-2.6-lttng/block/elevator.c =================================================================== --- linux-2.6-lttng.orig/block/elevator.c 2007-05-03 12:27:12.000000000 -0400 +++ linux-2.6-lttng/block/elevator.c 2007-05-03 12:54:58.000000000 -0400 @@ -32,7 +32,7 @@ #include <linux/init.h> #include <linux/compiler.h> #include <linux/delay.h> -#include <linux/blktrace_api.h> +#include <linux/marker.h> #include <linux/hash.h> #include <asm/uaccess.h> @@ -571,7 +571,7 @@ unsigned ordseq; int unplug_it = 1; - blk_add_trace_rq(q, rq, BLK_TA_INSERT); + trace_mark(blk_request_insert, "%p %p", q, rq); rq->q = q; @@ -757,7 +757,7 @@ * not be passed by new incoming requests */ rq->cmd_flags |= REQ_STARTED; - blk_add_trace_rq(q, rq, BLK_TA_ISSUE); + trace_mark(blk_request_issue, "%p %p", q, rq); } if (!q->boundary_rq || q->boundary_rq == rq) { Index: linux-2.6-lttng/block/ll_rw_blk.c =================================================================== --- linux-2.6-lttng.orig/block/ll_rw_blk.c 2007-05-03 12:27:12.000000000 -0400 +++ linux-2.6-lttng/block/ll_rw_blk.c 2007-05-03 12:54:58.000000000 -0400 @@ -28,6 +28,7 @@ #include <linux/task_io_accounting_ops.h> #include <linux/interrupt.h> #include <linux/cpu.h> +#include <linux/marker.h> #include <linux/blktrace_api.h> #include <linux/fault-inject.h> @@ -1551,7 +1552,7 @@ if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) { mod_timer(&q->unplug_timer, jiffies + q->unplug_delay); - blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG); + trace_mark(blk_plug_device, "%p %p %d", q, NULL, 0); } } @@ -1617,7 +1618,7 @@ * devices don't necessarily have an ->unplug_fn defined */ if (q->unplug_fn) { - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, + trace_mark(blk_pdu_unplug_io, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); q->unplug_fn(q); @@ -1628,7 +1629,7 @@ { request_queue_t *q = container_of(work, request_queue_t, unplug_work); - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, + trace_mark(blk_pdu_unplug_io, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); q->unplug_fn(q); @@ -1638,7 +1639,7 @@ { request_queue_t *q = (request_queue_t *)data; - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL, + trace_mark(blk_pdu_unplug_timer, "%p %p %d", q, NULL, q->rq.count[READ] + q->rq.count[WRITE]); kblockd_schedule_work(&q->unplug_work); @@ -2148,7 +2149,7 @@ rq_init(q, rq); - blk_add_trace_generic(q, bio, rw, BLK_TA_GETRQ); + trace_mark(blk_get_request, "%p %p %d", q, bio, rw); out: return rq; } @@ -2178,7 +2179,7 @@ if (!rq) { struct io_context *ioc; - blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ); + trace_mark(blk_sleep_request, "%p %p %d", q, bio, rw); __generic_unplug_device(q); spin_unlock_irq(q->queue_lock); @@ -2252,7 +2253,7 @@ */ void blk_requeue_request(request_queue_t *q, struct request *rq) { - blk_add_trace_rq(q, rq, BLK_TA_REQUEUE); + trace_mark(blk_requeue, "%p %p", q, rq); if (blk_rq_tagged(rq)) blk_queue_end_tag(q, rq); @@ -2937,7 +2938,7 @@ if (!ll_back_merge_fn(q, req, bio)) break; - blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE); + trace_mark(blk_bio_backmerge, "%p %p", q, bio); req->biotail->bi_next = bio; req->biotail = bio; @@ -2954,7 +2955,7 @@ if (!ll_front_merge_fn(q, req, bio)) break; - blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE); + trace_mark(blk_bio_frontmerge, "%p %p", q, bio); bio->bi_next = req->bio; req->bio = bio; @@ -3184,10 +3185,10 @@ blk_partition_remap(bio); if (old_sector != -1) - blk_add_trace_remap(q, bio, old_dev, bio->bi_sector, - old_sector); + trace_mark(blk_remap, "%p %p %u %llu %llu", q, bio, old_dev, + (u64)bio->bi_sector, (u64)old_sector); - blk_add_trace_bio(q, bio, BLK_TA_QUEUE); + trace_mark(blk_bio_queue, "%p %p", q, bio); old_sector = bio->bi_sector; old_dev = bio->bi_bdev->bd_dev; @@ -3329,7 +3330,7 @@ int total_bytes, bio_nbytes, error, next_idx = 0; struct bio *bio; - blk_add_trace_rq(req->q, req, BLK_TA_COMPLETE); + trace_mark(blk_request_complete, "%p %p", req->q, req); /* * extend uptodate bool to allow < 0 value to be direct io error Index: linux-2.6-lttng/block/Kconfig =================================================================== --- linux-2.6-lttng.orig/block/Kconfig 2007-05-03 12:27:12.000000000 -0400 +++ linux-2.6-lttng/block/Kconfig 2007-05-03 12:54:58.000000000 -0400 @@ -32,6 +32,7 @@ depends on SYSFS select RELAY select DEBUG_FS + select MARKERS help Say Y here, if you want to be able to trace the block layer actions on a given queue. Tracing allows you to see any traffic happening Index: linux-2.6-lttng/block/blktrace.c =================================================================== --- linux-2.6-lttng.orig/block/blktrace.c 2007-05-03 12:27:12.000000000 -0400 +++ linux-2.6-lttng/block/blktrace.c 2007-05-03 13:05:30.000000000 -0400 @@ -23,11 +23,19 @@ #include <linux/mutex.h> #include <linux/debugfs.h> #include <linux/time.h> +#include <linux/marker.h> #include <asm/uaccess.h> static DEFINE_PER_CPU(unsigned long long, blk_trace_cpu_offset) = { 0, }; static unsigned int blktrace_seq __read_mostly = 1; +/* Global reference count of probes */ +static DEFINE_MUTEX(blk_probe_mutex); +static int blk_probes_ref; + +int blk_probe_arm(void); +void blk_probe_disarm(void); + /* * Send out a notify message. */ @@ -179,7 +187,7 @@ EXPORT_SYMBOL_GPL(__blk_add_trace); static struct dentry *blk_tree_root; -static struct mutex blk_tree_mutex; +static DEFINE_MUTEX(blk_tree_mutex); static unsigned int root_users; static inline void blk_remove_root(void) @@ -229,6 +237,10 @@ blk_remove_tree(bt->dir); free_percpu(bt->sequence); kfree(bt); + mutex_lock(&blk_probe_mutex); + if (--blk_probes_ref == 0) + blk_probe_disarm(); + mutex_unlock(&blk_probe_mutex); } static int blk_trace_remove(request_queue_t *q) @@ -386,6 +398,11 @@ goto err; } + mutex_lock(&blk_probe_mutex); + if (!blk_probes_ref++) + blk_probe_arm(); + mutex_unlock(&blk_probe_mutex); + return 0; err: if (dir) @@ -549,9 +566,270 @@ #endif } +/** + * blk_add_trace_rq - Add a trace for a request oriented action + * Expected variable arguments : + * @q: queue the io is for + * @rq: the source request + * + * Description: + * Records an action against a request. Will log the bio offset + size. + * + **/ +static void blk_add_trace_rq(const struct __mark_marker_data *mdata, + const char *fmt, ...) +{ + va_list args; + u32 what; + struct blk_trace *bt; + int rw; + struct blk_probe_data *pinfo = mdata->pdata; + struct request_queue *q; + struct request *rq; + + va_start(args, fmt); + q = va_arg(args, struct request_queue *); + rq = va_arg(args, struct request *); + va_end(args); + + what = pinfo->flags; + bt = q->blk_trace; + rw = rq->cmd_flags & 0x03; + + if (likely(!bt)) + return; + + if (blk_pc_request(rq)) { + what |= BLK_TC_ACT(BLK_TC_PC); + __blk_add_trace(bt, 0, rq->data_len, rw, what, rq->errors, sizeof(rq->cmd), rq->cmd); + } else { + what |= BLK_TC_ACT(BLK_TC_FS); + __blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, rw, what, rq->errors, 0, NULL); + } +} + +/** + * blk_add_trace_bio - Add a trace for a bio oriented action + * Expected variable arguments : + * @q: queue the io is for + * @bio: the source bio + * + * Description: + * Records an action against a bio. Will log the bio offset + size. + * + **/ +static void blk_add_trace_bio(const struct __mark_marker_data *mdata, + const char *fmt, ...) +{ + va_list args; + u32 what; + struct blk_trace *bt; + struct blk_probe_data *pinfo = mdata->pdata; + struct request_queue *q; + struct bio *bio; + + va_start(args, fmt); + q = va_arg(args, struct request_queue *); + bio = va_arg(args, struct bio *); + va_end(args); + + what = pinfo->flags; + bt = q->blk_trace; + + if (likely(!bt)) + return; + + __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), 0, NULL); +} + +/** + * blk_add_trace_generic - Add a trace for a generic action + * Expected variable arguments : + * @q: queue the io is for + * @bio: the source bio + * @rw: the data direction + * + * Description: + * Records a simple trace + * + **/ +static void blk_add_trace_generic(const struct __mark_marker_data *mdata, + const char *fmt, ...) +{ + va_list args; + struct blk_trace *bt; + u32 what; + struct blk_probe_data *pinfo = mdata->pdata; + struct request_queue *q; + struct bio *bio; + int rw; + + va_start(args, fmt); + q = va_arg(args, struct request_queue *); + bio = va_arg(args, struct bio *); + rw = va_arg(args, int); + va_end(args); + + what = pinfo->flags; + bt = q->blk_trace; + + if (likely(!bt)) + return; + + if (bio) + blk_add_trace_bio(mdata, "%p %p", q, bio); + else + __blk_add_trace(bt, 0, 0, rw, what, 0, 0, NULL); +} + +/** + * blk_add_trace_pdu_int - Add a trace for a bio with an integer payload + * Expected variable arguments : + * @q: queue the io is for + * @bio: the source bio + * @pdu: the integer payload + * + * Description: + * Adds a trace with some integer payload. This might be an unplug + * option given as the action, with the depth at unplug time given + * as the payload + * + **/ +static void blk_add_trace_pdu_int(const struct __mark_marker_data *mdata, + const char *fmt, ...) +{ + va_list args; + struct blk_trace *bt; + u32 what; + struct blk_probe_data *pinfo = mdata->pdata; + struct request_queue *q; + struct bio *bio; + unsigned int pdu; + __be64 rpdu; + + va_start(args, fmt); + q = va_arg(args, struct request_queue *); + bio = va_arg(args, struct bio *); + pdu = va_arg(args, unsigned int); + va_end(args); + + what = pinfo->flags; + bt = q->blk_trace; + rpdu = cpu_to_be64(pdu); + + if (likely(!bt)) + return; + + if (bio) + __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), sizeof(rpdu), &rpdu); + else + __blk_add_trace(bt, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu); +} + +/** + * blk_add_trace_remap - Add a trace for a remap operation + * Expected variable arguments : + * @q: queue the io is for + * @bio: the source bio + * @dev: target device + * @from: source sector + * @to: target sector + * + * Description: + * Device mapper or raid target sometimes need to split a bio because + * it spans a stripe (or similar). Add a trace for that action. + * + **/ +static void blk_add_trace_remap(const struct __mark_marker_data *mdata, + const char *fmt, ...) +{ + va_list args; + struct blk_trace *bt; + struct blk_io_trace_remap r; + u32 what; + struct blk_probe_data *pinfo = mdata->pdata; + struct request_queue *q; + struct bio *bio; + u64 dev, from, to; + + va_start(args, fmt); + q = va_arg(args, struct request_queue *); + bio = va_arg(args, struct bio *); + dev = va_arg(args, u64); + from = va_arg(args, u64); + to = va_arg(args, u64); + va_end(args); + + what = pinfo->flags; + bt = q->blk_trace; + + if (likely(!bt)) + return; + + r.device = cpu_to_be32(dev); + r.sector = cpu_to_be64(to); + + __blk_add_trace(bt, from, bio->bi_size, bio->bi_rw, BLK_TA_REMAP, !bio_flagged(bio, BIO_UPTODATE), sizeof(r), &r); +} + +#define FACILITY_NAME "blk" + +static struct blk_probe_data probe_array[] = +{ + { "blk_bio_queue", "%p %p", BLK_TA_QUEUE, blk_add_trace_bio }, + { "blk_bio_backmerge", "%p %p", BLK_TA_BACKMERGE, blk_add_trace_bio }, + { "blk_bio_frontmerge", "%p %p", BLK_TA_FRONTMERGE, blk_add_trace_bio }, + { "blk_get_request", "%p %p %d", BLK_TA_GETRQ, blk_add_trace_generic }, + { "blk_sleep_request", "%p %p %d", BLK_TA_SLEEPRQ, + blk_add_trace_generic }, + { "blk_requeue", "%p %p", BLK_TA_REQUEUE, blk_add_trace_rq }, + { "blk_request_issue", "%p %p", BLK_TA_ISSUE, blk_add_trace_rq }, + { "blk_request_complete", "%p %p", BLK_TA_COMPLETE, blk_add_trace_rq }, + { "blk_plug_device", "%p %p %d", BLK_TA_PLUG, blk_add_trace_generic }, + { "blk_pdu_unplug_io", "%p %p %d", BLK_TA_UNPLUG_IO, + blk_add_trace_pdu_int }, + { "blk_pdu_unplug_timer", "%p %p %d", BLK_TA_UNPLUG_TIMER, + blk_add_trace_pdu_int }, + { "blk_request_insert", "%p %p", BLK_TA_INSERT, + blk_add_trace_rq }, + { "blk_pdu_split", "%p %p %d", BLK_TA_SPLIT, + blk_add_trace_pdu_int }, + { "blk_bio_bounce", "%p %p", BLK_TA_BOUNCE, blk_add_trace_bio }, + { "blk_remap", "%p %p %u %llu %llu", BLK_TA_REMAP, + blk_add_trace_remap }, +}; + + +int blk_probe_arm(void) +{ + int result; + int i; + + for (i = 0; i < ARRAY_SIZE(probe_array); i++) { + result = marker_set_probe(probe_array[i].name, + probe_array[i].format, + probe_array[i].callback, &probe_array[i]); + if (!result) + printk(KERN_INFO + "blktrace unable to register probe %s\n", + probe_array[i].name); + } + return 0; +} + +void blk_probe_disarm(void) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(probe_array); i++) { + marker_remove_probe(probe_array[i].name); + } + synchronize_sched(); /* Wait for probes to finish */ +} + + static __init int blk_trace_init(void) { - mutex_init(&blk_tree_mutex); on_each_cpu(blk_trace_check_cpu_time, NULL, 1, 1); blk_trace_set_ht_offsets(); Index: linux-2.6-lttng/include/linux/blktrace_api.h =================================================================== --- linux-2.6-lttng.orig/include/linux/blktrace_api.h 2007-05-03 12:27:12.000000000 -0400 +++ linux-2.6-lttng/include/linux/blktrace_api.h 2007-05-03 12:54:58.000000000 -0400 @@ -3,6 +3,7 @@ #include <linux/blkdev.h> #include <linux/relay.h> +#include <linux/marker.h> /* * Trace categories @@ -142,149 +143,24 @@ u32 pid; }; +/* Probe data used for probe-marker connection */ +struct blk_probe_data { + const char *name; + const char *format; + u32 flags; + marker_probe_func *callback; +}; + #if defined(CONFIG_BLK_DEV_IO_TRACE) extern int blk_trace_ioctl(struct block_device *, unsigned, char __user *); extern void blk_trace_shutdown(request_queue_t *); extern void __blk_add_trace(struct blk_trace *, sector_t, int, int, u32, int, int, void *); - -/** - * blk_add_trace_rq - Add a trace for a request oriented action - * @q: queue the io is for - * @rq: the source request - * @what: the action - * - * Description: - * Records an action against a request. Will log the bio offset + size. - * - **/ -static inline void blk_add_trace_rq(struct request_queue *q, struct request *rq, - u32 what) -{ - struct blk_trace *bt = q->blk_trace; - int rw = rq->cmd_flags & 0x03; - - if (likely(!bt)) - return; - - if (blk_pc_request(rq)) { - what |= BLK_TC_ACT(BLK_TC_PC); - __blk_add_trace(bt, 0, rq->data_len, rw, what, rq->errors, sizeof(rq->cmd), rq->cmd); - } else { - what |= BLK_TC_ACT(BLK_TC_FS); - __blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, rw, what, rq->errors, 0, NULL); - } -} - -/** - * blk_add_trace_bio - Add a trace for a bio oriented action - * @q: queue the io is for - * @bio: the source bio - * @what: the action - * - * Description: - * Records an action against a bio. Will log the bio offset + size. - * - **/ -static inline void blk_add_trace_bio(struct request_queue *q, struct bio *bio, - u32 what) -{ - struct blk_trace *bt = q->blk_trace; - - if (likely(!bt)) - return; - - __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), 0, NULL); -} - -/** - * blk_add_trace_generic - Add a trace for a generic action - * @q: queue the io is for - * @bio: the source bio - * @rw: the data direction - * @what: the action - * - * Description: - * Records a simple trace - * - **/ -static inline void blk_add_trace_generic(struct request_queue *q, - struct bio *bio, int rw, u32 what) -{ - struct blk_trace *bt = q->blk_trace; - - if (likely(!bt)) - return; - - if (bio) - blk_add_trace_bio(q, bio, what); - else - __blk_add_trace(bt, 0, 0, rw, what, 0, 0, NULL); -} - -/** - * blk_add_trace_pdu_int - Add a trace for a bio with an integer payload - * @q: queue the io is for - * @what: the action - * @bio: the source bio - * @pdu: the integer payload - * - * Description: - * Adds a trace with some integer payload. This might be an unplug - * option given as the action, with the depth at unplug time given - * as the payload - * - **/ -static inline void blk_add_trace_pdu_int(struct request_queue *q, u32 what, - struct bio *bio, unsigned int pdu) -{ - struct blk_trace *bt = q->blk_trace; - __be64 rpdu = cpu_to_be64(pdu); - - if (likely(!bt)) - return; - - if (bio) - __blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), sizeof(rpdu), &rpdu); - else - __blk_add_trace(bt, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu); -} - -/** - * blk_add_trace_remap - Add a trace for a remap operation - * @q: queue the io is for - * @bio: the source bio - * @dev: target device - * @from: source sector - * @to: target sector - * - * Description: - * Device mapper or raid target sometimes need to split a bio because - * it spans a stripe (or similar). Add a trace for that action. - * - **/ -static inline void blk_add_trace_remap(struct request_queue *q, struct bio *bio, - dev_t dev, sector_t from, sector_t to) -{ - struct blk_trace *bt = q->blk_trace; - struct blk_io_trace_remap r; - - if (likely(!bt)) - return; - - r.device = cpu_to_be32(dev); - r.sector = cpu_to_be64(to); - - __blk_add_trace(bt, from, bio->bi_size, bio->bi_rw, BLK_TA_REMAP, !bio_flagged(bio, BIO_UPTODATE), sizeof(r), &r); -} +extern int blk_probe_connect(void); +extern void blk_probe_disconnect(void); #else /* !CONFIG_BLK_DEV_IO_TRACE */ #define blk_trace_ioctl(bdev, cmd, arg) (-ENOTTY) #define blk_trace_shutdown(q) do { } while (0) -#define blk_add_trace_rq(q, rq, what) do { } while (0) -#define blk_add_trace_bio(q, rq, what) do { } while (0) -#define blk_add_trace_generic(q, rq, rw, what) do { } while (0) -#define blk_add_trace_pdu_int(q, what, bio, pdu) do { } while (0) -#define blk_add_trace_remap(q, bio, dev, f, t) do {} while (0) #endif /* CONFIG_BLK_DEV_IO_TRACE */ #endif Index: linux-2.6-lttng/mm/bounce.c =================================================================== --- linux-2.6-lttng.orig/mm/bounce.c 2007-05-03 12:27:12.000000000 -0400 +++ linux-2.6-lttng/mm/bounce.c 2007-05-03 12:54:58.000000000 -0400 @@ -13,7 +13,7 @@ #include <linux/init.h> #include <linux/hash.h> #include <linux/highmem.h> -#include <linux/blktrace_api.h> +#include <linux/marker.h> #include <asm/tlbflush.h> #define POOL_SIZE 64 @@ -237,7 +237,7 @@ if (!bio) return; - blk_add_trace_bio(q, *bio_orig, BLK_TA_BOUNCE); + trace_mark(blk_bio_bounce, "%p %p", q, *bio_orig); /* * at least one page was bounced, fill in possible non-highmem Index: linux-2.6-lttng/mm/highmem.c =================================================================== --- linux-2.6-lttng.orig/mm/highmem.c 2007-05-03 12:27:12.000000000 -0400 +++ linux-2.6-lttng/mm/highmem.c 2007-05-03 12:54:58.000000000 -0400 @@ -26,7 +26,7 @@ #include <linux/init.h> #include <linux/hash.h> #include <linux/highmem.h> -#include <linux/blktrace_api.h> +#include <linux/marker.h> #include <asm/tlbflush.h> /* Index: linux-2.6-lttng/fs/bio.c =================================================================== --- linux-2.6-lttng.orig/fs/bio.c 2007-05-03 12:27:12.000000000 -0400 +++ linux-2.6-lttng/fs/bio.c 2007-05-03 12:54:58.000000000 -0400 @@ -25,7 +25,7 @@ #include <linux/module.h> #include <linux/mempool.h> #include <linux/workqueue.h> -#include <linux/blktrace_api.h> +#include <linux/marker.h> #include <scsi/sg.h> /* for struct sg_iovec */ #define BIO_POOL_SIZE 2 @@ -1081,7 +1081,7 @@ if (!bp) return bp; - blk_add_trace_pdu_int(bdev_get_queue(bi->bi_bdev), BLK_TA_SPLIT, bi, + trace_mark(blk_pdu_split, "%p %p %d", bdev_get_queue(bi->bi_bdev), bi, bi->bi_sector + first_sectors); BUG_ON(bi->bi_vcnt != 1); Index: linux-2.6-lttng/drivers/block/cciss.c =================================================================== --- linux-2.6-lttng.orig/drivers/block/cciss.c 2007-05-03 12:27:12.000000000 -0400 +++ linux-2.6-lttng/drivers/block/cciss.c 2007-05-03 12:54:58.000000000 -0400 @@ -37,7 +37,7 @@ #include <linux/hdreg.h> #include <linux/spinlock.h> #include <linux/compat.h> -#include <linux/blktrace_api.h> +#include <linux/marker.h> #include <asm/uaccess.h> #include <asm/io.h> @@ -2502,7 +2502,7 @@ } cmd->rq->data_len = 0; cmd->rq->completion_data = cmd; - blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE); + trace_mark(blk_request_complete, "%p %p", cmd->rq->q, cmd->rq); blk_complete_request(cmd->rq); } Index: linux-2.6-lttng/drivers/md/dm.c =================================================================== --- linux-2.6-lttng.orig/drivers/md/dm.c 2007-05-03 12:27:12.000000000 -0400 +++ linux-2.6-lttng/drivers/md/dm.c 2007-05-03 12:54:58.000000000 -0400 @@ -19,7 +19,7 @@ #include <linux/slab.h> #include <linux/idr.h> #include <linux/hdreg.h> -#include <linux/blktrace_api.h> +#include <linux/marker.h> #include <linux/smp_lock.h> #define DM_MSG_PREFIX "core" @@ -485,8 +485,8 @@ wake_up(&io->md->wait); if (io->error != DM_ENDIO_REQUEUE) { - blk_add_trace_bio(io->md->queue, io->bio, - BLK_TA_COMPLETE); + trace_mark(blk_request_complete, "%p %p", + io->md->queue, io->bio); bio_endio(io->bio, io->bio->bi_size, io->error); } @@ -582,10 +582,10 @@ r = ti->type->map(ti, clone, &tio->info); if (r == DM_MAPIO_REMAPPED) { /* the bio has been remapped so dispatch it */ - - blk_add_trace_remap(bdev_get_queue(clone->bi_bdev), clone, - tio->io->bio->bi_bdev->bd_dev, sector, - clone->bi_sector); + trace_mark(blk_remap, "%p %p %u %llu %llu", + bdev_get_queue(clone->bi_bdev), clone, + (u64)tio->io->bio->bi_bdev->bd_dev, (u64)sector, + (u64)clone->bi_sector); generic_make_request(clone); } else if (r < 0 || r == DM_MAPIO_REQUEUE) { -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-03 17:16 ` Mathieu Desnoyers @ 2007-05-03 17:25 ` Christoph Hellwig 2007-05-10 19:39 ` Mathieu Desnoyers 0 siblings, 1 reply; 233+ messages in thread From: Christoph Hellwig @ 2007-05-03 17:25 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Christoph Hellwig, Andrew Morton, Andi Kleen, linux-kernel, rusty, wfg On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote: > > kprobes does this kind of synchronization internally, so the marker > > wrapper should probabl aswell. > > > > The problem appears on heavily loaded systems. Doing 50 > synchronize_sched() calls in a row can take up to a few seconds on a > 4-way machine. This is why I prefer to do it in the module to which > the callbacks belong. We recently had a discussion on batch unreistration interface for kprobes. I'm not very happy with having so different interfaces for different kind of probe registrations. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-03 17:25 ` Christoph Hellwig @ 2007-05-10 19:39 ` Mathieu Desnoyers 2007-05-13 21:04 ` Christoph Hellwig 0 siblings, 1 reply; 233+ messages in thread From: Mathieu Desnoyers @ 2007-05-10 19:39 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton, Andi Kleen, linux-kernel, rusty, wfg * Christoph Hellwig (hch@infradead.org) wrote: > On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote: > > > kprobes does this kind of synchronization internally, so the marker > > > wrapper should probabl aswell. > > > > > > > The problem appears on heavily loaded systems. Doing 50 > > synchronize_sched() calls in a row can take up to a few seconds on a > > 4-way machine. This is why I prefer to do it in the module to which > > the callbacks belong. > > We recently had a discussion on batch unreistration interface for > kprobes. I'm not very happy with having so different interfaces for > different kind of probe registrations. > Ok, I've had a look at the kprobes batch registration mechanisms and.. well, it does not look well suited for the markers. Adding supplementary data structures such as linked lists of probes does not look like a good match. However, I agree with you that providing a similar API is good. Therefore, here is my proposal : The goal is to do the synchronize just after we unregister the last probe handler provided by a given module. Since the unregistration functions iterate on every marker present in the kernel, we can keep a count of how many probes provided by the same module are still present. If we see that we unregistered the last probe pointing to this module, we issue a synchronize_sched(). It adds no data structure and keeps the same order of complexity as what is already there, we only have to do 2 passes in the marker structures : the first one finds the module associated with the callback and the second disables the callbacks and keep a count of the number of callbacks associated with the module. Mathieu P.S.: here is the code. Linux Kernel Markers - Architecture Independant code Provide internal synchronize_sched() in batch. The goal is to do the synchronize just after we unregister the last probe handler provided by a given module. Since the unregistration functions iterate on every marker present in the kernel, we can keep a count of how many probes provided by the same module are still present. If we see that we unregistered the last probe pointing to this module, we issue a synchronize_sched(). It adds no data structure and keeps the same order of complexity as what is already there, we only have to do 2 passes in the marker structures : the first one finds the module associated with the callback and the second disables the callbacks and keep a count of the number of callbacks associated with the module. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> --- kernel/module.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 52 insertions(+), 10 deletions(-) Index: linux-2.6-lttng/kernel/module.c =================================================================== --- linux-2.6-lttng.orig/kernel/module.c 2007-05-10 14:48:28.000000000 -0400 +++ linux-2.6-lttng/kernel/module.c 2007-05-10 15:38:27.000000000 -0400 @@ -404,8 +404,12 @@ } /* Sets a range of markers to a disabled state : unset the enable bit and - * provide the empty callback. */ + * provide the empty callback. + * Keep a count of other markers connected to the same module as the one + * provided as parameter. */ static int marker_remove_probe_range(const char *name, + struct module *probe_module, + int *ref_count, const struct __mark_marker *begin, const struct __mark_marker *end) { @@ -413,12 +417,17 @@ int found = 0; for (iter = begin; iter < end; iter++) { - if (strcmp(name, iter->mdata->name) == 0) { - marker_set_enable(iter->enable, 0, - iter->mdata->flags); - iter->mdata->call = __mark_empty_function; - found++; + if (strcmp(name, iter->mdata->name) != 0) { + if (probe_module) + if (__module_text_address( + (unsigned long)iter->mdata->call) + == probe_module) + (*ref_count)++; + continue; } + marker_set_enable(iter->enable, 0, iter->mdata->flags); + iter->mdata->call = __mark_empty_function; + found++; } return found; } @@ -450,6 +459,29 @@ return found; } +/* Get the module to which the probe handler's text belongs. + * Called with module_mutex taken. + * Returns NULL if the probe handler is not in a module. */ +static struct module *__marker_get_probe_module(const char *name) +{ + struct module *mod; + const struct __mark_marker *iter; + + list_for_each_entry(mod, &modules, list) { + if (mod->taints) + continue; + for (iter = mod->markers; + iter < mod->markers+mod->num_markers; iter++) { + if (strcmp(name, iter->mdata->name) != 0) + continue; + if (iter->mdata->call) + return __module_text_address( + (unsigned long)iter->mdata->call); + } + } + return NULL; +} + /* Calls _marker_set_probe_range for the core markers and modules markers. * Marker enabling/disabling use the modlist_lock to synchronise. */ int _marker_set_probe(int flags, const char *name, const char *format, @@ -477,23 +509,33 @@ EXPORT_SYMBOL_GPL(_marker_set_probe); /* Calls _marker_remove_probe_range for the core markers and modules markers. - * Marker enabling/disabling use the modlist_lock to synchronise. */ + * Marker enabling/disabling use the modlist_lock to synchronise. + * ref_count is the number of markers still connected to the same module + * as the one in which sits the probe handler currently removed, excluding the + * one currently removed. If the count is 0, we issue a synchronize_sched() to + * make sure the module can safely unload. */ int marker_remove_probe(const char *name) { - struct module *mod; + struct module *mod, *probe_module; int found = 0; + int ref_count = 0; mutex_lock(&module_mutex); + /* In what module is the probe handler ? */ + probe_module = __marker_get_probe_module(name); /* Core kernel markers */ - found += marker_remove_probe_range(name, + found += marker_remove_probe_range(name, probe_module, &ref_count, __start___markers, __stop___markers); /* Markers in modules. */ list_for_each_entry(mod, &modules, list) { if (!mod->taints) - found += marker_remove_probe_range(name, + found += marker_remove_probe_range(name, probe_module, + &ref_count, mod->markers, mod->markers+mod->num_markers); } mutex_unlock(&module_mutex); + if (!ref_count) + synchronize_sched(); return found; } EXPORT_SYMBOL_GPL(marker_remove_probe); -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-10 19:39 ` Mathieu Desnoyers @ 2007-05-13 21:04 ` Christoph Hellwig 0 siblings, 0 replies; 233+ messages in thread From: Christoph Hellwig @ 2007-05-13 21:04 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Christoph Hellwig, Andrew Morton, Andi Kleen, linux-kernel, rusty, wfg On Thu, May 10, 2007 at 03:39:36PM -0400, Mathieu Desnoyers wrote: > * Christoph Hellwig (hch@infradead.org) wrote: > > On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote: > > > > kprobes does this kind of synchronization internally, so the marker > > > > wrapper should probabl aswell. > > > > > > > > > > The problem appears on heavily loaded systems. Doing 50 > > > synchronize_sched() calls in a row can take up to a few seconds on a > > > 4-way machine. This is why I prefer to do it in the module to which > > > the callbacks belong. > > > > We recently had a discussion on batch unreistration interface for > > kprobes. I'm not very happy with having so different interfaces for > > different kind of probe registrations. > > > > Ok, I've had a look at the kprobes batch registration mechanisms and.. > well, it does not look well suited for the markers. Adding > supplementary data structures such as linked lists of probes does not > look like a good match. > > However, I agree with you that providing a similar API is good. > > Therefore, here is my proposal : > > The goal is to do the synchronize just after we unregister the last > probe handler provided by a given module. Since the unregistration > functions iterate on every marker present in the kernel, we can keep a > count of how many probes provided by the same module are still present. > If we see that we unregistered the last probe pointing to this module, > we issue a synchronize_sched(). > > It adds no data structure and keeps the same order of complexity as what > is already there, we only have to do 2 passes in the marker structures : > the first one finds the module associated with the callback and the > second disables the callbacks and keep a count of the number of > callbacks associated with the module. > > Mathieu > > P.S.: here is the code. > > > Linux Kernel Markers - Architecture Independant code Provide internal > synchronize_sched() in batch. > > The goal is to do the synchronize just after we unregister the last > probe handler provided by a given module. Since the unregistration > functions iterate on every marker present in the kernel, we can keep a > count of how many probes provided by the same module are still present. > If we see that we unregistered the last probe pointing to this module, > we issue a synchronize_sched(). > > It adds no data structure and keeps the same order of complexity as what > is already there, we only have to do 2 passes in the marker structures : > the first one finds the module associated with the callback and the > second disables the callbacks and keep a count of the number of > callbacks associated with the module. Looks good to me, please incorporate this is the next round of the markers patch series. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-02 23:11 ` Mathieu Desnoyers 2007-05-02 23:21 ` Andrew Morton @ 2007-05-03 8:06 ` Christoph Hellwig 2007-05-03 14:43 ` Mathieu Desnoyers 2007-05-03 10:31 ` Andi Kleen 2 siblings, 1 reply; 233+ messages in thread From: Christoph Hellwig @ 2007-05-03 8:06 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Andrew Morton, Christoph Hellwig, Andi Kleen, linux-kernel, rusty, wfg On Wed, May 02, 2007 at 07:11:04PM -0400, Mathieu Desnoyers wrote: > My statement was probably not clear enough. The actual marker code is > useful as-is without any further kernel patching required : SystemTAP is > an example where they use external modules to load probes that can > connect either to markers or through kprobes. LTTng, in its current state, > has a mostly modular core that also uses the markers. That just mean you have to load an enormous emount of exernal crap that replaces the missing kernel functionality. It's exactly the situation we want to avoid. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-03 8:06 ` Christoph Hellwig @ 2007-05-03 14:43 ` Mathieu Desnoyers 0 siblings, 0 replies; 233+ messages in thread From: Mathieu Desnoyers @ 2007-05-03 14:43 UTC (permalink / raw) To: Christoph Hellwig, Andrew Morton, Andi Kleen, linux-kernel, rusty, wfg * Christoph Hellwig (hch@infradead.org) wrote: > On Wed, May 02, 2007 at 07:11:04PM -0400, Mathieu Desnoyers wrote: > > My statement was probably not clear enough. The actual marker code is > > useful as-is without any further kernel patching required : SystemTAP is > > an example where they use external modules to load probes that can > > connect either to markers or through kprobes. LTTng, in its current state, > > has a mostly modular core that also uses the markers. > > That just mean you have to load an enormous emount of exernal crap > that replaces the missing kernel functionality. It's exactly the > situation we want to avoid. > It makes sense to use -mm to hold the hole usable infrastructure before submitting it to mainline. I will submit my core LTTng patches to Andrew in the following weeks. There is no hurry, in the LTTng perspective, to merge the markers sooner, although they could be useful to other (external) projects meanwhile. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-02 23:11 ` Mathieu Desnoyers 2007-05-02 23:21 ` Andrew Morton 2007-05-03 8:06 ` Christoph Hellwig @ 2007-05-03 10:31 ` Andi Kleen 2007-05-03 14:49 ` Mathieu Desnoyers 2 siblings, 1 reply; 233+ messages in thread From: Andi Kleen @ 2007-05-03 10:31 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Andrew Morton, Christoph Hellwig, Andi Kleen, linux-kernel, rusty, wfg > If we are looking at current "potential users" that are already in > mainline, we could change blktrace to make it use the markers. Ok, but do it step by step: - split out useful pieces like the "patched in enable/disable flags" and submit them separate with an example user or two [I got a couple of candidates e.g. with some of the sysctls in VM or networking] - post and merge that. - don't implement anything initially that is not needed by blktrace - post a minimal marker patch together with the blktrace conversion for review again on linux-kernel - await review comments. This review would not cover the basic need of markers, just the specific implementation. - then potentially merge incorporate review comments - then merge - later add features with individual review/discussion as new users in the kernel are added. -Andi ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-03 10:31 ` Andi Kleen @ 2007-05-03 14:49 ` Mathieu Desnoyers 0 siblings, 0 replies; 233+ messages in thread From: Mathieu Desnoyers @ 2007-05-03 14:49 UTC (permalink / raw) To: Andi Kleen; +Cc: Andrew Morton, Christoph Hellwig, linux-kernel, rusty, wfg Hi Andi, This plan makes sense. I will split the "patched in enabled/disable flags" part into a separate piece (good idea!) and then submit the LTTng core to Andrew. Christoph's has a good point about wanting a usable infrastructure to go ini. Regarding your plan, I must argue that blktrace is not a general purpose tracing infrastructure, but one dedicated to block io tracing. Therefore, it makes sense to bring in the generic infrastructure first and then convert blktrace to it. Mathieu * Andi Kleen (andi@firstfloor.org) wrote: > > If we are looking at current "potential users" that are already in > > mainline, we could change blktrace to make it use the markers. > > Ok, but do it step by step: > - split out useful pieces like the "patched in enable/disable flags" > and submit them separate with an example user or two > [I got a couple of candidates e.g. with some of the sysctls in VM or > networking] > - post and merge that. > - don't implement anything initially that is not needed by blktrace > - post a minimal marker patch together with the blktrace > conversion for review again on linux-kernel > - await review comments. This review would not cover the basic > need of markers, just the specific implementation. > - then potentially merge incorporate review comments > - then merge > - later add features with individual review/discussion as new users in the > kernel are added. > > -Andi -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-02 20:53 ` Andrew Morton 2007-05-02 23:11 ` Mathieu Desnoyers @ 2007-05-03 8:09 ` Christoph Hellwig 1 sibling, 0 replies; 233+ messages in thread From: Christoph Hellwig @ 2007-05-03 8:09 UTC (permalink / raw) To: Andrew Morton Cc: Mathieu Desnoyers, Christoph Hellwig, Andi Kleen, linux-kernel, rusty, wfg On Wed, May 02, 2007 at 01:53:36PM -0700, Andrew Morton wrote: > In which case we have: > > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch > atomich-complete-atomic_long-operations-in-asm-generic.patch > atomich-i386-type-safety-fix.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch > atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch > atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch > local_t-architecture-independant-extension.patch > local_t-alpha-extension.patch > local_t-i386-extension.patch > local_t-ia64-extension.patch > local_t-mips-extension.patch > local_t-parisc-cleanup.patch > local_t-powerpc-extension.patch > local_t-sparc64-cleanup.patch > local_t-x86_64-extension.patch > > For 2.6.22 > > linux-kernel-markers-kconfig-menus.patch > linux-kernel-markers-architecture-independant-code.patch > linux-kernel-markers-powerpc-optimization.patch > linux-kernel-markers-i386-optimization.patch > markers-add-instrumentation-markers-menus-to-avr32.patch > linux-kernel-markers-non-optimized-architectures.patch > markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch > linux-kernel-markers-documentation.patch > # > markers-define-the-linker-macro-extra_rwdata.patch > markers-use-extra_rwdata-in-architectures.patch > # > some-grammatical-fixups-and-additions-to-atomich-kernel-doc.patch > no-longer-include-asm-kdebugh.patch This it a plan I can fully agree with. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-02 20:36 ` Mathieu Desnoyers 2007-05-02 20:53 ` Andrew Morton @ 2007-05-03 8:08 ` Christoph Hellwig 1 sibling, 0 replies; 233+ messages in thread From: Christoph Hellwig @ 2007-05-03 8:08 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Christoph Hellwig, Andrew Morton, Andi Kleen, linux-kernel, rusty, wfg On Wed, May 02, 2007 at 04:36:27PM -0400, Mathieu Desnoyers wrote: > The idea is the following : either we integrate the infrastructure for > instrumentation / data serialization / buffer management / extraction of > data to user space in multiple different steps, which makes code review > easier for you guys, the staging area for that is -mm ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-02 16:47 ` Andrew Morton 2007-05-02 17:29 ` Christoph Hellwig @ 2007-05-02 17:49 ` Andi Kleen 2007-05-02 21:46 ` Tilman Schmidt 1 sibling, 1 reply; 233+ messages in thread From: Andi Kleen @ 2007-05-02 17:49 UTC (permalink / raw) To: Andrew Morton; +Cc: Andi Kleen, Mathieu Desnoyers, linux-kernel, rusty, wfg On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote: > On Wed, 2 May 2007 12:44:13 +0200 Andi Kleen <andi@firstfloor.org> wrote: > > > > It is currently used as an instrumentation infrastructure for the LTTng > > > tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in > > > WindRiver products. The SystemTAP project also plan to use this type of > > > infrastructure to trace sites hard to instrument. The Linux Kernel > > > Markers has the support of Frank C. Eigler, author of their current > > > marker alternative (which he wishes to drop in order to adopt the > > > markers infrastructure as soon as it hits mainline). > > > > All of the above don't use mainline kernels. > > That's because they have to add a markers patch! I meant they use very old kernels. Their experiences don't apply to mainline bitrottyness. > > That doesn't constitute using it. > > Andi, there was a huge amount of discussion about all this in September last > year (subjects: *markers* and *LTTng*). The outcome of all that was, I > believe, that the kernel should have a static marker infrastructure. I have no problem with that in principle; just some doubts about the current proposed implementation: in particular its complexity. And also I think when something is merged it should have some users in tree. -Andi ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-02 17:49 ` Andi Kleen @ 2007-05-02 21:46 ` Tilman Schmidt 2007-05-03 10:12 ` Andi Kleen 0 siblings, 1 reply; 233+ messages in thread From: Tilman Schmidt @ 2007-05-02 21:46 UTC (permalink / raw) To: Andi Kleen; +Cc: Andrew Morton, Mathieu Desnoyers, linux-kernel, rusty, wfg [-- Attachment #1: Type: text/plain, Size: 312 bytes --] Am 02.05.2007 19:49 schrieb Andi Kleen: > And also I think when something is merged it should have some users in tree. Isn't that a circular dependency? -- Tilman Schmidt E-Mail: tilman@imap.cc Bonn, Germany - Undetected errors are handled as if no error occurred. (IBM) - [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 253 bytes --] ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-02 21:46 ` Tilman Schmidt @ 2007-05-03 10:12 ` Andi Kleen 0 siblings, 0 replies; 233+ messages in thread From: Andi Kleen @ 2007-05-03 10:12 UTC (permalink / raw) To: Tilman Schmidt Cc: Andi Kleen, Andrew Morton, Mathieu Desnoyers, linux-kernel, rusty, wfg On Wed, May 02, 2007 at 11:46:40PM +0200, Tilman Schmidt wrote: > Am 02.05.2007 19:49 schrieb Andi Kleen: > > And also I think when something is merged it should have some users in tree. > > Isn't that a circular dependency? The normal mode of operation is to merge the initial users and the subsystem at the same time. -Andi ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-02 10:44 ` Andi Kleen 2007-05-02 16:37 ` Frank Ch. Eigler 2007-05-02 16:47 ` Andrew Morton @ 2007-05-02 17:19 ` Mathieu Desnoyers 2 siblings, 0 replies; 233+ messages in thread From: Mathieu Desnoyers @ 2007-05-02 17:19 UTC (permalink / raw) To: Andi Kleen; +Cc: Andrew Morton, linux-kernel, rusty, wfg * Andi Kleen (andi@firstfloor.org) wrote: > > It is currently used as an instrumentation infrastructure for the LTTng > > tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in > > WindRiver products. The SystemTAP project also plan to use this type of > > infrastructure to trace sites hard to instrument. The Linux Kernel > > Markers has the support of Frank C. Eigler, author of their current > > marker alternative (which he wishes to drop in order to adopt the > > markers infrastructure as soon as it hits mainline). > > All of the above don't use mainline kernels. > That doesn't constitute using it. > I am afraid this argument does not hold : - These companies are not shipping their products with mainline kernels to make sure things have time to stabilize. - They eventually get to the next version some time after it is not "head" anymore. They still want to benefit from the features of the newer versions. - All these companies would be really happy to have a marker infrastructure in mainline so they can stop applying a separate set of patches to provide this functionality. - Arguing the fact that "they apply their set of patches anyway" goes against the advice I have received from Greg KH, which is can be reworded as : please submit your patches to mainline instead of keeping your separate set of patches. See his various presentations about "mainlining" for more info about this. Because of these 4 arguments, I think that these companies can be considered as users and contributors of/to mainline kernels. > > Quoting Jim Keniston <jkenisto@us.ibm.com> : > > > > "kprobes remains a vital foundation for SystemTap. But markers are > > attactive as an alternate source of trace/debug info. Here's why: > > [...]" > > Talk is cheap. Do they have working code to use it? > LTTng has been using the markers for about 6 months now. SystemTAP is waiting on the "it hits mainline" signal before they switch from their STP_MARK() markers to this infrastructure. Give them a few days and they will proceed to the change. > > - Allow per-architecture optimized versions which removes the need for > > a d-cache based branch (patch a "load immediate" instruction > > instead). It minimized the d-cache impact of the disabled markers. > > That's a good idea in general, but should be generalized (available > independently), not hidden in your subsystem. I know a couple of places > who could use this successfully. > I agree that an efficient hooking mechanism is useful to manyr; listing at least security hooks and instrumentation for tracing. What other usage scenario do you have in mind that could not fit in my marker infrastructure ? I have tried to generalize this as much as possible, but if you see, within this, a piece of infrastructure that could be taken apart and used more widely, I will be happy to submit it separately to increase its usefulness. > > - Accept the cost of an unlikely branch at the marker site because the > > gcc compiler does not give the ability to put "nops" instead of a > > branch generated from C code. Keep this in mind for future > > per-architecture optimizations. > > See upcomming paravirt code for a way to do this. > I have looked at the paravirt code in Andrew's 2.6.21-rc7-mm2. A few reasons why I do not plan to use it : 1 - It requires specific arg setup for the calls to be crafted by hand, in assembly, for each and every number of parameters and each types, for each architecture. I use a variable argument list as a parameter to my marker to make sure that a single macro can be used for markup in a generic manner. Quoting : http://lkml.org/lkml/2007/4/4/577 "+ * Unfortunately there's no way to get gcc to generate the args setup + * for the call, and then allow the call itself to be generated by an + * inline asm. Because of this, we must do the complete arg setup and + * return value handling from within these macros. This is fairly + * cumbersome." 2 - I also provide an architecture independent "generic" version which does not depend on per-architecture assembly. From what I see, paravirt is only offered for i386 and x86_64. Are there any plans to support the other ~12 architectures ? Does it offer a architecture agnostic fallback in the cases where it is not implemented for a given architecture ? 3 - It can only alter instructions "safely" in the UP case before the other CPUs are turned on. See my arch/i386/marker.c code patcher for XMC-safe instruction patching. Marker activation must be done at runtime, when the system is fully operational. Quoting 2.6.21 arch/i386/kernel/alternative.c "/* Replace instructions with better alternatives for this CPU type. This runs before SMP is initialized to avoid SMP problems with self modifying code. This implies that assymetric systems where APs have less capabilities than the boot processor are not handled. Tough. Make sure you disable such features by hand. */ void apply_alternatives(struct alt_instr *start, struct alt_instr *end)" 4 - paravirt does not offer the ability to replace a branch instruction, generated by gcc, through its mechanism. If I choose to use paravirt mechanism, I must do the stack setup and function call by hand, which has been argued in points (1) and (2). GCC must itself generate the branch instruction to jump over the function call containing the variable argument list. > > - Instrumentation of challenging kernel sites > > - Instrumentation such as the one provided in the already existing > > Lock dependency checker (lockdep) and instrumentation of trap > > handlers implies being reentrant for such context. Therefore, the > > implementation must be lock-free and update the state in an atomic > > fashion (rcu-style). It must also let the programmer who describes > > a marker site the ability to specify what is forbidden in the probe > > that will be connected to the marker : can it generate a trap ? Can > > it call lockdep (irq disable, take any type of lock), can it call > > printk ? This is why flags can be passed to the _MARK() marker, > > while the MARK() marker has the default flags. > > Why can't you just generally forbid probes from doing all of this? > It would greatly simplify your code, wouldn't it? > > Keep it simple please. > An example, taken from the marker mechanism itself (no probe involved) shows how difficult it can be to "forbid all of this" : The optimized version patches code while the system is live. This implies cross modifying code in SMP environment. It can be done safely on x86 and x86_64 by using a breakpoint during the code modification to make sure the CPU issues a serializing instruction between the moment a given CPU speculates the code execution and actually reaches it. It implies going though a trap, which does funny things such as enabling interrupts, which calls into lockdep. Therefore, adding a marker into the lockdep code cannot be done with a breakpoint-based marker on these architectures. We have to provide an alternative way to do this, less intrusive, which is exactly what the "generic" markers provide. The same applies to instrumentation of the breakpoint trap handler. I strongly doubt that _every_ users of the markers would be comfortable with the "write your code so it does not take any lock and does everything atomically" constraint. I have done it in LTTng so I could have a fully reentrant tracer, but even then you can be limited by the nature of where you want to send the data. Richard Purdie implemented a serial port based data relay as an alternative data relay mechanism connected to LTTng; he needed a spinlock because of the semantic of his port, so he has to accept the limitation regarding the sites that can and cannot be probed. Providing an explicit declaration of site limitations make sense in this regard. On other architectures, it is the time source which requires a read seqlock. It is not atomic in the sense that a reader can nest over a writer (if coming from NMI context) and spin forever. I can list a lot of situations where we cannot _require_ the probe to run atomically in every aspect; so generally forbidding these actions does not seem to be a viable solution. In fact, this would be the best way to make sure the marker infrastructure is never used by early adopters because of the complexity level of writing probes, due to these "rules". > > Please tell me if I forgot to explain the rationale behind some > > implementation detail and I will be happy to explain in more depth. > > Having lots of flags to do things differently optionally normally > starts up all warning lights of early over design. While Linux > has this sometimes it is generally only in mature old subsystems. > But when something is freshly merged it shouldn't be like this. > That is because code tends to grow more complicated over its livetime > and when it is already complicated at the beginning it will eventually > fall over (you can study current slab as a poster child of this) > > -Andi > Explicitly identifying "hard to instrument" sites is nothing new. It has been done in different manners in the past. Kprobes sprinkles "__kprobes" declarations before function declarations all over the place to specify which ones cannot be safely instrumented. It results in a visually less appealing source code and it limits the sites that can be probed. The goal of the marker infrastructure is exactly to instrument those sites. Therefore, the approach "we forbid instrumentation of sites hard to instrument" misses the point of this infrastructure. We can leverage the fact that the marker is put in a context known by the programmer; it makes sense to give him the ability to specify what are the restrictions on the probes connected to this marker with some level of granularity. Regards, Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-01 12:17 ` Andi Kleen 2007-05-01 22:08 ` Mathieu Desnoyers @ 2007-05-02 0:31 ` Rusty Russell 2007-05-02 10:30 ` Andi Kleen 1 sibling, 1 reply; 233+ messages in thread From: Rusty Russell @ 2007-05-02 0:31 UTC (permalink / raw) To: Andi Kleen; +Cc: Andrew Morton, linux-kernel On Tue, 2007-05-01 at 14:17 +0200, Andi Kleen wrote: > Andrew Morton <akpm@linux-foundation.org> writes: > > Will merge the rustyvisor. > > IMHO the user code still doesn't belong into Documentation. > Also it needs another review round I guess. And some beta testing by > more people. Like any piece of code more review and more testing would be great. (Your earlier review was particularly useful!). But it's not clear that waiting for longer will achieve either. Look at kvm's experience for the reverse case: it went in, then got rewritten. As for the code in Documentation, my initial attempts tried to get around the need for a userspace part by putting everything in the kernel module. It meant you could launch a guest by writing a string to /dev/lguest (no real ABI burden there), but it's a worse solution than some user code in the kernel tree 8( Cheers, Rusty. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-05-02 0:31 ` Rusty Russell @ 2007-05-02 10:30 ` Andi Kleen 0 siblings, 0 replies; 233+ messages in thread From: Andi Kleen @ 2007-05-02 10:30 UTC (permalink / raw) To: Rusty Russell; +Cc: Andi Kleen, Andrew Morton, linux-kernel On Wed, May 02, 2007 at 10:31:10AM +1000, Rusty Russell wrote: > On Tue, 2007-05-01 at 14:17 +0200, Andi Kleen wrote: > > Andrew Morton <akpm@linux-foundation.org> writes: > > > Will merge the rustyvisor. > > > > IMHO the user code still doesn't belong into Documentation. > > Also it needs another review round I guess. And some beta testing by > > more people. > > Like any piece of code more review and more testing would be great. > (Your earlier review was particularly useful!). But it's not clear that > waiting for longer will achieve either. Not clear to me. Release a clear lguest patchkit with documentation on l-k several times and you'll probably get both reviewers and testers. Then confidence level will rise. > > Look at kvm's experience for the reverse case: it went in, then got > rewritten. They at least already had some user base at this point. > As for the code in Documentation, my initial attempts tried to get > around the need for a userspace part by putting everything in the kernel > module. It meant you could launch a guest by writing a string > to /dev/lguest (no real ABI burden there), but it's a worse solution > than some user code in the kernel tree 8( Just put it into a separate tarball. -Andi ^ permalink raw reply [flat|nested] 233+ messages in thread
* file capabilities and security_task_wait failure Re: 2.6.22 -mm merge plans 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (15 preceding siblings ...) 2007-05-01 12:17 ` Andi Kleen @ 2007-05-01 13:06 ` Stephen Smalley 2007-05-01 14:31 ` 2.6.22 -mm merge plans: mm-more-rmap-checking Hugh Dickins ` (5 subsequent siblings) 22 siblings, 0 replies; 233+ messages in thread From: Stephen Smalley @ 2007-05-01 13:06 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, James Morris, Eric Paris, Serge E. Hallyn, Chris Wright, linuxfs, Christoph Hellwig On Mon, 2007-04-30 at 16:20 -0700, Andrew Morton wrote: > implement-file-posix-capabilities.patch > file-capabilities-accomodate-future-64-bit-caps.patch > return-eperm-not-echild-on-security_task_wait-failure.patch > > I think we're still waiting for the security guys to work out what to do with > this work. return-eperm-not-echild-on-security_task_wait-failure.patch should be merged - it is effectively a bug fix. On the file capabilities support, have any of the filesystem folks (cc'd) looked at the code yet? -- Stephen Smalley National Security Agency ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: mm-more-rmap-checking 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (16 preceding siblings ...) 2007-05-01 13:06 ` file capabilities and security_task_wait failure " Stephen Smalley @ 2007-05-01 14:31 ` Hugh Dickins 2007-05-02 1:42 ` Nick Piggin 2007-05-01 16:56 ` 2.6.22 -mm merge plans Zan Lynx ` (4 subsequent siblings) 22 siblings, 1 reply; 233+ messages in thread From: Hugh Dickins @ 2007-05-01 14:31 UTC (permalink / raw) To: Andrew Morton; +Cc: Nick Piggin, linux-kernel, linux-mm On Mon, 30 Apr 2007, Andrew Morton wrote: >... > mm-more-rmap-checking.patch >... > > Misc MM things. Will merge. Would Nick mind very much if I ask you to drop this one? You did CC me ages ago, but I've only just run across it. It's a small matter, but I'd prefer it dropped for now. >> Re-introduce rmap verification patches that Hugh removed when he removed >> PG_map_lock. PG_map_lock actually isn't needed to synchronise access to >> anonymous pages, because PG_locked and PTL together already do. >> >> These checks were important in discovering and fixing a rare rmap corruption >> in SLES9. It introduces some silly checks which were never in mainline, nor so far as I can tell in SLES9: I'm thinking of those + BUG_ON(address < vma->vm_start || address >= vma->vm_end); There are few callsites for these rmap functions, I don't think they need to be checking their arguments in that way. It also changes the inline page_dup_rmap (a single atomic increment) into a bugchecking out-of-line function: do we really want to slow down fork in that way, for 2.6.22 to fix a rare corruption in SLES9? What I really like about the patch is Nick's observation that my /* else checking page index and mapping is racy */ is no longer true: a change we made to the do_swap_page sequence some while ago has indeed cured that raciness, and I'm happy to reintroduce the check on mapping and index in page_add_anon_rmap, and his BUG_ON(!PageLocked(page)) there (despite BUG_ONs falling out of fashion very recently). That becomes more important when I send the patches to free up PG_swapcache, using a PAGE_MAPPING_SWAP bit instead: so I was planning to include that part of Nick's patch in that series. Hugh ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: mm-more-rmap-checking 2007-05-01 14:31 ` 2.6.22 -mm merge plans: mm-more-rmap-checking Hugh Dickins @ 2007-05-02 1:42 ` Nick Piggin 2007-05-02 13:17 ` Hugh Dickins 0 siblings, 1 reply; 233+ messages in thread From: Nick Piggin @ 2007-05-02 1:42 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm Hugh Dickins wrote: > On Mon, 30 Apr 2007, Andrew Morton wrote: > >>... >> mm-more-rmap-checking.patch >>... >> >>Misc MM things. Will merge. > > > Would Nick mind very much if I ask you to drop this one? > You did CC me ages ago, but I've only just run across it. > It's a small matter, but I'd prefer it dropped for now. I guess I would prefer it to go under CONFIG_DEBUG_VM. Speaking of which, it would be nice to be able to turn that on unconditionally in -rc1. Although I may have put a few too many things under it, so it might slow down too much... >>>Re-introduce rmap verification patches that Hugh removed when he removed >>>PG_map_lock. PG_map_lock actually isn't needed to synchronise access to >>>anonymous pages, because PG_locked and PTL together already do. >>> >>>These checks were important in discovering and fixing a rare rmap corruption >>>in SLES9. > > > It introduces some silly checks which were never in mainline, > nor so far as I can tell in SLES9: I'm thinking of those > + BUG_ON(address < vma->vm_start || address >= vma->vm_end); Yes, but IIRC I put that in because there was another check in SLES9 that I actually couldn't put in, but used this one instead because it also caught the bug we saw. > There are few callsites for these rmap functions, I don't think > they need to be checking their arguments in that way. > > It also changes the inline page_dup_rmap (a single atomic increment) > into a bugchecking out-of-line function: do we really want to slow > down fork in that way, for 2.6.22 to fix a rare corruption in SLES9? This was actually a rare corruption that is also in 2.6.21, and as few rmap callsites as we have, it was never noticed until the SLES9 bug check was triggered. > What I really like about the patch is Nick's observation that my > /* else checking page index and mapping is racy */ > is no longer true: a change we made to the do_swap_page sequence > some while ago has indeed cured that raciness, and I'm happy to > reintroduce the check on mapping and index in page_add_anon_rmap, > and his BUG_ON(!PageLocked(page)) there (despite BUG_ONs falling > out of fashion very recently). Hmm, I didn't notice the do_swap_page change, rather just derived its safety by looking at the current state of the code (which I guess must have been post-do_swap_page change)... Do you have a pointer to the patch, for my interest? -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: mm-more-rmap-checking 2007-05-02 1:42 ` Nick Piggin @ 2007-05-02 13:17 ` Hugh Dickins 2007-05-03 0:18 ` Nick Piggin 0 siblings, 1 reply; 233+ messages in thread From: Hugh Dickins @ 2007-05-02 13:17 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm On Wed, 2 May 2007, Nick Piggin wrote: > > Yes, but IIRC I put that in because there was another check in > SLES9 that I actually couldn't put in, but used this one instead > because it also caught the bug we saw. >... > This was actually a rare corruption that is also in 2.6.21, and > as few rmap callsites as we have, it was never noticed until the > SLES9 bug check was triggered. You are being very mysterious. Please describe this bug (privately if you think it's exploitable), and let's work on the patch to fix it, rather than this "debug" patch. > Hmm, I didn't notice the do_swap_page change, rather just derived > its safety by looking at the current state of the code (which I > guess must have been post-do_swap_page change)... Your addition of page_add_new_anon_rmap clarified the situation too. > Do you have a pointer to the patch, for my interest? The patch which changed do_swap_page? commit c475a8ab625d567eacf5e30ec35d6d8704558062 Author: Hugh Dickins <hugh@veritas.com> Date: Tue Jun 21 17:15:12 2005 -0700 [PATCH] can_share_swap_page: use page_mapcount Or my intended PG_swapcache to PAGE_MAPPING_SWAP patch, which does assume PageLocked in page_add_anon_rmap? Yes, I can send you its current unsplit state if you like (but have higher priorities before splitting and commenting it for posting). Hugh ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: mm-more-rmap-checking 2007-05-02 13:17 ` Hugh Dickins @ 2007-05-03 0:18 ` Nick Piggin 0 siblings, 0 replies; 233+ messages in thread From: Nick Piggin @ 2007-05-03 0:18 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm Hugh Dickins wrote: > On Wed, 2 May 2007, Nick Piggin wrote: > >>Yes, but IIRC I put that in because there was another check in >>SLES9 that I actually couldn't put in, but used this one instead >>because it also caught the bug we saw. >>... >>This was actually a rare corruption that is also in 2.6.21, and >>as few rmap callsites as we have, it was never noticed until the >>SLES9 bug check was triggered. > > > You are being very mysterious. Please describe this bug (privately > if you think it's exploitable), and let's work on the patch to fix it, > rather than this "debug" patch. It is exec-fix-remove_arg_zero.patch in Andrew's tree, it's exploitable in that it leaks memory, but it could also release corrupted pagetables into quicklists on those architectures that have them... Anyway, it quite likely would have gone unfixed for several more years if we didn't have the bug triggers in. Now you could argue that my patch obviously fixes all bugs in there (but I wouldn't :)), and being most complex of the few callsites, _now_ we can avoid the bug checks. However I'd prefer to keep them at least under CONFIG_DEBUG_VM. >>Hmm, I didn't notice the do_swap_page change, rather just derived >>its safety by looking at the current state of the code (which I >>guess must have been post-do_swap_page change)... > > > Your addition of page_add_new_anon_rmap clarified the situation too. > > >>Do you have a pointer to the patch, for my interest? > > > The patch which changed do_swap_page? > > commit c475a8ab625d567eacf5e30ec35d6d8704558062 > Author: Hugh Dickins <hugh@veritas.com> > Date: Tue Jun 21 17:15:12 2005 -0700 > [PATCH] can_share_swap_page: use page_mapcount Yeah, this one, thanks. I'm just interested. > Or my intended PG_swapcache to PAGE_MAPPING_SWAP patch, > which does assume PageLocked in page_add_anon_rmap? > Yes, I can send you its current unsplit state if you like > (but have higher priorities before splitting and commenting > it for posting). I would like to see that too, but when you are ready :) Thanks, Nick -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (17 preceding siblings ...) 2007-05-01 14:31 ` 2.6.22 -mm merge plans: mm-more-rmap-checking Hugh Dickins @ 2007-05-01 16:56 ` Zan Lynx 2007-05-01 17:06 ` 2.6.22 -mm merge plans: mm-detach_vmas_to_be_unmapped-fix Hugh Dickins ` (3 subsequent siblings) 22 siblings, 0 replies; 233+ messages in thread From: Zan Lynx @ 2007-05-01 16:56 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 960 bytes --] On Mon, 2007-04-30 at 16:20 -0700, Andrew Morton wrote: [snip] > Mel's moveable-zone work. > > I don't believe that this has had sufficient review and I'm sure that it > hasn't had sufficient third-party testing. Most of the approbations thus far > have consisted of people liking the overall idea, based on the changelogs and > multi-year-old discussions. > > For such a large and core change I'd have expected more detailed reviewing > effort and more third-party testing. And I STILL haven't made time to review > the code in detail myself. [snip] I am a fan of this, but I hadn't really realized that it's in -mm, and that it has to be enabled with kernelcore= Now that I am, I'm running it on my laptop with kernelcore=256M (it wouldn't boot with 128M or less, weird initscript errors and OOMs). 1 GB single-core laptops are probably not the intended test audience :) But I'll see what happens. -- Zan Lynx <zlynx@acm.org> [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: mm-detach_vmas_to_be_unmapped-fix 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (18 preceding siblings ...) 2007-05-01 16:56 ` 2.6.22 -mm merge plans Zan Lynx @ 2007-05-01 17:06 ` Hugh Dickins 2007-05-01 18:10 ` 2.6.22 -mm merge plans: slub Hugh Dickins ` (2 subsequent siblings) 22 siblings, 0 replies; 233+ messages in thread From: Hugh Dickins @ 2007-05-01 17:06 UTC (permalink / raw) To: Andrew Morton; +Cc: akuster, Ken Chen, linux-kernel, linux-mm On Mon, 30 Apr 2007, Andrew Morton wrote: > ... > mm-detach_vmas_to_be_unmapped-fix.patch > ... > Misc MM things. Will merge. No, I think that one is just drifting like flotsam towards mainline, because nobody at all has yet found time to look at it. And Mr Akuster appears not to have signed off on it yet. I've given it a quick look now, and it seems to be based on misdescription and misconception. > From: <akuster@mvista.com> > > Wolfgang Wander submitted a fix to address a mmap fragmentation issue. The > git patch ( 1363c3cd8603a913a27e2995dccbd70d5312d8e6 ) is somewhat different > and yields different results when running Wolfgang's test case leakme.c. Ken did a lot of the work on that I believe: I certainly wouldn't want to see this patch go in without his Ack. (I've never done any work on unmapped area heuristics, but detach_vmas_to_be_unmapped always catches my eye.) > > IMHO, the vm start and end address are swapped in arch_unmap_area and > arch_unmap_area_topdown functions. I disagree. > > Prior to this patch arch_unmap_area() used area->vm_start and > arch_unmap_area_topdown used area->vm_end Yes (where area is the vma being unmapped). > in the git patch the following change showed up. > > if (mm->unmap_area == arch_unmap_area) > addr = prev ? prev->vm_start : mm->mmap_base; > else > addr = vma ? vma->vm_end : mm->mmap_base; No, that's not what showed up in the git patch: that's what the patch below is trying to change it to. The git patch said addr = prev ? prev->vm_end : mm->mmap_base for the bottomup case i.e. setting the unmapped area to the end of the vma below; and addr = vma ? vma->vm_start: mm->mmap_base; for the topdown case i.e. setting the unmapped area to the beginning of the vma above. That seems to me consistent with what was done before, but pushing the bounds out across any hole, for presumably better behaviour. > > Using Wolfgang Wander's leakme.c test, I get the same results seen with his > original "Avoiding mmap fragmentation" patch as I do after swapping the start > & end address in the above code segment. The patch I submitted addresses this > typo issue. I'm pretty sure it is not a typo. I did a very hasty test with two aLLocator .c progs Wolfgang posted (one unnamed, one named leakme4.c), on x86_64, and got apparently the same successful result with and without the patch below. In my case, it's probably just slightly slowing down the algorithm, by demanding an additional find_vma() because it mispositions mm->free_area_cache to an occupied area. I don't see how it could ever be an improvement, but I've not spent long enough checking out that code. I bet there's improvements that could be made there, but this patch looks wrong - please don't rush it into 2.6.22 (personally I'd say drop it, but I'd rather Ken takes a look). Hugh > > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > mm/mmap.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff -puN mm/mmap.c~mm-detach_vmas_to_be_unmapped-fix mm/mmap.c > --- a/mm/mmap.c~mm-detach_vmas_to_be_unmapped-fix > +++ a/mm/mmap.c > @@ -1723,9 +1723,9 @@ detach_vmas_to_be_unmapped(struct mm_str > *insertion_point = vma; > tail_vma->vm_next = NULL; > if (mm->unmap_area == arch_unmap_area) > - addr = prev ? prev->vm_end : mm->mmap_base; > + addr = prev ? prev->vm_start : mm->mmap_base; > else > - addr = vma ? vma->vm_start : mm->mmap_base; > + addr = vma ? vma->vm_end : mm->mmap_base; > mm->unmap_area(mm, addr); > mm->mmap_cache = NULL; /* Kill the cache. */ > } ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (19 preceding siblings ...) 2007-05-01 17:06 ` 2.6.22 -mm merge plans: mm-detach_vmas_to_be_unmapped-fix Hugh Dickins @ 2007-05-01 18:10 ` Hugh Dickins 2007-05-01 19:25 ` Christoph Lameter 2007-05-01 19:55 ` Andrew Morton 2007-05-03 15:54 ` swap-prefetch: 2.6.22 -mm merge plans Ingo Molnar 2007-05-07 17:47 ` Josef Sipek 22 siblings, 2 replies; 233+ messages in thread From: Hugh Dickins @ 2007-05-01 18:10 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Lameter, linux-kernel, linux-mm On Mon, 30 Apr 2007, Andrew Morton wrote: > > i386-use-page-allocator-to-allocate-thread_info-structure.patch > slub-core.patch > > slub. Or part thereof. This is another patch series which got messed up by > poor patch sequencing. > > make-page-private-usable-in-compound-pages-v1.patch > optimize-compound_head-by-avoiding-a-shared-page.patch > add-virt_to_head_page-and-consolidate-code-in-slab-and-slub.patch > slub-fix-object-tracking.patch > slub-enable-tracking-of-full-slabs.patch > slub-validation-of-slabs-metadata-and-guard-zones.patch > slub-add-min_partial.patch > slub-add-ability-to-list-alloc--free-callers-per-slab.patch > slub-free-slabs-and-sort-partial-slab-lists-in-kmem_cache_shrink.patch > slub-remove-object-activities-out-of-checking-functions.patch > slub-user-documentation.patch > slub-add-slabinfo-tool.patch > > Most of the rest of slub. Will merge it all. Merging slub already? I'm surprised. That's a very key piece of infrastructure, and I doubt it's had the exposure it needs yet. Just what has it been widely tested on so far? x86_64. Not many of us have ia64, but I guess SGI people will have been trying it on that. Not i386, that's excluded. Not powerpc - hmm, I thought that was known, but looking I see no ARCH_USES_SLAB_PAGE_STRUCT there: just built and tried to run it up, crashes in slab_free from pgtable_free_tlb frpm free_pte_range from free_pgd_range from free_pgtables from unmap_region form do_munmap. That's 2.6.21-rc7-mm2. slob has a justified place at the low end, but do we want some people running with slab and some with slub? I'd expected slub to stay in 2.6.22-mm, and have all the architectures cut over to it in that time, before advancing to mainline. I've nothing against slub in itself, though I'm wary of its cache merging (more scope for one corrupting another) (and sometimes I think Christoph spent one life uglifying slab for NUMA, then another life ripping that all out to make slub ;) Hugh ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-01 18:10 ` 2.6.22 -mm merge plans: slub Hugh Dickins @ 2007-05-01 19:25 ` Christoph Lameter 2007-05-01 19:55 ` Andrew Morton 1 sibling, 0 replies; 233+ messages in thread From: Christoph Lameter @ 2007-05-01 19:25 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm On Tue, 1 May 2007, Hugh Dickins wrote: > > Most of the rest of slub. Will merge it all. > > Merging slub already? I'm surprised. That's a very key piece of > infrastructure, and I doubt it's had the exposure it needs yet. Its not the default. Its just an alternative like SLOB. It will take some time to test with various loads in order to see if it can really replace SLAB in all scenarios. > Just what has it been widely tested on so far? x86_64. Not many > of us have ia64, but I guess SGI people will have been trying it > on that. Not i386, that's excluded. There is an i386 patch pending and I have used it on i386 for a while. > Not powerpc - hmm, I thought that was known, but looking I see no > ARCH_USES_SLAB_PAGE_STRUCT there: just built and tried to run it up, > crashes in slab_free from pgtable_free_tlb frpm free_pte_range from > free_pgd_range from free_pgtables from unmap_region form do_munmap. > That's 2.6.21-rc7-mm2. Hmmm... True I have not spend any time with that platform. We can set ARCH_USES_SLAB_PAGE_STRUCT there to switch it off. SLUB is the default for mm so I am a bit surprised that this did not surface earlier. > I've nothing against slub in itself, though I'm wary of its > cache merging (more scope for one corrupting another) (and Yes but then SLUB has more diagnostics etc etc than SLAB to prevent any issues. In debug mode all slabs are separate. The merge feature is very stable these days and significantly reduces cache overhead problems that plague SLAB and require it to have a complex object expiration technique. As a result I was able to rip out all timers. SLUB has no cache reaper nor any timer. Its silent if not in use. > sometimes I think Christoph spent one life uglifying slab for > NUMA, then another life ripping that all out to make slub ;) SLAB has a certain paradigm of doing things (queues) and I had to work within that framework. It was a group effort. SLUB is an answer to those complaints and a result of the lessons learned through years of some painful slab debugging. SLUB makes debugging extremely easy (and also the design is very simple and comprehensible). No rebuilding of the kernel. Just pop in a debug option on the command line which can even be targeted to a slab cache if we know that things break there. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-01 18:10 ` 2.6.22 -mm merge plans: slub Hugh Dickins 2007-05-01 19:25 ` Christoph Lameter @ 2007-05-01 19:55 ` Andrew Morton 2007-05-01 20:19 ` Hugh Dickins 1 sibling, 1 reply; 233+ messages in thread From: Andrew Morton @ 2007-05-01 19:55 UTC (permalink / raw) To: Hugh Dickins; +Cc: Christoph Lameter, linux-kernel, linux-mm On Tue, 1 May 2007 19:10:29 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote: > > Most of the rest of slub. Will merge it all. > > Merging slub already? I'm surprised. My thinking here is "does slub have a future". I think the answer is "yes", so we're reasonably safe getting it into mainline for the finishing work. The kernel.org kernel will still default to slab. Does that sound wrong? ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-01 19:55 ` Andrew Morton @ 2007-05-01 20:19 ` Hugh Dickins 2007-05-01 20:36 ` Andrew Morton 2007-05-01 21:08 ` Christoph Lameter 0 siblings, 2 replies; 233+ messages in thread From: Hugh Dickins @ 2007-05-01 20:19 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Lameter, linux-kernel, linux-mm On Tue, 1 May 2007, Andrew Morton wrote: > On Tue, 1 May 2007 19:10:29 +0100 (BST) > Hugh Dickins <hugh@veritas.com> wrote: > > > > Most of the rest of slub. Will merge it all. > > > > Merging slub already? I'm surprised. > > My thinking here is "does slub have a future". > I think the answer is "yes", I think I agree with that, though it's a judgement I'd leave to you and others. > so we're reasonably safe getting it into mainline for the finishing > work. The kernel.org kernel will still default to slab. > > Does that sound wrong? Yes, to me it does. If it could be defaulted to on throughout the -rcs, on every architecture, then I'd say that's "finishing work"; and we'd be safe knowing we could go back to slab in a hurry if needed. But it hasn't reached that stage yet, I think. Hugh ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-01 20:19 ` Hugh Dickins @ 2007-05-01 20:36 ` Andrew Morton 2007-05-01 20:46 ` Christoph Lameter 2007-05-02 12:54 ` Hugh Dickins 2007-05-01 21:08 ` Christoph Lameter 1 sibling, 2 replies; 233+ messages in thread From: Andrew Morton @ 2007-05-01 20:36 UTC (permalink / raw) To: Hugh Dickins; +Cc: Christoph Lameter, linux-kernel, linux-mm On Tue, 1 May 2007 21:19:09 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote: > On Tue, 1 May 2007, Andrew Morton wrote: > > On Tue, 1 May 2007 19:10:29 +0100 (BST) > > Hugh Dickins <hugh@veritas.com> wrote: > > > > > > Most of the rest of slub. Will merge it all. > > > > > > Merging slub already? I'm surprised. > > > > My thinking here is "does slub have a future". > > I think the answer is "yes", > > I think I agree with that, > though it's a judgement I'd leave to you and others. > > > so we're reasonably safe getting it into mainline for the finishing > > work. The kernel.org kernel will still default to slab. > > > > Does that sound wrong? > > Yes, to me it does. If it could be defaulted to on throughout the > -rcs, on every architecture, then I'd say that's "finishing work"; > and we'd be safe knowing we could go back to slab in a hurry if > needed. But it hasn't reached that stage yet, I think. > Given the current state and the current rate of development I'd expect slub to have reached the level of completion which you're describing around -rc2 or -rc3. I think we'd be pretty safe making that assumption. This is a bit unusual but there is of course some self-interest here: the patch dependencies are getting awful and having this hanging around out-of-tree will make 2.6.23 development harder for everyone. So on balance, given that we _do_ expect slub to have a future, I'm inclined to crash ahead with it. The worst that can happen will be a later rm mm/slub.c which would be pretty simple to do. otoh I could do some frantic patch mangling and make it easier to carry slub out-of-tree, but do we gain much from that? ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-01 20:36 ` Andrew Morton @ 2007-05-01 20:46 ` Christoph Lameter 2007-05-01 21:09 ` Andrew Morton 2007-05-02 12:54 ` Hugh Dickins 1 sibling, 1 reply; 233+ messages in thread From: Christoph Lameter @ 2007-05-01 20:46 UTC (permalink / raw) To: Andrew Morton; +Cc: Hugh Dickins, linux-kernel, linux-mm On Tue, 1 May 2007, Andrew Morton wrote: > otoh I could do some frantic patch mangling and make it easier to carry > slub out-of-tree, but do we gain much from that? Then we may loose all the slab API cleanups? Yuck. I really do not want redo those.... ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-01 20:46 ` Christoph Lameter @ 2007-05-01 21:09 ` Andrew Morton 0 siblings, 0 replies; 233+ messages in thread From: Andrew Morton @ 2007-05-01 21:09 UTC (permalink / raw) To: Christoph Lameter; +Cc: Hugh Dickins, linux-kernel, linux-mm On Tue, 1 May 2007 13:46:26 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > On Tue, 1 May 2007, Andrew Morton wrote: > > > otoh I could do some frantic patch mangling and make it easier to carry > > slub out-of-tree, but do we gain much from that? > > Then we may loose all the slab API cleanups? Yuck. I really do not want > redo those.... No, I meant that I'd look at splitting those patches up into one-against-mainline and one-against-slub. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-01 20:36 ` Andrew Morton 2007-05-01 20:46 ` Christoph Lameter @ 2007-05-02 12:54 ` Hugh Dickins 2007-05-02 17:03 ` Christoph Lameter 2007-05-02 18:52 ` Siddha, Suresh B 1 sibling, 2 replies; 233+ messages in thread From: Hugh Dickins @ 2007-05-02 12:54 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Lameter, linux-kernel, linux-mm On Tue, 1 May 2007, Andrew Morton wrote: > > Given the current state and the current rate of development I'd expect slub > to have reached the level of completion which you're describing around -rc2 > or -rc3. I think we'd be pretty safe making that assumption. Its developer does show signs of being active! > > This is a bit unusual but there is of course some self-interest here: the > patch dependencies are getting awful and having this hanging around > out-of-tree will make 2.6.23 development harder for everyone. That is a very strong argument: a somewhat worrisome argument, but a very strong one. Maintaining your sanity is important. > > So on balance, given that we _do_ expect slub to have a future, I'm > inclined to crash ahead with it. The worst that can happen will be a later > rm mm/slub.c which would be pretty simple to do. Okay. And there's been no chorus to echo my concern. But if Linus' tree is to be better than a warehouse to avoid awkward merges, I still think we want it to default to on for all the architectures, and for most if not all -rcs. > > otoh I could do some frantic patch mangling and make it easier to carry > slub out-of-tree, but do we gain much from that? No, keep away from that. Hugh ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 12:54 ` Hugh Dickins @ 2007-05-02 17:03 ` Christoph Lameter 2007-05-02 19:11 ` Andrew Morton 2007-05-02 18:52 ` Siddha, Suresh B 1 sibling, 1 reply; 233+ messages in thread From: Christoph Lameter @ 2007-05-02 17:03 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm On Wed, 2 May 2007, Hugh Dickins wrote: > But if Linus' tree is to be better than a warehouse to avoid > awkward merges, I still think we want it to default to on for > all the architectures, and for most if not all -rcs. At some point I dream that SLUB could become the default but I thought this would take at least 6 month or so. If want to force this now then I will certainly have some busy weeks ahead. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 17:03 ` Christoph Lameter @ 2007-05-02 19:11 ` Andrew Morton 2007-05-02 19:42 ` Christoph Lameter 0 siblings, 1 reply; 233+ messages in thread From: Andrew Morton @ 2007-05-02 19:11 UTC (permalink / raw) To: Christoph Lameter; +Cc: Hugh Dickins, linux-kernel, linux-mm On Wed, 2 May 2007 10:03:50 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > On Wed, 2 May 2007, Hugh Dickins wrote: > > > But if Linus' tree is to be better than a warehouse to avoid > > awkward merges, I still think we want it to default to on for > > all the architectures, and for most if not all -rcs. > > At some point I dream that SLUB could become the default but I thought > this would take at least 6 month or so. If want to force this now then I > will certainly have some busy weeks ahead. s/dream/promise/ ;) Six months sounds reasonable - I was kind of hoping for less. Make it default-to-on in 2.6.23-rc1, see how it goes. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 19:11 ` Andrew Morton @ 2007-05-02 19:42 ` Christoph Lameter 2007-05-02 19:54 ` Sam Ravnborg 0 siblings, 1 reply; 233+ messages in thread From: Christoph Lameter @ 2007-05-02 19:42 UTC (permalink / raw) To: Andrew Morton; +Cc: Hugh Dickins, linux-kernel, linux-mm On Wed, 2 May 2007, Andrew Morton wrote: > > At some point I dream that SLUB could become the default but I thought > > this would take at least 6 month or so. If want to force this now then I > > will certainly have some busy weeks ahead. > > s/dream/promise/ ;) > > Six months sounds reasonable - I was kind of hoping for less. Make it > default-to-on in 2.6.23-rc1, see how it goes. Here is how I think the future could develop Cycle SLAB SLUB SLOB SLxB 2.6.22 API fixes Stabilization API fixes Major event: SLUB availability as experimental 2.6.23 API upgrades Perf. Valid. EOL Major events: SLUB performance validation. Switch off experimental (could even be the default) Slab allocators support targeted reclaim for at least one slab cache (dentry?) (vacate/move all objects in a slab) 2.6.24 Earliest EOL Stable - Experiments Major events: SLUB stable. Stable targeted reclaim for all major reclaimable slabs. Maybe experiments with another new allocator? 2.6.25 EOL default - ? Death of SLAB. SLUB default. Hopefully new ideas on the horizon. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 19:42 ` Christoph Lameter @ 2007-05-02 19:54 ` Sam Ravnborg 2007-05-02 20:14 ` Christoph Lameter 0 siblings, 1 reply; 233+ messages in thread From: Sam Ravnborg @ 2007-05-02 19:54 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andrew Morton, Hugh Dickins, linux-kernel, linux-mm On Wed, May 02, 2007 at 12:42:54PM -0700, Christoph Lameter wrote: > On Wed, 2 May 2007, Andrew Morton wrote: > > > > At some point I dream that SLUB could become the default but I thought > > > this would take at least 6 month or so. If want to force this now then I > > > will certainly have some busy weeks ahead. > > > > s/dream/promise/ ;) > > > > Six months sounds reasonable - I was kind of hoping for less. Make it > > default-to-on in 2.6.23-rc1, see how it goes. > > Here is how I think the future could develop > > Cycle SLAB SLUB SLOB SLxB > > 2.6.22 API fixes Stabilization API fixes > > Major event: SLUB availability as experimental > > 2.6.23 API upgrades Perf. Valid. EOL > > Major events: SLUB performance validation. Switch off > experimental (could even be the default) > Slab allocators support targeted reclaim for at > least one slab cache (dentry?) > (vacate/move all objects in a slab) To facilitate this do NOT introduce CONFIG_SLAB until we decide that SLUB are default. In this way we can make CONFIG_SLUB be default and people will not continue with CONFIG_SLAB because they had it in their .config already. Or just rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED or something. The point is make sure that LSUB becomes default for people that does an make oldconfig (explicit or implicit). Sam ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 19:54 ` Sam Ravnborg @ 2007-05-02 20:14 ` Christoph Lameter 0 siblings, 0 replies; 233+ messages in thread From: Christoph Lameter @ 2007-05-02 20:14 UTC (permalink / raw) To: Sam Ravnborg; +Cc: Andrew Morton, Hugh Dickins, linux-kernel, linux-mm On Wed, 2 May 2007, Sam Ravnborg wrote: > To facilitate this do NOT introduce CONFIG_SLAB until we decide > that SLUB are default. In this way we can make CONFIG_SLUB be default > and people will not continue with CONFIG_SLAB because they had it in their > config already. We already have CONFIG_SLAB. If you use your existing .config then you will stay with SLAB. > The point is make sure that LSUB becomes default for people that does > an make oldconfig (explicit or implicit). Hmmmm... We can think about that when we actually want to make SLUB the default. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 12:54 ` Hugh Dickins 2007-05-02 17:03 ` Christoph Lameter @ 2007-05-02 18:52 ` Siddha, Suresh B 2007-05-02 18:58 ` Christoph Lameter 1 sibling, 1 reply; 233+ messages in thread From: Siddha, Suresh B @ 2007-05-02 18:52 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, Christoph Lameter, linux-kernel, linux-mm On Wed, May 02, 2007 at 05:54:53AM -0700, Hugh Dickins wrote: > On Tue, 1 May 2007, Andrew Morton wrote: > > So on balance, given that we _do_ expect slub to have a future, I'm > > inclined to crash ahead with it. The worst that can happen will be a later > > rm mm/slub.c which would be pretty simple to do. > > Okay. And there's been no chorus to echo my concern. I have been looking into "slub" recently to avoid some of the NUMA alien cache issues that we were encountering on the regular slab. I am having some stability issues with slub on an ia64 NUMA platform and didn't have time to dig further. I am hoping to look into it soon and share the data/findings with Christoph. We also did a quick perf collection on x86_64(atleast didn't hear any stability issues from our team on regular x86_64 SMP), that we will be sharing shortly. > But if Linus' tree is to be better than a warehouse to avoid > awkward merges, I still think we want it to default to on for > all the architectures, and for most if not all -rcs. I will not suggest for default on at this point. thanks, suresh ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 18:52 ` Siddha, Suresh B @ 2007-05-02 18:58 ` Christoph Lameter 0 siblings, 0 replies; 233+ messages in thread From: Christoph Lameter @ 2007-05-02 18:58 UTC (permalink / raw) To: Siddha, Suresh B; +Cc: Hugh Dickins, Andrew Morton, linux-kernel, linux-mm On Wed, 2 May 2007, Siddha, Suresh B wrote: > I have been looking into "slub" recently to avoid some of the NUMA alien > cache issues that we were encountering on the regular slab. Yes that is also our main concern. > I am having some stability issues with slub on an ia64 NUMA platform and > didn't have time to dig further. I am hoping to look into it soon > and share the data/findings with Christoph. There is at least one patch on top of 2.6.21-rc7-mm2 already in mm that may be necessary for you. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-01 20:19 ` Hugh Dickins 2007-05-01 20:36 ` Andrew Morton @ 2007-05-01 21:08 ` Christoph Lameter 2007-05-02 12:45 ` Hugh Dickins 1 sibling, 1 reply; 233+ messages in thread From: Christoph Lameter @ 2007-05-01 21:08 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm On Tue, 1 May 2007, Hugh Dickins wrote: > Yes, to me it does. If it could be defaulted to on throughout the > -rcs, on every architecture, then I'd say that's "finishing work"; > and we'd be safe knowing we could go back to slab in a hurry if > needed. But it hasn't reached that stage yet, I think. Why would we need to go back to SLAB if we have not switched to SLUB? SLUB is marked experimental and not the default. The only problems that I am aware of is(or was) the issue with arches modifying page struct fields of slab pages that SLUB needs for its own operations. And I thought it was all fixed since the powerpc guys were quiet and the patch was in for i386. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-01 21:08 ` Christoph Lameter @ 2007-05-02 12:45 ` Hugh Dickins 2007-05-02 17:01 ` Christoph Lameter 2007-05-02 17:25 ` Christoph Lameter 0 siblings, 2 replies; 233+ messages in thread From: Hugh Dickins @ 2007-05-02 12:45 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andrew Morton, linux-kernel, linux-mm On Tue, 1 May 2007, Christoph Lameter wrote: > On Tue, 1 May 2007, Hugh Dickins wrote: > > > Yes, to me it does. If it could be defaulted to on throughout the > > -rcs, on every architecture, then I'd say that's "finishing work"; > > and we'd be safe knowing we could go back to slab in a hurry if > > needed. But it hasn't reached that stage yet, I think. > > Why would we need to go back to SLAB if we have not switched to SLUB? SLUB > is marked experimental and not the default. I said above that I thought SLUB ought to be defaulted to on throughout the -rcs: if we don't do that, we're not going to learn much from having it in Linus' tree. And perhaps that line which appends "PREEMPT " to an oops report ought to append "SLUB " too, for so long as there's a choice. > The only problems that I am aware of is(or was) the issue with arches > modifying page struct fields of slab pages that SLUB needs for its own > operations. And I thought it was all fixed since the powerpc guys were > quiet and the patch was in for i386. You're forgetting your unions in struct page: in the SPLIT_PTLOCK case (NR_CPUS >= 4) the pagetable code is using spinlock_t ptl, which overlays SLUB's first_page and slab pointers. I just tried rebuilding powerpc with the SPLIT_PTLOCK cutover edited to 8 cpus instead, and then no crash. I presume the answer is just to extend your quicklist work to powerpc's lowest level of pagetables. The only other architecture which is using kmem_cache for them is arm26, which has "#error SMP is not supported", so won't be giving this problem. Hugh ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 12:45 ` Hugh Dickins @ 2007-05-02 17:01 ` Christoph Lameter 2007-05-02 18:08 ` Hugh Dickins 2007-05-02 17:25 ` Christoph Lameter 1 sibling, 1 reply; 233+ messages in thread From: Christoph Lameter @ 2007-05-02 17:01 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm On Wed, 2 May 2007, Hugh Dickins wrote: > > Why would we need to go back to SLAB if we have not switched to SLUB? SLUB > > is marked experimental and not the default. > > I said above that I thought SLUB ought to be defaulted to on throughout > the -rcs: if we don't do that, we're not going to learn much from having > it in Linus' tree. I'd rather be careful with that..... mm is enough for now. Why go to the extremes immediately. If it is an option then people can gradually start testing with it. > > The only problems that I am aware of is(or was) the issue with arches > > modifying page struct fields of slab pages that SLUB needs for its own > > operations. And I thought it was all fixed since the powerpc guys were > > quiet and the patch was in for i386. > > You're forgetting your unions in struct page: in the SPLIT_PTLOCK > case (NR_CPUS >= 4) the pagetable code is using spinlock_t ptl, > which overlays SLUB's first_page and slab pointers. Uhhh.... Right. So SLUB wont work if the lowest page table block is managed via slabs. > I just tried rebuilding powerpc with the SPLIT_PTLOCK cutover > edited to 8 cpus instead, and then no crash. > > I presume the answer is just to extend your quicklist work to > powerpc's lowest level of pagetables. The only other architecture I am not sure how PowerPCs lower pagetable pages work. If they are of PAGE_SIZE then this is no problem. > which is using kmem_cache for them is arm26, which has > "#error SMP is not supported", so won't be giving this problem. Ahh. Good. But these are arch specific problems. We could use ARCH_USES_SLAB_PAGE_STRUCT to disable SLUB on these platforms. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 17:01 ` Christoph Lameter @ 2007-05-02 18:08 ` Hugh Dickins 2007-05-02 18:28 ` Christoph Lameter 0 siblings, 1 reply; 233+ messages in thread From: Hugh Dickins @ 2007-05-02 18:08 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andrew Morton, linux-kernel, linux-mm On Wed, 2 May 2007, Christoph Lameter wrote: > > But these are arch specific problems. We could use > ARCH_USES_SLAB_PAGE_STRUCT to disable SLUB on these platforms. As a quick hack, sure. But every ARCH_USES_SLAB_PAGE_STRUCT diminishes the testing SLUB will get. If the idea is that we're going to support both SLAB and SLUB, some arches with one, some with another, some with either, for more than a single release, then I'm back to saying SLUB is being pushed in too early. I can understand people wanting pluggable schedulers, but pluggable slab allocators? Hugh ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 18:08 ` Hugh Dickins @ 2007-05-02 18:28 ` Christoph Lameter 2007-05-02 18:42 ` Andrew Morton 0 siblings, 1 reply; 233+ messages in thread From: Christoph Lameter @ 2007-05-02 18:28 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm On Wed, 2 May 2007, Hugh Dickins wrote: > On Wed, 2 May 2007, Christoph Lameter wrote: > > > > But these are arch specific problems. We could use > > ARCH_USES_SLAB_PAGE_STRUCT to disable SLUB on these platforms. > > As a quick hack, sure. But every ARCH_USES_SLAB_PAGE_STRUCT > diminishes the testing SLUB will get. If the idea is that we're > going to support both SLAB and SLUB, some arches with one, some > with another, some with either, for more than a single release, > then I'm back to saying SLUB is being pushed in too early. > I can understand people wanting pluggable schedulers, > but pluggable slab allocators? This is a sensitive piece of the kernel as you say and we better allow the running of two allocator for some time to make sure that it behaves in all load situations. The design is fundamentally different so its performance characteristics may diverge significantly and perhaps there will be corner cases for each where they do the best job. I have already reworked the slab API to allow for an easy implementation of alternate slab allocators (released with 2.6.20) which only covered SLAB and SLOB. This is continuing the cleanup work and adding a third one. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 18:28 ` Christoph Lameter @ 2007-05-02 18:42 ` Andrew Morton 2007-05-02 18:53 ` Christoph Lameter 0 siblings, 1 reply; 233+ messages in thread From: Andrew Morton @ 2007-05-02 18:42 UTC (permalink / raw) To: Christoph Lameter; +Cc: Hugh Dickins, linux-kernel, linux-mm On Wed, 2 May 2007 11:28:26 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > On Wed, 2 May 2007, Hugh Dickins wrote: > > > On Wed, 2 May 2007, Christoph Lameter wrote: > > > > > > But these are arch specific problems. We could use > > > ARCH_USES_SLAB_PAGE_STRUCT to disable SLUB on these platforms. > > > > As a quick hack, sure. But every ARCH_USES_SLAB_PAGE_STRUCT > > diminishes the testing SLUB will get. If the idea is that we're > > going to support both SLAB and SLUB, some arches with one, some > > with another, some with either, for more than a single release, > > then I'm back to saying SLUB is being pushed in too early. > > I can understand people wanting pluggable schedulers, > > but pluggable slab allocators? > > This is a sensitive piece of the kernel as you say and we better allow the > running of two allocator for some time to make sure that it behaves in all > load situations. The design is fundamentally different so its performance > characteristics may diverge significantly and perhaps there will be corner > cases for each where they do the best job. eek. We'd need to fix those corner cases then. Our endgame here really must be rm mm/slab.c. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 18:42 ` Andrew Morton @ 2007-05-02 18:53 ` Christoph Lameter 0 siblings, 0 replies; 233+ messages in thread From: Christoph Lameter @ 2007-05-02 18:53 UTC (permalink / raw) To: Andrew Morton; +Cc: Hugh Dickins, linux-kernel, linux-mm On Wed, 2 May 2007, Andrew Morton wrote: > > This is a sensitive piece of the kernel as you say and we better allow the > > running of two allocator for some time to make sure that it behaves in all > > load situations. The design is fundamentally different so its performance > > characteristics may diverge significantly and perhaps there will be corner > > cases for each where they do the best job. > > eek. We'd need to fix those corner cases then. Our endgame > here really must be rm mm/slab.c. First we need to discover them and I doubt that mm covers much more than development loads. I hope we can get to a point where we have SLUB be the primarily allocator soon but I would expect various performance issues to show up. On the other hand: I am pretty sure that SLUB can replace SLOB completely given SLOBs limitations and SLUBs more efficient use of space. SLOB needs 8 bytes of overhead. SLUB needs none. We may just have to #ifdef out the debugging support to make the code be of similar size to SLOB too. SLOB is a general problem because its features are not compatible to SLAB. F.e. it does not support DESTROY_BY_RCU and does not do reclaim the right way etc etc. SLUB may turn out to be the ideal embedded slab allocator. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 12:45 ` Hugh Dickins 2007-05-02 17:01 ` Christoph Lameter @ 2007-05-02 17:25 ` Christoph Lameter 2007-05-02 18:36 ` Hugh Dickins 2007-05-03 8:15 ` Andrew Morton 1 sibling, 2 replies; 233+ messages in thread From: Christoph Lameter @ 2007-05-02 17:25 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, haveblue, linux-kernel, linux-mm On Wed, 2 May 2007, Hugh Dickins wrote: > I presume the answer is just to extend your quicklist work to > powerpc's lowest level of pagetables. The only other architecture > which is using kmem_cache for them is arm26, which has > "#error SMP is not supported", so won't be giving this problem. In the meantime we would need something like this to disable SLUB in this particular configuration. Note that I have not tested this and the <= for the comparision with SPLIT_PTLOCK_CPUS may not work (Never seen such a construct in a Kconfig file but it is needed here). PowerPC: Disable SLUB for configurations in which slab page structs are modified PowerPC uses the slab allocator to manage the lowest level of the page table. In high cpu configurations we also use the page struct to split the page table lock. Disallow the selection of SLUB for that case. [Not tested: I am not familiar with powerpc build procedures etc] Signed-off-by: Christoph Lameter <clameter@sgi.com> Index: linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig =================================================================== --- linux-2.6.21-rc7-mm2.orig/arch/powerpc/Kconfig 2007-05-02 10:07:34.000000000 -0700 +++ linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig 2007-05-02 10:13:37.000000000 -0700 @@ -117,6 +117,19 @@ config GENERIC_BUG default y depends on BUG +# +# Powerpc uses the slab allocator to manage its ptes and the +# page structs of ptes are used for splitting the page table +# lock for configurations supporting more than SPLIT_PTLOCK_CPUS. +# +# In that special configuration the page structs of slabs are modified. +# This setting disables the selection of SLUB as a slab allocator. +# +config ARCH_USES_SLAB_PAGE_STRUCT + bool + default y + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS + config DEFAULT_UIMAGE bool help ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 17:25 ` Christoph Lameter @ 2007-05-02 18:36 ` Hugh Dickins 2007-05-02 18:39 ` Christoph Lameter 2007-05-03 8:15 ` Andrew Morton 1 sibling, 1 reply; 233+ messages in thread From: Hugh Dickins @ 2007-05-02 18:36 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andrew Morton, haveblue, linux-kernel, linux-mm On Wed, 2 May 2007, Christoph Lameter wrote: > On Wed, 2 May 2007, Hugh Dickins wrote: > > > I presume the answer is just to extend your quicklist work to > > powerpc's lowest level of pagetables. The only other architecture > > which is using kmem_cache for them is arm26, which has > > "#error SMP is not supported", so won't be giving this problem. > > In the meantime we would need something like this to disable SLUB in this > particular configuration. Note that I have not tested this and the <= for > the comparision with SPLIT_PTLOCK_CPUS may not work (Never seen such a > construct in a Kconfig file but it is needed here). I'm astonished and impressed, both with Kconfig and your use of it: that does seem to work. Though I don't dare go so far as to give the patch an ack, and don't like this way out at all. It needs a proper (quicklist) solution, and by the time that solution comes along, all the powerpc people will have CONFIG_SLAB=y in their .config, and "make oldconfig" will just perpetuate that status quo, instead of the switching over to CONFIG_SLUB=y. I think. Unless we keep changing the config option names, or go through a phase with no option. I'd much rather be testing a quicklist patch: I'd better give that a try. Hugh > > > > PowerPC: Disable SLUB for configurations in which slab page structs are modified > > PowerPC uses the slab allocator to manage the lowest level of the page table. > In high cpu configurations we also use the page struct to split the page > table lock. Disallow the selection of SLUB for that case. > > [Not tested: I am not familiar with powerpc build procedures etc] > > Signed-off-by: Christoph Lameter <clameter@sgi.com> > > Index: linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig > =================================================================== > --- linux-2.6.21-rc7-mm2.orig/arch/powerpc/Kconfig 2007-05-02 10:07:34.000000000 -0700 > +++ linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig 2007-05-02 10:13:37.000000000 -0700 > @@ -117,6 +117,19 @@ config GENERIC_BUG > default y > depends on BUG > > +# > +# Powerpc uses the slab allocator to manage its ptes and the > +# page structs of ptes are used for splitting the page table > +# lock for configurations supporting more than SPLIT_PTLOCK_CPUS. > +# > +# In that special configuration the page structs of slabs are modified. > +# This setting disables the selection of SLUB as a slab allocator. > +# > +config ARCH_USES_SLAB_PAGE_STRUCT > + bool > + default y > + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS > + > config DEFAULT_UIMAGE > bool > help ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 18:36 ` Hugh Dickins @ 2007-05-02 18:39 ` Christoph Lameter 2007-05-02 18:57 ` Andrew Morton 0 siblings, 1 reply; 233+ messages in thread From: Christoph Lameter @ 2007-05-02 18:39 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, haveblue, linux-kernel, linux-mm On Wed, 2 May 2007, Hugh Dickins wrote: > I'm astonished and impressed, both with Kconfig and your use of it: Thanks! > I'd much rather be testing a quicklist patch: > I'd better give that a try. Great. But I certainly do not mind people use SLAB. I do not think that one approach should be there for all. Choice is the way to have multiple allocators compete. One reason that SLAB is so crusty is because it was the only solution for so long. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 18:39 ` Christoph Lameter @ 2007-05-02 18:57 ` Andrew Morton 2007-05-02 19:01 ` Christoph Lameter 0 siblings, 1 reply; 233+ messages in thread From: Andrew Morton @ 2007-05-02 18:57 UTC (permalink / raw) To: Christoph Lameter; +Cc: Hugh Dickins, haveblue, linux-kernel, linux-mm On Wed, 2 May 2007 11:39:20 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > On Wed, 2 May 2007, Hugh Dickins wrote: > > > I'm astonished and impressed, both with Kconfig and your use of it: > > Thanks! > > > I'd much rather be testing a quicklist patch: > > I'd better give that a try. > > Great. But I certainly do not mind people use SLAB. I do not think that > one approach should be there for all. Choice is the way to have multiple > allocators compete. One reason that SLAB is so crusty is because it was > the only solution for so long. > noooo, we don't want competing slab allocators, please. We should get slub working well on all architectures then remove slab completely. Having to maintain both slab.c and slub.c would be awful. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 18:57 ` Andrew Morton @ 2007-05-02 19:01 ` Christoph Lameter 2007-05-02 19:18 ` Pekka Enberg 0 siblings, 1 reply; 233+ messages in thread From: Christoph Lameter @ 2007-05-02 19:01 UTC (permalink / raw) To: Andrew Morton; +Cc: Hugh Dickins, haveblue, linux-kernel, linux-mm On Wed, 2 May 2007, Andrew Morton wrote: > noooo, we don't want competing slab allocators, please. We should get slub > working well on all architectures then remove slab completely. Having to > maintain both slab.c and slub.c would be awful. Owww... You throw my roadmap out of the window and may create too high expectations of SLUB. I am the one who has to maintain SLAB and SLUB it seems and I have been dealing with the trio SLAB, SLOB and SLUB for awhile now. Its okay and it will be much easier once the cleanups are in. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 19:01 ` Christoph Lameter @ 2007-05-02 19:18 ` Pekka Enberg 2007-05-02 19:34 ` Christoph Lameter 2007-05-02 19:43 ` Christoph Lameter 0 siblings, 2 replies; 233+ messages in thread From: Pekka Enberg @ 2007-05-02 19:18 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, Hugh Dickins, haveblue, linux-kernel, linux-mm On 5/2/07, Christoph Lameter <clameter@sgi.com> wrote: > Owww... You throw my roadmap out of the window and may create too > high expectations of SLUB. Me too! On 5/2/07, Christoph Lameter <clameter@sgi.com> wrote: > I am the one who has to maintain SLAB and SLUB it seems and I have been > dealing with the trio SLAB, SLOB and SLUB for awhile now. Its okay and it > will be much easier once the cleanups are in. And then there's patches such as kmemleak which would need to target all three. Plus it doesn't really make sense for users to select between three competiting implementations. Please don't take away our high hopes of getting rid of mm/slab.c Christoph =) Pekka ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 19:18 ` Pekka Enberg @ 2007-05-02 19:34 ` Christoph Lameter 2007-05-02 19:43 ` Christoph Lameter 1 sibling, 0 replies; 233+ messages in thread From: Christoph Lameter @ 2007-05-02 19:34 UTC (permalink / raw) To: Pekka Enberg Cc: Andrew Morton, Hugh Dickins, haveblue, linux-kernel, linux-mm On Wed, 2 May 2007, Pekka Enberg wrote: > On 5/2/07, Christoph Lameter <clameter@sgi.com> wrote: > > I am the one who has to maintain SLAB and SLUB it seems and I have been > > dealing with the trio SLAB, SLOB and SLUB for awhile now. Its okay and it > > will be much easier once the cleanups are in. > > And then there's patches such as kmemleak which would need to target > all three. Plus it doesn't really make sense for users to select > between three competiting implementations. Please don't take away our > high hopes of getting rid of mm/slab.c Christoph =) SLUB supports kmemleak (actually its quite improved). Switch debugging on and try cat /sys/slab/kmalloc-128/alloc_calls. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 19:18 ` Pekka Enberg 2007-05-02 19:34 ` Christoph Lameter @ 2007-05-02 19:43 ` Christoph Lameter 1 sibling, 0 replies; 233+ messages in thread From: Christoph Lameter @ 2007-05-02 19:43 UTC (permalink / raw) To: Pekka Enberg Cc: Andrew Morton, Hugh Dickins, haveblue, linux-kernel, linux-mm On Wed, 2 May 2007, Pekka Enberg wrote: > And then there's patches such as kmemleak which would need to target > all three. Plus it doesn't really make sense for users to select > between three competiting implementations. Please don't take away our > high hopes of getting rid of mm/slab.c Christoph =) You too, Brutus ... ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-02 17:25 ` Christoph Lameter 2007-05-02 18:36 ` Hugh Dickins @ 2007-05-03 8:15 ` Andrew Morton 2007-05-03 8:27 ` William Lee Irwin III 2007-05-03 8:46 ` Hugh Dickins 1 sibling, 2 replies; 233+ messages in thread From: Andrew Morton @ 2007-05-03 8:15 UTC (permalink / raw) To: Christoph Lameter; +Cc: Hugh Dickins, haveblue, linux-kernel, linux-mm On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > On Wed, 2 May 2007, Hugh Dickins wrote: > > > I presume the answer is just to extend your quicklist work to > > powerpc's lowest level of pagetables. The only other architecture > > which is using kmem_cache for them is arm26, which has > > "#error SMP is not supported", so won't be giving this problem. > > In the meantime we would need something like this to disable SLUB in this > particular configuration. Note that I have not tested this and the <= for > the comparision with SPLIT_PTLOCK_CPUS may not work (Never seen such a > construct in a Kconfig file but it is needed here). > > > > PowerPC: Disable SLUB for configurations in which slab page structs are modified > > PowerPC uses the slab allocator to manage the lowest level of the page table. > In high cpu configurations we also use the page struct to split the page > table lock. Disallow the selection of SLUB for that case. > > [Not tested: I am not familiar with powerpc build procedures etc] > > Signed-off-by: Christoph Lameter <clameter@sgi.com> > > Index: linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig > =================================================================== > --- linux-2.6.21-rc7-mm2.orig/arch/powerpc/Kconfig 2007-05-02 10:07:34.000000000 -0700 > +++ linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig 2007-05-02 10:13:37.000000000 -0700 > @@ -117,6 +117,19 @@ config GENERIC_BUG > default y > depends on BUG > > +# > +# Powerpc uses the slab allocator to manage its ptes and the > +# page structs of ptes are used for splitting the page table > +# lock for configurations supporting more than SPLIT_PTLOCK_CPUS. > +# > +# In that special configuration the page structs of slabs are modified. > +# This setting disables the selection of SLUB as a slab allocator. > +# > +config ARCH_USES_SLAB_PAGE_STRUCT > + bool > + default y > + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS > + That all seems to work as intended. However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the machine early in boot. Too early for netconsole, no serial console. Wedges up uselessly with CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with CONFIG_XMON=y. This is all fairly typical for a powerpc/G5 crash :( However I was able to glimpse some stuff as it flew past. Crash started in flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free. I don't know how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial card, perhaps. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-03 8:15 ` Andrew Morton @ 2007-05-03 8:27 ` William Lee Irwin III 2007-05-03 16:30 ` Christoph Lameter 2007-05-03 8:46 ` Hugh Dickins 1 sibling, 1 reply; 233+ messages in thread From: William Lee Irwin III @ 2007-05-03 8:27 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Lameter, Hugh Dickins, haveblue, linux-kernel, linux-mm On Thu, May 03, 2007 at 01:15:15AM -0700, Andrew Morton wrote: > That all seems to work as intended. > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the > machine early in boot. > Too early for netconsole, no serial console. Wedges up uselessly with > CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with > CONFIG_XMON=y. This is all fairly typical for a powerpc/G5 crash :( > However I was able to glimpse some stuff as it flew past. Crash started in > flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free. I don't know > how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial > card, perhaps. I've seen this crash in flush_old_exec() before. ISTR it being due to slub vs. pagetable alignment or something on that order. -- wli ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-03 8:27 ` William Lee Irwin III @ 2007-05-03 16:30 ` Christoph Lameter 0 siblings, 0 replies; 233+ messages in thread From: Christoph Lameter @ 2007-05-03 16:30 UTC (permalink / raw) To: William Lee Irwin III Cc: Andrew Morton, Hugh Dickins, haveblue, linux-kernel, linux-mm On Thu, 3 May 2007, William Lee Irwin III wrote: > I've seen this crash in flush_old_exec() before. ISTR it being due to > slub vs. pagetable alignment or something on that order. >From from other discussion regarding SLAB: It may be necessary for powerpc to set the default alignment to 8 bytes on 32 bit powerpc because it requires that alignemnt for 64 bit value. SLUB will *not* disable debugging like SLAB if you do that. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-03 8:15 ` Andrew Morton 2007-05-03 8:27 ` William Lee Irwin III @ 2007-05-03 8:46 ` Hugh Dickins 2007-05-03 8:57 ` Andrew Morton 1 sibling, 1 reply; 233+ messages in thread From: Hugh Dickins @ 2007-05-03 8:46 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Lameter, linux-kernel, linux-mm On Thu, 3 May 2007, Andrew Morton wrote: > On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > > > +config ARCH_USES_SLAB_PAGE_STRUCT > > + bool > > + default y > > + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS > > + > > That all seems to work as intended. > > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the > machine early in boot. I thought that if that worked as intended, you wouldn't even get the chance to choose SLUB=y? That was how it was working for me (but I realize I didn't try more than make oldconfig). > > Too early for netconsole, no serial console. Wedges up uselessly with > CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with > CONFIG_XMON=y. This is all fairly typical for a powerpc/G5 crash :( > > However I was able to glimpse some stuff as it flew past. Crash started in > flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free. I don't know > how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial > card, perhaps. That sounds like what happens when SLUB's pagestruct use meets SPLIT_PTLOCK's pagestruct use. Does your .config really show CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y? Hugh ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-03 8:46 ` Hugh Dickins @ 2007-05-03 8:57 ` Andrew Morton 2007-05-03 9:15 ` Hugh Dickins 2007-05-03 16:45 ` 2.6.22 -mm merge plans: slub Christoph Lameter 0 siblings, 2 replies; 233+ messages in thread From: Andrew Morton @ 2007-05-03 8:57 UTC (permalink / raw) To: Hugh Dickins; +Cc: Christoph Lameter, linux-kernel, linux-mm On Thu, 3 May 2007 09:46:32 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote: > On Thu, 3 May 2007, Andrew Morton wrote: > > On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > > > > > +config ARCH_USES_SLAB_PAGE_STRUCT > > > + bool > > > + default y > > > + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS > > > + > > > > That all seems to work as intended. > > > > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the > > machine early in boot. > > I thought that if that worked as intended, you wouldn't even > get the chance to choose SLUB=y? That was how it was working > for me (but I realize I didn't try more than make oldconfig). Right. This can be tested on x86 without a cross-compiler: ARCH=powerpc make mrproper ARCH=powerpc make fooconfig > > > > Too early for netconsole, no serial console. Wedges up uselessly with > > CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with > > CONFIG_XMON=y. This is all fairly typical for a powerpc/G5 crash :( > > > > However I was able to glimpse some stuff as it flew past. Crash started in > > flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free. I don't know > > how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial > > card, perhaps. > > That sounds like what happens when SLUB's pagestruct use meets > SPLIT_PTLOCK's pagestruct use. Does your .config really show > CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y? > Nope. g5:/usr/src/25> grep SLUB .config CONFIG_SLUB=y g5:/usr/src/25> grep SLAB .config # CONFIG_SLAB is not set g5:/usr/src/25> grep CPUS .config CONFIG_NR_CPUS=8 # CONFIG_CPUSETS is not set # CONFIG_IRQ_ALL_CPUS is not set CONFIG_SPLIT_PTLOCK_CPUS=4 It's in http://userweb.kernel.org/~akpm/config-g5.txt ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-03 8:57 ` Andrew Morton @ 2007-05-03 9:15 ` Hugh Dickins 2007-05-03 21:04 ` 2.6.22 -mm merge plans: slub on PowerPC Hugh Dickins 2007-05-03 16:45 ` 2.6.22 -mm merge plans: slub Christoph Lameter 1 sibling, 1 reply; 233+ messages in thread From: Hugh Dickins @ 2007-05-03 9:15 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Lameter, linux-kernel, linux-mm On Thu, 3 May 2007, Andrew Morton wrote: > On Thu, 3 May 2007 09:46:32 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote: > > On Thu, 3 May 2007, Andrew Morton wrote: > > > On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > > > > > > > +config ARCH_USES_SLAB_PAGE_STRUCT > > > > + bool > > > > + default y > > > > + depends on SPLIT_PTLOCK_CPUS <= NR_CPUS > > > > + > > > > > > That all seems to work as intended. > > > > > > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the > > > machine early in boot. > > > > I thought that if that worked as intended, you wouldn't even > > get the chance to choose SLUB=y? That was how it was working > > for me (but I realize I didn't try more than make oldconfig). > > > > That sounds like what happens when SLUB's pagestruct use meets > > SPLIT_PTLOCK's pagestruct use. Does your .config really show > > CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y? > > Nope. > > g5:/usr/src/25> grep SLUB .config > CONFIG_SLUB=y > g5:/usr/src/25> grep SLAB .config > # CONFIG_SLAB is not set > g5:/usr/src/25> grep CPUS .config > CONFIG_NR_CPUS=8 > # CONFIG_CPUSETS is not set > # CONFIG_IRQ_ALL_CPUS is not set > CONFIG_SPLIT_PTLOCK_CPUS=4 > > It's in http://userweb.kernel.org/~akpm/config-g5.txt Seems we're all wrong in thinking Christoph's Kconfiggery worked as intended: maybe it just works some of the time. I'm not going to hazard a guess as to how to fix it up, will resume looking at the powerpc's quicklist potential later. Hugh ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub on PowerPC 2007-05-03 9:15 ` Hugh Dickins @ 2007-05-03 21:04 ` Hugh Dickins 2007-05-03 21:15 ` Christoph Lameter 2007-05-04 0:25 ` Benjamin Herrenschmidt 0 siblings, 2 replies; 233+ messages in thread From: Hugh Dickins @ 2007-05-03 21:04 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Lameter, Paul Mackerras, Benjamin Herrenschmidt, linux-kernel, linux-mm On Thu, 3 May 2007, Hugh Dickins wrote: > > Seems we're all wrong in thinking Christoph's Kconfiggery worked > as intended: maybe it just works some of the time. I'm not going > to hazard a guess as to how to fix it up, will resume looking at > the powerpc's quicklist potential later. Here's the patch I've been testing on G5, with 4k and with 64k pages, with SLAB and with SLUB. But, though it doesn't crash, the pgd kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity for using highorder allocations where SLAB would stick to order 0: under load, exec's mm_init gets page allocation failure on order 4 - SLUB's calculate_order may need some retuning. (I'd expect it to be going for order 3 actually, I'm not sure how order 4 comes about.) I don't know how offensive Ben and Paulus may find this patch: the kmem_cache use was nicely done and this messes it up a little. The SLUB allocator relies on struct page fields first_page and slab, overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then be used for the lowest level of pagetable pages. This was obstructing SLUB on PowerPC, which uses kmem_caches for its pagetables. So convert its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd want partpages, so continue to use kmem_caches for pmd, pud and pgd). But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM. Signed-off-by: Hugh Dickins <hugh@veritas.com> --- arch/powerpc/Kconfig | 4 ++++ arch/powerpc/mm/init_64.c | 17 ++++++----------- include/asm-powerpc/pgalloc.h | 26 +++++++++++--------------- 3 files changed, 21 insertions(+), 26 deletions(-) --- 2.6.21-rc7-mm2/arch/powerpc/Kconfig 2007-04-26 13:33:51.000000000 +0100 +++ linux/arch/powerpc/Kconfig 2007-05-03 20:45:12.000000000 +0100 @@ -31,6 +31,10 @@ config MMU bool default y +config QUICKLIST + bool + default y + config GENERIC_HARDIRQS bool default y --- 2.6.21-rc7-mm2/arch/powerpc/mm/init_64.c 2007-04-26 13:33:51.000000000 +0100 +++ linux/arch/powerpc/mm/init_64.c 2007-05-03 20:45:12.000000000 +0100 @@ -146,21 +146,16 @@ static void zero_ctor(void *addr, struct memset(addr, 0, kmem_cache_size(cache)); } -#ifdef CONFIG_PPC_64K_PAGES -static const unsigned int pgtable_cache_size[3] = { - PTE_TABLE_SIZE, PMD_TABLE_SIZE, PGD_TABLE_SIZE -}; -static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = { - "pte_pmd_cache", "pmd_cache", "pgd_cache", -}; -#else static const unsigned int pgtable_cache_size[2] = { - PTE_TABLE_SIZE, PMD_TABLE_SIZE + PGD_TABLE_SIZE, PMD_TABLE_SIZE }; static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = { - "pgd_pte_cache", "pud_pmd_cache", -}; +#ifdef CONFIG_PPC_64K_PAGES + "pgd_cache", "pmd_cache", +#else + "pgd_cache", "pud_pmd_cache", #endif /* CONFIG_PPC_64K_PAGES */ +}; #ifdef CONFIG_HUGETLB_PAGE /* Hugepages need one extra cache, initialized in hugetlbpage.c. We --- 2.6.21-rc7-mm2/include/asm-powerpc/pgalloc.h 2007-02-04 18:44:54.000000000 +0000 +++ linux/include/asm-powerpc/pgalloc.h 2007-05-03 20:45:12.000000000 +0100 @@ -10,21 +10,15 @@ #include <linux/slab.h> #include <linux/cpumask.h> #include <linux/percpu.h> +#include <linux/quicklist.h> extern struct kmem_cache *pgtable_cache[]; -#ifdef CONFIG_PPC_64K_PAGES -#define PTE_CACHE_NUM 0 -#define PMD_CACHE_NUM 1 -#define PGD_CACHE_NUM 2 -#define HUGEPTE_CACHE_NUM 3 -#else -#define PTE_CACHE_NUM 0 -#define PMD_CACHE_NUM 1 -#define PUD_CACHE_NUM 1 #define PGD_CACHE_NUM 0 +#define PUD_CACHE_NUM 1 +#define PMD_CACHE_NUM 1 #define HUGEPTE_CACHE_NUM 2 -#endif +#define PTE_CACHE_NUM 3 /* from quicklist rather than kmem_cache */ /* * This program is free software; you can redistribute it and/or @@ -97,8 +91,7 @@ static inline void pmd_free(pmd_t *pmd) static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address) { - return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM], - GFP_KERNEL|__GFP_REPEAT); + return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL); } static inline struct page *pte_alloc_one(struct mm_struct *mm, @@ -109,7 +102,7 @@ static inline struct page *pte_alloc_one static inline void pte_free_kernel(pte_t *pte) { - kmem_cache_free(pgtable_cache[PTE_CACHE_NUM], pte); + quicklist_free(0, NULL, pte); } static inline void pte_free(struct page *ptepage) @@ -136,7 +129,10 @@ static inline void pgtable_free(pgtable_ void *p = (void *)(pgf.val & ~PGF_CACHENUM_MASK); int cachenum = pgf.val & PGF_CACHENUM_MASK; - kmem_cache_free(pgtable_cache[cachenum], p); + if (cachenum == PTE_CACHE_NUM) + quicklist_free(0, NULL, p); + else + kmem_cache_free(pgtable_cache[cachenum], p); } extern void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf); @@ -153,7 +149,7 @@ extern void pgtable_free_tlb(struct mmu_ PUD_CACHE_NUM, PUD_TABLE_SIZE-1)) #endif /* CONFIG_PPC_64K_PAGES */ -#define check_pgt_cache() do { } while (0) +#define check_pgt_cache() quicklist_trim(0, NULL, 25, 16) #endif /* CONFIG_PPC64 */ #endif /* __KERNEL__ */ ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub on PowerPC 2007-05-03 21:04 ` 2.6.22 -mm merge plans: slub on PowerPC Hugh Dickins @ 2007-05-03 21:15 ` Christoph Lameter 2007-05-03 22:41 ` Hugh Dickins 2007-05-04 0:25 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 233+ messages in thread From: Christoph Lameter @ 2007-05-03 21:15 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Paul Mackerras, Benjamin Herrenschmidt, linux-kernel, linux-mm On Thu, 3 May 2007, Hugh Dickins wrote: > On Thu, 3 May 2007, Hugh Dickins wrote: > > > > Seems we're all wrong in thinking Christoph's Kconfiggery worked > > as intended: maybe it just works some of the time. I'm not going > > to hazard a guess as to how to fix it up, will resume looking at > > the powerpc's quicklist potential later. > > Here's the patch I've been testing on G5, with 4k and with 64k pages, > with SLAB and with SLUB. But, though it doesn't crash, the pgd > kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity > for using highorder allocations where SLAB would stick to order 0: > under load, exec's mm_init gets page allocation failure on order 4 > - SLUB's calculate_order may need some retuning. (I'd expect it to > be going for order 3 actually, I'm not sure how order 4 comes about.) There are SLUB patches pending (not in rc7-mm2 as far as I can recall) that reduce the default page order sizes to head off this issue. The defaults were initially too large (and they still default to large for testing if Mel's Antifrag work is detected to be active). > - return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM], > - GFP_KERNEL|__GFP_REPEAT); > + return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL); __GFP_REPEAT is unusual here but this was carried over from the kmem_cache_alloc it seems. Hmm... There is some variance on how we do this between arches. Should we uniformly set or not set this flag? clameter@schroedinger:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc include/asm-ia64/* include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-ia64/pgalloc.h: void *pg = quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-ia64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); clameter@schroedinger:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc arch/i386/mm/* arch/i386/mm/pgtable.c: pgd_t *pgd = quicklist_alloc(0, GFP_KERNEL, pgd_ctor); clameter@schroedinger:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc include/asm-sparc64/* include/asm-sparc64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-sparc64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-sparc64/pgalloc.h: return quicklist_alloc(0, GFP_KERNEL, NULL); include/asm-sparc64/pgalloc.h: void *pg = quicklist_alloc(0, GFP_KERNEL, NULL); clameter@schroedinger:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc include/asm-x86_64/* include/asm-x86_64/pgalloc.h: return (pmd_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL); include/asm-x86_64/pgalloc.h: return (pud_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL); include/asm-x86_64/pgalloc.h: pgd_t *pgd = (pgd_t *)quicklist_alloc(QUICK_PGD, include/asm-x86_64/pgalloc.h: return (pte_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL); include/asm-x86_64/pgalloc.h: void *p = (void *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL); ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub on PowerPC 2007-05-03 21:15 ` Christoph Lameter @ 2007-05-03 22:41 ` Hugh Dickins 0 siblings, 0 replies; 233+ messages in thread From: Hugh Dickins @ 2007-05-03 22:41 UTC (permalink / raw) To: Christoph Lameter Cc: Andrew Morton, Paul Mackerras, Benjamin Herrenschmidt, linux-kernel, linux-mm On Thu, 3 May 2007, Christoph Lameter wrote: > > There are SLUB patches pending (not in rc7-mm2 as far as I can recall) > that reduce the default page order sizes to head off this issue. The > defaults were initially too large (and they still default to large > for testing if Mel's Antifrag work is detected to be active). Sounds good. > > - return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM], > > - GFP_KERNEL|__GFP_REPEAT); > > + return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL); > > __GFP_REPEAT is unusual here but this was carried over from the > kmem_cache_alloc it seems. Hmm... There is some variance on how we do this > between arches. Should we uniformly set or not set this flag? Not something to get into in this patch, but it did surprise me too. I believe __GFP_REPEAT should be avoided, and I don't see justification for it here (but need to remember not to do a blind virt_to_page on the result in some places if it might return NULL - which IIRC it actually can do even if __GFP_REPEAT, when chosen for OOM kill). But I've a suspicion it got put in there for some good reason I don't know about. Hugh ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub on PowerPC 2007-05-03 21:04 ` 2.6.22 -mm merge plans: slub on PowerPC Hugh Dickins 2007-05-03 21:15 ` Christoph Lameter @ 2007-05-04 0:25 ` Benjamin Herrenschmidt 2007-05-04 0:54 ` Christoph Lameter 1 sibling, 1 reply; 233+ messages in thread From: Benjamin Herrenschmidt @ 2007-05-04 0:25 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Christoph Lameter, Paul Mackerras, linux-kernel, linux-mm On Thu, 2007-05-03 at 22:04 +0100, Hugh Dickins wrote: > On Thu, 3 May 2007, Hugh Dickins wrote: > > > > Seems we're all wrong in thinking Christoph's Kconfiggery worked > > as intended: maybe it just works some of the time. I'm not going > > to hazard a guess as to how to fix it up, will resume looking at > > the powerpc's quicklist potential later. > > Here's the patch I've been testing on G5, with 4k and with 64k pages, > with SLAB and with SLUB. But, though it doesn't crash, the pgd > kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity > for using highorder allocations where SLAB would stick to order 0: > under load, exec's mm_init gets page allocation failure on order 4 > - SLUB's calculate_order may need some retuning. (I'd expect it to > be going for order 3 actually, I'm not sure how order 4 comes about.) > > I don't know how offensive Ben and Paulus may find this patch: > the kmem_cache use was nicely done and this messes it up a little. > > > The SLUB allocator relies on struct page fields first_page and slab, > overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then > be used for the lowest level of pagetable pages. This was obstructing > SLUB on PowerPC, which uses kmem_caches for its pagetables. So convert > its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd > want partpages, so continue to use kmem_caches for pmd, pud and pgd). > But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM. Interesting... I'll have a look asap. Ben. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub on PowerPC 2007-05-04 0:25 ` Benjamin Herrenschmidt @ 2007-05-04 0:54 ` Christoph Lameter 0 siblings, 0 replies; 233+ messages in thread From: Christoph Lameter @ 2007-05-04 0:54 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Hugh Dickins, Andrew Morton, Paul Mackerras, linux-kernel, linux-mm On Fri, 4 May 2007, Benjamin Herrenschmidt wrote: > > The SLUB allocator relies on struct page fields first_page and slab, > > overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then > > be used for the lowest level of pagetable pages. This was obstructing > > SLUB on PowerPC, which uses kmem_caches for its pagetables. So convert > > its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd > > want partpages, so continue to use kmem_caches for pmd, pud and pgd). > > But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM. > > Interesting... I'll have a look asap. I would also recommend looking at removing the constructors for the remaining slabs. A constructor requires that SLUB never touch the object (same situation as is resulting from enabling debugging). So it must increase the object size in order to put the free pointer after the object. In case of a order of 2 cache this has a particularly bad effect of doubling object size. If the objects can be overwritten on free (no constructor) then we can use the first word of the object as a freepointer on kfree. Meaning we can use a hot cacheline so no cache miss. On alloc we have already touched the first cacheline which also avoids a cacheline fetch there. This is the optimal way of operation for SLUB. Hmmm.... We could add an option to allow the use of a constructor while keeping the free pointer at the beginning of the object? Then we would have to zap the first word on alloc. Would work like quicklists. Add SLAB_FREEPOINTER_MAY_OVERLAP? ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans: slub 2007-05-03 8:57 ` Andrew Morton 2007-05-03 9:15 ` Hugh Dickins @ 2007-05-03 16:45 ` Christoph Lameter 1 sibling, 0 replies; 233+ messages in thread From: Christoph Lameter @ 2007-05-03 16:45 UTC (permalink / raw) To: Andrew Morton; +Cc: Hugh Dickins, linux-kernel, linux-mm Hmmmm...There are a gazillion configs to choose from. It works fine with cell_defconfig. If I switch to 2 processors I can enable SLUB if I switch to 4 I cannot. I saw some other config weirdness like being unable to set SMP if SLOB is enabled with some configs. This should not work and does not work but the menus are then vanishing and one can still configure lots of processors while not having enabled SMP. It works as far as I can tell... The rest is arch weirdness. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (20 preceding siblings ...) 2007-05-01 18:10 ` 2.6.22 -mm merge plans: slub Hugh Dickins @ 2007-05-03 15:54 ` Ingo Molnar 2007-05-03 16:15 ` Michal Piotrowski ` (2 more replies) 2007-05-07 17:47 ` Josef Sipek 22 siblings, 3 replies; 233+ messages in thread From: Ingo Molnar @ 2007-05-03 15:54 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm, Con Kolivas * Andrew Morton <akpm@linux-foundation.org> wrote: > - If replying, please be sure to cc the appropriate individuals. > Please also consider rewriting the Subject: to something > appropriate. i'm wondering about swap-prefetch: mm-implement-swap-prefetching.patch swap-prefetch-avoid-repeating-entry.patch add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch The swap-prefetch feature is relatively compact: 10 files changed, 745 insertions(+), 1 deletion(-) it is contained mostly to itself: mm/swap_prefetch.c | 581 ++++++++++++++++++++++++++++++++ i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback i've seen so far was positive. Time to have this upstream and time for a desktop-oriented distro to pick it up. I think this has been held back way too long. It's .config selectable and it is as ready for integration as it ever is going to be. So it's a win/win scenario. Acked-by: Ingo Molnar <mingo@elte.hu> Ingo ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-03 15:54 ` swap-prefetch: 2.6.22 -mm merge plans Ingo Molnar @ 2007-05-03 16:15 ` Michal Piotrowski 2007-05-03 16:23 ` Michal Piotrowski 2007-05-03 22:14 ` Con Kolivas 2007-05-04 7:34 ` Nick Piggin 2 siblings, 1 reply; 233+ messages in thread From: Michal Piotrowski @ 2007-05-03 16:15 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, linux-mm, Con Kolivas Hi, On 03/05/07, Ingo Molnar <mingo@elte.hu> wrote: > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > - If replying, please be sure to cc the appropriate individuals. > > Please also consider rewriting the Subject: to something > > appropriate. > > i'm wondering about swap-prefetch: > > mm-implement-swap-prefetching.patch > swap-prefetch-avoid-repeating-entry.patch > add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch > > The swap-prefetch feature is relatively compact: > > 10 files changed, 745 insertions(+), 1 deletion(-) > > it is contained mostly to itself: > > mm/swap_prefetch.c | 581 ++++++++++++++++++++++++++++++++ > > i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a > clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback > i've seen so far was positive. Time to have this upstream and time for a > desktop-oriented distro to pick it up. > > I think this has been held back way too long. It's .config selectable > and it is as ready for integration as it ever is going to be. So it's a > win/win scenario. I'm using SWAP_PREFETCH since 2.6.17-mm1 (I don't have earlier configs) http://www.stardust.webpages.pl/files/tbf/euridica/2.6.17-mm1/mm-config and I don't recall _any_ problems. It's very stable for me. > > Acked-by: Ingo Molnar <mingo@elte.hu> > > Ingo Regards, Michal -- Michal K. K. Piotrowski Kernel Monkeys (http://kernel.wikidot.com/start) ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-03 16:15 ` Michal Piotrowski @ 2007-05-03 16:23 ` Michal Piotrowski 0 siblings, 0 replies; 233+ messages in thread From: Michal Piotrowski @ 2007-05-03 16:23 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, linux-mm, Con Kolivas On 03/05/07, Michal Piotrowski <michal.k.k.piotrowski@gmail.com> wrote: > Hi, > > On 03/05/07, Ingo Molnar <mingo@elte.hu> wrote: > > > > * Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > - If replying, please be sure to cc the appropriate individuals. > > > Please also consider rewriting the Subject: to something > > > appropriate. > > > > i'm wondering about swap-prefetch: > > > > mm-implement-swap-prefetching.patch > > swap-prefetch-avoid-repeating-entry.patch > > add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch > > > > The swap-prefetch feature is relatively compact: > > > > 10 files changed, 745 insertions(+), 1 deletion(-) > > > > it is contained mostly to itself: > > > > mm/swap_prefetch.c | 581 ++++++++++++++++++++++++++++++++ > > > > i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a > > clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback > > i've seen so far was positive. Time to have this upstream and time for a > > desktop-oriented distro to pick it up. > > > > I think this has been held back way too long. It's .config selectable > > and it is as ready for integration as it ever is going to be. So it's a > > win/win scenario. > > I'm using SWAP_PREFETCH since 2.6.17-mm1 (I don't have earlier configs) since 2.6.14-mm2 :) http://lkml.org/lkml/2005/11/11/260 Regards, Michal -- Michal K. K. Piotrowski Kernel Monkeys (http://kernel.wikidot.com/start) ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-03 15:54 ` swap-prefetch: 2.6.22 -mm merge plans Ingo Molnar 2007-05-03 16:15 ` Michal Piotrowski @ 2007-05-03 22:14 ` Con Kolivas 2007-05-04 7:34 ` Nick Piggin 2 siblings, 0 replies; 233+ messages in thread From: Con Kolivas @ 2007-05-03 22:14 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, linux-mm On Friday 04 May 2007 01:54, Ingo Molnar wrote: > * Andrew Morton <akpm@linux-foundation.org> wrote: > > - If replying, please be sure to cc the appropriate individuals. > > Please also consider rewriting the Subject: to something > > appropriate. > i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a > clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback > i've seen so far was positive. Time to have this upstream and time for a > desktop-oriented distro to pick it up. > > I think this has been held back way too long. It's .config selectable > and it is as ready for integration as it ever is going to be. So it's a > win/win scenario. > > Acked-by: Ingo Molnar <mingo@elte.hu> Thank you very much for code review, ack and support! -- -ck ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-03 15:54 ` swap-prefetch: 2.6.22 -mm merge plans Ingo Molnar 2007-05-03 16:15 ` Michal Piotrowski 2007-05-03 22:14 ` Con Kolivas @ 2007-05-04 7:34 ` Nick Piggin 2007-05-04 8:52 ` Ingo Molnar 2 siblings, 1 reply; 233+ messages in thread From: Nick Piggin @ 2007-05-04 7:34 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, linux-mm, Con Kolivas Ingo Molnar wrote: > * Andrew Morton <akpm@linux-foundation.org> wrote: > > >>- If replying, please be sure to cc the appropriate individuals. >> Please also consider rewriting the Subject: to something >> appropriate. > > > i'm wondering about swap-prefetch: Well I had some issues with it that I don't think were fully discussed, and Andrew prompted me to say something, but it went off list for a couple of posts (my fault, sorry). Posting it below with Andrew's permission... > mm-implement-swap-prefetching.patch > swap-prefetch-avoid-repeating-entry.patch > add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch > > The swap-prefetch feature is relatively compact: > > 10 files changed, 745 insertions(+), 1 deletion(-) > > it is contained mostly to itself: > > mm/swap_prefetch.c | 581 ++++++++++++++++++++++++++++++++ > > i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a > clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback > i've seen so far was positive. Time to have this upstream and time for a > desktop-oriented distro to pick it up. > > I think this has been held back way too long. It's .config selectable > and it is as ready for integration as it ever is going to be. So it's a > win/win scenario. Being able to config all these core heuristics changes is really not that much of a positive. The fact that we might _need_ to config something out, and double the configuration range isn't too pleasing. Here were some of my concerns, and where our discussion got up to. Andrew Morton wrote: > On Fri, 04 May 2007 14:34:45 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>Andrew Morton wrote: >> >>>istr you had issues with swap-prefetch? >>> >>>If so, now's a good time to reiterate them ;) >> >>1) It is said to help with the updatedb overnight problem, however it >> actually _doesn't_ prefetch swap when there are low free pages, which >> is how updatedb will leave the system. So this really puzzles me how >> it would work. However if updatedb is causing excessive swapout, I >> think we should try improving use-once algorithms first, for example. > > > Yes. Perhaps it just doesn't help with the updatedb thing. Or maybe with > normal system activity we get enough free pages to kick the thing off and > running. Perhaps updatedb itself has a lot of rss, for example. Could be, but I don't know. I'd think it unlikely to allow _much_ swapin, if huge amounts of the desktop have been swapped out. But maybe... as I said, nobody seems to have a recipe for these things. > Would be useful to see this claim substantiated with a real testcase, > description of results and an explanation of how and why it worked. Yes... and then try to first improve regular page reclaim and use-once handling. >>2) It is a _highly_ speculative operation, and in workloads where periods >> of low and high page usage with genuinely unused anonymous / tmpfs >> pages, it could waste power, memory bandwidth, bus bandwidth, disk >> bandwidth... > > > Yes. I suspect that's a matter of waiting for the corner-case reporters to > complain, then add more heuristics. Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch is happy to do a _lot_ of work for these things which we have already decided are least likely to be used again. >>3) I haven't seen a single set of numbers out of it. Feedback seems to >> have mostly come from people who > > > Yup. But can we come up with a testcase? It's hard. I guess it is hard firstly because swapping is quite random to start with. But I haven't even seen basic things like "make -jhuge swapstorm has no regressions". >>4) If this is helpful, wouldn't it be equally important for things like >> mapped file pages? Seems like half a solution. > > > True. > > Without thinking about it, I almost wonder if one could do a userspace > implementation with something which pokes around in /proc/pid/pagemap and > /proc/pid/kpagemap, perhaps with some additional interfaces added to > do a swapcache read. (Give userspace the ability to get at swapcache > via a regular fd?) > > (otoh the akpm usersapce implementation is swapoff -a;swapon -a) Perhaps. You may need a few indicators to see whether the system is idle... but OTOH, we've already got a lot of indicators for memory, disk usage, etc. So, maybe :) >>5) New one: it is possibly going to interact badly with MADV_FREE lazy >> freeing. The more complex we make page reclaim, the worse IMO. > > > That's a bit vague. What sort of problems do you envisage? Well MADV_FREE pages aren't technically free, are they? So it might be possible for a significant number of them to build up and prevent swap prefetch from working. Maybe. >>...) I had a few issues with implementation, like interaction with >> cpusets. Don't know if these are all fixed or not. I sort of gave >> up looking at it. > > > Ah yes, I remember some mention of cpusets. I forget what it was though. I could be wrong, but IIRC there is no good way to know which cpuset to bring the page back into, (and I guess similarly it would be hard to know what container to account it to, if doing account-on-allocate). We could hope that users of these features would be mostly disjoint sets, but that's an evil road to start heading down, where we have various core bits of mm that don't play nice together. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-04 7:34 ` Nick Piggin @ 2007-05-04 8:52 ` Ingo Molnar 2007-05-04 9:09 ` Nick Piggin ` (2 more replies) 0 siblings, 3 replies; 233+ messages in thread From: Ingo Molnar @ 2007-05-04 8:52 UTC (permalink / raw) To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, Con Kolivas * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > i'm wondering about swap-prefetch: > Being able to config all these core heuristics changes is really not > that much of a positive. The fact that we might _need_ to config > something out, and double the configuration range isn't too pleasing. Well, to the desktop user this is a speculative performance feature that he is willing to potentially waste CPU and IO capacity, in expectation of better performance. On the conceptual level it is _precisely the same thing as regular file readahead_. (with the difference that to me swapahead seems to be quite a bit more intelligent than our current file readahead logic.) This feature has no API or ABI impact at all, it's a pure performance feature. (besides the trivial sysctl to turn it runtime on/off). > Here were some of my concerns, and where our discussion got up to. > > Yes. Perhaps it just doesn't help with the updatedb thing. Or > > maybe with normal system activity we get enough free pages to kick > > the thing off and running. Perhaps updatedb itself has a lot of > > rss, for example. > > Could be, but I don't know. I'd think it unlikely to allow _much_ > swapin, if huge amounts of the desktop have been swapped out. But > maybe... as I said, nobody seems to have a recipe for these things. can i take this one as a "no fundamental objection"? There are really only 2 maintainance options left: 1) either you can do it better or at least have a _very_ clearly described idea outlined about how to do it differently 2) or you should let others try it #1 you've not done for 2-3 years since swap-prefetch was waiting for integration so it's not an option at this stage anymore. Then you are pretty much obliged to do #2. ;-) And let me be really blunt about this, there is no option #3 to say: "I have no real better idea, I have no code, I have no time, but hey, lets not merge this because it 'could in theory' be possible to do it better" =B-) really, we are likely be better off by risking the merge of _bad_ code (which in the swap-prefetch case is the exact opposite of the truth), than to let code stagnate. People are clearly unhappy about certain desktop aspects of swapping, and the only way out of that is to let more people hack that code. Merging code involves more people. It will cause 'noise' and could cause regressions, but at least in this case the only impact is 'performance' and the feature is trivial to disable. The maintainance drag outside of swap_prefetch.c is essentially _zero_. If the feature doesnt work it ends up on Con's desk. If it turns out to not work at all (despite years of testing and happy users) it still only ends up on Con's desk. A clear win/win scenario for you i think :-) > > Would be useful to see this claim substantiated with a real > > testcase, description of results and an explanation of how and why > > it worked. > > Yes... and then try to first improve regular page reclaim and use-once > handling. agreed. Con, IIRC you wrote a testcase for this, right? Could you please send us the results of that testing? > >>2) It is a _highly_ speculative operation, and in workloads where periods > >> of low and high page usage with genuinely unused anonymous / tmpfs > >> pages, it could waste power, memory bandwidth, bus bandwidth, disk > >> bandwidth... > > > > Yes. I suspect that's a matter of waiting for the corner-case > > reporters to complain, then add more heuristics. > > Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch > is happy to do a _lot_ of work for these things which we have already > decided are least likely to be used again. i see no real problem here. We've had heuristics for a _long_ time in various areas of the code. Sometimes they work, sometimes they suck. the flow of this is really easy: distro looking for a feature edge turns it on and announces it, if the feature does not work out for users then user turns it off and complains to distro, if enough users complain then distro turns it off for next release, upstream forgets about this performance feature and eventually removes it once someone notices that it wouldnt even compile in the past 2 main releases. I see no problem here, we did that in the past too with performance features. The networking stack has literally dozens of such small tunable things which get experimented with, and whose defaults do get tuned carefully. Some of the knobs help bandwidth, some help latency. I do not even see any risk of "splitup of mindshare" - swap-prefetch is so clearly speculative that it's not really a different view about how to do swapping (which would split the tester base, etc.), it's simply a "do you want your system to speculate about the future or not" add-on decision. Every system has a pretty clear idea about that: desktops generally want to do it, clusters generally dont want to do it. > >>3) I haven't seen a single set of numbers out of it. Feedback seems to > >> have mostly come from people who > > > > Yup. But can we come up with a testcase? It's hard. i think Con has a testcase. > >>4) If this is helpful, wouldn't it be equally important for things like > >> mapped file pages? Seems like half a solution. [...] > > (otoh the akpm usersapce implementation is swapoff -a;swapon -a) > > Perhaps. You may need a few indicators to see whether the system is > idle... but OTOH, we've already got a lot of indicators for memory, > disk usage, etc. So, maybe :) The time has passed for this. Let others play too. Please :-) > I could be wrong, but IIRC there is no good way to know which cpuset > to bring the page back into, (and I guess similarly it would be hard > to know what container to account it to, if doing > account-on-allocate). (i think cpusets are totally uninteresting in this context: nobody in their right mind is going to use swap-prefetch on a big NUMA box. Nor can i see any fundamental impediment to making this more cpuset-aware, just like other subsystems were made cpuset-aware, once the requests from actual users came in and people started getting interested in it.) I think the "lack of testcase and numbers" is the only valid technical objection i've seen so far. Con might be able to help us with that? Ingo ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-04 8:52 ` Ingo Molnar @ 2007-05-04 9:09 ` Nick Piggin 2007-05-04 12:10 ` Con Kolivas 2007-05-07 14:18 ` Bill Davidsen 2 siblings, 0 replies; 233+ messages in thread From: Nick Piggin @ 2007-05-04 9:09 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, linux-mm, Con Kolivas Ingo Molnar wrote: > * Nick Piggin <nickpiggin@yahoo.com.au> wrote: >>Here were some of my concerns, and where our discussion got up to. > > >>>Yes. Perhaps it just doesn't help with the updatedb thing. Or >>>maybe with normal system activity we get enough free pages to kick >>>the thing off and running. Perhaps updatedb itself has a lot of >>>rss, for example. >> >>Could be, but I don't know. I'd think it unlikely to allow _much_ >>swapin, if huge amounts of the desktop have been swapped out. But >>maybe... as I said, nobody seems to have a recipe for these things. > > > can i take this one as a "no fundamental objection"? There are really > only 2 maintainance options left: > > 1) either you can do it better or at least have a _very_ clearly > described idea outlined about how to do it differently > > 2) or you should let others try it > > #1 you've not done for 2-3 years since swap-prefetch was waiting for > integration so it's not an option at this stage anymore. Then you are > pretty much obliged to do #2. ;-) The burden is not on me to get someone else's feature merged. If it can be shown to work well and people's concerns addressed, then anything will get merged. The reason Linux is so good is because of what we don't merge, figuratively speaking. I wanted to see some basic regression tests to show that it hasn't caused obvious problems, and some basic scenarios where it helps, so that we can analyse them. It is really simple, but I haven't got any since first asking. And note that I don't think I ever explicitly "nacked" anything, just voiced my concerns. If my concerns had been addressed, then I couldn't have stopped anybody from merging anything. >>>>2) It is a _highly_ speculative operation, and in workloads where periods >>>> of low and high page usage with genuinely unused anonymous / tmpfs >>>> pages, it could waste power, memory bandwidth, bus bandwidth, disk >>>> bandwidth... >>> >>>Yes. I suspect that's a matter of waiting for the corner-case >>>reporters to complain, then add more heuristics. >> >>Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch >>is happy to do a _lot_ of work for these things which we have already >>decided are least likely to be used again. > > > i see no real problem here. We've had heuristics for a _long_ time in > various areas of the code. Sometimes they work, sometimes they suck. So that's one of my issues with the code. If all you have to support a merge is anecodal evidence, then I find it interesting that you would easily discount something like this. >>>>4) If this is helpful, wouldn't it be equally important for things like >>>> mapped file pages? Seems like half a solution. > > [...] > >>>(otoh the akpm usersapce implementation is swapoff -a;swapon -a) >> >>Perhaps. You may need a few indicators to see whether the system is >>idle... but OTOH, we've already got a lot of indicators for memory, >>disk usage, etc. So, maybe :) > > > The time has passed for this. Let others play too. Please :-) Play with what? Prefetching mmaped file pages as well? Sure. >>I could be wrong, but IIRC there is no good way to know which cpuset >>to bring the page back into, (and I guess similarly it would be hard >>to know what container to account it to, if doing >>account-on-allocate). > > > (i think cpusets are totally uninteresting in this context: nobody in > their right mind is going to use swap-prefetch on a big NUMA box. Nor > can i see any fundamental impediment to making this more cpuset-aware, > just like other subsystems were made cpuset-aware, once the requests > from actual users came in and people started getting interested in it.) OK, so make it more cpuset aware. This isn't a new issue, I raised it a long time ago. And trust me, it is a nightmare to just assume that nobody will use cpusets on a small box for example (AFAIK the resource control guys are looking at doing just that). All core VM features should play nicely with each other without *really* good reason. > I think the "lack of testcase and numbers" is the only valid technical > objection i've seen so far. Well you're entitled to your opinion too. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-04 8:52 ` Ingo Molnar 2007-05-04 9:09 ` Nick Piggin @ 2007-05-04 12:10 ` Con Kolivas 2007-05-05 8:42 ` Con Kolivas 2007-05-07 14:28 ` Bill Davidsen 2007-05-07 14:18 ` Bill Davidsen 2 siblings, 2 replies; 233+ messages in thread From: Con Kolivas @ 2007-05-04 12:10 UTC (permalink / raw) To: Ingo Molnar; +Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 717 bytes --] On Friday 04 May 2007 18:52, Ingo Molnar wrote: > agreed. Con, IIRC you wrote a testcase for this, right? Could you please > send us the results of that testing? Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch disabled and then enabled swap prefetch saves ~5 seconds on average hardware on this one test case. I had many users try this and the results were between 2 and 10 seconds, but always showed a saving on this testcase. This effect easily occurs on printing a big picture, editing a large file, compressing an iso image or whatever in real world workloads. Smaller, but much more frequent effects of this over the course of a day obviously also occur and do add up. -- -ck [-- Attachment #2: swap_prefetch_tester.c --] [-- Type: text/x-csrc, Size: 2067 bytes --] #include <stdio.h> #include <stdarg.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/mman.h> #include <time.h> void fatal(const char *format, ...) { va_list ap; if (format) { va_start(ap, format); vfprintf(stderr, format, ap); va_end(ap); } fprintf(stderr, "Fatal error - exiting\n"); exit(1); } size_t get_ram(void) { unsigned long ramsize; FILE *meminfo; char aux[256]; if(!(meminfo = fopen("/proc/meminfo", "r"))) fatal("fopen\n"); while( !feof(meminfo) && !fscanf(meminfo, "MemTotal: %lu kB", &ramsize) ) fgets(aux,sizeof(aux),meminfo); if (fclose(meminfo) == -1) fatal("fclose"); return ramsize * 1000; } unsigned long get_usecs(struct timespec *myts) { if (clock_gettime(CLOCK_REALTIME, myts)) fatal("clock_gettime"); return (myts->tv_sec * 1000000 + myts->tv_nsec / 1000 ); } int main(void) { unsigned long current_time, time_diff; struct timespec myts; char *buf1, *buf2, *buf3, *buf4; size_t size, full_size = get_ram(); int sleep_seconds = 600; size = full_size * 7 / 10; printf("Starting first malloc of %d bytes\n", size); buf1 = malloc(size); if (buf1 == (char *)-1) fatal("Failed to malloc 1st buffer\n"); memset(buf1, 0, size); printf("Completed first malloc and starting second malloc of %d bytes\n", full_size); buf2 = malloc(full_size); if (buf2 == (char *)-1) fatal("Failed to malloc 2nd buffer\n"); memset(buf2, 0, full_size); buf4 = malloc(1); for (buf3 = buf2 + full_size; buf3 > buf2; buf3--) *buf4 = *buf3; free(buf2); printf("Completed second malloc and free\n"); printf("Sleeping for %d seconds\n", sleep_seconds); sleep(sleep_seconds); printf("Important part - starting read of first malloc\n"); time_diff = current_time = get_usecs(&myts); for (buf3 = buf1; buf3 < buf1 + size; buf3++) *buf4 = *buf3; current_time = get_usecs(&myts); free(buf4); free(buf1); printf("Completed read and freeing of first malloc\n"); time_diff = current_time - time_diff; printf("Timed portion %lu microseconds\n",time_diff); return 0; } ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-04 12:10 ` Con Kolivas @ 2007-05-05 8:42 ` Con Kolivas 2007-05-06 10:13 ` [ck] " Antonino Ingargiola ` (2 more replies) 2007-05-07 14:28 ` Bill Davidsen 1 sibling, 3 replies; 233+ messages in thread From: Con Kolivas @ 2007-05-05 8:42 UTC (permalink / raw) To: Ingo Molnar, ck list; +Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 1785 bytes --] On Friday 04 May 2007 22:10, Con Kolivas wrote: > On Friday 04 May 2007 18:52, Ingo Molnar wrote: > > agreed. Con, IIRC you wrote a testcase for this, right? Could you please > > send us the results of that testing? > > Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch > disabled and then enabled swap prefetch saves ~5 seconds on average > hardware on this one test case. I had many users try this and the results > were between 2 and 10 seconds, but always showed a saving on this testcase. > This effect easily occurs on printing a big picture, editing a large file, > compressing an iso image or whatever in real world workloads. Smaller, but > much more frequent effects of this over the course of a day obviously also > occur and do add up. Here's a better swap prefetch tester. Instructions in file. Machine with 2GB ram and 2GB swapfile Prefetch disabled: ./sp_tester Ram 2060352000 Swap 1973420000 Total ram to be malloced: 3047062000 bytes Starting first malloc of 1523531000 bytes Starting 1st read of first malloc Touching this much ram takes 809 milliseconds Starting second malloc of 1523531000 bytes Completed second malloc and free Sleeping for 600 seconds Important part - starting reread of first malloc Completed read of first malloc Timed portion 53397 milliseconds Enabled: ./sp_tester Ram 2060352000 Swap 1973420000 Total ram to be malloced: 3047062000 bytes Starting first malloc of 1523531000 bytes Starting 1st read of first malloc Touching this much ram takes 676 milliseconds Starting second malloc of 1523531000 bytes Completed second malloc and free Sleeping for 600 seconds Important part - starting reread of first malloc Completed read of first malloc Timed portion 26351 milliseconds Note huge time difference. -- -ck [-- Attachment #2: sp_tester.c --] [-- Type: text/x-csrc, Size: 2890 bytes --] /* sp_tester.c Build with: gcc -o sp_tester sp_tester.c -lrt -W -Wall -O2 How to use: echo 1 > /proc/sys/vm/overcommit_memory swapoff -a swapon -a ./sp_tester then repeat with changed conditions eg echo 0 > /proc/sys/vm/swap_prefetch Each Test takes 10 minutes */ #include <stdio.h> #include <stdarg.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/mman.h> #include <time.h> void fatal(const char *format, ...) { va_list ap; if (format) { va_start(ap, format); vfprintf(stderr, format, ap); va_end(ap); } fprintf(stderr, "Fatal error - exiting\n"); exit(1); } unsigned long ramsize, swapsize; size_t get_ram(void) { FILE *meminfo; char aux[256]; if(!(meminfo = fopen("/proc/meminfo", "r"))) fatal("fopen\n"); while( !feof(meminfo) && !fscanf(meminfo, "MemTotal: %lu kB", &ramsize) ) fgets(aux,sizeof(aux),meminfo); while( !feof(meminfo) && !fscanf(meminfo, "SwapTotal: %lu kB", &swapsize) ) fgets(aux,sizeof(aux),meminfo); if (fclose(meminfo) == -1) fatal("fclose"); ramsize *= 1000; swapsize *= 1000; printf("Ram %lu Swap %lu\n", ramsize, swapsize); return ramsize + (swapsize / 2); } unsigned long get_usecs(struct timespec *myts) { if (clock_gettime(CLOCK_REALTIME, myts)) fatal("clock_gettime"); return (myts->tv_sec * 1000000 + myts->tv_nsec / 1000 ); } int main(void) { unsigned long current_time, time_diff; struct timespec myts; char *buf1, *buf2, *buf3, *buf4; size_t size = get_ram(); int sleep_seconds = 600; if (size > ramsize / 2 * 3) size = ramsize / 2 * 3; printf("Total ram to be malloced: %u bytes\n", size); size /= 2; printf("Starting first malloc of %u bytes\n", size); buf1 = malloc(size); buf4 = malloc(1); if (buf1 == (char *)-1) fatal("Failed to malloc 1st buffer\n"); memset(buf1, 0, size); time_diff = current_time = get_usecs(&myts); for (buf3 = buf1; buf3 < buf1 + size; buf3++) *buf4 = *buf3; printf("Starting 1st read of first malloc\n"); current_time = get_usecs(&myts); time_diff = current_time - time_diff; printf("Touching this much ram takes %lu milliseconds\n",time_diff / 1000); printf("Starting second malloc of %u bytes\n", size); buf2 = malloc(size); if (buf2 == (char *)-1) fatal("Failed to malloc 2nd buffer\n"); memset(buf2, 0, size); for (buf3 = buf2 + size; buf3 > buf2; buf3--) *buf4 = *buf3; free(buf2); printf("Completed second malloc and free\n"); printf("Sleeping for %u seconds\n", sleep_seconds); sleep(sleep_seconds); printf("Important part - starting reread of first malloc\n"); time_diff = current_time = get_usecs(&myts); for (buf3 = buf1; buf3 < buf1 + size; buf3++) *buf4 = *buf3; current_time = get_usecs(&myts); time_diff = current_time - time_diff; printf("Completed read of first malloc\n"); printf("Timed portion %lu milliseconds\n",time_diff / 1000); free(buf1); free(buf4); return 0; } ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [ck] Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-05 8:42 ` Con Kolivas @ 2007-05-06 10:13 ` Antonino Ingargiola 2007-05-06 18:22 ` Jory A. Pratt 2007-05-09 23:28 ` Con Kolivas 2 siblings, 0 replies; 233+ messages in thread From: Antonino Ingargiola @ 2007-05-06 10:13 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, ck list, Nick Piggin, Andrew Morton, linux-kernel, linux-mm 2007/5/5, Con Kolivas <kernel@kolivas.org>: [cut] > Here's a better swap prefetch tester. Instructions in file. The system should be leaved totally inactive for the test duration (10m) right? Regards, ~ Antonio ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [ck] Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-05 8:42 ` Con Kolivas 2007-05-06 10:13 ` [ck] " Antonino Ingargiola @ 2007-05-06 18:22 ` Jory A. Pratt 2007-05-09 23:28 ` Con Kolivas 2 siblings, 0 replies; 233+ messages in thread From: Jory A. Pratt @ 2007-05-06 18:22 UTC (permalink / raw) To: Con Kolivas Cc: Ingo Molnar, ck list, Nick Piggin, Andrew Morton, linux-kernel, linux-mm Con Kolivas wrote: > Here's a better swap prefetch tester. Instructions in file. > > Machine with 2GB ram and 2GB swapfile > > Prefetch disabled: > ./sp_tester > Ram 2060352000 Swap 1973420000 > Total ram to be malloced: 3047062000 bytes > Starting first malloc of 1523531000 bytes > Starting 1st read of first malloc > Touching this much ram takes 809 milliseconds > Starting second malloc of 1523531000 bytes > Completed second malloc and free > Sleeping for 600 seconds > Important part - starting reread of first malloc > Completed read of first malloc > Timed portion 53397 milliseconds > > Enabled: > ./sp_tester > Ram 2060352000 Swap 1973420000 > Total ram to be malloced: 3047062000 bytes > Starting first malloc of 1523531000 bytes > Starting 1st read of first malloc > Touching this much ram takes 676 milliseconds > Starting second malloc of 1523531000 bytes > Completed second malloc and free > Sleeping for 600 seconds > Important part - starting reread of first malloc > Completed read of first malloc > Timed portion 26351 milliseconds > echo 1 > /proc/sys/vm/overcommit_memory swapoff -a swapon -a ./sp_tester Ram 1153644000 Swap 1004052000 Total ram to be malloced: 1655670000 bytes Starting first malloc of 827835000 bytes Starting 1st read of first malloc Touching this much ram takes 937 milliseconds Starting second malloc of 827835000 bytes Completed second malloc and free Sleeping for 600 seconds Important part - starting reread of first malloc Completed read of first malloc Timed portion 15011 milliseconds echo 0 > /proc/sys/vm/overcommit_memory swapoff -a swapon -a ./sp_tester Ram 1153644000 Swap 1004052000 Total ram to be malloced: 1655670000 bytes Starting first malloc of 827835000 bytes Starting 1st read of first malloc Touching this much ram takes 1125 milliseconds Starting second malloc of 827835000 bytes Completed second malloc and free Sleeping for 600 seconds Important part - starting reread of first malloc Completed read of first malloc Timed portion 14611 milliseconds ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-05 8:42 ` Con Kolivas 2007-05-06 10:13 ` [ck] " Antonino Ingargiola 2007-05-06 18:22 ` Jory A. Pratt @ 2007-05-09 23:28 ` Con Kolivas 2007-05-10 0:05 ` Nick Piggin 2 siblings, 1 reply; 233+ messages in thread From: Con Kolivas @ 2007-05-09 23:28 UTC (permalink / raw) To: Ingo Molnar; +Cc: ck list, Nick Piggin, Andrew Morton, linux-kernel, linux-mm On Saturday 05 May 2007 18:42, Con Kolivas wrote: > On Friday 04 May 2007 22:10, Con Kolivas wrote: > > On Friday 04 May 2007 18:52, Ingo Molnar wrote: > > > agreed. Con, IIRC you wrote a testcase for this, right? Could you > > > please send us the results of that testing? > > > > Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch > > disabled and then enabled swap prefetch saves ~5 seconds on average > > hardware on this one test case. I had many users try this and the results > > were between 2 and 10 seconds, but always showed a saving on this > > testcase. This effect easily occurs on printing a big picture, editing a > > large file, compressing an iso image or whatever in real world workloads. > > Smaller, but much more frequent effects of this over the course of a day > > obviously also occur and do add up. > > Here's a better swap prefetch tester. Instructions in file. > > Machine with 2GB ram and 2GB swapfile > > Prefetch disabled: > ./sp_tester > Timed portion 53397 milliseconds > > Enabled: > ./sp_tester > Timed portion 26351 milliseconds > > Note huge time difference. Well how about that? That was the difference with a swap _file_ as I said, but I went ahead and checked with a swap partition as I used to have. I didn't notice, but somewhere in the last few months, swap prefetch code itself being unchanged for a year, seems to have been broken by other changes in the vm and it doesn't even start up prefetching often and has stale swap entries in its list. Once it breaks like that it does nothing from then on. So that leaves me with a quandry now. Do I: 1. Go ahead and find whatever breakage was introduced and fix it with hopefully a trivial change 2. Do option 1. and then implement support for yet another kernel feature (cpusets) that will be used perhaps never with swap prefetch [No Nick I don't believe you that cpusets have anything to do with normal users on a desktop ever; if it's used on a desktop it will only be by a kernel developer testing the cpusets code]. or 3. Dump swap prefetch forever and ignore that it ever worked and was helpful and was a lot of work to implement and so on. Given that even if I do 1 and/or 2 it'll still be blocked from ever going to mainline I think the choice is clear. Nick since you're personally the gatekeeper for this code, would you like to make a call? Just say 3 and put me out of my misery please. -- -ck P.S. Ingo, thanks (and sorry) for your involvement here. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-09 23:28 ` Con Kolivas @ 2007-05-10 0:05 ` Nick Piggin 2007-05-10 1:34 ` Con Kolivas 0 siblings, 1 reply; 233+ messages in thread From: Nick Piggin @ 2007-05-10 0:05 UTC (permalink / raw) To: Con Kolivas; +Cc: Ingo Molnar, ck list, Andrew Morton, linux-kernel, linux-mm Con Kolivas wrote: > Well how about that? That was the difference with a swap _file_ as I said, but > I went ahead and checked with a swap partition as I used to have. I didn't > notice, but somewhere in the last few months, swap prefetch code itself being > unchanged for a year, seems to have been broken by other changes in the vm > and it doesn't even start up prefetching often and has stale swap entries in > its list. Once it breaks like that it does nothing from then on. So that > leaves me with a quandry now. > > > Do I: > > 1. Go ahead and find whatever breakage was introduced and fix it with > hopefully a trivial change > > 2. Do option 1. and then implement support for yet another kernel feature > (cpusets) that will be used perhaps never with swap prefetch [No Nick I don't > believe you that cpusets have anything to do with normal users on a desktop > ever; if it's used on a desktop it will only be by a kernel developer testing > the cpusets code]. > > or > > 3. Dump swap prefetch forever and ignore that it ever worked and was helpful > and was a lot of work to implement and so on. > > > Given that even if I do 1 and/or 2 it'll still be blocked from ever going to > mainline I think the choice is clear. > > Nick since you're personally the gatekeeper for this code, would you like to > make a call? Just say 3 and put me out of my misery please. I'm not the gatekeeper and it is completely up to you whether you want to work on something or not... but I'm sure you understand where I was coming from when I suggested it doesn't get merged yet. You may not believe this, but I agree that swap prefetching (and prefetching in general) has some potential to help desktop workloads :). But it still should go through the normal process of being tested and questioned and having a look at options for first improving existing code in those problematic cases. Once that process happens and it is shown to work nicely, etc., then I would not be able to (or want to) keep it from getting merged. As far as cpusets goes... if your code goes in last, then you have to make it work with what is there, as a rule. People are using cpusets for memory resource control, which would have uses on a desktop system. It is just a really bad precedent to set, having different parts of the VM not work correctly together. Even if you made them mutually exclusive CONFIG_ options, that is still not a very nice solution. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-10 0:05 ` Nick Piggin @ 2007-05-10 1:34 ` Con Kolivas 2007-05-10 1:56 ` Nick Piggin 0 siblings, 1 reply; 233+ messages in thread From: Con Kolivas @ 2007-05-10 1:34 UTC (permalink / raw) To: Nick Piggin; +Cc: Ingo Molnar, ck list, Andrew Morton, linux-kernel, linux-mm On Thursday 10 May 2007 10:05, Nick Piggin wrote: > Con Kolivas wrote: > > Well how about that? That was the difference with a swap _file_ as I > > said, but I went ahead and checked with a swap partition as I used to > > have. I didn't notice, but somewhere in the last few months, swap > > prefetch code itself being unchanged for a year, seems to have been > > broken by other changes in the vm and it doesn't even start up > > prefetching often and has stale swap entries in its list. Once it breaks > > like that it does nothing from then on. So that leaves me with a quandry > > now. > > > > > > Do I: > > > > 1. Go ahead and find whatever breakage was introduced and fix it with > > hopefully a trivial change > > > > 2. Do option 1. and then implement support for yet another kernel feature > > (cpusets) that will be used perhaps never with swap prefetch [No Nick I > > don't believe you that cpusets have anything to do with normal users on a > > desktop ever; if it's used on a desktop it will only be by a kernel > > developer testing the cpusets code]. > > > > or > > > > 3. Dump swap prefetch forever and ignore that it ever worked and was > > helpful and was a lot of work to implement and so on. > > > > > > Given that even if I do 1 and/or 2 it'll still be blocked from ever going > > to mainline I think the choice is clear. > > > > Nick since you're personally the gatekeeper for this code, would you like > > to make a call? Just say 3 and put me out of my misery please. > > I'm not the gatekeeper and it is completely up to you whether you want > to work on something or not... but I'm sure you understand where I was > coming from when I suggested it doesn't get merged yet. No matter how you spin it, you're the gatekeeper. > You may not believe this, but I agree that swap prefetching (and > prefetching in general) has some potential to help desktop workloads :). > But it still should go through the normal process of being tested and > questioned and having a look at options for first improving existing > code in those problematic cases. Not this again? Proof was there ages ago that it helped and no proof that it harmed could be found yet you cunningly pretend it never existed. It's been done to death and I'm sick of this. > Once that process happens and it is shown to work nicely, etc., then I > would not be able to (or want to) keep it from getting merged. > > As far as cpusets goes... if your code goes in last, then you have to > make it work with what is there, as a rule. People are using cpusets > for memory resource control, which would have uses on a desktop system. > It is just a really bad precedent to set, having different parts of the > VM not work correctly together. Even if you made them mutually > exclusive CONFIG_ options, that is still not a very nice solution. That's as close to a 3 as I'm likely to get out of you. Andrew you'll be relieved to know I would like you to throw swap prefetch and related patches into the bin. Thanks. -- -ck ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-10 1:34 ` Con Kolivas @ 2007-05-10 1:56 ` Nick Piggin 2007-05-10 3:48 ` Ray Lee 0 siblings, 1 reply; 233+ messages in thread From: Nick Piggin @ 2007-05-10 1:56 UTC (permalink / raw) To: Con Kolivas; +Cc: Ingo Molnar, ck list, Andrew Morton, linux-kernel, linux-mm Con Kolivas wrote: > On Thursday 10 May 2007 10:05, Nick Piggin wrote: >>I'm not the gatekeeper and it is completely up to you whether you want >>to work on something or not... but I'm sure you understand where I was >>coming from when I suggested it doesn't get merged yet. > > > No matter how you spin it, you're the gatekeeper. If raising unaddressed issues means closing a gate, then OK. You can equally open it by answering them. >>You may not believe this, but I agree that swap prefetching (and >>prefetching in general) has some potential to help desktop workloads :). >>But it still should go through the normal process of being tested and >>questioned and having a look at options for first improving existing >>code in those problematic cases. > > > Not this again? Proof was there ages ago that it helped and no proof that it > harmed could be found yet you cunningly pretend it never existed. It's been > done to death and I'm sick of this. I said I know it can help. Do you know how many patches I have that help some workloads but are not merged? That's just the way it works. What I have seen is it helps the case where you force out a huge amount of swap. OK, that's nice -- the base case obviously works. You said it helped with the updatedb problem. That says we should look at why it is going bad first, and for example improve use-once algorithms. After we do that, then swap prefetching might still help, which is fine. >>Once that process happens and it is shown to work nicely, etc., then I >>would not be able to (or want to) keep it from getting merged. >> >>As far as cpusets goes... if your code goes in last, then you have to >>make it work with what is there, as a rule. People are using cpusets >>for memory resource control, which would have uses on a desktop system. >>It is just a really bad precedent to set, having different parts of the >>VM not work correctly together. Even if you made them mutually >>exclusive CONFIG_ options, that is still not a very nice solution. > > > That's as close to a 3 as I'm likely to get out of you. If you're not willing to try making it work with existing code, among other things, then yes it will be difficult to get it merged. That's not going to change. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-10 1:56 ` Nick Piggin @ 2007-05-10 3:48 ` Ray Lee 2007-05-10 3:56 ` Nick Piggin 2007-05-10 3:58 ` swap-prefetch: 2.6.22 -mm merge plans Con Kolivas 0 siblings, 2 replies; 233+ messages in thread From: Ray Lee @ 2007-05-10 3:48 UTC (permalink / raw) To: Nick Piggin Cc: Con Kolivas, Ingo Molnar, ck list, Andrew Morton, linux-kernel, linux-mm On 5/9/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > You said it helped with the updatedb problem. That says we should look at > why it is going bad first, and for example improve use-once algorithms. > After we do that, then swap prefetching might still help, which is fine. Nick, if you're volunteering to do that analysis, then great. If not, then you're just providing a airy hope with nothing to back up when or if that work would ever occur. Further, if you or someone else *does* do that work, then guess what, we still have the option to rip out the swap prefetching code after the hypothetical use-once improvements have been proven and merged. Which, by the way, I've watched people talk about since 2.4. That was, y'know, a *while* ago. So enough with the stop energy, okay? You're better than that. Con? He is right about the last feature to go in needs to work gracefully with what's there now. However, it's not unheard of for authors of other sections of code to help out with incompatibilities by answering politely phrased questions for guidance. Though the intersection of users between cpusets and desktop systems seems small indeed. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-10 3:48 ` Ray Lee @ 2007-05-10 3:56 ` Nick Piggin 2007-05-10 5:52 ` Ray Lee 2007-05-10 3:58 ` swap-prefetch: 2.6.22 -mm merge plans Con Kolivas 1 sibling, 1 reply; 233+ messages in thread From: Nick Piggin @ 2007-05-10 3:56 UTC (permalink / raw) To: Ray Lee Cc: Con Kolivas, Ingo Molnar, ck list, Andrew Morton, linux-kernel, linux-mm Ray Lee wrote: > On 5/9/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> You said it helped with the updatedb problem. That says we should look at >> why it is going bad first, and for example improve use-once algorithms. >> After we do that, then swap prefetching might still help, which is fine. > > > Nick, if you're volunteering to do that analysis, then great. If not, > then you're just providing a airy hope with nothing to back up when or > if that work would ever occur. I'd like to try helping. Tell me your problem. > Further, if you or someone else *does* do that work, then guess what, > we still have the option to rip out the swap prefetching code after > the hypothetical use-once improvements have been proven and merged. > Which, by the way, I've watched people talk about since 2.4. That was, > y'know, a *while* ago. What's wrong with the use-once we have? What improvements are you talking about? > So enough with the stop energy, okay? You're better than that. I don't think it is about energy or being mean, I'm just stating the issues I have with it. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-10 3:56 ` Nick Piggin @ 2007-05-10 5:52 ` Ray Lee 2007-05-10 7:04 ` Nick Piggin 0 siblings, 1 reply; 233+ messages in thread From: Ray Lee @ 2007-05-10 5:52 UTC (permalink / raw) To: Nick Piggin Cc: Con Kolivas, Ingo Molnar, ck list, Andrew Morton, linux-kernel, linux-mm On 5/9/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Ray Lee wrote: > > On 5/9/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > >> You said it helped with the updatedb problem. That says we should look at > >> why it is going bad first, and for example improve use-once algorithms. > >> After we do that, then swap prefetching might still help, which is fine. > > > > Nick, if you're volunteering to do that analysis, then great. If not, > > then you're just providing a airy hope with nothing to back up when or > > if that work would ever occur. > > I'd like to try helping. Tell me your problem. Huh? You already stated one version of it above, namely updatedb. But let's put this another way, shall we? A gedankenexperiment, if you will. Say we have a perfect swap-out algorithm that can choose exactly what needs to be evicted to disk. ('Perfect', of course, is dependent upon one's metric, but let's go with "maximizes overall system utilization and minimizes IO wait time." Arbitrary, but hey.) So, great, the right things got swapped out. Anything else that could have been chosen would have caused more overall IO Wait. Yay us. So what happens when those processes that triggered the swap-outs go away? (Firefox is closed, I stop hitting my local copy of a database, whatever.) Well, currently, nothing. What happens when I switch workspaces and try to use my email program? Swap-ins. Okay, so why didn't the system swap that stuff in preemptively? Why am I sitting there waiting for something that it could have already done in the background? A new swap-out algorithm, be it use-once, Clock-Pro, or perfect foreknowledge isn't going to change that issue. Swap prefetch does. > > Further, if you or someone else *does* do that work, then guess what, > > we still have the option to rip out the swap prefetching code after > > the hypothetical use-once improvements have been proven and merged. > > Which, by the way, I've watched people talk about since 2.4. That was, > > y'know, a *while* ago. > > What's wrong with the use-once we have? What improvements are you talking > about? You said, effectively: "Use-once could be improved to deal with updatedb". I said I've been reading emails from Rik and others talking about that for four years now, and we're still talking about it. Were it merely updatedb, I'd say us userspace folk should step up and rewrite the damn thing to amortize its work. However, I and others feel it's only an example -- glaring, obviously -- of a more pervasive issue. A small issue, to be sure!, but an issue nevertheless. In general, I/others are talking about improving the desktop experience of running too much on a RAM limited machine. (Which, in my case, is with a gig and a 2.2GHz processor.) Or restated: the desktop experience occasionally sucks for me, and I don't think I'm alone. There may be a heuristic, completely isolated from userspace (and so isn't an API the kernel has to support! -- if it doesn't work, we can rip it out again), that may mitigate the suckiness. Let's try it. > > So enough with the stop energy, okay? You're better than that. > > I don't think it is about energy or being mean, I'm just stating the > issues I have with it. Nick, I in no way think you're being mean, and I'm sorry if I've given you that impression. However, if you're just stating the issues you have with it, then can I assume that you won't lobby against having this experiment merged? ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-10 5:52 ` Ray Lee @ 2007-05-10 7:04 ` Nick Piggin 2007-05-10 7:20 ` William Lee Irwin III ` (2 more replies) 0 siblings, 3 replies; 233+ messages in thread From: Nick Piggin @ 2007-05-10 7:04 UTC (permalink / raw) To: Ray Lee Cc: Con Kolivas, Ingo Molnar, ck list, Andrew Morton, linux-kernel, linux-mm Ray Lee wrote: > On 5/9/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> Ray Lee wrote: >> > On 5/9/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: >> > >> >> You said it helped with the updatedb problem. That says we should >> look at >> >> why it is going bad first, and for example improve use-once >> algorithms. >> >> After we do that, then swap prefetching might still help, which is >> fine. >> > >> > Nick, if you're volunteering to do that analysis, then great. If not, >> > then you're just providing a airy hope with nothing to back up when or >> > if that work would ever occur. >> >> I'd like to try helping. Tell me your problem. > > > Huh? You already stated one version of it above, namely updatedb. But So a swapping problem with updatedb should be unusual and we'd like to see if we can fix it without resorting to prefetching. I know the theory behind swap prefetching, and I'm not saying it doesn't work, so I'll snip the rest of that. >> What's wrong with the use-once we have? What improvements are you talking >> about? > > > You said, effectively: "Use-once could be improved to deal with > updatedb". I said I've been reading emails from Rik and others talking > about that for four years now, and we're still talking about it. Were > it merely updatedb, I'd say us userspace folk should step up and > rewrite the damn thing to amortize its work. However, I and others > feel it's only an example -- glaring, obviously -- of a more pervasive > issue. A small issue, to be sure!, but an issue nevertheless. It isn't going to get fixed unless people complain about it. If you cover the use-once problem with swap prefetching, then it will never get fixed. >> I don't think it is about energy or being mean, I'm just stating the >> issues I have with it. > > > Nick, I in no way think you're being mean, and I'm sorry if I've given > you that impression. However, if you're just stating the issues you > have with it, then can I assume that you won't lobby against having > this experiment merged? Anybody is free to merge anything into their kernel. And if somebody asks for my issues with the swap prefetching patch, then I'll give them :) -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-10 7:04 ` Nick Piggin @ 2007-05-10 7:20 ` William Lee Irwin III 2007-05-10 12:34 ` Ray Lee 2007-05-12 4:46 ` [PATCH] mm: swap prefetch improvements Con Kolivas 2 siblings, 0 replies; 233+ messages in thread From: William Lee Irwin III @ 2007-05-10 7:20 UTC (permalink / raw) To: Nick Piggin Cc: Ray Lee, Con Kolivas, Ingo Molnar, ck list, Andrew Morton, linux-kernel, linux-mm Ray Lee wrote: >> Huh? You already stated one version of it above, namely updatedb. But On Thu, May 10, 2007 at 05:04:54PM +1000, Nick Piggin wrote: > So a swapping problem with updatedb should be unusual and we'd like to see > if we can fix it without resorting to prefetching. > I know the theory behind swap prefetching, and I'm not saying it doesn't > work, so I'll snip the rest of that. I've not run updatedb in years, so I have no idea what it does to a modern kernel. It used to be an unholy terror of slab fragmentation and displacing user memory. The case of streaming kernel metadata IO is probably not quite as easy as streaming file IO. Ray Lee wrote: >> You said, effectively: "Use-once could be improved to deal with >> updatedb". I said I've been reading emails from Rik and others talking >> about that for four years now, and we're still talking about it. Were >> it merely updatedb, I'd say us userspace folk should step up and >> rewrite the damn thing to amortize its work. However, I and others >> feel it's only an example -- glaring, obviously -- of a more pervasive >> issue. A small issue, to be sure!, but an issue nevertheless. On Thu, May 10, 2007 at 05:04:54PM +1000, Nick Piggin wrote: > It isn't going to get fixed unless people complain about it. If you > cover the use-once problem with swap prefetching, then it will never > get fixed. The policy people need to clean this up once and for all at some point. clameter's targeted reclaim bits for slub look like a plausible tactic, but are by no means comprehensive. Things need to attempt to eat their own tails before eating everyone else alive. Maybe we need to take hits on things such as badari's dd's to resolve the pathologies. -- wli ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-10 7:04 ` Nick Piggin 2007-05-10 7:20 ` William Lee Irwin III @ 2007-05-10 12:34 ` Ray Lee 2007-05-12 4:46 ` [PATCH] mm: swap prefetch improvements Con Kolivas 2 siblings, 0 replies; 233+ messages in thread From: Ray Lee @ 2007-05-10 12:34 UTC (permalink / raw) To: Nick Piggin Cc: Con Kolivas, Ingo Molnar, ck list, Andrew Morton, linux-kernel, linux-mm On 5/10/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > Huh? You already stated one version of it above, namely updatedb. But > > So a swapping problem with updatedb should be unusual and we'd like to see > if we can fix it without resorting to prefetching. > > I know the theory behind swap prefetching, and I'm not saying it doesn't > work, so I'll snip the rest of that. updatedb is only part of the problem. The other part is that the kernel has an opportunity to preemptively return some of the evicted working set to RAM before I ask for it. No fancy use-once algorithm is going to address that, so your solution is provably incomplete for my problem. What's so hard to understand about that? ^ permalink raw reply [flat|nested] 233+ messages in thread
* [PATCH] mm: swap prefetch improvements 2007-05-10 7:04 ` Nick Piggin 2007-05-10 7:20 ` William Lee Irwin III 2007-05-10 12:34 ` Ray Lee @ 2007-05-12 4:46 ` Con Kolivas 2007-05-12 5:03 ` Paul Jackson 2007-05-21 10:03 ` [PATCH] " Ingo Molnar 2 siblings, 2 replies; 233+ messages in thread From: Con Kolivas @ 2007-05-12 4:46 UTC (permalink / raw) To: Nick Piggin Cc: Ray Lee, Ingo Molnar, ck list, Andrew Morton, linux-kernel, linux-mm It turns out that fixing swap prefetch was not that hard to fix and improve upon, and since Andrew hasn't dropped swap prefetch, instead here are a swag of fixes and improvements, including making it depend on !CPUSETS as Nick requested. These changes lead to dramatic improvements. Eg on a machine with 2GB ram and only 500MB swap: Prefetch disabled: ./sp_tester Ram 2060352000 Swap 522072000 Total ram to be malloced: 2321388000 bytes Starting first malloc of 1160694000 bytes Starting 1st read of first malloc Touching this much ram takes 529 milliseconds Starting second malloc of 1160694000 bytes Completed second malloc and free Sleeping for 300 seconds Important part - starting reread of first malloc Completed read of first malloc Timed portion 6030 milliseconds Prefetch enabled: /sp_tester Ram 2060352000 Swap 522072000 Total ram to be malloced: 2321388000 bytes Starting first malloc of 1160694000 bytes Starting 1st read of first malloc Touching this much ram takes 528 milliseconds Starting second malloc of 1160694000 bytes Completed second malloc and free Sleeping for 300 seconds Important part - starting reread of first malloc Completed read of first malloc Timed portion 665 milliseconds Note that simply touching the ram took 528 ms so the time taken for the 230MB converted from major faults to minor faults took only 137ms instead of 5.5s. --- Numerous improvements to swap prefetch. It was possible for kprefetchd to go to sleep indefinitely before/after changing the /proc value of swap prefetch. Fix that. The cost of remove_from_swapped_list() can be removed from every page swapin by moving it to be done entirely by kprefetchd lazily. The call site for add_to_swapped_list need only be at one place. Wakeups can occur much less frequently if swap prefetch is disabled. Make it possible to enable swap prefetch explicitly via /proc when laptop_mode is enabled by changing the value of the sysctl to 2. The complicated iteration over every entry can be consolidated by using list_for_each_safe. Swap prefetch is not cpuset aware so make the config option depend on !CPUSETS. Fix potential irq problem by converting read_lock_irq to irqsave etc. Code style fixes. Change the ioprio from IOPRIO_CLASS_IDLE to normal lower priority to ensure that bio requests are not starved if other I/O begins during prefetching. Signed-off-by: Con Kolivas <kernel@kolivas.org> --- Documentation/sysctl/vm.txt | 4 - init/Kconfig | 2 mm/page_io.c | 2 mm/swap_prefetch.c | 158 +++++++++++++++++++------------------------- mm/swap_state.c | 2 mm/vmscan.c | 1 6 files changed, 75 insertions(+), 94 deletions(-) Index: linux-2.6.21-mm1/mm/page_io.c =================================================================== --- linux-2.6.21-mm1.orig/mm/page_io.c 2007-02-05 22:52:04.000000000 +1100 +++ linux-2.6.21-mm1/mm/page_io.c 2007-05-12 14:30:52.000000000 +1000 @@ -17,6 +17,7 @@ #include <linux/bio.h> #include <linux/swapops.h> #include <linux/writeback.h> +#include <linux/swap-prefetch.h> #include <asm/pgtable.h> static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index, @@ -118,6 +119,7 @@ int swap_writepage(struct page *page, st ret = -ENOMEM; goto out; } + add_to_swapped_list(page); if (wbc->sync_mode == WB_SYNC_ALL) rw |= (1 << BIO_RW_SYNC); count_vm_event(PSWPOUT); Index: linux-2.6.21-mm1/mm/swap_state.c =================================================================== --- linux-2.6.21-mm1.orig/mm/swap_state.c 2007-05-07 21:53:51.000000000 +1000 +++ linux-2.6.21-mm1/mm/swap_state.c 2007-05-12 14:30:52.000000000 +1000 @@ -83,7 +83,6 @@ static int __add_to_swap_cache(struct pa error = radix_tree_insert(&swapper_space.page_tree, entry.val, page); if (!error) { - remove_from_swapped_list(entry.val); page_cache_get(page); SetPageLocked(page); SetPageSwapCache(page); @@ -102,7 +101,6 @@ int add_to_swap_cache(struct page *page, int error; if (!swap_duplicate(entry)) { - remove_from_swapped_list(entry.val); INC_CACHE_INFO(noent_race); return -ENOENT; } Index: linux-2.6.21-mm1/mm/vmscan.c =================================================================== --- linux-2.6.21-mm1.orig/mm/vmscan.c 2007-05-07 21:53:51.000000000 +1000 +++ linux-2.6.21-mm1/mm/vmscan.c 2007-05-12 14:30:52.000000000 +1000 @@ -410,7 +410,6 @@ int remove_mapping(struct address_space if (PageSwapCache(page)) { swp_entry_t swap = { .val = page_private(page) }; - add_to_swapped_list(page); __delete_from_swap_cache(page); write_unlock_irq(&mapping->tree_lock); swap_free(swap); Index: linux-2.6.21-mm1/mm/swap_prefetch.c =================================================================== --- linux-2.6.21-mm1.orig/mm/swap_prefetch.c 2007-05-07 21:53:51.000000000 +1000 +++ linux-2.6.21-mm1/mm/swap_prefetch.c 2007-05-12 14:30:52.000000000 +1000 @@ -27,7 +27,8 @@ * needs to be at least this duration of idle time meaning in practice it can * be much longer */ -#define PREFETCH_DELAY (HZ * 5) +#define PREFETCH_DELAY (HZ * 5) +#define DISABLED_PREFETCH_DELAY (HZ * 60) /* sysctl - enable/disable swap prefetching */ int swap_prefetch __read_mostly = 1; @@ -61,19 +62,30 @@ inline void delay_swap_prefetch(void) } /* + * If laptop_mode is enabled don't prefetch to avoid hard drives + * doing unnecessary spin-ups unless swap_prefetch is explicitly + * set to a higher value. + */ +static inline int prefetch_enabled(void) +{ + if (swap_prefetch <= laptop_mode) + return 0; + return 1; +} + +static int wakeup_kprefetchd; + +/* * Drop behind accounting which keeps a list of the most recently used swap - * entries. + * entries. Entries are removed lazily by kprefetchd. */ void add_to_swapped_list(struct page *page) { struct swapped_entry *entry; unsigned long index, flags; - int wakeup; - - if (!swap_prefetch) - return; - wakeup = 0; + if (!prefetch_enabled()) + goto out; spin_lock_irqsave(&swapped.lock, flags); if (swapped.count >= swapped.maxcount) { @@ -103,23 +115,15 @@ void add_to_swapped_list(struct page *pa store_swap_entry_node(entry, page); if (likely(!radix_tree_insert(&swapped.swap_tree, index, entry))) { - /* - * If this is the first entry, kprefetchd needs to be - * (re)started. - */ - if (!swapped.count) - wakeup = 1; list_add(&entry->swapped_list, &swapped.list); swapped.count++; } out_locked: spin_unlock_irqrestore(&swapped.lock, flags); - - /* Do the wakeup outside the lock to shorten lock hold time. */ - if (wakeup) +out: + if (wakeup_kprefetchd) wake_up_process(kprefetchd_task); - return; } @@ -139,7 +143,7 @@ void remove_from_swapped_list(const unsi spin_lock_irqsave(&swapped.lock, flags); entry = radix_tree_delete(&swapped.swap_tree, index); if (likely(entry)) { - list_del_init(&entry->swapped_list); + list_del(&entry->swapped_list); swapped.count--; kmem_cache_free(swapped.cache, entry); } @@ -153,18 +157,18 @@ enum trickle_return { }; struct node_stats { - unsigned long last_free; /* Free ram after a cycle of prefetching */ - unsigned long current_free; + unsigned long last_free; /* Free ram on this cycle of checking prefetch_suitable */ - unsigned long prefetch_watermark; + unsigned long current_free; /* Maximum amount we will prefetch to */ - unsigned long highfree[MAX_NR_ZONES]; + unsigned long prefetch_watermark; /* The amount of free ram before we start prefetching */ - unsigned long lowfree[MAX_NR_ZONES]; + unsigned long highfree[MAX_NR_ZONES]; /* The amount of free ram where we will stop prefetching */ - unsigned long *pointfree[MAX_NR_ZONES]; + unsigned long lowfree[MAX_NR_ZONES]; /* highfree or lowfree depending on whether we've hit a watermark */ + unsigned long *pointfree[MAX_NR_ZONES]; }; /* @@ -172,10 +176,10 @@ struct node_stats { * determine if a node is suitable for prefetching into. */ struct prefetch_stats { - nodemask_t prefetch_nodes; /* Which nodes are currently suited to prefetching */ - unsigned long prefetched_pages; + nodemask_t prefetch_nodes; /* Total pages we've prefetched on this wakeup of kprefetchd */ + unsigned long prefetched_pages; struct node_stats node[MAX_NUMNODES]; }; @@ -189,16 +193,15 @@ static enum trickle_return trickle_swap_ const int node) { enum trickle_return ret = TRICKLE_FAILED; + unsigned long flags; struct page *page; - read_lock_irq(&swapper_space.tree_lock); + read_lock_irqsave(&swapper_space.tree_lock, flags); /* Entry may already exist */ page = radix_tree_lookup(&swapper_space.page_tree, entry.val); - read_unlock_irq(&swapper_space.tree_lock); - if (page) { - remove_from_swapped_list(entry.val); + read_unlock_irqrestore(&swapper_space.tree_lock, flags); + if (page) goto out; - } /* * Get a new page to read from swap. We have already checked the @@ -217,10 +220,8 @@ static enum trickle_return trickle_swap_ /* Add them to the tail of the inactive list to preserve LRU order */ lru_cache_add_tail(page); - if (unlikely(swap_readpage(NULL, page))) { - ret = TRICKLE_DELAY; + if (unlikely(swap_readpage(NULL, page))) goto out_release; - } sp_stat.prefetched_pages++; sp_stat.node[node].last_free--; @@ -229,6 +230,12 @@ static enum trickle_return trickle_swap_ out_release: page_cache_release(page); out: + /* + * All entries are removed here lazily. This avoids the cost of + * remove_from_swapped_list during normal swapin. Thus there are + * usually many stale entries. + */ + remove_from_swapped_list(entry.val); return ret; } @@ -414,17 +421,6 @@ out: } /* - * Get previous swapped entry when iterating over all entries. swapped.lock - * should be held and we should already ensure that entry exists. - */ -static inline struct swapped_entry *prev_swapped_entry - (struct swapped_entry *entry) -{ - return list_entry(entry->swapped_list.prev->prev, - struct swapped_entry, swapped_list); -} - -/* * trickle_swap is the main function that initiates the swap prefetching. It * first checks to see if the busy flag is set, and does not prefetch if it * is, as the flag implied we are low on memory or swapping in currently. @@ -435,70 +431,49 @@ static inline struct swapped_entry *prev static enum trickle_return trickle_swap(void) { enum trickle_return ret = TRICKLE_DELAY; - struct swapped_entry *entry; + struct list_head *p, *next; unsigned long flags; - /* - * If laptop_mode is enabled don't prefetch to avoid hard drives - * doing unnecessary spin-ups - */ - if (!swap_prefetch || laptop_mode) + if (!prefetch_enabled()) return ret; examine_free_limits(); - entry = NULL; + if (!prefetch_suitable()) + return ret; + if (list_empty(&swapped.list)) + return TRICKLE_FAILED; - for ( ; ; ) { + spin_lock_irqsave(&swapped.lock, flags); + list_for_each_safe(p, next, &swapped.list) { + struct swapped_entry *entry; swp_entry_t swp_entry; int node; + spin_unlock_irqrestore(&swapped.lock, flags); + might_sleep(); if (!prefetch_suitable()) - break; + goto out_unlocked; spin_lock_irqsave(&swapped.lock, flags); - if (list_empty(&swapped.list)) { - ret = TRICKLE_FAILED; - spin_unlock_irqrestore(&swapped.lock, flags); - break; - } - - if (!entry) { - /* - * This sets the entry for the first iteration. It - * also is a safeguard against the entry disappearing - * while the lock is not held. - */ - entry = list_entry(swapped.list.prev, - struct swapped_entry, swapped_list); - } else if (entry->swapped_list.prev == swapped.list.next) { - /* - * If we have iterated over all entries and there are - * still entries that weren't swapped out there may - * be a reason we could not swap them back in so - * delay attempting further prefetching. - */ - spin_unlock_irqrestore(&swapped.lock, flags); - break; - } - + entry = list_entry(p, struct swapped_entry, swapped_list); node = get_swap_entry_node(entry); if (!node_isset(node, sp_stat.prefetch_nodes)) { /* * We found an entry that belongs to a node that is * not suitable for prefetching so skip it. */ - entry = prev_swapped_entry(entry); - spin_unlock_irqrestore(&swapped.lock, flags); continue; } swp_entry = entry->swp_entry; - entry = prev_swapped_entry(entry); spin_unlock_irqrestore(&swapped.lock, flags); if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY) - break; + goto out_unlocked; + spin_lock_irqsave(&swapped.lock, flags); } + spin_unlock_irqrestore(&swapped.lock, flags); +out_unlocked: if (sp_stat.prefetched_pages) { lru_add_drain(); sp_stat.prefetched_pages = 0; @@ -513,13 +488,14 @@ static int kprefetchd(void *__unused) sched_setscheduler(current, SCHED_BATCH, ¶m); set_user_nice(current, 19); /* Set ioprio to lowest if supported by i/o scheduler */ - sys_ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_CLASS_IDLE); + sys_ioprio_set(IOPRIO_WHO_PROCESS, IOPRIO_BE_NR - 1, IOPRIO_CLASS_BE); /* kprefetchd has nothing to do until it is woken up the first time */ + wakeup_kprefetchd = 1; set_current_state(TASK_INTERRUPTIBLE); schedule(); - do { + while (!kthread_should_stop()) { try_to_freeze(); /* @@ -527,13 +503,17 @@ static int kprefetchd(void *__unused) * a wakeup, and further delay the next one. */ if (trickle_swap() == TRICKLE_FAILED) { + wakeup_kprefetchd = 1; set_current_state(TASK_INTERRUPTIBLE); schedule(); - } + } else + wakeup_kprefetchd = 0; clear_last_prefetch_free(); - schedule_timeout_interruptible(PREFETCH_DELAY); - } while (!kthread_should_stop()); - + if (!prefetch_enabled()) + schedule_timeout_interruptible(DISABLED_PREFETCH_DELAY); + else + schedule_timeout_interruptible(PREFETCH_DELAY); + } return 0; } Index: linux-2.6.21-mm1/Documentation/sysctl/vm.txt =================================================================== --- linux-2.6.21-mm1.orig/Documentation/sysctl/vm.txt 2007-05-07 21:53:00.000000000 +1000 +++ linux-2.6.21-mm1/Documentation/sysctl/vm.txt 2007-05-12 14:31:26.000000000 +1000 @@ -229,7 +229,9 @@ swap_prefetch This enables or disables the swap prefetching feature. When the virtual memory subsystem has been extremely idle for at least 5 seconds it will start copying back pages from swap into the swapcache and keep a copy in swap. In -practice it can take many minutes before the vm is idle enough. +practice it can take many minutes before the vm is idle enough. A value of 0 +disables swap prefetching, 1 enables it unless laptop_mode is enabled, and 2 +enables it even in the presence of laptop_mode. The default value is 1. Index: linux-2.6.21-mm1/init/Kconfig =================================================================== --- linux-2.6.21-mm1.orig/init/Kconfig 2007-05-07 21:53:51.000000000 +1000 +++ linux-2.6.21-mm1/init/Kconfig 2007-05-12 14:30:52.000000000 +1000 @@ -107,7 +107,7 @@ config SWAP config SWAP_PREFETCH bool "Support for prefetching swapped memory" - depends on SWAP + depends on SWAP && !CPUSETS default y ---help--- This option will allow the kernel to prefetch swapped memory pages -- -ck ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-12 4:46 ` [PATCH] mm: swap prefetch improvements Con Kolivas @ 2007-05-12 5:03 ` Paul Jackson 2007-05-12 5:15 ` Con Kolivas 2007-05-21 10:03 ` [PATCH] " Ingo Molnar 1 sibling, 1 reply; 233+ messages in thread From: Paul Jackson @ 2007-05-12 5:03 UTC (permalink / raw) To: Con Kolivas; +Cc: nickpiggin, ray-lk, mingo, ck, akpm, linux-kernel, linux-mm > Swap prefetch is not cpuset aware so make the config option depend on !CPUSETS. Ok. Could you explain what it means to say "swap prefetch is not cpuset aware", or could you give a rough idea of what it would take to make it cpuset aware? I wouldn't go so far as to say that no one would ever want to prefetch and use cpusets at the same time, but I will grant that it's not a sufficiently important need that it should block a useful prefetch implementation on non-cpuset systems. One case that would be useful, however, is to handle prefetch in the case that cpusets are configured into ones kernel, but one is not making any real use of them ('number_of_cpusets' <= 1). That will actually be the most common case for the major distribution(s) that enable cpusets by default in their builds, for most arch's including the arch's popular on desktops. So what would it take to allow CONFIG'ing both prefetch and cpusets on, but having prefetch dynamically adapt to the presence of active cpuset usage, perhaps by basically shutting down if it can't easily do any better? I could certainly entertain requests to callout to some prefetch routine from the cpuset code, at the critical points that cpusets transitioned in or out of active use. Semi-separate issue -- is it just cpusets that aren't prefetch friendly, or is it also mm/mempolicy (mbind, set_mempolicy) as well? For that matter, even if neither mm/mempolicy nor cpusets are used, on systems with multiple memory nodes (not all memory equally distant from all CPUs, aka NUMA), could prefetch cause some sort of shuffling of memory placement, which might harm the performance of an HPC (High Performance Computing) application with carefully tuned memory placement. Granted, this -is- getting to be a corner case. Most HPC apps running on NUMA hardware are making at least some use of mm/mempolicy or cpusets. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-12 5:03 ` Paul Jackson @ 2007-05-12 5:15 ` Con Kolivas 2007-05-12 5:51 ` Paul Jackson 0 siblings, 1 reply; 233+ messages in thread From: Con Kolivas @ 2007-05-12 5:15 UTC (permalink / raw) To: Paul Jackson; +Cc: nickpiggin, ray-lk, mingo, ck, akpm, linux-kernel, linux-mm On Saturday 12 May 2007 15:03, Paul Jackson wrote: > > Swap prefetch is not cpuset aware so make the config option depend on > > !CPUSETS. > > Ok. > > Could you explain what it means to say "swap prefetch is not cpuset aware", > or could you give a rough idea of what it would take to make it cpuset > aware? Hmm I'm not really sure what it takes to make it cpuset aware; it was Nick that pointed out that it was not, so I'm not sure and still going off your original recommendation that there was no need to make it cpuset aware but at least honour node placement (see below). > I wouldn't go so far as to say that no one would ever want to prefetch and > use cpusets at the same time, but I will grant that it's not a sufficiently > important need that it should block a useful prefetch implementation on > non-cpuset systems. Thank you for agreeing on me there :) > One case that would be useful, however, is to handle prefetch in the case > that cpusets are configured into ones kernel, but one is not making any > real use of them ('number_of_cpusets' <= 1). That will actually be the > most common case for the major distribution(s) that enable cpusets by > default in their builds, for most arch's including the arch's popular > on desktops. > > So what would it take to allow CONFIG'ing both prefetch and cpusets on, > but having prefetch dynamically adapt to the presence of active cpuset > usage, perhaps by basically shutting down if it can't easily do any > better? I could certainly entertain requests to callout to some > prefetch routine from the cpuset code, at the critical points that > cpusets transitioned in or out of active use. It would be absolutely trivial to add a check for 'number_of_cpusets' <= 1 in the prefetch_enabled() function. Would you like that? > Semi-separate issue -- is it just cpusets that aren't prefetch friendly, > or is it also mm/mempolicy (mbind, set_mempolicy) as well? > > For that matter, even if neither mm/mempolicy nor cpusets are used, on > systems with multiple memory nodes (not all memory equally distant from > all CPUs, aka NUMA), could prefetch cause some sort of shuffling of > memory placement, which might harm the performance of an HPC (High > Performance Computing) application with carefully tuned memory > placement. Granted, this -is- getting to be a corner case. Most HPC > apps running on NUMA hardware are making at least some use of > mm/mempolicy or cpusets. It is numa aware to some degree. It stores the node id and when it starts prefetching it only prefetches to nodes that are suitable for prefetching to (based on a number of arbitrary freeness arguments I invented). It uses the original node id it came from by allocating a page via: alloc_pages_node(node, GFP_HIGHUSER & ~__GFP_WAIT, 0); where "node" is the original node the swapped page came from. Thanks for comments. -- -ck ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-12 5:15 ` Con Kolivas @ 2007-05-12 5:51 ` Paul Jackson 2007-05-12 7:28 ` Con Kolivas 0 siblings, 1 reply; 233+ messages in thread From: Paul Jackson @ 2007-05-12 5:51 UTC (permalink / raw) To: Con Kolivas; +Cc: nickpiggin, ray-lk, mingo, ck, akpm, linux-kernel, linux-mm Con wrote: > Hmm I'm not really sure what it takes to make it cpuset aware; > ... > It is numa aware to some degree. It stores the node id and when it starts > prefetching it only prefetches to nodes that are suitable for prefetching to > ... > It would be absolutely trivial to add a check for 'number_of_cpusets' <= 1 > in the prefetch_enabled() function. Would you like that? Hmmm ... it seems that we shadow boxing here ... trying to pick a solution to solve a problem when we aren't even sure we have a problem, much less what the problem is. That does not usually lead to the right path. Could you put some more effort into characterizing what problems can arise if one has prefetch and cpusets active at the same time? My first wild guess is that the only incompatibility would have been that prefetch might mess up NUMA placement (get pages on wrong nodes), which it seems you have tried to address in your current patches. So it would not surprise me if there was no problem here. We may just have to lean on Nick some more, if he is the only one who understands what the problem is, to try again to explain it to us. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-12 5:51 ` Paul Jackson @ 2007-05-12 7:28 ` Con Kolivas 2007-05-12 8:14 ` Paul Jackson 0 siblings, 1 reply; 233+ messages in thread From: Con Kolivas @ 2007-05-12 7:28 UTC (permalink / raw) To: Paul Jackson; +Cc: nickpiggin, ray-lk, mingo, ck, akpm, linux-kernel, linux-mm On Saturday 12 May 2007 15:51, Paul Jackson wrote: > Con wrote: > > Hmm I'm not really sure what it takes to make it cpuset aware; > > ... > > It is numa aware to some degree. It stores the node id and when it starts > > prefetching it only prefetches to nodes that are suitable for prefetching > > to ... > > It would be absolutely trivial to add a check for 'number_of_cpusets' <= > > 1 in the prefetch_enabled() function. Would you like that? > > Hmmm ... it seems that we shadow boxing here ... trying to pick a solution > to solve a problem when we aren't even sure we have a problem, much less > what the problem is. > > That does not usually lead to the right path. > > Could you put some more effort into characterizing what problems > can arise if one has prefetch and cpusets active at the same time? > > My first wild guess is that the only incompatibility would have been that > prefetch might mess up NUMA placement (get pages on wrong nodes), which > it seems you have tried to address in your current patches. So it would > not surprise me if there was no problem here. Ummm this is what I've been saying for over a year now but noone has been listening. > We may just have to lean on Nick some more, if he is the only one who > understands what the problem is, to try again to explain it to us. -- -ck ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-12 7:28 ` Con Kolivas @ 2007-05-12 8:14 ` Paul Jackson 2007-05-12 8:21 ` Con Kolivas 0 siblings, 1 reply; 233+ messages in thread From: Paul Jackson @ 2007-05-12 8:14 UTC (permalink / raw) To: Con Kolivas; +Cc: nickpiggin, ray-lk, mingo, ck, akpm, linux-kernel, linux-mm > Ummm this is what I've been saying for over a year now but noone has been > listening. Well ... if there is a problem using prefetch and cpusets together, it doesn't look like the two of us are going to find it. I should probably look at your patch to answer this next question, but being a lazy retard, I'll just ask. Is there a way, on a running system that has your prefetch patch configured in, to disable prefetch -- perhaps writing to some magic /proc file or something? If so, then how about you just remove the lines in the patch that disable prefetch on kernels configured with CPUSETS, and we charge ahead allowing both at the same time? If some day in the future I find something about prefetch that harms the HPC NUMA loads I care about, then I can just dynamically disable prefetch. If someone ever uncovers a real problem with prefetch and cpusets, then we will deal with it then. As to whether your patch is otherwise (other than cpusets) worthy of further acceptance, that I will have to leave up to those who are competent to make such judgements. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-12 8:14 ` Paul Jackson @ 2007-05-12 8:21 ` Con Kolivas 2007-05-12 8:37 ` Paul Jackson 0 siblings, 1 reply; 233+ messages in thread From: Con Kolivas @ 2007-05-12 8:21 UTC (permalink / raw) To: Paul Jackson; +Cc: nickpiggin, ray-lk, mingo, ck, akpm, linux-kernel, linux-mm On Saturday 12 May 2007 18:14, Paul Jackson wrote: > > Ummm this is what I've been saying for over a year now but noone has been > > listening. > > Well ... if there is a problem using prefetch and cpusets together, > it doesn't look like the two of us are going to find it. > > I should probably look at your patch to answer this next question, > but being a lazy retard, I'll just ask. Is there a way, on a running > system that has your prefetch patch configured in, to disable prefetch > -- perhaps writing to some magic /proc file or something? Indeed: /proc/sys/vm/swap_prefetch > If so, then how about you just remove the lines in the patch that > disable prefetch on kernels configured with CPUSETS, and we charge > ahead allowing both at the same time? Ok so change the default value for swap_prefetch to 0 when CPUSETS is enabled? Sure, I can do that. > If some day in the future I find something about prefetch that harms > the HPC NUMA loads I care about, then I can just dynamically disable > prefetch. > > If someone ever uncovers a real problem with prefetch and cpusets, > then we will deal with it then. > > As to whether your patch is otherwise (other than cpusets) worthy > of further acceptance, that I will have to leave up to those who are > competent to make such judgements. Thank you very much for your comments! -- -ck ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-12 8:21 ` Con Kolivas @ 2007-05-12 8:37 ` Paul Jackson 2007-05-12 8:57 ` [PATCH respin] " Con Kolivas 0 siblings, 1 reply; 233+ messages in thread From: Paul Jackson @ 2007-05-12 8:37 UTC (permalink / raw) To: Con Kolivas; +Cc: nickpiggin, ray-lk, mingo, ck, akpm, linux-kernel, linux-mm Con wrote: > Ok so change the default value for swap_prefetch to 0 when CPUSETS is enabled? I don't see why that special case for cpusets is needed. I'm suggesting making no special cases for CPUSETS at all, until and unless we find reason to. In other words, I'm suggesting simply removing the patch lines: - depends on SWAP + depends on SWAP && !CPUSETS I see no other mention of cpusets in your patch. That's fine by me. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 233+ messages in thread
* [PATCH respin] mm: swap prefetch improvements 2007-05-12 8:37 ` Paul Jackson @ 2007-05-12 8:57 ` Con Kolivas 0 siblings, 0 replies; 233+ messages in thread From: Con Kolivas @ 2007-05-12 8:57 UTC (permalink / raw) To: Paul Jackson; +Cc: nickpiggin, ray-lk, mingo, ck, akpm, linux-kernel, linux-mm On Saturday 12 May 2007 18:37, Paul Jackson wrote: > Con wrote: > > Ok so change the default value for swap_prefetch to 0 when CPUSETS is > > enabled? > > I don't see why that special case for cpusets is needed. > > I'm suggesting making no special cases for CPUSETS at all, until and > unless we find reason to. > > In other words, I'm suggesting simply removing the patch lines: > > - depends on SWAP > + depends on SWAP && !CPUSETS > > I see no other mention of cpusets in your patch. That's fine by me. Excellent, I prefer that as well. Thanks very much for your comments! Here's a respin without that hunk. --- Numerous improvements to swap prefetch. It was possible for kprefetchd to go to sleep indefinitely before/after changing the /proc value of swap prefetch. Fix that. The cost of remove_from_swapped_list() can be removed from every page swapin by moving it to be done entirely by kprefetchd lazily. The call site for add_to_swapped_list need only be at one place. Wakeups can occur much less frequently if swap prefetch is disabled. Make it possible to enable swap prefetch explicitly via /proc when laptop_mode is enabled by changing the value of the sysctl to 2. The complicated iteration over every entry can be consolidated by using list_for_each_safe. Fix potential irq problem by converting read_lock_irq to irqsave etc. Code style fixes. Change the ioprio from IOPRIO_CLASS_IDLE to normal lower priority to ensure that bio requests are not starved if other I/O begins during prefetching. Signed-off-by: Con Kolivas <kernel@kolivas.org> --- Documentation/sysctl/vm.txt | 4 - mm/page_io.c | 2 mm/swap_prefetch.c | 158 +++++++++++++++++++------------------------- mm/swap_state.c | 2 mm/vmscan.c | 1 5 files changed, 74 insertions(+), 93 deletions(-) Index: linux-2.6.21-mm1/mm/page_io.c =================================================================== --- linux-2.6.21-mm1.orig/mm/page_io.c 2007-02-05 22:52:04.000000000 +1100 +++ linux-2.6.21-mm1/mm/page_io.c 2007-05-12 14:30:52.000000000 +1000 @@ -17,6 +17,7 @@ #include <linux/bio.h> #include <linux/swapops.h> #include <linux/writeback.h> +#include <linux/swap-prefetch.h> #include <asm/pgtable.h> static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index, @@ -118,6 +119,7 @@ int swap_writepage(struct page *page, st ret = -ENOMEM; goto out; } + add_to_swapped_list(page); if (wbc->sync_mode == WB_SYNC_ALL) rw |= (1 << BIO_RW_SYNC); count_vm_event(PSWPOUT); Index: linux-2.6.21-mm1/mm/swap_state.c =================================================================== --- linux-2.6.21-mm1.orig/mm/swap_state.c 2007-05-07 21:53:51.000000000 +1000 +++ linux-2.6.21-mm1/mm/swap_state.c 2007-05-12 14:30:52.000000000 +1000 @@ -83,7 +83,6 @@ static int __add_to_swap_cache(struct pa error = radix_tree_insert(&swapper_space.page_tree, entry.val, page); if (!error) { - remove_from_swapped_list(entry.val); page_cache_get(page); SetPageLocked(page); SetPageSwapCache(page); @@ -102,7 +101,6 @@ int add_to_swap_cache(struct page *page, int error; if (!swap_duplicate(entry)) { - remove_from_swapped_list(entry.val); INC_CACHE_INFO(noent_race); return -ENOENT; } Index: linux-2.6.21-mm1/mm/vmscan.c =================================================================== --- linux-2.6.21-mm1.orig/mm/vmscan.c 2007-05-07 21:53:51.000000000 +1000 +++ linux-2.6.21-mm1/mm/vmscan.c 2007-05-12 14:30:52.000000000 +1000 @@ -410,7 +410,6 @@ int remove_mapping(struct address_space if (PageSwapCache(page)) { swp_entry_t swap = { .val = page_private(page) }; - add_to_swapped_list(page); __delete_from_swap_cache(page); write_unlock_irq(&mapping->tree_lock); swap_free(swap); Index: linux-2.6.21-mm1/mm/swap_prefetch.c =================================================================== --- linux-2.6.21-mm1.orig/mm/swap_prefetch.c 2007-05-07 21:53:51.000000000 +1000 +++ linux-2.6.21-mm1/mm/swap_prefetch.c 2007-05-12 14:30:52.000000000 +1000 @@ -27,7 +27,8 @@ * needs to be at least this duration of idle time meaning in practice it can * be much longer */ -#define PREFETCH_DELAY (HZ * 5) +#define PREFETCH_DELAY (HZ * 5) +#define DISABLED_PREFETCH_DELAY (HZ * 60) /* sysctl - enable/disable swap prefetching */ int swap_prefetch __read_mostly = 1; @@ -61,19 +62,30 @@ inline void delay_swap_prefetch(void) } /* + * If laptop_mode is enabled don't prefetch to avoid hard drives + * doing unnecessary spin-ups unless swap_prefetch is explicitly + * set to a higher value. + */ +static inline int prefetch_enabled(void) +{ + if (swap_prefetch <= laptop_mode) + return 0; + return 1; +} + +static int wakeup_kprefetchd; + +/* * Drop behind accounting which keeps a list of the most recently used swap - * entries. + * entries. Entries are removed lazily by kprefetchd. */ void add_to_swapped_list(struct page *page) { struct swapped_entry *entry; unsigned long index, flags; - int wakeup; - - if (!swap_prefetch) - return; - wakeup = 0; + if (!prefetch_enabled()) + goto out; spin_lock_irqsave(&swapped.lock, flags); if (swapped.count >= swapped.maxcount) { @@ -103,23 +115,15 @@ void add_to_swapped_list(struct page *pa store_swap_entry_node(entry, page); if (likely(!radix_tree_insert(&swapped.swap_tree, index, entry))) { - /* - * If this is the first entry, kprefetchd needs to be - * (re)started. - */ - if (!swapped.count) - wakeup = 1; list_add(&entry->swapped_list, &swapped.list); swapped.count++; } out_locked: spin_unlock_irqrestore(&swapped.lock, flags); - - /* Do the wakeup outside the lock to shorten lock hold time. */ - if (wakeup) +out: + if (wakeup_kprefetchd) wake_up_process(kprefetchd_task); - return; } @@ -139,7 +143,7 @@ void remove_from_swapped_list(const unsi spin_lock_irqsave(&swapped.lock, flags); entry = radix_tree_delete(&swapped.swap_tree, index); if (likely(entry)) { - list_del_init(&entry->swapped_list); + list_del(&entry->swapped_list); swapped.count--; kmem_cache_free(swapped.cache, entry); } @@ -153,18 +157,18 @@ enum trickle_return { }; struct node_stats { - unsigned long last_free; /* Free ram after a cycle of prefetching */ - unsigned long current_free; + unsigned long last_free; /* Free ram on this cycle of checking prefetch_suitable */ - unsigned long prefetch_watermark; + unsigned long current_free; /* Maximum amount we will prefetch to */ - unsigned long highfree[MAX_NR_ZONES]; + unsigned long prefetch_watermark; /* The amount of free ram before we start prefetching */ - unsigned long lowfree[MAX_NR_ZONES]; + unsigned long highfree[MAX_NR_ZONES]; /* The amount of free ram where we will stop prefetching */ - unsigned long *pointfree[MAX_NR_ZONES]; + unsigned long lowfree[MAX_NR_ZONES]; /* highfree or lowfree depending on whether we've hit a watermark */ + unsigned long *pointfree[MAX_NR_ZONES]; }; /* @@ -172,10 +176,10 @@ struct node_stats { * determine if a node is suitable for prefetching into. */ struct prefetch_stats { - nodemask_t prefetch_nodes; /* Which nodes are currently suited to prefetching */ - unsigned long prefetched_pages; + nodemask_t prefetch_nodes; /* Total pages we've prefetched on this wakeup of kprefetchd */ + unsigned long prefetched_pages; struct node_stats node[MAX_NUMNODES]; }; @@ -189,16 +193,15 @@ static enum trickle_return trickle_swap_ const int node) { enum trickle_return ret = TRICKLE_FAILED; + unsigned long flags; struct page *page; - read_lock_irq(&swapper_space.tree_lock); + read_lock_irqsave(&swapper_space.tree_lock, flags); /* Entry may already exist */ page = radix_tree_lookup(&swapper_space.page_tree, entry.val); - read_unlock_irq(&swapper_space.tree_lock); - if (page) { - remove_from_swapped_list(entry.val); + read_unlock_irqrestore(&swapper_space.tree_lock, flags); + if (page) goto out; - } /* * Get a new page to read from swap. We have already checked the @@ -217,10 +220,8 @@ static enum trickle_return trickle_swap_ /* Add them to the tail of the inactive list to preserve LRU order */ lru_cache_add_tail(page); - if (unlikely(swap_readpage(NULL, page))) { - ret = TRICKLE_DELAY; + if (unlikely(swap_readpage(NULL, page))) goto out_release; - } sp_stat.prefetched_pages++; sp_stat.node[node].last_free--; @@ -229,6 +230,12 @@ static enum trickle_return trickle_swap_ out_release: page_cache_release(page); out: + /* + * All entries are removed here lazily. This avoids the cost of + * remove_from_swapped_list during normal swapin. Thus there are + * usually many stale entries. + */ + remove_from_swapped_list(entry.val); return ret; } @@ -414,17 +421,6 @@ out: } /* - * Get previous swapped entry when iterating over all entries. swapped.lock - * should be held and we should already ensure that entry exists. - */ -static inline struct swapped_entry *prev_swapped_entry - (struct swapped_entry *entry) -{ - return list_entry(entry->swapped_list.prev->prev, - struct swapped_entry, swapped_list); -} - -/* * trickle_swap is the main function that initiates the swap prefetching. It * first checks to see if the busy flag is set, and does not prefetch if it * is, as the flag implied we are low on memory or swapping in currently. @@ -435,70 +431,49 @@ static inline struct swapped_entry *prev static enum trickle_return trickle_swap(void) { enum trickle_return ret = TRICKLE_DELAY; - struct swapped_entry *entry; + struct list_head *p, *next; unsigned long flags; - /* - * If laptop_mode is enabled don't prefetch to avoid hard drives - * doing unnecessary spin-ups - */ - if (!swap_prefetch || laptop_mode) + if (!prefetch_enabled()) return ret; examine_free_limits(); - entry = NULL; + if (!prefetch_suitable()) + return ret; + if (list_empty(&swapped.list)) + return TRICKLE_FAILED; - for ( ; ; ) { + spin_lock_irqsave(&swapped.lock, flags); + list_for_each_safe(p, next, &swapped.list) { + struct swapped_entry *entry; swp_entry_t swp_entry; int node; + spin_unlock_irqrestore(&swapped.lock, flags); + might_sleep(); if (!prefetch_suitable()) - break; + goto out_unlocked; spin_lock_irqsave(&swapped.lock, flags); - if (list_empty(&swapped.list)) { - ret = TRICKLE_FAILED; - spin_unlock_irqrestore(&swapped.lock, flags); - break; - } - - if (!entry) { - /* - * This sets the entry for the first iteration. It - * also is a safeguard against the entry disappearing - * while the lock is not held. - */ - entry = list_entry(swapped.list.prev, - struct swapped_entry, swapped_list); - } else if (entry->swapped_list.prev == swapped.list.next) { - /* - * If we have iterated over all entries and there are - * still entries that weren't swapped out there may - * be a reason we could not swap them back in so - * delay attempting further prefetching. - */ - spin_unlock_irqrestore(&swapped.lock, flags); - break; - } - + entry = list_entry(p, struct swapped_entry, swapped_list); node = get_swap_entry_node(entry); if (!node_isset(node, sp_stat.prefetch_nodes)) { /* * We found an entry that belongs to a node that is * not suitable for prefetching so skip it. */ - entry = prev_swapped_entry(entry); - spin_unlock_irqrestore(&swapped.lock, flags); continue; } swp_entry = entry->swp_entry; - entry = prev_swapped_entry(entry); spin_unlock_irqrestore(&swapped.lock, flags); if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY) - break; + goto out_unlocked; + spin_lock_irqsave(&swapped.lock, flags); } + spin_unlock_irqrestore(&swapped.lock, flags); +out_unlocked: if (sp_stat.prefetched_pages) { lru_add_drain(); sp_stat.prefetched_pages = 0; @@ -513,13 +488,14 @@ static int kprefetchd(void *__unused) sched_setscheduler(current, SCHED_BATCH, ¶m); set_user_nice(current, 19); /* Set ioprio to lowest if supported by i/o scheduler */ - sys_ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_CLASS_IDLE); + sys_ioprio_set(IOPRIO_WHO_PROCESS, IOPRIO_BE_NR - 1, IOPRIO_CLASS_BE); /* kprefetchd has nothing to do until it is woken up the first time */ + wakeup_kprefetchd = 1; set_current_state(TASK_INTERRUPTIBLE); schedule(); - do { + while (!kthread_should_stop()) { try_to_freeze(); /* @@ -527,13 +503,17 @@ static int kprefetchd(void *__unused) * a wakeup, and further delay the next one. */ if (trickle_swap() == TRICKLE_FAILED) { + wakeup_kprefetchd = 1; set_current_state(TASK_INTERRUPTIBLE); schedule(); - } + } else + wakeup_kprefetchd = 0; clear_last_prefetch_free(); - schedule_timeout_interruptible(PREFETCH_DELAY); - } while (!kthread_should_stop()); - + if (!prefetch_enabled()) + schedule_timeout_interruptible(DISABLED_PREFETCH_DELAY); + else + schedule_timeout_interruptible(PREFETCH_DELAY); + } return 0; } Index: linux-2.6.21-mm1/Documentation/sysctl/vm.txt =================================================================== --- linux-2.6.21-mm1.orig/Documentation/sysctl/vm.txt 2007-05-07 21:53:00.000000000 +1000 +++ linux-2.6.21-mm1/Documentation/sysctl/vm.txt 2007-05-12 14:31:26.000000000 +1000 @@ -229,7 +229,9 @@ swap_prefetch This enables or disables the swap prefetching feature. When the virtual memory subsystem has been extremely idle for at least 5 seconds it will start copying back pages from swap into the swapcache and keep a copy in swap. In -practice it can take many minutes before the vm is idle enough. +practice it can take many minutes before the vm is idle enough. A value of 0 +disables swap prefetching, 1 enables it unless laptop_mode is enabled, and 2 +enables it even in the presence of laptop_mode. The default value is 1. -- -ck ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-12 4:46 ` [PATCH] mm: swap prefetch improvements Con Kolivas 2007-05-12 5:03 ` Paul Jackson @ 2007-05-21 10:03 ` Ingo Molnar 2007-05-21 13:44 ` Con Kolivas 1 sibling, 1 reply; 233+ messages in thread From: Ingo Molnar @ 2007-05-21 10:03 UTC (permalink / raw) To: Con Kolivas Cc: Nick Piggin, Ray Lee, ck list, Andrew Morton, linux-kernel, linux-mm * Con Kolivas <kernel@kolivas.org> wrote: > It turns out that fixing swap prefetch was not that hard to fix and > improve upon, and since Andrew hasn't dropped swap prefetch, instead > here are a swag of fixes and improvements, [...] it's a reliable win on my testbox too: # echo 1 > /proc/sys/vm/swap_prefetch # ./sp_tester Ram 1019540000 Swap 4096564000 Total ram to be malloced: 1529310000 bytes Starting first malloc of 764655000 bytes Starting 1st read of first malloc Touching this much ram takes 4393 milliseconds Starting second malloc of 764655000 bytes Completed second malloc and free Sleeping for 600 seconds Important part - starting reread of first malloc Completed read of first malloc Timed portion 30279 milliseconds versus: # echo 0 > /proc/sys/vm/swap_prefetch # ./sp_tester [...] Timed portion 36605 milliseconds i've repeated these tests to make sure it's a stable win and indeed it is: # swap-prefetch-on: Timed portion 29704 milliseconds # swap-prefetch-off: Timed portion 34863 milliseconds Nice work Con! A suggestion for improvement: right now swap-prefetch does a small bit of swapin every 5 seconds and stays idle inbetween. Could this perhaps be made more agressive (optionally perhaps), if the system is not swapping otherwise? If block-IO level instrumentation is needed to determine idleness of block IO then that is justified too i think. Another suggestion: swap-prefetch seems to be doing all the right decisions in the sp_test.c case - so would it be possible to add statistics so that it could be verified how much of the swapped-in pages were indeed a 'hit' - and how many were recycled without them being reused? That could give a reliable, objective metric about how efficient swap-prefetch is in any workload. Ingo ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-21 10:03 ` [PATCH] " Ingo Molnar @ 2007-05-21 13:44 ` Con Kolivas 2007-05-21 16:00 ` Ingo Molnar 0 siblings, 1 reply; 233+ messages in thread From: Con Kolivas @ 2007-05-21 13:44 UTC (permalink / raw) To: Ingo Molnar Cc: Nick Piggin, Ray Lee, ck list, Andrew Morton, linux-kernel, linux-mm On Monday 21 May 2007 20:03, Ingo Molnar wrote: > * Con Kolivas <kernel@kolivas.org> wrote: > > It turns out that fixing swap prefetch was not that hard to fix and > > improve upon, and since Andrew hasn't dropped swap prefetch, instead > > here are a swag of fixes and improvements, [...] > > it's a reliable win on my testbox too: > > # echo 1 > /proc/sys/vm/swap_prefetch > Timed portion 30279 milliseconds > > versus: > > # echo 0 > /proc/sys/vm/swap_prefetch > # ./sp_tester > [...] > > Timed portion 36605 milliseconds > > i've repeated these tests to make sure it's a stable win and indeed it > is: > > # swap-prefetch-on: > > Timed portion 29704 milliseconds > > # swap-prefetch-off: > > Timed portion 34863 milliseconds > > Nice work Con! Thanks! > > A suggestion for improvement: right now swap-prefetch does a small bit > of swapin every 5 seconds and stays idle inbetween. Could this perhaps > be made more agressive (optionally perhaps), if the system is not > swapping otherwise? If block-IO level instrumentation is needed to > determine idleness of block IO then that is justified too i think. Hmm.. The timer waits 5 seconds before trying to prefetch, but then only stops if it detects any activity elsewhere. It doesn't actually try to go idle in between but it doesn't take much activity to put it back to sleep, hence detecting yet another "not quite idle" period and then it goes to sleep again. I guess the sleep interval can actually be changed as another tunable from 5 seconds to whatever the user wanted. > Another suggestion: swap-prefetch seems to be doing all the right > decisions in the sp_test.c case - so would it be possible to add > statistics so that it could be verified how much of the swapped-in pages > were indeed a 'hit' - and how many were recycled without them being > reused? That could give a reliable, objective metric about how efficient > swap-prefetch is in any workload. Well the advantage is twofold potentially; 1. the pages that have been prefecthed and become minor faults when they would have been major faults, and 2. those that become minor faults (via 1) and then become major faults again (since a copy is kept on backing store with swap prefetch). The sp_tester only tests for 1, although it would be easy enough to simply do another big malloc at the end and see how fast it swapped out again as a marker of 2. As for an in-kernel option, it could get kind of expensive tracking pages that have done one or both of these. I'll think about an affordable way to do this, perhaps it could be just done as a debugging/testing patch, but if would be nice to make it cheap enough to have there permanently as well. The pages end up in swap cache (in the reverse direction pages normally get to swap cache) so the accounting could be done somewhere around there. > Ingo Thanks for comments! -- -ck ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-21 13:44 ` Con Kolivas @ 2007-05-21 16:00 ` Ingo Molnar 2007-05-22 10:15 ` Antonino Ingargiola 0 siblings, 1 reply; 233+ messages in thread From: Ingo Molnar @ 2007-05-21 16:00 UTC (permalink / raw) To: Con Kolivas Cc: Nick Piggin, Ray Lee, ck list, Andrew Morton, linux-kernel, linux-mm * Con Kolivas <kernel@kolivas.org> wrote: > > A suggestion for improvement: right now swap-prefetch does a small > > bit of swapin every 5 seconds and stays idle inbetween. Could this > > perhaps be made more agressive (optionally perhaps), if the system > > is not swapping otherwise? If block-IO level instrumentation is > > needed to determine idleness of block IO then that is justified too > > i think. > > Hmm.. The timer waits 5 seconds before trying to prefetch, but then > only stops if it detects any activity elsewhere. It doesn't actually > try to go idle in between but it doesn't take much activity to put it > back to sleep, hence detecting yet another "not quite idle" period and > then it goes to sleep again. I guess the sleep interval can actually > be changed as another tunable from 5 seconds to whatever the user > wanted. there was nothing else running on the system - so i suspect the swapin activity flagged 'itself' as some 'other' activity and stopped? The swapins happened in 4 bursts, separated by 5 seconds total idleness. Ingo ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-21 16:00 ` Ingo Molnar @ 2007-05-22 10:15 ` Antonino Ingargiola 2007-05-22 10:20 ` Con Kolivas 0 siblings, 1 reply; 233+ messages in thread From: Antonino Ingargiola @ 2007-05-22 10:15 UTC (permalink / raw) To: Ingo Molnar Cc: Con Kolivas, Nick Piggin, Ray Lee, ck list, Andrew Morton, linux-kernel, linux-mm 2007/5/21, Ingo Molnar <mingo@elte.hu>: > > * Con Kolivas <kernel@kolivas.org> wrote: > > > > A suggestion for improvement: right now swap-prefetch does a small > > > bit of swapin every 5 seconds and stays idle inbetween. Could this > > > perhaps be made more agressive (optionally perhaps), if the system > > > is not swapping otherwise? If block-IO level instrumentation is > > > needed to determine idleness of block IO then that is justified too > > > i think. > > > > Hmm.. The timer waits 5 seconds before trying to prefetch, but then > > only stops if it detects any activity elsewhere. It doesn't actually > > try to go idle in between but it doesn't take much activity to put it > > back to sleep, hence detecting yet another "not quite idle" period and > > then it goes to sleep again. I guess the sleep interval can actually > > be changed as another tunable from 5 seconds to whatever the user > > wanted. > > there was nothing else running on the system - so i suspect the swapin > activity flagged 'itself' as some 'other' activity and stopped? The > swapins happened in 4 bursts, separated by 5 seconds total idleness. I've noted burst swapins separated by some seconds of pause in my desktop system too (with sp_tester and an idle gnome). Regards, ~ Antonio ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-22 10:15 ` Antonino Ingargiola @ 2007-05-22 10:20 ` Con Kolivas 2007-05-22 10:25 ` Ingo Molnar 0 siblings, 1 reply; 233+ messages in thread From: Con Kolivas @ 2007-05-22 10:20 UTC (permalink / raw) To: Antonino Ingargiola Cc: Ingo Molnar, Nick Piggin, Ray Lee, ck list, Andrew Morton, linux-kernel, linux-mm On Tuesday 22 May 2007 20:15, Antonino Ingargiola wrote: > 2007/5/21, Ingo Molnar <mingo@elte.hu>: > > * Con Kolivas <kernel@kolivas.org> wrote: > > > > A suggestion for improvement: right now swap-prefetch does a small > > > > bit of swapin every 5 seconds and stays idle inbetween. Could this > > > > perhaps be made more agressive (optionally perhaps), if the system > > > > is not swapping otherwise? If block-IO level instrumentation is > > > > needed to determine idleness of block IO then that is justified too > > > > i think. > > > > > > Hmm.. The timer waits 5 seconds before trying to prefetch, but then > > > only stops if it detects any activity elsewhere. It doesn't actually > > > try to go idle in between but it doesn't take much activity to put it > > > back to sleep, hence detecting yet another "not quite idle" period and > > > then it goes to sleep again. I guess the sleep interval can actually > > > be changed as another tunable from 5 seconds to whatever the user > > > wanted. > > > > there was nothing else running on the system - so i suspect the swapin > > activity flagged 'itself' as some 'other' activity and stopped? The > > swapins happened in 4 bursts, separated by 5 seconds total idleness. > > I've noted burst swapins separated by some seconds of pause in my > desktop system too (with sp_tester and an idle gnome). That really is expected, as just about anything, including journal writeout, would be enough to put it back to sleep for 5 more seconds. -- -ck ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-22 10:20 ` Con Kolivas @ 2007-05-22 10:25 ` Ingo Molnar 2007-05-22 10:37 ` Con Kolivas 0 siblings, 1 reply; 233+ messages in thread From: Ingo Molnar @ 2007-05-22 10:25 UTC (permalink / raw) To: Con Kolivas Cc: Antonino Ingargiola, Nick Piggin, Ray Lee, ck list, Andrew Morton, linux-kernel, linux-mm * Con Kolivas <kernel@kolivas.org> wrote: > > > there was nothing else running on the system - so i suspect the > > > swapin activity flagged 'itself' as some 'other' activity and > > > stopped? The swapins happened in 4 bursts, separated by 5 seconds > > > total idleness. > > > > I've noted burst swapins separated by some seconds of pause in my > > desktop system too (with sp_tester and an idle gnome). > > That really is expected, as just about anything, including journal > writeout, would be enough to put it back to sleep for 5 more seconds. note that nothing like that happened on my system - in the swap-prefetch-off case there was _zero_ IO activity during the sleep period. Ingo ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-22 10:25 ` Ingo Molnar @ 2007-05-22 10:37 ` Con Kolivas 2007-05-22 10:46 ` Ingo Molnar 2007-05-22 20:42 ` Ash Milsted 0 siblings, 2 replies; 233+ messages in thread From: Con Kolivas @ 2007-05-22 10:37 UTC (permalink / raw) To: Ingo Molnar Cc: Antonino Ingargiola, Nick Piggin, Ray Lee, ck list, Andrew Morton, linux-kernel, linux-mm On Tuesday 22 May 2007 20:25, Ingo Molnar wrote: > * Con Kolivas <kernel@kolivas.org> wrote: > > > > there was nothing else running on the system - so i suspect the > > > > swapin activity flagged 'itself' as some 'other' activity and > > > > stopped? The swapins happened in 4 bursts, separated by 5 seconds > > > > total idleness. > > > > > > I've noted burst swapins separated by some seconds of pause in my > > > desktop system too (with sp_tester and an idle gnome). > > > > That really is expected, as just about anything, including journal > > writeout, would be enough to put it back to sleep for 5 more seconds. > > note that nothing like that happened on my system - in the > swap-prefetch-off case there was _zero_ IO activity during the sleep > period. Ok, granted it's _very_ conservative. I really don't want to risk its presence being a burden on anything, and the iowait it induces probably makes it turn itself off for another PREFETCH_DELAY (5s). I really don't want to cross the line to where it is detrimental in any way. Not dropping out on a cond_resched and perhaps making the delay tunable should be enough to make it a little less "sleepy". -- -ck ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-22 10:37 ` Con Kolivas @ 2007-05-22 10:46 ` Ingo Molnar 2007-05-22 10:54 ` Con Kolivas 2007-05-22 20:18 ` [ck] " Michael Chang 2007-05-22 20:42 ` Ash Milsted 1 sibling, 2 replies; 233+ messages in thread From: Ingo Molnar @ 2007-05-22 10:46 UTC (permalink / raw) To: Con Kolivas Cc: Antonino Ingargiola, Nick Piggin, Ray Lee, ck list, Andrew Morton, linux-kernel, linux-mm * Con Kolivas <kernel@kolivas.org> wrote: > On Tuesday 22 May 2007 20:25, Ingo Molnar wrote: > > * Con Kolivas <kernel@kolivas.org> wrote: > > > > > there was nothing else running on the system - so i suspect the > > > > > swapin activity flagged 'itself' as some 'other' activity and > > > > > stopped? The swapins happened in 4 bursts, separated by 5 seconds > > > > > total idleness. > > > > > > > > I've noted burst swapins separated by some seconds of pause in my > > > > desktop system too (with sp_tester and an idle gnome). > > > > > > That really is expected, as just about anything, including journal > > > writeout, would be enough to put it back to sleep for 5 more seconds. > > > > note that nothing like that happened on my system - in the > > swap-prefetch-off case there was _zero_ IO activity during the sleep > > period. > > Ok, granted it's _very_ conservative. [...] but your first reaction was "it should not have slept for 5 seconds": | Hmm.. The timer waits 5 seconds before trying to prefetch, but then | only stops if it detects any activity elsewhere. It doesn't actually | try to go idle in between It clearly should not consider 'itself' as IO activity. This suggests some bug in the 'detect activity' mechanism, agreed? I'm wondering whether you are seeing the same problem, or is all swap-prefetch IO on your system continuous until it's done [or some other IO comes inbetween]? Ingo ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-22 10:46 ` Ingo Molnar @ 2007-05-22 10:54 ` Con Kolivas 2007-05-22 10:57 ` Ingo Molnar 2007-05-22 20:18 ` [ck] " Michael Chang 1 sibling, 1 reply; 233+ messages in thread From: Con Kolivas @ 2007-05-22 10:54 UTC (permalink / raw) To: Ingo Molnar Cc: Antonino Ingargiola, Nick Piggin, Ray Lee, ck list, Andrew Morton, linux-kernel, linux-mm On Tuesday 22 May 2007 20:46, Ingo Molnar wrote: > It clearly should not consider 'itself' as IO activity. This suggests > some bug in the 'detect activity' mechanism, agreed? I'm wondering > whether you are seeing the same problem, or is all swap-prefetch IO on > your system continuous until it's done [or some other IO comes > inbetween]? When nothing else is happening anywhere on the system it reads in bursts and goes to sleep during journal writeout. -- -ck ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-22 10:54 ` Con Kolivas @ 2007-05-22 10:57 ` Ingo Molnar 2007-05-22 11:04 ` Con Kolivas 0 siblings, 1 reply; 233+ messages in thread From: Ingo Molnar @ 2007-05-22 10:57 UTC (permalink / raw) To: Con Kolivas Cc: Antonino Ingargiola, Nick Piggin, Ray Lee, ck list, Andrew Morton, linux-kernel, linux-mm * Con Kolivas <kernel@kolivas.org> wrote: > On Tuesday 22 May 2007 20:46, Ingo Molnar wrote: > > It clearly should not consider 'itself' as IO activity. This > > suggests some bug in the 'detect activity' mechanism, agreed? I'm > > wondering whether you are seeing the same problem, or is all > > swap-prefetch IO on your system continuous until it's done [or some > > other IO comes inbetween]? > > When nothing else is happening anywhere on the system it reads in > bursts and goes to sleep during journal writeout. hm, what do you call 'journal writeout' here that would be happening on my system? Ingo ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-22 10:57 ` Ingo Molnar @ 2007-05-22 11:04 ` Con Kolivas [not found] ` <20070522111104.GA14950@elte.hu> 0 siblings, 1 reply; 233+ messages in thread From: Con Kolivas @ 2007-05-22 11:04 UTC (permalink / raw) To: Ingo Molnar Cc: Antonino Ingargiola, Nick Piggin, Ray Lee, ck list, Andrew Morton, linux-kernel, linux-mm On Tuesday 22 May 2007 20:57, Ingo Molnar wrote: > * Con Kolivas <kernel@kolivas.org> wrote: > > On Tuesday 22 May 2007 20:46, Ingo Molnar wrote: > > > It clearly should not consider 'itself' as IO activity. This > > > suggests some bug in the 'detect activity' mechanism, agreed? I'm > > > wondering whether you are seeing the same problem, or is all > > > swap-prefetch IO on your system continuous until it's done [or some > > > other IO comes inbetween]? > > > > When nothing else is happening anywhere on the system it reads in > > bursts and goes to sleep during journal writeout. > > hm, what do you call 'journal writeout' here that would be happening on > my system? Not really sure what you have in terms of fs, but here even with nothing going on, ext3 writes to disk every 5 seconds with kjournald. -- -ck ^ permalink raw reply [flat|nested] 233+ messages in thread
[parent not found: <20070522111104.GA14950@elte.hu>]
* Re: [PATCH] mm: swap prefetch improvements [not found] ` <20070522111104.GA14950@elte.hu> @ 2007-05-22 11:12 ` Ingo Molnar 0 siblings, 0 replies; 233+ messages in thread From: Ingo Molnar @ 2007-05-22 11:12 UTC (permalink / raw) To: Con Kolivas Cc: Antonino Ingargiola, Nick Piggin, Ray Lee, ck list, Andrew Morton, linux-kernel, linux-mm * Con Kolivas <kernel@kolivas.org> wrote: > > hm, what do you call 'journal writeout' here that would be happening > > on my system? > > Not really sure what you have in terms of fs, but here even with > nothing going on, ext3 writes to disk every 5 seconds with kjournald. i have ext3, but it doesnt do that on my box. Also, i would have noticed any IO activity in the 'swap prefetch off' case. When i said completely idle, i really meant it ;-) so swap-prefetch stops for 5 seconds for no apparent reason. Ingo ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [ck] Re: [PATCH] mm: swap prefetch improvements 2007-05-22 10:46 ` Ingo Molnar 2007-05-22 10:54 ` Con Kolivas @ 2007-05-22 20:18 ` Michael Chang 2007-05-22 20:31 ` Ingo Molnar 1 sibling, 1 reply; 233+ messages in thread From: Michael Chang @ 2007-05-22 20:18 UTC (permalink / raw) To: Ingo Molnar Cc: Con Kolivas, Nick Piggin, Ray Lee, linux-kernel, ck list, linux-mm, Andrew Morton On 5/22/07, Ingo Molnar <mingo@elte.hu> wrote: > > * Con Kolivas <kernel@kolivas.org> wrote: > > > On Tuesday 22 May 2007 20:25, Ingo Molnar wrote: > > > * Con Kolivas <kernel@kolivas.org> wrote: > > > > > > there was nothing else running on the system - so i suspect the > > > > > > swapin activity flagged 'itself' as some 'other' activity and > > > > > > stopped? The swapins happened in 4 bursts, separated by 5 seconds > > > > > > total idleness. > > > > > > > > > > I've noted burst swapins separated by some seconds of pause in my > > > > > desktop system too (with sp_tester and an idle gnome). > > > > > > > > That really is expected, as just about anything, including journal > > > > writeout, would be enough to put it back to sleep for 5 more seconds. > > > > > > note that nothing like that happened on my system - in the > > > swap-prefetch-off case there was _zero_ IO activity during the sleep > > > period. > > > > Ok, granted it's _very_ conservative. [...] > > but your first reaction was "it should not have slept for 5 seconds": > > | Hmm.. The timer waits 5 seconds before trying to prefetch, but then > | only stops if it detects any activity elsewhere. It doesn't actually > | try to go idle in between > > It clearly should not consider 'itself' as IO activity. This suggests > some bug in the 'detect activity' mechanism, agreed? I'm wondering > whether you are seeing the same problem, or is all swap-prefetch IO on > your system continuous until it's done [or some other IO comes > inbetween]? The only "problem" I can see with this idea is in the potential case that it takes up all the IO activity, and so there is never enough IO activity from other progams to trigger the wait mechanism because they don't get a chance to run. That could probably be "fixed" by capping the IO, though... (with one of those oh-so-lovable "magic numbers" or a tunable) That said, I don't think there are any issues with the code compensating for its own activity in the "detect activity" mechanism -- assuming there wasn't a major impact in e.g. maintainability or something. As for the burstyness... considering the "no negative impact" stance, I can understand that. But it seems inefficient, at best... -- Michael Chang Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html Thank you. ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [ck] Re: [PATCH] mm: swap prefetch improvements 2007-05-22 20:18 ` [ck] " Michael Chang @ 2007-05-22 20:31 ` Ingo Molnar 0 siblings, 0 replies; 233+ messages in thread From: Ingo Molnar @ 2007-05-22 20:31 UTC (permalink / raw) To: Michael Chang Cc: Con Kolivas, Nick Piggin, Ray Lee, linux-kernel, ck list, linux-mm, Andrew Morton * Michael Chang <thenewme91@gmail.com> wrote: > > It clearly should not consider 'itself' as IO activity. This > > suggests some bug in the 'detect activity' mechanism, agreed? I'm > > wondering whether you are seeing the same problem, or is all > > swap-prefetch IO on your system continuous until it's done [or some > > other IO comes inbetween]? > > The only "problem" I can see with this idea is in the potential case > that it takes up all the IO activity, and so there is never enough IO > activity from other progams to trigger the wait mechanism because they > don't get a chance to run. i dont understand what you mean. Any 'use only idle IO capacity' mechanism should immediately cease to be active the moment any other app tries to do IO - whether the IO subsystem is saturated or not. > That said, I don't think there are any issues with the code > compensating for its own activity in the "detect activity" mechanism > -- assuming there wasn't a major impact in e.g. maintainability or > something. > > As for the burstyness... considering the "no negative impact" stance, > I can understand that. But it seems inefficient, at best... well, it's a plain old bug (a not too serious one) in my book, i'm surprised that we are now at mail #7 about it :-) I reported it, and i guess Con will fix it eventually. There's really no need to deny that it exists or to try to talk it out of existence. Sheesh! :-) Ingo ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [ck] Re: [PATCH] mm: swap prefetch improvements 2007-05-22 10:37 ` Con Kolivas 2007-05-22 10:46 ` Ingo Molnar @ 2007-05-22 20:42 ` Ash Milsted 2007-05-22 22:50 ` Con Kolivas 1 sibling, 1 reply; 233+ messages in thread From: Ash Milsted @ 2007-05-22 20:42 UTC (permalink / raw) To: ck; +Cc: linux-kernel, ck list On Tue, 22 May 2007 20:37:54 +1000 Con Kolivas <kernel@kolivas.org> wrote: > On Tuesday 22 May 2007 20:25, Ingo Molnar wrote: > > * Con Kolivas <kernel@kolivas.org> wrote: > > > > > there was nothing else running on the system - so i suspect the > > > > > swapin activity flagged 'itself' as some 'other' activity and > > > > > stopped? The swapins happened in 4 bursts, separated by 5 seconds > > > > > total idleness. > > > > > > > > I've noted burst swapins separated by some seconds of pause in my > > > > desktop system too (with sp_tester and an idle gnome). > > > > > > That really is expected, as just about anything, including journal > > > writeout, would be enough to put it back to sleep for 5 more seconds. > > > > note that nothing like that happened on my system - in the > > swap-prefetch-off case there was _zero_ IO activity during the sleep > > period. > > Ok, granted it's _very_ conservative. I really don't want to risk its presence > being a burden on anything, and the iowait it induces probably makes it turn > itself off for another PREFETCH_DELAY (5s). I really don't want to cross the > line to where it is detrimental in any way. Not dropping out on a > cond_resched and perhaps making the delay tunable should be enough to make it > a little less "sleepy". > > -- > -ck Hi. I just did some video encoding on my desktop and I was noticing (for the first time in a while) that running apps had to hit swap quite a lot when I switched to them (the encoding was going at full blast for most of the day, and most of the time other running apps were idle). Now, a good half of my RAM appeared to be free during all this, so I was thinking at the time that it would be nice if swap prefetch could be tunably more aggressive. I guess it would be ideal in this case if it could kick in during tunably low disk-IO periods, even if the CPU is rather busy. I'm sure you've considered this, so I only butt in here to cast a vote for it. :) Of course, I could be completely wrong about the possibility.. and I seem to remember that the disk cache can take up about half the ram by default without this showing up in 'gnome-system-monitor'... which I guess might happen during heavy encoding.. but even if it did, I could have set the limit lower, and would then have still appreciated prefetching. Ash ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-22 20:42 ` Ash Milsted @ 2007-05-22 22:50 ` Con Kolivas 2007-05-23 7:57 ` Ash Milsted 0 siblings, 1 reply; 233+ messages in thread From: Con Kolivas @ 2007-05-22 22:50 UTC (permalink / raw) To: ck; +Cc: Ash Milsted, linux-kernel On Wednesday 23 May 2007 06:42, Ash Milsted wrote: > Hi. I just did some video encoding on my desktop and I was noticing > (for the first time in a while) that running apps had to hit swap quite > a lot when I switched to them (the encoding was going at full blast for > most of the day, and most of the time other running apps were > idle). Now, a good half of my RAM appeared to be free during all this, > so I was thinking at the time that it would be nice if swap prefetch > could be tunably more aggressive. I guess it would be ideal in this > case if it could kick in during tunably low disk-IO periods, even if > the CPU is rather busy. I'm sure you've considered this, so I only butt > in here to cast a vote for it. :) In this case nicing the video encode should be enough to make it prefetch even during heavy cpu usage. It detects the total nice level rather than the cpu usage. > Of course, I could be completely wrong about the possibility.. and I > seem to remember that the disk cache can take up about half the ram by > default without this showing up in 'gnome-system-monitor'... which I > guess might happen during heavy encoding.. but even if it did, I could > have set the limit lower, and would then have still appreciated > prefetching. I plan to make it prefetch more aggressively by default soon and make it more tunable too. -- -ck ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: [PATCH] mm: swap prefetch improvements 2007-05-22 22:50 ` Con Kolivas @ 2007-05-23 7:57 ` Ash Milsted 0 siblings, 0 replies; 233+ messages in thread From: Ash Milsted @ 2007-05-23 7:57 UTC (permalink / raw) To: Con Kolivas; +Cc: ck, linux-kernel On Wed, 23 May 2007 08:50:01 +1000 Con Kolivas <kernel@kolivas.org> wrote: > On Wednesday 23 May 2007 06:42, Ash Milsted wrote: > > Hi. I just did some video encoding on my desktop and I was noticing > > (for the first time in a while) that running apps had to hit swap quite > > a lot when I switched to them (the encoding was going at full blast for > > most of the day, and most of the time other running apps were > > idle). Now, a good half of my RAM appeared to be free during all this, > > so I was thinking at the time that it would be nice if swap prefetch > > could be tunably more aggressive. I guess it would be ideal in this > > case if it could kick in during tunably low disk-IO periods, even if > > the CPU is rather busy. I'm sure you've considered this, so I only butt > > in here to cast a vote for it. :) > > In this case nicing the video encode should be enough to make it prefetch even > during heavy cpu usage. It detects the total nice level rather than the cpu > usage. > Cunning, but I guess the regular (less than 5 seconds apart) reads/writes during the encoding process would cause prefetching to hold off, no? I had used nice and ionice to reduce the encoder priority, which made desktop apps pretty responsive, except when they had to hit swap. If swap prefetch is using the idle io-priority I suppose it would hardly affect performance if it kicked in during such use, since it would operate in between the encoder reads anyway (assuming the encoder is at higher ioprio), right ? > > I plan to make it prefetch more aggressively by default soon and make it more > tunable too. > 'Sounds good! ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-10 3:48 ` Ray Lee 2007-05-10 3:56 ` Nick Piggin @ 2007-05-10 3:58 ` Con Kolivas 1 sibling, 0 replies; 233+ messages in thread From: Con Kolivas @ 2007-05-10 3:58 UTC (permalink / raw) To: Ray Lee Cc: Nick Piggin, Ingo Molnar, ck list, Andrew Morton, linux-kernel, linux-mm On Thursday 10 May 2007 13:48, Ray Lee wrote: > On 5/9/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > You said it helped with the updatedb problem. That says we should look at > > why it is going bad first, and for example improve use-once algorithms. > > After we do that, then swap prefetching might still help, which is fine. > > Nick, if you're volunteering to do that analysis, then great. If not, > then you're just providing a airy hope with nothing to back up when or > if that work would ever occur. > > Further, if you or someone else *does* do that work, then guess what, > we still have the option to rip out the swap prefetching code after > the hypothetical use-once improvements have been proven and merged. > Which, by the way, I've watched people talk about since 2.4. That was, > y'know, a *while* ago. > > So enough with the stop energy, okay? You're better than that. > > Con? He is right about the last feature to go in needs to work > gracefully with what's there now. However, it's not unheard of for > authors of other sections of code to help out with incompatibilities > by answering politely phrased questions for guidance. Though the > intersection of users between cpusets and desktop systems seems small > indeed. Let's just set the record straight. I actually discussed cpusets over a year ago in this nonsense and was told by sgi folk there was no need to get my head around cpusets and honouring node placement should be enough which, by the way, swap prefetch does. So I by no means ignored this; we just hit an impasse on just how much more featured it should be for the sake of a goddamn home desktop pc feature. Anyway why the hell am I resurrecting this thread? The code is declared dead already. Leave it be. -- -ck ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-04 12:10 ` Con Kolivas 2007-05-05 8:42 ` Con Kolivas @ 2007-05-07 14:28 ` Bill Davidsen 1 sibling, 0 replies; 233+ messages in thread From: Bill Davidsen @ 2007-05-07 14:28 UTC (permalink / raw) To: linux-kernel; +Cc: linux-mm Con Kolivas wrote: > On Friday 04 May 2007 18:52, Ingo Molnar wrote: >> agreed. Con, IIRC you wrote a testcase for this, right? Could you please >> send us the results of that testing? > > Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch > disabled and then enabled swap prefetch saves ~5 seconds on average hardware > on this one test case. I had many users try this and the results were between > 2 and 10 seconds, but always showed a saving on this testcase. This effect > easily occurs on printing a big picture, editing a large file, compressing an > iso image or whatever in real world workloads. Smaller, but much more > frequent effects of this over the course of a day obviously also occur and do > add up. > I'll try this when I get the scheduler stuff done, and also dig out the "resp1" stuff for "back when." I see the most recent datasets were comparing 2.5.43-mm2 responsiveness with 2.4.19-ck7, you know I always test your stuff ;-) Guess it might need a bit of polish for current hardware, I was testing on *small* machines, deliberately. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: swap-prefetch: 2.6.22 -mm merge plans 2007-05-04 8:52 ` Ingo Molnar 2007-05-04 9:09 ` Nick Piggin 2007-05-04 12:10 ` Con Kolivas @ 2007-05-07 14:18 ` Bill Davidsen 2 siblings, 0 replies; 233+ messages in thread From: Bill Davidsen @ 2007-05-07 14:18 UTC (permalink / raw) To: linux-kernel; +Cc: linux-mm Ingo Molnar wrote: > * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >>> i'm wondering about swap-prefetch: > >> Being able to config all these core heuristics changes is really not >> that much of a positive. The fact that we might _need_ to config >> something out, and double the configuration range isn't too pleasing. > > Well, to the desktop user this is a speculative performance feature that > he is willing to potentially waste CPU and IO capacity, in expectation > of better performance. > > On the conceptual level it is _precisely the same thing as regular file > readahead_. (with the difference that to me swapahead seems to be quite > a bit more intelligent than our current file readahead logic.) > > This feature has no API or ABI impact at all, it's a pure performance > feature. (besides the trivial sysctl to turn it runtime on/off). > >> Here were some of my concerns, and where our discussion got up to. [...snip...] > i see no real problem here. We've had heuristics for a _long_ time in > various areas of the code. Sometimes they work, sometimes they suck. > > the flow of this is really easy: distro looking for a feature edge turns > it on and announces it, if the feature does not work out for users then > user turns it off and complains to distro, if enough users complain then > distro turns it off for next release, upstream forgets about this > performance feature and eventually removes it once someone notices that > it wouldnt even compile in the past 2 main releases. I see no problem > here, we did that in the past too with performance features. The > networking stack has literally dozens of such small tunable things which > get experimented with, and whose defaults do get tuned carefully. Some > of the knobs help bandwidth, some help latency. > I haven't looked at this code since it first came out and didn't impress me, but I think it would be good to get the current version in. However, when you say "user turns it off" I hope you mean "in /proc/sys with a switch or knob" and not by expecting people to recompile and install a kernel. Then it might take a little memory but wouldn't do something undesirable. Note: I had no bad effect from the code, it just didn't feel faster. On a low memory machine it might help. Of course I have wanted to have a hard limit on memory used for i/o buffers, just to avoid swapping programs to make room for i/o, so to some extent I feel as if this is a fix for a problem we shouldn't have. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot ^ permalink raw reply [flat|nested] 233+ messages in thread
* Re: 2.6.22 -mm merge plans 2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton ` (21 preceding siblings ...) 2007-05-03 15:54 ` swap-prefetch: 2.6.22 -mm merge plans Ingo Molnar @ 2007-05-07 17:47 ` Josef Sipek 22 siblings, 0 replies; 233+ messages in thread From: Josef Sipek @ 2007-05-07 17:47 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-mm On Mon, Apr 30, 2007 at 04:20:07PM -0700, Andrew Morton wrote: ... > git-unionfs.patch > > Does this have a future? Yes! There are many active users who use our unioning functionality. Namespace unification consists of several major parts: 1) Duplicate elimination: This can be handled in the VFS. However, it would clutter up the VFS code with a lot of wrappers around key VFS functions to select the appropriate dentry/inode/etc. object from the underlying branch. (You also need to provide efficient and sane readdir/seekdir semantics which we do with our "On Disk Format" support.) 2) Copyup: Having a unified namespace by itself isn't enough. You also need copy-on-write functionality when the source file is on a read-only branch. This makes unioning much more useful and is one of the main attractions to unionfs users. 3) Whiteouts: Whiteouts are a key unioning construct. As it was pointed out at OLS 2006, they are a properly of the union and _NOT_ a branch. Therefore, they should not be stored persistently on a branch but rather in some "external" storage. 4) You also need unique and *persistent* inode numbers for network f/s exports and other unix tools. 5) You need to provide dynamic branch management functionality: adding, removing, and changing the mode of branches in an existing union. We have considerable experience in unioning file systems for years now; we are currently working on the third generation of the code. All of the above features, and more, are USED by users, and are NEEDED by users. We believe the right approach is the one we've taken, and is the least intrusive: a standalone (stackable) file system that doesn't clutter the VFS, with some small and gradual changes to the VFS to support stacking. As you may have noticed, we have been successfully submitting VFS patches to make the VFS more stacking friendly (not just to Unionfs, but also to eCryptfs which has been in since 2.6.19). The older Union mounts, alas, try to put all that functionality into the VFS. We recognize that some people think that union mounts at the VFS level is the "elegant" approach, but we hope people will listen to us and learn from our experience: unioning may seem simple in principle, but it is difficult in practice. (See http://unionfs.fileystems.org/ for a lot more info.) So we don't think that is a viable long term approach to have all of the unioning functionality in the VFS for two main reasons: (1) If you want users to use a VFS-level unioning functionality ala union-mounts, then you're going to have to implement *all* of the features we have implemented; the VFS clutter and complexity that will result will be very considerable, and we just don't think that it'd happen. (2) Some may suggest to have a lightweight union mounts that only offers a subset of the functionality that's suitable for placing in the VFS. In that case, most unionfs users simply won't use it. You'd need union mounts to provide ALL of the functionality that we have TODAY, if you want users to it. As far as we can see the remaining stumbling block right now is cache coherency between the layers. Whether you provide unioning as a stackable f/s or shoved into the VFS, coherency will have to be addressed. In our upcoming paper and talk at OLS'07, we plan to bring up and discuss several ideas we've explored already on how to resolve this incoherency. Our ideas range from complex graph-based pointer management between objects of all sorts, to simple timestamp-based VFS hooks. (We've been experimenting with several approaches and so far we're leaning toward the simple timestamp based on, again in the interest of keeping the VFS changes simple. We hope to have more results to report by OLS time.) Josef "Jeff" Sipek, on behalf of the Unionfs team. ^ permalink raw reply [flat|nested] 233+ messages in thread
end of thread, other threads:[~2007-05-23 7:58 UTC | newest]
Thread overview: 233+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
2007-04-30 23:48 ` to something appropriate (was Re: 2.6.22 -mm merge plans) Jeff Garzik
2007-05-01 0:07 ` Dave Jones
2007-05-01 0:09 ` Andrew Morton
2007-05-01 0:24 ` Jeff Garzik
2007-05-01 0:40 ` [stable] " Chris Wright
2007-05-01 0:45 ` Jeff Garzik
2007-05-01 4:58 ` Greg KH
2007-05-01 16:14 ` Chuck Ebbert
2007-05-01 16:40 ` Alan Cox
2007-05-01 23:34 ` Greg KH
2007-05-02 0:52 ` Chris Wright
2007-05-02 14:10 ` Chuck Ebbert
2007-05-01 9:49 ` Alan Cox
2007-04-30 23:59 ` 2.6.22 -mm merge plans Bill Irwin
2007-05-01 0:09 ` nfsd/md patches " Neil Brown
2007-05-01 9:08 ` Christoph Hellwig
2007-05-01 9:15 ` Andrew Morton
2007-05-01 9:21 ` Christoph Hellwig
2007-05-01 9:52 ` Neil Brown
2007-05-01 10:15 ` Christoph Hellwig
2007-05-01 14:34 ` Trond Myklebust
2007-05-01 0:54 ` MADV_FREE functionality Rik van Riel
2007-05-01 1:18 ` Andrew Morton
2007-05-01 1:23 ` Rik van Riel
2007-05-01 7:13 ` Jakub Jelinek
2007-05-01 1:23 ` Ulrich Drepper
2007-05-01 1:39 ` 2.6.22 -mm merge plans Stefan Richter
2007-05-01 2:30 ` 2.6.22 -mm merge plans (RE: input) Dmitry Torokhov
2007-05-01 8:14 ` Jiri Slaby
2007-05-01 12:05 ` Dmitry Torokhov
2007-05-01 8:11 ` 2.6.22 -mm merge plans -- pfn_valid_within Andy Whitcroft
2007-05-01 8:19 ` Andrew Morton
2007-05-01 8:42 ` "partical" kthread conversion Christoph Hellwig
2007-05-01 8:51 ` Andrew Morton
2007-05-02 14:01 ` Dean Nelson
2007-05-02 14:45 ` Eric W. Biederman
2007-05-02 15:37 ` Dean Nelson
2007-05-02 15:49 ` Eric W. Biederman
2007-05-02 19:33 ` Andrew Morton
2007-05-02 20:38 ` Eric W. Biederman
2007-05-01 8:44 ` 2.6.22 -mm merge plans -- vm bugfixes Nick Piggin
2007-05-01 8:54 ` Andrew Morton
2007-05-01 19:31 ` Hugh Dickins
2007-05-02 3:08 ` Nick Piggin
2007-05-02 9:15 ` Nick Piggin
2007-05-02 14:00 ` Hugh Dickins
2007-05-03 1:32 ` Nick Piggin
2007-05-03 10:37 ` Christoph Hellwig
2007-05-03 12:56 ` Nick Piggin
2007-05-04 9:23 ` Nick Piggin
2007-05-04 9:43 ` Nick Piggin
2007-05-08 3:03 ` Benjamin Herrenschmidt
2007-05-03 12:24 ` Hugh Dickins
2007-05-03 12:43 ` Nick Piggin
2007-05-03 12:58 ` Hugh Dickins
2007-05-03 13:08 ` Nick Piggin
2007-05-03 16:52 ` Andrew Morton
2007-05-04 4:16 ` Nick Piggin
2007-05-09 12:34 ` Nick Piggin
2007-05-09 14:28 ` Hugh Dickins
2007-05-09 14:45 ` Nick Piggin
2007-05-09 15:38 ` Hugh Dickins
2007-05-09 22:24 ` Nick Piggin
2007-05-01 8:46 ` pcmcia ioctl removal Christoph Hellwig
2007-05-01 8:56 ` Russell King
2007-05-01 8:57 ` Willy Tarreau
2007-05-01 9:08 ` Andrew Morton
2007-05-01 14:46 ` Adrian Bunk
2007-05-01 9:16 ` Robert P. J. Day
2007-05-01 9:44 ` Willy Tarreau
2007-05-01 10:16 ` Robert P. J. Day
2007-05-01 10:26 ` Gabriel C
2007-05-01 10:52 ` Willy Tarreau
2007-05-01 10:12 ` Jan Engelhardt
2007-05-01 11:00 ` Willy Tarreau
2007-05-01 12:06 ` Konstantin Münning
2007-05-01 13:56 ` Rogan Dawes
2007-05-01 19:10 ` Russell King
2007-05-01 20:41 ` Jan Engelhardt
2007-05-09 12:54 ` Pavel Machek
2007-05-09 13:00 ` Robert P. J. Day
2007-05-09 13:03 ` Adrian Bunk
2007-05-09 19:11 ` Romano Giannetti
2007-05-10 12:40 ` Adrian Bunk
2007-05-01 8:48 ` pci hotplug patches Christoph Hellwig
2007-05-02 3:57 ` Greg KH
2007-05-13 20:59 ` Christoph Hellwig
2007-05-14 11:48 ` Greg KH
2007-05-01 8:54 ` cache-pipe-buf-page-address-for-non-highmem-arch.patch Christoph Hellwig
[not found] ` <20070501020441.10b6a003.akpm@linux-foundation.org>
2007-05-03 3:48 ` cache-pipe-buf-page-address-for-non-highmem-arch.patch Ken Chen
2007-05-01 8:55 ` consolidate-generic_writepages-and-mpage_writepages.patch Christoph Hellwig
2007-05-01 9:17 ` 2.6.22 -mm merge plans Pekka Enberg
2007-05-01 9:24 ` Christoph Hellwig
2007-05-01 9:37 ` Peter Zijlstra
2007-05-01 12:19 ` Andi Kleen
2007-05-01 17:12 ` Pekka Enberg
2007-05-01 10:16 ` fragmentation avoidance " Mel Gorman
2007-05-01 13:02 ` 2.6.22 -mm merge plans -- lumpy reclaim Andy Whitcroft
2007-05-01 18:03 ` Peter Zijlstra
2007-05-01 19:00 ` Andrew Morton
2007-05-01 14:54 ` fragmentation avoidance Re: 2.6.22 -mm merge plans Christoph Lameter
2007-05-01 19:00 ` Mel Gorman
2007-05-01 18:57 ` Andrew Morton
2007-05-07 13:07 ` Yasunori Goto
2007-05-01 12:17 ` Andi Kleen
2007-05-01 22:08 ` Mathieu Desnoyers
2007-05-02 10:44 ` Andi Kleen
2007-05-02 16:37 ` Frank Ch. Eigler
2007-05-02 16:47 ` Andrew Morton
2007-05-02 17:29 ` Christoph Hellwig
2007-05-02 20:36 ` Mathieu Desnoyers
2007-05-02 20:53 ` Andrew Morton
2007-05-02 23:11 ` Mathieu Desnoyers
2007-05-02 23:21 ` Andrew Morton
2007-05-03 15:04 ` Mathieu Desnoyers
2007-05-03 15:12 ` Christoph Hellwig
2007-05-03 17:16 ` Mathieu Desnoyers
2007-05-03 17:25 ` Christoph Hellwig
2007-05-10 19:39 ` Mathieu Desnoyers
2007-05-13 21:04 ` Christoph Hellwig
2007-05-03 8:06 ` Christoph Hellwig
2007-05-03 14:43 ` Mathieu Desnoyers
2007-05-03 10:31 ` Andi Kleen
2007-05-03 14:49 ` Mathieu Desnoyers
2007-05-03 8:09 ` Christoph Hellwig
2007-05-03 8:08 ` Christoph Hellwig
2007-05-02 17:49 ` Andi Kleen
2007-05-02 21:46 ` Tilman Schmidt
2007-05-03 10:12 ` Andi Kleen
2007-05-02 17:19 ` Mathieu Desnoyers
2007-05-02 0:31 ` Rusty Russell
2007-05-02 10:30 ` Andi Kleen
2007-05-01 13:06 ` file capabilities and security_task_wait failure " Stephen Smalley
2007-05-01 14:31 ` 2.6.22 -mm merge plans: mm-more-rmap-checking Hugh Dickins
2007-05-02 1:42 ` Nick Piggin
2007-05-02 13:17 ` Hugh Dickins
2007-05-03 0:18 ` Nick Piggin
2007-05-01 16:56 ` 2.6.22 -mm merge plans Zan Lynx
2007-05-01 17:06 ` 2.6.22 -mm merge plans: mm-detach_vmas_to_be_unmapped-fix Hugh Dickins
2007-05-01 18:10 ` 2.6.22 -mm merge plans: slub Hugh Dickins
2007-05-01 19:25 ` Christoph Lameter
2007-05-01 19:55 ` Andrew Morton
2007-05-01 20:19 ` Hugh Dickins
2007-05-01 20:36 ` Andrew Morton
2007-05-01 20:46 ` Christoph Lameter
2007-05-01 21:09 ` Andrew Morton
2007-05-02 12:54 ` Hugh Dickins
2007-05-02 17:03 ` Christoph Lameter
2007-05-02 19:11 ` Andrew Morton
2007-05-02 19:42 ` Christoph Lameter
2007-05-02 19:54 ` Sam Ravnborg
2007-05-02 20:14 ` Christoph Lameter
2007-05-02 18:52 ` Siddha, Suresh B
2007-05-02 18:58 ` Christoph Lameter
2007-05-01 21:08 ` Christoph Lameter
2007-05-02 12:45 ` Hugh Dickins
2007-05-02 17:01 ` Christoph Lameter
2007-05-02 18:08 ` Hugh Dickins
2007-05-02 18:28 ` Christoph Lameter
2007-05-02 18:42 ` Andrew Morton
2007-05-02 18:53 ` Christoph Lameter
2007-05-02 17:25 ` Christoph Lameter
2007-05-02 18:36 ` Hugh Dickins
2007-05-02 18:39 ` Christoph Lameter
2007-05-02 18:57 ` Andrew Morton
2007-05-02 19:01 ` Christoph Lameter
2007-05-02 19:18 ` Pekka Enberg
2007-05-02 19:34 ` Christoph Lameter
2007-05-02 19:43 ` Christoph Lameter
2007-05-03 8:15 ` Andrew Morton
2007-05-03 8:27 ` William Lee Irwin III
2007-05-03 16:30 ` Christoph Lameter
2007-05-03 8:46 ` Hugh Dickins
2007-05-03 8:57 ` Andrew Morton
2007-05-03 9:15 ` Hugh Dickins
2007-05-03 21:04 ` 2.6.22 -mm merge plans: slub on PowerPC Hugh Dickins
2007-05-03 21:15 ` Christoph Lameter
2007-05-03 22:41 ` Hugh Dickins
2007-05-04 0:25 ` Benjamin Herrenschmidt
2007-05-04 0:54 ` Christoph Lameter
2007-05-03 16:45 ` 2.6.22 -mm merge plans: slub Christoph Lameter
2007-05-03 15:54 ` swap-prefetch: 2.6.22 -mm merge plans Ingo Molnar
2007-05-03 16:15 ` Michal Piotrowski
2007-05-03 16:23 ` Michal Piotrowski
2007-05-03 22:14 ` Con Kolivas
2007-05-04 7:34 ` Nick Piggin
2007-05-04 8:52 ` Ingo Molnar
2007-05-04 9:09 ` Nick Piggin
2007-05-04 12:10 ` Con Kolivas
2007-05-05 8:42 ` Con Kolivas
2007-05-06 10:13 ` [ck] " Antonino Ingargiola
2007-05-06 18:22 ` Jory A. Pratt
2007-05-09 23:28 ` Con Kolivas
2007-05-10 0:05 ` Nick Piggin
2007-05-10 1:34 ` Con Kolivas
2007-05-10 1:56 ` Nick Piggin
2007-05-10 3:48 ` Ray Lee
2007-05-10 3:56 ` Nick Piggin
2007-05-10 5:52 ` Ray Lee
2007-05-10 7:04 ` Nick Piggin
2007-05-10 7:20 ` William Lee Irwin III
2007-05-10 12:34 ` Ray Lee
2007-05-12 4:46 ` [PATCH] mm: swap prefetch improvements Con Kolivas
2007-05-12 5:03 ` Paul Jackson
2007-05-12 5:15 ` Con Kolivas
2007-05-12 5:51 ` Paul Jackson
2007-05-12 7:28 ` Con Kolivas
2007-05-12 8:14 ` Paul Jackson
2007-05-12 8:21 ` Con Kolivas
2007-05-12 8:37 ` Paul Jackson
2007-05-12 8:57 ` [PATCH respin] " Con Kolivas
2007-05-21 10:03 ` [PATCH] " Ingo Molnar
2007-05-21 13:44 ` Con Kolivas
2007-05-21 16:00 ` Ingo Molnar
2007-05-22 10:15 ` Antonino Ingargiola
2007-05-22 10:20 ` Con Kolivas
2007-05-22 10:25 ` Ingo Molnar
2007-05-22 10:37 ` Con Kolivas
2007-05-22 10:46 ` Ingo Molnar
2007-05-22 10:54 ` Con Kolivas
2007-05-22 10:57 ` Ingo Molnar
2007-05-22 11:04 ` Con Kolivas
[not found] ` <20070522111104.GA14950@elte.hu>
2007-05-22 11:12 ` Ingo Molnar
2007-05-22 20:18 ` [ck] " Michael Chang
2007-05-22 20:31 ` Ingo Molnar
2007-05-22 20:42 ` Ash Milsted
2007-05-22 22:50 ` Con Kolivas
2007-05-23 7:57 ` Ash Milsted
2007-05-10 3:58 ` swap-prefetch: 2.6.22 -mm merge plans Con Kolivas
2007-05-07 14:28 ` Bill Davidsen
2007-05-07 14:18 ` Bill Davidsen
2007-05-07 17:47 ` Josef Sipek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).