2.6.22 -mm merge plans

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* 2.6.22 -mm merge plans
@ 2007-04-30 23:20 Andrew Morton
  2007-04-30 23:48 ` to something appropriate (was Re: 2.6.22 -mm merge plans) Jeff Garzik
                   ` (22 more replies)
  0 siblings, 23 replies; 233+ messages in thread
From: Andrew Morton @ 2007-04-30 23:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm

- If replying, please be sure to cc the appropriate individuals.  Please
  also consider rewriting the Subject: to something appropriate.

- I'll cc linux-mm on this - the memory-management situation is complicated.

- The overall stability in recent -mm's was not sufficiently high and we ran
  out of time to find all the bugs.  I shouldn't have merged all those patches
  last week - they contained an exceptional amount of garbage.

  This all means that more bugs than usual will probably leak into mainline,
  and we'll have to fix them there.

- I've been ducking most non-bugfix patches recently.  I have ~200 feature
  and cleanup patches queued for later consideration, so people who sent those
  will be hearing from me eventually.

 extend-print_symbol-capability.patch
 reiserfs-suppress-lockdep-warning.patch
 rework-pm_ops-pm_disk_mode-kill-misuse.patch
 power-management-remove-firmware-disk-mode.patch
 power-management-implement-pm_opsvalid-for-everybody.patch
 power-management-force-pm_opsvalid-callback-to-be.patch
 add-kvasprintf.patch
 pm-include-eio-from-errno-baseh.patch

Sent

 ia64-race-flushing-icache-in-do_no_page-path.patch

People are still discussing this

 zlib-backout.patch

A huge zlib revert patch.  It's a last resort for bug #8405, which is still
being worked on.  2.6.20.x needs fixing, too.

 networking-fix-sending-netlink-message-when-replace-route.patch

Will send to davem

 slab-introduce-krealloc.patch

Will merge soon

 exit-acpi-processor-module-gracefully-if-acpi-is-disabled.patch

Will send to Len

 remove-unused-header-file-arch-arm-mach-s3c2410-basth.patch
 iop13xx-msi-support-rev6.patch
 arm-remove-useless-config-option-generic_bust_spinlock.patch

Will send to rmk

 cifs-use-mutexdiff.patch
 cifs-use-simple_prepare_write-to-zero-page-data.patch

Will send to sfrench

 macintosh-mediabay-convert-to-kthread-api.patch
 macintosh-adb-convert-to-the-kthread-api.patch
 macintosh-therm_pm72c-partially-convert-to-kthread-api.patch
 powerpc-pseries-rtasd-convert-to-kthread-api.patch
 powerpc-pseries-eeh-convert-to-kthread-api.patch

Will send to paulus (I already did - does Paul not handle the macintosh
driver?)

 revert-gregkh-driver-remove-struct-subsystem-as-it-is-no-longer-needed.patch

This is here because Greg's tree wrecks Dmitry's tree.  Will drop once they
sort it out.

 idr-fix-obscure-bug-in-allocation-path.patch
 idr-separate-out-idr_mark_full.patch
 ida-implement-idr-based-id-allocator.patch
 ida-implement-idr-based-id-allocator-fix.patch

These will go in via Greg's tree.

 fix-sysfs-rom-file-creation-for-bios-rom-shadows.patch
 more-fix-gregkh-driver-sysfs-kill-unnecessary-attribute-owner.patch
 even-more-fix-gregkh-driver-sysfs-kill-unnecessary-attribute-owner.patch
 even-even-more-fix-gregkh-driver-sysfs-kill-unnecessary-attribute-owner.patch
 acpi-driver-model-flags-and-platform_enable_wake.patch
 update-documentation-driver-model-platformtxt.patch
 power-management-remove-some-useless-code-from-arm.patch

Will send to Greg for the driver tree

 git-dvb.patch
 dvb_en_50221-convert-to-kthread-api.patch
 mm-only-saa7134-tvaudio-convert-to-kthread-api.patch
 git-dvb-vs-gregkh-driver-sysfs-kill-unnecessary-attribute-owner.patch

For Mauro

 i2c-tsl2550-support.patch
 apple-smc-driver-hardware-monitoring-and-control.patch

For Jean

 ia64-sn-xpc-convert-to-use-kthread-api.patch
 ia64-sn-xpc-convert-to-use-kthread-api-fix.patch
 ia64-sn-xpc-convert-to-use-kthread-api-fix-2.patch
 spin_lock_unlocked-macro-cleanup-in-arch-ia64.patch

For Tony

 sbp2-include-fixes.patch
 ieee1394-iso-needs-schedh.patch

For Stephan

 input-convert-from-class-devices-to-standard-devices.patch
 input-evdev-implement-proper-locking.patch
 mousedev-fix.patch
 mousedev-fix-2.patch

Dmitry will merge these once Greg has merged the preparatory work.  Except these
patches make the Vaio-of-doom crash in obscure circumstances, and we weren't
able to fix that?

 wistron_btns-add-led-support.patch
 input-ff-add-ff_raw-effect.patch
 input-phantom-add-a-new-driver.patch

For Dmitry

 kconfig-abort-configuration-with-recursive-dependencies.patch
 kbuild-handle-compressed-cpio-initramfs-es.patch

For Sam and Roman

 ahci-crash-fix.patch
 libata-acpi-add-infrastructure-for-drivers-to-use.patch
 pata_acpi-restore-driver.patch
 optional-led-trigger-for-libata.patch
 ata_timing-ensure-t-cycle-is-always-correct.patch
 pata_pcmcia-recognize-2gb-compactflash-from-transcend.patch
 drivers-ata-remove-the-wildcard-from-sata_nv-driver.patch
 pata_icside-driver.patch

ata stuff

 sl82c105-switch-to-ref-counting-api.patch

For Bart

 mmc-omap-add-missing-newline.patch
 mmc-omap-fix-omap-to-use-mmc_power_on.patch
 mmc-omap-clean-up-omap-set_ios-and-make-mmc_power_on.patch

Not sure.  These hit three different subsystems: arm, omap and mmc.  I might
just send them in.

 nommu-present-backing-device-capabilities-for-mtd.patch
 nommu-add-support-for-direct-mapping-through-mtdconcat.patch
 nommu-generalise-the-handling-of-mtd-specific-superblocks.patch
 nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch
 romfs-printk-format-warnings.patch
 dont-force-uclinux-mtd-map-to-be-root-dev.patch

For dwmw2 (again?)

 8139too-force-media-setting-fix.patch
 sundance-change-phy-address-search-from-phy=1-to-phy=0.patch
 forcedeth-improve-napi-logic.patch
 ne-add-platform_driver.patch
 ne-add-platform_driver-fix.patch
 ne-mips-use-platform_driver-for-ne-on-rbtx49xx.patch
 mips-drop-unnecessary-config_isa-from-rbtx49xx.patch
 ibmtr_cs-fix-hang-on-eject.patch

For netdev tree

 2621-rc5-mm3-fix-e1000-compilation.patch

Will re-re-resend to Auke

 ppp_generic-fix-lockdep-warning.patch

Jeff, I guess.  It's not clear that this is correct.

 input-rfkill-add-support-for-input-key-to-control-wireless-radio.patch

Will resend to davem once the preparatory bits are merged by Greg.

 bluetooth-add-sco-work-around-for-the-broadcom.patch

Will resend to Marcel

 fix-i-oat-for-kexec.patch

Will re-re-re-re-resend to Dan

 auth_gss-unregister-gss_domain-when-unloading-module.patch
 nfs-kill-the-obsolete-nfs_paranoia.patch
 nfs-statfs-error-handling-fix.patch
 nfs-use-__set_current_state.patch
 nfs-suppress-warnings-about-nfs4err_old_stateid-in-nfs4_handle_exception.patch

For Trond

 round_up-macro-cleanup-in-drivers-parisc.patch

Will re-re-resend to Kyle.

 pcmcia-pccard-deadlock-fix.patch
 pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch
 at91_cf-minor-fix.patch
 add-new_id-to-pcmcia-drivers.patch
 ide-cs-recognize-2gb-compactflash-from-transcend.patch

Dominik is busy.  Will probably re-review and send these direct to Linus.

 serial-driver-pmc-msp71xx.patch
 rm9000-serial-driver.patch
 serial-define-fixed_port-flag-for-serial_core.patch
 serial-use-resource_size_t-for-serial-port-io-addresses.patch
 mpsc-serial-driver-tx-locking.patch
 serial-suppress-rts-assertion-with-disabled-crtscts.patch
 8250_pci-fix-pci-must_checks.patch

Seems that I'm maintaining serial now.  Will re-review, re-check with rmk then
send.

 fix-gregkh-pci-pci-remove-the-broken-pci_multithread_probe-option.patch
 remove-pci_dac_dma_-apis.patch
 round_up-macro-cleanup-in-drivers-pci.patch
 pcie-remove-spin_lock_unlocked.patch
 cpqphp-partially-convert-to-use-the-kthread-api.patch
 ibmphp-partially-convert-to-use-the-kthreads-api.patch
 cpci_hotplug-partially-convert-to-use-the-kthread-api.patch
 msi-fix-arm-compile.patch
 support-pci-mcfg-space-on-intel-i915-bridges.patch
 pci-syscallc-switch-to-refcounting-api.patch

Stuff to (various levels of re-)send to Greg for the PCI tree.  I'll probably
drop the kthread patches as they seemed a bit half-baked and I've lost track
of which ones have which levels of baking.

 pci-device-ensure-sysdata-initialised-v2.patch

This is for Jeff's git-pciseg.patch which is sort-of on hold at present.

 git-s390-vs-gregkh-driver-sysfs-kill-unnecessary-attribute-owner.patch
 s390-scsi-zfcp_erp-partially-convert-to-use-the-kthread-api.patch
 s390-qeth-convert-to-use-the-kthread-api.patch
 s390-net-lcs-convert-to-the-kthread-api.patch

For Martin

 round_up-macro-cleanup-in-arch-sh64-kernel-pci_sh5c.patch

For Paul

 drivers-scsi-small-cleanups.patch
 drivers-scsi-advansysc-cleanups.patch
 megaraid-fix-warnings-when-config_proc_fs=n.patch
 remove-unnecessary-check-in-drivers-scsi-sgc.patch
 pci_module_init-convertion-in-tmscsimc.patch
 drivers-scsi-ncr5380c-replacing-yield-with-a.patch
 drivers-scsi-megaraidc-replacing-yield-with-a.patch
 drivers-scsi-mca_53c9xc-save_flags-cli-removal.patch
 sym53c8xx_2-claims-cpqarray-device.patch
 drivers-scsi-wd33c93c-cleanups.patch
 scsi-cover-up-bugs-fix-up-compiler-warnings-in-megaraid-driver.patch
 drivers-scsi-qla4xxx-possible-cleanups.patch
 make-seagate_st0x_detect-static.patch
 scsi-fix-obvious-typo-spin_lock_irqrestore-in-gdthc.patch
 drivers-scsi-aic7xxx_old-convert-to-generic-boolean-values.patch
 cleanup-variable-usage-in-mesh-interrupt-handler.patch
 fix--confusion-in-fusion-driver.patch
 use-unchecked_isa_dma-in-sd_revalidate_disk.patch
 fdomainc-get-rid-of-unused-stuff.patch
 remove-the-broken-scsi_acornscsi_3-driver.patch
 scsi-fix-config_scsi_wait_scan=m.patch
 sas_scsi_host-partially-convert-to-use-the-kthread-api.patch
 qla1280-use-dma_64bit_mask-instead-of-0ull.patch
 pci-error-recovery-symbios-scsi-base-support.patch
 pci-error-recovery-symbios-scsi-first-failure.patch

Will re^N-send to James.

 sparc64-powerc-convert-to-use-the-kthread-api.patch

Might drop, might send to davem.

 git-unionfs.patch

Does this have a future?

 cxacru-add-documentation-file.patch
 cxacru-cleanup-sysfs-attribute-code.patch

For Greg.

 i386-map-enough-initial-memory-to-create-lowmem-mappings-fix.patch
 fault-injection-disable-stacktrace-filter-for-x86-64.patch
 i386-efi-fix-proc-iomem-type-for-kexec-tools.patch
 fault-injection-enable-stacktrace-with-dwarf2-unwinder.patch
 i386-__inquire_remote_apic-printk-warning-fix.patch
 x86-msr-add-support-for-safe-variants.patch

For Andi

 xfs-clean-up-shrinker-games.patch
 xfs-fix-unmount-race.patch

For David

 add-apply_to_page_range-which-applies-a-function-to-a-pte-range.patch
 add-apply_to_page_range-which-applies-a-function-to-a-pte-range-fix.patch
 safer-nr_node_ids-and-nr_node_ids-determination-and-initial.patch
 use-zvc-counters-to-establish-exact-size-of-dirtyable-pages.patch
 proper-prototype-for-hugetlb_get_unmapped_area.patch
 mm-remove-gcc-workaround.patch
 slab-ensure-cache_alloc_refill-terminates.patch
 mm-more-rmap-checking.patch
 mm-make-read_cache_page-synchronous.patch
 fs-buffer-dont-pageuptodate-without-page-locked.patch
 allow-oom_adj-of-saintly-processes.patch
 introduce-config_has_dma.patch
 mm-slabc-proper-prototypes.patch
 mm-detach_vmas_to_be_unmapped-fix.patch

Misc MM things.  Will merge.

 add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages.patch
 add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated.patch
 split-the-free-lists-for-movable-and-unmovable-allocations.patch
 choose-pages-from-the-per-cpu-list-based-on-migration-type.patch
 add-a-configure-option-to-group-pages-by-mobility.patch
 drain-per-cpu-lists-when-high-order-allocations-fail.patch
 move-free-pages-between-lists-on-steal.patch
 group-short-lived-and-reclaimable-kernel-allocations.patch
 group-high-order-atomic-allocations.patch
 do-not-group-pages-by-mobility-type-on-low-memory-systems.patch
 bias-the-placement-of-kernel-pages-at-lower-pfns.patch
 be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback.patch
 fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2.patch
 create-the-zone_movable-zone.patch
 allow-huge-page-allocations-to-use-gfp_high_movable.patch
 x86-specify-amount-of-kernel-memory-at-boot-time.patch
 ppc-and-powerpc-specify-amount-of-kernel-memory-at-boot-time.patch
 x86_64-specify-amount-of-kernel-memory-at-boot-time.patch
 ia64-specify-amount-of-kernel-memory-at-boot-time.patch
 add-documentation-for-additional-boot-parameter-and-sysctl.patch
 handle-kernelcore=-boot-parameter-in-common-code-to-avoid-boot-problem-on-ia64.patch

Mel's moveable-zone work.

I don't believe that this has had sufficient review and I'm sure that it
hasn't had sufficient third-party testing.  Most of the approbations thus far
have consisted of people liking the overall idea, based on the changelogs and
multi-year-old discussions.

For such a large and core change I'd have expected more detailed reviewing
effort and more third-party testing.  And I STILL haven't made time to review
the code in detail myself.

So I'm a bit uncomfortable with moving ahead with these changes.

 mm-simplify-filemap_nopage.patch
 mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
 mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
 mm-merge-nopfn-into-fault.patch
 convert-hugetlbfs-to-use-vm_ops-fault.patch
 mm-remove-legacy-cruft.patch
 mm-debug-check-for-the-fault-vs-invalidate-race.patch
 mm-fix-clear_page_dirty_for_io-vs-fault-race.patch
 add-unitialized_var-macro-for-suppressing-gcc-warnings.patch
 i386-add-ptep_test_and_clear_dirtyyoung.patch
 i386-use-pte_update_defer-in-ptep_test_and_clear_dirtyyoung.patch

Miscish MM changes.  Will merge, dependent upon what still applies and works
if the moveable-zone patches get stalled.

 smaps-extract-pmd-walker-from-smaps-code.patch
 smaps-add-pages-referenced-count-to-smaps.patch
 smaps-add-clear_refs-file-to-clear-reference.patch

referenced-page accounting in /proc/pid/smaps.  Is realted to the maps2
patches.  Will merge.

 maps2-uninline-some-functions-in-the-page-walker.patch
 maps2-eliminate-the-pmd_walker-struct-in-the-page-walker.patch
 maps2-remove-vma-from-args-in-the-page-walker.patch
 maps2-propagate-errors-from-callback-in-page-walker.patch
 maps2-add-callbacks-for-each-level-to-page-walker.patch
 maps2-move-the-page-walker-code-to-lib.patch
 maps2-simplify-interdependence-of-proc-pid-maps-and-smaps.patch
 maps2-move-clear_refs-code-to-task_mmuc.patch
 maps2-regroup-task_mmu-by-interface.patch
 maps2-make-proc-pid-smaps-optional-under-config_embedded.patch
 maps2-make-proc-pid-clear_refs-option-under-config_embedded.patch
 maps2-add-proc-pid-pagemap-interface.patch
 maps2-add-proc-kpagemap-interface.patch

/proc/pid/pagemap and /proc/kpagemap.  A fairly important and low-level way of
exposing memory state to userspace, for developers.

Matt still has a decent-sized todo list here.  Might merge, might hold over
for 2.6.23.

 lumpy-reclaim-v4.patch

This is in a similar situation to the moveable-zone work.  Sounds great on
paper, but it needs considerable third-party testing and review.  It is a
major change to core MM and, we hope, a significant advance.  On paper.

 add-pfn_valid_within-helper-for-sub-max_order-hole-detection.patch
 anti-fragmentation-switch-over-to-pfn_valid_within.patch
 lumpy-move-to-using-pfn_valid_within.patch

More Mel things, and linkage between Mel-things and lumpy reclaim.  It's here
where the patch ordering gets into a mess and things won't improve if
moveable-zones and lumpy-reclaim get deferred.  Such a deferral would limit my
ability to queue more MM changes for 2.6.23.

 readahead-improve-heuristic-detecting-sequential-reads.patch
 readahead-code-cleanup.patch

Will merge.

 bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks.patch
 remove-page_group_by_mobility.patch
 dont-group-high-order-atomic-allocations.patch

More moveable-zone work.

 mm-move-common-segment-checks-to-separate-helper-function-v7.patch
 slab-use-num_possible_cpus-in-enable_cpucache.patch
 slab-dont-allocate-empty-shared-caches.patch
 slab-numa-kmem_cache-diet.patch
 do-not-disable-interrupts-when-reading-min_free_kbytes.patch
 slab-mark-set_up_list3s-__init.patch
 mm-clean-up-and-kernelify-shrinker-registration.patch
 fix-section-mismatch-of-memory-hotplug-related-code.patch
 add-white-list-into-modpostc-for-memory-hotplug-code-and-ia64s-machvec-section.patch
 split-mmap.patch
 only-allow-nonlinear-vmas-for-ram-backed-filesystems.patch
 cpusets-allow-tif_memdie-threads-to-allocate-anywhere.patch

More MM misc.  Will merge those patches which survive other merge decisions.

 i386-use-page-allocator-to-allocate-thread_info-structure.patch
 slub-core.patch

slub.  Or part thereof.  This is another patch series which got messed up by
poor patch sequencing.

 make-page-private-usable-in-compound-pages-v1.patch
 optimize-compound_head-by-avoiding-a-shared-page.patch
 add-virt_to_head_page-and-consolidate-code-in-slab-and-slub.patch
 slub-fix-object-tracking.patch
 slub-enable-tracking-of-full-slabs.patch
 slub-validation-of-slabs-metadata-and-guard-zones.patch
 slub-add-min_partial.patch
 slub-add-ability-to-list-alloc--free-callers-per-slab.patch
 slub-free-slabs-and-sort-partial-slab-lists-in-kmem_cache_shrink.patch
 slub-remove-object-activities-out-of-checking-functions.patch
 slub-user-documentation.patch
 slub-add-slabinfo-tool.patch

Most of the rest of slub.  Will merge it all.

 quicklists-for-page-table-pages.patch
 quicklist-support-for-ia64.patch
 quicklist-support-for-x86_64.patch
 quicklist-support-for-sparc64.patch

Will merge

 slob-handle-slab_panic-flag.patch
 include-kern_-constant-in-printk-calls-in-mm-slabc.patch
 mm-madvise-avoid-exclusive-mmap_sem.patch
 mm-remove-destroy_dirty_buffers-from-invalidate_bdev.patch
 mm-optimize-kill_bdev.patch
 mm-optimize-acorn-partition-truncate.patch
 slab-allocators-remove-obsolete-slab_must_hwcache_align.patch
 kmem_cache-simplify-slab-cache-creation.patch
 slab-allocators-remove-multiple-alignment-specifications.patch
 use-slab_panic-flag-cleanup.patch
 fault-injection-fix-failslab-with-config_numa.patch
 mm-document-fault_data-and-flags.patch
 mm-fix-handling-of-panic_on_oom-when-cpusets-are-in-use.patch
 oom-fix-constraint-deadlock.patch

More MM misc.  Will merge.

 get_unmapped_area-handles-map_fixed-on-powerpc.patch
 get_unmapped_area-handles-map_fixed-on-alpha.patch
 get_unmapped_area-handles-map_fixed-on-arm.patch
 get_unmapped_area-handles-map_fixed-on-frv.patch
 get_unmapped_area-handles-map_fixed-on-i386.patch
 get_unmapped_area-handles-map_fixed-on-ia64.patch
 get_unmapped_area-handles-map_fixed-on-parisc.patch
 get_unmapped_area-handles-map_fixed-on-sparc64.patch
 get_unmapped_area-handles-map_fixed-on-x86_64.patch
 get_unmapped_area-handles-map_fixed-in-hugetlbfs.patch
 get_unmapped_area-handles-map_fixed-in-generic-code.patch
 get_unmapped_area-doesnt-need-hugetlbfs-hacks-anymore.patch

Will merge.

 slub-exploit-page-mobility-to-increase-allocation-order.patch

Slub entanglement with moveable-zones.  Will merge if moveable-zones is merged.

 slab-allocators-remove-slab_debug_initial-flag.patch
 slab-allocators-remove-slab_ctor_atomic.patch
 slub-mm-only-make-slub-the-default-slab-allocator.patch

Various slab-related patches which are dependent upon multiple previous
patches.

 slub-i386-support.patch

Will hold for a while.

 lazy-freeing-of-memory-through-madv_free.patch
 lazy-freeing-of-memory-through-madv_free-vs-mm-madvise-avoid-exclusive-mmap_sem.patch
 restore-madv_dontneed-to-its-original-linux-behaviour.patch

I think the MADV_FREE changes need more work:

We need crystal-clear statements regarding the present functionality, the new
functionality and how these relate to the spec and to implmentations in other
OS'es.  Once we have that info we are in a position to work out whether the
code can be merged as-is, or if additional changes are needed.

Because right now, I don't know where we are with respect to these things and
I doubt if many of our users know either.  How can Michael write a manpage for
this is we don't tell him what it all does?

 implement-file-posix-capabilities.patch
 file-capabilities-accomodate-future-64-bit-caps.patch
 return-eperm-not-echild-on-security_task_wait-failure.patch

I think we're still waiting for the security guys to work out what to do with
this work.

 blackfin-arch.patch
 driver_bfin_serial_core.patch
 blackfin-on-chip-ethernet-mac-controller-driver.patch
 blackfin-patch-add-blackfin-support-in-smc91x.patch
 blackfin-on-chip-rtc-controller-driver.patch
 blackfin-blackfin-on-chip-spi-controller-driver.patch

 convert-h8-300-to-generic-timekeeping.patch
 h8300-generic-irq.patch
 h8300-add-zimage-support.patch

 round_up-macro-cleanup-in-arch-alpha-kernel-osf_sysc.patch
 alpha-fix-bootp-image-creation.patch
 alpha-prctl-macros.patch
 srmcons-fix-kmallocgfp_kernel-inside-spinlock.patch

 arm26-remove-useless-config-option-generic_bust_spinlock.patch

arch stuff.  Will merge.

 fix-refrigerator-vs-thaw_process-race.patch
 swsusp-use-inline-functions-for-changing-page-flags.patch
 swsusp-do-not-use-page-flags.patch
 mm-remove-unused-page-flags.patch
 swsusp-fix-error-paths-in-snapshot_open.patch
 swsusp-use-gfp_kernel-for-creating-basic-data-structures.patch
 freezer-remove-pf_nofreeze-from-handle_initrd.patch
 swsusp-use-rbtree-for-tracking-allocated-swap.patch
 freezer-fix-racy-usage-of-try_to_freeze-in-kswapd.patch
 remove-software_suspend.patch
 power-management-change-sys-power-disk-display.patch
 kconfig-mentioneds-hibernation-not-just-swsusp.patch
 swsusp-fix-snapshot_release.patch
 swsusp-free-more-memory.patch

swsusp: will merge.

 remove-unused-header-file-arch-m68k-atari-atasoundh.patch
 spin_lock_unlocked-cleanup-in-arch-m68k.patch

 remove-unused-header-file-drivers-serial-crisv10h.patch
 cris-check-for-memory-allocation.patch
 cris-remove-code-related-to-pre-22-kernel.patch

 uml-delete-unused-code.patch
 uml-formatting-fixes.patch
 uml-host_info-tidying.patch
 uml-mark-tt-mode-code-for-future-removal.patch
 uml-print-coredump-limits.patch
 uml-handle-block-device-hotplug-errors.patch
 uml-driver-formatting-fixes.patch
 uml-driver-formatting-fixes-fix.patch
 uml-network-interface-hotplug-error-handling.patch
 array_size-check-for-type.patch
 uml-move-sigio-testing-to-sigioc.patch
 uml-create-archh.patch
 uml-create-as-layouth.patch
 uml-move-remaining-useful-contents-of-user_utilh.patch
 uml-remove-user_utilh.patch
 uml-add-missing-__init-declarations.patch
 remove-unused-header-file-arch-um-kernel-tt-include-mode_kern-tth.patch
 uml-improve-checking-and-diagnostics-of-ethernet-macs.patch
 uml-eliminate-temporary-buffer-in-eth_configure.patch
 uml-replace-one-element-array-with-zero-element-array.patch
 uml-fix-umid-in-xterm-titles.patch
 uml-speed-up-exec.patch
 uml-no-locking-needed-in-tlsc.patch
 uml-tidy-processc.patch
 uml-remove-page_size.patch
 uml-kernel_thread-shouldnt-panic.patch
 uml-tidy-fault-code.patch
 uml-kernel-segfaults-should-dump-proper-registers.patch
 uml-comment-early-boot-locking.patch
 uml-irq-locking-commentary.patch
 uml-delete-host_frame_size.patch
 uml-drivers-get-release-methods.patch
 uml-dump-registers-on-ptrace-or-wait-failure.patch
 uml-speed-up-page-table-walking.patch
 uml-remove-unused-x86_64-code.patch
 uml-start-fixing-os_read_file-and-os_write_file.patch
 uml-tidy-libc-code.patch
 uml-convert-libc-layer-to-call-read-and-write.patch
 uml-batch-i-o-requests.patch
 uml-send-pointers-instead-of-structures-to-i-o-thread.patch
 uml-dump-core-on-panic.patch
 uml-dont-try-to-handle-signals-on-initial-process-stack.patch
 uml-change-remaining-callers-of-os_read_write_file.patch
 uml-formatting-fixes-around-os_read_write_file-callers.patch
 uml-remove-debugging-remnants.patch
 uml-rename-os_read_write_file_k-back-to-os_read_write_file.patch
 uml-aio-deadlock-avoidance.patch
 uml-speed-page-fault-path.patch
 uml-eliminate-a-piece-of-debugging-code.patch
 uml-more-page-fault-path-trimming.patch
 uml-only-flush-areas-covered-by-vma.patch
 uml-out-of-tmpfs-space-error-clarification.patch
 uml-virtualized-time-fix.patch

 v850-generic-timekeeping-conversion.patch

 xtensa-strlcpy-is-smart-enough.patch

More arch things.  Will merge.

 deprecate-smbfs-in-favour-of-cifs.patch

Probably 2.6.23.

 cpuset-remove-sched-domain-hooks-from-cpusets.patch

Hold.

 # clone-flag-clone_parent_tidptr-leaves-invalid-results-in-memory.patch: Eric B had issues
 clone-flag-clone_parent_tidptr-leaves-invalid-results-in-memory.patch
 factor-outstanding-i-o-error-handling.patch
 block_write_full_page-handle-enospc.patch
 simplify-the-stacktrace-code.patch
 filesystem-disk-errors-at-boot-time-caused-by-probe.patch
 allow-access-to-proc-pid-fd-after-setuid.patch
 ext2-3-4-fix-file-date-underflow-on-ext2-3-filesystems-on-64-bit-systems.patch
 reduce-size-of-task_struct-on-64-bit-machines.patch
 fix-quadratic-behavior-of-shrink_dcache_parent.patch
 mm-shrink-parent-dentries-when-shrinking-slab.patch
 ipmi-add-powerpc-openfirmware-sensing.patch
 ipmi-allow-shared-interrupts.patch
 ipmi-add-new-ipmi-nmi-watchdog-handling.patch
 ipmi-add-pci-remove-handling.patch
 freezer-task-exit_state-should-be-treated-as-bolean.patch
 softlockup-trivial-s-99-max_rt_prio.patch
 fix-constant-folding-and-poor-optimization-in-byte-swapping.patch
 documentation-ask-driver-writers-to-provide-pm-support.patch
 # fix-__d_path-for-lazy-unmounts-and-make-it-unambiguous.patch: Alan issues
 use-symbolic-constants-in-generic-lseek-code.patch
 use-use-seek_max-to-validate-user-lseek-arguments.patch
 devpts-add-fsnotify-create-event.patch
 tty-clarify-documentation-of-write.patch
 drivers-char-hvc_consolec-cleanups.patch
 is_power_of_2-in-fat.patch
 is_power_of_2-in-fs-hfs.patch
 is_power_of_2-in-fs-block_devc.patch
 freevxfs-possible-null-pointer-dereference-fix.patch
 reiserfs-possible-null-pointer-dereference-during-resize.patch
 scripts-kernel-doc-whitespace-cleanup.patch
 fix-section-mismatch-warning-in-lib-swiotlbc.patch
 init-do_mountsc-proper-prepare_namespace-prototype.patch
 fix-compilation-of-drivers-with-o0.patch
 reiserfs-shrink-superblock-if-no-xattrs.patch
 module-use-krealloc.patch
 reiserfs-correct-misspelled-reiserfs_proc_info-to.patch
 kconfig-centralize-the-selection-of-semaphore-debugging.patch
 irq-add-__must_check-to-request_irq.patch
 # use-stop_machine_run-in-the-intel-rng-driver.patch: needs re-review
 use-stop_machine_run-in-the-intel-rng-driver.patch
 cap-shmmax-at-int_max-in-compat-shminfo.patch
 exec-fix-remove_arg_zero.patch
 merge-sys_clone-sys_unshare-nsproxy-and-namespace.patch
 rcutorture-mark-rcu_torture_init-as-__init.patch
 init-dma-masks-in-pnp_dev.patch
 optimize-timespec_trunc.patch
 ext3-dirindex-error-pointer-issues.patch
 the-scheduled-removal-of-obsolete_oss-options.patch
 epoll-optimizations-and-cleanups.patch
 oss-strlcpy-is-smart-enough.patch
 add-filesystem-subtype-support.patch
 fix-race-between-proc_get_inode-and-remove_proc_entry.patch
 fix-race-between-proc_readdir-and-remove_proc_entry.patch
 proc-remove-pathetic-deleted-warn_on.patch
 vfs-remove-superflous-sb-==-null-checks.patch
 nameic-remove-utterly-outdated-comment.patch
 tpm_infineon-add-support-for-devices-in-mmio-space.patch
 replace-pci_find_device-in-drivers-telephony-ixjc.patch
 floppy-handle-device_create_file-failure-while-init.patch
 drivers-macintosh-mac_hidc-make-code-static.patch
 rocket-remove-modversions-include.patch
 virtual_eisa_root_init-should-be-__init.patch
 proc-maps-protection.patch
 remove-unused-header-file-drivers-message-i2o-i2o_lanh.patch
 remove-unused-header-file-drivers-char-digih.patch
 drivers-char-synclinkc-check-kmalloc-return-value.patch
 procfs-reorder-struct-pid_dentry-to-save-space-on-64bit-archs-and-constify-them.patch
 add-file-position-info-to-proc.patch
 vfs-delay-the-dentry-name-generation-on-sockets-and.patch
 tty-i386-x86_64-arbitary-speed-support.patch
 kprobes-make-kprobesymbol_name-const.patch
 fix-cycladesh-for-x86_64-and-probably-others.patch
 cyclades-remove-custom-types.patch
 small-fixes-for-jsm-driver.patch
 jsm-driver-fix-for-linuxpps-support.patch
 as-fix-antic_expire-check.patch
 rtc-add-rtc-rs5c313-driver.patch
 # rtc-add-rtc-class-driver-for-the-maxim-max6900.patch: Jean requested updates
 rtc-add-rtc-class-driver-for-the-maxim-max6900.patch
 # fix-rmmod-read-write-races-in-proc-entries.patch: worrisome (Arjan)
 fix-rmmod-read-write-races-in-proc-entries.patch
 # getrusage-fill-ru_inblock-and-ru_oublock-fields-if-possible.patch: wrong
 getrusage-fill-ru_inblock-and-ru_oublock-fields-if-possible.patch
 futex-restartable-futex_wait.patch
 proc-oom_score-oops-re-badness.patch
 enlarge-console-name.patch
 fixes-and-cleanups-for-earlyprintk-aka-boot-console.patch
 tty-remove-unnecessary-export-of-proc_clear_tty.patch
 tty-simplify-calling-of-put_pid.patch
 tty-introduce-no_tty-and-use-it-in-selinux.patch
 reiserfs-proc-support-requires-proc_fs.patch
 kprobes-fix-sparse-null-warning.patch
 add-ability-to-keep-track-of-callers-of-symbol_getput.patch
 update-mtd-use-of-symbol_getput.patch
 update-dvb-use-of-symbol_getput.patch
 move-die-notifier-handling-to-common-code.patch
 char-rocket-add-module_device_table.patch
 char-cs5535_gpio-add-module_device_table.patch
 remove-do_sync_file_range.patch
 protect-tty-drivers-list-with-tty_mutex.patch
 # more-scheduled-oss-driver-removal.patch: too early?
 more-scheduled-oss-driver-removal.patch
 schedule-obsolete-oss-drivers-for-removal-4th-round.patch
 delete-unused-header-file-math-emu-extendedh.patch
 fix-sscanf-%n-match-at-end-of-input-string.patch
 make-remove_inode_dquot_ref-static.patch
 fix-race-between-attach_task-and-cpuset_exit.patch
 delete-unused-header-file-linux-awe_voiceh.patch
 kernel-irq-procc-unprotected-iteration-over-the-irq-action-list-in-name_unique.patch
 parport-dev-driver-model-support.patch
 legacy-pc-parports-support-parport-dev.patch
 layered-parport-code-uses-parport-dev.patch
 cache-pipe-buf-page-address-for-non-highmem-arch.patch
 add-support-for-deferrable-timers-respun.patch
 add-a-new-deferrable-delayed-work-init.patch
 linux-sysdevh-needs-to-include-linux-moduleh.patch
 irq-check-for-percpu-flag-only-when-adding-first-irqaction.patch
 # time-smp-friendly-alignment-of-struct-clocksource.patch: needs x86_64-move-__vgetcpu_mode-__jiffies-to-the-vsyscall_2-zone.patch
 time-smp-friendly-alignment-of-struct-clocksource.patch
 move-timekeeping-code-to-timekeepingc.patch
 ignore-stolen-time-in-the-softlockup-watchdog.patch
 add-touch_all_softlockup_watchdogs.patch
 header-cleaning-dont-include-smp_lockh-when-not-used.patch
 fix-82875-pci-setup.patch
 unexport-pci_proc_attach_device.patch
 make-dev-port-conditional-on-config-symbol.patch
 remove-artificial-software-max_loop-limit.patch
 kdump-kexec-calculate-note-size-at-compile-time.patch
 fix-kevents-childs-priority-greediness.patch
 display-all-possible-partitions-when-the-root-filesystem-failed-to-mount.patch
 enhance-initcall_debug-measure-latency.patch
 kprobes-print-details-of-kretprobe-on-assertion-failure.patch
 reregister_binfmt-returns-with-ebusy.patch
 pnpacpi-sets-pnpdev-devarchdata.patch
 simplify-module_get_kallsym-by-dropping-length-arg.patch
 fix-race-between-rmmod-and-cat-proc-kallsyms.patch
 simplify-kallsyms_lookup.patch
 fix-race-between-cat-proc-wchan-and-rmmod-et-al.patch
 fix-race-between-cat-proc-slab_allocators-and-rmmod.patch
 kernel-paramsc-fix-lying-comment-for-param_array.patch
 replace-deprecated-sa_xxx-interrupt-flags.patch
 deprecate-sa_xxx-interrupt-flags-v2.patch
 # expose-range-checking-functions-from-arch-specific.patch: wrong? crap!
 expose-range-checking-functions-from-arch-specific.patch
 remove-hardcoding-of-hard_smp_processor_id-on-up.patch
 use-the-apic-to-determine-the-hardware-processor-id-i386.patch
 use-the-apic-to-determine-the-hardware-processor-id-x86_64.patch
 always-ask-the-hardware-to-obtain-hardware-processor-id-ia64.patch
 round_up-macro-cleanup-in-drivers-char-lpc.patch
 i386-schedh-inclusion-from-moduleh-is-baack.patch
 parport_serial-fix-pci-must_checks.patch
 round_up-macro-cleanup-in-fs-selectcompatreaddirc.patch
 round_up-macro-cleanup-in-fs-smbfs-requestc.patch
 doc-kernel-parameters-use-x86-32-tag-instead-of-ia-32.patch
 kernel-doc-handle-arrays-with-arithmetic-expressions-as.patch
 merge-compat_ioctlh-into-compat_ioctlc.patch
 lockdep-treats-down_write_trylock-like-regular-down_write.patch
 pad-irq_desc-to-internode-cacheline-size.patch
 partition-add-support-for-sysv68-partitions.patch
 dtlk-fix-error-checks-in-module_init.patch
 add-spaces-on-either-side-of-case-operator.patch
 cleanup-compat-ioctl-handling.patch
 partitions-check-the-return-value-of-kobject_add-etc.patch
 kallsyms-cleanup-use-seq_release_private-where-appropriate.patch
 proc-cleanup-use-seq_release_private-where-appropriate.patch
 cciss-reformat-error-handling.patch
 cciss-add-sg_io-ioctl-to-cciss.patch
 cciss-set-rq-errors-more-correctly-in-driver.patch
 generate-main-index-page-when-building-htmldocs.patch
 alphabetically-sorted-entries-in.patch
 fix-hotplug-for-legacy-platform-drivers.patch
 # remove-redundant-check-from-proc_setattr: need sds ack
 remove-redundant-check-from-proc_setattr.patch
 remove-redundant-check-from-proc_sys_setattr.patch
 make-iunique-use-a-do-while-loop-rather-than-its-obscure-goto-loop.patch
 kernel-doc-html-mode-struct-highlights.patch
 add-webpages-url-and-summarize-3-lines.patch
 add-keyboard-blink-driver.patch
 efi-warn-only-for-pre-100-system-tables.patch
 apm-fix-incorrect-comment.patch
 cciss-include-scsi-scsih-unconditionally.patch
 highres-dyntick-prevent-xtime-lock-contention.patch
 documentation-cciss-detecting-failed-drives.patch
 spin_lock_unlocked-cleanup-in-init_taskh.patch
 spin_lock_unlocked-cleanup-in-drivers-char-keyboard.patch
 spin_lock_unlocked-cleanup-in-drivers-serial.patch
 lockdep-lookup_chain_cache-comment-errata.patch
 taskstats-fix-getdelays-usage-information.patch
 smbfs-remove-unnecessary-allow_signal.patch
 pnpbios-conert-to-use-the-kthread-api.patch
 introduce-a-handy-list_first_entry-macro-v2.patch
 document-spin_lock_unlocked-rw_lock_unlocked-deprecation.patch
 getdelaysc-fix-overrun.patch
 serial_txx9-use-assigned-device-numbers.patch
 serial_txx9-zap-changelog-from-source-code.patch
 cpu-time-limit-patch--setrlimitrlimit_cpu-0-cheat-fix.patch
 ext3-copy-i_flags-to-inode-flags-on-write.patch
 codingstyle-start-flamewar-about-use-of-braces.patch
 upper-32-bits.patch
 console-utf-8-fixes.patch
 #report-that-kernel-is-tainted-if-there-were-an-oops-before.patch
 clarify-the-creation-of-the-localversion_auto-string.patch
 add-pci_try_set_mwi.patch
 check-privileges-before-setting-mount-propagation.patch
 jbd-check-for-error-returned-by-kthread_create-on-creating-journal-thread.patch
 clean-up-mutex_trylock-noise.patch
 the-scheduled-einval-for-invalid-timevals-in-setitimer.patch
 reiserfs-use-__set_current_state.patch
 drivers-char-use-__set_current_state.patch
 kill-warnings-when-building-mandocs.patch
 cleanup-mostly-unused-iospace-macros.patch
 lockdep-removed-unused-ip-argument-in-mark_lock-mark_held_locks.patch
 fat_dont-use_free_clusters-for-fat32.patch
 copy-i_flags-to-ext2-inode-flags-on-write.patch
 fix-chapter-reference-in-codingstyle.patch
 sleep-during-spinlock-in-tpm-driver.patch
 consolidate-asm-consth-to-linux-consth.patch
 x86_64-kill-19000-sparse-warnings.patch
 move-log_buf_shift-to-a-more-sensible-place.patch
 w1-printk-format-warning.patch
 w1-allow-bus-master-to-have-reset-and-byte-ops.patch
 driver-for-the-maxim-ds1wm-a-1-wire-bus-master-asic-core.patch
 dma_declare_coherent_memory-wrong-allocation.patch
 deflate-inflate_dynamic-too.patch
 fix-wrong-identifier-name-in-documentation-driver-model-devrestxt.patch
 edd-switch-to-refcounting-pci-apis.patch
 fix-vfat-compat-ioctls-on-64-bit-systems.patch

Misc.  A few of these need rechecking by people who had comments.  I'll
re-review these and will mostly-merge.

 consolidate-generic_writepages-and-mpage_writepages.patch

Might merge.  I forget what happened to this.

 sync_sb_inodes-propagate-errors.patch

This still isn't right.

 minor-spi_butterfly-cleanup.patch
 dev-spidevbc-interface.patch
 # mpc52xx-psc-spi-master-driver.patch: needs s-o-b
 mpc52xx-psc-spi-master-driver.patch

Will merge.

 mips-convert-to-use-shared-apm-emulation-fix.patch

Send to Ralf.  Or drop.  Not sure what it's doing here.

 make-static-counters-in-new_inode-and-iunique-be-32-bits.patch
 change-libfs-sb-creation-routines-to-avoid-collisions-with-their-root-inodes.patch

Will merge.

 schedule_on_each_cpu-use-preempt_disable.patch
 reimplement-flush_workqueue.patch
 implement-flush_work.patch
 flush_workqueue-use-preempt_disable-to-hold-off-cpu-hotplug.patch
 flush_cpu_workqueue-dont-flush-an-empty-worklist.patch
 aio-use-flush_work.patch
 kblockd-use-flush_work.patch
 relayfs-use-flush_keventd_work.patch
 tg3-use-flush_keventd_work.patch
 e1000-use-flush_keventd_work.patch
 libata-use-flush_work.patch
 phy-use-flush_work.patch

Will mostly-merge.  Some can go via subsystem maintainers if/when the base
patches are in.

 extend-notifier_call_chain-to-count-nr_calls-made.patch
 define-and-use-new-eventscpu_lock_acquire-and-cpu_lock_release.patch
 eliminate-lock_cpu_hotplug-in-kernel-schedc.patch
 call-cpu_chain-with-cpu_down_failed-if-cpu_down_prepare-failed.patch
 call-cpu_chain-with-cpu_down_failed-if-cpu_down_prepare-failed-vs-reduce-size-of-task_struct-on-64-bit-machines.patch
 slab-use-cpu_lock_.patch
 workqueue-fix-freezeable-workqueues-implementation.patch
 workqueue-fix-flush_workqueue-vs-cpu_dead-race.patch
 workqueue-dont-clear-cwq-thread-until-it-exits.patch
 workqueue-dont-migrate-pending-works-from-the-dead-cpu.patch
 workqueue-kill-run_scheduled_work.patch
 workqueue-dont-save-interrupts-in-run_workqueue.patch
 workqueue-make-cancel_rearming_delayed_workqueue-work-on-idle-dwork.patch
 workqueue-introduce-cpu_singlethread_map.patch
 workqueue-introduce-workqueue_struct-singlethread.patch
 workqueue-make-init_workqueues-__init.patch
 make-queue_delayed_work-friendly-to-flush_fork.patch
 unify-queue_delayed_work-and-queue_delayed_work_on.patch
 workqueue-introduce-wq_per_cpu-helper.patch
 make-cancel_rearming_delayed_work-work-on-any-workqueue-not-just-keventd_wq.patch
 ipvs-flush-defense_work-before-module-unload.patch
 workqueue-kill-noautorel-works.patch
 worker_thread-dont-play-with-signals.patch
 worker_thread-fix-racy-try_to_freeze-usage.patch
 zap_other_threads-remove-unneeded-exit_signal-change.patch
 # slab-shutdown-cache_reaper-when-cpu-goes-down.patch
 unify-flush_work-flush_work_keventd-and-rename-it-to-cancel_work_sync.patch
 ____call_usermodehelper-dont-flush_signals.patch

A lot of this is Oleg's workqueue rework which I deferred from 2.6.21.  Will
merge.

 freezer-read-pf_borrowed_mm-in-a-nonracy-way.patch
 freezer-close-theoretical-race-between-refrigerator-and-thaw_tasks.patch
 freezer-remove-pf_nofreeze-from-rcutorture-thread.patch
 freezer-remove-pf_nofreeze-from-bluetooth-threads.patch
 freezer-add-try_to_freeze-calls-to-all-kernel-threads.patch
 freezer-fix-vfork-problem.patch
 freezer-take-kernel_execve-into-consideration.patch
 kthread-dont-depend-on-work-queues-take-2.patch
 change-reparent_to_init-to-reparent_to_kthreadd.patch

Freezer work - trying to get the freezer ready to use it for CPU hotplug. 
Will merge.

 nlmclnt_recovery-dont-use-clone_sighand.patch
 usbatm_heavy_init-dont-use-clone_sighand.patch
 wait_for_helper-remove-unneeded-do_sigaction.patch
 worker_thread-dont-play-with-sigchld-and-numa-policy.patch
 change-kernel-threads-to-ignore-signals-instead-of-blocking-them.patch
 fix-kthread_create-vs-freezer-theoretical-race.patch
 fix-pf_nofreeze-and-freezeable-race-2.patch
 freezer-document-task_lock-in-thaw_process.patch
 move-frozen_process-to-kernel-power-processc.patch
 remvoe-kthread_bind-call-from-_cpu_down.patch

Various core thread-management things.  Will merge.

 move-page-writeback-acounting-out-of-macros.patch

Will merge.  Or might drop, dunno.  I think it makes sense.

 ext2-reservations.patch

This still awaits more testing.

 make-drivers-isdn-capi-capiutilccdebbuf_alloc-static.patch
 drivers-isdn-hardware-eicon-remove-unused-header-files.patch
 fix-spinlock-usage-in-hysdn_log_close.patch

ISDN: will merge.

 remove-obsolete-label-from-isdn4linux-v3.patch

This caused a lkml foodfight.  Will drop.

 remove-nfs4_acl_add_ace.patch
 the-nfsv2-nfsv3-server-does-not-handle-zero-length-write.patch
 knfsd-rename-sk_defer_lock-to-sk_lock.patch
 nfsd-nfs4state-remove-unnecessary-daemonize-call.patch
 rpc-add-wrapper-for-svc_reserve-to-account-for-checksum.patch

nfsd things - will merge after checking with Neil.

 sched-fix-idle-load-balancing-in-softirqd-context.patch
 sched-dynticks-idle-load-balancing-v3.patch
 speedup-divides-by-cpu_power-in-scheduler.patch
 sched-optimize-siblings-status-check-logic-in-wake_idle.patch
 sched-redundant-reschedule-when-set_user_nice-boosts-a-prio-of-a-task-from-the-expired-array.patch
 sched-align-rq-to-cacheline-boundary.patch

CPU scheduler: will merge.

 rcutorture-use-array_size-macro-when-appropriate.patch
 rcutorture-style-cleanup-avoid-=-null-in-boolean-tests.patch
 rcutorture-remove-redundant-assignment-to-cur_ops-in.patch

Will merge.

 utimensat-implementation.patch

Will merge.

 rtc-remove-sys-class-rtc-dev.patch
 rtc-rtc-interfaces-dont-use-class_device.patch
 rtc-simplified-rtc-sysfs-attribute-handling.patch
 rtc-simplified-proc-driver-rtc-handling.patch
 rtc-remove-rest-of-class_device.patch
 rtc-suspend-resume-restores-system-clock.patch
 rtc-simplified-rtc-sysfs-attribute-handling-tidy.patch
 rtc-update-to-class-device-removal-patches.patch
 rtc-kconfig-cleanup.patch
 rtc-update-vr41xx-alarm-handling.patch
 rtc-cmos-wakeup-interface.patch
 acpi-wakeup-hooks-for-rtc-cmos.patch
 workaround-rtc-related-acpi-table-bugs.patch
 revert-rtc-add-rtc_merge_alarm.patch
 remove-rtc_alm_set-mode-bugs.patch
 rtc-cmos-make-it-load-on-pnpbios-systems.patch

Will merge.

 declare-struct-ktime.patch
 futex-priority-based-wakeup.patch
 make-futex_wait-use-an-hrtimer-for-timeout.patch
 futex_requeue_pi-optimization.patch

Will merge.

 kprobes-use-hlist_for_each_entry.patch
 kprobes-codingstyle-cleanups.patch
 kprobes-kretprobes-simplifcations.patch
 kprobes-the-on-off-knob-thru-debugfs-updated.patch

Will merge.

 atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch
 atomich-complete-atomic_long-operations-in-asm-generic.patch
 atomich-i386-type-safety-fix.patch
 atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch
 atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch
 atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch
 atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch
 atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch
 atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch
 atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch
 local_t-architecture-independant-extension.patch
 local_t-alpha-extension.patch
 local_t-i386-extension.patch
 local_t-ia64-extension.patch
 local_t-mips-extension.patch
 local_t-parisc-cleanup.patch
 local_t-powerpc-extension.patch
 local_t-sparc64-cleanup.patch
 local_t-x86_64-extension.patch
 linux-kernel-markers-kconfig-menus.patch
 linux-kernel-markers-architecture-independant-code.patch
 linux-kernel-markers-powerpc-optimization.patch
 linux-kernel-markers-i386-optimization.patch
 markers-add-instrumentation-markers-menus-to-avr32.patch
 linux-kernel-markers-non-optimized-architectures.patch
 markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch
 linux-kernel-markers-documentation.patch
 markers-define-the-linker-macro-extra_rwdata.patch
 markers-use-extra_rwdata-in-architectures.patch

Static markers.  Will merge.

 some-grammatical-fixups-and-additions-to-atomich-kernel-doc.patch
 no-longer-include-asm-kdebugh.patch

Will merge.

 nfs-fix-congestion-control-use-atomic_longs.patch

Will merge.

 udf-use-sector_t-and-loff_t-for-file-offsets.patch
 udf-introduce-struct-extent_position.patch
 udf-use-get_bh.patch
 udf-add-assertions.patch
 udf-support-files-larger-than-1g.patch
 udf-fix-link-counts.patch
 udf-possible-null-pointer-dereference-while-load_partition.patch

Will merge.

 attach_pid-with-struct-pid-parameter.patch
 statically-initialize-struct-pid-for-swapper.patch
 explicitly-set-pgid-and-sid-of-init-process.patch
 use-struct-pid-parameter-in-copy_process.patch
 use-task_pgrp-task_session-in-copy_process.patch
 kill-unused-sesssion-and-group-values-in-rocket-driver.patch
 fix-some-coding-style-errors-in-autofs.patch
 replace-pid_t-in-autofs-with-struct-pid-reference.patch
 dont-init-pgrp-and-__session-in-init_signals.patch

Will merge.

 signal-timer-event-fds-v9-anonymous-inode-source.patch
 signal-timer-event-fds-v9-signalfd-core.patch
 signal-timer-event-signalfd-wire-up-x86-arches.patch
 signal-timer-event-fds-v9-signalfd-compat-code.patch
 signal-timer-event-fds-v9-timerfd-core.patch
 signal-timer-event-timerfd-wire-up-x86-arches.patch
 signal-timer-event-fds-v9-timerfd-compat-code.patch
 signal-timer-event-fds-v9-eventfd-core.patch
 signal-timer-event-eventfd-wire-up-x86-arches.patch
 signal-timer-event-fds-v9-kaio-eventfd-support-example.patch
 epoll-use-anonymous-inodes.patch

Will merge.

 epoll-cleanups-epoll-no-module.patch
 epoll-cleanups-epoll-remove-static-pre-declarations-and-akpm-ize-the-code.patch

Will merge.

 revoke-special-mmap-handling.patch
 revoke-special-mmap-handling-vs-fault-vs-invalidate.patch
 revoke-core-code.patch
 revoke-core-code-misc-fixes.patch
 revoke-core-code-fix-shared-mapping-revoke.patch
 revoke-core-code-move-magic.patch
 revoke-core-code-fs-revokec-cleanups-and-bugfix-for-64bit-systems.patch
 revoke-core-code-revoke-no-revoke-for-nommu.patch
 revoke-core-code-fix-shared-mapping-revoke-revoke-only-revoke-mappings-for-the-given-inode.patch
 revoke-core-code-break-cow-for-private-mappings.patch
 revoke-core-code-generic_file_revoke-stub-for-nommu.patch
 revoke-core-code-break-cow-fixes.patch
 revoke-core-code-mapping-revocation.patch
 revoke-core-code-only-fput-unused-files.patch
 revoke-core-code-slab-allocators-remove-slab_debug_initial-flag-revoke.patch
 revoke-support-for-ext2-and-ext3.patch
 revoke-add-documentation.patch
 revoke-wire-up-i386-system-calls.patch

Hold.  This is tricky stuff and I don't think we've seen sufficient reviewing,
testing and acking yet?

 add-irqf_irqpoll-flag-common-code.patch
 add-irqf_irqpoll-flag-on-x86_64.patch
 add-irqf_irqpoll-flag-on-i386.patch
 add-irqf_irqpoll-flag-on-ia64.patch
 add-irqf_irqpoll-flag-on-sh.patch
 add-irqf_irqpoll-flag-on-parisc.patch
 add-irqf_irqpoll-flag-on-arm.patch

Merge.

 char-cyclades-remove-pause.patch
 char-cyclades-cy_readx-writex-cleanup.patch
 char-cyclades-timer-cleanup.patch
 char-cyclades-remove-volatiles.patch
 char-cyclades-remove-useless-casts.patch

Merge.

 pnp-notice-whether-we-have-pnp-devices-pnpbios-or-pnpacpi.patch
 pnp-workaround-hp-bios-defect-that-leaves-smcf010-device-partly-enabled.patch
 smsc-ircc2-tidy-up-module-parameter-checking.patch
 smsc-ircc2-add-pnp-support.patch
 x86-serial-convert-legacy-com-ports-to-platform-devices.patch

Misc stuff.  Will merge.

 lguest-the-guest-code.patch
 lguest-vs-x86_64-mm-use-per-cpu-variables-for-gdt-pda.patch
 lguest-the-guest-code-update-lguests-patch-code-for-new-paravirt-patch.patch
 lguest-the-host-code.patch
 lguest-the-host-code-vs-x86_64-mm-i386-separate-hardware-defined-tss-from-linux-additions.patch
 lguest-the-host-code-fix-lguest-oops-when-guest-dies-while-receiving-i-o.patch
 lguest-the-host-code-simplification-dont-pin-guest-trap-handlers.patch
 lguest-the-asm-offsets.patch
 lguest-the-makefile-and-kconfig.patch
 lguest-the-console-driver.patch
 lguest-the-net-driver.patch
 lguest-the-block-driver.patch
 lguest-the-documentation-example-launcher.patch
 lguest-the-documentation-example-launcher-fix-lguest-documentation-error.patch

Will merge the rustyvisor.

 fs-convert-core-functions-to-zero_user_page.patch
 fs-convert-core-functions-to-zero_user_page-pass-kmap-type.patch
 fs-convert-core-functions-to-zero_user_page-fix-2.patch
 affs-use-zero_user_page.patch
 ecryptfs-use-zero_user_page.patch
 ext3-use-zero_user_page.patch
 ext4-use-zero_user_page.patch
 gfs2-use-zero_user_page.patch
 nfs-use-zero_user_page.patch
 ntfs-use-zero_user_page.patch
 ntfs-use-zero_user_page-fix.patch
 ocfs2-use-zero_user_page.patch
 reiserfs-use-zero_user_page.patch
 xfs-use-zero_user_page.patch
 fs-deprecate-memclear_highpage_flush.patch

Merge.

 char-cyclades-create-cy_init_ze.patch
 char-cyclades-use-pci_iomap-unmap.patch
 char-cyclades-init-ze-immediately.patch
 char-cyclades-create-cy_pci_probe.patch
 char-cyclades-move-card-entries-init-into-function.patch
 char-cyclades-init-card-struct-immediately.patch
 char-cyclades-remove-some-global-vars.patch
 char-cyclades-cy_init-error-handling.patch
 char-cyclades-tty_register_device-separately-for-each-device.patch
 char-cyclades-clear-interrupts-before-releasing.patch
 char-cyclades-allow-debug_shirq.patch

Merge

 add-suspend-related-notifications-for-cpu-hotplug.patch
 microcode-use-suspend-related-cpu-hotplug-notifications.patch

Merge.

 vmstat-use-our-own-timer-events.patch

Merge.

 readahead-kconfig-options.patch
 radixtree-introduce-scan-hole-data-functions.patch
 mm-introduce-probe_page.patch
 mm-introduce-pg_readahead.patch
 readahead-add-look-ahead-support-to-__do_page_cache_readahead.patch
 readahead-insert-cond_resched-calls.patch
 readahead-minmax_ra_pages.patch
 readahead-events-accounting.patch
 readahead-rescue_pages.patch
 readahead-sysctl-parameters.patch
 readahead-min-max-sizes.patch
 readahead-state-based-method-aging-accounting.patch
 readahead-state-based-method-routines.patch
 readahead-state-based-method.patch
 readahead-state-based-method-check-node-id.patch
 readahead-state-based-method-decouple-readahead_ratio-from-growth_limit.patch
 readahead-state-based-method-cancel-lookahead-gracefully.patch
 readahead-context-based-method.patch
 readahead-initial-method-guiding-sizes.patch
 readahead-initial-method-thrashing-guard-size.patch
 readahead-initial-method-user-recommended-size.patch
 readahead-initial-method.patch
 readahead-backward-prefetching-method.patch
 readahead-thrashing-recovery-method.patch
 readahead-thrashing-recovery-method-check-unbalanced-aging.patch
 readahead-thrashing-recovery-method-refill-holes.patch
 readahead-call-scheme.patch
 readahead-call-scheme-cleanup.patch
 readahead-call-scheme-catch-thrashing-on-lookahead-time.patch
 readahead-laptop-mode.patch
 readahead-loop-case.patch
 readahead-nfsd-case.patch
 readahead-remove-parameter-ra_max-from-thrashing_recovery_readahead.patch
 readahead-remove-parameter-ra_max-from-adjust_rala.patch
 readahead-state-based-method-protect-against-tiny-size.patch
 readahead-rename-state_based_readahead-to-clock_based_readahead.patch
 readahead-account-i-o-block-times-for-stock-readahead.patch
 readahead-rescue_pages-updates.patch
 readahead-remove-noaction-shrink-events.patch
 readahead-remove-size-limit-on-read_ahead_kb.patch
 readahead-remove-size-limit-of-max_sectors_kb-on-read_ahead_kb.patch
 readahead-partial-sendfile-fix.patch
 readahead-turn-on-by-default.patch

Hopefully Wu will be coming up with a much simpler best-of-readahead patch
soon.  I don't think we can get these patches over the hump and they are
somewhat costly to maintain.

 [93 random fbdev patches]

Will merge.

 drivers-mdc-use-array_size-macro-when-appropriate.patch
 md-cleanup-use-seq_release_private-where-appropriate.patch
 md-remove-broken-sigkill-support.patch

Will merge after checking with Neil

 md-dm-reduce-stack-usage-with-stacked-block-devices.patch

Will we ever fix this?

 statistics-infrastructure-prerequisite-list.patch
 statistics-infrastructure-prerequisite-parser.patch
 statistics-infrastructure-prerequisite-parser-fix.patch
 add-for_each_substring-and-match_substring.patch
 statistics-infrastructure-prerequisite-timestamp.patch
 statistics-infrastructure-make-printk_clock-a-generic-kernel-wide-nsec-resolution.patch
 statistics-infrastructure-documentation.patch statistics-infrastructure.patch
 statistics-infrastructure-add-for_each_substring-and-match_substring-exploitation.patch
 statistics-infrastructure-fix-parsing-of-statistics-type-attribute.patch
 statistics-infrastructure-simplify-statistics-debugfs-write-function.patch
 statistics-infrastructure-simplify-statistics-debugfs-read-functions.patch
 statistics-infrastructure-fix-string-termination.patch
 statistics-infrastructure-small-cleanup-in-debugfs-write-function.patch
 statistics-infrastructure-fix-cpu-hot-unplug-related-memory-leak.patch
 statistics-infrastructure-timer_stats-slimmed-down-statistics-prereq-labels.patch
 statistics-infrastructure-timer_stats-slimmed-down-statistics-prereq-keys.patch
 statistics-infrastructure-statistics-fix-sorted-list.patch
 add-suspend-related-notifications-for-cpu-hotplug-statistics.patch
 statistics-infrastructure-exploitation-zfcp.patch
 timer_stats-slimmed-down-using-statistics-infrastucture.patch

We have a second user of the statistics infrastructure!  If we have a third,
perhaps we can merge it.  It's an unobvious call.

 mprotect-patch-for-use-by-slim.patch
 integrity-service-api-and-dummy-provider.patch
 integrity-service-api-and-dummy-provider-integrity_dummy_verify_metadata.patch
 slim-main-patch.patch
 slim-main-lsm-getprocattr-hook-api-change.patch
 slim-secfs-patch.patch
 slim-make-and-config-stuff.patch
 slim-debug-output.patch
 slim-integrity-patch.patch
 slim-documentation.patch
 integrity-new-hooks.patch
 integrity-new-hooks-fix.patch
 integrity-fs-hook-placement.patch
 integrity-evm-as-an-integrity-service-provider.patch
 integrity-evm-as-an-integrity-service-provider-tidy.patch
 integrity-evm-as-an-integrity-service-provider-tidy-fix.patch
 integrity-evm-as-an-integrity-service-provider-tidy-fix-2.patch
 integrity-ima-integrity_measure-support.patch
 integrity-ima-integrity_measure-support-tidy.patch
 integrity-ima-integrity_measure-support-fix.patch
 integrity-ima-integrity_measure-support-fix-2.patch
 integrity-ima-integrity_measure-support-ima-exit.patch
 integrity-ima-integrity_measure-support-remove-spinlock.patch
 integrity-ima-identifiers.patch
 integrity-ima-cleanup.patch
 integrity-tpm-internal-kernel-interface.patch
 integrity-tpm-internal-kernel-interface-tidy.patch
 ibac-patch.patch

Hold.   This seems a long way from being mergeable.

 use-menuconfig-objects-acpi.patch
 use-menuconfig-objects-libata.patch
 use-menuconfig-objects-block-layer.patch
 use-menuconfig-objects-connector.patch
 use-menuconfig-objects-crypto.patch
 use-menuconfig-objects-crypto-hw.patch
 use-menuconfig-objects-dccp.patch
 use-menuconfig-objects-i2o.patch
 use-menuconfig-objects-ide.patch
 use-menuconfig-objects-ipvs.patch
 use-menuconfig-objects-sctp.patch
 use-menuconfig-objects-tipc.patch
 use-menuconfig-objects-arcnet.patch
 use-menuconfig-objects-phy.patch
 use-menuconfig-objects-toeknring.patch
 use-menuconfig-objects-netdev.patch
 use-menuconfig-objects-oldcd.patch
 use-menuconfig-objects-parport.patch
 use-menuconfig-objects-pcmcia.patch
 use-menuconfig-objects-pnp.patch
 use-menuconfig-objects-w1.patch

Will merge sometime.  Some needs to go via subsystem maintainers.

 w1-build-fix.patch

A gcc-4.3 maybe-fix.  Still awaiting testing results.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* to something appropriate (was Re: 2.6.22 -mm merge plans)
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
@ 2007-04-30 23:48 ` Jeff Garzik
  2007-05-01  0:07   ` Dave Jones
                     ` (2 more replies)
  2007-04-30 23:59 ` 2.6.22 -mm merge plans Bill Irwin
                   ` (21 subsequent siblings)
  22 siblings, 3 replies; 233+ messages in thread
From: Jeff Garzik @ 2007-04-30 23:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Andrew Morton wrote:
>  ahci-crash-fix.patch
>  libata-acpi-add-infrastructure-for-drivers-to-use.patch
>  pata_acpi-restore-driver.patch
>  optional-led-trigger-for-libata.patch
>  ata_timing-ensure-t-cycle-is-always-correct.patch
>  pata_pcmcia-recognize-2gb-compactflash-from-transcend.patch
>  drivers-ata-remove-the-wildcard-from-sata_nv-driver.patch
>  pata_icside-driver.patch
> 
> ata stuff

Tejun helpfully posted a bunch of clashing patches for all the ACPI 
stuff :)  You might be better off dropping and getting a resend after 
the dust settles.

That LED trigger patch seems technically correct, but also filling a 
need that few have.  IMO it craps up the hot path for little gain.

The other stuff should be in my mbox to be reviewed and applied

>  8139too-force-media-setting-fix.patch
>  sundance-change-phy-address-search-from-phy=1-to-phy=0.patch
>  forcedeth-improve-napi-logic.patch
>  ne-add-platform_driver.patch
>  ne-add-platform_driver-fix.patch
>  ne-mips-use-platform_driver-for-ne-on-rbtx49xx.patch
>  mips-drop-unnecessary-config_isa-from-rbtx49xx.patch
>  ibmtr_cs-fix-hang-on-eject.patch
> 
> For netdev tree

8139too: needs review w/ attention paid to historical usage, and I 
haven't had time for months.  Not sure its right.

sundance: I really think this is phy-dependent, and should not be 
universally applied.  In the standard MII PHY, phy 0 is a ghost of 
another id.

forcedeth: TX NAPI wants more than that minimum effort

other stuff: in mbox

>  ppp_generic-fix-lockdep-warning.patch
> 
> Jeff, I guess.  It's not clear that this is correct.

Usually PPP is paulus -> jgarzik -> linus, but you can bounce it 
straight to me if Paulus doesn't respond

>  pcmcia-pccard-deadlock-fix.patch
>  pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch
>  at91_cf-minor-fix.patch
>  add-new_id-to-pcmcia-drivers.patch
>  ide-cs-recognize-2gb-compactflash-from-transcend.patch
> 
> Dominik is busy.  Will probably re-review and send these direct to Linus.

I really wish "add ID" patches would not get buried in this tree.

Certainly they are trivial enough to go straight to Linus, but it would 
be nice to go through subsystem maintainers, some of whom have also 
picked up these new-id patches.

We don't send new-id patches for PCI drivers to GregKH, and we should 
similarly /not/ direct PCMCIA id patches to the PCMCIA bus tree.  It's 
far more scalable to send new-id patches to the maintainers dealing with 
the subsystem under which each driver falls (net, scsi, IDE, ...)

>  pci-device-ensure-sysdata-initialised-v2.patch
> 
> This is for Jeff's git-pciseg.patch which is sort-of on hold at present.

HP was kind enough to dig up another machine for me.  Other machines are 
appearing with x86 PCI domains (aka PCI segments) these days, so I need 
to get this into upstream.

	Jeff

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: to something appropriate (was Re: 2.6.22 -mm merge plans)
  2007-04-30 23:48 ` to something appropriate (was Re: 2.6.22 -mm merge plans) Jeff Garzik
@ 2007-05-01  0:07   ` Dave Jones
  2007-05-01  0:09   ` Andrew Morton
  2007-05-01  9:49   ` Alan Cox
  2 siblings, 0 replies; 233+ messages in thread
From: Dave Jones @ 2007-05-01  0:07 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Andrew Morton, linux-kernel

On Mon, Apr 30, 2007 at 07:48:50PM -0400, Jeff Garzik wrote:
 
 > >  add-new_id-to-pcmcia-drivers.patch
 > > Dominik is busy.  Will probably re-review and send these direct to Linus.
 > I really wish "add ID" patches would not get buried in this tree.

I don't think this is what you think it is (hint: look at the patch).
This is very much a pcmcia patch.

	Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: to something appropriate (was Re: 2.6.22 -mm merge plans)
  2007-04-30 23:48 ` to something appropriate (was Re: 2.6.22 -mm merge plans) Jeff Garzik
  2007-05-01  0:07   ` Dave Jones
@ 2007-05-01  0:09   ` Andrew Morton
  2007-05-01  0:24     ` Jeff Garzik
  2007-05-01  9:49   ` Alan Cox
  2 siblings, 1 reply; 233+ messages in thread
From: Andrew Morton @ 2007-05-01  0:09 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel


> Subject: to something appropriate (was Re: 2.6.22 -mm merge plans)

smartypants.

On Mon, 30 Apr 2007 19:48:50 -0400
Jeff Garzik <jeff@garzik.org> wrote:

> Andrew Morton wrote:
> >  ahci-crash-fix.patch
> >  libata-acpi-add-infrastructure-for-drivers-to-use.patch
> >  pata_acpi-restore-driver.patch
> >  optional-led-trigger-for-libata.patch
> >  ata_timing-ensure-t-cycle-is-always-correct.patch
> >  pata_pcmcia-recognize-2gb-compactflash-from-transcend.patch
> >  drivers-ata-remove-the-wildcard-from-sata_nv-driver.patch
> >  pata_icside-driver.patch
> > 
> > ata stuff
> 
> Tejun helpfully posted a bunch of clashing patches for all the ACPI 
> stuff :)  You might be better off dropping and getting a resend after 
> the dust settles.
> 
> That LED trigger patch seems technically correct, but also filling a 
> need that few have.  IMO it craps up the hot path for little gain.

OK, well I'll see what's recoverable after libata-all gets updated, will
blindly spam you with it as usual ;)

> 
> >  8139too-force-media-setting-fix.patch
> >  sundance-change-phy-address-search-from-phy=1-to-phy=0.patch
> >  forcedeth-improve-napi-logic.patch
> >  ne-add-platform_driver.patch
> >  ne-add-platform_driver-fix.patch
> >  ne-mips-use-platform_driver-for-ne-on-rbtx49xx.patch
> >  mips-drop-unnecessary-config_isa-from-rbtx49xx.patch
> >  ibmtr_cs-fix-hang-on-eject.patch
> > 
> > For netdev tree
> 
> 8139too: needs review w/ attention paid to historical usage, and I 
> haven't had time for months.  Not sure its right.
> 
> sundance: I really think this is phy-dependent, and should not be 
> universally applied.  In the standard MII PHY, phy 0 is a ghost of 
> another id.
> 
> forcedeth: TX NAPI wants more than that minimum effort

Ditto.

> >  ppp_generic-fix-lockdep-warning.patch
> > 
> > Jeff, I guess.  It's not clear that this is correct.
> 
> Usually PPP is paulus -> jgarzik -> linus, but you can bounce it 
> straight to me if Paulus doesn't respond
> 

OK, i'll move it to the netdev queue and will keep sending until something
happens.

> 
> >  pcmcia-pccard-deadlock-fix.patch
> >  pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch
> >  at91_cf-minor-fix.patch
> >  add-new_id-to-pcmcia-drivers.patch
> >  ide-cs-recognize-2gb-compactflash-from-transcend.patch
> > 
> > Dominik is busy.  Will probably re-review and send these direct to Linus.
> 
> I really wish "add ID" patches would not get buried in this tree.
> 
> Certainly they are trivial enough to go straight to Linus, but it would 
> be nice to go through subsystem maintainers, some of whom have also 
> picked up these new-id patches.
> 
> We don't send new-id patches for PCI drivers to GregKH, and we should 
> similarly /not/ direct PCMCIA id patches to the PCMCIA bus tree.  It's 
> far more scalable to send new-id patches to the maintainers dealing with 
> the subsystem under which each driver falls (net, scsi, IDE, ...)
> 

Yeah, a new-id patch is a pretty critical bugfix if you happen to have that
hardware.  I'll get all these into 2.6.22 by whatever means and will adopt
your advice in future.

Probably these should go into -stable too, but I don't know what
Greg&Chris's position is on new device IDs.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: to something appropriate (was Re: 2.6.22 -mm merge plans)
  2007-05-01  0:09   ` Andrew Morton
@ 2007-05-01  0:24     ` Jeff Garzik
  2007-05-01  0:40       ` [stable] " Chris Wright
  0 siblings, 1 reply; 233+ messages in thread
From: Jeff Garzik @ 2007-05-01  0:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, stable

Andrew Morton wrote:
> Yeah, a new-id patch is a pretty critical bugfix if you happen to have that
> hardware.  I'll get all these into 2.6.22 by whatever means and will adopt
> your advice in future.
> 
> Probably these should go into -stable too, but I don't know what
> Greg&Chris's position is on new device IDs.

I don't know either.  But a one-line ID patch is pretty painless 
considering the gain, so I would vote for stable@kernel.org taking such 
patches.

If it's more than one line added per ID though, NAK for -stable, IMO...

	Jeff



^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans)
  2007-05-01  0:24     ` Jeff Garzik
@ 2007-05-01  0:40       ` Chris Wright
  2007-05-01  0:45         ` Jeff Garzik
  0 siblings, 1 reply; 233+ messages in thread
From: Chris Wright @ 2007-05-01  0:40 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Andrew Morton, linux-kernel, stable

* Jeff Garzik (jeff@garzik.org) wrote:
> Andrew Morton wrote:
> > Yeah, a new-id patch is a pretty critical bugfix if you happen to have that
> > hardware.  I'll get all these into 2.6.22 by whatever means and will adopt
> > your advice in future.
> > 
> > Probably these should go into -stable too, but I don't know what
> > Greg&Chris's position is on new device IDs.
> 
> I don't know either.  But a one-line ID patch is pretty painless 
> considering the gain, so I would vote for stable@kernel.org taking such 
> patches.
> 
> If it's more than one line added per ID though, NAK for -stable, IMO...

Well, there's 2 issues here.  1) the patch in question is not -stable
material (the patch name is a bit misleading).  2) you can add them
runtime in userspace (and for pcmcia too after patch in question is
applied), so we've historically avoided that kind of patch for -stable.

thanks,
-chris

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans)
  2007-05-01  0:40       ` [stable] " Chris Wright
@ 2007-05-01  0:45         ` Jeff Garzik
  2007-05-01  4:58           ` Greg KH
  0 siblings, 1 reply; 233+ messages in thread
From: Jeff Garzik @ 2007-05-01  0:45 UTC (permalink / raw)
  To: Chris Wright; +Cc: Andrew Morton, linux-kernel, stable

Chris Wright wrote:
> 2) you can add them
> runtime in userspace (and for pcmcia too after patch in question is
> applied), so we've historically avoided that kind of patch for -stable.


Due to distro installer environments, and very poor support for making 
dynamic PCI IDs persistent once added, what you describe is more of a 
goal than reality.

	Jeff



^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans)
  2007-05-01  0:45         ` Jeff Garzik
@ 2007-05-01  4:58           ` Greg KH
  2007-05-01 16:14             ` Chuck Ebbert
  0 siblings, 1 reply; 233+ messages in thread
From: Greg KH @ 2007-05-01  4:58 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Chris Wright, Andrew Morton, linux-kernel, stable

On Mon, Apr 30, 2007 at 08:45:25PM -0400, Jeff Garzik wrote:
> Chris Wright wrote:
> > 2) you can add them
> > runtime in userspace (and for pcmcia too after patch in question is
> > applied), so we've historically avoided that kind of patch for -stable.
> 
> 
> Due to distro installer environments, and very poor support for making 
> dynamic PCI IDs persistent once added, what you describe is more of a 
> goal than reality.

But distros can easily add the device id to their kernel if needed, it
isn't something that the -stable tree shoud be accepting.  Otherwise, we
will be swamped with those types of patches...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans)
  2007-05-01  4:58           ` Greg KH
@ 2007-05-01 16:14             ` Chuck Ebbert
  2007-05-01 16:40               ` Alan Cox
  0 siblings, 1 reply; 233+ messages in thread
From: Chuck Ebbert @ 2007-05-01 16:14 UTC (permalink / raw)
  To: Greg KH; +Cc: Jeff Garzik, Chris Wright, Andrew Morton, linux-kernel, stable

Greg KH wrote:
> On Mon, Apr 30, 2007 at 08:45:25PM -0400, Jeff Garzik wrote:
>> Chris Wright wrote:
>>> 2) you can add them
>>> runtime in userspace (and for pcmcia too after patch in question is
>>> applied), so we've historically avoided that kind of patch for -stable.
>>
>> Due to distro installer environments, and very poor support for making 
>> dynamic PCI IDs persistent once added, what you describe is more of a 
>> goal than reality.
> 
> But distros can easily add the device id to their kernel if needed, it
> isn't something that the -stable tree shoud be accepting.  Otherwise, we
> will be swamped with those types of patches...
> 

Oh sure, leave the distros swamped with them instead. :)

And they all have to do it separately, meaning they don't stay in sync
and they duplicate each other's work...


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans)
  2007-05-01 16:14             ` Chuck Ebbert
@ 2007-05-01 16:40               ` Alan Cox
  2007-05-01 23:34                 ` Greg KH
  0 siblings, 1 reply; 233+ messages in thread
From: Alan Cox @ 2007-05-01 16:40 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Greg KH, Jeff Garzik, Chris Wright, Andrew Morton, linux-kernel,
	stable

> > But distros can easily add the device id to their kernel if needed, it
> > isn't something that the -stable tree shoud be accepting.  Otherwise, we
> > will be swamped with those types of patches...
> > 
> 
> Oh sure, leave the distros swamped with them instead. :)
> 
> And they all have to do it separately, meaning they don't stay in sync
> and they duplicate each other's work...

Well they *don't* have to work that separately. They could set up some
shared tree which would look suspiciously like what Greg is doing but
with the ID updates.... ;)

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans)
  2007-05-01 16:40               ` Alan Cox
@ 2007-05-01 23:34                 ` Greg KH
  2007-05-02  0:52                   ` Chris Wright
  0 siblings, 1 reply; 233+ messages in thread
From: Greg KH @ 2007-05-01 23:34 UTC (permalink / raw)
  To: Alan Cox
  Cc: Chuck Ebbert, Jeff Garzik, Chris Wright, Andrew Morton,
	linux-kernel, stable

On Tue, May 01, 2007 at 05:40:33PM +0100, Alan Cox wrote:
> > > But distros can easily add the device id to their kernel if needed, it
> > > isn't something that the -stable tree shoud be accepting.  Otherwise, we
> > > will be swamped with those types of patches...
> > > 
> > 
> > Oh sure, leave the distros swamped with them instead. :)
> > 
> > And they all have to do it separately, meaning they don't stay in sync
> > and they duplicate each other's work...
> 
> Well they *don't* have to work that separately. They could set up some
> shared tree which would look suspiciously like what Greg is doing but
> with the ID updates.... ;)

And is this really a problem?  The whole goal of the -stable tree was to
accomidate the users who relied on kernel.org kernels, and wanted
bugfixes and security updates.  It was not for new features or new
hardware support.

If people feel we should revisit this goal, then that's fine, and I have
no objection to that.  But until then, I think the rules that we have
had in place for over the past 2 years should still remain in affect.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans)
  2007-05-01 23:34                 ` Greg KH
@ 2007-05-02  0:52                   ` Chris Wright
  2007-05-02 14:10                     ` Chuck Ebbert
  0 siblings, 1 reply; 233+ messages in thread
From: Chris Wright @ 2007-05-02  0:52 UTC (permalink / raw)
  To: Greg KH
  Cc: Alan Cox, Chuck Ebbert, Jeff Garzik, Chris Wright, Andrew Morton,
	linux-kernel, stable

* Greg KH (greg@kroah.com) wrote:
> And is this really a problem?  The whole goal of the -stable tree was to
> accomidate the users who relied on kernel.org kernels, and wanted
> bugfixes and security updates.  It was not for new features or new
> hardware support.
> 
> If people feel we should revisit this goal, then that's fine, and I have
> no objection to that.  But until then, I think the rules that we have
> had in place for over the past 2 years should still remain in affect.

I have to agree.  I went back through my mbox and found vanishingly few
pci_id update patches.   So it's not clear there's even a big issue.

thanks,
-chris

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [stable] to something appropriate (was Re: 2.6.22 -mm merge plans)
  2007-05-02  0:52                   ` Chris Wright
@ 2007-05-02 14:10                     ` Chuck Ebbert
  0 siblings, 0 replies; 233+ messages in thread
From: Chuck Ebbert @ 2007-05-02 14:10 UTC (permalink / raw)
  To: Chris Wright
  Cc: Greg KH, Alan Cox, Jeff Garzik, Andrew Morton, linux-kernel,
	stable

Chris Wright wrote:
> * Greg KH (greg@kroah.com) wrote:
>> And is this really a problem?  The whole goal of the -stable tree was to
>> accomidate the users who relied on kernel.org kernels, and wanted
>> bugfixes and security updates.  It was not for new features or new
>> hardware support.
>>
>> If people feel we should revisit this goal, then that's fine, and I have
>> no objection to that.  But until then, I think the rules that we have
>> had in place for over the past 2 years should still remain in affect.
> 
> I have to agree.  I went back through my mbox and found vanishingly few
> pci_id update patches.   So it's not clear there's even a big issue.
> 

Of course you didn't find many -- most people know that's not part of
the -stable charter. If you asked for them you'd get them...

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: to something appropriate (was Re: 2.6.22 -mm merge plans)
  2007-04-30 23:48 ` to something appropriate (was Re: 2.6.22 -mm merge plans) Jeff Garzik
  2007-05-01  0:07   ` Dave Jones
  2007-05-01  0:09   ` Andrew Morton
@ 2007-05-01  9:49   ` Alan Cox
  2 siblings, 0 replies; 233+ messages in thread
From: Alan Cox @ 2007-05-01  9:49 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Andrew Morton, linux-kernel

> Tejun helpfully posted a bunch of clashing patches for all the ACPI 
> stuff :)  You might be better off dropping and getting a resend after 
> the dust settles.

Agree about the ACPI stuff.
 
> That LED trigger patch seems technically correct, but also filling a 
> need that few have.  IMO it craps up the hot path for little gain.

It only touches the affected devices and if its still an issue then it
should be via an arch define as well. Can you fire something off to the
submitter Jeff so we've got a direction ?


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
  2007-04-30 23:48 ` to something appropriate (was Re: 2.6.22 -mm merge plans) Jeff Garzik
@ 2007-04-30 23:59 ` Bill Irwin
  2007-05-01  0:09 ` nfsd/md patches " Neil Brown
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 233+ messages in thread
From: Bill Irwin @ 2007-04-30 23:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Mon, Apr 30, 2007 at 04:20:07PM -0700, Andrew Morton wrote:
>  proper-prototype-for-hugetlb_get_unmapped_area.patch
...
>  convert-hugetlbfs-to-use-vm_ops-fault.patch
...
>  get_unmapped_area-handles-map_fixed-in-hugetlbfs.patch
...
>  get_unmapped_area-doesnt-need-hugetlbfs-hacks-anymore.patch
...
> Will merge.

I've gone over these again and all are still good. The same holds for
the get_unmapped_area() series in general where I've reviewed it for
hugetlb relevance.


-- wli

^ permalink raw reply	[flat|nested] 233+ messages in thread

* nfsd/md patches Re: 2.6.22 -mm merge plans
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
  2007-04-30 23:48 ` to something appropriate (was Re: 2.6.22 -mm merge plans) Jeff Garzik
  2007-04-30 23:59 ` 2.6.22 -mm merge plans Bill Irwin
@ 2007-05-01  0:09 ` Neil Brown
  2007-05-01  9:08   ` Christoph Hellwig
  2007-05-01  0:54 ` MADV_FREE functionality Rik van Riel
                   ` (19 subsequent siblings)
  22 siblings, 1 reply; 233+ messages in thread
From: Neil Brown @ 2007-05-01  0:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Monday April 30, akpm@linux-foundation.org wrote:
> 
>  remove-nfs4_acl_add_ace.patch
>  the-nfsv2-nfsv3-server-does-not-handle-zero-length-write.patch
>  knfsd-rename-sk_defer_lock-to-sk_lock.patch
>  nfsd-nfs4state-remove-unnecessary-daemonize-call.patch
>  rpc-add-wrapper-for-svc_reserve-to-account-for-checksum.patch
> 
> nfsd things - will merge after checking with Neil.
> 

All acked, though that last one won't fix any oopses like the comment
hopes for - I really should look into that.


> 
>  drivers-mdc-use-array_size-macro-when-appropriate.patch
>  md-cleanup-use-seq_release_private-where-appropriate.patch
>  md-remove-broken-sigkill-support.patch
> 
> Will merge after checking with Neil

NAK on md-remove-broken-sigkill-support.patch - I'll follow up the
original mail.

ACK on the other two.


> 
>  md-dm-reduce-stack-usage-with-stacked-block-devices.patch
> 
> Will we ever fix this?
> 

I think we have several votes for "just merge it".  I don't think
there are known problems with it.

NeilBrown

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: nfsd/md patches Re: 2.6.22 -mm merge plans
  2007-05-01  0:09 ` nfsd/md patches " Neil Brown
@ 2007-05-01  9:08   ` Christoph Hellwig
  2007-05-01  9:15     ` Andrew Morton
  2007-05-01  9:52     ` Neil Brown
  0 siblings, 2 replies; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-01  9:08 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-kernel, linux-mm

apropos nfsd patches, what's the merge plans for my two export ops
patch series?

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: nfsd/md patches Re: 2.6.22 -mm merge plans
  2007-05-01  9:08   ` Christoph Hellwig
@ 2007-05-01  9:15     ` Andrew Morton
  2007-05-01  9:21       ` Christoph Hellwig
  2007-05-01  9:52     ` Neil Brown
  1 sibling, 1 reply; 233+ messages in thread
From: Andrew Morton @ 2007-05-01  9:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Neil Brown, linux-kernel, linux-mm

On Tue, 1 May 2007 10:08:43 +0100 Christoph Hellwig <hch@infradead.org> wrote:

> apropos nfsd patches, what's the merge plans for my two export ops
> patch series?

box:/usr/src/25/patches> grep -l '^From:.*hch' $(cat-series ../series )
dvb_en_50221-convert-to-kthread-api.patch
simplify-the-stacktrace-code.patch
vfs-remove-superflous-sb-==-null-checks.patch
nameic-remove-utterly-outdated-comment.patch
move-die-notifier-handling-to-common-code.patch
merge-compat_ioctlh-into-compat_ioctlc.patch
cleanup-compat-ioctl-handling.patch
kprobes-use-hlist_for_each_entry.patch
kprobes-codingstyle-cleanups.patch
kprobes-kretprobes-simplifcations.patch

I give up.  Where are they hiding?

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: nfsd/md patches Re: 2.6.22 -mm merge plans
  2007-05-01  9:15     ` Andrew Morton
@ 2007-05-01  9:21       ` Christoph Hellwig
  0 siblings, 0 replies; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-01  9:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Hellwig, Neil Brown, linux-kernel, linux-mm

On Tue, May 01, 2007 at 02:15:25AM -0700, Andrew Morton wrote:
> On Tue, 1 May 2007 10:08:43 +0100 Christoph Hellwig <hch@infradead.org> wrote:
> 
> > apropos nfsd patches, what's the merge plans for my two export ops
> > patch series?

This question was directed to Neil, sorry.  

> box:/usr/src/25/patches> grep -l '^From:.*hch' $(cat-series ../series )
> dvb_en_50221-convert-to-kthread-api.patch
> simplify-the-stacktrace-code.patch
> vfs-remove-superflous-sb-==-null-checks.patch
> nameic-remove-utterly-outdated-comment.patch
> move-die-notifier-handling-to-common-code.patch
> merge-compat_ioctlh-into-compat_ioctlc.patch
> cleanup-compat-ioctl-handling.patch
> kprobes-use-hlist_for_each_entry.patch
> kprobes-codingstyle-cleanups.patch
> kprobes-kretprobes-simplifcations.patch
> 
> I give up.  Where are they hiding?

Good question :)  I sent them to the nfs list.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: nfsd/md patches Re: 2.6.22 -mm merge plans
  2007-05-01  9:08   ` Christoph Hellwig
  2007-05-01  9:15     ` Andrew Morton
@ 2007-05-01  9:52     ` Neil Brown
  2007-05-01 10:15       ` Christoph Hellwig
  1 sibling, 1 reply; 233+ messages in thread
From: Neil Brown @ 2007-05-01  9:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Andrew Morton, linux-kernel, linux-mm

On Tuesday May 1, hch@infradead.org wrote:
> apropos nfsd patches, what's the merge plans for my two export ops
> patch series?

Still sitting in my tree - I've had my mind on other things
(nfs-utils, portmap....) and let them slip - sorry.

I think also there was an unanswered question about the second series
(there first I am completely happy with).

> Date: Fri, 30 Mar 2007 13:34:53 +1000
> 
> My only question involves motivation.
> 
>   You say "less complex", but to me it just looks "different" - though
>   being very familiar with the original, that might be a biased view.
>   Can you say more about how it is less complex?
>   Maybe the extension to generic 64bit support will make that clear...
> 
>   But then generic 64bit support should just be an independent set of
>   helper functions that can be plugged in to the export_operations
>   structure.
> 

It think I programmed myself to use a reply to that to be my wake_up
to forwarded them on, and forgot to register a timeout handler....

NeilBrown

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: nfsd/md patches Re: 2.6.22 -mm merge plans
  2007-05-01  9:52     ` Neil Brown
@ 2007-05-01 10:15       ` Christoph Hellwig
  2007-05-01 14:34         ` Trond Myklebust
  0 siblings, 1 reply; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-01 10:15 UTC (permalink / raw)
  To: Neil Brown; +Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm

On Tue, May 01, 2007 at 07:52:11PM +1000, Neil Brown wrote:
> On Tuesday May 1, hch@infradead.org wrote:
> > apropos nfsd patches, what's the merge plans for my two export ops
> > patch series?
> 
> Still sitting in my tree - I've had my mind on other things
> (nfs-utils, portmap....) and let them slip - sorry.
> 
> I think also there was an unanswered question about the second series
> (there first I am completely happy with).

A sorry, this mail got somewhere lost.  I'll reply on the nfs list
because we have a little more context there. (and due to the subscribers
only policy I can't crosspost unfortunately)


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: nfsd/md patches Re: 2.6.22 -mm merge plans
  2007-05-01 10:15       ` Christoph Hellwig
@ 2007-05-01 14:34         ` Trond Myklebust
  0 siblings, 0 replies; 233+ messages in thread
From: Trond Myklebust @ 2007-05-01 14:34 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Neil Brown, Andrew Morton, linux-kernel, linux-mm

On Tue, 2007-05-01 at 11:15 +0100, Christoph Hellwig wrote:
> On Tue, May 01, 2007 at 07:52:11PM +1000, Neil Brown wrote:
> > On Tuesday May 1, hch@infradead.org wrote:
> > > apropos nfsd patches, what's the merge plans for my two export ops
> > > patch series?
> > 
> > Still sitting in my tree - I've had my mind on other things
> > (nfs-utils, portmap....) and let them slip - sorry.
> > 
> > I think also there was an unanswered question about the second series
> > (there first I am completely happy with).
> 
> A sorry, this mail got somewhere lost.  I'll reply on the nfs list
> because we have a little more context there. (and due to the subscribers
> only policy I can't crosspost unfortunately)

I though we lifted the subscribers only policy quite a while back. There
should be nothing preventing you from cross-posting.

Cheers
  Trond


^ permalink raw reply	[flat|nested] 233+ messages in thread

* MADV_FREE functionality
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (2 preceding siblings ...)
  2007-05-01  0:09 ` nfsd/md patches " Neil Brown
@ 2007-05-01  0:54 ` Rik van Riel
  2007-05-01  1:18   ` Andrew Morton
  2007-05-01  1:23   ` Ulrich Drepper
  2007-05-01  1:39 ` 2.6.22 -mm merge plans Stefan Richter
                   ` (18 subsequent siblings)
  22 siblings, 2 replies; 233+ messages in thread
From: Rik van Riel @ 2007-05-01  0:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Andrew Morton wrote:

>  lazy-freeing-of-memory-through-madv_free.patch
>  lazy-freeing-of-memory-through-madv_free-vs-mm-madvise-avoid-exclusive-mmap_sem.patch
>  restore-madv_dontneed-to-its-original-linux-behaviour.patch
> 
> I think the MADV_FREE changes need more work:
> 
> We need crystal-clear statements regarding the present functionality, the new
> functionality and how these relate to the spec and to implmentations in other
> OS'es.  Once we have that info we are in a position to work out whether the
> code can be merged as-is, or if additional changes are needed.

There are two MADV variants that free pages, both do the exact
same thing with mapped file pages, but both do something slightly
different with anonymous pages.

MADV_DONTNEED will unmap file pages and free anonymous pages.
When a process accesses anonymous memory at an address that
was zapped with MADV_DONTNEED, it will return fresh zero filled
pages.

MADV_FREE will unmap file pages.  MADV_FREE on anonymous pages
is interpreted as a signal that the application no longer needs
the data in the pages, and they can be thrown away if the kernel
needs the memory for something else.  However, if the process
accesses the memory again before the kernel needs it, the process
will simply get the original pages back.  If the kernel needed
the memory first, the process will get a fresh zero filled page
like with MADV_DONTNEED.

In short:
- both MADV_FREE and MADV_DONTNEED only unmap file pages
- after MADV_DONTNEED the application will always get back
   fresh zero filled anonymous pages when accessing the
   memory
- after MADV_FREE the application can either get back the
   original data (without a page fault) or zero filled
   anonymous memory

The Linux MADV_DONTNEED behavior is not POSIX compliant.
POSIX says that with MADV_DONTNEED the application's data
will be preserved.

Currently glibc simply ignores POSIX_MADV_DONTNEED requests
from applications on Linux.  Changing the behaviour which
some Linux applications may rely on might not be the best
idea.

If you want POSIX_MADV_DONTNEED behaviour added, please let
me know and I'll whip up a patch.

> Because right now, I don't know where we are with respect to these things and
> I doubt if many of our users know either.  How can Michael write a manpage for
> this is we don't tell him what it all does?

If you need any additional information, please let me know.

If you still think the MADV_FREE patches themselves should
not be merged yet, can we at least merge the #defines, so
the Fedora kernel can get the MADV_FREE functionality?

Again, I'd be more than willing to whip up a patch for that.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: MADV_FREE functionality
  2007-05-01  0:54 ` MADV_FREE functionality Rik van Riel
@ 2007-05-01  1:18   ` Andrew Morton
  2007-05-01  1:23     ` Rik van Riel
  2007-05-01  7:13     ` Jakub Jelinek
  2007-05-01  1:23   ` Ulrich Drepper
  1 sibling, 2 replies; 233+ messages in thread
From: Andrew Morton @ 2007-05-01  1:18 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm, Michael Kerrisk

On Mon, 30 Apr 2007 20:54:02 -0400 Rik van Riel <riel@redhat.com> wrote:

> Andrew Morton wrote:
> 
> >  lazy-freeing-of-memory-through-madv_free.patch
> >  lazy-freeing-of-memory-through-madv_free-vs-mm-madvise-avoid-exclusive-mmap_sem.patch
> >  restore-madv_dontneed-to-its-original-linux-behaviour.patch
> > 
> > I think the MADV_FREE changes need more work:
> > 
> > We need crystal-clear statements regarding the present functionality, the new
> > functionality and how these relate to the spec and to implmentations in other
> > OS'es.  Once we have that info we are in a position to work out whether the
> > code can be merged as-is, or if additional changes are needed.
> 
> There are two MADV variants that free pages, both do the exact
> same thing with mapped file pages, but both do something slightly
> different with anonymous pages.
> 
> MADV_DONTNEED will unmap file pages and free anonymous pages.
> When a process accesses anonymous memory at an address that
> was zapped with MADV_DONTNEED, it will return fresh zero filled
> pages.
> 
> MADV_FREE will unmap file pages.  MADV_FREE on anonymous pages
> is interpreted as a signal that the application no longer needs
> the data in the pages, and they can be thrown away if the kernel
> needs the memory for something else.  However, if the process
> accesses the memory again before the kernel needs it, the process
> will simply get the original pages back.  If the kernel needed
> the memory first, the process will get a fresh zero filled page
> like with MADV_DONTNEED.
> 
> In short:
> - both MADV_FREE and MADV_DONTNEED only unmap file pages
> - after MADV_DONTNEED the application will always get back
>    fresh zero filled anonymous pages when accessing the
>    memory
> - after MADV_FREE the application can either get back the
>    original data (without a page fault) or zero filled
>    anonymous memory
> 
> The Linux MADV_DONTNEED behavior is not POSIX compliant.
> POSIX says that with MADV_DONTNEED the application's data
> will be preserved.
> 
> Currently glibc simply ignores POSIX_MADV_DONTNEED requests
> from applications on Linux.  Changing the behaviour which
> some Linux applications may rely on might not be the best
> idea.

OK, thanks.  I stuck that in the changelog.

Michael, do you think that's enough to finalise a manpage?

> If you need any additional information, please let me know.

The patch doesn't update the various comments in madvise.c at all, which is
a surprise.  Could you please check that they are all accurate and complete?

Also, where did we end up with the Solaris compatibility?

The patch I have at present retains MADV_FREE=0x05 for sparc and sparc64
which should be good.

Did we decide that the Solaris and Linux implementations of MADV_FREE are
compatible?

What about the Solaris and Linux MADV_DONTNEED implementations?

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: MADV_FREE functionality
  2007-05-01  1:18   ` Andrew Morton
@ 2007-05-01  1:23     ` Rik van Riel
  2007-05-01  7:13     ` Jakub Jelinek
  1 sibling, 0 replies; 233+ messages in thread
From: Rik van Riel @ 2007-05-01  1:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Michael Kerrisk

Andrew Morton wrote:

>> If you need any additional information, please let me know.
> 
> The patch doesn't update the various comments in madvise.c at all, which is
> a surprise.  Could you please check that they are all accurate and complete?

I'll take a look.

> Also, where did we end up with the Solaris compatibility?
> 
> The patch I have at present retains MADV_FREE=0x05 for sparc and sparc64
> which should be good.
> 
> Did we decide that the Solaris and Linux implementations of MADV_FREE are
> compatible?

Yes, the Linux, Solaris and FreeBSD implementations of MADV_FREE
appear to have equivalent semantics.

> What about the Solaris and Linux MADV_DONTNEED implementations?

This was never, and is still not, the same.  Linux will throw
away the data in anonymous pages while POSIX says we should
simply move the data to swap.  I assume Solaris and FreeBSD
will move the data to swap instead of throwing it away.

For file backed pages I suspect they all behave the same.

This is the reason that inside glibc, POSIX_MADV_DONTNEED is
a noop.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: MADV_FREE functionality
  2007-05-01  1:18   ` Andrew Morton
  2007-05-01  1:23     ` Rik van Riel
@ 2007-05-01  7:13     ` Jakub Jelinek
  1 sibling, 0 replies; 233+ messages in thread
From: Jakub Jelinek @ 2007-05-01  7:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, linux-kernel, linux-mm, Michael Kerrisk

On Mon, Apr 30, 2007 at 06:18:39PM -0700, Andrew Morton wrote:
> > In short:
> > - both MADV_FREE and MADV_DONTNEED only unmap file pages
> > - after MADV_DONTNEED the application will always get back
> >    fresh zero filled anonymous pages when accessing the
> >    memory
> > - after MADV_FREE the application can either get back the
> >    original data (without a page fault) or zero filled
> >    anonymous memory
> > 
> > The Linux MADV_DONTNEED behavior is not POSIX compliant.
> > POSIX says that with MADV_DONTNEED the application's data
> > will be preserved.
> > 
> > Currently glibc simply ignores POSIX_MADV_DONTNEED requests
> > from applications on Linux.  Changing the behaviour which
> > some Linux applications may rely on might not be the best
> > idea.
> 
> OK, thanks.  I stuck that in the changelog.

FYI, Solaris man page on MADV_FREE says:

      MADV_FREE
            Tells  the  kernel  that  contents  in  the  specified
            address  range  are  no longer important and the range
            will be overwritten. When there is demand for  memory,
            the  system will free pages associated with the speci-
            fied address range. In this instance, the next time  a
            page  in the address range is referenced, it will con-
            tail all zeroes.  Otherwise, it will contain the  data
            that was there prior to the MADV_FREE call. References
            made to the address range will  not  make  the  system
            read from backing store (swap space) until the page is
            modified again.

            This value cannot be used on mappings that have under-
            lying file objects.

The last paragraph seems to be just about the operation being
undefined, madvise MADV_FREE on MAP_SHARED file mapping returns 0
rather than flagging an error.

FreeBSD man page:

        MADV_FREE        Gives the VM system the freedom to free pages, and tells
                         the system that information in the specified page range
                         is no longer important.  This is an efficient way of
                         allowing malloc(3) to free pages anywhere in the address
                         space, while keeping the address space valid.  The next
                         time that the page is referenced, the page might be
                         demand zeroed, or might contain the data that was there
                         before the MADV_FREE call.  References made to that
                         address space range will not make the VM system page the
                         information back in from backing store until the page is
                         modified again.

> Also, where did we end up with the Solaris compatibility?
> 
> The patch I have at present retains MADV_FREE=0x05 for sparc and sparc64
> which should be good.
> 
> Did we decide that the Solaris and Linux implementations of MADV_FREE are
> compatible?

SPARC Solaris binary compatibility in Linux is in really bad shape, madvise
in Solaris is implemented using memcntl syscall (at least according to truss(1))
and that syscall is
systbl.S:       .word solaris_unimplemented     /* memcntl              131     */
When/if anyone decides to put more effort into the Solaris binary compatibility
(I'm quite doubtful anyone will), codes which don't match can be simply translated into
other codes, ignored etc., we can't use sys_madvise to implement memcntl
syscall anyway.  While Solaris MADV_FREE is the same as Linux MADV_FREE proposed
by Rik (except perhaps the documented undefined behavior with file mappings,
on
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>

int
main (void)
{
  getpid ();
  int fd = open ("test", O_RDWR);
  void *p = mmap (0, 8192, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
  memset (p, ' ', 8192);
  madvise (p, 8192, MADV_FREE);
  return 0;
}
on Solaris the spaces actually made it into the file), MADV_DONTNEED is not,
but that doesn't really matter except for arch/sparc*/solaris/ layer if anyone
cares.  We certainly can't change current MADV_DONTNEED behavior, all we
can do is implement a new MADV_* code with a different behavior and let glibc
translate POSIX_MADV_* codes on posix_madvise to the Linux specific MADV_*.

	Jakub

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: MADV_FREE functionality
  2007-05-01  0:54 ` MADV_FREE functionality Rik van Riel
  2007-05-01  1:18   ` Andrew Morton
@ 2007-05-01  1:23   ` Ulrich Drepper
  1 sibling, 0 replies; 233+ messages in thread
From: Ulrich Drepper @ 2007-05-01  1:23 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Andrew Morton, linux-kernel, linux-mm

On 4/30/07, Rik van Riel <riel@redhat.com> wrote:
> Andrew Morton wrote:
> > Because right now, I don't know where we are with respect to these things and
> > I doubt if many of our users know either.  How can Michael write a manpage for
> > this is we don't tell him what it all does?

I think we've been very clear before and Rik's description here puts
it all nicely in one place.  If you're worried about semantics you can
rest assured, it is all sound.  If this is what is holding up the
patch then add it to your collection.  Only if you have technical
objections should you hold it off.  The patch makes sense (and has
been validated by being implemented in the same way on other OSes) and
it is really needed.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (3 preceding siblings ...)
  2007-05-01  0:54 ` MADV_FREE functionality Rik van Riel
@ 2007-05-01  1:39 ` Stefan Richter
  2007-05-01  2:30 ` 2.6.22 -mm merge plans (RE: input) Dmitry Torokhov
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 233+ messages in thread
From: Stefan Richter @ 2007-05-01  1:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

Andrew Morton wrote:
>  sbp2-include-fixes.patch
>  ieee1394-iso-needs-schedh.patch
> 
> For Stephan

They were merged some hours ago.
-- 
Stefan Richter
-=====-=-=== -=-= ----=
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans (RE: input)
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (4 preceding siblings ...)
  2007-05-01  1:39 ` 2.6.22 -mm merge plans Stefan Richter
@ 2007-05-01  2:30 ` Dmitry Torokhov
  2007-05-01  8:14   ` Jiri Slaby
  2007-05-01  8:11 ` 2.6.22 -mm merge plans -- pfn_valid_within Andy Whitcroft
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 233+ messages in thread
From: Dmitry Torokhov @ 2007-05-01  2:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Éric Piel, Jiri Slaby

On Monday 30 April 2007 19:20, Andrew Morton wrote:
> 
>  input-convert-from-class-devices-to-standard-devices.patch
>  input-evdev-implement-proper-locking.patch
>  mousedev-fix.patch
>  mousedev-fix-2.patch
> 
> Dmitry will merge these once Greg has merged the preparatory work.  Except these
> patches make the Vaio-of-doom crash in obscure circumstances, and we weren't
> able to fix that?
> 

Would like to keep cooking in your tree till we get your Vaio going,
if you don't mind.

>  wistron_btns-add-led-support.patch

Will review once again and apply.

>  input-ff-add-ff_raw-effect.patch
>  input-phantom-add-a-new-driver.patch
>

It looks like Phanotom will not be using input layer...

>  input-rfkill-add-support-for-input-key-to-control-wireless-radio.patch
> 
> Will resend to davem once the preparatory bits are merged by Greg.
>

You mean me, right? I need to do some locking changes that DaveM
pointed out.
 
-- 
Dmitry

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans (RE: input)
  2007-05-01  2:30 ` 2.6.22 -mm merge plans (RE: input) Dmitry Torokhov
@ 2007-05-01  8:14   ` Jiri Slaby
  2007-05-01 12:05     ` Dmitry Torokhov
  0 siblings, 1 reply; 233+ messages in thread
From: Jiri Slaby @ 2007-05-01  8:14 UTC (permalink / raw)
  To: Dmitry Torokhov; +Cc: Andrew Morton, linux-kernel, Jiri Slaby

Dmitry Torokhov napsal(a):
> On Monday 30 April 2007 19:20, Andrew Morton wrote:
>>  input-ff-add-ff_raw-effect.patch
>>  input-phantom-add-a-new-driver.patch
>>
> 
> It looks like Phanotom will not be using input layer...

Yes, I have a new version, planning to test it tomorrow when I reach the device
and then post it.

You don't want it in input/misc in that case, right? If yes, Andrew, please drop
both.

thanks,
-- 
http://www.fi.muni.cz/~xslaby/            Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8  22A0 32CC 55C3 39D4 7A7E

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans (RE: input)
  2007-05-01  8:14   ` Jiri Slaby
@ 2007-05-01 12:05     ` Dmitry Torokhov
  0 siblings, 0 replies; 233+ messages in thread
From: Dmitry Torokhov @ 2007-05-01 12:05 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: Andrew Morton, linux-kernel

On 5/1/07, Jiri Slaby <jirislaby@gmail.com> wrote:
> Dmitry Torokhov napsal(a):
> > On Monday 30 April 2007 19:20, Andrew Morton wrote:
> >>  input-ff-add-ff_raw-effect.patch
> >>  input-phantom-add-a-new-driver.patch
> >>
> >
> > It looks like Phanotom will not be using input layer...
>
> Yes, I have a new version, planning to test it tomorrow when I reach the device
> and then post it.
>
> You don't want it in input/misc in that case, right?

Correct. Input/misc is only visible if CONFIG_INPUT is selected. But
if Phantom is not using input layer it should not depend on
CONFIG_INPUT. I'd put it into drivers/misc...

-- 
Dmitry

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- pfn_valid_within
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (5 preceding siblings ...)
  2007-05-01  2:30 ` 2.6.22 -mm merge plans (RE: input) Dmitry Torokhov
@ 2007-05-01  8:11 ` Andy Whitcroft
  2007-05-01  8:19   ` Andrew Morton
  2007-05-01  8:42 ` "partical" kthread conversion Christoph Hellwig
                   ` (15 subsequent siblings)
  22 siblings, 1 reply; 233+ messages in thread
From: Andy Whitcroft @ 2007-05-01  8:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Mel Gorman

Andrew Morton wrote:

>  add-pfn_valid_within-helper-for-sub-max_order-hole-detection.patch
>  anti-fragmentation-switch-over-to-pfn_valid_within.patch
>  lumpy-move-to-using-pfn_valid_within.patch
> 
> More Mel things, and linkage between Mel-things and lumpy reclaim.  It's here
> where the patch ordering gets into a mess and things won't improve if
> moveable-zones and lumpy-reclaim get deferred.  Such a deferral would limit my
> ability to queue more MM changes for 2.6.23.

The first of these is really a cleanup and should slide into the stack
before Mobility and Lumpy.  The other two should then join their
respective stacks anti-fragmentation-... to Mobility and lumpy-... to
Lumpy.  I would not expect them to increase linkage that way.

-apw

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- pfn_valid_within
  2007-05-01  8:11 ` 2.6.22 -mm merge plans -- pfn_valid_within Andy Whitcroft
@ 2007-05-01  8:19   ` Andrew Morton
  0 siblings, 0 replies; 233+ messages in thread
From: Andrew Morton @ 2007-05-01  8:19 UTC (permalink / raw)
  To: Andy Whitcroft; +Cc: linux-kernel, linux-mm, Mel Gorman

On Tue, 01 May 2007 09:11:18 +0100 Andy Whitcroft <apw@shadowen.org> wrote:

> Andrew Morton wrote:
> 
> >  add-pfn_valid_within-helper-for-sub-max_order-hole-detection.patch
> >  anti-fragmentation-switch-over-to-pfn_valid_within.patch
> >  lumpy-move-to-using-pfn_valid_within.patch
> > 
> > More Mel things, and linkage between Mel-things and lumpy reclaim.  It's here
> > where the patch ordering gets into a mess and things won't improve if
> > moveable-zones and lumpy-reclaim get deferred.  Such a deferral would limit my
> > ability to queue more MM changes for 2.6.23.
> 
> The first of these is really a cleanup and should slide into the stack
> before Mobility and Lumpy.  The other two should then join their
> respective stacks anti-fragmentation-... to Mobility and lumpy-... to
> Lumpy.  I would not expect them to increase linkage that way.
> 

yup, that improved things a bit, thanks.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* "partical" kthread conversion
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (6 preceding siblings ...)
  2007-05-01  8:11 ` 2.6.22 -mm merge plans -- pfn_valid_within Andy Whitcroft
@ 2007-05-01  8:42 ` Christoph Hellwig
  2007-05-01  8:51   ` Andrew Morton
  2007-05-01  8:44 ` 2.6.22 -mm merge plans -- vm bugfixes Nick Piggin
                   ` (14 subsequent siblings)
  22 siblings, 1 reply; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-01  8:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, ebiederm

On Mon, Apr 30, 2007 at 04:20:07PM -0700, Andrew Morton wrote:
>  macintosh-mediabay-convert-to-kthread-api.patch
>  macintosh-adb-convert-to-the-kthread-api.patch
>  macintosh-therm_pm72c-partially-convert-to-kthread-api.patch
>  powerpc-pseries-rtasd-convert-to-kthread-api.patch
>  powerpc-pseries-eeh-convert-to-kthread-api.patch
> 
> Will send to paulus (I already did - does Paul not handle the macintosh
> driver?)

Please don't send out the partial kthread conversions, as they're not
that helpful.  Depending on the way we'll let the API evolve a
kthread_create/run not paired by a kthread_stop might be actually harmful.

Please only send along patches that are paired or always built in so that
they don't require stopping at all.

Btw, many of the drivers above should probably go to benh.

There's probably a few more patches falling into this category, these
were just the first one the stick into my eye.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: "partical" kthread conversion
  2007-05-01  8:42 ` "partical" kthread conversion Christoph Hellwig
@ 2007-05-01  8:51   ` Andrew Morton
  2007-05-02 14:01     ` Dean Nelson
  0 siblings, 1 reply; 233+ messages in thread
From: Andrew Morton @ 2007-05-01  8:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, ebiederm

On Tue, 1 May 2007 09:42:45 +0100 Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Apr 30, 2007 at 04:20:07PM -0700, Andrew Morton wrote:
> >  macintosh-mediabay-convert-to-kthread-api.patch
> >  macintosh-adb-convert-to-the-kthread-api.patch
> >  macintosh-therm_pm72c-partially-convert-to-kthread-api.patch
> >  powerpc-pseries-rtasd-convert-to-kthread-api.patch
> >  powerpc-pseries-eeh-convert-to-kthread-api.patch
> > 
> > Will send to paulus (I already did - does Paul not handle the macintosh
> > driver?)
> 
> Please don't send out the partial kthread conversions, as they're not
> that helpful.  Depending on the way we'll let the API evolve a
> kthread_create/run not paired by a kthread_stop might be actually harmful.
> 
> Please only send along patches that are paired or always built in so that
> they don't require stopping at all.
> 
> Btw, many of the drivers above should probably go to benh.
> 
> There's probably a few more patches falling into this category, these
> were just the first one the stick into my eye.

Yes, I think I'll probably drop all of them - I've completely lost track of
which ones are complete, which ones need more work, etc.

I might send ia64-sn-xpc-convert-to-use-kthread-api.patch+fixes off to
Tony, as people put quite a bit of review and test effort into that one.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: "partical" kthread conversion
  2007-05-01  8:51   ` Andrew Morton
@ 2007-05-02 14:01     ` Dean Nelson
  2007-05-02 14:45       ` Eric W. Biederman
  0 siblings, 1 reply; 233+ messages in thread
From: Dean Nelson @ 2007-05-02 14:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: hch, ebiederm, linux-kernel

On Tue, May 01, 2007 at 01:51:41AM -0700, Andrew Morton wrote:
> On Tue, 1 May 2007 09:42:45 +0100 Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Mon, Apr 30, 2007 at 04:20:07PM -0700, Andrew Morton wrote:
> > >  macintosh-mediabay-convert-to-kthread-api.patch
> > >  macintosh-adb-convert-to-the-kthread-api.patch
> > >  macintosh-therm_pm72c-partially-convert-to-kthread-api.patch
> > >  powerpc-pseries-rtasd-convert-to-kthread-api.patch
> > >  powerpc-pseries-eeh-convert-to-kthread-api.patch
> > > 
> > > Will send to paulus (I already did - does Paul not handle the macintosh
> > > driver?)
> > 
> > Please don't send out the partial kthread conversions, as they're not
> > that helpful.  Depending on the way we'll let the API evolve a
> > kthread_create/run not paired by a kthread_stop might be actually harmful.
> > 
> > Please only send along patches that are paired or always built in so that
> > they don't require stopping at all.
> > 
> > Btw, many of the drivers above should probably go to benh.
> > 
> > There's probably a few more patches falling into this category, these
> > were just the first one the stick into my eye.
> 
> Yes, I think I'll probably drop all of them - I've completely lost track of
> which ones are complete, which ones need more work, etc.
> 
> I might send ia64-sn-xpc-convert-to-use-kthread-api.patch+fixes off to
> Tony, as people put quite a bit of review and test effort into that one.

Andrew, I would recommend holding off on sending these xpc patches to
Tony as the kthread_run()s aren't paired with kthread_stop()s yet. I
need to generate an additional patch after I've first sorted out how
best to deal with kthread_stop()'ng XPC's pool of kthreads with Eric.

Thanks,
Dean

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: "partical" kthread conversion
  2007-05-02 14:01     ` Dean Nelson
@ 2007-05-02 14:45       ` Eric W. Biederman
  2007-05-02 15:37         ` Dean Nelson
  2007-05-02 19:33         ` Andrew Morton
  0 siblings, 2 replies; 233+ messages in thread
From: Eric W. Biederman @ 2007-05-02 14:45 UTC (permalink / raw)
  To: Dean Nelson; +Cc: Andrew Morton, hch, linux-kernel

Dean Nelson <dcn@sgi.com> writes:

> On Tue, May 01, 2007 at 01:51:41AM -0700, Andrew Morton wrote:
>> > There's probably a few more patches falling into this category, these
>> > were just the first one the stick into my eye.
>> 
>> Yes, I think I'll probably drop all of them - I've completely lost track of
>> which ones are complete, which ones need more work, etc.

Andrew as far as dropping them.  If all you have is one of my dinky patches
that changes things to use kthread_run feel free, because of the general
necessity of calling kthread_stop I'm going to have to rework those anyway,
and I still have the originals.

If there is something more the we probably want to keep the patch because
someone has actually looked at it and done something useful.

I'm just now starting to work my way through them all again paying a little
closer attention, so I can do a thorough conversion.

>> I might send ia64-sn-xpc-convert-to-use-kthread-api.patch+fixes off to
>> Tony, as people put quite a bit of review and test effort into that one.
>
> Andrew, I would recommend holding off on sending these xpc patches to
> Tony as the kthread_run()s aren't paired with kthread_stop()s yet. I
> need to generate an additional patch after I've first sorted out how
> best to deal with kthread_stop()'ng XPC's pool of kthreads with Eric.

Ok.  Dean gve me a couple of a day or so.  I think I have just worked
through how to directly create kthreads without too much pain.  We are
still going to need kthreadd for spawning for a bit because I don't
expect all architectures to change over immediately, but I think
things can be done in a fairly simple low risk manner.

The changes to the kernel_thread replacement aren't going to be too
bad, pretty much just adding a couple of parameters.   It is
copy_thread where things get sticky.

If we can spawn threads fast enough we don't need a thread pool, I
would rather do that.

Eric

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: "partical" kthread conversion
  2007-05-02 14:45       ` Eric W. Biederman
@ 2007-05-02 15:37         ` Dean Nelson
  2007-05-02 15:49           ` Eric W. Biederman
  2007-05-02 19:33         ` Andrew Morton
  1 sibling, 1 reply; 233+ messages in thread
From: Dean Nelson @ 2007-05-02 15:37 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: akpm, hch, linux-kernel

On Wed, May 02, 2007 at 08:45:54AM -0600, Eric W. Biederman wrote:
> Dean Nelson <dcn@sgi.com> writes:
> 
> > On Tue, May 01, 2007 at 01:51:41AM -0700, Andrew Morton wrote:
> >> I might send ia64-sn-xpc-convert-to-use-kthread-api.patch+fixes off to
> >> Tony, as people put quite a bit of review and test effort into that one.
> >
> > Andrew, I would recommend holding off on sending these xpc patches to
> > Tony as the kthread_run()s aren't paired with kthread_stop()s yet. I
> > need to generate an additional patch after I've first sorted out how
> > best to deal with kthread_stop()'ng XPC's pool of kthreads with Eric.
> 
> Ok.  Dean gve me a couple of a day or so.  I think I have just worked
> through how to directly create kthreads without too much pain.  We are
> still going to need kthreadd for spawning for a bit because I don't
> expect all architectures to change over immediately, but I think
> things can be done in a fairly simple low risk manner.
> 
> The changes to the kernel_thread replacement aren't going to be too
> bad, pretty much just adding a couple of parameters.   It is
> copy_thread where things get sticky.
> 
> If we can spawn threads fast enough we don't need a thread pool, I
> would rather do that.

I'd typed up some questions for you about the new patch I need to create
which I'd just sent to you, so I won't repeat them here.

Before proceeding to far with your above changes, you might wait to see
the proposal that Robin Holt is putting together for a kthread pool.
I'm not sure how spawning a thread (which involves allocation of the
task_struct amongst other things, plus scheduling) can beat a wake_up()
of an already existing thread for cost time-wise.

Dean


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: "partical" kthread conversion
  2007-05-02 15:37         ` Dean Nelson
@ 2007-05-02 15:49           ` Eric W. Biederman
  0 siblings, 0 replies; 233+ messages in thread
From: Eric W. Biederman @ 2007-05-02 15:49 UTC (permalink / raw)
  To: Dean Nelson; +Cc: akpm, hch, linux-kernel

Dean Nelson <dcn@sgi.com> writes:

> I'd typed up some questions for you about the new patch I need to create
> which I'd just sent to you, so I won't repeat them here.
>
> Before proceeding to far with your above changes, you might wait to see
> the proposal that Robin Holt is putting together for a kthread pool.
> I'm not sure how spawning a thread (which involves allocation of the
> task_struct amongst other things, plus scheduling) can beat a wake_up()
> of an already existing thread for cost time-wise.

A reasonable point, although if you don't happen to sleep in the allocations
I suspect time wise it's pretty much a wash.

I have some other reasons I might need the capability of clone a thread
from a non-parent process, and it has the potential to simplify some things
in the kthread case so I'm going to finish investigating, since I believe
I have figured out a path to that target.

Eric

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: "partical" kthread conversion
  2007-05-02 14:45       ` Eric W. Biederman
  2007-05-02 15:37         ` Dean Nelson
@ 2007-05-02 19:33         ` Andrew Morton
  2007-05-02 20:38           ` Eric W. Biederman
  1 sibling, 1 reply; 233+ messages in thread
From: Andrew Morton @ 2007-05-02 19:33 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Dean Nelson, hch, linux-kernel

On Wed, 02 May 2007 08:45:54 -0600
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Dean Nelson <dcn@sgi.com> writes:
> 
> > On Tue, May 01, 2007 at 01:51:41AM -0700, Andrew Morton wrote:
> >> > There's probably a few more patches falling into this category, these
> >> > were just the first one the stick into my eye.
> >> 
> >> Yes, I think I'll probably drop all of them - I've completely lost track of
> >> which ones are complete, which ones need more work, etc.
> 
> Andrew as far as dropping them.  If all you have is one of my dinky patches
> that changes things to use kthread_run feel free, because of the general
> necessity of calling kthread_stop I'm going to have to rework those anyway,
> and I still have the originals.
> 

I gave up and dropped them all - let's have another run at it.  Possibly
some of them were complete and didn't deserve dropping, in which case you
can send them straight back at me.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: "partical" kthread conversion
  2007-05-02 19:33         ` Andrew Morton
@ 2007-05-02 20:38           ` Eric W. Biederman
  0 siblings, 0 replies; 233+ messages in thread
From: Eric W. Biederman @ 2007-05-02 20:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dean Nelson, hch, linux-kernel

Andrew Morton <akpm@linux-foundation.org> writes:

> I gave up and dropped them all - let's have another run at it.  Possibly
> some of them were complete and didn't deserve dropping, in which case you
> can send them straight back at me.

Sounds like a plan.

Eric


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (7 preceding siblings ...)
  2007-05-01  8:42 ` "partical" kthread conversion Christoph Hellwig
@ 2007-05-01  8:44 ` Nick Piggin
  2007-05-01  8:54   ` Andrew Morton
  2007-05-01 19:31   ` Hugh Dickins
  2007-05-01  8:46 ` pcmcia ioctl removal Christoph Hellwig
                   ` (13 subsequent siblings)
  22 siblings, 2 replies; 233+ messages in thread
From: Nick Piggin @ 2007-05-01  8:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Hugh Dickins, Andrea Arcangeli,
	Christoph Hellwig

Andrew Morton wrote:

>  mm-simplify-filemap_nopage.patch
>  mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
>  mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
>  mm-merge-nopfn-into-fault.patch
>  convert-hugetlbfs-to-use-vm_ops-fault.patch
>  mm-remove-legacy-cruft.patch
>  mm-debug-check-for-the-fault-vs-invalidate-race.patch

>  mm-fix-clear_page_dirty_for_io-vs-fault-race.patch

> Miscish MM changes.  Will merge, dependent upon what still applies and works
> if the moveable-zone patches get stalled.

These fix some bugs in the core vm, at least the former one we have
seen numerous people hitting in production...

I don't suppose you mean these are logically dependant on new features
sitting below them in your patch stack, just that you don't want to
spend time fixing a lot of rejects? If so, I can help fix those up, but
I don't think there is anything major, IIRC the biggest annoyance is
just that changing some GFP_types throws some big hunks.

So, do you or anyone else have any problems with these patches going in
2.6.22? I haven't had much feedback for a while, but I was under the
impression that people are more-or-less happy with them?

mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch

This patch fixes the core filemap_nopage vs invalidate_inode_pages2
race by having filemap_nopage return a locked page to do_no_page,
and removes the fairly complex (and inadequate) truncate_count
synchronisation logic.

There were concerns that we could do this more cheaply, but I think it
is important to start with a base that is simple and more likely to
be correct and build on that. My testing didn't show any obvious
problems with performance.

mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
mm-merge-nopfn-into-fault.patch
etc.

These move ->nopage, ->populate, ->nopfn (and soon, ->page_mkwrite)
into a single, unified interface. Although this strictly closes some
similar holes in nonlinear faults as well, they are very uncommon, so
I wouldn't be so upset if these aren't merged in 2.6.22 (I don't see
any reason not to, but at least they don't fix major bugs).

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-01  8:44 ` 2.6.22 -mm merge plans -- vm bugfixes Nick Piggin
@ 2007-05-01  8:54   ` Andrew Morton
  2007-05-01 19:31   ` Hugh Dickins
  1 sibling, 0 replies; 233+ messages in thread
From: Andrew Morton @ 2007-05-01  8:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, linux-mm, Hugh Dickins, Andrea Arcangeli,
	Christoph Hellwig

On Tue, 01 May 2007 18:44:07 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> >  mm-simplify-filemap_nopage.patch
> >  mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
> >  mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
> >  mm-merge-nopfn-into-fault.patch
> >  convert-hugetlbfs-to-use-vm_ops-fault.patch
> >  mm-remove-legacy-cruft.patch
> >  mm-debug-check-for-the-fault-vs-invalidate-race.patch
> 
> >  mm-fix-clear_page_dirty_for_io-vs-fault-race.patch
> 
> > Miscish MM changes.  Will merge, dependent upon what still applies and works
> > if the moveable-zone patches get stalled.
> 
> These fix some bugs in the core vm, at least the former one we have
> seen numerous people hitting in production...
> 
> I don't suppose you mean these are logically dependant on new features
> sitting below them in your patch stack, just that you don't want to
> spend time fixing a lot of rejects?

It'll probably be OK - I just haven't checked yet.  I'm fairly handy at
fixing rejects nowadays ;)


Nobody seems to be taking up this opportunity to provide us with review
and test results on the antifrag patches.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-01  8:44 ` 2.6.22 -mm merge plans -- vm bugfixes Nick Piggin
  2007-05-01  8:54   ` Andrew Morton
@ 2007-05-01 19:31   ` Hugh Dickins
  2007-05-02  3:08     ` Nick Piggin
  1 sibling, 1 reply; 233+ messages in thread
From: Hugh Dickins @ 2007-05-01 19:31 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli,
	Christoph Hellwig

On Tue, 1 May 2007, Nick Piggin wrote:
> Andrew Morton wrote:
> 
> >  mm-simplify-filemap_nopage.patch
> >  mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
> >  mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
> >  mm-merge-nopfn-into-fault.patch
> >  convert-hugetlbfs-to-use-vm_ops-fault.patch
> >  mm-remove-legacy-cruft.patch
> >  mm-debug-check-for-the-fault-vs-invalidate-race.patch
> 
> > mm-fix-clear_page_dirty_for_io-vs-fault-race.patch
> 
> > Miscish MM changes.  Will merge, dependent upon what still applies and works
> > if the moveable-zone patches get stalled.
> 
> These fix some bugs in the core vm, at least the former one we have
> seen numerous people hitting in production...
...
> 
> So, do you or anyone else have any problems with these patches going in
> 2.6.22? I haven't had much feedback for a while, but I was under the
> impression that people are more-or-less happy with them?
> 
> mm-fix-fault-vs-invalidate-race-for-linear-mappings.patch
> 
> This patch fixes the core filemap_nopage vs invalidate_inode_pages2
> race by having filemap_nopage return a locked page to do_no_page,
> and removes the fairly complex (and inadequate) truncate_count
> synchronisation logic.
> 
> There were concerns that we could do this more cheaply, but I think it
> is important to start with a base that is simple and more likely to
> be correct and build on that. My testing didn't show any obvious
> problems with performance.

I don't see _problems_ with performance, but I do consistently see the
same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency
and page fault tests on SMP, several machines, just as I did last year.

I'm assuming this patch is the one responsible: at 2.6.20-rc4 time
you posted a set of 10 and a set of 7 patches I tried in versus out;
at 2.6.21-rc3-mm2 time you had a group of patches in -mm I tried in
versus out; with similar results.

I did check the graphs on test.kernel.org, I couldn't see any bad
behaviour there that correlated with this work; though each -mm
has such a variety of new work in it, it's very hard to attribute.
And nobody else has reported any regression from your patches.

I'm inclined to write it off as poorer performance in some micro-
benchmarks, against which we offset the improved understandabilty
of holding the page lock over the file fault.

But I was quite disappointed when 
mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch
appeared, putting double unmap_mapping_range calls in.  Certainly
you were wrong to take the one out, but a pity to end up with two.

Your comment says/said:
The nopage vs invalidate race fix patch did not take care of truncating
private COW pages. Mind you, I'm pretty sure this was previously racy
even for regular truncate, not to mention vmtruncate_range.

vmtruncate_range (holepunch) was deficient I agree, and though we
can now take out your second unmap_mapping_range there, that's only
because I've slipped one into shmem_truncate_range.  In due course it
needs to be properly handled by noting the range in shmem inode info.

(I think you couldn't take that approach, noting invalid range in
->mapping while invalidating, because NFS has/had some cases of
invalidate_whatever without i_mutex?)

But I'm pretty sure (to use your words!) regular truncate was not racy
before: I believe Andrea's sequence count was handling that case fine,
without a second unmap_mapping_range.

Well, I guess I've come to accept that, expensive as unmap_mapping_range
may be, truncating files while they're mmap'ed is perverse behaviour:
perhaps even deserving such punishment.

But it is a shame, and leaves me wondering what you gained with the
page lock there.

One thing gained is ease of understanding, and if your later patches
build an edifice upon the knowledge of holding that page lock while
faulting, I've no wish to undermine that foundation.

> 
> mm-merge-populate-and-nopage-into-fault-fixes-nonlinear.patch
> mm-merge-nopfn-into-fault.patch
> etc.
> 
> These move ->nopage, ->populate, ->nopfn (and soon, ->page_mkwrite)
> into a single, unified interface. Although this strictly closes some
> similar holes in nonlinear faults as well, they are very uncommon, so
> I wouldn't be so upset if these aren't merged in 2.6.22 (I don't see
> any reason not to, but at least they don't fix major bugs).

I don't have an opinion on these, but I think BenH and others
were strongly in favour, with various people waiting upon them.

Hugh

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-01 19:31   ` Hugh Dickins
@ 2007-05-02  3:08     ` Nick Piggin
  2007-05-02  9:15       ` Nick Piggin
  2007-05-02 14:00       ` Hugh Dickins
  0 siblings, 2 replies; 233+ messages in thread
From: Nick Piggin @ 2007-05-02  3:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli,
	Christoph Hellwig

Hugh Dickins wrote:
> On Tue, 1 May 2007, Nick Piggin wrote:

>>There were concerns that we could do this more cheaply, but I think it
>>is important to start with a base that is simple and more likely to
>>be correct and build on that. My testing didn't show any obvious
>>problems with performance.
> 
> 
> I don't see _problems_ with performance, but I do consistently see the
> same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency
> and page fault tests on SMP, several machines, just as I did last year.

OK. I did run some tests at one stage which didn't show a regression
on my P4, however I don't know that they were statistically significant.
I'll try a couple more runs and post numbers.


> I'm assuming this patch is the one responsible: at 2.6.20-rc4 time
> you posted a set of 10 and a set of 7 patches I tried in versus out;
> at 2.6.21-rc3-mm2 time you had a group of patches in -mm I tried in
> versus out; with similar results.
> 
> I did check the graphs on test.kernel.org, I couldn't see any bad
> behaviour there that correlated with this work; though each -mm
> has such a variety of new work in it, it's very hard to attribute.
> And nobody else has reported any regression from your patches.
> 
> I'm inclined to write it off as poorer performance in some micro-
> benchmarks, against which we offset the improved understandabilty
> of holding the page lock over the file fault.
> 
> But I was quite disappointed when 
> mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch
> appeared, putting double unmap_mapping_range calls in.  Certainly
> you were wrong to take the one out, but a pity to end up with two.
> 
> Your comment says/said:
> The nopage vs invalidate race fix patch did not take care of truncating
> private COW pages. Mind you, I'm pretty sure this was previously racy
> even for regular truncate, not to mention vmtruncate_range.
> 
> vmtruncate_range (holepunch) was deficient I agree, and though we
> can now take out your second unmap_mapping_range there, that's only
> because I've slipped one into shmem_truncate_range.  In due course it
> needs to be properly handled by noting the range in shmem inode info.
> 
> (I think you couldn't take that approach, noting invalid range in
> ->mapping while invalidating, because NFS has/had some cases of
> invalidate_whatever without i_mutex?)

Sorry, I didn't parse this? But I wonder whether it is better to do
it in vmtruncate_range than the filesystem? Private COWed pages are
not really a filesystem "thing"...


> But I'm pretty sure (to use your words!) regular truncate was not racy
> before: I believe Andrea's sequence count was handling that case fine,
> without a second unmap_mapping_range.

OK, I think you're right. I _think_ it should also be OK with the
lock_page version as well: we should not be able to have any pages
after the first unmap_mapping_range call, because of the i_size
write. So if we have no pages, there is nothing to 'cow' from.


> Well, I guess I've come to accept that, expensive as unmap_mapping_range
> may be, truncating files while they're mmap'ed is perverse behaviour:
> perhaps even deserving such punishment.
> 
> But it is a shame, and leaves me wondering what you gained with the
> page lock there.
> 
> One thing gained is ease of understanding, and if your later patches
> build an edifice upon the knowledge of holding that page lock while
> faulting, I've no wish to undermine that foundation.

It also fixes a bug, doesn't it? ;)

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-02  3:08     ` Nick Piggin
@ 2007-05-02  9:15       ` Nick Piggin
  2007-05-02 14:00       ` Hugh Dickins
  1 sibling, 0 replies; 233+ messages in thread
From: Nick Piggin @ 2007-05-02  9:15 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Andrew Morton, linux-kernel, linux-mm,
	Andrea Arcangeli, Christoph Hellwig

Nick Piggin wrote:
> Hugh Dickins wrote:
> 
>> On Tue, 1 May 2007, Nick Piggin wrote:
> 
> 
>>> There were concerns that we could do this more cheaply, but I think it
>>> is important to start with a base that is simple and more likely to
>>> be correct and build on that. My testing didn't show any obvious
>>> problems with performance.
>>
>>
>>
>> I don't see _problems_ with performance, but I do consistently see the
>> same kind of ~5% degradation in lmbench fork, exec, sh, mmap latency
>> and page fault tests on SMP, several machines, just as I did last year.
> 
> 
> OK. I did run some tests at one stage which didn't show a regression
> on my P4, however I don't know that they were statistically significant.
> I'll try a couple more runs and post numbers.

I didn't have enough time tonight to get means/stddev, etc, but the runs
are pretty stable.

Patch tested was just the lock page one.

SMP kernel, tasks bound to 1 CPU:

P4 Xeon
          pagefault   fork          exec
2.6.21   1.67-1.69   140.7-142.0   449.5-460.8
+patch   1.75-1.77   144.0-145.5   456.2-463.0

So it's taken on nearly 5% on pagefault, but looks like less than 2% on
fork, so not as bad as your numbers (phew).

G5
          pagefault   fork          exec
2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2

Bigger hit there.

Page faults can be improved a tiny bit by not using a test and clear op
in unlock_page (less barriers for the G5).

I don't think that's really a blocker problem for a merge, but I wonder
what we can do to improve it. Lockless pagecache shaves quite a bit of
straight line find_get_page performance there.

Going to a non-sleeping lock might be one way to go in the long term, but
it would require quite a lot of restructuring.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-02  3:08     ` Nick Piggin
  2007-05-02  9:15       ` Nick Piggin
@ 2007-05-02 14:00       ` Hugh Dickins
  2007-05-03  1:32         ` Nick Piggin
  2007-05-09 12:34         ` Nick Piggin
  1 sibling, 2 replies; 233+ messages in thread
From: Hugh Dickins @ 2007-05-02 14:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli,
	Christoph Hellwig

On Wed, 2 May 2007, Nick Piggin wrote:
> Hugh Dickins wrote:
> > 
> > But I was quite disappointed when
> > mm-fix-fault-vs-invalidate-race-for-linear-mappings-fix.patch
> > appeared, putting double unmap_mapping_range calls in.  Certainly
> > you were wrong to take the one out, but a pity to end up with two.
> > 
> > Your comment says/said:
> > The nopage vs invalidate race fix patch did not take care of truncating
> > private COW pages. Mind you, I'm pretty sure this was previously racy
> > even for regular truncate, not to mention vmtruncate_range.
> > 
> > vmtruncate_range (holepunch) was deficient I agree, and though we
> > can now take out your second unmap_mapping_range there, that's only
> > because I've slipped one into shmem_truncate_range.  In due course it
> > needs to be properly handled by noting the range in shmem inode info.
> > 
> > (I think you couldn't take that approach, noting invalid range in
> > ->mapping while invalidating, because NFS has/had some cases of
> > invalidate_whatever without i_mutex?)
> 
> Sorry, I didn't parse this?

I meant that i_size is used to protect against truncation races, but
we have no equivalent inval_start,inval_end in the struct inode or
struct address_space, such as could be used for similar protection
against races while invalidating.

And that IIRC there are places where NFS was doing the invalidation
without i_mutex: so there could be concurrent invalidations, so one
inval_start,inval_end in the structure wouldn't be enough anyway.

> But I wonder whether it is better to do
> it in vmtruncate_range than the filesystem? Private COWed pages are
> not really a filesystem "thing"...

It wasn't the thought of private COWed pages which made me put a
second unmap_mapping_range in shmem_truncate_range, it was its own
internal file<->swap consistency which needed that (as a quick fix).
The real fix to be having a trunc_start,trunc_end or whatever in
the shmem_inode_info (assuming it's not wanted in the common inode:
might be if holepunch spreads e.g. it's been mentioned with fallocate).

Re private COWed pages and holepunch: Miklos and I agree that really
it would be better for holepunch _not_ to remove them - but that's
rather off-topic.

More on-topic, since you suggest doing more within vmtruncate_range
than the filesystem: no, I'm afraid that's misdesigned, and I want
to move almost all of it into the filesystem ->truncate_range.
Because, if what vmtruncate_range is doing before it gets to the
filesystem isn't to be just a waste of time, the filesystem needs
to know what's going on in advance - just as notify_change warns
the filesystem about a coming truncation.  But easier than inventing
some new notification is to move it all into the filesystem, with
unmap_mapping_range+truncate_inode_pages_range its library helpers.

> 
> > But I'm pretty sure (to use your words!) regular truncate was not racy
> > before: I believe Andrea's sequence count was handling that case fine,
> > without a second unmap_mapping_range.
> 
> OK, I think you're right. I _think_ it should also be OK with the
> lock_page version as well: we should not be able to have any pages
> after the first unmap_mapping_range call, because of the i_size
> write. So if we have no pages, there is nothing to 'cow' from.

I'd be delighted if you can remove those later unmap_mapping_ranges.
As I recall, the important thing for the copy pages is to be holding
the page lock (or whatever other serialization) on the copied page
still while the copy page is inserted into pagetable: that looks
to be so in your __do_fault.

> > But it is a shame, and leaves me wondering what you gained with the
> > page lock there.
> > 
> > One thing gained is ease of understanding, and if your later patches
> > build an edifice upon the knowledge of holding that page lock while
> > faulting, I've no wish to undermine that foundation.
> 
> It also fixes a bug, doesn't it? ;)

Well, I'd come to think that perhaps the bugs would be solved by
that second unmap_mapping_range alone, so the pagelock changes
just a misleading diversion.

I'm not sure how I feel about that: calling unmap_mapping_range a
second time feels such a cheat, but if (big if) it does solve the
races, and the pagelock method is as expensive as your numbers
now suggest...

Hugh

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-02 14:00       ` Hugh Dickins
@ 2007-05-03  1:32         ` Nick Piggin
  2007-05-03 10:37           ` Christoph Hellwig
                             ` (2 more replies)
  2007-05-09 12:34         ` Nick Piggin
  1 sibling, 3 replies; 233+ messages in thread
From: Nick Piggin @ 2007-05-03  1:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli,
	Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 3875 bytes --]

Hugh Dickins wrote:
> On Wed, 2 May 2007, Nick Piggin wrote:

[snip]

> More on-topic, since you suggest doing more within vmtruncate_range
> than the filesystem: no, I'm afraid that's misdesigned, and I want
> to move almost all of it into the filesystem ->truncate_range.
> Because, if what vmtruncate_range is doing before it gets to the
> filesystem isn't to be just a waste of time, the filesystem needs
> to know what's going on in advance - just as notify_change warns
> the filesystem about a coming truncation.  But easier than inventing
> some new notification is to move it all into the filesystem, with
> unmap_mapping_range+truncate_inode_pages_range its library helpers.

Well I would prefer it to follow the same pattern as regular
truncate. I don't think it is misdesigned to call the filesystem
_first_, but I think if you do that then the filesystem should
call the vm to prepare / finish truncate, rather than open code
calls to unmap itself.


>>>But I'm pretty sure (to use your words!) regular truncate was not racy
>>>before: I believe Andrea's sequence count was handling that case fine,
>>>without a second unmap_mapping_range.
>>
>>OK, I think you're right. I _think_ it should also be OK with the
>>lock_page version as well: we should not be able to have any pages
>>after the first unmap_mapping_range call, because of the i_size
>>write. So if we have no pages, there is nothing to 'cow' from.
> 
> 
> I'd be delighted if you can remove those later unmap_mapping_ranges.
> As I recall, the important thing for the copy pages is to be holding
> the page lock (or whatever other serialization) on the copied page
> still while the copy page is inserted into pagetable: that looks
> to be so in your __do_fault.

Yeah, I think my thought process went wrong on those... I'll
revisit.


>>>But it is a shame, and leaves me wondering what you gained with the
>>>page lock there.
>>>
>>>One thing gained is ease of understanding, and if your later patches
>>>build an edifice upon the knowledge of holding that page lock while
>>>faulting, I've no wish to undermine that foundation.
>>
>>It also fixes a bug, doesn't it? ;)
> 
> 
> Well, I'd come to think that perhaps the bugs would be solved by
> that second unmap_mapping_range alone, so the pagelock changes
> just a misleading diversion.
> 
> I'm not sure how I feel about that: calling unmap_mapping_range a
> second time feels such a cheat, but if (big if) it does solve the
> races, and the pagelock method is as expensive as your numbers
> now suggest...

Well aside from being terribly ugly, it means we can still drop
the dirty bit where we'd otherwise rather not, so I don't think
we can do that.

I think there may be some way we can do this without taking the
page lock, and I was going to look at it, but I think it is
quite neat to just lock the page...

I don't think performance is _that_ bad. On the P4 it is a couple
of % on the microbenchmarks. The G5 is worse, but even then I
don't think it is I'll try to improve that and get back to you.

The problem is that lock/unlock_page is expensive on powerpc, and
if we improve that, we improve more than just the fault handler...

The attached patch gets performance up a bit by avoiding some
barriers and some cachelines:

G5
          pagefault   fork          exec
2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0

So that brings the fork/exec hits down to much less than 5%, and
would likely speed up other things that lock the page, like write
or page reclaim.

I think we could get further performance improvement by
implementing arch specific bitops for lock/unlock operations,
so we don't need to use things like smb_mb__before_clear_bit()
if they aren't needed or full barriers in the test_and_set_bit().

-- 
SUSE Labs, Novell Inc.


[-- Attachment #2: mm-unlock-speedup.patch --]
[-- Type: text/plain, Size: 2228 bytes --]

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2007-04-24 10:39:56.000000000 +1000
+++ linux-2.6/include/linux/page-flags.h	2007-05-03 08:38:53.000000000 +1000
@@ -91,6 +91,8 @@
 #define PG_nosave_free		18	/* Used for system suspend/resume */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
+#define PG_waiters		20	/* Page has PG_locked waiters */
+
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
 
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h	2007-04-24 10:39:56.000000000 +1000
+++ linux-2.6/include/linux/pagemap.h	2007-05-03 08:35:08.000000000 +1000
@@ -141,7 +141,7 @@
 static inline void lock_page(struct page *page)
 {
 	might_sleep();
-	if (TestSetPageLocked(page))
+	if (unlikely(TestSetPageLocked(page)))
 		__lock_page(page);
 }
 
@@ -152,7 +152,7 @@
 static inline void lock_page_nosync(struct page *page)
 {
 	might_sleep();
-	if (TestSetPageLocked(page))
+	if (unlikely(TestSetPageLocked(page)))
 		__lock_page_nosync(page);
 }
 	
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2007-05-02 15:00:26.000000000 +1000
+++ linux-2.6/mm/filemap.c	2007-05-03 08:34:32.000000000 +1000
@@ -532,11 +532,13 @@
  */
 void fastcall unlock_page(struct page *page)
 {
+	VM_BUG_ON(!PageLocked(page));
 	smp_mb__before_clear_bit();
-	if (!TestClearPageLocked(page))
-		BUG();
-	smp_mb__after_clear_bit(); 
-	wake_up_page(page, PG_locked);
+	ClearPageLocked(page);
+	if (unlikely(test_bit(PG_waiters, &page->flags))) {
+		clear_bit(PG_waiters, &page->flags);
+		wake_up_page(page, PG_locked);
+	}
 }
 EXPORT_SYMBOL(unlock_page);
 
@@ -568,6 +570,11 @@
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
+	set_bit(PG_waiters, &page->flags);
+	if (unlikely(!TestSetPageLocked(page))) {
+		clear_bit(PG_waiters, &page->flags);
+		return;
+	}
 	__wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
 							TASK_UNINTERRUPTIBLE);
 }

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-03  1:32         ` Nick Piggin
@ 2007-05-03 10:37           ` Christoph Hellwig
  2007-05-03 12:56             ` Nick Piggin
  2007-05-03 12:24           ` Hugh Dickins
  2007-05-03 16:52           ` Andrew Morton
  2 siblings, 1 reply; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-03 10:37 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Andrew Morton, linux-kernel, linux-mm,
	Andrea Arcangeli, Christoph Hellwig

On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote:
> The attached patch gets performance up a bit by avoiding some
> barriers and some cachelines:
> 
> G5
>          pagefault   fork          exec
> 2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
> +patch   1.71-1.73   175.2-180.8   780.5-794.2
> +patch2  1.61-1.63   169.8-175.0   748.6-757.0
> 
> So that brings the fork/exec hits down to much less than 5%, and
> would likely speed up other things that lock the page, like write
> or page reclaim.

Is that every fork/exec or just under certain cicumstances?
A 5% regression on every fork/exec is not acceptable.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-03 10:37           ` Christoph Hellwig
@ 2007-05-03 12:56             ` Nick Piggin
  2007-05-04  9:23               ` Nick Piggin
  0 siblings, 1 reply; 233+ messages in thread
From: Nick Piggin @ 2007-05-03 12:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Hugh Dickins, Andrew Morton, linux-kernel, linux-mm,
	Andrea Arcangeli

Christoph Hellwig wrote:
> On Thu, May 03, 2007 at 11:32:23AM +1000, Nick Piggin wrote:
> 
>>The attached patch gets performance up a bit by avoiding some
>>barriers and some cachelines:
>>
>>G5
>>         pagefault   fork          exec
>>2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
>>+patch   1.71-1.73   175.2-180.8   780.5-794.2
>>+patch2  1.61-1.63   169.8-175.0   748.6-757.0
>>
>>So that brings the fork/exec hits down to much less than 5%, and
>>would likely speed up other things that lock the page, like write
>>or page reclaim.
> 
> 
> Is that every fork/exec or just under certain cicumstances?
> A 5% regression on every fork/exec is not acceptable.

Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
numbers will be improved as well with that patch. Then if we have
specific lock/unlock bitops, I hope it should reduce that further.

The overhead that is there should just be coming from the extra
overhead in the file backed fault handler. For noop fork/execs,
I think that tends to be more pronounced, it is hard to see any
difference on any non-micro benchmark.

The other thing is that I think there could be some cache effects
happening -- for example the exec numbers on the 2nd line are
disproportionately large.

It definitely isn't a good thing to drop performance anywhere
though, so I'll keep looking for improvements.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-03 12:56             ` Nick Piggin
@ 2007-05-04  9:23               ` Nick Piggin
  2007-05-04  9:43                 ` Nick Piggin
  2007-05-08  3:03                 ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 233+ messages in thread
From: Nick Piggin @ 2007-05-04  9:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Hugh Dickins, Andrew Morton, linux-kernel,
	linux-mm, Andrea Arcangeli, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 1021 bytes --]

Nick Piggin wrote:
> Christoph Hellwig wrote:

>> Is that every fork/exec or just under certain cicumstances?
>> A 5% regression on every fork/exec is not acceptable.
> 
> 
> Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
> numbers will be improved as well with that patch. Then if we have
> specific lock/unlock bitops, I hope it should reduce that further.

OK, with the races and missing barriers fixed from the previous patch,
plus the attached one added (+patch3), numbers are better again (I'm not
sure if I have the ppc barriers correct though).

These ops could also be put to use in bit spinlocks, buffer lock, and
probably a few other places too.

2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
+patch   1.71-1.73   175.2-180.8   780.5-794.2
+patch2  1.61-1.63   169.8-175.0   748.6-757.0
+patch3  1.54-1.57   165.6-170.9   748.5-757.5

So fault performance goes to under 5%, fork is in the noise, exec is
still up 1%, but maybe that's noise or cache effects again.

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: lock-bitops.patch --]
[-- Type: text/plain, Size: 10991 bytes --]

Index: linux-2.6/include/asm-powerpc/bitops.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/bitops.h	2007-05-04 16:08:20.000000000 +1000
+++ linux-2.6/include/asm-powerpc/bitops.h	2007-05-04 16:14:39.000000000 +1000
@@ -87,6 +87,24 @@
 	: "cc" );
 }
 
+static __inline__ void clear_bit_unlock(int nr, volatile unsigned long *addr)
+{
+	unsigned long old;
+	unsigned long mask = BITOP_MASK(nr);
+	unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr);
+
+	__asm__ __volatile__(
+	LWSYNC_ON_SMP
+"1:"	PPC_LLARX "%0,0,%3	# clear_bit_unlock\n"
+	"andc	%0,%0,%2\n"
+	PPC405_ERR77(0,%3)
+	PPC_STLCX "%0,0,%3\n"
+	"bne-	1b"
+	: "=&r" (old), "+m" (*p)
+	: "r" (mask), "r" (p)
+	: "cc" );
+}
+
 static __inline__ void change_bit(int nr, volatile unsigned long *addr)
 {
 	unsigned long old;
@@ -126,6 +144,27 @@
 	return (old & mask) != 0;
 }
 
+static __inline__ int test_and_set_bit_lock(unsigned long nr,
+				       volatile unsigned long *addr)
+{
+	unsigned long old, t;
+	unsigned long mask = BITOP_MASK(nr);
+	unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr);
+
+	__asm__ __volatile__(
+"1:"	PPC_LLARX "%0,0,%3		# test_and_set_bit_lock\n"
+	"or	%1,%0,%2 \n"
+	PPC405_ERR77(0,%3)
+	PPC_STLCX "%1,0,%3 \n"
+	"bne-	1b"
+	ISYNC_ON_SMP
+	: "=&r" (old), "=&r" (t)
+	: "r" (mask), "r" (p)
+	: "cc", "memory");
+
+	return (old & mask) != 0;
+}
+
 static __inline__ int test_and_clear_bit(unsigned long nr,
 					 volatile unsigned long *addr)
 {
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h	2007-05-04 16:14:36.000000000 +1000
+++ linux-2.6/include/linux/pagemap.h	2007-05-04 16:17:34.000000000 +1000
@@ -136,13 +136,18 @@
 extern void FASTCALL(__wait_on_page_locked(struct page *page));
 extern void FASTCALL(unlock_page(struct page *page));
 
+static inline int trylock_page(struct page *page)
+{
+	return (likely(!TestSetPageLocked_Lock(page)));
+}
+
 /*
  * lock_page may only be called if we have the page's inode pinned.
  */
 static inline void lock_page(struct page *page)
 {
 	might_sleep();
-	if (unlikely(TestSetPageLocked(page)))
+	if (!trylock_page(page))
 		__lock_page(page);
 }
 
@@ -153,7 +158,7 @@
 static inline void lock_page_nosync(struct page *page)
 {
 	might_sleep();
-	if (unlikely(TestSetPageLocked(page)))
+	if (!trylock_page(page))
 		__lock_page_nosync(page);
 }
 	
Index: linux-2.6/drivers/scsi/sg.c
===================================================================
--- linux-2.6.orig/drivers/scsi/sg.c	2007-04-12 14:35:08.000000000 +1000
+++ linux-2.6/drivers/scsi/sg.c	2007-05-04 16:23:27.000000000 +1000
@@ -1734,7 +1734,7 @@
                  */
 		flush_dcache_page(pages[i]);
 		/* ?? Is locking needed? I don't think so */
-		/* if (TestSetPageLocked(pages[i]))
+		/* if (!trylock_page(pages[i]))
 		   goto out_unlock; */
         }
 
Index: linux-2.6/fs/cifs/file.c
===================================================================
--- linux-2.6.orig/fs/cifs/file.c	2007-04-12 14:35:09.000000000 +1000
+++ linux-2.6/fs/cifs/file.c	2007-05-04 16:23:36.000000000 +1000
@@ -1229,7 +1229,7 @@
 
 			if (first < 0)
 				lock_page(page);
-			else if (TestSetPageLocked(page))
+			else if (!trylock_page(page))
 				break;
 
 			if (unlikely(page->mapping != mapping)) {
Index: linux-2.6/fs/jbd/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd/commit.c	2007-04-12 14:35:09.000000000 +1000
+++ linux-2.6/fs/jbd/commit.c	2007-05-04 16:23:30.000000000 +1000
@@ -64,7 +64,7 @@
 		goto nope;
 
 	/* OK, it's a truncated page */
-	if (TestSetPageLocked(page))
+	if (!trylock_page(page))
 		goto nope;
 
 	page_cache_get(page);
Index: linux-2.6/fs/jbd2/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd2/commit.c	2007-04-12 14:35:09.000000000 +1000
+++ linux-2.6/fs/jbd2/commit.c	2007-05-04 16:23:40.000000000 +1000
@@ -64,7 +64,7 @@
 		goto nope;
 
 	/* OK, it's a truncated page */
-	if (TestSetPageLocked(page))
+	if (!trylock_page(page))
 		goto nope;
 
 	page_cache_get(page);
Index: linux-2.6/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_aops.c	2007-03-05 15:17:25.000000000 +1100
+++ linux-2.6/fs/xfs/linux-2.6/xfs_aops.c	2007-05-04 16:23:33.000000000 +1000
@@ -601,7 +601,7 @@
 			} else
 				pg_offset = PAGE_CACHE_SIZE;
 
-			if (page->index == tindex && !TestSetPageLocked(page)) {
+			if (page->index == tindex && trylock_page(page)) {
 				len = xfs_probe_page(page, pg_offset, mapped);
 				unlock_page(page);
 			}
@@ -685,7 +685,7 @@
 
 	if (page->index != tindex)
 		goto fail;
-	if (TestSetPageLocked(page))
+	if (!trylock_page(page))
 		goto fail;
 	if (PageWriteback(page))
 		goto fail_unlock_page;
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2007-05-03 08:38:53.000000000 +1000
+++ linux-2.6/include/linux/page-flags.h	2007-05-04 16:18:23.000000000 +1000
@@ -116,8 +116,12 @@
 		set_bit(PG_locked, &(page)->flags)
 #define TestSetPageLocked(page)		\
 		test_and_set_bit(PG_locked, &(page)->flags)
+#define TestSetPageLocked_Lock(page)		\
+		test_and_set_bit_lock(PG_locked, &(page)->flags)
 #define ClearPageLocked(page)		\
 		clear_bit(PG_locked, &(page)->flags)
+#define ClearPageLocked_Unlock(page)		\
+		clear_bit_unlock(PG_locked, &(page)->flags)
 #define TestClearPageLocked(page)	\
 		test_and_clear_bit(PG_locked, &(page)->flags)
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2007-05-02 15:00:28.000000000 +1000
+++ linux-2.6/mm/memory.c	2007-05-04 16:19:12.000000000 +1000
@@ -1550,7 +1550,7 @@
 	 * not dirty accountable.
 	 */
 	if (PageAnon(old_page)) {
-		if (!TestSetPageLocked(old_page)) {
+		if (trylock_page(old_page)) {
 			reuse = can_share_swap_page(old_page);
 			unlock_page(old_page);
 		}
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c	2007-05-02 14:48:36.000000000 +1000
+++ linux-2.6/mm/migrate.c	2007-05-04 16:19:15.000000000 +1000
@@ -569,7 +569,7 @@
 	 * establishing additional references. We are the only one
 	 * holding a reference to the new page at this point.
 	 */
-	if (TestSetPageLocked(newpage))
+	if (!trylock_page(newpage))
 		BUG();
 
 	/* Prepare mapping for the new page.*/
@@ -621,7 +621,7 @@
 		goto move_newpage;
 
 	rc = -EAGAIN;
-	if (TestSetPageLocked(page)) {
+	if (!trylock_page(page)) {
 		if (!force)
 			goto move_newpage;
 		lock_page(page);
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2007-04-13 20:48:04.000000000 +1000
+++ linux-2.6/mm/rmap.c	2007-05-04 16:19:18.000000000 +1000
@@ -426,7 +426,7 @@
 			referenced += page_referenced_anon(page);
 		else if (is_locked)
 			referenced += page_referenced_file(page);
-		else if (TestSetPageLocked(page))
+		else if (!trylock_page(page))
 			referenced++;
 		else {
 			if (page->mapping)
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c	2007-05-02 15:00:26.000000000 +1000
+++ linux-2.6/mm/shmem.c	2007-05-04 16:19:22.000000000 +1000
@@ -1155,7 +1155,7 @@
 		}
 
 		/* We have to do this with page locked to prevent races */
-		if (TestSetPageLocked(swappage)) {
+		if (!trylock_page(swappage)) {
 			shmem_swp_unmap(entry);
 			spin_unlock(&info->lock);
 			wait_on_page_locked(swappage);
@@ -1214,7 +1214,7 @@
 		shmem_swp_unmap(entry);
 		filepage = find_get_page(mapping, idx);
 		if (filepage &&
-		    (!PageUptodate(filepage) || TestSetPageLocked(filepage))) {
+		    (!PageUptodate(filepage) || !trylock_page(filepage))) {
 			spin_unlock(&info->lock);
 			wait_on_page_locked(filepage);
 			page_cache_release(filepage);
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c	2007-04-12 14:35:11.000000000 +1000
+++ linux-2.6/mm/swap.c	2007-05-04 16:19:28.000000000 +1000
@@ -412,7 +412,7 @@
 	for (i = 0; i < pagevec_count(pvec); i++) {
 		struct page *page = pvec->pages[i];
 
-		if (PagePrivate(page) && !TestSetPageLocked(page)) {
+		if (PagePrivate(page) && trylock_page(page)) {
 			if (PagePrivate(page))
 				try_to_release_page(page, 0);
 			unlock_page(page);
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c	2007-04-24 10:39:57.000000000 +1000
+++ linux-2.6/mm/swap_state.c	2007-05-04 16:19:32.000000000 +1000
@@ -252,7 +252,7 @@
  */
 static inline void free_swap_cache(struct page *page)
 {
-	if (PageSwapCache(page) && !TestSetPageLocked(page)) {
+	if (PageSwapCache(page) && trylock_page(page)) {
 		remove_exclusive_swap_page(page);
 		unlock_page(page);
 	}
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c	2007-04-24 10:39:55.000000000 +1000
+++ linux-2.6/mm/swapfile.c	2007-05-04 16:19:25.000000000 +1000
@@ -401,7 +401,7 @@
 	if (p) {
 		if (swap_entry_free(p, swp_offset(entry)) == 1) {
 			page = find_get_page(&swapper_space, entry.val);
-			if (page && unlikely(TestSetPageLocked(page))) {
+			if (page && unlikely(!trylock_page(page))) {
 				page_cache_release(page);
 				page = NULL;
 			}
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c	2007-05-02 15:00:27.000000000 +1000
+++ linux-2.6/mm/truncate.c	2007-05-04 16:19:35.000000000 +1000
@@ -185,7 +185,7 @@
 			if (page_index > next)
 				next = page_index;
 			next++;
-			if (TestSetPageLocked(page))
+			if (!trylock_page(page))
 				continue;
 			if (PageWriteback(page)) {
 				unlock_page(page);
@@ -291,7 +291,7 @@
 			pgoff_t index;
 			int lock_failed;
 
-			lock_failed = TestSetPageLocked(page);
+			lock_failed = !trylock_page(page);
 
 			/*
 			 * We really shouldn't be looking at the ->index of an
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2007-04-24 10:39:56.000000000 +1000
+++ linux-2.6/mm/vmscan.c	2007-05-04 16:19:38.000000000 +1000
@@ -466,7 +466,7 @@
 		page = lru_to_page(page_list);
 		list_del(&page->lru);
 
-		if (TestSetPageLocked(page))
+		if (!trylock_page(page))
 			goto keep;
 
 		VM_BUG_ON(PageActive(page));
@@ -538,7 +538,7 @@
 				 * A synchronous write - probably a ramdisk.  Go
 				 * ahead and try to reclaim the page.
 				 */
-				if (TestSetPageLocked(page))
+				if (!trylock_page(page))
 					goto keep;
 				if (PageDirty(page) || PageWriteback(page))
 					goto keep_locked;

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-04  9:23               ` Nick Piggin
@ 2007-05-04  9:43                 ` Nick Piggin
  2007-05-08  3:03                 ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 233+ messages in thread
From: Nick Piggin @ 2007-05-04  9:43 UTC (permalink / raw)
  Cc: Christoph Hellwig, Hugh Dickins, Andrew Morton, linux-kernel,
	linux-mm, Andrea Arcangeli, Benjamin Herrenschmidt

Nick Piggin wrote:
> Nick Piggin wrote:
> 
>> Christoph Hellwig wrote:
> 
> 
>>> Is that every fork/exec or just under certain cicumstances?
>>> A 5% regression on every fork/exec is not acceptable.
>>
>>
>>
>> Well after patch2, G5 fork is 3% and exec is 1%, I'd say the P4
>> numbers will be improved as well with that patch. Then if we have
>> specific lock/unlock bitops, I hope it should reduce that further.
> 
> 
> OK, with the races and missing barriers fixed from the previous patch,
> plus the attached one added (+patch3), numbers are better again (I'm not
> sure if I have the ppc barriers correct though).
> 
> These ops could also be put to use in bit spinlocks, buffer lock, and
> probably a few other places too.
> 
> 2.6.21   1.49-1.51   164.6-170.8   741.8-760.3
> +patch   1.71-1.73   175.2-180.8   780.5-794.2
> +patch2  1.61-1.63   169.8-175.0   748.6-757.0
> +patch3  1.54-1.57   165.6-170.9   748.5-757.5
> 
> So fault performance goes to under 5%, fork is in the noise, exec is
> still up 1%, but maybe that's noise or cache effects again.

OK, with my new lock/unlock_page, dd if=large (bigger than RAM) sparse
file of=/dev/null with an experimentally optimal block size (32K) goes
from 626MB/s to 683MB/s on 2 CPU G5 booted with maxcpus=1.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-04  9:23               ` Nick Piggin
  2007-05-04  9:43                 ` Nick Piggin
@ 2007-05-08  3:03                 ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 233+ messages in thread
From: Benjamin Herrenschmidt @ 2007-05-08  3:03 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Christoph Hellwig, Hugh Dickins, Andrew Morton, linux-kernel,
	linux-mm, Andrea Arcangeli

On Fri, 2007-05-04 at 19:23 +1000, Nick Piggin wrote:

> These ops could also be put to use in bit spinlocks, buffer lock, and
> probably a few other places too.

Ok, the performance hit seems to be under control (especially with the
bigger benchmark showing actual improvements).

There's a little bogon with the PG_waiters bit that you already know
about but appart from that it should be ok.

I must say I absolutely _LOVE_ the bitops with explicit _lock/_unlock
semantics. That should allow us to remove a whole bunch of dodgy
barriers and smp_mb__before_whatever_magic_crap() things we have all
over the place by providing precisely the expected semantics for bit
locks.

There are quite a few people who've been trying to do bit locks and I've
always been very worried by how easy it is to get the barriers wrong (or
too much barriers in the fast path) with these.

There are a couple of things we might want to think about regarding the
actual API to bit locks... the API you propose is simple, but it might
not fit some of the most exotic usage requirements, which typically are
related to manipulating other bits along with the lock bit.

We might just ignore them though. In the case of the page lock, it's
only hitting the slow path, and I would expect other usage scenarii to
be similar.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-03  1:32         ` Nick Piggin
  2007-05-03 10:37           ` Christoph Hellwig
@ 2007-05-03 12:24           ` Hugh Dickins
  2007-05-03 12:43             ` Nick Piggin
  2007-05-03 16:52           ` Andrew Morton
  2 siblings, 1 reply; 233+ messages in thread
From: Hugh Dickins @ 2007-05-03 12:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli,
	Christoph Hellwig

On Thu, 3 May 2007, Nick Piggin wrote:
> 
> The problem is that lock/unlock_page is expensive on powerpc, and
> if we improve that, we improve more than just the fault handler...
> 
> The attached patch gets performance up a bit by avoiding some
> barriers and some cachelines:

There's a strong whiff of raciness about this...
but I could very easily be wrong.

> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c	2007-05-02 15:00:26.000000000 +1000
> +++ linux-2.6/mm/filemap.c	2007-05-03 08:34:32.000000000 +1000
> @@ -532,11 +532,13 @@
>   */
>  void fastcall unlock_page(struct page *page)
>  {
> +	VM_BUG_ON(!PageLocked(page));
>  	smp_mb__before_clear_bit();
> -	if (!TestClearPageLocked(page))
> -		BUG();
> -	smp_mb__after_clear_bit(); 
> -	wake_up_page(page, PG_locked);
> +	ClearPageLocked(page);
> +	if (unlikely(test_bit(PG_waiters, &page->flags))) {
> +		clear_bit(PG_waiters, &page->flags);
> +		wake_up_page(page, PG_locked);
> +	}
>  }
>  EXPORT_SYMBOL(unlock_page);
>  
> @@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
>  {
>  	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
>  
> +	set_bit(PG_waiters, &page->flags);
> +	if (unlikely(!TestSetPageLocked(page))) {

What happens if another cpu is coming through __lock_page at the
same time, did its set_bit, now finds PageLocked, and so proceeds
to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
so this task's unlock_page won't wake the other?

> +		clear_bit(PG_waiters, &page->flags);
> +		return;
> +	}
>  	__wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
>  							TASK_UNINTERRUPTIBLE);
>  }

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-03 12:24           ` Hugh Dickins
@ 2007-05-03 12:43             ` Nick Piggin
  2007-05-03 12:58               ` Hugh Dickins
  0 siblings, 1 reply; 233+ messages in thread
From: Nick Piggin @ 2007-05-03 12:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli,
	Christoph Hellwig

Hugh Dickins wrote:
> On Thu, 3 May 2007, Nick Piggin wrote:
> 
>>The problem is that lock/unlock_page is expensive on powerpc, and
>>if we improve that, we improve more than just the fault handler...
>>
>>The attached patch gets performance up a bit by avoiding some
>>barriers and some cachelines:
> 
> 
> There's a strong whiff of raciness about this...
> but I could very easily be wrong.
> 
> 
>>Index: linux-2.6/mm/filemap.c
>>===================================================================
>>--- linux-2.6.orig/mm/filemap.c	2007-05-02 15:00:26.000000000 +1000
>>+++ linux-2.6/mm/filemap.c	2007-05-03 08:34:32.000000000 +1000
>>@@ -532,11 +532,13 @@
>>  */
>> void fastcall unlock_page(struct page *page)
>> {
>>+	VM_BUG_ON(!PageLocked(page));
>> 	smp_mb__before_clear_bit();
>>-	if (!TestClearPageLocked(page))
>>-		BUG();
>>-	smp_mb__after_clear_bit(); 
>>-	wake_up_page(page, PG_locked);
>>+	ClearPageLocked(page);
>>+	if (unlikely(test_bit(PG_waiters, &page->flags))) {
>>+		clear_bit(PG_waiters, &page->flags);
>>+		wake_up_page(page, PG_locked);
>>+	}
>> }
>> EXPORT_SYMBOL(unlock_page);
>> 
>>@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
>> {
>> 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
>> 
>>+	set_bit(PG_waiters, &page->flags);
>>+	if (unlikely(!TestSetPageLocked(page))) {
> 
> 
> What happens if another cpu is coming through __lock_page at the
> same time, did its set_bit, now finds PageLocked, and so proceeds
> to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
> so this task's unlock_page won't wake the other?

You're right, we can't clear the bit here. Doubt it mattered much anyway?

BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page
above... that barrier is in the slow path as well though, so it shouldn't
matter either.

> 
> 
>>+		clear_bit(PG_waiters, &page->flags);
>>+		return;
>>+	}
>> 	__wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
>> 							TASK_UNINTERRUPTIBLE);
>> }
> 
> 


-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-03 12:43             ` Nick Piggin
@ 2007-05-03 12:58               ` Hugh Dickins
  2007-05-03 13:08                 ` Nick Piggin
  0 siblings, 1 reply; 233+ messages in thread
From: Hugh Dickins @ 2007-05-03 12:58 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli,
	Christoph Hellwig

On Thu, 3 May 2007, Nick Piggin wrote:
> >>@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
> > > {
> > >  DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
> > >
> > >+	set_bit(PG_waiters, &page->flags);
> > >+	if (unlikely(!TestSetPageLocked(page))) {
> > 
> > What happens if another cpu is coming through __lock_page at the
> > same time, did its set_bit, now finds PageLocked, and so proceeds
> > to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
> > so this task's unlock_page won't wake the other?
> 
> You're right, we can't clear the bit here. Doubt it mattered much anyway?

Ah yes, that's a good easy answer.  In fact, just remove this whole
test and block (we already tried TestSetPageLocked outside just a
short while ago, so this repeat won't often save anything).

> 
> BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page
> above... that barrier is in the slow path as well though, so it shouldn't
> matter either.

I vaguely wondered how such barriers had managed to dissolve away,
but cranking my brain up to think about barriers takes far too long.

> > >+		clear_bit(PG_waiters, &page->flags);
> > >+		return;
> > >+	}
> > >  __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page,
> > >        TASK_UNINTERRUPTIBLE);
> >> }

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-03 12:58               ` Hugh Dickins
@ 2007-05-03 13:08                 ` Nick Piggin
  0 siblings, 0 replies; 233+ messages in thread
From: Nick Piggin @ 2007-05-03 13:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli,
	Christoph Hellwig

Hugh Dickins wrote:
> On Thu, 3 May 2007, Nick Piggin wrote:
> 
>>>>@@ -568,6 +570,11 @@ __lock_page (diff -p would tell us!)
>>>>{
>>>> DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
>>>>
>>>>+	set_bit(PG_waiters, &page->flags);
>>>>+	if (unlikely(!TestSetPageLocked(page))) {
>>>
>>>What happens if another cpu is coming through __lock_page at the
>>>same time, did its set_bit, now finds PageLocked, and so proceeds
>>>to the __wait_on_bit_lock?  But this cpu now clears PG_waiters,
>>>so this task's unlock_page won't wake the other?
>>
>>You're right, we can't clear the bit here. Doubt it mattered much anyway?
> 
> 
> Ah yes, that's a good easy answer.  In fact, just remove this whole
> test and block (we already tried TestSetPageLocked outside just a
> short while ago, so this repeat won't often save anything).

Yeah, I was getting too clever for my own boots :)

I think the patch has merit though. Unfortunate that it uses another page
flag, however it seemed to have quite a bit speedup on unlock_page (probably
from both the barriers and an extra random cacheline load (from the hash)).

I guess it has to get good results from more benchmarks...


>>BTW. I also forgot an smp_mb__after_clear_bit() before the wake_up_page
>>above... that barrier is in the slow path as well though, so it shouldn't
>>matter either.
> 
> 
> I vaguely wondered how such barriers had managed to dissolve away,
> but cranking my brain up to think about barriers takes far too long.

That barrier was one too many :)

However I believe the fastpath barrier can go away because the PG_locked
operation is depending on the same cacheline as PG_waiters.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-03  1:32         ` Nick Piggin
  2007-05-03 10:37           ` Christoph Hellwig
  2007-05-03 12:24           ` Hugh Dickins
@ 2007-05-03 16:52           ` Andrew Morton
  2007-05-04  4:16             ` Nick Piggin
  2 siblings, 1 reply; 233+ messages in thread
From: Andrew Morton @ 2007-05-03 16:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, linux-kernel, linux-mm, Andrea Arcangeli,
	Christoph Hellwig

On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote:

>  void fastcall unlock_page(struct page *page)
>  {
> +	VM_BUG_ON(!PageLocked(page));
>  	smp_mb__before_clear_bit();
> -	if (!TestClearPageLocked(page))
> -		BUG();
> -	smp_mb__after_clear_bit(); 
> -	wake_up_page(page, PG_locked);
> +	ClearPageLocked(page);
> +	if (unlikely(test_bit(PG_waiters, &page->flags))) {
> +		clear_bit(PG_waiters, &page->flags);
> +		wake_up_page(page, PG_locked);
> +	}
>  }

Why is that significantly faster than plain old wake_up_page(), which
tests waitqueue_active()?

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-03 16:52           ` Andrew Morton
@ 2007-05-04  4:16             ` Nick Piggin
  0 siblings, 0 replies; 233+ messages in thread
From: Nick Piggin @ 2007-05-04  4:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, linux-kernel, linux-mm, Andrea Arcangeli,
	Christoph Hellwig

Andrew Morton wrote:
> On Thu, 03 May 2007 11:32:23 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>> void fastcall unlock_page(struct page *page)
>> {
>>+	VM_BUG_ON(!PageLocked(page));
>> 	smp_mb__before_clear_bit();
>>-	if (!TestClearPageLocked(page))
>>-		BUG();
>>-	smp_mb__after_clear_bit(); 
>>-	wake_up_page(page, PG_locked);
>>+	ClearPageLocked(page);
>>+	if (unlikely(test_bit(PG_waiters, &page->flags))) {
>>+		clear_bit(PG_waiters, &page->flags);
>>+		wake_up_page(page, PG_locked);
>>+	}
>> }
> 
> 
> Why is that significantly faster than plain old wake_up_page(), which
> tests waitqueue_active()?

Because it needs fewer barriers and doesn't touch random a random hash
cacheline in the fastpath.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-02 14:00       ` Hugh Dickins
  2007-05-03  1:32         ` Nick Piggin
@ 2007-05-09 12:34         ` Nick Piggin
  2007-05-09 14:28           ` Hugh Dickins
  1 sibling, 1 reply; 233+ messages in thread
From: Nick Piggin @ 2007-05-09 12:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli,
	Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 1280 bytes --]

Hugh Dickins wrote:
> On Wed, 2 May 2007, Nick Piggin wrote:

>>>But I'm pretty sure (to use your words!) regular truncate was not racy
>>>before: I believe Andrea's sequence count was handling that case fine,
>>>without a second unmap_mapping_range.
>>
>>OK, I think you're right. I _think_ it should also be OK with the
>>lock_page version as well: we should not be able to have any pages
>>after the first unmap_mapping_range call, because of the i_size
>>write. So if we have no pages, there is nothing to 'cow' from.
> 
> 
> I'd be delighted if you can remove those later unmap_mapping_ranges.
> As I recall, the important thing for the copy pages is to be holding
> the page lock (or whatever other serialization) on the copied page
> still while the copy page is inserted into pagetable: that looks
> to be so in your __do_fault.

Hmm, on second thoughts, I think I was right the first time, and do
need the unmap after the pages are truncated. With the lock_page code,
after the first unmap, we can get new ptes mapping pages, and
subsequently they can be COWed and then the original pte zapped before
the truncate loop checks it.

However, I wonder if we can't test mapping_mapped before the spinlock,
which would make most truncates cheaper?

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: mm-truncate-avoid-rmap-locks --]
[-- Type: text/plain, Size: 1053 bytes --]

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2007-04-24 15:02:51.000000000 +1000
+++ linux-2.6/mm/filemap.c	2007-05-09 17:30:47.000000000 +1000
@@ -2579,8 +2579,7 @@
 	if (rw == WRITE) {
 		write_len = iov_length(iov, nr_segs);
 		end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT;
-	       	if (mapping_mapped(mapping))
-			unmap_mapping_range(mapping, offset, write_len, 0);
+		unmap_mapping_range(mapping, offset, write_len, 0);
 	}
 
 	retval = filemap_write_and_wait(mapping);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2007-05-09 17:25:28.000000000 +1000
+++ linux-2.6/mm/memory.c	2007-05-09 17:30:22.000000000 +1000
@@ -1956,6 +1956,9 @@
 	pgoff_t hba = holebegin >> PAGE_SHIFT;
 	pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
+	if (!mapping_mapped(mapping))
+		return;
+
 	/* Check for overflow. */
 	if (sizeof(holelen) > sizeof(hlen)) {
 		long long holeend =

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-09 12:34         ` Nick Piggin
@ 2007-05-09 14:28           ` Hugh Dickins
  2007-05-09 14:45             ` Nick Piggin
  0 siblings, 1 reply; 233+ messages in thread
From: Hugh Dickins @ 2007-05-09 14:28 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli,
	Christoph Hellwig

On Wed, 9 May 2007, Nick Piggin wrote:
> Hugh Dickins wrote:
> > On Wed, 2 May 2007, Nick Piggin wrote:
> 
> > > >But I'm pretty sure (to use your words!) regular truncate was not racy
> > > >before: I believe Andrea's sequence count was handling that case fine,
> > > >without a second unmap_mapping_range.
> > >
> > >OK, I think you're right. I _think_ it should also be OK with the
> > >lock_page version as well: we should not be able to have any pages
> > >after the first unmap_mapping_range call, because of the i_size
> > >write. So if we have no pages, there is nothing to 'cow' from.
> > 
> > I'd be delighted if you can remove those later unmap_mapping_ranges.
> > As I recall, the important thing for the copy pages is to be holding
> > the page lock (or whatever other serialization) on the copied page
> > still while the copy page is inserted into pagetable: that looks
> > to be so in your __do_fault.
> 
> Hmm, on second thoughts, I think I was right the first time, and do
> need the unmap after the pages are truncated. With the lock_page code,
> after the first unmap, we can get new ptes mapping pages, and
> subsequently they can be COWed and then the original pte zapped before
> the truncate loop checks it.

The filesystem (or page cache) allows pages beyond i_size to come
in there?  That wasn't a problem before, was it?  But now it is?

> 
> However, I wonder if we can't test mapping_mapped before the spinlock,
> which would make most truncates cheaper?

Slightly cheaper, yes, though I doubt it'd be much in comparison with
actually doing any work in unmap_mapping_range or truncate_inode_pages.
Suspect you'd need a barrier of some kind between the i_size_write and
the mapping_mapped test?  But that's a change we could have made at
any time if we'd bothered, it's not really the issue here.

Hugh

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-09 14:28           ` Hugh Dickins
@ 2007-05-09 14:45             ` Nick Piggin
  2007-05-09 15:38               ` Hugh Dickins
  0 siblings, 1 reply; 233+ messages in thread
From: Nick Piggin @ 2007-05-09 14:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli,
	Christoph Hellwig

Hugh Dickins wrote:
> On Wed, 9 May 2007, Nick Piggin wrote:
> 
>>Hugh Dickins wrote:
>>
>>>On Wed, 2 May 2007, Nick Piggin wrote:
>>
>>>>>But I'm pretty sure (to use your words!) regular truncate was not racy
>>>>>before: I believe Andrea's sequence count was handling that case fine,
>>>>>without a second unmap_mapping_range.
>>>>
>>>>OK, I think you're right. I _think_ it should also be OK with the
>>>>lock_page version as well: we should not be able to have any pages
>>>>after the first unmap_mapping_range call, because of the i_size
>>>>write. So if we have no pages, there is nothing to 'cow' from.
>>>
>>>I'd be delighted if you can remove those later unmap_mapping_ranges.
>>>As I recall, the important thing for the copy pages is to be holding
>>>the page lock (or whatever other serialization) on the copied page
>>>still while the copy page is inserted into pagetable: that looks
>>>to be so in your __do_fault.
>>
>>Hmm, on second thoughts, I think I was right the first time, and do
>>need the unmap after the pages are truncated. With the lock_page code,
>>after the first unmap, we can get new ptes mapping pages, and
>>subsequently they can be COWed and then the original pte zapped before
>>the truncate loop checks it.
> 
> 
> The filesystem (or page cache) allows pages beyond i_size to come
> in there?  That wasn't a problem before, was it?  But now it is?

The filesystem still doesn't, but if i_size is updated after the page
is returned, we can have a problem that was previously taken care of
with the truncate_count but now isn't.

>>However, I wonder if we can't test mapping_mapped before the spinlock,
>>which would make most truncates cheaper?
> 
> 
> Slightly cheaper, yes, though I doubt it'd be much in comparison with
> actually doing any work in unmap_mapping_range or truncate_inode_pages.

But if we're supposing the common case for truncate is unmapped mappings,
then the main cost there will be the locking, which I'm trying to avoid.
Hopefully with this patch, most truncate workloads would get faster, even
though truncate mapped files is going to be unavoidably slower.


> Suspect you'd need a barrier of some kind between the i_size_write and
> the mapping_mapped test?

The unmap_mapping_range that runs after the truncate_inode_pages should
run in the correct order, I believe.

>  But that's a change we could have made at
> any time if we'd bothered, it's not really the issue here.

I don't see how you could, because you need to increment truncate_count.

But I believe this is fixing the issue, even if it does so in a peripheral
manner, because it avoids the added cost for unmapped files.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-09 14:45             ` Nick Piggin
@ 2007-05-09 15:38               ` Hugh Dickins
  2007-05-09 22:24                 ` Nick Piggin
  0 siblings, 1 reply; 233+ messages in thread
From: Hugh Dickins @ 2007-05-09 15:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli,
	Christoph Hellwig

On Thu, 10 May 2007, Nick Piggin wrote:
> > 
> > The filesystem (or page cache) allows pages beyond i_size to come
> > in there?  That wasn't a problem before, was it?  But now it is?
> 
> The filesystem still doesn't, but if i_size is updated after the page
> is returned, we can have a problem that was previously taken care of
> with the truncate_count but now isn't.

But... I thought the page lock was now taking care of that in your
scheme?  truncate_inode_pages has to wait for the page lock, then
it finds the page is mapped and... ahh, it finds the copiee page
is not mapped, so doesn't do its own little unmap_mapping_range,
and the copied page squeaks through.  Drat.

I really think the truncate_count solution worked better, for
truncation anyway.  There may be persuasive reasons you need the
page lock for invalidation: I gave up on trying to understand the
required behaviour(s) for invalidation.

So, bring back (the original use of, not my tree marker use of)
truncate_count?  Hmm, you probably don't want to do that, because
there was some pleasure in removing the strange barriers associated
with it.

A second unmap_mapping_range is just one line of code - but it sure
feels like a defeat to me, calling the whole exercise into question.
(But then, you'd be right to say my perfectionism made it impossible
for me to come up with any solution to the invalidation issues.)

> > Suspect you'd need a barrier of some kind between the i_size_write and
> > the mapping_mapped test?
> 
> The unmap_mapping_range that runs after the truncate_inode_pages should
> run in the correct order, I believe.

Yes, if there's going to be that backup call, the first won't really
need a barrier.

> > But that's a change we could have made at
> > any time if we'd bothered, it's not really the issue here.
> 
> I don't see how you could, because you need to increment truncate_count.

Though indeed we did so, I don't see that we needed to increment
truncate_count in that case (nobody could be coming through
do_no_page on that file, when there are no mappings of it).

> But I believe this is fixing the issue, even if it does so in a peripheral
> manner, because it avoids the added cost for unmapped files.

It's a small improvement to your common case, I agree.

Hugh

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- vm bugfixes
  2007-05-09 15:38               ` Hugh Dickins
@ 2007-05-09 22:24                 ` Nick Piggin
  0 siblings, 0 replies; 233+ messages in thread
From: Nick Piggin @ 2007-05-09 22:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-kernel, linux-mm, Andrea Arcangeli,
	Christoph Hellwig

Hugh Dickins wrote:
> On Thu, 10 May 2007, Nick Piggin wrote:
> 
>>>The filesystem (or page cache) allows pages beyond i_size to come
>>>in there?  That wasn't a problem before, was it?  But now it is?
>>
>>The filesystem still doesn't, but if i_size is updated after the page
>>is returned, we can have a problem that was previously taken care of
>>with the truncate_count but now isn't.
> 
> 
> But... I thought the page lock was now taking care of that in your
> scheme?  truncate_inode_pages has to wait for the page lock, then
> it finds the page is mapped and... ahh, it finds the copiee page
> is not mapped, so doesn't do its own little unmap_mapping_range,
> and the copied page squeaks through.  Drat.
> 
> I really think the truncate_count solution worked better, for
> truncation anyway.  There may be persuasive reasons you need the
> page lock for invalidation: I gave up on trying to understand the
> required behaviour(s) for invalidation.
> 
> So, bring back (the original use of, not my tree marker use of)
> truncate_count?  Hmm, you probably don't want to do that, because
> there was some pleasure in removing the strange barriers associated
> with it.
> 
> A second unmap_mapping_range is just one line of code - but it sure
> feels like a defeat to me, calling the whole exercise into question.
> (But then, you'd be right to say my perfectionism made it impossible
> for me to come up with any solution to the invalidation issues.)

Well we could bring back the truncate_count, but I think that sucks
because that's moving work into the page fault handler in order to
avoid a bit of work when truncating mapped files.


>>>But that's a change we could have made at
>>>any time if we'd bothered, it's not really the issue here.
>>
>>I don't see how you could, because you need to increment truncate_count.
> 
> 
> Though indeed we did so, I don't see that we needed to increment
> truncate_count in that case (nobody could be coming through
> do_no_page on that file, when there are no mappings of it).

Of course :P

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* pcmcia ioctl removal
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (8 preceding siblings ...)
  2007-05-01  8:44 ` 2.6.22 -mm merge plans -- vm bugfixes Nick Piggin
@ 2007-05-01  8:46 ` Christoph Hellwig
  2007-05-01  8:56   ` Russell King
                     ` (3 more replies)
  2007-05-01  8:48 ` pci hotplug patches Christoph Hellwig
                   ` (12 subsequent siblings)
  22 siblings, 4 replies; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-01  8:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, linux-pcmcia

>  pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch

...

> Dominik is busy.  Will probably re-review and send these direct to Linus.

The patch above is the removal of cardmgr support.  While I'd love to
see this cruft gone it definitively needs maintainer judgement on whether
they time has come that no one relies on cardmgr anymore.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-01  8:46 ` pcmcia ioctl removal Christoph Hellwig
@ 2007-05-01  8:56   ` Russell King
  2007-05-01  8:57   ` Willy Tarreau
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 233+ messages in thread
From: Russell King @ 2007-05-01  8:56 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm,
	linux-pcmcia

On Tue, May 01, 2007 at 09:46:23AM +0100, Christoph Hellwig wrote:
> >  pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch
> 
> ...
> 
> > Dominik is busy.  Will probably re-review and send these direct to Linus.
> 
> The patch above is the removal of cardmgr support.  While I'd love to
> see this cruft gone it definitively needs maintainer judgement on whether
> they time has come that no one relies on cardmgr anymore.

And I still run and use a platform where the GUI issues cardmgr ioctls.
A recent kernel upgrade (from 2.6.9ish to something more recent) broke
the "eject" GUI button applet due to the fscking with the cardmgr ioctls,
and it thinks the wireless card is always plugged in (and therefore the
signal strength meter remains.)

With all the ioctls gone I'll probably loose the signal strength meter.

And no, I don't have the resources (read: code) to fix and rebuild
userspace since I didn't snarf them when the CVS server was alive a
few years back.

That's the problem with API changes - things always break.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-01  8:46 ` pcmcia ioctl removal Christoph Hellwig
  2007-05-01  8:56   ` Russell King
@ 2007-05-01  8:57   ` Willy Tarreau
  2007-05-01  9:08     ` Andrew Morton
  2007-05-01  9:16   ` Robert P. J. Day
  2007-05-09 12:54   ` Pavel Machek
  3 siblings, 1 reply; 233+ messages in thread
From: Willy Tarreau @ 2007-05-01  8:57 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm,
	linux-pcmcia

Hi Christoph,

On Tue, May 01, 2007 at 09:46:23AM +0100, Christoph Hellwig wrote:
> >  pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch
> 
> ...
> 
> > Dominik is busy.  Will probably re-review and send these direct to Linus.
> 
> The patch above is the removal of cardmgr support.  While I'd love to
> see this cruft gone it definitively needs maintainer judgement on whether
> they time has come that no one relies on cardmgr anymore.

Well, I've not followed evolutions in this area for a long time. Here's
what I get on my notebook :

willy@wtap:~$ uname -r
2.6.20-wt3-wtap
willy@wtap:~$ ps auxw|grep card   
root      1216  0.0  0.0     0    0 ?        S<   Apr28   0:00 [pccardd]
root      1221  0.0  0.0     0    0 ?        S<   Apr28   0:00 [pccardd]
root      1244  0.0  0.0     0    0 ?        S<   Apr28   0:00 [pccardd]
root      1251  0.0  0.0     0    0 ?        Ss   Apr28   0:00 /sbin/cardmgr

What's the new recommended way of using PCMCIA cards when cardmgr is gone ?

Thanks,
Willy


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-01  8:57   ` Willy Tarreau
@ 2007-05-01  9:08     ` Andrew Morton
  2007-05-01 14:46       ` Adrian Bunk
  0 siblings, 1 reply; 233+ messages in thread
From: Andrew Morton @ 2007-05-01  9:08 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Christoph Hellwig, linux-kernel, linux-mm, linux-pcmcia

On Tue, 1 May 2007 10:57:10 +0200 Willy Tarreau <w@1wt.eu> wrote:

> Hi Christoph,
> 
> On Tue, May 01, 2007 at 09:46:23AM +0100, Christoph Hellwig wrote:
> > >  pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch
> > 
> > ...
> > 
> > > Dominik is busy.  Will probably re-review and send these direct to Linus.
> > 
> > The patch above is the removal of cardmgr support.  While I'd love to
> > see this cruft gone it definitively needs maintainer judgement on whether
> > they time has come that no one relies on cardmgr anymore.
> 
> Well, I've not followed evolutions in this area for a long time. Here's
> what I get on my notebook :
> 
> willy@wtap:~$ uname -r
> 2.6.20-wt3-wtap
> willy@wtap:~$ ps auxw|grep card   
> root      1216  0.0  0.0     0    0 ?        S<   Apr28   0:00 [pccardd]
> root      1221  0.0  0.0     0    0 ?        S<   Apr28   0:00 [pccardd]
> root      1244  0.0  0.0     0    0 ?        S<   Apr28   0:00 [pccardd]
> root      1251  0.0  0.0     0    0 ?        Ss   Apr28   0:00 /sbin/cardmgr
> 

Yes, that seems premature.  feature-removal.txt is pretty useless for
getting poeple off old tools.  If we're ever to make this migration we'll
need loud and scary printks coming out of the kernel.  Probably it'll take
another year or two to get there *once* we've done that.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-01  9:08     ` Andrew Morton
@ 2007-05-01 14:46       ` Adrian Bunk
  0 siblings, 0 replies; 233+ messages in thread
From: Adrian Bunk @ 2007-05-01 14:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Willy Tarreau, Christoph Hellwig, linux-kernel, linux-mm,
	linux-pcmcia

On Tue, May 01, 2007 at 02:08:20AM -0700, Andrew Morton wrote:
> On Tue, 1 May 2007 10:57:10 +0200 Willy Tarreau <w@1wt.eu> wrote:
> 
> > Hi Christoph,
> > 
> > On Tue, May 01, 2007 at 09:46:23AM +0100, Christoph Hellwig wrote:
> > > >  pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch
> > > 
> > > ...
> > > 
> > > > Dominik is busy.  Will probably re-review and send these direct to Linus.
> > > 
> > > The patch above is the removal of cardmgr support.  While I'd love to
> > > see this cruft gone it definitively needs maintainer judgement on whether
> > > they time has come that no one relies on cardmgr anymore.
> > 
> > Well, I've not followed evolutions in this area for a long time. Here's
> > what I get on my notebook :
> > 
> > willy@wtap:~$ uname -r
> > 2.6.20-wt3-wtap
> > willy@wtap:~$ ps auxw|grep card   
> > root      1216  0.0  0.0     0    0 ?        S<   Apr28   0:00 [pccardd]
> > root      1221  0.0  0.0     0    0 ?        S<   Apr28   0:00 [pccardd]
> > root      1244  0.0  0.0     0    0 ?        S<   Apr28   0:00 [pccardd]
> > root      1251  0.0  0.0     0    0 ?        Ss   Apr28   0:00 /sbin/cardmgr
> > 
> 
> Yes, that seems premature.  feature-removal.txt is pretty useless for
> getting poeple off old tools.  If we're ever to make this migration we'll
> need loud and scary printks coming out of the kernel.  Probably it'll take
> another year or two to get there *once* we've done that.


You already said the same two years ago, and you forwarded a patch 
implementing exactly this nearly two years ago:


commit c352ec8ab87b065cd2edda171811f49ac7d0d5cd
Author: Dominik Brodowski <linux@dominikbrodowski.net>
Date:   Tue Sep 13 01:25:03 2005 -0700

    [PATCH] pcmcia: warn on IOCTL usage
    
    More visible user information of scheduled feature removal.
    
    Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

diff --git a/drivers/pcmcia/pcmcia_ioctl.c b/drivers/pcmcia/pcmcia_ioctl.c
index 39ba640..80969f7 100644
--- a/drivers/pcmcia/pcmcia_ioctl.c
+++ b/drivers/pcmcia/pcmcia_ioctl.c
@@ -376,6 +376,7 @@ static int ds_open(struct inode *inode, struct file *file)
     socket_t i = iminor(inode);
     struct pcmcia_socket *s;
     user_info_t *user;
+    static int warning_printed = 0;
 
     ds_dbg(0, "ds_open(socket %d)\n", i);
 
@@ -407,6 +408,17 @@ static int ds_open(struct inode *inode, struct file *file)
     s->user = user;
     file->private_data = user;
 
+    if (!warning_printed) {
+	    printk(KERN_INFO "pcmcia: Detected deprecated PCMCIA ioctl "
+			"usage.\n");
+	    printk(KERN_INFO "pcmcia: This interface will soon be removed from "
+			"the kernel; please expect breakage unless you upgrade "
+			"to new tools.\n");
+	    printk(KERN_INFO "pcmcia: see http://www.kernel.org/pub/linux/"
+			"utils/kernel/pcmcia/pcmcia.html for details.\n");
+	    warning_printed = 1;
+    }
+
     if (s->pcmcia_state.present)
 	queue_event(user, CS_EVENT_CARD_INSERTION);
     return 0;


cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply related	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-01  8:46 ` pcmcia ioctl removal Christoph Hellwig
  2007-05-01  8:56   ` Russell King
  2007-05-01  8:57   ` Willy Tarreau
@ 2007-05-01  9:16   ` Robert P. J. Day
  2007-05-01  9:44     ` Willy Tarreau
  2007-05-01 10:12     ` Jan Engelhardt
  2007-05-09 12:54   ` Pavel Machek
  3 siblings, 2 replies; 233+ messages in thread
From: Robert P. J. Day @ 2007-05-01  9:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Andrew Morton, linux-kernel, linux-mm, linux-pcmcia

On Tue, 1 May 2007, Christoph Hellwig wrote:

> >  pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch
>
> ...
>
> > Dominik is busy.  Will probably re-review and send these direct to Linus.
>
> The patch above is the removal of cardmgr support.  While I'd love
> to see this cruft gone it definitively needs maintainer judgement on
> whether they time has come that no one relies on cardmgr anymore.

since i was the one who submitted the original patch to remove that
stuff, let me make an observation.

when i submitted a patch to remove, for instance, the traffic shaper
since it's clearly obsolete, i was told -- in no uncertain terms --
that that couldn't be done since there had been no warning about its
impending removal.

fair enough, i can accept that.

on the other hand, the features removal file contains the following:

...
What:   PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl])
When:   November 2005
...

in other words, the PCMCIA ioctl feature *has* been listed as obsolete
for quite some time, and is already a *year and a half* overdue for
removal.

in short, it's annoying to take the position that stuff can't be
deleted without warning, then turn around and be reluctant to remove
stuff for which *more than ample warning* has already been given.
doing that just makes a joke of the features removal file, and makes
you wonder what its purpose is in the first place.

a little consistency would be nice here, don't you think?

rday
-- 
========================================================================
Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://fsdev.net/wiki/index.php?title=Main_Page
========================================================================

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-01  9:16   ` Robert P. J. Day
@ 2007-05-01  9:44     ` Willy Tarreau
  2007-05-01 10:16       ` Robert P. J. Day
  2007-05-01 10:26       ` Gabriel C
  2007-05-01 10:12     ` Jan Engelhardt
  1 sibling, 2 replies; 233+ messages in thread
From: Willy Tarreau @ 2007-05-01  9:44 UTC (permalink / raw)
  To: Robert P. J. Day
  Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm,
	linux-pcmcia

On Tue, May 01, 2007 at 05:16:13AM -0400, Robert P. J. Day wrote:
> On Tue, 1 May 2007, Christoph Hellwig wrote:
> 
> > >  pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch
> >
> > ...
> >
> > > Dominik is busy.  Will probably re-review and send these direct to Linus.
> >
> > The patch above is the removal of cardmgr support.  While I'd love
> > to see this cruft gone it definitively needs maintainer judgement on
> > whether they time has come that no one relies on cardmgr anymore.
> 
> since i was the one who submitted the original patch to remove that
> stuff, let me make an observation.
> 
> when i submitted a patch to remove, for instance, the traffic shaper
> since it's clearly obsolete, i was told -- in no uncertain terms --
> that that couldn't be done since there had been no warning about its
> impending removal.
> 
> fair enough, i can accept that.
> 
> on the other hand, the features removal file contains the following:
> 
> ...
> What:   PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl])
> When:   November 2005
> ...
> 
> in other words, the PCMCIA ioctl feature *has* been listed as obsolete
> for quite some time, and is already a *year and a half* overdue for
> removal.
> 
> in short, it's annoying to take the position that stuff can't be
> deleted without warning, then turn around and be reluctant to remove
> stuff for which *more than ample warning* has already been given.
> doing that just makes a joke of the features removal file, and makes
> you wonder what its purpose is in the first place.
> 
> a little consistency would be nice here, don't you think?

No, it just shows how useless this file is. What is needed is a big
warning during usage, not a file that nobody reads. Facts are :

  - 90% of people here do not even know that this file exists
  - 80% of the people who know about it do not consult it on a regular basis
  - 80% of those who consult it on a regular basis are not concerned
  - 75% of statistics are invented

=> only 20% of 20% of 10% of those who read LKML know that one feature
   they are concerned about will soon be removed = 0.4% of LKML readers.

If you put a warning in kernel messages (as I've seen for a long time
about tcpdump using obsolete AF_PACKET), close to 100% of the users
of the obsolete code who are likely to change their kernels will notice
it.

I'm sorry for your patch which may get delayed a lot. You would spend
fewer time stuffing warnings in areas affected by scheduled removal.

BTW, I'm not even against the end of cardmgr support, it's just that
I don't know what the alternative is, and I suspect that many users
do not either. A big warning would have brought them to google who
would have provided them with suggestions for alternatives.

Willy


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-01  9:44     ` Willy Tarreau
@ 2007-05-01 10:16       ` Robert P. J. Day
  2007-05-01 10:26       ` Gabriel C
  1 sibling, 0 replies; 233+ messages in thread
From: Robert P. J. Day @ 2007-05-01 10:16 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Christoph Hellwig, Andrew Morton, Linux kernel mailing list,
	linux-mm, linux-pcmcia

On Tue, 1 May 2007, Willy Tarreau wrote:

> On Tue, May 01, 2007 at 05:16:13AM -0400, Robert P. J. Day wrote:
... snip ...
> > in other words, the PCMCIA ioctl feature *has* been listed as
> > obsolete for quite some time, and is already a *year and a half*
> > overdue for removal.
> >
> > in short, it's annoying to take the position that stuff can't be
> > deleted without warning, then turn around and be reluctant to remove
> > stuff for which *more than ample warning* has already been given.
> > doing that just makes a joke of the features removal file, and makes
> > you wonder what its purpose is in the first place.
> >
> > a little consistency would be nice here, don't you think?
>
> No, it just shows how useless this file is.

agreed.  it's mildly entertaining to have watched this raging
discussion over the last few days regarding bugs and emails and
bugzilla and adrian's regressions, while the one feature that's meant
to track aging and removable kernel features is essentially valueless,
and no one seems to care.

> What is needed is a big warning during usage, not a file that nobody
> reads.

agreed there as well.  but short of that, it would still be nice if
people took a minute, perused the feature removal file, and at least
brought it up-to-date.  if it's going to have any value, then:

1) all proposed removal dates should be reviewed to make sure they're
still meaningful,

2) stuff that's overdue for removal should be either removed, or have
its expiry date brought forward, and

3) stuff in the kernel tree that is understood to be obsolete or
nearly so should have an entry added to that file, so that the clock
can at least *start* ticking for that stuff, and you can at least say
you *tried* to warn current users.

as a start, i posted last month the results of running the simple
command:

  $ grep -iw obsolete $(find . -name Kconfig\*)

and some of what was printed is clearly misleading.  (don't worry,
tilman -- we're not going to reopen that whole isdn4linux thing. :-)

i mean, what of the following is actually obsolete:

  * traffic policing
  * IP6 Userspace queueing via NETLINK
  * IP Userspace queueing via NETLINK
  * ebt: ulog support
  * Traffic Shaper

and so on (and there's that legacy PM thing as well).

> I'm sorry for your patch which may get delayed a lot.

obviously, leaving stuff like that in the kernel doesn't actually
*hurt* anything but, yeah, it's a tad annoying to invest a few minutes
to do some janitor work based on what should be killable, submit the
patch, then have people freak out about how that is still an essential
feature.

bottom line:  if you want janitor folks to help out with cleanup, make
sure they know what can legitimately be cleaned, and stop wasting
peoples' time.

rday

p.s.  now if there were only a way to, say, tag various kernel
features as "obsolete" or "deprecated" ...  :-)

-- 
========================================================================
Robert P. J. Day
Linux Consulting, Training and Annoying Kernel Pedantry
Waterloo, Ontario, CANADA

http://fsdev.net/wiki/index.php?title=Main_Page
========================================================================

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-01  9:44     ` Willy Tarreau
  2007-05-01 10:16       ` Robert P. J. Day
@ 2007-05-01 10:26       ` Gabriel C
  2007-05-01 10:52         ` Willy Tarreau
  1 sibling, 1 reply; 233+ messages in thread
From: Gabriel C @ 2007-05-01 10:26 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Robert P. J. Day, Christoph Hellwig, Andrew Morton, linux-kernel,
	linux-mm, linux-pcmcia

Willy Tarreau wrote:
> On Tue, May 01, 2007 at 05:16:13AM -0400, Robert P. J. Day wrote:
>   
>> On Tue, 1 May 2007, Christoph Hellwig wrote:
>>
>>     
>>>>  pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch
>>>>         
>>> ...
>>>
>>>       
>>>> Dominik is busy.  Will probably re-review and send these direct to Linus.
>>>>         
>>> The patch above is the removal of cardmgr support.  While I'd love
>>> to see this cruft gone it definitively needs maintainer judgement on
>>> whether they time has come that no one relies on cardmgr anymore.
>>>       
>> since i was the one who submitted the original patch to remove that
>> stuff, let me make an observation.
>>
>> when i submitted a patch to remove, for instance, the traffic shaper
>> since it's clearly obsolete, i was told -- in no uncertain terms --
>> that that couldn't be done since there had been no warning about its
>> impending removal.
>>
>> fair enough, i can accept that.
>>
>> on the other hand, the features removal file contains the following:
>>
>> ...
>> What:   PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl])
>> When:   November 2005
>> ...
>>
>> in other words, the PCMCIA ioctl feature *has* been listed as obsolete
>> for quite some time, and is already a *year and a half* overdue for
>> removal.
>>
>> in short, it's annoying to take the position that stuff can't be
>> deleted without warning, then turn around and be reluctant to remove
>> stuff for which *more than ample warning* has already been given.
>> doing that just makes a joke of the features removal file, and makes
>> you wonder what its purpose is in the first place.
>>
>> a little consistency would be nice here, don't you think?
>>     
>
> No, it just shows how useless this file is. What is needed is a big
> warning during usage, not a file that nobody reads. Facts are :
>
>   - 90% of people here do not even know that this file exists
>   - 80% of the people who know about it do not consult it on a regular basis
>   - 80% of those who consult it on a regular basis are not concerned
>   - 75% of statistics are invented
>
> => only 20% of 20% of 10% of those who read LKML know that one feature
>    they are concerned about will soon be removed = 0.4% of LKML readers.
>
> If you put a warning in kernel messages (as I've seen for a long time
> about tcpdump using obsolete AF_PACKET), close to 100% of the users
> of the obsolete code who are likely to change their kernels will notice
> it.
>
>   

Hmm ? There is already a fat warning in dmesg for a long time now.


snip

...

pcmcia: Detected deprecated PCMCIA ioctl usage from process: discover.
pcmcia: This interface will soon be removed from the kernel; please 
expect breakage unless you upgrade to new tools.
pcmcia: see 
http://www.kernel.org/pub/linux/utils/kernel/pcmcia/pcmcia.html for details.

...


> I'm sorry for your patch which may get delayed a lot. You would spend
> fewer time stuffing warnings in areas affected by scheduled removal.
>
> BTW, I'm not even against the end of cardmgr support, it's just that
> I don't know what the alternative is, and I suspect that many users
> do not either. A big warning would have brought them to google who
> would have provided them with suggestions for alternatives.
>
> Willy
>
>
>   

Regards,

Gabriel


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-01 10:26       ` Gabriel C
@ 2007-05-01 10:52         ` Willy Tarreau
  0 siblings, 0 replies; 233+ messages in thread
From: Willy Tarreau @ 2007-05-01 10:52 UTC (permalink / raw)
  To: Gabriel C
  Cc: Robert P. J. Day, Christoph Hellwig, Andrew Morton, linux-kernel,
	linux-mm, linux-pcmcia

On Tue, May 01, 2007 at 12:26:48PM +0200, Gabriel C wrote:
> Willy Tarreau wrote:
> >On Tue, May 01, 2007 at 05:16:13AM -0400, Robert P. J. Day wrote:
> >  
> >>On Tue, 1 May 2007, Christoph Hellwig wrote:
> >>
> >>    
> >>>> pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch
> >>>>        
> >>>...
> >>>
> >>>      
> >>>>Dominik is busy.  Will probably re-review and send these direct to 
> >>>>Linus.
> >>>>        
> >>>The patch above is the removal of cardmgr support.  While I'd love
> >>>to see this cruft gone it definitively needs maintainer judgement on
> >>>whether they time has come that no one relies on cardmgr anymore.
> >>>      
> >>since i was the one who submitted the original patch to remove that
> >>stuff, let me make an observation.
> >>
> >>when i submitted a patch to remove, for instance, the traffic shaper
> >>since it's clearly obsolete, i was told -- in no uncertain terms --
> >>that that couldn't be done since there had been no warning about its
> >>impending removal.
> >>
> >>fair enough, i can accept that.
> >>
> >>on the other hand, the features removal file contains the following:
> >>
> >>...
> >>What:   PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl])
> >>When:   November 2005
> >>...
> >>
> >>in other words, the PCMCIA ioctl feature *has* been listed as obsolete
> >>for quite some time, and is already a *year and a half* overdue for
> >>removal.
> >>
> >>in short, it's annoying to take the position that stuff can't be
> >>deleted without warning, then turn around and be reluctant to remove
> >>stuff for which *more than ample warning* has already been given.
> >>doing that just makes a joke of the features removal file, and makes
> >>you wonder what its purpose is in the first place.
> >>
> >>a little consistency would be nice here, don't you think?
> >>    
> >
> >No, it just shows how useless this file is. What is needed is a big
> >warning during usage, not a file that nobody reads. Facts are :
> >
> >  - 90% of people here do not even know that this file exists
> >  - 80% of the people who know about it do not consult it on a regular 
> >  basis
> >  - 80% of those who consult it on a regular basis are not concerned
> >  - 75% of statistics are invented
> >
> >=> only 20% of 20% of 10% of those who read LKML know that one feature
> >   they are concerned about will soon be removed = 0.4% of LKML readers.
> >
> >If you put a warning in kernel messages (as I've seen for a long time
> >about tcpdump using obsolete AF_PACKET), close to 100% of the users
> >of the obsolete code who are likely to change their kernels will notice
> >it.
> >
> >  
> 
> Hmm ? There is already a fat warning in dmesg for a long time now.
> 
> 
> snip
> 
> ...
> 
> pcmcia: Detected deprecated PCMCIA ioctl usage from process: discover.
> pcmcia: This interface will soon be removed from the kernel; please 
> expect breakage unless you upgrade to new tools.
> pcmcia: see 
> http://www.kernel.org/pub/linux/utils/kernel/pcmcia/pcmcia.html for details.
> 
> ...

Oh you're terribly right ! I have grepped my logs and found it lying
there too in the middle of the memory probe messages and the IO probe
ones. I never noticed it. I know it's my fault, but I think that
"fat warning" is not really what could characterize it though, because
of the context verbose around it :

pcmcia: parent PCI bridge Memory window: 0x90000000 - 0x903fffff
pcmcia: parent PCI bridge Memory window: 0x30000000 - 0x3dffffff
pccard: PCMCIA card inserted into slot 0
cs: memory probe 0x90000000-0x903fffff: excluding 0x90000000-0x9003ffff 0x90080000-0x900bffff 0x90100000-0x9013ffff 0x90180000-0x901bffff 0x90200000-0x9023ffff 0x90280000-0x902bffff 0x90300000-0x9033ffff 0x90380000-0x903bffff
pcmcia: registering new device pcmcia0.0
pcmcia: Detected deprecated PCMCIA ioctl usage from process: cardmgr.
pcmcia: This interface will soon be removed from the kernel; please expect breakage unless you upgrade to new tools.
pcmcia: see http://www.kernel.org/pub/linux/utils/kernel/pcmcia/pcmcia.html for details.
cs: IO port probe 0xc00-0xcff: clean.
cs: IO port probe 0x820-0x8ff: clean.
cs: IO port probe 0x800-0x80f: clean.
cs: IO port probe 0x3e0-0x4ff: clean.
cs: IO port probe 0x100-0x3af: excluding 0x140-0x14f 0x378-0x37f
cs: IO port probe 0xa00-0xaff: clean.

Now I have the URL for the details ;-)

BTW, I thing we should standardize on some formating to display messages
about deprecated/obsolete code. Maybe something like this would be more
noticeable :

cs: memory probe 0x90000000-0x903fffff: excluding 0x90000000-0x9003ffff 0x90080000-0x900bffff 0x90100000-0x9013ffff 0x90180000-0x901bffff 0x90200000-0x9023ffff 0x90280000-0x902bffff 0x90300000-0x9033ffff 0x90380000-0x903bffff
pcmcia: registering new device pcmcia0.0
WARNING !!! Detected DEPRECATED PCMCIA ioctl usage from process: cardmgr.
WARNING !!!   This process may stop working past November 2005.
WARNING !!!   see http://www.kernel.org/pub/linux/utils/kernel/pcmcia/pcmcia.html for details.
cs: IO port probe 0xc00-0xcff: clean.

Regards,
Willy

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-01  9:16   ` Robert P. J. Day
  2007-05-01  9:44     ` Willy Tarreau
@ 2007-05-01 10:12     ` Jan Engelhardt
  2007-05-01 11:00       ` Willy Tarreau
  2007-05-01 19:10       ` Russell King
  1 sibling, 2 replies; 233+ messages in thread
From: Jan Engelhardt @ 2007-05-01 10:12 UTC (permalink / raw)
  To: Robert P. J. Day
  Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm,
	linux-pcmcia


On May 1 2007 05:16, Robert P. J. Day wrote:
>
>on the other hand, the features removal file contains the following:
>
>...
>What:   PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl])
>When:   November 2005
>...
>
>in other words, the PCMCIA ioctl feature *has* been listed as obsolete
>for quite some time, and is already a *year and a half* overdue for
>removal.
>
>in short, it's annoying to take the position that stuff can't be
>deleted without warning, then turn around and be reluctant to remove
>stuff for which *more than ample warning* has already been given.
>doing that just makes a joke of the features removal file, and makes
>you wonder what its purpose is in the first place.
>
>a little consistency would be nice here, don't you think?

I think this could raise their attention...

init/Makefile
obj-y += obsolete.o

init/obsolete.c:
static __init int obsolete_init(void)
{
	printk("\e[1;31m""

The following stuff is gonna get removed \e[5;37m SOON: \e[0m
	- cardmgr
	- foobar
	- bweebol

");
	schedule_timeout(3 * HZ);
	return;
}

static __exit void obsolete_exit(void) {}



Jan
-- 

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-01 10:12     ` Jan Engelhardt
@ 2007-05-01 11:00       ` Willy Tarreau
  2007-05-01 12:06         ` Konstantin Münning
  2007-05-01 13:56         ` Rogan Dawes
  2007-05-01 19:10       ` Russell King
  1 sibling, 2 replies; 233+ messages in thread
From: Willy Tarreau @ 2007-05-01 11:00 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Robert P. J. Day, Christoph Hellwig, Andrew Morton, linux-kernel,
	linux-mm, linux-pcmcia

On Tue, May 01, 2007 at 12:12:36PM +0200, Jan Engelhardt wrote:
> 
> On May 1 2007 05:16, Robert P. J. Day wrote:
> >
> >on the other hand, the features removal file contains the following:
> >
> >...
> >What:   PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl])
> >When:   November 2005
> >...
> >
> >in other words, the PCMCIA ioctl feature *has* been listed as obsolete
> >for quite some time, and is already a *year and a half* overdue for
> >removal.
> >
> >in short, it's annoying to take the position that stuff can't be
> >deleted without warning, then turn around and be reluctant to remove
> >stuff for which *more than ample warning* has already been given.
> >doing that just makes a joke of the features removal file, and makes
> >you wonder what its purpose is in the first place.
> >
> >a little consistency would be nice here, don't you think?
> 
> I think this could raise their attention...
> 
> init/Makefile
> obj-y += obsolete.o
> 
> init/obsolete.c:
> static __init int obsolete_init(void)
> {
> 	printk("\e[1;31m""
> 
> The following stuff is gonna get removed \e[5;37m SOON: \e[0m
> 	- cardmgr
> 	- foobar
> 	- bweebol
> 
> ");
> 	schedule_timeout(3 * HZ);
> 	return;
> }
> 
> static __exit void obsolete_exit(void) {}

There's something I like here : the fact that all features are centralized
and not hidden in the noise. Clearly we need some standard inside the kernel
to manage obsolete code as well as we currently do by hand.

Willy


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-01 11:00       ` Willy Tarreau
@ 2007-05-01 12:06         ` Konstantin Münning
  2007-05-01 13:56         ` Rogan Dawes
  1 sibling, 0 replies; 233+ messages in thread
From: Konstantin Münning @ 2007-05-01 12:06 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Jan Engelhardt, linux-pcmcia, linux-kernel, linux-mm,
	Robert P. J. Day, Andrew Morton

Willy Tarreau wrote:
> On Tue, May 01, 2007 at 12:12:36PM +0200, Jan Engelhardt wrote:
>> On May 1 2007 05:16, Robert P. J. Day wrote:
>>> on the other hand, the features removal file contains the following:
>>>
>>> ...
>>> What:   PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl])
>>> When:   November 2005
>>> ...
>>>
>>> in other words, the PCMCIA ioctl feature *has* been listed as obsolete
>>> for quite some time, and is already a *year and a half* overdue for
>>> removal.
>>>
>>> in short, it's annoying to take the position that stuff can't be
>>> deleted without warning, then turn around and be reluctant to remove
>>> stuff for which *more than ample warning* has already been given.
>>> doing that just makes a joke of the features removal file, and makes
>>> you wonder what its purpose is in the first place.
>>>
>>> a little consistency would be nice here, don't you think?
>> I think this could raise their attention...
>>
>> init/Makefile
>> obj-y += obsolete.o
>>
>> init/obsolete.c:
>> static __init int obsolete_init(void)
>> {
>> 	printk("\e[1;31m""
>>
>> The following stuff is gonna get removed \e[5;37m SOON: \e[0m
>> 	- cardmgr
>> 	- foobar
>> 	- bweebol
>>
>> ");
>> 	schedule_timeout(3 * HZ);
>> 	return;
>> }
>>
>> static __exit void obsolete_exit(void) {}
> 
> There's something I like here : the fact that all features are centralized
> and not hidden in the noise. Clearly we need some standard inside the kernel
> to manage obsolete code as well as we currently do by hand.
> 
> Willy

What about something like the tainted flag which status can be displayed
 easily? And even better when a list of the used obsolete features can
be displayed as well on request? This way you don't need to search the
logs. A standardized obsolete function like the one above could do all
the job.

Just my 2 cents.
-- 
Konstantin Münning

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-01 11:00       ` Willy Tarreau
  2007-05-01 12:06         ` Konstantin Münning
@ 2007-05-01 13:56         ` Rogan Dawes
  1 sibling, 0 replies; 233+ messages in thread
From: Rogan Dawes @ 2007-05-01 13:56 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Jan Engelhardt, linux-pcmcia, linux-kernel, linux-mm,
	Robert P. J. Day, Andrew Morton

Willy Tarreau wrote:
> On Tue, May 01, 2007 at 12:12:36PM +0200, Jan Engelhardt wrote:
>> On May 1 2007 05:16, Robert P. J. Day wrote:
>>> on the other hand, the features removal file contains the following:
>>>
>>> ...
>>> What:   PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl])
>>> When:   November 2005
>>> ...
>>>
>>> in other words, the PCMCIA ioctl feature *has* been listed as obsolete
>>> for quite some time, and is already a *year and a half* overdue for
>>> removal.
>>>
>>> in short, it's annoying to take the position that stuff can't be
>>> deleted without warning, then turn around and be reluctant to remove
>>> stuff for which *more than ample warning* has already been given.
>>> doing that just makes a joke of the features removal file, and makes
>>> you wonder what its purpose is in the first place.
>>>
>>> a little consistency would be nice here, don't you think?
>> I think this could raise their attention...
>>
>> init/Makefile
>> obj-y += obsolete.o
>>
>> init/obsolete.c:
>> static __init int obsolete_init(void)
>> {
>> 	printk("\e[1;31m""
>>
>> The following stuff is gonna get removed \e[5;37m SOON: \e[0m
>> 	- cardmgr
>> 	- foobar
>> 	- bweebol
>>
>> ");
>> 	schedule_timeout(3 * HZ);
>> 	return;
>> }
>>
>> static __exit void obsolete_exit(void) {}
> 
> There's something I like here : the fact that all features are centralized
> and not hidden in the noise. Clearly we need some standard inside the kernel
> to manage obsolete code as well as we currently do by hand.
> 
> Willy

The difference between this function and the PCAP/TCPDUMP warning is 
that the warning only showed up when the obsolete functionality was 
actually used.

Maybe a mechanism to automatically increase the severity of reporting as 
the removal date approaches would be an idea? i.e. for each new kernel 
that you build leading up the the removal date, a severity is calculated 
based on the time until official removal, and then, depending on the 
severity, the message can be logged in various ways.

Rogan


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-01 10:12     ` Jan Engelhardt
  2007-05-01 11:00       ` Willy Tarreau
@ 2007-05-01 19:10       ` Russell King
  2007-05-01 20:41         ` Jan Engelhardt
  1 sibling, 1 reply; 233+ messages in thread
From: Russell King @ 2007-05-01 19:10 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Robert P. J. Day, Christoph Hellwig, Andrew Morton, linux-kernel,
	linux-mm, linux-pcmcia

On Tue, May 01, 2007 at 12:12:36PM +0200, Jan Engelhardt wrote:
> init/obsolete.c:
> static __init int obsolete_init(void)
> {
> 	printk("\e[1;31m""
> 
> The following stuff is gonna get removed \e[5;37m SOON: \e[0m
> 	- cardmgr
> 	- foobar
> 	- bweebol
> 
> ");
> 	schedule_timeout(3 * HZ);
> 	return;
> }

The kernel console isn't VT102 compatible.  It doesn't understand any
escape codes, at all.  Neither does sysklogd.  So the above will just
end up as rubbish on your console.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-01 19:10       ` Russell King
@ 2007-05-01 20:41         ` Jan Engelhardt
  0 siblings, 0 replies; 233+ messages in thread
From: Jan Engelhardt @ 2007-05-01 20:41 UTC (permalink / raw)
  To: Russell King
  Cc: Robert P. J. Day, Christoph Hellwig, Andrew Morton, linux-kernel,
	linux-mm, linux-pcmcia

On May 1 2007 20:10, Russell King wrote:
>On Tue, May 01, 2007 at 12:12:36PM +0200, Jan Engelhardt wrote:
>> init/obsolete.c:
>> static __init int obsolete_init(void)
>> {
>> 	printk("\e[1;31m""
>> 
>> The following stuff is gonna get removed \e[5;37m SOON: \e[0m
>> 	- cardmgr
>> 	- foobar
>> 	- bweebol
>> 
>> ");
>> 	schedule_timeout(3 * HZ);
>> 	return;
>> }
>
>The kernel console isn't VT102 compatible.  It doesn't understand any
>escape codes, at all.  Neither does sysklogd.  So the above will just
>end up as rubbish on your console.

It will (should) at least show up as nicely as in the C source code.
With escape codes, but largely readable.
If someone knows how to directly spew it to tty0 (the current active console -
most likely tty1), the better.
Anyway, I just wanted to point out how to really highlight it for the user to
see. Although then there would be the distros who obscure it with funky
bootsplash screens. But hopefully, their users would not need to care too much
for old stuff (gets updated through the distro's update mechanism)

Jan
-- 

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-01  8:46 ` pcmcia ioctl removal Christoph Hellwig
                     ` (2 preceding siblings ...)
  2007-05-01  9:16   ` Robert P. J. Day
@ 2007-05-09 12:54   ` Pavel Machek
  2007-05-09 13:00     ` Robert P. J. Day
  2007-05-09 13:03     ` Adrian Bunk
  3 siblings, 2 replies; 233+ messages in thread
From: Pavel Machek @ 2007-05-09 12:54 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm,
	linux-pcmcia

Hi!

> >  pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch
> 
> ...
> 
> > Dominik is busy.  Will probably re-review and send these direct to Linus.
> 
> The patch above is the removal of cardmgr support.  While I'd love to
> see this cruft gone it definitively needs maintainer judgement on whether
> they time has come that no one relies on cardmgr anymore.

I remember needing cardmgr few months ago on sa-1100 arm system. I'm
not sure this is obsolete-enough to kill.

							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-09 12:54   ` Pavel Machek
@ 2007-05-09 13:00     ` Robert P. J. Day
  2007-05-09 13:03     ` Adrian Bunk
  1 sibling, 0 replies; 233+ messages in thread
From: Robert P. J. Day @ 2007-05-09 13:00 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm,
	linux-pcmcia

On Wed, 9 May 2007, Pavel Machek wrote:

> Hi!
>
> > >  pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch
> >
> > ...
> >
> > > Dominik is busy.  Will probably re-review and send these direct to Linus.
> >
> > The patch above is the removal of cardmgr support.  While I'd love
> > to see this cruft gone it definitively needs maintainer judgement
> > on whether they time has come that no one relies on cardmgr
> > anymore.
>
> I remember needing cardmgr few months ago on sa-1100 arm system. I'm
> not sure this is obsolete-enough to kill.

in that case, someone really should update
feature-removal-schedule.txt, which currently reads:

What:   PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl])
When:   November 2005
...

rday
-- 
========================================================================
Robert P. J. Day Linux Consulting, Training and Annoying Kernel
Pedantry Waterloo, Ontario, CANADA

http://fsdev.net/wiki/index.php?title=Main_Page
========================================================================

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-09 12:54   ` Pavel Machek
  2007-05-09 13:00     ` Robert P. J. Day
@ 2007-05-09 13:03     ` Adrian Bunk
  2007-05-09 19:11       ` Romano Giannetti
  1 sibling, 1 reply; 233+ messages in thread
From: Adrian Bunk @ 2007-05-09 13:03 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm,
	linux-pcmcia

On Wed, May 09, 2007 at 12:54:16PM +0000, Pavel Machek wrote:
> Hi!
> 
> > >  pcmcia-delete-obsolete-pcmcia_ioctl-feature.patch
> > 
> > ...
> > 
> > > Dominik is busy.  Will probably re-review and send these direct to Linus.
> > 
> > The patch above is the removal of cardmgr support.  While I'd love to
> > see this cruft gone it definitively needs maintainer judgement on whether
> > they time has come that no one relies on cardmgr anymore.
> 
> I remember needing cardmgr few months ago on sa-1100 arm system. I'm
> not sure this is obsolete-enough to kill.

Why didn't pcmciautils work?

> 							Pavel

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-09 13:03     ` Adrian Bunk
@ 2007-05-09 19:11       ` Romano Giannetti
  2007-05-10 12:40         ` Adrian Bunk
  0 siblings, 1 reply; 233+ messages in thread
From: Romano Giannetti @ 2007-05-09 19:11 UTC (permalink / raw)
  To: Adrian Bunk
  Cc: Pavel Machek, Christoph Hellwig, Andrew Morton, linux-kernel,
	linux-mm, linux-pcmcia

On Wed, 2007-05-09 at 15:03 +0200, Adrian Bunk wrote:
> On Wed, May 09, 2007 at 12:54:16PM +0000, Pavel Machek wrote:
>  relies on cardmgr anymore.
> >
> > I remember needing cardmgr few months ago on sa-1100 arm system. I'm
> > not sure this is obsolete-enough to kill.
>
> Why didn't pcmciautils work?

I have had a problem until 2.6.20 was out with pcmciautils (it did not
recognise the second function of multi-functions pcmcia cards that
needed a firmware .cis file), and the only way to use it was with
cardmgr, way after Nov 2005 :-).

Now it is solved (modulo that sometime the pcmcia modem is ttyS1,
sometime ttyS2, but that's another history --- and probably my fault).
But I wonder if similar problems are hidden away... what about put the
ioctls under a normally-disabled option and let a kernel out with it?

Romano

--
La presente comunicación tiene carácter confidencial y es para el exclusivo uso del destinatario indicado en la misma. Si Ud. no es el destinatario indicado, le informamos que cualquier forma de distribución, reproducción o uso de esta comunicación y/o de la información contenida en la misma están estrictamente prohibidos por la ley. Si Ud. ha recibido esta comunicación por error, por favor, notifíquelo inmediatamente al remitente contestando a este mensaje y proceda a continuación a destruirlo. Gracias por su colaboración.

This communication contains confidential information. It is for the exclusive use of the intended addressee. If you are not the intended addressee, please note that any form of distribution, copying or use of this communication or the information in it is strictly prohibited by law. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy this message. Thank you for your cooperation.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pcmcia ioctl removal
  2007-05-09 19:11       ` Romano Giannetti
@ 2007-05-10 12:40         ` Adrian Bunk
  0 siblings, 0 replies; 233+ messages in thread
From: Adrian Bunk @ 2007-05-10 12:40 UTC (permalink / raw)
  To: Romano Giannetti
  Cc: Pavel Machek, Christoph Hellwig, Andrew Morton, linux-kernel,
	linux-mm, linux-pcmcia

On Wed, May 09, 2007 at 09:11:52PM +0200, Romano Giannetti wrote:
> On Wed, 2007-05-09 at 15:03 +0200, Adrian Bunk wrote:
> > On Wed, May 09, 2007 at 12:54:16PM +0000, Pavel Machek wrote:
> >  relies on cardmgr anymore.
> > >
> > > I remember needing cardmgr few months ago on sa-1100 arm system. I'm
> > > not sure this is obsolete-enough to kill.
> >
> > Why didn't pcmciautils work?
> 
> I have had a problem until 2.6.20 was out with pcmciautils (it did not
> recognise the second function of multi-functions pcmcia cards that
> needed a firmware .cis file), and the only way to use it was with
> cardmgr, way after Nov 2005 :-).
> 
> Now it is solved (modulo that sometime the pcmcia modem is ttyS1,
> sometime ttyS2, but that's another history --- and probably my fault).
> But I wonder if similar problems are hidden away... what about put the
> ioctls under a normally-disabled option and let a kernel out with it?

It already prints a runtime warning to the user since 2005.
And people won't notice a changed default when using "make oldconfig".

Are there any known known regressions left?
Otherwise, the best way for getting problem reports for pcmciautils is 
to remove the ioctl (that's an experience from similar cases in other 
parts of the kernel)...

> Romano

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 233+ messages in thread

* pci hotplug patches
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (9 preceding siblings ...)
  2007-05-01  8:46 ` pcmcia ioctl removal Christoph Hellwig
@ 2007-05-01  8:48 ` Christoph Hellwig
  2007-05-02  3:57   ` Greg KH
  2007-05-01  8:54 ` cache-pipe-buf-page-address-for-non-highmem-arch.patch Christoph Hellwig
                   ` (11 subsequent siblings)
  22 siblings, 1 reply; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-01  8:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, greg

>  fix-gregkh-pci-pci-remove-the-broken-pci_multithread_probe-option.patch
>  remove-pci_dac_dma_-apis.patch
>  round_up-macro-cleanup-in-drivers-pci.patch
>  pcie-remove-spin_lock_unlocked.patch
>  cpqphp-partially-convert-to-use-the-kthread-api.patch
>  ibmphp-partially-convert-to-use-the-kthreads-api.patch
>  cpci_hotplug-partially-convert-to-use-the-kthread-api.patch
>  msi-fix-arm-compile.patch
>  support-pci-mcfg-space-on-intel-i915-bridges.patch
>  pci-syscallc-switch-to-refcounting-api.patch
> 
> Stuff to (various levels of re-)send to Greg for the PCI tree.  I'll probably
> drop the kthread patches as they seemed a bit half-baked and I've lost track
> of which ones have which levels of baking.

All the partially kthread conversion were superceed with full conversion
from me.  I've only got feedback from the cpci maintainer, and he acked
my patch together with a simple fix from him.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pci hotplug patches
  2007-05-01  8:48 ` pci hotplug patches Christoph Hellwig
@ 2007-05-02  3:57   ` Greg KH
  2007-05-13 20:59     ` Christoph Hellwig
  0 siblings, 1 reply; 233+ messages in thread
From: Greg KH @ 2007-05-02  3:57 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm
  Cc: kristen.c.accardi

On Tue, May 01, 2007 at 09:48:41AM +0100, Christoph Hellwig wrote:
> >  fix-gregkh-pci-pci-remove-the-broken-pci_multithread_probe-option.patch
> >  remove-pci_dac_dma_-apis.patch
> >  round_up-macro-cleanup-in-drivers-pci.patch
> >  pcie-remove-spin_lock_unlocked.patch
> >  cpqphp-partially-convert-to-use-the-kthread-api.patch
> >  ibmphp-partially-convert-to-use-the-kthreads-api.patch
> >  cpci_hotplug-partially-convert-to-use-the-kthread-api.patch
> >  msi-fix-arm-compile.patch
> >  support-pci-mcfg-space-on-intel-i915-bridges.patch
> >  pci-syscallc-switch-to-refcounting-api.patch
> > 
> > Stuff to (various levels of re-)send to Greg for the PCI tree.  I'll probably
> > drop the kthread patches as they seemed a bit half-baked and I've lost track
> > of which ones have which levels of baking.
> 
> All the partially kthread conversion were superceed with full conversion
> from me.  I've only got feedback from the cpci maintainer, and he acked
> my patch together with a simple fix from him.

Hm, I'm no longer the PCI Hotplug maintainer, so that's why I haven't
added them to my tree.  It would probably be best for everyone involved
to send them to her instead :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pci hotplug patches
  2007-05-02  3:57   ` Greg KH
@ 2007-05-13 20:59     ` Christoph Hellwig
  2007-05-14 11:48       ` Greg KH
  0 siblings, 1 reply; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-13 20:59 UTC (permalink / raw)
  To: Greg KH
  Cc: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm,
	kristen.c.accardi

On Tue, May 01, 2007 at 08:57:45PM -0700, Greg KH wrote:
> Hm, I'm no longer the PCI Hotplug maintainer, so that's why I haven't
> added them to my tree.  It would probably be best for everyone involved
> to send them to her instead :)

FYI: MAINTAINERS still lists you as the maintainer of the cpqphp driver.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: pci hotplug patches
  2007-05-13 20:59     ` Christoph Hellwig
@ 2007-05-14 11:48       ` Greg KH
  0 siblings, 0 replies; 233+ messages in thread
From: Greg KH @ 2007-05-14 11:48 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton, linux-kernel, linux-mm,
	kristen.c.accardi

On Sun, May 13, 2007 at 09:59:24PM +0100, Christoph Hellwig wrote:
> On Tue, May 01, 2007 at 08:57:45PM -0700, Greg KH wrote:
> > Hm, I'm no longer the PCI Hotplug maintainer, so that's why I haven't
> > added them to my tree.  It would probably be best for everyone involved
> > to send them to her instead :)
> 
> FYI: MAINTAINERS still lists you as the maintainer of the cpqphp driver.

Ick, I'll go fix that up, I don't even have the hardware anymore
(donated it to a local university...)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 233+ messages in thread

* cache-pipe-buf-page-address-for-non-highmem-arch.patch
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (10 preceding siblings ...)
  2007-05-01  8:48 ` pci hotplug patches Christoph Hellwig
@ 2007-05-01  8:54 ` Christoph Hellwig
       [not found]   ` <20070501020441.10b6a003.akpm@linux-foundation.org>
  2007-05-01  8:55 ` consolidate-generic_writepages-and-mpage_writepages.patch Christoph Hellwig
                   ` (10 subsequent siblings)
  22 siblings, 1 reply; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-01  8:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, kenchen

>  cache-pipe-buf-page-address-for-non-highmem-arch.patch

I still don't like this one at all.  If page_address on x86_64 is too
slow we should fix the root cause.


^ permalink raw reply	[flat|nested] 233+ messages in thread

[parent not found: <20070501020441.10b6a003.akpm@linux-foundation.org>]

* Re: cache-pipe-buf-page-address-for-non-highmem-arch.patch
       [not found]   ` <20070501020441.10b6a003.akpm@linux-foundation.org>
@ 2007-05-03  3:48     ` Ken Chen
  0 siblings, 0 replies; 233+ messages in thread
From: Ken Chen @ 2007-05-03  3:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Hellwig, linux-kernel, linux-mm, Andi Kleen

On 5/1/07, Andrew Morton <akpm@linux-foundation.org> wrote:
> Fair enough, it is a bit of an ugly thing.  And I see no measurements there
> on what the overall speedup was for any workload.
>
> Ken, which memory model was in use?  sparsemem?

discontigmem with config_numa on.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* consolidate-generic_writepages-and-mpage_writepages.patch
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (11 preceding siblings ...)
  2007-05-01  8:54 ` cache-pipe-buf-page-address-for-non-highmem-arch.patch Christoph Hellwig
@ 2007-05-01  8:55 ` Christoph Hellwig
  2007-05-01  9:17 ` 2.6.22 -mm merge plans Pekka Enberg
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-01  8:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

>  consolidate-generic_writepages-and-mpage_writepages.patch
> 
> Might merge.  I forget what happened to this.

ACK from me on this one.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (12 preceding siblings ...)
  2007-05-01  8:55 ` consolidate-generic_writepages-and-mpage_writepages.patch Christoph Hellwig
@ 2007-05-01  9:17 ` Pekka Enberg
  2007-05-01  9:24   ` Christoph Hellwig
                     ` (2 more replies)
  2007-05-01 10:16 ` fragmentation avoidance " Mel Gorman
                   ` (8 subsequent siblings)
  22 siblings, 3 replies; 233+ messages in thread
From: Pekka Enberg @ 2007-05-01  9:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, hch, npiggin, a.p.zijlstra

On 5/1/07, Andrew Morton <akpm@linux-foundation.org> wrote:
>  revoke-special-mmap-handling.patch

[snip]

> Hold.  This is tricky stuff and I don't think we've seen sufficient
> reviewing, testing and acking yet?

Agreed. While Peter and Nick have done some review of the patches, I
would really like VFS maintainers to review them before merge.
Christoph, have you had the chance to take a look at it?

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-01  9:17 ` 2.6.22 -mm merge plans Pekka Enberg
@ 2007-05-01  9:24   ` Christoph Hellwig
  2007-05-01  9:37   ` Peter Zijlstra
  2007-05-01 12:19   ` Andi Kleen
  2 siblings, 0 replies; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-01  9:24 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andrew Morton, linux-kernel, linux-mm, hch, npiggin, a.p.zijlstra

On Tue, May 01, 2007 at 12:17:28PM +0300, Pekka Enberg wrote:
> On 5/1/07, Andrew Morton <akpm@linux-foundation.org> wrote:
> > revoke-special-mmap-handling.patch
> 
> [snip]
> 
> >Hold.  This is tricky stuff and I don't think we've seen sufficient
> >reviewing, testing and acking yet?
> 
> Agreed. While Peter and Nick have done some review of the patches, I
> would really like VFS maintainers to review them before merge.
> Christoph, have you had the chance to take a look at it?

Not so far, but it's on my long list of highly useful things I want to review.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-01  9:17 ` 2.6.22 -mm merge plans Pekka Enberg
  2007-05-01  9:24   ` Christoph Hellwig
@ 2007-05-01  9:37   ` Peter Zijlstra
  2007-05-01 12:19   ` Andi Kleen
  2 siblings, 0 replies; 233+ messages in thread
From: Peter Zijlstra @ 2007-05-01  9:37 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Andrew Morton, linux-kernel, linux-mm, hch, npiggin

On Tue, 2007-05-01 at 12:17 +0300, Pekka Enberg wrote:
> On 5/1/07, Andrew Morton <akpm@linux-foundation.org> wrote:
> >  revoke-special-mmap-handling.patch
> 
> [snip]
> 
> > Hold.  This is tricky stuff and I don't think we've seen sufficient
> > reviewing, testing and acking yet?
> 
> Agreed. While Peter and Nick have done some review of the patches, I
> would really like VFS maintainers to review them before merge.
> Christoph, have you had the chance to take a look at it?

I'll have another look at it; also, I'll try to work through Mel's
patches once again.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-01  9:17 ` 2.6.22 -mm merge plans Pekka Enberg
  2007-05-01  9:24   ` Christoph Hellwig
  2007-05-01  9:37   ` Peter Zijlstra
@ 2007-05-01 12:19   ` Andi Kleen
  2007-05-01 17:12     ` Pekka Enberg
  2 siblings, 1 reply; 233+ messages in thread
From: Andi Kleen @ 2007-05-01 12:19 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andrew Morton, linux-kernel, linux-mm, hch, npiggin, a.p.zijlstra

"Pekka Enberg" <penberg@cs.helsinki.fi> writes:

> On 5/1/07, Andrew Morton <akpm@linux-foundation.org> wrote:
> >  revoke-special-mmap-handling.patch
> 
> [snip]
> 
> > Hold.  This is tricky stuff and I don't think we've seen sufficient
> > reviewing, testing and acking yet?
> 
> Agreed. While Peter and Nick have done some review of the patches, I
> would really like VFS maintainers to review them before merge.
> Christoph, have you had the chance to take a look at it?

Also have the cache performance concerns raised on the original review
been addressed?

-Andi

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-01 12:19   ` Andi Kleen
@ 2007-05-01 17:12     ` Pekka Enberg
  0 siblings, 0 replies; 233+ messages in thread
From: Pekka Enberg @ 2007-05-01 17:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux-kernel, linux-mm, hch, npiggin, a.p.zijlstra

Hi Andi,

On 01 May 2007 14:19:45 +0200, Andi Kleen <andi@firstfloor.org> wrote:
> Also have the cache performance concerns raised on the original review
> been addressed?

I am only aware of the fget_light() related issues Eric Dumazet raised
but it's fixed. If you're thinking of something else, could you please
remind me what it is?

^ permalink raw reply	[flat|nested] 233+ messages in thread

* fragmentation avoidance Re: 2.6.22 -mm merge plans
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (13 preceding siblings ...)
  2007-05-01  9:17 ` 2.6.22 -mm merge plans Pekka Enberg
@ 2007-05-01 10:16 ` Mel Gorman
  2007-05-01 13:02   ` 2.6.22 -mm merge plans -- lumpy reclaim Andy Whitcroft
                     ` (3 more replies)
  2007-05-01 12:17 ` Andi Kleen
                   ` (7 subsequent siblings)
  22 siblings, 4 replies; 233+ messages in thread
From: Mel Gorman @ 2007-05-01 10:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, apw, clameter, y-goto

On (30/04/07 16:20), Andrew Morton didst pronounce:
>  add-apply_to_page_range-which-applies-a-function-to-a-pte-range.patch
>  add-apply_to_page_range-which-applies-a-function-to-a-pte-range-fix.patch
>  safer-nr_node_ids-and-nr_node_ids-determination-and-initial.patch
>  use-zvc-counters-to-establish-exact-size-of-dirtyable-pages.patch
>  proper-prototype-for-hugetlb_get_unmapped_area.patch
>  mm-remove-gcc-workaround.patch
>  slab-ensure-cache_alloc_refill-terminates.patch
>  mm-more-rmap-checking.patch
>  mm-make-read_cache_page-synchronous.patch
>  fs-buffer-dont-pageuptodate-without-page-locked.patch
>  allow-oom_adj-of-saintly-processes.patch
>  introduce-config_has_dma.patch
>  mm-slabc-proper-prototypes.patch
>  mm-detach_vmas_to_be_unmapped-fix.patch
> 
> Misc MM things.  Will merge.

After Andy's mail, I am guessing that the patch below is also going here
in the stack as a cleanup.

add-pfn_valid_within-helper-for-sub-max_order-hole-detection.patch

>  add-a-bitmap-that-is-used-to-track-flags-affecting-a-block-of-pages.patch
>  add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated.patch
>  split-the-free-lists-for-movable-and-unmovable-allocations.patch
>  choose-pages-from-the-per-cpu-list-based-on-migration-type.patch
>  add-a-configure-option-to-group-pages-by-mobility.patch
>  drain-per-cpu-lists-when-high-order-allocations-fail.patch
>  move-free-pages-between-lists-on-steal.patch
>  group-short-lived-and-reclaimable-kernel-allocations.patch
>  group-high-order-atomic-allocations.patch
>  do-not-group-pages-by-mobility-type-on-low-memory-systems.patch
>  bias-the-placement-of-kernel-pages-at-lower-pfns.patch
>  be-more-agressive-about-stealing-when-migrate_reclaimable-allocations-fallback.patch
>  fix-corruption-of-memmap-on-ia64-sparsemem-when-mem_section-is-not-a-power-of-2.patch

Plus the patch below from Andy's pfn_valid_within() series would be here:

   anti-fragmentation-switch-over-to-pfn_valid_within.patch

These patches are the grouping pages by mobility patches. They get tested
every time someone boots the machine from the perspective that they affect
the page allocator. It is working to keep fragmentation problems to a
minimum and being exercised.  We have beaten it heavily here on tests
with a variety of machines using the system that drives test.kernel.org
for both functionality and performance testing. That covers x86, x86_64,
ppc64 and occasionally IA64. Granted, there are corner-case machines out
there or we'd never receive bug reports at all.

They are currently being reviewed by Christoph Lameter. His feedback in
the linux-mm thread "Antifrag patchset comments" has given me a TODO list
which I'm currently working through. So far, there has been no fundamental
mistake in my opinion and the additional work is logical extensions.

The closest thing to a fundamental mistake was grouping pages by
MAX_ORDER_NR_PAGES instead of an arbitrary order. What I did was fine for
x86_64, i386 and ppc64 but not as useful for IA64 with 1GB worth of memory
in MAX_ORDER_NR_PAGES. I also missed some temporary allocations as picked
up in Christophs review.

>  create-the-zone_movable-zone.patch
>  allow-huge-page-allocations-to-use-gfp_high_movable.patch
>  x86-specify-amount-of-kernel-memory-at-boot-time.patch
>  ppc-and-powerpc-specify-amount-of-kernel-memory-at-boot-time.patch
>  x86_64-specify-amount-of-kernel-memory-at-boot-time.patch
>  ia64-specify-amount-of-kernel-memory-at-boot-time.patch
>  add-documentation-for-additional-boot-parameter-and-sysctl.patch
>  handle-kernelcore=-boot-parameter-in-common-code-to-avoid-boot-problem-on-ia64.patch
> 
> Mel's moveable-zone work.

These patches are what creates ZONE_MOVABLE. The last 6 patches should be
collapsed into a single patch:

	handle-kernelcore=-generic

I believe Yasunori Goto is looking at these from the perspective of memory
hot-remove and has caught a few bugs in the past. Goto-san may be able to
comment on whether they have been reviewed recently.

The main complexity is in one function in patch one which determines where
the PFN is in each node for ZONE_MOVABLE. Getting that right so that the
requested amount of kernel memory spread as evenly as possible is just
not straight-forward.

> I don't believe that this has had sufficient review and I'm sure that it
> hasn't had sufficient third-party testing.  Most of the approbations thus far
> have consisted of people liking the overall idea, based on the changelogs and
> multi-year-old discussions.
> 
> For such a large and core change I'd have expected more detailed reviewing
> effort and more third-party testing.  And I STILL haven't made time to review
> the code in detail myself.
> 
> So I'm a bit uncomfortable with moving ahead with these changes.
> 

Ok. It is getting reviewed by Christoph and I'm going through the TODO items
it yielded. Andy has also been regularly reviewing them which is probably
why they have had less public errors than you might expect from something
like this. Christoph may like to comment more here.

> <snip>
> 
>  lumpy-reclaim-v4.patch

And I guess this patch also moves here

lumpy-move-to-using-pfn_valid_within.patch

> 
> This is in a similar situation to the moveable-zone work.  Sounds great on
> paper, but it needs considerable third-party testing and review.  It is a
> major change to core MM and, we hope, a significant advance.  On paper.

Andy will probably comment more here. Like the fragmentation stuff, we have
beaten this heavily in tests.

I'm not sure of it's review situation.

> More Mel things, and linkage between Mel-things and lumpy reclaim.  It's here
> where the patch ordering gets into a mess and things won't improve if
> moveable-zones and lumpy-reclaim get deferred.  Such a deferral would limit my
> ability to queue more MM changes for 2.6.23.
> 

This is where the three patches were originally. From the other thread,
I am assuming these are sorted out.

> <snip>
> 
>  bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks.patch
>  remove-page_group_by_mobility.patch
>  dont-group-high-order-atomic-allocations.patch
> 
> More moveable-zone work.
> 

This is the MIGRATE_RESERVE patch and two patches that back out parts of the
grouping pages by mobility stack. If possible, these patches should move to
the end of that stack. To fix the ordering, would it be helpful to provide
a fresh stack based on 2.6.21? That would delete 4 patches in all. The two
that introduce configuration items and highorder atomic groupings and these
two patches that subsequently remove them.

> <SNIP>
> 
>  slub-exploit-page-mobility-to-increase-allocation-order.patch
> 
> Slub entanglement with moveable-zones.  Will merge if moveable-zones is merged.
> 

Well, grouping pages by mobility is what it really depends on. The
ZONE_MOVABLE is not required for SLUB. However, I get the point and agree
with it. If the rest of SLUB gets merged, this patch could be moved to the
end of the grouping by mobility stack.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- lumpy reclaim
  2007-05-01 10:16 ` fragmentation avoidance " Mel Gorman
@ 2007-05-01 13:02   ` Andy Whitcroft
  2007-05-01 18:03     ` Peter Zijlstra
  2007-05-01 19:00     ` Andrew Morton
  2007-05-01 14:54   ` fragmentation avoidance Re: 2.6.22 -mm merge plans Christoph Lameter
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 233+ messages in thread
From: Andy Whitcroft @ 2007-05-01 13:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-kernel, linux-mm, clameter, y-goto,
	Peter Zijlstra

Mel Gorman wrote:

<snip>

>>  lumpy-reclaim-v4.patch
> 
> And I guess this patch also moves here
> 
> lumpy-move-to-using-pfn_valid_within.patch
> 
>> This is in a similar situation to the moveable-zone work.  Sounds great on
>> paper, but it needs considerable third-party testing and review.  It is a
>> major change to core MM and, we hope, a significant advance.  On paper.
> 
> Andy will probably comment more here. Like the fragmentation stuff, we have
> beaten this heavily in tests.

With this stack the basic functionality for Lumpy reclaim is complete.
Better integration with kswapd is desirable, but IMO that should be a
separate change.

In testing it has produced significant improvements the likelyhood of
reclaiming a page (reclaim effectiveness) at very high orders (where the
likelyhood of success is least), and effectiveness at lower orders
should be better again.  In general -mm testing lumpy is triggered for
any stalled allocation above order-0; it is common to see stack
allocations triggering lumpy under higher load.  kswapd also now
utilises lumpy when required.

As Mel has indicated a lot of automated testing has been done on these
patches.  As reclaim is only entered when low on memory, our testing
focuses on triggering pushing the system to a heavily fragmented state
where reclaim is used heavily.  This testing has not shown any
regressions and shows improved effectiveness particularly under load.

Effectiveness for regular reclaim is based on random distributions, as
such it is only likely to successfully reclaim pages at lower orders.
Lumpy reclaim improves on this by actively targeting reclaim on areas at
the orders required and so succeeds at significantly higher order.  Very
high order allocations require better layout, from the mobility patches.

I have some primitive stats patches which we have used performance
testing.  Perhaps those could be brought up to date to provide better
visibility into lumpy's operation.  Again this would be a separate patch.

> I'm not sure of it's review situation.

As lumpy reclaim and grouping-by-mobility are complementary patch sets
(in that they both assist at the highest order) we work pretty closely
and I generally pass all my patches past Mel before general release.
Early versions were based on patches from Peter Zijlstra who also
reviewed earlier versions if memory serves.  The changes since then have
been reviewed by Mel and Andrew Morton only to my knowledge.

Perhaps Peter would have some time to take a look over the latest stack
as it appears in -mm when that releases; ping me for a patch kit if you
want it before then :).

<snip>

-apw

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- lumpy reclaim
  2007-05-01 13:02   ` 2.6.22 -mm merge plans -- lumpy reclaim Andy Whitcroft
@ 2007-05-01 18:03     ` Peter Zijlstra
  2007-05-01 19:00     ` Andrew Morton
  1 sibling, 0 replies; 233+ messages in thread
From: Peter Zijlstra @ 2007-05-01 18:03 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Mel Gorman, Andrew Morton, linux-kernel, linux-mm, clameter,
	y-goto

On Tue, 2007-05-01 at 14:02 +0100, Andy Whitcroft wrote:

> Perhaps Peter would have some time to take a look over the latest stack
> as it appears in -mm when that releases; ping me for a patch kit if you
> want it before then :).

Lumpy-reclaim -v7, as per the roll-up provided privately;

Code is looking good, I like what you did to it :-)

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans -- lumpy reclaim
  2007-05-01 13:02   ` 2.6.22 -mm merge plans -- lumpy reclaim Andy Whitcroft
  2007-05-01 18:03     ` Peter Zijlstra
@ 2007-05-01 19:00     ` Andrew Morton
  1 sibling, 0 replies; 233+ messages in thread
From: Andrew Morton @ 2007-05-01 19:00 UTC (permalink / raw)
  To: Andy Whitcroft
  Cc: Mel Gorman, linux-kernel, linux-mm, clameter, y-goto,
	Peter Zijlstra

On Tue, 01 May 2007 14:02:41 +0100 Andy Whitcroft <apw@shadowen.org> wrote:

> I have some primitive stats patches which we have used performance
> testing.  Perhaps those could be brought up to date to provide better
> visibility into lumpy's operation.  Again this would be a separate patch.

Feel free to add new counters in /proc/vmstat - perhaps per-order
success and fail rates?  Monitoring the ratio between those would show
how effective lumpiness is being, perhaps.

It's always nice to see what's going on in there.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: fragmentation avoidance Re: 2.6.22 -mm merge plans
  2007-05-01 10:16 ` fragmentation avoidance " Mel Gorman
  2007-05-01 13:02   ` 2.6.22 -mm merge plans -- lumpy reclaim Andy Whitcroft
@ 2007-05-01 14:54   ` Christoph Lameter
  2007-05-01 19:00     ` Mel Gorman
  2007-05-01 18:57   ` Andrew Morton
  2007-05-07 13:07   ` Yasunori Goto
  3 siblings, 1 reply; 233+ messages in thread
From: Christoph Lameter @ 2007-05-01 14:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-kernel, linux-mm, apw, y-goto

On Tue, 1 May 2007, Mel Gorman wrote:

>    anti-fragmentation-switch-over-to-pfn_valid_within.patch
> 
> These patches are the grouping pages by mobility patches. They get tested
> every time someone boots the machine from the perspective that they affect
> the page allocator. It is working to keep fragmentation problems to a
> minimum and being exercised.  We have beaten it heavily here on tests
> with a variety of machines using the system that drives test.kernel.org
> for both functionality and performance testing. That covers x86, x86_64,
> ppc64 and occasionally IA64. Granted, there are corner-case machines out
> there or we'd never receive bug reports at all.
> 
> They are currently being reviewed by Christoph Lameter. His feedback in
> the linux-mm thread "Antifrag patchset comments" has given me a TODO list
> which I'm currently working through. So far, there has been no fundamental
> mistake in my opinion and the additional work is logical extensions.

I think we really urgently need a defragmentation solution in Linux in 
order to support higher page allocations for various purposes. SLUB f.e. 
would benefit from it and the large blocksize patches are not reasonable 
without such a method.

However, the current code is not up to the task. I did not see a clean 
categorization of allocations nor a consistent handling of those. The 
cleanup work that would have to be done throughout the kernel is not 
there. It is spotty. There seems to be a series of heuristic driving this 
thing (I have to agree with Nick there). The temporary allocations that 
were missed are just a few that I found. The review of the rest of the 
kernel was not done. Mel said that he fixed up locations that showed up to 
be a problem in testing. That is another issue: Too much focus on testing 
instead of conceptual cleanness and clean code in the kernel. It looks 
like this is geared for a specific series of tests on specific platforms 
and also to a particular allocation size (max order sized huge pages).

There are major technical problems with

1. Large Scale allocs. Multiple MAX_ORDER blocks as required by the 
   antifrag patches may not exist on all platforms. Thus the antifrag 
   patches will not be able to generate their MAX_ORDER sections. We
   could reduce MAX_ORDER on some platforms but that would have other
   implications like limiting the highest order allocation.

2. Small huge page size support. F.e. IA64 can support down to page size
   huge pages. The antifrag patches handle huge page in a special way. 
   They are categorized as movable. Small huge pages may 
   therefore contaminate the movable area.

3. Defining the size of ZONE_MOVABLE. This was done to guarantee
   availability of movable memory but the practical effect is to
   guarantee that we panic when too many unreclaimable allocations have
   been done.

I have already said during the review that IMHO the patches are not ready 
for merging. They are currently more like a prototype that explores ideas. 
The generalization steps are not done.

How we could make progress:

1. Develop a useful categorization of allocations in the kernel whose
   utility goes beyond the antifrag patches. I.e. length of 
   the objects existence and the method of reclaim could be useful in 
   various contexts.

2. Have statistics of these various allocations.

3. Page allocator should gather statistics on how memory was allocated in
   the various categories.

4. The available data can then be used to driver more intelligent reclaim 
   and develop methods of antifrag or defragmentation.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: fragmentation avoidance Re: 2.6.22 -mm merge plans
  2007-05-01 14:54   ` fragmentation avoidance Re: 2.6.22 -mm merge plans Christoph Lameter
@ 2007-05-01 19:00     ` Mel Gorman
  0 siblings, 0 replies; 233+ messages in thread
From: Mel Gorman @ 2007-05-01 19:00 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, linux-kernel, linux-mm, apw, y-goto

On Tue, 1 May 2007, Christoph Lameter wrote:

> On Tue, 1 May 2007, Mel Gorman wrote:
>
>>    anti-fragmentation-switch-over-to-pfn_valid_within.patch
>>
>> These patches are the grouping pages by mobility patches. They get tested
>> every time someone boots the machine from the perspective that they affect
>> the page allocator. It is working to keep fragmentation problems to a
>> minimum and being exercised.  We have beaten it heavily here on tests
>> with a variety of machines using the system that drives test.kernel.org
>> for both functionality and performance testing. That covers x86, x86_64,
>> ppc64 and occasionally IA64. Granted, there are corner-case machines out
>> there or we'd never receive bug reports at all.
>>
>> They are currently being reviewed by Christoph Lameter. His feedback in
>> the linux-mm thread "Antifrag patchset comments" has given me a TODO list
>> which I'm currently working through. So far, there has been no fundamental
>> mistake in my opinion and the additional work is logical extensions.
>
> I think we really urgently need a defragmentation solution in Linux in
> order to support higher page allocations for various purposes. SLUB f.e.
> would benefit from it and the large blocksize patches are not reasonable
> without such a method.
>

I continue to maintain that anti-fragmentation is a pre-requisite for
any defragmentation mechanism to be effective without trashing overall
performance. If allocation success rates are low when everything possible
has been reclaimed as is the case without fragmentation avoidance, then
defragmentation will not help unless the the 1:1 phys:virt mappings is broken
which incurs its own considerable set of problems.

> However, the current code is not up to the task. I did not see a clean
> categorization of allocations nor a consistent handling of those. The
> cleanup work that would have to be done throughout the kernel is not
> there.

The choice of mobility marker to use in each case was deliberate (even if I
have made mistakes but what else is review for?). The choice by default is
UNMOVABLE as it's the safe choice even if may be sub-optimal.  The description
of the mobility types may not be the clearest. For example, buffers were
placed beside page cache in MOVABLE because they can both be reclaimed in
the same fashion - I consider moving it to disk to be as "movable" as any
other definition of the word but in your world movable always means page
migration which has led to some confusion. They could have been separated
out as MOVABLE and BUFFERS for a conceptually cleaner split but it did not
seem necessary because the more types there are, the bigger the memory and
performance footprint becomes. Additional flag groupings like GFP_BUFFERS
could be defined that alias to MOVABLE if you felt it would make the code
clearer but functionally, the behaviour remains the same. This is similar
to your feedback on the treatment of GFP_TEMPORARY.

There can be as many alias mobility types as you wish but if more "real"
types are required, you can have as you want as long as NR_PAGEBLOCK_BITS
is increased properly and allocflags_to_migratetype() is able to translate
GFP flags to the appropriate mobility type. It increases the performance
and memory footprint though.

> It is spotty. There seems to be a series of heuristic driving this
> thing (I have to agree with Nick there). The temporary allocations that
> were missed are just a few that I found. The review of the rest of the
> kernel was not done.

The review for temporary allocations was aimed at catching the most common
callers, not every single one of them because a full review of every caller
is a large undertaking.  If anything, it makes more sense to do a review of
all callers at the end when the core mechanism is finished. The default to
treat them as UNMOVABLE is sensible.

> Mel said that he fixed up locations that showed up to
> be a problem in testing. That is another issue: Too much focus on testing
> instead of conceptual cleanness and clean code in the kernel.

The patches started as a thought experiment of what "should work". They
were then tested to find flaws in the model and the results were fed back
in. How is that a disadvantage exactly?

> It looks
> like this is geared for a specific series of tests on specific platforms
> and also to a particular allocation size (max order sized huge pages).
>

Some series of tests had to be chosen and one combination was chosen
that was known to be particularly hostile to external fragmentation -
i.e. large numbers of kernel cache allocations at the same time as page
cache allocations. No one has suggested an alternative test that would be
more suitable. The platforms used were x86, x86_64 and ppc64 which are not
exactly insignificant platforms. At the time, I didn't have an IA64 machine and
franky the one I have now does not always boot so testing is not as thorough.

Huge page sized pages were chosen because they were the hardest allocation
to satisfy. If they could be allocated successfully, it stood to reason that
smaller allocations at least as well.

Hugepages and MAX_ORDER pages were close to the same size on x86, x86_64
and ppc64 which is why that figure was chosen. I point out that while IA64
can specify hugepagesz= to change the hugepage size, it's not documented
in Documentation/kernel-parameters.txt or I might have spotted this sooner.

These decisions were not random.

> There are major technical problems with
>
> 1. Large Scale allocs. Multiple MAX_ORDER blocks as required by the
>   antifrag patches may not exist on all platforms. Thus the antifrag
>   patches will not be able to generate their MAX_ORDER sections. We
>   could reduce MAX_ORDER on some platforms but that would have other
>   implications like limiting the highest order allocation.

MAX_ORDER was a sensible choice on the three initial platforms. However,
it is not a fundamental value in the mechanism and is an easy assumption to
break. I've included a patch below based on your review that choses a size
based on the value of HPAGE_SHIFT. It took 45 minutes to cobble together
so it's rough looking and I might have missed something but it has passed
stress tests on x86 without difficulty. Here is the dmesg output

[    0.000000] Built 1 zonelists, mobility grouping on at order 5. Total pages: 16224

Voila, grouping on order 5 instead of 10 (I used 5 instead of HPAGE_SHIFT
for testing purposes).

The order used can be any value >= 2 and < MAX_ORDER.

> 2. Small huge page size support. F.e. IA64 can support down to page size
>   huge pages. The antifrag patches handle huge page in a special way.
>   They are categorized as movable. Small huge pages may
>   therefore contaminate the movable area.

They are only categorised as movable when a sysctl is set. This has to be
the deliberate choice of the administrator and its intention was to allow
hugepages to be alloced from ZONE_MOVABLE. This was to allow flexible sizing
of the hugepage pool when that zone is configured until such time as hugepages
were really movable in 100% of situations.

> 3. Defining the size of ZONE_MOVABLE. This was done to guarantee
>   availability of movable memory but the practical effect is to
>   guarantee that we panic when too many unreclaimable allocations have
>   been done.
>

The size of ZONE_MOVABLE is determined at boot time and it is not
required for grouping page by mobility to be effective. Presumably by an
administrator that has identified the problem that is fixed by having this
zone available. Furthermore, it would be done with the understanding of what
it means for OOM situations if the partition is made too small. The expectation
is that he has a solid understanding of his workload before using this option.

> I have already said during the review that IMHO the patches are not ready
> for merging. They are currently more like a prototype that explores ideas.
> The generalization steps are not done.
>
> How we could make progress:
>
> 1. Develop a useful categorization of allocations in the kernel whose
>   utility goes beyond the antifrag patches. I.e. length of
>   the objects existence and the method of reclaim could be useful in
>   various contexts.
>

The length of objects existence is something I am wary of because it
puts a big burden on the caller of the page allocator. The method of
reclaim is already implied by the existing categorisations. What may be
missing is clear documentation

UNMOVABLE - You can't reclaim it

RECLAIMABLE - You need the help of another subsystem to reclaim objects
	within the page before the page is reclaimed or the allocation
	is short-lived. Even when reclaimable, there is no guarantee that
	reclaim will succeed.

MOVABLE - The page is directly reclaimable by kswapd or it may be
	migrated. Being able to reclaim is guaranteed except where mlock()
	is involved. mlock pages need to be migrated.

You've defined these better yourself in your review. Arguably, RECLAIMABLE
should be separate from TEMPORARY and page buffers should be away from
MOVABLE but this did not appear necessary when tested.

If this breakout is found to be required, it is trivial to implement.

> 2. Have statistics of these various allocations.
>
> 3. Page allocator should gather statistics on how memory was allocated in
>   the various categories.
>

Statistics gathering has been done before and it can be done again. They were
used earlier in the development of the patches and then I stopped bringing
them forward in the belief they would not be of general interest. In a large
part, they helped define the current mobility types.  Gathering statistics
again is not a fundamental problem.

> 4. The available data can then be used to driver more intelligent reclaim
>   and develop methods of antifrag or defragmentation.
>

Once that data is available, it would help show how successfully fragmentation
avoidance as it currently stands and how it can be improved. The lack of the
statistics today does not seem a blocking issue because there are no users
of fragmentation avoidance that blow up if it's not effective.

Patch for breaking the MAX_ORDER grouping is as follows. Again, it's 45 minutes
coding so maybe I missed something but it survived a quick stress testing.

Not signed off due to incompleteness (e.g. should use a constant if the
hugepage size is known at compile time, nr_pages_pageblock should be
__read_mostly, not checked everywhere etc) and lack of full regression
testing and verification. If I hadn't bothered updating comments or printks,
the patch would be fairly small.

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.21-rc7-mm2-004_temporary/include/linux/pageblock-flags.h linux-2.6.21-rc7-mm2-005_group_arbitrary/include/linux/pageblock-flags.h
--- linux-2.6.21-rc7-mm2-004_temporary/include/linux/pageblock-flags.h	2007-04-27 22:04:34.000000000 +0100
+++ linux-2.6.21-rc7-mm2-005_group_arbitrary/include/linux/pageblock-flags.h	2007-05-01 16:02:51.000000000 +0100
@@ -1,6 +1,6 @@
 /*
  * Macros for manipulating and testing flags related to a
- * MAX_ORDER_NR_PAGES block of pages.
+ * large contiguous block of pages.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -35,6 +35,10 @@ enum pageblock_bits {
 	NR_PAGEBLOCK_BITS
 };

+/* Each pages_per_mobility_block of pages has NR_PAGEBLOCK_BITS */
+extern unsigned long nr_pages_pageblock;
+extern int pageblock_order;
+
 /* Forward declaration */
 struct page;

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.21-rc7-mm2-004_temporary/mm/page_alloc.c linux-2.6.21-rc7-mm2-005_group_arbitrary/mm/page_alloc.c
--- linux-2.6.21-rc7-mm2-004_temporary/mm/page_alloc.c	2007-04-27 22:04:34.000000000 +0100
+++ linux-2.6.21-rc7-mm2-005_group_arbitrary/mm/page_alloc.c	2007-05-01 19:54:18.000000000 +0100
@@ -58,6 +58,8 @@ unsigned long totalram_pages __read_most
 unsigned long totalreserve_pages __read_mostly;
 long nr_swap_pages;
 int percpu_pagelist_fraction;
+unsigned long nr_pages_pageblock;
+int pageblock_order;

 static void __free_pages_ok(struct page *page, unsigned int order);

@@ -721,7 +723,7 @@ static int fallbacks[MIGRATE_TYPES][MIGR

 /*
  * Move the free pages in a range to the free lists of the requested type.
- * Note that start_page and end_pages are not aligned in a MAX_ORDER_NR_PAGES
+ * Note that start_page and end_pages are not aligned in a pageblock
  * boundary. If alignment is required, use move_freepages_block()
  */
 int move_freepages(struct zone *zone,
@@ -771,10 +773,10 @@ int move_freepages_block(struct zone *zo
 	struct page *start_page, *end_page;

 	start_pfn = page_to_pfn(page);
-	start_pfn = start_pfn & ~(MAX_ORDER_NR_PAGES-1);
+	start_pfn = start_pfn & ~(nr_pages_pageblock-1);
 	start_page = pfn_to_page(start_pfn);
-	end_page = start_page + MAX_ORDER_NR_PAGES - 1;
-	end_pfn = start_pfn + MAX_ORDER_NR_PAGES - 1;
+	end_page = start_page + nr_pages_pageblock - 1;
+	end_pfn = start_pfn + nr_pages_pageblock - 1;

 	/* Do not cross zone boundaries */
 	if (start_pfn < zone->zone_start_pfn)
@@ -838,14 +840,14 @@ static struct page *__rmqueue_fallback(s
 			 * back for a reclaimable kernel allocation, be more
 			 * agressive about taking ownership of free pages
 			 */
-			if (unlikely(current_order >= MAX_ORDER / 2) ||
+			if (unlikely(current_order >= pageblock_order / 2) ||
 					start_migratetype == MIGRATE_RECLAIMABLE) {
 				unsigned long pages;
 				pages = move_freepages_block(zone, page,
 								start_migratetype);

 				/* Claim the whole block if over half of it is free */
-				if ((pages << current_order) >= (1 << (MAX_ORDER-2)))
+				if ((pages << current_order) >= (1 << (pageblock_order-2)))
 					set_pageblock_migratetype(page,
 								start_migratetype);

@@ -858,7 +860,7 @@ static struct page *__rmqueue_fallback(s
 			__mod_zone_page_state(zone, NR_FREE_PAGES,
 							-(1UL << order));

-			if (current_order == MAX_ORDER - 1)
+			if (current_order == pageblock_order)
 				set_pageblock_migratetype(page,
 							start_migratetype);

@@ -2253,14 +2255,16 @@ void __meminit build_all_zonelists(void)
 	 * made on memory-hotadd so a system can start with mobility
 	 * disabled and enable it later
 	 */
-	if (vm_total_pages < (MAX_ORDER_NR_PAGES * MIGRATE_TYPES))
+	if (vm_total_pages < (nr_pages_pageblock * MIGRATE_TYPES))
 		page_group_by_mobility_disabled = 1;
 	else
 		page_group_by_mobility_disabled = 0;

-	printk("Built %i zonelists, mobility grouping %s.  Total pages: %ld\n",
+	printk("Built %i zonelists, mobility grouping %s at order %d. "
+		"Total pages: %ld\n",
 			num_online_nodes(),
 			page_group_by_mobility_disabled ? "off" : "on",
+			pageblock_order,
 			vm_total_pages);
 }

@@ -2333,7 +2337,7 @@ static inline unsigned long wait_table_b
 #define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))

 /*
- * Mark a number of MAX_ORDER_NR_PAGES blocks as MIGRATE_RESERVE. The number
+ * Mark a number of pageblocks as MIGRATE_RESERVE. The number
  * of blocks reserved is based on zone->pages_min. The memory within the
  * reserve will tend to store contiguous free pages. Setting min_free_kbytes
  * higher will lead to a bigger reserve which will get freed as contiguous
@@ -2348,9 +2352,10 @@ static void setup_zone_migrate_reserve(s
 	/* Get the start pfn, end pfn and the number of blocks to reserve */
 	start_pfn = zone->zone_start_pfn;
 	end_pfn = start_pfn + zone->spanned_pages;
-	reserve = roundup(zone->pages_min, MAX_ORDER_NR_PAGES) >> (MAX_ORDER-1);
+	reserve = roundup(zone->pages_min, nr_pages_pageblock) >>
+							pageblock_order;

-	for (pfn = start_pfn; pfn < end_pfn; pfn += MAX_ORDER_NR_PAGES) {
+	for (pfn = start_pfn; pfn < end_pfn; pfn += nr_pages_pageblock) {
 		if (!pfn_valid(pfn))
 			continue;
 		page = pfn_to_page(pfn);
@@ -2425,7 +2430,7 @@ void __meminit memmap_init_zone(unsigned
 		 * the start are marked MIGRATE_RESERVE by
 		 * setup_zone_migrate_reserve()
 		 */
-		if ((pfn & (MAX_ORDER_NR_PAGES-1)))
+		if ((pfn & (nr_pages_pageblock-1)))
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);

 		INIT_LIST_HEAD(&page->lru);
@@ -3129,8 +3134,8 @@ static void __meminit calculate_node_tot
 #ifndef CONFIG_SPARSEMEM
 /*
  * Calculate the size of the zone->blockflags rounded to an unsigned long
- * Start by making sure zonesize is a multiple of MAX_ORDER-1 by rounding up
- * Then figure 1 NR_PAGEBLOCK_BITS worth of bits per MAX_ORDER-1, finally
+ * Start by making sure zonesize is a multiple of pageblock_order by rounding up
+ * Then figure 1 NR_PAGEBLOCK_BITS worth of bits per pageblock, finally
  * round what is now in bits to nearest long in bits, then return it in
  * bytes.
  */
@@ -3138,8 +3143,8 @@ static unsigned long __init usemap_size(
 {
 	unsigned long usemapsize;

-	usemapsize = roundup(zonesize, MAX_ORDER_NR_PAGES);
-	usemapsize = usemapsize >> (MAX_ORDER-1);
+	usemapsize = roundup(zonesize, nr_pages_pageblock);
+	usemapsize = usemapsize >> pageblock_order;
 	usemapsize *= NR_PAGEBLOCK_BITS;
 	usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long));

@@ -3161,6 +3166,26 @@ static void inline setup_usemap(struct p
 				struct zone *zone, unsigned long zonesize) {}
 #endif /* CONFIG_SPARSEMEM */

+/* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
+void __init initonce_nr_pages_pageblock(void)
+{
+	/* There will never be a 1:1 mapping, it makes no sense */
+	if (nr_pages_pageblock)
+		return;
+
+#ifdef CONFIG_HUGETLB_PAGE
+	/*
+	 * Assume the largest contiguous order of interest is a huge page.
+	 * This value may be variable depending on boot parameters on IA64
+	 */
+	pageblock_order = HUGETLB_PAGE_ORDER;
+#else
+	/* If huge pages are not in use, group based on MAX_ORDER */
+	pageblock_order = MAX_ORDER-1;
+#endif
+	nr_pages_pageblock = 1 << pageblock_order;
+}
+
 /*
  * Set up the zone data structures:
  *   - mark all pages reserved
@@ -3241,6 +3266,7 @@ static void __meminit free_area_init_cor
 		if (!size)
 			continue;

+		initonce_nr_pages_pageblock();
 		setup_usemap(pgdat, zone, size);
 		ret = init_currently_empty_zone(zone, zone_start_pfn,
 						size, MEMMAP_EARLY);
@@ -4132,15 +4158,15 @@ static inline int pfn_to_bitidx(struct z
 {
 #ifdef CONFIG_SPARSEMEM
 	pfn &= (PAGES_PER_SECTION-1);
-	return (pfn >> (MAX_ORDER-1)) * NR_PAGEBLOCK_BITS;
+	return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS;
 #else
 	pfn = pfn - zone->zone_start_pfn;
-	return (pfn >> (MAX_ORDER-1)) * NR_PAGEBLOCK_BITS;
+	return (pfn >> pageblock_order) * NR_PAGEBLOCK_BITS;
 #endif /* CONFIG_SPARSEMEM */
 }

 /**
- * get_pageblock_flags_group - Return the requested group of flags for the MAX_ORDER_NR_PAGES block of pages
+ * get_pageblock_flags_group - Return the requested group of flags for the nr_pages_pageblock block of pages
  * @page: The page within the block of interest
  * @start_bitidx: The first bit of interest to retrieve
  * @end_bitidx: The last bit of interest

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: fragmentation avoidance Re: 2.6.22 -mm merge plans
  2007-05-01 10:16 ` fragmentation avoidance " Mel Gorman
  2007-05-01 13:02   ` 2.6.22 -mm merge plans -- lumpy reclaim Andy Whitcroft
  2007-05-01 14:54   ` fragmentation avoidance Re: 2.6.22 -mm merge plans Christoph Lameter
@ 2007-05-01 18:57   ` Andrew Morton
  2007-05-07 13:07   ` Yasunori Goto
  3 siblings, 0 replies; 233+ messages in thread
From: Andrew Morton @ 2007-05-01 18:57 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-kernel, linux-mm, apw, clameter, y-goto

On Tue, 1 May 2007 11:16:51 +0100 mel@skynet.ie (Mel Gorman) wrote:

> 

OK, I did all the reorganisation which you recommended.

> Ok. It is getting reviewed by Christoph and I'm going through the TODO items
> it yielded. Andy has also been regularly reviewing them which is probably
> why they have had less public errors than you might expect from something
> like this.

Great.  I'm a bit behind on my linux-mm reading.

> Christoph may like to comment more here.

That would be helpful.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: fragmentation avoidance Re: 2.6.22 -mm merge plans
  2007-05-01 10:16 ` fragmentation avoidance " Mel Gorman
                     ` (2 preceding siblings ...)
  2007-05-01 18:57   ` Andrew Morton
@ 2007-05-07 13:07   ` Yasunori Goto
  3 siblings, 0 replies; 233+ messages in thread
From: Yasunori Goto @ 2007-05-07 13:07 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-kernel, linux-mm, apw, clameter


Sorry for late response. I went on a vacation in last week.
And I'm in the mountain of a ton of unread mail now....

> > Mel's moveable-zone work.
> 
> These patches are what creates ZONE_MOVABLE. The last 6 patches should be
> collapsed into a single patch:
> 
> 	handle-kernelcore=-generic
> 
> I believe Yasunori Goto is looking at these from the perspective of memory
> hot-remove and has caught a few bugs in the past. Goto-san may be able to
> comment on whether they have been reviewed recently.

Hmm, I don't think my review is enough.
To be precise, I'm just one user/tester of ZONE_MOVABLE.
I have tried to make memory remove patches with Mel-san's
ZONE_MOVABLE patch. And the bugs are things that I found in its work.
(I'll post these patches in a few days.)

> The main complexity is in one function in patch one which determines where
> the PFN is in each node for ZONE_MOVABLE. Getting that right so that the
> requested amount of kernel memory spread as evenly as possible is just
> not straight-forward.

>From memory-hotplug view, ZONE_MOVABLE should be aligned by section
size. But MAX_ORDER alignment is enough for others...

Bye.

-- 
Yasunori Goto 



^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (14 preceding siblings ...)
  2007-05-01 10:16 ` fragmentation avoidance " Mel Gorman
@ 2007-05-01 12:17 ` Andi Kleen
  2007-05-01 22:08   ` Mathieu Desnoyers
  2007-05-02  0:31   ` Rusty Russell
  2007-05-01 13:06 ` file capabilities and security_task_wait failure " Stephen Smalley
                   ` (6 subsequent siblings)
  22 siblings, 2 replies; 233+ messages in thread
From: Andi Kleen @ 2007-05-01 12:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, rusty, mathieu.desnoyers, wfg

Andrew Morton <akpm@linux-foundation.org> writes:

> Static markers.  Will merge.
There don't seem to be any users of this. How do you know it hasn't
already bitrotted?

It seems quite overcomplicated to me. Has the complexity been justified?

> 
> Will merge the rustyvisor.

IMHO the user code still doesn't belong into Documentation.
Also it needs another review round I guess. And some beta testing by
more people.

> Hopefully Wu will be coming up with a much simpler best-of-readahead patch
> soon.  I don't think we can get these patches over the hump and they are
> somewhat costly to maintain.

Didn't he have one already? There was a relatively simple readahead
patch recently, although it was unclear what dependencies it needed.
IMHO this work has much potential so i hope the benchmarking-review process
can be done quickly.

-Andi

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-01 12:17 ` Andi Kleen
@ 2007-05-01 22:08   ` Mathieu Desnoyers
  2007-05-02 10:44     ` Andi Kleen
  2007-05-02  0:31   ` Rusty Russell
  1 sibling, 1 reply; 233+ messages in thread
From: Mathieu Desnoyers @ 2007-05-01 22:08 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, linux-kernel, rusty, wfg

Hi Andi,

* Andi Kleen (andi@firstfloor.org) wrote:
> Andrew Morton <akpm@linux-foundation.org> writes:
> 
> 
> > Static markers.  Will merge.
> There don't seem to be any users of this. How do you know it hasn't
> already bitrotted?
> 

See the detailed explanation at :
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc7/2.6.21-rc7-mm2/broken-out/linux-kernel-markers-kconfig-menus.patch

Major points :

It is currently used as an instrumentation infrastructure for the LTTng
tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in
WindRiver products.  The SystemTAP project also plan to use this type of
infrastructure to trace sites hard to instrument. The Linux Kernel
Markers has the support of Frank C. Eigler, author of their current
marker alternative (which he wishes to drop in order to adopt the
markers infrastructure as soon as it hits mainline).

Quoting Jim Keniston <jkenisto@us.ibm.com> :

"kprobes remains a vital foundation for SystemTap.  But markers are
attactive as an alternate source of trace/debug info.  Here's why:
[...]"

> It seems quite overcomplicated to me. Has the complexity been justified?
> 

To summarize the document pointed at the URL above, where the full
the key goals of the markers, showing the rationale being the most
important design choices :
- Almost non perceivable impact on production machines when compiled in
  but markers are "disabled".
  - Use a separate section to keep the data to minimize d-cache
    trashing.
  - Put the code (stack setup and function call) in unlikely branches of the
    if() condition to minimize i-cache impact.
  - Since it is required to allow instrumentation of variables within
    the body of a function, accept the impact on compiler's
    optimizations and let it keep the variables "live" sometimes longer
    than required. It is up to the person who puts the marker in the
    code to choose the location that will have a small impact in this
    aspect.
  - Allow per-architecture optimized versions which removes the need for
    a d-cache based branch (patch a "load immediate" instruction
    instead). It minimized the d-cache impact of the disabled markers.
  - Accept the cost of an unlikely branch at the marker site because the
    gcc compiler does not give the ability to put "nops" instead of a
    branch generated from C code. Keep this in mind for future
    per-architecture optimizations.
- Instrumentation of challenging kernel sites
  - Instrumentation such as the one provided in the already existing
    Lock dependency checker (lockdep) and instrumentation of trap
    handlers implies being reentrant for such context. Therefore, the
    implementation must be lock-free and update the state in an atomic
    fashion (rcu-style). It must also let the programmer who describes
    a marker site the ability to specify what is forbidden in the probe
    that will be connected to the marker : can it generate a trap ? Can
    it call lockdep (irq disable, take any type of lock), can it call
    printk ? This is why flags can be passed to the _MARK() marker,
    while the MARK() marker has the default flags.

Please tell me if I forgot to explain the rationale behind some
implementation detail and I will be happy to explain in more depth.

Regards,

Mathieu


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-01 22:08   ` Mathieu Desnoyers
@ 2007-05-02 10:44     ` Andi Kleen
  2007-05-02 16:37       ` Frank Ch. Eigler
                         ` (2 more replies)
  0 siblings, 3 replies; 233+ messages in thread
From: Andi Kleen @ 2007-05-02 10:44 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Andi Kleen, Andrew Morton, linux-kernel, rusty, wfg

> It is currently used as an instrumentation infrastructure for the LTTng
> tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in
> WindRiver products.  The SystemTAP project also plan to use this type of
> infrastructure to trace sites hard to instrument. The Linux Kernel
> Markers has the support of Frank C. Eigler, author of their current
> marker alternative (which he wishes to drop in order to adopt the
> markers infrastructure as soon as it hits mainline).

All of the above don't use mainline kernels.
That doesn't constitute using it.

> Quoting Jim Keniston <jkenisto@us.ibm.com> :
> 
> "kprobes remains a vital foundation for SystemTap.  But markers are
> attactive as an alternate source of trace/debug info.  Here's why:
> [...]"

Talk is cheap. Do they have working code to use it?

>   - Allow per-architecture optimized versions which removes the need for
>     a d-cache based branch (patch a "load immediate" instruction
>     instead). It minimized the d-cache impact of the disabled markers.

That's a good idea in general, but should be generalized (available
independently), not hidden in your subsystem. I know a couple of places
who could use this successfully.

>   - Accept the cost of an unlikely branch at the marker site because the
>     gcc compiler does not give the ability to put "nops" instead of a
>     branch generated from C code. Keep this in mind for future
>     per-architecture optimizations.

See upcomming paravirt code for a way to do this.

> - Instrumentation of challenging kernel sites
>   - Instrumentation such as the one provided in the already existing
>     Lock dependency checker (lockdep) and instrumentation of trap
>     handlers implies being reentrant for such context. Therefore, the
>     implementation must be lock-free and update the state in an atomic
>     fashion (rcu-style). It must also let the programmer who describes
>     a marker site the ability to specify what is forbidden in the probe
>     that will be connected to the marker : can it generate a trap ? Can
>     it call lockdep (irq disable, take any type of lock), can it call
>     printk ? This is why flags can be passed to the _MARK() marker,
>     while the MARK() marker has the default flags.

Why can't you just generally forbid probes from doing all of this?
It would greatly simplify your code, wouldn't it?

Keep it simple please.

> Please tell me if I forgot to explain the rationale behind some
> implementation detail and I will be happy to explain in more depth.

Having lots of flags to do things differently optionally normally
starts up all warning lights of early over design. While Linux
has this sometimes it is generally only in mature old subsystems.
But when something is freshly merged it shouldn't be like this.
That is because code tends to grow more complicated over its livetime
and when it is already complicated at the beginning it will eventually
fall over (you can study current slab as a poster child of this)

-Andi


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-02 10:44     ` Andi Kleen
@ 2007-05-02 16:37       ` Frank Ch. Eigler
  2007-05-02 16:47       ` Andrew Morton
  2007-05-02 17:19       ` Mathieu Desnoyers
  2 siblings, 0 replies; 233+ messages in thread
From: Frank Ch. Eigler @ 2007-05-02 16:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Mathieu Desnoyers, Andrew Morton, linux-kernel, rusty, wfg


Andi Kleen <andi@firstfloor.org> writes:

> [...] The SystemTAP project also plan to use this type of
> > infrastructure to trace sites hard to instrument. The Linux Kernel
> > Markers has the support of Frank C. Eigler, author of their current
> > marker alternative [...]
> 
> All of the above don't use mainline kernels.
> That doesn't constitute using it.

Systemtap does run on mainline kernels.

> > "kprobes remains a vital foundation for SystemTap.  But markers are
> > attactive as an alternate source of trace/debug info.  Here's why:
> > [...]"
> 
> Talk is cheap. Do they have working code to use it? [...]

We had been waiting on the chicken & egg semaphore.  LTTNG has working
code yesterday (months ago); systemtap will have it "tomorrow" (a week
or few).


- FChE

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-02 10:44     ` Andi Kleen
  2007-05-02 16:37       ` Frank Ch. Eigler
@ 2007-05-02 16:47       ` Andrew Morton
  2007-05-02 17:29         ` Christoph Hellwig
  2007-05-02 17:49         ` Andi Kleen
  2007-05-02 17:19       ` Mathieu Desnoyers
  2 siblings, 2 replies; 233+ messages in thread
From: Andrew Morton @ 2007-05-02 16:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Mathieu Desnoyers, linux-kernel, rusty, wfg

On Wed, 2 May 2007 12:44:13 +0200 Andi Kleen <andi@firstfloor.org> wrote:

> > It is currently used as an instrumentation infrastructure for the LTTng
> > tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in
> > WindRiver products.  The SystemTAP project also plan to use this type of
> > infrastructure to trace sites hard to instrument. The Linux Kernel
> > Markers has the support of Frank C. Eigler, author of their current
> > marker alternative (which he wishes to drop in order to adopt the
> > markers infrastructure as soon as it hits mainline).
> 
> All of the above don't use mainline kernels.

That's because they have to add a markers patch!

> That doesn't constitute using it.

Andi, there was a huge amount of discussion about all this in September last
year (subjects: *markers* and *LTTng*). The outcome of all that was, I
believe, that the kernel should have a static marker infrastructure.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-02 16:47       ` Andrew Morton
@ 2007-05-02 17:29         ` Christoph Hellwig
  2007-05-02 20:36           ` Mathieu Desnoyers
  2007-05-02 17:49         ` Andi Kleen
  1 sibling, 1 reply; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-02 17:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, Mathieu Desnoyers, linux-kernel, rusty, wfg

On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote:
> > That doesn't constitute using it.
> 
> Andi, there was a huge amount of discussion about all this in September last
> year (subjects: *markers* and *LTTng*). The outcome of all that was, I
> believe, that the kernel should have a static marker infrastructure.

Only when it's actually useable.  A prerequisite for merging it is
having an actual trace transport infrastructure aswell as a few actually
useful tracing modules in the kernel tree.

Let this count as a vote to merge the markers once we have the infrastructure
above ready, it'll be very useful then.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-02 17:29         ` Christoph Hellwig
@ 2007-05-02 20:36           ` Mathieu Desnoyers
  2007-05-02 20:53             ` Andrew Morton
  2007-05-03  8:08             ` Christoph Hellwig
  0 siblings, 2 replies; 233+ messages in thread
From: Mathieu Desnoyers @ 2007-05-02 20:36 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton, Andi Kleen, linux-kernel, rusty,
	wfg

* Christoph Hellwig (hch@infradead.org) wrote:
> On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote:
> > > That doesn't constitute using it.
> > 
> > Andi, there was a huge amount of discussion about all this in September last
> > year (subjects: *markers* and *LTTng*). The outcome of all that was, I
> > believe, that the kernel should have a static marker infrastructure.
> 
> Only when it's actually useable.  A prerequisite for merging it is
> having an actual trace transport infrastructure aswell as a few actually
> useful tracing modules in the kernel tree.
> 
> Let this count as a vote to merge the markers once we have the infrastructure
> above ready, it'll be very useful then.

Hi Christoph,

The idea is the following : either we integrate the infrastructure for
instrumentation / data serialization / buffer management / extraction of
data to user space in multiple different steps, which makes code review
easier for you guys, or we bring the main pieces of the LTTng project
altogether with the Linux Kernel Markers, which would result in a bigger
change.

Based on the premise that discussing about logically distinct pieces of
infrastructure is easier and can be done more thoroughly when done
separately, we decided to submit the markers first, with the other
pieces planned in a near future.

I agree that it would be very useful to have the full tracing stack
available in the Linux kernel, but we inevitably face the argument :
"this change is too big" if we submit all LTTng modules at once or
the argument : "we want the whole tracing stack, not just part of it"
if we don't.

This is why we chose to push the tracing infrastructure chunk by chunk :
to make code review and criticism more efficient.

Regards,

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-02 20:36           ` Mathieu Desnoyers
@ 2007-05-02 20:53             ` Andrew Morton
  2007-05-02 23:11               ` Mathieu Desnoyers
  2007-05-03  8:09               ` Christoph Hellwig
  2007-05-03  8:08             ` Christoph Hellwig
  1 sibling, 2 replies; 233+ messages in thread
From: Andrew Morton @ 2007-05-02 20:53 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Christoph Hellwig, Andi Kleen, linux-kernel, rusty, wfg

On Wed, 2 May 2007 16:36:27 -0400
Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> * Christoph Hellwig (hch@infradead.org) wrote:
> > On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote:
> > > > That doesn't constitute using it.
> > > 
> > > Andi, there was a huge amount of discussion about all this in September last
> > > year (subjects: *markers* and *LTTng*). The outcome of all that was, I
> > > believe, that the kernel should have a static marker infrastructure.
> > 
> > Only when it's actually useable.  A prerequisite for merging it is
> > having an actual trace transport infrastructure aswell as a few actually
> > useful tracing modules in the kernel tree.
> > 
> > Let this count as a vote to merge the markers once we have the infrastructure
> > above ready, it'll be very useful then.
> 
> Hi Christoph,
> 
> The idea is the following : either we integrate the infrastructure for
> instrumentation / data serialization / buffer management / extraction of
> data to user space in multiple different steps, which makes code review
> easier for you guys, or we bring the main pieces of the LTTng project
> altogether with the Linux Kernel Markers, which would result in a bigger
> change.
> 
> Based on the premise that discussing about logically distinct pieces of
> infrastructure is easier and can be done more thoroughly when done
> separately, we decided to submit the markers first, with the other
> pieces planned in a near future.
> 
> I agree that it would be very useful to have the full tracing stack
> available in the Linux kernel, but we inevitably face the argument :
> "this change is too big" if we submit all LTTng modules at once or
> the argument : "we want the whole tracing stack, not just part of it"
> if we don't.
> 
> This is why we chose to push the tracing infrastructure chunk by chunk :
> to make code review and criticism more efficient.
> 

I didn't know that this was the plan.

The problem I have with this is that once we've merged one part, we're
committed to merging the other parts even though we haven't seen them yet.

What happens if there's a revolt over the next set of patches?  Do we
remove the core markers patches again?  We end up in a cant-go-forward,
cant-go-backward situation.

I thought the existing code was useful as-is for several projects, without
requiring additional patching to core kernel.  If such additional patching
_is_ needed to make the markers code useful then I agree that we should
continue to buffer the markers code in -mm until the
use-markers-for-something patches have been eyeballed.

In which case we have:

atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch
atomich-complete-atomic_long-operations-in-asm-generic.patch
atomich-i386-type-safety-fix.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch
atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch
atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch
local_t-architecture-independant-extension.patch
local_t-alpha-extension.patch
local_t-i386-extension.patch
local_t-ia64-extension.patch
local_t-mips-extension.patch
local_t-parisc-cleanup.patch
local_t-powerpc-extension.patch
local_t-sparc64-cleanup.patch
local_t-x86_64-extension.patch

  For 2.6.22

linux-kernel-markers-kconfig-menus.patch
linux-kernel-markers-architecture-independant-code.patch
linux-kernel-markers-powerpc-optimization.patch
linux-kernel-markers-i386-optimization.patch
markers-add-instrumentation-markers-menus-to-avr32.patch
linux-kernel-markers-non-optimized-architectures.patch
markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch
linux-kernel-markers-documentation.patch
#
markers-define-the-linker-macro-extra_rwdata.patch
markers-use-extra_rwdata-in-architectures.patch
#
some-grammatical-fixups-and-additions-to-atomich-kernel-doc.patch
no-longer-include-asm-kdebugh.patch

  Hold.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-02 20:53             ` Andrew Morton
@ 2007-05-02 23:11               ` Mathieu Desnoyers
  2007-05-02 23:21                 ` Andrew Morton
                                   ` (2 more replies)
  2007-05-03  8:09               ` Christoph Hellwig
  1 sibling, 3 replies; 233+ messages in thread
From: Mathieu Desnoyers @ 2007-05-02 23:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Hellwig, Andi Kleen, linux-kernel, rusty, wfg

* Andrew Morton (akpm@linux-foundation.org) wrote:
> On Wed, 2 May 2007 16:36:27 -0400
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> > * Christoph Hellwig (hch@infradead.org) wrote:
> > > On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote:
> > > > > That doesn't constitute using it.
> > > > 
> > > > Andi, there was a huge amount of discussion about all this in September last
> > > > year (subjects: *markers* and *LTTng*). The outcome of all that was, I
> > > > believe, that the kernel should have a static marker infrastructure.
> > > 
> > > Only when it's actually useable.  A prerequisite for merging it is
> > > having an actual trace transport infrastructure aswell as a few actually
> > > useful tracing modules in the kernel tree.
> > > 
> > > Let this count as a vote to merge the markers once we have the infrastructure
> > > above ready, it'll be very useful then.
> > 
> > Hi Christoph,
> > 
> > The idea is the following : either we integrate the infrastructure for
> > instrumentation / data serialization / buffer management / extraction of
> > data to user space in multiple different steps, which makes code review
> > easier for you guys, or we bring the main pieces of the LTTng project
> > altogether with the Linux Kernel Markers, which would result in a bigger
> > change.
> > 
> > Based on the premise that discussing about logically distinct pieces of
> > infrastructure is easier and can be done more thoroughly when done
> > separately, we decided to submit the markers first, with the other
> > pieces planned in a near future.
> > 
> > I agree that it would be very useful to have the full tracing stack
> > available in the Linux kernel, but we inevitably face the argument :
> > "this change is too big" if we submit all LTTng modules at once or
> > the argument : "we want the whole tracing stack, not just part of it"
> > if we don't.
> > 
> > This is why we chose to push the tracing infrastructure chunk by chunk :
> > to make code review and criticism more efficient.
> > 
> 
> I didn't know that this was the plan.
> 
> The problem I have with this is that once we've merged one part, we're
> committed to merging the other parts even though we haven't seen them yet.
> 
> What happens if there's a revolt over the next set of patches?  Do we
> remove the core markers patches again?  We end up in a cant-go-forward,
> cant-go-backward situation.
> 
> I thought the existing code was useful as-is for several projects, without
> requiring additional patching to core kernel.  If such additional patching
> _is_ needed to make the markers code useful then I agree that we should
> continue to buffer the markers code in -mm until the
> use-markers-for-something patches have been eyeballed.
> 

My statement was probably not clear enough. The actual marker code is
useful as-is without any further kernel patching required : SystemTAP is
an example where they use external modules to load probes that can
connect either to markers or through kprobes. LTTng, in its current state,
has a mostly modular core that also uses the markers.

Although some, like Christoph and myself, think that it would benefit to
the kernel community to have a common infrastructure for more than just
markers (meaning common serialization and buffering mechanism), it does
not change the fact that the markers, being in mainline, are usable by
projects through additional kernel modules.

If we are looking at current "potential users" that are already in
mainline, we could change blktrace to make it use the markers.

Mathieu


> In which case we have:
> 
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch
> atomich-complete-atomic_long-operations-in-asm-generic.patch
> atomich-i386-type-safety-fix.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch
> atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch
> local_t-architecture-independant-extension.patch
> local_t-alpha-extension.patch
> local_t-i386-extension.patch
> local_t-ia64-extension.patch
> local_t-mips-extension.patch
> local_t-parisc-cleanup.patch
> local_t-powerpc-extension.patch
> local_t-sparc64-cleanup.patch
> local_t-x86_64-extension.patch
> 
>   For 2.6.22
> 
> linux-kernel-markers-kconfig-menus.patch
> linux-kernel-markers-architecture-independant-code.patch
> linux-kernel-markers-powerpc-optimization.patch
> linux-kernel-markers-i386-optimization.patch
> markers-add-instrumentation-markers-menus-to-avr32.patch
> linux-kernel-markers-non-optimized-architectures.patch
> markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch
> linux-kernel-markers-documentation.patch
> #
> markers-define-the-linker-macro-extra_rwdata.patch
> markers-use-extra_rwdata-in-architectures.patch
> #
> some-grammatical-fixups-and-additions-to-atomich-kernel-doc.patch
> no-longer-include-asm-kdebugh.patch
> 
>   Hold.
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-02 23:11               ` Mathieu Desnoyers
@ 2007-05-02 23:21                 ` Andrew Morton
  2007-05-03 15:04                   ` Mathieu Desnoyers
  2007-05-03  8:06                 ` Christoph Hellwig
  2007-05-03 10:31                 ` Andi Kleen
  2 siblings, 1 reply; 233+ messages in thread
From: Andrew Morton @ 2007-05-02 23:21 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Christoph Hellwig, Andi Kleen, linux-kernel, rusty, wfg

On Wed, 2 May 2007 19:11:04 -0400
Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:

> > I didn't know that this was the plan.
> > 
> > The problem I have with this is that once we've merged one part, we're
> > committed to merging the other parts even though we haven't seen them yet.
> > 
> > What happens if there's a revolt over the next set of patches?  Do we
> > remove the core markers patches again?  We end up in a cant-go-forward,
> > cant-go-backward situation.
> > 
> > I thought the existing code was useful as-is for several projects, without
> > requiring additional patching to core kernel.  If such additional patching
> > _is_ needed to make the markers code useful then I agree that we should
> > continue to buffer the markers code in -mm until the
> > use-markers-for-something patches have been eyeballed.
> > 
> 
> My statement was probably not clear enough. The actual marker code is
> useful as-is without any further kernel patching required : SystemTAP is
> an example where they use external modules to load probes that can
> connect either to markers or through kprobes. LTTng, in its current state,
> has a mostly modular core that also uses the markers.

OK, that's what I thought.

> Although some, like Christoph and myself, think that it would benefit to
> the kernel community to have a common infrastructure for more than just
> markers (meaning common serialization and buffering mechanism), it does
> not change the fact that the markers, being in mainline, are usable by
> projects through additional kernel modules.
> 
> If we are looking at current "potential users" that are already in
> mainline, we could change blktrace to make it use the markers.

That'd be a useful demonstration.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-02 23:21                 ` Andrew Morton
@ 2007-05-03 15:04                   ` Mathieu Desnoyers
  2007-05-03 15:12                     ` Christoph Hellwig
  0 siblings, 1 reply; 233+ messages in thread
From: Mathieu Desnoyers @ 2007-05-03 15:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Hellwig, Andi Kleen, linux-kernel, rusty, wfg

* Andrew Morton (akpm@linux-foundation.org) wrote:
> > Although some, like Christoph and myself, think that it would benefit to
> > the kernel community to have a common infrastructure for more than just
> > markers (meaning common serialization and buffering mechanism), it does
> > not change the fact that the markers, being in mainline, are usable by
> > projects through additional kernel modules.
> > 
> > If we are looking at current "potential users" that are already in
> > mainline, we could change blktrace to make it use the markers.
> 
> That'd be a useful demonstration.

Here is a proof of concept patch, for demonstration purpose, of moving
blktrace to the markers.

A few remarks : this patch has the positive effect of removing some code
from the block io tracing hot paths, minimizing the i-cache impact in a
system where the io tracing is compiled in but inactive.

It also moves the blk tracing code from a header (and therefore from the
body of the instrumented functions) to a separate C file.

There, as soon as one device has to be traced, every devices have to
fall into the tracing function call. This is slower than the previous
inline function which tested the condition quickly. If it becomes a
show stopper, it could be fixed by having the possibility to test a
supplementary condition, dependant of the marker context, at the marker
site, just after the enable/disable test.

It does not make the code smaller, since I left all the specialized
tracing functions for requests, bio, generic, remap, which would go away
once a generic infrastructure is in place to serialize the information
passed to the marker. This is mostly why I consider it a proof a
concept.

Patch named "markers-port-blktrace-to-markers.patch", can be placed
after the marker patches in the 2.6.21-rc7-mm2 series.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>

Index: linux-2.6-lttng/block/elevator.c
===================================================================
--- linux-2.6-lttng.orig/block/elevator.c	2007-05-02 20:33:22.000000000 -0400
+++ linux-2.6-lttng/block/elevator.c	2007-05-02 20:33:49.000000000 -0400
@@ -32,7 +32,7 @@
 #include <linux/init.h>
 #include <linux/compiler.h>
 #include <linux/delay.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
 #include <linux/hash.h>
 
 #include <asm/uaccess.h>
@@ -571,7 +571,7 @@
 	unsigned ordseq;
 	int unplug_it = 1;
 
-	blk_add_trace_rq(q, rq, BLK_TA_INSERT);
+	MARK(blk_request_insert, "%p %p", q, rq);
 
 	rq->q = q;
 
@@ -757,7 +757,7 @@
 			 * not be passed by new incoming requests
 			 */
 			rq->cmd_flags |= REQ_STARTED;
-			blk_add_trace_rq(q, rq, BLK_TA_ISSUE);
+			MARK(blk_request_issue, "%p %p", q, rq);
 		}
 
 		if (!q->boundary_rq || q->boundary_rq == rq) {
Index: linux-2.6-lttng/block/ll_rw_blk.c
===================================================================
--- linux-2.6-lttng.orig/block/ll_rw_blk.c	2007-05-02 20:33:32.000000000 -0400
+++ linux-2.6-lttng/block/ll_rw_blk.c	2007-05-02 23:21:02.000000000 -0400
@@ -28,6 +28,7 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
+#include <linux/marker.h>
 #include <linux/blktrace_api.h>
 #include <linux/fault-inject.h>
 
@@ -1551,7 +1552,7 @@
 
 	if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
 		mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
-		blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
+		MARK(blk_plug_device, "%p %p %d", q, NULL, 0);
 	}
 }
 
@@ -1617,7 +1618,7 @@
 	 * devices don't necessarily have an ->unplug_fn defined
 	 */
 	if (q->unplug_fn) {
-		blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+		MARK(blk_pdu_unplug_io, "%p %p %d", q, NULL,
 					q->rq.count[READ] + q->rq.count[WRITE]);
 
 		q->unplug_fn(q);
@@ -1628,7 +1629,7 @@
 {
 	request_queue_t *q = container_of(work, request_queue_t, unplug_work);
 
-	blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+	MARK(blk_pdu_unplug_io, "%p %p %d", q, NULL,
 				q->rq.count[READ] + q->rq.count[WRITE]);
 
 	q->unplug_fn(q);
@@ -1638,7 +1639,7 @@
 {
 	request_queue_t *q = (request_queue_t *)data;
 
-	blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
+	MARK(blk_pdu_unplug_timer, "%p %p %d", q, NULL,
 				q->rq.count[READ] + q->rq.count[WRITE]);
 
 	kblockd_schedule_work(&q->unplug_work);
@@ -2148,7 +2149,7 @@
 	
 	rq_init(q, rq);
 
-	blk_add_trace_generic(q, bio, rw, BLK_TA_GETRQ);
+	MARK(blk_get_request, "%p %p %d", q, bio, rw);
 out:
 	return rq;
 }
@@ -2178,7 +2179,7 @@
 		if (!rq) {
 			struct io_context *ioc;
 
-			blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ);
+			MARK(blk_sleep_request, "%p %p %d", q, bio, rw);
 
 			__generic_unplug_device(q);
 			spin_unlock_irq(q->queue_lock);
@@ -2252,7 +2253,7 @@
  */
 void blk_requeue_request(request_queue_t *q, struct request *rq)
 {
-	blk_add_trace_rq(q, rq, BLK_TA_REQUEUE);
+	MARK(blk_requeue, "%p %p", q, rq);
 
 	if (blk_rq_tagged(rq))
 		blk_queue_end_tag(q, rq);
@@ -2937,7 +2938,7 @@
 			if (!ll_back_merge_fn(q, req, bio))
 				break;
 
-			blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE);
+			MARK(blk_bio_backmerge, "%p %p", q, bio);
 
 			req->biotail->bi_next = bio;
 			req->biotail = bio;
@@ -2954,7 +2955,7 @@
 			if (!ll_front_merge_fn(q, req, bio))
 				break;
 
-			blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE);
+			MARK(blk_bio_frontmerge, "%p %p", q, bio);
 
 			bio->bi_next = req->bio;
 			req->bio = bio;
@@ -3184,10 +3185,10 @@
 		blk_partition_remap(bio);
 
 		if (old_sector != -1)
-			blk_add_trace_remap(q, bio, old_dev, bio->bi_sector, 
-					    old_sector);
+			MARK(blk_remap, "%p %p %u %llu %llu", q, bio, old_dev,
+					(u64)bio->bi_sector, (u64)old_sector);
 
-		blk_add_trace_bio(q, bio, BLK_TA_QUEUE);
+		MARK(blk_bio_queue, "%p %p", q, bio);
 
 		old_sector = bio->bi_sector;
 		old_dev = bio->bi_bdev->bd_dev;
@@ -3329,7 +3330,7 @@
 	int total_bytes, bio_nbytes, error, next_idx = 0;
 	struct bio *bio;
 
-	blk_add_trace_rq(req->q, req, BLK_TA_COMPLETE);
+	MARK(blk_request_complete, "%p %p", req->q, req);
 
 	/*
 	 * extend uptodate bool to allow < 0 value to be direct io error
Index: linux-2.6-lttng/block/Kconfig
===================================================================
--- linux-2.6-lttng.orig/block/Kconfig	2007-05-02 20:34:30.000000000 -0400
+++ linux-2.6-lttng/block/Kconfig	2007-05-02 20:34:53.000000000 -0400
@@ -32,6 +32,7 @@
 	depends on SYSFS
 	select RELAY
 	select DEBUG_FS
+	select MARKERS
 	help
 	  Say Y here, if you want to be able to trace the block layer actions
 	  on a given queue. Tracing allows you to see any traffic happening
Index: linux-2.6-lttng/block/Makefile
===================================================================
--- linux-2.6-lttng.orig/block/Makefile	2007-05-02 21:20:30.000000000 -0400
+++ linux-2.6-lttng/block/Makefile	2007-05-02 21:20:46.000000000 -0400
@@ -9,4 +9,4 @@
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
 
-obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
+obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o blk-probe.o
Index: linux-2.6-lttng/block/blk-probe.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/block/blk-probe.c	2007-05-02 23:43:44.000000000 -0400
@@ -0,0 +1,276 @@
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/crc32.h>
+#include <linux/marker.h>
+#include <linux/blktrace_api.h>
+
+
+/**
+ * blk_add_trace_rq - Add a trace for a request oriented action
+ * Expected variable arguments :
+ * @q:		queue the io is for
+ * @rq:		the source request
+ *
+ * Description:
+ *     Records an action against a request. Will log the bio offset + size.
+ *
+ **/
+static void blk_add_trace_rq(const struct __mark_marker_data *mdata,
+	const char *fmt, ...)
+{
+	va_list args;
+	u32 what;
+	struct blk_trace *bt;
+	int rw;
+	struct blk_probe_data *pinfo = mdata->pdata;
+	struct request_queue *q;
+	struct request *rq;
+
+	va_start(args, fmt);
+	q = va_arg(args, struct request_queue *);
+	rq = va_arg(args, struct request *);
+	va_end(args);
+
+	what = pinfo->flags;
+	bt = q->blk_trace;
+	rw = rq->cmd_flags & 0x03;
+
+	if (likely(!bt))
+		return;
+
+	if (blk_pc_request(rq)) {
+		what |= BLK_TC_ACT(BLK_TC_PC);
+		__blk_add_trace(bt, 0, rq->data_len, rw, what, rq->errors, sizeof(rq->cmd), rq->cmd);
+	} else  {
+		what |= BLK_TC_ACT(BLK_TC_FS);
+		__blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, rw, what, rq->errors, 0, NULL);
+	}
+}
+
+/**
+ * blk_add_trace_bio - Add a trace for a bio oriented action
+ * Expected variable arguments :
+ * @q:		queue the io is for
+ * @bio:	the source bio
+ *
+ * Description:
+ *     Records an action against a bio. Will log the bio offset + size.
+ *
+ **/
+static void blk_add_trace_bio(const struct __mark_marker_data *mdata,
+	const char *fmt, ...)
+{
+	va_list args;
+	u32 what;
+	struct blk_trace *bt;
+	struct blk_probe_data *pinfo = mdata->pdata;
+	struct request_queue *q;
+	struct bio *bio;
+
+	va_start(args, fmt);
+	q = va_arg(args, struct request_queue *);
+	bio = va_arg(args, struct bio *);
+	va_end(args);
+
+	what = pinfo->flags;
+	bt = q->blk_trace;
+
+	if (likely(!bt))
+		return;
+
+	__blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), 0, NULL);
+}
+
+/**
+ * blk_add_trace_generic - Add a trace for a generic action
+ * Expected variable arguments :
+ * @q:		queue the io is for
+ * @bio:	the source bio
+ * @rw:		the data direction
+ *
+ * Description:
+ *     Records a simple trace
+ *
+ **/
+static void blk_add_trace_generic(const struct __mark_marker_data *mdata,
+	const char *fmt, ...)
+{
+	va_list args;
+	struct blk_trace *bt;
+	u32 what;
+	struct blk_probe_data *pinfo = mdata->pdata;
+	struct request_queue *q;
+	struct bio *bio;
+	int rw;
+
+	va_start(args, fmt);
+	q = va_arg(args, struct request_queue *);
+	bio = va_arg(args, struct bio *);
+	rw = va_arg(args, int);
+	va_end(args);
+
+	what = pinfo->flags;
+	bt = q->blk_trace;
+
+	if (likely(!bt))
+		return;
+
+	if (bio)
+		blk_add_trace_bio(mdata, "%p %p", q, bio);
+	else
+		__blk_add_trace(bt, 0, 0, rw, what, 0, 0, NULL);
+}
+
+/**
+ * blk_add_trace_pdu_int - Add a trace for a bio with an integer payload
+ * Expected variable arguments :
+ * @q:		queue the io is for
+ * @bio:	the source bio
+ * @pdu:	the integer payload
+ *
+ * Description:
+ *     Adds a trace with some integer payload. This might be an unplug
+ *     option given as the action, with the depth at unplug time given
+ *     as the payload
+ *
+ **/
+static void blk_add_trace_pdu_int(const struct __mark_marker_data *mdata,
+	const char *fmt, ...)
+{
+	va_list args;
+	struct blk_trace *bt;
+	u32 what;
+	struct blk_probe_data *pinfo = mdata->pdata;
+	struct request_queue *q;
+	struct bio *bio;
+	unsigned int pdu;
+	__be64 rpdu;
+
+	va_start(args, fmt);
+	q = va_arg(args, struct request_queue *);
+	bio = va_arg(args, struct bio *);
+	pdu = va_arg(args, unsigned int);
+	va_end(args);
+
+	what = pinfo->flags;
+	bt = q->blk_trace;
+	rpdu = cpu_to_be64(pdu);
+
+	if (likely(!bt))
+		return;
+
+	if (bio)
+		__blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), sizeof(rpdu), &rpdu);
+	else
+		__blk_add_trace(bt, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu);
+}
+
+/**
+ * blk_add_trace_remap - Add a trace for a remap operation
+ * Expected variable arguments :
+ * @q:		queue the io is for
+ * @bio:	the source bio
+ * @dev:	target device
+ * @from:	source sector
+ * @to:		target sector
+ *
+ * Description:
+ *     Device mapper or raid target sometimes need to split a bio because
+ *     it spans a stripe (or similar). Add a trace for that action.
+ *
+ **/
+static void blk_add_trace_remap(const struct __mark_marker_data *mdata,
+	const char *fmt, ...)
+{
+	va_list args;
+	struct blk_trace *bt;
+	struct blk_io_trace_remap r;
+	u32 what;
+	struct blk_probe_data *pinfo = mdata->pdata;
+	struct request_queue *q;
+	struct bio *bio;
+	u64 dev, from, to;
+
+	va_start(args, fmt);
+	q = va_arg(args, struct request_queue *);
+	bio = va_arg(args, struct bio *);
+	dev = va_arg(args, u64);
+	from = va_arg(args, u64);
+	to = va_arg(args, u64);
+	va_end(args);
+
+	what = pinfo->flags;
+	bt = q->blk_trace;
+
+	if (likely(!bt))
+		return;
+
+	r.device = cpu_to_be32(dev);
+	r.sector = cpu_to_be64(to);
+
+	__blk_add_trace(bt, from, bio->bi_size, bio->bi_rw, BLK_TA_REMAP, !bio_flagged(bio, BIO_UPTODATE), sizeof(r), &r);
+}
+
+#define FACILITY_NAME "blk"
+
+static struct blk_probe_data probe_array[] =
+{
+	{ "blk_bio_queue", "%p %p", BLK_TA_QUEUE, blk_add_trace_bio },
+	{ "blk_bio_backmerge", "%p %p", BLK_TA_BACKMERGE, blk_add_trace_bio },
+	{ "blk_bio_frontmerge", "%p %p", BLK_TA_FRONTMERGE, blk_add_trace_bio },
+	{ "blk_get_request", "%p %p %d", BLK_TA_GETRQ, blk_add_trace_generic },
+	{ "blk_sleep_request", "%p %p %d", BLK_TA_SLEEPRQ,
+		blk_add_trace_generic },
+	{ "blk_requeue", "%p %p", BLK_TA_REQUEUE, blk_add_trace_rq },
+	{ "blk_request_issue", "%p %p", BLK_TA_ISSUE, blk_add_trace_rq },
+	{ "blk_request_complete", "%p %p", BLK_TA_COMPLETE, blk_add_trace_rq },
+	{ "blk_plug_device", "%p %p %d", BLK_TA_PLUG, blk_add_trace_generic },
+	{ "blk_pdu_unplug_io", "%p %p %d", BLK_TA_UNPLUG_IO,
+		blk_add_trace_pdu_int },
+	{ "blk_pdu_unplug_timer", "%p %p %d", BLK_TA_UNPLUG_TIMER,
+		blk_add_trace_pdu_int },
+	{ "blk_request_insert", "%p %p", BLK_TA_INSERT,
+		blk_add_trace_rq },
+	{ "blk_pdu_split", "%p %p %d", BLK_TA_SPLIT,
+		blk_add_trace_pdu_int },
+	{ "blk_bio_bounce", "%p %p", BLK_TA_BOUNCE, blk_add_trace_bio },
+	{ "blk_remap", "%p %p %u %llu %llu", BLK_TA_REMAP,
+		blk_add_trace_remap },
+};
+
+
+#define NUM_PROBES ARRAY_SIZE(probe_array)
+
+int blk_probe_connect(void)
+{
+	int result;
+	uint8_t i;
+
+	for (i = 0; i < NUM_PROBES; i++) {
+		result = marker_set_probe(probe_array[i].name,
+				probe_array[i].format,
+				probe_array[i].callback, &probe_array[i]);
+		if (!result)
+			printk(KERN_INFO
+				"blktrace unable to register probe %s\n",
+				probe_array[i].name);
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(blk_probe_connect);
+
+void blk_probe_disconnect(void)
+{
+	uint8_t i;
+
+	for (i = 0; i < NUM_PROBES; i++) {
+		marker_remove_probe(probe_array[i].name);
+	}
+	synchronize_sched();	/* Wait for probes to finish */
+}
+EXPORT_SYMBOL_GPL(blk_probe_disconnect);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Mathieu Desnoyers");
+MODULE_DESCRIPTION(FACILITY_NAME " probe");
Index: linux-2.6-lttng/block/blktrace.c
===================================================================
--- linux-2.6-lttng.orig/block/blktrace.c	2007-05-02 20:33:15.000000000 -0400
+++ linux-2.6-lttng/block/blktrace.c	2007-05-02 23:48:32.000000000 -0400
@@ -28,6 +28,10 @@
 static DEFINE_PER_CPU(unsigned long long, blk_trace_cpu_offset) = { 0, };
 static unsigned int blktrace_seq __read_mostly = 1;
 
+/* Global reference count of probes */
+static struct mutex blk_probe_mutex;
+static int blk_probes_ref = 0;
+
 /*
  * Send out a notify message.
  */
@@ -229,6 +233,12 @@
 	blk_remove_tree(bt->dir);
 	free_percpu(bt->sequence);
 	kfree(bt);
+	mutex_lock(&blk_probe_mutex);
+	if (blk_probes_ref == 1) {
+		blk_probe_disconnect();
+		blk_probes_ref--;
+	}
+	mutex_unlock(&blk_probe_mutex);
 }
 
 static int blk_trace_remove(request_queue_t *q)
@@ -386,6 +396,14 @@
 		goto err;
 	}
 
+	/* Connect probes to markers */
+	mutex_lock(&blk_probe_mutex);
+	if (!blk_probes_ref) {
+		blk_probe_connect();
+		blk_probes_ref++;
+	}
+	mutex_unlock(&blk_probe_mutex);
+
 	return 0;
 err:
 	if (dir)
@@ -552,6 +570,7 @@
 static __init int blk_trace_init(void)
 {
 	mutex_init(&blk_tree_mutex);
+	mutex_init(&blk_probe_mutex);
 	on_each_cpu(blk_trace_check_cpu_time, NULL, 1, 1);
 	blk_trace_set_ht_offsets();
 
Index: linux-2.6-lttng/include/linux/blktrace_api.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/blktrace_api.h	2007-05-02 20:45:58.000000000 -0400
+++ linux-2.6-lttng/include/linux/blktrace_api.h	2007-05-02 22:12:46.000000000 -0400
@@ -3,6 +3,7 @@
 
 #include <linux/blkdev.h>
 #include <linux/relay.h>
+#include <linux/marker.h>
 
 /*
  * Trace categories
@@ -142,149 +143,24 @@
 	u32 pid;
 };
 
+/* Probe data used for probe-marker connection */
+struct blk_probe_data {
+	const char *name;
+	const char *format;
+	u32 flags;
+	marker_probe_func *callback;
+};
+
 #if defined(CONFIG_BLK_DEV_IO_TRACE)
 extern int blk_trace_ioctl(struct block_device *, unsigned, char __user *);
 extern void blk_trace_shutdown(request_queue_t *);
 extern void __blk_add_trace(struct blk_trace *, sector_t, int, int, u32, int, int, void *);
-
-/**
- * blk_add_trace_rq - Add a trace for a request oriented action
- * @q:		queue the io is for
- * @rq:		the source request
- * @what:	the action
- *
- * Description:
- *     Records an action against a request. Will log the bio offset + size.
- *
- **/
-static inline void blk_add_trace_rq(struct request_queue *q, struct request *rq,
-				    u32 what)
-{
-	struct blk_trace *bt = q->blk_trace;
-	int rw = rq->cmd_flags & 0x03;
-
-	if (likely(!bt))
-		return;
-
-	if (blk_pc_request(rq)) {
-		what |= BLK_TC_ACT(BLK_TC_PC);
-		__blk_add_trace(bt, 0, rq->data_len, rw, what, rq->errors, sizeof(rq->cmd), rq->cmd);
-	} else  {
-		what |= BLK_TC_ACT(BLK_TC_FS);
-		__blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, rw, what, rq->errors, 0, NULL);
-	}
-}
-
-/**
- * blk_add_trace_bio - Add a trace for a bio oriented action
- * @q:		queue the io is for
- * @bio:	the source bio
- * @what:	the action
- *
- * Description:
- *     Records an action against a bio. Will log the bio offset + size.
- *
- **/
-static inline void blk_add_trace_bio(struct request_queue *q, struct bio *bio,
-				     u32 what)
-{
-	struct blk_trace *bt = q->blk_trace;
-
-	if (likely(!bt))
-		return;
-
-	__blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), 0, NULL);
-}
-
-/**
- * blk_add_trace_generic - Add a trace for a generic action
- * @q:		queue the io is for
- * @bio:	the source bio
- * @rw:		the data direction
- * @what:	the action
- *
- * Description:
- *     Records a simple trace
- *
- **/
-static inline void blk_add_trace_generic(struct request_queue *q,
-					 struct bio *bio, int rw, u32 what)
-{
-	struct blk_trace *bt = q->blk_trace;
-
-	if (likely(!bt))
-		return;
-
-	if (bio)
-		blk_add_trace_bio(q, bio, what);
-	else
-		__blk_add_trace(bt, 0, 0, rw, what, 0, 0, NULL);
-}
-
-/**
- * blk_add_trace_pdu_int - Add a trace for a bio with an integer payload
- * @q:		queue the io is for
- * @what:	the action
- * @bio:	the source bio
- * @pdu:	the integer payload
- *
- * Description:
- *     Adds a trace with some integer payload. This might be an unplug
- *     option given as the action, with the depth at unplug time given
- *     as the payload
- *
- **/
-static inline void blk_add_trace_pdu_int(struct request_queue *q, u32 what,
-					 struct bio *bio, unsigned int pdu)
-{
-	struct blk_trace *bt = q->blk_trace;
-	__be64 rpdu = cpu_to_be64(pdu);
-
-	if (likely(!bt))
-		return;
-
-	if (bio)
-		__blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), sizeof(rpdu), &rpdu);
-	else
-		__blk_add_trace(bt, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu);
-}
-
-/**
- * blk_add_trace_remap - Add a trace for a remap operation
- * @q:		queue the io is for
- * @bio:	the source bio
- * @dev:	target device
- * @from:	source sector
- * @to:		target sector
- *
- * Description:
- *     Device mapper or raid target sometimes need to split a bio because
- *     it spans a stripe (or similar). Add a trace for that action.
- *
- **/
-static inline void blk_add_trace_remap(struct request_queue *q, struct bio *bio,
-				       dev_t dev, sector_t from, sector_t to)
-{
-	struct blk_trace *bt = q->blk_trace;
-	struct blk_io_trace_remap r;
-
-	if (likely(!bt))
-		return;
-
-	r.device = cpu_to_be32(dev);
-	r.sector = cpu_to_be64(to);
-
-	__blk_add_trace(bt, from, bio->bi_size, bio->bi_rw, BLK_TA_REMAP, !bio_flagged(bio, BIO_UPTODATE), sizeof(r), &r);
-}
+extern int blk_probe_connect(void);
+extern void blk_probe_disconnect(void);
 
 #else /* !CONFIG_BLK_DEV_IO_TRACE */
 #define blk_trace_ioctl(bdev, cmd, arg)		(-ENOTTY)
 #define blk_trace_shutdown(q)			do { } while (0)
-#define blk_add_trace_rq(q, rq, what)		do { } while (0)
-#define blk_add_trace_bio(q, rq, what)		do { } while (0)
-#define blk_add_trace_generic(q, rq, rw, what)	do { } while (0)
-#define blk_add_trace_pdu_int(q, what, bio, pdu)	do { } while (0)
-#define blk_add_trace_remap(q, bio, dev, f, t)	do {} while (0)
 #endif /* CONFIG_BLK_DEV_IO_TRACE */
 
 #endif
Index: linux-2.6-lttng/mm/bounce.c
===================================================================
--- linux-2.6-lttng.orig/mm/bounce.c	2007-05-02 21:34:39.000000000 -0400
+++ linux-2.6-lttng/mm/bounce.c	2007-05-02 21:36:17.000000000 -0400
@@ -13,7 +13,7 @@
 #include <linux/init.h>
 #include <linux/hash.h>
 #include <linux/highmem.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
 #include <asm/tlbflush.h>
 
 #define POOL_SIZE	64
@@ -237,7 +237,7 @@
 	if (!bio)
 		return;
 
-	blk_add_trace_bio(q, *bio_orig, BLK_TA_BOUNCE);
+	MARK(blk_bio_bounce, "%p %p", q, *bio_orig);
 
 	/*
 	 * at least one page was bounced, fill in possible non-highmem
Index: linux-2.6-lttng/mm/highmem.c
===================================================================
--- linux-2.6-lttng.orig/mm/highmem.c	2007-05-02 21:36:27.000000000 -0400
+++ linux-2.6-lttng/mm/highmem.c	2007-05-02 21:36:39.000000000 -0400
@@ -26,7 +26,7 @@
 #include <linux/init.h>
 #include <linux/hash.h>
 #include <linux/highmem.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
 #include <asm/tlbflush.h>
 
 /*
Index: linux-2.6-lttng/fs/bio.c
===================================================================
--- linux-2.6-lttng.orig/fs/bio.c	2007-05-02 21:37:52.000000000 -0400
+++ linux-2.6-lttng/fs/bio.c	2007-05-02 21:40:30.000000000 -0400
@@ -25,7 +25,7 @@
 #include <linux/module.h>
 #include <linux/mempool.h>
 #include <linux/workqueue.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
 #include <scsi/sg.h>		/* for struct sg_iovec */
 
 #define BIO_POOL_SIZE 2
@@ -1081,7 +1081,7 @@
 	if (!bp)
 		return bp;
 
-	blk_add_trace_pdu_int(bdev_get_queue(bi->bi_bdev), BLK_TA_SPLIT, bi,
+	MARK(blk_pdu_split, "%p %p %d", bdev_get_queue(bi->bi_bdev), bi,
 				bi->bi_sector + first_sectors);
 
 	BUG_ON(bi->bi_vcnt != 1);
Index: linux-2.6-lttng/drivers/block/cciss.c
===================================================================
--- linux-2.6-lttng.orig/drivers/block/cciss.c	2007-05-02 21:44:30.000000000 -0400
+++ linux-2.6-lttng/drivers/block/cciss.c	2007-05-02 21:45:08.000000000 -0400
@@ -37,7 +37,7 @@
 #include <linux/hdreg.h>
 #include <linux/spinlock.h>
 #include <linux/compat.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
 #include <asm/uaccess.h>
 #include <asm/io.h>
 
@@ -2502,7 +2502,7 @@
 	}
 	cmd->rq->data_len = 0;
 	cmd->rq->completion_data = cmd;
-	blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
+	MARK(blk_request_complete, "%p %p", cmd->rq->q, cmd->rq);
 	blk_complete_request(cmd->rq);
 }
 
Index: linux-2.6-lttng/drivers/md/dm.c
===================================================================
--- linux-2.6-lttng.orig/drivers/md/dm.c	2007-05-02 21:44:41.000000000 -0400
+++ linux-2.6-lttng/drivers/md/dm.c	2007-05-02 21:47:19.000000000 -0400
@@ -19,7 +19,7 @@
 #include <linux/slab.h>
 #include <linux/idr.h>
 #include <linux/hdreg.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
 #include <linux/smp_lock.h>
 
 #define DM_MSG_PREFIX "core"
@@ -485,8 +485,8 @@
 			wake_up(&io->md->wait);
 
 		if (io->error != DM_ENDIO_REQUEUE) {
-			blk_add_trace_bio(io->md->queue, io->bio,
-					  BLK_TA_COMPLETE);
+			MARK(blk_request_complete, "%p %p",
+				io->md->queue, io->bio);
 
 			bio_endio(io->bio, io->bio->bi_size, io->error);
 		}
@@ -582,10 +582,10 @@
 	r = ti->type->map(ti, clone, &tio->info);
 	if (r == DM_MAPIO_REMAPPED) {
 		/* the bio has been remapped so dispatch it */
-
-		blk_add_trace_remap(bdev_get_queue(clone->bi_bdev), clone,
-				    tio->io->bio->bi_bdev->bd_dev, sector,
-				    clone->bi_sector);
+		MARK(blk_remap, "%p %p %u %llu %llu",
+			bdev_get_queue(clone->bi_bdev), clone,
+			(u64)tio->io->bio->bi_bdev->bd_dev, (u64)sector,
+			(u64)clone->bi_sector);
 
 		generic_make_request(clone);
 	} else if (r < 0 || r == DM_MAPIO_REQUEUE) {
-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-03 15:04                   ` Mathieu Desnoyers
@ 2007-05-03 15:12                     ` Christoph Hellwig
  2007-05-03 17:16                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-03 15:12 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Christoph Hellwig, Andi Kleen, linux-kernel, rusty,
	wfg

On Thu, May 03, 2007 at 11:04:15AM -0400, Mathieu Desnoyers wrote:
> -	blk_add_trace_rq(q, rq, BLK_TA_INSERT);
> +	MARK(blk_request_insert, "%p %p", q, rq);

I don't really like the shouting MARK name very much.  Can we have
a less-generic, less shouting name, e.g. trace_marker?  The aboe would then
be:

	trace_mark(blk_request_insert, "%p %p", q, rq);

> +#define NUM_PROBES ARRAY_SIZE(probe_array)

just get rid of this and use ARRAY_SIZE diretly below.

> +int blk_probe_connect(void)
> +{
> +	int result;
> +	uint8_t i;

just use an int for for loops.  it's easy to read and probably faster
on most systems (it the compiler isn't smart enough and promotes it
to int anyway during code generation)

> +void blk_probe_disconnect(void)
> +{
> +	uint8_t i;
> +
> +	for (i = 0; i < NUM_PROBES; i++) {
> +		marker_remove_probe(probe_array[i].name);
> +	}
> +	synchronize_sched();	/* Wait for probes to finish */

kprobes does this kind of synchronization internally, so the marker
wrapper should probabl aswell.

> +static int blk_probes_ref = 0;

no need to initialize this.

>  /*
>   * Send out a notify message.
>   */
> @@ -229,6 +233,12 @@
>  	blk_remove_tree(bt->dir);
>  	free_percpu(bt->sequence);
>  	kfree(bt);
> +	mutex_lock(&blk_probe_mutex);
> +	if (blk_probes_ref == 1) {
> +		blk_probe_disconnect();
> +		blk_probes_ref--;
> +	}

	if (--blk_probes_ref == 0)
		blk_probe_disconnect();

would probably be a tad cleaner.

> +	if (!blk_probes_ref) {
> +		blk_probe_connect();
> +		blk_probes_ref++;
> +	}

Dito here with a:

	if (!blk_probes_ref++)
		blk_probe_connect();

also the connect in the name seems rather add, what about arm/disarm instead?

>  static __init int blk_trace_init(void)
>  {
>  	mutex_init(&blk_tree_mutex);
> +	mutex_init(&blk_probe_mutex);

both should use DEFINE_MUTEX for compile-time initialization isntead.

Also it's probably better to put the trace points into blktrace.c,
that means all blktrace code can be static and self-contained.  And we
can probably do some additional cleanups by simplifying things later on.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-03 15:12                     ` Christoph Hellwig
@ 2007-05-03 17:16                       ` Mathieu Desnoyers
  2007-05-03 17:25                         ` Christoph Hellwig
  0 siblings, 1 reply; 233+ messages in thread
From: Mathieu Desnoyers @ 2007-05-03 17:16 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton, Andi Kleen, linux-kernel, rusty,
	wfg

Here is the reworked patch, except a comment :

* Christoph Hellwig (hch@infradead.org) wrote:
> > +void blk_probe_disconnect(void)
> > +{
> > +	uint8_t i;
> > +
> > +	for (i = 0; i < NUM_PROBES; i++) {
> > +		marker_remove_probe(probe_array[i].name);
> > +	}
> > +	synchronize_sched();	/* Wait for probes to finish */
> 
> kprobes does this kind of synchronization internally, so the marker
> wrapper should probabl aswell.
> 

The problem appears on heavily loaded systems. Doing 50
synchronize_sched() calls in a row can take up to a few seconds on a
4-way machine. This is why I prefer to do it in the module to which
the callbacks belong.

Here is the reviewed patch. It depends on a newer version of markers
I'll send to Andrew soon.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>


Index: linux-2.6-lttng/block/elevator.c
===================================================================
--- linux-2.6-lttng.orig/block/elevator.c	2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/block/elevator.c	2007-05-03 12:54:58.000000000 -0400
@@ -32,7 +32,7 @@
 #include <linux/init.h>
 #include <linux/compiler.h>
 #include <linux/delay.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
 #include <linux/hash.h>
 
 #include <asm/uaccess.h>
@@ -571,7 +571,7 @@
 	unsigned ordseq;
 	int unplug_it = 1;
 
-	blk_add_trace_rq(q, rq, BLK_TA_INSERT);
+	trace_mark(blk_request_insert, "%p %p", q, rq);
 
 	rq->q = q;
 
@@ -757,7 +757,7 @@
 			 * not be passed by new incoming requests
 			 */
 			rq->cmd_flags |= REQ_STARTED;
-			blk_add_trace_rq(q, rq, BLK_TA_ISSUE);
+			trace_mark(blk_request_issue, "%p %p", q, rq);
 		}
 
 		if (!q->boundary_rq || q->boundary_rq == rq) {
Index: linux-2.6-lttng/block/ll_rw_blk.c
===================================================================
--- linux-2.6-lttng.orig/block/ll_rw_blk.c	2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/block/ll_rw_blk.c	2007-05-03 12:54:58.000000000 -0400
@@ -28,6 +28,7 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/interrupt.h>
 #include <linux/cpu.h>
+#include <linux/marker.h>
 #include <linux/blktrace_api.h>
 #include <linux/fault-inject.h>
 
@@ -1551,7 +1552,7 @@
 
 	if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) {
 		mod_timer(&q->unplug_timer, jiffies + q->unplug_delay);
-		blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
+		trace_mark(blk_plug_device, "%p %p %d", q, NULL, 0);
 	}
 }
 
@@ -1617,7 +1618,7 @@
 	 * devices don't necessarily have an ->unplug_fn defined
 	 */
 	if (q->unplug_fn) {
-		blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+		trace_mark(blk_pdu_unplug_io, "%p %p %d", q, NULL,
 					q->rq.count[READ] + q->rq.count[WRITE]);
 
 		q->unplug_fn(q);
@@ -1628,7 +1629,7 @@
 {
 	request_queue_t *q = container_of(work, request_queue_t, unplug_work);
 
-	blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL,
+	trace_mark(blk_pdu_unplug_io, "%p %p %d", q, NULL,
 				q->rq.count[READ] + q->rq.count[WRITE]);
 
 	q->unplug_fn(q);
@@ -1638,7 +1639,7 @@
 {
 	request_queue_t *q = (request_queue_t *)data;
 
-	blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL,
+	trace_mark(blk_pdu_unplug_timer, "%p %p %d", q, NULL,
 				q->rq.count[READ] + q->rq.count[WRITE]);
 
 	kblockd_schedule_work(&q->unplug_work);
@@ -2148,7 +2149,7 @@
 	
 	rq_init(q, rq);
 
-	blk_add_trace_generic(q, bio, rw, BLK_TA_GETRQ);
+	trace_mark(blk_get_request, "%p %p %d", q, bio, rw);
 out:
 	return rq;
 }
@@ -2178,7 +2179,7 @@
 		if (!rq) {
 			struct io_context *ioc;
 
-			blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ);
+			trace_mark(blk_sleep_request, "%p %p %d", q, bio, rw);
 
 			__generic_unplug_device(q);
 			spin_unlock_irq(q->queue_lock);
@@ -2252,7 +2253,7 @@
  */
 void blk_requeue_request(request_queue_t *q, struct request *rq)
 {
-	blk_add_trace_rq(q, rq, BLK_TA_REQUEUE);
+	trace_mark(blk_requeue, "%p %p", q, rq);
 
 	if (blk_rq_tagged(rq))
 		blk_queue_end_tag(q, rq);
@@ -2937,7 +2938,7 @@
 			if (!ll_back_merge_fn(q, req, bio))
 				break;
 
-			blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE);
+			trace_mark(blk_bio_backmerge, "%p %p", q, bio);
 
 			req->biotail->bi_next = bio;
 			req->biotail = bio;
@@ -2954,7 +2955,7 @@
 			if (!ll_front_merge_fn(q, req, bio))
 				break;
 
-			blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE);
+			trace_mark(blk_bio_frontmerge, "%p %p", q, bio);
 
 			bio->bi_next = req->bio;
 			req->bio = bio;
@@ -3184,10 +3185,10 @@
 		blk_partition_remap(bio);
 
 		if (old_sector != -1)
-			blk_add_trace_remap(q, bio, old_dev, bio->bi_sector, 
-					    old_sector);
+			trace_mark(blk_remap, "%p %p %u %llu %llu", q, bio, old_dev,
+					(u64)bio->bi_sector, (u64)old_sector);
 
-		blk_add_trace_bio(q, bio, BLK_TA_QUEUE);
+		trace_mark(blk_bio_queue, "%p %p", q, bio);
 
 		old_sector = bio->bi_sector;
 		old_dev = bio->bi_bdev->bd_dev;
@@ -3329,7 +3330,7 @@
 	int total_bytes, bio_nbytes, error, next_idx = 0;
 	struct bio *bio;
 
-	blk_add_trace_rq(req->q, req, BLK_TA_COMPLETE);
+	trace_mark(blk_request_complete, "%p %p", req->q, req);
 
 	/*
 	 * extend uptodate bool to allow < 0 value to be direct io error
Index: linux-2.6-lttng/block/Kconfig
===================================================================
--- linux-2.6-lttng.orig/block/Kconfig	2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/block/Kconfig	2007-05-03 12:54:58.000000000 -0400
@@ -32,6 +32,7 @@
 	depends on SYSFS
 	select RELAY
 	select DEBUG_FS
+	select MARKERS
 	help
 	  Say Y here, if you want to be able to trace the block layer actions
 	  on a given queue. Tracing allows you to see any traffic happening
Index: linux-2.6-lttng/block/blktrace.c
===================================================================
--- linux-2.6-lttng.orig/block/blktrace.c	2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/block/blktrace.c	2007-05-03 13:05:30.000000000 -0400
@@ -23,11 +23,19 @@
 #include <linux/mutex.h>
 #include <linux/debugfs.h>
 #include <linux/time.h>
+#include <linux/marker.h>
 #include <asm/uaccess.h>
 
 static DEFINE_PER_CPU(unsigned long long, blk_trace_cpu_offset) = { 0, };
 static unsigned int blktrace_seq __read_mostly = 1;
 
+/* Global reference count of probes */
+static DEFINE_MUTEX(blk_probe_mutex);
+static int blk_probes_ref;
+
+int blk_probe_arm(void);
+void blk_probe_disarm(void);
+
 /*
  * Send out a notify message.
  */
@@ -179,7 +187,7 @@
 EXPORT_SYMBOL_GPL(__blk_add_trace);
 
 static struct dentry *blk_tree_root;
-static struct mutex blk_tree_mutex;
+static DEFINE_MUTEX(blk_tree_mutex);
 static unsigned int root_users;
 
 static inline void blk_remove_root(void)
@@ -229,6 +237,10 @@
 	blk_remove_tree(bt->dir);
 	free_percpu(bt->sequence);
 	kfree(bt);
+	mutex_lock(&blk_probe_mutex);
+	if (--blk_probes_ref == 0)
+		blk_probe_disarm();
+	mutex_unlock(&blk_probe_mutex);
 }
 
 static int blk_trace_remove(request_queue_t *q)
@@ -386,6 +398,11 @@
 		goto err;
 	}
 
+	mutex_lock(&blk_probe_mutex);
+	if (!blk_probes_ref++)
+		blk_probe_arm();
+	mutex_unlock(&blk_probe_mutex);
+
 	return 0;
 err:
 	if (dir)
@@ -549,9 +566,270 @@
 #endif
 }
 
+/**
+ * blk_add_trace_rq - Add a trace for a request oriented action
+ * Expected variable arguments :
+ * @q:		queue the io is for
+ * @rq:		the source request
+ *
+ * Description:
+ *     Records an action against a request. Will log the bio offset + size.
+ *
+ **/
+static void blk_add_trace_rq(const struct __mark_marker_data *mdata,
+	const char *fmt, ...)
+{
+	va_list args;
+	u32 what;
+	struct blk_trace *bt;
+	int rw;
+	struct blk_probe_data *pinfo = mdata->pdata;
+	struct request_queue *q;
+	struct request *rq;
+
+	va_start(args, fmt);
+	q = va_arg(args, struct request_queue *);
+	rq = va_arg(args, struct request *);
+	va_end(args);
+
+	what = pinfo->flags;
+	bt = q->blk_trace;
+	rw = rq->cmd_flags & 0x03;
+
+	if (likely(!bt))
+		return;
+
+	if (blk_pc_request(rq)) {
+		what |= BLK_TC_ACT(BLK_TC_PC);
+		__blk_add_trace(bt, 0, rq->data_len, rw, what, rq->errors, sizeof(rq->cmd), rq->cmd);
+	} else  {
+		what |= BLK_TC_ACT(BLK_TC_FS);
+		__blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, rw, what, rq->errors, 0, NULL);
+	}
+}
+
+/**
+ * blk_add_trace_bio - Add a trace for a bio oriented action
+ * Expected variable arguments :
+ * @q:		queue the io is for
+ * @bio:	the source bio
+ *
+ * Description:
+ *     Records an action against a bio. Will log the bio offset + size.
+ *
+ **/
+static void blk_add_trace_bio(const struct __mark_marker_data *mdata,
+	const char *fmt, ...)
+{
+	va_list args;
+	u32 what;
+	struct blk_trace *bt;
+	struct blk_probe_data *pinfo = mdata->pdata;
+	struct request_queue *q;
+	struct bio *bio;
+
+	va_start(args, fmt);
+	q = va_arg(args, struct request_queue *);
+	bio = va_arg(args, struct bio *);
+	va_end(args);
+
+	what = pinfo->flags;
+	bt = q->blk_trace;
+
+	if (likely(!bt))
+		return;
+
+	__blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), 0, NULL);
+}
+
+/**
+ * blk_add_trace_generic - Add a trace for a generic action
+ * Expected variable arguments :
+ * @q:		queue the io is for
+ * @bio:	the source bio
+ * @rw:		the data direction
+ *
+ * Description:
+ *     Records a simple trace
+ *
+ **/
+static void blk_add_trace_generic(const struct __mark_marker_data *mdata,
+	const char *fmt, ...)
+{
+	va_list args;
+	struct blk_trace *bt;
+	u32 what;
+	struct blk_probe_data *pinfo = mdata->pdata;
+	struct request_queue *q;
+	struct bio *bio;
+	int rw;
+
+	va_start(args, fmt);
+	q = va_arg(args, struct request_queue *);
+	bio = va_arg(args, struct bio *);
+	rw = va_arg(args, int);
+	va_end(args);
+
+	what = pinfo->flags;
+	bt = q->blk_trace;
+
+	if (likely(!bt))
+		return;
+
+	if (bio)
+		blk_add_trace_bio(mdata, "%p %p", q, bio);
+	else
+		__blk_add_trace(bt, 0, 0, rw, what, 0, 0, NULL);
+}
+
+/**
+ * blk_add_trace_pdu_int - Add a trace for a bio with an integer payload
+ * Expected variable arguments :
+ * @q:		queue the io is for
+ * @bio:	the source bio
+ * @pdu:	the integer payload
+ *
+ * Description:
+ *     Adds a trace with some integer payload. This might be an unplug
+ *     option given as the action, with the depth at unplug time given
+ *     as the payload
+ *
+ **/
+static void blk_add_trace_pdu_int(const struct __mark_marker_data *mdata,
+	const char *fmt, ...)
+{
+	va_list args;
+	struct blk_trace *bt;
+	u32 what;
+	struct blk_probe_data *pinfo = mdata->pdata;
+	struct request_queue *q;
+	struct bio *bio;
+	unsigned int pdu;
+	__be64 rpdu;
+
+	va_start(args, fmt);
+	q = va_arg(args, struct request_queue *);
+	bio = va_arg(args, struct bio *);
+	pdu = va_arg(args, unsigned int);
+	va_end(args);
+
+	what = pinfo->flags;
+	bt = q->blk_trace;
+	rpdu = cpu_to_be64(pdu);
+
+	if (likely(!bt))
+		return;
+
+	if (bio)
+		__blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), sizeof(rpdu), &rpdu);
+	else
+		__blk_add_trace(bt, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu);
+}
+
+/**
+ * blk_add_trace_remap - Add a trace for a remap operation
+ * Expected variable arguments :
+ * @q:		queue the io is for
+ * @bio:	the source bio
+ * @dev:	target device
+ * @from:	source sector
+ * @to:		target sector
+ *
+ * Description:
+ *     Device mapper or raid target sometimes need to split a bio because
+ *     it spans a stripe (or similar). Add a trace for that action.
+ *
+ **/
+static void blk_add_trace_remap(const struct __mark_marker_data *mdata,
+	const char *fmt, ...)
+{
+	va_list args;
+	struct blk_trace *bt;
+	struct blk_io_trace_remap r;
+	u32 what;
+	struct blk_probe_data *pinfo = mdata->pdata;
+	struct request_queue *q;
+	struct bio *bio;
+	u64 dev, from, to;
+
+	va_start(args, fmt);
+	q = va_arg(args, struct request_queue *);
+	bio = va_arg(args, struct bio *);
+	dev = va_arg(args, u64);
+	from = va_arg(args, u64);
+	to = va_arg(args, u64);
+	va_end(args);
+
+	what = pinfo->flags;
+	bt = q->blk_trace;
+
+	if (likely(!bt))
+		return;
+
+	r.device = cpu_to_be32(dev);
+	r.sector = cpu_to_be64(to);
+
+	__blk_add_trace(bt, from, bio->bi_size, bio->bi_rw, BLK_TA_REMAP, !bio_flagged(bio, BIO_UPTODATE), sizeof(r), &r);
+}
+
+#define FACILITY_NAME "blk"
+
+static struct blk_probe_data probe_array[] =
+{
+	{ "blk_bio_queue", "%p %p", BLK_TA_QUEUE, blk_add_trace_bio },
+	{ "blk_bio_backmerge", "%p %p", BLK_TA_BACKMERGE, blk_add_trace_bio },
+	{ "blk_bio_frontmerge", "%p %p", BLK_TA_FRONTMERGE, blk_add_trace_bio },
+	{ "blk_get_request", "%p %p %d", BLK_TA_GETRQ, blk_add_trace_generic },
+	{ "blk_sleep_request", "%p %p %d", BLK_TA_SLEEPRQ,
+		blk_add_trace_generic },
+	{ "blk_requeue", "%p %p", BLK_TA_REQUEUE, blk_add_trace_rq },
+	{ "blk_request_issue", "%p %p", BLK_TA_ISSUE, blk_add_trace_rq },
+	{ "blk_request_complete", "%p %p", BLK_TA_COMPLETE, blk_add_trace_rq },
+	{ "blk_plug_device", "%p %p %d", BLK_TA_PLUG, blk_add_trace_generic },
+	{ "blk_pdu_unplug_io", "%p %p %d", BLK_TA_UNPLUG_IO,
+		blk_add_trace_pdu_int },
+	{ "blk_pdu_unplug_timer", "%p %p %d", BLK_TA_UNPLUG_TIMER,
+		blk_add_trace_pdu_int },
+	{ "blk_request_insert", "%p %p", BLK_TA_INSERT,
+		blk_add_trace_rq },
+	{ "blk_pdu_split", "%p %p %d", BLK_TA_SPLIT,
+		blk_add_trace_pdu_int },
+	{ "blk_bio_bounce", "%p %p", BLK_TA_BOUNCE, blk_add_trace_bio },
+	{ "blk_remap", "%p %p %u %llu %llu", BLK_TA_REMAP,
+		blk_add_trace_remap },
+};
+
+
+int blk_probe_arm(void)
+{
+	int result;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(probe_array); i++) {
+		result = marker_set_probe(probe_array[i].name,
+				probe_array[i].format,
+				probe_array[i].callback, &probe_array[i]);
+		if (!result)
+			printk(KERN_INFO
+				"blktrace unable to register probe %s\n",
+				probe_array[i].name);
+	}
+	return 0;
+}
+
+void blk_probe_disarm(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(probe_array); i++) {
+		marker_remove_probe(probe_array[i].name);
+	}
+	synchronize_sched();	/* Wait for probes to finish */
+}
+
+
 static __init int blk_trace_init(void)
 {
-	mutex_init(&blk_tree_mutex);
 	on_each_cpu(blk_trace_check_cpu_time, NULL, 1, 1);
 	blk_trace_set_ht_offsets();
 
Index: linux-2.6-lttng/include/linux/blktrace_api.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/blktrace_api.h	2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/include/linux/blktrace_api.h	2007-05-03 12:54:58.000000000 -0400
@@ -3,6 +3,7 @@
 
 #include <linux/blkdev.h>
 #include <linux/relay.h>
+#include <linux/marker.h>
 
 /*
  * Trace categories
@@ -142,149 +143,24 @@
 	u32 pid;
 };
 
+/* Probe data used for probe-marker connection */
+struct blk_probe_data {
+	const char *name;
+	const char *format;
+	u32 flags;
+	marker_probe_func *callback;
+};
+
 #if defined(CONFIG_BLK_DEV_IO_TRACE)
 extern int blk_trace_ioctl(struct block_device *, unsigned, char __user *);
 extern void blk_trace_shutdown(request_queue_t *);
 extern void __blk_add_trace(struct blk_trace *, sector_t, int, int, u32, int, int, void *);
-
-/**
- * blk_add_trace_rq - Add a trace for a request oriented action
- * @q:		queue the io is for
- * @rq:		the source request
- * @what:	the action
- *
- * Description:
- *     Records an action against a request. Will log the bio offset + size.
- *
- **/
-static inline void blk_add_trace_rq(struct request_queue *q, struct request *rq,
-				    u32 what)
-{
-	struct blk_trace *bt = q->blk_trace;
-	int rw = rq->cmd_flags & 0x03;
-
-	if (likely(!bt))
-		return;
-
-	if (blk_pc_request(rq)) {
-		what |= BLK_TC_ACT(BLK_TC_PC);
-		__blk_add_trace(bt, 0, rq->data_len, rw, what, rq->errors, sizeof(rq->cmd), rq->cmd);
-	} else  {
-		what |= BLK_TC_ACT(BLK_TC_FS);
-		__blk_add_trace(bt, rq->hard_sector, rq->hard_nr_sectors << 9, rw, what, rq->errors, 0, NULL);
-	}
-}
-
-/**
- * blk_add_trace_bio - Add a trace for a bio oriented action
- * @q:		queue the io is for
- * @bio:	the source bio
- * @what:	the action
- *
- * Description:
- *     Records an action against a bio. Will log the bio offset + size.
- *
- **/
-static inline void blk_add_trace_bio(struct request_queue *q, struct bio *bio,
-				     u32 what)
-{
-	struct blk_trace *bt = q->blk_trace;
-
-	if (likely(!bt))
-		return;
-
-	__blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), 0, NULL);
-}
-
-/**
- * blk_add_trace_generic - Add a trace for a generic action
- * @q:		queue the io is for
- * @bio:	the source bio
- * @rw:		the data direction
- * @what:	the action
- *
- * Description:
- *     Records a simple trace
- *
- **/
-static inline void blk_add_trace_generic(struct request_queue *q,
-					 struct bio *bio, int rw, u32 what)
-{
-	struct blk_trace *bt = q->blk_trace;
-
-	if (likely(!bt))
-		return;
-
-	if (bio)
-		blk_add_trace_bio(q, bio, what);
-	else
-		__blk_add_trace(bt, 0, 0, rw, what, 0, 0, NULL);
-}
-
-/**
- * blk_add_trace_pdu_int - Add a trace for a bio with an integer payload
- * @q:		queue the io is for
- * @what:	the action
- * @bio:	the source bio
- * @pdu:	the integer payload
- *
- * Description:
- *     Adds a trace with some integer payload. This might be an unplug
- *     option given as the action, with the depth at unplug time given
- *     as the payload
- *
- **/
-static inline void blk_add_trace_pdu_int(struct request_queue *q, u32 what,
-					 struct bio *bio, unsigned int pdu)
-{
-	struct blk_trace *bt = q->blk_trace;
-	__be64 rpdu = cpu_to_be64(pdu);
-
-	if (likely(!bt))
-		return;
-
-	if (bio)
-		__blk_add_trace(bt, bio->bi_sector, bio->bi_size, bio->bi_rw, what, !bio_flagged(bio, BIO_UPTODATE), sizeof(rpdu), &rpdu);
-	else
-		__blk_add_trace(bt, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu);
-}
-
-/**
- * blk_add_trace_remap - Add a trace for a remap operation
- * @q:		queue the io is for
- * @bio:	the source bio
- * @dev:	target device
- * @from:	source sector
- * @to:		target sector
- *
- * Description:
- *     Device mapper or raid target sometimes need to split a bio because
- *     it spans a stripe (or similar). Add a trace for that action.
- *
- **/
-static inline void blk_add_trace_remap(struct request_queue *q, struct bio *bio,
-				       dev_t dev, sector_t from, sector_t to)
-{
-	struct blk_trace *bt = q->blk_trace;
-	struct blk_io_trace_remap r;
-
-	if (likely(!bt))
-		return;
-
-	r.device = cpu_to_be32(dev);
-	r.sector = cpu_to_be64(to);
-
-	__blk_add_trace(bt, from, bio->bi_size, bio->bi_rw, BLK_TA_REMAP, !bio_flagged(bio, BIO_UPTODATE), sizeof(r), &r);
-}
+extern int blk_probe_connect(void);
+extern void blk_probe_disconnect(void);
 
 #else /* !CONFIG_BLK_DEV_IO_TRACE */
 #define blk_trace_ioctl(bdev, cmd, arg)		(-ENOTTY)
 #define blk_trace_shutdown(q)			do { } while (0)
-#define blk_add_trace_rq(q, rq, what)		do { } while (0)
-#define blk_add_trace_bio(q, rq, what)		do { } while (0)
-#define blk_add_trace_generic(q, rq, rw, what)	do { } while (0)
-#define blk_add_trace_pdu_int(q, what, bio, pdu)	do { } while (0)
-#define blk_add_trace_remap(q, bio, dev, f, t)	do {} while (0)
 #endif /* CONFIG_BLK_DEV_IO_TRACE */
 
 #endif
Index: linux-2.6-lttng/mm/bounce.c
===================================================================
--- linux-2.6-lttng.orig/mm/bounce.c	2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/mm/bounce.c	2007-05-03 12:54:58.000000000 -0400
@@ -13,7 +13,7 @@
 #include <linux/init.h>
 #include <linux/hash.h>
 #include <linux/highmem.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
 #include <asm/tlbflush.h>
 
 #define POOL_SIZE	64
@@ -237,7 +237,7 @@
 	if (!bio)
 		return;
 
-	blk_add_trace_bio(q, *bio_orig, BLK_TA_BOUNCE);
+	trace_mark(blk_bio_bounce, "%p %p", q, *bio_orig);
 
 	/*
 	 * at least one page was bounced, fill in possible non-highmem
Index: linux-2.6-lttng/mm/highmem.c
===================================================================
--- linux-2.6-lttng.orig/mm/highmem.c	2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/mm/highmem.c	2007-05-03 12:54:58.000000000 -0400
@@ -26,7 +26,7 @@
 #include <linux/init.h>
 #include <linux/hash.h>
 #include <linux/highmem.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
 #include <asm/tlbflush.h>
 
 /*
Index: linux-2.6-lttng/fs/bio.c
===================================================================
--- linux-2.6-lttng.orig/fs/bio.c	2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/fs/bio.c	2007-05-03 12:54:58.000000000 -0400
@@ -25,7 +25,7 @@
 #include <linux/module.h>
 #include <linux/mempool.h>
 #include <linux/workqueue.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
 #include <scsi/sg.h>		/* for struct sg_iovec */
 
 #define BIO_POOL_SIZE 2
@@ -1081,7 +1081,7 @@
 	if (!bp)
 		return bp;
 
-	blk_add_trace_pdu_int(bdev_get_queue(bi->bi_bdev), BLK_TA_SPLIT, bi,
+	trace_mark(blk_pdu_split, "%p %p %d", bdev_get_queue(bi->bi_bdev), bi,
 				bi->bi_sector + first_sectors);
 
 	BUG_ON(bi->bi_vcnt != 1);
Index: linux-2.6-lttng/drivers/block/cciss.c
===================================================================
--- linux-2.6-lttng.orig/drivers/block/cciss.c	2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/drivers/block/cciss.c	2007-05-03 12:54:58.000000000 -0400
@@ -37,7 +37,7 @@
 #include <linux/hdreg.h>
 #include <linux/spinlock.h>
 #include <linux/compat.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
 #include <asm/uaccess.h>
 #include <asm/io.h>
 
@@ -2502,7 +2502,7 @@
 	}
 	cmd->rq->data_len = 0;
 	cmd->rq->completion_data = cmd;
-	blk_add_trace_rq(cmd->rq->q, cmd->rq, BLK_TA_COMPLETE);
+	trace_mark(blk_request_complete, "%p %p", cmd->rq->q, cmd->rq);
 	blk_complete_request(cmd->rq);
 }
 
Index: linux-2.6-lttng/drivers/md/dm.c
===================================================================
--- linux-2.6-lttng.orig/drivers/md/dm.c	2007-05-03 12:27:12.000000000 -0400
+++ linux-2.6-lttng/drivers/md/dm.c	2007-05-03 12:54:58.000000000 -0400
@@ -19,7 +19,7 @@
 #include <linux/slab.h>
 #include <linux/idr.h>
 #include <linux/hdreg.h>
-#include <linux/blktrace_api.h>
+#include <linux/marker.h>
 #include <linux/smp_lock.h>
 
 #define DM_MSG_PREFIX "core"
@@ -485,8 +485,8 @@
 			wake_up(&io->md->wait);
 
 		if (io->error != DM_ENDIO_REQUEUE) {
-			blk_add_trace_bio(io->md->queue, io->bio,
-					  BLK_TA_COMPLETE);
+			trace_mark(blk_request_complete, "%p %p",
+				io->md->queue, io->bio);
 
 			bio_endio(io->bio, io->bio->bi_size, io->error);
 		}
@@ -582,10 +582,10 @@
 	r = ti->type->map(ti, clone, &tio->info);
 	if (r == DM_MAPIO_REMAPPED) {
 		/* the bio has been remapped so dispatch it */
-
-		blk_add_trace_remap(bdev_get_queue(clone->bi_bdev), clone,
-				    tio->io->bio->bi_bdev->bd_dev, sector,
-				    clone->bi_sector);
+		trace_mark(blk_remap, "%p %p %u %llu %llu",
+			bdev_get_queue(clone->bi_bdev), clone,
+			(u64)tio->io->bio->bi_bdev->bd_dev, (u64)sector,
+			(u64)clone->bi_sector);
 
 		generic_make_request(clone);
 	} else if (r < 0 || r == DM_MAPIO_REQUEUE) {
-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-03 17:16                       ` Mathieu Desnoyers
@ 2007-05-03 17:25                         ` Christoph Hellwig
  2007-05-10 19:39                           ` Mathieu Desnoyers
  0 siblings, 1 reply; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-03 17:25 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Christoph Hellwig, Andrew Morton, Andi Kleen, linux-kernel, rusty,
	wfg

On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote:
> > kprobes does this kind of synchronization internally, so the marker
> > wrapper should probabl aswell.
> > 
> 
> The problem appears on heavily loaded systems. Doing 50
> synchronize_sched() calls in a row can take up to a few seconds on a
> 4-way machine. This is why I prefer to do it in the module to which
> the callbacks belong.

We recently had a discussion on batch unreistration interface for
kprobes.  I'm not very happy with having so different interfaces for
different kind of probe registrations.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-03 17:25                         ` Christoph Hellwig
@ 2007-05-10 19:39                           ` Mathieu Desnoyers
  2007-05-13 21:04                             ` Christoph Hellwig
  0 siblings, 1 reply; 233+ messages in thread
From: Mathieu Desnoyers @ 2007-05-10 19:39 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton, Andi Kleen, linux-kernel, rusty,
	wfg

* Christoph Hellwig (hch@infradead.org) wrote:
> On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote:
> > > kprobes does this kind of synchronization internally, so the marker
> > > wrapper should probabl aswell.
> > > 
> > 
> > The problem appears on heavily loaded systems. Doing 50
> > synchronize_sched() calls in a row can take up to a few seconds on a
> > 4-way machine. This is why I prefer to do it in the module to which
> > the callbacks belong.
> 
> We recently had a discussion on batch unreistration interface for
> kprobes.  I'm not very happy with having so different interfaces for
> different kind of probe registrations.
> 

Ok, I've had a look at the kprobes batch registration mechanisms and..
well, it does not look well suited for the markers. Adding
supplementary data structures such as linked lists of probes does not
look like a good match.

However, I agree with you that providing a similar API is good.

Therefore, here is my proposal :

The goal is to do the synchronize just after we unregister the last
probe handler provided by a given module. Since the unregistration
functions iterate on every marker present in the kernel, we can keep a
count of how many probes provided by the same module are still present.
If we see that we unregistered the last probe pointing to this module,
we issue a synchronize_sched().

It adds no data structure and keeps the same order of complexity as what
is already there, we only have to do 2 passes in the marker structures :
the first one finds the module associated with the callback and the
second disables the callbacks and keep a count of the number of
callbacks associated with the module.

Mathieu

P.S.: here is the code.


Linux Kernel Markers - Architecture Independant code Provide internal
synchronize_sched() in batch.

The goal is to do the synchronize just after we unregister the last
probe handler provided by a given module. Since the unregistration
functions iterate on every marker present in the kernel, we can keep a
count of how many probes provided by the same module are still present.
If we see that we unregistered the last probe pointing to this module,
we issue a synchronize_sched().

It adds no data structure and keeps the same order of complexity as what
is already there, we only have to do 2 passes in the marker structures : 
the first one finds the module associated with the callback and the 
second disables the callbacks and keep a count of the number of
callbacks associated with the module.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 kernel/module.c |   62 ++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 52 insertions(+), 10 deletions(-)

Index: linux-2.6-lttng/kernel/module.c
===================================================================
--- linux-2.6-lttng.orig/kernel/module.c	2007-05-10 14:48:28.000000000 -0400
+++ linux-2.6-lttng/kernel/module.c	2007-05-10 15:38:27.000000000 -0400
@@ -404,8 +404,12 @@
 }
 
 /* Sets a range of markers to a disabled state : unset the enable bit and
- * provide the empty callback. */
+ * provide the empty callback.
+ * Keep a count of other markers connected to the same module as the one
+ * provided as parameter. */
 static int marker_remove_probe_range(const char *name,
+	struct module *probe_module,
+	int *ref_count,
 	const struct __mark_marker *begin,
 	const struct __mark_marker *end)
 {
@@ -413,12 +417,17 @@
 	int found = 0;
 
 	for (iter = begin; iter < end; iter++) {
-		if (strcmp(name, iter->mdata->name) == 0) {
-			marker_set_enable(iter->enable, 0,
-				iter->mdata->flags);
-			iter->mdata->call = __mark_empty_function;
-			found++;
+		if (strcmp(name, iter->mdata->name) != 0) {
+			if (probe_module)
+				if (__module_text_address(
+					(unsigned long)iter->mdata->call)
+						== probe_module)
+					(*ref_count)++;
+			continue;
 		}
+		marker_set_enable(iter->enable, 0, iter->mdata->flags);
+		iter->mdata->call = __mark_empty_function;
+		found++;
 	}
 	return found;
 }
@@ -450,6 +459,29 @@
 	return found;
 }
 
+/* Get the module to which the probe handler's text belongs.
+ * Called with module_mutex taken.
+ * Returns NULL if the probe handler is not in a module. */
+static struct module *__marker_get_probe_module(const char *name)
+{
+	struct module *mod;
+	const struct __mark_marker *iter;
+
+	list_for_each_entry(mod, &modules, list) {
+		if (mod->taints)
+			continue;
+		for (iter = mod->markers;
+			iter < mod->markers+mod->num_markers; iter++) {
+			if (strcmp(name, iter->mdata->name) != 0)
+				continue;
+			if (iter->mdata->call)
+				return __module_text_address(
+					(unsigned long)iter->mdata->call);
+		}
+	}
+	return NULL;
+}
+
 /* Calls _marker_set_probe_range for the core markers and modules markers.
  * Marker enabling/disabling use the modlist_lock to synchronise. */
 int _marker_set_probe(int flags, const char *name, const char *format,
@@ -477,23 +509,33 @@
 EXPORT_SYMBOL_GPL(_marker_set_probe);
 
 /* Calls _marker_remove_probe_range for the core markers and modules markers.
- * Marker enabling/disabling use the modlist_lock to synchronise. */
+ * Marker enabling/disabling use the modlist_lock to synchronise.
+ * ref_count is the number of markers still connected to the same module
+ * as the one in which sits the probe handler currently removed, excluding the
+ * one currently removed. If the count is 0, we issue a synchronize_sched() to
+ * make sure the module can safely unload. */
 int marker_remove_probe(const char *name)
 {
-	struct module *mod;
+	struct module *mod, *probe_module;
 	int found = 0;
+	int ref_count = 0;
 
 	mutex_lock(&module_mutex);
+	/* In what module is the probe handler ? */
+	probe_module = __marker_get_probe_module(name);
 	/* Core kernel markers */
-	found += marker_remove_probe_range(name,
+	found += marker_remove_probe_range(name, probe_module, &ref_count,
 			__start___markers, __stop___markers);
 	/* Markers in modules. */
 	list_for_each_entry(mod, &modules, list) {
 		if (!mod->taints)
-			found += marker_remove_probe_range(name,
+			found += marker_remove_probe_range(name, probe_module,
+				&ref_count,
 				mod->markers, mod->markers+mod->num_markers);
 	}
 	mutex_unlock(&module_mutex);
+	if (!ref_count)
+		synchronize_sched();
 	return found;
 }
 EXPORT_SYMBOL_GPL(marker_remove_probe);

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-10 19:39                           ` Mathieu Desnoyers
@ 2007-05-13 21:04                             ` Christoph Hellwig
  0 siblings, 0 replies; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-13 21:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Christoph Hellwig, Andrew Morton, Andi Kleen, linux-kernel, rusty,
	wfg

On Thu, May 10, 2007 at 03:39:36PM -0400, Mathieu Desnoyers wrote:
> * Christoph Hellwig (hch@infradead.org) wrote:
> > On Thu, May 03, 2007 at 01:16:46PM -0400, Mathieu Desnoyers wrote:
> > > > kprobes does this kind of synchronization internally, so the marker
> > > > wrapper should probabl aswell.
> > > > 
> > > 
> > > The problem appears on heavily loaded systems. Doing 50
> > > synchronize_sched() calls in a row can take up to a few seconds on a
> > > 4-way machine. This is why I prefer to do it in the module to which
> > > the callbacks belong.
> > 
> > We recently had a discussion on batch unreistration interface for
> > kprobes.  I'm not very happy with having so different interfaces for
> > different kind of probe registrations.
> > 
> 
> Ok, I've had a look at the kprobes batch registration mechanisms and..
> well, it does not look well suited for the markers. Adding
> supplementary data structures such as linked lists of probes does not
> look like a good match.
> 
> However, I agree with you that providing a similar API is good.
> 
> Therefore, here is my proposal :
> 
> The goal is to do the synchronize just after we unregister the last
> probe handler provided by a given module. Since the unregistration
> functions iterate on every marker present in the kernel, we can keep a
> count of how many probes provided by the same module are still present.
> If we see that we unregistered the last probe pointing to this module,
> we issue a synchronize_sched().
> 
> It adds no data structure and keeps the same order of complexity as what
> is already there, we only have to do 2 passes in the marker structures :
> the first one finds the module associated with the callback and the
> second disables the callbacks and keep a count of the number of
> callbacks associated with the module.
> 
> Mathieu
> 
> P.S.: here is the code.
> 
> 
> Linux Kernel Markers - Architecture Independant code Provide internal
> synchronize_sched() in batch.
> 
> The goal is to do the synchronize just after we unregister the last
> probe handler provided by a given module. Since the unregistration
> functions iterate on every marker present in the kernel, we can keep a
> count of how many probes provided by the same module are still present.
> If we see that we unregistered the last probe pointing to this module,
> we issue a synchronize_sched().
> 
> It adds no data structure and keeps the same order of complexity as what
> is already there, we only have to do 2 passes in the marker structures : 
> the first one finds the module associated with the callback and the 
> second disables the callbacks and keep a count of the number of
> callbacks associated with the module.

Looks good to me, please incorporate this is the next round of the
markers patch series.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-02 23:11               ` Mathieu Desnoyers
  2007-05-02 23:21                 ` Andrew Morton
@ 2007-05-03  8:06                 ` Christoph Hellwig
  2007-05-03 14:43                   ` Mathieu Desnoyers
  2007-05-03 10:31                 ` Andi Kleen
  2 siblings, 1 reply; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-03  8:06 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Christoph Hellwig, Andi Kleen, linux-kernel, rusty,
	wfg

On Wed, May 02, 2007 at 07:11:04PM -0400, Mathieu Desnoyers wrote:
> My statement was probably not clear enough. The actual marker code is
> useful as-is without any further kernel patching required : SystemTAP is
> an example where they use external modules to load probes that can
> connect either to markers or through kprobes. LTTng, in its current state,
> has a mostly modular core that also uses the markers.

That just mean you have to load an enormous emount of exernal crap
that replaces the missing kernel functionality.  It's exactly the
situation we want to avoid.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-03  8:06                 ` Christoph Hellwig
@ 2007-05-03 14:43                   ` Mathieu Desnoyers
  0 siblings, 0 replies; 233+ messages in thread
From: Mathieu Desnoyers @ 2007-05-03 14:43 UTC (permalink / raw)
  To: Christoph Hellwig, Andrew Morton, Andi Kleen, linux-kernel, rusty,
	wfg

* Christoph Hellwig (hch@infradead.org) wrote:
> On Wed, May 02, 2007 at 07:11:04PM -0400, Mathieu Desnoyers wrote:
> > My statement was probably not clear enough. The actual marker code is
> > useful as-is without any further kernel patching required : SystemTAP is
> > an example where they use external modules to load probes that can
> > connect either to markers or through kprobes. LTTng, in its current state,
> > has a mostly modular core that also uses the markers.
> 
> That just mean you have to load an enormous emount of exernal crap
> that replaces the missing kernel functionality.  It's exactly the
> situation we want to avoid.
> 

It makes sense to use -mm to hold the hole usable infrastructure before
submitting it to mainline. I will submit my core LTTng patches to Andrew
in the following weeks.  There is no hurry, in the LTTng perspective, to
merge the markers sooner, although they could be useful to other
(external) projects meanwhile.

Mathieu


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-02 23:11               ` Mathieu Desnoyers
  2007-05-02 23:21                 ` Andrew Morton
  2007-05-03  8:06                 ` Christoph Hellwig
@ 2007-05-03 10:31                 ` Andi Kleen
  2007-05-03 14:49                   ` Mathieu Desnoyers
  2 siblings, 1 reply; 233+ messages in thread
From: Andi Kleen @ 2007-05-03 10:31 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, Christoph Hellwig, Andi Kleen, linux-kernel, rusty,
	wfg

> If we are looking at current "potential users" that are already in
> mainline, we could change blktrace to make it use the markers.

Ok, but do it step by step:
- split out useful pieces like the "patched in enable/disable flags" 
and submit them separate with an example user or two 
[I got a couple of candidates e.g. with some of the sysctls in VM or 
networking] 
- post and merge that.
- don't implement anything initially that is not needed by blktrace
- post a minimal marker patch together with the blktrace
conversion for review again on linux-kernel
- await review comments. This review would not cover the basic
need of markers, just the specific implementation.
- then potentially merge incorporate review comments
- then merge
- later add features with individual review/discussion as new users in the 
kernel are added.

-Andi

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-03 10:31                 ` Andi Kleen
@ 2007-05-03 14:49                   ` Mathieu Desnoyers
  0 siblings, 0 replies; 233+ messages in thread
From: Mathieu Desnoyers @ 2007-05-03 14:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, Christoph Hellwig, linux-kernel, rusty, wfg

Hi Andi,

This plan makes sense. I will split the "patched in enabled/disable
flags" part into a separate piece (good idea!) and then submit the LTTng
core to Andrew. Christoph's has a good point about wanting a usable
infrastructure to go ini. Regarding your plan, I must argue that
blktrace is not a general purpose tracing infrastructure, but one
dedicated to block io tracing.  Therefore, it makes sense to bring in
the generic infrastructure first and then convert blktrace to it.

Mathieu

* Andi Kleen (andi@firstfloor.org) wrote:
> > If we are looking at current "potential users" that are already in
> > mainline, we could change blktrace to make it use the markers.
> 
> Ok, but do it step by step:
> - split out useful pieces like the "patched in enable/disable flags" 
> and submit them separate with an example user or two 
> [I got a couple of candidates e.g. with some of the sysctls in VM or 
> networking] 
> - post and merge that.
> - don't implement anything initially that is not needed by blktrace
> - post a minimal marker patch together with the blktrace
> conversion for review again on linux-kernel
> - await review comments. This review would not cover the basic
> need of markers, just the specific implementation.
> - then potentially merge incorporate review comments
> - then merge
> - later add features with individual review/discussion as new users in the 
> kernel are added.
> 
> -Andi

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-02 20:53             ` Andrew Morton
  2007-05-02 23:11               ` Mathieu Desnoyers
@ 2007-05-03  8:09               ` Christoph Hellwig
  1 sibling, 0 replies; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-03  8:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mathieu Desnoyers, Christoph Hellwig, Andi Kleen, linux-kernel,
	rusty, wfg

On Wed, May 02, 2007 at 01:53:36PM -0700, Andrew Morton wrote:
> In which case we have:
> 
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-alpha.patch
> atomich-complete-atomic_long-operations-in-asm-generic.patch
> atomich-i386-type-safety-fix.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-ia64.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-mips.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-parisc.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-powerpc.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-sparc64.patch
> atomich-add-atomic64-cmpxchg-xchg-and-add_unless-to-x86_64.patch
> atomich-atomic_add_unless-as-inline-remove-systemh-atomich-circular-dependency.patch
> local_t-architecture-independant-extension.patch
> local_t-alpha-extension.patch
> local_t-i386-extension.patch
> local_t-ia64-extension.patch
> local_t-mips-extension.patch
> local_t-parisc-cleanup.patch
> local_t-powerpc-extension.patch
> local_t-sparc64-cleanup.patch
> local_t-x86_64-extension.patch
> 
>   For 2.6.22
> 
> linux-kernel-markers-kconfig-menus.patch
> linux-kernel-markers-architecture-independant-code.patch
> linux-kernel-markers-powerpc-optimization.patch
> linux-kernel-markers-i386-optimization.patch
> markers-add-instrumentation-markers-menus-to-avr32.patch
> linux-kernel-markers-non-optimized-architectures.patch
> markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch
> linux-kernel-markers-documentation.patch
> #
> markers-define-the-linker-macro-extra_rwdata.patch
> markers-use-extra_rwdata-in-architectures.patch
> #
> some-grammatical-fixups-and-additions-to-atomich-kernel-doc.patch
> no-longer-include-asm-kdebugh.patch

This it a plan I can fully agree with.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-02 20:36           ` Mathieu Desnoyers
  2007-05-02 20:53             ` Andrew Morton
@ 2007-05-03  8:08             ` Christoph Hellwig
  1 sibling, 0 replies; 233+ messages in thread
From: Christoph Hellwig @ 2007-05-03  8:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Christoph Hellwig, Andrew Morton, Andi Kleen, linux-kernel, rusty,
	wfg

On Wed, May 02, 2007 at 04:36:27PM -0400, Mathieu Desnoyers wrote:
> The idea is the following : either we integrate the infrastructure for
> instrumentation / data serialization / buffer management / extraction of
> data to user space in multiple different steps, which makes code review
> easier for you guys,

the staging area for that is -mm


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-02 16:47       ` Andrew Morton
  2007-05-02 17:29         ` Christoph Hellwig
@ 2007-05-02 17:49         ` Andi Kleen
  2007-05-02 21:46           ` Tilman Schmidt
  1 sibling, 1 reply; 233+ messages in thread
From: Andi Kleen @ 2007-05-02 17:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, Mathieu Desnoyers, linux-kernel, rusty, wfg

On Wed, May 02, 2007 at 09:47:07AM -0700, Andrew Morton wrote:
> On Wed, 2 May 2007 12:44:13 +0200 Andi Kleen <andi@firstfloor.org> wrote:
> 
> > > It is currently used as an instrumentation infrastructure for the LTTng
> > > tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in
> > > WindRiver products.  The SystemTAP project also plan to use this type of
> > > infrastructure to trace sites hard to instrument. The Linux Kernel
> > > Markers has the support of Frank C. Eigler, author of their current
> > > marker alternative (which he wishes to drop in order to adopt the
> > > markers infrastructure as soon as it hits mainline).
> > 
> > All of the above don't use mainline kernels.
> 
> That's because they have to add a markers patch!

I meant they use very old kernels. Their experiences don't apply
to mainline bitrottyness.

> > That doesn't constitute using it.
> 
> Andi, there was a huge amount of discussion about all this in September last
> year (subjects: *markers* and *LTTng*). The outcome of all that was, I
> believe, that the kernel should have a static marker infrastructure.

I have no problem with that in principle; just some doubts about
the current proposed implementation: in particular its complexity.
And also I think when something is merged it should have some users in tree.

-Andi


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-02 17:49         ` Andi Kleen
@ 2007-05-02 21:46           ` Tilman Schmidt
  2007-05-03 10:12             ` Andi Kleen
  0 siblings, 1 reply; 233+ messages in thread
From: Tilman Schmidt @ 2007-05-02 21:46 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, Mathieu Desnoyers, linux-kernel, rusty, wfg

[-- Attachment #1: Type: text/plain, Size: 312 bytes --]

Am 02.05.2007 19:49 schrieb Andi Kleen:
> And also I think when something is merged it should have some users in tree.

Isn't that a circular dependency?

-- 
Tilman Schmidt                          E-Mail: tilman@imap.cc
Bonn, Germany
- Undetected errors are handled as if no error occurred. (IBM) -


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 253 bytes --]

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-02 21:46           ` Tilman Schmidt
@ 2007-05-03 10:12             ` Andi Kleen
  0 siblings, 0 replies; 233+ messages in thread
From: Andi Kleen @ 2007-05-03 10:12 UTC (permalink / raw)
  To: Tilman Schmidt
  Cc: Andi Kleen, Andrew Morton, Mathieu Desnoyers, linux-kernel, rusty,
	wfg

On Wed, May 02, 2007 at 11:46:40PM +0200, Tilman Schmidt wrote:
> Am 02.05.2007 19:49 schrieb Andi Kleen:
> > And also I think when something is merged it should have some users in tree.
> 
> Isn't that a circular dependency?

The normal mode of operation is to merge the initial users and the 
subsystem at the same time.

-Andi

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-02 10:44     ` Andi Kleen
  2007-05-02 16:37       ` Frank Ch. Eigler
  2007-05-02 16:47       ` Andrew Morton
@ 2007-05-02 17:19       ` Mathieu Desnoyers
  2 siblings, 0 replies; 233+ messages in thread
From: Mathieu Desnoyers @ 2007-05-02 17:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, linux-kernel, rusty, wfg

* Andi Kleen (andi@firstfloor.org) wrote:
> > It is currently used as an instrumentation infrastructure for the LTTng
> > tracer at IBM, Google, Autodesk, Sony, MontaVista and deployed in
> > WindRiver products.  The SystemTAP project also plan to use this type of
> > infrastructure to trace sites hard to instrument. The Linux Kernel
> > Markers has the support of Frank C. Eigler, author of their current
> > marker alternative (which he wishes to drop in order to adopt the
> > markers infrastructure as soon as it hits mainline).
> 
> All of the above don't use mainline kernels.
> That doesn't constitute using it.
> 

I am afraid this argument does not hold :

- These companies are not shipping their products with mainline kernels
  to make sure things have time to stabilize.
- They eventually get to the next version some time after it is not
  "head" anymore. They still want to benefit from the features of the
  newer versions.
- All these companies would be really happy to have a marker
  infrastructure in mainline so they can stop applying a separate set of
  patches to provide this functionality.
- Arguing the fact that "they apply their set of patches anyway" goes
  against the advice I have received from Greg KH, which is can be
  reworded as : please submit your patches to mainline instead of
  keeping your separate set of patches. See his various presentations
  about "mainlining" for more info about this.

Because of these 4 arguments, I think that these companies can be
considered as users and contributors of/to mainline kernels.

> > Quoting Jim Keniston <jkenisto@us.ibm.com> :
> > 
> > "kprobes remains a vital foundation for SystemTap.  But markers are
> > attactive as an alternate source of trace/debug info.  Here's why:
> > [...]"
> 
> Talk is cheap. Do they have working code to use it?
> 

LTTng has been using the markers for about 6 months now. SystemTAP is
waiting on the "it hits mainline" signal before they switch from their
STP_MARK() markers to this infrastructure. Give them a few days and they
will proceed to the change.

> >   - Allow per-architecture optimized versions which removes the need for
> >     a d-cache based branch (patch a "load immediate" instruction
> >     instead). It minimized the d-cache impact of the disabled markers.
> 
> That's a good idea in general, but should be generalized (available
> independently), not hidden in your subsystem. I know a couple of places
> who could use this successfully.
> 

I agree that an efficient hooking mechanism is useful to manyr; listing
at least security hooks and instrumentation for tracing. What other
usage scenario do you have in mind that could not fit in my marker
infrastructure ? I have tried to generalize this as much as possible,
but if you see, within this, a piece of infrastructure that could be
taken apart and used more widely, I will be happy to submit it
separately to increase its usefulness.

> >   - Accept the cost of an unlikely branch at the marker site because the
> >     gcc compiler does not give the ability to put "nops" instead of a
> >     branch generated from C code. Keep this in mind for future
> >     per-architecture optimizations.
> 
> See upcomming paravirt code for a way to do this.
> 

I have looked at the paravirt code in Andrew's 2.6.21-rc7-mm2. A few
reasons why I do not plan to use it :

1 - It requires specific arg setup for the calls to be crafted by hand,
in assembly, for each and every number of parameters and each types, for
each architecture. I use a variable argument list as a parameter to my
marker to make sure that a single macro can be used for markup in a
generic manner.

Quoting : http://lkml.org/lkml/2007/4/4/577
"+ * Unfortunately there's no way to get gcc to generate the args setup
+ * for the call, and then allow the call itself to be generated by an
+ * inline asm.  Because of this, we must do the complete arg setup and
+ * return value handling from within these macros.  This is fairly
+ * cumbersome."

2 - I also provide an architecture independent "generic" version which
does not depend on per-architecture assembly. From what I see, paravirt
is only offered for i386 and x86_64. Are there any plans to support the
other ~12 architectures ? Does it offer a architecture agnostic fallback
in the cases where it is not implemented for a given architecture ?

3 - It can only alter instructions "safely" in the UP case before the
other CPUs are turned on. See my arch/i386/marker.c code patcher for
XMC-safe instruction patching. Marker activation must be done at
runtime, when the system is fully operational.

Quoting 2.6.21 arch/i386/kernel/alternative.c
"/* Replace instructions with better alternatives for this CPU type.
   This runs before SMP is initialized to avoid SMP problems with
   self modifying code. This implies that assymetric systems where
   APs have less capabilities than the boot processor are not handled.
   Tough. Make sure you disable such features by hand. */

void apply_alternatives(struct alt_instr *start, struct alt_instr *end)"

4 - paravirt does not offer the ability to replace a branch instruction,
generated by gcc, through its mechanism. If I choose to use paravirt
mechanism, I must do the stack setup and function call by hand, which
has been argued in points (1) and (2). GCC must itself generate the
branch instruction to jump over the function call containing the
variable argument list.

> > - Instrumentation of challenging kernel sites
> >   - Instrumentation such as the one provided in the already existing
> >     Lock dependency checker (lockdep) and instrumentation of trap
> >     handlers implies being reentrant for such context. Therefore, the
> >     implementation must be lock-free and update the state in an atomic
> >     fashion (rcu-style). It must also let the programmer who describes
> >     a marker site the ability to specify what is forbidden in the probe
> >     that will be connected to the marker : can it generate a trap ? Can
> >     it call lockdep (irq disable, take any type of lock), can it call
> >     printk ? This is why flags can be passed to the _MARK() marker,
> >     while the MARK() marker has the default flags.
> 
> Why can't you just generally forbid probes from doing all of this?
> It would greatly simplify your code, wouldn't it?
> 
> Keep it simple please.
> 

An example, taken from the marker mechanism itself (no probe involved)
shows how difficult it can be to "forbid all of this" :

The optimized version patches code while the system is live. This
implies cross modifying code in SMP environment. It can be done safely
on x86 and x86_64 by using a breakpoint during the code modification to
make sure the CPU issues a serializing instruction between the moment a
given CPU speculates the code execution and actually reaches it. It
implies going though a trap, which does funny things such as enabling
interrupts, which calls into lockdep. Therefore, adding a marker into
the lockdep code cannot be done with a breakpoint-based marker on these
architectures. We have to provide an alternative way to do this, less
intrusive, which is exactly what the "generic" markers provide. The same
applies to instrumentation of the breakpoint trap handler.

I strongly doubt that _every_ users of the markers would be comfortable
with the "write your code so it does not take any lock and does
everything atomically" constraint. I have done it in LTTng so I could
have a fully reentrant tracer, but even then you can be limited by the
nature of where you want to send the data. Richard Purdie implemented a
serial port based data relay as an alternative data relay mechanism
connected to LTTng; he needed a spinlock because of the semantic of his
port, so he has to accept the limitation regarding the sites that can
and cannot be probed. Providing an explicit declaration of site
limitations make sense in this regard.

On other architectures, it is the time source which requires a read
seqlock. It is not atomic in the sense that a reader can nest over a
writer (if coming from NMI context) and spin forever.

I can list a lot of situations where we cannot _require_ the probe to
run atomically in every aspect; so generally forbidding these actions
does not seem to be a viable solution. In fact, this would be the best
way to make sure the marker infrastructure is never used by early
adopters because of the complexity level of writing probes, due to these
"rules".

> > Please tell me if I forgot to explain the rationale behind some
> > implementation detail and I will be happy to explain in more depth.
> 
> Having lots of flags to do things differently optionally normally
> starts up all warning lights of early over design. While Linux
> has this sometimes it is generally only in mature old subsystems.
> But when something is freshly merged it shouldn't be like this.
> That is because code tends to grow more complicated over its livetime
> and when it is already complicated at the beginning it will eventually
> fall over (you can study current slab as a poster child of this)
> 
> -Andi
> 

Explicitly identifying "hard to instrument" sites is nothing new. It has
been done in different manners in the past. Kprobes sprinkles
"__kprobes" declarations before function declarations all over the place
to specify which ones cannot be safely instrumented. It results in a
visually less appealing source code and it limits the sites that can be
probed.

The goal of the marker infrastructure is exactly to instrument those
sites.  Therefore, the approach "we forbid instrumentation of sites hard
to instrument" misses the point of this infrastructure. We can leverage
the fact that the marker is put in a context known by the programmer; it
makes sense to give him the ability to specify what are the restrictions
on the probes connected to this marker with some level of granularity.

Regards,

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-01 12:17 ` Andi Kleen
  2007-05-01 22:08   ` Mathieu Desnoyers
@ 2007-05-02  0:31   ` Rusty Russell
  2007-05-02 10:30     ` Andi Kleen
  1 sibling, 1 reply; 233+ messages in thread
From: Rusty Russell @ 2007-05-02  0:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, linux-kernel

On Tue, 2007-05-01 at 14:17 +0200, Andi Kleen wrote:
> Andrew Morton <akpm@linux-foundation.org> writes:
> > Will merge the rustyvisor.
> 
> IMHO the user code still doesn't belong into Documentation.
> Also it needs another review round I guess. And some beta testing by
> more people.

Like any piece of code more review and more testing would be great.
(Your earlier review was particularly useful!).  But it's not clear that
waiting for longer will achieve either.

Look at kvm's experience for the reverse case: it went in, then got
rewritten.

As for the code in Documentation, my initial attempts tried to get
around the need for a userspace part by putting everything in the kernel
module.  It meant you could launch a guest by writing a string
to /dev/lguest (no real ABI burden there), but it's a worse solution
than some user code in the kernel tree 8(

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-05-02  0:31   ` Rusty Russell
@ 2007-05-02 10:30     ` Andi Kleen
  0 siblings, 0 replies; 233+ messages in thread
From: Andi Kleen @ 2007-05-02 10:30 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Andi Kleen, Andrew Morton, linux-kernel

On Wed, May 02, 2007 at 10:31:10AM +1000, Rusty Russell wrote:
> On Tue, 2007-05-01 at 14:17 +0200, Andi Kleen wrote:
> > Andrew Morton <akpm@linux-foundation.org> writes:
> > > Will merge the rustyvisor.
> > 
> > IMHO the user code still doesn't belong into Documentation.
> > Also it needs another review round I guess. And some beta testing by
> > more people.
> 
> Like any piece of code more review and more testing would be great.
> (Your earlier review was particularly useful!).  But it's not clear that
> waiting for longer will achieve either.

Not clear to me. Release a clear lguest patchkit with documentation
on l-k several times and you'll probably get both reviewers and testers.
Then confidence level will rise.

> 
> Look at kvm's experience for the reverse case: it went in, then got
> rewritten.

They at least already had some user base at this point.

> As for the code in Documentation, my initial attempts tried to get
> around the need for a userspace part by putting everything in the kernel
> module.  It meant you could launch a guest by writing a string
> to /dev/lguest (no real ABI burden there), but it's a worse solution
> than some user code in the kernel tree 8(

Just put it into a separate tarball.

-Andi

^ permalink raw reply	[flat|nested] 233+ messages in thread

* file capabilities and security_task_wait failure Re: 2.6.22 -mm merge plans
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (15 preceding siblings ...)
  2007-05-01 12:17 ` Andi Kleen
@ 2007-05-01 13:06 ` Stephen Smalley
  2007-05-01 14:31 ` 2.6.22 -mm merge plans: mm-more-rmap-checking Hugh Dickins
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 233+ messages in thread
From: Stephen Smalley @ 2007-05-01 13:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, James Morris, Eric Paris, Serge E. Hallyn,
	Chris Wright, linuxfs, Christoph Hellwig

On Mon, 2007-04-30 at 16:20 -0700, Andrew Morton wrote:
>  implement-file-posix-capabilities.patch
>  file-capabilities-accomodate-future-64-bit-caps.patch
>  return-eperm-not-echild-on-security_task_wait-failure.patch
> 
> I think we're still waiting for the security guys to work out what to do with
> this work.

return-eperm-not-echild-on-security_task_wait-failure.patch should be
merged - it is effectively a bug fix.

On the file capabilities support, have any of the filesystem folks
(cc'd) looked at the code yet?

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: mm-more-rmap-checking
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (16 preceding siblings ...)
  2007-05-01 13:06 ` file capabilities and security_task_wait failure " Stephen Smalley
@ 2007-05-01 14:31 ` Hugh Dickins
  2007-05-02  1:42   ` Nick Piggin
  2007-05-01 16:56 ` 2.6.22 -mm merge plans Zan Lynx
                   ` (4 subsequent siblings)
  22 siblings, 1 reply; 233+ messages in thread
From: Hugh Dickins @ 2007-05-01 14:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nick Piggin, linux-kernel, linux-mm

On Mon, 30 Apr 2007, Andrew Morton wrote:
>... 
>  mm-more-rmap-checking.patch
>...
> 
> Misc MM things.  Will merge.

Would Nick mind very much if I ask you to drop this one?
You did CC me ages ago, but I've only just run across it.
It's a small matter, but I'd prefer it dropped for now.

>> Re-introduce rmap verification patches that Hugh removed when he removed
>> PG_map_lock. PG_map_lock actually isn't needed to synchronise access to
>> anonymous pages, because PG_locked and PTL together already do.
>> 
>> These checks were important in discovering and fixing a rare rmap corruption
>> in SLES9.

It introduces some silly checks which were never in mainline,
nor so far as I can tell in SLES9: I'm thinking of those
+	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
There are few callsites for these rmap functions, I don't think
they need to be checking their arguments in that way.

It also changes the inline page_dup_rmap (a single atomic increment)
into a bugchecking out-of-line function: do we really want to slow
down fork in that way, for 2.6.22 to fix a rare corruption in SLES9?

What I really like about the patch is Nick's observation that my
	/* else checking page index and mapping is racy */
is no longer true: a change we made to the do_swap_page sequence
some while ago has indeed cured that raciness, and I'm happy to
reintroduce the check on mapping and index in page_add_anon_rmap,
and his BUG_ON(!PageLocked(page)) there (despite BUG_ONs falling
out of fashion very recently).

That becomes more important when I send the patches to free up
PG_swapcache, using a PAGE_MAPPING_SWAP bit instead: so I was
planning to include that part of Nick's patch in that series.

Hugh

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: mm-more-rmap-checking
  2007-05-01 14:31 ` 2.6.22 -mm merge plans: mm-more-rmap-checking Hugh Dickins
@ 2007-05-02  1:42   ` Nick Piggin
  2007-05-02 13:17     ` Hugh Dickins
  0 siblings, 1 reply; 233+ messages in thread
From: Nick Piggin @ 2007-05-02  1:42 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> On Mon, 30 Apr 2007, Andrew Morton wrote:
> 
>>... 
>> mm-more-rmap-checking.patch
>>...
>>
>>Misc MM things.  Will merge.
> 
> 
> Would Nick mind very much if I ask you to drop this one?
> You did CC me ages ago, but I've only just run across it.
> It's a small matter, but I'd prefer it dropped for now.

I guess I would prefer it to go under CONFIG_DEBUG_VM. Speaking
of which, it would be nice to be able to turn that on unconditionally
in -rc1. Although I may have put a few too many things under it, so
it might slow down too much...


>>>Re-introduce rmap verification patches that Hugh removed when he removed
>>>PG_map_lock. PG_map_lock actually isn't needed to synchronise access to
>>>anonymous pages, because PG_locked and PTL together already do.
>>>
>>>These checks were important in discovering and fixing a rare rmap corruption
>>>in SLES9.
> 
> 
> It introduces some silly checks which were never in mainline,
> nor so far as I can tell in SLES9: I'm thinking of those
> +	BUG_ON(address < vma->vm_start || address >= vma->vm_end);

Yes, but IIRC I put that in because there was another check in
SLES9 that I actually couldn't put in, but used this one instead
because it also caught the bug we saw.


> There are few callsites for these rmap functions, I don't think
> they need to be checking their arguments in that way.
> 
> It also changes the inline page_dup_rmap (a single atomic increment)
> into a bugchecking out-of-line function: do we really want to slow
> down fork in that way, for 2.6.22 to fix a rare corruption in SLES9?

This was actually a rare corruption that is also in 2.6.21, and
as few rmap callsites as we have, it was never noticed until the
SLES9 bug check was triggered.


> What I really like about the patch is Nick's observation that my
> 	/* else checking page index and mapping is racy */
> is no longer true: a change we made to the do_swap_page sequence
> some while ago has indeed cured that raciness, and I'm happy to
> reintroduce the check on mapping and index in page_add_anon_rmap,
> and his BUG_ON(!PageLocked(page)) there (despite BUG_ONs falling
> out of fashion very recently).

Hmm, I didn't notice the do_swap_page change, rather just derived
its safety by looking at the current state of the code (which I
guess must have been post-do_swap_page change)...

Do you have a pointer to the patch, for my interest?

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: mm-more-rmap-checking
  2007-05-02  1:42   ` Nick Piggin
@ 2007-05-02 13:17     ` Hugh Dickins
  2007-05-03  0:18       ` Nick Piggin
  0 siblings, 1 reply; 233+ messages in thread
From: Hugh Dickins @ 2007-05-02 13:17 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm

On Wed, 2 May 2007, Nick Piggin wrote:
> 
> Yes, but IIRC I put that in because there was another check in
> SLES9 that I actually couldn't put in, but used this one instead
> because it also caught the bug we saw.
>... 
> This was actually a rare corruption that is also in 2.6.21, and
> as few rmap callsites as we have, it was never noticed until the
> SLES9 bug check was triggered.

You are being very mysterious.  Please describe this bug (privately
if you think it's exploitable), and let's work on the patch to fix it,
rather than this "debug" patch.

> Hmm, I didn't notice the do_swap_page change, rather just derived
> its safety by looking at the current state of the code (which I
> guess must have been post-do_swap_page change)...

Your addition of page_add_new_anon_rmap clarified the situation too.

> Do you have a pointer to the patch, for my interest?

The patch which changed do_swap_page?

commit c475a8ab625d567eacf5e30ec35d6d8704558062
Author: Hugh Dickins <hugh@veritas.com>
Date:   Tue Jun 21 17:15:12 2005 -0700
[PATCH] can_share_swap_page: use page_mapcount

Or my intended PG_swapcache to PAGE_MAPPING_SWAP patch,
which does assume PageLocked in page_add_anon_rmap?
Yes, I can send you its current unsplit state if you like
(but have higher priorities before splitting and commenting
it for posting).

Hugh

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: mm-more-rmap-checking
  2007-05-02 13:17     ` Hugh Dickins
@ 2007-05-03  0:18       ` Nick Piggin
  0 siblings, 0 replies; 233+ messages in thread
From: Nick Piggin @ 2007-05-03  0:18 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm

Hugh Dickins wrote:
> On Wed, 2 May 2007, Nick Piggin wrote:
> 
>>Yes, but IIRC I put that in because there was another check in
>>SLES9 that I actually couldn't put in, but used this one instead
>>because it also caught the bug we saw.
>>... 
>>This was actually a rare corruption that is also in 2.6.21, and
>>as few rmap callsites as we have, it was never noticed until the
>>SLES9 bug check was triggered.
> 
> 
> You are being very mysterious.  Please describe this bug (privately
> if you think it's exploitable), and let's work on the patch to fix it,
> rather than this "debug" patch.

It is exec-fix-remove_arg_zero.patch in Andrew's tree, it's exploitable
in that it leaks memory, but it could also release corrupted pagetables
into quicklists on those architectures that have them...

Anyway, it quite likely would have gone unfixed for several more years
if we didn't have the bug triggers in. Now you could argue that my
patch obviously fixes all bugs in there (but I wouldn't :)), and being
most complex of the few callsites, _now_ we can avoid the bug checks.
However I'd prefer to keep them at least under CONFIG_DEBUG_VM.


>>Hmm, I didn't notice the do_swap_page change, rather just derived
>>its safety by looking at the current state of the code (which I
>>guess must have been post-do_swap_page change)...
> 
> 
> Your addition of page_add_new_anon_rmap clarified the situation too.
> 
> 
>>Do you have a pointer to the patch, for my interest?
> 
> 
> The patch which changed do_swap_page?
> 
> commit c475a8ab625d567eacf5e30ec35d6d8704558062
> Author: Hugh Dickins <hugh@veritas.com>
> Date:   Tue Jun 21 17:15:12 2005 -0700
> [PATCH] can_share_swap_page: use page_mapcount


Yeah, this one, thanks. I'm just interested.


> Or my intended PG_swapcache to PAGE_MAPPING_SWAP patch,
> which does assume PageLocked in page_add_anon_rmap?
> Yes, I can send you its current unsplit state if you like
> (but have higher priorities before splitting and commenting
> it for posting).

I would like to see that too, but when you are ready :)

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (17 preceding siblings ...)
  2007-05-01 14:31 ` 2.6.22 -mm merge plans: mm-more-rmap-checking Hugh Dickins
@ 2007-05-01 16:56 ` Zan Lynx
  2007-05-01 17:06 ` 2.6.22 -mm merge plans: mm-detach_vmas_to_be_unmapped-fix Hugh Dickins
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 233+ messages in thread
From: Zan Lynx @ 2007-05-01 16:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 960 bytes --]

On Mon, 2007-04-30 at 16:20 -0700, Andrew Morton wrote:
[snip]
> Mel's moveable-zone work.
> 
> I don't believe that this has had sufficient review and I'm sure that it
> hasn't had sufficient third-party testing.  Most of the approbations thus far
> have consisted of people liking the overall idea, based on the changelogs and
> multi-year-old discussions.
> 
> For such a large and core change I'd have expected more detailed reviewing
> effort and more third-party testing.  And I STILL haven't made time to review
> the code in detail myself.
[snip]

I am a fan of this, but I hadn't really realized that it's in -mm, and
that it has to be enabled with kernelcore=

Now that I am, I'm running it on my laptop with kernelcore=256M (it
wouldn't boot with 128M or less, weird initscript errors and OOMs).

1 GB single-core laptops are probably not the intended test audience :)
But I'll see what happens.
-- 
Zan Lynx <zlynx@acm.org>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: mm-detach_vmas_to_be_unmapped-fix
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (18 preceding siblings ...)
  2007-05-01 16:56 ` 2.6.22 -mm merge plans Zan Lynx
@ 2007-05-01 17:06 ` Hugh Dickins
  2007-05-01 18:10 ` 2.6.22 -mm merge plans: slub Hugh Dickins
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 233+ messages in thread
From: Hugh Dickins @ 2007-05-01 17:06 UTC (permalink / raw)
  To: Andrew Morton; +Cc: akuster, Ken Chen, linux-kernel, linux-mm

On Mon, 30 Apr 2007, Andrew Morton wrote:
> ...
>  mm-detach_vmas_to_be_unmapped-fix.patch
> ...
> Misc MM things.  Will merge.

No, I think that one is just drifting like flotsam towards mainline,
because nobody at all has yet found time to look at it.  And Mr Akuster
appears not to have signed off on it yet.  I've given it a quick look
now, and it seems to be based on misdescription and misconception.

> From: <akuster@mvista.com>
> 
> Wolfgang Wander submitted a fix to address a mmap fragmentation issue.  The
> git patch ( 1363c3cd8603a913a27e2995dccbd70d5312d8e6 ) is somewhat different
> and yields different results when running Wolfgang's test case leakme.c.

Ken did a lot of the work on that I believe: I certainly wouldn't
want to see this patch go in without his Ack.  (I've never done
any work on unmapped area heuristics, but detach_vmas_to_be_unmapped
always catches my eye.)

> 
> IMHO, the vm start and end address are swapped in arch_unmap_area and
> arch_unmap_area_topdown functions.

I disagree.

> 
> Prior to this patch arch_unmap_area() used area->vm_start and
> arch_unmap_area_topdown used area->vm_end

Yes (where area is the vma being unmapped).

> in the git patch the following change showed up.
> 
> if (mm->unmap_area == arch_unmap_area)
>      addr = prev ? prev->vm_start : mm->mmap_base;
> else
>      addr = vma ?  vma->vm_end : mm->mmap_base;

No, that's not what showed up in the git patch: that's what the
patch below is trying to change it to.  The git patch said
	addr = prev ? prev->vm_end : mm->mmap_base
for the bottomup case i.e. setting the unmapped area to the
end of the vma below; and
	addr = vma ? vma->vm_start: mm->mmap_base;
for the topdown case i.e. setting the unmapped area to the
beginning of the vma above.

That seems to me consistent with what was done before, but pushing
the bounds out across any hole, for presumably better behaviour.

> 
> Using Wolfgang Wander's leakme.c test, I get the same results seen with his
> original "Avoiding mmap fragmentation" patch as I do after swapping the start
> & end address in the above code segment.  The patch I submitted addresses this
> typo issue.

I'm pretty sure it is not a typo.  I did a very hasty test with two
aLLocator .c progs Wolfgang posted (one unnamed, one named leakme4.c),
on x86_64, and got apparently the same successful result with and
without the patch below.  In my case, it's probably just slightly
slowing down the algorithm, by demanding an additional find_vma()
because it mispositions mm->free_area_cache to an occupied area.
I don't see how it could ever be an improvement, but I've not
spent long enough checking out that code.

I bet there's improvements that could be made there, but
this patch looks wrong - please don't rush it into 2.6.22
(personally I'd say drop it, but I'd rather Ken takes a look).

Hugh

> 
> 
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  mm/mmap.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff -puN mm/mmap.c~mm-detach_vmas_to_be_unmapped-fix mm/mmap.c
> --- a/mm/mmap.c~mm-detach_vmas_to_be_unmapped-fix
> +++ a/mm/mmap.c
> @@ -1723,9 +1723,9 @@ detach_vmas_to_be_unmapped(struct mm_str
>  	*insertion_point = vma;
>  	tail_vma->vm_next = NULL;
>  	if (mm->unmap_area == arch_unmap_area)
> -		addr = prev ? prev->vm_end : mm->mmap_base;
> +		addr = prev ? prev->vm_start : mm->mmap_base;
>  	else
> -		addr = vma ?  vma->vm_start : mm->mmap_base;
> +		addr = vma ?  vma->vm_end : mm->mmap_base;
>  	mm->unmap_area(mm, addr);
>  	mm->mmap_cache = NULL;		/* Kill the cache. */
>  }

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (19 preceding siblings ...)
  2007-05-01 17:06 ` 2.6.22 -mm merge plans: mm-detach_vmas_to_be_unmapped-fix Hugh Dickins
@ 2007-05-01 18:10 ` Hugh Dickins
  2007-05-01 19:25   ` Christoph Lameter
  2007-05-01 19:55   ` Andrew Morton
  2007-05-03 15:54 ` swap-prefetch: 2.6.22 -mm merge plans Ingo Molnar
  2007-05-07 17:47 ` Josef Sipek
  22 siblings, 2 replies; 233+ messages in thread
From: Hugh Dickins @ 2007-05-01 18:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-kernel, linux-mm

On Mon, 30 Apr 2007, Andrew Morton wrote:
> 
>  i386-use-page-allocator-to-allocate-thread_info-structure.patch
>  slub-core.patch
> 
> slub.  Or part thereof.  This is another patch series which got messed up by
> poor patch sequencing.
> 
>  make-page-private-usable-in-compound-pages-v1.patch
>  optimize-compound_head-by-avoiding-a-shared-page.patch
>  add-virt_to_head_page-and-consolidate-code-in-slab-and-slub.patch
>  slub-fix-object-tracking.patch
>  slub-enable-tracking-of-full-slabs.patch
>  slub-validation-of-slabs-metadata-and-guard-zones.patch
>  slub-add-min_partial.patch
>  slub-add-ability-to-list-alloc--free-callers-per-slab.patch
>  slub-free-slabs-and-sort-partial-slab-lists-in-kmem_cache_shrink.patch
>  slub-remove-object-activities-out-of-checking-functions.patch
>  slub-user-documentation.patch
>  slub-add-slabinfo-tool.patch
> 
> Most of the rest of slub.  Will merge it all.

Merging slub already?  I'm surprised.  That's a very key piece of
infrastructure, and I doubt it's had the exposure it needs yet.

Just what has it been widely tested on so far?  x86_64.  Not many
of us have ia64, but I guess SGI people will have been trying it
on that.  Not i386, that's excluded.

Not powerpc - hmm, I thought that was known, but looking I see no
ARCH_USES_SLAB_PAGE_STRUCT there: just built and tried to run it up,
crashes in slab_free from pgtable_free_tlb frpm free_pte_range from
free_pgd_range from free_pgtables from unmap_region form do_munmap.
That's 2.6.21-rc7-mm2.

slob has a justified place at the low end, but do we want some
people running with slab and some with slub?  I'd expected slub
to stay in 2.6.22-mm, and have all the architectures cut over to
it in that time, before advancing to mainline.

I've nothing against slub in itself, though I'm wary of its
cache merging (more scope for one corrupting another) (and
sometimes I think Christoph spent one life uglifying slab for
NUMA, then another life ripping that all out to make slub ;)

Hugh

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-01 18:10 ` 2.6.22 -mm merge plans: slub Hugh Dickins
@ 2007-05-01 19:25   ` Christoph Lameter
  2007-05-01 19:55   ` Andrew Morton
  1 sibling, 0 replies; 233+ messages in thread
From: Christoph Lameter @ 2007-05-01 19:25 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm

On Tue, 1 May 2007, Hugh Dickins wrote:

> > Most of the rest of slub.  Will merge it all.
> 
> Merging slub already?  I'm surprised.  That's a very key piece of
> infrastructure, and I doubt it's had the exposure it needs yet.

Its not the default. Its just an alternative like SLOB. It will take some 
time to test with various loads in order to see if it can really replace
SLAB in all scenarios.

> Just what has it been widely tested on so far?  x86_64.  Not many
> of us have ia64, but I guess SGI people will have been trying it
> on that.  Not i386, that's excluded.

There is an i386 patch pending and I have used it on i386 for a while.

> Not powerpc - hmm, I thought that was known, but looking I see no
> ARCH_USES_SLAB_PAGE_STRUCT there: just built and tried to run it up,
> crashes in slab_free from pgtable_free_tlb frpm free_pte_range from
> free_pgd_range from free_pgtables from unmap_region form do_munmap.
> That's 2.6.21-rc7-mm2.

Hmmm... True I have not spend any time with that platform. We can set 
ARCH_USES_SLAB_PAGE_STRUCT there to switch it off. SLUB is the default for 
mm so I am a bit surprised that this did not surface earlier.

> I've nothing against slub in itself, though I'm wary of its
> cache merging (more scope for one corrupting another) (and

Yes but then SLUB has more diagnostics etc etc than SLAB to prevent any 
issues. In debug mode all slabs are separate. The merge feature is very 
stable these days and significantly reduces cache overhead problems 
that plague SLAB and require it to have a complex object expiration 
technique. As a result I was able to rip out all timers. SLUB has no cache 
reaper nor any timer. Its silent if not in use.

> sometimes I think Christoph spent one life uglifying slab for
> NUMA, then another life ripping that all out to make slub ;)

SLAB has a certain paradigm of doing things (queues) and I had to work 
within that framework. It was a group effort. SLUB is an answer to those 
complaints and a result of the lessons learned through years of some 
painful slab debugging. SLUB makes debugging extremely easy (and also the 
design is very simple and comprehensible). No rebuilding of the kernel. 
Just pop in a debug option on the command line which can even be targeted 
to a slab cache if we know that things break there.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-01 18:10 ` 2.6.22 -mm merge plans: slub Hugh Dickins
  2007-05-01 19:25   ` Christoph Lameter
@ 2007-05-01 19:55   ` Andrew Morton
  2007-05-01 20:19     ` Hugh Dickins
  1 sibling, 1 reply; 233+ messages in thread
From: Andrew Morton @ 2007-05-01 19:55 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Christoph Lameter, linux-kernel, linux-mm

On Tue, 1 May 2007 19:10:29 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> > Most of the rest of slub.  Will merge it all.
> 
> Merging slub already?  I'm surprised.

My thinking here is "does slub have a future".  I think the answer is
"yes", so we're reasonably safe getting it into mainline for the finishing
work.  The kernel.org kernel will still default to slab.

Does that sound wrong?


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-01 19:55   ` Andrew Morton
@ 2007-05-01 20:19     ` Hugh Dickins
  2007-05-01 20:36       ` Andrew Morton
  2007-05-01 21:08       ` Christoph Lameter
  0 siblings, 2 replies; 233+ messages in thread
From: Hugh Dickins @ 2007-05-01 20:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-kernel, linux-mm

On Tue, 1 May 2007, Andrew Morton wrote:
> On Tue, 1 May 2007 19:10:29 +0100 (BST)
> Hugh Dickins <hugh@veritas.com> wrote:
> 
> > > Most of the rest of slub.  Will merge it all.
> > 
> > Merging slub already?  I'm surprised.
> 
> My thinking here is "does slub have a future".
> I think the answer is "yes",

I think I agree with that,
though it's a judgement I'd leave to you and others.

> so we're reasonably safe getting it into mainline for the finishing
> work.  The kernel.org kernel will still default to slab.
> 
> Does that sound wrong?

Yes, to me it does.  If it could be defaulted to on throughout the
-rcs, on every architecture, then I'd say that's "finishing work";
and we'd be safe knowing we could go back to slab in a hurry if
needed.  But it hasn't reached that stage yet, I think.

Hugh

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-01 20:19     ` Hugh Dickins
@ 2007-05-01 20:36       ` Andrew Morton
  2007-05-01 20:46         ` Christoph Lameter
  2007-05-02 12:54         ` Hugh Dickins
  2007-05-01 21:08       ` Christoph Lameter
  1 sibling, 2 replies; 233+ messages in thread
From: Andrew Morton @ 2007-05-01 20:36 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Christoph Lameter, linux-kernel, linux-mm

On Tue, 1 May 2007 21:19:09 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> On Tue, 1 May 2007, Andrew Morton wrote:
> > On Tue, 1 May 2007 19:10:29 +0100 (BST)
> > Hugh Dickins <hugh@veritas.com> wrote:
> > 
> > > > Most of the rest of slub.  Will merge it all.
> > > 
> > > Merging slub already?  I'm surprised.
> > 
> > My thinking here is "does slub have a future".
> > I think the answer is "yes",
> 
> I think I agree with that,
> though it's a judgement I'd leave to you and others.
> 
> > so we're reasonably safe getting it into mainline for the finishing
> > work.  The kernel.org kernel will still default to slab.
> > 
> > Does that sound wrong?
> 
> Yes, to me it does.  If it could be defaulted to on throughout the
> -rcs, on every architecture, then I'd say that's "finishing work";
> and we'd be safe knowing we could go back to slab in a hurry if
> needed.  But it hasn't reached that stage yet, I think.
> 

Given the current state and the current rate of development I'd expect slub
to have reached the level of completion which you're describing around -rc2
or -rc3.  I think we'd be pretty safe making that assumption.

This is a bit unusual but there is of course some self-interest here: the
patch dependencies are getting awful and having this hanging around
out-of-tree will make 2.6.23 development harder for everyone.

So on balance, given that we _do_ expect slub to have a future, I'm
inclined to crash ahead with it.  The worst that can happen will be a later
rm mm/slub.c which would be pretty simple to do.

otoh I could do some frantic patch mangling and make it easier to carry
slub out-of-tree, but do we gain much from that?

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-01 20:36       ` Andrew Morton
@ 2007-05-01 20:46         ` Christoph Lameter
  2007-05-01 21:09           ` Andrew Morton
  2007-05-02 12:54         ` Hugh Dickins
  1 sibling, 1 reply; 233+ messages in thread
From: Christoph Lameter @ 2007-05-01 20:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-kernel, linux-mm

On Tue, 1 May 2007, Andrew Morton wrote:

> otoh I could do some frantic patch mangling and make it easier to carry
> slub out-of-tree, but do we gain much from that?

Then we may loose all the slab API cleanups? Yuck. I really do not want 
redo those....


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-01 20:46         ` Christoph Lameter
@ 2007-05-01 21:09           ` Andrew Morton
  0 siblings, 0 replies; 233+ messages in thread
From: Andrew Morton @ 2007-05-01 21:09 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Hugh Dickins, linux-kernel, linux-mm

On Tue, 1 May 2007 13:46:26 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Tue, 1 May 2007, Andrew Morton wrote:
> 
> > otoh I could do some frantic patch mangling and make it easier to carry
> > slub out-of-tree, but do we gain much from that?
> 
> Then we may loose all the slab API cleanups? Yuck. I really do not want 
> redo those....

No, I meant that I'd look at splitting those patches up into
one-against-mainline and one-against-slub.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-01 20:36       ` Andrew Morton
  2007-05-01 20:46         ` Christoph Lameter
@ 2007-05-02 12:54         ` Hugh Dickins
  2007-05-02 17:03           ` Christoph Lameter
  2007-05-02 18:52           ` Siddha, Suresh B
  1 sibling, 2 replies; 233+ messages in thread
From: Hugh Dickins @ 2007-05-02 12:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-kernel, linux-mm

On Tue, 1 May 2007, Andrew Morton wrote:
> 
> Given the current state and the current rate of development I'd expect slub
> to have reached the level of completion which you're describing around -rc2
> or -rc3.  I think we'd be pretty safe making that assumption.

Its developer does show signs of being active!

> 
> This is a bit unusual but there is of course some self-interest here: the
> patch dependencies are getting awful and having this hanging around
> out-of-tree will make 2.6.23 development harder for everyone.

That is a very strong argument: a somewhat worrisome argument,
but a very strong one.  Maintaining your sanity is important.

> 
> So on balance, given that we _do_ expect slub to have a future, I'm
> inclined to crash ahead with it.  The worst that can happen will be a later
> rm mm/slub.c which would be pretty simple to do.

Okay.  And there's been no chorus to echo my concern.

But if Linus' tree is to be better than a warehouse to avoid
awkward merges, I still think we want it to default to on for
all the architectures, and for most if not all -rcs.

> 
> otoh I could do some frantic patch mangling and make it easier to carry
> slub out-of-tree, but do we gain much from that?

No, keep away from that.

Hugh

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 12:54         ` Hugh Dickins
@ 2007-05-02 17:03           ` Christoph Lameter
  2007-05-02 19:11             ` Andrew Morton
  2007-05-02 18:52           ` Siddha, Suresh B
  1 sibling, 1 reply; 233+ messages in thread
From: Christoph Lameter @ 2007-05-02 17:03 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm

On Wed, 2 May 2007, Hugh Dickins wrote:

> But if Linus' tree is to be better than a warehouse to avoid
> awkward merges, I still think we want it to default to on for
> all the architectures, and for most if not all -rcs.

At some point I dream that SLUB could become the default but I thought 
this would take at least 6 month or so. If want to force this now then I 
will certainly have some busy weeks ahead.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 17:03           ` Christoph Lameter
@ 2007-05-02 19:11             ` Andrew Morton
  2007-05-02 19:42               ` Christoph Lameter
  0 siblings, 1 reply; 233+ messages in thread
From: Andrew Morton @ 2007-05-02 19:11 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Hugh Dickins, linux-kernel, linux-mm

On Wed, 2 May 2007 10:03:50 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Wed, 2 May 2007, Hugh Dickins wrote:
> 
> > But if Linus' tree is to be better than a warehouse to avoid
> > awkward merges, I still think we want it to default to on for
> > all the architectures, and for most if not all -rcs.
> 
> At some point I dream that SLUB could become the default but I thought 
> this would take at least 6 month or so. If want to force this now then I 
> will certainly have some busy weeks ahead.

s/dream/promise/ ;)

Six months sounds reasonable - I was kind of hoping for less.  Make it
default-to-on in 2.6.23-rc1, see how it goes.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 19:11             ` Andrew Morton
@ 2007-05-02 19:42               ` Christoph Lameter
  2007-05-02 19:54                 ` Sam Ravnborg
  0 siblings, 1 reply; 233+ messages in thread
From: Christoph Lameter @ 2007-05-02 19:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-kernel, linux-mm

On Wed, 2 May 2007, Andrew Morton wrote:

> > At some point I dream that SLUB could become the default but I thought 
> > this would take at least 6 month or so. If want to force this now then I 
> > will certainly have some busy weeks ahead.
> 
> s/dream/promise/ ;)
> 
> Six months sounds reasonable - I was kind of hoping for less.  Make it
> default-to-on in 2.6.23-rc1, see how it goes.

Here is how I think the future could develop

Cycle	SLAB		SLUB		SLOB		SLxB

2.6.22	API fixes	Stabilization	API fixes

Major event: SLUB availability as experimental

2.6.23	API upgrades	Perf. Valid.	EOL

Major events: SLUB performance validation. Switch off
	experimental (could even be the default) 
	Slab allocators support targeted reclaim for at
	least one slab cache (dentry?)
	(vacate/move all objects in a slab)

2.6.24	Earliest EOL	Stable		- 		Experiments

Major events: SLUB stable. Stable targeted reclaim
		for all major reclaimable slabs.
		Maybe experiments with another new allocator?

2.6.25	EOL		default		-		?

Death of SLAB. SLUB default. Hopefully new ideas on the horizon.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 19:42               ` Christoph Lameter
@ 2007-05-02 19:54                 ` Sam Ravnborg
  2007-05-02 20:14                   ` Christoph Lameter
  0 siblings, 1 reply; 233+ messages in thread
From: Sam Ravnborg @ 2007-05-02 19:54 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, Hugh Dickins, linux-kernel, linux-mm

On Wed, May 02, 2007 at 12:42:54PM -0700, Christoph Lameter wrote:
> On Wed, 2 May 2007, Andrew Morton wrote:
> 
> > > At some point I dream that SLUB could become the default but I thought 
> > > this would take at least 6 month or so. If want to force this now then I 
> > > will certainly have some busy weeks ahead.
> > 
> > s/dream/promise/ ;)
> > 
> > Six months sounds reasonable - I was kind of hoping for less.  Make it
> > default-to-on in 2.6.23-rc1, see how it goes.
> 
> Here is how I think the future could develop
> 
> Cycle	SLAB		SLUB		SLOB		SLxB
> 
> 2.6.22	API fixes	Stabilization	API fixes
> 
> Major event: SLUB availability as experimental
> 
> 2.6.23	API upgrades	Perf. Valid.	EOL
> 
> Major events: SLUB performance validation. Switch off
> 	experimental (could even be the default) 
> 	Slab allocators support targeted reclaim for at
> 	least one slab cache (dentry?)
> 	(vacate/move all objects in a slab)

To facilitate this do NOT introduce CONFIG_SLAB until we decide
that SLUB are default. In this way we can make CONFIG_SLUB be default
and people will not continue with CONFIG_SLAB because they had it in their
.config already.
Or just rename CONFIG_SLAB to CONFIG_SLAB_DEPRECATED or something.

The point is make sure that LSUB becomes default for people that does
an make oldconfig (explicit or implicit).

	Sam

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 19:54                 ` Sam Ravnborg
@ 2007-05-02 20:14                   ` Christoph Lameter
  0 siblings, 0 replies; 233+ messages in thread
From: Christoph Lameter @ 2007-05-02 20:14 UTC (permalink / raw)
  To: Sam Ravnborg; +Cc: Andrew Morton, Hugh Dickins, linux-kernel, linux-mm

On Wed, 2 May 2007, Sam Ravnborg wrote:

> To facilitate this do NOT introduce CONFIG_SLAB until we decide
> that SLUB are default. In this way we can make CONFIG_SLUB be default
> and people will not continue with CONFIG_SLAB because they had it in their
> config already.

We already have CONFIG_SLAB. If you use your existing .config then
you will stay with SLAB.

> The point is make sure that LSUB becomes default for people that does
> an make oldconfig (explicit or implicit).

Hmmmm... We can think about that when we actually want to make SLUB the 
default.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 12:54         ` Hugh Dickins
  2007-05-02 17:03           ` Christoph Lameter
@ 2007-05-02 18:52           ` Siddha, Suresh B
  2007-05-02 18:58             ` Christoph Lameter
  1 sibling, 1 reply; 233+ messages in thread
From: Siddha, Suresh B @ 2007-05-02 18:52 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Christoph Lameter, linux-kernel, linux-mm

On Wed, May 02, 2007 at 05:54:53AM -0700, Hugh Dickins wrote:
> On Tue, 1 May 2007, Andrew Morton wrote:
> > So on balance, given that we _do_ expect slub to have a future, I'm
> > inclined to crash ahead with it.  The worst that can happen will be a later
> > rm mm/slub.c which would be pretty simple to do.
> 
> Okay.  And there's been no chorus to echo my concern.

I have been looking into "slub" recently to avoid some of the NUMA alien
cache issues that we were encountering on the regular slab.

I am having some stability issues with slub on an ia64 NUMA platform and
didn't have time to dig further. I am hoping to look into it soon
and share the data/findings with  Christoph.

We also did a quick perf collection on x86_64(atleast didn't hear
any stability issues from our team on regular x86_64 SMP), that we will be
sharing shortly.

> But if Linus' tree is to be better than a warehouse to avoid
> awkward merges, I still think we want it to default to on for
> all the architectures, and for most if not all -rcs.

I will not suggest for default on at this point.

thanks,
suresh

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 18:52           ` Siddha, Suresh B
@ 2007-05-02 18:58             ` Christoph Lameter
  0 siblings, 0 replies; 233+ messages in thread
From: Christoph Lameter @ 2007-05-02 18:58 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: Hugh Dickins, Andrew Morton, linux-kernel, linux-mm

On Wed, 2 May 2007, Siddha, Suresh B wrote:

> I have been looking into "slub" recently to avoid some of the NUMA alien
> cache issues that we were encountering on the regular slab.

Yes that is also our main concern.

> I am having some stability issues with slub on an ia64 NUMA platform and
> didn't have time to dig further. I am hoping to look into it soon
> and share the data/findings with  Christoph.

There is at least one patch on top of 2.6.21-rc7-mm2 already in mm that 
may be necessary for you.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-01 20:19     ` Hugh Dickins
  2007-05-01 20:36       ` Andrew Morton
@ 2007-05-01 21:08       ` Christoph Lameter
  2007-05-02 12:45         ` Hugh Dickins
  1 sibling, 1 reply; 233+ messages in thread
From: Christoph Lameter @ 2007-05-01 21:08 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm

On Tue, 1 May 2007, Hugh Dickins wrote:

> Yes, to me it does.  If it could be defaulted to on throughout the
> -rcs, on every architecture, then I'd say that's "finishing work";
> and we'd be safe knowing we could go back to slab in a hurry if
> needed.  But it hasn't reached that stage yet, I think.

Why would we need to go back to SLAB if we have not switched to SLUB? SLUB 
is marked experimental and not the default.

The only problems that I am aware of is(or was) the issue with arches 
modifying page struct fields of slab pages that SLUB needs for its own 
operations. And I thought it was all fixed since the powerpc guys were 
quiet and the patch was in for i386.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-01 21:08       ` Christoph Lameter
@ 2007-05-02 12:45         ` Hugh Dickins
  2007-05-02 17:01           ` Christoph Lameter
  2007-05-02 17:25           ` Christoph Lameter
  0 siblings, 2 replies; 233+ messages in thread
From: Hugh Dickins @ 2007-05-02 12:45 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, linux-kernel, linux-mm

On Tue, 1 May 2007, Christoph Lameter wrote:
> On Tue, 1 May 2007, Hugh Dickins wrote:
> 
> > Yes, to me it does.  If it could be defaulted to on throughout the
> > -rcs, on every architecture, then I'd say that's "finishing work";
> > and we'd be safe knowing we could go back to slab in a hurry if
> > needed.  But it hasn't reached that stage yet, I think.
> 
> Why would we need to go back to SLAB if we have not switched to SLUB? SLUB 
> is marked experimental and not the default.

I said above that I thought SLUB ought to be defaulted to on throughout
the -rcs: if we don't do that, we're not going to learn much from having
it in Linus' tree.

And perhaps that line which appends "PREEMPT " to an oops report ought
to append "SLUB " too, for so long as there's a choice.

> The only problems that I am aware of is(or was) the issue with arches 
> modifying page struct fields of slab pages that SLUB needs for its own 
> operations. And I thought it was all fixed since the powerpc guys were 
> quiet and the patch was in for i386.

You're forgetting your unions in struct page: in the SPLIT_PTLOCK
case (NR_CPUS >= 4) the pagetable code is using spinlock_t ptl,
which overlays SLUB's first_page and slab pointers.

I just tried rebuilding powerpc with the SPLIT_PTLOCK cutover
edited to 8 cpus instead, and then no crash.

I presume the answer is just to extend your quicklist work to
powerpc's lowest level of pagetables.  The only other architecture
which is using kmem_cache for them is arm26, which has
"#error SMP is not supported", so won't be giving this problem.

Hugh

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 12:45         ` Hugh Dickins
@ 2007-05-02 17:01           ` Christoph Lameter
  2007-05-02 18:08             ` Hugh Dickins
  2007-05-02 17:25           ` Christoph Lameter
  1 sibling, 1 reply; 233+ messages in thread
From: Christoph Lameter @ 2007-05-02 17:01 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm

On Wed, 2 May 2007, Hugh Dickins wrote:

> > Why would we need to go back to SLAB if we have not switched to SLUB? SLUB 
> > is marked experimental and not the default.
> 
> I said above that I thought SLUB ought to be defaulted to on throughout
> the -rcs: if we don't do that, we're not going to learn much from having
> it in Linus' tree.

I'd rather be careful with that..... mm is enough for now. Why go to the 
extremes immediately. If it is an option then people can gradually start 
testing with it.
 
> > The only problems that I am aware of is(or was) the issue with arches 
> > modifying page struct fields of slab pages that SLUB needs for its own 
> > operations. And I thought it was all fixed since the powerpc guys were 
> > quiet and the patch was in for i386.
> 
> You're forgetting your unions in struct page: in the SPLIT_PTLOCK
> case (NR_CPUS >= 4) the pagetable code is using spinlock_t ptl,
> which overlays SLUB's first_page and slab pointers.

Uhhh.... Right. So SLUB wont work if the lowest page table block is 
managed via slabs.
 
> I just tried rebuilding powerpc with the SPLIT_PTLOCK cutover
> edited to 8 cpus instead, and then no crash.
> 
> I presume the answer is just to extend your quicklist work to
> powerpc's lowest level of pagetables.  The only other architecture

I am not sure how PowerPCs lower pagetable pages work. If they are of 
PAGE_SIZE then this is no problem.

> which is using kmem_cache for them is arm26, which has
> "#error SMP is not supported", so won't be giving this problem.

Ahh. Good.

But these are arch specific problems. We could use 
ARCH_USES_SLAB_PAGE_STRUCT to disable SLUB on these platforms.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 17:01           ` Christoph Lameter
@ 2007-05-02 18:08             ` Hugh Dickins
  2007-05-02 18:28               ` Christoph Lameter
  0 siblings, 1 reply; 233+ messages in thread
From: Hugh Dickins @ 2007-05-02 18:08 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, linux-kernel, linux-mm

On Wed, 2 May 2007, Christoph Lameter wrote:
> 
> But these are arch specific problems. We could use 
> ARCH_USES_SLAB_PAGE_STRUCT to disable SLUB on these platforms.

As a quick hack, sure.  But every ARCH_USES_SLAB_PAGE_STRUCT
diminishes the testing SLUB will get.  If the idea is that we're
going to support both SLAB and SLUB, some arches with one, some
with another, some with either, for more than a single release,
then I'm back to saying SLUB is being pushed in too early.
I can understand people wanting pluggable schedulers,
but pluggable slab allocators?

Hugh

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 18:08             ` Hugh Dickins
@ 2007-05-02 18:28               ` Christoph Lameter
  2007-05-02 18:42                 ` Andrew Morton
  0 siblings, 1 reply; 233+ messages in thread
From: Christoph Lameter @ 2007-05-02 18:28 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, linux-kernel, linux-mm

On Wed, 2 May 2007, Hugh Dickins wrote:

> On Wed, 2 May 2007, Christoph Lameter wrote:
> > 
> > But these are arch specific problems. We could use 
> > ARCH_USES_SLAB_PAGE_STRUCT to disable SLUB on these platforms.
> 
> As a quick hack, sure.  But every ARCH_USES_SLAB_PAGE_STRUCT
> diminishes the testing SLUB will get.  If the idea is that we're
> going to support both SLAB and SLUB, some arches with one, some
> with another, some with either, for more than a single release,
> then I'm back to saying SLUB is being pushed in too early.
> I can understand people wanting pluggable schedulers,
> but pluggable slab allocators?

This is a sensitive piece of the kernel as you say and we better allow the 
running of two allocator for some time to make sure that it behaves in all 
load situations. The design is fundamentally different so its performance 
characteristics may diverge significantly and perhaps there will be corner 
cases for each where they do the best job.

I have already reworked the slab API to allow for an easy implementation 
of alternate slab allocators (released with 2.6.20) which only covered 
SLAB and SLOB. This is continuing the cleanup work and adding a third one.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 18:28               ` Christoph Lameter
@ 2007-05-02 18:42                 ` Andrew Morton
  2007-05-02 18:53                   ` Christoph Lameter
  0 siblings, 1 reply; 233+ messages in thread
From: Andrew Morton @ 2007-05-02 18:42 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Hugh Dickins, linux-kernel, linux-mm

On Wed, 2 May 2007 11:28:26 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Wed, 2 May 2007, Hugh Dickins wrote:
> 
> > On Wed, 2 May 2007, Christoph Lameter wrote:
> > > 
> > > But these are arch specific problems. We could use 
> > > ARCH_USES_SLAB_PAGE_STRUCT to disable SLUB on these platforms.
> > 
> > As a quick hack, sure.  But every ARCH_USES_SLAB_PAGE_STRUCT
> > diminishes the testing SLUB will get.  If the idea is that we're
> > going to support both SLAB and SLUB, some arches with one, some
> > with another, some with either, for more than a single release,
> > then I'm back to saying SLUB is being pushed in too early.
> > I can understand people wanting pluggable schedulers,
> > but pluggable slab allocators?
> 
> This is a sensitive piece of the kernel as you say and we better allow the 
> running of two allocator for some time to make sure that it behaves in all 
> load situations. The design is fundamentally different so its performance 
> characteristics may diverge significantly and perhaps there will be corner 
> cases for each where they do the best job.

eek.  We'd need to fix those corner cases then.  Our endgame
here really must be rm mm/slab.c.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 18:42                 ` Andrew Morton
@ 2007-05-02 18:53                   ` Christoph Lameter
  0 siblings, 0 replies; 233+ messages in thread
From: Christoph Lameter @ 2007-05-02 18:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-kernel, linux-mm

On Wed, 2 May 2007, Andrew Morton wrote:

> > This is a sensitive piece of the kernel as you say and we better allow the 
> > running of two allocator for some time to make sure that it behaves in all 
> > load situations. The design is fundamentally different so its performance 
> > characteristics may diverge significantly and perhaps there will be corner 
> > cases for each where they do the best job.
> 
> eek.  We'd need to fix those corner cases then.  Our endgame
> here really must be rm mm/slab.c.

First we need to discover them and I doubt that mm covers much more than 
development loads. I hope we can get to a point where we have SLUB be 
the primarily allocator soon but I would expect various performance issues 
to show up.

On the other hand: I am pretty sure that SLUB can replace SLOB completely 
given SLOBs limitations and SLUBs more efficient use of space. SLOB needs 
8 bytes of overhead. SLUB needs none. We may just have to #ifdef out the 
debugging support to make the code be of similar size to SLOB too. SLOB is 
a general problem because its features are not compatible to SLAB. F.e. it 
does not support DESTROY_BY_RCU and does not do reclaim the right way etc 
etc. SLUB may turn out to be the ideal embedded slab allocator.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 12:45         ` Hugh Dickins
  2007-05-02 17:01           ` Christoph Lameter
@ 2007-05-02 17:25           ` Christoph Lameter
  2007-05-02 18:36             ` Hugh Dickins
  2007-05-03  8:15             ` Andrew Morton
  1 sibling, 2 replies; 233+ messages in thread
From: Christoph Lameter @ 2007-05-02 17:25 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, haveblue, linux-kernel, linux-mm

On Wed, 2 May 2007, Hugh Dickins wrote:

> I presume the answer is just to extend your quicklist work to
> powerpc's lowest level of pagetables.  The only other architecture
> which is using kmem_cache for them is arm26, which has
> "#error SMP is not supported", so won't be giving this problem.

In the meantime we would need something like this to disable SLUB in this 
particular configuration. Note that I have not tested this and the <= for
the comparision with SPLIT_PTLOCK_CPUS may not work (Never seen such a
construct in a Kconfig file but it is needed here).



PowerPC: Disable SLUB for configurations in which slab page structs are modified

PowerPC uses the slab allocator to manage the lowest level of the page table.
In high cpu configurations we also use the page struct to split the page
table lock. Disallow the selection of SLUB for that case.

[Not tested: I am not familiar with powerpc build procedures etc]

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig
===================================================================
--- linux-2.6.21-rc7-mm2.orig/arch/powerpc/Kconfig	2007-05-02 10:07:34.000000000 -0700
+++ linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig	2007-05-02 10:13:37.000000000 -0700
@@ -117,6 +117,19 @@ config GENERIC_BUG
 	default y
 	depends on BUG
 
+#
+# Powerpc uses the slab allocator to manage its ptes and the
+# page structs of ptes are used for splitting the page table
+# lock for configurations supporting more than SPLIT_PTLOCK_CPUS.
+#
+# In that special configuration the page structs of slabs are modified.
+# This setting disables the selection of SLUB as a slab allocator.
+#
+config ARCH_USES_SLAB_PAGE_STRUCT
+	bool
+	default y
+	depends on SPLIT_PTLOCK_CPUS <= NR_CPUS
+
 config DEFAULT_UIMAGE
 	bool
 	help

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 17:25           ` Christoph Lameter
@ 2007-05-02 18:36             ` Hugh Dickins
  2007-05-02 18:39               ` Christoph Lameter
  2007-05-03  8:15             ` Andrew Morton
  1 sibling, 1 reply; 233+ messages in thread
From: Hugh Dickins @ 2007-05-02 18:36 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, haveblue, linux-kernel, linux-mm

On Wed, 2 May 2007, Christoph Lameter wrote:
> On Wed, 2 May 2007, Hugh Dickins wrote:
> 
> > I presume the answer is just to extend your quicklist work to
> > powerpc's lowest level of pagetables.  The only other architecture
> > which is using kmem_cache for them is arm26, which has
> > "#error SMP is not supported", so won't be giving this problem.
> 
> In the meantime we would need something like this to disable SLUB in this 
> particular configuration. Note that I have not tested this and the <= for
> the comparision with SPLIT_PTLOCK_CPUS may not work (Never seen such a
> construct in a Kconfig file but it is needed here).

I'm astonished and impressed, both with Kconfig and your use of it:
that does seem to work.  Though I don't dare go so far as to give
the patch an ack, and don't like this way out at all.  It needs a
proper (quicklist) solution, and by the time that solution comes
along, all the powerpc people will have CONFIG_SLAB=y in their
.config, and "make oldconfig" will just perpetuate that status quo,
instead of the switching over to CONFIG_SLUB=y.  I think.  Unless
we keep changing the config option names, or go through a phase
with no option.

I'd much rather be testing a quicklist patch:
I'd better give that a try.

Hugh

> 
> 
> 
> PowerPC: Disable SLUB for configurations in which slab page structs are modified
> 
> PowerPC uses the slab allocator to manage the lowest level of the page table.
> In high cpu configurations we also use the page struct to split the page
> table lock. Disallow the selection of SLUB for that case.
> 
> [Not tested: I am not familiar with powerpc build procedures etc]
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> Index: linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig
> ===================================================================
> --- linux-2.6.21-rc7-mm2.orig/arch/powerpc/Kconfig	2007-05-02 10:07:34.000000000 -0700
> +++ linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig	2007-05-02 10:13:37.000000000 -0700
> @@ -117,6 +117,19 @@ config GENERIC_BUG
>  	default y
>  	depends on BUG
>  
> +#
> +# Powerpc uses the slab allocator to manage its ptes and the
> +# page structs of ptes are used for splitting the page table
> +# lock for configurations supporting more than SPLIT_PTLOCK_CPUS.
> +#
> +# In that special configuration the page structs of slabs are modified.
> +# This setting disables the selection of SLUB as a slab allocator.
> +#
> +config ARCH_USES_SLAB_PAGE_STRUCT
> +	bool
> +	default y
> +	depends on SPLIT_PTLOCK_CPUS <= NR_CPUS
> +
>  config DEFAULT_UIMAGE
>  	bool
>  	help

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 18:36             ` Hugh Dickins
@ 2007-05-02 18:39               ` Christoph Lameter
  2007-05-02 18:57                 ` Andrew Morton
  0 siblings, 1 reply; 233+ messages in thread
From: Christoph Lameter @ 2007-05-02 18:39 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, haveblue, linux-kernel, linux-mm

On Wed, 2 May 2007, Hugh Dickins wrote:

> I'm astonished and impressed, both with Kconfig and your use of it:

Thanks!

> I'd much rather be testing a quicklist patch:
> I'd better give that a try.

Great. But I certainly do not mind people use SLAB. I do not think that 
one approach should be there for all. Choice is the way to have multiple 
allocators compete. One reason that SLAB is so crusty is because it was 
the only solution for so long.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 18:39               ` Christoph Lameter
@ 2007-05-02 18:57                 ` Andrew Morton
  2007-05-02 19:01                   ` Christoph Lameter
  0 siblings, 1 reply; 233+ messages in thread
From: Andrew Morton @ 2007-05-02 18:57 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Hugh Dickins, haveblue, linux-kernel, linux-mm

On Wed, 2 May 2007 11:39:20 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Wed, 2 May 2007, Hugh Dickins wrote:
> 
> > I'm astonished and impressed, both with Kconfig and your use of it:
> 
> Thanks!
> 
> > I'd much rather be testing a quicklist patch:
> > I'd better give that a try.
> 
> Great. But I certainly do not mind people use SLAB. I do not think that 
> one approach should be there for all. Choice is the way to have multiple 
> allocators compete. One reason that SLAB is so crusty is because it was 
> the only solution for so long.
> 

noooo, we don't want competing slab allocators, please.  We should get slub
working well on all architectures then remove slab completely.  Having to
maintain both slab.c and slub.c would be awful.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 18:57                 ` Andrew Morton
@ 2007-05-02 19:01                   ` Christoph Lameter
  2007-05-02 19:18                     ` Pekka Enberg
  0 siblings, 1 reply; 233+ messages in thread
From: Christoph Lameter @ 2007-05-02 19:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, haveblue, linux-kernel, linux-mm

On Wed, 2 May 2007, Andrew Morton wrote:

> noooo, we don't want competing slab allocators, please.  We should get slub
> working well on all architectures then remove slab completely.  Having to
> maintain both slab.c and slub.c would be awful.

Owww... You throw my roadmap out of the window and may create too 
high expectations of SLUB.

I am the one who has to maintain SLAB and SLUB it seems and I have been 
dealing with the trio SLAB, SLOB and SLUB for awhile now. Its okay and it 
will be much easier once the cleanups are in.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 19:01                   ` Christoph Lameter
@ 2007-05-02 19:18                     ` Pekka Enberg
  2007-05-02 19:34                       ` Christoph Lameter
  2007-05-02 19:43                       ` Christoph Lameter
  0 siblings, 2 replies; 233+ messages in thread
From: Pekka Enberg @ 2007-05-02 19:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Hugh Dickins, haveblue, linux-kernel, linux-mm

On 5/2/07, Christoph Lameter <clameter@sgi.com> wrote:
> Owww... You throw my roadmap out of the window and may create too
> high expectations of SLUB.

Me too!

On 5/2/07, Christoph Lameter <clameter@sgi.com> wrote:
> I am the one who has to maintain SLAB and SLUB it seems and I have been
> dealing with the trio SLAB, SLOB and SLUB for awhile now. Its okay and it
> will be much easier once the cleanups are in.

And then there's patches such as kmemleak which would need to target
all three. Plus it doesn't really make sense for users to select
between three competiting implementations. Please don't take away our
high hopes of getting rid of mm/slab.c Christoph =)

                                      Pekka

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 19:18                     ` Pekka Enberg
@ 2007-05-02 19:34                       ` Christoph Lameter
  2007-05-02 19:43                       ` Christoph Lameter
  1 sibling, 0 replies; 233+ messages in thread
From: Christoph Lameter @ 2007-05-02 19:34 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andrew Morton, Hugh Dickins, haveblue, linux-kernel, linux-mm

On Wed, 2 May 2007, Pekka Enberg wrote:

> On 5/2/07, Christoph Lameter <clameter@sgi.com> wrote:
> > I am the one who has to maintain SLAB and SLUB it seems and I have been
> > dealing with the trio SLAB, SLOB and SLUB for awhile now. Its okay and it
> > will be much easier once the cleanups are in.
> 
> And then there's patches such as kmemleak which would need to target
> all three. Plus it doesn't really make sense for users to select
> between three competiting implementations. Please don't take away our
> high hopes of getting rid of mm/slab.c Christoph =)

SLUB supports kmemleak (actually its quite improved). Switch debugging on 
and try

cat /sys/slab/kmalloc-128/alloc_calls.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 19:18                     ` Pekka Enberg
  2007-05-02 19:34                       ` Christoph Lameter
@ 2007-05-02 19:43                       ` Christoph Lameter
  1 sibling, 0 replies; 233+ messages in thread
From: Christoph Lameter @ 2007-05-02 19:43 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andrew Morton, Hugh Dickins, haveblue, linux-kernel, linux-mm

On Wed, 2 May 2007, Pekka Enberg wrote:

> And then there's patches such as kmemleak which would need to target
> all three. Plus it doesn't really make sense for users to select
> between three competiting implementations. Please don't take away our
> high hopes of getting rid of mm/slab.c Christoph =)

You too, Brutus ...

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-02 17:25           ` Christoph Lameter
  2007-05-02 18:36             ` Hugh Dickins
@ 2007-05-03  8:15             ` Andrew Morton
  2007-05-03  8:27               ` William Lee Irwin III
  2007-05-03  8:46               ` Hugh Dickins
  1 sibling, 2 replies; 233+ messages in thread
From: Andrew Morton @ 2007-05-03  8:15 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Hugh Dickins, haveblue, linux-kernel, linux-mm

On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> On Wed, 2 May 2007, Hugh Dickins wrote:
> 
> > I presume the answer is just to extend your quicklist work to
> > powerpc's lowest level of pagetables.  The only other architecture
> > which is using kmem_cache for them is arm26, which has
> > "#error SMP is not supported", so won't be giving this problem.
> 
> In the meantime we would need something like this to disable SLUB in this 
> particular configuration. Note that I have not tested this and the <= for
> the comparision with SPLIT_PTLOCK_CPUS may not work (Never seen such a
> construct in a Kconfig file but it is needed here).
> 
> 
> 
> PowerPC: Disable SLUB for configurations in which slab page structs are modified
> 
> PowerPC uses the slab allocator to manage the lowest level of the page table.
> In high cpu configurations we also use the page struct to split the page
> table lock. Disallow the selection of SLUB for that case.
> 
> [Not tested: I am not familiar with powerpc build procedures etc]
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> Index: linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig
> ===================================================================
> --- linux-2.6.21-rc7-mm2.orig/arch/powerpc/Kconfig	2007-05-02 10:07:34.000000000 -0700
> +++ linux-2.6.21-rc7-mm2/arch/powerpc/Kconfig	2007-05-02 10:13:37.000000000 -0700
> @@ -117,6 +117,19 @@ config GENERIC_BUG
>  	default y
>  	depends on BUG
>  
> +#
> +# Powerpc uses the slab allocator to manage its ptes and the
> +# page structs of ptes are used for splitting the page table
> +# lock for configurations supporting more than SPLIT_PTLOCK_CPUS.
> +#
> +# In that special configuration the page structs of slabs are modified.
> +# This setting disables the selection of SLUB as a slab allocator.
> +#
> +config ARCH_USES_SLAB_PAGE_STRUCT
> +	bool
> +	default y
> +	depends on SPLIT_PTLOCK_CPUS <= NR_CPUS
> +

That all seems to work as intended.

However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the
machine early in boot.  

Too early for netconsole, no serial console.  Wedges up uselessly with
CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with
CONFIG_XMON=y.  This is all fairly typical for a powerpc/G5 crash :(

However I was able to glimpse some stuff as it flew past.  Crash started in
flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free.  I don't know
how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial
card, perhaps.


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-03  8:15             ` Andrew Morton
@ 2007-05-03  8:27               ` William Lee Irwin III
  2007-05-03 16:30                 ` Christoph Lameter
  2007-05-03  8:46               ` Hugh Dickins
  1 sibling, 1 reply; 233+ messages in thread
From: William Lee Irwin III @ 2007-05-03  8:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Hugh Dickins, haveblue, linux-kernel, linux-mm

On Thu, May 03, 2007 at 01:15:15AM -0700, Andrew Morton wrote:
> That all seems to work as intended.
> However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the
> machine early in boot.  
> Too early for netconsole, no serial console.  Wedges up uselessly with
> CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with
> CONFIG_XMON=y.  This is all fairly typical for a powerpc/G5 crash :(
> However I was able to glimpse some stuff as it flew past.  Crash started in
> flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free.  I don't know
> how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial
> card, perhaps.

I've seen this crash in flush_old_exec() before. ISTR it being due to
slub vs. pagetable alignment or something on that order.


-- wli

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-03  8:27               ` William Lee Irwin III
@ 2007-05-03 16:30                 ` Christoph Lameter
  0 siblings, 0 replies; 233+ messages in thread
From: Christoph Lameter @ 2007-05-03 16:30 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrew Morton, Hugh Dickins, haveblue, linux-kernel, linux-mm

On Thu, 3 May 2007, William Lee Irwin III wrote:

> I've seen this crash in flush_old_exec() before. ISTR it being due to
> slub vs. pagetable alignment or something on that order.

>From from other discussion regarding SLAB: It may be necessary for 
powerpc to set the default alignment to 8 bytes on 32 bit powerpc 
because it requires that alignemnt for 64 bit value. SLUB will *not* 
disable debugging like SLAB if you do that.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-03  8:15             ` Andrew Morton
  2007-05-03  8:27               ` William Lee Irwin III
@ 2007-05-03  8:46               ` Hugh Dickins
  2007-05-03  8:57                 ` Andrew Morton
  1 sibling, 1 reply; 233+ messages in thread
From: Hugh Dickins @ 2007-05-03  8:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-kernel, linux-mm

On Thu, 3 May 2007, Andrew Morton wrote:
> On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> 
> > +config ARCH_USES_SLAB_PAGE_STRUCT
> > +	bool
> > +	default y
> > +	depends on SPLIT_PTLOCK_CPUS <= NR_CPUS
> > +
> 
> That all seems to work as intended.
> 
> However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the
> machine early in boot.  

I thought that if that worked as intended, you wouldn't even
get the chance to choose SLUB=y?  That was how it was working
for me (but I realize I didn't try more than make oldconfig).

> 
> Too early for netconsole, no serial console.  Wedges up uselessly with
> CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with
> CONFIG_XMON=y.  This is all fairly typical for a powerpc/G5 crash :(
> 
> However I was able to glimpse some stuff as it flew past.  Crash started in
> flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free.  I don't know
> how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial
> card, perhaps.

That sounds like what happens when SLUB's pagestruct use meets
SPLIT_PTLOCK's pagestruct use.  Does your .config really show
CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y?

Hugh

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-03  8:46               ` Hugh Dickins
@ 2007-05-03  8:57                 ` Andrew Morton
  2007-05-03  9:15                   ` Hugh Dickins
  2007-05-03 16:45                   ` 2.6.22 -mm merge plans: slub Christoph Lameter
  0 siblings, 2 replies; 233+ messages in thread
From: Andrew Morton @ 2007-05-03  8:57 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Christoph Lameter, linux-kernel, linux-mm

On Thu, 3 May 2007 09:46:32 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote:

> On Thu, 3 May 2007, Andrew Morton wrote:
> > On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> > 
> > > +config ARCH_USES_SLAB_PAGE_STRUCT
> > > +	bool
> > > +	default y
> > > +	depends on SPLIT_PTLOCK_CPUS <= NR_CPUS
> > > +
> > 
> > That all seems to work as intended.
> > 
> > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the
> > machine early in boot.  
> 
> I thought that if that worked as intended, you wouldn't even
> get the chance to choose SLUB=y?  That was how it was working
> for me (but I realize I didn't try more than make oldconfig).

Right.  This can be tested on x86 without a cross-compiler:

ARCH=powerpc make mrproper
ARCH=powerpc make fooconfig

> > 
> > Too early for netconsole, no serial console.  Wedges up uselessly with
> > CONFIG_XMON=n, does mysterious repeated uncontrollable exceptions with
> > CONFIG_XMON=y.  This is all fairly typical for a powerpc/G5 crash :(
> > 
> > However I was able to glimpse some stuff as it flew past.  Crash started in
> > flush_old_exec and ended in pgtable_free_tlb -> kmem_cache_free.  I don't know
> > how to do better than that I'm afraid, unless I'm to hunt down a PCIE serial
> > card, perhaps.
> 
> That sounds like what happens when SLUB's pagestruct use meets
> SPLIT_PTLOCK's pagestruct use.  Does your .config really show
> CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y?
> 

Nope.

g5:/usr/src/25> grep SLUB .config
CONFIG_SLUB=y
g5:/usr/src/25> grep SLAB .config
# CONFIG_SLAB is not set
g5:/usr/src/25> grep CPUS .config
CONFIG_NR_CPUS=8
# CONFIG_CPUSETS is not set
# CONFIG_IRQ_ALL_CPUS is not set
CONFIG_SPLIT_PTLOCK_CPUS=4

It's in http://userweb.kernel.org/~akpm/config-g5.txt


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-03  8:57                 ` Andrew Morton
@ 2007-05-03  9:15                   ` Hugh Dickins
  2007-05-03 21:04                     ` 2.6.22 -mm merge plans: slub on PowerPC Hugh Dickins
  2007-05-03 16:45                   ` 2.6.22 -mm merge plans: slub Christoph Lameter
  1 sibling, 1 reply; 233+ messages in thread
From: Hugh Dickins @ 2007-05-03  9:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, linux-kernel, linux-mm

On Thu, 3 May 2007, Andrew Morton wrote:
> On Thu, 3 May 2007 09:46:32 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote:
> > On Thu, 3 May 2007, Andrew Morton wrote:
> > > On Wed, 2 May 2007 10:25:47 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> > > 
> > > > +config ARCH_USES_SLAB_PAGE_STRUCT
> > > > +	bool
> > > > +	default y
> > > > +	depends on SPLIT_PTLOCK_CPUS <= NR_CPUS
> > > > +
> > > 
> > > That all seems to work as intended.
> > > 
> > > However with NR_CPUS=8 SPLIT_PTLOCK_CPUS=4, enabling SLUB=y crashes the
> > > machine early in boot.  
> > 
> > I thought that if that worked as intended, you wouldn't even
> > get the chance to choose SLUB=y?  That was how it was working
> > for me (but I realize I didn't try more than make oldconfig).
> > 
> > That sounds like what happens when SLUB's pagestruct use meets
> > SPLIT_PTLOCK's pagestruct use.  Does your .config really show
> > CONFIG_SLUB=y together with CONFIG_ARCH_USES_SLAB_PAGE_STRUCT=y?
> 
> Nope.
> 
> g5:/usr/src/25> grep SLUB .config
> CONFIG_SLUB=y
> g5:/usr/src/25> grep SLAB .config
> # CONFIG_SLAB is not set
> g5:/usr/src/25> grep CPUS .config
> CONFIG_NR_CPUS=8
> # CONFIG_CPUSETS is not set
> # CONFIG_IRQ_ALL_CPUS is not set
> CONFIG_SPLIT_PTLOCK_CPUS=4
> 
> It's in http://userweb.kernel.org/~akpm/config-g5.txt

Seems we're all wrong in thinking Christoph's Kconfiggery worked
as intended: maybe it just works some of the time.  I'm not going
to hazard a guess as to how to fix it up, will resume looking at
the powerpc's quicklist potential later.

Hugh

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub on PowerPC
  2007-05-03  9:15                   ` Hugh Dickins
@ 2007-05-03 21:04                     ` Hugh Dickins
  2007-05-03 21:15                       ` Christoph Lameter
  2007-05-04  0:25                       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 233+ messages in thread
From: Hugh Dickins @ 2007-05-03 21:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Paul Mackerras, Benjamin Herrenschmidt,
	linux-kernel, linux-mm

On Thu, 3 May 2007, Hugh Dickins wrote:
> 
> Seems we're all wrong in thinking Christoph's Kconfiggery worked
> as intended: maybe it just works some of the time.  I'm not going
> to hazard a guess as to how to fix it up, will resume looking at
> the powerpc's quicklist potential later.

Here's the patch I've been testing on G5, with 4k and with 64k pages,
with SLAB and with SLUB.  But, though it doesn't crash, the pgd
kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity
for using highorder allocations where SLAB would stick to order 0:
under load, exec's mm_init gets page allocation failure on order 4
- SLUB's calculate_order may need some retuning.  (I'd expect it to
be going for order 3 actually, I'm not sure how order 4 comes about.)

I don't know how offensive Ben and Paulus may find this patch:
the kmem_cache use was nicely done and this messes it up a little.


The SLUB allocator relies on struct page fields first_page and slab,
overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then
be used for the lowest level of pagetable pages.  This was obstructing
SLUB on PowerPC, which uses kmem_caches for its pagetables.  So convert
its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd
want partpages, so continue to use kmem_caches for pmd, pud and pgd).
But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---

 arch/powerpc/Kconfig          |    4 ++++
 arch/powerpc/mm/init_64.c     |   17 ++++++-----------
 include/asm-powerpc/pgalloc.h |   26 +++++++++++---------------
 3 files changed, 21 insertions(+), 26 deletions(-)

--- 2.6.21-rc7-mm2/arch/powerpc/Kconfig	2007-04-26 13:33:51.000000000 +0100
+++ linux/arch/powerpc/Kconfig	2007-05-03 20:45:12.000000000 +0100
@@ -31,6 +31,10 @@ config MMU
 	bool
 	default y
 
+config QUICKLIST
+	bool
+	default y
+
 config GENERIC_HARDIRQS
 	bool
 	default y
--- 2.6.21-rc7-mm2/arch/powerpc/mm/init_64.c	2007-04-26 13:33:51.000000000 +0100
+++ linux/arch/powerpc/mm/init_64.c	2007-05-03 20:45:12.000000000 +0100
@@ -146,21 +146,16 @@ static void zero_ctor(void *addr, struct
 	memset(addr, 0, kmem_cache_size(cache));
 }
 
-#ifdef CONFIG_PPC_64K_PAGES
-static const unsigned int pgtable_cache_size[3] = {
-	PTE_TABLE_SIZE, PMD_TABLE_SIZE, PGD_TABLE_SIZE
-};
-static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = {
-	"pte_pmd_cache", "pmd_cache", "pgd_cache",
-};
-#else
 static const unsigned int pgtable_cache_size[2] = {
-	PTE_TABLE_SIZE, PMD_TABLE_SIZE
+	PGD_TABLE_SIZE, PMD_TABLE_SIZE
 };
 static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = {
-	"pgd_pte_cache", "pud_pmd_cache",
-};
+#ifdef CONFIG_PPC_64K_PAGES
+	"pgd_cache", "pmd_cache",
+#else
+	"pgd_cache", "pud_pmd_cache",
 #endif /* CONFIG_PPC_64K_PAGES */
+};
 
 #ifdef CONFIG_HUGETLB_PAGE
 /* Hugepages need one extra cache, initialized in hugetlbpage.c.  We
--- 2.6.21-rc7-mm2/include/asm-powerpc/pgalloc.h	2007-02-04 18:44:54.000000000 +0000
+++ linux/include/asm-powerpc/pgalloc.h	2007-05-03 20:45:12.000000000 +0100
@@ -10,21 +10,15 @@
 #include <linux/slab.h>
 #include <linux/cpumask.h>
 #include <linux/percpu.h>
+#include <linux/quicklist.h>
 
 extern struct kmem_cache *pgtable_cache[];
 
-#ifdef CONFIG_PPC_64K_PAGES
-#define PTE_CACHE_NUM	0
-#define PMD_CACHE_NUM	1
-#define PGD_CACHE_NUM	2
-#define HUGEPTE_CACHE_NUM 3
-#else
-#define PTE_CACHE_NUM	0
-#define PMD_CACHE_NUM	1
-#define PUD_CACHE_NUM	1
 #define PGD_CACHE_NUM	0
+#define PUD_CACHE_NUM	1
+#define PMD_CACHE_NUM	1
 #define HUGEPTE_CACHE_NUM 2
-#endif
+#define PTE_CACHE_NUM	3	/* from quicklist rather than  kmem_cache */
 
 /*
  * This program is free software; you can redistribute it and/or
@@ -97,8 +91,7 @@ static inline void pmd_free(pmd_t *pmd)
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long address)
 {
-	return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM],
-				GFP_KERNEL|__GFP_REPEAT);
+	return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL);
 }
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm,
@@ -109,7 +102,7 @@ static inline struct page *pte_alloc_one
 		
 static inline void pte_free_kernel(pte_t *pte)
 {
-	kmem_cache_free(pgtable_cache[PTE_CACHE_NUM], pte);
+	quicklist_free(0, NULL, pte);
 }
 
 static inline void pte_free(struct page *ptepage)
@@ -136,7 +129,10 @@ static inline void pgtable_free(pgtable_
 	void *p = (void *)(pgf.val & ~PGF_CACHENUM_MASK);
 	int cachenum = pgf.val & PGF_CACHENUM_MASK;
 
-	kmem_cache_free(pgtable_cache[cachenum], p);
+	if (cachenum == PTE_CACHE_NUM)
+		quicklist_free(0, NULL, p);
+	else
+		kmem_cache_free(pgtable_cache[cachenum], p);
 }
 
 extern void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf);
@@ -153,7 +149,7 @@ extern void pgtable_free_tlb(struct mmu_
 		PUD_CACHE_NUM, PUD_TABLE_SIZE-1))
 #endif /* CONFIG_PPC_64K_PAGES */
 
-#define check_pgt_cache()	do { } while (0)
+#define check_pgt_cache()	quicklist_trim(0, NULL, 25, 16)
 
 #endif /* CONFIG_PPC64 */
 #endif /* __KERNEL__ */

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub on PowerPC
  2007-05-03 21:04                     ` 2.6.22 -mm merge plans: slub on PowerPC Hugh Dickins
@ 2007-05-03 21:15                       ` Christoph Lameter
  2007-05-03 22:41                         ` Hugh Dickins
  2007-05-04  0:25                       ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 233+ messages in thread
From: Christoph Lameter @ 2007-05-03 21:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Paul Mackerras, Benjamin Herrenschmidt,
	linux-kernel, linux-mm

On Thu, 3 May 2007, Hugh Dickins wrote:

> On Thu, 3 May 2007, Hugh Dickins wrote:
> > 
> > Seems we're all wrong in thinking Christoph's Kconfiggery worked
> > as intended: maybe it just works some of the time.  I'm not going
> > to hazard a guess as to how to fix it up, will resume looking at
> > the powerpc's quicklist potential later.
> 
> Here's the patch I've been testing on G5, with 4k and with 64k pages,
> with SLAB and with SLUB.  But, though it doesn't crash, the pgd
> kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity
> for using highorder allocations where SLAB would stick to order 0:
> under load, exec's mm_init gets page allocation failure on order 4
> - SLUB's calculate_order may need some retuning.  (I'd expect it to
> be going for order 3 actually, I'm not sure how order 4 comes about.)

There are SLUB patches pending (not in rc7-mm2 as far as I can recall) 
that reduce the default page order sizes to head off this issue. The 
defaults were initially too large (and they still default to large
for testing if Mel's Antifrag work is detected to be active).

> -	return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM],
> -				GFP_KERNEL|__GFP_REPEAT);
> +	return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL);

__GFP_REPEAT is unusual here but this was carried over from the 
kmem_cache_alloc it seems. Hmm... There is some variance on how we do this 
between arches. Should we uniformly set or not set this flag?

clameter@schroedinger:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc include/asm-ia64/*
include/asm-ia64/pgalloc.h:     return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-ia64/pgalloc.h:     return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-ia64/pgalloc.h:     return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-ia64/pgalloc.h:     void *pg = quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-ia64/pgalloc.h:     return quicklist_alloc(0, GFP_KERNEL, NULL);
clameter@schroedinger:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc arch/i386/mm/*    
arch/i386/mm/pgtable.c: pgd_t *pgd = quicklist_alloc(0, GFP_KERNEL, pgd_ctor);
clameter@schroedinger:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc include/asm-sparc64/*
include/asm-sparc64/pgalloc.h:  return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-sparc64/pgalloc.h:  return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-sparc64/pgalloc.h:  return quicklist_alloc(0, GFP_KERNEL, NULL);
include/asm-sparc64/pgalloc.h:  void *pg = quicklist_alloc(0, GFP_KERNEL, NULL);
clameter@schroedinger:~/software/linux-2.6.21-rc7-mm2$ grep quicklist_alloc include/asm-x86_64/* 
include/asm-x86_64/pgalloc.h:   return (pmd_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL);
include/asm-x86_64/pgalloc.h:   return (pud_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL);
include/asm-x86_64/pgalloc.h:   pgd_t *pgd = (pgd_t *)quicklist_alloc(QUICK_PGD,
include/asm-x86_64/pgalloc.h:   return (pte_t *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL);
include/asm-x86_64/pgalloc.h:   void *p = (void *)quicklist_alloc(QUICK_PT, GFP_KERNEL|__GFP_REPEAT, NULL);


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub on PowerPC
  2007-05-03 21:15                       ` Christoph Lameter
@ 2007-05-03 22:41                         ` Hugh Dickins
  0 siblings, 0 replies; 233+ messages in thread
From: Hugh Dickins @ 2007-05-03 22:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Paul Mackerras, Benjamin Herrenschmidt,
	linux-kernel, linux-mm

On Thu, 3 May 2007, Christoph Lameter wrote:
> 
> There are SLUB patches pending (not in rc7-mm2 as far as I can recall) 
> that reduce the default page order sizes to head off this issue. The 
> defaults were initially too large (and they still default to large
> for testing if Mel's Antifrag work is detected to be active).

Sounds good.

> > -	return kmem_cache_alloc(pgtable_cache[PTE_CACHE_NUM],
> > -				GFP_KERNEL|__GFP_REPEAT);
> > +	return quicklist_alloc(0, GFP_KERNEL|__GFP_REPEAT, NULL);
> 
> __GFP_REPEAT is unusual here but this was carried over from the 
> kmem_cache_alloc it seems. Hmm... There is some variance on how we do this 
> between arches. Should we uniformly set or not set this flag?

Not something to get into in this patch, but it did surprise me too.
I believe __GFP_REPEAT should be avoided, and I don't see justification
for it here (but need to remember not to do a blind virt_to_page on the
result in some places if it might return NULL - which IIRC it actually
can do even if __GFP_REPEAT, when chosen for OOM kill).  But I've a
suspicion it got put in there for some good reason I don't know about.

Hugh

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub on PowerPC
  2007-05-03 21:04                     ` 2.6.22 -mm merge plans: slub on PowerPC Hugh Dickins
  2007-05-03 21:15                       ` Christoph Lameter
@ 2007-05-04  0:25                       ` Benjamin Herrenschmidt
  2007-05-04  0:54                         ` Christoph Lameter
  1 sibling, 1 reply; 233+ messages in thread
From: Benjamin Herrenschmidt @ 2007-05-04  0:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Christoph Lameter, Paul Mackerras, linux-kernel,
	linux-mm

On Thu, 2007-05-03 at 22:04 +0100, Hugh Dickins wrote:
> On Thu, 3 May 2007, Hugh Dickins wrote:
> > 
> > Seems we're all wrong in thinking Christoph's Kconfiggery worked
> > as intended: maybe it just works some of the time.  I'm not going
> > to hazard a guess as to how to fix it up, will resume looking at
> > the powerpc's quicklist potential later.
> 
> Here's the patch I've been testing on G5, with 4k and with 64k pages,
> with SLAB and with SLUB.  But, though it doesn't crash, the pgd
> kmem_cache in the 4k-page SLUB case is revealing SLUB's propensity
> for using highorder allocations where SLAB would stick to order 0:
> under load, exec's mm_init gets page allocation failure on order 4
> - SLUB's calculate_order may need some retuning.  (I'd expect it to
> be going for order 3 actually, I'm not sure how order 4 comes about.)
> 
> I don't know how offensive Ben and Paulus may find this patch:
> the kmem_cache use was nicely done and this messes it up a little.
> 
> 
> The SLUB allocator relies on struct page fields first_page and slab,
> overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then
> be used for the lowest level of pagetable pages.  This was obstructing
> SLUB on PowerPC, which uses kmem_caches for its pagetables.  So convert
> its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd
> want partpages, so continue to use kmem_caches for pmd, pud and pgd).
> But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM.

Interesting... I'll have a look asap.

Ben.



^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub on PowerPC
  2007-05-04  0:25                       ` Benjamin Herrenschmidt
@ 2007-05-04  0:54                         ` Christoph Lameter
  0 siblings, 0 replies; 233+ messages in thread
From: Christoph Lameter @ 2007-05-04  0:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Hugh Dickins, Andrew Morton, Paul Mackerras, linux-kernel,
	linux-mm

On Fri, 4 May 2007, Benjamin Herrenschmidt wrote:

> > The SLUB allocator relies on struct page fields first_page and slab,
> > overwritten by ptl when SPLIT_PTLOCK: so the SLUB allocator cannot then
> > be used for the lowest level of pagetable pages.  This was obstructing
> > SLUB on PowerPC, which uses kmem_caches for its pagetables.  So convert
> > its pte level to use quicklist pages (whereas pmd, pud and 64k-page pgd
> > want partpages, so continue to use kmem_caches for pmd, pud and pgd).
> > But to keep up appearances for pgtable_free, we still need PTE_CACHE_NUM.
> 
> Interesting... I'll have a look asap.

I would also recommend looking at removing the constructors for the 
remaining slabs. A constructor requires that SLUB never touch the object 
(same situation as is resulting from enabling debugging). So it must 
increase the object size in order to put the free pointer after the 
object. In case of a order of 2 cache this has a particularly bad effect 
of doubling object size. If the objects can be overwritten on free (no 
constructor) then we can use the first word of the object as a freepointer 
on kfree. Meaning we can use a hot cacheline so no cache miss. On 
alloc we have already touched the first cacheline which also avoids a 
cacheline fetch there. This is the optimal way of operation for SLUB.

Hmmm.... We could add an option to allow the use of a constructor while
keeping the free pointer at the beginning of the object? Then we would 
have to zap the first word on alloc. Would work like quicklists.

Add SLAB_FREEPOINTER_MAY_OVERLAP?

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans: slub
  2007-05-03  8:57                 ` Andrew Morton
  2007-05-03  9:15                   ` Hugh Dickins
@ 2007-05-03 16:45                   ` Christoph Lameter
  1 sibling, 0 replies; 233+ messages in thread
From: Christoph Lameter @ 2007-05-03 16:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, linux-kernel, linux-mm

Hmmmm...There are a gazillion configs to choose from. It works fine with
cell_defconfig. If I switch to 2 processors I can enable SLUB if I switch 
to 4 I cannot.

I saw some other config weirdness like being unable to set SMP if SLOB is 
enabled with some configs. This should not work and does not work but 
the menus are then vanishing and one can still configure lots of 
processors while not having enabled SMP.

It works as far as I can tell... The rest is arch weirdness.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (20 preceding siblings ...)
  2007-05-01 18:10 ` 2.6.22 -mm merge plans: slub Hugh Dickins
@ 2007-05-03 15:54 ` Ingo Molnar
  2007-05-03 16:15   ` Michal Piotrowski
                     ` (2 more replies)
  2007-05-07 17:47 ` Josef Sipek
  22 siblings, 3 replies; 233+ messages in thread
From: Ingo Molnar @ 2007-05-03 15:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm, Con Kolivas


* Andrew Morton <akpm@linux-foundation.org> wrote:

> - If replying, please be sure to cc the appropriate individuals.  
>   Please also consider rewriting the Subject: to something 
>   appropriate.

i'm wondering about swap-prefetch:

  mm-implement-swap-prefetching.patch
  swap-prefetch-avoid-repeating-entry.patch
  add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch

The swap-prefetch feature is relatively compact:

   10 files changed, 745 insertions(+), 1 deletion(-)

it is contained mostly to itself:

   mm/swap_prefetch.c            |  581 ++++++++++++++++++++++++++++++++

i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a 
clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback 
i've seen so far was positive. Time to have this upstream and time for a 
desktop-oriented distro to pick it up.

I think this has been held back way too long. It's .config selectable 
and it is as ready for integration as it ever is going to be. So it's a 
win/win scenario.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-03 15:54 ` swap-prefetch: 2.6.22 -mm merge plans Ingo Molnar
@ 2007-05-03 16:15   ` Michal Piotrowski
  2007-05-03 16:23     ` Michal Piotrowski
  2007-05-03 22:14   ` Con Kolivas
  2007-05-04  7:34   ` Nick Piggin
  2 siblings, 1 reply; 233+ messages in thread
From: Michal Piotrowski @ 2007-05-03 16:15 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, linux-mm, Con Kolivas

Hi,

On 03/05/07, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > - If replying, please be sure to cc the appropriate individuals.
> >   Please also consider rewriting the Subject: to something
> >   appropriate.
>
> i'm wondering about swap-prefetch:
>
>   mm-implement-swap-prefetching.patch
>   swap-prefetch-avoid-repeating-entry.patch
>   add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch
>
> The swap-prefetch feature is relatively compact:
>
>    10 files changed, 745 insertions(+), 1 deletion(-)
>
> it is contained mostly to itself:
>
>    mm/swap_prefetch.c            |  581 ++++++++++++++++++++++++++++++++
>
> i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a
> clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback
> i've seen so far was positive. Time to have this upstream and time for a
> desktop-oriented distro to pick it up.
>
> I think this has been held back way too long. It's .config selectable
> and it is as ready for integration as it ever is going to be. So it's a
> win/win scenario.

I'm using SWAP_PREFETCH since 2.6.17-mm1 (I don't have earlier configs)
http://www.stardust.webpages.pl/files/tbf/euridica/2.6.17-mm1/mm-config
and I don't recall _any_ problems. It's very stable for me.

>
> Acked-by: Ingo Molnar <mingo@elte.hu>
>
>         Ingo

Regards,
Michal

-- 
Michal K. K. Piotrowski
Kernel Monkeys
(http://kernel.wikidot.com/start)

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-03 16:15   ` Michal Piotrowski
@ 2007-05-03 16:23     ` Michal Piotrowski
  0 siblings, 0 replies; 233+ messages in thread
From: Michal Piotrowski @ 2007-05-03 16:23 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, linux-mm, Con Kolivas

On 03/05/07, Michal Piotrowski <michal.k.k.piotrowski@gmail.com> wrote:
> Hi,
>
> On 03/05/07, Ingo Molnar <mingo@elte.hu> wrote:
> >
> > * Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > > - If replying, please be sure to cc the appropriate individuals.
> > >   Please also consider rewriting the Subject: to something
> > >   appropriate.
> >
> > i'm wondering about swap-prefetch:
> >
> >   mm-implement-swap-prefetching.patch
> >   swap-prefetch-avoid-repeating-entry.patch
> >   add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch
> >
> > The swap-prefetch feature is relatively compact:
> >
> >    10 files changed, 745 insertions(+), 1 deletion(-)
> >
> > it is contained mostly to itself:
> >
> >    mm/swap_prefetch.c            |  581 ++++++++++++++++++++++++++++++++
> >
> > i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a
> > clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback
> > i've seen so far was positive. Time to have this upstream and time for a
> > desktop-oriented distro to pick it up.
> >
> > I think this has been held back way too long. It's .config selectable
> > and it is as ready for integration as it ever is going to be. So it's a
> > win/win scenario.
>
> I'm using SWAP_PREFETCH since 2.6.17-mm1 (I don't have earlier configs)

since 2.6.14-mm2 :)
http://lkml.org/lkml/2005/11/11/260

Regards,
Michal

-- 
Michal K. K. Piotrowski
Kernel Monkeys
(http://kernel.wikidot.com/start)

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-03 15:54 ` swap-prefetch: 2.6.22 -mm merge plans Ingo Molnar
  2007-05-03 16:15   ` Michal Piotrowski
@ 2007-05-03 22:14   ` Con Kolivas
  2007-05-04  7:34   ` Nick Piggin
  2 siblings, 0 replies; 233+ messages in thread
From: Con Kolivas @ 2007-05-03 22:14 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, linux-mm

On Friday 04 May 2007 01:54, Ingo Molnar wrote:
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> > - If replying, please be sure to cc the appropriate individuals.
> >   Please also consider rewriting the Subject: to something
> >   appropriate.

> i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a
> clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback
> i've seen so far was positive. Time to have this upstream and time for a
> desktop-oriented distro to pick it up.
>
> I think this has been held back way too long. It's .config selectable
> and it is as ready for integration as it ever is going to be. So it's a
> win/win scenario.
>
> Acked-by: Ingo Molnar <mingo@elte.hu>

Thank you very much for code review, ack and support!

-- 
-ck

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-03 15:54 ` swap-prefetch: 2.6.22 -mm merge plans Ingo Molnar
  2007-05-03 16:15   ` Michal Piotrowski
  2007-05-03 22:14   ` Con Kolivas
@ 2007-05-04  7:34   ` Nick Piggin
  2007-05-04  8:52     ` Ingo Molnar
  2 siblings, 1 reply; 233+ messages in thread
From: Nick Piggin @ 2007-05-04  7:34 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, linux-mm, Con Kolivas

Ingo Molnar wrote:
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> 
>>- If replying, please be sure to cc the appropriate individuals.  
>>  Please also consider rewriting the Subject: to something 
>>  appropriate.
> 
> 
> i'm wondering about swap-prefetch:

Well I had some issues with it that I don't think were fully discussed,
and Andrew prompted me to say something, but it went off list for a
couple of posts (my fault, sorry). Posting it below with Andrew's
permission...

>   mm-implement-swap-prefetching.patch
>   swap-prefetch-avoid-repeating-entry.patch
>   add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch
> 
> The swap-prefetch feature is relatively compact:
> 
>    10 files changed, 745 insertions(+), 1 deletion(-)
> 
> it is contained mostly to itself:
> 
>    mm/swap_prefetch.c            |  581 ++++++++++++++++++++++++++++++++
> 
> i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a 
> clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback 
> i've seen so far was positive. Time to have this upstream and time for a 
> desktop-oriented distro to pick it up.
> 
> I think this has been held back way too long. It's .config selectable 
> and it is as ready for integration as it ever is going to be. So it's a 
> win/win scenario.

Being able to config all these core heuristics changes is really not that
much of a positive. The fact that we might _need_ to config something out,
and double the configuration range isn't too pleasing.

Here were some of my concerns, and where our discussion got up to.

Andrew Morton wrote:
 > On Fri, 04 May 2007 14:34:45 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote:
 >
 >
 >>Andrew Morton wrote:
 >>
 >>>istr you had issues with swap-prefetch?
 >>>
 >>>If so, now's a good time to reiterate them ;)
 >>
 >>1) It is said to help with the updatedb overnight problem, however it
 >>    actually _doesn't_ prefetch swap when there are low free pages, which
 >>    is how updatedb will leave the system. So this really puzzles me how
 >>    it would work. However if updatedb is causing excessive swapout, I
 >>    think we should try improving use-once algorithms first, for example.
 >
 >
 > Yes.  Perhaps it just doesn't help with the updatedb thing.  Or maybe with
 > normal system activity we get enough free pages to kick the thing off and
 > running.  Perhaps updatedb itself has a lot of rss, for example.

Could be, but I don't know. I'd think it unlikely to allow _much_ swapin,
if huge amounts of the desktop have been swapped out. But maybe... as I
said, nobody seems to have a recipe for these things.

 > Would be useful to see this claim substantiated with a real testcase,
 > description of results and an explanation of how and why it worked.

Yes... and then try to first improve regular page reclaim and use-once
handling.

 >>2) It is a _highly_ speculative operation, and in workloads where periods
 >>    of low and high page usage with genuinely unused anonymous / tmpfs
 >>    pages, it could waste power, memory bandwidth, bus bandwidth, disk
 >>    bandwidth...
 >
 >
 > Yes.  I suspect that's a matter of waiting for the corner-case reporters to
 > complain, then add more heuristics.

Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch is
happy to do a _lot_ of work for these things which we have already decided
are least likely to be used again.

 >>3) I haven't seen a single set of numbers out of it. Feedback seems to
 >>    have mostly come from people who
 >
 >
 > Yup.  But can we come up with a testcase?  It's hard.

I guess it is hard firstly because swapping is quite random to start with.
But I haven't even seen basic things like "make -jhuge swapstorm has no
regressions".

 >>4) If this is helpful, wouldn't it be equally important for things like
 >>    mapped file pages? Seems like half a solution.
 >
 >
 > True.
 >
 > Without thinking about it, I almost wonder if one could do a userspace
 > implementation with something which pokes around in /proc/pid/pagemap and
 > /proc/pid/kpagemap, perhaps with some additional interfaces added to
 > do a swapcache read.  (Give userspace the ability to get at swapcache
 > via a regular fd?)
 >
 > (otoh the akpm usersapce implementation is swapoff -a;swapon -a)

Perhaps. You may need a few indicators to see whether the system is idle...
but OTOH, we've already got a lot of indicators for memory, disk usage,
etc. So, maybe :)

 >>5) New one: it is possibly going to interact badly with MADV_FREE lazy
 >>    freeing. The more complex we make page reclaim, the worse IMO.
 >
 >
 > That's a bit vague.  What sort of problems do you envisage?

Well MADV_FREE pages aren't technically free, are they? So it might be
possible for a significant number of them to build up and prevent
swap prefetch from working. Maybe.

 >>...) I had a few issues with implementation, like interaction with
 >>    cpusets. Don't know if these are all fixed or not. I sort of gave
 >>    up looking at it.
 >
 >
 > Ah yes, I remember some mention of cpusets.  I forget what it was though.

I could be wrong, but IIRC there is no good way to know which cpuset to
bring the page back into, (and I guess similarly it would be hard to know
what container to account it to, if doing account-on-allocate).

We could hope that users of these features would be mostly disjoint sets,
but that's an evil road to start heading down, where we have various core
bits of mm that don't play nice together.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-04  7:34   ` Nick Piggin
@ 2007-05-04  8:52     ` Ingo Molnar
  2007-05-04  9:09       ` Nick Piggin
                         ` (2 more replies)
  0 siblings, 3 replies; 233+ messages in thread
From: Ingo Molnar @ 2007-05-04  8:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-mm, Con Kolivas

* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > i'm wondering about swap-prefetch:

> Being able to config all these core heuristics changes is really not 
> that much of a positive. The fact that we might _need_ to config 
> something out, and double the configuration range isn't too pleasing.

Well, to the desktop user this is a speculative performance feature that 
he is willing to potentially waste CPU and IO capacity, in expectation 
of better performance.

On the conceptual level it is _precisely the same thing as regular file 
readahead_. (with the difference that to me swapahead seems to be quite 
a bit more intelligent than our current file readahead logic.)

This feature has no API or ABI impact at all, it's a pure performance 
feature. (besides the trivial sysctl to turn it runtime on/off).

> Here were some of my concerns, and where our discussion got up to.

> > Yes.  Perhaps it just doesn't help with the updatedb thing.  Or 
> > maybe with normal system activity we get enough free pages to kick 
> > the thing off and running.  Perhaps updatedb itself has a lot of 
> > rss, for example.
> 
> Could be, but I don't know. I'd think it unlikely to allow _much_ 
> swapin, if huge amounts of the desktop have been swapped out. But 
> maybe... as I said, nobody seems to have a recipe for these things.

can i take this one as a "no fundamental objection"? There are really 
only 2 maintainance options left:

  1) either you can do it better or at least have a _very_ clearly
     described idea outlined about how to do it differently

  2) or you should let others try it

#1 you've not done for 2-3 years since swap-prefetch was waiting for
integration so it's not an option at this stage anymore. Then you are 
pretty much obliged to do #2. ;-)

And let me be really blunt about this, there is no option #3 to say: "I 
have no real better idea, I have no code, I have no time, but hey, lets 
not merge this because it 'could in theory' be possible to do it better" 
=B-)

really, we are likely be better off by risking the merge of _bad_ code 
(which in the swap-prefetch case is the exact opposite of the truth), 
than to let code stagnate. People are clearly unhappy about certain 
desktop aspects of swapping, and the only way out of that is to let more 
people hack that code. Merging code involves more people. It will cause 
'noise' and could cause regressions, but at least in this case the only 
impact is 'performance' and the feature is trivial to disable.

The maintainance drag outside of swap_prefetch.c is essentially _zero_. 
If the feature doesnt work it ends up on Con's desk. If it turns out to 
not work at all (despite years of testing and happy users) it still only 
ends up on Con's desk. A clear win/win scenario for you i think :-)

> > Would be useful to see this claim substantiated with a real 
> > testcase, description of results and an explanation of how and why 
> > it worked.
> 
> Yes... and then try to first improve regular page reclaim and use-once 
> handling.

agreed. Con, IIRC you wrote a testcase for this, right? Could you please 
send us the results of that testing?

> >>2) It is a _highly_ speculative operation, and in workloads where periods
> >>    of low and high page usage with genuinely unused anonymous / tmpfs
> >>    pages, it could waste power, memory bandwidth, bus bandwidth, disk
> >>    bandwidth...
> > 
> > Yes.  I suspect that's a matter of waiting for the corner-case 
> > reporters to complain, then add more heuristics.
> 
> Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch 
> is happy to do a _lot_ of work for these things which we have already 
> decided are least likely to be used again.

i see no real problem here. We've had heuristics for a _long_ time in 
various areas of the code. Sometimes they work, sometimes they suck.

the flow of this is really easy: distro looking for a feature edge turns 
it on and announces it, if the feature does not work out for users then 
user turns it off and complains to distro, if enough users complain then 
distro turns it off for next release, upstream forgets about this 
performance feature and eventually removes it once someone notices that 
it wouldnt even compile in the past 2 main releases. I see no problem 
here, we did that in the past too with performance features. The 
networking stack has literally dozens of such small tunable things which 
get experimented with, and whose defaults do get tuned carefully. Some 
of the knobs help bandwidth, some help latency.

I do not even see any risk of "splitup of mindshare" - swap-prefetch is 
so clearly speculative that it's not really a different view about how 
to do swapping (which would split the tester base, etc.), it's simply a 
"do you want your system to speculate about the future or not" add-on 
decision. Every system has a pretty clear idea about that: desktops 
generally want to do it, clusters generally dont want to do it.

> >>3) I haven't seen a single set of numbers out of it. Feedback seems to
> >>    have mostly come from people who
> >
> > Yup.  But can we come up with a testcase?  It's hard.

i think Con has a testcase.

> >>4) If this is helpful, wouldn't it be equally important for things like
> >>    mapped file pages? Seems like half a solution.
[...]
> > (otoh the akpm usersapce implementation is swapoff -a;swapon -a)
> 
> Perhaps. You may need a few indicators to see whether the system is 
> idle... but OTOH, we've already got a lot of indicators for memory, 
> disk usage, etc. So, maybe :)

The time has passed for this. Let others play too. Please :-)

> I could be wrong, but IIRC there is no good way to know which cpuset 
> to bring the page back into, (and I guess similarly it would be hard 
> to know what container to account it to, if doing 
> account-on-allocate).

(i think cpusets are totally uninteresting in this context: nobody in 
their right mind is going to use swap-prefetch on a big NUMA box. Nor 
can i see any fundamental impediment to making this more cpuset-aware, 
just like other subsystems were made cpuset-aware, once the requests 
from actual users came in and people started getting interested in it.)

I think the "lack of testcase and numbers" is the only valid technical 
objection i've seen so far. Con might be able to help us with that?

	Ingo

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-04  8:52     ` Ingo Molnar
@ 2007-05-04  9:09       ` Nick Piggin
  2007-05-04 12:10       ` Con Kolivas
  2007-05-07 14:18       ` Bill Davidsen
  2 siblings, 0 replies; 233+ messages in thread
From: Nick Piggin @ 2007-05-04  9:09 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel, linux-mm, Con Kolivas

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:

>>Here were some of my concerns, and where our discussion got up to.
> 
> 
>>>Yes.  Perhaps it just doesn't help with the updatedb thing.  Or 
>>>maybe with normal system activity we get enough free pages to kick 
>>>the thing off and running.  Perhaps updatedb itself has a lot of 
>>>rss, for example.
>>
>>Could be, but I don't know. I'd think it unlikely to allow _much_ 
>>swapin, if huge amounts of the desktop have been swapped out. But 
>>maybe... as I said, nobody seems to have a recipe for these things.
> 
> 
> can i take this one as a "no fundamental objection"? There are really 
> only 2 maintainance options left:
> 
>   1) either you can do it better or at least have a _very_ clearly
>      described idea outlined about how to do it differently
> 
>   2) or you should let others try it
> 
> #1 you've not done for 2-3 years since swap-prefetch was waiting for
> integration so it's not an option at this stage anymore. Then you are 
> pretty much obliged to do #2. ;-)

The burden is not on me to get someone else's feature merged. If it
can be shown to work well and people's concerns addressed, then anything
will get merged. The reason Linux is so good is because of what we don't
merge, figuratively speaking.

I wanted to see some basic regression tests to show that it hasn't caused
obvious problems, and some basic scenarios where it helps, so that we can
analyse them. It is really simple, but I haven't got any since first
asking.

And note that I don't think I ever explicitly "nacked" anything, just
voiced my concerns. If my concerns had been addressed, then I couldn't
have stopped anybody from merging anything.


>>>>2) It is a _highly_ speculative operation, and in workloads where periods
>>>>   of low and high page usage with genuinely unused anonymous / tmpfs
>>>>   pages, it could waste power, memory bandwidth, bus bandwidth, disk
>>>>   bandwidth...
>>>
>>>Yes.  I suspect that's a matter of waiting for the corner-case 
>>>reporters to complain, then add more heuristics.
>>
>>Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch 
>>is happy to do a _lot_ of work for these things which we have already 
>>decided are least likely to be used again.
> 
> 
> i see no real problem here. We've had heuristics for a _long_ time in 
> various areas of the code. Sometimes they work, sometimes they suck.

So that's one of my issues with the code. If all you have to support a
merge is anecodal evidence, then I find it interesting that you would
easily discount something like this.


>>>>4) If this is helpful, wouldn't it be equally important for things like
>>>>   mapped file pages? Seems like half a solution.
> 
> [...]
> 
>>>(otoh the akpm usersapce implementation is swapoff -a;swapon -a)
>>
>>Perhaps. You may need a few indicators to see whether the system is 
>>idle... but OTOH, we've already got a lot of indicators for memory, 
>>disk usage, etc. So, maybe :)
> 
> 
> The time has passed for this. Let others play too. Please :-)

Play with what? Prefetching mmaped file pages as well? Sure.


>>I could be wrong, but IIRC there is no good way to know which cpuset 
>>to bring the page back into, (and I guess similarly it would be hard 
>>to know what container to account it to, if doing 
>>account-on-allocate).
> 
> 
> (i think cpusets are totally uninteresting in this context: nobody in 
> their right mind is going to use swap-prefetch on a big NUMA box. Nor 
> can i see any fundamental impediment to making this more cpuset-aware, 
> just like other subsystems were made cpuset-aware, once the requests 
> from actual users came in and people started getting interested in it.)

OK, so make it more cpuset aware. This isn't a new issue, I raised it
a long time ago. And trust me, it is a nightmare to just assume that
nobody will use cpusets on a small box for example (AFAIK the resource
control guys are looking at doing just that).

All core VM features should play nicely with each other without *really*
good reason.


> I think the "lack of testcase and numbers" is the only valid technical 
> objection i've seen so far.

Well you're entitled to your opinion too.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-04  8:52     ` Ingo Molnar
  2007-05-04  9:09       ` Nick Piggin
@ 2007-05-04 12:10       ` Con Kolivas
  2007-05-05  8:42         ` Con Kolivas
  2007-05-07 14:28         ` Bill Davidsen
  2007-05-07 14:18       ` Bill Davidsen
  2 siblings, 2 replies; 233+ messages in thread
From: Con Kolivas @ 2007-05-04 12:10 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 717 bytes --]

On Friday 04 May 2007 18:52, Ingo Molnar wrote:
> agreed. Con, IIRC you wrote a testcase for this, right? Could you please
> send us the results of that testing?

Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch 
disabled and then enabled swap prefetch saves ~5 seconds on average hardware 
on this one test case. I had many users try this and the results were between 
2 and 10 seconds, but always showed a saving on this testcase. This effect 
easily occurs on printing a big picture, editing a large file, compressing an 
iso image or whatever in real world workloads. Smaller, but much more 
frequent effects of this over the course of a day obviously also occur and do 
add up.

-- 
-ck

[-- Attachment #2: swap_prefetch_tester.c --]
[-- Type: text/x-csrc, Size: 2067 bytes --]

#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
#include <time.h>

void fatal(const char *format, ...)
{
	va_list ap;

	if (format) {
		va_start(ap, format);
		vfprintf(stderr, format, ap);
		va_end(ap);
	}

	fprintf(stderr, "Fatal error - exiting\n");
	exit(1);
}

size_t get_ram(void)
{
        unsigned long ramsize;
	FILE *meminfo;
        char aux[256];

	if(!(meminfo = fopen("/proc/meminfo", "r")))
		fatal("fopen\n");

	while( !feof(meminfo) && !fscanf(meminfo, "MemTotal: %lu kB", &ramsize) )
		fgets(aux,sizeof(aux),meminfo);
	if (fclose(meminfo) == -1)
		fatal("fclose");
	return ramsize * 1000;
}

unsigned long get_usecs(struct timespec *myts)
{
	if (clock_gettime(CLOCK_REALTIME, myts))
		fatal("clock_gettime");
	return (myts->tv_sec * 1000000 + myts->tv_nsec / 1000 );
}

int main(void)
{
	unsigned long current_time, time_diff;
	struct timespec myts;
	char *buf1, *buf2, *buf3, *buf4;
	size_t size, full_size = get_ram();
	int sleep_seconds = 600;

	size = full_size * 7 / 10;
	printf("Starting first malloc of %d bytes\n", size);
	buf1 = malloc(size);
	if (buf1 == (char *)-1)
		fatal("Failed to malloc 1st buffer\n");
	memset(buf1, 0, size);
	printf("Completed first malloc and starting second malloc of %d bytes\n", full_size);

	buf2 = malloc(full_size);
	if (buf2 == (char *)-1)
		fatal("Failed to malloc 2nd buffer\n");
	memset(buf2, 0, full_size);
	buf4 = malloc(1);
	for (buf3 = buf2 + full_size; buf3 > buf2; buf3--)
		*buf4 = *buf3;
	free(buf2);
	printf("Completed second malloc and free\n");

	printf("Sleeping for %d seconds\n", sleep_seconds);
	sleep(sleep_seconds);

	printf("Important part - starting read of first malloc\n");
	time_diff = current_time = get_usecs(&myts);
	for (buf3 = buf1; buf3 < buf1 + size; buf3++)
		*buf4 = *buf3;
	current_time = get_usecs(&myts);
	free(buf4);
	free(buf1);
	printf("Completed read and freeing of first malloc\n");
	time_diff = current_time - time_diff;
	printf("Timed portion %lu microseconds\n",time_diff);

	return 0;
}

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-04 12:10       ` Con Kolivas
@ 2007-05-05  8:42         ` Con Kolivas
  2007-05-06 10:13           ` [ck] " Antonino Ingargiola
                             ` (2 more replies)
  2007-05-07 14:28         ` Bill Davidsen
  1 sibling, 3 replies; 233+ messages in thread
From: Con Kolivas @ 2007-05-05  8:42 UTC (permalink / raw)
  To: Ingo Molnar, ck list; +Cc: Nick Piggin, Andrew Morton, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1785 bytes --]

On Friday 04 May 2007 22:10, Con Kolivas wrote:
> On Friday 04 May 2007 18:52, Ingo Molnar wrote:
> > agreed. Con, IIRC you wrote a testcase for this, right? Could you please
> > send us the results of that testing?
>
> Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch
> disabled and then enabled swap prefetch saves ~5 seconds on average
> hardware on this one test case. I had many users try this and the results
> were between 2 and 10 seconds, but always showed a saving on this testcase.
> This effect easily occurs on printing a big picture, editing a large file,
> compressing an iso image or whatever in real world workloads. Smaller, but
> much more frequent effects of this over the course of a day obviously also
> occur and do add up.

Here's a better swap prefetch tester. Instructions in file.

Machine with 2GB ram and 2GB swapfile

Prefetch disabled:
./sp_tester
Ram 2060352000  Swap 1973420000
Total ram to be malloced: 3047062000 bytes
Starting first malloc of 1523531000 bytes
Starting 1st read of first malloc
Touching this much ram takes 809 milliseconds
Starting second malloc of 1523531000 bytes
Completed second malloc and free
Sleeping for 600 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 53397 milliseconds

Enabled:
./sp_tester
Ram 2060352000  Swap 1973420000
Total ram to be malloced: 3047062000 bytes
Starting first malloc of 1523531000 bytes
Starting 1st read of first malloc
Touching this much ram takes 676 milliseconds
Starting second malloc of 1523531000 bytes
Completed second malloc and free
Sleeping for 600 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 26351 milliseconds

Note huge time difference.

-- 
-ck

[-- Attachment #2: sp_tester.c --]
[-- Type: text/x-csrc, Size: 2890 bytes --]

/*
sp_tester.c

Build with:
gcc -o sp_tester sp_tester.c -lrt -W -Wall -O2

How to use:
echo 1 > /proc/sys/vm/overcommit_memory
swapoff -a
swapon -a
./sp_tester

then repeat with changed conditions eg
echo 0 > /proc/sys/vm/swap_prefetch

Each Test takes 10 minutes
*/

#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
#include <time.h>

void fatal(const char *format, ...)
{
	va_list ap;

	if (format) {
		va_start(ap, format);
		vfprintf(stderr, format, ap);
		va_end(ap);
	}

	fprintf(stderr, "Fatal error - exiting\n");
	exit(1);
}

unsigned long ramsize, swapsize;

size_t get_ram(void)
{
	FILE *meminfo;
        char aux[256];

	if(!(meminfo = fopen("/proc/meminfo", "r")))
		fatal("fopen\n");

	while( !feof(meminfo) && !fscanf(meminfo, "MemTotal: %lu kB", &ramsize) )
		fgets(aux,sizeof(aux),meminfo);
	while( !feof(meminfo) && !fscanf(meminfo, "SwapTotal: %lu kB", &swapsize) )
		fgets(aux,sizeof(aux),meminfo);
	if (fclose(meminfo) == -1)
		fatal("fclose");
	ramsize *= 1000;
	swapsize *= 1000;
	printf("Ram %lu  Swap %lu\n", ramsize, swapsize);
	return ramsize + (swapsize / 2);
}

unsigned long get_usecs(struct timespec *myts)
{
	if (clock_gettime(CLOCK_REALTIME, myts))
		fatal("clock_gettime");
	return (myts->tv_sec * 1000000 + myts->tv_nsec / 1000 );
}

int main(void)
{
	unsigned long current_time, time_diff;
	struct timespec myts;
	char *buf1, *buf2, *buf3, *buf4;
	size_t size = get_ram();
	int sleep_seconds = 600;

	if (size > ramsize / 2 * 3)
		size = ramsize / 2 * 3;
	printf("Total ram to be malloced: %u bytes\n", size);
	size /= 2;
	printf("Starting first malloc of %u bytes\n", size);
	buf1 = malloc(size);
	buf4 = malloc(1);
	if (buf1 == (char *)-1)
		fatal("Failed to malloc 1st buffer\n");
	memset(buf1, 0, size);
	time_diff = current_time = get_usecs(&myts);
	for (buf3 = buf1; buf3 < buf1 + size; buf3++)
		*buf4 = *buf3;
	printf("Starting 1st read of first malloc\n");
	current_time = get_usecs(&myts);
	time_diff = current_time - time_diff;
	printf("Touching this much ram takes %lu milliseconds\n",time_diff / 1000);
	printf("Starting second malloc of %u bytes\n", size);

	buf2 = malloc(size);
	if (buf2 == (char *)-1)
		fatal("Failed to malloc 2nd buffer\n");
	memset(buf2, 0, size);
	for (buf3 = buf2 + size; buf3 > buf2; buf3--)
		*buf4 = *buf3;
	free(buf2);
	printf("Completed second malloc and free\n");

	printf("Sleeping for %u seconds\n", sleep_seconds);
	sleep(sleep_seconds);

	printf("Important part - starting reread of first malloc\n");
	time_diff = current_time = get_usecs(&myts);
	for (buf3 = buf1; buf3 < buf1 + size; buf3++)
		*buf4 = *buf3;
	current_time = get_usecs(&myts);
	time_diff = current_time - time_diff;
	printf("Completed read of first malloc\n");
	printf("Timed portion %lu milliseconds\n",time_diff / 1000);

	free(buf1);
	free(buf4);

	return 0;
}

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [ck] Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-05  8:42         ` Con Kolivas
@ 2007-05-06 10:13           ` Antonino Ingargiola
  2007-05-06 18:22           ` Jory A. Pratt
  2007-05-09 23:28           ` Con Kolivas
  2 siblings, 0 replies; 233+ messages in thread
From: Antonino Ingargiola @ 2007-05-06 10:13 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, ck list, Nick Piggin, Andrew Morton, linux-kernel,
	linux-mm

2007/5/5, Con Kolivas <kernel@kolivas.org>:
[cut]
> Here's a better swap prefetch tester. Instructions in file.

The system should be leaved totally inactive for the test duration (10m) right?


Regards,

  ~ Antonio

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [ck] Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-05  8:42         ` Con Kolivas
  2007-05-06 10:13           ` [ck] " Antonino Ingargiola
@ 2007-05-06 18:22           ` Jory A. Pratt
  2007-05-09 23:28           ` Con Kolivas
  2 siblings, 0 replies; 233+ messages in thread
From: Jory A. Pratt @ 2007-05-06 18:22 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Ingo Molnar, ck list, Nick Piggin, Andrew Morton, linux-kernel,
	linux-mm

Con Kolivas wrote:
> Here's a better swap prefetch tester. Instructions in file.
>
> Machine with 2GB ram and 2GB swapfile
>
> Prefetch disabled:
> ./sp_tester
> Ram 2060352000  Swap 1973420000
> Total ram to be malloced: 3047062000 bytes
> Starting first malloc of 1523531000 bytes
> Starting 1st read of first malloc
> Touching this much ram takes 809 milliseconds
> Starting second malloc of 1523531000 bytes
> Completed second malloc and free
> Sleeping for 600 seconds
> Important part - starting reread of first malloc
> Completed read of first malloc
> Timed portion 53397 milliseconds
>
> Enabled:
> ./sp_tester
> Ram 2060352000  Swap 1973420000
> Total ram to be malloced: 3047062000 bytes
> Starting first malloc of 1523531000 bytes
> Starting 1st read of first malloc
> Touching this much ram takes 676 milliseconds
> Starting second malloc of 1523531000 bytes
> Completed second malloc and free
> Sleeping for 600 seconds
> Important part - starting reread of first malloc
> Completed read of first malloc
> Timed portion 26351 milliseconds
>   
echo 1 > /proc/sys/vm/overcommit_memory
swapoff -a
swapon -a
./sp_tester

Ram 1153644000  Swap 1004052000
Total ram to be malloced: 1655670000 bytes
Starting first malloc of 827835000 bytes
Starting 1st read of first malloc
Touching this much ram takes 937 milliseconds
Starting second malloc of 827835000 bytes
Completed second malloc and free
Sleeping for 600 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 15011 milliseconds

echo 0 > /proc/sys/vm/overcommit_memory
swapoff -a
swapon -a
./sp_tester

Ram 1153644000  Swap 1004052000
Total ram to be malloced: 1655670000 bytes
Starting first malloc of 827835000 bytes
Starting 1st read of first malloc
Touching this much ram takes 1125 milliseconds
Starting second malloc of 827835000 bytes
Completed second malloc and free
Sleeping for 600 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 14611 milliseconds


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-05  8:42         ` Con Kolivas
  2007-05-06 10:13           ` [ck] " Antonino Ingargiola
  2007-05-06 18:22           ` Jory A. Pratt
@ 2007-05-09 23:28           ` Con Kolivas
  2007-05-10  0:05             ` Nick Piggin
  2 siblings, 1 reply; 233+ messages in thread
From: Con Kolivas @ 2007-05-09 23:28 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: ck list, Nick Piggin, Andrew Morton, linux-kernel, linux-mm

On Saturday 05 May 2007 18:42, Con Kolivas wrote:
> On Friday 04 May 2007 22:10, Con Kolivas wrote:
> > On Friday 04 May 2007 18:52, Ingo Molnar wrote:
> > > agreed. Con, IIRC you wrote a testcase for this, right? Could you
> > > please send us the results of that testing?
> >
> > Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch
> > disabled and then enabled swap prefetch saves ~5 seconds on average
> > hardware on this one test case. I had many users try this and the results
> > were between 2 and 10 seconds, but always showed a saving on this
> > testcase. This effect easily occurs on printing a big picture, editing a
> > large file, compressing an iso image or whatever in real world workloads.
> > Smaller, but much more frequent effects of this over the course of a day
> > obviously also occur and do add up.
>
> Here's a better swap prefetch tester. Instructions in file.
>
> Machine with 2GB ram and 2GB swapfile
>
> Prefetch disabled:
> ./sp_tester

> Timed portion 53397 milliseconds
>
> Enabled:
> ./sp_tester

> Timed portion 26351 milliseconds
>
> Note huge time difference.

Well how about that? That was the difference with a swap _file_ as I said, but 
I went ahead and checked with a swap partition as I used to have. I didn't 
notice, but somewhere in the last few months, swap prefetch code itself being 
unchanged for a year, seems to have been broken by other changes in the vm 
and it doesn't even start up prefetching often and has stale swap entries in 
its list. Once it breaks like that it does nothing from then on. So that 
leaves me with a quandry now.

Do I:

1. Go ahead and find whatever breakage was introduced and fix it with 
hopefully a trivial change

2. Do option 1. and then implement support for yet another kernel feature 
(cpusets) that will be used perhaps never with swap prefetch [No Nick I don't 
believe you that cpusets have anything to do with normal users on a desktop 
ever; if it's used on a desktop it will only be by a kernel developer testing 
the cpusets code].

or

3. Dump swap prefetch forever and ignore that it ever worked and was helpful 
and was a lot of work to implement and so on.

Given that even if I do 1 and/or 2 it'll still be blocked from ever going to 
mainline I think the choice is clear.

Nick since you're personally the gatekeeper for this code, would you like to 
make a call? Just say 3 and put me out of my misery please.

-- 
-ck

P.S. Ingo, thanks (and sorry) for your involvement here.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-09 23:28           ` Con Kolivas
@ 2007-05-10  0:05             ` Nick Piggin
  2007-05-10  1:34               ` Con Kolivas
  0 siblings, 1 reply; 233+ messages in thread
From: Nick Piggin @ 2007-05-10  0:05 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Ingo Molnar, ck list, Andrew Morton, linux-kernel, linux-mm

Con Kolivas wrote:

> Well how about that? That was the difference with a swap _file_ as I said, but 
> I went ahead and checked with a swap partition as I used to have. I didn't 
> notice, but somewhere in the last few months, swap prefetch code itself being 
> unchanged for a year, seems to have been broken by other changes in the vm 
> and it doesn't even start up prefetching often and has stale swap entries in 
> its list. Once it breaks like that it does nothing from then on. So that 
> leaves me with a quandry now.
> 
> 
> Do I:
> 
> 1. Go ahead and find whatever breakage was introduced and fix it with 
> hopefully a trivial change
> 
> 2. Do option 1. and then implement support for yet another kernel feature 
> (cpusets) that will be used perhaps never with swap prefetch [No Nick I don't 
> believe you that cpusets have anything to do with normal users on a desktop 
> ever; if it's used on a desktop it will only be by a kernel developer testing 
> the cpusets code].
> 
> or
> 
> 3. Dump swap prefetch forever and ignore that it ever worked and was helpful 
> and was a lot of work to implement and so on.
> 
> 
> Given that even if I do 1 and/or 2 it'll still be blocked from ever going to 
> mainline I think the choice is clear.
> 
> Nick since you're personally the gatekeeper for this code, would you like to 
> make a call? Just say 3 and put me out of my misery please.

I'm not the gatekeeper and it is completely up to you whether you want
to work on something or not... but I'm sure you understand where I was
coming from when I suggested it doesn't get merged yet.

You may not believe this, but I agree that swap prefetching (and
prefetching in general) has some potential to help desktop workloads :).
But it still should go through the normal process of being tested and
questioned and having a look at options for first improving existing
code in those problematic cases.

Once that process happens and it is shown to work nicely, etc., then I
would not be able to (or want to) keep it from getting merged.

As far as cpusets goes... if your code goes in last, then you have to
make it work with what is there, as a rule. People are using cpusets
for memory resource control, which would have uses on a desktop system.
It is just a really bad precedent to set, having different parts of the
VM not work correctly together. Even if you made them mutually
exclusive CONFIG_ options, that is still not a very nice solution.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-10  0:05             ` Nick Piggin
@ 2007-05-10  1:34               ` Con Kolivas
  2007-05-10  1:56                 ` Nick Piggin
  0 siblings, 1 reply; 233+ messages in thread
From: Con Kolivas @ 2007-05-10  1:34 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Ingo Molnar, ck list, Andrew Morton, linux-kernel, linux-mm

On Thursday 10 May 2007 10:05, Nick Piggin wrote:
> Con Kolivas wrote:
> > Well how about that? That was the difference with a swap _file_ as I
> > said, but I went ahead and checked with a swap partition as I used to
> > have. I didn't notice, but somewhere in the last few months, swap
> > prefetch code itself being unchanged for a year, seems to have been
> > broken by other changes in the vm and it doesn't even start up
> > prefetching often and has stale swap entries in its list. Once it breaks
> > like that it does nothing from then on. So that leaves me with a quandry
> > now.
> >
> >
> > Do I:
> >
> > 1. Go ahead and find whatever breakage was introduced and fix it with
> > hopefully a trivial change
> >
> > 2. Do option 1. and then implement support for yet another kernel feature
> > (cpusets) that will be used perhaps never with swap prefetch [No Nick I
> > don't believe you that cpusets have anything to do with normal users on a
> > desktop ever; if it's used on a desktop it will only be by a kernel
> > developer testing the cpusets code].
> >
> > or
> >
> > 3. Dump swap prefetch forever and ignore that it ever worked and was
> > helpful and was a lot of work to implement and so on.
> >
> >
> > Given that even if I do 1 and/or 2 it'll still be blocked from ever going
> > to mainline I think the choice is clear.
> >
> > Nick since you're personally the gatekeeper for this code, would you like
> > to make a call? Just say 3 and put me out of my misery please.
>
> I'm not the gatekeeper and it is completely up to you whether you want
> to work on something or not... but I'm sure you understand where I was
> coming from when I suggested it doesn't get merged yet.

No matter how you spin it, you're the gatekeeper.

> You may not believe this, but I agree that swap prefetching (and
> prefetching in general) has some potential to help desktop workloads :).
> But it still should go through the normal process of being tested and
> questioned and having a look at options for first improving existing
> code in those problematic cases.

Not this again? Proof was there ages ago that it helped and no proof that it 
harmed could be found yet you cunningly pretend it never existed. It's been 
done to death and I'm sick of this.

> Once that process happens and it is shown to work nicely, etc., then I
> would not be able to (or want to) keep it from getting merged.
>
> As far as cpusets goes... if your code goes in last, then you have to
> make it work with what is there, as a rule. People are using cpusets
> for memory resource control, which would have uses on a desktop system.
> It is just a really bad precedent to set, having different parts of the
> VM not work correctly together. Even if you made them mutually
> exclusive CONFIG_ options, that is still not a very nice solution.

That's as close to a 3 as I'm likely to get out of you.

Andrew you'll be relieved to know I would like you to throw swap prefetch and 
related patches into the bin. Thanks.

-- 
-ck

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-10  1:34               ` Con Kolivas
@ 2007-05-10  1:56                 ` Nick Piggin
  2007-05-10  3:48                   ` Ray Lee
  0 siblings, 1 reply; 233+ messages in thread
From: Nick Piggin @ 2007-05-10  1:56 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Ingo Molnar, ck list, Andrew Morton, linux-kernel, linux-mm

Con Kolivas wrote:
> On Thursday 10 May 2007 10:05, Nick Piggin wrote:

>>I'm not the gatekeeper and it is completely up to you whether you want
>>to work on something or not... but I'm sure you understand where I was
>>coming from when I suggested it doesn't get merged yet.
> 
> 
> No matter how you spin it, you're the gatekeeper.

If raising unaddressed issues means closing a gate, then OK. You can
equally open it by answering them.


>>You may not believe this, but I agree that swap prefetching (and
>>prefetching in general) has some potential to help desktop workloads :).
>>But it still should go through the normal process of being tested and
>>questioned and having a look at options for first improving existing
>>code in those problematic cases.
> 
> 
> Not this again? Proof was there ages ago that it helped and no proof that it 
> harmed could be found yet you cunningly pretend it never existed. It's been 
> done to death and I'm sick of this.

I said I know it can help. Do you know how many patches I have that help
some workloads but are not merged? That's just the way it works.

What I have seen is it helps the case where you force out a huge amount
of swap. OK, that's nice -- the base case obviously works.

You said it helped with the updatedb problem. That says we should look at
why it is going bad first, and for example improve use-once algorithms.
After we do that, then swap prefetching might still help, which is fine.


>>Once that process happens and it is shown to work nicely, etc., then I
>>would not be able to (or want to) keep it from getting merged.
>>
>>As far as cpusets goes... if your code goes in last, then you have to
>>make it work with what is there, as a rule. People are using cpusets
>>for memory resource control, which would have uses on a desktop system.
>>It is just a really bad precedent to set, having different parts of the
>>VM not work correctly together. Even if you made them mutually
>>exclusive CONFIG_ options, that is still not a very nice solution.
> 
> 
> That's as close to a 3 as I'm likely to get out of you.

If you're not willing to try making it work with existing code, among other
things, then yes it will be difficult to get it merged. That's not going to
change.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-10  1:56                 ` Nick Piggin
@ 2007-05-10  3:48                   ` Ray Lee
  2007-05-10  3:56                     ` Nick Piggin
  2007-05-10  3:58                     ` swap-prefetch: 2.6.22 -mm merge plans Con Kolivas
  0 siblings, 2 replies; 233+ messages in thread
From: Ray Lee @ 2007-05-10  3:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Con Kolivas, Ingo Molnar, ck list, Andrew Morton, linux-kernel,
	linux-mm

On 5/9/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> You said it helped with the updatedb problem. That says we should look at
> why it is going bad first, and for example improve use-once algorithms.
> After we do that, then swap prefetching might still help, which is fine.

Nick, if you're volunteering to do that analysis, then great. If not,
then you're just providing a airy hope with nothing to back up when or
if that work would ever occur.

Further, if you or someone else *does* do that work, then guess what,
we still have the option to rip out the swap prefetching code after
the hypothetical use-once improvements have been proven and merged.
Which, by the way, I've watched people talk about since 2.4. That was,
y'know, a *while* ago.

So enough with the stop energy, okay? You're better than that.

Con? He is right about the last feature to go in needs to work
gracefully with what's there now. However, it's not unheard of for
authors of other sections of code to help out with incompatibilities
by answering politely phrased questions for guidance. Though the
intersection of users between cpusets and desktop systems seems small
indeed.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-10  3:48                   ` Ray Lee
@ 2007-05-10  3:56                     ` Nick Piggin
  2007-05-10  5:52                       ` Ray Lee
  2007-05-10  3:58                     ` swap-prefetch: 2.6.22 -mm merge plans Con Kolivas
  1 sibling, 1 reply; 233+ messages in thread
From: Nick Piggin @ 2007-05-10  3:56 UTC (permalink / raw)
  To: Ray Lee
  Cc: Con Kolivas, Ingo Molnar, ck list, Andrew Morton, linux-kernel,
	linux-mm

Ray Lee wrote:
> On 5/9/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>> You said it helped with the updatedb problem. That says we should look at
>> why it is going bad first, and for example improve use-once algorithms.
>> After we do that, then swap prefetching might still help, which is fine.
> 
> 
> Nick, if you're volunteering to do that analysis, then great. If not,
> then you're just providing a airy hope with nothing to back up when or
> if that work would ever occur.

I'd like to try helping. Tell me your problem.


> Further, if you or someone else *does* do that work, then guess what,
> we still have the option to rip out the swap prefetching code after
> the hypothetical use-once improvements have been proven and merged.
> Which, by the way, I've watched people talk about since 2.4. That was,
> y'know, a *while* ago.

What's wrong with the use-once we have? What improvements are you talking
about?


> So enough with the stop energy, okay? You're better than that.

I don't think it is about energy or being mean, I'm just stating the
issues I have with it.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-10  3:56                     ` Nick Piggin
@ 2007-05-10  5:52                       ` Ray Lee
  2007-05-10  7:04                         ` Nick Piggin
  0 siblings, 1 reply; 233+ messages in thread
From: Ray Lee @ 2007-05-10  5:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Con Kolivas, Ingo Molnar, ck list, Andrew Morton, linux-kernel,
	linux-mm

On 5/9/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> Ray Lee wrote:
> > On 5/9/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> >
> >> You said it helped with the updatedb problem. That says we should look at
> >> why it is going bad first, and for example improve use-once algorithms.
> >> After we do that, then swap prefetching might still help, which is fine.
> >
> > Nick, if you're volunteering to do that analysis, then great. If not,
> > then you're just providing a airy hope with nothing to back up when or
> > if that work would ever occur.
>
> I'd like to try helping. Tell me your problem.

Huh? You already stated one version of it above, namely updatedb. But
let's put this another way, shall we? A gedankenexperiment, if you
will.

Say we have a perfect swap-out algorithm that can choose exactly what
needs to be evicted to disk. ('Perfect', of course, is dependent upon
one's metric, but let's go with "maximizes overall system utilization
and minimizes IO wait time." Arbitrary, but hey.)

So, great, the right things got swapped out. Anything else that could
have been chosen would have caused more overall IO Wait. Yay us.

So what happens when those processes that triggered the swap-outs go
away? (Firefox is closed, I stop hitting my local copy of a database,
whatever.) Well, currently, nothing. What happens when I switch
workspaces and try to use my email program? Swap-ins.

Okay, so why didn't the system swap that stuff in preemptively? Why am
I sitting there waiting for something that it could have already done
in the background?

A new swap-out algorithm, be it use-once, Clock-Pro, or perfect
foreknowledge isn't going to change that issue. Swap prefetch does.

> > Further, if you or someone else *does* do that work, then guess what,
> > we still have the option to rip out the swap prefetching code after
> > the hypothetical use-once improvements have been proven and merged.
> > Which, by the way, I've watched people talk about since 2.4. That was,
> > y'know, a *while* ago.
>
> What's wrong with the use-once we have? What improvements are you talking
> about?

You said, effectively: "Use-once could be improved to deal with
updatedb". I said I've been reading emails from Rik and others talking
about that for four years now, and we're still talking about it. Were
it merely updatedb, I'd say us userspace folk should step up and
rewrite the damn thing to amortize its work. However, I and others
feel it's only an example -- glaring, obviously -- of a more pervasive
issue. A small issue, to be sure!, but an issue nevertheless.

In general, I/others are talking about improving the desktop
experience of running too much on a RAM limited machine. (Which, in my
case, is with a gig and a 2.2GHz processor.)

Or restated: the desktop experience occasionally sucks for me, and I
don't think I'm alone. There may be a heuristic, completely isolated
from userspace (and so isn't an API the kernel has to support! -- if
it doesn't work, we can rip it out again), that may mitigate the
suckiness. Let's try it.

> > So enough with the stop energy, okay? You're better than that.
>
> I don't think it is about energy or being mean, I'm just stating the
> issues I have with it.

Nick, I in no way think you're being mean, and I'm sorry if I've given
you that impression. However, if you're just stating the issues you
have with it, then can I assume that you won't lobby against having
this experiment merged?

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-10  5:52                       ` Ray Lee
@ 2007-05-10  7:04                         ` Nick Piggin
  2007-05-10  7:20                           ` William Lee Irwin III
                                             ` (2 more replies)
  0 siblings, 3 replies; 233+ messages in thread
From: Nick Piggin @ 2007-05-10  7:04 UTC (permalink / raw)
  To: Ray Lee
  Cc: Con Kolivas, Ingo Molnar, ck list, Andrew Morton, linux-kernel,
	linux-mm

Ray Lee wrote:
> On 5/9/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>> Ray Lee wrote:
>> > On 5/9/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>> >
>> >> You said it helped with the updatedb problem. That says we should 
>> look at
>> >> why it is going bad first, and for example improve use-once 
>> algorithms.
>> >> After we do that, then swap prefetching might still help, which is 
>> fine.
>> >
>> > Nick, if you're volunteering to do that analysis, then great. If not,
>> > then you're just providing a airy hope with nothing to back up when or
>> > if that work would ever occur.
>>
>> I'd like to try helping. Tell me your problem.
> 
> 
> Huh? You already stated one version of it above, namely updatedb. But

So a swapping problem with updatedb should be unusual and we'd like to see
if we can fix it without resorting to prefetching.

I know the theory behind swap prefetching, and I'm not saying it doesn't
work, so I'll snip the rest of that.


>> What's wrong with the use-once we have? What improvements are you talking
>> about?
> 
> 
> You said, effectively: "Use-once could be improved to deal with
> updatedb". I said I've been reading emails from Rik and others talking
> about that for four years now, and we're still talking about it. Were
> it merely updatedb, I'd say us userspace folk should step up and
> rewrite the damn thing to amortize its work. However, I and others
> feel it's only an example -- glaring, obviously -- of a more pervasive
> issue. A small issue, to be sure!, but an issue nevertheless.

It isn't going to get fixed unless people complain about it. If you
cover the use-once problem with swap prefetching, then it will never
get fixed.


>> I don't think it is about energy or being mean, I'm just stating the
>> issues I have with it.
> 
> 
> Nick, I in no way think you're being mean, and I'm sorry if I've given
> you that impression. However, if you're just stating the issues you
> have with it, then can I assume that you won't lobby against having
> this experiment merged?

Anybody is free to merge anything into their kernel. And if somebody
asks for my issues with the swap prefetching patch, then I'll give
them :)

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-10  7:04                         ` Nick Piggin
@ 2007-05-10  7:20                           ` William Lee Irwin III
  2007-05-10 12:34                           ` Ray Lee
  2007-05-12  4:46                           ` [PATCH] mm: swap prefetch improvements Con Kolivas
  2 siblings, 0 replies; 233+ messages in thread
From: William Lee Irwin III @ 2007-05-10  7:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ray Lee, Con Kolivas, Ingo Molnar, ck list, Andrew Morton,
	linux-kernel, linux-mm

Ray Lee wrote:
>> Huh? You already stated one version of it above, namely updatedb. But

On Thu, May 10, 2007 at 05:04:54PM +1000, Nick Piggin wrote:
> So a swapping problem with updatedb should be unusual and we'd like to see
> if we can fix it without resorting to prefetching.
> I know the theory behind swap prefetching, and I'm not saying it doesn't
> work, so I'll snip the rest of that.

I've not run updatedb in years, so I have no idea what it does to a
modern kernel. It used to be an unholy terror of slab fragmentation
and displacing user memory. The case of streaming kernel metadata IO
is probably not quite as easy as streaming file IO.


Ray Lee wrote:
>> You said, effectively: "Use-once could be improved to deal with
>> updatedb". I said I've been reading emails from Rik and others talking
>> about that for four years now, and we're still talking about it. Were
>> it merely updatedb, I'd say us userspace folk should step up and
>> rewrite the damn thing to amortize its work. However, I and others
>> feel it's only an example -- glaring, obviously -- of a more pervasive
>> issue. A small issue, to be sure!, but an issue nevertheless.

On Thu, May 10, 2007 at 05:04:54PM +1000, Nick Piggin wrote:
> It isn't going to get fixed unless people complain about it. If you
> cover the use-once problem with swap prefetching, then it will never
> get fixed.

The policy people need to clean this up once and for all at some point.
clameter's targeted reclaim bits for slub look like a plausible tactic,
but are by no means comprehensive. Things need to attempt to eat their
own tails before eating everyone else alive. Maybe we need to take hits
on things such as badari's dd's to resolve the pathologies.


-- wli

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-10  7:04                         ` Nick Piggin
  2007-05-10  7:20                           ` William Lee Irwin III
@ 2007-05-10 12:34                           ` Ray Lee
  2007-05-12  4:46                           ` [PATCH] mm: swap prefetch improvements Con Kolivas
  2 siblings, 0 replies; 233+ messages in thread
From: Ray Lee @ 2007-05-10 12:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Con Kolivas, Ingo Molnar, ck list, Andrew Morton, linux-kernel,
	linux-mm

On 5/10/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > Huh? You already stated one version of it above, namely updatedb. But
>
> So a swapping problem with updatedb should be unusual and we'd like to see
> if we can fix it without resorting to prefetching.
>
> I know the theory behind swap prefetching, and I'm not saying it doesn't
> work, so I'll snip the rest of that.

updatedb is only part of the problem. The other part is that the
kernel has an opportunity to preemptively return some of the evicted
working set to RAM before I ask for it. No fancy use-once algorithm is
going to address that, so your solution is provably incomplete for my
problem.

What's so hard to understand about that?

^ permalink raw reply	[flat|nested] 233+ messages in thread

* [PATCH] mm: swap prefetch improvements
  2007-05-10  7:04                         ` Nick Piggin
  2007-05-10  7:20                           ` William Lee Irwin III
  2007-05-10 12:34                           ` Ray Lee
@ 2007-05-12  4:46                           ` Con Kolivas
  2007-05-12  5:03                             ` Paul Jackson
  2007-05-21 10:03                             ` [PATCH] " Ingo Molnar
  2 siblings, 2 replies; 233+ messages in thread
From: Con Kolivas @ 2007-05-12  4:46 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ray Lee, Ingo Molnar, ck list, Andrew Morton, linux-kernel,
	linux-mm

It turns out that fixing swap prefetch was not that hard to fix and improve 
upon, and since Andrew hasn't dropped swap prefetch, instead here are a swag 
of fixes and improvements, including making it depend on !CPUSETS as Nick 
requested.

These changes lead to dramatic improvements.

Eg on a machine with 2GB ram and only 500MB swap:

Prefetch disabled:
./sp_tester
Ram 2060352000  Swap 522072000
Total ram to be malloced: 2321388000 bytes
Starting first malloc of 1160694000 bytes
Starting 1st read of first malloc
Touching this much ram takes 529 milliseconds
Starting second malloc of 1160694000 bytes
Completed second malloc and free
Sleeping for 300 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 6030 milliseconds


Prefetch enabled:
/sp_tester
Ram 2060352000  Swap 522072000
Total ram to be malloced: 2321388000 bytes
Starting first malloc of 1160694000 bytes
Starting 1st read of first malloc
Touching this much ram takes 528 milliseconds
Starting second malloc of 1160694000 bytes
Completed second malloc and free
Sleeping for 300 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 665 milliseconds

Note that simply touching the ram took 528 ms so the time taken for the 230MB
converted from major faults to minor faults took only 137ms instead of 5.5s.

---
Numerous improvements to swap prefetch.

It was possible for kprefetchd to go to sleep indefinitely before/after
changing the /proc value of swap prefetch. Fix that.

The cost of remove_from_swapped_list() can be removed from every page swapin
by moving it to be done entirely by kprefetchd lazily.

The call site for add_to_swapped_list need only be at one place.

Wakeups can occur much less frequently if swap prefetch is disabled.

Make it possible to enable swap prefetch explicitly via /proc when laptop_mode
is enabled by changing the value of the sysctl to 2.

The complicated iteration over every entry can be consolidated by using
list_for_each_safe.

Swap prefetch is not cpuset aware so make the config option depend on !CPUSETS.

Fix potential irq problem by converting read_lock_irq to irqsave etc.

Code style fixes.

Change the ioprio from IOPRIO_CLASS_IDLE to normal lower priority to ensure
that bio requests are not starved if other I/O begins during prefetching.

Signed-off-by: Con Kolivas <kernel@kolivas.org>

---
 Documentation/sysctl/vm.txt |    4 -
 init/Kconfig                |    2 
 mm/page_io.c                |    2 
 mm/swap_prefetch.c          |  158 +++++++++++++++++++-------------------------
 mm/swap_state.c             |    2 
 mm/vmscan.c                 |    1 
 6 files changed, 75 insertions(+), 94 deletions(-)

Index: linux-2.6.21-mm1/mm/page_io.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/page_io.c	2007-02-05 22:52:04.000000000 +1100
+++ linux-2.6.21-mm1/mm/page_io.c	2007-05-12 14:30:52.000000000 +1000
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/swapops.h>
 #include <linux/writeback.h>
+#include <linux/swap-prefetch.h>
 #include <asm/pgtable.h>
 
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -118,6 +119,7 @@ int swap_writepage(struct page *page, st
 		ret = -ENOMEM;
 		goto out;
 	}
+	add_to_swapped_list(page);
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		rw |= (1 << BIO_RW_SYNC);
 	count_vm_event(PSWPOUT);
Index: linux-2.6.21-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/swap_state.c	2007-05-07 21:53:51.000000000 +1000
+++ linux-2.6.21-mm1/mm/swap_state.c	2007-05-12 14:30:52.000000000 +1000
@@ -83,7 +83,6 @@ static int __add_to_swap_cache(struct pa
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (!error) {
-			remove_from_swapped_list(entry.val);
 			page_cache_get(page);
 			SetPageLocked(page);
 			SetPageSwapCache(page);
@@ -102,7 +101,6 @@ int add_to_swap_cache(struct page *page,
 	int error;
 
 	if (!swap_duplicate(entry)) {
-		remove_from_swapped_list(entry.val);
 		INC_CACHE_INFO(noent_race);
 		return -ENOENT;
 	}
Index: linux-2.6.21-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/vmscan.c	2007-05-07 21:53:51.000000000 +1000
+++ linux-2.6.21-mm1/mm/vmscan.c	2007-05-12 14:30:52.000000000 +1000
@@ -410,7 +410,6 @@ int remove_mapping(struct address_space 
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
-		add_to_swapped_list(page);
 		__delete_from_swap_cache(page);
 		write_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
Index: linux-2.6.21-mm1/mm/swap_prefetch.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/swap_prefetch.c	2007-05-07 21:53:51.000000000 +1000
+++ linux-2.6.21-mm1/mm/swap_prefetch.c	2007-05-12 14:30:52.000000000 +1000
@@ -27,7 +27,8 @@
  * needs to be at least this duration of idle time meaning in practice it can
  * be much longer
  */
-#define PREFETCH_DELAY	(HZ * 5)
+#define PREFETCH_DELAY		(HZ * 5)
+#define DISABLED_PREFETCH_DELAY	(HZ * 60)
 
 /* sysctl - enable/disable swap prefetching */
 int swap_prefetch __read_mostly = 1;
@@ -61,19 +62,30 @@ inline void delay_swap_prefetch(void)
 }
 
 /*
+ * If laptop_mode is enabled don't prefetch to avoid hard drives
+ * doing unnecessary spin-ups unless swap_prefetch is explicitly
+ * set to a higher value.
+ */
+static inline int prefetch_enabled(void)
+{
+	if (swap_prefetch <= laptop_mode)
+		return 0;
+	return 1;
+}
+
+static int wakeup_kprefetchd;
+
+/*
  * Drop behind accounting which keeps a list of the most recently used swap
- * entries.
+ * entries. Entries are removed lazily by kprefetchd.
  */
 void add_to_swapped_list(struct page *page)
 {
 	struct swapped_entry *entry;
 	unsigned long index, flags;
-	int wakeup;
-
-	if (!swap_prefetch)
-		return;
 
-	wakeup = 0;
+	if (!prefetch_enabled())
+		goto out;
 
 	spin_lock_irqsave(&swapped.lock, flags);
 	if (swapped.count >= swapped.maxcount) {
@@ -103,23 +115,15 @@ void add_to_swapped_list(struct page *pa
 	store_swap_entry_node(entry, page);
 
 	if (likely(!radix_tree_insert(&swapped.swap_tree, index, entry))) {
-		/*
-		 * If this is the first entry, kprefetchd needs to be
-		 * (re)started.
-		 */
-		if (!swapped.count)
-			wakeup = 1;
 		list_add(&entry->swapped_list, &swapped.list);
 		swapped.count++;
 	}
 
 out_locked:
 	spin_unlock_irqrestore(&swapped.lock, flags);
-
-	/* Do the wakeup outside the lock to shorten lock hold time. */
-	if (wakeup)
+out:
+	if (wakeup_kprefetchd)
 		wake_up_process(kprefetchd_task);
-
 	return;
 }
 
@@ -139,7 +143,7 @@ void remove_from_swapped_list(const unsi
 	spin_lock_irqsave(&swapped.lock, flags);
 	entry = radix_tree_delete(&swapped.swap_tree, index);
 	if (likely(entry)) {
-		list_del_init(&entry->swapped_list);
+		list_del(&entry->swapped_list);
 		swapped.count--;
 		kmem_cache_free(swapped.cache, entry);
 	}
@@ -153,18 +157,18 @@ enum trickle_return {
 };
 
 struct node_stats {
-	unsigned long	last_free;
 	/* Free ram after a cycle of prefetching */
-	unsigned long	current_free;
+	unsigned long	last_free;
 	/* Free ram on this cycle of checking prefetch_suitable */
-	unsigned long	prefetch_watermark;
+	unsigned long	current_free;
 	/* Maximum amount we will prefetch to */
-	unsigned long	highfree[MAX_NR_ZONES];
+	unsigned long	prefetch_watermark;
 	/* The amount of free ram before we start prefetching */
-	unsigned long	lowfree[MAX_NR_ZONES];
+	unsigned long	highfree[MAX_NR_ZONES];
 	/* The amount of free ram where we will stop prefetching */
-	unsigned long	*pointfree[MAX_NR_ZONES];
+	unsigned long	lowfree[MAX_NR_ZONES];
 	/* highfree or lowfree depending on whether we've hit a watermark */
+	unsigned long	*pointfree[MAX_NR_ZONES];
 };
 
 /*
@@ -172,10 +176,10 @@ struct node_stats {
  * determine if a node is suitable for prefetching into.
  */
 struct prefetch_stats {
-	nodemask_t	prefetch_nodes;
 	/* Which nodes are currently suited to prefetching */
-	unsigned long	prefetched_pages;
+	nodemask_t	prefetch_nodes;
 	/* Total pages we've prefetched on this wakeup of kprefetchd */
+	unsigned long	prefetched_pages;
 	struct node_stats node[MAX_NUMNODES];
 };
 
@@ -189,16 +193,15 @@ static enum trickle_return trickle_swap_
 	const int node)
 {
 	enum trickle_return ret = TRICKLE_FAILED;
+	unsigned long flags;
 	struct page *page;
 
-	read_lock_irq(&swapper_space.tree_lock);
+	read_lock_irqsave(&swapper_space.tree_lock, flags);
 	/* Entry may already exist */
 	page = radix_tree_lookup(&swapper_space.page_tree, entry.val);
-	read_unlock_irq(&swapper_space.tree_lock);
-	if (page) {
-		remove_from_swapped_list(entry.val);
+	read_unlock_irqrestore(&swapper_space.tree_lock, flags);
+	if (page)
 		goto out;
-	}
 
 	/*
 	 * Get a new page to read from swap. We have already checked the
@@ -217,10 +220,8 @@ static enum trickle_return trickle_swap_
 
 	/* Add them to the tail of the inactive list to preserve LRU order */
 	lru_cache_add_tail(page);
-	if (unlikely(swap_readpage(NULL, page))) {
-		ret = TRICKLE_DELAY;
+	if (unlikely(swap_readpage(NULL, page)))
 		goto out_release;
-	}
 
 	sp_stat.prefetched_pages++;
 	sp_stat.node[node].last_free--;
@@ -229,6 +230,12 @@ static enum trickle_return trickle_swap_
 out_release:
 	page_cache_release(page);
 out:
+	/*
+	 * All entries are removed here lazily. This avoids the cost of
+	 * remove_from_swapped_list during normal swapin. Thus there are
+	 * usually many stale entries.
+	 */
+	remove_from_swapped_list(entry.val);
 	return ret;
 }
 
@@ -414,17 +421,6 @@ out:
 }
 
 /*
- * Get previous swapped entry when iterating over all entries. swapped.lock
- * should be held and we should already ensure that entry exists.
- */
-static inline struct swapped_entry *prev_swapped_entry
-	(struct swapped_entry *entry)
-{
-	return list_entry(entry->swapped_list.prev->prev,
-		struct swapped_entry, swapped_list);
-}
-
-/*
  * trickle_swap is the main function that initiates the swap prefetching. It
  * first checks to see if the busy flag is set, and does not prefetch if it
  * is, as the flag implied we are low on memory or swapping in currently.
@@ -435,70 +431,49 @@ static inline struct swapped_entry *prev
 static enum trickle_return trickle_swap(void)
 {
 	enum trickle_return ret = TRICKLE_DELAY;
-	struct swapped_entry *entry;
+	struct list_head *p, *next;
 	unsigned long flags;
 
-	/*
-	 * If laptop_mode is enabled don't prefetch to avoid hard drives
-	 * doing unnecessary spin-ups
-	 */
-	if (!swap_prefetch || laptop_mode)
+	if (!prefetch_enabled())
 		return ret;
 
 	examine_free_limits();
-	entry = NULL;
+	if (!prefetch_suitable())
+		return ret;
+	if (list_empty(&swapped.list))
+		return TRICKLE_FAILED;
 
-	for ( ; ; ) {
+	spin_lock_irqsave(&swapped.lock, flags);
+	list_for_each_safe(p, next, &swapped.list) {
+		struct swapped_entry *entry;
 		swp_entry_t swp_entry;
 		int node;
 
+		spin_unlock_irqrestore(&swapped.lock, flags);
+		might_sleep();
 		if (!prefetch_suitable())
-			break;
+			goto out_unlocked;
 
 		spin_lock_irqsave(&swapped.lock, flags);
-		if (list_empty(&swapped.list)) {
-			ret = TRICKLE_FAILED;
-			spin_unlock_irqrestore(&swapped.lock, flags);
-			break;
-		}
-
-		if (!entry) {
-			/*
-			 * This sets the entry for the first iteration. It
-			 * also is a safeguard against the entry disappearing
-			 * while the lock is not held.
-			 */
-			entry = list_entry(swapped.list.prev,
-				struct swapped_entry, swapped_list);
-		} else if (entry->swapped_list.prev == swapped.list.next) {
-			/*
-			 * If we have iterated over all entries and there are
-			 * still entries that weren't swapped out there may
-			 * be a reason we could not swap them back in so
-			 * delay attempting further prefetching.
-			 */
-			spin_unlock_irqrestore(&swapped.lock, flags);
-			break;
-		}
-
+		entry = list_entry(p, struct swapped_entry, swapped_list);
 		node = get_swap_entry_node(entry);
 		if (!node_isset(node, sp_stat.prefetch_nodes)) {
 			/*
 			 * We found an entry that belongs to a node that is
 			 * not suitable for prefetching so skip it.
 			 */
-			entry = prev_swapped_entry(entry);
-			spin_unlock_irqrestore(&swapped.lock, flags);
 			continue;
 		}
 		swp_entry = entry->swp_entry;
-		entry = prev_swapped_entry(entry);
 		spin_unlock_irqrestore(&swapped.lock, flags);
 
 		if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY)
-			break;
+			goto out_unlocked;
+		spin_lock_irqsave(&swapped.lock, flags);
 	}
+	spin_unlock_irqrestore(&swapped.lock, flags);
 
+out_unlocked:
 	if (sp_stat.prefetched_pages) {
 		lru_add_drain();
 		sp_stat.prefetched_pages = 0;
@@ -513,13 +488,14 @@ static int kprefetchd(void *__unused)
 	sched_setscheduler(current, SCHED_BATCH, &param);
 	set_user_nice(current, 19);
 	/* Set ioprio to lowest if supported by i/o scheduler */
-	sys_ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_CLASS_IDLE);
+	sys_ioprio_set(IOPRIO_WHO_PROCESS, IOPRIO_BE_NR - 1, IOPRIO_CLASS_BE);
 
 	/* kprefetchd has nothing to do until it is woken up the first time */
+	wakeup_kprefetchd = 1;
 	set_current_state(TASK_INTERRUPTIBLE);
 	schedule();
 
-	do {
+	while (!kthread_should_stop()) {
 		try_to_freeze();
 
 		/*
@@ -527,13 +503,17 @@ static int kprefetchd(void *__unused)
 		 * a wakeup, and further delay the next one.
 		 */
 		if (trickle_swap() == TRICKLE_FAILED) {
+			wakeup_kprefetchd = 1;
 			set_current_state(TASK_INTERRUPTIBLE);
 			schedule();
-		}
+		} else
+			wakeup_kprefetchd = 0;
 		clear_last_prefetch_free();
-		schedule_timeout_interruptible(PREFETCH_DELAY);
-	} while (!kthread_should_stop());
-
+		if (!prefetch_enabled())
+			schedule_timeout_interruptible(DISABLED_PREFETCH_DELAY);
+		else
+			schedule_timeout_interruptible(PREFETCH_DELAY);
+	}
 	return 0;
 }
 
Index: linux-2.6.21-mm1/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.21-mm1.orig/Documentation/sysctl/vm.txt	2007-05-07 21:53:00.000000000 +1000
+++ linux-2.6.21-mm1/Documentation/sysctl/vm.txt	2007-05-12 14:31:26.000000000 +1000
@@ -229,7 +229,9 @@ swap_prefetch
 This enables or disables the swap prefetching feature. When the virtual
 memory subsystem has been extremely idle for at least 5 seconds it will start
 copying back pages from swap into the swapcache and keep a copy in swap. In
-practice it can take many minutes before the vm is idle enough.
+practice it can take many minutes before the vm is idle enough. A value of 0
+disables swap prefetching, 1 enables it unless laptop_mode is enabled, and 2
+enables it even in the presence of laptop_mode.
 
 The default value is 1.
 
Index: linux-2.6.21-mm1/init/Kconfig
===================================================================
--- linux-2.6.21-mm1.orig/init/Kconfig	2007-05-07 21:53:51.000000000 +1000
+++ linux-2.6.21-mm1/init/Kconfig	2007-05-12 14:30:52.000000000 +1000
@@ -107,7 +107,7 @@ config SWAP
 
 config SWAP_PREFETCH
 	bool "Support for prefetching swapped memory"
-	depends on SWAP
+	depends on SWAP && !CPUSETS
 	default y
 	---help---
 	  This option will allow the kernel to prefetch swapped memory pages

-- 
-ck

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-12  4:46                           ` [PATCH] mm: swap prefetch improvements Con Kolivas
@ 2007-05-12  5:03                             ` Paul Jackson
  2007-05-12  5:15                               ` Con Kolivas
  2007-05-21 10:03                             ` [PATCH] " Ingo Molnar
  1 sibling, 1 reply; 233+ messages in thread
From: Paul Jackson @ 2007-05-12  5:03 UTC (permalink / raw)
  To: Con Kolivas; +Cc: nickpiggin, ray-lk, mingo, ck, akpm, linux-kernel, linux-mm

> Swap prefetch is not cpuset aware so make the config option depend on !CPUSETS.

Ok.

Could you explain what it means to say "swap prefetch is not cpuset aware",
or could you give a rough idea of what it would take to make it cpuset aware?

I wouldn't go so far as to say that no one would ever want to prefetch and
use cpusets at the same time, but I will grant that it's not a sufficiently
important need that it should block a useful prefetch implementation on
non-cpuset systems.

One case that would be useful, however, is to handle prefetch in the case
that cpusets are configured into ones kernel, but one is not making any
real use of them ('number_of_cpusets' <= 1).  That will actually be the
most common case for the major distribution(s) that enable cpusets by
default in their builds, for most arch's including the arch's popular
on desktops.

So what would it take to allow CONFIG'ing both prefetch and cpusets on,
but having prefetch dynamically adapt to the presence of active cpuset
usage, perhaps by basically shutting down if it can't easily do any
better?  I could certainly entertain requests to callout to some
prefetch routine from the cpuset code, at the critical points that
cpusets transitioned in or out of active use.

Semi-separate issue -- is it just cpusets that aren't prefetch friendly,
or is it also mm/mempolicy (mbind, set_mempolicy) as well?

For that matter, even if neither mm/mempolicy nor cpusets are used, on
systems with multiple memory nodes (not all memory equally distant from
all CPUs, aka NUMA), could prefetch cause some sort of shuffling of
memory placement, which might harm the performance of an HPC (High
Performance Computing) application with carefully tuned memory
placement.  Granted, this -is- getting to be a corner case.  Most HPC
apps running on NUMA hardware are making at least some use of
mm/mempolicy or cpusets.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-12  5:03                             ` Paul Jackson
@ 2007-05-12  5:15                               ` Con Kolivas
  2007-05-12  5:51                                 ` Paul Jackson
  0 siblings, 1 reply; 233+ messages in thread
From: Con Kolivas @ 2007-05-12  5:15 UTC (permalink / raw)
  To: Paul Jackson; +Cc: nickpiggin, ray-lk, mingo, ck, akpm, linux-kernel, linux-mm

On Saturday 12 May 2007 15:03, Paul Jackson wrote:
> > Swap prefetch is not cpuset aware so make the config option depend on
> > !CPUSETS.
>
> Ok.
>
> Could you explain what it means to say "swap prefetch is not cpuset aware",
> or could you give a rough idea of what it would take to make it cpuset
> aware?

Hmm I'm not really sure what it takes to make it cpuset aware; it was Nick 
that pointed out that it was not, so I'm not sure and still going off your 
original recommendation that there was no need to make it cpuset aware but at 
least honour node placement (see below).

> I wouldn't go so far as to say that no one would ever want to prefetch and
> use cpusets at the same time, but I will grant that it's not a sufficiently
> important need that it should block a useful prefetch implementation on
> non-cpuset systems.

Thank you for agreeing on me there :)

> One case that would be useful, however, is to handle prefetch in the case
> that cpusets are configured into ones kernel, but one is not making any
> real use of them ('number_of_cpusets' <= 1).  That will actually be the
> most common case for the major distribution(s) that enable cpusets by
> default in their builds, for most arch's including the arch's popular
> on desktops.
>
> So what would it take to allow CONFIG'ing both prefetch and cpusets on,
> but having prefetch dynamically adapt to the presence of active cpuset
> usage, perhaps by basically shutting down if it can't easily do any
> better?  I could certainly entertain requests to callout to some
> prefetch routine from the cpuset code, at the critical points that
> cpusets transitioned in or out of active use.

It would be absolutely trivial to add a check for 'number_of_cpusets' <= 1 in 
the prefetch_enabled() function. Would you like that?

> Semi-separate issue -- is it just cpusets that aren't prefetch friendly,
> or is it also mm/mempolicy (mbind, set_mempolicy) as well?
>
> For that matter, even if neither mm/mempolicy nor cpusets are used, on
> systems with multiple memory nodes (not all memory equally distant from
> all CPUs, aka NUMA), could prefetch cause some sort of shuffling of
> memory placement, which might harm the performance of an HPC (High
> Performance Computing) application with carefully tuned memory
> placement.  Granted, this -is- getting to be a corner case.  Most HPC
> apps running on NUMA hardware are making at least some use of
> mm/mempolicy or cpusets.

It is numa aware to some degree. It stores the node id and when it starts 
prefetching it only prefetches to nodes that are suitable for prefetching to 
(based on a number of arbitrary freeness arguments I invented). It uses the 
original node id it came from by allocating a page via:
	alloc_pages_node(node, GFP_HIGHUSER & ~__GFP_WAIT, 0);
where "node" is the original node the swapped page came from.

Thanks for comments.

-- 
-ck

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-12  5:15                               ` Con Kolivas
@ 2007-05-12  5:51                                 ` Paul Jackson
  2007-05-12  7:28                                   ` Con Kolivas
  0 siblings, 1 reply; 233+ messages in thread
From: Paul Jackson @ 2007-05-12  5:51 UTC (permalink / raw)
  To: Con Kolivas; +Cc: nickpiggin, ray-lk, mingo, ck, akpm, linux-kernel, linux-mm

Con wrote:
> Hmm I'm not really sure what it takes to make it cpuset aware;
> ...
> It is numa aware to some degree. It stores the node id and when it starts 
> prefetching it only prefetches to nodes that are suitable for prefetching to 
> ...
> It would be absolutely trivial to add a check for 'number_of_cpusets' <= 1
> in  the prefetch_enabled() function. Would you like that?

Hmmm ... it seems that we shadow boxing here ... trying to pick a solution
to solve a problem when we aren't even sure we have a problem, much less
what the problem is.

That does not usually lead to the right path.

Could you put some more effort into characterizing what problems
can arise if one has prefetch and cpusets active at the same time?

My first wild guess is that the only incompatibility would have been that
prefetch might mess up NUMA placement (get pages on wrong nodes), which
it seems you have tried to address in your current patches.  So it would
not surprise me if there was no problem here.

We may just have to lean on Nick some more, if he is the only one who
understands what the problem is, to try again to explain it to us.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-12  5:51                                 ` Paul Jackson
@ 2007-05-12  7:28                                   ` Con Kolivas
  2007-05-12  8:14                                     ` Paul Jackson
  0 siblings, 1 reply; 233+ messages in thread
From: Con Kolivas @ 2007-05-12  7:28 UTC (permalink / raw)
  To: Paul Jackson; +Cc: nickpiggin, ray-lk, mingo, ck, akpm, linux-kernel, linux-mm

On Saturday 12 May 2007 15:51, Paul Jackson wrote:
> Con wrote:
> > Hmm I'm not really sure what it takes to make it cpuset aware;
> > ...
> > It is numa aware to some degree. It stores the node id and when it starts
> > prefetching it only prefetches to nodes that are suitable for prefetching
> > to ...
> > It would be absolutely trivial to add a check for 'number_of_cpusets' <=
> > 1 in  the prefetch_enabled() function. Would you like that?
>
> Hmmm ... it seems that we shadow boxing here ... trying to pick a solution
> to solve a problem when we aren't even sure we have a problem, much less
> what the problem is.
>
> That does not usually lead to the right path.
>
> Could you put some more effort into characterizing what problems
> can arise if one has prefetch and cpusets active at the same time?
>
> My first wild guess is that the only incompatibility would have been that
> prefetch might mess up NUMA placement (get pages on wrong nodes), which
> it seems you have tried to address in your current patches.  So it would
> not surprise me if there was no problem here.

Ummm this is what I've been saying for over a year now but noone has been 
listening.

> We may just have to lean on Nick some more, if he is the only one who
> understands what the problem is, to try again to explain it to us.

-- 
-ck

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-12  7:28                                   ` Con Kolivas
@ 2007-05-12  8:14                                     ` Paul Jackson
  2007-05-12  8:21                                       ` Con Kolivas
  0 siblings, 1 reply; 233+ messages in thread
From: Paul Jackson @ 2007-05-12  8:14 UTC (permalink / raw)
  To: Con Kolivas; +Cc: nickpiggin, ray-lk, mingo, ck, akpm, linux-kernel, linux-mm

> Ummm this is what I've been saying for over a year now but noone has been 
> listening.

Well ... if there is a problem using prefetch and cpusets together,
it doesn't look like the two of us are going to find it.

I should probably look at your patch to answer this next question,
but being a lazy retard, I'll just ask.  Is there a way, on a running
system that has your prefetch patch configured in, to disable prefetch
-- perhaps writing to some magic /proc file or something?

If so, then how about you just remove the lines in the patch that
disable prefetch on kernels configured with CPUSETS, and we charge
ahead allowing both at the same time?

If some day in the future I find something about prefetch that harms
the HPC NUMA loads I care about, then I can just dynamically disable
prefetch.

If someone ever uncovers a real problem with prefetch and cpusets,
then we will deal with it then.

As to whether your patch is otherwise (other than cpusets) worthy
of further acceptance, that I will have to leave up to those who are
competent to make such judgements.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-12  8:14                                     ` Paul Jackson
@ 2007-05-12  8:21                                       ` Con Kolivas
  2007-05-12  8:37                                         ` Paul Jackson
  0 siblings, 1 reply; 233+ messages in thread
From: Con Kolivas @ 2007-05-12  8:21 UTC (permalink / raw)
  To: Paul Jackson; +Cc: nickpiggin, ray-lk, mingo, ck, akpm, linux-kernel, linux-mm

On Saturday 12 May 2007 18:14, Paul Jackson wrote:
> > Ummm this is what I've been saying for over a year now but noone has been
> > listening.
>
> Well ... if there is a problem using prefetch and cpusets together,
> it doesn't look like the two of us are going to find it.
>
> I should probably look at your patch to answer this next question,
> but being a lazy retard, I'll just ask.  Is there a way, on a running
> system that has your prefetch patch configured in, to disable prefetch
> -- perhaps writing to some magic /proc file or something?

Indeed:

/proc/sys/vm/swap_prefetch

> If so, then how about you just remove the lines in the patch that
> disable prefetch on kernels configured with CPUSETS, and we charge
> ahead allowing both at the same time?

Ok so change the default value for swap_prefetch to 0 when CPUSETS is enabled? 
Sure, I can do that.

> If some day in the future I find something about prefetch that harms
> the HPC NUMA loads I care about, then I can just dynamically disable
> prefetch.
>
> If someone ever uncovers a real problem with prefetch and cpusets,
> then we will deal with it then.
>
> As to whether your patch is otherwise (other than cpusets) worthy
> of further acceptance, that I will have to leave up to those who are
> competent to make such judgements.

Thank you very much for your comments!

-- 
-ck

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-12  8:21                                       ` Con Kolivas
@ 2007-05-12  8:37                                         ` Paul Jackson
  2007-05-12  8:57                                           ` [PATCH respin] " Con Kolivas
  0 siblings, 1 reply; 233+ messages in thread
From: Paul Jackson @ 2007-05-12  8:37 UTC (permalink / raw)
  To: Con Kolivas; +Cc: nickpiggin, ray-lk, mingo, ck, akpm, linux-kernel, linux-mm

Con wrote:
> Ok so change the default value for swap_prefetch to 0 when CPUSETS is enabled? 

I don't see why that special case for cpusets is needed.

I'm suggesting making no special cases for CPUSETS at all, until and
unless we find reason to.

In other words, I'm suggesting simply removing the patch lines:

-	depends on SWAP
+	depends on SWAP && !CPUSETS

I see no other mention of cpusets in your patch.  That's fine by me.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 233+ messages in thread

* [PATCH respin] mm: swap prefetch improvements
  2007-05-12  8:37                                         ` Paul Jackson
@ 2007-05-12  8:57                                           ` Con Kolivas
  0 siblings, 0 replies; 233+ messages in thread
From: Con Kolivas @ 2007-05-12  8:57 UTC (permalink / raw)
  To: Paul Jackson; +Cc: nickpiggin, ray-lk, mingo, ck, akpm, linux-kernel, linux-mm

On Saturday 12 May 2007 18:37, Paul Jackson wrote:
> Con wrote:
> > Ok so change the default value for swap_prefetch to 0 when CPUSETS is
> > enabled?
>
> I don't see why that special case for cpusets is needed.
>
> I'm suggesting making no special cases for CPUSETS at all, until and
> unless we find reason to.
>
> In other words, I'm suggesting simply removing the patch lines:
>
> -	depends on SWAP
> +	depends on SWAP && !CPUSETS
>
> I see no other mention of cpusets in your patch.  That's fine by me.

Excellent, I prefer that as well. Thanks very much for your comments!

Here's a respin without that hunk.

---
Numerous improvements to swap prefetch.

It was possible for kprefetchd to go to sleep indefinitely before/after
changing the /proc value of swap prefetch. Fix that.

The cost of remove_from_swapped_list() can be removed from every page swapin
by moving it to be done entirely by kprefetchd lazily.

The call site for add_to_swapped_list need only be at one place.

Wakeups can occur much less frequently if swap prefetch is disabled.

Make it possible to enable swap prefetch explicitly via /proc when laptop_mode
is enabled by changing the value of the sysctl to 2.

The complicated iteration over every entry can be consolidated by using
list_for_each_safe.

Fix potential irq problem by converting read_lock_irq to irqsave etc.

Code style fixes.

Change the ioprio from IOPRIO_CLASS_IDLE to normal lower priority to ensure
that bio requests are not starved if other I/O begins during prefetching.

Signed-off-by: Con Kolivas <kernel@kolivas.org>

---
 Documentation/sysctl/vm.txt |    4 -
 mm/page_io.c                |    2 
 mm/swap_prefetch.c          |  158 +++++++++++++++++++-------------------------
 mm/swap_state.c             |    2 
 mm/vmscan.c                 |    1 
 5 files changed, 74 insertions(+), 93 deletions(-)

Index: linux-2.6.21-mm1/mm/page_io.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/page_io.c	2007-02-05 22:52:04.000000000 +1100
+++ linux-2.6.21-mm1/mm/page_io.c	2007-05-12 14:30:52.000000000 +1000
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/swapops.h>
 #include <linux/writeback.h>
+#include <linux/swap-prefetch.h>
 #include <asm/pgtable.h>
 
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -118,6 +119,7 @@ int swap_writepage(struct page *page, st
 		ret = -ENOMEM;
 		goto out;
 	}
+	add_to_swapped_list(page);
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		rw |= (1 << BIO_RW_SYNC);
 	count_vm_event(PSWPOUT);
Index: linux-2.6.21-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/swap_state.c	2007-05-07 21:53:51.000000000 +1000
+++ linux-2.6.21-mm1/mm/swap_state.c	2007-05-12 14:30:52.000000000 +1000
@@ -83,7 +83,6 @@ static int __add_to_swap_cache(struct pa
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (!error) {
-			remove_from_swapped_list(entry.val);
 			page_cache_get(page);
 			SetPageLocked(page);
 			SetPageSwapCache(page);
@@ -102,7 +101,6 @@ int add_to_swap_cache(struct page *page,
 	int error;
 
 	if (!swap_duplicate(entry)) {
-		remove_from_swapped_list(entry.val);
 		INC_CACHE_INFO(noent_race);
 		return -ENOENT;
 	}
Index: linux-2.6.21-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/vmscan.c	2007-05-07 21:53:51.000000000 +1000
+++ linux-2.6.21-mm1/mm/vmscan.c	2007-05-12 14:30:52.000000000 +1000
@@ -410,7 +410,6 @@ int remove_mapping(struct address_space 
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
-		add_to_swapped_list(page);
 		__delete_from_swap_cache(page);
 		write_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
Index: linux-2.6.21-mm1/mm/swap_prefetch.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/swap_prefetch.c	2007-05-07 21:53:51.000000000 +1000
+++ linux-2.6.21-mm1/mm/swap_prefetch.c	2007-05-12 14:30:52.000000000 +1000
@@ -27,7 +27,8 @@
  * needs to be at least this duration of idle time meaning in practice it can
  * be much longer
  */
-#define PREFETCH_DELAY	(HZ * 5)
+#define PREFETCH_DELAY		(HZ * 5)
+#define DISABLED_PREFETCH_DELAY	(HZ * 60)
 
 /* sysctl - enable/disable swap prefetching */
 int swap_prefetch __read_mostly = 1;
@@ -61,19 +62,30 @@ inline void delay_swap_prefetch(void)
 }
 
 /*
+ * If laptop_mode is enabled don't prefetch to avoid hard drives
+ * doing unnecessary spin-ups unless swap_prefetch is explicitly
+ * set to a higher value.
+ */
+static inline int prefetch_enabled(void)
+{
+	if (swap_prefetch <= laptop_mode)
+		return 0;
+	return 1;
+}
+
+static int wakeup_kprefetchd;
+
+/*
  * Drop behind accounting which keeps a list of the most recently used swap
- * entries.
+ * entries. Entries are removed lazily by kprefetchd.
  */
 void add_to_swapped_list(struct page *page)
 {
 	struct swapped_entry *entry;
 	unsigned long index, flags;
-	int wakeup;
-
-	if (!swap_prefetch)
-		return;
 
-	wakeup = 0;
+	if (!prefetch_enabled())
+		goto out;
 
 	spin_lock_irqsave(&swapped.lock, flags);
 	if (swapped.count >= swapped.maxcount) {
@@ -103,23 +115,15 @@ void add_to_swapped_list(struct page *pa
 	store_swap_entry_node(entry, page);
 
 	if (likely(!radix_tree_insert(&swapped.swap_tree, index, entry))) {
-		/*
-		 * If this is the first entry, kprefetchd needs to be
-		 * (re)started.
-		 */
-		if (!swapped.count)
-			wakeup = 1;
 		list_add(&entry->swapped_list, &swapped.list);
 		swapped.count++;
 	}
 
 out_locked:
 	spin_unlock_irqrestore(&swapped.lock, flags);
-
-	/* Do the wakeup outside the lock to shorten lock hold time. */
-	if (wakeup)
+out:
+	if (wakeup_kprefetchd)
 		wake_up_process(kprefetchd_task);
-
 	return;
 }
 
@@ -139,7 +143,7 @@ void remove_from_swapped_list(const unsi
 	spin_lock_irqsave(&swapped.lock, flags);
 	entry = radix_tree_delete(&swapped.swap_tree, index);
 	if (likely(entry)) {
-		list_del_init(&entry->swapped_list);
+		list_del(&entry->swapped_list);
 		swapped.count--;
 		kmem_cache_free(swapped.cache, entry);
 	}
@@ -153,18 +157,18 @@ enum trickle_return {
 };
 
 struct node_stats {
-	unsigned long	last_free;
 	/* Free ram after a cycle of prefetching */
-	unsigned long	current_free;
+	unsigned long	last_free;
 	/* Free ram on this cycle of checking prefetch_suitable */
-	unsigned long	prefetch_watermark;
+	unsigned long	current_free;
 	/* Maximum amount we will prefetch to */
-	unsigned long	highfree[MAX_NR_ZONES];
+	unsigned long	prefetch_watermark;
 	/* The amount of free ram before we start prefetching */
-	unsigned long	lowfree[MAX_NR_ZONES];
+	unsigned long	highfree[MAX_NR_ZONES];
 	/* The amount of free ram where we will stop prefetching */
-	unsigned long	*pointfree[MAX_NR_ZONES];
+	unsigned long	lowfree[MAX_NR_ZONES];
 	/* highfree or lowfree depending on whether we've hit a watermark */
+	unsigned long	*pointfree[MAX_NR_ZONES];
 };
 
 /*
@@ -172,10 +176,10 @@ struct node_stats {
  * determine if a node is suitable for prefetching into.
  */
 struct prefetch_stats {
-	nodemask_t	prefetch_nodes;
 	/* Which nodes are currently suited to prefetching */
-	unsigned long	prefetched_pages;
+	nodemask_t	prefetch_nodes;
 	/* Total pages we've prefetched on this wakeup of kprefetchd */
+	unsigned long	prefetched_pages;
 	struct node_stats node[MAX_NUMNODES];
 };
 
@@ -189,16 +193,15 @@ static enum trickle_return trickle_swap_
 	const int node)
 {
 	enum trickle_return ret = TRICKLE_FAILED;
+	unsigned long flags;
 	struct page *page;
 
-	read_lock_irq(&swapper_space.tree_lock);
+	read_lock_irqsave(&swapper_space.tree_lock, flags);
 	/* Entry may already exist */
 	page = radix_tree_lookup(&swapper_space.page_tree, entry.val);
-	read_unlock_irq(&swapper_space.tree_lock);
-	if (page) {
-		remove_from_swapped_list(entry.val);
+	read_unlock_irqrestore(&swapper_space.tree_lock, flags);
+	if (page)
 		goto out;
-	}
 
 	/*
 	 * Get a new page to read from swap. We have already checked the
@@ -217,10 +220,8 @@ static enum trickle_return trickle_swap_
 
 	/* Add them to the tail of the inactive list to preserve LRU order */
 	lru_cache_add_tail(page);
-	if (unlikely(swap_readpage(NULL, page))) {
-		ret = TRICKLE_DELAY;
+	if (unlikely(swap_readpage(NULL, page)))
 		goto out_release;
-	}
 
 	sp_stat.prefetched_pages++;
 	sp_stat.node[node].last_free--;
@@ -229,6 +230,12 @@ static enum trickle_return trickle_swap_
 out_release:
 	page_cache_release(page);
 out:
+	/*
+	 * All entries are removed here lazily. This avoids the cost of
+	 * remove_from_swapped_list during normal swapin. Thus there are
+	 * usually many stale entries.
+	 */
+	remove_from_swapped_list(entry.val);
 	return ret;
 }
 
@@ -414,17 +421,6 @@ out:
 }
 
 /*
- * Get previous swapped entry when iterating over all entries. swapped.lock
- * should be held and we should already ensure that entry exists.
- */
-static inline struct swapped_entry *prev_swapped_entry
-	(struct swapped_entry *entry)
-{
-	return list_entry(entry->swapped_list.prev->prev,
-		struct swapped_entry, swapped_list);
-}
-
-/*
  * trickle_swap is the main function that initiates the swap prefetching. It
  * first checks to see if the busy flag is set, and does not prefetch if it
  * is, as the flag implied we are low on memory or swapping in currently.
@@ -435,70 +431,49 @@ static inline struct swapped_entry *prev
 static enum trickle_return trickle_swap(void)
 {
 	enum trickle_return ret = TRICKLE_DELAY;
-	struct swapped_entry *entry;
+	struct list_head *p, *next;
 	unsigned long flags;
 
-	/*
-	 * If laptop_mode is enabled don't prefetch to avoid hard drives
-	 * doing unnecessary spin-ups
-	 */
-	if (!swap_prefetch || laptop_mode)
+	if (!prefetch_enabled())
 		return ret;
 
 	examine_free_limits();
-	entry = NULL;
+	if (!prefetch_suitable())
+		return ret;
+	if (list_empty(&swapped.list))
+		return TRICKLE_FAILED;
 
-	for ( ; ; ) {
+	spin_lock_irqsave(&swapped.lock, flags);
+	list_for_each_safe(p, next, &swapped.list) {
+		struct swapped_entry *entry;
 		swp_entry_t swp_entry;
 		int node;
 
+		spin_unlock_irqrestore(&swapped.lock, flags);
+		might_sleep();
 		if (!prefetch_suitable())
-			break;
+			goto out_unlocked;
 
 		spin_lock_irqsave(&swapped.lock, flags);
-		if (list_empty(&swapped.list)) {
-			ret = TRICKLE_FAILED;
-			spin_unlock_irqrestore(&swapped.lock, flags);
-			break;
-		}
-
-		if (!entry) {
-			/*
-			 * This sets the entry for the first iteration. It
-			 * also is a safeguard against the entry disappearing
-			 * while the lock is not held.
-			 */
-			entry = list_entry(swapped.list.prev,
-				struct swapped_entry, swapped_list);
-		} else if (entry->swapped_list.prev == swapped.list.next) {
-			/*
-			 * If we have iterated over all entries and there are
-			 * still entries that weren't swapped out there may
-			 * be a reason we could not swap them back in so
-			 * delay attempting further prefetching.
-			 */
-			spin_unlock_irqrestore(&swapped.lock, flags);
-			break;
-		}
-
+		entry = list_entry(p, struct swapped_entry, swapped_list);
 		node = get_swap_entry_node(entry);
 		if (!node_isset(node, sp_stat.prefetch_nodes)) {
 			/*
 			 * We found an entry that belongs to a node that is
 			 * not suitable for prefetching so skip it.
 			 */
-			entry = prev_swapped_entry(entry);
-			spin_unlock_irqrestore(&swapped.lock, flags);
 			continue;
 		}
 		swp_entry = entry->swp_entry;
-		entry = prev_swapped_entry(entry);
 		spin_unlock_irqrestore(&swapped.lock, flags);
 
 		if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY)
-			break;
+			goto out_unlocked;
+		spin_lock_irqsave(&swapped.lock, flags);
 	}
+	spin_unlock_irqrestore(&swapped.lock, flags);
 
+out_unlocked:
 	if (sp_stat.prefetched_pages) {
 		lru_add_drain();
 		sp_stat.prefetched_pages = 0;
@@ -513,13 +488,14 @@ static int kprefetchd(void *__unused)
 	sched_setscheduler(current, SCHED_BATCH, &param);
 	set_user_nice(current, 19);
 	/* Set ioprio to lowest if supported by i/o scheduler */
-	sys_ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_CLASS_IDLE);
+	sys_ioprio_set(IOPRIO_WHO_PROCESS, IOPRIO_BE_NR - 1, IOPRIO_CLASS_BE);
 
 	/* kprefetchd has nothing to do until it is woken up the first time */
+	wakeup_kprefetchd = 1;
 	set_current_state(TASK_INTERRUPTIBLE);
 	schedule();
 
-	do {
+	while (!kthread_should_stop()) {
 		try_to_freeze();
 
 		/*
@@ -527,13 +503,17 @@ static int kprefetchd(void *__unused)
 		 * a wakeup, and further delay the next one.
 		 */
 		if (trickle_swap() == TRICKLE_FAILED) {
+			wakeup_kprefetchd = 1;
 			set_current_state(TASK_INTERRUPTIBLE);
 			schedule();
-		}
+		} else
+			wakeup_kprefetchd = 0;
 		clear_last_prefetch_free();
-		schedule_timeout_interruptible(PREFETCH_DELAY);
-	} while (!kthread_should_stop());
-
+		if (!prefetch_enabled())
+			schedule_timeout_interruptible(DISABLED_PREFETCH_DELAY);
+		else
+			schedule_timeout_interruptible(PREFETCH_DELAY);
+	}
 	return 0;
 }
 
Index: linux-2.6.21-mm1/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.21-mm1.orig/Documentation/sysctl/vm.txt	2007-05-07 21:53:00.000000000 +1000
+++ linux-2.6.21-mm1/Documentation/sysctl/vm.txt	2007-05-12 14:31:26.000000000 +1000
@@ -229,7 +229,9 @@ swap_prefetch
 This enables or disables the swap prefetching feature. When the virtual
 memory subsystem has been extremely idle for at least 5 seconds it will start
 copying back pages from swap into the swapcache and keep a copy in swap. In
-practice it can take many minutes before the vm is idle enough.
+practice it can take many minutes before the vm is idle enough. A value of 0
+disables swap prefetching, 1 enables it unless laptop_mode is enabled, and 2
+enables it even in the presence of laptop_mode.
 
 The default value is 1.
 

-- 
-ck

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-12  4:46                           ` [PATCH] mm: swap prefetch improvements Con Kolivas
  2007-05-12  5:03                             ` Paul Jackson
@ 2007-05-21 10:03                             ` Ingo Molnar
  2007-05-21 13:44                               ` Con Kolivas
  1 sibling, 1 reply; 233+ messages in thread
From: Ingo Molnar @ 2007-05-21 10:03 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Nick Piggin, Ray Lee, ck list, Andrew Morton, linux-kernel,
	linux-mm

* Con Kolivas <kernel@kolivas.org> wrote:

> It turns out that fixing swap prefetch was not that hard to fix and 
> improve upon, and since Andrew hasn't dropped swap prefetch, instead 
> here are a swag of fixes and improvements, [...]

it's a reliable win on my testbox too:

 # echo 1 > /proc/sys/vm/swap_prefetch
 # ./sp_tester
 Ram 1019540000  Swap 4096564000
 Total ram to be malloced: 1529310000 bytes
 Starting first malloc of 764655000 bytes
 Starting 1st read of first malloc
 Touching this much ram takes 4393 milliseconds
 Starting second malloc of 764655000 bytes
 Completed second malloc and free
 Sleeping for 600 seconds
 Important part - starting reread of first malloc
 Completed read of first malloc
 Timed portion 30279 milliseconds

versus:

 # echo 0 > /proc/sys/vm/swap_prefetch
 # ./sp_tester
 [...]

 Timed portion 36605 milliseconds

i've repeated these tests to make sure it's a stable win and indeed it 
is:

   # swap-prefetch-on:

   Timed portion 29704 milliseconds

   # swap-prefetch-off:

   Timed portion 34863 milliseconds

Nice work Con!

A suggestion for improvement: right now swap-prefetch does a small bit 
of swapin every 5 seconds and stays idle inbetween. Could this perhaps 
be made more agressive (optionally perhaps), if the system is not 
swapping otherwise? If block-IO level instrumentation is needed to 
determine idleness of block IO then that is justified too i think.

Another suggestion: swap-prefetch seems to be doing all the right 
decisions in the sp_test.c case - so would it be possible to add 
statistics so that it could be verified how much of the swapped-in pages 
were indeed a 'hit' - and how many were recycled without them being 
reused? That could give a reliable, objective metric about how efficient 
swap-prefetch is in any workload.

	Ingo

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-21 10:03                             ` [PATCH] " Ingo Molnar
@ 2007-05-21 13:44                               ` Con Kolivas
  2007-05-21 16:00                                 ` Ingo Molnar
  0 siblings, 1 reply; 233+ messages in thread
From: Con Kolivas @ 2007-05-21 13:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nick Piggin, Ray Lee, ck list, Andrew Morton, linux-kernel,
	linux-mm

On Monday 21 May 2007 20:03, Ingo Molnar wrote:
> * Con Kolivas <kernel@kolivas.org> wrote:
> > It turns out that fixing swap prefetch was not that hard to fix and
> > improve upon, and since Andrew hasn't dropped swap prefetch, instead
> > here are a swag of fixes and improvements, [...]
>
> it's a reliable win on my testbox too:
>
>  # echo 1 > /proc/sys/vm/swap_prefetch

>  Timed portion 30279 milliseconds
>
> versus:
>
>  # echo 0 > /proc/sys/vm/swap_prefetch
>  # ./sp_tester
>  [...]
>
>  Timed portion 36605 milliseconds
>
> i've repeated these tests to make sure it's a stable win and indeed it
> is:
>
>    # swap-prefetch-on:
>
>    Timed portion 29704 milliseconds
>
>    # swap-prefetch-off:
>
>    Timed portion 34863 milliseconds
>
> Nice work Con!

Thanks!

>
> A suggestion for improvement: right now swap-prefetch does a small bit
> of swapin every 5 seconds and stays idle inbetween. Could this perhaps
> be made more agressive (optionally perhaps), if the system is not
> swapping otherwise? If block-IO level instrumentation is needed to
> determine idleness of block IO then that is justified too i think.

Hmm.. The timer waits 5 seconds before trying to prefetch, but then only stops 
if it detects any activity elsewhere. It doesn't actually try to go idle in 
between but it doesn't take much activity to put it back to sleep, hence 
detecting yet another "not quite idle" period and then it goes to sleep 
again. I guess the sleep interval can actually be changed as another tunable 
from 5 seconds to whatever the user wanted.

> Another suggestion: swap-prefetch seems to be doing all the right
> decisions in the sp_test.c case - so would it be possible to add
> statistics so that it could be verified how much of the swapped-in pages
> were indeed a 'hit' - and how many were recycled without them being
> reused? That could give a reliable, objective metric about how efficient
> swap-prefetch is in any workload.

Well the advantage is twofold potentially; 1. the pages that have been 
prefecthed and become minor faults when they would have been major faults, 
and 2. those that become minor faults (via 1) and then become major faults 
again (since a copy is kept on backing store with swap prefetch). The 
sp_tester only tests for 1, although it would be easy enough to simply do 
another big malloc at the end and see how fast it swapped out again as a 
marker of 2. As for an in-kernel option, it could get kind of expensive 
tracking pages that have done one or both of these. I'll think about an 
affordable way to do this, perhaps it could be just done as a 
debugging/testing patch, but if would be nice to make it cheap enough to have 
there permanently as well. The pages end up in swap cache (in the reverse 
direction pages normally get to swap cache) so the accounting could be done 
somewhere around there.

> 	Ingo

Thanks for comments!

-- 
-ck

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-21 13:44                               ` Con Kolivas
@ 2007-05-21 16:00                                 ` Ingo Molnar
  2007-05-22 10:15                                   ` Antonino Ingargiola
  0 siblings, 1 reply; 233+ messages in thread
From: Ingo Molnar @ 2007-05-21 16:00 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Nick Piggin, Ray Lee, ck list, Andrew Morton, linux-kernel,
	linux-mm


* Con Kolivas <kernel@kolivas.org> wrote:

> > A suggestion for improvement: right now swap-prefetch does a small 
> > bit of swapin every 5 seconds and stays idle inbetween. Could this 
> > perhaps be made more agressive (optionally perhaps), if the system 
> > is not swapping otherwise? If block-IO level instrumentation is 
> > needed to determine idleness of block IO then that is justified too 
> > i think.
> 
> Hmm.. The timer waits 5 seconds before trying to prefetch, but then 
> only stops if it detects any activity elsewhere. It doesn't actually 
> try to go idle in between but it doesn't take much activity to put it 
> back to sleep, hence detecting yet another "not quite idle" period and 
> then it goes to sleep again. I guess the sleep interval can actually 
> be changed as another tunable from 5 seconds to whatever the user 
> wanted.

there was nothing else running on the system - so i suspect the swapin 
activity flagged 'itself' as some 'other' activity and stopped? The 
swapins happened in 4 bursts, separated by 5 seconds total idleness.

	Ingo

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-21 16:00                                 ` Ingo Molnar
@ 2007-05-22 10:15                                   ` Antonino Ingargiola
  2007-05-22 10:20                                     ` Con Kolivas
  0 siblings, 1 reply; 233+ messages in thread
From: Antonino Ingargiola @ 2007-05-22 10:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Con Kolivas, Nick Piggin, Ray Lee, ck list, Andrew Morton,
	linux-kernel, linux-mm

2007/5/21, Ingo Molnar <mingo@elte.hu>:
>
> * Con Kolivas <kernel@kolivas.org> wrote:
>
> > > A suggestion for improvement: right now swap-prefetch does a small
> > > bit of swapin every 5 seconds and stays idle inbetween. Could this
> > > perhaps be made more agressive (optionally perhaps), if the system
> > > is not swapping otherwise? If block-IO level instrumentation is
> > > needed to determine idleness of block IO then that is justified too
> > > i think.
> >
> > Hmm.. The timer waits 5 seconds before trying to prefetch, but then
> > only stops if it detects any activity elsewhere. It doesn't actually
> > try to go idle in between but it doesn't take much activity to put it
> > back to sleep, hence detecting yet another "not quite idle" period and
> > then it goes to sleep again. I guess the sleep interval can actually
> > be changed as another tunable from 5 seconds to whatever the user
> > wanted.
>
> there was nothing else running on the system - so i suspect the swapin
> activity flagged 'itself' as some 'other' activity and stopped? The
> swapins happened in 4 bursts, separated by 5 seconds total idleness.

I've noted burst swapins separated by some seconds of pause in my
desktop system too (with sp_tester and an idle gnome).


Regards,

    ~ Antonio

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-22 10:15                                   ` Antonino Ingargiola
@ 2007-05-22 10:20                                     ` Con Kolivas
  2007-05-22 10:25                                       ` Ingo Molnar
  0 siblings, 1 reply; 233+ messages in thread
From: Con Kolivas @ 2007-05-22 10:20 UTC (permalink / raw)
  To: Antonino Ingargiola
  Cc: Ingo Molnar, Nick Piggin, Ray Lee, ck list, Andrew Morton,
	linux-kernel, linux-mm

On Tuesday 22 May 2007 20:15, Antonino Ingargiola wrote:
> 2007/5/21, Ingo Molnar <mingo@elte.hu>:
> > * Con Kolivas <kernel@kolivas.org> wrote:
> > > > A suggestion for improvement: right now swap-prefetch does a small
> > > > bit of swapin every 5 seconds and stays idle inbetween. Could this
> > > > perhaps be made more agressive (optionally perhaps), if the system
> > > > is not swapping otherwise? If block-IO level instrumentation is
> > > > needed to determine idleness of block IO then that is justified too
> > > > i think.
> > >
> > > Hmm.. The timer waits 5 seconds before trying to prefetch, but then
> > > only stops if it detects any activity elsewhere. It doesn't actually
> > > try to go idle in between but it doesn't take much activity to put it
> > > back to sleep, hence detecting yet another "not quite idle" period and
> > > then it goes to sleep again. I guess the sleep interval can actually
> > > be changed as another tunable from 5 seconds to whatever the user
> > > wanted.
> >
> > there was nothing else running on the system - so i suspect the swapin
> > activity flagged 'itself' as some 'other' activity and stopped? The
> > swapins happened in 4 bursts, separated by 5 seconds total idleness.
>
> I've noted burst swapins separated by some seconds of pause in my
> desktop system too (with sp_tester and an idle gnome).

That really is expected, as just about anything, including journal writeout, 
would be enough to put it back to sleep for 5 more seconds.
-- 
-ck

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-22 10:20                                     ` Con Kolivas
@ 2007-05-22 10:25                                       ` Ingo Molnar
  2007-05-22 10:37                                         ` Con Kolivas
  0 siblings, 1 reply; 233+ messages in thread
From: Ingo Molnar @ 2007-05-22 10:25 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Antonino Ingargiola, Nick Piggin, Ray Lee, ck list, Andrew Morton,
	linux-kernel, linux-mm


* Con Kolivas <kernel@kolivas.org> wrote:

> > > there was nothing else running on the system - so i suspect the 
> > > swapin activity flagged 'itself' as some 'other' activity and 
> > > stopped? The swapins happened in 4 bursts, separated by 5 seconds 
> > > total idleness.
> >
> > I've noted burst swapins separated by some seconds of pause in my 
> > desktop system too (with sp_tester and an idle gnome).
> 
> That really is expected, as just about anything, including journal 
> writeout, would be enough to put it back to sleep for 5 more seconds. 

note that nothing like that happened on my system - in the 
swap-prefetch-off case there was _zero_ IO activity during the sleep 
period.

	Ingo

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-22 10:25                                       ` Ingo Molnar
@ 2007-05-22 10:37                                         ` Con Kolivas
  2007-05-22 10:46                                           ` Ingo Molnar
  2007-05-22 20:42                                           ` Ash Milsted
  0 siblings, 2 replies; 233+ messages in thread
From: Con Kolivas @ 2007-05-22 10:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Antonino Ingargiola, Nick Piggin, Ray Lee, ck list, Andrew Morton,
	linux-kernel, linux-mm

On Tuesday 22 May 2007 20:25, Ingo Molnar wrote:
> * Con Kolivas <kernel@kolivas.org> wrote:
> > > > there was nothing else running on the system - so i suspect the
> > > > swapin activity flagged 'itself' as some 'other' activity and
> > > > stopped? The swapins happened in 4 bursts, separated by 5 seconds
> > > > total idleness.
> > >
> > > I've noted burst swapins separated by some seconds of pause in my
> > > desktop system too (with sp_tester and an idle gnome).
> >
> > That really is expected, as just about anything, including journal
> > writeout, would be enough to put it back to sleep for 5 more seconds.
>
> note that nothing like that happened on my system - in the
> swap-prefetch-off case there was _zero_ IO activity during the sleep
> period.

Ok, granted it's _very_ conservative. I really don't want to risk its presence 
being a burden on anything, and the iowait it induces probably makes it turn 
itself off for another PREFETCH_DELAY (5s). I really don't want to cross the 
line to where it is detrimental in any way. Not dropping out on a 
cond_resched and perhaps making the delay tunable should be enough to make it 
a little less "sleepy".

-- 
-ck

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-22 10:37                                         ` Con Kolivas
@ 2007-05-22 10:46                                           ` Ingo Molnar
  2007-05-22 10:54                                             ` Con Kolivas
  2007-05-22 20:18                                             ` [ck] " Michael Chang
  2007-05-22 20:42                                           ` Ash Milsted
  1 sibling, 2 replies; 233+ messages in thread
From: Ingo Molnar @ 2007-05-22 10:46 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Antonino Ingargiola, Nick Piggin, Ray Lee, ck list, Andrew Morton,
	linux-kernel, linux-mm


* Con Kolivas <kernel@kolivas.org> wrote:

> On Tuesday 22 May 2007 20:25, Ingo Molnar wrote:
> > * Con Kolivas <kernel@kolivas.org> wrote:
> > > > > there was nothing else running on the system - so i suspect the
> > > > > swapin activity flagged 'itself' as some 'other' activity and
> > > > > stopped? The swapins happened in 4 bursts, separated by 5 seconds
> > > > > total idleness.
> > > >
> > > > I've noted burst swapins separated by some seconds of pause in my
> > > > desktop system too (with sp_tester and an idle gnome).
> > >
> > > That really is expected, as just about anything, including journal
> > > writeout, would be enough to put it back to sleep for 5 more seconds.
> >
> > note that nothing like that happened on my system - in the
> > swap-prefetch-off case there was _zero_ IO activity during the sleep
> > period.
> 
> Ok, granted it's _very_ conservative. [...]

but your first reaction was "it should not have slept for 5 seconds":

| Hmm.. The timer waits 5 seconds before trying to prefetch, but then 
| only stops if it detects any activity elsewhere. It doesn't actually 
| try to go idle in between
 
It clearly should not consider 'itself' as IO activity. This suggests 
some bug in the 'detect activity' mechanism, agreed? I'm wondering 
whether you are seeing the same problem, or is all swap-prefetch IO on 
your system continuous until it's done [or some other IO comes 
inbetween]?

	Ingo

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-22 10:46                                           ` Ingo Molnar
@ 2007-05-22 10:54                                             ` Con Kolivas
  2007-05-22 10:57                                               ` Ingo Molnar
  2007-05-22 20:18                                             ` [ck] " Michael Chang
  1 sibling, 1 reply; 233+ messages in thread
From: Con Kolivas @ 2007-05-22 10:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Antonino Ingargiola, Nick Piggin, Ray Lee, ck list, Andrew Morton,
	linux-kernel, linux-mm

On Tuesday 22 May 2007 20:46, Ingo Molnar wrote:
> It clearly should not consider 'itself' as IO activity. This suggests
> some bug in the 'detect activity' mechanism, agreed? I'm wondering
> whether you are seeing the same problem, or is all swap-prefetch IO on
> your system continuous until it's done [or some other IO comes
> inbetween]?

When nothing else is happening anywhere on the system it reads in bursts and 
goes to sleep during journal writeout.
 
-- 
-ck

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-22 10:54                                             ` Con Kolivas
@ 2007-05-22 10:57                                               ` Ingo Molnar
  2007-05-22 11:04                                                 ` Con Kolivas
  0 siblings, 1 reply; 233+ messages in thread
From: Ingo Molnar @ 2007-05-22 10:57 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Antonino Ingargiola, Nick Piggin, Ray Lee, ck list, Andrew Morton,
	linux-kernel, linux-mm


* Con Kolivas <kernel@kolivas.org> wrote:

> On Tuesday 22 May 2007 20:46, Ingo Molnar wrote:
> > It clearly should not consider 'itself' as IO activity. This 
> > suggests some bug in the 'detect activity' mechanism, agreed? I'm 
> > wondering whether you are seeing the same problem, or is all 
> > swap-prefetch IO on your system continuous until it's done [or some 
> > other IO comes inbetween]?
> 
> When nothing else is happening anywhere on the system it reads in 
> bursts and goes to sleep during journal writeout.

hm, what do you call 'journal writeout' here that would be happening on 
my system?

	Ingo

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-22 10:57                                               ` Ingo Molnar
@ 2007-05-22 11:04                                                 ` Con Kolivas
       [not found]                                                   ` <20070522111104.GA14950@elte.hu>
  0 siblings, 1 reply; 233+ messages in thread
From: Con Kolivas @ 2007-05-22 11:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Antonino Ingargiola, Nick Piggin, Ray Lee, ck list, Andrew Morton,
	linux-kernel, linux-mm

On Tuesday 22 May 2007 20:57, Ingo Molnar wrote:
> * Con Kolivas <kernel@kolivas.org> wrote:
> > On Tuesday 22 May 2007 20:46, Ingo Molnar wrote:
> > > It clearly should not consider 'itself' as IO activity. This
> > > suggests some bug in the 'detect activity' mechanism, agreed? I'm
> > > wondering whether you are seeing the same problem, or is all
> > > swap-prefetch IO on your system continuous until it's done [or some
> > > other IO comes inbetween]?
> >
> > When nothing else is happening anywhere on the system it reads in
> > bursts and goes to sleep during journal writeout.
>
> hm, what do you call 'journal writeout' here that would be happening on
> my system?

Not really sure what you have in terms of fs, but here even with nothing going 
on, ext3 writes to disk every 5 seconds with kjournald.

-- 
-ck

^ permalink raw reply	[flat|nested] 233+ messages in thread

[parent not found: <20070522111104.GA14950@elte.hu>]

* Re: [PATCH] mm: swap prefetch improvements
       [not found]                                                   ` <20070522111104.GA14950@elte.hu>
@ 2007-05-22 11:12                                                     ` Ingo Molnar
  0 siblings, 0 replies; 233+ messages in thread
From: Ingo Molnar @ 2007-05-22 11:12 UTC (permalink / raw)
  To: Con Kolivas
  Cc: Antonino Ingargiola, Nick Piggin, Ray Lee, ck list, Andrew Morton,
	linux-kernel, linux-mm


* Con Kolivas <kernel@kolivas.org> wrote:
 
> > hm, what do you call 'journal writeout' here that would be happening 
> > on my system?
> 
> Not really sure what you have in terms of fs, but here even with 
> nothing going on, ext3 writes to disk every 5 seconds with kjournald.

i have ext3, but it doesnt do that on my box. Also, i would have noticed 
any IO activity in the 'swap prefetch off' case. When i said completely 
idle, i really meant it ;-)

so swap-prefetch stops for 5 seconds for no apparent reason.
 
	Ingo

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [ck] Re: [PATCH] mm: swap prefetch improvements
  2007-05-22 10:46                                           ` Ingo Molnar
  2007-05-22 10:54                                             ` Con Kolivas
@ 2007-05-22 20:18                                             ` Michael Chang
  2007-05-22 20:31                                               ` Ingo Molnar
  1 sibling, 1 reply; 233+ messages in thread
From: Michael Chang @ 2007-05-22 20:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Con Kolivas, Nick Piggin, Ray Lee, linux-kernel, ck list,
	linux-mm, Andrew Morton

On 5/22/07, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Con Kolivas <kernel@kolivas.org> wrote:
>
> > On Tuesday 22 May 2007 20:25, Ingo Molnar wrote:
> > > * Con Kolivas <kernel@kolivas.org> wrote:
> > > > > > there was nothing else running on the system - so i suspect the
> > > > > > swapin activity flagged 'itself' as some 'other' activity and
> > > > > > stopped? The swapins happened in 4 bursts, separated by 5 seconds
> > > > > > total idleness.
> > > > >
> > > > > I've noted burst swapins separated by some seconds of pause in my
> > > > > desktop system too (with sp_tester and an idle gnome).
> > > >
> > > > That really is expected, as just about anything, including journal
> > > > writeout, would be enough to put it back to sleep for 5 more seconds.
> > >
> > > note that nothing like that happened on my system - in the
> > > swap-prefetch-off case there was _zero_ IO activity during the sleep
> > > period.
> >
> > Ok, granted it's _very_ conservative. [...]
>
> but your first reaction was "it should not have slept for 5 seconds":
>
> | Hmm.. The timer waits 5 seconds before trying to prefetch, but then
> | only stops if it detects any activity elsewhere. It doesn't actually
> | try to go idle in between
>
> It clearly should not consider 'itself' as IO activity. This suggests
> some bug in the 'detect activity' mechanism, agreed? I'm wondering
> whether you are seeing the same problem, or is all swap-prefetch IO on
> your system continuous until it's done [or some other IO comes
> inbetween]?

The only "problem" I can see with this idea is in the potential case
that it takes up all the IO activity, and so there is never enough IO
activity from other progams to trigger the wait mechanism because they
don't get a chance to run.

That could probably be "fixed" by capping the IO, though... (with one
of those oh-so-lovable "magic numbers" or a tunable)

That said, I don't think there are any issues with the code
compensating for its own activity in the "detect activity" mechanism
-- assuming there wasn't a major impact in e.g. maintainability or
something.

As for the burstyness... considering the "no negative impact" stance,
I can understand that. But it seems inefficient, at best...

-- 
Michael Chang

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
Thank you.

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [ck] Re: [PATCH] mm: swap prefetch improvements
  2007-05-22 20:18                                             ` [ck] " Michael Chang
@ 2007-05-22 20:31                                               ` Ingo Molnar
  0 siblings, 0 replies; 233+ messages in thread
From: Ingo Molnar @ 2007-05-22 20:31 UTC (permalink / raw)
  To: Michael Chang
  Cc: Con Kolivas, Nick Piggin, Ray Lee, linux-kernel, ck list,
	linux-mm, Andrew Morton


* Michael Chang <thenewme91@gmail.com> wrote:

> > It clearly should not consider 'itself' as IO activity. This 
> > suggests some bug in the 'detect activity' mechanism, agreed? I'm 
> > wondering whether you are seeing the same problem, or is all 
> > swap-prefetch IO on your system continuous until it's done [or some 
> > other IO comes inbetween]?
> 
> The only "problem" I can see with this idea is in the potential case 
> that it takes up all the IO activity, and so there is never enough IO 
> activity from other progams to trigger the wait mechanism because they 
> don't get a chance to run.

i dont understand what you mean. Any 'use only idle IO capacity' 
mechanism should immediately cease to be active the moment any other app 
tries to do IO - whether the IO subsystem is saturated or not.

> That said, I don't think there are any issues with the code 
> compensating for its own activity in the "detect activity" mechanism 
> -- assuming there wasn't a major impact in e.g. maintainability or 
> something.
> 
> As for the burstyness... considering the "no negative impact" stance, 
> I can understand that. But it seems inefficient, at best...

well, it's a plain old bug (a not too serious one) in my book, i'm 
surprised that we are now at mail #7 about it :-) I reported it, and i 
guess Con will fix it eventually. There's really no need to deny that it 
exists or to try to talk it out of existence. Sheesh! :-)

	Ingo

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [ck] Re: [PATCH] mm: swap prefetch improvements
  2007-05-22 10:37                                         ` Con Kolivas
  2007-05-22 10:46                                           ` Ingo Molnar
@ 2007-05-22 20:42                                           ` Ash Milsted
  2007-05-22 22:50                                             ` Con Kolivas
  1 sibling, 1 reply; 233+ messages in thread
From: Ash Milsted @ 2007-05-22 20:42 UTC (permalink / raw)
  To: ck; +Cc: linux-kernel, ck list

On Tue, 22 May 2007 20:37:54 +1000
Con Kolivas <kernel@kolivas.org> wrote:

> On Tuesday 22 May 2007 20:25, Ingo Molnar wrote:
> > * Con Kolivas <kernel@kolivas.org> wrote:
> > > > > there was nothing else running on the system - so i suspect the
> > > > > swapin activity flagged 'itself' as some 'other' activity and
> > > > > stopped? The swapins happened in 4 bursts, separated by 5 seconds
> > > > > total idleness.
> > > >
> > > > I've noted burst swapins separated by some seconds of pause in my
> > > > desktop system too (with sp_tester and an idle gnome).
> > >
> > > That really is expected, as just about anything, including journal
> > > writeout, would be enough to put it back to sleep for 5 more seconds.
> >
> > note that nothing like that happened on my system - in the
> > swap-prefetch-off case there was _zero_ IO activity during the sleep
> > period.
> 
> Ok, granted it's _very_ conservative. I really don't want to risk its presence 
> being a burden on anything, and the iowait it induces probably makes it turn 
> itself off for another PREFETCH_DELAY (5s). I really don't want to cross the 
> line to where it is detrimental in any way. Not dropping out on a 
> cond_resched and perhaps making the delay tunable should be enough to make it 
> a little less "sleepy".
> 
> -- 
> -ck

Hi. I just did some video encoding on my desktop and I was noticing
(for the first time in a while) that running apps had to hit swap quite
a lot when I switched to them (the encoding was going at full blast for
most of the day, and most of the time other running apps were
idle). Now, a good half of my RAM appeared to be free during all this,
so I was thinking at the time that it would be nice if swap prefetch
could be tunably more aggressive. I guess it would be ideal in this
case if it could kick in during tunably low disk-IO periods, even if
the CPU is rather busy. I'm sure you've considered this, so I only butt
in here to cast a vote for it. :)

Of course, I could be completely wrong about the possibility.. and I
seem to remember that the disk cache can take up about half the ram by
default without this showing up in 'gnome-system-monitor'... which I
guess might happen during heavy encoding.. but even if it did, I could
have set the limit lower, and would then have still appreciated
prefetching.

Ash

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-22 20:42                                           ` Ash Milsted
@ 2007-05-22 22:50                                             ` Con Kolivas
  2007-05-23  7:57                                               ` Ash Milsted
  0 siblings, 1 reply; 233+ messages in thread
From: Con Kolivas @ 2007-05-22 22:50 UTC (permalink / raw)
  To: ck; +Cc: Ash Milsted, linux-kernel

On Wednesday 23 May 2007 06:42, Ash Milsted wrote:
> Hi. I just did some video encoding on my desktop and I was noticing
> (for the first time in a while) that running apps had to hit swap quite
> a lot when I switched to them (the encoding was going at full blast for
> most of the day, and most of the time other running apps were
> idle). Now, a good half of my RAM appeared to be free during all this,
> so I was thinking at the time that it would be nice if swap prefetch
> could be tunably more aggressive. I guess it would be ideal in this
> case if it could kick in during tunably low disk-IO periods, even if
> the CPU is rather busy. I'm sure you've considered this, so I only butt
> in here to cast a vote for it. :)

In this case nicing the video encode should be enough to make it prefetch even 
during heavy cpu usage. It detects the total nice level rather than the cpu 
usage.

> Of course, I could be completely wrong about the possibility.. and I
> seem to remember that the disk cache can take up about half the ram by
> default without this showing up in 'gnome-system-monitor'... which I
> guess might happen during heavy encoding.. but even if it did, I could
> have set the limit lower, and would then have still appreciated
> prefetching.

I plan to make it prefetch more aggressively by default soon and make it more 
tunable too.

-- 
-ck

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: [PATCH] mm: swap prefetch improvements
  2007-05-22 22:50                                             ` Con Kolivas
@ 2007-05-23  7:57                                               ` Ash Milsted
  0 siblings, 0 replies; 233+ messages in thread
From: Ash Milsted @ 2007-05-23  7:57 UTC (permalink / raw)
  To: Con Kolivas; +Cc: ck, linux-kernel

On Wed, 23 May 2007 08:50:01 +1000
Con Kolivas <kernel@kolivas.org> wrote:

> On Wednesday 23 May 2007 06:42, Ash Milsted wrote:
> > Hi. I just did some video encoding on my desktop and I was noticing
> > (for the first time in a while) that running apps had to hit swap quite
> > a lot when I switched to them (the encoding was going at full blast for
> > most of the day, and most of the time other running apps were
> > idle). Now, a good half of my RAM appeared to be free during all this,
> > so I was thinking at the time that it would be nice if swap prefetch
> > could be tunably more aggressive. I guess it would be ideal in this
> > case if it could kick in during tunably low disk-IO periods, even if
> > the CPU is rather busy. I'm sure you've considered this, so I only butt
> > in here to cast a vote for it. :)
> 
> In this case nicing the video encode should be enough to make it prefetch even 
> during heavy cpu usage. It detects the total nice level rather than the cpu 
> usage.
> 

Cunning, but I guess the regular (less than 5 seconds apart)
reads/writes during the encoding process would cause prefetching to
hold off, no? I had used nice and ionice to reduce the encoder
priority, which made desktop apps pretty responsive, except when they
had to hit swap. If swap prefetch is using the idle io-priority I
suppose it would hardly affect performance if it kicked in during such
use, since it would operate in between the encoder reads anyway
(assuming the encoder is at higher ioprio), right ?

> 
> I plan to make it prefetch more aggressively by default soon and make it more 
> tunable too.
> 

'Sounds good!

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-10  3:48                   ` Ray Lee
  2007-05-10  3:56                     ` Nick Piggin
@ 2007-05-10  3:58                     ` Con Kolivas
  1 sibling, 0 replies; 233+ messages in thread
From: Con Kolivas @ 2007-05-10  3:58 UTC (permalink / raw)
  To: Ray Lee
  Cc: Nick Piggin, Ingo Molnar, ck list, Andrew Morton, linux-kernel,
	linux-mm

On Thursday 10 May 2007 13:48, Ray Lee wrote:
> On 5/9/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > You said it helped with the updatedb problem. That says we should look at
> > why it is going bad first, and for example improve use-once algorithms.
> > After we do that, then swap prefetching might still help, which is fine.
>
> Nick, if you're volunteering to do that analysis, then great. If not,
> then you're just providing a airy hope with nothing to back up when or
> if that work would ever occur.
>
> Further, if you or someone else *does* do that work, then guess what,
> we still have the option to rip out the swap prefetching code after
> the hypothetical use-once improvements have been proven and merged.
> Which, by the way, I've watched people talk about since 2.4. That was,
> y'know, a *while* ago.
>
> So enough with the stop energy, okay? You're better than that.
>
> Con? He is right about the last feature to go in needs to work
> gracefully with what's there now. However, it's not unheard of for
> authors of other sections of code to help out with incompatibilities
> by answering politely phrased questions for guidance. Though the
> intersection of users between cpusets and desktop systems seems small
> indeed.

Let's just set the record straight. I actually discussed cpusets over a year 
ago in this nonsense and was told by sgi folk there was no need to get my 
head around cpusets and honouring node placement should be enough which, by 
the way, swap prefetch does. So I by no means ignored this; we just hit an 
impasse on just how much more featured it should be for the sake of a goddamn 
home desktop pc feature.

Anyway why the hell am I resurrecting this thread? The code is declared dead 
already. Leave it be.

-- 
-ck

^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-04 12:10       ` Con Kolivas
  2007-05-05  8:42         ` Con Kolivas
@ 2007-05-07 14:28         ` Bill Davidsen
  1 sibling, 0 replies; 233+ messages in thread
From: Bill Davidsen @ 2007-05-07 14:28 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm

Con Kolivas wrote:
> On Friday 04 May 2007 18:52, Ingo Molnar wrote:
>> agreed. Con, IIRC you wrote a testcase for this, right? Could you please
>> send us the results of that testing?
> 
> Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch 
> disabled and then enabled swap prefetch saves ~5 seconds on average hardware 
> on this one test case. I had many users try this and the results were between 
> 2 and 10 seconds, but always showed a saving on this testcase. This effect 
> easily occurs on printing a big picture, editing a large file, compressing an 
> iso image or whatever in real world workloads. Smaller, but much more 
> frequent effects of this over the course of a day obviously also occur and do 
> add up.
> 
I'll try this when I get the scheduler stuff done, and also dig out the 
"resp1" stuff for "back when." I see the most recent datasets were 
comparing 2.5.43-mm2 responsiveness with 2.4.19-ck7, you know I always 
test your stuff ;-)

Guess it might need a bit of polish for current hardware, I was testing 
on *small* machines, deliberately.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: swap-prefetch: 2.6.22 -mm merge plans
  2007-05-04  8:52     ` Ingo Molnar
  2007-05-04  9:09       ` Nick Piggin
  2007-05-04 12:10       ` Con Kolivas
@ 2007-05-07 14:18       ` Bill Davidsen
  2 siblings, 0 replies; 233+ messages in thread
From: Bill Davidsen @ 2007-05-07 14:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>> i'm wondering about swap-prefetch:
> 
>> Being able to config all these core heuristics changes is really not 
>> that much of a positive. The fact that we might _need_ to config 
>> something out, and double the configuration range isn't too pleasing.
> 
> Well, to the desktop user this is a speculative performance feature that 
> he is willing to potentially waste CPU and IO capacity, in expectation 
> of better performance.
> 
> On the conceptual level it is _precisely the same thing as regular file 
> readahead_. (with the difference that to me swapahead seems to be quite 
> a bit more intelligent than our current file readahead logic.)
> 
> This feature has no API or ABI impact at all, it's a pure performance 
> feature. (besides the trivial sysctl to turn it runtime on/off).
> 
>> Here were some of my concerns, and where our discussion got up to.

	[...snip...]

> i see no real problem here. We've had heuristics for a _long_ time in 
> various areas of the code. Sometimes they work, sometimes they suck.
> 
> the flow of this is really easy: distro looking for a feature edge turns 
> it on and announces it, if the feature does not work out for users then 
> user turns it off and complains to distro, if enough users complain then 
> distro turns it off for next release, upstream forgets about this 
> performance feature and eventually removes it once someone notices that 
> it wouldnt even compile in the past 2 main releases. I see no problem 
> here, we did that in the past too with performance features. The 
> networking stack has literally dozens of such small tunable things which 
> get experimented with, and whose defaults do get tuned carefully. Some 
> of the knobs help bandwidth, some help latency.
> 
I haven't looked at this code since it first came out and didn't impress 
me, but I think it would be good to get the current version in. However, 
when you say "user turns it off" I hope you mean "in /proc/sys with a 
switch or knob" and not by expecting people to recompile and install a 
kernel. Then it might take a little memory but wouldn't do something 
undesirable.

Note: I had no bad effect from the code, it just didn't feel faster. On 
a low memory machine it might help. Of course I have wanted to have a 
hard limit on memory used for i/o buffers, just to avoid swapping 
programs to make room for i/o, so to some extent I feel as if this is a 
fix for a problem we shouldn't have.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot


^ permalink raw reply	[flat|nested] 233+ messages in thread

* Re: 2.6.22 -mm merge plans
  2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
                   ` (21 preceding siblings ...)
  2007-05-03 15:54 ` swap-prefetch: 2.6.22 -mm merge plans Ingo Molnar
@ 2007-05-07 17:47 ` Josef Sipek
  22 siblings, 0 replies; 233+ messages in thread
From: Josef Sipek @ 2007-05-07 17:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-mm

On Mon, Apr 30, 2007 at 04:20:07PM -0700, Andrew Morton wrote:
...
>  git-unionfs.patch
>
> Does this have a future?

Yes!  There are many active users who use our unioning functionality.

Namespace unification consists of several major parts:

1) Duplicate elimination: This can be handled in the VFS.  However, it would
   clutter up the VFS code with a lot of wrappers around key VFS functions
   to select the appropriate dentry/inode/etc. object from the underlying
   branch.  (You also need to provide efficient and sane readdir/seekdir
   semantics which we do with our "On Disk Format" support.)

2) Copyup: Having a unified namespace by itself isn't enough.  You also need
   copy-on-write functionality when the source file is on a read-only
   branch.  This makes unioning much more useful and is one of the main
   attractions to unionfs users.

3) Whiteouts: Whiteouts are a key unioning construct.  As it was pointed out
   at OLS 2006, they are a properly of the union and _NOT_ a branch.
   Therefore, they should not be stored persistently on a branch but rather
   in some "external" storage.

4) You also need unique and *persistent* inode numbers for network f/s
   exports and other unix tools.

5) You need to provide dynamic branch management functionality: adding,
   removing, and changing the mode of branches in an existing union.

We have considerable experience in unioning file systems for years now; we
are currently working on the third generation of the code.  All of the above
features, and more, are USED by users, and are NEEDED by users.

We believe the right approach is the one we've taken, and is the least
intrusive: a standalone (stackable) file system that doesn't clutter the
VFS, with some small and gradual changes to the VFS to support stacking.  As
you may have noticed, we have been successfully submitting VFS patches to
make the VFS more stacking friendly (not just to Unionfs, but also to
eCryptfs which has been in since 2.6.19).

The older Union mounts, alas, try to put all that functionality into the
VFS.  We recognize that some people think that union mounts at the VFS level
is the "elegant" approach, but we hope people will listen to us and learn
from our experience: unioning may seem simple in principle, but it is
difficult in practice.  (See http://unionfs.fileystems.org/ for a lot more
info.)  So we don't think that is a viable long term approach to have all of
the unioning functionality in the VFS for two main reasons:

(1) If you want users to use a VFS-level unioning functionality ala
    union-mounts, then you're going to have to implement *all* of the
    features we have implemented; the VFS clutter and complexity that will
    result will be very considerable, and we just don't think that it'd
    happen.

(2) Some may suggest to have a lightweight union mounts that only offers a
    subset of the functionality that's suitable for placing in the VFS.  In
    that case, most unionfs users simply won't use it.  You'd need union
    mounts to provide ALL of the functionality that we have TODAY, if you
    want users to it.

As far as we can see the remaining stumbling block right now is cache
coherency between the layers.  Whether you provide unioning as a stackable
f/s or shoved into the VFS, coherency will have to be addressed.  In our
upcoming paper and talk at OLS'07, we plan to bring up and discuss several
ideas we've explored already on how to resolve this incoherency.  Our ideas
range from complex graph-based pointer management between objects of all
sorts, to simple timestamp-based VFS hooks.  (We've been experimenting with
several approaches and so far we're leaning toward the simple timestamp
based on, again in the interest of keeping the VFS changes simple.  We hope
to have more results to report by OLS time.)

Josef "Jeff" Sipek, on behalf of the Unionfs team.

^ permalink raw reply	[flat|nested] 233+ messages in thread

end of thread, other threads:[~2007-05-23  7:58 UTC | newest]

Thread overview: 233+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-30 23:20 2.6.22 -mm merge plans Andrew Morton
2007-04-30 23:48 ` to something appropriate (was Re: 2.6.22 -mm merge plans) Jeff Garzik
2007-05-01  0:07   ` Dave Jones
2007-05-01  0:09   ` Andrew Morton
2007-05-01  0:24     ` Jeff Garzik
2007-05-01  0:40       ` [stable] " Chris Wright
2007-05-01  0:45         ` Jeff Garzik
2007-05-01  4:58           ` Greg KH
2007-05-01 16:14             ` Chuck Ebbert
2007-05-01 16:40               ` Alan Cox
2007-05-01 23:34                 ` Greg KH
2007-05-02  0:52                   ` Chris Wright
2007-05-02 14:10                     ` Chuck Ebbert
2007-05-01  9:49   ` Alan Cox
2007-04-30 23:59 ` 2.6.22 -mm merge plans Bill Irwin
2007-05-01  0:09 ` nfsd/md patches " Neil Brown
2007-05-01  9:08   ` Christoph Hellwig
2007-05-01  9:15     ` Andrew Morton
2007-05-01  9:21       ` Christoph Hellwig
2007-05-01  9:52     ` Neil Brown
2007-05-01 10:15       ` Christoph Hellwig
2007-05-01 14:34         ` Trond Myklebust
2007-05-01  0:54 ` MADV_FREE functionality Rik van Riel
2007-05-01  1:18   ` Andrew Morton
2007-05-01  1:23     ` Rik van Riel
2007-05-01  7:13     ` Jakub Jelinek
2007-05-01  1:23   ` Ulrich Drepper
2007-05-01  1:39 ` 2.6.22 -mm merge plans Stefan Richter
2007-05-01  2:30 ` 2.6.22 -mm merge plans (RE: input) Dmitry Torokhov
2007-05-01  8:14   ` Jiri Slaby
2007-05-01 12:05     ` Dmitry Torokhov
2007-05-01  8:11 ` 2.6.22 -mm merge plans -- pfn_valid_within Andy Whitcroft
2007-05-01  8:19   ` Andrew Morton
2007-05-01  8:42 ` "partical" kthread conversion Christoph Hellwig
2007-05-01  8:51   ` Andrew Morton
2007-05-02 14:01     ` Dean Nelson
2007-05-02 14:45       ` Eric W. Biederman
2007-05-02 15:37         ` Dean Nelson
2007-05-02 15:49           ` Eric W. Biederman
2007-05-02 19:33         ` Andrew Morton
2007-05-02 20:38           ` Eric W. Biederman
2007-05-01  8:44 ` 2.6.22 -mm merge plans -- vm bugfixes Nick Piggin
2007-05-01  8:54   ` Andrew Morton
2007-05-01 19:31   ` Hugh Dickins
2007-05-02  3:08     ` Nick Piggin
2007-05-02  9:15       ` Nick Piggin
2007-05-02 14:00       ` Hugh Dickins
2007-05-03  1:32         ` Nick Piggin
2007-05-03 10:37           ` Christoph Hellwig
2007-05-03 12:56             ` Nick Piggin
2007-05-04  9:23               ` Nick Piggin
2007-05-04  9:43                 ` Nick Piggin
2007-05-08  3:03                 ` Benjamin Herrenschmidt
2007-05-03 12:24           ` Hugh Dickins
2007-05-03 12:43             ` Nick Piggin
2007-05-03 12:58               ` Hugh Dickins
2007-05-03 13:08                 ` Nick Piggin
2007-05-03 16:52           ` Andrew Morton
2007-05-04  4:16             ` Nick Piggin
2007-05-09 12:34         ` Nick Piggin
2007-05-09 14:28           ` Hugh Dickins
2007-05-09 14:45             ` Nick Piggin
2007-05-09 15:38               ` Hugh Dickins
2007-05-09 22:24                 ` Nick Piggin
2007-05-01  8:46 ` pcmcia ioctl removal Christoph Hellwig
2007-05-01  8:56   ` Russell King
2007-05-01  8:57   ` Willy Tarreau
2007-05-01  9:08     ` Andrew Morton
2007-05-01 14:46       ` Adrian Bunk
2007-05-01  9:16   ` Robert P. J. Day
2007-05-01  9:44     ` Willy Tarreau
2007-05-01 10:16       ` Robert P. J. Day
2007-05-01 10:26       ` Gabriel C
2007-05-01 10:52         ` Willy Tarreau
2007-05-01 10:12     ` Jan Engelhardt
2007-05-01 11:00       ` Willy Tarreau
2007-05-01 12:06         ` Konstantin Münning
2007-05-01 13:56         ` Rogan Dawes
2007-05-01 19:10       ` Russell King
2007-05-01 20:41         ` Jan Engelhardt
2007-05-09 12:54   ` Pavel Machek
2007-05-09 13:00     ` Robert P. J. Day
2007-05-09 13:03     ` Adrian Bunk
2007-05-09 19:11       ` Romano Giannetti
2007-05-10 12:40         ` Adrian Bunk
2007-05-01  8:48 ` pci hotplug patches Christoph Hellwig
2007-05-02  3:57   ` Greg KH
2007-05-13 20:59     ` Christoph Hellwig
2007-05-14 11:48       ` Greg KH
2007-05-01  8:54 ` cache-pipe-buf-page-address-for-non-highmem-arch.patch Christoph Hellwig
     [not found]   ` <20070501020441.10b6a003.akpm@linux-foundation.org>
2007-05-03  3:48     ` cache-pipe-buf-page-address-for-non-highmem-arch.patch Ken Chen
2007-05-01  8:55 ` consolidate-generic_writepages-and-mpage_writepages.patch Christoph Hellwig
2007-05-01  9:17 ` 2.6.22 -mm merge plans Pekka Enberg
2007-05-01  9:24   ` Christoph Hellwig
2007-05-01  9:37   ` Peter Zijlstra
2007-05-01 12:19   ` Andi Kleen
2007-05-01 17:12     ` Pekka Enberg
2007-05-01 10:16 ` fragmentation avoidance " Mel Gorman
2007-05-01 13:02   ` 2.6.22 -mm merge plans -- lumpy reclaim Andy Whitcroft
2007-05-01 18:03     ` Peter Zijlstra
2007-05-01 19:00     ` Andrew Morton
2007-05-01 14:54   ` fragmentation avoidance Re: 2.6.22 -mm merge plans Christoph Lameter
2007-05-01 19:00     ` Mel Gorman
2007-05-01 18:57   ` Andrew Morton
2007-05-07 13:07   ` Yasunori Goto
2007-05-01 12:17 ` Andi Kleen
2007-05-01 22:08   ` Mathieu Desnoyers
2007-05-02 10:44     ` Andi Kleen
2007-05-02 16:37       ` Frank Ch. Eigler
2007-05-02 16:47       ` Andrew Morton
2007-05-02 17:29         ` Christoph Hellwig
2007-05-02 20:36           ` Mathieu Desnoyers
2007-05-02 20:53             ` Andrew Morton
2007-05-02 23:11               ` Mathieu Desnoyers
2007-05-02 23:21                 ` Andrew Morton
2007-05-03 15:04                   ` Mathieu Desnoyers
2007-05-03 15:12                     ` Christoph Hellwig
2007-05-03 17:16                       ` Mathieu Desnoyers
2007-05-03 17:25                         ` Christoph Hellwig
2007-05-10 19:39                           ` Mathieu Desnoyers
2007-05-13 21:04                             ` Christoph Hellwig
2007-05-03  8:06                 ` Christoph Hellwig
2007-05-03 14:43                   ` Mathieu Desnoyers
2007-05-03 10:31                 ` Andi Kleen
2007-05-03 14:49                   ` Mathieu Desnoyers
2007-05-03  8:09               ` Christoph Hellwig
2007-05-03  8:08             ` Christoph Hellwig
2007-05-02 17:49         ` Andi Kleen
2007-05-02 21:46           ` Tilman Schmidt
2007-05-03 10:12             ` Andi Kleen
2007-05-02 17:19       ` Mathieu Desnoyers
2007-05-02  0:31   ` Rusty Russell
2007-05-02 10:30     ` Andi Kleen
2007-05-01 13:06 ` file capabilities and security_task_wait failure " Stephen Smalley
2007-05-01 14:31 ` 2.6.22 -mm merge plans: mm-more-rmap-checking Hugh Dickins
2007-05-02  1:42   ` Nick Piggin
2007-05-02 13:17     ` Hugh Dickins
2007-05-03  0:18       ` Nick Piggin
2007-05-01 16:56 ` 2.6.22 -mm merge plans Zan Lynx
2007-05-01 17:06 ` 2.6.22 -mm merge plans: mm-detach_vmas_to_be_unmapped-fix Hugh Dickins
2007-05-01 18:10 ` 2.6.22 -mm merge plans: slub Hugh Dickins
2007-05-01 19:25   ` Christoph Lameter
2007-05-01 19:55   ` Andrew Morton
2007-05-01 20:19     ` Hugh Dickins
2007-05-01 20:36       ` Andrew Morton
2007-05-01 20:46         ` Christoph Lameter
2007-05-01 21:09           ` Andrew Morton
2007-05-02 12:54         ` Hugh Dickins
2007-05-02 17:03           ` Christoph Lameter
2007-05-02 19:11             ` Andrew Morton
2007-05-02 19:42               ` Christoph Lameter
2007-05-02 19:54                 ` Sam Ravnborg
2007-05-02 20:14                   ` Christoph Lameter
2007-05-02 18:52           ` Siddha, Suresh B
2007-05-02 18:58             ` Christoph Lameter
2007-05-01 21:08       ` Christoph Lameter
2007-05-02 12:45         ` Hugh Dickins
2007-05-02 17:01           ` Christoph Lameter
2007-05-02 18:08             ` Hugh Dickins
2007-05-02 18:28               ` Christoph Lameter
2007-05-02 18:42                 ` Andrew Morton
2007-05-02 18:53                   ` Christoph Lameter
2007-05-02 17:25           ` Christoph Lameter
2007-05-02 18:36             ` Hugh Dickins
2007-05-02 18:39               ` Christoph Lameter
2007-05-02 18:57                 ` Andrew Morton
2007-05-02 19:01                   ` Christoph Lameter
2007-05-02 19:18                     ` Pekka Enberg
2007-05-02 19:34                       ` Christoph Lameter
2007-05-02 19:43                       ` Christoph Lameter
2007-05-03  8:15             ` Andrew Morton
2007-05-03  8:27               ` William Lee Irwin III
2007-05-03 16:30                 ` Christoph Lameter
2007-05-03  8:46               ` Hugh Dickins
2007-05-03  8:57                 ` Andrew Morton
2007-05-03  9:15                   ` Hugh Dickins
2007-05-03 21:04                     ` 2.6.22 -mm merge plans: slub on PowerPC Hugh Dickins
2007-05-03 21:15                       ` Christoph Lameter
2007-05-03 22:41                         ` Hugh Dickins
2007-05-04  0:25                       ` Benjamin Herrenschmidt
2007-05-04  0:54                         ` Christoph Lameter
2007-05-03 16:45                   ` 2.6.22 -mm merge plans: slub Christoph Lameter
2007-05-03 15:54 ` swap-prefetch: 2.6.22 -mm merge plans Ingo Molnar
2007-05-03 16:15   ` Michal Piotrowski
2007-05-03 16:23     ` Michal Piotrowski
2007-05-03 22:14   ` Con Kolivas
2007-05-04  7:34   ` Nick Piggin
2007-05-04  8:52     ` Ingo Molnar
2007-05-04  9:09       ` Nick Piggin
2007-05-04 12:10       ` Con Kolivas
2007-05-05  8:42         ` Con Kolivas
2007-05-06 10:13           ` [ck] " Antonino Ingargiola
2007-05-06 18:22           ` Jory A. Pratt
2007-05-09 23:28           ` Con Kolivas
2007-05-10  0:05             ` Nick Piggin
2007-05-10  1:34               ` Con Kolivas
2007-05-10  1:56                 ` Nick Piggin
2007-05-10  3:48                   ` Ray Lee
2007-05-10  3:56                     ` Nick Piggin
2007-05-10  5:52                       ` Ray Lee
2007-05-10  7:04                         ` Nick Piggin
2007-05-10  7:20                           ` William Lee Irwin III
2007-05-10 12:34                           ` Ray Lee
2007-05-12  4:46                           ` [PATCH] mm: swap prefetch improvements Con Kolivas
2007-05-12  5:03                             ` Paul Jackson
2007-05-12  5:15                               ` Con Kolivas
2007-05-12  5:51                                 ` Paul Jackson
2007-05-12  7:28                                   ` Con Kolivas
2007-05-12  8:14                                     ` Paul Jackson
2007-05-12  8:21                                       ` Con Kolivas
2007-05-12  8:37                                         ` Paul Jackson
2007-05-12  8:57                                           ` [PATCH respin] " Con Kolivas
2007-05-21 10:03                             ` [PATCH] " Ingo Molnar
2007-05-21 13:44                               ` Con Kolivas
2007-05-21 16:00                                 ` Ingo Molnar
2007-05-22 10:15                                   ` Antonino Ingargiola
2007-05-22 10:20                                     ` Con Kolivas
2007-05-22 10:25                                       ` Ingo Molnar
2007-05-22 10:37                                         ` Con Kolivas
2007-05-22 10:46                                           ` Ingo Molnar
2007-05-22 10:54                                             ` Con Kolivas
2007-05-22 10:57                                               ` Ingo Molnar
2007-05-22 11:04                                                 ` Con Kolivas
     [not found]                                                   ` <20070522111104.GA14950@elte.hu>
2007-05-22 11:12                                                     ` Ingo Molnar
2007-05-22 20:18                                             ` [ck] " Michael Chang
2007-05-22 20:31                                               ` Ingo Molnar
2007-05-22 20:42                                           ` Ash Milsted
2007-05-22 22:50                                             ` Con Kolivas
2007-05-23  7:57                                               ` Ash Milsted
2007-05-10  3:58                     ` swap-prefetch: 2.6.22 -mm merge plans Con Kolivas
2007-05-07 14:28         ` Bill Davidsen
2007-05-07 14:18       ` Bill Davidsen
2007-05-07 17:47 ` Josef Sipek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).