Linux 2.6.26-rc4

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Linux 2.6.26-rc4
@ 2008-05-26 18:41 Linus Torvalds
  2008-05-26 21:24 ` Jesper Krogh
                   ` (5 more replies)
  0 siblings, 6 replies; 89+ messages in thread
From: Linus Torvalds @ 2008-05-26 18:41 UTC (permalink / raw)
  To: Linux Kernel Mailing List


You know the drill by now: another week, another -rc.

There's a lot of small stuff in here, most people won't even notice. The 
most noticeable thing is for all you 32-bit x86 people who use PAE 
(enabled by the HIGHMEM64G config option) due to having too much memory in 
your machine - mprotect() was broken due to some of the PAT fix/cleanup 
patches, causing the NX bit to be not set correctly.

So if you had PAE enabled _and_ a recent enough CPU to have NX, but not 
recent enough to be 64-bit (or you were just perverse and wanted to run a 
32-bit kernel despite having a chip that could do 64-bit and enough memory 
that you _really_ should have used a 64-bit kernel), you'd get various 
random program failures with SIGSEGV. It ranged from X not starting up to 
apparently OpenOffice not working if it did.

But most of the changes, as usual, are in drivers, at 60%, with some DRI 
changes leading the way (fixing a number of other regressions, mainly by 
reverting the under-cooked vblank update). Network, MMC, USB, watchdog and 
IDE drivers also got updates.

We had CIFS and NFS updates, and some arch updates as usual. The dirstat 
gives the overview:

   2.0% arch/blackfin/
   3.3% arch/powerpc/configs/
   3.7% arch/powerpc/
  10.4% arch/
   4.5% drivers/ata/
  15.4% drivers/char/drm/
  15.8% drivers/char/
   9.6% drivers/hwmon/
   6.4% drivers/mmc/card/
   6.5% drivers/mmc/
   2.6% drivers/net/sfc/
   8.4% drivers/net/
   5.8% drivers/usb/class/
   6.1% drivers/usb/
   3.5% drivers/watchdog/
  60.9% drivers/
   7.7% fs/cifs/
   2.5% fs/nfs/
  12.7% fs/
   4.5% include/
   2.4% net/ipv4/
   2.0% net/sunrpc/xprtrdma/
   2.3% net/sunrpc/
   6.7% net/

and I append the shortlog for a sense of the details. Nothing really 
hugely exciting, I think we're doing pretty ok in the release cycle, and 
I'm getting the feeling that things are calming down.

Hopefully I'm not wrong about that "calming down" feeling, but on the 
whole this release cycle doesn't seem to have become _too_ painful.

			Linus

---
Abhijeet Kolekar (1):
      mac80211 : Association with 11n hidden ssid ap.

Adrian Bunk (10):
      nfs: make nfs4_drop_state_owner() static
      [POWERPC] powerpc/mm/hash_low_32.S: Remove CVS keyword
      sparc64: remove CVS keywords
      sparc: remove CVS keywords
      HID: remove CVS keywords
      drivers/atm/: remove CVS keywords
      make myri10ge_get_firmware_capabilities() static
      md: proper extern for mdp_major
      frv: export empty_zero_page
      sparc64: global_reg_snapshot is not for userspace

Al Viro (26):
      [Blackfin] arch: Blackfin checksum annotations
      take init_files to fs/file.c
      dup_fd() fixes, part 1
      dup_fd() part 2
      dup_fd() - part 3
      dup_fd() part 4 - race fix
      avoid multiplication overflows and signedness issues for max_fds
      get rid of leak in compat_execve()
      return to old errno choice in mkdir() et.al.
      missed kmalloc() in pcap_user.c
      fix include order in sys-i386/registers.c
      fix hppfs Makefile breakage
      uml: add missing exports for UML_RANDOM=m
      missing export of csum_partial() on uml/amd64
      thanks to net/mac80211 we need to pull drivers/leds/Kconfig on uml
      misc drivers/net endianness noise
      ecryptfs fixes
      sbus bpp: instances missed in s/dev_name/bpp_dev_name/
      irda-usb endianness annotations and fixes
      ocfs2 endianness fixes
      missing dependencies on HAS_DMA
      msnd_* is ISA-only
      MODULE_LICENSE expects "GPL v2", not "GPLv2"
      caiaq endianness fix
      provide out-of-line strcat() for m68k
      HTC_EGPIO is ARM-only

Alan Cox (4):
      iphase: Fix 64bit warning.
      MAINTAINERS needs further order fixing
      mm: fix atomic_t overflow in vm
      ip2: fix crashes on load/unload

Andi Kleen (2):
      kbuild: disable modpost warnings for linkonce sections
      x86: use explicit copy in vdso_gettimeofday()

Andrew Morton (7):
      hysdn: No longer broken on SMP.
      IB/mlx4: Fix uninitialized-var warning in mlx4_ib_post_send()
      [patch 1/1] audit_send_reply(): fix error-path memory leak
      [netdrvr] dm9000: use delayed work to update mii phy state fix
      pcnet32: fix warning
      drivers/net/tokenring/3c359.c: squish a warning
      drivers/net/tokenring/olympic.c: fix warning

Andrew Price (1):
      [GFS2] Fix cast from unsigned int to s64

Andy Fleming (1):
      ucc_geth: Fix arguments to dma map/unmap functions

Andy Whitcroft (1):
      zonelists: handle a node zonelist with no applicable entries

Anton Vorontsov (1):
      uli526x: add support for netpoll

Arjan van de Ven (2):
      Fix a deadlock in the bttv driver
      serial: fix enable_irq_wake/disable_irq_wake imbalance in serial_core.c

Arnaldo Carvalho de Melo (1):
      USB: OPTION: fix name of Onda MSA501HS HSDPA modem

Arnd Bergmann (1):
      [POWERPC] Update Cell MAINTAINERS entry, add spufs entry

Atsushi Nemoto (1):
      usb-serial: Use ftdi_sio driver for RATOC REX-USB60F

Aurelien Nephtali (1):
      net/usb: add support for Apple USB Ethernet Adapter

Avi Kivity (2):
      KVM: x86 emulator: fix writes to registers with modrm encodings
      KVM: Update MAINTAINERS for new mailing lists

Becky Bruce (1):
      e1000e: use resource_size_t, not unsigned long, for phys addrs

Ben Dooks (8):
      [ARM] 5040/1: BAST: Fix DM9000 IRQ flags initialisation
      [ARM] 5041/1: VR1000: Fix DM9000 IRQ flags initialisation
      [ARM] 5039/1: S3C244X: Rename SDI device if running on S3C244X.
      SM501: reverse FPEN/VBIASEN flags behaviour
      S3C2410: ensure that FB_BLANK_POWERDOWN shuts down the controller
      S3C2410: add error print if we cannot add attribute
      S3C2410: clean out changelog header and tidy
      S3C2410: fix driver MODULE_ALIAS()

Ben Hutchings (16):
      sfc: Use mod_timer() to set expiry and add_timer() together
      sfc: Removed casts to void
      sfc: Simplified efx_rx_calc_buffer_size() using get_order()
      sfc: Removed unncesssary UL suffixes on 0 literals
      sfc: Added and removed braces to comply with kernel style
      sfc: Replaced various macros with inline functions
      sfc: Merged efx_page_offset() into efx_rx_buf_offset()
      sfc: Use resource_size_t for PCI bus address
      sfc: Correct and expand some comments
      sfc: Use DMA_BIT_MASK() instead of our own DMA mask macros
      sfc: Do not define inline macro
      sfc: Use __packed macro
      sfc: Change type of efx_nic::nic_data to struct falcon_nic_data *
      sfc: Remove redundant casts to and from void *
      sfc: Added checks for heap allocation failure
      sfc: Remove sub-minor component from driver version

Bernd Schubert (2):
      md: md: raid5 rate limit error printk
      md: allow parallel resync of md-devices.

Bob Copeland (1):
      ath5k: Fix loop variable initializations

Bob Peterson (1):
      [GFS2] filesystem consistency error from do_strip

Brian Cavagnolo (1):
      libertas: fix command timeout after firmware failure

Brian King (1):
      ehea: Fix use after free on reboot

Brice Goglin (1):
      Add maintainers for myri10ge driver

Bryan Wu (1):
      Blackfin arch: Fix bug - USB fails to build for BF524/BF526

Cedric Le Goater (1):
      cgroups: remove node_ prefix_from ns subsystem

Chen Gong (1):
      [WATCHDOG] Fix booke_wdt.c on MPC85xx SMP system's

Christian Borntraeger (1):
      stop_machine: make stop_machine_run more virtualization friendly

Christoph Hellwig (2):
      [XFS] Fix memory corruption with small buffer reads
      md: kill file_path wrapper

Christophe Jaillet (3):
      kconfig: incorrect 'len' field initialisation ?
      avr32/pata: avoid unnecessary memset (updated after comments)
      iop-adma: fixup some kzalloc/memset confusions

Chuck Ebbert (1):
      x86: don't read maxlvt before checking if APIC is mapped

Cliff Cai (1):
      Blackfin arch: enable a choice to provide 4M DMA memory

Cyrill Gorcunov (2):
      module loading ELF handling: use SELFMAG instead of numeric constant
      ecryptfs: fix missed mutex_unlock

Dan Williams (1):
      md: notify userspace on 'stop' events

Darrick J. Wong (2):
      i5k_amb: support Intel 5400 chipset
      ibmaem: new driver for power/energy/temp meters in IBM System X hardware

Dave Airlie (1):
      Revert "drm/vbl rework: rework how the drm deals with vblank."

Dave Jones (2):
      [CPUFREQ] Remove documentation of removed ondemand tunable.
      [CIFS] Fix reversed memset arguments

Dave Olson (1):
      IB/mad: Fix kernel crash when .process_mad() returns SUCCESS|CONSUMED

David Brownell (3):
      gpio: pca953x driver handles pca9554 too
      gpio: build fixes
      spi: remove some spidev oops-on-rmmod paths

David Chinner (4):
      [XFS] Include linux/random.h in all builds, not just debug builds.
      [XFS] Fix fsync() b0rkage.
      [XFS] Don't allow memory reclaim to wait on the filesystem in inode
      [XFS] Fix inode list allocation size in writeback.

David Gibson (1):
      [POWERPC] Fix __set_fixmap() for STRICT_MM_TYPECHECKS

David S. Miller (7):
      sparc64: Add global register dumping facility.
      sunhv: Fix locking in non-paged I/O case.
      cassini: Only use chip checksum for ipv4 packets.
      xfrm_user: Remove zero length key checks.
      sparc64: Fix kernel thread stack termination.
      sparc64: Fix stack tracing through trap frames.
      sparc64: Prevent stack backtrace false positives on trap frames.

David Teigland (1):
      dlm: fix plock dev_write return value

David Woodhouse (2):
      net: Fix call to ->change_rx_flags(dev, IFF_MULTICAST) in dev_change_flags()
      libertas: Fix ethtool statistics

Denis Cheng (1):
      net/ipv4/arp.c: Use common hex_asc helpers

Denis V. Lunev (3):
      pktgen: make sure that pktgen_thread_worker has been executed
      modules: proper cleanup of kobject without CONFIG_SYSFS
      proc: proc_get_inode() should get module only once

Diego 'Flameeyes' Petteno (1):
      HID: split Numlock emulation quirk from HID_QUIRK_APPLE_HAS_FN.

Dominik Brodowski (1):
      [CPUFREQ] clarify license of freq_table.c

Dylan R Semler (1):
      HID: Add iMON LCDs to blacklist

Eric Paris (1):
      nfs/lsm: make NFSv4 set LSM mount options

Fernando Luis Vazquez Cao (1):
      for_each_online_pgdat(): kerneldoc fix

Francois Romieu (1):
      au1000_eth: remove useless check

Fred Isaman (1):
      nfs: fix race in nfs_dirty_request

Gabor Czigola (1):
      hdaps: invert the axes for HDAPS on Lenovo R61i ThinkPads

Gabriel C (2):
      [WATCHDOG] Add ICH9DO into the iTCO_wdt.c driver
      scripts/ver_linux use 'gcc -dumpversion'

Geoff Levand (1):
      [POWERPC] PS3: Fix memory hotplug

Gerrit Renker (1):
      [SC92031] Using padto turned driver into an IPv6-only interface

Greg Kroah-Hartman (14):
      Driver core: add device_create_vargs and device_create_drvdata
      mm: bdi: fix race in bdi_class device creation
      fbdev: fix race in device_create
      ide: fix race in device_create
      IB: fix race in device_create
      LEDS: fix race in device_create
      Power Supply: fix race in device_create
      UIO: fix race in device_create
      SOUND: fix race in device_create
      s390: fix race in device_create
      USB: Phidget: fix race in device_create
      USB: Core: fix race in device_create
      SCSI: fix race in device_create
      USB: add TELIT HDSPA UC864-E modem to option driver

Greg Ungerer (3):
      [ARM] 5051/1: define pgtable_t for the !CONFIG_MMU case too
      [ARM] 5052/1: export clock functions for the at91x40
      [ARM] 5053/1: define before use of processor_id

Harvey Harrison (8):
      sh: use the common ascii hex helpers
      nfs: replace remaining __FUNCTION__ occurrences
      ata: remove FIT() macro
      x86: fix integer as NULL pointer warning
      acpi: fix integer as NULL pointer warning
      isdn: fix integer as NULL pointer warning
      scsi: fix integer as NULL pointer warning
      fbdev: fix integer as NULL pointer warning

Heiko Carstens (2):
      s390: KVM guest: fix compile error
      memory hotplug: fix early allocation handling

Helmut Schaa (1):
      mac80211: fix NULL pointer dereference in ieee80211_compatible_rates

Herbert Xu (1):
      ipsec: Use the correct ip_local_out function

Hideo Saito (2):
      sh: Fix up optimized SH-4 memcpy on big endian.
      sh: Fix up thread info pointer in syscall_badsys resume path.

Huang Weiyi (1):
      Blackfin EMAC Driver: Removed duplicated include <linux/ethtool.h>

Hugh Dickins (1):
      x86: strengthen 64-bit p?d_bad()

Ignacio García Pérez (1):
      serial: support for InstaShield IS-400 four port RS-232 PCI card

Igor Mammedov (4):
      Fixed DFS code to work with new 'build_path_from_dentry', that returns full path if share in the dfs, now.
      CIFSGetDFSRefer cleanup + dfs_referral_level_3 fixed to conform REFERRAL_V3 the MS-DFSC spec.
      Fix possible access to undefined memory region.
      Adds username in the upcall key for unattended mounts with keytab

Ilpo Järvinen (3):
      tcp: Make prior_ssthresh a u32
      hamradio/scc: add missing block braces to multi-statement if
      s2io: add missing block braces to multistatement if statement

Ingo Molnar (3):
      USB: build fix
      namespacecheck: automated fixes
      x86: prevent PGE flush from interruption/preemption

Ivo van Doorn (1):
      mac80211: Add RTNL version of ieee80211_iterate_active_interfaces

J. Bruce Fields (2):
      nfsd: reorder printk in do_probe_callback to avoid use-after-free
      svcrpc: fix proc/net/rpc/auth.unix.ip/content display

Jack Morgenstein (1):
      IPoIB: Test for NULL broadcast object in ipiob_mcast_join_finish()

James Chapman (1):
      l2tp: avoid skb truesize bug if headroom is increased

Jan Beulich (1):
      x86/xen: fix arbitrary_virt_to_machine()

Jan Blunck (2):
      nfs: path_{get,put}() cleanups
      Don't clean bounds.h and asm-offsets.h

Javier Herrero (2):
      8250 Serial Driver: Added support for 8250-class UARTs in HV Sistemas H8606 board
      Blackfin serial driver: add extra IRQ flag for 8250 serial driver

Jay Fenlason (1):
      firewire: prevent userspace from accessing shut down devices

Jean Delvare (1):
      [GFS2] Prefer strlcpy() over snprintf()

Jeff Garzik (1):
      drivers/ata: trim trailing whitespace

Jeff Layton (4):
      [CIFS] CIFS currently allows for permissions to be changed on files, even
      fix memory leak in CIFSFindNext
      clarify return value of cifs_convert_flags()
      add function to convert access flags to legacy open mode

Jeremy Fitzhardinge (8):
      x86: define PTE_MASK in a universally useful way
      x86: fix warning on 32-bit non-PAE
      x86: rearrange __(VIRTUAL|PHYSICAL)_MASK
      x86: use PTE_MASK in 32-bit PAE
      x86: use PTE_MASK in pgtable_32.h
      x86: clarify use of _PAGE_CHG_MASK
      x86: use PTE_MASK rather than ad-hoc mask
      xen: use PTE_MASK in pte_mfn()

Jesse Barnes (3):
      drm/i915: fix off by one in VGA save/restore of AR & CR regs.
      PCI: correct mailing list address
      remove debug printk from DRM suspend path

Jike Song (1):
      .gitignore: match ncscope.out

Jiri Slaby (1):
      i2c: Align i2c_device_id

Joe Perches (1):
      drivers/net/ehea - remove unnecessary memset after kzalloc

Johannes Berg (1):
      mac80211: don't claim iwspy support

Johannes Weiner (1):
      mm: don't drop a partial page in a zone's memory map size

Jordan Crouse (1):
      [WATCHDOG] Add a watchdog driver based on the CS5535/CS5536 MFGPT timers

Josh Boyer (1):
      [POWERPC] 4xx: Workaround for CHIP_11 Errata

Julia Lawall (1):
      drivers/net/fs_enet: remove null pointer dereference

Karel Zak (1):
      MAINTAINERS: add util-linux-ng package

Kazunori MIYAZAWA (1):
      af_key: Fix selector family initialization.

Keith Packard (1):
      drm/i915: save and restore dsparb and d_state registers.

Komuro (2):
      xirc2ps_cs: re-initialize the multicast address in do_reset
      fmvj18x_cs: add NextCom NC5310 rev B support

Krzysztof Halasa (2):
      WAN: protect Cisco HDLC state changes with a spinlock.
      WAN: protect HDLC proto list while insmod/rmmod

Kumar Gala (5):
      lmb: Fix compile warning
      [POWERPC] Remove generated files on make clean
      [POWERPC] Update arch/powerpc/boot/.gitignore
      [POWERPC] Fix mpc8377_mds.dts DMA nodes to match spec
      edac: mpc85xx: fix building as a module

Lennert Buytenhek (1):
      USB: ehci-orion: the Orion EHCI root hub does have a Transaction Translator

Leonardo Potenza (1):
      dlm: section mismatch warning fix

Linus Torvalds (1):
      Linux 2.6.26-rc4

Maciej W. Rozycki (1):
      PHYLIB: Kconfig: Fix the dependency on S390

Magnus Damm (5):
      sh: fix USBF resource for sh7722
      sh: fix VPU interrupt vector for sh7723
      sh: add probe support for new sh7723 cut
      sh: use sm501 8250 mfd support on r2d boards
      sh: update Migo-R defconfig

Marc Pignat (1):
      at91_mci: minor cleanup

Marcelo Tosatti (3):
      KVM: PIT: take inject_pending into account when emulating hlt
      KVM: Fix kvm_vcpu_block() task state race
      KVM: LAPIC: ignore pending timers if LVTT is disabled

Marcin Krol (1):
      brd: don't show ramdisks in /proc/partitions

Marcin Slusarz (4):
      [CIFS] CIFSSMBPosixLock should return -EINVAL on error
      isdn/capi: Return proper errnos on module init.
      dlm: tcp_connect_to_sock should check for -EINVAL, not EINVAL
      ntfs: le*_add_cpu conversion

Mariusz Kozlowski (2):
      fix parenthesis in include/asm-mips/gic.h
      fix parenthesis in include/asm-mips/mach-au1x00/au1000.h

Mark Asselstine (1):
      hysdn: Remove cli()/sti() calls.

Mark Langsdorf (1):
      [CPUFREQ] powernow-k8: improve error messages

Mark Lord (10):
      sata_mv: always do softreset
      sata_mv: fis irq register fixes
      sata_mv: group genIIe flags
      sata_mv: async notify for genIIe only
      sata_mv: don't blindly enable IRQs
      sata_mv: consolidate main_irq_mask updates
      sata_mv: fix pmp drives not found
      sata_mv: disregard masked irqs
      sata_mv: cache main_irq_mask register in hpriv
      sata_mv: ensure empty request queue for FBS-NCQ EH

Masakazu Mokuno (1):
      wireless: Create 'device' symlink in sysfs

Masatake YAMATO (1):
      kbuild: escape meta characters in regular expression in make TAGS

Mathieu Chouquet-Stringer (1):
      hostap: fix "registers" registration in procfs

Matteo Croce (1):
      cpmac bugfixes and enhancements

Matthew Garrett (1):
      Fixups to ATA ACPI hotplug

Matthias Kaehlcke (1):
      dlm: convert connections_lock in a mutex

Matti Linnanvuori (1):
      doc: add a chapter about trylock functions [Bug 9011]

Mauro Carvalho Chehab (1):
      [ALSA] hda - Fix capture mute Widget for stac9250/9251

Michael Ellerman (1):
      [POWERPC] Add kernstart_addr to list of allowed symbols in prom_init

Michael F. Robbins (1):
      USB: serial: ch341: New VID/PID for CH341 USB-serial

Michael Hennerich (5):
      Blackfin arch: Check for Anomaly 05000182
      Blackfin arch: Sync channel defines with struct dma_register dma_io_base_addr.
      Blackfin arch: Add workaround to read edge triggered GPIOs
      Blackfin arch: IO Port functions to read/write unalligned memory
      Blackfin arch: update boards defconfig files

Michael Krufky (1):
      tuner: Do not alter i2c_client.name

Mikael Pettersson (3):
      sata_promise: fix irq clearing buglets
      sata_promise: mmio access cleanups
      sata_promise: other cleanups

Mike Frysinger (5):
      [Blackfin] arch: rename bf5xx-flash to bfin-async-flash
      atm: Cleanup atm_tcp.h and atm.h for userspace.
      Blackfin arch: cleanup the icplb/dcplb multiple hit checks
      Blackfin SPORTS UART Driver: converting BFIN->BLACKFIN
      [WATCHDOG] Blackfin Watchdog Driver: split platform device/driver

Miklos Szeredi (1):
      fuse: fix bdi naming conflict

MinChan Kim (1):
      slob: Fix to return wrong pointer

Mingarelli, Thomas (1):
      [WATCHDOG] hpwdt: Fix NMI handling.

NeilBrown (4):
      md: fix possible oops when removing a bitmap from an active array
      md: raid1: Fix restoration of bio between failed read and write.
      md: notify userspace on 'write-pending' changes to array_state
      md: restart recovery cleanly after device failure.

Nick Piggin (1):
      mm: allow pfnmap ->fault()s

Oleg Nesterov (3):
      signals: fix sigqueue_free() vs __exit_signal() race
      posix timers: sigqueue_free: don't free sigqueue if it is queued
      posix timers: discard SI_TIMER signals on exec

Oliver Neukum (2):
      USB: CDC WDM driver
      rtl8187: resource leak in error case

Patrick McHardy (5):
      net_sched: cls_api: fix return value for non-existant classifiers
      vlan: Correctly handle device notifications for layered VLAN devices
      [VLAN]: Propagate selected feature bits to VLAN devices
      netfilter: Move linux/types.h inclusions outside of #ifdef __KERNEL__
      vlan: Use bitmask of feature flags instead of seperate feature bits

Paul E. McKenney (1):
      list_for_each_rcu must die: audit

Paul Gortmaker (1):
      phylib: do EXPORT_SYMBOL on get_phy_id

Paul Mackerras (1):
      [POWERPC] Update defconfigs for desktop/server systems

Paul Mundt (5):
      sh: display boot params by default on entry.
      sh: disable initrd defaults in .empty_zero_page.
      sh: Make is_valid_bugaddr() more intelligent on nommu.
      sh: Fix up restorer in debug_trap exception return path.
      sh: Drop broken URAM support on SH7723.

Pavel Roskin (2):
      hostap_cs: add ID for Conceptronic CON11CPro
      orinoco_cs: add ID for SpeedStream wireless adapters

Pekka Enberg (1):
      slub: ksize() abuse checks

Philipp Zabel (1):
      [ARM] 5043/1: pxafb: remove unused mode variable in pxafb_init_fbinfo

Pierre Ossman (1):
      mmc: mmc host test driver

Pierre Ynard (1):
      rndis_host: increase delay in command response loop

Ralph Campbell (1):
      IB/ipath: Fix UC receive completion opcode for RDMA WRITE with immediate

Rami Rosen (1):
      net: The world is not perfect patch.

Randy Dunlap (1):
      kernel-doc: allow unnamed bit-fields

Robert P. J. Day (2):
      dlm: <linux/dlm_plock.h> should be "unifdef"ed.
      ipv6: Move <linux/in6.h> from header-y to unifdef-y.

Roel Kluin (2):
      wireless, airo: waitbusy() won't delay
      gpio: mcp23s08 debug fix

Roland Dreier (4):
      IB/ipath: Fix printk format for ipath_sdma_status
      RDMA/cxgb3: Fix uninitialized variable warning in iwch_post_send()
      IB/mthca: Fix max_sge value returned by query_device
      IB/mlx4: Fix creation of kernel QP with max number of send s/g entries

Russell King (4):
      [ARM] omap: fix omap clk support build errors
      Revert "[ARM] pxa: spitz wants PXA27x UDC definitions"
      [ARM] fix OMAP include loops
      [ARM] integrator: fix build warnings and errors

S.Çağlar Onur (1):
      Remove *.rej pattern from .gitignore

Sam Ravnborg (3):
      MAINTAINERS: document names of new kbuild trees
      kbuild: filter away debug symbols from kernel symbols
      Kconfig: introduce ARCH_DEFCONFIG to DEFCONFIG_LIST

Samuel Tardieu (3):
      [WATCHDOG] Make w83697h_wdt void-like functions void
      [WATCHDOG] Make w83697h_wdt timeout option string similar to others
      [WATCHDOG] Add w83697h_wdt early_disable option

Shi Weihua (1):
      sys_prctl(): fix return of uninitialized value

Sonic Zhang (1):
      pata-bf54x: Set ATAPI HSM to control IDE device terminate sequence.

Sreenivasa Honnur (3):
      S2io: Move all the transmit completions to a single msi-x (alarm) vector
      S2io: Added napi support when MSIX is enabled.
      S2io: Version update for napi and MSI-X patches

Sridhar Samudrala (1):
      tcp: TCP connection times out if ICMP frag needed is delayed

Stas Sergeev (6):
      snd-pcsp: adjust help texts to frighten users
      snd-pcsp: put back the compatibility code for the older alsa-libs
      snd-pcsp: depend on CONFIG_EXPERIMENTAL
      snd-pcsp: silent misleading warning
      snd-pcsp: use HRTIMER_CB_SOFTIRQ
      [ALSA] snd-pcsp - fix pcsp_treble_info() to honour an item number

Stefan Richter (1):
      ieee1394: sbp2: use correct size of command descriptor block

Stephen Hemminger (5):
      net: handle errors from device_rename
      sysfs: remove error messages for -EEXIST case
      bonding: handle case of device named bonding_master
      sky2: restore vlan acceleration on reset
      sb1250: use netdev_alloc_skb

Stephen Rothwell (2):
      [POWERPC] mpic: Fix use of uninitialized variable
      [POWERPC] iSeries: Remove unused mail address

Steve French (14):
      [CIFS] cleanup old checkpatch warnings
      [CIFS] don't explicitly do a FindClose on rewind when directory search has ended
      [CIFS] Fix paths when share is in DFS to include proper prefix
      [CIFS] suppress duplicate warning
      [CIFS] BKL-removal: convert CIFS over to unlocked_ioctl
      [CIFS] Finishup DFS code
      [CIFS] enable parsing for transport encryption mount parm
      [CIFS] Add missing defines for DFS
      [CIFS] add more complete mount options to cifs_show_options
      [CIFS] add missing seq_printf to cifs_show_options for hard mount option
      [CIFS] Enable DFS support for Unix query path info
      [CIFS] Enable DFS support for Windows query path info
      [CIFS] Remove debug statement
      [CIFS] Remove redundant NULL check

Steve Grubb (1):
      open sessionid permissions

Steve Wise (1):
      MAINTAINERS: Add cxgb3 and iw_cxgb3 NIC and iWARP driver entries

Takashi Iwai (3):
      [ALSA] hda - Fix ALC262 fujitsu model
      [ALSA] hda - Fix noise on VT1708 codec
      [ALSA] hda - Fix COEF and EAPD in ALC889 auto-configuration mode

Tejun Heo (10):
      libata: fix sata_link_hardreset() @online out parameter handling
      libata: reorganize ata_eh_reset() no reset method path
      libata: move reset freeze/thaw handling into ata_eh_reset()
      libata: kill hotplug related race condition
      libata: ignore recovered PHY errors
      libata: increase PMP register access timeout to 3s
      libata: make sure PMP notification is turned off during recovery
      libata: don't schedule LPM action seperately during probing
      sata_sil24: don't use NCQ if marvell 4140 PMP is attached
      libata: ignore SIMG4726 config pseudo device

Thomas Gleixner (3):
      x86: fix setup of cyc2ns in tsc_64.c
      x86: distangle user disabled TSC from unstable
      x86: disable TSC for sched_clock() when calibration failed

Thomas Graf (1):
      netlink: Fix nla_parse_nested_compat() to call nla_parse() directly

Thomas Hellstrom (1):
      drm: disable tasklets not IRQs when taking the drm lock spinlock

Thomas Kunze (1):
      [ARM] 5025/2: fix collie cpu initialisation

Tobias Diedrich (1):
      [netdrvr] forcedeth: Restore multicast settings on resume

Tom Tucker (23):
      svc: Remove extra check for XPT_DEAD bit in svc_xprt_enqueue
      svc: Remove unused header files from svc_xprt.c
      svcrdma: Simplify receive buffer posting
      svcrdma: Fix race with dto_tasklet in svc_rdma_send
      svcrdma: Fix return value in svc_rdma_send
      svcrdma: Add put of connection ESTABLISHED reference in rdma_cma_handler
      svcrdma: Free context on ib_post_recv error
      svcrdma: Free context on post_recv error in send_reply
      svcrdma: Fix error handling during listening endpoint creation
      svcrdma: Return error from rdma_read_xdr so caller knows to free context
      svcrdma: Remove unused READ_DONE context flags bit
      svcrdma: Simplify RDMA_READ deferral buffer management
      svcrdma: Use standard Linux lists for context cache
      svcrdma: Shrink scope of spinlock on RQ CQ
      svcrdma: Move destroy to kernel thread
      svcrdma: Add reference for each SQ/RQ WR
      svcrdma: Move the QP and cm_id destruction to svc_rdma_free
      svcrdma: Cleanup queued, but unprocessed I/O in svc_rdma_free
      svcrdma: Use ib verbs version of dma_unmap
      svcrdma: Set rqstp transport address in rdma_read_complete function
      svcrdma: Copy transport address and arm CQ before calling rdma_accept
      svcrdma: Change svc_rdma_send_error return type to void
      svcrdma: Verify read-list fits within RPCSVC_MAXPAGES

Tony Camuso (1):
      PCI: Correct last two HP entries in the bfsort whitelist

Tony Lindgren (2):
      mmc: Fix omap compile by replacing dev_name with dma_dev_name
      [ARM] 5038/1: ARM: OMAP: Remove tsc2102 references from board-palmte.c

Travis Place (3):
      [ALSA] hda - Fix ASUS P5GD1 model
      [ALSA] hda - Add model for ASUS P5K-E/WIFI-AP
      [ALSA] hda - Added support for Foxconn P35AX-S mainboard

Trent Piepho (1):
      gpiolib: fix off by one errors

Trond Myklebust (3):
      NFS: Ensure that 'noac' and/or 'actimeo=0' turn off attribute caching
      NFSv4: Check the return value of decode_compound_hdr_arg()
      SUNRPC: AUTH_SYS "machine creds" shouldn't use negative valued uid/gid

WANG Cong (2):
      [Patch] fs/binfmt_elf.c: fix a wrong free
      [Patch] fs/binfmt_elf.c: fix wrong return values

Wang Chen (3):
      VIRTIO: Use __skb_queue_purge()
      NETFRONT: Use __skb_queue_purge()
      3C509: rx_bytes should not be increased when alloc_skb failed

Xiantao Zhang (3):
      KVM: ia64: Define new kvm_fpreg struture to replace ia64_fpreg
      KVM: ia64: fix GVMM module including position-dependent objects
      KVM: ia64: Set KVM_IOAPIC_NUM_PINS to 48

Xiaofan Chen (1):
      HID: add Microchip PICKit 1 and PICkit 2 to blacklist

YOSHIFUJI Hideaki (4):
      ndisc: Add missing strategies for per-device retrans timer/reachable time settings.
      ipv6 addrconf: Fix route lifetime setting in corner case.
      ipv6 route: Fix lifetime in netlink.
      ipv6 addrconf: Allow infinite prefix lifetime.

Yoshihiro Shimoda (1):
      sh: fix sh7785 master clock value

Zhang Wei (1):
      fsldma: update the fsldma driver MAINTAINERS info

karl beldan (1):
      USB: pxa27x_udc - Fix Oops

maximilian attems (2):
      [CPUFREQ] Crusoe: longrun cpufreq module reports false min freq
      types.h: don't expose struct ustat to userspace

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-26 18:41 Linux 2.6.26-rc4 Linus Torvalds
@ 2008-05-26 21:24 ` Jesper Krogh
  2008-05-26 21:42   ` Linus Torvalds
  2008-05-27  5:23 ` 2.6.26-rc4: RIP find_pid_ns+0x6b/0xa0 Alexey Dobriyan
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 89+ messages in thread
From: Jesper Krogh @ 2008-05-26 21:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List

Linus Torvalds wrote:
> You know the drill by now: another week, another -rc.

I did get this one (which I didn't on 2.6.25.2)

[42949399.810959] ck804xrom ck804xrom_init_one(): Unable to register 
resource 0x0000000000000000-0x00000000ffffffff - kernel bug?
[42949399.979924] ------------[ cut here ]------------
[42949399.979924] WARNING: at arch/x86/mm/ioremap.c:159 
__ioremap_caller+0x299/0x330()
[42949399.979924] Modules linked in: ck804xrom(+) mtd i2c_nforce2(+) 
niu(+) i2c_core serio_raw button(+) chipreg map_funcs pcspkr k8temp 
shpchp pci_hotplug evdev joydev ext3 jbd mbcache pata_amd sr_mod cdrom 
sg sd_mod usb_storage libusual pata_acpi usbhid hid mptsas ata_generic 
mptspi scsi_transport_sas mptscsih mptbase libata scsi_transport_spi 
ehci_hcd e1000 scsi_mod dock ohci_hcd usbcore dm_mirror dm_log 
dm_snapshot dm_mod thermal processor fan fuse
[42949399.979924] Pid: 5660, comm: modprobe Not tainted 2.6.26-rc4 #1
[42949399.979924]
[42949399.979924] Call Trace:
[42949399.979924]  [<ffffffff80236aa4>] warn_on_slowpath+0x64/0xa0
[42949399.979924]  [<ffffffff8034b697>] idr_get_empty_slot+0xf7/0x280
[42949399.979924]  [<ffffffff80237b4e>] printk+0x4e/0x60
[42949399.979924]  [<ffffffff80465e52>] klist_iter_exit+0x12/0x20
[42949399.979924]  [<ffffffff80223139>] __ioremap_caller+0x299/0x330
[42949399.979924]  [<ffffffffa00ee1f6>] 
:ck804xrom:init_ck804xrom+0x1f6/0x631
[42949399.979924]  [<ffffffffa00ee1f6>] 
:ck804xrom:init_ck804xrom+0x1f6/0x631
[42949399.979924]  [<ffffffff8025c23a>] sys_init_module+0x17a/0x1d00
[42949399.979924]  [<ffffffff8028c228>] vma_prio_tree_insert+0x28/0x60
[42949399.979924]  [<ffffffff8020c2bb>] system_call_after_swapgs+0x7b/0x80
[42949399.979924]
[42949399.979924] ---[ end trace 5bb785355abc57e6 ]---
[42949399.979924] ck804xrom: ioremap(00000000, 100000000) failed

-- 
Jesper

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-26 21:24 ` Jesper Krogh
@ 2008-05-26 21:42   ` Linus Torvalds
  2008-05-27  0:25     ` Arjan van de Ven
  2008-05-27  1:16     ` Carl-Daniel Hailfinger
  0 siblings, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2008-05-26 21:42 UTC (permalink / raw)
  To: Jesper Krogh
  Cc: Linux Kernel Mailing List, Carl-Daniel Hailfinger,
	David Woodhouse

On Mon, 26 May 2008, Jesper Krogh wrote:
> 
> I did get this one (which I didn't on 2.6.25.2)
> 
> [42949399.810959] ck804xrom ck804xrom_init_one(): Unable to register resource
> 0x0000000000000000-0x00000000ffffffff - kernel bug?

Something is trying to register a 4GB resource. That sounds unlikely 
(possible on a 64-bit PCI setup, but I think it's more likely to be some 
overflow of 0 in "unsigned int").

In fact, this seems to be due to some driver bug. It looks like we have

	window->size = 0xffffffffUL - window->phys + 1UL;

and in order for window->size to be 0x100000000, that means that 
window->phys has to be 0. Which looks impossible, or at least like 
ent->driver_data is neither DEV_CK804 nor DEV_MCP55. Very odd.

The warning:

> [42949399.979924] WARNING: at arch/x86/mm/ioremap.c:159 __ioremap_caller+0x299/0x330()

is then just a result of the driver blindly continuing and trying to 
"ioremap()" the resource even though it's bogus and the resource 
allocation failed.

In other words, that driver init routine is really bad about error 
handling. Carl-Daniel? David?

		Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-26 21:42   ` Linus Torvalds
@ 2008-05-27  0:25     ` Arjan van de Ven
  2008-05-27  0:31       ` Arjan van de Ven
  2008-05-27  5:43       ` David Woodhouse
  2008-05-27  1:16     ` Carl-Daniel Hailfinger
  1 sibling, 2 replies; 89+ messages in thread
From: Arjan van de Ven @ 2008-05-27  0:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jesper Krogh, Linux Kernel Mailing List, Carl-Daniel Hailfinger,
	David Woodhouse

On Mon, 26 May 2008 14:42:37 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Mon, 26 May 2008, Jesper Krogh wrote:
> > 
> > I did get this one (which I didn't on 2.6.25.2)
> > 
> > [42949399.810959] ck804xrom ck804xrom_init_one(): Unable to
> > register resource 0x0000000000000000-0x00000000ffffffff - kernel
> > bug?
> 
> Something is trying to register a 4GB resource. That sounds unlikely 
> (possible on a 64-bit PCI setup, but I think it's more likely to be
> some overflow of 0 in "unsigned int").
> 
> In fact, this seems to be due to some driver bug. It looks like we
> have
> 
> 	window->size = 0xffffffffUL - window->phys + 1UL;
> 
> and in order for window->size to be 0x100000000, that means that 
> window->phys has to be 0. Which looks impossible, or at least like 
> ent->driver_data is neither DEV_CK804 nor DEV_MCP55. Very odd.
> 
> The warning:
> 
> > [42949399.979924] WARNING: at arch/x86/mm/ioremap.c:159
> > __ioremap_caller+0x299/0x330()
> 
> is then just a result of the driver blindly continuing and trying to 
> "ioremap()" the resource even though it's bogus and the resource 
> allocation failed.
> 
> In other words, that driver init routine is really bad about error 
> handling. Carl-Daniel? David?
> 

btw this guy has shown up on kerneloops.org a lot: 
http://www.kerneloops.org/searchweek.php?search=__ioremap_caller
where it's trying to map memory as uncachable, which is.. well nasty
(it seems to map not just the piece it needs, but more, and then turns
that "more" uncachable, even if the kernel is using it for "normal"
things)

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-27  0:25     ` Arjan van de Ven
@ 2008-05-27  0:31       ` Arjan van de Ven
  2008-05-27  5:43       ` David Woodhouse
  1 sibling, 0 replies; 89+ messages in thread
From: Arjan van de Ven @ 2008-05-27  0:31 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Jesper Krogh, Linux Kernel Mailing List,
	Carl-Daniel Hailfinger, David Woodhouse

On Mon, 26 May 2008 17:25:19 -0700
Arjan van de Ven <arjan@infradead.org> wrote:

> On Mon, 26 May 2008 14:42:37 -0700 (PDT)
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > 
> > 
> > On Mon, 26 May 2008, Jesper Krogh wrote:
> > > 
> > > I did get this one (which I didn't on 2.6.25.2)
> > > 
> > > [42949399.810959] ck804xrom ck804xrom_init_one(): Unable to
> > > register resource 0x0000000000000000-0x00000000ffffffff - kernel
> > > bug?
> > 
> > Something is trying to register a 4GB resource. That sounds
> > unlikely (possible on a 64-bit PCI setup, but I think it's more
> > likely to be some overflow of 0 in "unsigned int").
> > 
> > In fact, this seems to be due to some driver bug. It looks like we
> > have
> > 
> > 	window->size = 0xffffffffUL - window->phys + 1UL;
> > 
> > and in order for window->size to be 0x100000000, that means that 
> > window->phys has to be 0. Which looks impossible, or at least like 
> > ent->driver_data is neither DEV_CK804 nor DEV_MCP55. Very odd.
> > 
> > The warning:
> > 
> > > [42949399.979924] WARNING: at arch/x86/mm/ioremap.c:159
> > > __ioremap_caller+0x299/0x330()
> > 
> > is then just a result of the driver blindly continuing and trying
> > to "ioremap()" the resource even though it's bogus and the resource 
> > allocation failed.
> > 
> > In other words, that driver init routine is really bad about error 
> > handling. Carl-Daniel? David?
> > 
> 
> btw this guy has shown up on kerneloops.org a lot: 
> http://www.kerneloops.org/searchweek.php?search=__ioremap_caller
> where it's trying to map memory as uncachable, which is.. well nasty
> (it seems to map not just the piece it needs, but more, and then turns
> that "more" uncachable, even if the kernel is using it for "normal"
> things)

one thing to note: it only shows up on 64 bit kernels somehow...
interesting.



-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-26 21:42   ` Linus Torvalds
  2008-05-27  0:25     ` Arjan van de Ven
@ 2008-05-27  1:16     ` Carl-Daniel Hailfinger
  2008-05-27  1:23       ` Carl-Daniel Hailfinger
  2008-05-27 10:35       ` Jeff Garzik
  1 sibling, 2 replies; 89+ messages in thread
From: Carl-Daniel Hailfinger @ 2008-05-27  1:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jesper Krogh, Linux Kernel Mailing List, David Woodhouse

On 26.05.2008 23:42, Linus Torvalds wrote:
> On Mon, 26 May 2008, Jesper Krogh wrote:
>   
>> I did get this one (which I didn't on 2.6.25.2)
>>
>> [42949399.810959] ck804xrom ck804xrom_init_one(): Unable to register resource
>> 0x0000000000000000-0x00000000ffffffff - kernel bug?
>>     
>
> Something is trying to register a 4GB resource. That sounds unlikely 
> (possible on a 64-bit PCI setup, but I think it's more likely to be some 
> overflow of 0 in "unsigned int").
>
> In fact, this seems to be due to some driver bug. It looks like we have
>
> 	window->size = 0xffffffffUL - window->phys + 1UL;
>
> and in order for window->size to be 0x100000000, that means that 
> window->phys has to be 0. Which looks impossible, or at least like 
> ent->driver_data is neither DEV_CK804 nor DEV_MCP55. Very odd.
>
> The warning:
>
>   
>> [42949399.979924] WARNING: at arch/x86/mm/ioremap.c:159 __ioremap_caller+0x299/0x330()
>>     
>
> is then just a result of the driver blindly continuing and trying to 
> "ioremap()" the resource even though it's bogus and the resource 
> allocation failed.
>
> In other words, that driver init routine is really bad about error 
> handling. Carl-Daniel? David?
>   

It hurts to look at this:

static struct pci_device_id ck804xrom_pci_tbl[] = {
	{ PCI_VENDOR_ID_NVIDIA, 0x0051, PCI_ANY_ID, PCI_ANY_ID, DEV_CK804 },
	{ PCI_VENDOR_ID_NVIDIA, 0x0360, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
	{ PCI_VENDOR_ID_NVIDIA, 0x0361, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
	{ PCI_VENDOR_ID_NVIDIA, 0x0362, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
	{ PCI_VENDOR_ID_NVIDIA, 0x0363, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
	{ PCI_VENDOR_ID_NVIDIA, 0x0364, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
	{ PCI_VENDOR_ID_NVIDIA, 0x0365, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
	{ PCI_VENDOR_ID_NVIDIA, 0x0366, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
	{ PCI_VENDOR_ID_NVIDIA, 0x0367, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
	{ 0, }
};

considering how struct pci_device_id looks like:

struct pci_device_id {
	__u32 vendor, device;		/* Vendor and device ID or PCI_ANY_ID*/
	__u32 subvendor, subdevice;	/* Subsystem ID's or PCI_ANY_ID */
	__u32 class, class_mask;	/* (class,subclass,prog-if) triplet */
	kernel_ulong_t driver_data;	/* Data private to the driver */
};



DEV_CK804 and DEV_MCP55 actually end up in class instead of driver_data.

I'd send a patch, but I'm traveling and my only code access is gitweb.

New code should look like
static struct pci_device_id ck804xrom_pci_tbl[] = {
	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0051), .driver_data = DEV_CK804 },
	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0360), .driver_data = DEV_MCP55 },
	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0361), .driver_data = DEV_MCP55 },
	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0362), .driver_data = DEV_MCP55 },
	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0363), .driver_data = DEV_MCP55 },
	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0364), .driver_data = DEV_MCP55 },
	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0365), .driver_data = DEV_MCP55 },
	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0366), .driver_data = DEV_MCP55 },
	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0367), .driver_data = DEV_MCP55 },
	{ 0, }
};


Regards,
Carl-Daniel


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-27  1:16     ` Carl-Daniel Hailfinger
@ 2008-05-27  1:23       ` Carl-Daniel Hailfinger
  2008-05-27  1:52         ` Abhijit Menon-Sen
  2008-05-27 10:35       ` Jeff Garzik
  1 sibling, 1 reply; 89+ messages in thread
From: Carl-Daniel Hailfinger @ 2008-05-27  1:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jesper Krogh, Linux Kernel Mailing List, David Woodhouse

On 27.05.2008 03:16, Carl-Daniel Hailfinger wrote:
> DEV_CK804 and DEV_MCP55 actually end up in class instead of driver_data.
>
> I'd send a patch, but I'm traveling and my only code access is gitweb.
>
> New code should look like
> static struct pci_device_id ck804xrom_pci_tbl[] = {
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0051), .driver_data = DEV_CK804 },
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0360), .driver_data = DEV_MCP55 },
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0361), .driver_data = DEV_MCP55 },
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0362), .driver_data = DEV_MCP55 },
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0363), .driver_data = DEV_MCP55 },
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0364), .driver_data = DEV_MCP55 },
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0365), .driver_data = DEV_MCP55 },
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0366), .driver_data = DEV_MCP55 },
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0367), .driver_data = DEV_MCP55 },
> 	{ 0, }
> };
>   

The change is
Signed-off-by: Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net>
in case someone makes a patch from it.

Regards,
Carl-Daniel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-27  1:23       ` Carl-Daniel Hailfinger
@ 2008-05-27  1:52         ` Abhijit Menon-Sen
  2008-05-27  5:19           ` Jesper Krogh
                             ` (2 more replies)
  0 siblings, 3 replies; 89+ messages in thread
From: Abhijit Menon-Sen @ 2008-05-27  1:52 UTC (permalink / raw)
  To: Carl-Daniel Hailfinger
  Cc: Linus Torvalds, Jesper Krogh, David Woodhouse, linux-kernel

At 2008-05-27 03:23:19 +0200, c-d.hailfinger.devel.2006@gmx.net wrote:
>
> The change is
> Signed-off-by: Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net>
> in case someone makes a patch from it.

Here you go.

-- ams

diff --git a/drivers/mtd/maps/ck804xrom.c b/drivers/mtd/maps/ck804xrom.c
index 59d8fb4..effaf7c 100644
--- a/drivers/mtd/maps/ck804xrom.c
+++ b/drivers/mtd/maps/ck804xrom.c
@@ -331,15 +331,15 @@ static void __devexit ck804xrom_remove_one (struct pci_dev *pdev)
 }
 
 static struct pci_device_id ck804xrom_pci_tbl[] = {
-	{ PCI_VENDOR_ID_NVIDIA, 0x0051, PCI_ANY_ID, PCI_ANY_ID, DEV_CK804 },
-	{ PCI_VENDOR_ID_NVIDIA, 0x0360, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
-	{ PCI_VENDOR_ID_NVIDIA, 0x0361, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
-	{ PCI_VENDOR_ID_NVIDIA, 0x0362, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
-	{ PCI_VENDOR_ID_NVIDIA, 0x0363, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
-	{ PCI_VENDOR_ID_NVIDIA, 0x0364, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
-	{ PCI_VENDOR_ID_NVIDIA, 0x0365, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
-	{ PCI_VENDOR_ID_NVIDIA, 0x0366, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
-	{ PCI_VENDOR_ID_NVIDIA, 0x0367, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0051), .driver_data = DEV_CK804 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0360), .driver_data = DEV_MCP55 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0361), .driver_data = DEV_MCP55 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0362), .driver_data = DEV_MCP55 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0363), .driver_data = DEV_MCP55 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0364), .driver_data = DEV_MCP55 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0365), .driver_data = DEV_MCP55 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0366), .driver_data = DEV_MCP55 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0367), .driver_data = DEV_MCP55 },
 	{ 0, }
 };
 

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-27  1:52         ` Abhijit Menon-Sen
@ 2008-05-27  5:19           ` Jesper Krogh
  2008-05-27  5:31           ` [MTD] [MAPS] ck804rom: fix driver_data in probe table David Woodhouse
  2008-05-27  5:31           ` Linux 2.6.26-rc4 David Woodhouse
  2 siblings, 0 replies; 89+ messages in thread
From: Jesper Krogh @ 2008-05-27  5:19 UTC (permalink / raw)
  To: Abhijit Menon-Sen
  Cc: Carl-Daniel Hailfinger, Linus Torvalds, David Woodhouse,
	linux-kernel

Abhijit Menon-Sen wrote:
 > At 2008-05-27 03:23:19 +0200, c-d.hailfinger.devel.2006@gmx.net wrote:
 >> The change is
 >> Signed-off-by: Carl-Daniel Hailfinger 
<c-d.hailfinger.devel.2006@gmx.net>
 >> in case someone makes a patch from it.
 >
 > Here you go.
 >
 > -- ams
 >
 > diff --git a/drivers/mtd/maps/ck804xrom.c b/drivers/mtd/maps/ck804xrom.c
 > index 59d8fb4..effaf7c 100644

Ok. Patch applied. The stacktrace goes away, but I still get these:

[42949399.211630] ck804xrom ck804xrom_init_one(): Unable to register 
resource 0x00000000ffb00000-0x00000000ffffffff - kernel bug?
[42949399.399754] CFI: Found no ck804xrom @ffc00000 device at location zero
[42949399.409759] JEDEC: Found no ck804xrom @ffc00000 device at location 
zero
[42949399.409759] CFI: Found no ck804xrom @ffc00000 device at location zero
[42949399.409759] JEDEC: Found no ck804xrom @ffc00000 device at location 
zero
[42949399.409759] CFI: Found no ck804xrom @ffc00000 device at location zero
[42949399.409759] JEDEC: Found no ck804xrom @ffc00000 device at location 
zero
[42949399.409759] CFI: Found no ck804xrom @ffc10000 device at location zero
[42949399.409759] JEDEC: Found no ck804xrom @ffc10000 device at location 
zero
[42949399.409759] CFI: Found no ck804xrom @ffc10000 device at location zero
[42949399.409759] JEDEC: Found no ck804xrom @ffc10000 device at location 
zero
[42949399.409759] CFI: Found no ck804xrom @ffc10000 device at location zero
... and many more.

The system works Ok anyway.

-- 
Jesper

^ permalink raw reply	[flat|nested] 89+ messages in thread

* 2.6.26-rc4: RIP find_pid_ns+0x6b/0xa0
  2008-05-26 18:41 Linux 2.6.26-rc4 Linus Torvalds
  2008-05-26 21:24 ` Jesper Krogh
@ 2008-05-27  5:23 ` Alexey Dobriyan
  2008-05-27  9:06   ` Oleg Nesterov
  2008-05-27 10:01 ` Linux 2.6.26-rc4 J.A. Magallón
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 89+ messages in thread
From: Alexey Dobriyan @ 2008-05-27  5:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, akpm

PREEMPT_RCU is in use, again. And die counter is 2 because of CFQ oops
I haven't noticed earlier.



0xffffffff802447cb is in find_pid_ns (kernel/pid.c:297).
292             struct hlist_node *elem;
293             struct upid *pnr;
294
295             hlist_for_each_entry_rcu(pnr, elem,
296                             &pid_hash[pid_hashfn(nr, ns)], pid_chain)
297                     if (pnr->nr == nr && pnr->ns == ns)
298                             return container_of(pnr, struct pid,
299                                             numbers[ns->level]);
300
301             return NULL;


general protection fault: 0000 [2] PREEMPT SMP DEBUG_PAGEALLOC
CPU 0 
Modules linked in: ext2 nf_conntrack_irc xt_state iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack ip_tables x_tables usblp ehci_hcd uhci_hcd usbcore sr_mod cdrom
Pid: 15599, comm: profil01 Tainted: G      D   2.6.26-rc4 #1
RIP: 0010:[<ffffffff802447cb>]  [<ffffffff802447cb>] find_pid_ns+0x6b/0xa0
RSP: 0018:ffff810129021ea8  EFLAGS: 00010202
RAX: ffff810130580948 RBX: 0000000000003cef RCX: ffff81017d865278
RDX: 6b6b6b6b6b6b6b6b RSI: ffffffff80566760 RDI: 0000000000003cef
RBP: ffff810129021ea8 R08: 0000000000000000 R09: 00007f9a93987b70
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
R13: 0000000000000011 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f9a9397c6f0(0000) GS:ffffffff805c6000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000257f2e8 CR3: 0000000102479000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process profil01 (pid: 15599, threadinfo ffff810129020000, task ffff81004bc24500)
Stack:  ffff810129021eb8 ffffffff8024487d ffff810129021f78 ffffffff8023f275
 0000000000000011 0000000000000000 0000000000003cef ffff810129020000
 ffffffff8061b140 00007f9a93989bc0 00007fff9b98a410 ffffffff8045fd63
Call Trace:
 [<ffffffff8024487d>] find_vpid+0x1d/0x20
 [<ffffffff8023f275>] sys_kill+0x85/0x1b0
 [<ffffffff8045fd63>] ? lockdep_sys_exit_thunk+0x35/0x67
 [<ffffffff8045fcf2>] ? trace_hardirqs_on_thunk+0x35/0x3a
 [<ffffffff8023d9e1>] ? lock_task_sighand+0x41/0x80
 [<ffffffff8020b68b>] system_call_after_swapgs+0x7b/0x80
Code: c2 48 c1 e0 02 48 01 c2 48 d3 ea 48 c1 e2 03 48 03 15 72 4a 41 00 48 8b 02 48 85 c0 48 89 c2 75 0a eb 30 48 8b 12 48 85 d2 74 28 <3b> 7a f0 48 8b 02 48 8d 4a f0 0f 18 08 75 e9 48 3b 72 f8 75 e3 
RIP  [<ffffffff802447cb>] find_pid_ns+0x6b/0xa0
 RSP <ffff810129021ea8>
---[ end trace 2cae3e148f7cd27c ]---


^ permalink raw reply	[flat|nested] 89+ messages in thread

* [MTD] [MAPS] ck804rom: fix driver_data in probe table.
  2008-05-27  1:52         ` Abhijit Menon-Sen
  2008-05-27  5:19           ` Jesper Krogh
@ 2008-05-27  5:31           ` David Woodhouse
  2008-05-27  5:31           ` Linux 2.6.26-rc4 David Woodhouse
  2 siblings, 0 replies; 89+ messages in thread
From: David Woodhouse @ 2008-05-27  5:31 UTC (permalink / raw)
  To: Abhijit Menon-Sen
  Cc: Carl-Daniel Hailfinger, Linus Torvalds, Jesper Krogh,
	linux-kernel

There's a reason why using C99 initialisers even in the supposedly
trivial structs is a good idea.

Signed-off-by: Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>

diff --git a/drivers/mtd/maps/ck804xrom.c b/drivers/mtd/maps/ck804xrom.c
index 59d8fb4..effaf7c 100644
--- a/drivers/mtd/maps/ck804xrom.c
+++ b/drivers/mtd/maps/ck804xrom.c
@@ -331,15 +331,15 @@ static void __devexit ck804xrom_remove_one (struct pci_dev *pdev)
 }
 
 static struct pci_device_id ck804xrom_pci_tbl[] = {
-	{ PCI_VENDOR_ID_NVIDIA, 0x0051, PCI_ANY_ID, PCI_ANY_ID, DEV_CK804 },
-	{ PCI_VENDOR_ID_NVIDIA, 0x0360, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
-	{ PCI_VENDOR_ID_NVIDIA, 0x0361, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
-	{ PCI_VENDOR_ID_NVIDIA, 0x0362, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
-	{ PCI_VENDOR_ID_NVIDIA, 0x0363, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
-	{ PCI_VENDOR_ID_NVIDIA, 0x0364, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
-	{ PCI_VENDOR_ID_NVIDIA, 0x0365, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
-	{ PCI_VENDOR_ID_NVIDIA, 0x0366, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
-	{ PCI_VENDOR_ID_NVIDIA, 0x0367, PCI_ANY_ID, PCI_ANY_ID, DEV_MCP55 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0051), .driver_data = DEV_CK804 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0360), .driver_data = DEV_MCP55 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0361), .driver_data = DEV_MCP55 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0362), .driver_data = DEV_MCP55 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0363), .driver_data = DEV_MCP55 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0364), .driver_data = DEV_MCP55 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0365), .driver_data = DEV_MCP55 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0366), .driver_data = DEV_MCP55 },
+	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0367), .driver_data = DEV_MCP55 },
 	{ 0, }
 };
 

-- 
dwmw2


^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-27  1:52         ` Abhijit Menon-Sen
  2008-05-27  5:19           ` Jesper Krogh
  2008-05-27  5:31           ` [MTD] [MAPS] ck804rom: fix driver_data in probe table David Woodhouse
@ 2008-05-27  5:31           ` David Woodhouse
  2 siblings, 0 replies; 89+ messages in thread
From: David Woodhouse @ 2008-05-27  5:31 UTC (permalink / raw)
  To: Abhijit Menon-Sen
  Cc: Carl-Daniel Hailfinger, Linus Torvalds, Jesper Krogh,
	linux-kernel

On Tue, 2008-05-27 at 07:22 +0530, Abhijit Menon-Sen wrote:
> At 2008-05-27 03:23:19 +0200, c-d.hailfinger.devel.2006@gmx.net wrote:
> >
> > The change is
> > Signed-off-by: Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net>
> > in case someone makes a patch from it.
> 
> Here you go.

Thanks.

-- 
dwmw2


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-27  0:25     ` Arjan van de Ven
  2008-05-27  0:31       ` Arjan van de Ven
@ 2008-05-27  5:43       ` David Woodhouse
  2008-05-27  6:00         ` Arjan van de Ven
  1 sibling, 1 reply; 89+ messages in thread
From: David Woodhouse @ 2008-05-27  5:43 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Jesper Krogh, Linux Kernel Mailing List,
	Carl-Daniel Hailfinger

On Mon, 2008-05-26 at 17:25 -0700, Arjan van de Ven wrote:
> btw this guy has shown up on kerneloops.org a lot: 
> http://www.kerneloops.org/searchweek.php?search=__ioremap_caller
> where it's trying to map memory as uncachable, which is.. well nasty
> (it seems to map not just the piece it needs, but more, and then turns
> that "more" uncachable, even if the kernel is using it for "normal"
> things)

The driver needs that 'more' to reach the lock registers for the flash
chip. If it's being used for other things, shouldn't the
request_region() fail?

On a vaguely related note, there's a lot to be said for _not_ using the
standard PCI driver setup on the BIOS flash drivers, and going back to
having them manually loaded rather than being automatically loaded
whenever the appropriate southbridge is present.

It would be nicer if the only people who have write access to their BIOS
flash are the ones who _really_ wanted it.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-27  5:43       ` David Woodhouse
@ 2008-05-27  6:00         ` Arjan van de Ven
  2008-05-27  6:24           ` David Woodhouse
  0 siblings, 1 reply; 89+ messages in thread
From: Arjan van de Ven @ 2008-05-27  6:00 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linus Torvalds, Jesper Krogh, Linux Kernel Mailing List,
	Carl-Daniel Hailfinger

On Tue, 27 May 2008 06:43:51 +0100
David Woodhouse <dwmw2@infradead.org> wrote:

> On Mon, 2008-05-26 at 17:25 -0700, Arjan van de Ven wrote:
> > btw this guy has shown up on kerneloops.org a lot: 
> > http://www.kerneloops.org/searchweek.php?search=__ioremap_caller
> > where it's trying to map memory as uncachable, which is.. well nasty
> > (it seems to map not just the piece it needs, but more, and then
> > turns that "more" uncachable, even if the kernel is using it for
> > "normal" things)
> 
> The driver needs that 'more' to reach the lock registers for the flash
> chip. If it's being used for other things, shouldn't the
> request_region() fail?

it does fail....
... and then the driver continues anyway!

        if (request_resource(&iomem_resource, &window->rsrc)) {
                window->rsrc.parent = NULL;
                printk(KERN_ERR MOD_NAME   
                        " %s(): Unable to register resource"
                        " 0x%.016llx-0x%.016llx - kernel bug?\n",
                        __func__,
                        (unsigned long long)window->rsrc.start,
                        (unsigned long long)window->rsrc.end); 
        }

notice the lack of "return" or "goto out" etc ;)

> 
> On a vaguely related note, there's a lot to be said for _not_ using
> the standard PCI driver setup on the BIOS flash drivers, and going
> back to having them manually loaded rather than being automatically
> loaded whenever the appropriate southbridge is present.
> 
> It would be nicer if the only people who have write access to their
> BIOS flash are the ones who _really_ wanted it.

absolutely

do we need a MODULE_NO_AUTOLOAD() ?

> 


-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-27  6:00         ` Arjan van de Ven
@ 2008-05-27  6:24           ` David Woodhouse
  0 siblings, 0 replies; 89+ messages in thread
From: David Woodhouse @ 2008-05-27  6:24 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Jesper Krogh, Linux Kernel Mailing List,
	Carl-Daniel Hailfinger

On Mon, 2008-05-26 at 23:00 -0700, Arjan van de Ven wrote:
> > The driver needs that 'more' to reach the lock registers for the flash
> > chip. If it's being used for other things, shouldn't the
> > request_region() fail?
> 
> it does fail....
> ... and then the driver continues anyway!

Heh. That's.... naughty. There are kind of valid reasons for that kind
of thing occasionally, but I suspect not this time.

> absolutely
> 
> do we need a MODULE_NO_AUTOLOAD() ?

Just dropping the MODULE_DEVICE_TABLE() should suffice. We should
probably also use the code which is currently in #if 0 which uses
pci_register_driver() instead of doing things for itself; I'm not
entirely sure why that is commented out.

-- 
dwmw2

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: 2.6.26-rc4: RIP find_pid_ns+0x6b/0xa0
  2008-05-27  5:23 ` 2.6.26-rc4: RIP find_pid_ns+0x6b/0xa0 Alexey Dobriyan
@ 2008-05-27  9:06   ` Oleg Nesterov
  2008-05-27 15:03     ` Linus Torvalds
  0 siblings, 1 reply; 89+ messages in thread
From: Oleg Nesterov @ 2008-05-27  9:06 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: Linus Torvalds, linux-kernel, akpm

On 05/27, Alexey Dobriyan wrote:
>
> PREEMPT_RCU is in use, again. And die counter is 2 because of CFQ oops
> I haven't noticed earlier.
> 
> 0xffffffff802447cb is in find_pid_ns (kernel/pid.c:297).
> 292             struct hlist_node *elem;
> 293             struct upid *pnr;
> 294
> 295             hlist_for_each_entry_rcu(pnr, elem,
> 296                             &pid_hash[pid_hashfn(nr, ns)], pid_chain)
> 297                     if (pnr->nr == nr && pnr->ns == ns)
> 298                             return container_of(pnr, struct pid,
> 299                                             numbers[ns->level]);
> 300
> 301             return NULL;
> 
> 
> general protection fault: 0000 [2] PREEMPT SMP DEBUG_PAGEALLOC
> CPU 0 
> Modules linked in: ext2 nf_conntrack_irc xt_state iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack ip_tables x_tables usblp ehci_hcd uhci_hcd usbcore sr_mod cdrom
> Pid: 15599, comm: profil01 Tainted: G      D   2.6.26-rc4 #1
> RIP: 0010:[<ffffffff802447cb>]  [<ffffffff802447cb>] find_pid_ns+0x6b/0xa0
> RSP: 0018:ffff810129021ea8  EFLAGS: 00010202
> RAX: ffff810130580948 RBX: 0000000000003cef RCX: ffff81017d865278
> RDX: 6b6b6b6b6b6b6b6b RSI: ffffffff80566760 RDI: 0000000000003cef
> RBP: ffff810129021ea8 R08: 0000000000000000 R09: 00007f9a93987b70
> R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> R13: 0000000000000011 R14: 0000000000000000 R15: 0000000000000000
> FS:  00007f9a9397c6f0(0000) GS:ffffffff805c6000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 000000000257f2e8 CR3: 0000000102479000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process profil01 (pid: 15599, threadinfo ffff810129020000, task ffff81004bc24500)
> Stack:  ffff810129021eb8 ffffffff8024487d ffff810129021f78 ffffffff8023f275
>  0000000000000011 0000000000000000 0000000000003cef ffff810129020000
>  ffffffff8061b140 00007f9a93989bc0 00007fff9b98a410 ffffffff8045fd63
> Call Trace:
>  [<ffffffff8024487d>] find_vpid+0x1d/0x20
>  [<ffffffff8023f275>] sys_kill+0x85/0x1b0

Is this reproducible?

In theory find_pid() is not safe without rcu_read_lock() if CONFIG_PREEMPT_RCU.
But we have a lot of "read_lock(tasklist_lock) + find_pid()", this was legal
and documented. It was actually broken, but happened to work because read_lock()
implied rcu_read_lock().

Could you look at

	[PATCH] fix tasklist + find_pid() with CONFIG_PREEMPT_RCU
	http://marc.info/?t=120162615300012

?

I am not sure this is the actual reason though, the race is very unlikely.

Oleg.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-26 18:41 Linux 2.6.26-rc4 Linus Torvalds
  2008-05-26 21:24 ` Jesper Krogh
  2008-05-27  5:23 ` 2.6.26-rc4: RIP find_pid_ns+0x6b/0xa0 Alexey Dobriyan
@ 2008-05-27 10:01 ` J.A. Magallón
  2008-05-28 23:59   ` Bill Davidsen
       [not found] ` <20080527124315.131b1343@Varda>
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 89+ messages in thread
From: J.A. Magallón @ 2008-05-27 10:01 UTC (permalink / raw)
  To: Linux-Kernel, Linus-IDE

On Mon, 26 May 2008 11:41:35 -0700 (PDT), Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> You know the drill by now: another week, another -rc.
> 
> There's a lot of small stuff in here, most people won't even notice. The 
> most noticeable thing is for all you 32-bit x86 people who use PAE 
> (enabled by the HIGHMEM64G config option) due to having too much memory in 
> your machine - mprotect() was broken due to some of the PAT fix/cleanup 
> patches, causing the NX bit to be not set correctly.
> 
> So if you had PAE enabled _and_ a recent enough CPU to have NX, but not 
> recent enough to be 64-bit (or you were just perverse and wanted to run a 
> 32-bit kernel despite having a chip that could do 64-bit and enough memory 
> that you _really_ should have used a 64-bit kernel), you'd get various 
> random program failures with SIGSEGV. It ranged from X not starting up to 
> apparently OpenOffice not working if it did.
> 
> But most of the changes, as usual, are in drivers, at 60%, with some DRI 
> changes leading the way (fixing a number of other regressions, mainly by 
> reverting the under-cooked vblank update). Network, MMC, USB, watchdog and 
> IDE drivers also got updates.
> 
> We had CIFS and NFS updates, and some arch updates as usual. The dirstat 
> gives the overview:
> 

I have this patchsets collected from LKML, that still apply ontop of -rc4.
Are they not so urgent or are they not needed any more ?

JBD[2] races
http://marc.info/?l=linux-ext4&m=121141319601650&w=2
http://marc.info/?l=linux-ext4&m=121141319701660&w=2

libata EH timeout handling
http://marc.info/?l=linux-ide&m=121121761530723&w=2

alignment in block DMA
http://marc.info/?l=linux-ide&m=121125981930670&w=2

-- 
J.A. Magallon <jamagallon()ono!com>     \               Software is like sex:
                                         \         It's better when it's free
Mandriva Linux release 2008.1 (Cooker) for i586
Linux 2.6.23-jam05 (gcc 4.2.2 20071128 (4.2.2-2mdv2008.1)) SMP PREEMPT

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-27  1:16     ` Carl-Daniel Hailfinger
  2008-05-27  1:23       ` Carl-Daniel Hailfinger
@ 2008-05-27 10:35       ` Jeff Garzik
  2008-05-27 10:53         ` Carl-Daniel Hailfinger
  1 sibling, 1 reply; 89+ messages in thread
From: Jeff Garzik @ 2008-05-27 10:35 UTC (permalink / raw)
  To: Carl-Daniel Hailfinger
  Cc: Linus Torvalds, Jesper Krogh, Linux Kernel Mailing List,
	David Woodhouse

Carl-Daniel Hailfinger wrote:
> New code should look like
> static struct pci_device_id ck804xrom_pci_tbl[] = {
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0051), .driver_data = DEV_CK804 },
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0360), .driver_data = DEV_MCP55 },
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0361), .driver_data = DEV_MCP55 },
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0362), .driver_data = DEV_MCP55 },
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0363), .driver_data = DEV_MCP55 },
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0364), .driver_data = DEV_MCP55 },
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0365), .driver_data = DEV_MCP55 },
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0366), .driver_data = DEV_MCP55 },
> 	{ PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0367), .driver_data = DEV_MCP55 },
> 	{ 0, }
> };


Actually, more like

	{ PCI_VDEVICE(NVIDIA, 0x0367), DEV_MCP55 },

Regards,

	Jeff




^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-27 10:35       ` Jeff Garzik
@ 2008-05-27 10:53         ` Carl-Daniel Hailfinger
  2008-05-27 10:54           ` Jeff Garzik
  0 siblings, 1 reply; 89+ messages in thread
From: Carl-Daniel Hailfinger @ 2008-05-27 10:53 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Jesper Krogh, Linux Kernel Mailing List,
	David Woodhouse

On 27.05.2008 12:35, Jeff Garzik wrote:
> Carl-Daniel Hailfinger wrote:
>> New code should look like
>> static struct pci_device_id ck804xrom_pci_tbl[] = {
>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0051), .driver_data =
>> DEV_CK804 },
>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0360), .driver_data =
>> DEV_MCP55 },
>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0361), .driver_data =
>> DEV_MCP55 },
>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0362), .driver_data =
>> DEV_MCP55 },
>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0363), .driver_data =
>> DEV_MCP55 },
>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0364), .driver_data =
>> DEV_MCP55 },
>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0365), .driver_data =
>> DEV_MCP55 },
>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0366), .driver_data =
>> DEV_MCP55 },
>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0367), .driver_data =
>> DEV_MCP55 },
>>     { 0, }
>> };
>
>
> Actually, more like
>
>     { PCI_VDEVICE(NVIDIA, 0x0367), DEV_MCP55 },

AFAICS that would reintroduce the bug.

Regards,
Carl-Daniel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-27 10:53         ` Carl-Daniel Hailfinger
@ 2008-05-27 10:54           ` Jeff Garzik
  2008-05-27 10:58             ` Carl-Daniel Hailfinger
  0 siblings, 1 reply; 89+ messages in thread
From: Jeff Garzik @ 2008-05-27 10:54 UTC (permalink / raw)
  To: Carl-Daniel Hailfinger
  Cc: Linus Torvalds, Jesper Krogh, Linux Kernel Mailing List,
	David Woodhouse

Carl-Daniel Hailfinger wrote:
> On 27.05.2008 12:35, Jeff Garzik wrote:
>> Carl-Daniel Hailfinger wrote:
>>> New code should look like
>>> static struct pci_device_id ck804xrom_pci_tbl[] = {
>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0051), .driver_data =
>>> DEV_CK804 },
>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0360), .driver_data =
>>> DEV_MCP55 },
>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0361), .driver_data =
>>> DEV_MCP55 },
>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0362), .driver_data =
>>> DEV_MCP55 },
>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0363), .driver_data =
>>> DEV_MCP55 },
>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0364), .driver_data =
>>> DEV_MCP55 },
>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0365), .driver_data =
>>> DEV_MCP55 },
>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0366), .driver_data =
>>> DEV_MCP55 },
>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0367), .driver_data =
>>> DEV_MCP55 },
>>>     { 0, }
>>> };
>>
>> Actually, more like
>>
>>     { PCI_VDEVICE(NVIDIA, 0x0367), DEV_MCP55 },
> 
> AFAICS that would reintroduce the bug.

Look closely.

	Jeff





^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-27 10:54           ` Jeff Garzik
@ 2008-05-27 10:58             ` Carl-Daniel Hailfinger
  0 siblings, 0 replies; 89+ messages in thread
From: Carl-Daniel Hailfinger @ 2008-05-27 10:58 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Linus Torvalds, Jesper Krogh, Linux Kernel Mailing List,
	David Woodhouse

On 27.05.2008 12:54, Jeff Garzik wrote:
> Carl-Daniel Hailfinger wrote:
>> On 27.05.2008 12:35, Jeff Garzik wrote:
>>> Carl-Daniel Hailfinger wrote:
>>>> New code should look like
>>>> static struct pci_device_id ck804xrom_pci_tbl[] = {
>>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0051), .driver_data =
>>>> DEV_CK804 },
>>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0360), .driver_data =
>>>> DEV_MCP55 },
>>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0361), .driver_data =
>>>> DEV_MCP55 },
>>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0362), .driver_data =
>>>> DEV_MCP55 },
>>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0363), .driver_data =
>>>> DEV_MCP55 },
>>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0364), .driver_data =
>>>> DEV_MCP55 },
>>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0365), .driver_data =
>>>> DEV_MCP55 },
>>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0366), .driver_data =
>>>> DEV_MCP55 },
>>>>     { PCI_DEVICE(PCI_VENDOR_ID_NVIDIA, 0x0367), .driver_data =
>>>> DEV_MCP55 },
>>>>     { 0, }
>>>> };
>>>
>>> Actually, more like
>>>
>>>     { PCI_VDEVICE(NVIDIA, 0x0367), DEV_MCP55 },
>>
>> AFAICS that would reintroduce the bug.
>
> Look closely.

My apologies. I missed the V in PCI_VDEVICE.

Regards,
Carl-Daniel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: 2.6.26-rc4: RIP find_pid_ns+0x6b/0xa0
  2008-05-27  9:06   ` Oleg Nesterov
@ 2008-05-27 15:03     ` Linus Torvalds
  2008-05-27 15:40       ` Paul E. McKenney
  2008-05-27 16:45       ` Oleg Nesterov
  0 siblings, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2008-05-27 15:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Alexey Dobriyan, Linux Kernel Mailing List, Andrew Morton,
	Paul E. McKenney



On Tue, 27 May 2008, Oleg Nesterov wrote:

> On 05/27, Alexey Dobriyan wrote:
> >
> > PREEMPT_RCU is in use, again.

I do wonder if PREEMPT_RCU is broken.

> > 0xffffffff802447cb is in find_pid_ns (kernel/pid.c:297).
> > 292             struct hlist_node *elem;
> > 293             struct upid *pnr;
> > 294
> > 295             hlist_for_each_entry_rcu(pnr, elem,
> > 296                             &pid_hash[pid_hashfn(nr, ns)], pid_chain)
> > 297                     if (pnr->nr == nr && pnr->ns == ns)

> > general protection fault: 0000 [2] PREEMPT SMP DEBUG_PAGEALLOC
> > RDX: 6b6b6b6b6b6b6b6b RSI: ffffffff80566760 RDI: 0000000000003cef

That repeated 0x6b is POISON_FREE, and the code is

	cmp    -0x10(%rdx),%edi

which is the load of "pnr->nr". So 'pnr' has been free'd.

On Tue, 27 May 2008, Oleg Nesterov wrote:
> 
> Is this reproducible?
> 
> In theory find_pid() is not safe without rcu_read_lock() if CONFIG_PREEMPT_RCU.
> But we have a lot of "read_lock(tasklist_lock) + find_pid()", this was legal
> and documented. It was actually broken, but happened to work because read_lock()
> implied rcu_read_lock().
> 
> Could you look at
> 
> 	[PATCH] fix tasklist + find_pid() with CONFIG_PREEMPT_RCU
> 	http://marc.info/?t=120162615300012
> 
> ?
> 
> I am not sure this is the actual reason though, the race is very unlikely.

That is a *very* unlikely race, especially as that bad_fork_free_pid case 
would only happen if pid_ns_prepare_proc() fails. And if it fails, it's 
still very unlikely to hit, I think.

That said, it does smell like a bug. But I *really* would be much much 
happier if even SRCU at least waited for a grace period, so that it would 
always be safe to just disable preemption for a "rcu_read_lock()". That 
way, things that take spinlocks are safe even with SRCU.

Paul? How hard would it be to make preemptable RCU just honor that classic 
RCU behavior?

			Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: 2.6.26-rc4: RIP find_pid_ns+0x6b/0xa0
  2008-05-27 15:03     ` Linus Torvalds
@ 2008-05-27 15:40       ` Paul E. McKenney
  2008-05-27 16:11         ` Linus Torvalds
  2008-05-27 16:45       ` Oleg Nesterov
  1 sibling, 1 reply; 89+ messages in thread
From: Paul E. McKenney @ 2008-05-27 15:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Alexey Dobriyan, Linux Kernel Mailing List,
	Andrew Morton

On Tue, May 27, 2008 at 08:03:03AM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 27 May 2008, Oleg Nesterov wrote:
> 
> > On 05/27, Alexey Dobriyan wrote:
> > >
> > > PREEMPT_RCU is in use, again.
> 
> I do wonder if PREEMPT_RCU is broken.

I never stop wondering that...

> > > 0xffffffff802447cb is in find_pid_ns (kernel/pid.c:297).
> > > 292             struct hlist_node *elem;
> > > 293             struct upid *pnr;
> > > 294
> > > 295             hlist_for_each_entry_rcu(pnr, elem,
> > > 296                             &pid_hash[pid_hashfn(nr, ns)], pid_chain)
> > > 297                     if (pnr->nr == nr && pnr->ns == ns)
> 
> > > general protection fault: 0000 [2] PREEMPT SMP DEBUG_PAGEALLOC
> > > RDX: 6b6b6b6b6b6b6b6b RSI: ffffffff80566760 RDI: 0000000000003cef
> 
> That repeated 0x6b is POISON_FREE, and the code is
> 
> 	cmp    -0x10(%rdx),%edi
> 
> which is the load of "pnr->nr". So 'pnr' has been free'd.
> 
> On Tue, 27 May 2008, Oleg Nesterov wrote:
> > 
> > Is this reproducible?
> > 
> > In theory find_pid() is not safe without rcu_read_lock() if CONFIG_PREEMPT_RCU.
> > But we have a lot of "read_lock(tasklist_lock) + find_pid()", this was legal
> > and documented. It was actually broken, but happened to work because read_lock()
> > implied rcu_read_lock().
> > 
> > Could you look at
> > 
> > 	[PATCH] fix tasklist + find_pid() with CONFIG_PREEMPT_RCU
> > 	http://marc.info/?t=120162615300012
> > 
> > ?
> > 
> > I am not sure this is the actual reason though, the race is very unlikely.
> 
> That is a *very* unlikely race, especially as that bad_fork_free_pid case 
> would only happen if pid_ns_prepare_proc() fails. And if it fails, it's 
> still very unlikely to hit, I think.
> 
> That said, it does smell like a bug. But I *really* would be much much 
> happier if even SRCU at least waited for a grace period, so that it would 
> always be safe to just disable preemption for a "rcu_read_lock()". That 
> way, things that take spinlocks are safe even with SRCU.

SRCU does wait for all CPUs to schedule, and thus already waits for all
pre-existing non-preemptable code sequences to finish on all CPUs.

> Paul? How hard would it be to make preemptable RCU just honor that classic 
> RCU behavior?

Hmmm...  Might not be too hard, I will look into this.  Should just be
another stage in the rcu_try_flip state machine, along with a few of
the changes already in the queue for call_rcu_sched().

But this will only help until preemptible spinlocks arrive, right?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: 2.6.26-rc4: RIP find_pid_ns+0x6b/0xa0
  2008-05-27 15:40       ` Paul E. McKenney
@ 2008-05-27 16:11         ` Linus Torvalds
  2008-05-27 17:06           ` Paul E. McKenney
  0 siblings, 1 reply; 89+ messages in thread
From: Linus Torvalds @ 2008-05-27 16:11 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Oleg Nesterov, Alexey Dobriyan, Linux Kernel Mailing List,
	Andrew Morton

On Tue, 27 May 2008, Paul E. McKenney wrote:
> 
> But this will only help until preemptible spinlocks arrive, right?

I don't think we will ever have preemptible spinlocks.

If you preempt spinlocks, you have serious issues with contention and 
priority inversion etc, and you basically need to turn them into sleeping 
mutexes. So now you also need to do interrupts as sleepable threads etc 
etc.

And it would break the existing non-preempt RCU usage anyway.

Yeah, maybe the RT people try to do that, but quite frankly, it is insane. 
Spinlocks are *different* from sleeping locks, for a damn good reason.

		Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: 2.6.26-rc4: RIP find_pid_ns+0x6b/0xa0
  2008-05-27 15:03     ` Linus Torvalds
  2008-05-27 15:40       ` Paul E. McKenney
@ 2008-05-27 16:45       ` Oleg Nesterov
  2008-05-27 17:37         ` Oleg Nesterov
  1 sibling, 1 reply; 89+ messages in thread
From: Oleg Nesterov @ 2008-05-27 16:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexey Dobriyan, Linux Kernel Mailing List, Andrew Morton,
	Paul E. McKenney

On 05/27, Linus Torvalds wrote:
> 
> On Tue, 27 May 2008, Oleg Nesterov wrote:
> 
> > In theory find_pid() is not safe without rcu_read_lock() if CONFIG_PREEMPT_RCU.
> > But we have a lot of "read_lock(tasklist_lock) + find_pid()", this was legal
> > and documented. It was actually broken, but happened to work because read_lock()
> > implied rcu_read_lock().
> > 
> > Could you look at
> > 
> > 	[PATCH] fix tasklist + find_pid() with CONFIG_PREEMPT_RCU
> > 	http://marc.info/?t=120162615300012
> > 
> > ?
> > 
> > I am not sure this is the actual reason though, the race is very unlikely.
> 
> That is a *very* unlikely race, especially as that bad_fork_free_pid case 
> would only happen if pid_ns_prepare_proc() fails.

To be precise, this case also happens when fork() fails due to signal_pending().

But I agree, this race is pretty much theoretical.

Oleg.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: 2.6.26-rc4: RIP find_pid_ns+0x6b/0xa0
  2008-05-27 16:11         ` Linus Torvalds
@ 2008-05-27 17:06           ` Paul E. McKenney
  2008-05-28  5:01             ` Paul E. McKenney
  0 siblings, 1 reply; 89+ messages in thread
From: Paul E. McKenney @ 2008-05-27 17:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Alexey Dobriyan, Linux Kernel Mailing List,
	Andrew Morton

On Tue, May 27, 2008 at 09:11:33AM -0700, Linus Torvalds wrote:
> On Tue, 27 May 2008, Paul E. McKenney wrote:
> > 
> > But this will only help until preemptible spinlocks arrive, right?
> 
> I don't think we will ever have preemptible spinlocks.
> 
> If you preempt spinlocks, you have serious issues with contention and 
> priority inversion etc, and you basically need to turn them into sleeping 
> mutexes. So now you also need to do interrupts as sleepable threads etc 
> etc.

Indeed, all of these are required in that case.

> And it would break the existing non-preempt RCU usage anyway.

Yes, preemptable spinlocks cannot work without preemptable RCU.

> Yeah, maybe the RT people try to do that, but quite frankly, it is insane. 
> Spinlocks are *different* from sleeping locks, for a damn good reason.

Well, I guess I never claimed to be sane...

Anyway, will look at a preemptable RCU that waits for preempt-disable
sections of code.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: 2.6.26-rc4: RIP find_pid_ns+0x6b/0xa0
  2008-05-27 16:45       ` Oleg Nesterov
@ 2008-05-27 17:37         ` Oleg Nesterov
  2008-05-27 21:26           ` Alexey Dobriyan
  0 siblings, 1 reply; 89+ messages in thread
From: Oleg Nesterov @ 2008-05-27 17:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexey Dobriyan, Linux Kernel Mailing List, Andrew Morton,
	Paul E. McKenney

On 05/27, Oleg Nesterov wrote:
>
> But I agree, this race is pretty much theoretical.

Perhaps we have the unbalanced put_pid(), in that case "struct pid" will
be freed without waiting for a grace period.

Alexey, could you re-test with the patch below?

Oleg.

Add the temporary debugging code to catch the unbalanced put_pid()'s.
At least those which can free the "live" pid.

--- MM/kernel/pid.c~	2008-02-20 18:29:40.000000000 +0300
+++ MM/kernel/pid.c	2008-02-20 18:35:15.000000000 +0300
@@ -208,6 +208,10 @@ void put_pid(struct pid *pid)
 	ns = pid->numbers[pid->level].ns;
 	if ((atomic_read(&pid->count) == 1) ||
 	     atomic_dec_and_test(&pid->count)) {
+		int type = PIDTYPE_MAX;
+		while (--type >= 0)
+			if (WARN_ON(!hlist_empty(&pid->tasks[type])))
+				return;
 		kmem_cache_free(ns->pid_cachep, pid);
 		put_pid_ns(ns);
 	}



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: 2.6.26-rc4: RIP find_pid_ns+0x6b/0xa0
  2008-05-27 17:37         ` Oleg Nesterov
@ 2008-05-27 21:26           ` Alexey Dobriyan
  0 siblings, 0 replies; 89+ messages in thread
From: Alexey Dobriyan @ 2008-05-27 21:26 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Linux Kernel Mailing List, Andrew Morton,
	Paul E. McKenney

On Tue, May 27, 2008 at 09:37:11PM +0400, Oleg Nesterov wrote:
> On 05/27, Oleg Nesterov wrote:
> > But I agree, this race is pretty much theoretical.
> 
> Perhaps we have the unbalanced put_pid(), in that case "struct pid" will
> be freed without waiting for a grace period.
> 
> Alexey, could you re-test with the patch below?

OK, and this is first time I saw this oops.

> --- MM/kernel/pid.c~	2008-02-20 18:29:40.000000000 +0300
> +++ MM/kernel/pid.c	2008-02-20 18:35:15.000000000 +0300
> @@ -208,6 +208,10 @@ void put_pid(struct pid *pid)
>  	ns = pid->numbers[pid->level].ns;
>  	if ((atomic_read(&pid->count) == 1) ||
>  	     atomic_dec_and_test(&pid->count)) {
> +		int type = PIDTYPE_MAX;
> +		while (--type >= 0)
> +			if (WARN_ON(!hlist_empty(&pid->tasks[type])))
> +				return;
>  		kmem_cache_free(ns->pid_cachep, pid);
>  		put_pid_ns(ns);
>  	}


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: 2.6.26-rc4: RIP find_pid_ns+0x6b/0xa0
  2008-05-27 17:06           ` Paul E. McKenney
@ 2008-05-28  5:01             ` Paul E. McKenney
  2008-05-28  7:26               ` Paul E. McKenney
  0 siblings, 1 reply; 89+ messages in thread
From: Paul E. McKenney @ 2008-05-28  5:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Alexey Dobriyan, Linux Kernel Mailing List,
	Andrew Morton

On Tue, May 27, 2008 at 10:06:11AM -0700, Paul E. McKenney wrote:
> On Tue, May 27, 2008 at 09:11:33AM -0700, Linus Torvalds wrote:
> > On Tue, 27 May 2008, Paul E. McKenney wrote:
> > > 
> > > But this will only help until preemptible spinlocks arrive, right?
> > 
> > I don't think we will ever have preemptible spinlocks.
> > 
> > If you preempt spinlocks, you have serious issues with contention and 
> > priority inversion etc, and you basically need to turn them into sleeping 
> > mutexes. So now you also need to do interrupts as sleepable threads etc 
> > etc.
> 
> Indeed, all of these are required in that case.
> 
> > And it would break the existing non-preempt RCU usage anyway.
> 
> Yes, preemptable spinlocks cannot work without preemptable RCU.
> 
> > Yeah, maybe the RT people try to do that, but quite frankly, it is insane. 
> > Spinlocks are *different* from sleeping locks, for a damn good reason.
> 
> Well, I guess I never claimed to be sane...
> 
> Anyway, will look at a preemptable RCU that waits for preempt-disable
> sections of code.

And here is a just-now hacked up patch.  Untested, probably fails to compile.
Just kicked off a light test run, will let you know how it goes.

							Thanx, Paul

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 include/linux/rcupreempt.h |   15 ++++++++-
 kernel/Kconfig.preempt     |   15 +++++++++
 kernel/rcupreempt.c        |   71 ++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 99 insertions(+), 2 deletions(-)

diff -urpNa -X dontdiff linux-2.6.26-rc3/include/linux/rcupreempt.h linux-2.6.26-rc3-rcu-gcwnp/include/linux/rcupreempt.h
--- linux-2.6.26-rc3/include/linux/rcupreempt.h	2008-05-23 02:26:06.000000000 -0700
+++ linux-2.6.26-rc3-rcu-gcwnp/include/linux/rcupreempt.h	2008-05-27 21:27:35.000000000 -0700
@@ -40,7 +40,20 @@
 #include <linux/cpumask.h>
 #include <linux/seqlock.h>
 
-#define rcu_qsctr_inc(cpu)
+struct rcu_dyntick_sched {
+	int qs;
+	int rcu_qs_snap;
+};
+
+DECLARE_PER_CPU(struct rcu_dyntick_sched, rcu_dyntick_sched);
+
+static inline void rcu_qsctr_inc(int cpu)
+{
+	struct rcu_dyntick_sched *rdssp = &per_cpu(rcu_dyntick_sched, cpu);
+
+	rdssp->qs++;
+}
+
 #define rcu_bh_qsctr_inc(cpu)
 #define call_rcu_bh(head, rcu) call_rcu(head, rcu)
 
diff -urpNa -X dontdiff linux-2.6.26-rc3/kernel/Kconfig.preempt linux-2.6.26-rc3-rcu-gcwnp/kernel/Kconfig.preempt
--- linux-2.6.26-rc3/kernel/Kconfig.preempt	2008-04-16 19:49:44.000000000 -0700
+++ linux-2.6.26-rc3-rcu-gcwnp/kernel/Kconfig.preempt	2008-05-27 21:27:39.000000000 -0700
@@ -77,3 +77,18 @@ config RCU_TRACE
 
 	  Say Y here if you want to enable RCU tracing
 	  Say N if you are unsure.
+
+config PREEMPT_RCU_WAIT_PREEMPT_DISABLE
+	bool "Cause preemptible RCU to wait for preempt_disable code"
+	depends on PREEMPT_RCU
+	default y
+	help
+	  This option causes preemptible RCU's grace periods to wait
+	  on preempt_disable() code sections (such as spinlock critical
+	  sections in CONFIG_PREEMPT kernels) as well as for RCU
+	  read-side critical sections.  This preserves this semantic
+	  from Classic RCU.  Longer term, explicit RCU read-side critical
+	  sections need to be added.
+
+	  Say N here if you want strict RCU semantics.
+	  Say Y if you are unsure.
diff -urpNa -X dontdiff linux-2.6.26-rc3/kernel/rcupreempt.c linux-2.6.26-rc3-rcu-gcwnp/kernel/rcupreempt.c
--- linux-2.6.26-rc3/kernel/rcupreempt.c	2008-05-23 02:26:07.000000000 -0700
+++ linux-2.6.26-rc3-rcu-gcwnp/kernel/rcupreempt.c	2008-05-27 21:46:51.000000000 -0700
@@ -123,6 +123,12 @@ enum rcu_try_flip_states {
 	rcu_try_flip_waitzero_state,
 
 	/*
+	 * Wait here for all CPUs to pass through a quiescent state, but
+	 * only if CONFIG_PREEMPT_RCU_WAIT_PREEMPT_DISABLE.
+	 */
+	rcu_try_flip_waitqs_state,
+
+	/*
 	 * Wait here for each of the other CPUs to execute a memory barrier.
 	 * This is necessary to ensure that these other CPUs really have
 	 * completed executing their RCU read-side critical sections, despite
@@ -131,6 +137,14 @@ enum rcu_try_flip_states {
 	rcu_try_flip_waitmb_state,
 };
 
+/* Plumb the grace-period state machine based on Kconfig parameters. */
+
+#ifdef CONFIG_PREEMPT_RCU_WAIT_PREEMPT_DISABLE
+#define rcu_try_flip_waitzero_next_state rcu_try_flip_waitqs_state
+#else  /* #ifdef CONFIG_PREEMPT_RCU_WAIT_PREEMPT_DISABLE */
+#define rcu_try_flip_waitzero_next_state rcu_try_flip_waitmb_state
+#endif /* #else  #ifdef CONFIG_PREEMPT_RCU_WAIT_PREEMPT_DISABLE */
+
 struct rcu_ctrlblk {
 	spinlock_t	fliplock;	/* Protect state-machine transitions. */
 	long		completed;	/* Number of last completed batch. */
@@ -413,6 +427,8 @@ static void __rcu_advance_callbacks(stru
 	}
 }
 
+DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_dyntick_sched, rcu_dyntick_sched);
+
 #ifdef CONFIG_NO_HZ
 
 DEFINE_PER_CPU(long, dynticks_progress_counter) = 1;
@@ -619,6 +635,25 @@ rcu_try_flip_waitmb_needed(int cpu)
 
 #endif /* CONFIG_NO_HZ */
 
+#ifdef CONFIG_PREEMPT_RCU_WAIT_PREEMPT_DISABLE
+
+void rcu_try_flip_take_qs_snapshot(void)
+{
+	struct rcu_dyntick_sched *rdssp;
+	int cpu;
+
+	for_each_cpu_mask(cpu, rcu_cpu_online_map) {
+		rdssp = &per_cpu(rcu_dyntick_sched, cpu);
+		rdssp->rcu_qs_snap = rdssp->qs;
+	}
+}
+
+#else /* #ifdef CONFIG_PREEMPT_RCU_WAIT_PREEMPT_DISABLE */
+
+#define rcu_try_flip_take_qs_snapshot()
+
+#endif /* #else #ifdef CONFIG_PREEMPT_RCU_WAIT_PREEMPT_DISABLE */
+
 /*
  * Get here when RCU is idle.  Decide whether we need to
  * move out of idle state, and return non-zero if so.
@@ -662,6 +697,13 @@ rcu_try_flip_idle(void)
 		dyntick_save_progress_counter(cpu);
 	}
 
+	/*
+	 * And take quiescent-state snapshot if we are also to wait
+	 * on preempt_disable() code sequences.
+	 */
+
+	rcu_try_flip_take_qs_snapshot();
+
 	return 1;
 }
 
@@ -731,6 +773,26 @@ rcu_try_flip_waitzero(void)
 	return 1;
 }
 
+static int
+rcu_try_flip_waitqs(void)
+{
+	int cpu;
+	struct rcu_dyntick_sched *rdssp;
+
+	/* RCU_TRACE_ME(rcupreempt_trace_try_flip_q1); */
+	for_each_cpu_mask(cpu, rcu_cpu_online_map) {
+		rdssp = &per_cpu(rcu_dyntick_sched, cpu);
+		if (rcu_try_flip_waitack_needed(cpu) &&
+		    (rdssp->qs == rdssp->rcu_qs_snap)) {
+			/* RCU_TRACE_ME(rcupreempt_trace_try_flip_qe1); */
+			return 0;
+		}
+	}
+
+	/* RCU_TRACE_ME(rcupreempt_trace_try_flip_q2); */
+	return 1;
+}
+
 /*
  * Wait for all CPUs to do their end-of-grace-period memory barrier.
  * Return 0 once all CPUs have done so.
@@ -775,7 +837,9 @@ static void rcu_try_flip(void)
 
 	/*
 	 * Take the next transition(s) through the RCU grace-period
-	 * flip-counter state machine.
+	 * flip-counter state machine.  The _next_state transition
+	 * is defined by the "plumbing" definitions following the
+	 * rcu_try_flip_states enum.
 	 */
 
 	switch (rcu_ctrlblk.rcu_try_flip_state) {
@@ -792,6 +856,11 @@ static void rcu_try_flip(void)
 	case rcu_try_flip_waitzero_state:
 		if (rcu_try_flip_waitzero())
 			rcu_ctrlblk.rcu_try_flip_state =
+				rcu_try_flip_waitzero_next_state;
+		break;
+	case rcu_try_flip_waitqs_state:
+		if (rcu_try_flip_waitqs())
+			rcu_ctrlblk.rcu_try_flip_state =
 				rcu_try_flip_waitmb_state;
 		break;
 	case rcu_try_flip_waitmb_state:

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: 2.6.26-rc4: RIP find_pid_ns+0x6b/0xa0
  2008-05-28  5:01             ` Paul E. McKenney
@ 2008-05-28  7:26               ` Paul E. McKenney
  0 siblings, 0 replies; 89+ messages in thread
From: Paul E. McKenney @ 2008-05-28  7:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Oleg Nesterov, Alexey Dobriyan, Linux Kernel Mailing List,
	Andrew Morton

On Tue, May 27, 2008 at 10:01:50PM -0700, Paul E. McKenney wrote:
> On Tue, May 27, 2008 at 10:06:11AM -0700, Paul E. McKenney wrote:
> > On Tue, May 27, 2008 at 09:11:33AM -0700, Linus Torvalds wrote:
> > > On Tue, 27 May 2008, Paul E. McKenney wrote:
> > > > 
> > > > But this will only help until preemptible spinlocks arrive, right?
> > > 
> > > I don't think we will ever have preemptible spinlocks.
> > > 
> > > If you preempt spinlocks, you have serious issues with contention and 
> > > priority inversion etc, and you basically need to turn them into sleeping 
> > > mutexes. So now you also need to do interrupts as sleepable threads etc 
> > > etc.
> > 
> > Indeed, all of these are required in that case.
> > 
> > > And it would break the existing non-preempt RCU usage anyway.
> > 
> > Yes, preemptable spinlocks cannot work without preemptable RCU.
> > 
> > > Yeah, maybe the RT people try to do that, but quite frankly, it is insane. 
> > > Spinlocks are *different* from sleeping locks, for a damn good reason.
> > 
> > Well, I guess I never claimed to be sane...
> > 
> > Anyway, will look at a preemptable RCU that waits for preempt-disable
> > sections of code.
> 
> And here is a just-now hacked up patch.  Untested, probably fails to compile.
> Just kicked off a light test run, will let you know how it goes.

And it passes light testing on a 4-CPU x86 box.

							Thanx, Paul

> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
> 
>  include/linux/rcupreempt.h |   15 ++++++++-
>  kernel/Kconfig.preempt     |   15 +++++++++
>  kernel/rcupreempt.c        |   71 ++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 99 insertions(+), 2 deletions(-)
> 
> diff -urpNa -X dontdiff linux-2.6.26-rc3/include/linux/rcupreempt.h linux-2.6.26-rc3-rcu-gcwnp/include/linux/rcupreempt.h
> --- linux-2.6.26-rc3/include/linux/rcupreempt.h	2008-05-23 02:26:06.000000000 -0700
> +++ linux-2.6.26-rc3-rcu-gcwnp/include/linux/rcupreempt.h	2008-05-27 21:27:35.000000000 -0700
> @@ -40,7 +40,20 @@
>  #include <linux/cpumask.h>
>  #include <linux/seqlock.h>
>  
> -#define rcu_qsctr_inc(cpu)
> +struct rcu_dyntick_sched {
> +	int qs;
> +	int rcu_qs_snap;
> +};
> +
> +DECLARE_PER_CPU(struct rcu_dyntick_sched, rcu_dyntick_sched);
> +
> +static inline void rcu_qsctr_inc(int cpu)
> +{
> +	struct rcu_dyntick_sched *rdssp = &per_cpu(rcu_dyntick_sched, cpu);
> +
> +	rdssp->qs++;
> +}
> +
>  #define rcu_bh_qsctr_inc(cpu)
>  #define call_rcu_bh(head, rcu) call_rcu(head, rcu)
>  
> diff -urpNa -X dontdiff linux-2.6.26-rc3/kernel/Kconfig.preempt linux-2.6.26-rc3-rcu-gcwnp/kernel/Kconfig.preempt
> --- linux-2.6.26-rc3/kernel/Kconfig.preempt	2008-04-16 19:49:44.000000000 -0700
> +++ linux-2.6.26-rc3-rcu-gcwnp/kernel/Kconfig.preempt	2008-05-27 21:27:39.000000000 -0700
> @@ -77,3 +77,18 @@ config RCU_TRACE
>  
>  	  Say Y here if you want to enable RCU tracing
>  	  Say N if you are unsure.
> +
> +config PREEMPT_RCU_WAIT_PREEMPT_DISABLE
> +	bool "Cause preemptible RCU to wait for preempt_disable code"
> +	depends on PREEMPT_RCU
> +	default y
> +	help
> +	  This option causes preemptible RCU's grace periods to wait
> +	  on preempt_disable() code sections (such as spinlock critical
> +	  sections in CONFIG_PREEMPT kernels) as well as for RCU
> +	  read-side critical sections.  This preserves this semantic
> +	  from Classic RCU.  Longer term, explicit RCU read-side critical
> +	  sections need to be added.
> +
> +	  Say N here if you want strict RCU semantics.
> +	  Say Y if you are unsure.
> diff -urpNa -X dontdiff linux-2.6.26-rc3/kernel/rcupreempt.c linux-2.6.26-rc3-rcu-gcwnp/kernel/rcupreempt.c
> --- linux-2.6.26-rc3/kernel/rcupreempt.c	2008-05-23 02:26:07.000000000 -0700
> +++ linux-2.6.26-rc3-rcu-gcwnp/kernel/rcupreempt.c	2008-05-27 21:46:51.000000000 -0700
> @@ -123,6 +123,12 @@ enum rcu_try_flip_states {
>  	rcu_try_flip_waitzero_state,
>  
>  	/*
> +	 * Wait here for all CPUs to pass through a quiescent state, but
> +	 * only if CONFIG_PREEMPT_RCU_WAIT_PREEMPT_DISABLE.
> +	 */
> +	rcu_try_flip_waitqs_state,
> +
> +	/*
>  	 * Wait here for each of the other CPUs to execute a memory barrier.
>  	 * This is necessary to ensure that these other CPUs really have
>  	 * completed executing their RCU read-side critical sections, despite
> @@ -131,6 +137,14 @@ enum rcu_try_flip_states {
>  	rcu_try_flip_waitmb_state,
>  };
>  
> +/* Plumb the grace-period state machine based on Kconfig parameters. */
> +
> +#ifdef CONFIG_PREEMPT_RCU_WAIT_PREEMPT_DISABLE
> +#define rcu_try_flip_waitzero_next_state rcu_try_flip_waitqs_state
> +#else  /* #ifdef CONFIG_PREEMPT_RCU_WAIT_PREEMPT_DISABLE */
> +#define rcu_try_flip_waitzero_next_state rcu_try_flip_waitmb_state
> +#endif /* #else  #ifdef CONFIG_PREEMPT_RCU_WAIT_PREEMPT_DISABLE */
> +
>  struct rcu_ctrlblk {
>  	spinlock_t	fliplock;	/* Protect state-machine transitions. */
>  	long		completed;	/* Number of last completed batch. */
> @@ -413,6 +427,8 @@ static void __rcu_advance_callbacks(stru
>  	}
>  }
>  
> +DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_dyntick_sched, rcu_dyntick_sched);
> +
>  #ifdef CONFIG_NO_HZ
>  
>  DEFINE_PER_CPU(long, dynticks_progress_counter) = 1;
> @@ -619,6 +635,25 @@ rcu_try_flip_waitmb_needed(int cpu)
>  
>  #endif /* CONFIG_NO_HZ */
>  
> +#ifdef CONFIG_PREEMPT_RCU_WAIT_PREEMPT_DISABLE
> +
> +void rcu_try_flip_take_qs_snapshot(void)
> +{
> +	struct rcu_dyntick_sched *rdssp;
> +	int cpu;
> +
> +	for_each_cpu_mask(cpu, rcu_cpu_online_map) {
> +		rdssp = &per_cpu(rcu_dyntick_sched, cpu);
> +		rdssp->rcu_qs_snap = rdssp->qs;
> +	}
> +}
> +
> +#else /* #ifdef CONFIG_PREEMPT_RCU_WAIT_PREEMPT_DISABLE */
> +
> +#define rcu_try_flip_take_qs_snapshot()
> +
> +#endif /* #else #ifdef CONFIG_PREEMPT_RCU_WAIT_PREEMPT_DISABLE */
> +
>  /*
>   * Get here when RCU is idle.  Decide whether we need to
>   * move out of idle state, and return non-zero if so.
> @@ -662,6 +697,13 @@ rcu_try_flip_idle(void)
>  		dyntick_save_progress_counter(cpu);
>  	}
>  
> +	/*
> +	 * And take quiescent-state snapshot if we are also to wait
> +	 * on preempt_disable() code sequences.
> +	 */
> +
> +	rcu_try_flip_take_qs_snapshot();
> +
>  	return 1;
>  }
>  
> @@ -731,6 +773,26 @@ rcu_try_flip_waitzero(void)
>  	return 1;
>  }
>  
> +static int
> +rcu_try_flip_waitqs(void)
> +{
> +	int cpu;
> +	struct rcu_dyntick_sched *rdssp;
> +
> +	/* RCU_TRACE_ME(rcupreempt_trace_try_flip_q1); */
> +	for_each_cpu_mask(cpu, rcu_cpu_online_map) {
> +		rdssp = &per_cpu(rcu_dyntick_sched, cpu);
> +		if (rcu_try_flip_waitack_needed(cpu) &&
> +		    (rdssp->qs == rdssp->rcu_qs_snap)) {
> +			/* RCU_TRACE_ME(rcupreempt_trace_try_flip_qe1); */
> +			return 0;
> +		}
> +	}
> +
> +	/* RCU_TRACE_ME(rcupreempt_trace_try_flip_q2); */
> +	return 1;
> +}
> +
>  /*
>   * Wait for all CPUs to do their end-of-grace-period memory barrier.
>   * Return 0 once all CPUs have done so.
> @@ -775,7 +837,9 @@ static void rcu_try_flip(void)
>  
>  	/*
>  	 * Take the next transition(s) through the RCU grace-period
> -	 * flip-counter state machine.
> +	 * flip-counter state machine.  The _next_state transition
> +	 * is defined by the "plumbing" definitions following the
> +	 * rcu_try_flip_states enum.
>  	 */
>  
>  	switch (rcu_ctrlblk.rcu_try_flip_state) {
> @@ -792,6 +856,11 @@ static void rcu_try_flip(void)
>  	case rcu_try_flip_waitzero_state:
>  		if (rcu_try_flip_waitzero())
>  			rcu_ctrlblk.rcu_try_flip_state =
> +				rcu_try_flip_waitzero_next_state;
> +		break;
> +	case rcu_try_flip_waitqs_state:
> +		if (rcu_try_flip_waitqs())
> +			rcu_ctrlblk.rcu_try_flip_state =
>  				rcu_try_flip_waitmb_state;
>  		break;
>  	case rcu_try_flip_waitmb_state:

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
       [not found] ` <20080527124315.131b1343@Varda>
@ 2008-05-28 20:10   ` Linus Torvalds
  2008-05-28 20:17     ` Johannes Berg
  0 siblings, 1 reply; 89+ messages in thread
From: Linus Torvalds @ 2008-05-28 20:10 UTC (permalink / raw)
  To: Alejandro Riveira Fernández
  Cc: linux-kernel@vger.kernel.org, linux-wireless

On Tue, 27 May 2008, Alejandro Riveira Fernández wrote:
>  
>   Compiled and booting into 2.6.26-rc4
> 
>  1) With splash quiet on grub line it doesn't boot (or i didin't wait long
>     enough)
>  2) without quiet and splash I get into VT X filas becouse i use the evil
>     nvidia driver that's expected. But ubuntu failsafe mode (xserver with vesa
>     in low res) doesn't show up either *regression*

Can you give more details about where it stops? 

>  3) i get an oops with network manager

.. again, please give the full oops, without that we just know that "an 
oops happened".

>  4) if i try to run some sudo command it gets stuck (Crtl +C doesnt' help)
>     "ip route" gets stuck too. User programs i tried wrok fine (only ls and
>     htop). Network realted problem?

I assume it's related to the oops above - the oops probably happened while 
holding some mutex or other lock, which is why network-related stuff then 
blocks on that lock (which will never be released, since the oops killed 
the process that held it).

I suspect the oops is also why the bootup breaks, so it's likely all the 
same issue. Please save the dmesg into a file, reboot into a working 
setup, and send that. Along with hw information (it's likely related to 
your network device driver, since I've not seen an uproar of these kinds 
of problems from everybody else..).

>  5) printk times on dmesg go crazy

More details, please, again.

			Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-28 20:10   ` Linus Torvalds
@ 2008-05-28 20:17     ` Johannes Berg
  2008-05-28 21:48       ` John W. Linville
  0 siblings, 1 reply; 89+ messages in thread
From: Johannes Berg @ 2008-05-28 20:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alejandro Riveira Fernández, linux-kernel@vger.kernel.org,
	linux-wireless, John Linville

[-- Attachment #1: Type: text/plain, Size: 1115 bytes --]


> >  4) if i try to run some sudo command it gets stuck (Crtl +C doesnt' help)
> >     "ip route" gets stuck too. User programs i tried wrok fine (only ls and
> >     htop). Network realted problem?
> 
> I assume it's related to the oops above - the oops probably happened while 
> holding some mutex or other lock, which is why network-related stuff then 
> blocks on that lock (which will never be released, since the oops killed 
> the process that held it).

rtnl, in fact, I suspect.

> I suspect the oops is also why the bootup breaks, so it's likely all the 
> same issue. Please save the dmesg into a file, reboot into a working 
> setup, and send that. Along with hw information (it's likely related to 
> your network device driver, since I've not seen an uproar of these kinds 
> of problems from everybody else..).

Actually, please apply the patch "rt2x00: Use atomic interface
iteration in irq context" before trying anything more, that'll fix it.
I'm not sure where Linville is hiding ;)

(http://article.gmane.org/gmane.linux.kernel.wireless.general/15268)

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-28 20:17     ` Johannes Berg
@ 2008-05-28 21:48       ` John W. Linville
  0 siblings, 0 replies; 89+ messages in thread
From: John W. Linville @ 2008-05-28 21:48 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Linus Torvalds, Alejandro Riveira Fernández,
	linux-kernel@vger.kernel.org, linux-wireless

On Wed, May 28, 2008 at 10:17:57PM +0200, Johannes Berg wrote:

> > I suspect the oops is also why the bootup breaks, so it's likely all the 
> > same issue. Please save the dmesg into a file, reboot into a working 
> > setup, and send that. Along with hw information (it's likely related to 
> > your network device driver, since I've not seen an uproar of these kinds 
> > of problems from everybody else..).
> 
> Actually, please apply the patch "rt2x00: Use atomic interface
> iteration in irq context" before trying anything more, that'll fix it.
> I'm not sure where Linville is hiding ;)
> 
> (http://article.gmane.org/gmane.linux.kernel.wireless.general/15268)

Zzzzzzz....huh, what?? :-)

I just sent a pull request to Dave M. with that patch in it.
Last weekend was a long one in the USA, so I apologize for the delay...

John
-- 
John W. Linville
linville@tuxdriver.com

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-27 10:01 ` Linux 2.6.26-rc4 J.A. Magallón
@ 2008-05-28 23:59   ` Bill Davidsen
  0 siblings, 0 replies; 89+ messages in thread
From: Bill Davidsen @ 2008-05-28 23:59 UTC (permalink / raw)
  To: "J.A. Magallón"; +Cc: Linux-Kernel, Linus-IDE

J.A. Magallón wrote:
> On Mon, 26 May 2008 11:41:35 -0700 (PDT), Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> You know the drill by now: another week, another -rc.
>>
>> There's a lot of small stuff in here, most people won't even notice. The 
>> most noticeable thing is for all you 32-bit x86 people who use PAE 
>> (enabled by the HIGHMEM64G config option) due to having too much memory in 
>> your machine - mprotect() was broken due to some of the PAT fix/cleanup 
>> patches, causing the NX bit to be not set correctly.
>>
>> So if you had PAE enabled _and_ a recent enough CPU to have NX, but not 
>> recent enough to be 64-bit (or you were just perverse and wanted to run a 
>> 32-bit kernel despite having a chip that could do 64-bit and enough memory 
>> that you _really_ should have used a 64-bit kernel), you'd get various 
>> random program failures with SIGSEGV. It ranged from X not starting up to 
>> apparently OpenOffice not working if it did.
>>
>> But most of the changes, as usual, are in drivers, at 60%, with some DRI 
>> changes leading the way (fixing a number of other regressions, mainly by 
>> reverting the under-cooked vblank update). Network, MMC, USB, watchdog and 
>> IDE drivers also got updates.
>>
>> We had CIFS and NFS updates, and some arch updates as usual. The dirstat 
>> gives the overview:
>>
> 
> I have this patchsets collected from LKML, that still apply ontop of -rc4.
> Are they not so urgent or are they not needed any more ?
> 
> JBD[2] races
> http://marc.info/?l=linux-ext4&m=121141319601650&w=2
> http://marc.info/?l=linux-ext4&m=121141319701660&w=2
> 
Unless I misread the thread, akpm had a raft of issues with the first 
one, all of which were spelling except one initialization which he 
thought was not needed. All of those could be fixed in less effort that 
it takes to mention them, perhaps, but even so I think the patch could 
be ready in a short time.

The problem looks worth fixing to me, but probably not a must have at 
the rc4 level. Maybe in 2.6.27 we could get rid of the potential race?


-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-26 18:41 Linux 2.6.26-rc4 Linus Torvalds
                   ` (3 preceding siblings ...)
       [not found] ` <20080527124315.131b1343@Varda>
@ 2008-06-03  9:49 ` Jesper Krogh
  2008-06-03  9:57   ` Al Viro
  2008-06-04 17:51 ` Jesper Krogh
  5 siblings, 1 reply; 89+ messages in thread
From: Jesper Krogh @ 2008-06-03  9:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List, linux-fsdevel

Hi.

I'm getting this one. The mount.nfs call is being called by the
autofs-daemon.

It is not directly reproducible, but happens around once a day on a single
node on a 48 node cluster.

I reported it to the NFS-guys on a .22 kernel, and was directed to the
fsdevel list(CC'ed again) back then. Now the problem is reproduces on
.26-rc4

Jesper


Jun  3 09:46:58 node40 kernel: [17191699.952564] PGD 9f6f5067 PUD baf4f067
PMD 0
Jun  3 09:46:58 node40 kernel: [17191699.952564] CPU 1
Jun  3 09:46:58 node40 kernel: [17191699.952564] Modules linked in: nfs
lockd sunrpc autofs4 ipv6 af_packet usbhid hid uhci_hcd ehci_hcd usbkbd
parport_pc lp parport amd_rng i2c_amd756 container psmouse serio_raw
button pcspkr evdev i2c_core k8temp shpchp pci_hotplug ext3 jbd mbcache sg
sd_mod ide_cd_mod cdrom amd74xx ide_core floppy mptspi mptscsih mptbase
scsi_transport_spi tg3 ata_generic libata scsi_mod dock ohci_hcd usbcore
thermal processor fan thermal_sys fuse
Jun  3 09:46:58 node40 kernel: [17191699.952564] Pid: 15602, comm:
mount.nfs Not tainted 2.6.26-rc4 #1
Jun  3 09:46:58 node40 kernel: [17191700.727822] RIP:
0010:[graft_tree+77/288]  [graft_tree+77/288] graft_tree+0x4d/0x120
Jun  3 09:46:58 node40 kernel: [17191700.727822] RSP:
0000:ffff810068683e08  EFLAGS: 00010246
Jun  3 09:46:58 node40 kernel: [17191700.727822] RAX: ffff8100bf836a90
RBX: 00000000ffffffec RCX: 0000000000000000
Jun  3 09:46:58 node40 kernel: [17191700.983576] RDX: ffff8100fa378500
RSI: ffff810068683e68 RDI: ffff810029252b00
Jun  3 09:46:58 node40 kernel: [17191700.983576] RBP: ffff810029252b00
R08: 0000000000000000 R09: 0000000000000001
Jun  3 09:46:58 node40 kernel: [17191700.983576] R10: 0000000000000001
R11: ffffffff803011c0 R12: ffff810068683e68
Jun  3 09:46:58 node40 kernel: [17191700.983576] R13: 0000000000000000
R14: 000000000000000b R15: 000000000000000b
Jun  3 09:46:58 node40 kernel: [17191700.983576] FS: 
00007f525595b6e0(0000) GS:ffff8100fbb12280(0000) knlGS:00000000cb307b90
Jun  3 09:46:58 node40 kernel: [17191701.435324] CS:  0010 DS: 0000 ES:
0000 CR0: 000000008005003b
Jun  3 09:46:58 node40 kernel: [17191701.435324] CR2: 00000000000000b2
CR3: 000000005ccbc000 CR4: 00000000000006e0
Jun  3 09:46:58 node40 kernel: [17191701.435324] DR0: 0000000000000000
DR1: 0000000000000000 DR2: 0000000000000000
Jun  3 09:46:58 node40 kernel: [17191701.435324] DR3: 0000000000000000
DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun  3 09:46:58 node40 kernel: [17191701.435324] Process mount.nfs (pid:
15602, threadinfo ffff810068682000, task ffff8100bad61700)
Jun  3 09:46:58 node40 kernel: [17191701.879311] Stack:  ffff810068683e68
ffff810068683e70 ffff810029252b00 ffffffff802b2622
Jun  3 09:46:58 node40 kernel: [17191701.879311]  0000000000000006
0000000000000000 ffff8100c1b72000 ffff8100f68aa000
Jun  3 09:46:58 node40 kernel: [17191701.879311]  ffff8100f6540000
ffffffff802b49e9 000000004844f763 0000000000000000
Jun  3 09:46:58 node40 kernel: [17191701.879311] Call Trace:
Jun  3 09:46:58 node40 kernel: [17191701.879311]  [do_add_mount+162/320] ?
do_add_mount+0xa2/0x140
Jun  3 09:46:58 node40 kernel: [17191701.879311]  [do_mount+505/592] ?
do_mount+0x1f9/0x250
Jun  3 09:46:58 node40 kernel: [17191701.879311] 
[copy_mount_options+269/384] ? copy_mount_options+0x10d/0x180
Jun  3 09:46:58 node40 kernel: [17191701.879311]  [sys_mount+155/256] ?
sys_mount+0x9b/0x100
Jun  3 09:46:58 node40 kernel: [17191701.879311] 
[system_call_after_swapgs+123/128] ? system_call_after_swapgs+0x7b/0x80
Jun  3 09:46:58 node40 kernel: [17191701.879311]
Jun  3 09:46:58 node40 kernel: [17191701.879311]
Jun  3 09:46:58 node40 kernel: [17191701.879311]  RSP <ffff810068683e08>
Jun  3 09:46:58 node40 kernel: [17191704.861207] ---[ end trace
bc4c286fe026e348 ]---



-- 
Jesper Krogh


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03  9:49 ` Jesper Krogh
@ 2008-06-03  9:57   ` Al Viro
  2008-06-03 10:04     ` Jesper Krogh
  2008-06-03 10:35     ` Al Viro
  0 siblings, 2 replies; 89+ messages in thread
From: Al Viro @ 2008-06-03  9:57 UTC (permalink / raw)
  To: Jesper Krogh; +Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel

On Tue, Jun 03, 2008 at 11:49:32AM +0200, Jesper Krogh wrote:

> I reported it to the NFS-guys on a .22 kernel, and was directed to the
> fsdevel list(CC'ed again) back then. Now the problem is reproduces on
> .26-rc4

Lovely...  Do you have the full oops trace, with the actual code
dump?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03  9:57   ` Al Viro
@ 2008-06-03 10:04     ` Jesper Krogh
  2008-06-03 10:13       ` Miklos Szeredi
  2008-06-03 10:35     ` Al Viro
  1 sibling, 1 reply; 89+ messages in thread
From: Jesper Krogh @ 2008-06-03 10:04 UTC (permalink / raw)
  To: Al Viro
  Cc: Jesper Krogh, Linus Torvalds, Linux Kernel Mailing List,
	linux-fsdevel


> On Tue, Jun 03, 2008 at 11:49:32AM +0200, Jesper Krogh wrote:
>
>
>> I reported it to the NFS-guys on a .22 kernel, and was directed to the
>> fsdevel list(CC'ed again) back then. Now the problem is reproduced on
>> .26-rc4
>>
>
> Lovely...  Do you have the full oops trace, with the actual code
> dump? --

This is all I got from the logs. I'll try go get more from the serial
console.
But since it is not directly reproducible, it is a bit hard. Suggestions are
welcome?

Jesper
-- 
Jesper Krogh


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 10:04     ` Jesper Krogh
@ 2008-06-03 10:13       ` Miklos Szeredi
  2008-06-03 10:37         ` Miklos Szeredi
  2008-06-03 10:40         ` Al Viro
  0 siblings, 2 replies; 89+ messages in thread
From: Miklos Szeredi @ 2008-06-03 10:13 UTC (permalink / raw)
  To: jesper; +Cc: viro, jesper, torvalds, linux-kernel, linux-fsdevel

> > On Tue, Jun 03, 2008 at 11:49:32AM +0200, Jesper Krogh wrote:
> >
> >
> >> I reported it to the NFS-guys on a .22 kernel, and was directed to the
> >> fsdevel list(CC'ed again) back then. Now the problem is reproduced on
> >> .26-rc4
> >>
> >
> > Lovely...  Do you have the full oops trace, with the actual code
> > dump? --
> 
> This is all I got from the logs. I'll try go get more from the serial

Probably the same as this one:

http://www.kerneloops.org/raw.php?rawid=12419&msgid=

Looks like a negative inode in S_ISDIR(mnt->mnt_root->d_inode->i_mode),
which would be due to NFS not properly filling in its root dentry?

Miklos

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03  9:57   ` Al Viro
  2008-06-03 10:04     ` Jesper Krogh
@ 2008-06-03 10:35     ` Al Viro
  1 sibling, 0 replies; 89+ messages in thread
From: Al Viro @ 2008-06-03 10:35 UTC (permalink / raw)
  To: Jesper Krogh; +Cc: Linus Torvalds, Linux Kernel Mailing List, linux-fsdevel

On Tue, Jun 03, 2008 at 10:57:13AM +0100, Al Viro wrote:
> On Tue, Jun 03, 2008 at 11:49:32AM +0200, Jesper Krogh wrote:
> 
> > I reported it to the NFS-guys on a .22 kernel, and was directed to the
> > fsdevel list(CC'ed again) back then. Now the problem is reproduces on
> > .26-rc4
> 
> Lovely...  Do you have the full oops trace, with the actual code
> dump?

FWIW, searching for graft_tree on arjan's site:
#12419:
	negative dentry path->dentry
#17367:
	ditto
#17463:
	ditto
#13042:
	probably the same (is that earlier oops you've mentioned?)
#18932:
	WTF is that one doing there?  (graft_tree is never mentioned in it)

Nuts...  I really don't see how that could happen, unless it's NFS
revalidation playing silly buggers with dentries and ->d_revalidate()
there ends up turning dentry passed to it into negative one...

Where does the mountpoint in question live and what gets passed to
sys_mount()?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 10:13       ` Miklos Szeredi
@ 2008-06-03 10:37         ` Miklos Szeredi
  2008-06-03 10:48           ` Al Viro
  2008-06-03 10:40         ` Al Viro
  1 sibling, 1 reply; 89+ messages in thread
From: Miklos Szeredi @ 2008-06-03 10:37 UTC (permalink / raw)
  To: jesper; +Cc: raven, viro, jesper, torvalds, linux-kernel, linux-fsdevel

> > > On Tue, Jun 03, 2008 at 11:49:32AM +0200, Jesper Krogh wrote:
> > >
> > >
> > >> I reported it to the NFS-guys on a .22 kernel, and was directed to the
> > >> fsdevel list(CC'ed again) back then. Now the problem is reproduced on
> > >> .26-rc4
> > >>
> > >
> > > Lovely...  Do you have the full oops trace, with the actual code
> > > dump? --
> > 
> > This is all I got from the logs. I'll try go get more from the serial
> 
> Probably the same as this one:
> 
> http://www.kerneloops.org/raw.php?rawid=12419&msgid=
> 
> Looks like a negative inode in S_ISDIR(mnt->mnt_root->d_inode->i_mode),
> which would be due to NFS not properly filling in its root dentry?

On second thought it's S_ISDIR(path->dentry->d_inode->i_mode), which
means it's an autofs thing.

CC-ing Ian.

Miklos

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 10:13       ` Miklos Szeredi
  2008-06-03 10:37         ` Miklos Szeredi
@ 2008-06-03 10:40         ` Al Viro
  2008-06-03 10:45           ` Miklos Szeredi
  1 sibling, 1 reply; 89+ messages in thread
From: Al Viro @ 2008-06-03 10:40 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: jesper, torvalds, linux-kernel, linux-fsdevel

On Tue, Jun 03, 2008 at 12:13:54PM +0200, Miklos Szeredi wrote:
> 
> http://www.kerneloops.org/raw.php?rawid=12419&msgid=
> 
> Looks like a negative inode in S_ISDIR(mnt->mnt_root->d_inode->i_mode),
> which would be due to NFS not properly filling in its root dentry?

Look more carefully.  It's path->dentry; aside of the fact that dentry
pointer is fetched at offset 8 from one of the arguments (fits path->dentry,
too low for mnt->mnt_root), do_add_mount() itself has just done S_ISLNK
on the very same thing, so it'd die before getting to graft_tree().

No, it's either path_lookup() somehow returning a negative dentry in
do_mount() (which shouldn't be possible, unless it's some crap around
return_reval in __link_path_walk()) or it's follow_down() giving us
a negative dentry.  Which almost certainly would've exploded prior to
that...

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 10:40         ` Al Viro
@ 2008-06-03 10:45           ` Miklos Szeredi
  2008-06-03 10:52             ` Al Viro
  0 siblings, 1 reply; 89+ messages in thread
From: Miklos Szeredi @ 2008-06-03 10:45 UTC (permalink / raw)
  To: viro; +Cc: miklos, jesper, torvalds, linux-kernel, linux-fsdevel

> > 
> > http://www.kerneloops.org/raw.php?rawid=12419&msgid=
> > 
> > Looks like a negative inode in S_ISDIR(mnt->mnt_root->d_inode->i_mode),
> > which would be due to NFS not properly filling in its root dentry?
> 
> Look more carefully.  It's path->dentry; aside of the fact that dentry
> pointer is fetched at offset 8 from one of the arguments (fits path->dentry,
> too low for mnt->mnt_root),

Yup, realized this after posting.

> do_add_mount() itself has just done S_ISLNK
> on the very same thing, so it'd die before getting to graft_tree().
> 
> No, it's either path_lookup() somehow returning a negative dentry in
> do_mount() (which shouldn't be possible, unless it's some crap around
> return_reval in __link_path_walk()) or it's follow_down() giving us
> a negative dentry.  Which almost certainly would've exploded prior to
> that...

I think it must be autofs4 doing something weird.  Like this in
autofs4_lookup_unhashed():

			/*
			 * Make the rehashed dentry negative so the VFS
			 * behaves as it should.
			 */
			if (inode) {
				dentry->d_inode = NULL;


Miklos

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 10:37         ` Miklos Szeredi
@ 2008-06-03 10:48           ` Al Viro
  2008-06-03 13:31             ` Ian Kent
  0 siblings, 1 reply; 89+ messages in thread
From: Al Viro @ 2008-06-03 10:48 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: jesper, raven, torvalds, linux-kernel, linux-fsdevel

On Tue, Jun 03, 2008 at 12:37:59PM +0200, Miklos Szeredi wrote:
> > http://www.kerneloops.org/raw.php?rawid=12419&msgid=
> > 
> > Looks like a negative inode in S_ISDIR(mnt->mnt_root->d_inode->i_mode),
> > which would be due to NFS not properly filling in its root dentry?
> 
> On second thought it's S_ISDIR(path->dentry->d_inode->i_mode), which
> means it's an autofs thing.

It is path->dentry, all right, but the question is how'd it get that way.
Look: we got that nd.path.dentry out of path_lookup() with LOOKUP_FOLLOW
as flags.  Then we'd passed it through do_new_mount() to do_add_mount()
without changes.  And went through
        /* Something was mounted here while we slept */
        while (d_mountpoint(nd->path.dentry) &&
               follow_down(&nd->path.mnt, &nd->path.dentry))
                ;

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 10:45           ` Miklos Szeredi
@ 2008-06-03 10:52             ` Al Viro
  2008-06-03 13:27               ` Ian Kent
  0 siblings, 1 reply; 89+ messages in thread
From: Al Viro @ 2008-06-03 10:52 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: jesper, torvalds, linux-kernel, linux-fsdevel

On Tue, Jun 03, 2008 at 12:45:33PM +0200, Miklos Szeredi wrote:

> I think it must be autofs4 doing something weird.  Like this in
> autofs4_lookup_unhashed():
> 
> 			/*
> 			 * Make the rehashed dentry negative so the VFS
> 			 * behaves as it should.
> 			 */
> 			if (inode) {
> 				dentry->d_inode = NULL;

Lovely.  If we ever step into that with somebody else (no matter who)
holding a reference to that dentry, we are certainly well and truly
buggered.  It's not just mount(2) - everything in the tree assumes that
holding a reference to positive dentry guarantees that it remains
positive.

Ian?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 10:52             ` Al Viro
@ 2008-06-03 13:27               ` Ian Kent
  2008-06-03 15:01                 ` Linus Torvalds
  0 siblings, 1 reply; 89+ messages in thread
From: Ian Kent @ 2008-06-03 13:27 UTC (permalink / raw)
  To: Al Viro; +Cc: Miklos Szeredi, jesper, torvalds, linux-kernel, linux-fsdevel


On Tue, 2008-06-03 at 11:52 +0100, Al Viro wrote:
> On Tue, Jun 03, 2008 at 12:45:33PM +0200, Miklos Szeredi wrote:
> 
> > I think it must be autofs4 doing something weird.  Like this in
> > autofs4_lookup_unhashed():
> > 
> > 			/*
> > 			 * Make the rehashed dentry negative so the VFS
> > 			 * behaves as it should.
> > 			 */
> > 			if (inode) {
> > 				dentry->d_inode = NULL;
> 
> Lovely.  If we ever step into that with somebody else (no matter who)
> holding a reference to that dentry, we are certainly well and truly
> buggered.  It's not just mount(2) - everything in the tree assumes that
> holding a reference to positive dentry guarantees that it remains
> positive.

The intent here is that, the dentry above is unhashed at this point, and
if hasn't been reclaimed by the VFS, it is made negative and replaces
the unhashed negative dentry passed to ->lookup(). The reference count
is incremented to account for the reference held by the path walk.

What am I doing wrong here?
 
Ian



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 10:48           ` Al Viro
@ 2008-06-03 13:31             ` Ian Kent
  2008-06-03 13:32               ` Ian Kent
  0 siblings, 1 reply; 89+ messages in thread
From: Ian Kent @ 2008-06-03 13:31 UTC (permalink / raw)
  To: Al Viro; +Cc: Miklos Szeredi, jesper, torvalds, linux-kernel, linux-fsdevel


On Tue, 2008-06-03 at 11:48 +0100, Al Viro wrote:
> On Tue, Jun 03, 2008 at 12:37:59PM +0200, Miklos Szeredi wrote:
> > > http://www.kerneloops.org/raw.php?rawid=12419&msgid=
> > > 
> > > Looks like a negative inode in S_ISDIR(mnt->mnt_root->d_inode->i_mode),
> > > which would be due to NFS not properly filling in its root dentry?
> > 
> > On second thought it's S_ISDIR(path->dentry->d_inode->i_mode), which
> > means it's an autofs thing.
> 
> It is path->dentry, all right, but the question is how'd it get that way.
> Look: we got that nd.path.dentry out of path_lookup() with LOOKUP_FOLLOW
> as flags.  Then we'd passed it through do_new_mount() to do_add_mount()
> without changes.  And went through
>         /* Something was mounted here while we slept */
>         while (d_mountpoint(nd->path.dentry) &&
>                follow_down(&nd->path.mnt, &nd->path.dentry))
>                 ;

And this relates to previous in that a mount isn't done by autofs until
until after the directory is created, at which time the (->mkdir())
dentry is hashed.

Ian



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 13:31             ` Ian Kent
@ 2008-06-03 13:32               ` Ian Kent
  0 siblings, 0 replies; 89+ messages in thread
From: Ian Kent @ 2008-06-03 13:32 UTC (permalink / raw)
  To: Al Viro; +Cc: Miklos Szeredi, jesper, torvalds, linux-kernel, linux-fsdevel


On Tue, 2008-06-03 at 21:31 +0800, Ian Kent wrote:
> On Tue, 2008-06-03 at 11:48 +0100, Al Viro wrote:
> > On Tue, Jun 03, 2008 at 12:37:59PM +0200, Miklos Szeredi wrote:
> > > > http://www.kerneloops.org/raw.php?rawid=12419&msgid=
> > > > 
> > > > Looks like a negative inode in S_ISDIR(mnt->mnt_root->d_inode->i_mode),
> > > > which would be due to NFS not properly filling in its root dentry?
> > > 
> > > On second thought it's S_ISDIR(path->dentry->d_inode->i_mode), which
> > > means it's an autofs thing.
> > 
> > It is path->dentry, all right, but the question is how'd it get that way.
> > Look: we got that nd.path.dentry out of path_lookup() with LOOKUP_FOLLOW
> > as flags.  Then we'd passed it through do_new_mount() to do_add_mount()
> > without changes.  And went through
> >         /* Something was mounted here while we slept */
> >         while (d_mountpoint(nd->path.dentry) &&
> >                follow_down(&nd->path.mnt, &nd->path.dentry))
> >                 ;
> 
> And this relates to previous in that a mount isn't done by autofs until
> until after the directory is created, at which time the (->mkdir())
> dentry is hashed.

Oh .. and made positive at the same time.

> 
> Ian
> 


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 13:27               ` Ian Kent
@ 2008-06-03 15:01                 ` Linus Torvalds
  2008-06-03 16:07                   ` Ian Kent
  2008-06-05  7:31                   ` Ian Kent
  0 siblings, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2008-06-03 15:01 UTC (permalink / raw)
  To: Ian Kent; +Cc: Al Viro, Miklos Szeredi, jesper, linux-kernel, linux-fsdevel

On Tue, 3 Jun 2008, Ian Kent wrote:
> > 
> > > I think it must be autofs4 doing something weird.  Like this in
> > > autofs4_lookup_unhashed():
> > > 
> > > 			/*
> > > 			 * Make the rehashed dentry negative so the VFS
> > > 			 * behaves as it should.
> > > 			 */
> > > 			if (inode) {
> > > 				dentry->d_inode = NULL;

Uhhuh. Yeah, that's not allowed.

A dentry inode can start _out_ as NULL, but it can never later become NULL 
again until it is totally unused.

> > Lovely.  If we ever step into that with somebody else (no matter who)
> > holding a reference to that dentry, we are certainly well and truly
> > buggered.  It's not just mount(2) - everything in the tree assumes that
> > holding a reference to positive dentry guarantees that it remains
> > positive.

Indeed. Things like regular file ops won't even test the inode, since they 
know that "open()" will only open a dentry with a positive entry, so they 
know that the dentry->inode is non-NULL.

[ Although some code-paths do test - but that is just because people are 
  so used to testign that pointers are non-NULL. ]

> The intent here is that, the dentry above is unhashed at this point, and
> if hasn't been reclaimed by the VFS, it is made negative and replaces
> the unhashed negative dentry passed to ->lookup(). The reference count
> is incremented to account for the reference held by the path walk.
> 
> What am I doing wrong here?

What's wrong is that you can't do that "dentry->d_inode = NULL". EVER.

Why would you want to? If the dentry is already unhashed, then no _new_ 
lookups will ever find it anyway, so it's effectively unfindable anyway. 
Except by people who *have* to find it, ie the people who already hold it 
open (because, for example, they opened it earlier, or because they 
chdir()'ed into a subdirectory).

So why don't you just return a NULL dentry instead, for a unhashed dentry? 
Or do the "goto next" thing?

			Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 15:01                 ` Linus Torvalds
@ 2008-06-03 16:07                   ` Ian Kent
  2008-06-03 16:35                     ` Linus Torvalds
  2008-06-05  7:31                   ` Ian Kent
  1 sibling, 1 reply; 89+ messages in thread
From: Ian Kent @ 2008-06-03 16:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Miklos Szeredi, jesper, linux-kernel, linux-fsdevel


On Tue, 2008-06-03 at 08:01 -0700, Linus Torvalds wrote:
> 
> On Tue, 3 Jun 2008, Ian Kent wrote:
> > > 
> > > > I think it must be autofs4 doing something weird.  Like this in
> > > > autofs4_lookup_unhashed():
> > > > 
> > > > 			/*
> > > > 			 * Make the rehashed dentry negative so the VFS
> > > > 			 * behaves as it should.
> > > > 			 */
> > > > 			if (inode) {
> > > > 				dentry->d_inode = NULL;
> 
> Uhhuh. Yeah, that's not allowed.
> 
> A dentry inode can start _out_ as NULL, but it can never later become NULL 
> again until it is totally unused.
> 
> > > Lovely.  If we ever step into that with somebody else (no matter who)
> > > holding a reference to that dentry, we are certainly well and truly
> > > buggered.  It's not just mount(2) - everything in the tree assumes that
> > > holding a reference to positive dentry guarantees that it remains
> > > positive.
> 
> Indeed. Things like regular file ops won't even test the inode, since they 
> know that "open()" will only open a dentry with a positive entry, so they 
> know that the dentry->inode is non-NULL.
> 
> [ Although some code-paths do test - but that is just because people are 
>   so used to testign that pointers are non-NULL. ]
> 
> > The intent here is that, the dentry above is unhashed at this point, and
> > if hasn't been reclaimed by the VFS, it is made negative and replaces
> > the unhashed negative dentry passed to ->lookup(). The reference count
> > is incremented to account for the reference held by the path walk.
> > 
> > What am I doing wrong here?
> 
> What's wrong is that you can't do that "dentry->d_inode = NULL". EVER.

OK.

> 
> Why would you want to? If the dentry is already unhashed, then no _new_ 
> lookups will ever find it anyway, so it's effectively unfindable anyway. 
> Except by people who *have* to find it, ie the people who already hold it 
> open (because, for example, they opened it earlier, or because they 
> chdir()'ed into a subdirectory).

The code we're talking about deals with a race between expiring and
mounting an autofs mount point at the same time. 

I'll have a closer look and see if I can make it work without turning
the dentry negative.

> 
> So why don't you just return a NULL dentry instead, for a unhashed dentry? 
> Or do the "goto next" thing?

That just won't work for the case this is meant to deal with.

Ian



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 16:07                   ` Ian Kent
@ 2008-06-03 16:35                     ` Linus Torvalds
  2008-06-03 16:41                       ` Al Viro
  2008-06-03 17:13                       ` Ian Kent
  0 siblings, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2008-06-03 16:35 UTC (permalink / raw)
  To: Ian Kent; +Cc: Al Viro, Miklos Szeredi, jesper, linux-kernel, linux-fsdevel

On Wed, 4 Jun 2008, Ian Kent wrote:
> 
> The code we're talking about deals with a race between expiring and
> mounting an autofs mount point at the same time. 
> 
> I'll have a closer look and see if I can make it work without turning
> the dentry negative.

Hmm.

Can you walk me through this?

If the dentry is unhashed, it means that it _either_

 - has already been deleted (rmdir'ed) or d_invalidate()'d. Right?

   I don't see why you should ever return the dentry in this case..

 - or it has not yet been hashed at all

   But then d_inode should be NULL too, no?

Anyway, as far as I can tell, you should handle the race between expiring 
and re-mounting not by unhashing at expire time (which causes these kinds 
of problems), but by setting a bit in the dentry and using the dentry 
"revalidate()" callback to wait for the revalidate.

But I don't know autofs4, so you probably have some reason. Could you 
explain it?

		Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 16:35                     ` Linus Torvalds
@ 2008-06-03 16:41                       ` Al Viro
  2008-06-03 16:50                         ` Al Viro
  2008-06-03 16:59                         ` Linus Torvalds
  2008-06-03 17:13                       ` Ian Kent
  1 sibling, 2 replies; 89+ messages in thread
From: Al Viro @ 2008-06-03 16:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ian Kent, Miklos Szeredi, jesper, linux-kernel, linux-fsdevel

On Tue, Jun 03, 2008 at 09:35:47AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 4 Jun 2008, Ian Kent wrote:
> > 
> > The code we're talking about deals with a race between expiring and
> > mounting an autofs mount point at the same time. 
> > 
> > I'll have a closer look and see if I can make it work without turning
> > the dentry negative.
> 
> Hmm.
> 
> Can you walk me through this?
> 
> If the dentry is unhashed, it means that it _either_
> 
>  - has already been deleted (rmdir'ed) or d_invalidate()'d. Right?
> 
>    I don't see why you should ever return the dentry in this case..

>From my reading of that code looks like it's been rmdir'ed.  And no, I
don't understand what the hell is that code trying to do.

Ian, could you describe the race you are talking about?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 16:41                       ` Al Viro
@ 2008-06-03 16:50                         ` Al Viro
  2008-06-03 17:28                           ` Ian Kent
  2008-06-03 16:59                         ` Linus Torvalds
  1 sibling, 1 reply; 89+ messages in thread
From: Al Viro @ 2008-06-03 16:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ian Kent, Miklos Szeredi, jesper, linux-kernel, linux-fsdevel

On Tue, Jun 03, 2008 at 05:41:03PM +0100, Al Viro wrote:

> >From my reading of that code looks like it's been rmdir'ed.  And no, I
> don't understand what the hell is that code trying to do.
> 
> Ian, could you describe the race you are talking about?

BTW, this stuff is definitely broken regardless of mount - if something
had the directory in question opened before that rmdir and we'd hit
your lookup_unhashed while another CPU had been in the middle of
getdents(2) on that opened descriptor, we'll get

vfs_readdir() grabs i_mutex
vfs_readdir() checks that it's dead
autofs4_lookup_unhashed() calls iput()
inode is freed
vfs_readdir() releases i_mutex - in already freed struct inode.

Hell, just getdents() right *after* dentry->d_inode = NULL will oops,
plain and simple.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 16:41                       ` Al Viro
  2008-06-03 16:50                         ` Al Viro
@ 2008-06-03 16:59                         ` Linus Torvalds
  2008-06-03 17:30                           ` Ian Kent
  1 sibling, 1 reply; 89+ messages in thread
From: Linus Torvalds @ 2008-06-03 16:59 UTC (permalink / raw)
  To: Al Viro; +Cc: Ian Kent, Miklos Szeredi, jesper, linux-kernel, linux-fsdevel



On Tue, 3 Jun 2008, Al Viro wrote:
> > 
> > If the dentry is unhashed, it means that it _either_
> > 
> >  - has already been deleted (rmdir'ed) or d_invalidate()'d. Right?
> > 
> >    I don't see why you should ever return the dentry in this case..
> 
> From my reading of that code looks like it's been rmdir'ed.  And no, I
> don't understand what the hell is that code trying to do.

Hmm. Looking closer, I think that code is meant to handle the 
d_invalidate() that it did in autofs4_tree_busy().

However, that should never trigger for a directory entry that can be 
reached some other way, because that code has done a "dget()" on the 
dentry, and d_invalidate() does

	if (atomic_read(&dentry->d_count) > 1) {
		if (dentry->d_inode && S_ISDIR(dentry->d_inode->i_mode)) {
			..unlock..
			return -EBUSY;
		}
	}

so I dunno. I still think the expire code shouldn't even use 
d_invalidate() at all, and just revalidate() at lookup. 

		Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 16:35                     ` Linus Torvalds
  2008-06-03 16:41                       ` Al Viro
@ 2008-06-03 17:13                       ` Ian Kent
  2008-06-03 17:30                         ` Al Viro
  1 sibling, 1 reply; 89+ messages in thread
From: Ian Kent @ 2008-06-03 17:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Miklos Szeredi, jesper, linux-kernel, linux-fsdevel

On Tue, 2008-06-03 at 09:35 -0700, Linus Torvalds wrote:
> 
> On Wed, 4 Jun 2008, Ian Kent wrote:
> > 
> > The code we're talking about deals with a race between expiring and
> > mounting an autofs mount point at the same time. 
> > 
> > I'll have a closer look and see if I can make it work without turning
> > the dentry negative.
> 
> Hmm.
> 
> Can you walk me through this?
> 
> If the dentry is unhashed, it means that it _either_
> 
>  - has already been deleted (rmdir'ed) or d_invalidate()'d. Right?

In the current code the only way a dentry gets onto this list is by one
of the two operations ->unlink() or ->rmdir(), it is d_drop'ed and left
positive by both of these operations (a carry over from 2.4 when
d_lookup() returned unhashed dentrys, I missed that detail for quite a
while).

> 
>    I don't see why you should ever return the dentry in this case..

It's been a while now but the original patch description should help.

"What happens is that during an expire the situation can arise
that a directory is removed and another lookup is done before
the expire issues a completion status to the kernel module.
In this case, since the the lookup gets a new dentry, it doesn't
know that there is an expire in progress and when it posts its
mount request, matches the existing expire request and waits
for its completion. ENOENT is then returned to user space
from lookup (as the dentry passed in is now unhashed) without
having performed the mount request.

The solution used here is to keep track of dentrys in this
unhashed state and reuse them, if possible, in order to
preserve the flags. Additionally, this infrastructure will
provide the framework for the reintroduction of caching
of mount fails removed earlier in development."

I wasn't able to do an acceptable re-implementation of the negative
caching we had in 2.4 with this framework, so just ignore the last
sentence in the above description. 

> 
>  - or it has not yet been hashed at all

It has been previously hashed, yes.

> 
>    But then d_inode should be NULL too, no?

Unfortunately no, but I thought that once the dentry became unhashed
(aka ->rmdir() or ->unlink()) it was invisible to the dcache. But, of
course there may be descriptors open on the dentry, which I think is the
problem that's being pointed out.

> 
> Anyway, as far as I can tell, you should handle the race between expiring 
> and re-mounting not by unhashing at expire time (which causes these kinds 
> of problems), but by setting a bit in the dentry and using the dentry 
> "revalidate()" callback to wait for the revalidate.

Yes, that would be ideal but the reason we arrived here is that, because
we must release the directory mutex before calling back to the daemon
(the heart of the problem, actually having to drop the mutex) to perform
the mount, we can get a deadlock. The cause of the problem was that for
"create" like operations the mutex is held for ->lookup() and
->revalidate() but for a "path walks" the mutex is only held for
->lookup(), so if the mutex is held when we're in ->revalidate(), we
could never be sure that we where the code path that acquired it.

Sorry, this last bit is unclear.
I'll need to work a bit harder on the explanation if you're interested
in checking further.

Anyway, I'm sure I made the dentry negative upon getting a rehash hit
for a reason so I'll need to revisit it. Perhaps I was misguided in the
first place or perhaps just plain lazy.

Ian

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 16:50                         ` Al Viro
@ 2008-06-03 17:28                           ` Ian Kent
  2008-06-03 17:41                             ` Al Viro
  0 siblings, 1 reply; 89+ messages in thread
From: Ian Kent @ 2008-06-03 17:28 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Miklos Szeredi, jesper, linux-kernel,
	linux-fsdevel


On Tue, 2008-06-03 at 17:50 +0100, Al Viro wrote:
> On Tue, Jun 03, 2008 at 05:41:03PM +0100, Al Viro wrote:
> 
> > >From my reading of that code looks like it's been rmdir'ed.  And no, I
> > don't understand what the hell is that code trying to do.
> > 
> > Ian, could you describe the race you are talking about?
> 
> BTW, this stuff is definitely broken regardless of mount - if something
> had the directory in question opened before that rmdir and we'd hit
> your lookup_unhashed while another CPU had been in the middle of
> getdents(2) on that opened descriptor, we'll get
> 
> vfs_readdir() grabs i_mutex
> vfs_readdir() checks that it's dead
> autofs4_lookup_unhashed() calls iput()

Can this really happen, since autofs4_lookup_unhashed() is only called
with the i_mutex held.

> inode is freed
> vfs_readdir() releases i_mutex - in already freed struct inode.

But it could happen later. So it's academic I guess.

> 
> Hell, just getdents() right *after* dentry->d_inode = NULL will oops,
> plain and simple.

Yeah, I'll look into why I believed I needed to turn the dentry
negative. I'll need to keep the dentry positive through out this
process.

Ian



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 17:13                       ` Ian Kent
@ 2008-06-03 17:30                         ` Al Viro
  2008-06-03 17:38                           ` Ian Kent
  2008-06-03 17:46                           ` Jeff Moyer
  0 siblings, 2 replies; 89+ messages in thread
From: Al Viro @ 2008-06-03 17:30 UTC (permalink / raw)
  To: Ian Kent
  Cc: Linus Torvalds, Miklos Szeredi, jesper, linux-kernel,
	linux-fsdevel

On Wed, Jun 04, 2008 at 01:13:08AM +0800, Ian Kent wrote:

> "What happens is that during an expire the situation can arise
> that a directory is removed and another lookup is done before
> the expire issues a completion status to the kernel module.
> In this case, since the the lookup gets a new dentry, it doesn't
> know that there is an expire in progress and when it posts its
> mount request, matches the existing expire request and waits
> for its completion. ENOENT is then returned to user space
> from lookup (as the dentry passed in is now unhashed) without
> having performed the mount request.
> 
> The solution used here is to keep track of dentrys in this
> unhashed state and reuse them, if possible, in order to
> preserve the flags. Additionally, this infrastructure will
> provide the framework for the reintroduction of caching
> of mount fails removed earlier in development."
> 
> I wasn't able to do an acceptable re-implementation of the negative
> caching we had in 2.4 with this framework, so just ignore the last
> sentence in the above description. 

> Unfortunately no, but I thought that once the dentry became unhashed
> (aka ->rmdir() or ->unlink()) it was invisible to the dcache. But, of
> course there may be descriptors open on the dentry, which I think is the
> problem that's being pointed out.
 
... or we could have had a pending mount(2) sitting there with a reference
to mountpoint-to-be...

> Yes, that would be ideal but the reason we arrived here is that, because
> we must release the directory mutex before calling back to the daemon
> (the heart of the problem, actually having to drop the mutex) to perform
> the mount, we can get a deadlock. The cause of the problem was that for
> "create" like operations the mutex is held for ->lookup() and
> ->revalidate() but for a "path walks" the mutex is only held for
> ->lookup(), so if the mutex is held when we're in ->revalidate(), we
> could never be sure that we where the code path that acquired it.
> 
> Sorry, this last bit is unclear.
> I'll need to work a bit harder on the explanation if you're interested
> in checking further.

I am.

Oh, well...  Looks like RTFS time for me for now...  Additional parts of
braindump would be appreciated - the last time I've seriously looked at
autofs4 internal had been ~2005 or so ;-/

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 16:59                         ` Linus Torvalds
@ 2008-06-03 17:30                           ` Ian Kent
  0 siblings, 0 replies; 89+ messages in thread
From: Ian Kent @ 2008-06-03 17:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Miklos Szeredi, jesper, linux-kernel, linux-fsdevel


On Tue, 2008-06-03 at 09:59 -0700, Linus Torvalds wrote:
> 
> On Tue, 3 Jun 2008, Al Viro wrote:
> > > 
> > > If the dentry is unhashed, it means that it _either_
> > > 
> > >  - has already been deleted (rmdir'ed) or d_invalidate()'d. Right?
> > > 
> > >    I don't see why you should ever return the dentry in this case..
> > 
> > From my reading of that code looks like it's been rmdir'ed.  And no, I
> > don't understand what the hell is that code trying to do.
> 
> Hmm. Looking closer, I think that code is meant to handle the 
> d_invalidate() that it did in autofs4_tree_busy().
> 
> However, that should never trigger for a directory entry that can be 
> reached some other way, because that code has done a "dget()" on the 
> dentry, and d_invalidate() does
> 
> 	if (atomic_read(&dentry->d_count) > 1) {
> 		if (dentry->d_inode && S_ISDIR(dentry->d_inode->i_mode)) {
> 			..unlock..
> 			return -EBUSY;
> 		}
> 	}
> 
> so I dunno. I still think the expire code shouldn't even use 
> d_invalidate() at all, and just revalidate() at lookup. 

Yes, perhaps not.
A job for another day.

Ian



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 17:30                         ` Al Viro
@ 2008-06-03 17:38                           ` Ian Kent
  2008-06-03 17:46                           ` Jeff Moyer
  1 sibling, 0 replies; 89+ messages in thread
From: Ian Kent @ 2008-06-03 17:38 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Miklos Szeredi, jesper, linux-kernel,
	linux-fsdevel


On Tue, 2008-06-03 at 18:30 +0100, Al Viro wrote:
> On Wed, Jun 04, 2008 at 01:13:08AM +0800, Ian Kent wrote:
> 
> > "What happens is that during an expire the situation can arise
> > that a directory is removed and another lookup is done before
> > the expire issues a completion status to the kernel module.
> > In this case, since the the lookup gets a new dentry, it doesn't
> > know that there is an expire in progress and when it posts its
> > mount request, matches the existing expire request and waits
> > for its completion. ENOENT is then returned to user space
> > from lookup (as the dentry passed in is now unhashed) without
> > having performed the mount request.
> > 
> > The solution used here is to keep track of dentrys in this
> > unhashed state and reuse them, if possible, in order to
> > preserve the flags. Additionally, this infrastructure will
> > provide the framework for the reintroduction of caching
> > of mount fails removed earlier in development."
> > 
> > I wasn't able to do an acceptable re-implementation of the negative
> > caching we had in 2.4 with this framework, so just ignore the last
> > sentence in the above description. 
> 
> > Unfortunately no, but I thought that once the dentry became unhashed
> > (aka ->rmdir() or ->unlink()) it was invisible to the dcache. But, of
> > course there may be descriptors open on the dentry, which I think is the
> > problem that's being pointed out.
>  
> ... or we could have had a pending mount(2) sitting there with a reference
> to mountpoint-to-be...
> 
> > Yes, that would be ideal but the reason we arrived here is that, because
> > we must release the directory mutex before calling back to the daemon
> > (the heart of the problem, actually having to drop the mutex) to perform
> > the mount, we can get a deadlock. The cause of the problem was that for
> > "create" like operations the mutex is held for ->lookup() and
> > ->revalidate() but for a "path walks" the mutex is only held for
> > ->lookup(), so if the mutex is held when we're in ->revalidate(), we
> > could never be sure that we where the code path that acquired it.
> > 
> > Sorry, this last bit is unclear.
> > I'll need to work a bit harder on the explanation if you're interested
> > in checking further.
> 
> I am.
> 
> Oh, well...  Looks like RTFS time for me for now...  Additional parts of
> braindump would be appreciated - the last time I've seriously looked at
> autofs4 internal had been ~2005 or so ;-/

You will find other problems.

The other bit to this is the patch to resolve the deadlock issue I spoke
about just above. This is likely where most of the current problems
started and the fact that we have always had to drop the mutex to call
back the daemon.

I can post the patches as well if that helps.

The description accompanying that patch was (the inconsistent locking
referred to here is what was described above):

"Due to inconsistent locking in the VFS between calls to lookup and
revalidate deadlock can occur in the automounter.

The inconsistency is that the directory inode mutex is held for both
lookup and revalidate calls when called via lookup_hash whereas it is
held only for lookup during a path walk. Consequently, if the mutex
is held during a call to revalidate autofs4 can't release the mutex
to callback the daemon as it can't know whether it owns the mutex.

This situation happens when a process tries to create a directory
within an automount and a second process also tries to create the
same directory between the lookup and the mkdir. Since the first
process has dropped the mutex for the daemon callback, the second
process takes it during revalidate leading to deadlock between the
autofs daemon and the second process when the daemon tries to create
the mount point directory.

After spending quite a bit of time trying to resolve this on more than
one occassion, using rather complex and ulgy approaches, it turns out
that just delaying the hashing of the dentry until the create operation
work fine."

Ian



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 17:28                           ` Ian Kent
@ 2008-06-03 17:41                             ` Al Viro
  2008-06-03 17:41                               ` Ian Kent
  0 siblings, 1 reply; 89+ messages in thread
From: Al Viro @ 2008-06-03 17:41 UTC (permalink / raw)
  To: Ian Kent
  Cc: Linus Torvalds, Miklos Szeredi, jesper, linux-kernel,
	linux-fsdevel

On Wed, Jun 04, 2008 at 01:28:23AM +0800, Ian Kent wrote:
> 
> On Tue, 2008-06-03 at 17:50 +0100, Al Viro wrote:
> > On Tue, Jun 03, 2008 at 05:41:03PM +0100, Al Viro wrote:
> > 
> > > >From my reading of that code looks like it's been rmdir'ed.  And no, I
> > > don't understand what the hell is that code trying to do.
> > > 
> > > Ian, could you describe the race you are talking about?
> > 
> > BTW, this stuff is definitely broken regardless of mount - if something
> > had the directory in question opened before that rmdir and we'd hit
> > your lookup_unhashed while another CPU had been in the middle of
> > getdents(2) on that opened descriptor, we'll get
> > 
> > vfs_readdir() grabs i_mutex
> > vfs_readdir() checks that it's dead
> > autofs4_lookup_unhashed() calls iput()
> 
> Can this really happen, since autofs4_lookup_unhashed() is only called
> with the i_mutex held.

i_mutex on a different inode (obviously - it frees the inode in question,
so if caller held i_mutex on it, you would be in trouble every time you
hit that codepath).

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 17:41                             ` Al Viro
@ 2008-06-03 17:41                               ` Ian Kent
  2008-06-03 17:50                                 ` Al Viro
  0 siblings, 1 reply; 89+ messages in thread
From: Ian Kent @ 2008-06-03 17:41 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Miklos Szeredi, jesper, linux-kernel,
	linux-fsdevel


On Tue, 2008-06-03 at 18:41 +0100, Al Viro wrote:
> On Wed, Jun 04, 2008 at 01:28:23AM +0800, Ian Kent wrote:
> > 
> > On Tue, 2008-06-03 at 17:50 +0100, Al Viro wrote:
> > > On Tue, Jun 03, 2008 at 05:41:03PM +0100, Al Viro wrote:
> > > 
> > > > >From my reading of that code looks like it's been rmdir'ed.  And no, I
> > > > don't understand what the hell is that code trying to do.
> > > > 
> > > > Ian, could you describe the race you are talking about?
> > > 
> > > BTW, this stuff is definitely broken regardless of mount - if something
> > > had the directory in question opened before that rmdir and we'd hit
> > > your lookup_unhashed while another CPU had been in the middle of
> > > getdents(2) on that opened descriptor, we'll get
> > > 
> > > vfs_readdir() grabs i_mutex
> > > vfs_readdir() checks that it's dead
> > > autofs4_lookup_unhashed() calls iput()
> > 
> > Can this really happen, since autofs4_lookup_unhashed() is only called
> > with the i_mutex held.
> 
> i_mutex on a different inode (obviously - it frees the inode in question,
> so if caller held i_mutex on it, you would be in trouble every time you
> hit that codepath).

OK, I'll need to look at vfs_readdir().
I thought vfs_readdir() would take the containing directory mutex as
does ->lookup().



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 17:30                         ` Al Viro
  2008-06-03 17:38                           ` Ian Kent
@ 2008-06-03 17:46                           ` Jeff Moyer
  2008-06-03 19:18                             ` Al Viro
  1 sibling, 1 reply; 89+ messages in thread
From: Jeff Moyer @ 2008-06-03 17:46 UTC (permalink / raw)
  To: Al Viro
  Cc: Ian Kent, Linus Torvalds, Miklos Szeredi, jesper, linux-kernel,
	linux-fsdevel

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Wed, Jun 04, 2008 at 01:13:08AM +0800, Ian Kent wrote:
>
>> "What happens is that during an expire the situation can arise
>> that a directory is removed and another lookup is done before
>> the expire issues a completion status to the kernel module.
>> In this case, since the the lookup gets a new dentry, it doesn't
>> know that there is an expire in progress and when it posts its
>> mount request, matches the existing expire request and waits
>> for its completion. ENOENT is then returned to user space
>> from lookup (as the dentry passed in is now unhashed) without
>> having performed the mount request.
>> 
>> The solution used here is to keep track of dentrys in this
>> unhashed state and reuse them, if possible, in order to
>> preserve the flags. Additionally, this infrastructure will
>> provide the framework for the reintroduction of caching
>> of mount fails removed earlier in development."
>> 
>> I wasn't able to do an acceptable re-implementation of the negative
>> caching we had in 2.4 with this framework, so just ignore the last
>> sentence in the above description. 
>
>> Unfortunately no, but I thought that once the dentry became unhashed
>> (aka ->rmdir() or ->unlink()) it was invisible to the dcache. But, of
>> course there may be descriptors open on the dentry, which I think is the
>> problem that's being pointed out.
>  
> ... or we could have had a pending mount(2) sitting there with a reference
> to mountpoint-to-be...
>
>> Yes, that would be ideal but the reason we arrived here is that, because
>> we must release the directory mutex before calling back to the daemon
>> (the heart of the problem, actually having to drop the mutex) to perform
>> the mount, we can get a deadlock. The cause of the problem was that for
>> "create" like operations the mutex is held for ->lookup() and
>> ->revalidate() but for a "path walks" the mutex is only held for
>> ->lookup(), so if the mutex is held when we're in ->revalidate(), we
>> could never be sure that we where the code path that acquired it.
>> 
>> Sorry, this last bit is unclear.
>> I'll need to work a bit harder on the explanation if you're interested
>> in checking further.
>
> I am.

commit 1864f7bd58351732593def024e73eca1f75bc352
Author: Ian Kent <raven@themaw.net>
Date:   Wed Aug 22 14:01:54 2007 -0700

    autofs4: deadlock during create
    
    Due to inconsistent locking in the VFS between calls to lookup and
    revalidate deadlock can occur in the automounter.
    
    The inconsistency is that the directory inode mutex is held for both lookup
    and revalidate calls when called via lookup_hash whereas it is held only
    for lookup during a path walk.  Consequently, if the mutex is held during a
    call to revalidate autofs4 can't release the mutex to callback the daemon
    as it can't know whether it owns the mutex.
    
    This situation happens when a process tries to create a directory within an
    automount and a second process also tries to create the same directory
    between the lookup and the mkdir.  Since the first process has dropped the
    mutex for the daemon callback, the second process takes it during
    revalidate leading to deadlock between the autofs daemon and the second
    process when the daemon tries to create the mount point directory.
    
    After spending quite a bit of time trying to resolve this on more than one
    occassion, using rather complex and ulgy approaches, it turns out that just
    delaying the hashing of the dentry until the create operation works fine.

> Oh, well...  Looks like RTFS time for me for now...  Additional parts of
> braindump would be appreciated - the last time I've seriously looked at
> autofs4 internal had been ~2005 or so ;-/

Well, let me know what level of dump you'd like.  I can give the 50,000
foot view, or I can give you the history of things that happened to get
us to where we are today, or anything inbetween.  The more specific
your request, the quicker I can respond.  A full brain-dump would take
some time!

Cheers,

Jeff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 17:50                                 ` Al Viro
@ 2008-06-03 17:49                                   ` Ian Kent
  0 siblings, 0 replies; 89+ messages in thread
From: Ian Kent @ 2008-06-03 17:49 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, Miklos Szeredi, jesper, linux-kernel,
	linux-fsdevel


On Tue, 2008-06-03 at 18:50 +0100, Al Viro wrote:
> On Wed, Jun 04, 2008 at 01:41:32AM +0800, Ian Kent wrote:
> 
> > OK, I'll need to look at vfs_readdir().
> > I thought vfs_readdir() would take the containing directory mutex as
> > does ->lookup().
> 
> vfs_readdir() takes i_mutex on directory it reads.  I.e. on the victim
> in this case.  lookup has i_mutex on directory it does lookup in, i.e.
> root in this case...

Hahaha, yes, I'll need to go back and re-read that bit of code, sorry.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 17:41                               ` Ian Kent
@ 2008-06-03 17:50                                 ` Al Viro
  2008-06-03 17:49                                   ` Ian Kent
  0 siblings, 1 reply; 89+ messages in thread
From: Al Viro @ 2008-06-03 17:50 UTC (permalink / raw)
  To: Ian Kent
  Cc: Linus Torvalds, Miklos Szeredi, jesper, linux-kernel,
	linux-fsdevel

On Wed, Jun 04, 2008 at 01:41:32AM +0800, Ian Kent wrote:

> OK, I'll need to look at vfs_readdir().
> I thought vfs_readdir() would take the containing directory mutex as
> does ->lookup().

vfs_readdir() takes i_mutex on directory it reads.  I.e. on the victim
in this case.  lookup has i_mutex on directory it does lookup in, i.e.
root in this case...

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 17:46                           ` Jeff Moyer
@ 2008-06-03 19:18                             ` Al Viro
  2008-06-03 19:53                               ` Jeff Moyer
  2008-06-04  1:36                               ` Ian Kent
  0 siblings, 2 replies; 89+ messages in thread
From: Al Viro @ 2008-06-03 19:18 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Ian Kent, Linus Torvalds, Miklos Szeredi, jesper, linux-kernel,
	linux-fsdevel

On Tue, Jun 03, 2008 at 01:46:41PM -0400, Jeff Moyer wrote:

> Well, let me know what level of dump you'd like.  I can give the 50,000
> foot view, or I can give you the history of things that happened to get
> us to where we are today, or anything inbetween.  The more specific
> your request, the quicker I can respond.  A full brain-dump would take
> some time!

a) what the hell is going on in autofs4_free_ino()?  It checks for
ino->dentry, when the only caller has just set it to NULL.

b) while we are at it, what's ino->inode doing there?  AFAICS, it's
a write-only field...

c) what are possible states of autofs4 dentry and what's the supposed
life cycle of these beasts?

d)
/* For dentries of directories in the root dir */
static struct dentry_operations autofs4_root_dentry_operations = {
        .d_revalidate   = autofs4_revalidate,
        .d_release      = autofs4_dentry_release,
};

/* For other dentries */
static struct dentry_operations autofs4_dentry_operations = {
        .d_revalidate   = autofs4_revalidate,
        .d_release      = autofs4_dentry_release,
};

Just what is the difference?

e) in autofs4_tree_busy() we do atomic_read() on ino->count and dentry->d_count
What's going to keep these suckers consistent with each other in any useful
way?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 19:18                             ` Al Viro
@ 2008-06-03 19:53                               ` Jeff Moyer
  2008-06-03 23:00                                 ` Al Viro
  2008-06-04  1:36                               ` Ian Kent
  1 sibling, 1 reply; 89+ messages in thread
From: Jeff Moyer @ 2008-06-03 19:53 UTC (permalink / raw)
  To: Al Viro
  Cc: Ian Kent, Linus Torvalds, Miklos Szeredi, jesper, linux-kernel,
	linux-fsdevel

Al Viro <viro@ZenIV.linux.org.uk> writes:

> On Tue, Jun 03, 2008 at 01:46:41PM -0400, Jeff Moyer wrote:
>
>> Well, let me know what level of dump you'd like.  I can give the 50,000
>> foot view, or I can give you the history of things that happened to get
>> us to where we are today, or anything inbetween.  The more specific
>> your request, the quicker I can respond.  A full brain-dump would take
>> some time!
>
> a) what the hell is going on in autofs4_free_ino()?  It checks for
> ino->dentry, when the only caller has just set it to NULL.

Probably historic.  Ian?

> b) while we are at it, what's ino->inode doing there?  AFAICS, it's
> a write-only field...

Another good point.

> c) what are possible states of autofs4 dentry and what's the supposed
> life cycle of these beasts?

The life cycle of a dentry for an indirect, non-browsable mount goes
something like this:

autofs4_lookup is called on behalf a process trying to walk into an
automounted directory.  That dentry's d_flags is set to
DCACHE_AUTOFS_PENDING but not hashed.  A waitqueue entry is created,
indexed off of the name of the dentry.  A callout is made to the
automount daemon (via autofs4_wait).

The daemon looks up the directory name in its configuration.  If it
finds a valid map entry, it will then create the directory using
sys_mkdir.  The autofs4_lookup call on behalf of the daemon (oz_mode ==
1) will return NULL, and then the mkdir call will be made.  The
autofs4_mkdir function then instantiates the dentry which, by the way,
is different from the original dentry passed to autofs4_lookup.  (This
dentry also does not get the PENDING flag set, which is a bug addressed
by a patch set that Ian and I have been working on;  specifically, the
idea is to reuse the dentry from the original lookup, but I digress).

The daemon then mounts the share on the given directory and issues an
ioctl to wakeup the waiter.  When awakened, the waiter clears the
DCACHE_AUTOFS_PENDING flag, does another lookup of the name in the
dcache and returns that dentry if found.

Later, the dentry gets expired via another ioctl.  That path sets
the AUTOFS_INF_EXPIRING flag in the d_fsdata associated with the dentry.
It then calls out to the daemon to perform the unmount and rmdir.  The
rmdir unhashes the dentry (and places it on the rehash list).

The dentry is removed from the rehash list if there was a racing expire
and mount or if the dentry is released.

This description is valid for the tree as it stands today.  Ian and I
have been working on fixing some other race conditions which will change
the dentry life cycle (for the better, I hope).

> d)
> /* For dentries of directories in the root dir */
> static struct dentry_operations autofs4_root_dentry_operations = {
>         .d_revalidate   = autofs4_revalidate,
>         .d_release      = autofs4_dentry_release,
> };
>
> /* For other dentries */
> static struct dentry_operations autofs4_dentry_operations = {
>         .d_revalidate   = autofs4_revalidate,
>         .d_release      = autofs4_dentry_release,
> };
>
> Just what is the difference?

Nothing.  There used to be, and I'm guessing Ian kept this around for,
umm, clarity?

> e) in autofs4_tree_busy() we do atomic_read() on ino->count and dentry->d_count
> What's going to keep these suckers consistent with each other in any useful
> way?

I'm afraid I'm not familiar enough with that part of the code to give
you a good answer.  Ian?

Cheers,

Jeff

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 19:53                               ` Jeff Moyer
@ 2008-06-03 23:00                                 ` Al Viro
  2008-06-04  2:42                                   ` Ian Kent
  0 siblings, 1 reply; 89+ messages in thread
From: Al Viro @ 2008-06-03 23:00 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Ian Kent, Linus Torvalds, Miklos Szeredi, jesper, linux-kernel,
	linux-fsdevel

On Tue, Jun 03, 2008 at 03:53:36PM -0400, Jeff Moyer wrote:

> autofs4_lookup is called on behalf a process trying to walk into an
> automounted directory.  That dentry's d_flags is set to
> DCACHE_AUTOFS_PENDING but not hashed.  A waitqueue entry is created,
> indexed off of the name of the dentry.  A callout is made to the
> automount daemon (via autofs4_wait).
> 
> The daemon looks up the directory name in its configuration.  If it
> finds a valid map entry, it will then create the directory using
> sys_mkdir.  The autofs4_lookup call on behalf of the daemon (oz_mode ==
> 1) will return NULL, and then the mkdir call will be made.  The
> autofs4_mkdir function then instantiates the dentry which, by the way,
> is different from the original dentry passed to autofs4_lookup.  (This
> dentry also does not get the PENDING flag set, which is a bug addressed
> by a patch set that Ian and I have been working on;  specifically, the
> idea is to reuse the dentry from the original lookup, but I digress).
> 
> The daemon then mounts the share on the given directory and issues an
> ioctl to wakeup the waiter.  When awakened, the waiter clears the
> DCACHE_AUTOFS_PENDING flag, does another lookup of the name in the
> dcache and returns that dentry if found.
> Later, the dentry gets expired via another ioctl.  That path sets
> the AUTOFS_INF_EXPIRING flag in the d_fsdata associated with the dentry.
> It then calls out to the daemon to perform the unmount and rmdir.  The
> rmdir unhashes the dentry (and places it on the rehash list).
> 
> The dentry is removed from the rehash list if there was a racing expire
> and mount or if the dentry is released.
> 
> This description is valid for the tree as it stands today.  Ian and I
> have been working on fixing some other race conditions which will change
> the dentry life cycle (for the better, I hope).

So what happens if new lookup hits between umount and rmdir?

Another thing: would be nice to write down the expected state of dentry
(positive/negative, flags, has/hasn't ->d_fsdata, flags on ->d_fsdata)
for all stages.  I'll go through the code and do that once I get some sleep,
but if you'll have time to do it before that...

FWIW, I wonder if it would be better to leave the directory alone and just
have the daemon mount the sucker elsewhere and let the kernel side move
the damn thing in place itself, along with making dentry positive and
waking the sleepers up.  Then we might get away with not unlocking anything
at all...  That obviously doesn't help the current systems with existing
daemon, but it might be interesting for the next autofs version...
Note that we don't even have to mount it anywhere - mount2() is close to
the top of the pile for the next couple of cycles and it'd separate
"activate fs" from "attach fully set up fs to given place", with the
former resulting in a descriptor and the latter being
	mount2(Attach, dir_fd, fs_fd);
Kernel side of autofs might receive the file descriptor in question and
do the rest itself...

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 19:18                             ` Al Viro
  2008-06-03 19:53                               ` Jeff Moyer
@ 2008-06-04  1:36                               ` Ian Kent
  1 sibling, 0 replies; 89+ messages in thread
From: Ian Kent @ 2008-06-04  1:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Moyer, Linus Torvalds, Miklos Szeredi, jesper, linux-kernel,
	linux-fsdevel


On Tue, 2008-06-03 at 20:18 +0100, Al Viro wrote:
> On Tue, Jun 03, 2008 at 01:46:41PM -0400, Jeff Moyer wrote:
> 
> > Well, let me know what level of dump you'd like.  I can give the 50,000
> > foot view, or I can give you the history of things that happened to get
> > us to where we are today, or anything inbetween.  The more specific
> > your request, the quicker I can respond.  A full brain-dump would take
> > some time!
> 
> a) what the hell is going on in autofs4_free_ino()?  It checks for
> ino->dentry, when the only caller has just set it to NULL.

I know.
I need to clean that up.

> 
> b) while we are at it, what's ino->inode doing there?  AFAICS, it's
> a write-only field...

I know.
And I think it has never been used anywhere either but I haven't removed
it from the info structure.

> 
> c) what are possible states of autofs4 dentry and what's the supposed
> life cycle of these beasts?
> 
> d)
> /* For dentries of directories in the root dir */
> static struct dentry_operations autofs4_root_dentry_operations = {
>         .d_revalidate   = autofs4_revalidate,
>         .d_release      = autofs4_dentry_release,
> };
> 
> /* For other dentries */
> static struct dentry_operations autofs4_dentry_operations = {
>         .d_revalidate   = autofs4_revalidate,
>         .d_release      = autofs4_dentry_release,
> };
> 
> Just what is the difference?

There isn't any difference.
There's no real reason to keep them different except that there are two
distinct sets of operations. I don't see any harm in retaining this.
 
> 
> e) in autofs4_tree_busy() we do atomic_read() on ino->count and dentry->d_count
> What's going to keep these suckers consistent with each other in any useful
> way?

The only time ino->count is changed is in ->mkdir(), ->rmdir and
->symlink() and ->unlink(). So it is supposed to represent the minimal
reference count. The code in autofs4_free_ino() should go but that may
be a bug, I need to check.



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 23:00                                 ` Al Viro
@ 2008-06-04  2:42                                   ` Ian Kent
  2008-06-04  5:34                                     ` Miklos Szeredi
  2008-06-10  4:57                                     ` Ian Kent
  0 siblings, 2 replies; 89+ messages in thread
From: Ian Kent @ 2008-06-04  2:42 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Moyer, Linus Torvalds, Miklos Szeredi, jesper, linux-kernel,
	linux-fsdevel


On Wed, 2008-06-04 at 00:00 +0100, Al Viro wrote:
> On Tue, Jun 03, 2008 at 03:53:36PM -0400, Jeff Moyer wrote:
> 
> > autofs4_lookup is called on behalf a process trying to walk into an
> > automounted directory.  That dentry's d_flags is set to
> > DCACHE_AUTOFS_PENDING but not hashed.  A waitqueue entry is created,
> > indexed off of the name of the dentry.  A callout is made to the
> > automount daemon (via autofs4_wait).
> > 
> > The daemon looks up the directory name in its configuration.  If it
> > finds a valid map entry, it will then create the directory using
> > sys_mkdir.  The autofs4_lookup call on behalf of the daemon (oz_mode ==
> > 1) will return NULL, and then the mkdir call will be made.  The
> > autofs4_mkdir function then instantiates the dentry which, by the way,
> > is different from the original dentry passed to autofs4_lookup.  (This
> > dentry also does not get the PENDING flag set, which is a bug addressed
> > by a patch set that Ian and I have been working on;  specifically, the
> > idea is to reuse the dentry from the original lookup, but I digress).
> > 
> > The daemon then mounts the share on the given directory and issues an
> > ioctl to wakeup the waiter.  When awakened, the waiter clears the
> > DCACHE_AUTOFS_PENDING flag, does another lookup of the name in the
> > dcache and returns that dentry if found.
> > Later, the dentry gets expired via another ioctl.  That path sets
> > the AUTOFS_INF_EXPIRING flag in the d_fsdata associated with the dentry.
> > It then calls out to the daemon to perform the unmount and rmdir.  The
> > rmdir unhashes the dentry (and places it on the rehash list).
> > 
> > The dentry is removed from the rehash list if there was a racing expire
> > and mount or if the dentry is released.
> > 
> > This description is valid for the tree as it stands today.  Ian and I
> > have been working on fixing some other race conditions which will change
> > the dentry life cycle (for the better, I hope).
> 
> So what happens if new lookup hits between umount and rmdir?

It will wait for the expire to complete and then wait for a mount
request to the daemon.

This is an example of how I've broken the lookup by delaying the hashing
of the dentry without providing a way for ->lookup() to pickup the same
unhashed dentry prior the directory dentry being hashed. Currently only
the first lookup after the d_drop will get this dentry.

Keeping track of the dentry between the first lookup and it's subsequent
hashing (or release) is what I want to do. But, as you point out, I also
need to keep the dentry positive.

> 
> Another thing: would be nice to write down the expected state of dentry
> (positive/negative, flags, has/hasn't ->d_fsdata, flags on ->d_fsdata)
> for all stages.  I'll go through the code and do that once I get some sleep,
> but if you'll have time to do it before that...

A dentry gets an info struct when it gets an inode and it should retain
it until the dentry is released.

When a dentry is selected for umount the AUTOFS_INF_EXPIRING
(ino->flags) is set and cleared upon return (synchronous expire).

The DCACHE_AUTOFS_PENDING (dentry->d_flags) flag should be set when a
mount request is to be issued to the daemon and cleared when the request
completes. I've introduced some inconsistency in setting and clearing
this flag which has compounded the delayed hashing issue.

> 
> FWIW, I wonder if it would be better to leave the directory alone and just
> have the daemon mount the sucker elsewhere and let the kernel side move
> the damn thing in place itself, along with making dentry positive and
> waking the sleepers up.  Then we might get away with not unlocking anything
> at all...  That obviously doesn't help the current systems with existing
> daemon, but it might be interesting for the next autofs version...
> Note that we don't even have to mount it anywhere - mount2() is close to
> the top of the pile for the next couple of cycles and it'd separate
> "activate fs" from "attach fully set up fs to given place", with the
> former resulting in a descriptor and the latter being
> 	mount2(Attach, dir_fd, fs_fd);
> Kernel side of autofs might receive the file descriptor in question and
> do the rest itself...

Perhaps, if we didn't use /etc/mtab anywhere.
It would make a difference if we could "mount" /proc/mounts onto a file
such as /etc/mtab and everyone always did that.



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-04  2:42                                   ` Ian Kent
@ 2008-06-04  5:34                                     ` Miklos Szeredi
  2008-06-04  5:41                                       ` Ian Kent
  2008-06-10  4:57                                     ` Ian Kent
  1 sibling, 1 reply; 89+ messages in thread
From: Miklos Szeredi @ 2008-06-04  5:34 UTC (permalink / raw)
  To: raven; +Cc: viro, jmoyer, torvalds, miklos, jesper, linux-kernel,
	linux-fsdevel

> Perhaps, if we didn't use /etc/mtab anywhere.
> It would make a difference if we could "mount" /proc/mounts onto a file
> such as /etc/mtab and everyone always did that.

That's actually remarkably close to being possible.  Loopback mounts
have already been fixed not to rely on /etc/mtab.  The only major
piece missing is "user" mounts, and there's already a patchset for
that waiting for review by the VFS maintainers.  After that, it's just
a "simple" issue of fixing up all the userspace pieces.  Could be
finished in the next decade, possibly ;)

Miklos

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-04  5:34                                     ` Miklos Szeredi
@ 2008-06-04  5:41                                       ` Ian Kent
  0 siblings, 0 replies; 89+ messages in thread
From: Ian Kent @ 2008-06-04  5:41 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: viro, jmoyer, torvalds, jesper, linux-kernel, linux-fsdevel


On Wed, 2008-06-04 at 07:34 +0200, Miklos Szeredi wrote:
> > Perhaps, if we didn't use /etc/mtab anywhere.
> > It would make a difference if we could "mount" /proc/mounts onto a file
> > such as /etc/mtab and everyone always did that.
> 
> That's actually remarkably close to being possible.  Loopback mounts
> have already been fixed not to rely on /etc/mtab.  The only major
> piece missing is "user" mounts, and there's already a patchset for
> that waiting for review by the VFS maintainers.  After that, it's just
> a "simple" issue of fixing up all the userspace pieces.  Could be
> finished in the next decade, possibly ;)

;)



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-05-26 18:41 Linux 2.6.26-rc4 Linus Torvalds
                   ` (4 preceding siblings ...)
  2008-06-03  9:49 ` Jesper Krogh
@ 2008-06-04 17:51 ` Jesper Krogh
  5 siblings, 0 replies; 89+ messages in thread
From: Jesper Krogh @ 2008-06-04 17:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List, autofs

Hi.

Unrelated to the 2 other reports against the 2.6.26-rc4 kernel. I got 
this message. The system still operates fine so it is not critical in 
anyway. I thought I'd report it anyway. CC'ing the autofs-mailing list
since it seems autofs related.

[43071687.620727] INFO: task automount:23131 blocked for more than 120 
seconds.
[43071687.740715] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[43071687.862459] automount     D ffff810103a2b380     0 23131   7655
[43071687.862466]  ffff8101fb5d9b28 0000000000000082 0000000000000000 
0000000000000000
[43071687.862472]  ffffffff8069e380 ffffffff8069e380 ffffffff8069a4c0 
ffffffff8069e380
[43071687.862475]  ffff8101e8d1af00 ffff8101fb5d9ae8 ffff8101fb5d9ad8 
ffff8101fb5d9af4
[43071687.862480] Call Trace:
[43071687.862556]  [<ffffffffa03c47c0>] 
:sunrpc:rpc_wait_bit_killable+0x0/0x40
[43071687.862570]  [<ffffffffa03c47dc>] 
:sunrpc:rpc_wait_bit_killable+0x1c/0x40
[43071687.862581]  [<ffffffff804764bf>] __wait_on_bit+0x4f/0x80
[43071687.862594]  [<ffffffffa03c47c0>] 
:sunrpc:rpc_wait_bit_killable+0x0/0x40
[43071687.862600]  [<ffffffff8047656a>] out_of_line_wait_on_bit+0x7a/0xa0
[43071687.862606]  [<ffffffff8024c060>] wake_bit_function+0x0/0x30
[43071687.862618]  [<ffffffffa03c0e13>] :sunrpc:xprt_connect+0x83/0x170
[43071687.862633]  [<ffffffffa03c4e69>] :sunrpc:__rpc_execute+0xc9/0x270
[43071687.862645]  [<ffffffffa03bded4>] :sunrpc:rpc_run_task+0x34/0x70
[43071687.862657]  [<ffffffffa03bdfac>] :sunrpc:rpc_call_sync+0x3c/0x60
[43071687.862667]  [<ffffffff802b9379>] __follow_mount+0x29/0xa0
[43071687.862696]  [<ffffffffa042220a>] :nfs:nfs3_rpc_wrapper+0x3a/0x60
[43071687.862711]  [<ffffffffa04225e1>] :nfs:nfs3_proc_getattr+0x51/0x90
[43071687.862724]  [<ffffffffa0415273>] 
:nfs:__nfs_revalidate_inode+0x183/0x2b0
[43071687.862733]  [<ffffffff802ca0b1>] mntput_no_expire+0x21/0x140
[43071687.862738]  [<ffffffff802ca0b1>] mntput_no_expire+0x21/0x140
[43071687.862744]  [<ffffffff802bc2e1>] path_walk+0xb1/0xd0
[43071687.862759]  [<ffffffffa0415c9a>] :nfs:nfs_getattr+0x7a/0x120
[43071687.862765]  [<ffffffff802b4d53>] vfs_lstat_fd+0x43/0x70
[43071687.862774]  [<ffffffff802c3e88>] d_kill+0x48/0x70
[43071687.862780]  [<ffffffff802c417f>] dput+0x1f/0xf0
[43071687.862784]  [<ffffffff802d1440>] dcache_dir_close+0x10/0x20
[43071687.862788]  [<ffffffff802b4da7>] sys_newlstat+0x27/0x50
[43071687.862792]  [<ffffffff802ca0b1>] mntput_no_expire+0x21/0x140
[43071687.862798]  [<ffffffff802aef04>] filp_close+0x54/0x90
[43071687.862802]  [<ffffffff802b0694>] sys_close+0x94/0xe0
[43071687.862808]  [<ffffffff8020c2bb>] system_call_after_swapgs+0x7b/0x80
[43071687.862816]

The system is a "quite hard" loaded 8x dual-core Sun X4600 server
(amd64).

Jesper
-- 
Jesper Krogh

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-03 15:01                 ` Linus Torvalds
  2008-06-03 16:07                   ` Ian Kent
@ 2008-06-05  7:31                   ` Ian Kent
  2008-06-05 21:29                     ` Linus Torvalds
                                       ` (2 more replies)
  1 sibling, 3 replies; 89+ messages in thread
From: Ian Kent @ 2008-06-05  7:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Miklos Szeredi, jesper, linux-kernel, linux-fsdevel,
	Jeff Moyer, Andrew Morton


On Tue, 2008-06-03 at 08:01 -0700, Linus Torvalds wrote:
> 
> On Tue, 3 Jun 2008, Ian Kent wrote:
> > > 
> > > > I think it must be autofs4 doing something weird.  Like this in
> > > > autofs4_lookup_unhashed():
> > > > 
> > > > 			/*
> > > > 			 * Make the rehashed dentry negative so the VFS
> > > > 			 * behaves as it should.
> > > > 			 */
> > > > 			if (inode) {
> > > > 				dentry->d_inode = NULL;
> 
> Uhhuh. Yeah, that's not allowed.
> 
> A dentry inode can start _out_ as NULL, but it can never later become NULL 
> again until it is totally unused.

Here is a patch for autofs4 to, hopefully, resolve this.

Keep in mind this doesn't address any other autofs4 issues but I it
should allow us to identify if this was in fact the root cause of the
problem Jesper reported.

autofs4 - leave rehashed dentry positive

From: Ian Kent <raven@themaw.net>

Correct the error of making a positive dentry negative after it has been
instantiated.

This involves removing the code in autofs4_lookup_unhashed() that
makes the dentry negative and updating autofs4_dir_symlink() and
autofs4_dir_mkdir() to recognise they have been given a postive
dentry (previously the dentry was always negative) and deal with
it. In addition the dentry info struct initialization, autofs4_init_ino(),
and the symlink free function, ino_lnkfree(), have been made aware
of this possible re-use. This is needed because the current case
re-uses a dentry in order to preserve it's flags as described in
commit f50b6f8691cae2e0064c499dd3ef3f31142987f0.

Signed-off-by: Ian Kent <raven@themaw.net>
---

 fs/autofs4/inode.c |   23 +++++++------
 fs/autofs4/root.c  |   95 ++++++++++++++++++++++++++++++----------------------
 2 files changed, 67 insertions(+), 51 deletions(-)


diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 2fdcf5e..ec9a641 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -24,8 +24,10 @@
 
 static void ino_lnkfree(struct autofs_info *ino)
 {
-	kfree(ino->u.symlink);
-	ino->u.symlink = NULL;
+	if (ino->u.symlink) {
+		kfree(ino->u.symlink);
+		ino->u.symlink = NULL;
+	}
 }
 
 struct autofs_info *autofs4_init_ino(struct autofs_info *ino,
@@ -41,16 +43,17 @@ struct autofs_info *autofs4_init_ino(struct autofs_info *ino,
 	if (ino == NULL)
 		return NULL;
 
-	ino->flags = 0;
-	ino->mode = mode;
-	ino->inode = NULL;
-	ino->dentry = NULL;
-	ino->size = 0;
-
-	INIT_LIST_HEAD(&ino->rehash);
+	if (!reinit) {
+		ino->flags = 0;
+		ino->mode = mode;
+		ino->inode = NULL;
+		ino->dentry = NULL;
+		ino->size = 0;
+		INIT_LIST_HEAD(&ino->rehash);
+		atomic_set(&ino->count, 0);
+	}
 
 	ino->last_used = jiffies;
-	atomic_set(&ino->count, 0);
 
 	ino->sbi = sbi;
 
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index edf5b6b..6ce603b 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -555,24 +555,8 @@ static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct
 			goto next;
 
 		if (d_unhashed(dentry)) {
-			struct inode *inode = dentry->d_inode;
-
-			ino = autofs4_dentry_ino(dentry);
 			list_del_init(&ino->rehash);
 			dget(dentry);
-			/*
-			 * Make the rehashed dentry negative so the VFS
-			 * behaves as it should.
-			 */
-			if (inode) {
-				dentry->d_inode = NULL;
-				list_del_init(&dentry->d_alias);
-				spin_unlock(&dentry->d_lock);
-				spin_unlock(&sbi->rehash_lock);
-				spin_unlock(&dcache_lock);
-				iput(inode);
-				return dentry;
-			}
 			spin_unlock(&dentry->d_lock);
 			spin_unlock(&sbi->rehash_lock);
 			spin_unlock(&dcache_lock);
@@ -728,35 +712,50 @@ static int autofs4_dir_symlink(struct inode *dir,
 		return -EACCES;
 
 	ino = autofs4_init_ino(ino, sbi, S_IFLNK | 0555);
-	if (ino == NULL)
+	if (!ino)
 		return -ENOSPC;
 
-	ino->size = strlen(symname);
-	ino->u.symlink = cp = kmalloc(ino->size + 1, GFP_KERNEL);
-
-	if (cp == NULL) {
-		kfree(ino);
+	cp = kmalloc(ino->size + 1, GFP_KERNEL);
+	if (!cp) {
+		if (!dentry->d_fsdata)
+			kfree(ino);
 		return -ENOSPC;
 	}
 
 	strcpy(cp, symname);
 
-	inode = autofs4_get_inode(dir->i_sb, ino);
-	d_add(dentry, inode);
+	inode = dentry->d_inode;
+	if (inode)
+		d_rehash(dentry);
+	else {
+		inode = autofs4_get_inode(dir->i_sb, ino);
+		if (!inode) {
+			kfree(cp);
+			if (!dentry->d_fsdata)
+				kfree(ino);
+			return -ENOSPC;
+		}
+
+		d_add(dentry, inode);
 
-	if (dir == dir->i_sb->s_root->d_inode)
-		dentry->d_op = &autofs4_root_dentry_operations;
-	else
-		dentry->d_op = &autofs4_dentry_operations;
+		if (dir == dir->i_sb->s_root->d_inode)
+			dentry->d_op = &autofs4_root_dentry_operations;
+		else
+			dentry->d_op = &autofs4_dentry_operations;
+
+		dentry->d_fsdata = ino;
+		ino->dentry = dentry;
+		ino->inode = inode;
+	}
+	dget(dentry);
 
-	dentry->d_fsdata = ino;
-	ino->dentry = dget(dentry);
 	atomic_inc(&ino->count);
 	p_ino = autofs4_dentry_ino(dentry->d_parent);
 	if (p_ino && dentry->d_parent != dentry)
 		atomic_inc(&p_ino->count);
-	ino->inode = inode;
 
+	ino->u.symlink = cp;
+	ino->size = strlen(symname);
 	dir->i_mtime = CURRENT_TIME;
 
 	return 0;
@@ -866,24 +865,38 @@ static int autofs4_dir_mkdir(struct inode *dir, struct dentry *dentry, int mode)
 		dentry, dentry->d_name.len, dentry->d_name.name);
 
 	ino = autofs4_init_ino(ino, sbi, S_IFDIR | 0555);
-	if (ino == NULL)
+	if (!ino)
 		return -ENOSPC;
 
-	inode = autofs4_get_inode(dir->i_sb, ino);
-	d_add(dentry, inode);
+	inode = dentry->d_inode;
+	if (inode)
+		d_rehash(dentry);
+	else {
+		inode = autofs4_get_inode(dir->i_sb, ino);
+		if (!inode) {
+			if (!dentry->d_fsdata)
+				kfree(ino);
+			return -ENOSPC;
+		}
 
-	if (dir == dir->i_sb->s_root->d_inode)
-		dentry->d_op = &autofs4_root_dentry_operations;
-	else
-		dentry->d_op = &autofs4_dentry_operations;
+		d_add(dentry, inode);
+
+		if (dir == dir->i_sb->s_root->d_inode)
+			dentry->d_op = &autofs4_root_dentry_operations;
+		else
+			dentry->d_op = &autofs4_dentry_operations;
+
+		dentry->d_fsdata = ino;
+		ino->dentry = dentry;
+		ino->inode = inode;
+	}
+	dget(dentry);
 
-	dentry->d_fsdata = ino;
-	ino->dentry = dget(dentry);
 	atomic_inc(&ino->count);
 	p_ino = autofs4_dentry_ino(dentry->d_parent);
 	if (p_ino && dentry->d_parent != dentry)
 		atomic_inc(&p_ino->count);
-	ino->inode = inode;
+
 	inc_nlink(dir);
 	dir->i_mtime = CURRENT_TIME;
 



^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-05  7:31                   ` Ian Kent
@ 2008-06-05 21:29                     ` Linus Torvalds
  2008-06-05 21:34                       ` Jesper Krogh
  2008-06-06  2:39                       ` Ian Kent
  2008-06-05 22:30                     ` Andrew Morton
  2008-06-06  6:23                     ` Jesper Krogh
  2 siblings, 2 replies; 89+ messages in thread
From: Linus Torvalds @ 2008-06-05 21:29 UTC (permalink / raw)
  To: Ian Kent
  Cc: Al Viro, Miklos Szeredi, jesper, linux-kernel, linux-fsdevel,
	Jeff Moyer, Andrew Morton



On Thu, 5 Jun 2008, Ian Kent wrote:
> 
> Keep in mind this doesn't address any other autofs4 issues but I it
> should allow us to identify if this was in fact the root cause of the
> problem Jesper reported.

Jesper, can you test this? I think you said you could trigger this in ~24 
hours or so? I'd love to have some testing of some heavy autofs user. Even 
if it's not a guarantee of a fix (due to reproducing the bug not being 
entirely trivial), at least I'd like to know that it doesn't introduce any 
obvious new problems either..

			Linus

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-05 21:29                     ` Linus Torvalds
@ 2008-06-05 21:34                       ` Jesper Krogh
  2008-06-06  2:39                       ` Ian Kent
  1 sibling, 0 replies; 89+ messages in thread
From: Jesper Krogh @ 2008-06-05 21:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ian Kent, Al Viro, Miklos Szeredi, linux-kernel, linux-fsdevel,
	Jeff Moyer, Andrew Morton

Linus Torvalds wrote:
> 
> On Thu, 5 Jun 2008, Ian Kent wrote:
>> Keep in mind this doesn't address any other autofs4 issues but I it
>> should allow us to identify if this was in fact the root cause of the
>> problem Jesper reported.
> 
> Jesper, can you test this? I think you said you could trigger this in ~24 
> hours or so? I'd love to have some testing of some heavy autofs user. Even 
> if it's not a guarantee of a fix (due to reproducing the bug not being 
> entirely trivial), at least I'd like to know that it doesn't introduce any 
> obvious new problems either..

Yes, I'll apply it an report back. Dont expect anything before early 
next week.

Jesper
-- 
Jesper

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-05  7:31                   ` Ian Kent
  2008-06-05 21:29                     ` Linus Torvalds
@ 2008-06-05 22:30                     ` Andrew Morton
  2008-06-06  2:47                       ` Ian Kent
  2008-06-27  4:18                       ` Ian Kent
  2008-06-06  6:23                     ` Jesper Krogh
  2 siblings, 2 replies; 89+ messages in thread
From: Andrew Morton @ 2008-06-05 22:30 UTC (permalink / raw)
  To: Ian Kent
  Cc: torvalds, viro, miklos, jesper, linux-kernel, linux-fsdevel,
	jmoyer

On Thu, 05 Jun 2008 15:31:37 +0800
Ian Kent <raven@themaw.net> wrote:

> 
> On Tue, 2008-06-03 at 08:01 -0700, Linus Torvalds wrote:
> > 
> > On Tue, 3 Jun 2008, Ian Kent wrote:
> > > > 
> > > > > I think it must be autofs4 doing something weird.  Like this in
> > > > > autofs4_lookup_unhashed():
> > > > > 
> > > > > 			/*
> > > > > 			 * Make the rehashed dentry negative so the VFS
> > > > > 			 * behaves as it should.
> > > > > 			 */
> > > > > 			if (inode) {
> > > > > 				dentry->d_inode = NULL;
> > 
> > Uhhuh. Yeah, that's not allowed.
> > 
> > A dentry inode can start _out_ as NULL, but it can never later become NULL 
> > again until it is totally unused.
> 
> Here is a patch for autofs4 to, hopefully, resolve this.
> 
> Keep in mind this doesn't address any other autofs4 issues but I it
> should allow us to identify if this was in fact the root cause of the
> problem Jesper reported.
> 
> autofs4 - leave rehashed dentry positive
> 
> From: Ian Kent <raven@themaw.net>
> 
> Correct the error of making a positive dentry negative after it has been
> instantiated.
> 
> This involves removing the code in autofs4_lookup_unhashed() that
> makes the dentry negative and updating autofs4_dir_symlink() and
> autofs4_dir_mkdir() to recognise they have been given a postive
> dentry (previously the dentry was always negative) and deal with
> it. In addition the dentry info struct initialization, autofs4_init_ino(),
> and the symlink free function, ino_lnkfree(), have been made aware
> of this possible re-use. This is needed because the current case
> re-uses a dentry in order to preserve it's flags as described in
> commit f50b6f8691cae2e0064c499dd3ef3f31142987f0.
> 
> ...
>
> ...
>
> diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
> index edf5b6b..6ce603b 100644
> --- a/fs/autofs4/root.c
> +++ b/fs/autofs4/root.c
> @@ -555,24 +555,8 @@ static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct
>  			goto next;
>  
>  		if (d_unhashed(dentry)) {
> -			struct inode *inode = dentry->d_inode;
> -
> -			ino = autofs4_dentry_ino(dentry);
>  			list_del_init(&ino->rehash);
>  			dget(dentry);
> -			/*
> -			 * Make the rehashed dentry negative so the VFS
> -			 * behaves as it should.
> -			 */
> -			if (inode) {
> -				dentry->d_inode = NULL;
> -				list_del_init(&dentry->d_alias);
> -				spin_unlock(&dentry->d_lock);
> -				spin_unlock(&sbi->rehash_lock);
> -				spin_unlock(&dcache_lock);
> -				iput(inode);
> -				return dentry;
> -			}
>  			spin_unlock(&dentry->d_lock);
>  			spin_unlock(&sbi->rehash_lock);
>  			spin_unlock(&dcache_lock);
> @@ -728,35 +712,50 @@ static int autofs4_dir_symlink(struct inode *dir,
>  		return -EACCES;
>  
>  	ino = autofs4_init_ino(ino, sbi, S_IFLNK | 0555);
> -	if (ino == NULL)
> +	if (!ino)
>  		return -ENOSPC;

Should have been ENOMEM, I guess.

> -	ino->size = strlen(symname);
> -	ino->u.symlink = cp = kmalloc(ino->size + 1, GFP_KERNEL);
> -
> -	if (cp == NULL) {
> -		kfree(ino);
> +	cp = kmalloc(ino->size + 1, GFP_KERNEL);
> +	if (!cp) {
> +		if (!dentry->d_fsdata)
> +			kfree(ino);

OK, so here we work out that autofs4_init_ino() had to allocate a new
autofs_info and if so, free it here.  It took me a moment..

>  		return -ENOSPC;

ENOMEM?

>  	}
>  
>  	strcpy(cp, symname);
>  
> -	inode = autofs4_get_inode(dir->i_sb, ino);
> -	d_add(dentry, inode);
> +	inode = dentry->d_inode;
> +	if (inode)
> +		d_rehash(dentry);
> +	else {
> +		inode = autofs4_get_inode(dir->i_sb, ino);
> +		if (!inode) {
> +			kfree(cp);
> +			if (!dentry->d_fsdata)
> +				kfree(ino);
> +			return -ENOSPC;
> +		}
> +
> +		d_add(dentry, inode);
>  
> -	if (dir == dir->i_sb->s_root->d_inode)
> -		dentry->d_op = &autofs4_root_dentry_operations;
> -	else
> -		dentry->d_op = &autofs4_dentry_operations;
> +		if (dir == dir->i_sb->s_root->d_inode)
> +			dentry->d_op = &autofs4_root_dentry_operations;
> +		else
> +			dentry->d_op = &autofs4_dentry_operations;
> +
> +		dentry->d_fsdata = ino;
> +		ino->dentry = dentry;
> +		ino->inode = inode;
> +	}
> +	dget(dentry);
>  
> -	dentry->d_fsdata = ino;
> -	ino->dentry = dget(dentry);
>  	atomic_inc(&ino->count);
>  	p_ino = autofs4_dentry_ino(dentry->d_parent);
>  	if (p_ino && dentry->d_parent != dentry)
>  		atomic_inc(&p_ino->count);
> -	ino->inode = inode;
>  
> +	ino->u.symlink = cp;
> +	ino->size = strlen(symname);
>  	dir->i_mtime = CURRENT_TIME;

This all seems a bit ungainly.  I assume that on entry to
autofs4_dir_symlink(), ino->size is equal to strlen(symname)?  If it's
not, that strcpy() will overrun.

But if ino->size _is_ equal to strlen(symname) then why did we just
recalculate the same thing?

I'm suspecting we can zap a lump of code and just do

	cp = kstrdup(symname, GFP_KERNEL);

Anyway, please check that.

>  	return 0;
> @@ -866,24 +865,38 @@ static int autofs4_dir_mkdir(struct inode *dir, struct dentry *dentry, int mode)
>  		dentry, dentry->d_name.len, dentry->d_name.name);
>  
>  	ino = autofs4_init_ino(ino, sbi, S_IFDIR | 0555);
> -	if (ino == NULL)
> +	if (!ino)
>  		return -ENOSPC;

ENOMEM?

> -	inode = autofs4_get_inode(dir->i_sb, ino);
> -	d_add(dentry, inode);
> +	inode = dentry->d_inode;
> +	if (inode)
> +		d_rehash(dentry);
> +	else {
> +		inode = autofs4_get_inode(dir->i_sb, ino);
> +		if (!inode) {
> +			if (!dentry->d_fsdata)
> +				kfree(ino);
> +			return -ENOSPC;
> +		}
>  
> -	if (dir == dir->i_sb->s_root->d_inode)
> -		dentry->d_op = &autofs4_root_dentry_operations;
> -	else
> -		dentry->d_op = &autofs4_dentry_operations;
> +		d_add(dentry, inode);
> +
> +		if (dir == dir->i_sb->s_root->d_inode)
> +			dentry->d_op = &autofs4_root_dentry_operations;
> +		else
> +			dentry->d_op = &autofs4_dentry_operations;
> +
> +		dentry->d_fsdata = ino;
> +		ino->dentry = dentry;
> +		ino->inode = inode;
> +	}
> +	dget(dentry);

This all looks very similar to the code in autofs4_dir_symlink().  Some
refactoring might be needed at some stage?

> -	dentry->d_fsdata = ino;
> -	ino->dentry = dget(dentry);
>  	atomic_inc(&ino->count);
>  	p_ino = autofs4_dentry_ino(dentry->d_parent);
>  	if (p_ino && dentry->d_parent != dentry)
>  		atomic_inc(&p_ino->count);
> -	ino->inode = inode;
> +
>  	inc_nlink(dir);
>  	dir->i_mtime = CURRENT_TIME;
>  
> 

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-05 21:29                     ` Linus Torvalds
  2008-06-05 21:34                       ` Jesper Krogh
@ 2008-06-06  2:39                       ` Ian Kent
  1 sibling, 0 replies; 89+ messages in thread
From: Ian Kent @ 2008-06-06  2:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Miklos Szeredi, jesper, linux-kernel, linux-fsdevel,
	Jeff Moyer, Andrew Morton


On Thu, 2008-06-05 at 14:29 -0700, Linus Torvalds wrote:
> 
> On Thu, 5 Jun 2008, Ian Kent wrote:
> > 
> > Keep in mind this doesn't address any other autofs4 issues but I it
> > should allow us to identify if this was in fact the root cause of the
> > problem Jesper reported.
> 
> Jesper, can you test this? I think you said you could trigger this in ~24 
> hours or so? I'd love to have some testing of some heavy autofs user. Even 
> if it's not a guarantee of a fix (due to reproducing the bug not being 
> entirely trivial), at least I'd like to know that it doesn't introduce any 
> obvious new problems either..

I'm continuing to test this also.
My initial testing was fine but late last night, with some of my other
patches applied as well, I started seeing some strange problems.

Ian



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-05 22:30                     ` Andrew Morton
@ 2008-06-06  2:47                       ` Ian Kent
  2008-06-27  4:18                       ` Ian Kent
  1 sibling, 0 replies; 89+ messages in thread
From: Ian Kent @ 2008-06-06  2:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: torvalds, viro, miklos, jesper, linux-kernel, linux-fsdevel,
	jmoyer


On Thu, 2008-06-05 at 15:30 -0700, Andrew Morton wrote:
> On Thu, 05 Jun 2008 15:31:37 +0800
> Ian Kent <raven@themaw.net> wrote:
> 
> > 
> > On Tue, 2008-06-03 at 08:01 -0700, Linus Torvalds wrote:
> > > 
> > > On Tue, 3 Jun 2008, Ian Kent wrote:
> > > > > 
> > > > > > I think it must be autofs4 doing something weird.  Like this in
> > > > > > autofs4_lookup_unhashed():
> > > > > > 
> > > > > > 			/*
> > > > > > 			 * Make the rehashed dentry negative so the VFS
> > > > > > 			 * behaves as it should.
> > > > > > 			 */
> > > > > > 			if (inode) {
> > > > > > 				dentry->d_inode = NULL;
> > > 
> > > Uhhuh. Yeah, that's not allowed.
> > > 
> > > A dentry inode can start _out_ as NULL, but it can never later become NULL 
> > > again until it is totally unused.
> > 
> > Here is a patch for autofs4 to, hopefully, resolve this.
> > 
> > Keep in mind this doesn't address any other autofs4 issues but I it
> > should allow us to identify if this was in fact the root cause of the
> > problem Jesper reported.
> > 
> > autofs4 - leave rehashed dentry positive
> > 
> > From: Ian Kent <raven@themaw.net>
> > 
> > Correct the error of making a positive dentry negative after it has been
> > instantiated.
> > 
> > This involves removing the code in autofs4_lookup_unhashed() that
> > makes the dentry negative and updating autofs4_dir_symlink() and
> > autofs4_dir_mkdir() to recognise they have been given a postive
> > dentry (previously the dentry was always negative) and deal with
> > it. In addition the dentry info struct initialization, autofs4_init_ino(),
> > and the symlink free function, ino_lnkfree(), have been made aware
> > of this possible re-use. This is needed because the current case
> > re-uses a dentry in order to preserve it's flags as described in
> > commit f50b6f8691cae2e0064c499dd3ef3f31142987f0.
> > 
> > ...
> >
> > ...
> >
> > diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
> > index edf5b6b..6ce603b 100644
> > --- a/fs/autofs4/root.c
> > +++ b/fs/autofs4/root.c
> > @@ -555,24 +555,8 @@ static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct
> >  			goto next;
> >  
> >  		if (d_unhashed(dentry)) {
> > -			struct inode *inode = dentry->d_inode;
> > -
> > -			ino = autofs4_dentry_ino(dentry);
> >  			list_del_init(&ino->rehash);
> >  			dget(dentry);
> > -			/*
> > -			 * Make the rehashed dentry negative so the VFS
> > -			 * behaves as it should.
> > -			 */
> > -			if (inode) {
> > -				dentry->d_inode = NULL;
> > -				list_del_init(&dentry->d_alias);
> > -				spin_unlock(&dentry->d_lock);
> > -				spin_unlock(&sbi->rehash_lock);
> > -				spin_unlock(&dcache_lock);
> > -				iput(inode);
> > -				return dentry;
> > -			}
> >  			spin_unlock(&dentry->d_lock);
> >  			spin_unlock(&sbi->rehash_lock);
> >  			spin_unlock(&dcache_lock);
> > @@ -728,35 +712,50 @@ static int autofs4_dir_symlink(struct inode *dir,
> >  		return -EACCES;
> >  
> >  	ino = autofs4_init_ino(ino, sbi, S_IFLNK | 0555);
> > -	if (ino == NULL)
> > +	if (!ino)
> >  		return -ENOSPC;
> 
> Should have been ENOMEM, I guess.

Ha, yeah, I think it should be.
I just used the errno that has been there all along without thinking.

> 
> > -	ino->size = strlen(symname);
> > -	ino->u.symlink = cp = kmalloc(ino->size + 1, GFP_KERNEL);
> > -
> > -	if (cp == NULL) {
> > -		kfree(ino);
> > +	cp = kmalloc(ino->size + 1, GFP_KERNEL);
> > +	if (!cp) {
> > +		if (!dentry->d_fsdata)
> > +			kfree(ino);
> 
> OK, so here we work out that autofs4_init_ino() had to allocate a new
> autofs_info and if so, free it here.  It took me a moment..

That is right, but as it is now, this will always be a new allocation.
If all goes well (yeah right) I will be allocating the info struct in
->lookup() in a subsequent patch.

> 
> >  		return -ENOSPC;
> 
> ENOMEM?
> 
> >  	}
> >  
> >  	strcpy(cp, symname);
> >  
> > -	inode = autofs4_get_inode(dir->i_sb, ino);
> > -	d_add(dentry, inode);
> > +	inode = dentry->d_inode;
> > +	if (inode)
> > +		d_rehash(dentry);
> > +	else {
> > +		inode = autofs4_get_inode(dir->i_sb, ino);
> > +		if (!inode) {
> > +			kfree(cp);
> > +			if (!dentry->d_fsdata)
> > +				kfree(ino);
> > +			return -ENOSPC;
> > +		}
> > +
> > +		d_add(dentry, inode);
> >  
> > -	if (dir == dir->i_sb->s_root->d_inode)
> > -		dentry->d_op = &autofs4_root_dentry_operations;
> > -	else
> > -		dentry->d_op = &autofs4_dentry_operations;
> > +		if (dir == dir->i_sb->s_root->d_inode)
> > +			dentry->d_op = &autofs4_root_dentry_operations;
> > +		else
> > +			dentry->d_op = &autofs4_dentry_operations;
> > +
> > +		dentry->d_fsdata = ino;
> > +		ino->dentry = dentry;
> > +		ino->inode = inode;
> > +	}
> > +	dget(dentry);
> >  
> > -	dentry->d_fsdata = ino;
> > -	ino->dentry = dget(dentry);
> >  	atomic_inc(&ino->count);
> >  	p_ino = autofs4_dentry_ino(dentry->d_parent);
> >  	if (p_ino && dentry->d_parent != dentry)
> >  		atomic_inc(&p_ino->count);
> > -	ino->inode = inode;
> >  
> > +	ino->u.symlink = cp;
> > +	ino->size = strlen(symname);
> >  	dir->i_mtime = CURRENT_TIME;
> 
> This all seems a bit ungainly.  I assume that on entry to
> autofs4_dir_symlink(), ino->size is equal to strlen(symname)?  If it's
> not, that strcpy() will overrun.
> 
> But if ino->size _is_ equal to strlen(symname) then why did we just
> recalculate the same thing?
> 
> I'm suspecting we can zap a lump of code and just do
> 
> 	cp = kstrdup(symname, GFP_KERNEL);
> 
> Anyway, please check that.
> 
> >  	return 0;
> > @@ -866,24 +865,38 @@ static int autofs4_dir_mkdir(struct inode *dir, struct dentry *dentry, int mode)
> >  		dentry, dentry->d_name.len, dentry->d_name.name);
> >  
> >  	ino = autofs4_init_ino(ino, sbi, S_IFDIR | 0555);
> > -	if (ino == NULL)
> > +	if (!ino)
> >  		return -ENOSPC;
> 
> ENOMEM?
> 
> > -	inode = autofs4_get_inode(dir->i_sb, ino);
> > -	d_add(dentry, inode);
> > +	inode = dentry->d_inode;
> > +	if (inode)
> > +		d_rehash(dentry);
> > +	else {
> > +		inode = autofs4_get_inode(dir->i_sb, ino);
> > +		if (!inode) {
> > +			if (!dentry->d_fsdata)
> > +				kfree(ino);
> > +			return -ENOSPC;
> > +		}
> >  
> > -	if (dir == dir->i_sb->s_root->d_inode)
> > -		dentry->d_op = &autofs4_root_dentry_operations;
> > -	else
> > -		dentry->d_op = &autofs4_dentry_operations;
> > +		d_add(dentry, inode);
> > +
> > +		if (dir == dir->i_sb->s_root->d_inode)
> > +			dentry->d_op = &autofs4_root_dentry_operations;
> > +		else
> > +			dentry->d_op = &autofs4_dentry_operations;
> > +
> > +		dentry->d_fsdata = ino;
> > +		ino->dentry = dentry;
> > +		ino->inode = inode;
> > +	}
> > +	dget(dentry);
> 
> This all looks very similar to the code in autofs4_dir_symlink().  Some
> refactoring might be needed at some stage?
> 
> > -	dentry->d_fsdata = ino;
> > -	ino->dentry = dget(dentry);
> >  	atomic_inc(&ino->count);
> >  	p_ino = autofs4_dentry_ino(dentry->d_parent);
> >  	if (p_ino && dentry->d_parent != dentry)
> >  		atomic_inc(&p_ino->count);
> > -	ino->inode = inode;
> > +
> >  	inc_nlink(dir);
> >  	dir->i_mtime = CURRENT_TIME;
> >  
> > 


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-05  7:31                   ` Ian Kent
  2008-06-05 21:29                     ` Linus Torvalds
  2008-06-05 22:30                     ` Andrew Morton
@ 2008-06-06  6:23                     ` Jesper Krogh
  2008-06-06  8:21                       ` Ian Kent
  2 siblings, 1 reply; 89+ messages in thread
From: Jesper Krogh @ 2008-06-06  6:23 UTC (permalink / raw)
  To: Ian Kent
  Cc: Linus Torvalds, Al Viro, Miklos Szeredi, linux-kernel,
	linux-fsdevel, Jeff Moyer, Andrew Morton

Hi.

This isn't a test of the proposed patch. I just got another variatioin 
of the problem in the log. (I've tried running the automount daemon both 
with and without the --ghost option) that is the only change I can see. 
Still 2.6.26-rc4..

Jun  5 16:13:15 node37 kernel: [17388710.169561] BUG: unable to handle 
kernel NULL pointer dereference at 00000000000000b2
Jun  5 16:13:15 node37 automount[28691]: mount(nfs): nfs: mount failure 
hest.nzcorp.net:/z/fx1200 on /nfs/fx1200
Jun  5 16:13:15 node37 automount[28691]: failed to mount /nfs/fx1200
Jun  5 16:13:15 node37 kernel: [17388710.217273] IP: [graft_tree+77/288] 
graft_tree+0x4d/0x120
Jun  5 16:13:15 node37 kernel: [17388710.217273] PGD f9e75067 PUD 
f681e067 PMD 0
Jun  5 16:13:15 node37 kernel: [17388710.217273] Oops: 0000 [1] SMP
Jun  5 16:13:15 node37 kernel: [17388710.217273] CPU 1
Jun  5 16:13:15 node37 kernel: [17388710.217273] Modules linked in: nfs 
lockd sunrpc autofs4 ipv6 af_packet usbhid hid uhci_hcd ehci_hcd usbkbd 
fuse parport_pc lp parport i2c_amd756 serio_raw psmouse pcspkr container 
i2c_core shpchp k8temp button amd_rng evdev pci_hotplug ext3 jbd mbcache 
sg sd_mod ide_cd_mod cdrom floppy mptspi mptscsih mptbase 
scsi_transport_spi ohci_hcd tg3 usbcore amd74xx ide_core ata_generic 
libata scsi_mod dock thermal processor fan thermal_sys
Jun  5 16:13:15 node37 kernel: [17388710.217273] Pid: 28693, comm: 
mount.nfs Not tainted 2.6.26-rc4 #1
Jun  5 16:13:15 node37 kernel: [17388710.993688] RIP: 
0010:[graft_tree+77/288]  [graft_tree+77/288] graft_tree+0x4d/0x120
Jun  5 16:13:15 node37 kernel: [17388710.993688] RSP: 
0000:ffff8100f9c85e08  EFLAGS: 00010246
Jun  5 16:13:15 node37 kernel: [17388710.993688] RAX: ffff8100bfbc0270 
RBX: 00000000ffffffec RCX: 0000000000000000
Jun  5 16:13:15 node37 kernel: [17388711.245666] RDX: ffff8100f9ec5900 
RSI: ffff8100f9c85e68 RDI: ffff8100bae1f800
Jun  5 16:13:15 node37 kernel: [17388711.245666] RBP: ffff8100bae1f800 
R08: 0000000000000000 R09: 0000000000000001
Jun  5 16:13:15 node37 kernel: [17388711.245666] R10: 0000000000000001 
R11: ffffffff803011c0 R12: ffff8100f9c85e68
Jun  5 16:13:15 node37 kernel: [17388711.513641] R13: 0000000000000000 
R14: 000000000000000b R15: 000000000000000b
Jun  5 16:13:15 node37 kernel: [17388711.513641] FS: 
00007fd02f2cf6e0(0000) GS:ffff8100fbb0e280(0000) knlGS:00000000557fc6b0
Jun  5 16:13:15 node37 kernel: [17388711.701623] CS:  0010 DS: 0000 ES: 
0000 CR0: 000000008005003b
Jun  5 16:13:15 node37 kernel: [17388711.701623] CR2: 00000000000000b2 
CR3: 00000000f9f49000 CR4: 00000000000006e0
Jun  5 16:13:15 node37 kernel: [17388711.701623] DR0: 0000000000000000 
DR1: 0000000000000000 DR2: 0000000000000000
Jun  5 16:13:15 node37 kernel: [17388711.701623] DR3: 0000000000000000 
DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun  5 16:13:15 node37 kernel: [17388711.701623] Process mount.nfs (pid: 
28693, threadinfo ffff8100f9c84000, task ffff8100f6815640)
Jun  5 16:13:15 node37 kernel: [17388712.145575] Stack: 
ffff8100f9c85e68 ffff8100f9c85e70 ffff8100bae1f800 ffffffff802b2622
Jun  5 16:13:15 node37 kernel: [17388712.145575]  0000000000000006 
0000000000000000 ffff8100f695f000 ffff8100f695e000
Jun  5 16:13:15 node37 kernel: [17388712.145575]  ffff8100f695d000 
ffffffff802b49e9 000000004847f4ef 0000000000000000
Jun  5 16:13:15 node37 kernel: [17388712.145575] Call Trace:
Jun  5 16:13:15 node37 kernel: [17388712.145575]  [do_add_mount+162/320] 
? do_add_mount+0xa2/0x140
Jun  5 16:13:15 node37 kernel: [17388712.145575]  [do_mount+505/592] ? 
do_mount+0x1f9/0x250
Jun  5 16:13:15 node37 kernel: [17388712.145575] 
[copy_mount_options+269/384] ? copy_mount_options+0x10d/0x180
Jun  5 16:13:15 node37 kernel: [17388712.145575]  [sys_mount+155/256] ? 
sys_mount+0x9b/0x100
Jun  5 16:13:15 node37 kernel: [17388712.145575] 
[system_call_after_swapgs+123/128] ? system_call_after_swapgs+0x7b/0x80
Jun  5 16:13:15 node37 kernel: [17388712.145575]
Jun  5 16:13:15 node37 kernel: [17388712.145575]
Jun  5 16:13:15 node37 kernel: [17388712.145575] Code: f7 40 58 00 00 00 
80 74 15 89 d8 48 8b 6c 24 08 48 8b 1c 24 4c 8b 64 24 10 48 83 c4 18 c3 
48 8b 46 08 bb ec ff ff ff 48 8b 48 10 <0f> b7 81 b2 00 00 00 25 00 f0 
00 00 3d 00 40 00 00 48 8b 47 20
Jun  5 16:13:15 node37 kernel: [17388712.145575] RIP 
[graft_tree+77/288] graft_tree+0x4d/0x120
Jun  5 16:13:15 node37 kernel: [17388712.145575]  RSP <ffff8100f9c85e08>
Jun  5 16:13:15 node37 kernel: [17388712.145575] CR2: 00000000000000b2
Jun  5 16:13:15 node37 kernel: [17388715.129847] ---[ end trace 
f3c4579f529c23bf ]---

I'll apply the patch today and get some nodes booted up on it.

-- 
Jesper

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-06  6:23                     ` Jesper Krogh
@ 2008-06-06  8:21                       ` Ian Kent
  2008-06-06  8:25                         ` Ian Kent
  0 siblings, 1 reply; 89+ messages in thread
From: Ian Kent @ 2008-06-06  8:21 UTC (permalink / raw)
  To: Jesper Krogh
  Cc: Linus Torvalds, Al Viro, Miklos Szeredi, linux-kernel,
	linux-fsdevel, Jeff Moyer, Andrew Morton, Jeff Moyer


On Fri, 2008-06-06 at 08:23 +0200, Jesper Krogh wrote:
> Hi.
> 
> This isn't a test of the proposed patch. I just got another variatioin 
> of the problem in the log. (I've tried running the automount daemon both 
> with and without the --ghost option) that is the only change I can see. 
> Still 2.6.26-rc4..

Right.

Whether that would make a difference depends largely on your map
configuration. If you have simple indirect or direct maps then then
using the --ghost option (or just adding the "browse" option if you're
using version 5) should prevent the code that turns the dentry negative
from being executed at all. If you're using submounts in your map, or
the "hosts" map or you have multi-mount entries in the maps then that
code could still be executed.

> 
> Jun  5 16:13:15 node37 kernel: [17388710.169561] BUG: unable to handle 
> kernel NULL pointer dereference at 00000000000000b2
> Jun  5 16:13:15 node37 automount[28691]: mount(nfs): nfs: mount failure 
> hest.nzcorp.net:/z/fx1200 on /nfs/fx1200
> Jun  5 16:13:15 node37 automount[28691]: failed to mount /nfs/fx1200
> Jun  5 16:13:15 node37 kernel: [17388710.217273] IP: [graft_tree+77/288] 
> graft_tree+0x4d/0x120
> Jun  5 16:13:15 node37 kernel: [17388710.217273] PGD f9e75067 PUD 
> f681e067 PMD 0
> Jun  5 16:13:15 node37 kernel: [17388710.217273] Oops: 0000 [1] SMP
> Jun  5 16:13:15 node37 kernel: [17388710.217273] CPU 1
> Jun  5 16:13:15 node37 kernel: [17388710.217273] Modules linked in: nfs 
> lockd sunrpc autofs4 ipv6 af_packet usbhid hid uhci_hcd ehci_hcd usbkbd 
> fuse parport_pc lp parport i2c_amd756 serio_raw psmouse pcspkr container 
> i2c_core shpchp k8temp button amd_rng evdev pci_hotplug ext3 jbd mbcache 
> sg sd_mod ide_cd_mod cdrom floppy mptspi mptscsih mptbase 
> scsi_transport_spi ohci_hcd tg3 usbcore amd74xx ide_core ata_generic 
> libata scsi_mod dock thermal processor fan thermal_sys
> Jun  5 16:13:15 node37 kernel: [17388710.217273] Pid: 28693, comm: 
> mount.nfs Not tainted 2.6.26-rc4 #1
> Jun  5 16:13:15 node37 kernel: [17388710.993688] RIP: 
> 0010:[graft_tree+77/288]  [graft_tree+77/288] graft_tree+0x4d/0x120
> Jun  5 16:13:15 node37 kernel: [17388710.993688] RSP: 
> 0000:ffff8100f9c85e08  EFLAGS: 00010246
> Jun  5 16:13:15 node37 kernel: [17388710.993688] RAX: ffff8100bfbc0270 
> RBX: 00000000ffffffec RCX: 0000000000000000
> Jun  5 16:13:15 node37 kernel: [17388711.245666] RDX: ffff8100f9ec5900 
> RSI: ffff8100f9c85e68 RDI: ffff8100bae1f800
> Jun  5 16:13:15 node37 kernel: [17388711.245666] RBP: ffff8100bae1f800 
> R08: 0000000000000000 R09: 0000000000000001
> Jun  5 16:13:15 node37 kernel: [17388711.245666] R10: 0000000000000001 
> R11: ffffffff803011c0 R12: ffff8100f9c85e68
> Jun  5 16:13:15 node37 kernel: [17388711.513641] R13: 0000000000000000 
> R14: 000000000000000b R15: 000000000000000b
> Jun  5 16:13:15 node37 kernel: [17388711.513641] FS: 
> 00007fd02f2cf6e0(0000) GS:ffff8100fbb0e280(0000) knlGS:00000000557fc6b0
> Jun  5 16:13:15 node37 kernel: [17388711.701623] CS:  0010 DS: 0000 ES: 
> 0000 CR0: 000000008005003b
> Jun  5 16:13:15 node37 kernel: [17388711.701623] CR2: 00000000000000b2 
> CR3: 00000000f9f49000 CR4: 00000000000006e0
> Jun  5 16:13:15 node37 kernel: [17388711.701623] DR0: 0000000000000000 
> DR1: 0000000000000000 DR2: 0000000000000000
> Jun  5 16:13:15 node37 kernel: [17388711.701623] DR3: 0000000000000000 
> DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Jun  5 16:13:15 node37 kernel: [17388711.701623] Process mount.nfs (pid: 
> 28693, threadinfo ffff8100f9c84000, task ffff8100f6815640)
> Jun  5 16:13:15 node37 kernel: [17388712.145575] Stack: 
> ffff8100f9c85e68 ffff8100f9c85e70 ffff8100bae1f800 ffffffff802b2622
> Jun  5 16:13:15 node37 kernel: [17388712.145575]  0000000000000006 
> 0000000000000000 ffff8100f695f000 ffff8100f695e000
> Jun  5 16:13:15 node37 kernel: [17388712.145575]  ffff8100f695d000 
> ffffffff802b49e9 000000004847f4ef 0000000000000000
> Jun  5 16:13:15 node37 kernel: [17388712.145575] Call Trace:
> Jun  5 16:13:15 node37 kernel: [17388712.145575]  [do_add_mount+162/320] 
> ? do_add_mount+0xa2/0x140
> Jun  5 16:13:15 node37 kernel: [17388712.145575]  [do_mount+505/592] ? 
> do_mount+0x1f9/0x250
> Jun  5 16:13:15 node37 kernel: [17388712.145575] 
> [copy_mount_options+269/384] ? copy_mount_options+0x10d/0x180
> Jun  5 16:13:15 node37 kernel: [17388712.145575]  [sys_mount+155/256] ? 
> sys_mount+0x9b/0x100
> Jun  5 16:13:15 node37 kernel: [17388712.145575] 
> [system_call_after_swapgs+123/128] ? system_call_after_swapgs+0x7b/0x80
> Jun  5 16:13:15 node37 kernel: [17388712.145575]
> Jun  5 16:13:15 node37 kernel: [17388712.145575]
> Jun  5 16:13:15 node37 kernel: [17388712.145575] Code: f7 40 58 00 00 00 
> 80 74 15 89 d8 48 8b 6c 24 08 48 8b 1c 24 4c 8b 64 24 10 48 83 c4 18 c3 
> 48 8b 46 08 bb ec ff ff ff 48 8b 48 10 <0f> b7 81 b2 00 00 00 25 00 f0 
> 00 00 3d 00 40 00 00 48 8b 47 20
> Jun  5 16:13:15 node37 kernel: [17388712.145575] RIP 
> [graft_tree+77/288] graft_tree+0x4d/0x120
> Jun  5 16:13:15 node37 kernel: [17388712.145575]  RSP <ffff8100f9c85e08>
> Jun  5 16:13:15 node37 kernel: [17388712.145575] CR2: 00000000000000b2
> Jun  5 16:13:15 node37 kernel: [17388715.129847] ---[ end trace 
> f3c4579f529c23bf ]---
> 
> I'll apply the patch today and get some nodes booted up on it.
> 


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-06  8:21                       ` Ian Kent
@ 2008-06-06  8:25                         ` Ian Kent
  0 siblings, 0 replies; 89+ messages in thread
From: Ian Kent @ 2008-06-06  8:25 UTC (permalink / raw)
  To: Jesper Krogh
  Cc: Linus Torvalds, Al Viro, Miklos Szeredi, linux-kernel,
	linux-fsdevel, Jeff Moyer, Andrew Morton


On Fri, 2008-06-06 at 16:21 +0800, Ian Kent wrote:
> On Fri, 2008-06-06 at 08:23 +0200, Jesper Krogh wrote:
> > Hi.
> > 
> > This isn't a test of the proposed patch. I just got another variatioin 
> > of the problem in the log. (I've tried running the automount daemon both 
> > with and without the --ghost option) that is the only change I can see. 
> > Still 2.6.26-rc4..
> 
> Right.
> 
> Whether that would make a difference depends largely on your map
> configuration. If you have simple indirect or direct maps then then
> using the --ghost option (or just adding the "browse" option if you're
> using version 5) should prevent the code that turns the dentry negative
> from being executed at all. If you're using submounts in your map, or
> the "hosts" map or you have multi-mount entries in the maps then that
> code could still be executed.
> 

btw, I'm currently testing with these additional changes.
I don't think they will result in functional differences but it's best
we keep in sync.

autofs4 - leave rehash dentry positive fix

From: Ian Kent <raven@themaw.net>

Change ENOSPC to ENOMEM.
Make autofs4_init_ino() always set mode field.

Signed-off-by: Ian Kent <raven@themaw.net>
---

 fs/autofs4/inode.c |    2 +-
 fs/autofs4/root.c  |   10 +++++-----
 2 files changed, 6 insertions(+), 6 deletions(-)


diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index ec9a641..3221506 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -45,7 +45,6 @@ struct autofs_info *autofs4_init_ino(struct autofs_info *ino,
 
 	if (!reinit) {
 		ino->flags = 0;
-		ino->mode = mode;
 		ino->inode = NULL;
 		ino->dentry = NULL;
 		ino->size = 0;
@@ -53,6 +52,7 @@ struct autofs_info *autofs4_init_ino(struct autofs_info *ino,
 		atomic_set(&ino->count, 0);
 	}
 
+	ino->mode = mode;
 	ino->last_used = jiffies;
 
 	ino->sbi = sbi;
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index 6ce603b..f438e6b 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -713,13 +713,13 @@ static int autofs4_dir_symlink(struct inode *dir,
 
 	ino = autofs4_init_ino(ino, sbi, S_IFLNK | 0555);
 	if (!ino)
-		return -ENOSPC;
+		return -ENOMEM;
 
 	cp = kmalloc(ino->size + 1, GFP_KERNEL);
 	if (!cp) {
 		if (!dentry->d_fsdata)
 			kfree(ino);
-		return -ENOSPC;
+		return -ENOMEM;
 	}
 
 	strcpy(cp, symname);
@@ -733,7 +733,7 @@ static int autofs4_dir_symlink(struct inode *dir,
 			kfree(cp);
 			if (!dentry->d_fsdata)
 				kfree(ino);
-			return -ENOSPC;
+			return -ENOMEM;
 		}
 
 		d_add(dentry, inode);
@@ -866,7 +866,7 @@ static int autofs4_dir_mkdir(struct inode *dir, struct dentry *dentry, int mode)
 
 	ino = autofs4_init_ino(ino, sbi, S_IFDIR | 0555);
 	if (!ino)
-		return -ENOSPC;
+		return -ENOMEM;
 
 	inode = dentry->d_inode;
 	if (inode)
@@ -876,7 +876,7 @@ static int autofs4_dir_mkdir(struct inode *dir, struct dentry *dentry, int mode)
 		if (!inode) {
 			if (!dentry->d_fsdata)
 				kfree(ino);
-			return -ENOSPC;
+			return -ENOMEM;
 		}
 
 		d_add(dentry, inode);



^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-04  2:42                                   ` Ian Kent
  2008-06-04  5:34                                     ` Miklos Szeredi
@ 2008-06-10  4:57                                     ` Ian Kent
  2008-06-10  6:28                                       ` Jesper Krogh
  1 sibling, 1 reply; 89+ messages in thread
From: Ian Kent @ 2008-06-10  4:57 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Moyer, Linus Torvalds, Miklos Szeredi, jesper, linux-kernel,
	linux-fsdevel


On Wed, 2008-06-04 at 10:42 +0800, Ian Kent wrote:
> On Wed, 2008-06-04 at 00:00 +0100, Al Viro wrote:
> > On Tue, Jun 03, 2008 at 03:53:36PM -0400, Jeff Moyer wrote:
> > 
> > > autofs4_lookup is called on behalf a process trying to walk into an
> > > automounted directory.  That dentry's d_flags is set to
> > > DCACHE_AUTOFS_PENDING but not hashed.  A waitqueue entry is created,
> > > indexed off of the name of the dentry.  A callout is made to the
> > > automount daemon (via autofs4_wait).
> > > 
> > > The daemon looks up the directory name in its configuration.  If it
> > > finds a valid map entry, it will then create the directory using
> > > sys_mkdir.  The autofs4_lookup call on behalf of the daemon (oz_mode ==
> > > 1) will return NULL, and then the mkdir call will be made.  The
> > > autofs4_mkdir function then instantiates the dentry which, by the way,
> > > is different from the original dentry passed to autofs4_lookup.  (This
> > > dentry also does not get the PENDING flag set, which is a bug addressed
> > > by a patch set that Ian and I have been working on;  specifically, the
> > > idea is to reuse the dentry from the original lookup, but I digress).
> > > 
> > > The daemon then mounts the share on the given directory and issues an
> > > ioctl to wakeup the waiter.  When awakened, the waiter clears the
> > > DCACHE_AUTOFS_PENDING flag, does another lookup of the name in the
> > > dcache and returns that dentry if found.
> > > Later, the dentry gets expired via another ioctl.  That path sets
> > > the AUTOFS_INF_EXPIRING flag in the d_fsdata associated with the dentry.
> > > It then calls out to the daemon to perform the unmount and rmdir.  The
> > > rmdir unhashes the dentry (and places it on the rehash list).
> > > 
> > > The dentry is removed from the rehash list if there was a racing expire
> > > and mount or if the dentry is released.
> > > 
> > > This description is valid for the tree as it stands today.  Ian and I
> > > have been working on fixing some other race conditions which will change
> > > the dentry life cycle (for the better, I hope).
> > 
> > So what happens if new lookup hits between umount and rmdir?
> 
> It will wait for the expire to complete and then wait for a mount
> request to the daemon.

Actually, that explanation is a bit simple minded.

It should wait for the expire in ->revalidate().
Following the expire completion d_invalidate() should return 0, since
the dentry is now unhashed, which causes ->revalidate() to return 0.
do_lookup() should see this and call a ->lookup().

But maybe I've missed something as I'm seeing a problem now.

Ian



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-10  4:57                                     ` Ian Kent
@ 2008-06-10  6:28                                       ` Jesper Krogh
  2008-06-10  6:40                                         ` Ian Kent
  0 siblings, 1 reply; 89+ messages in thread
From: Jesper Krogh @ 2008-06-10  6:28 UTC (permalink / raw)
  To: Ian Kent
  Cc: Al Viro, Jeff Moyer, Linus Torvalds, Miklos Szeredi, linux-kernel,
	linux-fsdevel

Ian Kent wrote:
> On Wed, 2008-06-04 at 10:42 +0800, Ian Kent wrote:
>> On Wed, 2008-06-04 at 00:00 +0100, Al Viro wrote:
>>> On Tue, Jun 03, 2008 at 03:53:36PM -0400, Jeff Moyer wrote:
>>>
>>>> autofs4_lookup is called on behalf a process trying to walk into an
>>>> automounted directory.  That dentry's d_flags is set to
>>>> DCACHE_AUTOFS_PENDING but not hashed.  A waitqueue entry is created,
>>>> indexed off of the name of the dentry.  A callout is made to the
>>>> automount daemon (via autofs4_wait).
>>>>
>>>> The daemon looks up the directory name in its configuration.  If it
>>>> finds a valid map entry, it will then create the directory using
>>>> sys_mkdir.  The autofs4_lookup call on behalf of the daemon (oz_mode ==
>>>> 1) will return NULL, and then the mkdir call will be made.  The
>>>> autofs4_mkdir function then instantiates the dentry which, by the way,
>>>> is different from the original dentry passed to autofs4_lookup.  (This
>>>> dentry also does not get the PENDING flag set, which is a bug addressed
>>>> by a patch set that Ian and I have been working on;  specifically, the
>>>> idea is to reuse the dentry from the original lookup, but I digress).
>>>>
>>>> The daemon then mounts the share on the given directory and issues an
>>>> ioctl to wakeup the waiter.  When awakened, the waiter clears the
>>>> DCACHE_AUTOFS_PENDING flag, does another lookup of the name in the
>>>> dcache and returns that dentry if found.
>>>> Later, the dentry gets expired via another ioctl.  That path sets
>>>> the AUTOFS_INF_EXPIRING flag in the d_fsdata associated with the dentry.
>>>> It then calls out to the daemon to perform the unmount and rmdir.  The
>>>> rmdir unhashes the dentry (and places it on the rehash list).
>>>>
>>>> The dentry is removed from the rehash list if there was a racing expire
>>>> and mount or if the dentry is released.
>>>>
>>>> This description is valid for the tree as it stands today.  Ian and I
>>>> have been working on fixing some other race conditions which will change
>>>> the dentry life cycle (for the better, I hope).
>>> So what happens if new lookup hits between umount and rmdir?
>> It will wait for the expire to complete and then wait for a mount
>> request to the daemon.
> 
> Actually, that explanation is a bit simple minded.
> 
> It should wait for the expire in ->revalidate().
> Following the expire completion d_invalidate() should return 0, since
> the dentry is now unhashed, which causes ->revalidate() to return 0.
> do_lookup() should see this and call a ->lookup().
> 
> But maybe I've missed something as I'm seeing a problem now.

Ok. Ive been running on the patch for a few days now .. and didn't see
any problems. But that being said, I also turned off the --ghost option
to autofs so if it actually is the patch or the different codepaths
being used, I dont know. Since this is a production system, I'm a bit
reluctant to just change a working setup to test it out.

Jesper
-- 
Jesper

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-10  6:28                                       ` Jesper Krogh
@ 2008-06-10  6:40                                         ` Ian Kent
  2008-06-10  9:09                                           ` Ian Kent
  2008-06-12  3:03                                           ` Ian Kent
  0 siblings, 2 replies; 89+ messages in thread
From: Ian Kent @ 2008-06-10  6:40 UTC (permalink / raw)
  To: Jesper Krogh
  Cc: Al Viro, Jeff Moyer, Linus Torvalds, Miklos Szeredi, linux-kernel,
	linux-fsdevel


On Tue, 2008-06-10 at 08:28 +0200, Jesper Krogh wrote:
> Ian Kent wrote:
> > On Wed, 2008-06-04 at 10:42 +0800, Ian Kent wrote:
> >> On Wed, 2008-06-04 at 00:00 +0100, Al Viro wrote:
> >>> On Tue, Jun 03, 2008 at 03:53:36PM -0400, Jeff Moyer wrote:
> >>>
> >>>> autofs4_lookup is called on behalf a process trying to walk into an
> >>>> automounted directory.  That dentry's d_flags is set to
> >>>> DCACHE_AUTOFS_PENDING but not hashed.  A waitqueue entry is created,
> >>>> indexed off of the name of the dentry.  A callout is made to the
> >>>> automount daemon (via autofs4_wait).
> >>>>
> >>>> The daemon looks up the directory name in its configuration.  If it
> >>>> finds a valid map entry, it will then create the directory using
> >>>> sys_mkdir.  The autofs4_lookup call on behalf of the daemon (oz_mode ==
> >>>> 1) will return NULL, and then the mkdir call will be made.  The
> >>>> autofs4_mkdir function then instantiates the dentry which, by the way,
> >>>> is different from the original dentry passed to autofs4_lookup.  (This
> >>>> dentry also does not get the PENDING flag set, which is a bug addressed
> >>>> by a patch set that Ian and I have been working on;  specifically, the
> >>>> idea is to reuse the dentry from the original lookup, but I digress).
> >>>>
> >>>> The daemon then mounts the share on the given directory and issues an
> >>>> ioctl to wakeup the waiter.  When awakened, the waiter clears the
> >>>> DCACHE_AUTOFS_PENDING flag, does another lookup of the name in the
> >>>> dcache and returns that dentry if found.
> >>>> Later, the dentry gets expired via another ioctl.  That path sets
> >>>> the AUTOFS_INF_EXPIRING flag in the d_fsdata associated with the dentry.
> >>>> It then calls out to the daemon to perform the unmount and rmdir.  The
> >>>> rmdir unhashes the dentry (and places it on the rehash list).
> >>>>
> >>>> The dentry is removed from the rehash list if there was a racing expire
> >>>> and mount or if the dentry is released.
> >>>>
> >>>> This description is valid for the tree as it stands today.  Ian and I
> >>>> have been working on fixing some other race conditions which will change
> >>>> the dentry life cycle (for the better, I hope).
> >>> So what happens if new lookup hits between umount and rmdir?
> >> It will wait for the expire to complete and then wait for a mount
> >> request to the daemon.
> > 
> > Actually, that explanation is a bit simple minded.
> > 
> > It should wait for the expire in ->revalidate().
> > Following the expire completion d_invalidate() should return 0, since
> > the dentry is now unhashed, which causes ->revalidate() to return 0.
> > do_lookup() should see this and call a ->lookup().
> > 
> > But maybe I've missed something as I'm seeing a problem now.
> 
> Ok. Ive been running on the patch for a few days now .. and didn't see
> any problems. But that being said, I also turned off the --ghost option
> to autofs so if it actually is the patch or the different codepaths
> being used, I dont know. Since this is a production system, I'm a bit
> reluctant to just change a working setup to test it out.

No need to change anything.

My comment above relates to difficulties I'm having with patches that
I'm working on that follow this one and the specific question that Al
Viro asked "what happens if new lookup hits between umount and rmdir".

But, clearly we need to know if I (autofs4) caused the specific problem
you reported and if the patch resolves it. And that sounds promising
from what you've seen so far.

Ian



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-10  6:40                                         ` Ian Kent
@ 2008-06-10  9:09                                           ` Ian Kent
  2008-06-12  3:03                                           ` Ian Kent
  1 sibling, 0 replies; 89+ messages in thread
From: Ian Kent @ 2008-06-10  9:09 UTC (permalink / raw)
  To: Jesper Krogh
  Cc: Al Viro, Jeff Moyer, Linus Torvalds, Miklos Szeredi, linux-kernel,
	linux-fsdevel


On Tue, 2008-06-10 at 14:40 +0800, Ian Kent wrote:
> On Tue, 2008-06-10 at 08:28 +0200, Jesper Krogh wrote:
> > Ian Kent wrote:
> > > On Wed, 2008-06-04 at 10:42 +0800, Ian Kent wrote:
> > >> On Wed, 2008-06-04 at 00:00 +0100, Al Viro wrote:
> > >>> On Tue, Jun 03, 2008 at 03:53:36PM -0400, Jeff Moyer wrote:
> > >>>
> > >>>> autofs4_lookup is called on behalf a process trying to walk into an
> > >>>> automounted directory.  That dentry's d_flags is set to
> > >>>> DCACHE_AUTOFS_PENDING but not hashed.  A waitqueue entry is created,
> > >>>> indexed off of the name of the dentry.  A callout is made to the
> > >>>> automount daemon (via autofs4_wait).
> > >>>>
> > >>>> The daemon looks up the directory name in its configuration.  If it
> > >>>> finds a valid map entry, it will then create the directory using
> > >>>> sys_mkdir.  The autofs4_lookup call on behalf of the daemon (oz_mode ==
> > >>>> 1) will return NULL, and then the mkdir call will be made.  The
> > >>>> autofs4_mkdir function then instantiates the dentry which, by the way,
> > >>>> is different from the original dentry passed to autofs4_lookup.  (This
> > >>>> dentry also does not get the PENDING flag set, which is a bug addressed
> > >>>> by a patch set that Ian and I have been working on;  specifically, the
> > >>>> idea is to reuse the dentry from the original lookup, but I digress).
> > >>>>
> > >>>> The daemon then mounts the share on the given directory and issues an
> > >>>> ioctl to wakeup the waiter.  When awakened, the waiter clears the
> > >>>> DCACHE_AUTOFS_PENDING flag, does another lookup of the name in the
> > >>>> dcache and returns that dentry if found.
> > >>>> Later, the dentry gets expired via another ioctl.  That path sets
> > >>>> the AUTOFS_INF_EXPIRING flag in the d_fsdata associated with the dentry.
> > >>>> It then calls out to the daemon to perform the unmount and rmdir.  The
> > >>>> rmdir unhashes the dentry (and places it on the rehash list).
> > >>>>
> > >>>> The dentry is removed from the rehash list if there was a racing expire
> > >>>> and mount or if the dentry is released.
> > >>>>
> > >>>> This description is valid for the tree as it stands today.  Ian and I
> > >>>> have been working on fixing some other race conditions which will change
> > >>>> the dentry life cycle (for the better, I hope).
> > >>> So what happens if new lookup hits between umount and rmdir?
> > >> It will wait for the expire to complete and then wait for a mount
> > >> request to the daemon.
> > > 
> > > Actually, that explanation is a bit simple minded.
> > > 
> > > It should wait for the expire in ->revalidate().
> > > Following the expire completion d_invalidate() should return 0, since
> > > the dentry is now unhashed, which causes ->revalidate() to return 0.
> > > do_lookup() should see this and call a ->lookup().
> > > 
> > > But maybe I've missed something as I'm seeing a problem now.
> > 
> > Ok. Ive been running on the patch for a few days now .. and didn't see
> > any problems. But that being said, I also turned off the --ghost option
> > to autofs so if it actually is the patch or the different codepaths
> > being used, I dont know. Since this is a production system, I'm a bit
> > reluctant to just change a working setup to test it out.
> 
> No need to change anything.

Mmmm .. that comment might not be accurate either.

It's beginning to look like my original approach, a post from back in
Feb 2007, to fix a deadlock bug, wasn't right at all. But we don't
really have time to determine that for sure now as it can take several
days for the bug to trigger.

Ian



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-10  6:40                                         ` Ian Kent
  2008-06-10  9:09                                           ` Ian Kent
@ 2008-06-12  3:03                                           ` Ian Kent
  2008-06-12  7:02                                             ` Jesper Krogh
  2008-06-12 11:19                                             ` Ian Kent
  1 sibling, 2 replies; 89+ messages in thread
From: Ian Kent @ 2008-06-12  3:03 UTC (permalink / raw)
  To: Jesper Krogh
  Cc: Al Viro, Jeff Moyer, Linus Torvalds, Miklos Szeredi, linux-kernel,
	linux-fsdevel, Andrew Morton


On Tue, 2008-06-10 at 14:40 +0800, Ian Kent wrote:
> > >>> So what happens if new lookup hits between umount and rmdir?
> > >> It will wait for the expire to complete and then wait for a mount
> > >> request to the daemon.
> > > 
> > > Actually, that explanation is a bit simple minded.
> > > 
> > > It should wait for the expire in ->revalidate().
> > > Following the expire completion d_invalidate() should return 0, since
> > > the dentry is now unhashed, which causes ->revalidate() to return 0.
> > > do_lookup() should see this and call a ->lookup().
> > > 
> > > But maybe I've missed something as I'm seeing a problem now.
> > 
> > Ok. Ive been running on the patch for a few days now .. and didn't see
> > any problems. But that being said, I also turned off the --ghost option
> > to autofs so if it actually is the patch or the different codepaths
> > being used, I dont know. Since this is a production system, I'm a bit
> > reluctant to just change a working setup to test it out.
> 
> No need to change anything.

There is a problem with the patch I posted.
It will allow an incorrect ENOENT return in some cases.

The patch below is sufficiently different to the original patch I posted
to warrant a replacement rather than a correction.

If you can find a way to test this out that would be great.

autofs4 - don't make expiring dentry negative

From: Ian Kent <raven@themaw.net>

Correct the error of making a positive dentry negative after it has been
instantiated.

This involves removing the code in autofs4_lookup_unhashed() that makes
the dentry negative and updating autofs4_lookup() to check for an
unfinished expire and wait if needed. The dentry used for the lookup
must be negative for mounts to trigger in the required cases so the
dentry can't be re-used (which is probably for the better anyway).

Signed-off-by: Ian Kent <raven@themaw.net>
---

 fs/autofs4/autofs_i.h |    6 +--
 fs/autofs4/inode.c    |    6 +--
 fs/autofs4/root.c     |  115 ++++++++++++++++++-------------------------------
 3 files changed, 49 insertions(+), 78 deletions(-)


diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index c3d352d..69b1497 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -52,7 +52,7 @@ struct autofs_info {
 
 	int		flags;
 
-	struct list_head rehash;
+	struct list_head expiring;
 
 	struct autofs_sb_info *sbi;
 	unsigned long last_used;
@@ -112,8 +112,8 @@ struct autofs_sb_info {
 	struct mutex wq_mutex;
 	spinlock_t fs_lock;
 	struct autofs_wait_queue *queues; /* Wait queue pointer */
-	spinlock_t rehash_lock;
-	struct list_head rehash_list;
+	spinlock_t lookup_lock;
+	struct list_head expiring_list;
 };
 
 static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 2fdcf5e..94bfc15 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -47,7 +47,7 @@ struct autofs_info *autofs4_init_ino(struct autofs_info *ino,
 	ino->dentry = NULL;
 	ino->size = 0;
 
-	INIT_LIST_HEAD(&ino->rehash);
+	INIT_LIST_HEAD(&ino->expiring);
 
 	ino->last_used = jiffies;
 	atomic_set(&ino->count, 0);
@@ -338,8 +338,8 @@ int autofs4_fill_super(struct super_block *s, void *data, int silent)
 	mutex_init(&sbi->wq_mutex);
 	spin_lock_init(&sbi->fs_lock);
 	sbi->queues = NULL;
-	spin_lock_init(&sbi->rehash_lock);
-	INIT_LIST_HEAD(&sbi->rehash_list);
+	spin_lock_init(&sbi->lookup_lock);
+	INIT_LIST_HEAD(&sbi->expiring_list);
 	s->s_blocksize = 1024;
 	s->s_blocksize_bits = 10;
 	s->s_magic = AUTOFS_SUPER_MAGIC;
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index edf5b6b..2e8959c 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -493,10 +493,10 @@ void autofs4_dentry_release(struct dentry *de)
 		struct autofs_sb_info *sbi = autofs4_sbi(de->d_sb);
 
 		if (sbi) {
-			spin_lock(&sbi->rehash_lock);
-			if (!list_empty(&inf->rehash))
-				list_del(&inf->rehash);
-			spin_unlock(&sbi->rehash_lock);
+			spin_lock(&sbi->lookup_lock);
+			if (!list_empty(&inf->expiring))
+				list_del(&inf->expiring);
+			spin_unlock(&sbi->lookup_lock);
 		}
 
 		inf->dentry = NULL;
@@ -518,7 +518,7 @@ static struct dentry_operations autofs4_dentry_operations = {
 	.d_release	= autofs4_dentry_release,
 };
 
-static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct dentry *parent, struct qstr *name)
+static struct dentry *autofs4_lookup_expiring(struct autofs_sb_info *sbi, struct dentry *parent, struct qstr *name)
 {
 	unsigned int len = name->len;
 	unsigned int hash = name->hash;
@@ -526,14 +526,14 @@ static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct
 	struct list_head *p, *head;
 
 	spin_lock(&dcache_lock);
-	spin_lock(&sbi->rehash_lock);
-	head = &sbi->rehash_list;
+	spin_lock(&sbi->lookup_lock);
+	head = &sbi->expiring_list;
 	list_for_each(p, head) {
 		struct autofs_info *ino;
 		struct dentry *dentry;
 		struct qstr *qstr;
 
-		ino = list_entry(p, struct autofs_info, rehash);
+		ino = list_entry(p, struct autofs_info, expiring);
 		dentry = ino->dentry;
 
 		spin_lock(&dentry->d_lock);
@@ -555,33 +555,17 @@ static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct
 			goto next;
 
 		if (d_unhashed(dentry)) {
-			struct inode *inode = dentry->d_inode;
-
-			ino = autofs4_dentry_ino(dentry);
-			list_del_init(&ino->rehash);
+			list_del_init(&ino->expiring);
 			dget(dentry);
-			/*
-			 * Make the rehashed dentry negative so the VFS
-			 * behaves as it should.
-			 */
-			if (inode) {
-				dentry->d_inode = NULL;
-				list_del_init(&dentry->d_alias);
-				spin_unlock(&dentry->d_lock);
-				spin_unlock(&sbi->rehash_lock);
-				spin_unlock(&dcache_lock);
-				iput(inode);
-				return dentry;
-			}
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&sbi->rehash_lock);
+			spin_unlock(&sbi->lookup_lock);
 			spin_unlock(&dcache_lock);
 			return dentry;
 		}
 next:
 		spin_unlock(&dentry->d_lock);
 	}
-	spin_unlock(&sbi->rehash_lock);
+	spin_unlock(&sbi->lookup_lock);
 	spin_unlock(&dcache_lock);
 
 	return NULL;
@@ -591,7 +575,7 @@ next:
 static struct dentry *autofs4_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd)
 {
 	struct autofs_sb_info *sbi;
-	struct dentry *unhashed;
+	struct dentry *expiring;
 	int oz_mode;
 
 	DPRINTK("name = %.*s",
@@ -607,44 +591,40 @@ static struct dentry *autofs4_lookup(struct inode *dir, struct dentry *dentry, s
 	DPRINTK("pid = %u, pgrp = %u, catatonic = %d, oz_mode = %d",
 		 current->pid, task_pgrp_nr(current), sbi->catatonic, oz_mode);
 
-	unhashed = autofs4_lookup_unhashed(sbi, dentry->d_parent, &dentry->d_name);
-	if (!unhashed) {
-		/*
-		 * Mark the dentry incomplete but don't hash it. We do this
-		 * to serialize our inode creation operations (symlink and
-		 * mkdir) which prevents deadlock during the callback to
-		 * the daemon. Subsequent user space lookups for the same
-		 * dentry are placed on the wait queue while the daemon
-		 * itself is allowed passage unresticted so the create
-		 * operation itself can then hash the dentry. Finally,
-		 * we check for the hashed dentry and return the newly
-		 * hashed dentry.
-		 */
-		dentry->d_op = &autofs4_root_dentry_operations;
-
-		dentry->d_fsdata = NULL;
-		d_instantiate(dentry, NULL);
-	} else {
-		struct autofs_info *ino = autofs4_dentry_ino(unhashed);
-		DPRINTK("rehash %p with %p", dentry, unhashed);
+	expiring = autofs4_lookup_expiring(sbi, dentry->d_parent, &dentry->d_name);
+	if (expiring) {
+		struct autofs_info *ino = autofs4_dentry_ino(expiring);
 		/*
 		 * If we are racing with expire the request might not
 		 * be quite complete but the directory has been removed
 		 * so it must have been successful, so just wait for it.
-		 * We need to ensure the AUTOFS_INF_EXPIRING flag is clear
-		 * before continuing as revalidate may fail when calling
-		 * try_to_fill_dentry (returning EAGAIN) if we don't.
 		 */
 		while (ino && (ino->flags & AUTOFS_INF_EXPIRING)) {
 			DPRINTK("wait for incomplete expire %p name=%.*s",
-				unhashed, unhashed->d_name.len,
-				unhashed->d_name.name);
-			autofs4_wait(sbi, unhashed, NFY_NONE);
+				expiring, expiring->d_name.len,
+				expiring->d_name.name);
+			autofs4_wait(sbi, expiring, NFY_NONE);
 			DPRINTK("request completed");
 		}
-		dentry = unhashed;
+		dput(expiring);
 	}
 
+	/*
+	 * Mark the dentry incomplete but don't hash it. We do this
+	 * to serialize our inode creation operations (symlink and
+	 * mkdir) which prevents deadlock during the callback to
+	 * the daemon. Subsequent user space lookups for the same
+	 * dentry are placed on the wait queue while the daemon
+	 * itself is allowed passage unresticted so the create
+	 * operation itself can then hash the dentry. Finally,
+	 * we check for the hashed dentry and return the newly
+	 * hashed dentry.
+	 */
+	dentry->d_op = &autofs4_root_dentry_operations;
+
+	dentry->d_fsdata = NULL;
+	d_instantiate(dentry, NULL);
+
 	if (!oz_mode) {
 		spin_lock(&dentry->d_lock);
 		dentry->d_flags |= DCACHE_AUTOFS_PENDING;
@@ -668,8 +648,6 @@ static struct dentry *autofs4_lookup(struct inode *dir, struct dentry *dentry, s
 			if (sigismember (sigset, SIGKILL) ||
 			    sigismember (sigset, SIGQUIT) ||
 			    sigismember (sigset, SIGINT)) {
-			    if (unhashed)
-				dput(unhashed);
 			    return ERR_PTR(-ERESTARTNOINTR);
 			}
 		}
@@ -699,15 +677,9 @@ static struct dentry *autofs4_lookup(struct inode *dir, struct dentry *dentry, s
 		else
 			dentry = ERR_PTR(-ENOENT);
 
-		if (unhashed)
-			dput(unhashed);
-
 		return dentry;
 	}
 
-	if (unhashed)
-		return dentry;
-
 	return NULL;
 }
 
@@ -769,9 +741,8 @@ static int autofs4_dir_symlink(struct inode *dir,
  * that the file no longer exists. However, doing that means that the
  * VFS layer can turn the dentry into a negative dentry.  We don't want
  * this, because the unlink is probably the result of an expire.
- * We simply d_drop it and add it to a rehash candidates list in the
- * super block, which allows the dentry lookup to reuse it retaining
- * the flags, such as expire in progress, in case we're racing with expire.
+ * We simply d_drop it and add it to a expiring list in the super block,
+ * which allows the dentry lookup to check for an incomplete expire.
  *
  * If a process is blocked on the dentry waiting for the expire to finish,
  * it will invalidate the dentry and try to mount with a new one.
@@ -801,9 +772,9 @@ static int autofs4_dir_unlink(struct inode *dir, struct dentry *dentry)
 	dir->i_mtime = CURRENT_TIME;
 
 	spin_lock(&dcache_lock);
-	spin_lock(&sbi->rehash_lock);
-	list_add(&ino->rehash, &sbi->rehash_list);
-	spin_unlock(&sbi->rehash_lock);
+	spin_lock(&sbi->lookup_lock);
+	list_add(&ino->expiring, &sbi->expiring_list);
+	spin_unlock(&sbi->lookup_lock);
 	spin_lock(&dentry->d_lock);
 	__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
@@ -829,9 +800,9 @@ static int autofs4_dir_rmdir(struct inode *dir, struct dentry *dentry)
 		spin_unlock(&dcache_lock);
 		return -ENOTEMPTY;
 	}
-	spin_lock(&sbi->rehash_lock);
-	list_add(&ino->rehash, &sbi->rehash_list);
-	spin_unlock(&sbi->rehash_lock);
+	spin_lock(&sbi->lookup_lock);
+	list_add(&ino->expiring, &sbi->expiring_list);
+	spin_unlock(&sbi->lookup_lock);
 	spin_lock(&dentry->d_lock);
 	__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);



^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-12  3:03                                           ` Ian Kent
@ 2008-06-12  7:02                                             ` Jesper Krogh
  2008-06-12 11:21                                               ` Ian Kent
  2008-06-12 11:19                                             ` Ian Kent
  1 sibling, 1 reply; 89+ messages in thread
From: Jesper Krogh @ 2008-06-12  7:02 UTC (permalink / raw)
  To: Ian Kent
  Cc: Jesper Krogh, Al Viro, Jeff Moyer, Linus Torvalds, Miklos Szeredi,
	linux-kernel, linux-fsdevel, Andrew Morton

> On Tue, 2008-06-10 at 14:40 +0800, Ian Kent wrote:
> The patch below is sufficiently different to the original patch I posted
> to warrant a replacement rather than a correction.

I'll patch that in now and see what it gives. (Is it preferred
to use this --ghost option or not?) What would you suggest?

Jesper
-- 
Jesper Krogh


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-12  3:03                                           ` Ian Kent
  2008-06-12  7:02                                             ` Jesper Krogh
@ 2008-06-12 11:19                                             ` Ian Kent
  1 sibling, 0 replies; 89+ messages in thread
From: Ian Kent @ 2008-06-12 11:19 UTC (permalink / raw)
  To: Jesper Krogh
  Cc: Al Viro, Jeff Moyer, Linus Torvalds, Miklos Szeredi, linux-kernel,
	linux-fsdevel, Andrew Morton


On Thu, 2008-06-12 at 11:03 +0800, Ian Kent wrote:
> On Tue, 2008-06-10 at 14:40 +0800, Ian Kent wrote:
> > > >>> So what happens if new lookup hits between umount and rmdir?
> > > >> It will wait for the expire to complete and then wait for a mount
> > > >> request to the daemon.
> > > > 
> > > > Actually, that explanation is a bit simple minded.
> > > > 
> > > > It should wait for the expire in ->revalidate().
> > > > Following the expire completion d_invalidate() should return 0, since
> > > > the dentry is now unhashed, which causes ->revalidate() to return 0.
> > > > do_lookup() should see this and call a ->lookup().
> > > > 
> > > > But maybe I've missed something as I'm seeing a problem now.
> > > 
> > > Ok. Ive been running on the patch for a few days now .. and didn't see
> > > any problems. But that being said, I also turned off the --ghost option
> > > to autofs so if it actually is the patch or the different codepaths
> > > being used, I dont know. Since this is a production system, I'm a bit
> > > reluctant to just change a working setup to test it out.
> > 
> > No need to change anything.
> 
> There is a problem with the patch I posted.
> It will allow an incorrect ENOENT return in some cases.
> 
> The patch below is sufficiently different to the original patch I posted
> to warrant a replacement rather than a correction.
> 
> If you can find a way to test this out that would be great.

Oops, I must have not set the Preformat option in Evolution, let me try
again.

autofs4 - don't make expiring dentry negative

From: Ian Kent <raven@themaw.net>

Correct the error of making a positive dentry negative after it has been
instantiated.

This involves removing the code in autofs4_lookup_unhashed() that makes
the dentry negative and updating autofs4_lookup() to check for an
unfinished expire and wait if needed. The dentry used for the lookup
must be negative for mounts to trigger in the required cases so the
dentry can't be re-used (which is probably for the better anyway).

Signed-off-by: Ian Kent <raven@themaw.net>
---

 fs/autofs4/autofs_i.h |    6 +--
 fs/autofs4/inode.c    |    6 +--
 fs/autofs4/root.c     |  115 ++++++++++++++++++-------------------------------
 3 files changed, 49 insertions(+), 78 deletions(-)


diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index c3d352d..69b1497 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -52,7 +52,7 @@ struct autofs_info {
 
 	int		flags;
 
-	struct list_head rehash;
+	struct list_head expiring;
 
 	struct autofs_sb_info *sbi;
 	unsigned long last_used;
@@ -112,8 +112,8 @@ struct autofs_sb_info {
 	struct mutex wq_mutex;
 	spinlock_t fs_lock;
 	struct autofs_wait_queue *queues; /* Wait queue pointer */
-	spinlock_t rehash_lock;
-	struct list_head rehash_list;
+	spinlock_t lookup_lock;
+	struct list_head expiring_list;
 };
 
 static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 2fdcf5e..94bfc15 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -47,7 +47,7 @@ struct autofs_info *autofs4_init_ino(struct autofs_info *ino,
 	ino->dentry = NULL;
 	ino->size = 0;
 
-	INIT_LIST_HEAD(&ino->rehash);
+	INIT_LIST_HEAD(&ino->expiring);
 
 	ino->last_used = jiffies;
 	atomic_set(&ino->count, 0);
@@ -338,8 +338,8 @@ int autofs4_fill_super(struct super_block *s, void *data, int silent)
 	mutex_init(&sbi->wq_mutex);
 	spin_lock_init(&sbi->fs_lock);
 	sbi->queues = NULL;
-	spin_lock_init(&sbi->rehash_lock);
-	INIT_LIST_HEAD(&sbi->rehash_list);
+	spin_lock_init(&sbi->lookup_lock);
+	INIT_LIST_HEAD(&sbi->expiring_list);
 	s->s_blocksize = 1024;
 	s->s_blocksize_bits = 10;
 	s->s_magic = AUTOFS_SUPER_MAGIC;
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index edf5b6b..2e8959c 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -493,10 +493,10 @@ void autofs4_dentry_release(struct dentry *de)
 		struct autofs_sb_info *sbi = autofs4_sbi(de->d_sb);
 
 		if (sbi) {
-			spin_lock(&sbi->rehash_lock);
-			if (!list_empty(&inf->rehash))
-				list_del(&inf->rehash);
-			spin_unlock(&sbi->rehash_lock);
+			spin_lock(&sbi->lookup_lock);
+			if (!list_empty(&inf->expiring))
+				list_del(&inf->expiring);
+			spin_unlock(&sbi->lookup_lock);
 		}
 
 		inf->dentry = NULL;
@@ -518,7 +518,7 @@ static struct dentry_operations autofs4_dentry_operations = {
 	.d_release	= autofs4_dentry_release,
 };
 
-static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct dentry *parent, struct qstr *name)
+static struct dentry *autofs4_lookup_expiring(struct autofs_sb_info *sbi, struct dentry *parent, struct qstr *name)
 {
 	unsigned int len = name->len;
 	unsigned int hash = name->hash;
@@ -526,14 +526,14 @@ static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct
 	struct list_head *p, *head;
 
 	spin_lock(&dcache_lock);
-	spin_lock(&sbi->rehash_lock);
-	head = &sbi->rehash_list;
+	spin_lock(&sbi->lookup_lock);
+	head = &sbi->expiring_list;
 	list_for_each(p, head) {
 		struct autofs_info *ino;
 		struct dentry *dentry;
 		struct qstr *qstr;
 
-		ino = list_entry(p, struct autofs_info, rehash);
+		ino = list_entry(p, struct autofs_info, expiring);
 		dentry = ino->dentry;
 
 		spin_lock(&dentry->d_lock);
@@ -555,33 +555,17 @@ static struct dentry *autofs4_lookup_unhashed(struct autofs_sb_info *sbi, struct
 			goto next;
 
 		if (d_unhashed(dentry)) {
-			struct inode *inode = dentry->d_inode;
-
-			ino = autofs4_dentry_ino(dentry);
-			list_del_init(&ino->rehash);
+			list_del_init(&ino->expiring);
 			dget(dentry);
-			/*
-			 * Make the rehashed dentry negative so the VFS
-			 * behaves as it should.
-			 */
-			if (inode) {
-				dentry->d_inode = NULL;
-				list_del_init(&dentry->d_alias);
-				spin_unlock(&dentry->d_lock);
-				spin_unlock(&sbi->rehash_lock);
-				spin_unlock(&dcache_lock);
-				iput(inode);
-				return dentry;
-			}
 			spin_unlock(&dentry->d_lock);
-			spin_unlock(&sbi->rehash_lock);
+			spin_unlock(&sbi->lookup_lock);
 			spin_unlock(&dcache_lock);
 			return dentry;
 		}
 next:
 		spin_unlock(&dentry->d_lock);
 	}
-	spin_unlock(&sbi->rehash_lock);
+	spin_unlock(&sbi->lookup_lock);
 	spin_unlock(&dcache_lock);
 
 	return NULL;
@@ -591,7 +575,7 @@ next:
 static struct dentry *autofs4_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd)
 {
 	struct autofs_sb_info *sbi;
-	struct dentry *unhashed;
+	struct dentry *expiring;
 	int oz_mode;
 
 	DPRINTK("name = %.*s",
@@ -607,44 +591,40 @@ static struct dentry *autofs4_lookup(struct inode *dir, struct dentry *dentry, s
 	DPRINTK("pid = %u, pgrp = %u, catatonic = %d, oz_mode = %d",
 		 current->pid, task_pgrp_nr(current), sbi->catatonic, oz_mode);
 
-	unhashed = autofs4_lookup_unhashed(sbi, dentry->d_parent, &dentry->d_name);
-	if (!unhashed) {
-		/*
-		 * Mark the dentry incomplete but don't hash it. We do this
-		 * to serialize our inode creation operations (symlink and
-		 * mkdir) which prevents deadlock during the callback to
-		 * the daemon. Subsequent user space lookups for the same
-		 * dentry are placed on the wait queue while the daemon
-		 * itself is allowed passage unresticted so the create
-		 * operation itself can then hash the dentry. Finally,
-		 * we check for the hashed dentry and return the newly
-		 * hashed dentry.
-		 */
-		dentry->d_op = &autofs4_root_dentry_operations;
-
-		dentry->d_fsdata = NULL;
-		d_instantiate(dentry, NULL);
-	} else {
-		struct autofs_info *ino = autofs4_dentry_ino(unhashed);
-		DPRINTK("rehash %p with %p", dentry, unhashed);
+	expiring = autofs4_lookup_expiring(sbi, dentry->d_parent, &dentry->d_name);
+	if (expiring) {
+		struct autofs_info *ino = autofs4_dentry_ino(expiring);
 		/*
 		 * If we are racing with expire the request might not
 		 * be quite complete but the directory has been removed
 		 * so it must have been successful, so just wait for it.
-		 * We need to ensure the AUTOFS_INF_EXPIRING flag is clear
-		 * before continuing as revalidate may fail when calling
-		 * try_to_fill_dentry (returning EAGAIN) if we don't.
 		 */
 		while (ino && (ino->flags & AUTOFS_INF_EXPIRING)) {
 			DPRINTK("wait for incomplete expire %p name=%.*s",
-				unhashed, unhashed->d_name.len,
-				unhashed->d_name.name);
-			autofs4_wait(sbi, unhashed, NFY_NONE);
+				expiring, expiring->d_name.len,
+				expiring->d_name.name);
+			autofs4_wait(sbi, expiring, NFY_NONE);
 			DPRINTK("request completed");
 		}
-		dentry = unhashed;
+		dput(expiring);
 	}
 
+	/*
+	 * Mark the dentry incomplete but don't hash it. We do this
+	 * to serialize our inode creation operations (symlink and
+	 * mkdir) which prevents deadlock during the callback to
+	 * the daemon. Subsequent user space lookups for the same
+	 * dentry are placed on the wait queue while the daemon
+	 * itself is allowed passage unresticted so the create
+	 * operation itself can then hash the dentry. Finally,
+	 * we check for the hashed dentry and return the newly
+	 * hashed dentry.
+	 */
+	dentry->d_op = &autofs4_root_dentry_operations;
+
+	dentry->d_fsdata = NULL;
+	d_instantiate(dentry, NULL);
+
 	if (!oz_mode) {
 		spin_lock(&dentry->d_lock);
 		dentry->d_flags |= DCACHE_AUTOFS_PENDING;
@@ -668,8 +648,6 @@ static struct dentry *autofs4_lookup(struct inode *dir, struct dentry *dentry, s
 			if (sigismember (sigset, SIGKILL) ||
 			    sigismember (sigset, SIGQUIT) ||
 			    sigismember (sigset, SIGINT)) {
-			    if (unhashed)
-				dput(unhashed);
 			    return ERR_PTR(-ERESTARTNOINTR);
 			}
 		}
@@ -699,15 +677,9 @@ static struct dentry *autofs4_lookup(struct inode *dir, struct dentry *dentry, s
 		else
 			dentry = ERR_PTR(-ENOENT);
 
-		if (unhashed)
-			dput(unhashed);
-
 		return dentry;
 	}
 
-	if (unhashed)
-		return dentry;
-
 	return NULL;
 }
 
@@ -769,9 +741,8 @@ static int autofs4_dir_symlink(struct inode *dir,
  * that the file no longer exists. However, doing that means that the
  * VFS layer can turn the dentry into a negative dentry.  We don't want
  * this, because the unlink is probably the result of an expire.
- * We simply d_drop it and add it to a rehash candidates list in the
- * super block, which allows the dentry lookup to reuse it retaining
- * the flags, such as expire in progress, in case we're racing with expire.
+ * We simply d_drop it and add it to a expiring list in the super block,
+ * which allows the dentry lookup to check for an incomplete expire.
  *
  * If a process is blocked on the dentry waiting for the expire to finish,
  * it will invalidate the dentry and try to mount with a new one.
@@ -801,9 +772,9 @@ static int autofs4_dir_unlink(struct inode *dir, struct dentry *dentry)
 	dir->i_mtime = CURRENT_TIME;
 
 	spin_lock(&dcache_lock);
-	spin_lock(&sbi->rehash_lock);
-	list_add(&ino->rehash, &sbi->rehash_list);
-	spin_unlock(&sbi->rehash_lock);
+	spin_lock(&sbi->lookup_lock);
+	list_add(&ino->expiring, &sbi->expiring_list);
+	spin_unlock(&sbi->lookup_lock);
 	spin_lock(&dentry->d_lock);
 	__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);
@@ -829,9 +800,9 @@ static int autofs4_dir_rmdir(struct inode *dir, struct dentry *dentry)
 		spin_unlock(&dcache_lock);
 		return -ENOTEMPTY;
 	}
-	spin_lock(&sbi->rehash_lock);
-	list_add(&ino->rehash, &sbi->rehash_list);
-	spin_unlock(&sbi->rehash_lock);
+	spin_lock(&sbi->lookup_lock);
+	list_add(&ino->expiring, &sbi->expiring_list);
+	spin_unlock(&sbi->lookup_lock);
 	spin_lock(&dentry->d_lock);
 	__d_drop(dentry);
 	spin_unlock(&dentry->d_lock);



^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-12  7:02                                             ` Jesper Krogh
@ 2008-06-12 11:21                                               ` Ian Kent
  0 siblings, 0 replies; 89+ messages in thread
From: Ian Kent @ 2008-06-12 11:21 UTC (permalink / raw)
  To: Jesper Krogh
  Cc: Al Viro, Jeff Moyer, Linus Torvalds, Miklos Szeredi, linux-kernel,
	linux-fsdevel, Andrew Morton


On Thu, 2008-06-12 at 09:02 +0200, Jesper Krogh wrote:
> > On Tue, 2008-06-10 at 14:40 +0800, Ian Kent wrote:
> > The patch below is sufficiently different to the original patch I posted
> > to warrant a replacement rather than a correction.
> 
> I'll patch that in now and see what it gives. (Is it preferred
> to use this --ghost option or not?) What would you suggest?

It is a matter of choice mostly.
If you have a large number of entries in your map then don't use it. It
will result in a performance hit especially during expires.

Ian



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: Linux 2.6.26-rc4
  2008-06-05 22:30                     ` Andrew Morton
  2008-06-06  2:47                       ` Ian Kent
@ 2008-06-27  4:18                       ` Ian Kent
  1 sibling, 0 replies; 89+ messages in thread
From: Ian Kent @ 2008-06-27  4:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: torvalds, viro, miklos, jesper, linux-kernel, linux-fsdevel,
	jmoyer


On Thu, 2008-06-05 at 15:30 -0700, Andrew Morton wrote:
> >  
> > +	ino->u.symlink = cp;
> > +	ino->size = strlen(symname);
> >  	dir->i_mtime = CURRENT_TIME;
> 
> This all seems a bit ungainly.  I assume that on entry to
> autofs4_dir_symlink(), ino->size is equal to strlen(symname)?  If it's
> not, that strcpy() will overrun.
> 
> But if ino->size _is_ equal to strlen(symname) then why did we just
> recalculate the same thing?

Oops.
I've fixed that in my git tree just now.

> 
> I'm suspecting we can zap a lump of code and just do
> 
> 	cp = kstrdup(symname, GFP_KERNEL);
> 
> Anyway, please check that.

Yep, but fix now re-factor later.

Ian



^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2008-06-27  4:21 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-26 18:41 Linux 2.6.26-rc4 Linus Torvalds
2008-05-26 21:24 ` Jesper Krogh
2008-05-26 21:42   ` Linus Torvalds
2008-05-27  0:25     ` Arjan van de Ven
2008-05-27  0:31       ` Arjan van de Ven
2008-05-27  5:43       ` David Woodhouse
2008-05-27  6:00         ` Arjan van de Ven
2008-05-27  6:24           ` David Woodhouse
2008-05-27  1:16     ` Carl-Daniel Hailfinger
2008-05-27  1:23       ` Carl-Daniel Hailfinger
2008-05-27  1:52         ` Abhijit Menon-Sen
2008-05-27  5:19           ` Jesper Krogh
2008-05-27  5:31           ` [MTD] [MAPS] ck804rom: fix driver_data in probe table David Woodhouse
2008-05-27  5:31           ` Linux 2.6.26-rc4 David Woodhouse
2008-05-27 10:35       ` Jeff Garzik
2008-05-27 10:53         ` Carl-Daniel Hailfinger
2008-05-27 10:54           ` Jeff Garzik
2008-05-27 10:58             ` Carl-Daniel Hailfinger
2008-05-27  5:23 ` 2.6.26-rc4: RIP find_pid_ns+0x6b/0xa0 Alexey Dobriyan
2008-05-27  9:06   ` Oleg Nesterov
2008-05-27 15:03     ` Linus Torvalds
2008-05-27 15:40       ` Paul E. McKenney
2008-05-27 16:11         ` Linus Torvalds
2008-05-27 17:06           ` Paul E. McKenney
2008-05-28  5:01             ` Paul E. McKenney
2008-05-28  7:26               ` Paul E. McKenney
2008-05-27 16:45       ` Oleg Nesterov
2008-05-27 17:37         ` Oleg Nesterov
2008-05-27 21:26           ` Alexey Dobriyan
2008-05-27 10:01 ` Linux 2.6.26-rc4 J.A. Magallón
2008-05-28 23:59   ` Bill Davidsen
     [not found] ` <20080527124315.131b1343@Varda>
2008-05-28 20:10   ` Linus Torvalds
2008-05-28 20:17     ` Johannes Berg
2008-05-28 21:48       ` John W. Linville
2008-06-03  9:49 ` Jesper Krogh
2008-06-03  9:57   ` Al Viro
2008-06-03 10:04     ` Jesper Krogh
2008-06-03 10:13       ` Miklos Szeredi
2008-06-03 10:37         ` Miklos Szeredi
2008-06-03 10:48           ` Al Viro
2008-06-03 13:31             ` Ian Kent
2008-06-03 13:32               ` Ian Kent
2008-06-03 10:40         ` Al Viro
2008-06-03 10:45           ` Miklos Szeredi
2008-06-03 10:52             ` Al Viro
2008-06-03 13:27               ` Ian Kent
2008-06-03 15:01                 ` Linus Torvalds
2008-06-03 16:07                   ` Ian Kent
2008-06-03 16:35                     ` Linus Torvalds
2008-06-03 16:41                       ` Al Viro
2008-06-03 16:50                         ` Al Viro
2008-06-03 17:28                           ` Ian Kent
2008-06-03 17:41                             ` Al Viro
2008-06-03 17:41                               ` Ian Kent
2008-06-03 17:50                                 ` Al Viro
2008-06-03 17:49                                   ` Ian Kent
2008-06-03 16:59                         ` Linus Torvalds
2008-06-03 17:30                           ` Ian Kent
2008-06-03 17:13                       ` Ian Kent
2008-06-03 17:30                         ` Al Viro
2008-06-03 17:38                           ` Ian Kent
2008-06-03 17:46                           ` Jeff Moyer
2008-06-03 19:18                             ` Al Viro
2008-06-03 19:53                               ` Jeff Moyer
2008-06-03 23:00                                 ` Al Viro
2008-06-04  2:42                                   ` Ian Kent
2008-06-04  5:34                                     ` Miklos Szeredi
2008-06-04  5:41                                       ` Ian Kent
2008-06-10  4:57                                     ` Ian Kent
2008-06-10  6:28                                       ` Jesper Krogh
2008-06-10  6:40                                         ` Ian Kent
2008-06-10  9:09                                           ` Ian Kent
2008-06-12  3:03                                           ` Ian Kent
2008-06-12  7:02                                             ` Jesper Krogh
2008-06-12 11:21                                               ` Ian Kent
2008-06-12 11:19                                             ` Ian Kent
2008-06-04  1:36                               ` Ian Kent
2008-06-05  7:31                   ` Ian Kent
2008-06-05 21:29                     ` Linus Torvalds
2008-06-05 21:34                       ` Jesper Krogh
2008-06-06  2:39                       ` Ian Kent
2008-06-05 22:30                     ` Andrew Morton
2008-06-06  2:47                       ` Ian Kent
2008-06-27  4:18                       ` Ian Kent
2008-06-06  6:23                     ` Jesper Krogh
2008-06-06  8:21                       ` Ian Kent
2008-06-06  8:25                         ` Ian Kent
2008-06-03 10:35     ` Al Viro
2008-06-04 17:51 ` Jesper Krogh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox