All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] Assign CPUs to nodes in round-robin manner on Fake NUMA.
From: Nikanth Karthikesan @ 2010-10-08  5:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, x86, Yinghai Lu, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner
In-Reply-To: <20101007004252.599b6888.akpm@linux-foundation.org>

On Thursday 07 October 2010 13:12:52 Andrew Morton wrote:
> On Thu, 30 Sep 2010 17:34:10 +0530 Nikanth Karthikesan <knikanth@suse.de> 
wrote:
> > Assign CPUs to nodes in round-robin manner on Fake NUMA.
> >
> > commit d9c2d5ac6af87b4491bff107113aaf16f6c2b2d9
> > "x86, numa: Use near(er) online node instead of roundrobin for NUMA"
> > changed NUMA initialization on Intel to choose the nearest online node or
> > first node. Fake NUMA would be better of with round-robin initialization,
> > instead of the all CPUS on first node. Change the choice of first node,
> > back to round-robin.
> 
> Why would fake NUMA "be better off with round-robin initialization"?
> 

For testing NUMA kernel behaviour without cpusets and NUMA aware applications, 
it would be better to have cpus in different nodes, rather than all in a 
single node. With cpusets migration of tasks scenarios cannot not be tested.

I guess having it round-robin shouldn't affect the use cases for all cpus on 
the first node.

>From the code comments in arch/x86/mm/numa_64.c:759 looks like this used to be 
the case, which was changed by commit d9c2d5ac6. It changed from roundrobin to 
nearer or first node. And I couldn't find any reason for this change in its 
changelog.

Thanks
Nikanth

> > ---
> >
> > diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> > index 85f69cd..47dd171 100644
> > --- a/arch/x86/kernel/cpu/intel.c
> > +++ b/arch/x86/kernel/cpu/intel.c
> > @@ -283,9 +283,7 @@ static void __cpuinit srat_detect_node(struct
> > cpuinfo_x86 *c) /* Don't do the funky fallback heuristics the AMD version
> > employs for now. */
> >  	node = apicid_to_node[apicid];
> > -	if (node == NUMA_NO_NODE)
> > -		node = first_node(node_online_map);
> > -	else if (!node_online(node)) {
> > +	if (node == NUMA_NO_NODE || !node_online(node)) {
> >  		/* reuse the value from init_cpu_to_node() */
> >  		node = cpu_to_node(cpu);
> >  	}
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply

* Re: Raid 6 - TLER/CCTL/ERC
From: Stefan /*St0fF*/ Hübner @ 2010-10-08  5:47 UTC (permalink / raw)
  To: Lemur Kryptering; +Cc: Linux RAID, philip
In-Reply-To: <3870309.421286406668656.JavaMail.SYSTEM@ninja>

Am 07.10.2010 01:11, schrieb Lemur Kryptering:
> 
> [...]
> 
> That sounds exactly like what I'm seeing in the logs -- the sector initially reported as bad is indeed unreadable via dd. All of the subsequent problems reported in other sectors aren't actually problems when I check on them at a later point. Couldn't this be worked around by exposing whatever timeouts there are in mdraid to something that could be adjusted in /sys?
> 
>>
>> There's nothing you can do about this viscious circle except either
>> enabling ERC or using Raid-Edition disk (which have ERC enabled by
>> default).

I must say yesterday we had our first Hitachi UltraStar Drives - which
are supposed to be Raid-Edition.  They didn't have ERC enabled.  I'll
inquire Hitachi about that today.
>>
> 
> I tried connecting the drives directly to my motherboard (my controller didn't seem to want to let me pass the smart commands ERC commands to the drives). The ERC commands took, in so far as I was able to read them back with what I set them to. This didn't seem to help much with the issues I was having, however.

Which wouldn't work, as the SCT ERC settings are volatile.  I.e.:
they're gone after a power cycle.

> 
> Lesson-learned on the non-raid edition disks. I would have spent the extra to avoid all this headache, but am now stuck with these things. I realize that not fixing the problem at the core (the drives themselves), essentially puts the burden on mdraid (which would be forced to block for a ridiculous amount of time waiting for the drive instead of just kicking it), however, in my particular case, this sort of delay would not be a cause for concern.
> 
> Would someone be able to nudge me in the right direction as far as where the logic that handles this is located?
> 
>> [...]
>>> #!/bin/bash
>>> #
>>> for x in /sys/block/md*/md/sync_action ; do
>>>         echo repair >$x
>>> done
>>>[...]

That is probably the only thing you can try.  As this does indeed try to
reconstruct the sector from the redundancy.  But I'd try it with ERC
enabled.  Maybe you find a way where this works.  (i.e. move the whole
Raid to the other computer...)

Stefan

^ permalink raw reply

* Re: Problem with Infiniband adapter on IBM p550
From: Benjamin Herrenschmidt @ 2010-10-08  5:45 UTC (permalink / raw)
  To: Patrick Finnegan; +Cc: paulus, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <1286516470.2463.403.camel@pasglop>


> Ok, so from what I can tell, the driver is unhappy because either BAR 0
> hasn't been assigned a memory resource or the size doesn't match what
> the driver expects.
> 
Ooops, accidentally sent too quickly...

>From your OF log I see:

reg                     00c10000 00000000 00000000  00000000 00000000 
                        03c10010 00000000 00000000  00000000 00100000 
                        43c10018 00000000 00000000  00000000 00800000 
                        43c10020 00000000 00000000  00000000 08000000 
assigned-addresses      83c10020 00000000 e8000000  00000000 08000000 

Now, I think this is the problem.

The "assigned-addresses" property seems to indicate that the firmware only
assigned BAR 4 and didn't assign anything to the other ones.

I don't know why, but it definitely looks like a firmware bug to me. On those
machines, PCI resource assignment is under hypervisor control and so Linux
cannot re-assign missing resources itself.

I'll see if I can find a FW person to shed some light on this.

Can you provide me (privately maybe) with the FW version on the machine ?

Cheers,
Ben.

^ permalink raw reply

* Re: testing 2010-10-04
From: Stefan Schmidt @ 2010-10-08  5:44 UTC (permalink / raw)
  To: openembedded-devel
In-Reply-To: <4CAE19C1.4060406@telus.net>

Hello.

On Thu, 2010-10-07 at 12:04, Dallas Foley wrote:
> On 10-10-07 11:26 AM, Khem Raj wrote:
> >
> >what would be your sign off? your email does not reveal your full name :)
> 
> No need to include me in the sign off.

To late :)

> I believe I've corrected my full name now.

Thanks. Good for the next time.

regards
Stefan Schmidt



^ permalink raw reply

* Re: Problem with Infiniband adapter on IBM p550
From: Benjamin Herrenschmidt @ 2010-10-08  5:41 UTC (permalink / raw)
  To: Patrick Finnegan; +Cc: linuxppc-dev@lists.ozlabs.org
In-Reply-To: <201010072324.33062.pat@computer-refuge.org>

On Thu, 2010-10-07 at 23:24 -0400, Patrick Finnegan wrote:
> I seem to be running into a problem getting a Mellanox Infinihost  
> Infiniband adapter working on my IBM p550 (a 9113-550).  I'm using 
> Debian squeeze, and tried upgrading to the 2.6.35.7 kernel without any 
> help.
> 
> I get the following messages in dmesg:
> [    4.972548] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 
> 2008)
> [    4.972564] ib_mthca: Initializing 0000:c1:00.0
> [    4.972674] ib_mthca 0000:c1:00.0: Missing DCS, aborting.

Ok, so from what I can tell, the driver is unhappy because either BAR 0
hasn't been assigned a memory resource or the size doesn't match what
the driver expects.

Let's see...

> The problem looks the same as a problem I ran into with OpenFirmware on 
> a Sun V880, which was fixed with this patch by Dave Miller:
> http://ns3.spinics.net/lists/linux-rdma/msg01779.html
> 
> I spent some time looking at the equivalent function on powerpc, but 
> didn't a block of code that looked similar.

I don't think we are hitting the same problem. I believe our code in
that area differs enough.

In your lspci, however, I see:

	Memory at <unassigned> (64-bit, non-prefetchable)
	Memory at <unassigned> (64-bit, prefetchable)

Which doesn't look good...

>From your OF log

> Any suggestions?
> 
> I have dmesg, the dev .properties from openfirmware, and lspci -v from 
> the machine:
> 
> http://ned.rcac.purdue.edu/p550-ib/dmesg
> http://ned.rcac.purdue.edu/p550-ib/ib-of-device
> http://ned.rcac.purdue.edu/p550-ib/lspci-v
> 
> Pat

^ permalink raw reply

* [tip:x86/setup] x86, setup: Use string copy operation to optimze copy in kernel compression
From: tip-bot for Zhao Yakui @ 2010-10-08  5:40 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, yakui.zhao, tglx
In-Reply-To: <1286502453-7043-1-git-send-email-yakui.zhao@intel.com>

Commit-ID:  68f4d5a00adaab33b136fce2c72d5c377b39b0b0
Gitweb:     http://git.kernel.org/tip/68f4d5a00adaab33b136fce2c72d5c377b39b0b0
Author:     Zhao Yakui <yakui.zhao@intel.com>
AuthorDate: Fri, 8 Oct 2010 09:47:33 +0800
Committer:  H. Peter Anvin <hpa@zytor.com>
CommitDate: Thu, 7 Oct 2010 21:23:09 -0700

x86, setup: Use string copy operation to optimze copy in kernel compression

The kernel decompression code parses the ELF header and then copies
the segment to the corresponding destination.  Currently it uses slow
byte-copy code.  This patch makes it use the string copy operations
instead.

In the test the copy performance can be improved very significantly after using
the string copy operation mechanism.
        1. The copy time can be reduced from 150ms to 20ms on one Atom machine
	2. The copy time can be reduced about 80% on another machine
		The time is reduced from 7ms to 1.5ms when using 32-bit kernel.
		The time is reduced from 10ms to 2ms when using 64-bit kernel.

Signed-off-by: Zhao Yakui <yakui.zhao@intel.com>
LKML-Reference: <1286502453-7043-1-git-send-email-yakui.zhao@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
---
 arch/x86/boot/compressed/misc.c |   29 +++++++++++++++++++++++------
 1 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c
index 8f7bef8..23f315c 100644
--- a/arch/x86/boot/compressed/misc.c
+++ b/arch/x86/boot/compressed/misc.c
@@ -229,18 +229,35 @@ void *memset(void *s, int c, size_t n)
 		ss[i] = c;
 	return s;
 }
-
+#ifdef CONFIG_X86_32
 void *memcpy(void *dest, const void *src, size_t n)
 {
-	int i;
-	const char *s = src;
-	char *d = dest;
+	int d0, d1, d2;
+	asm volatile(
+		"rep ; movsl\n\t"
+		"movl %4,%%ecx\n\t"
+		"rep ; movsb\n\t"
+		: "=&c" (d0), "=&D" (d1), "=&S" (d2)
+		: "0" (n >> 2), "g" (n & 3), "1" (dest), "2" (src)
+		: "memory");
 
-	for (i = 0; i < n; i++)
-		d[i] = s[i];
 	return dest;
 }
+#else
+void *memcpy(void *dest, const void *src, size_t n)
+{
+	long d0, d1, d2;
+	asm volatile(
+		"rep ; movsq\n\t"
+		"movq %4,%%rcx\n\t"
+		"rep ; movsb\n\t"
+		: "=&c" (d0), "=&D" (d1), "=&S" (d2)
+		: "0" (n >> 3), "g" (n & 7), "1" (dest), "2" (src)
+		: "memory");
 
+	return dest;
+}
+#endif
 
 static void error(char *x)
 {

^ permalink raw reply related

* [PATCH] sd: Export effective protection mode in sysfs
From: Martin K. Petersen @ 2010-10-08  5:36 UTC (permalink / raw)
  To: James.Bottomley; +Cc: linux-scsi


Create a sysfs entry that reports the negotiated DIX/DIF protection mode
for a SCSI disk. This depends on the protection type the disk is
formatted with as well as the protection capabilities advertised by the
controller.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

---

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index c158e61..ca0ffc3 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -259,6 +259,28 @@ sd_show_protection_type(struct device *dev, struct device_attribute *attr,
 }
 
 static ssize_t
+sd_show_protection_mode(struct device *dev, struct device_attribute *attr,
+			char *buf)
+{
+	struct scsi_disk *sdkp = to_scsi_disk(dev);
+	struct scsi_device *sdp = sdkp->device;
+	unsigned int dif, dix;
+
+	dif = scsi_host_dif_capable(sdp->host, sdkp->protection_type);
+	dix = scsi_host_dix_capable(sdp->host, sdkp->protection_type);
+
+	if (!dix && scsi_host_dix_capable(sdp->host, SD_DIF_TYPE0_PROTECTION)) {
+		dif = 0;
+		dix = 1;
+	}
+
+	if (!dif && !dix)
+		return snprintf(buf, 20, "none\n");
+
+	return snprintf(buf, 20, "%s%u\n", dix ? "dix" : "dif", dif);
+}
+
+static ssize_t
 sd_show_app_tag_own(struct device *dev, struct device_attribute *attr,
 		    char *buf)
 {
@@ -285,6 +307,7 @@ static struct device_attribute sd_disk_attrs[] = {
 	__ATTR(manage_start_stop, S_IRUGO|S_IWUSR, sd_show_manage_start_stop,
 	       sd_store_manage_start_stop),
 	__ATTR(protection_type, S_IRUGO, sd_show_protection_type, NULL),
+	__ATTR(protection_mode, S_IRUGO, sd_show_protection_mode, NULL),
 	__ATTR(app_tag_own, S_IRUGO, sd_show_app_tag_own, NULL),
 	__ATTR(thin_provisioning, S_IRUGO, sd_show_thin_provisioning, NULL),
 	__ATTR_NULL,

^ permalink raw reply related

* RE: [PATCH v5 3/3] OMAP: DSS2: OMAPFB: Allow usage of def_vrfb only for omap2,3
From: Guruswamy, Senthilvadivu @ 2010-10-08  5:36 UTC (permalink / raw)
  To: Tomi Valkeinen; +Cc: Hiremath, Vaibhav, linux-omap@vger.kernel.org
In-Reply-To: <1286456423.28499.1057.camel@tubuntu.research.nokia.com>



> -----Original Message-----
> From: Tomi Valkeinen [mailto:tomi.valkeinen@nokia.com]
> Sent: Thursday, October 07, 2010 6:30 PM
> To: Guruswamy, Senthilvadivu
> Cc: Hiremath, Vaibhav; linux-omap@vger.kernel.org
> Subject: Re: [PATCH v5 3/3] OMAP: DSS2: OMAPFB: Allow usage of def_vrfb
> only for omap2,3
> 
> On Thu, 2010-10-07 at 13:15 +0200, ext Guruswamy Senthilvadivu wrote:
> > From: Senthilvadivu Guruswamy <svadivu@ti.com>
> >
> > For Non-VRFB devices/platforms (omap2, omap3 family) force it to the
> > DMA based rotation.
> 
> This sounds a bit confusing, as if omap2 and omap3 are Non-VRFB devices.
> 
> And I'm not sure it's exactly correct to say "forcing to DMA rotation".
> We're just disallowing the use of VRFB, not forcing to use DMA rotation.
> Of course DMA rotation is currently the only other option, but still.
> 
> I'd put it:
> 
> VRFB is supported on OMAP2 and OMAP3 platforms. If VRFB rotation is not
> supported by the hardware and the user requests VRFB rotation, print a
> warning and ignore the request from the user.
> 
> > Signed-off-by: Senthilvadivu Guruswamy <svadivu@ti.com>
> > ---
> >  drivers/video/omap2/omapfb/omapfb-main.c |   10 ++++++++++
> >  1 files changed, 10 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/video/omap2/omapfb/omapfb-main.c
> b/drivers/video/omap2/omapfb/omapfb-main.c
> > index bddfca6..fcd9038 100644
> > --- a/drivers/video/omap2/omapfb/omapfb-main.c
> > +++ b/drivers/video/omap2/omapfb/omapfb-main.c
> > @@ -2198,6 +2198,16 @@ static int omapfb_probe(struct platform_device
> *pdev)
> >  		goto err0;
> >  	}
> >
> > +	/* TODO : Replace cpu check with omap_has_vrfb once HAS_FEATURE
> > +	*	 available for OMAP2 and OMAP3
> > +	*/
> > +	if (def_vrfb && (!cpu_is_omap24xx()) && (!cpu_is_omap34xx())) {
> 
> The parenthesis are extra around !cpu_is_xxxx calls.
> 
> > +		def_vrfb = 0;
> > +		dev_warn(&pdev->dev, "VRFB is not in this device,"
> > +				"using DMA for rotation\n");
> 
> How about: "VRFB is not supported by the hardware, ignoring vrfb=y
> module parameter".
> 
> Otherwise I think the patch set is ok. If you're fine with these
> changes, I can make them while applying these to my tree. Or send a new
> patch set, both are fine for me.
> 
[Senthil]  Oops! I missed "non omap2/3" word in the commit description.
Will send v6 with these detailed description, so that you could pull it straight away.
>  Tomi
> 


^ permalink raw reply

* Re: [PATCH 3/5] mips: sanitize restart logics
From: Al Viro @ 2010-10-08  5:36 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: David Daney, Al Viro, ralf, linux-kernel, linux-arch,
	Maciej W. Rozycki
In-Reply-To: <alpine.DEB.1.10.1009300218380.25860@tp.orcam.me.uk>

On Thu, Sep 30, 2010 at 02:50:17AM +0100, Maciej W. Rozycki wrote:
> On Wed, 29 Sep 2010, David Daney wrote:
>  Not exactly.  These GNU C library functions rely on the magic value of 
> "1" there to recognise contexts they created themselves and which must 
> therefore be handled by themselves internally (these contexts are not 
> complete and only preserve the call-saved registers as specified by the 
> respective MIPS ABIs, and are therefore unsafe to be passed to the 
> rt_sigreturn(2) syscall).  All the other values, including of course "0", 
> are not treated specially and the context is passed to rt_sigreturn(2) as 
> usually.  This only matters in cases where e.g. setcontext(3) is used to 
> exit from or return to a signal handler.

Nothing has changed in that respect; setup_sigcontext() (and its counterparts)
do _not_ use regs->regs[0].  Note
        err |= __put_user(0, &sc->sc_regs[0]);
        for (i = 1; i < 32; i++)
                err |= __put_user(regs->regs[i], &sc->sc_regs[i]);
in there.  The whole point of ->regs[0] uses (both original and modified)
is that $0 is constant 0 and thus the kernel is free to use that member
of pt_regs to indicate that syscall restart might be needed.  So's libc,
for that matter (to distinguish between sigreturn and setcontext ones).
When sigframe is created we still discard the value - the fragment above
is not modified at all.

BTW, with original code regs->regs[0] *can* be 1, if you are leaving syscall
with -EINVAL.  It won't reach the userland, though.

^ permalink raw reply

* [LTP] POSIX "aio_return/2-1" failed.
From: Mitani @ 2010-10-08  5:29 UTC (permalink / raw)
  To: ltp-list

Hi,


"conformance/interfaces/aio_return/2-1" failed with following message:
------------
conformance/interfaces/aio_return/2-1: execution: FAILED: Output: 
aio_return/2-1.c Second call to aio_return() should return -1 : 111
------------

Same problem occurs in "aio_return/3-2.c" and "aio_return/4-1".

Environment is as follows:
  - RHEL5.5 --- (x86, x86_64, ia64)
  - RHEL4.8 --- (x86, x86_64, ia64)


This testset seems to be the error root test for "aio_return()".

First "aio_return()" returned with 111. This return value is the length 
of buffer of "aio_write()".
But second "aio_return()" returned with 111, too. It's unexpected 
result for this test set.
Therefore, this test failed.

Man page says that 
"This function should be called only once for any given request,  after
aio_error(2) returns something other than EINPROGRESS.":
------------
# LANG=C man 3 aio_return

AIO_RETURN(3)                  Linux Programmer's Manual
AIO_RETURN(3)

NAME
       aio_return - get return status of asynchronous I/O operation

SYNOPSIS
       #include <aio.h>

       ssize_t aio_return(struct aiocb *aiocbp);

DESCRIPTION
       The aio_return function returns the final return status for the
asynchronous I/O
       request with control block pointed to by aiocbp.

       This  function  should  be  called  only  once  for  any  given
request,  after
       aio_error(2) returns something other than EINPROGRESS.

RETURN VALUE
       If the asynchronous I/O operation has completed, this function
returns the value
       that would have been returned in case of a synchronous  read,  write,
or  fsync
       request.  Otherwise the return value is undefined.  On error, the
error value is
       returned.

ERRORS
       EINVAL aiocbp does not point at a control block for an asynchronous
I/O  request
              of which the return status has not been retrieved yet.

CONFORMING TO
       POSIX 1003.1-2003

SEE ALSO
       aio_cancel(3),    aio_error(3),   aio_fsync(3),   aio_read(3),
aio_suspend(3),
       aio_write(3)

                                       2003-11-14
AIO_RETURN(3)
#
------------

And, it says that 
"If the asynchronous I/O operation has completed, this function returns
the value that would have been returned in case of a synchronous read,
write, or fsync request.  Otherwise the return value is undefined.  On
error, the error value is returned.".


It can be supposed that the return value of second "aio_return()" is 
undefined.
Therefore, it is not mistake that return value of the second "aio_return()" 
is one at success, I think.
And, I think that "UNTESTED" or "UNRESOLVED" may be is more appropriate 
for this test.

How does that look?


Regards--

-Tomonori Mitani



------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
Ltp-list mailing list
Ltp-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ltp-list

^ permalink raw reply

* [Bug 19702] i5-450M CPU gets stuck in low/lowest state
From: bugzilla-daemon @ 2010-10-08  5:26 UTC (permalink / raw)
  To: cpufreq
In-Reply-To: <bug-19702-12968@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=19702


Zhang Rui <rui.zhang@intel.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |acpi-bugzilla@lists.sourcef
                   |                            |orge.net, lenb@kernel.org,
                   |                            |rui.zhang@intel.com
         AssignedTo|acpi_power-processor@kernel |trenn@suse.de
                   |-bugs.osdl.org              |




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

^ permalink raw reply

* [PATCH 04/18] fs: Implement lazy LRU updates for inodes.
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel
In-Reply-To: <1286515292-15882-1-git-send-email-david@fromorbit.com>

From: Nick Piggin <npiggin@suse.de>

Convert the inode LRU to use lazy updates to reduce lock and
cacheline traffic.  We avoid moving inodes around in the LRU list
during iget/iput operations so these frequent operations don't need
to access the LRUs. Instead, we defer the refcount checks to
reclaim-time and use a per-inode state flag, I_REFERENCED, to tell
reclaim that iget has touched the inode in the past. This means that
only reclaim should be touching the LRU with any frequency, hence
significantly reducing lock acquisitions and the amount contention
on LRU updates.

This also removes the inode_in_use list, which means we now only
have one list for tracking the inode LRU status. This makes it much
simpler to split out the LRU list operations under it's own lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c         |    9 +------
 fs/inode.c                |   56 +++++++++++++++++++++++++++++----------------
 include/linux/fs.h        |   13 +++++-----
 include/linux/writeback.h |    1 -
 4 files changed, 44 insertions(+), 35 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3209aff..2a61300 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -408,15 +408,8 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * completion.
 			 */
 			redirty_tail(inode);
-		} else if (atomic_read(&inode->i_count)) {
-			/*
-			 * The inode is clean, inuse
-			 */
-			list_move(&inode->i_list, &inode_in_use);
 		} else {
-			/*
-			 * The inode is clean, unused
-			 */
+			/* The inode is clean */
 			list_move(&inode->i_list, &inode_unused);
 		}
 	}
diff --git a/fs/inode.c b/fs/inode.c
index 22ef3f1..e76d398 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -72,7 +72,6 @@ static unsigned int i_hash_shift __read_mostly;
  * allowing for low-overhead inode sync() operations.
  */
 
-LIST_HEAD(inode_in_use);
 LIST_HEAD(inode_unused);
 static struct hlist_head *inode_hashtable __read_mostly;
 
@@ -291,6 +290,7 @@ void inode_init_once(struct inode *inode)
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
+	INIT_LIST_HEAD(&inode->i_list);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
@@ -317,12 +317,7 @@ static void init_once(void *foo)
  */
 void __iget(struct inode *inode)
 {
-	if (atomic_inc_return(&inode->i_count) != 1)
-		return;
-
-	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
-		list_move(&inode->i_list, &inode_in_use);
-	percpu_counter_dec(&nr_inodes_unused);
+	atomic_inc(&inode->i_count);
 }
 
 void end_writeback(struct inode *inode)
@@ -367,7 +362,7 @@ static void dispose_list(struct list_head *head)
 		struct inode *inode;
 
 		inode = list_first_entry(head, struct inode, i_list);
-		list_del(&inode->i_list);
+		list_del_init(&inode->i_list);
 
 		evict(inode);
 
@@ -489,8 +484,15 @@ static void prune_icache(int nr_to_scan)
 
 		inode = list_entry(inode_unused.prev, struct inode, i_list);
 
-		if (inode->i_state || atomic_read(&inode->i_count)) {
+		if (atomic_read(&inode->i_count) ||
+		    (inode->i_state & ~I_REFERENCED)) {
+			list_del_init(&inode->i_list);
+			percpu_counter_dec(&nr_inodes_unused);
+			continue;
+		}
+		if (inode->i_state & I_REFERENCED) {
 			list_move(&inode->i_list, &inode_unused);
+			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
@@ -502,11 +504,15 @@ static void prune_icache(int nr_to_scan)
 			iput(inode);
 			spin_lock(&inode_lock);
 
-			if (inode != list_entry(inode_unused.next,
-						struct inode, i_list))
-				continue;	/* wrong inode or list_empty */
-			if (!can_unuse(inode))
+			/*
+			 * if we can't reclaim this inod immediately, give it
+			 * another pass through the free list so we don't spin
+			 * on it.
+			 */
+			if (!can_unuse(inode)) {
+				list_move(&inode->i_list, &inode_unused);
 				continue;
+			}
 		}
 		list_move(&inode->i_list, &freeable);
 		WARN_ON(inode->i_state & I_NEW);
@@ -621,7 +627,6 @@ static inline void
 __inode_add_to_lists(struct super_block *sb, struct hlist_head *head,
 			struct inode *inode)
 {
-	list_add(&inode->i_list, &inode_in_use);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
 	if (head)
 		hlist_add_head(&inode->i_hash, head);
@@ -1238,10 +1243,12 @@ static void iput_final(struct inode *inode)
 		drop = generic_drop_inode(inode);
 
 	if (!drop) {
-		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
-			list_move(&inode->i_list, &inode_unused);
-		percpu_counter_inc(&nr_inodes_unused);
 		if (sb->s_flags & MS_ACTIVE) {
+			inode->i_state |= I_REFERENCED;
+			if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
+				list_move(inode->i_list, &inode_unused);
+				percpu_counter_inc(&nr_inodes_unused);
+			}
 			spin_unlock(&inode_lock);
 			return;
 		}
@@ -1252,13 +1259,22 @@ static void iput_final(struct inode *inode)
 		spin_lock(&inode_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
-		percpu_counter_dec(&nr_inodes_unused);
 		hlist_del_init(&inode->i_hash);
 	}
-	list_del_init(&inode->i_list);
-	list_del_init(&inode->i_sb_list);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
+
+	/*
+	 * We avoid moving dirty inodes back onto the LRU now because I_FREEING
+	 * is set and hence writeback_single_inode() won't move the inode
+	 * around.
+	 */
+	if (!list_empty(&inode->i_list)) {
+		list_del_init(&inode->i_list);
+		percpu_counter_dec(&nr_inodes_unused);
+	}
+
+	list_del_init(&inode->i_sb_list);
 	spin_unlock(&inode_lock);
 	evict(inode);
 	spin_lock(&inode_lock);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6f0b07f..8ff7b6b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1632,16 +1632,17 @@ struct super_operations {
  *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
-#define I_DIRTY_SYNC		1
-#define I_DIRTY_DATASYNC	2
-#define I_DIRTY_PAGES		4
+#define I_DIRTY_SYNC		0x01
+#define I_DIRTY_DATASYNC	0x02
+#define I_DIRTY_PAGES		0x04
 #define __I_NEW			3
 #define I_NEW			(1 << __I_NEW)
-#define I_WILL_FREE		16
-#define I_FREEING		32
-#define I_CLEAR			64
+#define I_WILL_FREE		0x10
+#define I_FREEING		0x20
+#define I_CLEAR			0x40
 #define __I_SYNC		7
 #define I_SYNC			(1 << __I_SYNC)
+#define I_REFERENCED		0x100
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 72a5d64..f956b66 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -10,7 +10,6 @@
 struct backing_dev_info;
 
 extern spinlock_t inode_lock;
-extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
 /*
-- 
1.7.1


^ permalink raw reply related

* [RFD] Device Renaming Mechanism
From: Nao Nishijima @ 2010-10-08  5:23 UTC (permalink / raw)
  To: gregkh, James.Bottomley, rwheeler
  Cc: linux-kernel, linux-hotplug-devel, linux-hotplug,
	masami.hiramatsu.pt

Hi,

I'm trying to solve a device name(or device node) mismatch problem caused by
device configuration changes. Now I have an idea of device renaming to solve it,
and would like to request for comments from kernel developers.


Device Name Mismatch
====================

Device names(e.g. sda) are assigned by the order of driver loading and device
recognizing (usually from small bus number). This may cause a device name
mismatch between previous and current boot whenever the device configuration is
changed. Suppose there is an application opens disk via /dev/sdb. When device
configuration changing (hot-plug, device breakdown) or system configuration
changing(driver loading order, changing modprobe.conf) causes changing order
device names. This device names does not always point to same disks.

This mismatch causes unexpected disk access and redundancy miss setting (e.g.
Multipath, software-raid), if you use device file names to a configuration file.


Udev Solution
=============

Typically we use to avoid this problem we uses persistent device names provided
by udev.

Udev makes persistent symbolic links(by-{id, uuid, path, label}) pointing to each
device based on device information. Applications access the device via these
symbolic links. Udev solves mismatch between device name and physical disk.
However the persistent name mismatches kernel's device name.
This mismatch causes following 4 issues.

Issue 1: /proc/partitions, /proc/diskstat gives you device names
We have to run "ls -l /dev/disk/by-*" or "udevadm" for finding corresponding
persistent symbolic links.

Issue 2: dmesg output device name instead of persistent symbolic links
Users might not know which disk is sdX, because they identify the disk by a
persistent symbolic link.

Issue 3: Some system commands don't accept symbolic link(e.g. df, iostat,...)
These commands just expect sdX device name or check input by /proc information.
This will also occur on several GNOME/KDE/etc GUI sysadmin tools. :(

Issue 4: Undecided symbolic link
Even if we would like to introduce device names/persistent symbolic links
mapping tool to solve it, we can not determine a symbolic link from a device,
because several symbolic links point a device file.

Therefore, I think the symbolic link is not enough to solve. We need a
better solution.


Proposal
========
I'd like to propose introducing device renaming interface to solve these issues.

I think renaming device name in the kernel is the simplest way to solve mismatch
dmesg and /proc information. This can be done while kernel booting up(like
ifcfg). Of course, udev still needs to assign new name for each device via that
interface.

This proposal just requests to add a simple interface to kernel as below. And we
can continue to use user program without any modification.

int rename_device(const char *newname, const char *oldname)

Any comments, or suggestions are very welcome!
Best Regards,

-- 
Nao NISHIJIMA
2nd Dept. Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
Email: nao.nishijima.xt@hitachi.com

^ permalink raw reply

* [PATCH 05/18] fs: inode split IO and LRU lists
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel
In-Reply-To: <1286515292-15882-1-git-send-email-david@fromorbit.com>

From: Nick Piggin <npiggin@suse.de>

The use of the same inode list structure (inode->i_list) for two
different list constructs with different lifecycles and purposes
makes it impossible to separate the locking of the different
operations. Therefore, to enable the separation of the locking of
the writeback and reclaim lists, split the inode->i_list into two
separate lists dedicated to their specific tracking functions.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c         |   30 +++++++++++++++++-------------
 fs/inode.c                |   36 +++++++++++++++++++++---------------
 fs/nilfs2/mdt.c           |    3 ++-
 include/linux/fs.h        |    3 ++-
 include/linux/writeback.h |    3 +++
 mm/backing-dev.c          |   44 ++++++++++++++++++++++----------------------
 6 files changed, 67 insertions(+), 52 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2a61300..78aaaa8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -172,11 +172,11 @@ static void redirty_tail(struct inode *inode)
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
-		tail = list_entry(wb->b_dirty.next, struct inode, i_list);
+		tail = list_entry(wb->b_dirty.next, struct inode, i_io);
 		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	list_move(&inode->i_list, &wb->b_dirty);
+	list_move(&inode->i_io, &wb->b_dirty);
 }
 
 /*
@@ -186,7 +186,7 @@ static void requeue_io(struct inode *inode)
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
-	list_move(&inode->i_list, &wb->b_more_io);
+	list_move(&inode->i_io, &wb->b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -227,14 +227,14 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 	int do_sb_sort = 0;
 
 	while (!list_empty(delaying_queue)) {
-		inode = list_entry(delaying_queue->prev, struct inode, i_list);
+		inode = list_entry(delaying_queue->prev, struct inode, i_io);
 		if (older_than_this &&
 		    inode_dirtied_after(inode, *older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
-		list_move(&inode->i_list, &tmp);
+		list_move(&inode->i_io, &tmp);
 	}
 
 	/* just one sb in list, splice to dispatch_queue and we're done */
@@ -245,12 +245,12 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 
 	/* Move inodes from one superblock together */
 	while (!list_empty(&tmp)) {
-		inode = list_entry(tmp.prev, struct inode, i_list);
+		inode = list_entry(tmp.prev, struct inode, i_io);
 		sb = inode->i_sb;
 		list_for_each_prev_safe(pos, node, &tmp) {
-			inode = list_entry(pos, struct inode, i_list);
+			inode = list_entry(pos, struct inode, i_io);
 			if (inode->i_sb == sb)
-				list_move(&inode->i_list, dispatch_queue);
+				list_move(&inode->i_io, dispatch_queue);
 		}
 	}
 }
@@ -410,7 +410,11 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			redirty_tail(inode);
 		} else {
 			/* The inode is clean */
-			list_move(&inode->i_list, &inode_unused);
+			list_del_init(&inode->i_io);
+			if (list_empty(&inode->i_lru)) {
+				list_add(&inode->i_lru, &inode_unused);
+				percpu_counter_inc(&nr_inodes_unused);
+			}
 		}
 	}
 	inode_sync_complete(inode);
@@ -459,7 +463,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 	while (!list_empty(&wb->b_io)) {
 		long pages_skipped;
 		struct inode *inode = list_entry(wb->b_io.prev,
-						 struct inode, i_list);
+						 struct inode, i_io);
 
 		if (inode->i_sb != sb) {
 			if (only_this_sb) {
@@ -530,7 +534,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = list_entry(wb->b_io.prev,
-						 struct inode, i_list);
+						 struct inode, i_io);
 		struct super_block *sb = inode->i_sb;
 
 		if (!pin_sb_for_writeback(sb)) {
@@ -669,7 +673,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 		spin_lock(&inode_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			inode = list_entry(wb->b_more_io.prev,
-						struct inode, i_list);
+						struct inode, i_io);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			inode_wait_for_writeback(inode);
 		}
@@ -983,7 +987,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 			}
 
 			inode->dirtied_when = jiffies;
-			list_move(&inode->i_list, &bdi->wb.b_dirty);
+			list_move(&inode->i_io, &bdi->wb.b_dirty);
 		}
 	}
 out:
diff --git a/fs/inode.c b/fs/inode.c
index e76d398..98f8963 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -102,8 +102,8 @@ static DECLARE_RWSEM(iprune_sem);
  */
 struct inodes_stat_t inodes_stat;
 
-static struct percpu_counter nr_inodes __cacheline_aligned_in_smp;
-static struct percpu_counter nr_inodes_unused __cacheline_aligned_in_smp;
+struct percpu_counter nr_inodes __cacheline_aligned_in_smp;
+struct percpu_counter nr_inodes_unused __cacheline_aligned_in_smp;
 
 static struct kmem_cache *inode_cachep __read_mostly;
 
@@ -272,6 +272,7 @@ EXPORT_SYMBOL(__destroy_inode);
 
 void destroy_inode(struct inode *inode)
 {
+	BUG_ON(!list_empty(&inode->i_lru));
 	__destroy_inode(inode);
 	if (inode->i_sb->s_op->destroy_inode)
 		inode->i_sb->s_op->destroy_inode(inode);
@@ -290,7 +291,8 @@ void inode_init_once(struct inode *inode)
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
-	INIT_LIST_HEAD(&inode->i_list);
+	INIT_LIST_HEAD(&inode->i_io);
+	INIT_LIST_HEAD(&inode->i_lru);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
 	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
@@ -361,8 +363,8 @@ static void dispose_list(struct list_head *head)
 	while (!list_empty(head)) {
 		struct inode *inode;
 
-		inode = list_first_entry(head, struct inode, i_list);
-		list_del_init(&inode->i_list);
+		inode = list_first_entry(head, struct inode, i_lru);
+		list_del_init(&inode->i_lru);
 
 		evict(inode);
 
@@ -405,7 +407,8 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 			continue;
 		invalidate_inode_buffers(inode);
 		if (!atomic_read(&inode->i_count)) {
-			list_move(&inode->i_list, dispose);
+			list_move(&inode->i_lru, dispose);
+			list_del_init(&inode->i_io);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
 			percpu_counter_dec(&nr_inodes_unused);
@@ -482,16 +485,16 @@ static void prune_icache(int nr_to_scan)
 		if (list_empty(&inode_unused))
 			break;
 
-		inode = list_entry(inode_unused.prev, struct inode, i_list);
+		inode = list_entry(inode_unused.prev, struct inode, i_lru);
 
 		if (atomic_read(&inode->i_count) ||
 		    (inode->i_state & ~I_REFERENCED)) {
-			list_del_init(&inode->i_list);
+			list_del_init(&inode->i_lru);
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
 		if (inode->i_state & I_REFERENCED) {
-			list_move(&inode->i_list, &inode_unused);
+			list_move(&inode->i_lru, &inode_unused);
 			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
@@ -510,11 +513,12 @@ static void prune_icache(int nr_to_scan)
 			 * on it.
 			 */
 			if (!can_unuse(inode)) {
-				list_move(&inode->i_list, &inode_unused);
+				list_move(&inode->i_lru, &inode_unused);
 				continue;
 			}
 		}
-		list_move(&inode->i_list, &freeable);
+		list_move(&inode->i_lru, &freeable);
+		list_del_init(&inode->i_io);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
 		percpu_counter_dec(&nr_inodes_unused);
@@ -1245,8 +1249,9 @@ static void iput_final(struct inode *inode)
 	if (!drop) {
 		if (sb->s_flags & MS_ACTIVE) {
 			inode->i_state |= I_REFERENCED;
-			if (!(inode->i_state & (I_DIRTY|I_SYNC))) {
-				list_move(inode->i_list, &inode_unused);
+			if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
+			    list_empty(&inode->i_lru)) {
+				list_add(&inode->i_lru, &inode_unused);
 				percpu_counter_inc(&nr_inodes_unused);
 			}
 			spin_unlock(&inode_lock);
@@ -1261,6 +1266,7 @@ static void iput_final(struct inode *inode)
 		inode->i_state &= ~I_WILL_FREE;
 		hlist_del_init(&inode->i_hash);
 	}
+	list_del_init(&inode->i_io);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 
@@ -1269,8 +1275,8 @@ static void iput_final(struct inode *inode)
 	 * is set and hence writeback_single_inode() won't move the inode
 	 * around.
 	 */
-	if (!list_empty(&inode->i_list)) {
-		list_del_init(&inode->i_list);
+	if (!list_empty(&inode->i_lru)) {
+		list_del_init(&inode->i_lru);
 		percpu_counter_dec(&nr_inodes_unused);
 	}
 
diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c
index 7713861..2ee524f 100644
--- a/fs/nilfs2/mdt.c
+++ b/fs/nilfs2/mdt.c
@@ -504,7 +504,8 @@ nilfs_mdt_new_common(struct the_nilfs *nilfs, struct super_block *sb,
 #endif
 		inode->dirtied_when = 0;
 
-		INIT_LIST_HEAD(&inode->i_list);
+		INIT_LIST_HEAD(&inode->i_io);
+		INIT_LIST_HEAD(&inode->i_lru);
 		INIT_LIST_HEAD(&inode->i_sb_list);
 		inode->i_state = 0;
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8ff7b6b..11c7ad4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -725,7 +725,8 @@ struct posix_acl;
 
 struct inode {
 	struct hlist_node	i_hash;
-	struct list_head	i_list;		/* backing dev IO list */
+	struct list_head	i_io;		/* backing dev IO list */
+	struct list_head	i_lru;		/* backing dev IO list */
 	struct list_head	i_sb_list;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index f956b66..f7ed2a0 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -12,6 +12,9 @@ struct backing_dev_info;
 extern spinlock_t inode_lock;
 extern struct list_head inode_unused;
 
+extern struct percpu_counter nr_inodes;
+extern struct percpu_counter nr_inodes_unused;
+
 /*
  * fs/fs-writeback.c
  */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 0188d99..a124991 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -74,11 +74,11 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
 	spin_lock(&inode_lock);
-	list_for_each_entry(inode, &wb->b_dirty, i_list)
+	list_for_each_entry(inode, &wb->b_dirty, i_io)
 		nr_dirty++;
-	list_for_each_entry(inode, &wb->b_io, i_list)
+	list_for_each_entry(inode, &wb->b_io, i_io)
 		nr_io++;
-	list_for_each_entry(inode, &wb->b_more_io, i_list)
+	list_for_each_entry(inode, &wb->b_more_io, i_io)
 		nr_more_io++;
 	spin_unlock(&inode_lock);
 
@@ -681,27 +681,27 @@ void mapping_set_bdi(struct address_space *mapping,
 		return;
 
 	spin_lock(&inode_lock);
-	if (!list_empty(&inode->i_list)) {
+	if (!list_empty(&inode->i_io)) {
 		struct inode *i;
 
-		list_for_each_entry(i, &old->wb.b_dirty, i_list) {
+		list_for_each_entry(i, &old->wb.b_dirty, i_io) {
 			if (inode == i) {
-				list_del(&inode->i_list);
-				list_add(&inode->i_list, &bdi->wb.b_dirty);
+				list_del(&inode->i_io);
+				list_add(&inode->i_io, &bdi->wb.b_dirty);
 				goto found;
 			}
 		}
-		list_for_each_entry(i, &old->wb.b_io, i_list) {
+		list_for_each_entry(i, &old->wb.b_io, i_io) {
 			if (inode == i) {
-				list_del(&inode->i_list);
-				list_add(&inode->i_list, &bdi->wb.b_io);
+				list_del(&inode->i_io);
+				list_add(&inode->i_io, &bdi->wb.b_io);
 				goto found;
 			}
 		}
-		list_for_each_entry(i, &old->wb.b_more_io, i_list) {
+		list_for_each_entry(i, &old->wb.b_more_io, i_io) {
 			if (inode == i) {
-				list_del(&inode->i_list);
-				list_add(&inode->i_list, &bdi->wb.b_more_io);
+				list_del(&inode->i_io);
+				list_add(&inode->i_io, &bdi->wb.b_more_io);
 				goto found;
 			}
 		}
@@ -726,19 +726,19 @@ void bdi_destroy(struct backing_dev_info *bdi)
 		struct inode *i, *tmp;
 
 		spin_lock(&inode_lock);
-		list_for_each_entry_safe(i, tmp, &bdi->wb.b_dirty, i_list) {
-			list_del(&i->i_list);
-			list_add_tail(&i->i_list, &dst->b_dirty);
+		list_for_each_entry_safe(i, tmp, &bdi->wb.b_dirty, i_io) {
+			list_del(&i->i_io);
+			list_add_tail(&i->i_io, &dst->b_dirty);
 			i->i_mapping->a_bdi = bdi;
 		}
-		list_for_each_entry_safe(i, tmp, &bdi->wb.b_io, i_list) {
-			list_del(&i->i_list);
-			list_add_tail(&i->i_list, &dst->b_io);
+		list_for_each_entry_safe(i, tmp, &bdi->wb.b_io, i_io) {
+			list_del(&i->i_io);
+			list_add_tail(&i->i_io, &dst->b_io);
 			i->i_mapping->a_bdi = bdi;
 		}
-		list_for_each_entry_safe(i, tmp, &bdi->wb.b_more_io, i_list) {
-			list_del(&i->i_list);
-			list_add_tail(&i->i_list, &dst->b_more_io);
+		list_for_each_entry_safe(i, tmp, &bdi->wb.b_more_io, i_io) {
+			list_del(&i->i_io);
+			list_add_tail(&i->i_io, &dst->b_more_io);
 			i->i_mapping->a_bdi = bdi;
 		}
 		spin_unlock(&inode_lock);
-- 
1.7.1


^ permalink raw reply related

* [PATCH 06/18] fs: Clean up inode reference counting
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel
In-Reply-To: <1286515292-15882-1-git-send-email-david@fromorbit.com>

From: Dave Chinner <dchinner@redhat.com>

Lots of filesystem code open codes the act of getting a reference to
an inode.  Factor the open coded inode lock, increment, unlock into
a function iref().  Then rename __iget to iref_locked so that nothing
is directly incrementing the inode reference count for trivial
operations.

Originally based on a patch from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/9p/vfs_inode.c           |    5 +++--
 fs/affs/inode.c             |    2 +-
 fs/afs/dir.c                |    2 +-
 fs/anon_inodes.c            |    5 ++---
 fs/bfs/dir.c                |    2 +-
 fs/block_dev.c              |   13 ++++++-------
 fs/btrfs/inode.c            |    2 +-
 fs/coda/dir.c               |    2 +-
 fs/drop_caches.c            |    2 +-
 fs/exofs/inode.c            |    2 +-
 fs/exofs/namei.c            |    2 +-
 fs/ext2/namei.c             |    2 +-
 fs/ext3/namei.c             |    2 +-
 fs/ext4/namei.c             |    2 +-
 fs/fs-writeback.c           |    6 +++---
 fs/gfs2/ops_inode.c         |    2 +-
 fs/hfsplus/dir.c            |    2 +-
 fs/inode.c                  |   29 +++++++++++++++++++----------
 fs/jffs2/dir.c              |    4 ++--
 fs/jfs/jfs_txnmgr.c         |    2 +-
 fs/jfs/namei.c              |    2 +-
 fs/libfs.c                  |    2 +-
 fs/logfs/dir.c              |    2 +-
 fs/minix/namei.c            |    2 +-
 fs/namei.c                  |    2 +-
 fs/nfs/dir.c                |    2 +-
 fs/nfs/getroot.c            |    2 +-
 fs/nfs/write.c              |    2 +-
 fs/nilfs2/namei.c           |    2 +-
 fs/notify/inode_mark.c      |    8 ++++----
 fs/ntfs/super.c             |    4 ++--
 fs/ocfs2/namei.c            |    2 +-
 fs/quota/dquot.c            |    2 +-
 fs/reiserfs/namei.c         |    2 +-
 fs/sysv/namei.c             |    2 +-
 fs/ubifs/dir.c              |    2 +-
 fs/udf/namei.c              |    2 +-
 fs/ufs/namei.c              |    2 +-
 fs/xfs/linux-2.6/xfs_iops.c |    2 +-
 fs/xfs/xfs_inode.h          |    2 +-
 include/linux/fs.h          |    3 ++-
 ipc/mqueue.c                |    2 +-
 kernel/futex.c              |    2 +-
 mm/shmem.c                  |    2 +-
 net/socket.c                |    2 +-
 45 files changed, 79 insertions(+), 70 deletions(-)

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 9e670d5..1f76624 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -1789,9 +1789,10 @@ v9fs_vfs_link_dotl(struct dentry *old_dentry, struct inode *dir,
 		kfree(st);
 	} else {
 		/* Caching disabled. No need to get upto date stat info.
-		 * This dentry will be released immediately. So, just i_count++
+		 * This dentry will be released immediately. So, just take
+		 * a reference.
 		 */
-		atomic_inc(&old_dentry->d_inode->i_count);
+		iref(old_dentry->d_inode);
 	}
 
 	dentry->d_op = old_dentry->d_op;
diff --git a/fs/affs/inode.c b/fs/affs/inode.c
index 3a0fdec..2100852 100644
--- a/fs/affs/inode.c
+++ b/fs/affs/inode.c
@@ -388,7 +388,7 @@ affs_add_entry(struct inode *dir, struct inode *inode, struct dentry *dentry, s3
 		affs_adjust_checksum(inode_bh, block - be32_to_cpu(chain));
 		mark_buffer_dirty_inode(inode_bh, inode);
 		inode->i_nlink = 2;
-		atomic_inc(&inode->i_count);
+		iref(inode);
 	}
 	affs_fix_checksum(sb, bh);
 	mark_buffer_dirty_inode(bh, inode);
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 0d38c09..87d8c03 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -1045,7 +1045,7 @@ static int afs_link(struct dentry *from, struct inode *dir,
 	if (ret < 0)
 		goto link_error;
 
-	atomic_inc(&vnode->vfs_inode.i_count);
+	iref(&vnode->vfs_inode);
 	d_instantiate(dentry, &vnode->vfs_inode);
 	key_put(key);
 	_leave(" = 0");
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index e4b75d6..55a825f 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -111,10 +111,9 @@ struct file *anon_inode_getfile(const char *name,
 	path.mnt = mntget(anon_inode_mnt);
 	/*
 	 * We know the anon_inode inode count is always greater than zero,
-	 * so we can avoid doing an igrab() and we can use an open-coded
-	 * atomic_inc().
+	 * so we can avoid doing an igrab() by using iref().
 	 */
-	atomic_inc(&anon_inode_inode->i_count);
+	iref(anon_inode_inode);
 
 	path.dentry->d_op = &anon_inodefs_dentry_operations;
 	d_instantiate(path.dentry, anon_inode_inode);
diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c
index d967e05..6e93a37 100644
--- a/fs/bfs/dir.c
+++ b/fs/bfs/dir.c
@@ -176,7 +176,7 @@ static int bfs_link(struct dentry *old, struct inode *dir,
 	inc_nlink(inode);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(new, inode);
 	mutex_unlock(&info->bfs_lock);
 	return 0;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index ac070d7..b7d1534 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -550,7 +550,7 @@ EXPORT_SYMBOL(bdget);
  */
 struct block_device *bdgrab(struct block_device *bdev)
 {
-	atomic_inc(&bdev->bd_inode->i_count);
+	iref(bdev->bd_inode);
 	return bdev;
 }
 
@@ -580,7 +580,7 @@ static struct block_device *bd_acquire(struct inode *inode)
 	spin_lock(&bdev_lock);
 	bdev = inode->i_bdev;
 	if (bdev) {
-		atomic_inc(&bdev->bd_inode->i_count);
+		bdgrab(bdev);
 		spin_unlock(&bdev_lock);
 		return bdev;
 	}
@@ -591,12 +591,11 @@ static struct block_device *bd_acquire(struct inode *inode)
 		spin_lock(&bdev_lock);
 		if (!inode->i_bdev) {
 			/*
-			 * We take an additional bd_inode->i_count for inode,
-			 * and it's released in clear_inode() of inode.
-			 * So, we can access it via ->i_mapping always
-			 * without igrab().
+			 * We take an additional bdev reference here so
+			 * we can access it via ->i_mapping always
+			 * without first needing to grab a reference.
 			 */
-			atomic_inc(&bdev->bd_inode->i_count);
+			bdgrab(bdev);
 			inode->i_bdev = bdev;
 			inode->i_mapping = bdev->bd_inode->i_mapping;
 			list_add(&inode->i_devices, &bdev->bd_inodes);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c646c0c..0c3a35b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4758,7 +4758,7 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
 	}
 
 	btrfs_set_trans_block_group(trans, dir);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = btrfs_add_nondir(trans, dentry, inode, 1, index);
 
diff --git a/fs/coda/dir.c b/fs/coda/dir.c
index ccd98b0..ac8b913 100644
--- a/fs/coda/dir.c
+++ b/fs/coda/dir.c
@@ -303,7 +303,7 @@ static int coda_link(struct dentry *source_de, struct inode *dir_inode,
 	}
 
 	coda_dir_update_mtime(dir_inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(de, inode);
 	inc_nlink(inode);
 
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 2195c21..c4f3e06 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -22,7 +22,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 			continue;
 		if (inode->i_mapping->nrpages == 0)
 			continue;
-		__iget(inode);
+		iref_locked(inode);
 		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index eb7368e..b631ff3 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -1154,7 +1154,7 @@ struct inode *exofs_new_inode(struct inode *dir, int mode)
 	/* increment the refcount so that the inode will still be around when we
 	 * reach the callback
 	 */
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	ios->done = create_done;
 	ios->private = inode;
diff --git a/fs/exofs/namei.c b/fs/exofs/namei.c
index b7dd0c2..f2a30a0 100644
--- a/fs/exofs/namei.c
+++ b/fs/exofs/namei.c
@@ -153,7 +153,7 @@ static int exofs_link(struct dentry *old_dentry, struct inode *dir,
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	return exofs_add_nondir(dentry, inode);
 }
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 71efb0e..b15435f 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -206,7 +206,7 @@ static int ext2_link (struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = ext2_add_link(dentry, inode);
 	if (!err) {
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index 2b35ddb..6c7a5d6 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -2260,7 +2260,7 @@ retry:
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = ext3_add_entry(handle, dentry, inode);
 	if (!err) {
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 314c0d3..a406a85 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2312,7 +2312,7 @@ retry:
 
 	inode->i_ctime = ext4_current_time(inode);
 	ext4_inc_count(handle, inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = ext4_add_entry(handle, dentry, inode);
 	if (!err) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 78aaaa8..1bf8a28 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -297,7 +297,7 @@ static void inode_wait_for_writeback(struct inode *inode)
 
 /*
  * Write out an inode's dirty pages.  Called under inode_lock.  Either the
- * caller has ref on the inode (either via __iget or via syscall against an fd)
+ * caller has ref on the inode (either via iref_locked or via syscall against an fd)
  * or the inode has I_WILL_FREE set (via generic_forget_inode)
  *
  * If `wait' is set, wait on the writeout.
@@ -496,7 +496,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			return 1;
 
 		BUG_ON(inode->i_state & I_FREEING);
-		__iget(inode);
+		iref_locked(inode);
 		pages_skipped = wbc->pages_skipped;
 		writeback_single_inode(inode, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
@@ -1042,7 +1042,7 @@ static void wait_sb_inodes(struct super_block *sb)
 		mapping = inode->i_mapping;
 		if (mapping->nrpages == 0)
 			continue;
-		__iget(inode);
+		iref_locked(inode);
 		spin_unlock(&inode_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have
diff --git a/fs/gfs2/ops_inode.c b/fs/gfs2/ops_inode.c
index 1009be2..508407d 100644
--- a/fs/gfs2/ops_inode.c
+++ b/fs/gfs2/ops_inode.c
@@ -253,7 +253,7 @@ out_parent:
 	gfs2_holder_uninit(ghs);
 	gfs2_holder_uninit(ghs + 1);
 	if (!error) {
-		atomic_inc(&inode->i_count);
+		iref(inode);
 		d_instantiate(dentry, inode);
 		mark_inode_dirty(inode);
 	}
diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c
index 764fd1b..e2ce54d 100644
--- a/fs/hfsplus/dir.c
+++ b/fs/hfsplus/dir.c
@@ -301,7 +301,7 @@ static int hfsplus_link(struct dentry *src_dentry, struct inode *dst_dir,
 
 	inc_nlink(inode);
 	hfsplus_instantiate(dst_dentry, inode, cnid);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	inode->i_ctime = CURRENT_TIME_SEC;
 	mark_inode_dirty(inode);
 	HFSPLUS_SB(sb).file_count++;
diff --git a/fs/inode.c b/fs/inode.c
index 98f8963..aa66e07 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -313,11 +313,20 @@ static void init_once(void *foo)
 
 	inode_init_once(inode);
 }
+EXPORT_SYMBOL_GPL(iref_locked);
+
+void iref(struct inode *inode)
+{
+	spin_lock(&inode_lock);
+	iref_locked(inode);
+	spin_unlock(&inode_lock);
+}
+EXPORT_SYMBOL_GPL(iref);
 
 /*
  * inode_lock must be held
  */
-void __iget(struct inode *inode)
+void iref_locked(struct inode *inode)
 {
 	atomic_inc(&inode->i_count);
 }
@@ -499,7 +508,7 @@ static void prune_icache(int nr_to_scan)
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
-			__iget(inode);
+			iref_locked(inode);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
@@ -565,7 +574,7 @@ static struct shrinker icache_shrinker = {
 static void __wait_on_freeing_inode(struct inode *inode);
 /*
  * Called with the inode lock held.
- * NOTE: we are not increasing the inode-refcount, you must call __iget()
+ * NOTE: we are not increasing the inode-refcount, you must call iref_locked()
  * by hand after calling find_inode now! This simplifies iunique and won't
  * add any additional branch in the common code.
  */
@@ -769,7 +778,7 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		__iget(old);
+		iref_locked(old);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -816,7 +825,7 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		__iget(old);
+		iref_locked(old);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -869,7 +878,7 @@ struct inode *igrab(struct inode *inode)
 {
 	spin_lock(&inode_lock);
 	if (!(inode->i_state & (I_FREEING|I_WILL_FREE)))
-		__iget(inode);
+		iref_locked(inode);
 	else
 		/*
 		 * Handle the case where s_op->clear_inode is not been
@@ -910,7 +919,7 @@ static struct inode *ifind(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode(sb, head, test, data);
 	if (inode) {
-		__iget(inode);
+		iref_locked(inode);
 		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
@@ -943,7 +952,7 @@ static struct inode *ifind_fast(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, head, ino);
 	if (inode) {
-		__iget(inode);
+		iref_locked(inode);
 		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
@@ -1126,7 +1135,7 @@ int insert_inode_locked(struct inode *inode)
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		__iget(old);
+		iref_locked(old);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1165,7 +1174,7 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		__iget(old);
+		iref_locked(old);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index ed78a3c..797a034 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -289,7 +289,7 @@ static int jffs2_link (struct dentry *old_dentry, struct inode *dir_i, struct de
 		mutex_unlock(&f->sem);
 		d_instantiate(dentry, old_dentry->d_inode);
 		dir_i->i_mtime = dir_i->i_ctime = ITIME(now);
-		atomic_inc(&old_dentry->d_inode->i_count);
+		iref(old_dentry->d_inode);
 	}
 	return ret;
 }
@@ -864,7 +864,7 @@ static int jffs2_rename (struct inode *old_dir_i, struct dentry *old_dentry,
 		printk(KERN_NOTICE "jffs2_rename(): Link succeeded, unlink failed (err %d). You now have a hard link\n", ret);
 		/* Might as well let the VFS know */
 		d_instantiate(new_dentry, old_dentry->d_inode);
-		atomic_inc(&old_dentry->d_inode->i_count);
+		iref(old_dentry->d_inode);
 		new_dir_i->i_mtime = new_dir_i->i_ctime = ITIME(now);
 		return ret;
 	}
diff --git a/fs/jfs/jfs_txnmgr.c b/fs/jfs/jfs_txnmgr.c
index d945ea7..3e6dd08 100644
--- a/fs/jfs/jfs_txnmgr.c
+++ b/fs/jfs/jfs_txnmgr.c
@@ -1279,7 +1279,7 @@ int txCommit(tid_t tid,		/* transaction identifier */
 	 * lazy commit thread finishes processing
 	 */
 	if (tblk->xflag & COMMIT_DELETE) {
-		atomic_inc(&tblk->u.ip->i_count);
+		iref(tblk->u.ip);
 		/*
 		 * Avoid a rare deadlock
 		 *
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index a9cf8e8..3d3566e 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -839,7 +839,7 @@ static int jfs_link(struct dentry *old_dentry,
 	ip->i_ctime = CURRENT_TIME;
 	dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	mark_inode_dirty(dir);
-	atomic_inc(&ip->i_count);
+	iref(ip);
 
 	iplist[0] = ip;
 	iplist[1] = dir;
diff --git a/fs/libfs.c b/fs/libfs.c
index 0a9da95..f190d73 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -255,7 +255,7 @@ int simple_link(struct dentry *old_dentry, struct inode *dir, struct dentry *den
 
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	dget(dentry);
 	d_instantiate(dentry, inode);
 	return 0;
diff --git a/fs/logfs/dir.c b/fs/logfs/dir.c
index 9777eb5..8522edc 100644
--- a/fs/logfs/dir.c
+++ b/fs/logfs/dir.c
@@ -569,7 +569,7 @@ static int logfs_link(struct dentry *old_dentry, struct inode *dir,
 		return -EMLINK;
 
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	inode->i_nlink++;
 	mark_inode_dirty_sync(inode);
 
diff --git a/fs/minix/namei.c b/fs/minix/namei.c
index f3f3578..7563a82 100644
--- a/fs/minix/namei.c
+++ b/fs/minix/namei.c
@@ -101,7 +101,7 @@ static int minix_link(struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	return add_nondir(dentry, inode);
 }
 
diff --git a/fs/namei.c b/fs/namei.c
index 24896e8..5fb93f3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2291,7 +2291,7 @@ static long do_unlinkat(int dfd, const char __user *pathname)
 			goto slashes;
 		inode = dentry->d_inode;
 		if (inode)
-			atomic_inc(&inode->i_count);
+			iref(inode);
 		error = mnt_want_write(nd.path.mnt);
 		if (error)
 			goto exit2;
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index e257172..5482ede 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1580,7 +1580,7 @@ nfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
 	d_drop(dentry);
 	error = NFS_PROTO(dir)->link(inode, dir, &dentry->d_name);
 	if (error == 0) {
-		atomic_inc(&inode->i_count);
+		iref(inode);
 		d_add(dentry, inode);
 	}
 	return error;
diff --git a/fs/nfs/getroot.c b/fs/nfs/getroot.c
index a70e446..5aaa2be 100644
--- a/fs/nfs/getroot.c
+++ b/fs/nfs/getroot.c
@@ -55,7 +55,7 @@ static int nfs_superblock_set_dummy_root(struct super_block *sb, struct inode *i
 			return -ENOMEM;
 		}
 		/* Circumvent igrab(): we know the inode is not being freed */
-		atomic_inc(&inode->i_count);
+		iref(inode);
 		/*
 		 * Ensure that this dentry is invisible to d_find_alias().
 		 * Otherwise, it may be spliced into the tree by
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index a8baf4b..75bc1a3 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -390,7 +390,7 @@ static int nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
 	error = radix_tree_insert(&nfsi->nfs_page_tree, req->wb_index, req);
 	BUG_ON(error);
 	if (!nfsi->npages) {
-		igrab(inode);
+		iref(inode);
 		if (nfs_have_delegation(inode, FMODE_WRITE))
 			nfsi->change_attr++;
 	}
diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
index ad6ed2c..fbd3348 100644
--- a/fs/nilfs2/namei.c
+++ b/fs/nilfs2/namei.c
@@ -219,7 +219,7 @@ static int nilfs_link(struct dentry *old_dentry, struct inode *dir,
 
 	inode->i_ctime = CURRENT_TIME;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	err = nilfs_add_nondir(dentry, inode);
 	if (!err)
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 33297c0..8096a9e 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -244,7 +244,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		struct inode *need_iput_tmp;
 
 		/*
-		 * We cannot __iget() an inode in state I_FREEING,
+		 * We cannot iref() an inode in state I_FREEING,
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
 		 * the inode cannot have any associated watches.
 		 */
@@ -253,7 +253,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		/*
 		 * If i_count is zero, the inode cannot have any watches and
-		 * doing an __iget/iput with MS_ACTIVE clear would actually
+		 * doing an iref/iput with MS_ACTIVE clear would actually
 		 * evict all inodes with zero i_count from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
@@ -265,7 +265,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		/* In case fsnotify_inode_delete() drops a reference. */
 		if (inode != need_iput_tmp)
-			__iget(inode);
+			iref_locked(inode);
 		else
 			need_iput_tmp = NULL;
 
@@ -273,7 +273,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		if ((&next_i->i_sb_list != list) &&
 		    atomic_read(&next_i->i_count) &&
 		    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
-			__iget(next_i);
+			iref_locked(next_i);
 			need_iput = next_i;
 		}
 
diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c
index 5128061..52b48e3 100644
--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -2929,8 +2929,8 @@ static int ntfs_fill_super(struct super_block *sb, void *opt, const int silent)
 		goto unl_upcase_iput_tmp_ino_err_out_now;
 	}
 	if ((sb->s_root = d_alloc_root(vol->root_ino))) {
-		/* We increment i_count simulating an ntfs_iget(). */
-		atomic_inc(&vol->root_ino->i_count);
+		/* Simulate an ntfs_iget() call */
+		iref(vol->root_ino);
 		ntfs_debug("Exiting, status successful.");
 		/* Release the default upcase if it has no users. */
 		mutex_lock(&ntfs_lock);
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index a00dda2..0e002f6 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -741,7 +741,7 @@ static int ocfs2_link(struct dentry *old_dentry,
 		goto out_commit;
 	}
 
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	dentry->d_op = &ocfs2_dentry_ops;
 	d_instantiate(dentry, inode);
 
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index aad1316..5199418 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -909,7 +909,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		if (!dqinit_needed(inode, type))
 			continue;
 
-		__iget(inode);
+		iref_locked(inode);
 		spin_unlock(&inode_lock);
 
 		iput(old_inode);
diff --git a/fs/reiserfs/namei.c b/fs/reiserfs/namei.c
index ee78d4a..f19bb3d 100644
--- a/fs/reiserfs/namei.c
+++ b/fs/reiserfs/namei.c
@@ -1156,7 +1156,7 @@ static int reiserfs_link(struct dentry *old_dentry, struct inode *dir,
 	inode->i_ctime = CURRENT_TIME_SEC;
 	reiserfs_update_sd(&th, inode);
 
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(dentry, inode);
 	retval = journal_end(&th, dir->i_sb, jbegin_count);
 	reiserfs_write_unlock(dir->i_sb);
diff --git a/fs/sysv/namei.c b/fs/sysv/namei.c
index 33e047b..765974f 100644
--- a/fs/sysv/namei.c
+++ b/fs/sysv/namei.c
@@ -126,7 +126,7 @@ static int sysv_link(struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	return add_nondir(dentry, inode);
 }
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index d669260..6a6393b 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -550,7 +550,7 @@ static int ubifs_link(struct dentry *old_dentry, struct inode *dir,
 
 	lock_2_inodes(dir, inode);
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	inode->i_ctime = ubifs_current_time(inode);
 	dir->i_size += sz_change;
 	dir_ui->ui_size = dir->i_size;
diff --git a/fs/udf/namei.c b/fs/udf/namei.c
index bf5fc67..f6e232a 100644
--- a/fs/udf/namei.c
+++ b/fs/udf/namei.c
@@ -1101,7 +1101,7 @@ static int udf_link(struct dentry *old_dentry, struct inode *dir,
 	inc_nlink(inode);
 	inode->i_ctime = current_fs_time(inode->i_sb);
 	mark_inode_dirty(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(dentry, inode);
 	unlock_kernel();
 
diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c
index b056f02..2a598eb 100644
--- a/fs/ufs/namei.c
+++ b/fs/ufs/namei.c
@@ -180,7 +180,7 @@ static int ufs_link (struct dentry * old_dentry, struct inode * dir,
 
 	inode->i_ctime = CURRENT_TIME_SEC;
 	inode_inc_link_count(inode);
-	atomic_inc(&inode->i_count);
+	iref(inode);
 
 	error = ufs_add_nondir(dentry, inode);
 	unlock_kernel();
diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c
index b1fc2a6..b7ec465 100644
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -352,7 +352,7 @@ xfs_vn_link(
 	if (unlikely(error))
 		return -error;
 
-	atomic_inc(&inode->i_count);
+	iref(inode);
 	d_instantiate(dentry, inode);
 	return 0;
 }
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 0898c54..cbb4791 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -482,7 +482,7 @@ void		xfs_mark_inode_dirty_sync(xfs_inode_t *);
 #define IHOLD(ip) \
 do { \
 	ASSERT(atomic_read(&VFS_I(ip)->i_count) > 0) ; \
-	atomic_inc(&(VFS_I(ip)->i_count)); \
+	iref(VFS_I(ip)); \
 	trace_xfs_ihold(ip, _THIS_IP_); \
 } while (0)
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 11c7ad4..2e971f2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2184,7 +2184,8 @@ extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struc
 extern int insert_inode_locked(struct inode *);
 extern void unlock_new_inode(struct inode *);
 
-extern void __iget(struct inode * inode);
+extern void iref(struct inode *inode);
+extern void iref_locked(struct inode *inode);
 extern void iget_failed(struct inode *);
 extern void end_writeback(struct inode *);
 extern void destroy_inode(struct inode *);
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index c60e519..d53a2c1 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -769,7 +769,7 @@ SYSCALL_DEFINE1(mq_unlink, const char __user *, u_name)
 
 	inode = dentry->d_inode;
 	if (inode)
-		atomic_inc(&inode->i_count);
+		iref(inode);
 	err = mnt_want_write(ipc_ns->mq_mnt);
 	if (err)
 		goto out_err;
diff --git a/kernel/futex.c b/kernel/futex.c
index 6a3a5fa..3bb418c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -168,7 +168,7 @@ static void get_futex_key_refs(union futex_key *key)
 
 	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
 	case FUT_OFF_INODE:
-		atomic_inc(&key->shared.inode->i_count);
+		iref(key->shared.inode);
 		break;
 	case FUT_OFF_MMSHARED:
 		atomic_inc(&key->private.mm->mm_count);
diff --git a/mm/shmem.c b/mm/shmem.c
index fbee46d..4daaa24 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1903,7 +1903,7 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr
 	dir->i_size += BOGO_DIRENT_SIZE;
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
 	inc_nlink(inode);
-	atomic_inc(&inode->i_count);	/* New dentry reference */
+	iref(inode);
 	dget(dentry);		/* Extra pinning count for the created dentry */
 	d_instantiate(dentry, inode);
 out:
diff --git a/net/socket.c b/net/socket.c
index 2270b94..715ca57 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -377,7 +377,7 @@ static int sock_alloc_file(struct socket *sock, struct file **f, int flags)
 		  &socket_file_ops);
 	if (unlikely(!file)) {
 		/* drop dentry, keep inode */
-		atomic_inc(&path.dentry->d_inode->i_count);
+		iref(path.dentry->d_inode);
 		path_put(&path);
 		put_unused_fd(fd);
 		return -ENFILE;
-- 
1.7.1


^ permalink raw reply related

* Re: [lm-sensors] [1/2] hwmon: uniform the init style of pkgtemp
From: Chen Gong @ 2010-10-08  5:25 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: lm-sensors@lm-sensors.org, JBeulich@novell.com,
	linux-kernel@vger.kernel.org
In-Reply-To: <20101008035248.GA23368@ericsson.com>

于 10/8/2010 11:52 AM, Guenter Roeck 写道:
> On Thu, Oct 07, 2010 at 11:42:05PM -0400, Chen Gong wrote:
>> 于 10/2/2010 11:26 AM, Guenter Roeck 写道:
>>> On Sun, Sep 26, 2010 at 05:59:59AM -0000, Chen Gong wrote:
>>>> pkgtemp is derived from coretemp, so some reasonable
>>>> logics should be applied onto pkgtemp, too. Such as
>>>> the init logic here.
>>>>
>>>> Signed-off-by: Chen Gong<gong.chen@linux.intel.com>
>>>> Acked-by: Jan Beulich<jbeulich@novell.com>
>>>>
>>> Hi,
>>>
>>> this patch, when applied with CONFIG_HOTPLUG_CPU undefined, causes a compile failure
>>> because it tries to access pkgtemp_cpu_notifier which is not defined in this case.
>>>
>>> For that reason, I have removed the patch from the list of applied patches for -next.
>>> Please re-submit a version which compiles for all combinations of HOTPLUG_CPU and SMP
>>> defined/undefined.
>>>
>>> Thanks,
>>> Guenter
>>>
>> Sorry for late. I just come back from my holiday. If only this one patch
>> is applied, it is broken, but it will be OK after these 2 patches are
>> applied. I tested the patches when CONFIG_SMP is undefined, it does be
>
> Each patch by itself must be compilable, otherwise we break the ability
> to bisect. Not a good idea.

Yes, but in fact it is an existing issue, not new incoming. The 1st 
patch isn't used for this purpose.

>
>> broken again. My suggestion is adding a macro definiton in the pkgtemp.c
>> like "#include<asm/smp.h>". If it is doable, I will re-post a new patch
>> series
>
> I assume you mean to add the include directive. Yes, that should do it,
> but please make sure that it compiles.

Sure, of course

>
> Thanks,
> Guenter
>


_______________________________________________
lm-sensors mailing list
lm-sensors@lm-sensors.org
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors

^ permalink raw reply

* Re: [1/2] hwmon: uniform the init style of pkgtemp
From: Chen Gong @ 2010-10-08  5:25 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: lm-sensors@lm-sensors.org, JBeulich@novell.com,
	linux-kernel@vger.kernel.org
In-Reply-To: <20101008035248.GA23368@ericsson.com>

于 10/8/2010 11:52 AM, Guenter Roeck 写道:
> On Thu, Oct 07, 2010 at 11:42:05PM -0400, Chen Gong wrote:
>> 于 10/2/2010 11:26 AM, Guenter Roeck 写道:
>>> On Sun, Sep 26, 2010 at 05:59:59AM -0000, Chen Gong wrote:
>>>> pkgtemp is derived from coretemp, so some reasonable
>>>> logics should be applied onto pkgtemp, too. Such as
>>>> the init logic here.
>>>>
>>>> Signed-off-by: Chen Gong<gong.chen@linux.intel.com>
>>>> Acked-by: Jan Beulich<jbeulich@novell.com>
>>>>
>>> Hi,
>>>
>>> this patch, when applied with CONFIG_HOTPLUG_CPU undefined, causes a compile failure
>>> because it tries to access pkgtemp_cpu_notifier which is not defined in this case.
>>>
>>> For that reason, I have removed the patch from the list of applied patches for -next.
>>> Please re-submit a version which compiles for all combinations of HOTPLUG_CPU and SMP
>>> defined/undefined.
>>>
>>> Thanks,
>>> Guenter
>>>
>> Sorry for late. I just come back from my holiday. If only this one patch
>> is applied, it is broken, but it will be OK after these 2 patches are
>> applied. I tested the patches when CONFIG_SMP is undefined, it does be
>
> Each patch by itself must be compilable, otherwise we break the ability
> to bisect. Not a good idea.

Yes, but in fact it is an existing issue, not new incoming. The 1st 
patch isn't used for this purpose.

>
>> broken again. My suggestion is adding a macro definiton in the pkgtemp.c
>> like "#include<asm/smp.h>". If it is doable, I will re-post a new patch
>> series
>
> I assume you mean to add the include directive. Yes, that should do it,
> but please make sure that it compiles.

Sure, of course

>
> Thanks,
> Guenter
>


^ permalink raw reply

* [PATCH 12/18] fs: add a per-superblock lock for the inode list
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel
In-Reply-To: <1286515292-15882-1-git-send-email-david@fromorbit.com>

From: Dave Chinner <dchinner@redhat.com>

To allow removal of the inode_lock, we first need to protect the
superblock inode list with its own lock instead of using the
inode_lock. Add a lock to the superblock to protect this list and
nest the new lock inside the inode_lock around the list operations
it needs to protect.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/drop_caches.c       |    4 ++++
 fs/fs-writeback.c      |    4 ++++
 fs/inode.c             |   22 +++++++++++++++++++---
 fs/notify/inode_mark.c |    3 +++
 fs/quota/dquot.c       |    6 ++++++
 fs/super.c             |    1 +
 include/linux/fs.h     |    1 +
 7 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index c4f3e06..c808ca8 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -17,18 +17,22 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 	struct inode *inode, *toput_inode = NULL;
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
 			continue;
 		if (inode->i_mapping->nrpages == 0)
 			continue;
 		iref_locked(inode);
+		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
 		spin_lock(&inode_lock);
+		spin_lock(&sb->s_inodes_lock);
 	}
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 	iput(toput_inode);
 }
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d63ab47..29f8032 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1026,6 +1026,7 @@ static void wait_sb_inodes(struct super_block *sb)
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
@@ -1043,6 +1044,7 @@ static void wait_sb_inodes(struct super_block *sb)
 		if (mapping->nrpages == 0)
 			continue;
 		iref_locked(inode);
+		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have
@@ -1060,7 +1062,9 @@ static void wait_sb_inodes(struct super_block *sb)
 		cond_resched();
 
 		spin_lock(&inode_lock);
+		spin_lock(&sb->s_inodes_lock);
 	}
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 	iput(old_inode);
 }
diff --git a/fs/inode.c b/fs/inode.c
index 3c07719..e6bb36d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -33,13 +33,18 @@
  *   i_ref
  * inode_hash_bucket lock protects:
  *   inode hash table, i_hash
+ * sb inode lock protects:
+ *   s_inodes, i_sb_list
  *
  * Lock orders
  * inode_lock
  *   inode hash bucket lock
  *     inode->i_lock
+ *
+ * inode_lock
+ *   sb inode lock
+ *     inode->i_lock
  */
-
 /*
  * This is needed for the following functions:
  *  - inode_has_buffers
@@ -488,7 +493,9 @@ static void dispose_list(struct list_head *head)
 
 		spin_lock(&inode_lock);
 		__remove_inode_hash(inode);
+		spin_lock(&inode->i_sb->s_inodes_lock);
 		list_del_init(&inode->i_sb_list);
+		spin_unlock(&inode->i_sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 
 		wake_up_inode(inode);
@@ -499,7 +506,8 @@ static void dispose_list(struct list_head *head)
 /*
  * Invalidate all inodes for a device.
  */
-static int invalidate_list(struct list_head *head, struct list_head *dispose)
+static int invalidate_list(struct super_block *sb, struct list_head *head,
+			struct list_head *dispose)
 {
 	struct list_head *next;
 	int busy = 0;
@@ -516,6 +524,7 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		 * shrink_icache_memory() away.
 		 */
 		cond_resched_lock(&inode_lock);
+		cond_resched_lock(&sb->s_inodes_lock);
 
 		next = next->next;
 		if (tmp == head)
@@ -555,8 +564,10 @@ int invalidate_inodes(struct super_block *sb)
 
 	down_write(&iprune_sem);
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 	fsnotify_unmount_inodes(&sb->s_inodes);
-	busy = invalidate_list(&sb->s_inodes, &throw_away);
+	busy = invalidate_list(sb, &sb->s_inodes, &throw_away);
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
@@ -753,7 +764,9 @@ static inline void
 __inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
 			struct inode *inode)
 {
+	spin_lock(&sb->s_inodes_lock);
 	list_add(&inode->i_sb_list, &sb->s_inodes);
+	spin_unlock(&sb->s_inodes_lock);
 	if (b) {
 		spin_lock_bucket(b);
 		hlist_bl_add_head(&inode->i_hash, &b->head);
@@ -1397,7 +1410,10 @@ static void iput_final(struct inode *inode)
 		percpu_counter_dec(&nr_inodes_unused);
 	}
 
+	spin_lock(&sb->s_inodes_lock);
 	list_del_init(&inode->i_sb_list);
+	spin_unlock(&sb->s_inodes_lock);
+
 	spin_unlock(&inode_lock);
 	evict(inode);
 	remove_inode_hash(inode);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 2fe319b..3389ff0 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -242,6 +242,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 	list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
 		struct inode *need_iput_tmp;
+		struct super_block *sb = inode->i_sb;
 
 		/*
 		 * We cannot iref() an inode in state I_FREEING,
@@ -288,6 +289,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		 * will be added since the umount has begun.  Finally,
 		 * iprune_mutex keeps shrink_icache_memory() away.
 		 */
+		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 
 		if (need_iput_tmp)
@@ -301,5 +303,6 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		iput(inode);
 
 		spin_lock(&inode_lock);
+		spin_lock(&sb->s_inodes_lock);
 	}
 }
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 5199418..b7cbc41 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -897,6 +897,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 #endif
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
 			continue;
@@ -910,6 +911,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 			continue;
 
 		iref_locked(inode);
+		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 
 		iput(old_inode);
@@ -921,7 +923,9 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		 * keep the reference and iput it later. */
 		old_inode = inode;
 		spin_lock(&inode_lock);
+		spin_lock(&sb->s_inodes_lock);
 	}
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 	iput(old_inode);
 
@@ -1004,6 +1008,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 	int reserved = 0;
 
 	spin_lock(&inode_lock);
+	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
@@ -1017,6 +1022,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 			remove_inode_dquot_ref(inode, type, tofree_head);
 		}
 	}
+	spin_unlock(&sb->s_inodes_lock);
 	spin_unlock(&inode_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
diff --git a/fs/super.c b/fs/super.c
index 8819e3a..d826214 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -76,6 +76,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		INIT_LIST_HEAD(&s->s_dentry_lru);
 		init_rwsem(&s->s_umount);
 		mutex_init(&s->s_lock);
+		spin_lock_init(&(s->s_inodes_lock);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
 		/*
 		 * The locking rules for s_lock are up to the
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 34f983f..54c4e86 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1342,6 +1342,7 @@ struct super_block {
 #endif
 	const struct xattr_handler **s_xattr;
 
+	spinlock_t		s_inodes_lock;	/* lock for s_inodes */
 	struct list_head	s_inodes;	/* all inodes */
 	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 #ifdef CONFIG_SMP
-- 
1.7.1


^ permalink raw reply related

* [PATCH 09/18] fs: rework icount to be a locked variable
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel
In-Reply-To: <1286515292-15882-1-git-send-email-david@fromorbit.com>

From: Dave Chinner <dchinner@redhat.com>

The inode reference count is currently an atomic variable so that it can be
sampled/modified outside the inode_lock. However, the inode_lock is still
needed to synchronise the final reference count and checks against the inode
state.

To avoid needing the protection of the inode lock, protect the inode reference
count with the per-inode i_lock and convert it to a normal variable. To avoid
existing out-of-tree code accidentally compiling against the new method, rename
the i_count field to i_ref. This is relatively straight forward as there
are limited external references to the i_count field remaining.

Based on work originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/btrfs/inode.c       |    8 ++++-
 fs/inode.c             |   83 ++++++++++++++++++++++++++++++++++++-----------
 fs/nfs/nfs4state.c     |    2 +-
 fs/nilfs2/mdt.c        |    2 +-
 fs/notify/inode_mark.c |   16 ++++++---
 include/linux/fs.h     |    2 +-
 6 files changed, 84 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2953e9f..9f04478 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1964,8 +1964,14 @@ void btrfs_add_delayed_iput(struct inode *inode)
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
 	struct delayed_iput *delayed;
 
-	if (atomic_add_unless(&inode->i_count, -1, 1))
+	/* XXX: filesystems should not play refcount games like this */
+	spin_lock(&inode->i_lock);
+	if (inode->i_ref > 1) {
+		inode->i_ref--;
+		spin_unlock(&inode->i_lock);
 		return;
+	}
+	spin_unlock(&inode->i_lock);
 
 	delayed = kmalloc(sizeof(*delayed), GFP_NOFS | __GFP_NOFAIL);
 	delayed->inode = inode;
diff --git a/fs/inode.c b/fs/inode.c
index b1dc6dc..5c8a3ea 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -26,6 +26,13 @@
 #include <linux/posix_acl.h>
 
 /*
+ * Locking rules.
+ *
+ * inode->i_lock protects:
+ *   i_ref
+ */
+
+/*
  * This is needed for the following functions:
  *  - inode_has_buffers
  *  - invalidate_inode_buffers
@@ -64,9 +71,9 @@ static unsigned int i_hash_shift __read_mostly;
  * Each inode can be on two separate lists. One is
  * the hash list of the inode, used for lookups. The
  * other linked list is the "type" list:
- *  "in_use" - valid inode, i_count > 0, i_nlink > 0
+ *  "in_use" - valid inode, i_ref > 0, i_nlink > 0
  *  "dirty"  - as "in_use" but also dirty
- *  "unused" - valid inode, i_count = 0
+ *  "unused" - valid inode, i_ref = 0
  *
  * A "dirty" list is maintained for each super block,
  * allowing for low-overhead inode sync() operations.
@@ -164,7 +171,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_sb = sb;
 	inode->i_blkbits = sb->s_blocksize_bits;
 	inode->i_flags = 0;
-	atomic_set(&inode->i_count, 1);
+	inode->i_ref = 1;
 	inode->i_op = &empty_iops;
 	inode->i_fop = &empty_fops;
 	inode->i_nlink = 1;
@@ -313,31 +320,38 @@ static void init_once(void *foo)
 
 	inode_init_once(inode);
 }
+
+/*
+ * inode_lock must be held
+ */
+void iref_locked(struct inode *inode)
+{
+	inode->i_ref++;
+}
 EXPORT_SYMBOL_GPL(iref_locked);
 
 void iref(struct inode *inode)
 {
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	iref_locked(inode);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(iref);
 
 /*
- * inode_lock must be held
- */
-void iref_locked(struct inode *inode)
-{
-	atomic_inc(&inode->i_count);
-}
-
-/*
  * Nobody outside of core code should really be looking at the inode reference
  * count. Please don't add new users of this function.
  */
 int iref_read(struct inode *inode)
 {
-	return atomic_read(&inode->i_count);
+	int ref;
+
+	spin_lock(&inode->i_lock);
+	ref = inode->i_ref;
+	spin_unlock(&inode->i_lock);
+	return ref;
 }
 EXPORT_SYMBOL_GPL(iref_read);
 
@@ -425,7 +439,9 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		if (inode->i_state & I_NEW)
 			continue;
 		invalidate_inode_buffers(inode);
-		if (!atomic_read(&inode->i_count)) {
+		spin_lock(&inode->i_lock);
+		if (!inode->i_ref) {
+			spin_unlock(&inode->i_lock);
 			list_move(&inode->i_lru, dispose);
 			list_del_init(&inode->i_io);
 			WARN_ON(inode->i_state & I_NEW);
@@ -433,6 +449,7 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
+		spin_unlock(&inode->i_lock);
 		busy = 1;
 	}
 	return busy;
@@ -470,7 +487,7 @@ static int can_unuse(struct inode *inode)
 		return 0;
 	if (inode_has_buffers(inode))
 		return 0;
-	if (atomic_read(&inode->i_count))
+	if (iref_read(inode))
 		return 0;
 	if (inode->i_data.nrpages)
 		return 0;
@@ -506,19 +523,22 @@ static void prune_icache(int nr_to_scan)
 
 		inode = list_entry(inode_unused.prev, struct inode, i_lru);
 
-		if (atomic_read(&inode->i_count) ||
-		    (inode->i_state & ~I_REFERENCED)) {
+		spin_lock(&inode->i_lock);
+		if (inode->i_ref || (inode->i_state & ~I_REFERENCED)) {
+			spin_unlock(&inode->i_lock);
 			list_del_init(&inode->i_lru);
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
 		if (inode->i_state & I_REFERENCED) {
+			spin_unlock(&inode->i_lock);
 			list_move(&inode->i_lru, &inode_unused);
 			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
 			iref_locked(inode);
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
@@ -535,7 +555,8 @@ static void prune_icache(int nr_to_scan)
 				list_move(&inode->i_lru, &inode_unused);
 				continue;
 			}
-		}
+		} else
+			spin_unlock(&inode->i_lock);
 		list_move(&inode->i_lru, &freeable);
 		list_del_init(&inode->i_io);
 		WARN_ON(inode->i_state & I_NEW);
@@ -788,7 +809,9 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
+		spin_lock(&old->i_lock);
 		iref_locked(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -835,7 +858,9 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
+		spin_lock(&old->i_lock);
 		iref_locked(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
@@ -887,9 +912,11 @@ EXPORT_SYMBOL(iunique);
 struct inode *igrab(struct inode *inode)
 {
 	spin_lock(&inode_lock);
-	if (!(inode->i_state & (I_FREEING|I_WILL_FREE)))
+	if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
+		spin_lock(&inode->i_lock);
 		iref_locked(inode);
-	else
+		spin_unlock(&inode->i_lock);
+	} else
 		/*
 		 * Handle the case where s_op->clear_inode is not been
 		 * called yet, and somebody is calling igrab
@@ -929,7 +956,9 @@ static struct inode *ifind(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode(sb, head, test, data);
 	if (inode) {
+		spin_lock(&inode->i_lock);
 		iref_locked(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
@@ -962,7 +991,9 @@ static struct inode *ifind_fast(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, head, ino);
 	if (inode) {
+		spin_lock(&inode->i_lock);
 		iref_locked(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
@@ -1145,7 +1176,9 @@ int insert_inode_locked(struct inode *inode)
 			spin_unlock(&inode_lock);
 			return 0;
 		}
+		spin_lock(&old->i_lock);
 		iref_locked(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1184,7 +1217,9 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 			spin_unlock(&inode_lock);
 			return 0;
 		}
+		spin_lock(&old->i_lock);
 		iref_locked(old);
+		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_unhashed(&old->i_hash))) {
@@ -1324,8 +1359,16 @@ void iput(struct inode *inode)
 	if (inode) {
 		BUG_ON(inode->i_state & I_CLEAR);
 
-		if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
+		spin_lock(&inode_lock);
+		spin_lock(&inode->i_lock);
+		inode->i_ref--;
+		if (inode->i_ref == 0) {
+			spin_unlock(&inode->i_lock);
 			iput_final(inode);
+			return;
+		}
+		spin_unlock(&inode->i_lock);
+		spin_lock(&inode_lock);
 	}
 }
 EXPORT_SYMBOL(iput);
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 3e2f19b..d7fc5d0 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -506,8 +506,8 @@ nfs4_get_open_state(struct inode *inode, struct nfs4_state_owner *owner)
 		state->owner = owner;
 		atomic_inc(&owner->so_count);
 		list_add(&state->inode_states, &nfsi->open_states);
-		state->inode = igrab(inode);
 		spin_unlock(&inode->i_lock);
+		state->inode = igrab(inode);
 		/* Note: The reclaim code dictates that we add stateless
 		 * and read-only stateids to the end of the list */
 		list_add_tail(&state->open_states, &owner->so_states);
diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c
index 2ee524f..435ba11 100644
--- a/fs/nilfs2/mdt.c
+++ b/fs/nilfs2/mdt.c
@@ -480,7 +480,7 @@ nilfs_mdt_new_common(struct the_nilfs *nilfs, struct super_block *sb,
 		inode->i_sb = sb; /* sb may be NULL for some meta data files */
 		inode->i_blkbits = nilfs->ns_blocksize_bits;
 		inode->i_flags = 0;
-		atomic_set(&inode->i_count, 1);
+		inode->i_ref = 1;
 		inode->i_nlink = 1;
 		inode->i_ino = ino;
 		inode->i_mode = S_IFREG;
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 6c54e02..2fe319b 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -257,7 +257,8 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		 * actually evict all unreferenced inodes from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		if (!iref_read(inode))
+		spin_lock(&inode->i_lock);
+		if (!inode->i_ref)
 			continue;
 
 		need_iput_tmp = need_iput;
@@ -268,12 +269,17 @@ void fsnotify_unmount_inodes(struct list_head *list)
 			iref_locked(inode);
 		else
 			need_iput_tmp = NULL;
+		spin_unlock(&inode->i_lock);
 
 		/* In case the dropping of a reference would nuke next_i. */
-		if ((&next_i->i_sb_list != list) && iref_read(inode) &&
-		    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
-			iref_locked(next_i);
-			need_iput = next_i;
+		if (&next_i->i_sb_list != list) {
+			spin_lock(&next_i->i_lock);
+			if (inode->i_ref &&
+			    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
+				iref_locked(next_i);
+				need_iput = next_i;
+			}
+			spin_unlock(&next_i->i_lock);
 		}
 
 		/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6f0df2a..1162c10 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -730,7 +730,7 @@ struct inode {
 	struct list_head	i_sb_list;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
-	atomic_t		i_count;
+	unsigned int		i_ref;
 	unsigned int		i_nlink;
 	uid_t			i_uid;
 	gid_t			i_gid;
-- 
1.7.1


^ permalink raw reply related

* [PATCH 08/18] fs: add inode reference coutn read accessor
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel
In-Reply-To: <1286515292-15882-1-git-send-email-david@fromorbit.com>

From: Dave Chinner <dchinner@redhat.com>

To remove most of the remaining direct references to the inode
reference count, add an iref_read() accessor function to read the
current reference count.  New users of this function should be
frowned upon, as there is rarely a good reason for looking at the
current reference count.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 arch/powerpc/platforms/cell/spufs/file.c |    2 +-
 drivers/staging/pohmelfs/inode.c         |   10 +++++-----
 fs/btrfs/inode.c                         |    6 +++---
 fs/ceph/mds_client.c                     |    2 +-
 fs/cifs/inode.c                          |    2 +-
 fs/ext3/ialloc.c                         |    4 ++--
 fs/ext4/ialloc.c                         |    4 ++--
 fs/fs-writeback.c                        |    2 +-
 fs/hpfs/inode.c                          |    2 +-
 fs/inode.c                               |   10 ++++++++++
 fs/locks.c                               |    2 +-
 fs/logfs/readwrite.c                     |    2 +-
 fs/nfs/inode.c                           |    4 ++--
 fs/notify/inode_mark.c                   |   11 +++++------
 fs/reiserfs/stree.c                      |    2 +-
 fs/smbfs/inode.c                         |    2 +-
 fs/ubifs/super.c                         |    2 +-
 fs/xfs/linux-2.6/xfs_trace.h             |    2 +-
 fs/xfs/xfs_inode.h                       |    2 +-
 include/linux/fs.h                       |    1 +
 20 files changed, 42 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/file.c b/arch/powerpc/platforms/cell/spufs/file.c
index 1a40da9..2e4263c 100644
--- a/arch/powerpc/platforms/cell/spufs/file.c
+++ b/arch/powerpc/platforms/cell/spufs/file.c
@@ -1549,7 +1549,7 @@ static int spufs_mfc_open(struct inode *inode, struct file *file)
 	if (ctx->owner != current->mm)
 		return -EINVAL;
 
-	if (atomic_read(&inode->i_count) != 1)
+	if (iref_read(inode) != 1)
 		return -EBUSY;
 
 	mutex_lock(&ctx->mapping_lock);
diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c
index 97dae29..d8a308d 100644
--- a/drivers/staging/pohmelfs/inode.c
+++ b/drivers/staging/pohmelfs/inode.c
@@ -1289,11 +1289,11 @@ static void pohmelfs_put_super(struct super_block *sb)
 		dprintk("%s: ino: %llu, pi: %p, inode: %p, count: %u.\n",
 				__func__, pi->ino, pi, inode, count);
 
-		if (atomic_read(&inode->i_count) != count) {
+		if (iref_read(inode) != count) {
 			printk("%s: ino: %llu, pi: %p, inode: %p, count: %u, i_count: %d.\n",
 					__func__, pi->ino, pi, inode, count,
-					atomic_read(&inode->i_count));
-			count = atomic_read(&inode->i_count);
+					iref_read(inode));
+			count = iref_read(inode);
 			in_drop_list++;
 		}
 
@@ -1305,7 +1305,7 @@ static void pohmelfs_put_super(struct super_block *sb)
 		pi = POHMELFS_I(inode);
 
 		dprintk("%s: ino: %llu, pi: %p, inode: %p, i_count: %u.\n",
-				__func__, pi->ino, pi, inode, atomic_read(&inode->i_count));
+				__func__, pi->ino, pi, inode, iref_read(inode));
 
 		/*
 		 * These are special inodes, they were created during
@@ -1313,7 +1313,7 @@ static void pohmelfs_put_super(struct super_block *sb)
 		 * so they live here with reference counter being 1 and prevent
 		 * umount from succeed since it believes that they are busy.
 		 */
-		count = atomic_read(&inode->i_count);
+		count = iref_read(inode);
 		if (count) {
 			list_del_init(&inode->i_sb_list);
 			while (count--)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0c3a35b..2953e9f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2718,10 +2718,10 @@ static struct btrfs_trans_handle *__unlink_start_trans(struct inode *dir,
 		return ERR_PTR(-ENOSPC);
 
 	/* check if there is someone else holds reference */
-	if (S_ISDIR(inode->i_mode) && atomic_read(&inode->i_count) > 1)
+	if (S_ISDIR(inode->i_mode) && iref_read(inode) > 1)
 		return ERR_PTR(-ENOSPC);
 
-	if (atomic_read(&inode->i_count) > 2)
+	if (iref_read(inode) > 2)
 		return ERR_PTR(-ENOSPC);
 
 	if (xchg(&root->fs_info->enospc_unlink, 1))
@@ -3939,7 +3939,7 @@ again:
 		inode = igrab(&entry->vfs_inode);
 		if (inode) {
 			spin_unlock(&root->inode_lock);
-			if (atomic_read(&inode->i_count) > 1)
+			if (iref_read(inode) > 1)
 				d_prune_aliases(inode);
 			/*
 			 * btrfs_drop_inode will have it removed from
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index fad95f8..b6d0ef1 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1102,7 +1102,7 @@ static int trim_caps_cb(struct inode *inode, struct ceph_cap *cap, void *arg)
 		spin_unlock(&inode->i_lock);
 		d_prune_aliases(inode);
 		dout("trim_caps_cb %p cap %p  pruned, count now %d\n",
-		     inode, cap, atomic_read(&inode->i_count));
+		     inode, cap, iref_read(inode));
 		return 0;
 	}
 
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index 63a0bdb..74cb762 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -1641,7 +1641,7 @@ int cifs_revalidate_dentry(struct dentry *dentry)
 	}
 
 	cFYI(1, "Revalidate: %s inode 0x%p count %d dentry: 0x%p d_time %ld "
-		 "jiffies %ld", full_path, inode, inode->i_count.counter,
+		 "jiffies %ld", full_path, inode, iref_read(inode),
 		 dentry, dentry->d_time, jiffies);
 
 	if (CIFS_SB(sb)->tcon->unix_ext)
diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
index 4ab72db..64669aa 100644
--- a/fs/ext3/ialloc.c
+++ b/fs/ext3/ialloc.c
@@ -100,9 +100,9 @@ void ext3_free_inode (handle_t *handle, struct inode * inode)
 	struct ext3_sb_info *sbi;
 	int fatal = 0, err;
 
-	if (atomic_read(&inode->i_count) > 1) {
+	if (iref_read(inode) > 1) {
 		printk ("ext3_free_inode: inode has count=%d\n",
-					atomic_read(&inode->i_count));
+					iref_read(inode));
 		return;
 	}
 	if (inode->i_nlink) {
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 45853e0..38ac6e5 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -189,9 +189,9 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
 	struct ext4_sb_info *sbi;
 	int fatal = 0, err, count, cleared;
 
-	if (atomic_read(&inode->i_count) > 1) {
+	if (iref_read(inode) > 1) {
 		printk(KERN_ERR "ext4_free_inode: inode has count=%d\n",
-		       atomic_read(&inode->i_count));
+		       iref_read(inode));
 		return;
 	}
 	if (inode->i_nlink) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1bf8a28..ec7a689 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -315,7 +315,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	unsigned dirty;
 	int ret;
 
-	if (!atomic_read(&inode->i_count))
+	if (!iref_read(inode))
 		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
 	else
 		WARN_ON(inode->i_state & I_WILL_FREE);
diff --git a/fs/hpfs/inode.c b/fs/hpfs/inode.c
index 56f0da1..05b5d79 100644
--- a/fs/hpfs/inode.c
+++ b/fs/hpfs/inode.c
@@ -183,7 +183,7 @@ void hpfs_write_inode(struct inode *i)
 	struct hpfs_inode_info *hpfs_inode = hpfs_i(i);
 	struct inode *parent;
 	if (i->i_ino == hpfs_sb(i->i_sb)->sb_root) return;
-	if (hpfs_inode->i_rddir_off && !atomic_read(&i->i_count)) {
+	if (hpfs_inode->i_rddir_off && !iref_read(i)) {
 		if (*hpfs_inode->i_rddir_off) printk("HPFS: write_inode: some position still there\n");
 		kfree(hpfs_inode->i_rddir_off);
 		hpfs_inode->i_rddir_off = NULL;
diff --git a/fs/inode.c b/fs/inode.c
index aa66e07..b1dc6dc 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -331,6 +331,16 @@ void iref_locked(struct inode *inode)
 	atomic_inc(&inode->i_count);
 }
 
+/*
+ * Nobody outside of core code should really be looking at the inode reference
+ * count. Please don't add new users of this function.
+ */
+int iref_read(struct inode *inode)
+{
+	return atomic_read(&inode->i_count);
+}
+EXPORT_SYMBOL_GPL(iref_read);
+
 void end_writeback(struct inode *inode)
 {
 	might_sleep();
diff --git a/fs/locks.c b/fs/locks.c
index ab24d49..cbf3114 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1376,7 +1376,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
 			goto out;
 		if ((arg == F_WRLCK)
 		    && ((atomic_read(&dentry->d_count) > 1)
-			|| (atomic_read(&inode->i_count) > 1)))
+			|| (iref_read(inode) > 1)))
 			goto out;
 	}
 
diff --git a/fs/logfs/readwrite.c b/fs/logfs/readwrite.c
index 6127baf..8beb842 100644
--- a/fs/logfs/readwrite.c
+++ b/fs/logfs/readwrite.c
@@ -1002,7 +1002,7 @@ static int __logfs_is_valid_block(struct inode *inode, u64 bix, u64 ofs)
 {
 	struct logfs_inode *li = logfs_inode(inode);
 
-	if ((inode->i_nlink == 0) && atomic_read(&inode->i_count) == 1)
+	if ((inode->i_nlink == 0) && iref_read(inode) == 1)
 		return 0;
 
 	if (bix < I0_BLOCKS)
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 886be68..387f4dc 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -385,7 +385,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
 	dprintk("NFS: nfs_fhget(%s/%Ld ct=%d)\n",
 		inode->i_sb->s_id,
 		(long long)NFS_FILEID(inode),
-		atomic_read(&inode->i_count));
+		iref_read(inode));
 
 out:
 	return inode;
@@ -1191,7 +1191,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
 
 	dfprintk(VFS, "NFS: %s(%s/%ld ct=%d info=0x%x)\n",
 			__func__, inode->i_sb->s_id, inode->i_ino,
-			atomic_read(&inode->i_count), fattr->valid);
+			iref_read(inode), fattr->valid);
 
 	if ((fattr->valid & NFS_ATTR_FATTR_FILEID) && nfsi->fileid != fattr->fileid)
 		goto out_fileid;
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 8096a9e..6c54e02 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -252,12 +252,12 @@ void fsnotify_unmount_inodes(struct list_head *list)
 			continue;
 
 		/*
-		 * If i_count is zero, the inode cannot have any watches and
-		 * doing an iref/iput with MS_ACTIVE clear would actually
-		 * evict all inodes with zero i_count from icache which is
+		 * If the inode is not referenced, the inode cannot have any
+		 * watches and doing an iref/iput with MS_ACTIVE clear would
+		 * actually evict all unreferenced inodes from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		if (!atomic_read(&inode->i_count))
+		if (!iref_read(inode))
 			continue;
 
 		need_iput_tmp = need_iput;
@@ -270,8 +270,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 			need_iput_tmp = NULL;
 
 		/* In case the dropping of a reference would nuke next_i. */
-		if ((&next_i->i_sb_list != list) &&
-		    atomic_read(&next_i->i_count) &&
+		if ((&next_i->i_sb_list != list) && iref_read(inode) &&
 		    !(next_i->i_state & (I_FREEING | I_WILL_FREE))) {
 			iref_locked(next_i);
 			need_iput = next_i;
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 313d39d..55c3ad3 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1477,7 +1477,7 @@ static int maybe_indirect_to_direct(struct reiserfs_transaction_handle *th,
 	 ** reading in the last block.  The user will hit problems trying to
 	 ** read the file, but for now we just skip the indirect2direct
 	 */
-	if (atomic_read(&inode->i_count) > 1 ||
+	if (iref_read(inode) > 1 ||
 	    !tail_has_to_be_packed(inode) ||
 	    !page || (REISERFS_I(inode)->i_flags & i_nopack_mask)) {
 		/* leave tail in an unformatted node */
diff --git a/fs/smbfs/inode.c b/fs/smbfs/inode.c
index 450c919..792593b 100644
--- a/fs/smbfs/inode.c
+++ b/fs/smbfs/inode.c
@@ -320,7 +320,7 @@ out:
 }
 
 /*
- * This routine is called when i_nlink == 0 and i_count goes to 0.
+ * This routine is called when i_nlink == 0 and the reference count goes to 0.
  * All blocking cleanup operations need to go here to avoid races.
  */
 static void
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 45888fb..a1b109c 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -342,7 +342,7 @@ static void ubifs_evict_inode(struct inode *inode)
 		goto out;
 
 	dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode);
-	ubifs_assert(!atomic_read(&inode->i_count));
+	ubifs_assert(!iref_read(inode));
 
 	truncate_inode_pages(&inode->i_data, 0);
 
diff --git a/fs/xfs/linux-2.6/xfs_trace.h b/fs/xfs/linux-2.6/xfs_trace.h
index be5dffd..c3940ab 100644
--- a/fs/xfs/linux-2.6/xfs_trace.h
+++ b/fs/xfs/linux-2.6/xfs_trace.h
@@ -599,7 +599,7 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
 	TP_fast_assign(
 		__entry->dev = VFS_I(ip)->i_sb->s_dev;
 		__entry->ino = ip->i_ino;
-		__entry->count = atomic_read(&VFS_I(ip)->i_count);
+		__entry->count = iref_read(VFS_I(ip));
 		__entry->pincount = atomic_read(&ip->i_pincount);
 		__entry->caller_ip = caller_ip;
 	),
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index cbb4791..5000660 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -481,7 +481,7 @@ void		xfs_mark_inode_dirty_sync(xfs_inode_t *);
 
 #define IHOLD(ip) \
 do { \
-	ASSERT(atomic_read(&VFS_I(ip)->i_count) > 0) ; \
+	ASSERT(iref_read(VFS_I(ip)) > 0) ; \
 	iref(VFS_I(ip)); \
 	trace_xfs_ihold(ip, _THIS_IP_); \
 } while (0)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2e971f2..6f0df2a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2186,6 +2186,7 @@ extern void unlock_new_inode(struct inode *);
 
 extern void iref(struct inode *inode);
 extern void iref_locked(struct inode *inode);
+extern int iref_read(struct inode *inode);
 extern void iget_failed(struct inode *);
 extern void end_writeback(struct inode *);
 extern void destroy_inode(struct inode *);
-- 
1.7.1


^ permalink raw reply related

* [PATCH 13/18] fs: split locking of inode writeback and LRU lists
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel
In-Reply-To: <1286515292-15882-1-git-send-email-david@fromorbit.com>

From: Dave Chinner <dchinner@redhat.com>

Now that the inode LRU and IO lists are split apart, we can separate
the locking for them. The IO lists are only ever accessed in the
context of writeback, so a per-BDI lock for those lists separates
them out nicely.

For the inode LRU, introduce a simple global lock to protect it.
While this could be made per-sb, it is unclear yet as to what is the
next steps for optimising/parallelising reclaim of inodes. Rather
than optimise now, leave it as a global list and lock until further
analysis canbe done.

Based on a patch originally from Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c           |   48 +++++++++++++-------
 fs/inode.c                  |  101 ++++++++++++++++++++++++++++++++++--------
 fs/internal.h               |    6 +++
 fs/super.c                  |    2 +-
 include/linux/backing-dev.h |    1 +
 include/linux/writeback.h   |   12 ++++-
 mm/backing-dev.c            |   21 +++++++++
 7 files changed, 150 insertions(+), 41 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 29f8032..49d44cc 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -69,16 +69,6 @@ int writeback_in_progress(struct backing_dev_info *bdi)
 	return test_bit(BDI_writeback_running, &bdi->state);
 }
 
-static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
-{
-	struct super_block *sb = inode->i_sb;
-
-	if (strcmp(sb->s_type->name, "bdev") == 0)
-		return inode->i_mapping->a_bdi;
-
-	return sb->s_bdi;
-}
-
 static void bdi_queue_work(struct backing_dev_info *bdi,
 		struct wb_writeback_work *work)
 {
@@ -169,6 +159,7 @@ static void redirty_tail(struct inode *inode)
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
+	assert_spin_locked(&wb->b_lock);
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
@@ -186,6 +177,7 @@ static void requeue_io(struct inode *inode)
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
+	assert_spin_locked(&wb->b_lock);
 	list_move(&inode->i_io, &wb->b_more_io);
 }
 
@@ -268,6 +260,7 @@ static void move_expired_inodes(struct list_head *delaying_queue,
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
+	assert_spin_locked(&wb->b_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
 	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
@@ -311,6 +304,7 @@ static void inode_wait_for_writeback(struct inode *inode)
 static int
 writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 {
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	struct address_space *mapping = inode->i_mapping;
 	unsigned dirty;
 	int ret;
@@ -330,7 +324,9 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
+			spin_lock(&bdi->wb.b_lock);
 			requeue_io(inode);
+			spin_unlock(&bdi->wb.b_lock);
 			return 0;
 		}
 
@@ -385,6 +381,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * sometimes bales out without doing anything.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
+			spin_lock(&bdi->wb.b_lock);
 			if (wbc->nr_to_write <= 0) {
 				/*
 				 * slice used up: queue for next turn
@@ -400,6 +397,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 				 */
 				redirty_tail(inode);
 			}
+			spin_unlock(&bdi->wb.b_lock);
 		} else if (inode->i_state & I_DIRTY) {
 			/*
 			 * Filesystems can dirty the inode during writeback
@@ -407,14 +405,15 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * submission or metadata updates after data IO
 			 * completion.
 			 */
+			spin_lock(&bdi->wb.b_lock);
 			redirty_tail(inode);
+			spin_unlock(&bdi->wb.b_lock);
 		} else {
 			/* The inode is clean */
+			spin_lock(&bdi->wb.b_lock);
 			list_del_init(&inode->i_io);
-			if (list_empty(&inode->i_lru)) {
-				list_add(&inode->i_lru, &inode_unused);
-				percpu_counter_inc(&nr_inodes_unused);
-			}
+			spin_unlock(&bdi->wb.b_lock);
+			inode_lru_list_add(inode);
 		}
 	}
 	inode_sync_complete(inode);
@@ -460,6 +459,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
 static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		struct writeback_control *wbc, bool only_this_sb)
 {
+	assert_spin_locked(&wb->b_lock);
 	while (!list_empty(&wb->b_io)) {
 		long pages_skipped;
 		struct inode *inode = list_entry(wb->b_io.prev,
@@ -475,7 +475,6 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 				redirty_tail(inode);
 				continue;
 			}
-
 			/*
 			 * The inode belongs to a different superblock.
 			 * Bounce back to the caller to unpin this and
@@ -484,7 +483,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			return 0;
 		}
 
-		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
+		if (inode->i_state & (I_NEW | I_WILL_FREE | I_FREEING)) {
 			requeue_io(inode);
 			continue;
 		}
@@ -495,8 +494,11 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		if (inode_dirtied_after(inode, wbc->wb_start))
 			return 1;
 
-		BUG_ON(inode->i_state & I_FREEING);
+		spin_lock(&inode->i_lock);
 		iref_locked(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&wb->b_lock);
+
 		pages_skipped = wbc->pages_skipped;
 		writeback_single_inode(inode, wbc);
 		if (wbc->pages_skipped != pages_skipped) {
@@ -504,12 +506,15 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
 			 */
+			spin_lock(&wb->b_lock);
 			redirty_tail(inode);
+			spin_unlock(&wb->b_lock);
 		}
 		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
+		spin_lock(&wb->b_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
 			return 1;
@@ -529,6 +534,8 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
+	spin_lock(&wb->b_lock);
+
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
@@ -547,6 +554,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 		if (ret)
 			break;
 	}
+	spin_unlock(&wb->b_lock);
 	spin_unlock(&inode_lock);
 	/* Leave any unwritten inodes on b_io */
 }
@@ -557,9 +565,11 @@ static void __writeback_inodes_sb(struct super_block *sb,
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	spin_lock(&inode_lock);
+	spin_lock(&wb->b_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
+	spin_unlock(&wb->b_lock);
 	spin_unlock(&inode_lock);
 }
 
@@ -672,8 +682,10 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 */
 		spin_lock(&inode_lock);
 		if (!list_empty(&wb->b_more_io))  {
+			spin_lock(&wb->b_lock);
 			inode = list_entry(wb->b_more_io.prev,
 						struct inode, i_io);
+			spin_unlock(&wb->b_lock);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			inode_wait_for_writeback(inode);
 		}
@@ -986,8 +998,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 					wakeup_bdi = true;
 			}
 
+			spin_lock(&bdi->wb.b_lock);
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_io, &bdi->wb.b_dirty);
+			spin_unlock(&bdi->wb.b_lock);
 		}
 	}
 out:
diff --git a/fs/inode.c b/fs/inode.c
index e6bb36d..4ad7900 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -35,6 +35,10 @@
  *   inode hash table, i_hash
  * sb inode lock protects:
  *   s_inodes, i_sb_list
+ * bdi writeback lock protects:
+ *   b_io, b_more_io, b_dirty, i_io
+ * inode_lru_lock protects:
+ *   inode_lru, i_lru
  *
  * Lock orders
  * inode_lock
@@ -43,7 +47,9 @@
  *
  * inode_lock
  *   sb inode lock
- *     inode->i_lock
+ *     inode_lru_lock
+ *       wb->b_lock
+ *         inode->i_lock
  */
 /*
  * This is needed for the following functions:
@@ -92,7 +98,8 @@ static unsigned int i_hash_shift __read_mostly;
  * allowing for low-overhead inode sync() operations.
  */
 
-LIST_HEAD(inode_unused);
+static LIST_HEAD(inode_lru);
+static DEFINE_SPINLOCK(inode_lru_lock);
 
 struct inode_hash_bucket {
 	struct hlist_bl_head head;
@@ -383,6 +390,30 @@ int iref_read(struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(iref_read);
 
+/*
+ * check against I_FREEING as inode writeback completion could race with
+ * setting the I_FREEING and removing the inode from the LRU.
+ */
+void inode_lru_list_add(struct inode *inode)
+{
+	spin_lock(&inode_lru_lock);
+	if (list_empty(&inode->i_lru) && !(inode->i_state & I_FREEING)) {
+		list_add(&inode->i_lru, &inode_lru);
+		percpu_counter_inc(&nr_inodes_unused);
+	}
+	spin_unlock(&inode_lru_lock);
+}
+
+void inode_lru_list_del(struct inode *inode)
+{
+	spin_lock(&inode_lru_lock);
+	if (!list_empty(&inode->i_lru)) {
+		list_del_init(&inode->i_lru);
+		percpu_counter_dec(&nr_inodes_unused);
+	}
+	spin_unlock(&inode_lru_lock);
+}
+
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
 {
 	unsigned long tmp;
@@ -535,11 +566,26 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
 		invalidate_inode_buffers(inode);
 		spin_lock(&inode->i_lock);
 		if (!inode->i_ref) {
+			struct backing_dev_info *bdi = inode_to_bdi(inode);
+
 			spin_unlock(&inode->i_lock);
-			list_move(&inode->i_lru, dispose);
-			list_del_init(&inode->i_io);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
+
+
+			/*
+			 * move the inode off the IO lists and LRU once
+			 * I_FREEING is set so that it won't get moved back on
+			 * there if it is dirty.
+			 */
+			spin_lock(&bdi->wb.b_lock);
+			list_del_init(&inode->i_io);
+			spin_unlock(&bdi->wb.b_lock);
+
+			spin_lock(&inode_lru_lock);
+			list_move(&inode->i_lru, dispose);
+			spin_unlock(&inode_lru_lock);
+
 			percpu_counter_dec(&nr_inodes_unused);
 			continue;
 		}
@@ -596,7 +642,7 @@ static int can_unuse(struct inode *inode)
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  We expect the final iput() on that inode to add it to
- * the front of the inode_unused list.  So look for it there and if the
+ * the front of the inode_lru list.  So look for it there and if the
  * inode is still freeable, proceed.  The right inode is found 99.9% of the
  * time in testing on a 4-way.
  *
@@ -611,13 +657,15 @@ static void prune_icache(int nr_to_scan)
 
 	down_read(&iprune_sem);
 	spin_lock(&inode_lock);
+	spin_lock(&inode_lru_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
+		struct backing_dev_info *bdi;
 
-		if (list_empty(&inode_unused))
+		if (list_empty(&inode_lru))
 			break;
 
-		inode = list_entry(inode_unused.prev, struct inode, i_lru);
+		inode = list_entry(inode_lru.prev, struct inode, i_lru);
 
 		spin_lock(&inode->i_lock);
 		if (inode->i_ref || (inode->i_state & ~I_REFERENCED)) {
@@ -628,19 +676,21 @@ static void prune_icache(int nr_to_scan)
 		}
 		if (inode->i_state & I_REFERENCED) {
 			spin_unlock(&inode->i_lock);
-			list_move(&inode->i_lru, &inode_unused);
+			list_move(&inode->i_lru, &inode_lru);
 			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
 			iref_locked(inode);
 			spin_unlock(&inode->i_lock);
+			spin_unlock(&inode_lru_lock);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
 			iput(inode);
 			spin_lock(&inode_lock);
+			spin_lock(&inode_lru_lock);
 
 			/*
 			 * if we can't reclaim this inod immediately, give it
@@ -648,21 +698,32 @@ static void prune_icache(int nr_to_scan)
 			 * on it.
 			 */
 			if (!can_unuse(inode)) {
-				list_move(&inode->i_lru, &inode_unused);
+				list_move(&inode->i_lru, &inode_lru);
 				continue;
 			}
 		} else
 			spin_unlock(&inode->i_lock);
-		list_move(&inode->i_lru, &freeable);
-		list_del_init(&inode->i_io);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
+
+		/*
+		 * move the inode off the IO lists and LRU once
+		 * I_FREEING is set so that it won't get moved back on
+		 * there if it is dirty.
+		 */
+		bdi = inode_to_bdi(inode);
+		spin_lock(&bdi->wb.b_lock);
+		list_del_init(&inode->i_io);
+		spin_unlock(&bdi->wb.b_lock);
+
+		list_move(&inode->i_lru, &freeable);
 		percpu_counter_dec(&nr_inodes_unused);
 	}
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
 		__count_vm_events(PGINODESTEAL, reap);
+	spin_unlock(&inode_lru_lock);
 	spin_unlock(&inode_lock);
 
 	dispose_list(&freeable);
@@ -1369,6 +1430,7 @@ static void iput_final(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	const struct super_operations *op = inode->i_sb->s_op;
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	int drop;
 
 	if (op && op->drop_inode)
@@ -1381,8 +1443,7 @@ static void iput_final(struct inode *inode)
 			inode->i_state |= I_REFERENCED;
 			if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
 			    list_empty(&inode->i_lru)) {
-				list_add(&inode->i_lru, &inode_unused);
-				percpu_counter_inc(&nr_inodes_unused);
+				inode_lru_list_add(inode);
 			}
 			spin_unlock(&inode_lock);
 			return;
@@ -1396,19 +1457,19 @@ static void iput_final(struct inode *inode)
 		inode->i_state &= ~I_WILL_FREE;
 		__remove_inode_hash(inode);
 	}
-	list_del_init(&inode->i_io);
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
 
 	/*
-	 * We avoid moving dirty inodes back onto the LRU now because I_FREEING
-	 * is set and hence writeback_single_inode() won't move the inode
+	 * move the inode off the IO lists and LRU once I_FREEING is set so
+	 * that it won't get moved back on there if it is dirty.
 	 * around.
 	 */
-	if (!list_empty(&inode->i_lru)) {
-		list_del_init(&inode->i_lru);
-		percpu_counter_dec(&nr_inodes_unused);
-	}
+	spin_lock(&bdi->wb.b_lock);
+	list_del_init(&inode->i_io);
+	spin_unlock(&bdi->wb.b_lock);
+
+	inode_lru_list_del(inode);
 
 	spin_lock(&sb->s_inodes_lock);
 	list_del_init(&inode->i_sb_list);
diff --git a/fs/internal.h b/fs/internal.h
index a6910e9..ece3565 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -101,3 +101,9 @@ extern void put_super(struct super_block *sb);
 struct nameidata;
 extern struct file *nameidata_to_filp(struct nameidata *);
 extern void release_open_intent(struct nameidata *);
+
+/*
+ * inode.c
+ */
+extern void inode_lru_list_add(struct inode *inode);
+extern void inode_lru_list_del(struct inode *inode);
diff --git a/fs/super.c b/fs/super.c
index d826214..c5332e5 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -76,7 +76,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		INIT_LIST_HEAD(&s->s_dentry_lru);
 		init_rwsem(&s->s_umount);
 		mutex_init(&s->s_lock);
-		spin_lock_init(&(s->s_inodes_lock);
+		spin_lock_init(&s->s_inodes_lock);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
 		/*
 		 * The locking rules for s_lock are up to the
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 31e1346..5106fc4 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -57,6 +57,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	spinlock_t b_lock;		/* writeback lists lock */
 };
 
 struct backing_dev_info {
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index f7ed2a0..b182ccc 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -10,10 +10,7 @@
 struct backing_dev_info;
 
 extern spinlock_t inode_lock;
-extern struct list_head inode_unused;
 
-extern struct percpu_counter nr_inodes;
-extern struct percpu_counter nr_inodes_unused;
 
 /*
  * fs/fs-writeback.c
@@ -82,6 +79,15 @@ static inline void inode_sync_wait(struct inode *inode)
 							TASK_UNINTERRUPTIBLE);
 }
 
+static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+
+	if (strcmp(sb->s_type->name, "bdev") == 0)
+		return inode->i_mapping->a_bdi;
+
+	return sb->s_bdi;
+}
 
 /*
  * mm/page-writeback.c
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index a124991..74e8269 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -74,12 +74,14 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
 	spin_lock(&inode_lock);
+	spin_lock(&wb->b_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_io)
 		nr_dirty++;
 	list_for_each_entry(inode, &wb->b_io, i_io)
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_io)
 		nr_more_io++;
+	spin_unlock(&wb->b_lock);
 	spin_unlock(&inode_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
@@ -634,6 +636,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	spin_lock_init(&wb->b_lock);
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
 
@@ -671,6 +674,18 @@ err:
 }
 EXPORT_SYMBOL(bdi_init);
 
+static void bdi_lock_two(struct backing_dev_info *bdi1,
+				struct backing_dev_info *bdi2)
+{
+	if (bdi1 < bdi2) {
+		spin_lock(&bdi1->wb.b_lock);
+		spin_lock_nested(&bdi2->wb.b_lock, 1);
+	} else {
+		spin_lock(&bdi2->wb.b_lock);
+		spin_lock_nested(&bdi1->wb.b_lock, 1);
+	}
+}
+
 void mapping_set_bdi(struct address_space *mapping,
 				struct backing_dev_info *bdi)
 {
@@ -681,6 +696,7 @@ void mapping_set_bdi(struct address_space *mapping,
 		return;
 
 	spin_lock(&inode_lock);
+	bdi_lock_two(bdi, old);
 	if (!list_empty(&inode->i_io)) {
 		struct inode *i;
 
@@ -709,6 +725,8 @@ void mapping_set_bdi(struct address_space *mapping,
 	}
 found:
 	mapping->a_bdi = bdi;
+	spin_unlock(&bdi->wb.b_lock);
+	spin_unlock(&old->wb.b_lock);
 	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(mapping_set_bdi);
@@ -726,6 +744,7 @@ void bdi_destroy(struct backing_dev_info *bdi)
 		struct inode *i, *tmp;
 
 		spin_lock(&inode_lock);
+		bdi_lock_two(bdi, &default_backing_dev_info);
 		list_for_each_entry_safe(i, tmp, &bdi->wb.b_dirty, i_io) {
 			list_del(&i->i_io);
 			list_add_tail(&i->i_io, &dst->b_dirty);
@@ -741,6 +760,8 @@ void bdi_destroy(struct backing_dev_info *bdi)
 			list_add_tail(&i->i_io, &dst->b_more_io);
 			i->i_mapping->a_bdi = bdi;
 		}
+		spin_unlock(&bdi->wb.b_lock);
+		spin_unlock(&dst->b_lock);
 		spin_unlock(&inode_lock);
 	}
 
-- 
1.7.1


^ permalink raw reply related

* [PATCH 03/18] fs: keep inode with backing-dev
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel
In-Reply-To: <1286515292-15882-1-git-send-email-david@fromorbit.com>

From: Nick Piggin <npiggin@suse.de>

Having inode on writeback lists of a different bdi than
inode->i_mapping->backing_dev_info makes it very difficult to do
per-bdi locking of the writeback lists. Add functions to move these
inodes over when the mapping backing dev is changed.

Also, rename i_mapping.backing_dev_info to i_mapping.a_bdi while we're
here. Succinct is nice, and it catches conversion errors.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 drivers/char/mem.c          |    2 +-
 drivers/char/raw.c          |    2 +-
 drivers/mtd/mtdchar.c       |    2 +-
 fs/afs/write.c              |    6 ++--
 fs/block_dev.c              |   13 +++++----
 fs/btrfs/disk-io.c          |    2 +-
 fs/btrfs/file.c             |    2 +-
 fs/btrfs/inode.c            |   10 +++---
 fs/buffer.c                 |    2 +-
 fs/ceph/addr.c              |    2 +-
 fs/ceph/inode.c             |    4 +-
 fs/cifs/file.c              |    2 +-
 fs/cifs/inode.c             |    2 +-
 fs/configfs/inode.c         |    3 +-
 fs/ext2/ialloc.c            |    2 +-
 fs/fs-writeback.c           |    2 +-
 fs/fuse/file.c              |    6 ++--
 fs/fuse/inode.c             |    2 +-
 fs/gfs2/glock.c             |    3 +-
 fs/hugetlbfs/inode.c        |    3 +-
 fs/inode.c                  |    6 ++--
 fs/nfs/inode.c              |    3 +-
 fs/nfs/write.c              |    7 ++---
 fs/nilfs2/btnode.c          |    2 +-
 fs/nilfs2/mdt.c             |    2 +-
 fs/nilfs2/the_nilfs.c       |    2 +-
 fs/ntfs/file.c              |    2 +-
 fs/ocfs2/dlmfs/dlmfs.c      |    4 +-
 fs/ocfs2/file.c             |    2 +-
 fs/ramfs/inode.c            |    2 +-
 fs/romfs/super.c            |    4 +-
 fs/sysfs/inode.c            |    2 +-
 fs/ubifs/dir.c              |    2 +-
 fs/ubifs/super.c            |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c  |    4 +-
 fs/xfs/linux-2.6/xfs_file.c |    2 +-
 include/linux/backing-dev.h |   16 ++++++++---
 include/linux/fs.h          |    2 +-
 kernel/cgroup.c             |    2 +-
 mm/backing-dev.c            |   61 ++++++++++++++++++++++++++++++++++++++++--
 mm/fadvise.c                |    4 +-
 mm/filemap.c                |    4 +-
 mm/filemap_xip.c            |    2 +-
 mm/page-writeback.c         |   15 +++++-----
 mm/readahead.c              |    6 ++--
 mm/shmem.c                  |    2 +-
 mm/swap.c                   |    2 +-
 mm/swap_state.c             |    2 +-
 mm/swapfile.c               |    2 +-
 mm/truncate.c               |    3 +-
 mm/vmscan.c                 |    2 +-
 51 files changed, 155 insertions(+), 90 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 1f528fa..2285c1e 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -872,7 +872,7 @@ static int memory_open(struct inode *inode, struct file *filp)
 
 	filp->f_op = dev->fops;
 	if (dev->dev_info)
-		filp->f_mapping->backing_dev_info = dev->dev_info;
+		mapping_set_bdi(filp->f_mapping, dev->dev_info);
 
 	if (dev->fops->open)
 		return dev->fops->open(inode, filp);
diff --git a/drivers/char/raw.c b/drivers/char/raw.c
index b38942f..5baa83f 100644
--- a/drivers/char/raw.c
+++ b/drivers/char/raw.c
@@ -109,7 +109,7 @@ static int raw_release(struct inode *inode, struct file *filp)
 	if (--raw_devices[minor].inuse == 0) {
 		/* Here  inode->i_mapping == bdev->bd_inode->i_mapping  */
 		inode->i_mapping = &inode->i_data;
-		inode->i_mapping->backing_dev_info = &default_backing_dev_info;
+		mapping_set_bdi(inode->i_mapping, &default_backing_dev_info);
 	}
 	mutex_unlock(&raw_mutex);
 
diff --git a/drivers/mtd/mtdchar.c b/drivers/mtd/mtdchar.c
index a825002..26af8b1 100644
--- a/drivers/mtd/mtdchar.c
+++ b/drivers/mtd/mtdchar.c
@@ -113,7 +113,7 @@ static int mtd_open(struct inode *inode, struct file *file)
 	if (mtd_ino->i_state & I_NEW) {
 		mtd_ino->i_private = mtd;
 		mtd_ino->i_mode = S_IFCHR;
-		mtd_ino->i_data.backing_dev_info = mtd->backing_dev_info;
+		mapping_new_set_bdi(&mtd_ino->i_data, mtd->backing_dev_info);
 		unlock_new_inode(mtd_ino);
 	}
 	file->f_mapping = mtd_ino->i_mapping;
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 722743b..b321bfc 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -438,7 +438,7 @@ no_more:
  */
 int afs_writepage(struct page *page, struct writeback_control *wbc)
 {
-	struct backing_dev_info *bdi = page->mapping->backing_dev_info;
+	struct backing_dev_info *bdi = page->mapping->a_bdi;
 	struct afs_writeback *wb;
 	int ret;
 
@@ -469,7 +469,7 @@ static int afs_writepages_region(struct address_space *mapping,
 				 struct writeback_control *wbc,
 				 pgoff_t index, pgoff_t end, pgoff_t *_next)
 {
-	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct backing_dev_info *bdi = mapping->a_bdi;
 	struct afs_writeback *wb;
 	struct page *page;
 	int ret, n;
@@ -548,7 +548,7 @@ static int afs_writepages_region(struct address_space *mapping,
 int afs_writepages(struct address_space *mapping,
 		   struct writeback_control *wbc)
 {
-	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct backing_dev_info *bdi = mapping->a_bdi;
 	pgoff_t start, end, next;
 	int ret;
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 50e8c85..ac070d7 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -533,7 +533,7 @@ struct block_device *bdget(dev_t dev)
 		inode->i_bdev = bdev;
 		inode->i_data.a_ops = &def_blk_aops;
 		mapping_set_gfp_mask(&inode->i_data, GFP_USER);
-		inode->i_data.backing_dev_info = &default_backing_dev_info;
+		mapping_new_set_bdi(&inode->i_data, &default_backing_dev_info);
 		spin_lock(&bdev_lock);
 		list_add(&bdev->bd_list, &all_bdevs);
 		spin_unlock(&bdev_lock);
@@ -1390,7 +1390,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 				bdi = blk_get_backing_dev_info(bdev);
 				if (bdi == NULL)
 					bdi = &default_backing_dev_info;
-				bdev->bd_inode->i_data.backing_dev_info = bdi;
+				mapping_set_bdi(&bdev->bd_inode->i_data, bdi);
 			}
 			if (bdev->bd_invalidated)
 				rescan_partitions(disk, bdev);
@@ -1405,8 +1405,8 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 			if (ret)
 				goto out_clear;
 			bdev->bd_contains = whole;
-			bdev->bd_inode->i_data.backing_dev_info =
-			   whole->bd_inode->i_data.backing_dev_info;
+			mapping_set_bdi(&bdev->bd_inode->i_data,
+			   whole->bd_inode->i_data.a_bdi);
 			bdev->bd_part = disk_get_part(disk, partno);
 			if (!(disk->flags & GENHD_FL_UP) ||
 			    !bdev->bd_part || !bdev->bd_part->nr_sects) {
@@ -1439,7 +1439,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 	disk_put_part(bdev->bd_part);
 	bdev->bd_disk = NULL;
 	bdev->bd_part = NULL;
-	bdev->bd_inode->i_data.backing_dev_info = &default_backing_dev_info;
+	mapping_set_bdi(&bdev->bd_inode->i_data, &default_backing_dev_info);
 	if (bdev != bdev->bd_contains)
 		__blkdev_put(bdev->bd_contains, mode, 1);
 	bdev->bd_contains = NULL;
@@ -1533,7 +1533,8 @@ static int __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 		disk_put_part(bdev->bd_part);
 		bdev->bd_part = NULL;
 		bdev->bd_disk = NULL;
-		bdev->bd_inode->i_data.backing_dev_info = &default_backing_dev_info;
+		mapping_set_bdi(&bdev->bd_inode->i_data,
+				&default_backing_dev_info);
 		if (bdev != bdev->bd_contains)
 			victim = bdev->bd_contains;
 		bdev->bd_contains = NULL;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 64f1008..05c3fc7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1636,7 +1636,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,
 	 */
 	fs_info->btree_inode->i_size = OFFSET_MAX;
 	fs_info->btree_inode->i_mapping->a_ops = &btree_aops;
-	fs_info->btree_inode->i_mapping->backing_dev_info = &fs_info->bdi;
+	mapping_new_set_bdi(fs_info->btree_inode->i_mapping, &fs_info->bdi);
 
 	RB_CLEAR_NODE(&BTRFS_I(fs_info->btree_inode)->rb_node);
 	extent_io_tree_init(&BTRFS_I(fs_info->btree_inode)->io_tree,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e354c33..96e3883 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -872,7 +872,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 		goto out;
 	count = ocount;
 
-	current->backing_dev_info = inode->i_mapping->backing_dev_info;
+	current->backing_dev_info = inode->i_mapping->a_bdi;
 	err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
 	if (err)
 		goto out;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c038644..c646c0c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2475,7 +2475,7 @@ static void btrfs_read_locked_inode(struct inode *inode)
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFREG:
 		inode->i_mapping->a_ops = &btrfs_aops;
-		inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+		mapping_new_set_bdi(inode->i_mapping, &root->fs_info->bdi);
 		BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
 		inode->i_fop = &btrfs_file_operations;
 		inode->i_op = &btrfs_file_inode_operations;
@@ -2490,7 +2490,7 @@ static void btrfs_read_locked_inode(struct inode *inode)
 	case S_IFLNK:
 		inode->i_op = &btrfs_symlink_inode_operations;
 		inode->i_mapping->a_ops = &btrfs_symlink_aops;
-		inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+		mapping_new_set_bdi(inode->i_mapping, &root->fs_info->bdi);
 		break;
 	default:
 		inode->i_op = &btrfs_special_inode_operations;
@@ -4705,7 +4705,7 @@ static int btrfs_create(struct inode *dir, struct dentry *dentry,
 		drop_inode = 1;
 	else {
 		inode->i_mapping->a_ops = &btrfs_aops;
-		inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+		mapping_new_set_bdi(inode->i_mapping, &root->fs_info->bdi);
 		inode->i_fop = &btrfs_file_operations;
 		inode->i_op = &btrfs_file_inode_operations;
 		BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
@@ -6699,7 +6699,7 @@ static int btrfs_symlink(struct inode *dir, struct dentry *dentry,
 		drop_inode = 1;
 	else {
 		inode->i_mapping->a_ops = &btrfs_aops;
-		inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+		mapping_new_set_bdi(inode->i_mapping, &root->fs_info->bdi);
 		inode->i_fop = &btrfs_file_operations;
 		inode->i_op = &btrfs_file_inode_operations;
 		BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
@@ -6739,7 +6739,7 @@ static int btrfs_symlink(struct inode *dir, struct dentry *dentry,
 
 	inode->i_op = &btrfs_symlink_inode_operations;
 	inode->i_mapping->a_ops = &btrfs_symlink_aops;
-	inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+	mapping_new_set_bdi(inode->i_mapping, &root->fs_info->bdi);
 	inode_set_bytes(inode, name_len);
 	btrfs_i_size_write(inode, name_len - 1);
 	err = btrfs_update_inode(trans, root, inode);
diff --git a/fs/buffer.c b/fs/buffer.c
index 3e7dca2..b5c4153 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3161,7 +3161,7 @@ void block_sync_page(struct page *page)
 	smp_mb();
 	mapping = page_mapping(page);
 	if (mapping)
-		blk_run_backing_dev(mapping->backing_dev_info, page);
+		blk_run_backing_dev(mapping->a_bdi, page);
 }
 EXPORT_SYMBOL(block_sync_page);
 
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index efbc604..448400a 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -588,7 +588,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 				 struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct backing_dev_info *bdi = mapping->a_bdi;
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_client *client;
 	pgoff_t index, start, end;
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 62377ec..e427082 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -624,8 +624,8 @@ static int fill_inode(struct inode *inode,
 	}
 
 	inode->i_mapping->a_ops = &ceph_aops;
-	inode->i_mapping->backing_dev_info =
-		&ceph_sb_to_client(inode->i_sb)->backing_dev_info;
+	mapping_new_set_bdi(inode->i_mapping,
+			&ceph_sb_to_client(inode->i_sb)->backing_dev_info);
 
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFIFO:
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index de748c6..3673e66 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1337,7 +1337,7 @@ static int cifs_partialpagewrite(struct page *page, unsigned from, unsigned to)
 static int cifs_writepages(struct address_space *mapping,
 			   struct writeback_control *wbc)
 {
-	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct backing_dev_info *bdi = mapping->a_bdi;
 	unsigned int bytes_to_write;
 	unsigned int bytes_written;
 	struct cifs_sb_info *cifs_sb;
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index 53cce8c..63a0bdb 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -802,7 +802,7 @@ retry_iget5_locked:
 		if (inode->i_state & I_NEW) {
 			inode->i_ino = hash;
 			if (S_ISREG(inode->i_mode))
-				inode->i_data.backing_dev_info = sb->s_bdi;
+				inode->i_data.a_bdi = sb->s_bdi;
 #ifdef CONFIG_CIFS_FSCACHE
 			/* initialize per-inode cache cookie pointer */
 			CIFS_I(inode)->fscache = NULL;
diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c
index cf78d44..40b2bec 100644
--- a/fs/configfs/inode.c
+++ b/fs/configfs/inode.c
@@ -136,7 +136,8 @@ struct inode * configfs_new_inode(mode_t mode, struct configfs_dirent * sd)
 	struct inode * inode = new_inode(configfs_sb);
 	if (inode) {
 		inode->i_mapping->a_ops = &configfs_aops;
-		inode->i_mapping->backing_dev_info = &configfs_backing_dev_info;
+		mapping_new_set_bdi(inode->i_mapping,
+				&configfs_backing_dev_info);
 		inode->i_op = &configfs_inode_operations;
 
 		if (sd->s_iattr) {
diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c
index ad70479..29942f0 100644
--- a/fs/ext2/ialloc.c
+++ b/fs/ext2/ialloc.c
@@ -172,7 +172,7 @@ static void ext2_preread_inode(struct inode *inode)
 	struct ext2_group_desc * gdp;
 	struct backing_dev_info *bdi;
 
-	bdi = inode->i_mapping->backing_dev_info;
+	bdi = inode->i_mapping->a_bdi;
 	if (bdi_read_congested(bdi))
 		return;
 	if (bdi_write_congested(bdi))
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 58a95b7..3209aff 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -74,7 +74,7 @@ static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
 	struct super_block *sb = inode->i_sb;
 
 	if (strcmp(sb->s_type->name, "bdev") == 0)
-		return inode->i_mapping->backing_dev_info;
+		return inode->i_mapping->a_bdi;
 
 	return sb->s_bdi;
 }
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index c822458..193a0d1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -945,7 +945,7 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 
 	/* We can write back this queue in page reclaim */
-	current->backing_dev_info = mapping->backing_dev_info;
+	current->backing_dev_info = mapping->a_bdi;
 
 	err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
 	if (err)
@@ -1133,7 +1133,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 {
 	struct inode *inode = req->inode;
 	struct fuse_inode *fi = get_fuse_inode(inode);
-	struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+	struct backing_dev_info *bdi = inode->i_mapping->a_bdi;
 
 	list_del(&req->writepages_entry);
 	dec_bdi_stat(bdi, BDI_WRITEBACK);
@@ -1247,7 +1247,7 @@ static int fuse_writepage_locked(struct page *page)
 	req->end = fuse_writepage_end;
 	req->inode = inode;
 
-	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+	inc_bdi_stat(mapping->a_bdi, BDI_WRITEBACK);
 	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 	end_page_writeback(page);
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index da9e6e1..5cf105c 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -256,7 +256,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
 	if ((inode->i_state & I_NEW)) {
 		inode->i_flags |= S_NOATIME|S_NOCMTIME;
 		inode->i_generation = generation;
-		inode->i_data.backing_dev_info = &fc->bdi;
+		mapping_new_set_bdi(&inode->i_data, &fc->bdi);
 		fuse_init_inode(inode, attr);
 		unlock_new_inode(inode);
 	} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 9adf8f9..c8f4c50 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -8,6 +8,7 @@
  */
 
 #include <linux/sched.h>
+#include <linux/backing-dev.h>
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/buffer_head.h>
@@ -797,7 +798,7 @@ int gfs2_glock_get(struct gfs2_sbd *sdp, u64 number,
 		mapping->flags = 0;
 		mapping_set_gfp_mask(mapping, GFP_NOFS);
 		mapping->assoc_mapping = NULL;
-		mapping->backing_dev_info = s->s_bdi;
+		mapping_new_set_bdi(mapping, s->s_bdi);
 		mapping->writeback_index = 0;
 	}
 
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 6e5bd42..a37920a 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -459,7 +459,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, uid_t uid,
 		inode->i_uid = uid;
 		inode->i_gid = gid;
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
-		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
+		mapping_new_set_bdi(inode->i_mapping,
+				&hugetlbfs_backing_dev_info);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		INIT_LIST_HEAD(&inode->i_mapping->private_list);
 		info = HUGETLBFS_I(inode);
diff --git a/fs/inode.c b/fs/inode.c
index f04d501..22ef3f1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -201,7 +201,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->flags = 0;
 	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
 	mapping->assoc_mapping = NULL;
-	mapping->backing_dev_info = &default_backing_dev_info;
+	mapping_new_set_bdi(mapping, &default_backing_dev_info);
 	mapping->writeback_index = 0;
 
 	/*
@@ -212,8 +212,8 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	if (sb->s_bdev) {
 		struct backing_dev_info *bdi;
 
-		bdi = sb->s_bdev->bd_inode->i_mapping->backing_dev_info;
-		mapping->backing_dev_info = bdi;
+		bdi = sb->s_bdev->bd_inode->i_mapping->a_bdi;
+		mapping_new_set_bdi(mapping, bdi);
 	}
 	inode->i_private = NULL;
 	inode->i_mapping = mapping;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 7d2d6c7..886be68 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -287,7 +287,8 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
 		if (S_ISREG(inode->i_mode)) {
 			inode->i_fop = &nfs_file_operations;
 			inode->i_data.a_ops = &nfs_file_aops;
-			inode->i_data.backing_dev_info = &NFS_SB(sb)->backing_dev_info;
+			mapping_new_set_bdi(&inode->i_data,
+						&NFS_SB(sb)->backing_dev_info);
 		} else if (S_ISDIR(inode->i_mode)) {
 			inode->i_op = NFS_SB(sb)->nfs_client->rpc_ops->dir_inode_ops;
 			inode->i_fop = &nfs_dir_operations;
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 874972d..a8baf4b 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -455,7 +455,7 @@ nfs_mark_request_commit(struct nfs_page *req)
 	nfsi->ncommit++;
 	spin_unlock(&inode->i_lock);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
-	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
+	inc_bdi_stat(req->wb_page->mapping->a_bdi, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 }
 
@@ -466,7 +466,7 @@ nfs_clear_request_commit(struct nfs_page *req)
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
-		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
+		dec_bdi_stat(page->mapping->a_bdi, BDI_RECLAIMABLE);
 		return 1;
 	}
 	return 0;
@@ -1321,8 +1321,7 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
-		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
-				BDI_RECLAIMABLE);
+		dec_bdi_stat(req->wb_page->mapping->a_bdi, BDI_RECLAIMABLE);
 		nfs_clear_page_tag_locked(req);
 	}
 	nfs_commit_clear_lock(NFS_I(inode));
diff --git a/fs/nilfs2/btnode.c b/fs/nilfs2/btnode.c
index f78ab10..d74ed8f 100644
--- a/fs/nilfs2/btnode.c
+++ b/fs/nilfs2/btnode.c
@@ -59,7 +59,7 @@ void nilfs_btnode_cache_init(struct address_space *btnc,
 	btnc->flags = 0;
 	mapping_set_gfp_mask(btnc, GFP_NOFS);
 	btnc->assoc_mapping = NULL;
-	btnc->backing_dev_info = bdi;
+	mapping_new_set_bdi(btnc, bdi);
 	btnc->a_ops = &def_btnode_aops;
 }
 
diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c
index d01aff4..7713861 100644
--- a/fs/nilfs2/mdt.c
+++ b/fs/nilfs2/mdt.c
@@ -517,7 +517,7 @@ nilfs_mdt_new_common(struct the_nilfs *nilfs, struct super_block *sb,
 		mapping->flags = 0;
 		mapping_set_gfp_mask(mapping, gfp_mask);
 		mapping->assoc_mapping = NULL;
-		mapping->backing_dev_info = nilfs->ns_bdi;
+		mapping_new_set_bdi(mapping, nilfs->ns_bdi);
 
 		inode->i_mapping = mapping;
 	}
diff --git a/fs/nilfs2/the_nilfs.c b/fs/nilfs2/the_nilfs.c
index ba7c10c..cb81695 100644
--- a/fs/nilfs2/the_nilfs.c
+++ b/fs/nilfs2/the_nilfs.c
@@ -729,7 +729,7 @@ int init_nilfs(struct the_nilfs *nilfs, struct nilfs_sb_info *sbi, char *data)
 
 	nilfs->ns_mount_state = le16_to_cpu(sbp->s_state);
 
-	bdi = nilfs->ns_bdev->bd_inode->i_mapping->backing_dev_info;
+	bdi = nilfs->ns_bdev->bd_inode->i_mapping->a_bdi;
 	nilfs->ns_bdi = bdi ? : &default_backing_dev_info;
 
 	err = nilfs_store_log_cursor(nilfs, sbp);
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 113ebd9..19f9447 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2088,7 +2088,7 @@ static ssize_t ntfs_file_aio_write_nolock(struct kiocb *iocb,
 	pos = *ppos;
 	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 	/* We can write back this queue in page reclaim. */
-	current->backing_dev_info = mapping->backing_dev_info;
+	current->backing_dev_info = mapping->a_bdi;
 	written = 0;
 	err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
 	if (err)
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index c2903b8..6b931db 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -403,7 +403,7 @@ static struct inode *dlmfs_get_root_inode(struct super_block *sb)
 		inode->i_mode = mode;
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
-		inode->i_mapping->backing_dev_info = &dlmfs_backing_dev_info;
+		mapping_new_set_bdi(inode->i_mapping, &dlmfs_backing_dev_info);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		inc_nlink(inode);
 
@@ -428,7 +428,7 @@ static struct inode *dlmfs_get_inode(struct inode *parent,
 	inode->i_mode = mode;
 	inode->i_uid = current_fsuid();
 	inode->i_gid = current_fsgid();
-	inode->i_mapping->backing_dev_info = &dlmfs_backing_dev_info;
+	mapping_new_set_bdi(inode->i_mapping, &dlmfs_backing_dev_info);
 	inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 
 	ip = DLMFS_I(inode);
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 9a03c15..863e016 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2327,7 +2327,7 @@ relock:
 			goto out_dio;
 		}
 	} else {
-		current->backing_dev_info = file->f_mapping->backing_dev_info;
+		current->backing_dev_info = file->f_mapping->a_bdi;
 		written = generic_file_buffered_write(iocb, iov, nr_segs, *ppos,
 						      ppos, count, 0);
 		current->backing_dev_info = NULL;
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index a5ebae7..02d8ffb 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -60,7 +60,7 @@ struct inode *ramfs_get_inode(struct super_block *sb,
 	if (inode) {
 		inode_init_owner(inode, dir, mode);
 		inode->i_mapping->a_ops = &ramfs_aops;
-		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
+		mapping_new_set_bdi(inode->i_mapping, &ramfs_backing_dev_info);
 		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
 		mapping_set_unevictable(inode->i_mapping);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
diff --git a/fs/romfs/super.c b/fs/romfs/super.c
index 42d2135..bb4b195 100644
--- a/fs/romfs/super.c
+++ b/fs/romfs/super.c
@@ -356,8 +356,8 @@ static struct inode *romfs_iget(struct super_block *sb, unsigned long pos)
 		i->i_fop = &romfs_ro_fops;
 		i->i_data.a_ops = &romfs_aops;
 		if (i->i_sb->s_mtd)
-			i->i_data.backing_dev_info =
-				i->i_sb->s_mtd->backing_dev_info;
+			mapping_new_set_bdi(&i->i_data,
+				i->i_sb->s_mtd->backing_dev_info);
 		if (nextfh & ROMFH_EXEC)
 			mode |= S_IXUGO;
 		break;
diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c
index cffb1fd..3d049e5 100644
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -251,7 +251,7 @@ static void sysfs_init_inode(struct sysfs_dirent *sd, struct inode *inode)
 
 	inode->i_private = sysfs_get(sd);
 	inode->i_mapping->a_ops = &sysfs_aops;
-	inode->i_mapping->backing_dev_info = &sysfs_backing_dev_info;
+	mapping_new_set_bdi(inode->i_mapping, &sysfs_backing_dev_info);
 	inode->i_op = &sysfs_inode_operations;
 
 	set_default_inode_attr(inode, sd->s_mode);
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 87ebcce..d669260 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -109,7 +109,7 @@ struct inode *ubifs_new_inode(struct ubifs_info *c, const struct inode *dir,
 			 ubifs_current_time(inode);
 	inode->i_mapping->nrpages = 0;
 	/* Disable readahead */
-	inode->i_mapping->backing_dev_info = &c->bdi;
+	mapping_new_set_bdi(inode->i_mapping, &c->bdi);
 
 	switch (mode & S_IFMT) {
 	case S_IFREG:
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index cd5900b..45888fb 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -157,7 +157,7 @@ struct inode *ubifs_iget(struct super_block *sb, unsigned long inum)
 		goto out_invalid;
 
 	/* Disable read-ahead */
-	inode->i_mapping->backing_dev_info = &c->bdi;
+	mapping_new_set_bdi(inode->i_mapping, &c->bdi);
 
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFREG:
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 286e36e..7038d77 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -630,7 +630,7 @@ xfs_buf_readahead(
 {
 	struct backing_dev_info *bdi;
 
-	bdi = target->bt_mapping->backing_dev_info;
+	bdi = target->bt_mapping->a_bdi;
 	if (bdi_read_congested(bdi))
 		return;
 
@@ -1580,7 +1580,7 @@ xfs_mapping_buftarg(
 		bdi = &default_backing_dev_info;
 	mapping = &inode->i_data;
 	mapping->a_ops = &mapping_aops;
-	mapping->backing_dev_info = bdi;
+	mapping_new_set_bdi(mapping, bdi);
 	mapping_set_gfp_mask(mapping, GFP_NOFS);
 	btp->bt_mapping = mapping;
 	return 0;
diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index ba8ad42..94cf85b 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -679,7 +679,7 @@ start:
 		goto out_unlock_internal;
 
 	/* We can write back this queue in page reclaim */
-	current->backing_dev_info = mapping->backing_dev_info;
+	current->backing_dev_info = mapping->a_bdi;
 
 	if ((ioflags & IO_ISDIRECT)) {
 		if (mapping->nrpages) {
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 35b0074..31e1346 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -314,19 +314,27 @@ static inline bool bdi_cap_flush_forker(struct backing_dev_info *bdi)
 	return bdi == &default_backing_dev_info;
 }
 
+void mapping_set_bdi(struct address_space *mapping,
+					struct backing_dev_info *bdi);
+static inline void mapping_new_set_bdi(struct address_space *mapping,
+					struct backing_dev_info *bdi)
+{
+	mapping->a_bdi = bdi;
+}
+
 static inline bool mapping_cap_writeback_dirty(struct address_space *mapping)
 {
-	return bdi_cap_writeback_dirty(mapping->backing_dev_info);
+	return bdi_cap_writeback_dirty(mapping->a_bdi);
 }
 
 static inline bool mapping_cap_account_dirty(struct address_space *mapping)
 {
-	return bdi_cap_account_dirty(mapping->backing_dev_info);
+	return bdi_cap_account_dirty(mapping->a_bdi);
 }
 
 static inline bool mapping_cap_swap_backed(struct address_space *mapping)
 {
-	return bdi_cap_swap_backed(mapping->backing_dev_info);
+	return bdi_cap_swap_backed(mapping->a_bdi);
 }
 
 static inline int bdi_sched_wait(void *word)
@@ -345,7 +353,7 @@ static inline void blk_run_backing_dev(struct backing_dev_info *bdi,
 static inline void blk_run_address_space(struct address_space *mapping)
 {
 	if (mapping)
-		blk_run_backing_dev(mapping->backing_dev_info, NULL);
+		blk_run_backing_dev(mapping->a_bdi, NULL);
 }
 
 #endif		/* _LINUX_BACKING_DEV_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1fb92f9..6f0b07f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -633,7 +633,7 @@ struct address_space {
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
-	struct backing_dev_info *backing_dev_info; /* device readahead, etc */
+	struct backing_dev_info *a_bdi;		/* device readahead, etc */
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c9483d8..8f1952b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -782,7 +782,7 @@ static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
 		inode->i_uid = current_fsuid();
 		inode->i_gid = current_fsgid();
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
-		inode->i_mapping->backing_dev_info = &cgroup_backing_dev_info;
+		mapping_new_set_bdi(inode->i_mapping, &cgroup_backing_dev_info);
 	}
 	return inode;
 }
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 65d4204..0188d99 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -671,6 +671,48 @@ err:
 }
 EXPORT_SYMBOL(bdi_init);
 
+void mapping_set_bdi(struct address_space *mapping,
+				struct backing_dev_info *bdi)
+{
+	struct inode *inode = mapping->host;
+	struct backing_dev_info *old = mapping->a_bdi;
+
+	if (unlikely(old == bdi))
+		return;
+
+	spin_lock(&inode_lock);
+	if (!list_empty(&inode->i_list)) {
+		struct inode *i;
+
+		list_for_each_entry(i, &old->wb.b_dirty, i_list) {
+			if (inode == i) {
+				list_del(&inode->i_list);
+				list_add(&inode->i_list, &bdi->wb.b_dirty);
+				goto found;
+			}
+		}
+		list_for_each_entry(i, &old->wb.b_io, i_list) {
+			if (inode == i) {
+				list_del(&inode->i_list);
+				list_add(&inode->i_list, &bdi->wb.b_io);
+				goto found;
+			}
+		}
+		list_for_each_entry(i, &old->wb.b_more_io, i_list) {
+			if (inode == i) {
+				list_del(&inode->i_list);
+				list_add(&inode->i_list, &bdi->wb.b_more_io);
+				goto found;
+			}
+		}
+		BUG();
+	}
+found:
+	mapping->a_bdi = bdi;
+	spin_unlock(&inode_lock);
+}
+EXPORT_SYMBOL(mapping_set_bdi);
+
 void bdi_destroy(struct backing_dev_info *bdi)
 {
 	int i;
@@ -681,11 +723,24 @@ void bdi_destroy(struct backing_dev_info *bdi)
 	 */
 	if (bdi_has_dirty_io(bdi)) {
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
+		struct inode *i, *tmp;
 
 		spin_lock(&inode_lock);
-		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
-		list_splice(&bdi->wb.b_io, &dst->b_io);
-		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+		list_for_each_entry_safe(i, tmp, &bdi->wb.b_dirty, i_list) {
+			list_del(&i->i_list);
+			list_add_tail(&i->i_list, &dst->b_dirty);
+			i->i_mapping->a_bdi = bdi;
+		}
+		list_for_each_entry_safe(i, tmp, &bdi->wb.b_io, i_list) {
+			list_del(&i->i_list);
+			list_add_tail(&i->i_list, &dst->b_io);
+			i->i_mapping->a_bdi = bdi;
+		}
+		list_for_each_entry_safe(i, tmp, &bdi->wb.b_more_io, i_list) {
+			list_del(&i->i_list);
+			list_add_tail(&i->i_list, &dst->b_more_io);
+			i->i_mapping->a_bdi = bdi;
+		}
 		spin_unlock(&inode_lock);
 	}
 
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 8d723c9..72e3ac5 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -72,7 +72,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
 	else
 		endbyte--;		/* inclusive */
 
-	bdi = mapping->backing_dev_info;
+	bdi = mapping->a_bdi;
 
 	switch (advice) {
 	case POSIX_FADV_NORMAL:
@@ -116,7 +116,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
 	case POSIX_FADV_NOREUSE:
 		break;
 	case POSIX_FADV_DONTNEED:
-		if (!bdi_write_congested(mapping->backing_dev_info))
+		if (!bdi_write_congested(mapping->a_bdi))
 			filemap_flush(mapping);
 
 		/* First and last FULL page! */
diff --git a/mm/filemap.c b/mm/filemap.c
index 3d4df44..454d5ec 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -136,7 +136,7 @@ void __remove_from_page_cache(struct page *page)
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
 		dec_zone_page_state(page, NR_FILE_DIRTY);
-		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		dec_bdi_stat(mapping->a_bdi, BDI_RECLAIMABLE);
 	}
 }
 
@@ -2373,7 +2373,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 
 	/* We can write back this queue in page reclaim */
-	current->backing_dev_info = mapping->backing_dev_info;
+	current->backing_dev_info = mapping->a_bdi;
 	written = 0;
 
 	err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 83364df..cdca914 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -409,7 +409,7 @@ xip_file_write(struct file *filp, const char __user *buf, size_t len,
 	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 
 	/* We can write back this queue in page reclaim */
-	current->backing_dev_info = mapping->backing_dev_info;
+	current->backing_dev_info = mapping->a_bdi;
 
 	ret = generic_write_checks(filp, &pos, &count, S_ISBLK(inode->i_mode));
 	if (ret)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e3bccac..e2d50b1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -489,7 +489,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 	unsigned long pages_written = 0;
 	unsigned long pause = 1;
 	bool dirty_exceeded = false;
-	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct backing_dev_info *bdi = mapping->a_bdi;
 
 	for (;;) {
 		struct writeback_control wbc = {
@@ -633,7 +633,7 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 	unsigned long *p;
 
 	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
+	if (mapping->a_bdi->dirty_exceeded)
 		ratelimit = 8;
 
 	/*
@@ -964,7 +964,7 @@ continue_unlock:
 			if (!clear_page_dirty_for_io(page))
 				goto continue_unlock;
 
-			trace_wbc_writepage(wbc, mapping->backing_dev_info);
+			trace_wbc_writepage(wbc, mapping->a_bdi);
 			ret = (*writepage)(page, wbc, data);
 			if (unlikely(ret)) {
 				if (ret == AOP_WRITEPAGE_ACTIVATE) {
@@ -1121,7 +1121,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
-		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		__inc_bdi_stat(mapping->a_bdi, BDI_RECLAIMABLE);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
 	}
@@ -1297,8 +1297,7 @@ int clear_page_dirty_for_io(struct page *page)
 		 */
 		if (TestClearPageDirty(page)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
-			dec_bdi_stat(mapping->backing_dev_info,
-					BDI_RECLAIMABLE);
+			dec_bdi_stat(mapping->a_bdi, BDI_RECLAIMABLE);
 			return 1;
 		}
 		return 0;
@@ -1313,7 +1312,7 @@ int test_clear_page_writeback(struct page *page)
 	int ret;
 
 	if (mapping) {
-		struct backing_dev_info *bdi = mapping->backing_dev_info;
+		struct backing_dev_info *bdi = mapping->a_bdi;
 		unsigned long flags;
 
 		spin_lock_irqsave(&mapping->tree_lock, flags);
@@ -1342,7 +1341,7 @@ int test_set_page_writeback(struct page *page)
 	int ret;
 
 	if (mapping) {
-		struct backing_dev_info *bdi = mapping->backing_dev_info;
+		struct backing_dev_info *bdi = mapping->a_bdi;
 		unsigned long flags;
 
 		spin_lock_irqsave(&mapping->tree_lock, flags);
diff --git a/mm/readahead.c b/mm/readahead.c
index 77506a2..831b927 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -25,7 +25,7 @@
 void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
 {
-	ra->ra_pages = mapping->backing_dev_info->ra_pages;
+	ra->ra_pages = mapping->a_bdi->ra_pages;
 	ra->prev_pos = -1;
 }
 EXPORT_SYMBOL_GPL(file_ra_state_init);
@@ -549,7 +549,7 @@ page_cache_async_readahead(struct address_space *mapping,
 	/*
 	 * Defer asynchronous read-ahead on IO congestion.
 	 */
-	if (bdi_read_congested(mapping->backing_dev_info))
+	if (bdi_read_congested(mapping->a_bdi))
 		return;
 
 	/* do read-ahead */
@@ -564,7 +564,7 @@ page_cache_async_readahead(struct address_space *mapping,
 	 * explicitly kick off the IO.
 	 */
 	if (PageUptodate(page))
-		blk_run_backing_dev(mapping->backing_dev_info, NULL);
+		blk_run_backing_dev(mapping->a_bdi, NULL);
 #endif
 }
 EXPORT_SYMBOL_GPL(page_cache_async_readahead);
diff --git a/mm/shmem.c b/mm/shmem.c
index 080b09a..fbee46d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1588,7 +1588,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 	if (inode) {
 		inode_init_owner(inode, dir, mode);
 		inode->i_blocks = 0;
-		inode->i_mapping->backing_dev_info = &shmem_backing_dev_info;
+		mapping_new_set_bdi(inode->i_mapping, &shmem_backing_dev_info);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		inode->i_generation = get_seconds();
 		info = SHMEM_I(inode);
diff --git a/mm/swap.c b/mm/swap.c
index 3ce7bc3..9352a37 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -501,7 +501,7 @@ void __init swap_setup(void)
 	unsigned long megs = totalram_pages >> (20 - PAGE_SHIFT);
 
 #ifdef CONFIG_SWAP
-	bdi_init(swapper_space.backing_dev_info);
+	bdi_init(swapper_space.a_bdi);
 #endif
 
 	/* Use a smaller cluster for small-memory machines */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e10f583..6496074 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -45,7 +45,7 @@ struct address_space swapper_space = {
 	.tree_lock	= __SPIN_LOCK_UNLOCKED(swapper_space.tree_lock),
 	.a_ops		= &swap_aops,
 	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
-	.backing_dev_info = &swap_backing_dev_info,
+	.a_bdi		= &swap_backing_dev_info,
 };
 
 #define INC_CACHE_INFO(x)	do { swap_cache_info.x++; } while (0)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7c703ff..c14b755 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -116,7 +116,7 @@ void swap_unplug_io_fn(struct backing_dev_info *unused_bdi, struct page *page)
 		 */
 		WARN_ON(page_count(page) <= 1);
 
-		bdi = bdev->bd_inode->i_mapping->backing_dev_info;
+		bdi = bdev->bd_inode->i_mapping->a_bdi;
 		blk_run_backing_dev(bdi, page);
 	}
 	up_read(&swap_unplug_sem);
diff --git a/mm/truncate.c b/mm/truncate.c
index ba887bf..bb79cef 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -75,8 +75,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
-			dec_bdi_stat(mapping->backing_dev_info,
-					BDI_RECLAIMABLE);
+			dec_bdi_stat(mapping->a_bdi, BDI_RECLAIMABLE);
 			if (account_size)
 				task_io_account_cancelled_write(account_size);
 		}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5dfabf..8f58773 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -366,7 +366,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	}
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
-	if (!may_write_to_queue(mapping->backing_dev_info))
+	if (!may_write_to_queue(mapping->a_bdi))
 		return PAGE_KEEP;
 
 	if (clear_page_dirty_for_io(page)) {
-- 
1.7.1


^ permalink raw reply related

* [RFD] Device Renaming Mechanism
From: Nao Nishijima @ 2010-10-08  5:23 UTC (permalink / raw)
  To: gregkh, James.Bottomley, rwheeler
  Cc: linux-kernel, linux-hotplug-devel, linux-hotplug,
	masami.hiramatsu.pt

Hi,

I'm trying to solve a device name(or device node) mismatch problem caused by
device configuration changes. Now I have an idea of device renaming to solve it,
and would like to request for comments from kernel developers.


Device Name Mismatch
==========

Device names(e.g. sda) are assigned by the order of driver loading and device
recognizing (usually from small bus number). This may cause a device name
mismatch between previous and current boot whenever the device configuration is
changed. Suppose there is an application opens disk via /dev/sdb. When device
configuration changing (hot-plug, device breakdown) or system configuration
changing(driver loading order, changing modprobe.conf) causes changing order
device names. This device names does not always point to same disks.

This mismatch causes unexpected disk access and redundancy miss setting (e.g.
Multipath, software-raid), if you use device file names to a configuration file.


Udev Solution
======
Typically we use to avoid this problem we uses persistent device names provided
by udev.

Udev makes persistent symbolic links(by-{id, uuid, path, label}) pointing to each
device based on device information. Applications access the device via these
symbolic links. Udev solves mismatch between device name and physical disk.
However the persistent name mismatches kernel's device name.
This mismatch causes following 4 issues.

Issue 1: /proc/partitions, /proc/diskstat gives you device names
We have to run "ls -l /dev/disk/by-*" or "udevadm" for finding corresponding
persistent symbolic links.

Issue 2: dmesg output device name instead of persistent symbolic links
Users might not know which disk is sdX, because they identify the disk by a
persistent symbolic link.

Issue 3: Some system commands don't accept symbolic link(e.g. df, iostat,...)
These commands just expect sdX device name or check input by /proc information.
This will also occur on several GNOME/KDE/etc GUI sysadmin tools. :(

Issue 4: Undecided symbolic link
Even if we would like to introduce device names/persistent symbolic links
mapping tool to solve it, we can not determine a symbolic link from a device,
because several symbolic links point a device file.

Therefore, I think the symbolic link is not enough to solve. We need a
better solution.


Proposal
====
I'd like to propose introducing device renaming interface to solve these issues.

I think renaming device name in the kernel is the simplest way to solve mismatch
dmesg and /proc information. This can be done while kernel booting up(like
ifcfg). Of course, udev still needs to assign new name for each device via that
interface.

This proposal just requests to add a simple interface to kernel as below. And we
can continue to use user program without any modification.

int rename_device(const char *newname, const char *oldname)

Any comments, or suggestions are very welcome!
Best Regards,

-- 
Nao NISHIJIMA
2nd Dept. Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
Email: nao.nishijima.xt@hitachi.com

^ permalink raw reply

* [PATCH 17/18] fs: icache remove inode_lock
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel
In-Reply-To: <1286515292-15882-1-git-send-email-david@fromorbit.com>

From: Dave Chinner <dchinner@redhat.com>

All the functionality that the inode_lock protected has now been
wrapped up in new independent locks and/or functionality. Hence the
inode_lock does not serve a purpose any longer and hence can now be
removed.

Based on work originally done by Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 Documentation/filesystems/Locking |    2 +-
 Documentation/filesystems/porting |   10 ++++-
 Documentation/filesystems/vfs.txt |    2 +-
 fs/buffer.c                       |    2 +-
 fs/drop_caches.c                  |    4 --
 fs/fs-writeback.c                 |   47 ++++-----------------
 fs/inode.c                        |   82 ++++---------------------------------
 fs/logfs/inode.c                  |    2 +-
 fs/notify/inode_mark.c            |   11 ++---
 fs/notify/mark.c                  |    1 -
 fs/notify/vfsmount_mark.c         |    1 -
 fs/ntfs/inode.c                   |    4 +-
 fs/ocfs2/inode.c                  |    2 +-
 fs/quota/dquot.c                  |   12 +----
 include/linux/fs.h                |    2 +-
 include/linux/writeback.h         |    3 -
 mm/backing-dev.c                  |    6 ---
 mm/filemap.c                      |    6 +-
 mm/rmap.c                         |    6 +-
 19 files changed, 48 insertions(+), 157 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 2db4283..e92dad2 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -114,7 +114,7 @@ alloc_inode:
 destroy_inode:
 dirty_inode:				(must not sleep)
 write_inode:
-drop_inode:				!!!inode_lock!!!
+drop_inode:				!!!i_lock, sb_inode_list_lock!!!
 evict_inode:
 put_super:		write
 write_super:		read
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index b12c895..ab07213 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -299,7 +299,7 @@ be used instead.  It gets called whenever the inode is evicted, whether it has
 remaining links or not.  Caller does *not* evict the pagecache or inode-associated
 metadata buffers; getting rid of those is responsibility of method, as it had
 been for ->delete_inode().
-	->drop_inode() returns int now; it's called on final iput() with inode_lock
+	->drop_inode() returns int now; it's called on final iput() with i_lock
 held and it returns true if filesystems wants the inode to be dropped.  As before,
 generic_drop_inode() is still the default and it's been updated appropriately.
 generic_delete_inode() is also alive and it consists simply of return 1.  Note that
@@ -318,3 +318,11 @@ if it's zero is not *and* *never* *had* *been* enough.  Final unlink() and iput(
 may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly
 free the on-disk inode, you may end up doing that while ->write_inode() is writing
 to it.
+
+
+[mandatory]
+	inode_lock is gone, replaced by fine grained locks. See fs/inode.c
+for details of what locks to replace inode_lock with in order to protect
+particular things. Most of the time, a filesystem only needs ->i_lock, which
+protects *all* the inode state and its membership on lists that was
+previously protected with inode_lock.
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index ed7e5ef..405beb2 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -246,7 +246,7 @@ or bottom half).
 	should be synchronous or not, not all filesystems check this flag.
 
   drop_inode: called when the last access to the inode is dropped,
-	with the inode_lock spinlock held.
+	with the i_lock and sb_inode_list_lock spinlock held.
 
 	This method should be either NULL (normal UNIX filesystem
 	semantics) or "generic_delete_inode" (for filesystems that do not
diff --git a/fs/buffer.c b/fs/buffer.c
index b5c4153..99a9f8d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1145,7 +1145,7 @@ __getblk_slow(struct block_device *bdev, sector_t block, int size)
  * inode list.
  *
  * mark_buffer_dirty() is atomic.  It takes bh->b_page->mapping->private_lock,
- * mapping->tree_lock and the global inode_lock.
+ * and mapping->tree_lock.
  */
 void mark_buffer_dirty(struct buffer_head *bh)
 {
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 00180dc..2105713 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -16,7 +16,6 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
@@ -28,15 +27,12 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
-		spin_lock(&inode_lock);
 		spin_lock(&sb->s_inodes_lock);
 	}
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 	iput(toput_inode);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 404d449..f8eb27c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -184,7 +184,7 @@ static void requeue_io(struct inode *inode)
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
-	 * Prevent speculative execution through spin_unlock(&inode_lock);
+	 * Prevent speculative execution through spin_unlock(&inode->i_lock);
 	 */
 	smp_mb();
 	wake_up_bit(&inode->i_state, __I_SYNC);
@@ -283,25 +283,21 @@ static void inode_wait_for_writeback(struct inode *inode)
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 	while (inode->i_state & I_SYNC) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
-		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
 	}
 }
 
 /*
- * Write out an inode's dirty pages.  Called under inode_lock.  Either the
- * caller has ref on the inode (either via iref_locked or via syscall against an fd)
- * or the inode has I_WILL_FREE set (via generic_forget_inode)
+ * Write out an inode's dirty pages.  Either the caller has ref on the inode
+ * (either via iref_locked or via syscall against an fd) or the inode has
+ * I_WILL_FREE set (via generic_forget_inode)
  *
  * If `wait' is set, wait on the writeout.
  *
  * The whole writeout design is quite complex and fragile.  We want to avoid
  * starvation of particular inodes when others are being redirtied, prevent
  * livelocks, etc.
- *
- * Called under inode_lock.
  */
 static int
 writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
@@ -346,7 +342,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 
 	ret = do_writepages(mapping, wbc);
 
@@ -366,12 +361,10 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	 * due to delalloc, clear dirty metadata flags right before
 	 * write_inode()
 	 */
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
@@ -379,7 +372,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			ret = err;
 	}
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
@@ -527,10 +519,8 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			redirty_tail(inode);
 			spin_unlock(&wb->b_lock);
 		}
-		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
-		spin_lock(&inode_lock);
 		spin_lock(&wb->b_lock);
 		if (wbc->nr_to_write <= 0) {
 			wbc->more_io = 1;
@@ -550,9 +540,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 
 	if (!wbc->wb_start)
 		wbc->wb_start = jiffies; /* livelock avoidance */
-	spin_lock(&inode_lock);
 	spin_lock(&wb->b_lock);
-
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
@@ -572,7 +560,6 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 			break;
 	}
 	spin_unlock(&wb->b_lock);
-	spin_unlock(&inode_lock);
 	/* Leave any unwritten inodes on b_io */
 }
 
@@ -581,13 +568,11 @@ static void __writeback_inodes_sb(struct super_block *sb,
 {
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_lock);
 	spin_lock(&wb->b_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&wb->b_lock);
-	spin_unlock(&inode_lock);
 }
 
 /*
@@ -697,7 +682,6 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.
 		 */
-		spin_lock(&inode_lock);
 		if (!list_empty(&wb->b_more_io))  {
 			spin_lock(&wb->b_lock);
 			inode = list_entry(wb->b_more_io.prev,
@@ -708,7 +692,6 @@ static long wb_writeback(struct bdi_writeback *wb,
 			inode_wait_for_writeback(inode);
 			spin_unlock(&inode->i_lock);
 		}
-		spin_unlock(&inode_lock);
 	}
 
 	return wrote;
@@ -971,7 +954,6 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 	if (unlikely(block_dump))
 		block_dump___mark_inode_dirty(inode);
 
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
@@ -1029,8 +1011,6 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 out_unlock:
 	spin_unlock(&inode->i_lock);
 out:
-	spin_unlock(&inode_lock);
-
 	if (wakeup_bdi)
 		bdi_wakeup_thread_delayed(bdi);
 }
@@ -1063,7 +1043,6 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 
 	/*
@@ -1086,14 +1065,12 @@ static void wait_sb_inodes(struct super_block *sb)
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 		/*
-		 * We hold a reference to 'inode' so it couldn't have
-		 * been removed from s_inodes list while we dropped the
-		 * inode_lock.  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it
-		 * under inode_lock. So we keep the reference and iput
-		 * it later.
+		 * We hold a reference to 'inode' so it couldn't have been
+		 * removed from s_inodes list while we dropped the
+		 * s_inodes_lock.  We cannot iput the inode now as we can be
+		 * holding the last reference and we cannot iput it under
+		 * s_inodes_lock. So we keep the reference and iput it later.
 		 */
 		iput(old_inode);
 		old_inode = inode;
@@ -1102,11 +1079,9 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&inode_lock);
 		spin_lock(&sb->s_inodes_lock);
 	}
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 	iput(old_inode);
 }
 
@@ -1209,9 +1184,7 @@ int write_inode_now(struct inode *inode, int sync)
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&inode_lock);
 	ret = writeback_single_inode(inode, &wbc);
-	spin_unlock(&inode_lock);
 	if (sync)
 		inode_sync_wait(inode);
 	return ret;
@@ -1233,9 +1206,7 @@ int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	int ret;
 
-	spin_lock(&inode_lock);
 	ret = writeback_single_inode(inode, wbc);
-	spin_unlock(&inode_lock);
 	return ret;
 }
 EXPORT_SYMBOL(sync_inode);
diff --git a/fs/inode.c b/fs/inode.c
index 4ec360e..c778ec4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -41,11 +41,9 @@
  *   inode_lru, i_lru
  *
  * Lock orders
- * inode_lock
  *   inode hash bucket lock
  *     inode->i_lock
  *
- * inode_lock
  *   sb inode lock
  *     inode_lru_lock
  *       wb->b_lock
@@ -118,14 +116,6 @@ static inline void spin_unlock_bucket(struct inode_hash_bucket *b)
 static struct inode_hash_bucket *inode_hashtable __read_mostly;
 
 /*
- * A simple spinlock to protect the list manipulations.
- *
- * NOTE! You also have to own the lock if you change
- * the i_state of an inode while it is in use..
- */
-DEFINE_SPINLOCK(inode_lock);
-
-/*
  * iprune_sem provides exclusion between the kswapd or try_to_free_pages
  * icache shrinking path, and the umount path.  Without this exclusion,
  * by the time prune_icache calls iput for the inode whose pages it has
@@ -357,7 +347,7 @@ static void init_once(void *foo)
 }
 
 /*
- * inode_lock must be held
+ * i_lock must be held
  */
 void iref_locked(struct inode *inode)
 {
@@ -369,11 +359,9 @@ EXPORT_SYMBOL_GPL(iref_locked);
 
 void iref(struct inode *inode)
 {
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	iref_locked(inode);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(iref);
 
@@ -439,11 +427,9 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 	struct inode_hash_bucket *b;
 
 	b = inode_hashtable + hash(inode->i_sb, hashval);
-	spin_lock(&inode_lock);
 	spin_lock_bucket(b);
 	hlist_bl_add_head(&inode->i_hash, &b->head);
 	spin_unlock_bucket(b);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
 
@@ -472,9 +458,7 @@ static void __remove_inode_hash(struct inode *inode)
  */
 void remove_inode_hash(struct inode *inode)
 {
-	spin_lock(&inode_lock);
 	__remove_inode_hash(inode);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(remove_inode_hash);
 
@@ -526,12 +510,10 @@ static void dispose_list(struct list_head *head)
 
 		evict(inode);
 
-		spin_lock(&inode_lock);
 		__remove_inode_hash(inode);
 		spin_lock(&inode->i_sb->s_inodes_lock);
 		list_del_init(&inode->i_sb_list);
 		spin_unlock(&inode->i_sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 
 		wake_up_inode(inode);
 		destroy_inode(inode);
@@ -558,7 +540,6 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
 		 * change during umount anymore, and because iprune_sem keeps
 		 * shrink_icache_memory() away.
 		 */
-		cond_resched_lock(&inode_lock);
 		cond_resched_lock(&sb->s_inodes_lock);
 
 		next = next->next;
@@ -614,12 +595,10 @@ int invalidate_inodes(struct super_block *sb)
 	LIST_HEAD(throw_away);
 
 	down_write(&iprune_sem);
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	fsnotify_unmount_inodes(&sb->s_inodes);
 	busy = invalidate_list(sb, &sb->s_inodes, &throw_away);
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 
 	dispose_list(&throw_away);
 	up_write(&iprune_sem);
@@ -644,7 +623,7 @@ static int can_unuse(struct inode *inode)
 
 /*
  * Scan `goal' inodes on the unused list for freeable ones. They are moved to
- * a temporary list and then are freed outside inode_lock by dispose_list().
+ * a temporary list and then are freed outside LRU lock by dispose_list().
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  We expect the final iput() on that inode to add it to
@@ -662,7 +641,6 @@ static void prune_icache(int nr_to_scan)
 	unsigned long reap = 0;
 
 	down_read(&iprune_sem);
-	spin_lock(&inode_lock);
 	spin_lock(&inode_lru_lock);
 	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
 		struct inode *inode;
@@ -690,12 +668,10 @@ static void prune_icache(int nr_to_scan)
 			iref_locked(inode);
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lru_lock);
-			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
 				reap += invalidate_mapping_pages(&inode->i_data,
 								0, -1);
 			iput(inode);
-			spin_lock(&inode_lock);
 			spin_lock(&inode_lru_lock);
 			spin_lock(&inode->i_lock);
 
@@ -733,7 +709,6 @@ static void prune_icache(int nr_to_scan)
 	else
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&inode_lru_lock);
-	spin_unlock(&inode_lock);
 
 	dispose_list(&freeable);
 	up_read(&iprune_sem);
@@ -854,9 +829,9 @@ __inode_add_to_lists(struct super_block *sb, struct inode_hash_bucket *b,
  * @inode: inode to mark in use
  *
  * When an inode is allocated it needs to be accounted for, added to the in use
- * list, the owning superblock and the inode hash. This needs to be done under
- * the inode_lock, so export a function to do this rather than the inode lock
- * itself. We calculate the hash list to add to here so it is all internal
+ * list, the owning superblock and the inode hash.
+ *
+ * We calculate the hash list to add to here so it is all internal
  * which requires the caller to have already set up the inode number in the
  * inode to add.
  */
@@ -864,9 +839,7 @@ void inode_add_to_lists(struct super_block *sb, struct inode *inode)
 {
 	struct inode_hash_bucket *b = inode_hashtable + hash(sb, inode->i_ino);
 
-	spin_lock(&inode_lock);
 	__inode_add_to_lists(sb, b, inode);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL_GPL(inode_add_to_lists);
 
@@ -923,15 +896,11 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&inode_lock);
-
 	inode = alloc_inode(sb);
 	if (inode) {
-		spin_lock(&inode_lock);
 		inode->i_ino = last_ino_get();
 		inode->i_state = 0;
 		__inode_add_to_lists(sb, NULL, inode);
-		spin_unlock(&inode_lock);
 	}
 	return inode;
 }
@@ -990,7 +959,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 	if (inode) {
 		struct inode *old;
 
-		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
 		old = find_inode(sb, b, test, data);
 		if (!old) {
@@ -999,7 +967,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 
 			inode->i_state = I_NEW;
 			__inode_add_to_lists(sb, b, inode);
-			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -1014,7 +981,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 */
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
@@ -1022,7 +988,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 	return inode;
 
 set_failed:
-	spin_unlock(&inode_lock);
 	destroy_inode(inode);
 	return NULL;
 }
@@ -1040,14 +1005,12 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 	if (inode) {
 		struct inode *old;
 
-		spin_lock(&inode_lock);
 		/* We released the lock, so.. */
 		old = find_inode_fast(sb, b, ino);
 		if (!old) {
 			inode->i_ino = ino;
 			__inode_add_to_lists(sb, b, inode);
 			inode->i_state = I_NEW;
-			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -1062,7 +1025,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 */
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_lock);
 		destroy_inode(inode);
 		inode = old;
 		wait_on_inode(inode);
@@ -1119,7 +1081,6 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 	static unsigned int counter;
 	ino_t res;
 
-	spin_lock(&inode_lock);
 	spin_lock(&unique_lock);
 	do {
 		if (counter <= max_reserved)
@@ -1127,7 +1088,6 @@ ino_t iunique(struct super_block *sb, ino_t max_reserved)
 		res = counter++;
 	} while (!test_inode_iunique(sb, res));
 	spin_unlock(&unique_lock);
-	spin_unlock(&inode_lock);
 
 	return res;
 }
@@ -1135,7 +1095,6 @@ EXPORT_SYMBOL(iunique);
 
 struct inode *igrab(struct inode *inode)
 {
-	spin_lock(&inode_lock);
 	spin_lock(&inode->i_lock);
 	if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
 		iref_locked(inode);
@@ -1149,7 +1108,6 @@ struct inode *igrab(struct inode *inode)
 		 */
 		inode = NULL;
 	}
-	spin_unlock(&inode_lock);
 	return inode;
 }
 EXPORT_SYMBOL(igrab);
@@ -1171,7 +1129,7 @@ EXPORT_SYMBOL(igrab);
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 static struct inode *ifind(struct super_block *sb,
 		struct inode_hash_bucket *b,
@@ -1180,17 +1138,14 @@ static struct inode *ifind(struct super_block *sb,
 {
 	struct inode *inode;
 
-	spin_lock(&inode_lock);
 	inode = find_inode(sb, b, test, data);
 	if (inode) {
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
 		return inode;
 	}
-	spin_unlock(&inode_lock);
 	return NULL;
 }
 
@@ -1215,16 +1170,13 @@ static struct inode *ifind_fast(struct super_block *sb,
 {
 	struct inode *inode;
 
-	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, b, ino);
 	if (inode) {
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		wait_on_inode(inode);
 		return inode;
 	}
-	spin_unlock(&inode_lock);
 	return NULL;
 }
 
@@ -1247,7 +1199,7 @@ static struct inode *ifind_fast(struct super_block *sb,
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
@@ -1275,7 +1227,7 @@ EXPORT_SYMBOL(ilookup5_nowait);
  *
  * Otherwise NULL is returned.
  *
- * Note, @test is called with the inode_lock held, so can't sleep.
+ * Note, @test is called with the i_lock held, so can't sleep.
  */
 struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
@@ -1326,7 +1278,7 @@ EXPORT_SYMBOL(ilookup);
  * inode and this is returned locked, hashed, and with the I_NEW flag set. The
  * file system gets to fill it in before unlocking it via unlock_new_inode().
  *
- * Note both @test and @set are called with the inode_lock held, so can't sleep.
+ * Note both @test and @set are called with the i_lock held, so can't sleep.
  */
 struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *),
@@ -1391,7 +1343,6 @@ int insert_inode_locked(struct inode *inode)
 	while (1) {
 		struct hlist_bl_node *node;
 		struct inode *old = NULL;
-		spin_lock(&inode_lock);
 		spin_lock_bucket(b);
 		hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
 			if (old->i_ino != ino)
@@ -1408,13 +1359,11 @@ int insert_inode_locked(struct inode *inode)
 		if (likely(!node)) {
 			hlist_bl_add_head(&inode->i_hash, &b->head);
 			spin_unlock_bucket(b);
-			spin_unlock(&inode_lock);
 			return 0;
 		}
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
 		spin_unlock_bucket(b);
-		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
 			iput(old);
@@ -1437,7 +1386,6 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		struct hlist_bl_node *node;
 		struct inode *old = NULL;
 
-		spin_lock(&inode_lock);
 		spin_lock_bucket(b);
 		hlist_bl_for_each_entry(old, node, &b->head, i_hash) {
 			if (old->i_sb != sb)
@@ -1454,13 +1402,11 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		if (likely(!node)) {
 			hlist_bl_add_head(&inode->i_hash, &b->head);
 			spin_unlock_bucket(b);
-			spin_unlock(&inode_lock);
 			return 0;
 		}
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
 		spin_unlock_bucket(b);
-		spin_unlock(&inode_lock);
 		wait_on_inode(old);
 		if (unlikely(!hlist_bl_unhashed(&old->i_hash))) {
 			iput(old);
@@ -1523,15 +1469,12 @@ static void iput_final(struct inode *inode)
 				return;
 			}
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_lock);
 			return;
 		}
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_WILL_FREE;
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_lock);
 		write_inode_now(inode, 1);
-		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
@@ -1556,7 +1499,6 @@ static void iput_final(struct inode *inode)
 	list_del_init(&inode->i_sb_list);
 	spin_unlock(&sb->s_inodes_lock);
 
-	spin_unlock(&inode_lock);
 	evict(inode);
 	remove_inode_hash(inode);
 	wake_up_inode(inode);
@@ -1576,7 +1518,6 @@ static void iput_final(struct inode *inode)
 void iput(struct inode *inode)
 {
 	if (inode) {
-		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
 		BUG_ON(inode->i_state & I_CLEAR);
 
@@ -1586,7 +1527,6 @@ void iput(struct inode *inode)
 			return;
 		}
 		spin_unlock(&inode->i_lock);
-		spin_lock(&inode_lock);
 	}
 }
 EXPORT_SYMBOL(iput);
@@ -1766,8 +1706,6 @@ EXPORT_SYMBOL(inode_wait);
  * It doesn't matter if I_NEW is not set initially, a call to
  * wake_up_inode() after removing from the hash list will DTRT.
  *
- * This is called with inode_lock held.
- *
  * Called with i_lock held and returns with it dropped.
  */
 static void __wait_on_freeing_inode(struct inode *inode)
@@ -1777,10 +1715,8 @@ static void __wait_on_freeing_inode(struct inode *inode)
 	wq = bit_waitqueue(&inode->i_state, __I_NEW);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
-	spin_lock(&inode_lock);
 }
 
 static __initdata unsigned long ihash_entries;
diff --git a/fs/logfs/inode.c b/fs/logfs/inode.c
index d8c71ec..a67b607 100644
--- a/fs/logfs/inode.c
+++ b/fs/logfs/inode.c
@@ -286,7 +286,7 @@ static int logfs_write_inode(struct inode *inode, struct writeback_control *wbc)
 	return ret;
 }
 
-/* called with inode_lock held */
+/* called with i_lock held */
 static int logfs_drop_inode(struct inode *inode)
 {
 	struct logfs_super *super = logfs_super(inode->i_sb);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 8a05213..57c28ae 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -22,7 +22,7 @@
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
-#include <linux/writeback.h> /* for inode_lock */
+#include <linux/writeback.h>
 
 #include <asm/atomic.h>
 
@@ -232,9 +232,8 @@ out:
  * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
  * @list: list of inodes being unmounted (sb->s_inodes)
  *
- * Called with inode_lock held, protecting the unmounting super block's list
- * of inodes, and with iprune_mutex held, keeping shrink_icache_memory() at bay.
- * We temporarily drop inode_lock, however, and CAN block.
+ * Called with iprune_mutex held, keeping shrink_icache_memory() at bay.
+ * sb_inode_list_lock to protect the super block's list of inodes.
  */
 void fsnotify_unmount_inodes(struct list_head *list)
 {
@@ -288,13 +287,12 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		}
 
 		/*
-		 * We can safely drop inode_lock here because we hold
+		 * We can safely drop sb->s_inodes_lock here because we hold
 		 * references on both inode and next_i.  Also no new inodes
 		 * will be added since the umount has begun.  Finally,
 		 * iprune_mutex keeps shrink_icache_memory() away.
 		 */
 		spin_unlock(&sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 
 		if (need_iput_tmp)
 			iput(need_iput_tmp);
@@ -306,7 +304,6 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		iput(inode);
 
-		spin_lock(&inode_lock);
 		spin_lock(&sb->s_inodes_lock);
 	}
 }
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index 325185e..50c0085 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -91,7 +91,6 @@
 #include <linux/slab.h>
 #include <linux/spinlock.h>
 #include <linux/srcu.h>
-#include <linux/writeback.h> /* for inode_lock */
 
 #include <asm/atomic.h>
 
diff --git a/fs/notify/vfsmount_mark.c b/fs/notify/vfsmount_mark.c
index 56772b5..6f8eefe 100644
--- a/fs/notify/vfsmount_mark.c
+++ b/fs/notify/vfsmount_mark.c
@@ -23,7 +23,6 @@
 #include <linux/mount.h>
 #include <linux/mutex.h>
 #include <linux/spinlock.h>
-#include <linux/writeback.h> /* for inode_lock */
 
 #include <asm/atomic.h>
 
diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c
index 93622b1..7c530f3 100644
--- a/fs/ntfs/inode.c
+++ b/fs/ntfs/inode.c
@@ -54,7 +54,7 @@
  *
  * Return 1 if the attributes match and 0 if not.
  *
- * NOTE: This function runs with the inode_lock spin lock held so it is not
+ * NOTE: This function runs with the i_lock spin lock held so it is not
  * allowed to sleep.
  */
 int ntfs_test_inode(struct inode *vi, ntfs_attr *na)
@@ -98,7 +98,7 @@ int ntfs_test_inode(struct inode *vi, ntfs_attr *na)
  *
  * Return 0 on success and -errno on error.
  *
- * NOTE: This function runs with the inode_lock spin lock held so it is not
+ * NOTE: This function runs with the i_lock spin lock held so it is not
  * allowed to sleep. (Hence the GFP_ATOMIC allocation.)
  */
 static int ntfs_init_locked_inode(struct inode *vi, ntfs_attr *na)
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index eece3e0..65c61e2 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -1195,7 +1195,7 @@ void ocfs2_evict_inode(struct inode *inode)
 	ocfs2_clear_inode(inode);
 }
 
-/* Called under inode_lock, with no more references on the
+/* Called under i_lock, with no more references on the
  * struct inode, so it's safe here to check the flags field
  * and to manipulate i_nlink without any other locks. */
 int ocfs2_drop_inode(struct inode *inode)
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index c7b5fc6..533cd95 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -76,7 +76,7 @@
 #include <linux/buffer_head.h>
 #include <linux/capability.h>
 #include <linux/quotaops.h>
-#include <linux/writeback.h> /* for inode_lock, oddly enough.. */
+#include <linux/writeback.h>
 
 #include <asm/uaccess.h>
 
@@ -896,7 +896,6 @@ static void add_dquot_ref(struct super_block *sb, int type)
 	int reserved = 0;
 #endif
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
@@ -914,21 +913,18 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
-		spin_unlock(&inode_lock);
 
 		iput(old_inode);
 		__dquot_initialize(inode, type);
 		/* We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the inode_lock.
+		 * removed from s_inodes list while we dropped the lock.
 		 * We cannot iput the inode now as we can be holding the last
-		 * reference and we cannot iput it under inode_lock. So we
+		 * reference and we cannot iput it under the lock. So we
 		 * keep the reference and iput it later. */
 		old_inode = inode;
-		spin_lock(&inode_lock);
 		spin_lock(&sb->s_inodes_lock);
 	}
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1009,7 +1005,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 	struct inode *inode;
 	int reserved = 0;
 
-	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
@@ -1025,7 +1020,6 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 		}
 	}
 	spin_unlock(&sb->s_inodes_lock);
-	spin_unlock(&inode_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 54c4e86..453e0b4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1588,7 +1588,7 @@ struct super_operations {
 };
 
 /*
- * Inode state bits.  Protected by inode_lock.
+ * Inode state bits.  Protected by i_lock.
  *
  * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
  * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index b182ccc..67be7a2 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -9,9 +9,6 @@
 
 struct backing_dev_info;
 
-extern spinlock_t inode_lock;
-
-
 /*
  * fs/fs-writeback.c
  */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 74e8269..0c0586b 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -73,7 +73,6 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 	struct inode *inode;
 
 	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
-	spin_lock(&inode_lock);
 	spin_lock(&wb->b_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_io)
 		nr_dirty++;
@@ -82,7 +81,6 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 	list_for_each_entry(inode, &wb->b_more_io, i_io)
 		nr_more_io++;
 	spin_unlock(&wb->b_lock);
-	spin_unlock(&inode_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
 	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -695,7 +693,6 @@ void mapping_set_bdi(struct address_space *mapping,
 	if (unlikely(old == bdi))
 		return;
 
-	spin_lock(&inode_lock);
 	bdi_lock_two(bdi, old);
 	if (!list_empty(&inode->i_io)) {
 		struct inode *i;
@@ -727,7 +724,6 @@ found:
 	mapping->a_bdi = bdi;
 	spin_unlock(&bdi->wb.b_lock);
 	spin_unlock(&old->wb.b_lock);
-	spin_unlock(&inode_lock);
 }
 EXPORT_SYMBOL(mapping_set_bdi);
 
@@ -743,7 +739,6 @@ void bdi_destroy(struct backing_dev_info *bdi)
 		struct bdi_writeback *dst = &default_backing_dev_info.wb;
 		struct inode *i, *tmp;
 
-		spin_lock(&inode_lock);
 		bdi_lock_two(bdi, &default_backing_dev_info);
 		list_for_each_entry_safe(i, tmp, &bdi->wb.b_dirty, i_io) {
 			list_del(&i->i_io);
@@ -762,7 +757,6 @@ void bdi_destroy(struct backing_dev_info *bdi)
 		}
 		spin_unlock(&bdi->wb.b_lock);
 		spin_unlock(&dst->b_lock);
-		spin_unlock(&inode_lock);
 	}
 
 	bdi_unregister(bdi);
diff --git a/mm/filemap.c b/mm/filemap.c
index 454d5ec..857fb34 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -80,7 +80,7 @@
  *  ->i_mutex
  *    ->i_alloc_sem             (various)
  *
- *  ->inode_lock
+ *  ->i_lock
  *    ->sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
  *
@@ -98,8 +98,8 @@
  *    ->zone.lru_lock		(check_pte_range->isolate_lru_page)
  *    ->private_lock		(page_remove_rmap->set_page_dirty)
  *    ->tree_lock		(page_remove_rmap->set_page_dirty)
- *    ->inode_lock		(page_remove_rmap->set_page_dirty)
- *    ->inode_lock		(zap_pte_range->set_page_dirty)
+ *    ->i_lock			(page_remove_rmap->set_page_dirty)
+ *    ->i_lock			(zap_pte_range->set_page_dirty)
  *    ->private_lock		(zap_pte_range->__set_page_dirty_buffers)
  *
  *  ->task->proc_lock
diff --git a/mm/rmap.c b/mm/rmap.c
index 92e6757..dbfccae 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -31,11 +31,11 @@
  *             swap_lock (in swap_duplicate, swap_info_get)
  *               mmlist_lock (in mmput, drain_mmlist and others)
  *               mapping->private_lock (in __set_page_dirty_buffers)
- *               inode_lock (in set_page_dirty's __mark_inode_dirty)
- *                 sb_lock (within inode_lock in fs/fs-writeback.c)
+ *               i_lock (in set_page_dirty's __mark_inode_dirty)
+ *                 sb_lock (within i_lock in fs/fs-writeback.c)
  *                 mapping->tree_lock (widely used, in set_page_dirty,
  *                           in arch-dependent flush_dcache_mmap_lock,
- *                           within inode_lock in __sync_single_inode)
+ *                           within i_lock in __sync_single_inode)
  *
  * (code doesn't rely on that order so it could be switched around)
  * ->tasklist_lock
-- 
1.7.1


^ permalink raw reply related

* [PATCH 14/18] fs: Protect inode->i_state with th einode->i_lock
From: Dave Chinner @ 2010-10-08  5:21 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel
In-Reply-To: <1286515292-15882-1-git-send-email-david@fromorbit.com>

From: Dave Chinner <dchinner@redhat.com>

We currently protect the per-inode state flags with the inode_lock.
Using a global lock to protect per-object state is overkill when we
coul duse a per-inode lock to protect the state.  Use the
inode->i_lock for this, and wrap all the state changes and checks
with the inode->i_lock.

Based on work originally written by Nick Piggin.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/drop_caches.c       |    9 +++--
 fs/fs-writeback.c      |   49 ++++++++++++++++++++++------
 fs/inode.c             |   83 ++++++++++++++++++++++++++++++++---------------
 fs/nilfs2/gcdat.c      |    1 +
 fs/notify/inode_mark.c |   10 ++++--
 fs/quota/dquot.c       |   12 ++++---
 6 files changed, 115 insertions(+), 49 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index c808ca8..00180dc 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -19,11 +19,14 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
-			continue;
-		if (inode->i_mapping->nrpages == 0)
+		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    (inode->i_mapping->nrpages == 0)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		iref_locked(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 49d44cc..404d449 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -281,10 +281,12 @@ static void inode_wait_for_writeback(struct inode *inode)
 	wait_queue_head_t *wqh;
 
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
-	 while (inode->i_state & I_SYNC) {
+	while (inode->i_state & I_SYNC) {
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
 		spin_lock(&inode_lock);
+		spin_lock(&inode->i_lock);
 	}
 }
 
@@ -309,7 +311,8 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	unsigned dirty;
 	int ret;
 
-	if (!iref_read(inode))
+	spin_lock(&inode->i_lock);
+	if (!inode->i_ref)
 		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
 	else
 		WARN_ON(inode->i_state & I_WILL_FREE);
@@ -324,6 +327,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
+			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.b_lock);
 			requeue_io(inode);
 			spin_unlock(&bdi->wb.b_lock);
@@ -341,6 +345,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	/* Set I_SYNC, reset I_DIRTY_PAGES */
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY_PAGES;
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 
 	ret = do_writepages(mapping, wbc);
@@ -362,8 +367,10 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	 * write_inode()
 	 */
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
@@ -373,6 +380,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	}
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
@@ -381,6 +389,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * sometimes bales out without doing anything.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
+			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.b_lock);
 			if (wbc->nr_to_write <= 0) {
 				/*
@@ -405,16 +414,21 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			 * submission or metadata updates after data IO
 			 * completion.
 			 */
+			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.b_lock);
 			redirty_tail(inode);
 			spin_unlock(&bdi->wb.b_lock);
 		} else {
 			/* The inode is clean */
+			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.b_lock);
 			list_del_init(&inode->i_io);
 			spin_unlock(&bdi->wb.b_lock);
 			inode_lru_list_add(inode);
 		}
+	} else {
+		/* freer will clean up */
+		spin_unlock(&inode->i_lock);
 	}
 	inode_sync_complete(inode);
 	return ret;
@@ -483,7 +497,9 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			return 0;
 		}
 
+		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_WILL_FREE | I_FREEING)) {
+			spin_unlock(&inode->i_lock);
 			requeue_io(inode);
 			continue;
 		}
@@ -491,10 +507,11 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		 * Was this inode dirtied after sync_sb_inodes was called?
 		 * This keeps sync from extra jobs and livelock.
 		 */
-		if (inode_dirtied_after(inode, wbc->wb_start))
+		if (inode_dirtied_after(inode, wbc->wb_start)) {
+			spin_unlock(&inode->i_lock);
 			return 1;
+		}
 
-		spin_lock(&inode->i_lock);
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->b_lock);
@@ -687,7 +704,9 @@ static long wb_writeback(struct bdi_writeback *wb,
 						struct inode, i_io);
 			spin_unlock(&wb->b_lock);
 			trace_wbc_writeback_wait(&wbc, wb->bdi);
+			spin_lock(&inode->i_lock);
 			inode_wait_for_writeback(inode);
+			spin_unlock(&inode->i_lock);
 		}
 		spin_unlock(&inode_lock);
 	}
@@ -953,6 +972,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		block_dump___mark_inode_dirty(inode);
 
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
 
@@ -964,7 +984,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 * superblock list, based upon its state.
 		 */
 		if (inode->i_state & I_SYNC)
-			goto out;
+			goto out_unlock;
 
 		/*
 		 * Only add valid (hashed) inodes to the superblock's
@@ -972,10 +992,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 */
 		if (!S_ISBLK(inode->i_mode)) {
 			if (hlist_bl_unhashed(&inode->i_hash))
-				goto out;
+				goto out_unlock;
 		}
 		if (inode->i_state & I_FREEING)
-			goto out;
+			goto out_unlock;
 
 		/*
 		 * If the inode was already on b_dirty/b_io/b_more_io, don't
@@ -998,12 +1018,16 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 					wakeup_bdi = true;
 			}
 
-			spin_lock(&bdi->wb.b_lock);
 			inode->dirtied_when = jiffies;
+			spin_unlock(&inode->i_lock);
+			spin_lock(&bdi->wb.b_lock);
 			list_move(&inode->i_io, &bdi->wb.b_dirty);
 			spin_unlock(&bdi->wb.b_lock);
+			goto out;
 		}
 	}
+out_unlock:
+	spin_unlock(&inode->i_lock);
 out:
 	spin_unlock(&inode_lock);
 
@@ -1052,12 +1076,15 @@ static void wait_sb_inodes(struct super_block *sb)
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		struct address_space *mapping;
 
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
-			continue;
+		spin_lock(&inode->i_lock);
 		mapping = inode->i_mapping;
-		if (mapping->nrpages == 0)
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    (mapping->nrpages == 0)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		iref_locked(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 		/*
diff --git a/fs/inode.c b/fs/inode.c
index 4ad7900..d3bd08a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -30,7 +30,7 @@
  * Locking rules.
  *
  * inode->i_lock protects:
- *   i_ref
+ *   i_ref i_state
  * inode_hash_bucket lock protects:
  *   inode hash table, i_hash
  * sb inode lock protects:
@@ -182,7 +182,7 @@ int proc_nr_inodes(ctl_table *table, int write,
 static void wake_up_inode(struct inode *inode)
 {
 	/*
-	 * Prevent speculative execution through spin_unlock(&inode_lock);
+	 * Prevent speculative execution through spin_unlock(&inode->i_lock);
 	 */
 	smp_mb();
 	wake_up_bit(&inode->i_state, __I_NEW);
@@ -361,6 +361,8 @@ static void init_once(void *foo)
  */
 void iref_locked(struct inode *inode)
 {
+	assert_spin_locked(&inode->i_lock);
+
 	inode->i_ref++;
 }
 EXPORT_SYMBOL_GPL(iref_locked);
@@ -484,7 +486,9 @@ void end_writeback(struct inode *inode)
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(inode->i_state & I_CLEAR);
 	inode_sync_wait(inode);
+	spin_lock(&inode->i_lock);
 	inode->i_state = I_FREEING | I_CLEAR;
+	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL(end_writeback);
 
@@ -561,17 +565,18 @@ static int invalidate_list(struct super_block *sb, struct list_head *head,
 		if (tmp == head)
 			break;
 		inode = list_entry(tmp, struct inode, i_sb_list);
-		if (inode->i_state & I_NEW)
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & I_NEW) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		invalidate_inode_buffers(inode);
-		spin_lock(&inode->i_lock);
 		if (!inode->i_ref) {
 			struct backing_dev_info *bdi = inode_to_bdi(inode);
 
-			spin_unlock(&inode->i_lock);
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
-
+			spin_unlock(&inode->i_lock);
 
 			/*
 			 * move the inode off the IO lists and LRU once
@@ -625,11 +630,12 @@ EXPORT_SYMBOL(invalidate_inodes);
 
 static int can_unuse(struct inode *inode)
 {
+	assert_spin_locked(&inode->i_lock);
 	if (inode->i_state)
 		return 0;
 	if (inode_has_buffers(inode))
 		return 0;
-	if (iref_read(inode))
+	if (inode->i_ref)
 		return 0;
 	if (inode->i_data.nrpages)
 		return 0;
@@ -675,9 +681,9 @@ static void prune_icache(int nr_to_scan)
 			continue;
 		}
 		if (inode->i_state & I_REFERENCED) {
+			inode->i_state &= ~I_REFERENCED;
 			spin_unlock(&inode->i_lock);
 			list_move(&inode->i_lru, &inode_lru);
-			inode->i_state &= ~I_REFERENCED;
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
@@ -691,6 +697,7 @@ static void prune_icache(int nr_to_scan)
 			iput(inode);
 			spin_lock(&inode_lock);
 			spin_lock(&inode_lru_lock);
+			spin_lock(&inode->i_lock);
 
 			/*
 			 * if we can't reclaim this inod immediately, give it
@@ -699,12 +706,14 @@ static void prune_icache(int nr_to_scan)
 			 */
 			if (!can_unuse(inode)) {
 				list_move(&inode->i_lru, &inode_lru);
+				spin_unlock(&inode->i_lock);
 				continue;
 			}
-		} else
-			spin_unlock(&inode->i_lock);
+		}
+
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_FREEING;
+		spin_unlock(&inode->i_lock);
 
 		/*
 		 * move the inode off the IO lists and LRU once
@@ -761,7 +770,7 @@ static struct shrinker icache_shrinker = {
 
 static void __wait_on_freeing_inode(struct inode *inode);
 /*
- * Called with the inode lock held.
+ * Returns with inode->i_lock held.
  * NOTE: we are not increasing the inode-refcount, you must call iref_locked()
  * by hand after calling find_inode now! This simplifies iunique and won't
  * add any additional branch in the common code.
@@ -779,8 +788,11 @@ repeat:
 	hlist_bl_for_each_entry(inode, node, &b->head, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
-		if (!test(inode, data))
+		spin_lock(&inode->i_lock);
+		if (!test(inode, data)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
 			spin_unlock_bucket(b);
 			__wait_on_freeing_inode(inode);
@@ -810,6 +822,7 @@ repeat:
 			continue;
 		if (inode->i_sb != sb)
 			continue;
+		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
 			spin_unlock_bucket(b);
 			__wait_on_freeing_inode(inode);
@@ -884,9 +897,9 @@ struct inode *new_inode(struct super_block *sb)
 	inode = alloc_inode(sb);
 	if (inode) {
 		spin_lock(&inode_lock);
-		__inode_add_to_lists(sb, NULL, inode);
 		inode->i_ino = ++last_ino;
 		inode->i_state = 0;
+		__inode_add_to_lists(sb, NULL, inode);
 		spin_unlock(&inode_lock);
 	}
 	return inode;
@@ -953,8 +966,8 @@ static struct inode *get_new_inode(struct super_block *sb,
 			if (set(inode, data))
 				goto set_failed;
 
-			__inode_add_to_lists(sb, b, inode);
 			inode->i_state = I_NEW;
+			__inode_add_to_lists(sb, b, inode);
 			spin_unlock(&inode_lock);
 
 			/* Return the locked inode with I_NEW set, the
@@ -968,7 +981,6 @@ static struct inode *get_new_inode(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		spin_lock(&old->i_lock);
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
@@ -1017,7 +1029,6 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		spin_lock(&old->i_lock);
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
 		spin_unlock(&inode_lock);
@@ -1071,17 +1082,19 @@ EXPORT_SYMBOL(iunique);
 struct inode *igrab(struct inode *inode)
 {
 	spin_lock(&inode_lock);
+	spin_lock(&inode->i_lock);
 	if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) {
-		spin_lock(&inode->i_lock);
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
-	} else
+	} else {
+		spin_unlock(&inode->i_lock);
 		/*
 		 * Handle the case where s_op->clear_inode is not been
 		 * called yet, and somebody is calling igrab
 		 * while the inode is getting freed.
 		 */
 		inode = NULL;
+	}
 	spin_unlock(&inode_lock);
 	return inode;
 }
@@ -1116,7 +1129,6 @@ static struct inode *ifind(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode(sb, b, test, data);
 	if (inode) {
-		spin_lock(&inode->i_lock);
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
@@ -1152,7 +1164,6 @@ static struct inode *ifind_fast(struct super_block *sb,
 	spin_lock(&inode_lock);
 	inode = find_inode_fast(sb, b, ino);
 	if (inode) {
-		spin_lock(&inode->i_lock);
 		iref_locked(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
@@ -1318,6 +1329,10 @@ int insert_inode_locked(struct inode *inode)
 	ino_t ino = inode->i_ino;
 	struct inode_hash_bucket *b = inode_hashtable + hash(sb, ino);
 
+	/*
+	 * Nobody else can see the new inode yet, so it is safe to set flags
+	 * without locking here.
+	 */
 	inode->i_state |= I_NEW;
 	while (1) {
 		struct hlist_bl_node *node;
@@ -1329,8 +1344,11 @@ int insert_inode_locked(struct inode *inode)
 				continue;
 			if (old->i_sb != sb)
 				continue;
-			if (old->i_state & (I_FREEING|I_WILL_FREE))
+			spin_lock(&old->i_lock);
+			if (old->i_state & (I_FREEING|I_WILL_FREE)) {
+				spin_unlock(&old->i_lock);
 				continue;
+			}
 			break;
 		}
 		if (likely(!node)) {
@@ -1339,7 +1357,6 @@ int insert_inode_locked(struct inode *inode)
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		spin_lock(&old->i_lock);
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
 		spin_unlock_bucket(b);
@@ -1373,8 +1390,11 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 				continue;
 			if (!test(old, data))
 				continue;
-			if (old->i_state & (I_FREEING|I_WILL_FREE))
+			spin_lock(&old->i_lock);
+			if (old->i_state & (I_FREEING|I_WILL_FREE)) {
+				spin_unlock(&old->i_lock);
 				continue;
+			}
 			break;
 		}
 		if (likely(!node)) {
@@ -1383,7 +1403,6 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 			spin_unlock(&inode_lock);
 			return 0;
 		}
-		spin_lock(&old->i_lock);
 		iref_locked(old);
 		spin_unlock(&old->i_lock);
 		spin_unlock_bucket(b);
@@ -1433,6 +1452,8 @@ static void iput_final(struct inode *inode)
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	int drop;
 
+	assert_spin_locked(&inode->i_lock);
+
 	if (op && op->drop_inode)
 		drop = op->drop_inode(inode);
 	else
@@ -1443,22 +1464,28 @@ static void iput_final(struct inode *inode)
 			inode->i_state |= I_REFERENCED;
 			if (!(inode->i_state & (I_DIRTY|I_SYNC)) &&
 			    list_empty(&inode->i_lru)) {
+				spin_unlock(&inode->i_lock);
 				inode_lru_list_add(inode);
+				return;
 			}
+			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_lock);
 			return;
 		}
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state |= I_WILL_FREE;
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_lock);
 		write_inode_now(inode, 1);
 		spin_lock(&inode_lock);
+		spin_lock(&inode->i_lock);
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
 		__remove_inode_hash(inode);
 	}
 	WARN_ON(inode->i_state & I_NEW);
 	inode->i_state |= I_FREEING;
+	spin_unlock(&inode->i_lock);
 
 	/*
 	 * move the inode off the IO lists and LRU once I_FREEING is set so
@@ -1495,13 +1522,12 @@ static void iput_final(struct inode *inode)
 void iput(struct inode *inode)
 {
 	if (inode) {
-		BUG_ON(inode->i_state & I_CLEAR);
-
 		spin_lock(&inode_lock);
 		spin_lock(&inode->i_lock);
+		BUG_ON(inode->i_state & I_CLEAR);
+
 		inode->i_ref--;
 		if (inode->i_ref == 0) {
-			spin_unlock(&inode->i_lock);
 			iput_final(inode);
 			return;
 		}
@@ -1687,6 +1713,8 @@ EXPORT_SYMBOL(inode_wait);
  * wake_up_inode() after removing from the hash list will DTRT.
  *
  * This is called with inode_lock held.
+ *
+ * Called with i_lock held and returns with it dropped.
  */
 static void __wait_on_freeing_inode(struct inode *inode)
 {
@@ -1694,6 +1722,7 @@ static void __wait_on_freeing_inode(struct inode *inode)
 	DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
 	wq = bit_waitqueue(&inode->i_state, __I_NEW);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
diff --git a/fs/nilfs2/gcdat.c b/fs/nilfs2/gcdat.c
index 84a45d1..c51f0e8 100644
--- a/fs/nilfs2/gcdat.c
+++ b/fs/nilfs2/gcdat.c
@@ -27,6 +27,7 @@
 #include "page.h"
 #include "mdt.h"
 
+/* XXX: what protects i_state? */
 int nilfs_init_gcdat_inode(struct the_nilfs *nilfs)
 {
 	struct inode *dat = nilfs->ns_dat, *gcdat = nilfs->ns_gc_dat;
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 3389ff0..8a05213 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -249,8 +249,11 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		 * I_WILL_FREE, or I_NEW which is fine because by that point
 		 * the inode cannot have any associated watches.
 		 */
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+		spin_lock(&inode->i_lock);
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 
 		/*
 		 * If the inode is not referenced, the inode cannot have any
@@ -258,9 +261,10 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		 * actually evict all unreferenced inodes from icache which is
 		 * unnecessarily violent and may in fact be illegal to do.
 		 */
-		spin_lock(&inode->i_lock);
-		if (!inode->i_ref)
+		if (!inode->i_ref) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 
 		need_iput_tmp = need_iput;
 		need_iput = NULL;
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index b7cbc41..c7b5fc6 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -899,18 +899,20 @@ static void add_dquot_ref(struct super_block *sb, int type)
 	spin_lock(&inode_lock);
 	spin_lock(&sb->s_inodes_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		    !atomic_read(&inode->i_writecount) ||
+		    !dqinit_needed(inode, type)) {
+			spin_unlock(&inode->i_lock);
 			continue;
+		}
 #ifdef CONFIG_QUOTA_DEBUG
 		if (unlikely(inode_get_rsv_space(inode) > 0))
 			reserved = 1;
 #endif
-		if (!atomic_read(&inode->i_writecount))
-			continue;
-		if (!dqinit_needed(inode, type))
-			continue;
 
 		iref_locked(inode);
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&sb->s_inodes_lock);
 		spin_unlock(&inode_lock);
 
-- 
1.7.1


^ permalink raw reply related


This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.