* [RFC 0/2] locking order of mm->mmap_sem and various FS
@ 2011-11-03 4:53 J. R. Okajima
2011-11-03 4:53 ` [RFC 1/2] introduce f_op->{pre,post}_mmap() J. R. Okajima
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: J. R. Okajima @ 2011-11-03 4:53 UTC (permalink / raw)
To: linux-kernel; +Cc: hooanon05, viro, hch, jwboyer, wli
There had ever been several posts which report a circular locking
problem around mm->mmap_sem and FS. For instance
"INFO: possible circular locking dependency detected 3.1.0-rc2-00190-g3210d19"
<http://marc.info/?l=linux-kernel&m=131402669412658&w=2>
While the problem in ext4 evict_inode seems to be already solved, here
I'll try fixing hugetlbfs as first step. The problem in hugetlbfs is
- read(2) -- hugetlbfs_read() -- ... -- __copy_to_user()
hugetlbfs_read() holds i_mutex. So this is i_mutex before mmap_sem
correctly.
- mmap(2) -- hugetlbfs_file_mmap()
hugetlbfs_file_mmap() holds i_mutex too. But mmap_sem is already held
before hugetlbfs_file_mmap(). This is an AB-BA problem.
While I am not sure whether hugetlbfs_read() really needs to acquire
i_mutex, if it really does, then I'd suggest f_op->{pre,post}_mmap().
These two patches are just to show the approach and not intends to be
merged into mainline now. I don't think it is the best solution, but I
simply have no idea other than this.
I'd like to hear comments from LKML people.
Taking a glance at ->mmap() functions in several FSs. I also found
gfs2_mmap()/gfs2_readdir() which acquires gl->gl_spin and may cause a
similar problem. And ocfs2_mmap()/ocfs2_readdir() too, but I don't
understand it enough.
If it is OK and {pre,post}_mmap() is accepted, then I will step forward
and try fixing below too. All of them acquires mmap_sem and calls
->mmap() (indirectly).
- callers of do_mmap()
arch/x86/ia32/ia32_aout.c:load_aout_binary() and its siblings
arch/x86/kvm/x86.c:kvm_arch_prepare_memory_region()
arch/tile/kernel/single_step.c:single_step_once()
drivers/gpu/drm/drm_bufs.c:drm_mapbufs() and others
drivers/gpu/drm/i810/i810_dma.c:i810_map_buffer()
drivers/gpu/drm/i915/i915_gem.c:i915_gem_mmap_ioctl()
fs/aio.c:aio_setup_ring()
fs/binfmt_aout.c:load_aout_binary() and its siblings
fs/binfmt_elf.c:elf_map() and its siblings
fs/binfmt_elf_fdpic.c:load_elf_fdpic_binary() and its siblings
fs/binfmt_flat.c:load_flat_file() and its siblings
fs/binfmt_som.c:map_som_binary() and its siblings
ipc/shm.c:do_shmat()
- callers of do_mmap_pgoff()
mm/nommu.c:SYSCALL mmap_pgoff
mm/mmap.c:SYSCALL mmap_pgoff
- callers of mmap_region()
arch/tile/mm/elf.c:arch_setup_additional_pages()
Additionally they will need some work too.
- callers of ->mmap()
fs/coda/file.c:coda_file_mmap()
fs/proc/inode.c:proc_reg_mmap()
Oh, the base version is v3.0, not latest mainline.
J. R. Okajima (2):
introduce f_op->{pre,post}_mmap()
hugetlbfs: implement f_op->{pre,post}_mmap()
Documentation/filesystems/Locking | 8 ++++++++
Documentation/filesystems/vfs.txt | 7 +++++++
fs/hugetlbfs/inode.c | 20 +++++++++++++++++---
include/linux/fs.h | 2 ++
include/linux/mm.h | 4 ++++
mm/mmap.c | 27 ++++++++++++++++++++++++---
6 files changed, 62 insertions(+), 6 deletions(-)
--
1.7.2.5
^ permalink raw reply [flat|nested] 5+ messages in thread
* [RFC 1/2] introduce f_op->{pre,post}_mmap()
2011-11-03 4:53 [RFC 0/2] locking order of mm->mmap_sem and various FS J. R. Okajima
@ 2011-11-03 4:53 ` J. R. Okajima
2011-11-03 4:53 ` [RFC 2/2] hugetlbfs: implement f_op->{pre,post}_mmap() J. R. Okajima
2011-11-03 7:48 ` [RFC 0/2] locking order of mm->mmap_sem and various FS Christoph Hellwig
2 siblings, 0 replies; 5+ messages in thread
From: J. R. Okajima @ 2011-11-03 4:53 UTC (permalink / raw)
To: linux-kernel; +Cc: hooanon05, viro, hch, jwboyer, wli
The locking order between mm->mmap_sem and inode->i_mutex (or other FS
internal lock) has a problem. The right order is i_mutex first and then
mmap_sem. But sometimes this is hard for FS which has complicated
->mmap() since it prohibits acquire i_mutex (or other FS internal lock)
otherwise it will case an AB-BA deadlock problem.
In order to allow FS to implemente complicated ->mmpa(), introduce
f_op->{pre,post}_mmap(). ->pre_mmap() is called just before acquiring
mmap_sem for ->mmap(), and ->post_mmap() is called just after releasing
mmap_sem.
Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
Documentation/filesystems/Locking | 8 ++++++++
Documentation/filesystems/vfs.txt | 7 +++++++
include/linux/fs.h | 2 ++
include/linux/mm.h | 4 ++++
mm/mmap.c | 27 ++++++++++++++++++++++++---
5 files changed, 45 insertions(+), 3 deletions(-)
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 57d827d..1815e20 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -408,7 +408,9 @@ prototypes:
unsigned int (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
+ int (*pre_mmap) (struct file *, unsigned long, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
+ void (*post_mmap) (struct file *, unsigned long, unsigned long);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *);
int (*release) (struct inode *, struct file *);
@@ -466,6 +468,12 @@ components. And there are other reasons why the current interface is a mess...
->read on directories probably must go away - we should just enforce -EISDIR
in sys_read() and friends.
+->mmap has mm->mmap_sem for write. If your FS needs i_mutex for mmap(2),
+ then never acquire it in ->mmap. Instead acquire it in ->pre_mmap(),
+ and release it in ->post_mmap() since they don't have mm->mmap_sem.
+ When ->pre_mmap() returns other than zero, both of ->mmap() and
+ ->post_mmap() will not be called.
+
--------------------------- dquot_operations -------------------------------
prototypes:
int (*write_dquot) (struct dquot *);
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 88b9f55..e2a579e 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -751,7 +751,9 @@ struct file_operations {
unsigned int (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
+ int (*pre_mmap) (struct file *, unsigned long, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
+ void (*post_mmap) (struct file *, unsigned long, unsigned long);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *);
int (*release) (struct inode *, struct file *);
@@ -794,8 +796,13 @@ otherwise noted.
compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
are used on 64 bit kernels.
+ pre_mmap: called by the mmap(2) system call
+
mmap: called by the mmap(2) system call
+ post_mmap: called by the mmap(2) system call
+ For pre_mmap, mmap, post_mmap, read Locking.txt too.
+
open: called by the VFS when an inode should be opened. When the VFS
opens a file, it creates a new "struct file". It then calls the
open method for the newly allocated file structure. You might
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b5b9792..bce9d44 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1554,7 +1554,9 @@ struct file_operations {
unsigned int (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
+ int (*pre_mmap) (struct file *, unsigned long, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
+ void (*post_mmap) (struct file *, unsigned long, unsigned long);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..e22230c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1428,6 +1428,10 @@ extern unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, unsigned long flags,
vm_flags_t vm_flags, unsigned long pgoff);
+extern int pre_mmap(struct file *file, unsigned long prot, unsigned long flag);
+extern void post_mmap(struct file *file, unsigned long prot,
+ unsigned long flag);
+
static inline unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long offset)
diff --git a/mm/mmap.c b/mm/mmap.c
index 0290c8e..0dd2acb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1089,6 +1089,25 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
}
EXPORT_SYMBOL(do_mmap_pgoff);
+int pre_mmap(struct file *file, unsigned long prot, unsigned long flag)
+{
+ int err;
+
+ err = 0;
+ if (file && file->f_op && file->f_op->pre_mmap)
+ err = file->f_op->pre_mmap(file, prot, flag);
+ if (!err)
+ down_write(¤t->mm->mmap_sem);
+ return err;
+}
+
+void post_mmap(struct file *file, unsigned long prot, unsigned long flag)
+{
+ up_write(¤t->mm->mmap_sem);
+ if (file && file->f_op && file->f_op->post_mmap)
+ file->f_op->post_mmap(file, prot, flag);
+}
+
SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
unsigned long, prot, unsigned long, flags,
unsigned long, fd, unsigned long, pgoff)
@@ -1120,9 +1139,11 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
- down_write(¤t->mm->mmap_sem);
- retval = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
- up_write(¤t->mm->mmap_sem);
+ retval = pre_mmap(file, prot, flags);
+ if (!retval) {
+ retval = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+ post_mmap(file, prot, flags);
+ }
if (file)
fput(file);
--
1.7.2.5
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [RFC 2/2] hugetlbfs: implement f_op->{pre,post}_mmap()
2011-11-03 4:53 [RFC 0/2] locking order of mm->mmap_sem and various FS J. R. Okajima
2011-11-03 4:53 ` [RFC 1/2] introduce f_op->{pre,post}_mmap() J. R. Okajima
@ 2011-11-03 4:53 ` J. R. Okajima
2011-11-03 7:48 ` [RFC 0/2] locking order of mm->mmap_sem and various FS Christoph Hellwig
2 siblings, 0 replies; 5+ messages in thread
From: J. R. Okajima @ 2011-11-03 4:53 UTC (permalink / raw)
To: linux-kernel; +Cc: hooanon05, viro, hch, jwboyer, wli
To fix the AB-BA deadlock problem between mm->mmap_sem and
inode->i_mutex.
Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
fs/hugetlbfs/inode.c | 20 +++++++++++++++++---
1 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 7aafeb8..f044d54 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -76,6 +76,21 @@ static void huge_pagevec_release(struct pagevec *pvec)
pagevec_reinit(pvec);
}
+static int hugetlbfs_file_pre_mmap(struct file *file, unsigned long prot,
+ unsigned long flag)
+{
+ struct inode *inode = file->f_path.dentry->d_inode;
+ mutex_lock(&inode->i_mutex);
+ return 0;
+}
+
+static void hugetlbfs_file_post_mmap(struct file *file, unsigned long prot,
+ unsigned long flag)
+{
+ struct inode *inode = file->f_path.dentry->d_inode;
+ mutex_unlock(&inode->i_mutex);
+}
+
static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct inode *inode = file->f_path.dentry->d_inode;
@@ -99,7 +114,6 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
vma_len = (loff_t)(vma->vm_end - vma->vm_start);
- mutex_lock(&inode->i_mutex);
file_accessed(file);
ret = -ENOMEM;
@@ -116,8 +130,6 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
if (vma->vm_flags & VM_WRITE && inode->i_size < len)
inode->i_size = len;
out:
- mutex_unlock(&inode->i_mutex);
-
return ret;
}
@@ -693,7 +705,9 @@ static void init_once(void *foo)
const struct file_operations hugetlbfs_file_operations = {
.read = hugetlbfs_read,
+ .pre_mmap = hugetlbfs_file_pre_mmap,
.mmap = hugetlbfs_file_mmap,
+ .post_mmap = hugetlbfs_file_post_mmap,
.fsync = noop_fsync,
.get_unmapped_area = hugetlb_get_unmapped_area,
.llseek = default_llseek,
--
1.7.2.5
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [RFC 0/2] locking order of mm->mmap_sem and various FS
2011-11-03 4:53 [RFC 0/2] locking order of mm->mmap_sem and various FS J. R. Okajima
2011-11-03 4:53 ` [RFC 1/2] introduce f_op->{pre,post}_mmap() J. R. Okajima
2011-11-03 4:53 ` [RFC 2/2] hugetlbfs: implement f_op->{pre,post}_mmap() J. R. Okajima
@ 2011-11-03 7:48 ` Christoph Hellwig
2011-11-14 5:18 ` J. R. Okajima
2 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2011-11-03 7:48 UTC (permalink / raw)
To: J. R. Okajima; +Cc: linux-kernel, viro, hch, jwboyer, wli
On Thu, Nov 03, 2011 at 01:53:53PM +0900, J. R. Okajima wrote:
> There had ever been several posts which report a circular locking
> problem around mm->mmap_sem and FS. For instance
> "INFO: possible circular locking dependency detected 3.1.0-rc2-00190-g3210d19"
> <http://marc.info/?l=linux-kernel&m=131402669412658&w=2>
>
> While the problem in ext4 evict_inode seems to be already solved, here
> I'll try fixing hugetlbfs as first step. The problem in hugetlbfs is
>
> - read(2) -- hugetlbfs_read() -- ... -- __copy_to_user()
> hugetlbfs_read() holds i_mutex. So this is i_mutex before mmap_sem
> correctly.
>
> - mmap(2) -- hugetlbfs_file_mmap()
> hugetlbfs_file_mmap() holds i_mutex too. But mmap_sem is already held
> before hugetlbfs_file_mmap(). This is an AB-BA problem.
And that is where the problem is. ->mmap should not take i_mutex.
While a few filesystems abuse i_mutex it only has two clear defined
meanings:
- protect the namespace topology (only on directories)
- protect writes against each other and truncate.
The hugetlb use falls under neither category and should be switched
to an internal lock. It seems like that look should in fact be taken
inside hugetlb_reserve_pages, as that is the only thing that appears
to need any protection.
> While I am not sure whether hugetlbfs_read() really needs to acquire
> i_mutex, if it really does, then I'd suggest f_op->{pre,post}_mmap().
> These two patches are just to show the approach and not intends to be
> merged into mainline now. I don't think it is the best solution, but I
> simply have no idea other than this.
> I'd like to hear comments from LKML people.
>
> Taking a glance at ->mmap() functions in several FSs. I also found
> gfs2_mmap()/gfs2_readdir() which acquires gl->gl_spin and may cause a
> similar problem. And ocfs2_mmap()/ocfs2_readdir() too, but I don't
> understand it enough.
You can't mmap directories, so that is not a problem.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC 0/2] locking order of mm->mmap_sem and various FS
2011-11-03 7:48 ` [RFC 0/2] locking order of mm->mmap_sem and various FS Christoph Hellwig
@ 2011-11-14 5:18 ` J. R. Okajima
0 siblings, 0 replies; 5+ messages in thread
From: J. R. Okajima @ 2011-11-14 5:18 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-kernel, viro, jwboyer, wli
Christoph Hellwig:
> While a few filesystems abuse i_mutex it only has two clear defined
> meanings:
>
> - protect the namespace topology (only on directories)
> - protect writes against each other and truncate.
>
> The hugetlb use falls under neither category and should be switched
> to an internal lock. It seems like that look should in fact be taken
> inside hugetlb_reserve_pages, as that is the only thing that appears
> to need any protection.
Sorry my late reply. I am just busy.
I have no objection about switching another lock.
But we may need to consider about truncate(2) too.
- hugetlbfs_file_mmap() updates i_size (single-direction. eg. grow
only).
- hugetlbfs_read() refers i_size (only once).
- hugetlbfs_setattr() updates i_size too (both directions are possible).
If we do these (below), then the race betwee mmap(2) and read(2) will be
fixed.
- remove i_mutex from hugetlbfs_file_mmap().
- use i_size_read() and i_size_write() in hugetlbfs_file_mmap().
Since hugetlbfs_read() acquires i_mutex, read/truncate race will not
happen, but the mmap/truncate race may happen by removing i_mutex from
hugetlbfs_file_mmap() (above).
To address this race, I'd also suggest these (below).
- introduce a new mutex in struct hugetlbfs_inode_info.
- acquire it in hugetlbfs_file_mmap().
- protect i_size and truncate_hugepages() in hugetlb_vmtruncate() by the
new mutex too.
Honestly speaking I am not sure yet whether it is necessary to protect
truncate_hugepages() in hugetlb_vmtruncate().
Would you review this patch?
By the way, who is maintaining hugetlbfs currently?
My previous mails to "William Irwin <wli@holomorphy.com>" were bounced.
J. R. Okajima
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 0be5a78..cc7db2d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -99,12 +99,12 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
vma_len = (loff_t)(vma->vm_end - vma->vm_start);
- mutex_lock(&inode->i_mutex);
file_accessed(file);
ret = -ENOMEM;
len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
+ mutex_lock(&HUGETLBFS_I(inode)->mtx);
if (hugetlb_reserve_pages(inode,
vma->vm_pgoff >> huge_page_order(h),
len >> huge_page_shift(h), vma,
@@ -113,10 +113,10 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
ret = 0;
hugetlb_prefault_arch_hook(vma->vm_mm);
- if (vma->vm_flags & VM_WRITE && inode->i_size < len)
- inode->i_size = len;
+ if (vma->vm_flags & VM_WRITE && i_size_read(inode) < len)
+ i_size_write(inode, len);
out:
- mutex_unlock(&inode->i_mutex);
+ mutex_unlock(&HUGETLBFS_I(inode)->mtx);
return ret;
}
@@ -411,12 +411,12 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
BUG_ON(offset & ~huge_page_mask(h));
pgoff = offset >> PAGE_SHIFT;
+ mutex_lock(&HUGETLBFS_I(inode)->mtx);
i_size_write(inode, offset);
- mutex_lock(&mapping->i_mmap_mutex);
if (!prio_tree_empty(&mapping->i_mmap))
hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
- mutex_unlock(&mapping->i_mmap_mutex);
truncate_hugepages(inode, offset);
+ mutex_unlock(&HUGETLBFS_I(inode)->mtx);
return 0;
}
@@ -673,6 +673,7 @@ static void hugetlbfs_i_callback(struct rcu_head *head)
static void hugetlbfs_destroy_inode(struct inode *inode)
{
hugetlbfs_inc_free_inodes(HUGETLBFS_SB(inode->i_sb));
+ mutex_destroy(&HUGETLBFS_I(inode)->mtx);
mpol_free_shared_policy(&HUGETLBFS_I(inode)->policy);
call_rcu(&inode->i_rcu, hugetlbfs_i_callback);
}
@@ -689,6 +690,7 @@ static void init_once(void *foo)
{
struct hugetlbfs_inode_info *ei = (struct hugetlbfs_inode_info *)foo;
+ mutex_init(&ei->mtx);
inode_init_once(&ei->vfs_inode);
}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 19644e0..32a336b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -153,6 +153,7 @@ struct hugetlbfs_sb_info {
struct hugetlbfs_inode_info {
+ struct mutex mtx;
struct shared_policy policy;
struct inode vfs_inode;
};
^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2011-11-14 5:26 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-03 4:53 [RFC 0/2] locking order of mm->mmap_sem and various FS J. R. Okajima
2011-11-03 4:53 ` [RFC 1/2] introduce f_op->{pre,post}_mmap() J. R. Okajima
2011-11-03 4:53 ` [RFC 2/2] hugetlbfs: implement f_op->{pre,post}_mmap() J. R. Okajima
2011-11-03 7:48 ` [RFC 0/2] locking order of mm->mmap_sem and various FS Christoph Hellwig
2011-11-14 5:18 ` J. R. Okajima
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox