* Mapping range locking and related stuff
@ 2013-09-27 20:42 Jan Kara
2013-09-27 23:32 ` Dave Chinner
0 siblings, 1 reply; 3+ messages in thread
From: Jan Kara @ 2013-09-27 20:42 UTC (permalink / raw)
To: dchinner; +Cc: linux-fsdevel, linux-mm
Hello,
so recently I've spent some time rummaging in get_user_pages(), fault
code etc. The use of mmap_sem is really messy in some places (like V4L
drivers, infiniband,...). It is held over a deep & wide call chains and
it's not clear what's protected by it, just in the middle of that is a call
to get_user_pages(). Anyway that's mostly a side note.
The main issue I found is with the range locking itself. Consider someone
doing:
fd = open("foo", O_RDWR);
base = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
write(fd, base, 4096);
The write() is an interesting way to do nothing but if the mapping range
lock will be acquired early (like in generic_file_aio_write()), then this
would deadlock because generic_perform_write() will try to fault in
destination buffer and that will try to get the range lock for the same
range again.
Prefaulting buffer before we get the range lock isn't really an option
since the write(2) can be rather large. So we really either have to lock
page faults differently or use per page locking as I originally wanted.
I'm still thinking what would be the best solution for this. Ideas are
welcome.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Mapping range locking and related stuff
2013-09-27 20:42 Mapping range locking and related stuff Jan Kara
@ 2013-09-27 23:32 ` Dave Chinner
2013-09-30 16:25 ` Jan Kara
0 siblings, 1 reply; 3+ messages in thread
From: Dave Chinner @ 2013-09-27 23:32 UTC (permalink / raw)
To: Jan Kara; +Cc: dchinner, linux-fsdevel, linux-mm
On Fri, Sep 27, 2013 at 10:42:14PM +0200, Jan Kara wrote:
> Hello,
>
> so recently I've spent some time rummaging in get_user_pages(), fault
> code etc. The use of mmap_sem is really messy in some places (like V4L
> drivers, infiniband,...). It is held over a deep & wide call chains and
> it's not clear what's protected by it, just in the middle of that is a call
> to get_user_pages(). Anyway that's mostly a side note.
>
> The main issue I found is with the range locking itself. Consider someone
> doing:
> fd = open("foo", O_RDWR);
> base = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> write(fd, base, 4096);
>
> The write() is an interesting way to do nothing but if the mapping range
> lock will be acquired early (like in generic_file_aio_write()), then this
> would deadlock because generic_perform_write() will try to fault in
> destination buffer and that will try to get the range lock for the same
> range again.
Quite frankly, I'd like to see EFAULT or EDEADLOCK returned to the
caller doing something like this. It's a stupid thing to do, and
while I beleive in giving people enough rope to hang themselves,
the contortions we are going through here to provide that rope
doesn't seem worthwhile at all.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Mapping range locking and related stuff
2013-09-27 23:32 ` Dave Chinner
@ 2013-09-30 16:25 ` Jan Kara
0 siblings, 0 replies; 3+ messages in thread
From: Jan Kara @ 2013-09-30 16:25 UTC (permalink / raw)
To: Dave Chinner; +Cc: Jan Kara, dchinner, linux-fsdevel, linux-mm
On Sat 28-09-13 09:32:02, Dave Chinner wrote:
> On Fri, Sep 27, 2013 at 10:42:14PM +0200, Jan Kara wrote:
> > Hello,
> >
> > so recently I've spent some time rummaging in get_user_pages(), fault
> > code etc. The use of mmap_sem is really messy in some places (like V4L
> > drivers, infiniband,...). It is held over a deep & wide call chains and
> > it's not clear what's protected by it, just in the middle of that is a call
> > to get_user_pages(). Anyway that's mostly a side note.
> >
> > The main issue I found is with the range locking itself. Consider someone
> > doing:
> > fd = open("foo", O_RDWR);
> > base = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> > write(fd, base, 4096);
> >
> > The write() is an interesting way to do nothing but if the mapping range
> > lock will be acquired early (like in generic_file_aio_write()), then this
> > would deadlock because generic_perform_write() will try to fault in
> > destination buffer and that will try to get the range lock for the same
> > range again.
>
> Quite frankly, I'd like to see EFAULT or EDEADLOCK returned to the
> caller doing something like this. It's a stupid thing to do, and
> while I beleive in giving people enough rope to hang themselves,
> the contortions we are going through here to provide that rope
> doesn't seem worthwhile at all.
Well, what I wrote above was kind of minimal example. But if you would
like to solve all deadlocks of this kind (when more processes are
involved), you would have to forbid giving any file mapping (private or
shared) as a buffer for IO. And I'm afraid we cannot really afford that
because there may be reasonable uses for that.
Standards conformant solution would be to prefault a chunk of pages and
return short IO as soon as we hit unmapped page in the buffer. But I was
already fending off application developers complaining Linux returns short
write when you try to write more than 2 GB. So I'm *sure* we would break a
lot of userspace if we suddently start returning short writes for much
smaller requests. Crappy userspace ;)
I have a way to make things work but it doesn't make the locking simpler
from the current situation which is sad. The logic is as follows:
We will have range lock (basically a range-equivalent of current i_mutex).
Each range can also have a flag set saying that pages cannot be inserted in
a given range.
Punch hole / truncate lock the range with the flag set.
Write & direct IO lock the range without the flag set.
Read & page fault work on per-page basis and create page for given index
only after waiting for held range locks with flag set covering that index.
This achieves:
1) Exclusion among punch hole, truncate, writes, direct IO.
2) Exclusion between punch hole, truncate and page creation in given range.
3) Exclusion among reads, page faults, writes is achieved by PageLock as
currently.
Direct IO write will have the same consistency problems with mmap as it
currently has.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2013-09-30 16:25 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-27 20:42 Mapping range locking and related stuff Jan Kara
2013-09-27 23:32 ` Dave Chinner
2013-09-30 16:25 ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).