Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jerome Glisse <jglisse@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
Date: Tue, 13 Dec 2016 17:55:24 -0500	[thread overview]
Message-ID: <20161213225523.GG2305@redhat.com> (raw)
In-Reply-To: <20161213221322.GD4326@dastard>

On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > > > I would like to discuss un-addressable device memory in the context of
> > > > > > filesystem and block device. Specificaly how to handle write-back, read,
> > > > > > ... when a filesystem page is migrated to device memory that CPU can not
> > > > > > access.
> > > > > 
> > > > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > > > becomes non-DAX capable?
> > > > > 
> > > > > If you are not talking about pmem and DAX, then exactly what does
> > > > > "when a filesystem page is migrated to device memory that CPU can
> > > > > not access" mean? What "filesystem page" are we talking about that
> > > > > can get migrated from main RAM to something the CPU can't access?
> > > > 
> > > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > > > board memory that can not be expose transparently to the CPU. I am
> > > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > > > https://lwn.net/Articles/706856/
> > > 
> > > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
> > 
> > Well not only target, it can be source too. But the device can read
> > and write any system memory and dma to/from that memory to its on
> > board memory.
> 
> So you want the device to be able to dirty mmapped pages that the
> CPU can't access?

Yes, correct.


> > > > So in my case i am only considering non DAX/PMEM filesystem ie any
> > > > "regular" filesystem back by a "regular" block device. I want to be
> > > > able to migrate mmaped area of such filesystem to device memory while
> > > > the device is actively using that memory.
> > > 
> > > "migrate mmapped area of such filesystem" means what, exactly?
> > 
> > fd = open("/path/to/some/file")
> > ptr = mmap(fd, ...);
> > gpu_compute_something(ptr);
> 
> Thought so. Lots of problems with this.
> 
> > > Are you talking about file data contents that have been copied into
> > > the page cache and mmapped into a user process address space?
> > > IOWs, migrating ZONE_NORMAL page cache page content and state
> > > to a new ZONE_DEVICE page, and then migrating back again somehow?
> > 
> > Take any existing application that mmap a file and allow to migrate
> > chunk of that mmaped file to device memory without the application
> > even knowing about it. So nothing special in respect to that mmaped
> > file.
> 
> From the application point of view. Filesystem, page cache, etc
> there's substantial problems here...
> 
> > It is a regular file on your filesystem.
> 
> ... because of this.
> 
> > > > From kernel point of view such memory is almost like any other, it
> > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > accessible page.
> > > 
> > > That sounds ... complex. Page migration on page cache access inside
> > > the filesytem IO path locking during read()/write() sounds like
> > > a great way to cause deadlocks....
> > 
> > There are few restriction on device page, no one can do GUP on them and
> > thus no one can pin them. Hence they can always be migrated back. Yes
> > each fs need modification, most of it (if not all) is isolated in common
> > filemap helpers.
> 
> Sure, but you haven't answered my question: how do you propose we
> address the issue of placing all the mm locks required for migration
> under the filesystem IO path locks?

Two different plans (which are non exclusive of each other). First is to use
workqueue and have read/write wait on the workqueue to be done migrating the
page back.

Second solution is to use a bounce page during I/O so that there is no need
for migration.


> > > > But for thing like writeback i want to be able to do writeback with-
> > > > out having to migrate page back first. So that data can stay on the
> > > > device while writeback is happening.
> > > 
> > > Why can't you do writeback before migration, so only clean pages get
> > > moved?
> > 
> > Because device can write to the page while the page is inside the device
> > memory and we might want to writeback to disk while page stays in device
> > memory and computation continues.
> 
> Ok. So how does the device trigger ->page_mkwrite on a clean page to
> tell the filesystem that the page has been dirtied? So that, for
> example, if the page covers a hole because the file is sparse the
> filesytem can do the required block allocation and data
> initialisation (i.e. zero the cached page) before it gets marked
> dirty and any data gets written to it?
> 
> And if zeroing the page during such a fault requires CPU access to
> the data, how do you propose we handle page migration in the middle
> of the page fault to allow the CPU to zero the page? Seems like more
> lock order/inversion problems there, too...


File back page are never allocated on device, at least we have no incentive
for usecase we care about today to do so. So a regular page is first use
and initialize (to zero for hole) before being migrated to device. So i do
not believe there should be any major concern on ->page_mkwrite. At least
this was my impression when i look at generic filemap one, but for some
filesystem this might need be problematic. I intend to enable this kind of
migration on fs basis and allowing control by userspace to block such
migration for given fs.

Jï¿½rï¿½me

WARNING: multiple messages have this Message-ID (diff)

From: Jerome Glisse <jglisse@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
Date: Tue, 13 Dec 2016 17:55:24 -0500	[thread overview]
Message-ID: <20161213225523.GG2305@redhat.com> (raw)
In-Reply-To: <20161213221322.GD4326@dastard>

On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > > > I would like to discuss un-addressable device memory in the context of
> > > > > > filesystem and block device. Specificaly how to handle write-back, read,
> > > > > > ... when a filesystem page is migrated to device memory that CPU can not
> > > > > > access.
> > > > > 
> > > > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > > > becomes non-DAX capable?
> > > > > 
> > > > > If you are not talking about pmem and DAX, then exactly what does
> > > > > "when a filesystem page is migrated to device memory that CPU can
> > > > > not access" mean? What "filesystem page" are we talking about that
> > > > > can get migrated from main RAM to something the CPU can't access?
> > > > 
> > > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > > > board memory that can not be expose transparently to the CPU. I am
> > > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > > > https://lwn.net/Articles/706856/
> > > 
> > > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
> > 
> > Well not only target, it can be source too. But the device can read
> > and write any system memory and dma to/from that memory to its on
> > board memory.
> 
> So you want the device to be able to dirty mmapped pages that the
> CPU can't access?

Yes, correct.


> > > > So in my case i am only considering non DAX/PMEM filesystem ie any
> > > > "regular" filesystem back by a "regular" block device. I want to be
> > > > able to migrate mmaped area of such filesystem to device memory while
> > > > the device is actively using that memory.
> > > 
> > > "migrate mmapped area of such filesystem" means what, exactly?
> > 
> > fd = open("/path/to/some/file")
> > ptr = mmap(fd, ...);
> > gpu_compute_something(ptr);
> 
> Thought so. Lots of problems with this.
> 
> > > Are you talking about file data contents that have been copied into
> > > the page cache and mmapped into a user process address space?
> > > IOWs, migrating ZONE_NORMAL page cache page content and state
> > > to a new ZONE_DEVICE page, and then migrating back again somehow?
> > 
> > Take any existing application that mmap a file and allow to migrate
> > chunk of that mmaped file to device memory without the application
> > even knowing about it. So nothing special in respect to that mmaped
> > file.
> 
> From the application point of view. Filesystem, page cache, etc
> there's substantial problems here...
> 
> > It is a regular file on your filesystem.
> 
> ... because of this.
> 
> > > > From kernel point of view such memory is almost like any other, it
> > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > accessible page.
> > > 
> > > That sounds ... complex. Page migration on page cache access inside
> > > the filesytem IO path locking during read()/write() sounds like
> > > a great way to cause deadlocks....
> > 
> > There are few restriction on device page, no one can do GUP on them and
> > thus no one can pin them. Hence they can always be migrated back. Yes
> > each fs need modification, most of it (if not all) is isolated in common
> > filemap helpers.
> 
> Sure, but you haven't answered my question: how do you propose we
> address the issue of placing all the mm locks required for migration
> under the filesystem IO path locks?

Two different plans (which are non exclusive of each other). First is to use
workqueue and have read/write wait on the workqueue to be done migrating the
page back.

Second solution is to use a bounce page during I/O so that there is no need
for migration.


> > > > But for thing like writeback i want to be able to do writeback with-
> > > > out having to migrate page back first. So that data can stay on the
> > > > device while writeback is happening.
> > > 
> > > Why can't you do writeback before migration, so only clean pages get
> > > moved?
> > 
> > Because device can write to the page while the page is inside the device
> > memory and we might want to writeback to disk while page stays in device
> > memory and computation continues.
> 
> Ok. So how does the device trigger ->page_mkwrite on a clean page to
> tell the filesystem that the page has been dirtied? So that, for
> example, if the page covers a hole because the file is sparse the
> filesytem can do the required block allocation and data
> initialisation (i.e. zero the cached page) before it gets marked
> dirty and any data gets written to it?
> 
> And if zeroing the page during such a fault requires CPU access to
> the data, how do you propose we handle page migration in the middle
> of the page fault to allow the CPU to zero the page? Seems like more
> lock order/inversion problems there, too...


File back page are never allocated on device, at least we have no incentive
for usecase we care about today to do so. So a regular page is first use
and initialize (to zero for hole) before being migrated to device. So i do
not believe there should be any major concern on ->page_mkwrite. At least
this was my impression when i look at generic filemap one, but for some
filesystem this might need be problematic. I intend to enable this kind of
migration on fs basis and allowing control by userspace to block such
migration for given fs.

Jï¿½rï¿½me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Jerome Glisse <jglisse@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [LSF/MM TOPIC] Un-addressable device memory and block/fs implications
Date: Tue, 13 Dec 2016 17:55:24 -0500	[thread overview]
Message-ID: <20161213225523.GG2305@redhat.com> (raw)
In-Reply-To: <20161213221322.GD4326@dastard>

On Wed, Dec 14, 2016 at 09:13:22AM +1100, Dave Chinner wrote:
> On Tue, Dec 13, 2016 at 04:24:33PM -0500, Jerome Glisse wrote:
> > On Wed, Dec 14, 2016 at 08:10:41AM +1100, Dave Chinner wrote:
> > > On Tue, Dec 13, 2016 at 03:31:13PM -0500, Jerome Glisse wrote:
> > > > On Wed, Dec 14, 2016 at 07:15:15AM +1100, Dave Chinner wrote:
> > > > > On Tue, Dec 13, 2016 at 01:15:11PM -0500, Jerome Glisse wrote:
> > > > > > I would like to discuss un-addressable device memory in the context of
> > > > > > filesystem and block device. Specificaly how to handle write-back, read,
> > > > > > ... when a filesystem page is migrated to device memory that CPU can not
> > > > > > access.
> > > > > 
> > > > > You mean pmem that is DAX-capable that suddenly, without warning,
> > > > > becomes non-DAX capable?
> > > > > 
> > > > > If you are not talking about pmem and DAX, then exactly what does
> > > > > "when a filesystem page is migrated to device memory that CPU can
> > > > > not access" mean? What "filesystem page" are we talking about that
> > > > > can get migrated from main RAM to something the CPU can't access?
> > > > 
> > > > I am talking about GPU, FPGA, ... any PCIE device that have fast on
> > > > board memory that can not be expose transparently to the CPU. I am
> > > > reusing ZONE_DEVICE for this, you can see HMM patchset on linux-mm
> > > > https://lwn.net/Articles/706856/
> > > 
> > > So ZONE_DEVICE memory that is a DMA target but not CPU addressable?
> > 
> > Well not only target, it can be source too. But the device can read
> > and write any system memory and dma to/from that memory to its on
> > board memory.
> 
> So you want the device to be able to dirty mmapped pages that the
> CPU can't access?

Yes, correct.


> > > > So in my case i am only considering non DAX/PMEM filesystem ie any
> > > > "regular" filesystem back by a "regular" block device. I want to be
> > > > able to migrate mmaped area of such filesystem to device memory while
> > > > the device is actively using that memory.
> > > 
> > > "migrate mmapped area of such filesystem" means what, exactly?
> > 
> > fd = open("/path/to/some/file")
> > ptr = mmap(fd, ...);
> > gpu_compute_something(ptr);
> 
> Thought so. Lots of problems with this.
> 
> > > Are you talking about file data contents that have been copied into
> > > the page cache and mmapped into a user process address space?
> > > IOWs, migrating ZONE_NORMAL page cache page content and state
> > > to a new ZONE_DEVICE page, and then migrating back again somehow?
> > 
> > Take any existing application that mmap a file and allow to migrate
> > chunk of that mmaped file to device memory without the application
> > even knowing about it. So nothing special in respect to that mmaped
> > file.
> 
> From the application point of view. Filesystem, page cache, etc
> there's substantial problems here...
> 
> > It is a regular file on your filesystem.
> 
> ... because of this.
> 
> > > > From kernel point of view such memory is almost like any other, it
> > > > has a struct page and most of the mm code is non the wiser, nor need
> > > > to be about it. CPU access trigger a migration back to regular CPU
> > > > accessible page.
> > > 
> > > That sounds ... complex. Page migration on page cache access inside
> > > the filesytem IO path locking during read()/write() sounds like
> > > a great way to cause deadlocks....
> > 
> > There are few restriction on device page, no one can do GUP on them and
> > thus no one can pin them. Hence they can always be migrated back. Yes
> > each fs need modification, most of it (if not all) is isolated in common
> > filemap helpers.
> 
> Sure, but you haven't answered my question: how do you propose we
> address the issue of placing all the mm locks required for migration
> under the filesystem IO path locks?

Two different plans (which are non exclusive of each other). First is to use
workqueue and have read/write wait on the workqueue to be done migrating the
page back.

Second solution is to use a bounce page during I/O so that there is no need
for migration.


> > > > But for thing like writeback i want to be able to do writeback with-
> > > > out having to migrate page back first. So that data can stay on the
> > > > device while writeback is happening.
> > > 
> > > Why can't you do writeback before migration, so only clean pages get
> > > moved?
> > 
> > Because device can write to the page while the page is inside the device
> > memory and we might want to writeback to disk while page stays in device
> > memory and computation continues.
> 
> Ok. So how does the device trigger ->page_mkwrite on a clean page to
> tell the filesystem that the page has been dirtied? So that, for
> example, if the page covers a hole because the file is sparse the
> filesytem can do the required block allocation and data
> initialisation (i.e. zero the cached page) before it gets marked
> dirty and any data gets written to it?
> 
> And if zeroing the page during such a fault requires CPU access to
> the data, how do you propose we handle page migration in the middle
> of the page fault to allow the CPU to zero the page? Seems like more
> lock order/inversion problems there, too...


File back page are never allocated on device, at least we have no incentive
for usecase we care about today to do so. So a regular page is first use
and initialize (to zero for hole) before being migrated to device. So i do
not believe there should be any major concern on ->page_mkwrite. At least
this was my impression when i look at generic filemap one, but for some
filesystem this might need be problematic. I intend to enable this kind of
migration on fs basis and allowing control by userspace to block such
migration for given fs.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2016-12-13 22:55 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-13 18:15 [LSF/MM TOPIC] Un-addressable device memory and block/fs implications Jerome Glisse
2016-12-13 18:15 ` Jerome Glisse
2016-12-13 18:15 ` Jerome Glisse
2016-12-13 18:20 ` James Bottomley
2016-12-13 18:20   ` James Bottomley
2016-12-13 18:20   ` James Bottomley
2016-12-13 18:55   ` Jerome Glisse
2016-12-13 18:55     ` Jerome Glisse
2016-12-13 18:55     ` Jerome Glisse
2016-12-13 20:01     ` James Bottomley
2016-12-13 20:01       ` James Bottomley
2016-12-13 20:22       ` Jerome Glisse
2016-12-13 20:22         ` Jerome Glisse
2016-12-13 20:22         ` Jerome Glisse
2016-12-13 20:27       ` Dave Hansen
2016-12-13 20:27         ` Dave Hansen
2016-12-13 20:15 ` Dave Chinner
2016-12-13 20:15   ` Dave Chinner
2016-12-13 20:31   ` Jerome Glisse
2016-12-13 20:31     ` Jerome Glisse
2016-12-13 20:31     ` Jerome Glisse
2016-12-13 21:10     ` Dave Chinner
2016-12-13 21:10       ` Dave Chinner
2016-12-13 21:24       ` Jerome Glisse
2016-12-13 21:24         ` Jerome Glisse
2016-12-13 21:24         ` Jerome Glisse
2016-12-13 22:08         ` Dave Hansen
2016-12-13 22:08           ` Dave Hansen
2016-12-13 23:02           ` Jerome Glisse
2016-12-13 23:02             ` Jerome Glisse
2016-12-13 23:02             ` Jerome Glisse
2016-12-13 22:13         ` Dave Chinner
2016-12-13 22:13           ` Dave Chinner
2016-12-13 22:55           ` Jerome Glisse [this message]
2016-12-13 22:55             ` Jerome Glisse
2016-12-13 22:55             ` Jerome Glisse
2016-12-14  0:14             ` Dave Chinner
2016-12-14  0:14               ` Dave Chinner
2016-12-14  1:07               ` Jerome Glisse
2016-12-14  1:07                 ` Jerome Glisse
2016-12-14  1:07                 ` Jerome Glisse
2016-12-14  4:23                 ` Dave Chinner
2016-12-14  4:23                   ` Dave Chinner
2016-12-14 16:35                   ` Jerome Glisse
2016-12-14 16:35                     ` Jerome Glisse
2016-12-14 16:35                     ` Jerome Glisse
2016-12-14 11:13         ` [Lsf-pc] " Jan Kara
2016-12-14 11:13           ` Jan Kara
2016-12-14 17:15           ` Jerome Glisse
2016-12-14 17:15             ` Jerome Glisse
2016-12-14 17:15             ` Jerome Glisse
2016-12-15 16:19             ` Jan Kara
2016-12-15 16:19               ` Jan Kara
2016-12-15 19:14               ` Jerome Glisse
2016-12-15 19:14                 ` Jerome Glisse
2016-12-15 19:14                 ` Jerome Glisse
2016-12-16  8:14                 ` Jan Kara
2016-12-16  8:14                   ` Jan Kara
2016-12-16  3:10               ` Aneesh Kumar K.V
2016-12-16  3:10                 ` Aneesh Kumar K.V
2016-12-16  3:10                 ` Aneesh Kumar K.V
2016-12-19  8:46                 ` Jan Kara
2016-12-19  8:46                   ` Jan Kara
2016-12-19 17:00           ` Aneesh Kumar K.V
2016-12-19 17:00             ` Aneesh Kumar K.V
2016-12-14  3:55 ` Balbir Singh
2016-12-14  3:55   ` Balbir Singh
2016-12-16  3:14 ` [LSF/MM ATTEND] " Aneesh Kumar K.V
2016-12-16  3:14   ` Aneesh Kumar K.V
2017-01-16 12:04   ` Anshuman Khandual
2017-01-16 12:04     ` Anshuman Khandual
2017-01-16 23:15     ` John Hubbard
2017-01-16 23:15       ` John Hubbard
2017-01-18 11:00   ` [Lsf-pc] " Jan Kara
2017-01-18 11:00     ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161213225523.GG2305@redhat.com \
    --to=jglisse@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.