Re: [PATCH v3 3/7] padata: dispatch works on different nodes

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Tim Chen <tim.c.chen@linux.intel.com>
To: Gang Li <gang.li@linux.dev>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	David Rientjes <rientjes@google.com>,
	 linux-kernel@vger.kernel.org, ligang.bdlg@bytedance.com,
	David Hildenbrand <david@redhat.com>,
	Muchun Song <muchun.song@linux.dev>
Subject: Re: [PATCH v3 3/7] padata: dispatch works on different nodes
Date: Wed, 17 Jan 2024 14:14:56 -0800	[thread overview]
Message-ID: <cc135e4ba87bf64b384a529ccbd4c644bb135266.camel@linux.intel.com> (raw)
In-Reply-To: <ea4a5417-1fce-4b36-be4d-215086fd7e96@linux.dev>

On Mon, 2024-01-15 at 16:57 +0800, Gang Li wrote:
> 
> On 2024/1/13 02:27, Tim Chen wrote:
> > On Fri, 2024-01-12 at 15:09 +0800, Gang Li wrote:
> > > On 2024/1/12 01:50, Tim Chen wrote:
> > > > On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
> > > > > When a group of tasks that access different nodes are scheduled on the
> > > > > same node, they may encounter bandwidth bottlenecks and access latency.
> > > > > 
> > > > > Thus, numa_aware flag is introduced here, allowing tasks to be
> > > > > distributed across different nodes to fully utilize the advantage of
> > > > > multi-node systems.
> > > > > 
> > > > > Signed-off-by: Gang Li <gang.li@linux.dev>
> > > > > ---
> > > > >    include/linux/padata.h | 3 +++
> > > > >    kernel/padata.c        | 8 ++++++--
> > > > >    mm/mm_init.c           | 1 +
> > > > >    3 files changed, 10 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/include/linux/padata.h b/include/linux/padata.h
> > > > > index 495b16b6b4d72..f79ccd50e7f40 100644
> > > > > --- a/include/linux/padata.h
> > > > > +++ b/include/linux/padata.h
> > > > > @@ -137,6 +137,8 @@ struct padata_shell {
> > > > >     *             appropriate for one worker thread to do at once.
> > > > >     * @max_threads: Max threads to use for the job, actual number may be less
> > > > >     *               depending on task size and minimum chunk size.
> > > > > + * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
> > > > > + *              no CPU, dispatch its jobs to a random CPU.
> > > > >     */
> > > > >    struct padata_mt_job {
> > > > >    	void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
> > > > > @@ -146,6 +148,7 @@ struct padata_mt_job {
> > > > >    	unsigned long		align;
> > > > >    	unsigned long		min_chunk;
> > > > >    	int			max_threads;
> > > > > +	bool			numa_aware;
> > > > >    };
> > > > >    
> > > > >    /**
> > > > > diff --git a/kernel/padata.c b/kernel/padata.c
> > > > > index 179fb1518070c..1c2b3a337479e 100644
> > > > > --- a/kernel/padata.c
> > > > > +++ b/kernel/padata.c
> > > > > @@ -485,7 +485,7 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
> > > > >    	struct padata_work my_work, *pw;
> > > > >    	struct padata_mt_job_state ps;
> > > > >    	LIST_HEAD(works);
> > > > > -	int nworks;
> > > > > +	int nworks, nid = 0;
> > > > 
> > > > If we always start from 0, we may be biased towards the low numbered node,
> > > > and not use high numbered nodes at all.  Suggest you do
> > > > static nid = 0;
> > > > 
> > > 
> > > When we use `static`, if there are multiple parallel calls to
> > > `padata_do_multithreaded`, it may result in an uneven distribution of
> > > tasks for each padata_do_multithreaded.
> > > 
> > > We can make the following modifications to address this issue.
> > > 
> > > ```
> > > diff --git a/kernel/padata.c b/kernel/padata.c
> > > index 1c2b3a337479e..925e48df6dd8d 100644
> > > --- a/kernel/padata.c
> > > +++ b/kernel/padata.c
> > > @@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct
> > > padata_mt_job *job)
> > >           struct padata_work my_work, *pw;
> > >           struct padata_mt_job_state ps;
> > >           LIST_HEAD(works);
> > > -       int nworks, nid = 0;
> > > +       int nworks, nid;
> > > +       static volatile int global_nid = 0;
> > > 
> > >           if (job->size == 0)
> > >                   return;
> > > @@ -516,12 +517,15 @@ void __init padata_do_multithreaded(struct
> > > padata_mt_job *job)
> > >           ps.chunk_size = max(ps.chunk_size, job->min_chunk);
> > >           ps.chunk_size = roundup(ps.chunk_size, job->align);
> > > 
> > > +       nid = global_nid;
> > >           list_for_each_entry(pw, &works, pw_list)
> > > -               if (job->numa_aware)
> > > -                       queue_work_node((++nid % num_node_state(N_MEMORY)),
> > > -                                       system_unbound_wq, &pw->pw_work);
> > > -               else
> > > +               if (job->numa_aware) {
> > > +                       queue_work_node(nid, system_unbound_wq,
> > > &pw->pw_work);
> > > +                       nid = next_node(nid, node_states[N_CPU]);
> > > +               } else
> > >                           queue_work(system_unbound_wq, &pw->pw_work);
> > > +       if (job->numa_aware)
> > > +               global_nid = nid;
> > 
> > Thinking more about it, there could still be multiple threads working
> > at the same time with stale global_nid.  We should probably do a compare
> > exchange of global_nid with new nid only if the global nid was unchanged.
> > Otherwise we should go to the next node with the changed global nid before
> > we queue the job.
> > 
> > Tim
> > 
> How about:
> ```
> nid = global_nid;
> list_for_each_entry(pw, &works, pw_list)
> 	if (job->numa_aware) {
> 		int old_node = nid;
> 		queue_work_node(nid, system_unbound_wq, &pw->pw_work);
> 		nid = next_node(nid, node_states[N_CPU]);
> 		cmpxchg(&global_nid, old_node, nid);
> 	} else
> 		queue_work(system_unbound_wq, &pw->pw_work);
> 
> ```
> 

I am thinking something like

static volatile atomic_t last_used_nid;

list_for_each_entry(pw, &works, pw_list)
 	if (job->numa_aware) {
		int old_node = atomic_read(&last_used_nid);
		
		do {
			nid = next_node_in(old_node, node_states[N_CPU]);
		} while (!atomic_try_cmpxchg(&last_used_nid, &old_node, nid));
 		queue_work_node(nid, system_unbound_wq, &pw->pw_work);		
 	} else {
 		queue_work(system_unbound_wq, &pw->pw_work);
	}

Note that we need to use next_node_in so we'll wrap around the node mask.

Tim

next prev parent reply	other threads:[~2024-01-17 22:15 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-02 13:12 [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot Gang Li
2024-01-02 13:12 ` [PATCH v3 1/7] hugetlb: code clean for hugetlb_hstate_alloc_pages Gang Li
2024-01-10 10:19   ` Muchun Song
2024-01-11  3:30     ` Gang Li
2024-01-10 21:55   ` Tim Chen
2024-01-11  3:34     ` Gang Li
2024-01-02 13:12 ` [PATCH v3 2/7] hugetlb: split hugetlb_hstate_alloc_pages Gang Li
2024-01-10 23:12   ` Tim Chen
2024-01-11  3:44     ` Gang Li
2024-01-16  7:02   ` Muchun Song
2024-01-16  8:09     ` Gang Li
2024-01-02 13:12 ` [PATCH v3 3/7] padata: dispatch works on different nodes Gang Li
2024-01-11 17:50   ` Tim Chen
2024-01-12  7:09     ` Gang Li
2024-01-12 18:27       ` Tim Chen
2024-01-15  8:57         ` Gang Li
2024-01-17 22:14           ` Tim Chen [this message]
2024-01-18  6:15             ` Gang Li
2024-01-02 13:12 ` [PATCH v3 4/7] hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc Gang Li
2024-01-03  1:32   ` David Rientjes
2024-01-03  2:22     ` Gang Li
2024-01-03  2:36       ` David Rientjes
2024-01-11 22:21   ` Tim Chen
2024-01-12  8:07     ` Gang Li
2024-01-02 13:12 ` [PATCH v3 5/7] hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA Gang Li
2024-01-11 22:49   ` Tim Chen
2024-01-16  9:26   ` Muchun Song
2024-01-02 13:12 ` [PATCH v3 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization Gang Li
2024-01-02 13:12 ` [PATCH v3 7/7] hugetlb: parallelize 1G hugetlb initialization Gang Li
2024-01-03  1:52 ` [PATCH v3 0/7] hugetlb: parallelize hugetlb page init on boot David Rientjes
2024-01-03  2:20   ` Gang Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cc135e4ba87bf64b384a529ccbd4c644bb135266.camel@linux.intel.com \
    --to=tim.c.chen@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=gang.li@linux.dev \
    --cc=ligang.bdlg@bytedance.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=muchun.song@linux.dev \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.