sched/numa: Avoid migrating task to CPU-less node

[tip: sched/core] sched/numa: Avoid migrating task to CPU-less node

Posted by tip-bot2 for Huang Ying 4 years, 4 months ago

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     5c7b1aaf139dab5072311853bacc40fc3457d1f9
Gitweb:        https://git.kernel.org/tip/5c7b1aaf139dab5072311853bacc40fc3457d1f9
Author:        Huang Ying <ying.huang@intel.com>
AuthorDate:    Mon, 14 Feb 2022 20:15:53 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 16 Feb 2022 15:57:53 +01:00

sched/numa: Avoid migrating task to CPU-less node

In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
nodes.  But if the number of the hint page faults on a PMEM node is
the max for a task, The current NUMA balancing policy may try to place
the task on the PMEM node instead of DRAM node.  This is unreasonable,
because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
nodes are ignored when searching the migration target node for a task
in this patch.

To test the patch, we run a workload that accesses more memory in PMEM
node than memory in DRAM node.  Without the patch, the PMEM node will
be chosen as preferred node in task_numa_placement().  While the DRAM
node will be chosen instead with the patch.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220214121553.582248-2-ying.huang@intel.com
---
 kernel/sched/fair.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da3230b..11a72e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1989,7 +1989,7 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	ng = deref_curr_numa_group(p);
 	if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) {
-		for_each_online_node(nid) {
+		for_each_node_state(nid, N_CPU) {
 			if (nid == env.src_nid || nid == p->numa_preferred_nid)
 				continue;
 
@@ -2087,13 +2087,13 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
 	unsigned long faults, max_faults = 0;
 	int nid, active_nodes = 0;
 
-	for_each_online_node(nid) {
+	for_each_node_state(nid, N_CPU) {
 		faults = group_faults_cpu(numa_group, nid);
 		if (faults > max_faults)
 			max_faults = faults;
 	}
 
-	for_each_online_node(nid) {
+	for_each_node_state(nid, N_CPU) {
 		faults = group_faults_cpu(numa_group, nid);
 		if (faults * ACTIVE_NODE_FRACTION > max_faults)
 			active_nodes++;
@@ -2247,7 +2247,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
 
 		dist = sched_max_numa_distance;
 
-		for_each_online_node(node) {
+		for_each_node_state(node, N_CPU) {
 			score = group_weight(p, node, dist);
 			if (score > max_score) {
 				max_score = score;
@@ -2266,7 +2266,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
 	 * inside the highest scoring group of nodes. The nodemask tricks
 	 * keep the complexity of the search down.
 	 */
-	nodes = node_online_map;
+	nodes = node_states[N_CPU];
 	for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) {
 		unsigned long max_faults = 0;
 		nodemask_t max_group = NODE_MASK_NONE;
@@ -2405,6 +2405,21 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
+	/* Cannot migrate task to CPU-less node */
+	if (!node_state(max_nid, N_CPU)) {
+		int near_nid = max_nid;
+		int distance, near_distance = INT_MAX;
+
+		for_each_node_state(nid, N_CPU) {
+			distance = node_distance(max_nid, nid);
+			if (distance < near_distance) {
+				near_nid = nid;
+				near_distance = distance;
+			}
+		}
+		max_nid = near_nid;
+	}
+
 	if (ng) {
 		numa_group_count_active_nodes(ng);
 		spin_unlock_irq(group_lock);

Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node

Posted by Qian Cai 4 years, 3 months ago

On Thu, Feb 17, 2022 at 06:56:52PM -0000, tip-bot2 for Huang Ying wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     5c7b1aaf139dab5072311853bacc40fc3457d1f9
> Gitweb:        https://git.kernel.org/tip/5c7b1aaf139dab5072311853bacc40fc3457d1f9
> Author:        Huang Ying <ying.huang@intel.com>
> AuthorDate:    Mon, 14 Feb 2022 20:15:53 +08:00
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Wed, 16 Feb 2022 15:57:53 +01:00
> 
> sched/numa: Avoid migrating task to CPU-less node
> 
> In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
> nodes.  But if the number of the hint page faults on a PMEM node is
> the max for a task, The current NUMA balancing policy may try to place
> the task on the PMEM node instead of DRAM node.  This is unreasonable,
> because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
> nodes are ignored when searching the migration target node for a task
> in this patch.
> 
> To test the patch, we run a workload that accesses more memory in PMEM
> node than memory in DRAM node.  Without the patch, the PMEM node will
> be chosen as preferred node in task_numa_placement().  While the DRAM
> node will be chosen instead with the patch.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/20220214121553.582248-2-ying.huang@intel.com

Reverting this commit on the top of today's linux-next fixed a boot crash
on arm64 NUMA systems.

 Unable to handle kernel paging request at virtual address ffff7a6601694aec
 KASAN: maybe wild-memory-access in range [0xffffd3300b4a5760-0xffffd3300b4a5767]
 Mem abort info:
   ESR = 0x96000005
   EC = 0x25: DABT (current EL), IL = 32 bits
 mlx5_core 0007:02:00.0: enabling device (0100 -> 0102)
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
   FSC = 0x05: level 1 translation fault
 Data abort info:
   ISV = 0, ISS = 0x00000005
   CM = 0, WnR = 0
 swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000400b3d6c6000
 [ffff7a6601694aec] pgd=0000403fc007f003, p4d=0000403fc007f003, pud=0000000000000000
 Internal error: Oops: 96000005 [#1] PREEMPT SMP
 Modules linked in: nouveau(+) drm_ttm_helper ttm nvme(+) drm_dp_helper drm_kms_helper mlx5_core(+) mpt3sas(+) xhci_pci(+) nvme_core raid_class xhci_pci_renesas drm
 CPU: 85 PID: 1308 Comm: udevadm Not tainted 5.17.0-rc6-next-20220301 #1
 pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : task_numa_placement
 lr : task_numa_placement
 sp : ffff800031047760
 x29: ffff800031047760 x28: ffff3fffab916c00 x27: 0000000000000020
 x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000000

 x23: ffff07ffe5289a80 x22: ffffd3300b4a5760 x21: 000000000000003f
 x20: ffffd32feb4a5768 x19: 0000000000000000 x18: ffff07ffe528ad88
 x17: ffffd32fe5693a1c x16: 0000000000000000 x15: ffff8000310478e0

 x14: ffff07ffe528ad90 x13: 0000000000000002 x12: dfff80000000000d
 x11: 0000000000000001 x10: 000000000000b6be x9 : 0000000000000000
 x8 : 00000000ffffffff x7 : ffffd32feb4a5780 x6 : 0000000000000000
 x5 : 0000000000000000 x4 : 0000000000000000 x3 : 1ffffa6601694aec
 x2 : 0000000000000000 x1 : dfff800000000000 x0 : 000000001ffffff8
 Call trace:
  task_numa_placement
  arch_test_bit at include/asm-generic/bitops/non-atomic.h:118
  (inlined by) node_state at include/linux/nodemask.h:416
  (inlined by) task_numa_placement at kernel/sched/fair.c:2439
  task_numa_fault
  do_numa_page
  handle_pte_fault
  __handle_mm_fault
  handle_mm_fault
  do_page_fault
  do_translation_fault
  do_mem_abort
  el0_da
  el0t_64_sync_handler
  el0t_64_sync
 Code: 8b000296 d2d00001 f2fbffe1 d343fec3 (38e16861)
 ---[ end trace 0000000000000000 ]---
 Kernel panic - not syncing: Oops: Fatal exception
 SMP: stopping secondary CPUs
 Kernel Offset: 0x532fdcf70000 from 0xffff800008000000
 PHYS_OFFSET: 0x80000000
 CPU features: 0x00,00042c0c,19801c82
 Memory Limit: none
 ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

> ---
>  kernel/sched/fair.c | 25 ++++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da3230b..11a72e1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1989,7 +1989,7 @@ static int task_numa_migrate(struct task_struct *p)
>  	 */
>  	ng = deref_curr_numa_group(p);
>  	if (env.best_cpu == -1 || (ng && ng->active_nodes > 1)) {
> -		for_each_online_node(nid) {
> +		for_each_node_state(nid, N_CPU) {
>  			if (nid == env.src_nid || nid == p->numa_preferred_nid)
>  				continue;
>  
> @@ -2087,13 +2087,13 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
>  	unsigned long faults, max_faults = 0;
>  	int nid, active_nodes = 0;
>  
> -	for_each_online_node(nid) {
> +	for_each_node_state(nid, N_CPU) {
>  		faults = group_faults_cpu(numa_group, nid);
>  		if (faults > max_faults)
>  			max_faults = faults;
>  	}
>  
> -	for_each_online_node(nid) {
> +	for_each_node_state(nid, N_CPU) {
>  		faults = group_faults_cpu(numa_group, nid);
>  		if (faults * ACTIVE_NODE_FRACTION > max_faults)
>  			active_nodes++;
> @@ -2247,7 +2247,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
>  
>  		dist = sched_max_numa_distance;
>  
> -		for_each_online_node(node) {
> +		for_each_node_state(node, N_CPU) {
>  			score = group_weight(p, node, dist);
>  			if (score > max_score) {
>  				max_score = score;
> @@ -2266,7 +2266,7 @@ static int preferred_group_nid(struct task_struct *p, int nid)
>  	 * inside the highest scoring group of nodes. The nodemask tricks
>  	 * keep the complexity of the search down.
>  	 */
> -	nodes = node_online_map;
> +	nodes = node_states[N_CPU];
>  	for (dist = sched_max_numa_distance; dist > LOCAL_DISTANCE; dist--) {
>  		unsigned long max_faults = 0;
>  		nodemask_t max_group = NODE_MASK_NONE;
> @@ -2405,6 +2405,21 @@ static void task_numa_placement(struct task_struct *p)
>  		}
>  	}
>  
> +	/* Cannot migrate task to CPU-less node */
> +	if (!node_state(max_nid, N_CPU)) {
> +		int near_nid = max_nid;
> +		int distance, near_distance = INT_MAX;
> +
> +		for_each_node_state(nid, N_CPU) {
> +			distance = node_distance(max_nid, nid);
> +			if (distance < near_distance) {
> +				near_nid = nid;
> +				near_distance = distance;
> +			}
> +		}
> +		max_nid = near_nid;
> +	}
> +
>  	if (ng) {
>  		numa_group_count_active_nodes(ng);
>  		spin_unlock_irq(group_lock);

Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node

Posted by Huang, Ying 4 years, 3 months ago

Qian Cai <quic_qiancai@quicinc.com> writes:

> On Thu, Feb 17, 2022 at 06:56:52PM -0000, tip-bot2 for Huang Ying wrote:
>> The following commit has been merged into the sched/core branch of tip:
>> 
>> Commit-ID:     5c7b1aaf139dab5072311853bacc40fc3457d1f9
>> Gitweb:        https://git.kernel.org/tip/5c7b1aaf139dab5072311853bacc40fc3457d1f9
>> Author:        Huang Ying <ying.huang@intel.com>
>> AuthorDate:    Mon, 14 Feb 2022 20:15:53 +08:00
>> Committer:     Peter Zijlstra <peterz@infradead.org>
>> CommitterDate: Wed, 16 Feb 2022 15:57:53 +01:00
>> 
>> sched/numa: Avoid migrating task to CPU-less node
>> 
>> In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
>> nodes.  But if the number of the hint page faults on a PMEM node is
>> the max for a task, The current NUMA balancing policy may try to place
>> the task on the PMEM node instead of DRAM node.  This is unreasonable,
>> because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
>> nodes are ignored when searching the migration target node for a task
>> in this patch.
>> 
>> To test the patch, we run a workload that accesses more memory in PMEM
>> node than memory in DRAM node.  Without the patch, the PMEM node will
>> be chosen as preferred node in task_numa_placement().  While the DRAM
>> node will be chosen instead with the patch.
>> 
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Link: https://lkml.kernel.org/r/20220214121553.582248-2-ying.huang@intel.com
>
> Reverting this commit on the top of today's linux-next fixed a boot crash
> on arm64 NUMA systems.
>
>  Unable to handle kernel paging request at virtual address ffff7a6601694aec
>  KASAN: maybe wild-memory-access in range [0xffffd3300b4a5760-0xffffd3300b4a5767]
>  Mem abort info:
>    ESR = 0x96000005
>    EC = 0x25: DABT (current EL), IL = 32 bits
>  mlx5_core 0007:02:00.0: enabling device (0100 -> 0102)
>    SET = 0, FnV = 0
>    EA = 0, S1PTW = 0
>    FSC = 0x05: level 1 translation fault
>  Data abort info:
>    ISV = 0, ISS = 0x00000005
>    CM = 0, WnR = 0
>  swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000400b3d6c6000
>  [ffff7a6601694aec] pgd=0000403fc007f003, p4d=0000403fc007f003, pud=0000000000000000
>  Internal error: Oops: 96000005 [#1] PREEMPT SMP
>  Modules linked in: nouveau(+) drm_ttm_helper ttm nvme(+) drm_dp_helper drm_kms_helper mlx5_core(+) mpt3sas(+) xhci_pci(+) nvme_core raid_class xhci_pci_renesas drm
>  CPU: 85 PID: 1308 Comm: udevadm Not tainted 5.17.0-rc6-next-20220301 #1
>  pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>  pc : task_numa_placement
>  lr : task_numa_placement
>  sp : ffff800031047760
>  x29: ffff800031047760 x28: ffff3fffab916c00 x27: 0000000000000020
>  x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000000
>
>  x23: ffff07ffe5289a80 x22: ffffd3300b4a5760 x21: 000000000000003f
>  x20: ffffd32feb4a5768 x19: 0000000000000000 x18: ffff07ffe528ad88
>  x17: ffffd32fe5693a1c x16: 0000000000000000 x15: ffff8000310478e0
>
>  x14: ffff07ffe528ad90 x13: 0000000000000002 x12: dfff80000000000d
>  x11: 0000000000000001 x10: 000000000000b6be x9 : 0000000000000000
>  x8 : 00000000ffffffff x7 : ffffd32feb4a5780 x6 : 0000000000000000
>  x5 : 0000000000000000 x4 : 0000000000000000 x3 : 1ffffa6601694aec
>  x2 : 0000000000000000 x1 : dfff800000000000 x0 : 000000001ffffff8
>  Call trace:
>   task_numa_placement
>   arch_test_bit at include/asm-generic/bitops/non-atomic.h:118
>   (inlined by) node_state at include/linux/nodemask.h:416
>   (inlined by) task_numa_placement at kernel/sched/fair.c:2439
>   task_numa_fault
>   do_numa_page
>   handle_pte_fault
>   __handle_mm_fault
>   handle_mm_fault
>   do_page_fault
>   do_translation_fault
>   do_mem_abort
>   el0_da
>   el0t_64_sync_handler
>   el0t_64_sync
>  Code: 8b000296 d2d00001 f2fbffe1 d343fec3 (38e16861)
>  ---[ end trace 0000000000000000 ]---
>  Kernel panic - not syncing: Oops: Fatal exception
>  SMP: stopping secondary CPUs
>  Kernel Offset: 0x532fdcf70000 from 0xffff800008000000
>  PHYS_OFFSET: 0x80000000
>  CPU features: 0x00,00042c0c,19801c82
>  Memory Limit: none
>  ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

Thanks for reporting!  Can you try whether the following debug patch can fix the issue?

Best Regards,
Huang, Ying

----------------------------8<-------------------------------------------
From 176d185426730111e763eb386d0210561f021dbc Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@intel.com>
Date: Wed, 2 Mar 2022 08:54:01 +0800
Subject: [PATCH] dbg KASAN error

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a3f0ea216ccb..1fe7a4510cca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2405,7 +2405,7 @@ static void task_numa_placement(struct task_struct *p)
 	}
 
 	/* Cannot migrate task to CPU-less node */
-	if (!node_state(max_nid, N_CPU)) {
+	if (max_nid != NUMA_NO_NODE && !node_state(max_nid, N_CPU)) {
 		int near_nid = max_nid;
 		int distance, near_distance = INT_MAX;
 
-- 
2.30.2

Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node

Posted by Qian Cai 4 years, 3 months ago

On Wed, Mar 02, 2022 at 08:59:55AM +0800, Huang, Ying wrote:
> Thanks for reporting!  Can you try whether the following debug patch can fix the issue?

Yes, it prevents the crash.

Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node

Posted by Huang, Ying 4 years, 3 months ago

Hi, Qian,

"Huang, Ying" <ying.huang@intel.com> writes:

> Qian Cai <quic_qiancai@quicinc.com> writes:
>
>> On Thu, Feb 17, 2022 at 06:56:52PM -0000, tip-bot2 for Huang Ying wrote:
>>> The following commit has been merged into the sched/core branch of tip:
>>> 
>>> Commit-ID:     5c7b1aaf139dab5072311853bacc40fc3457d1f9
>>> Gitweb:        https://git.kernel.org/tip/5c7b1aaf139dab5072311853bacc40fc3457d1f9
>>> Author:        Huang Ying <ying.huang@intel.com>
>>> AuthorDate:    Mon, 14 Feb 2022 20:15:53 +08:00
>>> Committer:     Peter Zijlstra <peterz@infradead.org>
>>> CommitterDate: Wed, 16 Feb 2022 15:57:53 +01:00
>>> 
>>> sched/numa: Avoid migrating task to CPU-less node
>>> 
>>> In a typical memory tiering system, there's no CPU in slow (PMEM) NUMA
>>> nodes.  But if the number of the hint page faults on a PMEM node is
>>> the max for a task, The current NUMA balancing policy may try to place
>>> the task on the PMEM node instead of DRAM node.  This is unreasonable,
>>> because there's no CPU in PMEM NUMA nodes.  To fix this, CPU-less
>>> nodes are ignored when searching the migration target node for a task
>>> in this patch.
>>> 
>>> To test the patch, we run a workload that accesses more memory in PMEM
>>> node than memory in DRAM node.  Without the patch, the PMEM node will
>>> be chosen as preferred node in task_numa_placement().  While the DRAM
>>> node will be chosen instead with the patch.
>>> 
>>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>> Link: https://lkml.kernel.org/r/20220214121553.582248-2-ying.huang@intel.com
>>
>> Reverting this commit on the top of today's linux-next fixed a boot crash
>> on arm64 NUMA systems.
>>
>>  Unable to handle kernel paging request at virtual address ffff7a6601694aec
>>  KASAN: maybe wild-memory-access in range [0xffffd3300b4a5760-0xffffd3300b4a5767]
>>  Mem abort info:
>>    ESR = 0x96000005
>>    EC = 0x25: DABT (current EL), IL = 32 bits
>>  mlx5_core 0007:02:00.0: enabling device (0100 -> 0102)
>>    SET = 0, FnV = 0
>>    EA = 0, S1PTW = 0
>>    FSC = 0x05: level 1 translation fault
>>  Data abort info:
>>    ISV = 0, ISS = 0x00000005
>>    CM = 0, WnR = 0
>>  swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000400b3d6c6000
>>  [ffff7a6601694aec] pgd=0000403fc007f003, p4d=0000403fc007f003, pud=0000000000000000
>>  Internal error: Oops: 96000005 [#1] PREEMPT SMP
>>  Modules linked in: nouveau(+) drm_ttm_helper ttm nvme(+) drm_dp_helper drm_kms_helper mlx5_core(+) mpt3sas(+) xhci_pci(+) nvme_core raid_class xhci_pci_renesas drm
>>  CPU: 85 PID: 1308 Comm: udevadm Not tainted 5.17.0-rc6-next-20220301 #1
>>  pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>  pc : task_numa_placement
>>  lr : task_numa_placement
>>  sp : ffff800031047760
>>  x29: ffff800031047760 x28: ffff3fffab916c00 x27: 0000000000000020
>>  x26: 0000000000000001 x25: 0000000000000000 x24: 0000000000000000
>>
>>  x23: ffff07ffe5289a80 x22: ffffd3300b4a5760 x21: 000000000000003f
>>  x20: ffffd32feb4a5768 x19: 0000000000000000 x18: ffff07ffe528ad88
>>  x17: ffffd32fe5693a1c x16: 0000000000000000 x15: ffff8000310478e0
>>
>>  x14: ffff07ffe528ad90 x13: 0000000000000002 x12: dfff80000000000d
>>  x11: 0000000000000001 x10: 000000000000b6be x9 : 0000000000000000
>>  x8 : 00000000ffffffff x7 : ffffd32feb4a5780 x6 : 0000000000000000
>>  x5 : 0000000000000000 x4 : 0000000000000000 x3 : 1ffffa6601694aec
>>  x2 : 0000000000000000 x1 : dfff800000000000 x0 : 000000001ffffff8
>>  Call trace:
>>   task_numa_placement
>>   arch_test_bit at include/asm-generic/bitops/non-atomic.h:118
>>   (inlined by) node_state at include/linux/nodemask.h:416
>>   (inlined by) task_numa_placement at kernel/sched/fair.c:2439
>>   task_numa_fault
>>   do_numa_page
>>   handle_pte_fault
>>   __handle_mm_fault
>>   handle_mm_fault
>>   do_page_fault
>>   do_translation_fault
>>   do_mem_abort
>>   el0_da
>>   el0t_64_sync_handler
>>   el0t_64_sync
>>  Code: 8b000296 d2d00001 f2fbffe1 d343fec3 (38e16861)
>>  ---[ end trace 0000000000000000 ]---
>>  Kernel panic - not syncing: Oops: Fatal exception
>>  SMP: stopping secondary CPUs
>>  Kernel Offset: 0x532fdcf70000 from 0xffff800008000000
>>  PHYS_OFFSET: 0x80000000
>>  CPU features: 0x00,00042c0c,19801c82
>>  Memory Limit: none
>>  ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---
>
> Thanks for reporting!  Can you try whether the following debug patch can fix the issue?
>
> Best Regards,
> Huang, Ying
>
> ----------------------------8<-------------------------------------------
> From 176d185426730111e763eb386d0210561f021dbc Mon Sep 17 00:00:00 2001
> From: Huang Ying <ying.huang@intel.com>
> Date: Wed, 2 Mar 2022 08:54:01 +0800
> Subject: [PATCH] dbg KASAN error
>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a3f0ea216ccb..1fe7a4510cca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2405,7 +2405,7 @@ static void task_numa_placement(struct task_struct *p)
>  	}
>  
>  	/* Cannot migrate task to CPU-less node */
> -	if (!node_state(max_nid, N_CPU)) {
> +	if (max_nid != NUMA_NO_NODE && !node_state(max_nid, N_CPU)) {
>  		int near_nid = max_nid;
>  		int distance, near_distance = INT_MAX;

Do you have time to give this patch a try?

Best Regards,
Huang, Ying

Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node

Posted by Qian Cai 4 years, 3 months ago

On Mon, Mar 07, 2022 at 01:51:55PM +0800, Huang, Ying wrote:
> > ---
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a3f0ea216ccb..1fe7a4510cca 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2405,7 +2405,7 @@ static void task_numa_placement(struct task_struct *p)
> >  	}
> >  
> >  	/* Cannot migrate task to CPU-less node */
> > -	if (!node_state(max_nid, N_CPU)) {
> > +	if (max_nid != NUMA_NO_NODE && !node_state(max_nid, N_CPU)) {
> >  		int near_nid = max_nid;
> >  		int distance, near_distance = INT_MAX;
> 
> Do you have time to give this patch a try?

Ah, I thought I has already replied it a while ago. Anyway, it works fine.

Re: [tip: sched/core] sched/numa: Avoid migrating task to CPU-less node

Posted by Huang, Ying 4 years, 3 months ago

Qian Cai <quic_qiancai@quicinc.com> writes:

> On Mon, Mar 07, 2022 at 01:51:55PM +0800, Huang, Ying wrote:
>> > ---
>> >  kernel/sched/fair.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index a3f0ea216ccb..1fe7a4510cca 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -2405,7 +2405,7 @@ static void task_numa_placement(struct task_struct *p)
>> >  	}
>> >  
>> >  	/* Cannot migrate task to CPU-less node */
>> > -	if (!node_state(max_nid, N_CPU)) {
>> > +	if (max_nid != NUMA_NO_NODE && !node_state(max_nid, N_CPU)) {
>> >  		int near_nid = max_nid;
>> >  		int distance, near_distance = INT_MAX;
>> 
>> Do you have time to give this patch a try?
>
> Ah, I thought I has already replied it a while ago. Anyway, it works fine.

Thanks!

Best Regards,
Huang, Ying