From: Thomas Prescher <thomas.prescher@cyberus-technology.de>
Add a command line option that enables control of how many
threads per NUMA node should be used to allocate huge pages.
Allocating huge pages can take a very long time on servers
with terabytes of memory even when they are allocated at
boot time where the allocation happens in parallel.
The kernel currently uses a hard coded value of 2 threads per
NUMA node for these allocations.
This patch allows to override this value.
Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
---
Documentation/admin-guide/kernel-parameters.txt | 7 ++++
Documentation/admin-guide/mm/hugetlbpage.rst | 9 ++++-
mm/hugetlb.c | 50 +++++++++++++++++--------
3 files changed, 49 insertions(+), 17 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb8752b42ec8582b8750d7e014c4d76166fa2fc1..812064542fdb0a5c0ff7587aaaba8da81dc234a9 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1882,6 +1882,13 @@
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: size[KMG]
+ hugepage_alloc_threads=
+ [HW] The number of threads per NUMA node that should
+ be used to allocate hugepages during boot.
+ This option can be used to improve system bootup time
+ when allocating a large amount of huge pages.
+ The default value is 2 threads per NUMA node.
+
hugetlb_cma= [HW,CMA,EARLY] The size of a CMA area used for allocation
of gigantic hugepages. Or using node format, the size
of a CMA area per node can be specified.
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index f34a0d798d5b533f30add99a34f66ba4e1c496a3..c88461be0f66887d532ac4ef20e3a61dfd396be7 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -145,7 +145,14 @@ hugepages
It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
If the node number is invalid, the parameter will be ignored.
-
+hugepage_alloc_threads
+ Specify the number of threads per NUMA node that should be used to
+ allocate hugepages during boot. This parameter can be used to improve
+ system bootup time when allocating a large amount of huge pages.
+ The default value is 2 threads per NUMA node. Example to use 8 threads
+ per NUMA node::
+
+ hugepage_alloc_threads=8
default_hugepagesz
Specify the default huge page size. This parameter can
only be specified once on the command line. default_hugepagesz can
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 163190e89ea16450026496c020b544877db147d1..b7d24c41e0f9d22f5b86c253e29a2eca28460026 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -68,6 +68,7 @@ static unsigned long __initdata default_hstate_max_huge_pages;
static bool __initdata parsed_valid_hugepagesz = true;
static bool __initdata parsed_default_hugepagesz;
static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata;
+static unsigned long allocation_threads_per_node __initdata = 2;
/*
* Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
@@ -3432,26 +3433,23 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
job.size = h->max_huge_pages;
/*
- * job.max_threads is twice the num_node_state(N_MEMORY),
+ * job.max_threads is twice the num_node_state(N_MEMORY) by default.
*
- * Tests below indicate that a multiplier of 2 significantly improves
- * performance, and although larger values also provide improvements,
- * the gains are marginal.
+ * On large servers with terabytes of memory, huge page allocation
+ * can consume a considerably amount of time.
*
- * Therefore, choosing 2 as the multiplier strikes a good balance between
- * enhancing parallel processing capabilities and maintaining efficient
- * resource management.
+ * Tests below show how long it takes to allocate 1 TiB of memory with 2MiB huge pages.
+ * 2MiB huge pages. Using more threads can significantly improve allocation time.
*
- * +------------+-------+-------+-------+-------+-------+
- * | multiplier | 1 | 2 | 3 | 4 | 5 |
- * +------------+-------+-------+-------+-------+-------+
- * | 256G 2node | 358ms | 215ms | 157ms | 134ms | 126ms |
- * | 2T 4node | 979ms | 679ms | 543ms | 489ms | 481ms |
- * | 50G 2node | 71ms | 44ms | 37ms | 30ms | 31ms |
- * +------------+-------+-------+-------+-------+-------+
+ * +--------------------+-------+-------+-------+-------+-------+
+ * | threads per node | 2 | 4 | 8 | 16 | 32 |
+ * +--------------------+-------+-------+-------+-------+-------+
+ * | skylake 4node | 44s | 22s | 16s | 19s | 20s |
+ * | cascade lake 4node | 39s | 20s | 11s | 10s | 9s |
+ * +--------------------+-------+-------+-------+-------+-------+
*/
- job.max_threads = num_node_state(N_MEMORY) * 2;
- job.min_chunk = h->max_huge_pages / num_node_state(N_MEMORY) / 2;
+ job.max_threads = num_node_state(N_MEMORY) * allocation_threads_per_node;
+ job.min_chunk = h->max_huge_pages / num_node_state(N_MEMORY) / allocation_threads_per_node;
padata_do_multithreaded(&job);
return h->nr_huge_pages;
@@ -4764,6 +4762,26 @@ static int __init default_hugepagesz_setup(char *s)
}
__setup("default_hugepagesz=", default_hugepagesz_setup);
+/* hugepage_alloc_threads command line parsing
+ * When set, use this specific number of threads per NUMA node for the boot
+ * allocation of hugepages.
+ */
+static int __init hugepage_alloc_threads_setup(char *s)
+{
+ unsigned long threads_per_node;
+
+ if (kstrtoul(s, 0, &threads_per_node) != 0)
+ return 1;
+
+ if (threads_per_node == 0)
+ return 1;
+
+ allocation_threads_per_node = threads_per_node;
+
+ return 1;
+}
+__setup("hugepage_alloc_threads=", hugepage_alloc_threads_setup);
+
static unsigned int allowed_mems_nr(struct hstate *h)
{
int node;
--
2.48.1
On Fri, Feb 21, 2025 at 5:49 AM Thomas Prescher via B4 Relay
<devnull+thomas.prescher.cyberus-technology.de@kernel.org> wrote:
>
> From: Thomas Prescher <thomas.prescher@cyberus-technology.de>
>
> Add a command line option that enables control of how many
> threads per NUMA node should be used to allocate huge pages.
>
> Allocating huge pages can take a very long time on servers
> with terabytes of memory even when they are allocated at
> boot time where the allocation happens in parallel.
>
> The kernel currently uses a hard coded value of 2 threads per
> NUMA node for these allocations.
>
> This patch allows to override this value.
>
> Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
> ---
> Documentation/admin-guide/kernel-parameters.txt | 7 ++++
> Documentation/admin-guide/mm/hugetlbpage.rst | 9 ++++-
> mm/hugetlb.c | 50 +++++++++++++++++--------
> 3 files changed, 49 insertions(+), 17 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index fb8752b42ec8582b8750d7e014c4d76166fa2fc1..812064542fdb0a5c0ff7587aaaba8da81dc234a9 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1882,6 +1882,13 @@
> Documentation/admin-guide/mm/hugetlbpage.rst.
> Format: size[KMG]
>
> + hugepage_alloc_threads=
> + [HW] The number of threads per NUMA node that should
> + be used to allocate hugepages during boot.
> + This option can be used to improve system bootup time
> + when allocating a large amount of huge pages.
> + The default value is 2 threads per NUMA node.
> +
> hugetlb_cma= [HW,CMA,EARLY] The size of a CMA area used for allocation
> of gigantic hugepages. Or using node format, the size
> of a CMA area per node can be specified.
> diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
> index f34a0d798d5b533f30add99a34f66ba4e1c496a3..c88461be0f66887d532ac4ef20e3a61dfd396be7 100644
> --- a/Documentation/admin-guide/mm/hugetlbpage.rst
> +++ b/Documentation/admin-guide/mm/hugetlbpage.rst
> @@ -145,7 +145,14 @@ hugepages
>
> It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
> If the node number is invalid, the parameter will be ignored.
> -
> +hugepage_alloc_threads
> + Specify the number of threads per NUMA node that should be used to
> + allocate hugepages during boot. This parameter can be used to improve
> + system bootup time when allocating a large amount of huge pages.
> + The default value is 2 threads per NUMA node. Example to use 8 threads
> + per NUMA node::
> +
> + hugepage_alloc_threads=8
> default_hugepagesz
> Specify the default huge page size. This parameter can
> only be specified once on the command line. default_hugepagesz can
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 163190e89ea16450026496c020b544877db147d1..b7d24c41e0f9d22f5b86c253e29a2eca28460026 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -68,6 +68,7 @@ static unsigned long __initdata default_hstate_max_huge_pages;
> static bool __initdata parsed_valid_hugepagesz = true;
> static bool __initdata parsed_default_hugepagesz;
> static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata;
> +static unsigned long allocation_threads_per_node __initdata = 2;
>
> /*
> * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
> @@ -3432,26 +3433,23 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
> job.size = h->max_huge_pages;
>
> /*
> - * job.max_threads is twice the num_node_state(N_MEMORY),
> + * job.max_threads is twice the num_node_state(N_MEMORY) by default.
> *
> - * Tests below indicate that a multiplier of 2 significantly improves
> - * performance, and although larger values also provide improvements,
> - * the gains are marginal.
> + * On large servers with terabytes of memory, huge page allocation
> + * can consume a considerably amount of time.
> *
> - * Therefore, choosing 2 as the multiplier strikes a good balance between
> - * enhancing parallel processing capabilities and maintaining efficient
> - * resource management.
> + * Tests below show how long it takes to allocate 1 TiB of memory with 2MiB huge pages.
> + * 2MiB huge pages. Using more threads can significantly improve allocation time.
> *
> - * +------------+-------+-------+-------+-------+-------+
> - * | multiplier | 1 | 2 | 3 | 4 | 5 |
> - * +------------+-------+-------+-------+-------+-------+
> - * | 256G 2node | 358ms | 215ms | 157ms | 134ms | 126ms |
> - * | 2T 4node | 979ms | 679ms | 543ms | 489ms | 481ms |
> - * | 50G 2node | 71ms | 44ms | 37ms | 30ms | 31ms |
> - * +------------+-------+-------+-------+-------+-------+
> + * +--------------------+-------+-------+-------+-------+-------+
> + * | threads per node | 2 | 4 | 8 | 16 | 32 |
> + * +--------------------+-------+-------+-------+-------+-------+
> + * | skylake 4node | 44s | 22s | 16s | 19s | 20s |
> + * | cascade lake 4node | 39s | 20s | 11s | 10s | 9s |
> + * +--------------------+-------+-------+-------+-------+-------+
> */
> - job.max_threads = num_node_state(N_MEMORY) * 2;
> - job.min_chunk = h->max_huge_pages / num_node_state(N_MEMORY) / 2;
> + job.max_threads = num_node_state(N_MEMORY) * allocation_threads_per_node;
> + job.min_chunk = h->max_huge_pages / num_node_state(N_MEMORY) / allocation_threads_per_node;
> padata_do_multithreaded(&job);
>
> return h->nr_huge_pages;
> @@ -4764,6 +4762,26 @@ static int __init default_hugepagesz_setup(char *s)
> }
> __setup("default_hugepagesz=", default_hugepagesz_setup);
>
> +/* hugepage_alloc_threads command line parsing
> + * When set, use this specific number of threads per NUMA node for the boot
> + * allocation of hugepages.
> + */
> +static int __init hugepage_alloc_threads_setup(char *s)
> +{
> + unsigned long threads_per_node;
> +
> + if (kstrtoul(s, 0, &threads_per_node) != 0)
> + return 1;
> +
> + if (threads_per_node == 0)
> + return 1;
> +
> + allocation_threads_per_node = threads_per_node;
> +
> + return 1;
> +}
> +__setup("hugepage_alloc_threads=", hugepage_alloc_threads_setup);
> +
> static unsigned int allowed_mems_nr(struct hstate *h)
> {
> int node;
>
> --
> 2.48.1
>
>
>
Maybe mention that this does not apply to 'gigantic' hugepages (e.g.
hugetlb pages of an order > MAX_PAGE_ORDER). Those are allocated
earlier in boot by memblock, in a single-threaded environment.
Not your fault that this distinction between these types of hugetlb
pages isn't clear in the Docs, of course. Only hugetlb_cma mentions
that it is for gigantic pages. But it's probably best to mention that
the threads parameter is for non-gigantic hugetlb pages only.
- Frank
On Fri, Feb 21, 2025 at 02:49:03PM +0100, Thomas Prescher via B4 Relay wrote: > Add a command line option that enables control of how many > threads per NUMA node should be used to allocate huge pages. I don't think we should add a command line option (ie blame the sysadmin for getting it wrong). Instead, we should figure out the right number. Is it half the number of threads per socket? A quarter? 90%? It's bootup, the threads aren't really doing anything else. But we should figure it out, not the sysadmin.
© 2016 - 2026 Red Hat, Inc.