[v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get

[PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get

Posted by Xavier Xia 9 months ago

This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
function by adding early termination logic. It checks if the dirty and
young bits of orig_pte are already set and skips redundant bit-setting
operations during the loop. This reduces unnecessary iterations and
improves performance.

In order to verify the optimization performance, a test function has been
designed. The function's execution time and instruction statistics have
been traced using perf, and the following are the operation results on a
certain Qualcomm mobile phone chip:

Test Code:

	#define PAGE_SIZE 4096
	#define CONT_PTES 16
	#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
	#define YOUNG_BIT 8
	void rwdata(char *buf)
	{
		for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
			buf[i] = 'a';
			volatile char c = buf[i];
		}
	}
	void clear_young_dirty(char *buf)
	{
		if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
			perror("madvise free failed");
			free(buf);
			exit(EXIT_FAILURE);
		}
		if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
			perror("madvise free failed");
			free(buf);
			exit(EXIT_FAILURE);
		}
	}
	void set_one_young(char *buf)
	{
		for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
			volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
		}
	}

	void test_contpte_perf() {
		char *buf;
		int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
				TEST_SIZE);
		if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
			perror("posix_memalign failed");
			exit(EXIT_FAILURE);
		}

		rwdata(buf);
	#if TEST_CASE2 || TEST_CASE3
		clear_young_dirty(buf);
	#endif
	#if TEST_CASE2
		set_one_young(buf);
	#endif

		for (int j = 0; j < 500; j++) {
			mlock(buf, TEST_SIZE);

			munlock(buf, TEST_SIZE);
		}
		free(buf);
	}

	Descriptions of three test scenarios

Scenario 1
	The data of all 16 PTEs are both dirty and young.
	#define TEST_CASE2 0
	#define TEST_CASE3 0

Scenario 2
	Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
	#define TEST_CASE2 1
	#define TEST_CASE3 0

Scenario 3
	Among the 16 PTEs, there are neither young nor dirty ones.
	#define TEST_CASE2 0
	#define TEST_CASE3 1

Test results

|Scenario 1         |       Original|       Optimized|
|-------------------|---------------|----------------|
|instructions       |    37912436160|     18731580031|
|test time          |         4.2797|          2.2949|
|overhead of        |               |                |
|contpte_ptep_get() |         21.31%|           4.80%|

|Scenario 2         |       Original|       Optimized|
|-------------------|---------------|----------------|
|instructions       |    36701270862|     36115790086|
|test time          |         3.2335|          3.0874|
|Overhead of        |               |                |
|contpte_ptep_get() |         32.26%|          33.57%|

|Scenario 3         |       Original|       Optimized|
|-------------------|---------------|----------------|
|instructions       |    36706279735|     36750881878|
|test time          |         3.2008|          3.1249|
|Overhead of        |               |                |
|contpte_ptep_get() |         31.94%|          34.59%|

For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
and a time benefit of 46.38%.
For Scenario 2, optimized code can achieve an instruction count benefit of
1.6% and a time benefit of 4.5%.
For Scenario 3, since all the PTEs have neither the young nor the dirty
flag, the branches taken by optimized code should be the same as those of
the original code. In fact, the test results of optimized code seem to be
closer to those of the original code.

It can be proven through test function that the optimization for
contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless
is similar to that of contpte_ptep_get, the same optimization scheme is
also adopted for it.

Signed-off-by: Xavier Xia <xavier_qy@163.com>
---
 arch/arm64/mm/contpte.c | 71 +++++++++++++++++++++++++++++++++++------
 1 file changed, 62 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index bcac4f55f9c1..e9882ec782fc 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -169,17 +169,41 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
 	for (i = 0; i < CONT_PTES; i++, ptep++) {
 		pte = __ptep_get(ptep);
 
-		if (pte_dirty(pte))
+		if (pte_dirty(pte)) {
 			orig_pte = pte_mkdirty(orig_pte);
-
-		if (pte_young(pte))
+			for (; i < CONT_PTES; i++, ptep++) {
+				pte = __ptep_get(ptep);
+				if (pte_young(pte)) {
+					orig_pte = pte_mkyoung(orig_pte);
+					break;
+				}
+			}
+			break;
+		}
+
+		if (pte_young(pte)) {
 			orig_pte = pte_mkyoung(orig_pte);
+			i++;
+			ptep++;
+			for (; i < CONT_PTES; i++, ptep++) {
+				pte = __ptep_get(ptep);
+				if (pte_dirty(pte)) {
+					orig_pte = pte_mkdirty(orig_pte);
+					break;
+				}
+			}
+			break;
+		}
 	}
 
 	return orig_pte;
 }
 EXPORT_SYMBOL_GPL(contpte_ptep_get);
 
+#define CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot) \
+	(!pte_valid_cont(pte) || pte_pfn(pte) != pfn || \
+		pgprot_val(prot) != pgprot_val(orig_prot))
+
 pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
 {
 	/*
@@ -221,16 +245,45 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
 		pte = __ptep_get(ptep);
 		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
 
-		if (!pte_valid_cont(pte) ||
-		   pte_pfn(pte) != pfn ||
-		   pgprot_val(prot) != pgprot_val(orig_prot))
+		if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
 			goto retry;
 
-		if (pte_dirty(pte))
+		if (pte_dirty(pte)) {
 			orig_pte = pte_mkdirty(orig_pte);
-
-		if (pte_young(pte))
+			for (; i < CONT_PTES; i++, ptep++, pfn++) {
+				pte = __ptep_get(ptep);
+				prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+
+				if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
+					goto retry;
+
+				if (pte_young(pte)) {
+					orig_pte = pte_mkyoung(orig_pte);
+					break;
+				}
+			}
+			break;
+		}
+
+		if (pte_young(pte)) {
 			orig_pte = pte_mkyoung(orig_pte);
+			i++;
+			ptep++;
+			pfn++;
+			for (; i < CONT_PTES; i++, ptep++, pfn++) {
+				pte = __ptep_get(ptep);
+				prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
+
+				if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
+					goto retry;
+
+				if (pte_dirty(pte)) {
+					orig_pte = pte_mkdirty(orig_pte);
+					break;
+				}
+			}
+			break;
+		}
 	}
 
 	return orig_pte;
-- 
2.34.1

Re: [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get

Posted by Barry Song 9 months ago

On Thu, May 8, 2025 at 7:04 PM Xavier Xia <xavier_qy@163.com> wrote:
>
> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
> function by adding early termination logic. It checks if the dirty and
> young bits of orig_pte are already set and skips redundant bit-setting
> operations during the loop. This reduces unnecessary iterations and
> improves performance.
>
> In order to verify the optimization performance, a test function has been
> designed. The function's execution time and instruction statistics have
> been traced using perf, and the following are the operation results on a
> certain Qualcomm mobile phone chip:
>
> Test Code:
>
>         #define PAGE_SIZE 4096
>         #define CONT_PTES 16
>         #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>         #define YOUNG_BIT 8
>         void rwdata(char *buf)
>         {
>                 for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
>                         buf[i] = 'a';
>                         volatile char c = buf[i];
>                 }
>         }
>         void clear_young_dirty(char *buf)
>         {
>                 if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
>                         perror("madvise free failed");
>                         free(buf);
>                         exit(EXIT_FAILURE);
>                 }
>                 if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
>                         perror("madvise free failed");
>                         free(buf);
>                         exit(EXIT_FAILURE);
>                 }
>         }
>         void set_one_young(char *buf)
>         {
>                 for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
>                         volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
>                 }
>         }
>
>         void test_contpte_perf() {
>                 char *buf;
>                 int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
>                                 TEST_SIZE);
>                 if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
>                         perror("posix_memalign failed");
>                         exit(EXIT_FAILURE);
>                 }
>
>                 rwdata(buf);
>         #if TEST_CASE2 || TEST_CASE3
>                 clear_young_dirty(buf);
>         #endif
>         #if TEST_CASE2
>                 set_one_young(buf);
>         #endif
>
>                 for (int j = 0; j < 500; j++) {
>                         mlock(buf, TEST_SIZE);
>
>                         munlock(buf, TEST_SIZE);
>                 }
>                 free(buf);
>         }
>
>         Descriptions of three test scenarios
>
> Scenario 1
>         The data of all 16 PTEs are both dirty and young.
>         #define TEST_CASE2 0
>         #define TEST_CASE3 0
>
> Scenario 2
>         Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
>         #define TEST_CASE2 1
>         #define TEST_CASE3 0
>
> Scenario 3
>         Among the 16 PTEs, there are neither young nor dirty ones.
>         #define TEST_CASE2 0
>         #define TEST_CASE3 1
>
> Test results
>
> |Scenario 1         |       Original|       Optimized|
> |-------------------|---------------|----------------|
> |instructions       |    37912436160|     18731580031|
> |test time          |         4.2797|          2.2949|
> |overhead of        |               |                |
> |contpte_ptep_get() |         21.31%|           4.80%|
>
> |Scenario 2         |       Original|       Optimized|
> |-------------------|---------------|----------------|
> |instructions       |    36701270862|     36115790086|
> |test time          |         3.2335|          3.0874|
> |Overhead of        |               |                |
> |contpte_ptep_get() |         32.26%|          33.57%|
>
> |Scenario 3         |       Original|       Optimized|
> |-------------------|---------------|----------------|
> |instructions       |    36706279735|     36750881878|
> |test time          |         3.2008|          3.1249|
> |Overhead of        |               |                |
> |contpte_ptep_get() |         31.94%|          34.59%|
>
> For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
> and a time benefit of 46.38%.
> For Scenario 2, optimized code can achieve an instruction count benefit of
> 1.6% and a time benefit of 4.5%.
> For Scenario 3, since all the PTEs have neither the young nor the dirty
> flag, the branches taken by optimized code should be the same as those of
> the original code. In fact, the test results of optimized code seem to be
> closer to those of the original code.
>
> It can be proven through test function that the optimization for
> contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless
> is similar to that of contpte_ptep_get, the same optimization scheme is
> also adopted for it.
>
> Signed-off-by: Xavier Xia <xavier_qy@163.com>
> ---
>  arch/arm64/mm/contpte.c | 71 +++++++++++++++++++++++++++++++++++------
>  1 file changed, 62 insertions(+), 9 deletions(-)
>
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index bcac4f55f9c1..e9882ec782fc 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -169,17 +169,41 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>         for (i = 0; i < CONT_PTES; i++, ptep++) {
>                 pte = __ptep_get(ptep);
>
> -               if (pte_dirty(pte))
> +               if (pte_dirty(pte)) {
>                         orig_pte = pte_mkdirty(orig_pte);
> -
> -               if (pte_young(pte))
> +                       for (; i < CONT_PTES; i++, ptep++) {
> +                               pte = __ptep_get(ptep);
> +                               if (pte_young(pte)) {
> +                                       orig_pte = pte_mkyoung(orig_pte);
> +                                       break;
> +                               }
> +                       }
> +                       break;
> +               }
> +
> +               if (pte_young(pte)) {
>                         orig_pte = pte_mkyoung(orig_pte);
> +                       i++;
> +                       ptep++;
> +                       for (; i < CONT_PTES; i++, ptep++) {
> +                               pte = __ptep_get(ptep);
> +                               if (pte_dirty(pte)) {
> +                                       orig_pte = pte_mkdirty(orig_pte);
> +                                       break;
> +                               }
> +                       }
> +                       break;
> +               }
>         }
>
>         return orig_pte;
>  }
>  EXPORT_SYMBOL_GPL(contpte_ptep_get);
>
> +#define CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot) \
> +       (!pte_valid_cont(pte) || pte_pfn(pte) != pfn || \
> +               pgprot_val(prot) != pgprot_val(orig_prot))

maybe make it a static inline function to improve readability. Also,
the name appears to
be not good: CHECK_CONTPTE_CONSISTENCY is actually checking for inconsistency,
not consistency.

it might be:

static inline bool contpte_is_consistent(...)
{
        return pte_valid_cont(pte) && pte_pfn(pte) == pfn &&
               pgprot_val(prot) == pgprot_val(orig_prot);
}

or another better name.

> +
>  pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>  {
>         /*
> @@ -221,16 +245,45 @@ pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>                 pte = __ptep_get(ptep);
>                 prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>
> -               if (!pte_valid_cont(pte) ||
> -                  pte_pfn(pte) != pfn ||
> -                  pgprot_val(prot) != pgprot_val(orig_prot))
> +               if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
>                         goto retry;
>
> -               if (pte_dirty(pte))
> +               if (pte_dirty(pte)) {
>                         orig_pte = pte_mkdirty(orig_pte);
> -
> -               if (pte_young(pte))
> +                       for (; i < CONT_PTES; i++, ptep++, pfn++) {
> +                               pte = __ptep_get(ptep);
> +                               prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +
> +                               if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
> +                                       goto retry;
> +
> +                               if (pte_young(pte)) {
> +                                       orig_pte = pte_mkyoung(orig_pte);
> +                                       break;
> +                               }
> +                       }
> +                       break;
> +               }
> +
> +               if (pte_young(pte)) {
>                         orig_pte = pte_mkyoung(orig_pte);
> +                       i++;
> +                       ptep++;
> +                       pfn++;
> +                       for (; i < CONT_PTES; i++, ptep++, pfn++) {
> +                               pte = __ptep_get(ptep);
> +                               prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +
> +                               if (CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot))
> +                                       goto retry;
> +
> +                               if (pte_dirty(pte)) {
> +                                       orig_pte = pte_mkdirty(orig_pte);
> +                                       break;
> +                               }
> +                       }
> +                       break;
> +               }
>         }
>
>         return orig_pte;
> --
> 2.34.1
>

Thanks
barry

Re: [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get

Posted by Xavier 9 months ago


At 2025-05-09 10:09:21, "Barry Song" <21cnbao@gmail.com> wrote:
>On Thu, May 8, 2025 at 7:04 PM Xavier Xia <xavier_qy@163.com> wrote:
>>
>> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
>> function by adding early termination logic. It checks if the dirty and
>> young bits of orig_pte are already set and skips redundant bit-setting
>> operations during the loop. This reduces unnecessary iterations and
>> improves performance.
>>
>> In order to verify the optimization performance, a test function has been
>> designed. The function's execution time and instruction statistics have
>> been traced using perf, and the following are the operation results on a
>> certain Qualcomm mobile phone chip:
>>
>> Test Code:
>>
>>         #define PAGE_SIZE 4096
>>         #define CONT_PTES 16
>>         #define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)
>>         #define YOUNG_BIT 8
>>         void rwdata(char *buf)
>>         {
>>                 for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
>>                         buf[i] = 'a';
>>                         volatile char c = buf[i];
>>                 }
>>         }
>>         void clear_young_dirty(char *buf)
>>         {
>>                 if (madvise(buf, TEST_SIZE, MADV_FREE) == -1) {
>>                         perror("madvise free failed");
>>                         free(buf);
>>                         exit(EXIT_FAILURE);
>>                 }
>>                 if (madvise(buf, TEST_SIZE, MADV_COLD) == -1) {
>>                         perror("madvise free failed");
>>                         free(buf);
>>                         exit(EXIT_FAILURE);
>>                 }
>>         }
>>         void set_one_young(char *buf)
>>         {
>>                 for (size_t i = 0; i < TEST_SIZE; i += CONT_PTES * PAGE_SIZE) {
>>                         volatile char c = buf[i + YOUNG_BIT * PAGE_SIZE];
>>                 }
>>         }
>>
>>         void test_contpte_perf() {
>>                 char *buf;
>>                 int ret = posix_memalign((void **)&buf, CONT_PTES * PAGE_SIZE,
>>                                 TEST_SIZE);
>>                 if ((ret != 0) || ((unsigned long)buf % CONT_PTES * PAGE_SIZE)) {
>>                         perror("posix_memalign failed");
>>                         exit(EXIT_FAILURE);
>>                 }
>>
>>                 rwdata(buf);
>>         #if TEST_CASE2 || TEST_CASE3
>>                 clear_young_dirty(buf);
>>         #endif
>>         #if TEST_CASE2
>>                 set_one_young(buf);
>>         #endif
>>
>>                 for (int j = 0; j < 500; j++) {
>>                         mlock(buf, TEST_SIZE);
>>
>>                         munlock(buf, TEST_SIZE);
>>                 }
>>                 free(buf);
>>         }
>>
>>         Descriptions of three test scenarios
>>
>> Scenario 1
>>         The data of all 16 PTEs are both dirty and young.
>>         #define TEST_CASE2 0
>>         #define TEST_CASE3 0
>>
>> Scenario 2
>>         Among the 16 PTEs, only the 8th one is young, and there are no dirty ones.
>>         #define TEST_CASE2 1
>>         #define TEST_CASE3 0
>>
>> Scenario 3
>>         Among the 16 PTEs, there are neither young nor dirty ones.
>>         #define TEST_CASE2 0
>>         #define TEST_CASE3 1
>>
>> Test results
>>
>> |Scenario 1         |       Original|       Optimized|
>> |-------------------|---------------|----------------|
>> |instructions       |    37912436160|     18731580031|
>> |test time          |         4.2797|          2.2949|
>> |overhead of        |               |                |
>> |contpte_ptep_get() |         21.31%|           4.80%|
>>
>> |Scenario 2         |       Original|       Optimized|
>> |-------------------|---------------|----------------|
>> |instructions       |    36701270862|     36115790086|
>> |test time          |         3.2335|          3.0874|
>> |Overhead of        |               |                |
>> |contpte_ptep_get() |         32.26%|          33.57%|
>>
>> |Scenario 3         |       Original|       Optimized|
>> |-------------------|---------------|----------------|
>> |instructions       |    36706279735|     36750881878|
>> |test time          |         3.2008|          3.1249|
>> |Overhead of        |               |                |
>> |contpte_ptep_get() |         31.94%|          34.59%|
>>
>> For Scenario 1, optimized code can achieve an instruction benefit of 50.59%
>> and a time benefit of 46.38%.
>> For Scenario 2, optimized code can achieve an instruction count benefit of
>> 1.6% and a time benefit of 4.5%.
>> For Scenario 3, since all the PTEs have neither the young nor the dirty
>> flag, the branches taken by optimized code should be the same as those of
>> the original code. In fact, the test results of optimized code seem to be
>> closer to those of the original code.
>>
>> It can be proven through test function that the optimization for
>> contpte_ptep_get is effective. Since the logic of contpte_ptep_get_lockless
>> is similar to that of contpte_ptep_get, the same optimization scheme is
>> also adopted for it.
>>
>> Signed-off-by: Xavier Xia <xavier_qy@163.com>
>> ---
>>  arch/arm64/mm/contpte.c | 71 +++++++++++++++++++++++++++++++++++------
>>  1 file changed, 62 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bcac4f55f9c1..e9882ec782fc 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -169,17 +169,41 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>         for (i = 0; i < CONT_PTES; i++, ptep++) {
>>                 pte = __ptep_get(ptep);
>>
>> -               if (pte_dirty(pte))
>> +               if (pte_dirty(pte)) {
>>                         orig_pte = pte_mkdirty(orig_pte);
>> -
>> -               if (pte_young(pte))
>> +                       for (; i < CONT_PTES; i++, ptep++) {
>> +                               pte = __ptep_get(ptep);
>> +                               if (pte_young(pte)) {
>> +                                       orig_pte = pte_mkyoung(orig_pte);
>> +                                       break;
>> +                               }
>> +                       }
>> +                       break;
>> +               }
>> +
>> +               if (pte_young(pte)) {
>>                         orig_pte = pte_mkyoung(orig_pte);
>> +                       i++;
>> +                       ptep++;
>> +                       for (; i < CONT_PTES; i++, ptep++) {
>> +                               pte = __ptep_get(ptep);
>> +                               if (pte_dirty(pte)) {
>> +                                       orig_pte = pte_mkdirty(orig_pte);
>> +                                       break;
>> +                               }
>> +                       }
>> +                       break;
>> +               }
>>         }
>>
>>         return orig_pte;
>>  }
>>  EXPORT_SYMBOL_GPL(contpte_ptep_get);
>>
>> +#define CHECK_CONTPTE_CONSISTENCY(pte, pfn, prot, orig_prot) \
>> +       (!pte_valid_cont(pte) || pte_pfn(pte) != pfn || \
>> +               pgprot_val(prot) != pgprot_val(orig_prot))
>
>maybe make it a static inline function to improve readability. Also,
>the name appears to
>be not good: CHECK_CONTPTE_CONSISTENCY is actually checking for inconsistency,
>not consistency.
>
>it might be:
>
>static inline bool contpte_is_consistent(...)
>{
>        return pte_valid_cont(pte) && pte_pfn(pte) == pfn &&
>               pgprot_val(prot) == pgprot_val(orig_prot);
>}
>
>or another better name.
>

You're right. What's being checked here is the inconsistency. I will make the modification
in the next version. Thank you for your suggestion.

--

Thanks,
Xavier

Re: [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get

Posted by David Hildenbrand 9 months ago

On 08.05.25 09:03, Xavier Xia wrote:
> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
> function by adding early termination logic. It checks if the dirty and
> young bits of orig_pte are already set and skips redundant bit-setting
> operations during the loop. This reduces unnecessary iterations and
> improves performance.
> 
> In order to verify the optimization performance, a test function has been
> designed. The function's execution time and instruction statistics have
> been traced using perf, and the following are the operation results on a
> certain Qualcomm mobile phone chip:

For the future, please don't post vN+1 as reply to vN.

-- 
Cheers,

David / dhildenb

Re:Re: [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get

Posted by Xavier 9 months ago



At 2025-05-08 16:30:14, "David Hildenbrand" <david@redhat.com> wrote:
>On 08.05.25 09:03, Xavier Xia wrote:
>> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
>> function by adding early termination logic. It checks if the dirty and
>> young bits of orig_pte are already set and skips redundant bit-setting
>> operations during the loop. This reduces unnecessary iterations and
>> improves performance.
>> 
>> In order to verify the optimization performance, a test function has been
>> designed. The function's execution time and instruction statistics have
>> been traced using perf, and the following are the operation results on a
>> certain Qualcomm mobile phone chip:
>
>For the future, please don't post vN+1 as reply to vN.

I will pay attention to it when I submit it later. Thank you for the reminder.

>
>-- 
>Cheers,
>
>David / dhildenb

--
Thanks,
Xavier

Re: [PATCH v4] arm64/mm: Optimize loop to reduce redundant operations of contpte_ptep_get

Posted by David Hildenbrand 9 months ago

On 09.05.25 11:17, Xavier wrote:
> 
> 
> 
> At 2025-05-08 16:30:14, "David Hildenbrand" <david@redhat.com> wrote:
>> On 08.05.25 09:03, Xavier Xia wrote:
>>> This commit optimizes the contpte_ptep_get and contpte_ptep_get_lockless
>>> function by adding early termination logic. It checks if the dirty and
>>> young bits of orig_pte are already set and skips redundant bit-setting
>>> operations during the loop. This reduces unnecessary iterations and
>>> improves performance.
>>>
>>> In order to verify the optimization performance, a test function has been
>>> designed. The function's execution time and instruction statistics have
>>> been traced using perf, and the following are the operation results on a
>>> certain Qualcomm mobile phone chip:
>>
>> For the future, please don't post vN+1 as reply to vN.
> 
> I will pay attention to it when I submit it later. Thank you for the reminder.

The rationale is that many people will just treat it as some discussion 
noise as part of vN and not really have a closer look, waiting for vN+1.

-- 
Cheers,

David / dhildenb