CONFIG_TEST_VMALLOC=y conflict/race with alloc_tag_init

David Wang posted 1 patch 3 months, 2 weeks ago
CONFIG_TEST_VMALLOC=y conflict/race with alloc_tag_init
Posted by David Wang 3 months, 2 weeks ago
On Wed, Jun 18, 2025 at 02:25:37PM +0800, kernel test robot wrote:
> 
> Hello,
> 
> for this change, we reported
> "[linux-next:master] [lib/test_vmalloc.c]  7fc85b92db: Mem-Info"
> in
> https://lore.kernel.org/all/202505071555.e757f1e0-lkp@intel.com/
> 
> at that time, we made some tests with x86_64 config which runs well.
> 
> now we noticed the commit is in mainline now.

> the config still has expected diff with parent:
> 
> --- /pkg/linux/x86_64-randconfig-161-20250614/gcc-12/7a73348e5d4715b5565a53f21c01ea7b54e46cbd/.config   2025-06-17 14:40:29.481052101 +0800
> +++ /pkg/linux/x86_64-randconfig-161-20250614/gcc-12/2d76e79315e403aab595d4c8830b7a46c19f0f3b/.config   2025-06-17 14:41:18.448543738 +0800
> @@ -7551,7 +7551,7 @@ CONFIG_TEST_IDA=m
>  CONFIG_TEST_MISC_MINOR=m
>  # CONFIG_TEST_LKM is not set
>  CONFIG_TEST_BITOPS=m
> -CONFIG_TEST_VMALLOC=m
> +CONFIG_TEST_VMALLOC=y
>  # CONFIG_TEST_BPF is not set
>  CONFIG_FIND_BIT_BENCHMARK=m
>  # CONFIG_TEST_FIRMWARE is not set
> 
> 
> then we noticed similar random issue with x86_64 randconfig this time.
> 
> 7a73348e5d4715b5 2d76e79315e403aab595d4c8830
> ---------------- ---------------------------
>        fail:runs  %reproduction    fail:runs
>            |             |             |
>            :199         34%          67:200   dmesg.KASAN:null-ptr-deref_in_range[#-#]
>            :199         34%          67:200   dmesg.Kernel_panic-not_syncing:Fatal_exception
>            :199         34%          67:200   dmesg.Mem-Info
>            :199         34%          67:200   dmesg.Oops:general_protection_fault,probably_for_non-canonical_address#:#[##]SMP_KASAN
>            :199         34%          67:200   dmesg.RIP:down_read_trylock
> 
> we don't have enough knowledge to understand the relationship between code
> change and the random issues. just report what we obsverved in our tests FYI.
> 

I think this is caused by a race between vmalloc_test_init and alloc_tag_init.

vmalloc_test actually depends on alloc_tag via alloc_tag_top_users, because when
memory allocation fails show_mem() would invoke alloc_tag_top_users.

With following configuration:

CONFIG_TEST_VMALLOC=y
CONFIG_MEM_ALLOC_PROFILING=y
CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
CONFIG_MEM_ALLOC_PROFILING_DEBUG=y

If vmalloc_test_init starts before alloc_tag_init, show_mem() would cause
a NULL deference because alloc_tag_cttype was not init yet.

I add some debug to confirm this theory
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index d48b80f3f007..9b8e7501010f 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -133,6 +133,8 @@ size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sl
        struct codetag *ct;
        struct codetag_bytes n;
        unsigned int i, nr = 0;
+       pr_info("memory profiling alloc top %d: %llx\n", mem_profiling_support, (long long)alloc_tag_cttype);
+       return 0;
 
        if (can_sleep)
                codetag_lock_module_list(alloc_tag_cttype, true);
@@ -831,6 +833,7 @@ static int __init alloc_tag_init(void)
                shutdown_mem_profiling(true);
                return PTR_ERR(alloc_tag_cttype);
        }
+       pr_info("memory profiling ready %d: %llx\n", mem_profiling_support, (long long)alloc_tag_cttype);
 
        return 0;
 }

When bootup the kernel, the log shows:

$ sudo dmesg -T | grep profiling
[Fri Jun 20 17:29:35 2025] memory profiling alloc top 1: 0  <--- alloc_tag_cttype == NULL
[Fri Jun 20 17:30:24 2025] memory profiling ready 1: ffff9b1641aa06c0


vmalloc_test_init should happened after alloc_tag_init if CONFIG_TEST_VMALLOC=y,
or mem_show() should check whether alloc_tag is done initialized when calling
alloc_tag_top_users



David
Re: CONFIG_TEST_VMALLOC=y conflict/race with alloc_tag_init
Posted by Suren Baghdasaryan 3 months, 2 weeks ago
On Fri, Jun 20, 2025 at 3:03 AM David Wang <00107082@163.com> wrote:
>
> On Wed, Jun 18, 2025 at 02:25:37PM +0800, kernel test robot wrote:
> >
> > Hello,
> >
> > for this change, we reported
> > "[linux-next:master] [lib/test_vmalloc.c]  7fc85b92db: Mem-Info"
> > in
> > https://lore.kernel.org/all/202505071555.e757f1e0-lkp@intel.com/
> >
> > at that time, we made some tests with x86_64 config which runs well.
> >
> > now we noticed the commit is in mainline now.
>
> > the config still has expected diff with parent:
> >
> > --- /pkg/linux/x86_64-randconfig-161-20250614/gcc-12/7a73348e5d4715b5565a53f21c01ea7b54e46cbd/.config   2025-06-17 14:40:29.481052101 +0800
> > +++ /pkg/linux/x86_64-randconfig-161-20250614/gcc-12/2d76e79315e403aab595d4c8830b7a46c19f0f3b/.config   2025-06-17 14:41:18.448543738 +0800
> > @@ -7551,7 +7551,7 @@ CONFIG_TEST_IDA=m
> >  CONFIG_TEST_MISC_MINOR=m
> >  # CONFIG_TEST_LKM is not set
> >  CONFIG_TEST_BITOPS=m
> > -CONFIG_TEST_VMALLOC=m
> > +CONFIG_TEST_VMALLOC=y
> >  # CONFIG_TEST_BPF is not set
> >  CONFIG_FIND_BIT_BENCHMARK=m
> >  # CONFIG_TEST_FIRMWARE is not set
> >
> >
> > then we noticed similar random issue with x86_64 randconfig this time.
> >
> > 7a73348e5d4715b5 2d76e79315e403aab595d4c8830
> > ---------------- ---------------------------
> >        fail:runs  %reproduction    fail:runs
> >            |             |             |
> >            :199         34%          67:200   dmesg.KASAN:null-ptr-deref_in_range[#-#]
> >            :199         34%          67:200   dmesg.Kernel_panic-not_syncing:Fatal_exception
> >            :199         34%          67:200   dmesg.Mem-Info
> >            :199         34%          67:200   dmesg.Oops:general_protection_fault,probably_for_non-canonical_address#:#[##]SMP_KASAN
> >            :199         34%          67:200   dmesg.RIP:down_read_trylock
> >
> > we don't have enough knowledge to understand the relationship between code
> > change and the random issues. just report what we obsverved in our tests FYI.
> >
>
> I think this is caused by a race between vmalloc_test_init and alloc_tag_init.
>
> vmalloc_test actually depends on alloc_tag via alloc_tag_top_users, because when
> memory allocation fails show_mem() would invoke alloc_tag_top_users.
>
> With following configuration:
>
> CONFIG_TEST_VMALLOC=y
> CONFIG_MEM_ALLOC_PROFILING=y
> CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
> CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
>
> If vmalloc_test_init starts before alloc_tag_init, show_mem() would cause
> a NULL deference because alloc_tag_cttype was not init yet.
>
> I add some debug to confirm this theory
> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> index d48b80f3f007..9b8e7501010f 100644
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -133,6 +133,8 @@ size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sl
>         struct codetag *ct;
>         struct codetag_bytes n;
>         unsigned int i, nr = 0;
> +       pr_info("memory profiling alloc top %d: %llx\n", mem_profiling_support, (long long)alloc_tag_cttype);
> +       return 0;
>
>         if (can_sleep)
>                 codetag_lock_module_list(alloc_tag_cttype, true);
> @@ -831,6 +833,7 @@ static int __init alloc_tag_init(void)
>                 shutdown_mem_profiling(true);
>                 return PTR_ERR(alloc_tag_cttype);
>         }
> +       pr_info("memory profiling ready %d: %llx\n", mem_profiling_support, (long long)alloc_tag_cttype);
>
>         return 0;
>  }
>
> When bootup the kernel, the log shows:
>
> $ sudo dmesg -T | grep profiling
> [Fri Jun 20 17:29:35 2025] memory profiling alloc top 1: 0  <--- alloc_tag_cttype == NULL
> [Fri Jun 20 17:30:24 2025] memory profiling ready 1: ffff9b1641aa06c0
>
>
> vmalloc_test_init should happened after alloc_tag_init if CONFIG_TEST_VMALLOC=y,
> or mem_show() should check whether alloc_tag is done initialized when calling
> alloc_tag_top_users

Thanks for reporting!
So, IIUC https://lore.kernel.org/all/20250620195305.1115151-1-harry.yoo@oracle.com/
will address this issue as well. Is that correct?

>
>
>
> David
>
Re: CONFIG_TEST_VMALLOC=y conflict/race with alloc_tag_init
Posted by David Wang 3 months, 2 weeks ago
At 2025-06-23 06:50:44, "Suren Baghdasaryan" <surenb@google.com> wrote:
>On Fri, Jun 20, 2025 at 3:03 AM David Wang <00107082@163.com> wrote:
>>
>> On Wed, Jun 18, 2025 at 02:25:37PM +0800, kernel test robot wrote:
>> >
>> > Hello,
>> >
>> > for this change, we reported
>> > "[linux-next:master] [lib/test_vmalloc.c]  7fc85b92db: Mem-Info"
>> > in
>> > https://lore.kernel.org/all/202505071555.e757f1e0-lkp@intel.com/
>> >
>> > at that time, we made some tests with x86_64 config which runs well.
>> >
>> > now we noticed the commit is in mainline now.
>>
>> > the config still has expected diff with parent:
>> >
>> > --- /pkg/linux/x86_64-randconfig-161-20250614/gcc-12/7a73348e5d4715b5565a53f21c01ea7b54e46cbd/.config   2025-06-17 14:40:29.481052101 +0800
>> > +++ /pkg/linux/x86_64-randconfig-161-20250614/gcc-12/2d76e79315e403aab595d4c8830b7a46c19f0f3b/.config   2025-06-17 14:41:18.448543738 +0800
>> > @@ -7551,7 +7551,7 @@ CONFIG_TEST_IDA=m
>> >  CONFIG_TEST_MISC_MINOR=m
>> >  # CONFIG_TEST_LKM is not set
>> >  CONFIG_TEST_BITOPS=m
>> > -CONFIG_TEST_VMALLOC=m
>> > +CONFIG_TEST_VMALLOC=y
>> >  # CONFIG_TEST_BPF is not set
>> >  CONFIG_FIND_BIT_BENCHMARK=m
>> >  # CONFIG_TEST_FIRMWARE is not set
>> >
>> >
>> > then we noticed similar random issue with x86_64 randconfig this time.
>> >
>> > 7a73348e5d4715b5 2d76e79315e403aab595d4c8830
>> > ---------------- ---------------------------
>> >        fail:runs  %reproduction    fail:runs
>> >            |             |             |
>> >            :199         34%          67:200   dmesg.KASAN:null-ptr-deref_in_range[#-#]
>> >            :199         34%          67:200   dmesg.Kernel_panic-not_syncing:Fatal_exception
>> >            :199         34%          67:200   dmesg.Mem-Info
>> >            :199         34%          67:200   dmesg.Oops:general_protection_fault,probably_for_non-canonical_address#:#[##]SMP_KASAN
>> >            :199         34%          67:200   dmesg.RIP:down_read_trylock
>> >
>> > we don't have enough knowledge to understand the relationship between code
>> > change and the random issues. just report what we obsverved in our tests FYI.
>> >
>>
>> I think this is caused by a race between vmalloc_test_init and alloc_tag_init.
>>
>> vmalloc_test actually depends on alloc_tag via alloc_tag_top_users, because when
>> memory allocation fails show_mem() would invoke alloc_tag_top_users.
>>
>> With following configuration:
>>
>> CONFIG_TEST_VMALLOC=y
>> CONFIG_MEM_ALLOC_PROFILING=y
>> CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
>> CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
>>
>> If vmalloc_test_init starts before alloc_tag_init, show_mem() would cause
>> a NULL deference because alloc_tag_cttype was not init yet.
>>
>> I add some debug to confirm this theory
>> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
>> index d48b80f3f007..9b8e7501010f 100644
>> --- a/lib/alloc_tag.c
>> +++ b/lib/alloc_tag.c
>> @@ -133,6 +133,8 @@ size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sl
>>         struct codetag *ct;
>>         struct codetag_bytes n;
>>         unsigned int i, nr = 0;
>> +       pr_info("memory profiling alloc top %d: %llx\n", mem_profiling_support, (long long)alloc_tag_cttype);
>> +       return 0;
>>
>>         if (can_sleep)
>>                 codetag_lock_module_list(alloc_tag_cttype, true);
>> @@ -831,6 +833,7 @@ static int __init alloc_tag_init(void)
>>                 shutdown_mem_profiling(true);
>>                 return PTR_ERR(alloc_tag_cttype);
>>         }
>> +       pr_info("memory profiling ready %d: %llx\n", mem_profiling_support, (long long)alloc_tag_cttype);
>>
>>         return 0;
>>  }
>>
>> When bootup the kernel, the log shows:
>>
>> $ sudo dmesg -T | grep profiling
>> [Fri Jun 20 17:29:35 2025] memory profiling alloc top 1: 0  <--- alloc_tag_cttype == NULL
>> [Fri Jun 20 17:30:24 2025] memory profiling ready 1: ffff9b1641aa06c0
>>
>>
>> vmalloc_test_init should happened after alloc_tag_init if CONFIG_TEST_VMALLOC=y,
>> or mem_show() should check whether alloc_tag is done initialized when calling
>> alloc_tag_top_users
>
>Thanks for reporting!
>So, IIUC https://lore.kernel.org/all/20250620195305.1115151-1-harry.yoo@oracle.com/
>will address this issue as well. Is that correct?

Yes, the panic can be fix by that patch.

I still feel it better to delay vmalloc_test_init, make it happen after alloc_tag_init.
Or, maybe we can promote alloc_tag_init to some early init? I remember reporting some allocation
not registered by memory profiling during boot,  
https://lore.kernel.org/all/213ff7d2.7c6c.1945eb0c2ff.Coremail.00107082@163.com/

I will make some tests, and update later


David


>
>>
>>
>>
>> David
>>
Re: CONFIG_TEST_VMALLOC=y conflict/race with alloc_tag_init
Posted by Uladzislau Rezki 3 months, 2 weeks ago
On Mon, Jun 23, 2025 at 10:45:31AM +0800, David Wang wrote:
> 
> At 2025-06-23 06:50:44, "Suren Baghdasaryan" <surenb@google.com> wrote:
> >On Fri, Jun 20, 2025 at 3:03 AM David Wang <00107082@163.com> wrote:
> >>
> >> On Wed, Jun 18, 2025 at 02:25:37PM +0800, kernel test robot wrote:
> >> >
> >> > Hello,
> >> >
> >> > for this change, we reported
> >> > "[linux-next:master] [lib/test_vmalloc.c]  7fc85b92db: Mem-Info"
> >> > in
> >> > https://lore.kernel.org/all/202505071555.e757f1e0-lkp@intel.com/
> >> >
> >> > at that time, we made some tests with x86_64 config which runs well.
> >> >
> >> > now we noticed the commit is in mainline now.
> >>
> >> > the config still has expected diff with parent:
> >> >
> >> > --- /pkg/linux/x86_64-randconfig-161-20250614/gcc-12/7a73348e5d4715b5565a53f21c01ea7b54e46cbd/.config   2025-06-17 14:40:29.481052101 +0800
> >> > +++ /pkg/linux/x86_64-randconfig-161-20250614/gcc-12/2d76e79315e403aab595d4c8830b7a46c19f0f3b/.config   2025-06-17 14:41:18.448543738 +0800
> >> > @@ -7551,7 +7551,7 @@ CONFIG_TEST_IDA=m
> >> >  CONFIG_TEST_MISC_MINOR=m
> >> >  # CONFIG_TEST_LKM is not set
> >> >  CONFIG_TEST_BITOPS=m
> >> > -CONFIG_TEST_VMALLOC=m
> >> > +CONFIG_TEST_VMALLOC=y
> >> >  # CONFIG_TEST_BPF is not set
> >> >  CONFIG_FIND_BIT_BENCHMARK=m
> >> >  # CONFIG_TEST_FIRMWARE is not set
> >> >
> >> >
> >> > then we noticed similar random issue with x86_64 randconfig this time.
> >> >
> >> > 7a73348e5d4715b5 2d76e79315e403aab595d4c8830
> >> > ---------------- ---------------------------
> >> >        fail:runs  %reproduction    fail:runs
> >> >            |             |             |
> >> >            :199         34%          67:200   dmesg.KASAN:null-ptr-deref_in_range[#-#]
> >> >            :199         34%          67:200   dmesg.Kernel_panic-not_syncing:Fatal_exception
> >> >            :199         34%          67:200   dmesg.Mem-Info
> >> >            :199         34%          67:200   dmesg.Oops:general_protection_fault,probably_for_non-canonical_address#:#[##]SMP_KASAN
> >> >            :199         34%          67:200   dmesg.RIP:down_read_trylock
> >> >
> >> > we don't have enough knowledge to understand the relationship between code
> >> > change and the random issues. just report what we obsverved in our tests FYI.
> >> >
> >>
> >> I think this is caused by a race between vmalloc_test_init and alloc_tag_init.
> >>
> >> vmalloc_test actually depends on alloc_tag via alloc_tag_top_users, because when
> >> memory allocation fails show_mem() would invoke alloc_tag_top_users.
> >>
> >> With following configuration:
> >>
> >> CONFIG_TEST_VMALLOC=y
> >> CONFIG_MEM_ALLOC_PROFILING=y
> >> CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
> >> CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
> >>
> >> If vmalloc_test_init starts before alloc_tag_init, show_mem() would cause
> >> a NULL deference because alloc_tag_cttype was not init yet.
> >>
> >> I add some debug to confirm this theory
> >> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> >> index d48b80f3f007..9b8e7501010f 100644
> >> --- a/lib/alloc_tag.c
> >> +++ b/lib/alloc_tag.c
> >> @@ -133,6 +133,8 @@ size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sl
> >>         struct codetag *ct;
> >>         struct codetag_bytes n;
> >>         unsigned int i, nr = 0;
> >> +       pr_info("memory profiling alloc top %d: %llx\n", mem_profiling_support, (long long)alloc_tag_cttype);
> >> +       return 0;
> >>
> >>         if (can_sleep)
> >>                 codetag_lock_module_list(alloc_tag_cttype, true);
> >> @@ -831,6 +833,7 @@ static int __init alloc_tag_init(void)
> >>                 shutdown_mem_profiling(true);
> >>                 return PTR_ERR(alloc_tag_cttype);
> >>         }
> >> +       pr_info("memory profiling ready %d: %llx\n", mem_profiling_support, (long long)alloc_tag_cttype);
> >>
> >>         return 0;
> >>  }
> >>
> >> When bootup the kernel, the log shows:
> >>
> >> $ sudo dmesg -T | grep profiling
> >> [Fri Jun 20 17:29:35 2025] memory profiling alloc top 1: 0  <--- alloc_tag_cttype == NULL
> >> [Fri Jun 20 17:30:24 2025] memory profiling ready 1: ffff9b1641aa06c0
> >>
> >>
> >> vmalloc_test_init should happened after alloc_tag_init if CONFIG_TEST_VMALLOC=y,
> >> or mem_show() should check whether alloc_tag is done initialized when calling
> >> alloc_tag_top_users
> >
> >Thanks for reporting!
> >So, IIUC https://lore.kernel.org/all/20250620195305.1115151-1-harry.yoo@oracle.com/
> >will address this issue as well. Is that correct?
> 
> Yes, the panic can be fix by that patch.
> 
> I still feel it better to delay vmalloc_test_init, make it happen after alloc_tag_init.
>
We can, but then we would not notice the bag that is in question :)

At least we should, i think, to exclude the tests which trigger warnings
when the test-suite is run with default configurations, i.e. run the tests
which are not supposed to fail.

--
Uladzislau Rezki
Re: CONFIG_TEST_VMALLOC=y conflict/race with alloc_tag_init
Posted by David Wang 3 months, 2 weeks ago
At 2025-06-23 19:36:03, "Uladzislau Rezki" <urezki@gmail.com> wrote:
>On Mon, Jun 23, 2025 at 10:45:31AM +0800, David Wang wrote:
>> 
>> At 2025-06-23 06:50:44, "Suren Baghdasaryan" <surenb@google.com> wrote:
>> >On Fri, Jun 20, 2025 at 3:03 AM David Wang <00107082@163.com> wrote:
>> >>
>> >> On Wed, Jun 18, 2025 at 02:25:37PM +0800, kernel test robot wrote:
>> >> >
>> >> > Hello,
>> >> >
>> >> > for this change, we reported
>> >> > "[linux-next:master] [lib/test_vmalloc.c]  7fc85b92db: Mem-Info"
>> >> > in
>> >> > https://lore.kernel.org/all/202505071555.e757f1e0-lkp@intel.com/
>> >> >
>> >> > at that time, we made some tests with x86_64 config which runs well.
>> >> >
>> >> > now we noticed the commit is in mainline now.
>> >>
>> >> > the config still has expected diff with parent:
>> >> >
>> >> > --- /pkg/linux/x86_64-randconfig-161-20250614/gcc-12/7a73348e5d4715b5565a53f21c01ea7b54e46cbd/.config   2025-06-17 14:40:29.481052101 +0800
>> >> > +++ /pkg/linux/x86_64-randconfig-161-20250614/gcc-12/2d76e79315e403aab595d4c8830b7a46c19f0f3b/.config   2025-06-17 14:41:18.448543738 +0800
>> >> > @@ -7551,7 +7551,7 @@ CONFIG_TEST_IDA=m
>> >> >  CONFIG_TEST_MISC_MINOR=m
>> >> >  # CONFIG_TEST_LKM is not set
>> >> >  CONFIG_TEST_BITOPS=m
>> >> > -CONFIG_TEST_VMALLOC=m
>> >> > +CONFIG_TEST_VMALLOC=y
>> >> >  # CONFIG_TEST_BPF is not set
>> >> >  CONFIG_FIND_BIT_BENCHMARK=m
>> >> >  # CONFIG_TEST_FIRMWARE is not set
>> >> >
>> >> >
>> >> > then we noticed similar random issue with x86_64 randconfig this time.
>> >> >
>> >> > 7a73348e5d4715b5 2d76e79315e403aab595d4c8830
>> >> > ---------------- ---------------------------
>> >> >        fail:runs  %reproduction    fail:runs
>> >> >            |             |             |
>> >> >            :199         34%          67:200   dmesg.KASAN:null-ptr-deref_in_range[#-#]
>> >> >            :199         34%          67:200   dmesg.Kernel_panic-not_syncing:Fatal_exception
>> >> >            :199         34%          67:200   dmesg.Mem-Info
>> >> >            :199         34%          67:200   dmesg.Oops:general_protection_fault,probably_for_non-canonical_address#:#[##]SMP_KASAN
>> >> >            :199         34%          67:200   dmesg.RIP:down_read_trylock
>> >> >
>> >> > we don't have enough knowledge to understand the relationship between code
>> >> > change and the random issues. just report what we obsverved in our tests FYI.
>> >> >
>> >>
>> >> I think this is caused by a race between vmalloc_test_init and alloc_tag_init.
>> >>
>> >> vmalloc_test actually depends on alloc_tag via alloc_tag_top_users, because when
>> >> memory allocation fails show_mem() would invoke alloc_tag_top_users.
>> >>
>> >> With following configuration:
>> >>
>> >> CONFIG_TEST_VMALLOC=y
>> >> CONFIG_MEM_ALLOC_PROFILING=y
>> >> CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
>> >> CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
>> >>
>> >> If vmalloc_test_init starts before alloc_tag_init, show_mem() would cause
>> >> a NULL deference because alloc_tag_cttype was not init yet.
>> >>
>> >> I add some debug to confirm this theory
>> >> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
>> >> index d48b80f3f007..9b8e7501010f 100644
>> >> --- a/lib/alloc_tag.c
>> >> +++ b/lib/alloc_tag.c
>> >> @@ -133,6 +133,8 @@ size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sl
>> >>         struct codetag *ct;
>> >>         struct codetag_bytes n;
>> >>         unsigned int i, nr = 0;
>> >> +       pr_info("memory profiling alloc top %d: %llx\n", mem_profiling_support, (long long)alloc_tag_cttype);
>> >> +       return 0;
>> >>
>> >>         if (can_sleep)
>> >>                 codetag_lock_module_list(alloc_tag_cttype, true);
>> >> @@ -831,6 +833,7 @@ static int __init alloc_tag_init(void)
>> >>                 shutdown_mem_profiling(true);
>> >>                 return PTR_ERR(alloc_tag_cttype);
>> >>         }
>> >> +       pr_info("memory profiling ready %d: %llx\n", mem_profiling_support, (long long)alloc_tag_cttype);
>> >>
>> >>         return 0;
>> >>  }
>> >>
>> >> When bootup the kernel, the log shows:
>> >>
>> >> $ sudo dmesg -T | grep profiling
>> >> [Fri Jun 20 17:29:35 2025] memory profiling alloc top 1: 0  <--- alloc_tag_cttype == NULL
>> >> [Fri Jun 20 17:30:24 2025] memory profiling ready 1: ffff9b1641aa06c0
>> >>
>> >>
>> >> vmalloc_test_init should happened after alloc_tag_init if CONFIG_TEST_VMALLOC=y,
>> >> or mem_show() should check whether alloc_tag is done initialized when calling
>> >> alloc_tag_top_users
>> >
>> >Thanks for reporting!
>> >So, IIUC https://lore.kernel.org/all/20250620195305.1115151-1-harry.yoo@oracle.com/
>> >will address this issue as well. Is that correct?
>> 
>> Yes, the panic can be fix by that patch.
>> 
>> I still feel it better to delay vmalloc_test_init, make it happen after alloc_tag_init.
>>
>We can, but then we would not notice the bag that is in question :)

Yes,   strangely lucky here~ :)
I was thinking, if some vmalloc tests fail, is alloc_tag_top_users helpful for debug?
Considering this bug has already been caught,  if alloc_tag_top_users is helpful for vmalloc test analysis,
maybe it is still reasonable to delay vmalloc_test_init?... ☺︎

>
>At least we should, i think, to exclude the tests which trigger warnings
>when the test-suite is run with default configurations, i.e. run the tests
>which are not supposed to fail.



>
>--
>Uladzislau Rezki
Re: CONFIG_TEST_VMALLOC=y conflict/race with alloc_tag_init
Posted by David Wang 3 months, 2 weeks ago
At 2025-06-23 10:45:31, "David Wang" <00107082@163.com> wrote:
>
>At 2025-06-23 06:50:44, "Suren Baghdasaryan" <surenb@google.com> wrote:
>>On Fri, Jun 20, 2025 at 3:03 AM David Wang <00107082@163.com> wrote:
>>>
>>> On Wed, Jun 18, 2025 at 02:25:37PM +0800, kernel test robot wrote:
>>> >
>>> > Hello,
>>> >
>>> > for this change, we reported
>>> > "[linux-next:master] [lib/test_vmalloc.c]  7fc85b92db: Mem-Info"
>>> > in
>>> > https://lore.kernel.org/all/202505071555.e757f1e0-lkp@intel.com/
>>> >
>>> > at that time, we made some tests with x86_64 config which runs well.
>>> >
>>> > now we noticed the commit is in mainline now.
>>>
>>> > the config still has expected diff with parent:
>>> >
>>> > --- /pkg/linux/x86_64-randconfig-161-20250614/gcc-12/7a73348e5d4715b5565a53f21c01ea7b54e46cbd/.config   2025-06-17 14:40:29.481052101 +0800
>>> > +++ /pkg/linux/x86_64-randconfig-161-20250614/gcc-12/2d76e79315e403aab595d4c8830b7a46c19f0f3b/.config   2025-06-17 14:41:18.448543738 +0800
>>> > @@ -7551,7 +7551,7 @@ CONFIG_TEST_IDA=m
>>> >  CONFIG_TEST_MISC_MINOR=m
>>> >  # CONFIG_TEST_LKM is not set
>>> >  CONFIG_TEST_BITOPS=m
>>> > -CONFIG_TEST_VMALLOC=m
>>> > +CONFIG_TEST_VMALLOC=y
>>> >  # CONFIG_TEST_BPF is not set
>>> >  CONFIG_FIND_BIT_BENCHMARK=m
>>> >  # CONFIG_TEST_FIRMWARE is not set
>>> >
>>> >
>>> > then we noticed similar random issue with x86_64 randconfig this time.
>>> >
>>> > 7a73348e5d4715b5 2d76e79315e403aab595d4c8830
>>> > ---------------- ---------------------------
>>> >        fail:runs  %reproduction    fail:runs
>>> >            |             |             |
>>> >            :199         34%          67:200   dmesg.KASAN:null-ptr-deref_in_range[#-#]
>>> >            :199         34%          67:200   dmesg.Kernel_panic-not_syncing:Fatal_exception
>>> >            :199         34%          67:200   dmesg.Mem-Info
>>> >            :199         34%          67:200   dmesg.Oops:general_protection_fault,probably_for_non-canonical_address#:#[##]SMP_KASAN
>>> >            :199         34%          67:200   dmesg.RIP:down_read_trylock
>>> >
>>> > we don't have enough knowledge to understand the relationship between code
>>> > change and the random issues. just report what we obsverved in our tests FYI.
>>> >
>>>
>>> I think this is caused by a race between vmalloc_test_init and alloc_tag_init.
>>>
>>> vmalloc_test actually depends on alloc_tag via alloc_tag_top_users, because when
>>> memory allocation fails show_mem() would invoke alloc_tag_top_users.
>>>
>>> With following configuration:
>>>
>>> CONFIG_TEST_VMALLOC=y
>>> CONFIG_MEM_ALLOC_PROFILING=y
>>> CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
>>> CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
>>>
>>> If vmalloc_test_init starts before alloc_tag_init, show_mem() would cause
>>> a NULL deference because alloc_tag_cttype was not init yet.
>>>
>>> I add some debug to confirm this theory
>>> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
>>> index d48b80f3f007..9b8e7501010f 100644
>>> --- a/lib/alloc_tag.c
>>> +++ b/lib/alloc_tag.c
>>> @@ -133,6 +133,8 @@ size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sl
>>>         struct codetag *ct;
>>>         struct codetag_bytes n;
>>>         unsigned int i, nr = 0;
>>> +       pr_info("memory profiling alloc top %d: %llx\n", mem_profiling_support, (long long)alloc_tag_cttype);
>>> +       return 0;
>>>
>>>         if (can_sleep)
>>>                 codetag_lock_module_list(alloc_tag_cttype, true);
>>> @@ -831,6 +833,7 @@ static int __init alloc_tag_init(void)
>>>                 shutdown_mem_profiling(true);
>>>                 return PTR_ERR(alloc_tag_cttype);
>>>         }
>>> +       pr_info("memory profiling ready %d: %llx\n", mem_profiling_support, (long long)alloc_tag_cttype);
>>>
>>>         return 0;
>>>  }
>>>
>>> When bootup the kernel, the log shows:
>>>
>>> $ sudo dmesg -T | grep profiling
>>> [Fri Jun 20 17:29:35 2025] memory profiling alloc top 1: 0  <--- alloc_tag_cttype == NULL
>>> [Fri Jun 20 17:30:24 2025] memory profiling ready 1: ffff9b1641aa06c0
>>>
>>>
>>> vmalloc_test_init should happened after alloc_tag_init if CONFIG_TEST_VMALLOC=y,
>>> or mem_show() should check whether alloc_tag is done initialized when calling
>>> alloc_tag_top_users
>>
>>Thanks for reporting!
>>So, IIUC https://lore.kernel.org/all/20250620195305.1115151-1-harry.yoo@oracle.com/
>>will address this issue as well. Is that correct?
>
>Yes, the panic can be fix by that patch.
>
>I still feel it better to delay vmalloc_test_init, make it happen after alloc_tag_init.
>Or, maybe we can promote alloc_tag_init to some early init? I remember reporting some allocation
>not registered by memory profiling during boot,  
>https://lore.kernel.org/all/213ff7d2.7c6c.1945eb0c2ff.Coremail.00107082@163.com/
>
>I will make some tests, and update later

The memory allocations in sched_init_domains happened quite early, maybe it is core_initcall, while
 alloc_tag_init needs rootfs, it needs to be after rootfs_initcall, so no reasonable place to promote.......
But I think this explain why some allocation counter missed during boot: the allocation happened before alloc_tag_init


Thanks
David

>
>
>David
>
>
>>
>>>
>>>
>>>
>>> David
>>>
Re: CONFIG_TEST_VMALLOC=y conflict/race with alloc_tag_init
Posted by David Wang 3 months, 2 weeks ago
At 2025-06-23 11:16:15, "David Wang" <00107082@163.com> wrote:
>
>At 2025-06-23 10:45:31, "David Wang" <00107082@163.com> wrote:
>>
>>At 2025-06-23 06:50:44, "Suren Baghdasaryan" <surenb@google.com> wrote:
>>>On Fri, Jun 20, 2025 at 3:03 AM David Wang <00107082@163.com> wrote:
>>>>
>>>> On Wed, Jun 18, 2025 at 02:25:37PM +0800, kernel test robot wrote:
>>>> >
>>>> > Hello,
>>>> >
>>>> > for this change, we reported
>>>> > "[linux-next:master] [lib/test_vmalloc.c]  7fc85b92db: Mem-Info"
>>>> > in
>>>> > https://lore.kernel.org/all/202505071555.e757f1e0-lkp@intel.com/
>>>> >
>>>> > at that time, we made some tests with x86_64 config which runs well.
>>>> >
>>>> > now we noticed the commit is in mainline now.
>>>>
>>>> > the config still has expected diff with parent:
>>>> >
>>>> > --- /pkg/linux/x86_64-randconfig-161-20250614/gcc-12/7a73348e5d4715b5565a53f21c01ea7b54e46cbd/.config   2025-06-17 14:40:29.481052101 +0800
>>>> > +++ /pkg/linux/x86_64-randconfig-161-20250614/gcc-12/2d76e79315e403aab595d4c8830b7a46c19f0f3b/.config   2025-06-17 14:41:18.448543738 +0800
>>>> > @@ -7551,7 +7551,7 @@ CONFIG_TEST_IDA=m
>>>> >  CONFIG_TEST_MISC_MINOR=m
>>>> >  # CONFIG_TEST_LKM is not set
>>>> >  CONFIG_TEST_BITOPS=m
>>>> > -CONFIG_TEST_VMALLOC=m
>>>> > +CONFIG_TEST_VMALLOC=y
>>>> >  # CONFIG_TEST_BPF is not set
>>>> >  CONFIG_FIND_BIT_BENCHMARK=m
>>>> >  # CONFIG_TEST_FIRMWARE is not set
>>>> >
>>>> >
>>>> > then we noticed similar random issue with x86_64 randconfig this time.
>>>> >
>>>> > 7a73348e5d4715b5 2d76e79315e403aab595d4c8830
>>>> > ---------------- ---------------------------
>>>> >        fail:runs  %reproduction    fail:runs
>>>> >            |             |             |
>>>> >            :199         34%          67:200   dmesg.KASAN:null-ptr-deref_in_range[#-#]
>>>> >            :199         34%          67:200   dmesg.Kernel_panic-not_syncing:Fatal_exception
>>>> >            :199         34%          67:200   dmesg.Mem-Info
>>>> >            :199         34%          67:200   dmesg.Oops:general_protection_fault,probably_for_non-canonical_address#:#[##]SMP_KASAN
>>>> >            :199         34%          67:200   dmesg.RIP:down_read_trylock
>>>> >
>>>> > we don't have enough knowledge to understand the relationship between code
>>>> > change and the random issues. just report what we obsverved in our tests FYI.
>>>> >
>>>>
>>>> I think this is caused by a race between vmalloc_test_init and alloc_tag_init.
>>>>
>>>> vmalloc_test actually depends on alloc_tag via alloc_tag_top_users, because when
>>>> memory allocation fails show_mem() would invoke alloc_tag_top_users.
>>>>
>>>> With following configuration:
>>>>
>>>> CONFIG_TEST_VMALLOC=y
>>>> CONFIG_MEM_ALLOC_PROFILING=y
>>>> CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
>>>> CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
>>>>
>>>> If vmalloc_test_init starts before alloc_tag_init, show_mem() would cause
>>>> a NULL deference because alloc_tag_cttype was not init yet.
>>>>
>>>> I add some debug to confirm this theory
>>>> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
>>>> index d48b80f3f007..9b8e7501010f 100644
>>>> --- a/lib/alloc_tag.c
>>>> +++ b/lib/alloc_tag.c
>>>> @@ -133,6 +133,8 @@ size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sl
>>>>         struct codetag *ct;
>>>>         struct codetag_bytes n;
>>>>         unsigned int i, nr = 0;
>>>> +       pr_info("memory profiling alloc top %d: %llx\n", mem_profiling_support, (long long)alloc_tag_cttype);
>>>> +       return 0;
>>>>
>>>>         if (can_sleep)
>>>>                 codetag_lock_module_list(alloc_tag_cttype, true);
>>>> @@ -831,6 +833,7 @@ static int __init alloc_tag_init(void)
>>>>                 shutdown_mem_profiling(true);
>>>>                 return PTR_ERR(alloc_tag_cttype);
>>>>         }
>>>> +       pr_info("memory profiling ready %d: %llx\n", mem_profiling_support, (long long)alloc_tag_cttype);
>>>>
>>>>         return 0;
>>>>  }
>>>>
>>>> When bootup the kernel, the log shows:
>>>>
>>>> $ sudo dmesg -T | grep profiling
>>>> [Fri Jun 20 17:29:35 2025] memory profiling alloc top 1: 0  <--- alloc_tag_cttype == NULL
>>>> [Fri Jun 20 17:30:24 2025] memory profiling ready 1: ffff9b1641aa06c0
>>>>
>>>>
>>>> vmalloc_test_init should happened after alloc_tag_init if CONFIG_TEST_VMALLOC=y,
>>>> or mem_show() should check whether alloc_tag is done initialized when calling
>>>> alloc_tag_top_users
>>>
>>>Thanks for reporting!
>>>So, IIUC https://lore.kernel.org/all/20250620195305.1115151-1-harry.yoo@oracle.com/
>>>will address this issue as well. Is that correct?
>>
>>Yes, the panic can be fix by that patch.
>>
>>I still feel it better to delay vmalloc_test_init, make it happen after alloc_tag_init.
>>Or, maybe we can promote alloc_tag_init to some early init? I remember reporting some allocation
>>not registered by memory profiling during boot,  
>>https://lore.kernel.org/all/213ff7d2.7c6c.1945eb0c2ff.Coremail.00107082@163.com/
>>
>>I will make some tests, and update later
>
>The memory allocations in sched_init_domains happened quite early, maybe it is core_initcall, while
> alloc_tag_init needs rootfs, it needs to be after rootfs_initcall, so no reasonable place to promote.......
>But I think this explain why some allocation counter missed during boot: the allocation happened before alloc_tag_init

..... Sorry, I think I was wrong..... The counters does not need alloc_tag_init...

sorry for bothering, please ignore my mumbo jumbo.

David

>
>
>Thanks
>David
>
>>
>>
>>David
>>
>>
>>>
>>>>
>>>>
>>>>
>>>> David
>>>>
Re: CONFIG_TEST_VMALLOC=y conflict/race with alloc_tag_init
Posted by Harry Yoo 3 months, 2 weeks ago
On Sun, Jun 22, 2025 at 03:50:44PM -0700, Suren Baghdasaryan wrote:
> On Fri, Jun 20, 2025 at 3:03 AM David Wang <00107082@163.com> wrote:
> >
> > On Wed, Jun 18, 2025 at 02:25:37PM +0800, kernel test robot wrote:
> > >
> > > Hello,
> > >
> > > for this change, we reported
> > > "[linux-next:master] [lib/test_vmalloc.c]  7fc85b92db: Mem-Info"
> > > in
> > > https://urldefense.com/v3/__https://lore.kernel.org/all/202505071555.e757f1e0-lkp@intel.com/__;!!ACWV5N9M2RV99hQ!LY3bHD8lW73pDdoyiPE87NlpBt6nrJCqoSCm7mxOX2M5tOiT__0NF9Hs2Qm0otnk8D6kx9-OrbpZWVI$ 
> > >
> > > at that time, we made some tests with x86_64 config which runs well.
> > >
> > > now we noticed the commit is in mainline now.
> >
> > > the config still has expected diff with parent:
> > >
> > > --- /pkg/linux/x86_64-randconfig-161-20250614/gcc-12/7a73348e5d4715b5565a53f21c01ea7b54e46cbd/.config   2025-06-17 14:40:29.481052101 +0800
> > > +++ /pkg/linux/x86_64-randconfig-161-20250614/gcc-12/2d76e79315e403aab595d4c8830b7a46c19f0f3b/.config   2025-06-17 14:41:18.448543738 +0800
> > > @@ -7551,7 +7551,7 @@ CONFIG_TEST_IDA=m
> > >  CONFIG_TEST_MISC_MINOR=m
> > >  # CONFIG_TEST_LKM is not set
> > >  CONFIG_TEST_BITOPS=m
> > > -CONFIG_TEST_VMALLOC=m
> > > +CONFIG_TEST_VMALLOC=y
> > >  # CONFIG_TEST_BPF is not set
> > >  CONFIG_FIND_BIT_BENCHMARK=m
> > >  # CONFIG_TEST_FIRMWARE is not set
> > >
> > >
> > > then we noticed similar random issue with x86_64 randconfig this time.
> > >
> > > 7a73348e5d4715b5 2d76e79315e403aab595d4c8830
> > > ---------------- ---------------------------
> > >        fail:runs  %reproduction    fail:runs
> > >            |             |             |
> > >            :199         34%          67:200   dmesg.KASAN:null-ptr-deref_in_range[#-#]
> > >            :199         34%          67:200   dmesg.Kernel_panic-not_syncing:Fatal_exception
> > >            :199         34%          67:200   dmesg.Mem-Info
> > >            :199         34%          67:200   dmesg.Oops:general_protection_fault,probably_for_non-canonical_address#:#[##]SMP_KASAN
> > >            :199         34%          67:200   dmesg.RIP:down_read_trylock
> > >
> > > we don't have enough knowledge to understand the relationship between code
> > > change and the random issues. just report what we obsverved in our tests FYI.
> > >
> >
> > I think this is caused by a race between vmalloc_test_init and alloc_tag_init.
> >
> > vmalloc_test actually depends on alloc_tag via alloc_tag_top_users, because when
> > memory allocation fails show_mem() would invoke alloc_tag_top_users.
> >
> > With following configuration:
> >
> > CONFIG_TEST_VMALLOC=y
> > CONFIG_MEM_ALLOC_PROFILING=y
> > CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
> > CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
> >
> > If vmalloc_test_init starts before alloc_tag_init, show_mem() would cause
> > a NULL deference because alloc_tag_cttype was not init yet.
> >
> > I add some debug to confirm this theory
> > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> > index d48b80f3f007..9b8e7501010f 100644
> > --- a/lib/alloc_tag.c
> > +++ b/lib/alloc_tag.c
> > @@ -133,6 +133,8 @@ size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sl
> >         struct codetag *ct;
> >         struct codetag_bytes n;
> >         unsigned int i, nr = 0;
> > +       pr_info("memory profiling alloc top %d: %llx\n", mem_profiling_support, (long long)alloc_tag_cttype);
> > +       return 0;
> >
> >         if (can_sleep)
> >                 codetag_lock_module_list(alloc_tag_cttype, true);
> > @@ -831,6 +833,7 @@ static int __init alloc_tag_init(void)
> >                 shutdown_mem_profiling(true);
> >                 return PTR_ERR(alloc_tag_cttype);
> >         }
> > +       pr_info("memory profiling ready %d: %llx\n", mem_profiling_support, (long long)alloc_tag_cttype);
> >
> >         return 0;
> >  }
> >
> > When bootup the kernel, the log shows:
> >
> > $ sudo dmesg -T | grep profiling
> > [Fri Jun 20 17:29:35 2025] memory profiling alloc top 1: 0  <--- alloc_tag_cttype == NULL
> > [Fri Jun 20 17:30:24 2025] memory profiling ready 1: ffff9b1641aa06c0
> >
> >
> > vmalloc_test_init should happened after alloc_tag_init if CONFIG_TEST_VMALLOC=y,
> > or mem_show() should check whether alloc_tag is done initialized when calling
> > alloc_tag_top_users
> 
> Thanks for reporting!
> So, IIUC https://lore.kernel.org/all/20250620195305.1115151-1-harry.yoo@oracle.com/
> will address this issue as well. Is that correct?

Yes, I verified that it addresses this issue.

> >
> > David
> >

-- 
Cheers,
Harry / Hyeonggon