include/linux/kallsyms.h | 12 +- include/linux/module.h | 4 +- init/Kconfig | 13 ++ kernel/Makefile | 1 + kernel/kallsyms.c | 197 ++++++++++++++-- kernel/kallsyms_internal.h | 1 + kernel/kallsyms_selftest.c | 464 +++++++++++++++++++++++++++++++++++++ kernel/livepatch/core.c | 31 ++- kernel/module/kallsyms.c | 15 +- kernel/trace/ftrace.c | 3 +- scripts/kallsyms.c | 161 ++++++++++--- scripts/link-vmlinux.sh | 4 + 12 files changed, 837 insertions(+), 69 deletions(-) create mode 100644 kernel/kallsyms_selftest.c
v6 --> v7:
1. Improve the performance of kallsyms_lookup_name() when CONFIG_LTO_CLANG=y
To achieve this, restrict '.' to be at the beginning of a substring, not in
the middle or end.
2. kallsyms_selftest.c adds support for CONFIG_LTO_CLANG=y.
3. Patches 4-6 are rearranged, centralize implementations of the same
functionality in one patch, rather than split it based on whether it
belongs to the tool or kernel.
4. Due to the impact of the following patches, some adaptations are made.
aa221f2ea58655f kallsyms: take the input file instead of reading stdin
73bbb94466fd3f8 kallsyms: support "big" kernel symbols
dfb352ab1162f73 kallsyms: Drop CONFIG_CFI_CLANG workarounds
v5 --> v6:
1. Add patch 6/11, kallsyms: Add helper kallsyms_lookup_clang_name()
2. Update commit message of patch 9/11.
v4 --> v5:
1. In scripts/kallsyms.c, we use an extra field to hold type and eventually
put it together with name in write_src().
2. Generate a new table kallsyms_best_token_table[], so that we compress a
symbol in the kernel using a process similar to compress_symbol().
3. Remove helper sym_name(), and rename field 'sym[]' to 'name[]' in
scripts/kallsyms.c
4. Add helper __kallsyms_lookup_compressed_name() to avoid duplicate code in
functions kallsyms_lookup_name() and kallsyms_on_each_match_symbol().
5. Add a new parameter "const char *modname" to module_kallsyms_on_each_symbol(),
this makes the code logic clearer.
6. Delete the parameter 'struct module *' in the hook function associated with
kallsyms_on_each_symbol(), it's unused now.
v3 --> v4:
1. Move the declaration of function kallsyms_sym_address() to linux/kallsyms.h,
fix a build warning.
v2 --> v3:
1. Improve test cases, perform complete functional tests on functions
kallsyms_lookup_name(), kallsyms_on_each_symbol() and
kallsyms_on_each_match_symbol().
2. Add patch [PATCH v3 2/8] scripts/kallsyms: ensure that all possible
combinations are compressed.
3. The symbol type is not compressed regardless of whether
CONFIG_KALLSYMS_ALL is set or not. The memory overhead is increased
by less than 20KiB if CONFIG_KALLSYMS_ALL=n.
4. Discard [PATCH v2 3/8] kallsyms: Adjust the types of some local variables
v1 --> v2:
Add self-test facility
v1:
Currently, to search for a symbol, we need to expand the symbols in
'kallsyms_names' one by one, and then use the expanded string for
comparison. This is very slow.
In fact, we can first compress the name being looked up and then use
it for comparison when traversing 'kallsyms_names'.
This patch series optimizes the performance of function kallsyms_lookup_name(),
and function klp_find_object_symbol() in the livepatch module. Based on the
test results, the performance overhead is reduced to 5%. That is, the
performance of these functions is improved by 20 times.
To avoid increasing the kernel size in non-debug mode, the optimization is only
for the case CONFIG_KALLSYMS_ALL=y.
Zhen Lei (11):
scripts/kallsyms: rename build_initial_tok_table()
scripts/kallsyms: don't compress symbol types
scripts/kallsyms: remove helper sym_name() and cleanup
kallsyms: Add helper kallsyms_compress_symbol_name()
kallsyms: Improve the performance of kallsyms_lookup_name()
kallsyms: Improve the performance of kallsyms_lookup_name() when
CONFIG_LTO_CLANG=y
kallsyms: Add helper kallsyms_on_each_match_symbol()
livepatch: Use kallsyms_on_each_match_symbol() to improve performance
livepatch: Improve the search performance of
module_kallsyms_on_each_symbol()
kallsyms: Delete an unused parameter related to
kallsyms_on_each_symbol()
kallsyms: Add self-test facility
include/linux/kallsyms.h | 12 +-
include/linux/module.h | 4 +-
init/Kconfig | 13 ++
kernel/Makefile | 1 +
kernel/kallsyms.c | 197 ++++++++++++++--
kernel/kallsyms_internal.h | 1 +
kernel/kallsyms_selftest.c | 464 +++++++++++++++++++++++++++++++++++++
kernel/livepatch/core.c | 31 ++-
kernel/module/kallsyms.c | 15 +-
kernel/trace/ftrace.c | 3 +-
scripts/kallsyms.c | 161 ++++++++++---
scripts/link-vmlinux.sh | 4 +
12 files changed, 837 insertions(+), 69 deletions(-)
create mode 100644 kernel/kallsyms_selftest.c
--
2.25.1
On Mon, Oct 17, 2022 at 02:49:39PM +0800, Zhen Lei wrote: > Currently, to search for a symbol, we need to expand the symbols in > 'kallsyms_names' one by one, and then use the expanded string for > comparison. This is very slow. > > In fact, we can first compress the name being looked up and then use > it for comparison when traversing 'kallsyms_names'. > > This patch series optimizes the performance of function kallsyms_lookup_name(), > and function klp_find_object_symbol() in the livepatch module. Based on the > test results, the performance overhead is reduced to 5%. That is, the > performance of these functions is improved by 20 times. Stupid question, is a hash table in order? Luis
On 2022/10/19 20:01, Luis Chamberlain wrote: > On Mon, Oct 17, 2022 at 02:49:39PM +0800, Zhen Lei wrote: >> Currently, to search for a symbol, we need to expand the symbols in >> 'kallsyms_names' one by one, and then use the expanded string for >> comparison. This is very slow. >> >> In fact, we can first compress the name being looked up and then use >> it for comparison when traversing 'kallsyms_names'. >> >> This patch series optimizes the performance of function kallsyms_lookup_name(), >> and function klp_find_object_symbol() in the livepatch module. Based on the >> test results, the performance overhead is reduced to 5%. That is, the >> performance of these functions is improved by 20 times. > > Stupid question, is a hash table in order? No hash table. All symbols are arranged in ascending order of address. For example: cat /proc/kallsyms The addresses of all symbols are stored in kallsyms_addresses[], and names of all symbols are stored in kallsyms_names[]. The elements in these two arrays are in a one-to-one relationship. For any symbol, it has the same index in both arrays. Therefore, when we look up a symbolic name based on an address, we use a binary lookup. However, when we look up an address based on a symbol name, we can only traverse array kallsyms_names[] in sequence. I think the reason why hash is not used is to save memory. > > Luis > . > -- Regards, Zhen Lei
On Wed, Oct 19, 2022 at 10:11:58PM +0800, Leizhen (ThunderTown) wrote: > > > On 2022/10/19 20:01, Luis Chamberlain wrote: > > On Mon, Oct 17, 2022 at 02:49:39PM +0800, Zhen Lei wrote: > >> Currently, to search for a symbol, we need to expand the symbols in > >> 'kallsyms_names' one by one, and then use the expanded string for > >> comparison. This is very slow. > >> > >> In fact, we can first compress the name being looked up and then use > >> it for comparison when traversing 'kallsyms_names'. > >> > >> This patch series optimizes the performance of function kallsyms_lookup_name(), > >> and function klp_find_object_symbol() in the livepatch module. Based on the > >> test results, the performance overhead is reduced to 5%. That is, the > >> performance of these functions is improved by 20 times. > > > > Stupid question, is a hash table in order? > > No hash table. > > All symbols are arranged in ascending order of address. For example: cat /proc/kallsyms > > The addresses of all symbols are stored in kallsyms_addresses[], and names of all symbols > are stored in kallsyms_names[]. The elements in these two arrays are in a one-to-one > relationship. For any symbol, it has the same index in both arrays. > > Therefore, when we look up a symbolic name based on an address, we use a binary lookup. > However, when we look up an address based on a symbol name, we can only traverse array > kallsyms_names[] in sequence. I think the reason why hash is not used is to save memory. This answers how we don't use a hash table, the question was *should* we use one? Luis
On 2022/10/26 1:53, Luis Chamberlain wrote: > On Wed, Oct 19, 2022 at 10:11:58PM +0800, Leizhen (ThunderTown) wrote: >> >> >> On 2022/10/19 20:01, Luis Chamberlain wrote: >>> On Mon, Oct 17, 2022 at 02:49:39PM +0800, Zhen Lei wrote: >>>> Currently, to search for a symbol, we need to expand the symbols in >>>> 'kallsyms_names' one by one, and then use the expanded string for >>>> comparison. This is very slow. >>>> >>>> In fact, we can first compress the name being looked up and then use >>>> it for comparison when traversing 'kallsyms_names'. >>>> >>>> This patch series optimizes the performance of function kallsyms_lookup_name(), >>>> and function klp_find_object_symbol() in the livepatch module. Based on the >>>> test results, the performance overhead is reduced to 5%. That is, the >>>> performance of these functions is improved by 20 times. >>> >>> Stupid question, is a hash table in order? >> >> No hash table. >> >> All symbols are arranged in ascending order of address. For example: cat /proc/kallsyms >> >> The addresses of all symbols are stored in kallsyms_addresses[], and names of all symbols >> are stored in kallsyms_names[]. The elements in these two arrays are in a one-to-one >> relationship. For any symbol, it has the same index in both arrays. >> >> Therefore, when we look up a symbolic name based on an address, we use a binary lookup. >> However, when we look up an address based on a symbol name, we can only traverse array >> kallsyms_names[] in sequence. I think the reason why hash is not used is to save memory. > > This answers how we don't use a hash table, the question was *should* we > use one? I'm not the original author, and I can only answer now based on my understanding. Maybe the original author didn't think of the hash method, or he has weighed it out. Hash is a good solution if only performance is required and memory overhead is not considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms + 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million. Because I don't know what hash algorithm will be used, the cost of generating the hash value corresponding to the symbol name is unknown now. But I think it's gonna be small. But it definitely needs a simpler algorithm, the tool needs to implement the same hash algorithm. If the hash is not very uniform or ARRAY_SIZE(hashtable) is small, then my current approach still makes sense. So maybe hash can be deferred to the next phase of improvement. > > Luis > . > -- Regards, Zhen Lei
On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote: > On 2022/10/26 1:53, Luis Chamberlain wrote: > > This answers how we don't use a hash table, the question was *should* we > > use one? > > I'm not the original author, and I can only answer now based on my understanding. Maybe > the original author didn't think of the hash method, or he has weighed it out. > > Hash is a good solution if only performance is required and memory overhead is not > considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms + > 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million. > > Because I don't know what hash algorithm will be used, the cost of generating the > hash value corresponding to the symbol name is unknown now. But I think it's gonna > be small. But it definitely needs a simpler algorithm, the tool needs to implement > the same hash algorithm. For instance, you can look at evaluating if alloc_large_system_hash() would help. Luis
On 2022/10/27 3:03, Luis Chamberlain wrote: > On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote: >> On 2022/10/26 1:53, Luis Chamberlain wrote: >>> This answers how we don't use a hash table, the question was *should* we >>> use one? >> >> I'm not the original author, and I can only answer now based on my understanding. Maybe >> the original author didn't think of the hash method, or he has weighed it out. >> >> Hash is a good solution if only performance is required and memory overhead is not >> considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms + >> 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million. >> >> Because I don't know what hash algorithm will be used, the cost of generating the >> hash value corresponding to the symbol name is unknown now. But I think it's gonna >> be small. But it definitely needs a simpler algorithm, the tool needs to implement >> the same hash algorithm. > > For instance, you can look at evaluating if alloc_large_system_hash() would help. OK, I found the right hash function. In this way, the tool does not need to consider the byte order. include/linux/stringhash.h /* * Version 1: one byte at a time. Example of use: * * unsigned long hash = init_name_hash; * while (*p) * hash = partial_name_hash(tolower(*p++), hash); * hash = end_name_hash(hash); > > Luis > . > -- Regards, Zhen Lei
On 2022/10/27 11:26, Leizhen (ThunderTown) wrote:
>
>
> On 2022/10/27 3:03, Luis Chamberlain wrote:
>> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote:
>>> On 2022/10/26 1:53, Luis Chamberlain wrote:
>>>> This answers how we don't use a hash table, the question was *should* we
>>>> use one?
>>>
>>> I'm not the original author, and I can only answer now based on my understanding. Maybe
>>> the original author didn't think of the hash method, or he has weighed it out.
>>>
>>> Hash is a good solution if only performance is required and memory overhead is not
>>> considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms +
>>> 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million.
Sorry, 1-2 million ==> 0.1~0.2 million
>>>
>>> Because I don't know what hash algorithm will be used, the cost of generating the
>>> hash value corresponding to the symbol name is unknown now. But I think it's gonna
>>> be small. But it definitely needs a simpler algorithm, the tool needs to implement
>>> the same hash algorithm.
>>
>> For instance, you can look at evaluating if alloc_large_system_hash() would help.
>
> OK, I found the right hash function. In this way, the tool does not need to consider
> the byte order.
https://en.wikipedia.org/wiki/Jenkins_hash_function
Let's go with jenkins_one_at_a_time_hash(), which looks simpler and doesn't even
have to think about sizeof(long). It seems to be closest to our current needs.
uint32_t jenkins_one_at_a_time_hash(const uint8_t* key, size_t length) {
size_t i = 0;
uint32_t hash = 0;
while (i != length) {
hash += key[i++];
hash += hash << 10;
hash ^= hash >> 6;
}
hash += hash << 3;
hash ^= hash >> 11;
hash += hash << 15;
return hash;
}
>
> include/linux/stringhash.h
>
> /*
> * Version 1: one byte at a time. Example of use:
> *
> * unsigned long hash = init_name_hash;
> * while (*p)
> * hash = partial_name_hash(tolower(*p++), hash);
> * hash = end_name_hash(hash);
>
>
>>
>> Luis
>> .
>>
>
--
Regards,
Zhen Lei
On 2022/10/27 14:27, Leizhen (ThunderTown) wrote:
>
>
> On 2022/10/27 11:26, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2022/10/27 3:03, Luis Chamberlain wrote:
>>> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote:
>>>> On 2022/10/26 1:53, Luis Chamberlain wrote:
>>>>> This answers how we don't use a hash table, the question was *should* we
>>>>> use one?
>>>>
>>>> I'm not the original author, and I can only answer now based on my understanding. Maybe
>>>> the original author didn't think of the hash method, or he has weighed it out.
>>>>
>>>> Hash is a good solution if only performance is required and memory overhead is not
>>>> considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms +
>>>> 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million.
>
> Sorry, 1-2 million ==> 0.1~0.2 million
>
>>>>
>>>> Because I don't know what hash algorithm will be used, the cost of generating the
>>>> hash value corresponding to the symbol name is unknown now. But I think it's gonna
>>>> be small. But it definitely needs a simpler algorithm, the tool needs to implement
>>>> the same hash algorithm.
>>>
>>> For instance, you can look at evaluating if alloc_large_system_hash() would help.
>>
The following three hash algorithms are compared. The kernel is compiled by defconfig
on arm64.
|---------------------------------------------------------------------------------------|
| | hash &= HASH_TABLE_SIZE - 1 |
| | number of conflicts >= 1000 |
|---------------------------------------------------------------------------------------|
| ARRAY_SIZE(hash_table) | crc16 | jhash_one_at_a_time | string hash_32 |
|---------------------------------------------------------------------------------------|
| | 345b: 3905 | 0d40: 1013 | 4a4c: 6548 |
| | 35bb: 1016 | 35ce: 6549 | 883a: 1015 |
| 0x10000 | 385b: 6548 | 4440: 19126 | d05f: 19129 |
| | f0ba: 19127 | 7ebe: 3916 | ecda: 3903 |
|---------------------------------------------------------------------------------------|
| | 0ba: 19168 | 440: 19165 | 05f: 19170 |
| | 45b: 3955 | 5ce: 6577 | 83a: 1066 |
| 0x1000 | 5bb: 1069 | d40: 1052 | a4c: 6609 |
| | 85b: 6582 | ebe: 3938 | cda: 3924 |
|---------------------------------------------------------------------------------------|
Based on the above test results, I conclude that:
1. For the worst-case scenario, the three algorithms are not much different. But the kernel
only implements crc16 and string hash_32. The latter is not processed byte-by-byte, so
it is coupled with byte order and sizeof(long). So crc16 is the best choice.
2. For the worst-case scenario, there are almost 19K strings are mapped to the same hash
value,just over 1/10 of the total. And with my current compression-then-comparison
approach, it's 25-30 times faster. So there's still a need for my current approach, and
they can be combined.
if (nr_conflicts(key) >= CONST_N) {
newname = compress(name);
for_each_name_in_slot(key): compare(new_name)
} else {
for_each_name_in_slot(key): compare(name)
}
Above CONST_N can be roughly calculated:
time_of_compress(name) + N * time_of_compare(new_name) <= N * time_of_compare(name)
3. For the worst-case scenario, there is no obvious difference between ARRAY_SIZE(hash_table)
0x10000 and 0x1000. So ARRAY_SIZE(hash_table)=0x1000 is enough.
Statistic information:
|------------------------------------------------------|
| nr_conflicts(key) | ARRAY_SIZE(hash_table) |
|------------------------------------------------------|
| <= ? | 0x1000 | 0x10000 |
|------------------------------------------------------|
| 0 | 0 | 7821 |
| 20 | 19 | 57375 |
| 40 | 2419 | 124 |
| 60 | 1343 | 70 |
| 80 | 149 | 73 |
| 100 | 38 | 49 |
| 200 | 108 | 16 |
| 400 | 14 | 2 |
| 600 | 2 | 2 |
| 800 | 0 | 0 |
| 1000 | 0 | 0 |
| 100000 | 4 | 4 |
|------------------------------------------------------|
Also, I re-calculated:
Using hash will increase the memory size by up to "6 * kallsyms_num_syms + 4 * ARRAY_SIZE(hashtable)"
|---- What I said earlier was 4
The increased size is close to 1 MB if CONFIG_KALLSYMS_ALL=y.
Hi, Luis:
For the reasons of the above-mentioned second conclusion. And except for patches 4-6,
even if only the hash method is used, other patches and option "--lto-clang" in patch 6/11
are also needed. If you don't mind, I'd like to use hash at the next stage. The current
patch set is already huge.
>> OK, I found the right hash function. In this way, the tool does not need to consider
>> the byte order.
>
> https://en.wikipedia.org/wiki/Jenkins_hash_function
>
> Let's go with jenkins_one_at_a_time_hash(), which looks simpler and doesn't even
> have to think about sizeof(long). It seems to be closest to our current needs.
>
> uint32_t jenkins_one_at_a_time_hash(const uint8_t* key, size_t length) {
> size_t i = 0;
> uint32_t hash = 0;
>
> while (i != length) {
> hash += key[i++];
> hash += hash << 10;
> hash ^= hash >> 6;
> }
> hash += hash << 3;
> hash ^= hash >> 11;
> hash += hash << 15;
>
> return hash;
> }
>
>>
>> include/linux/stringhash.h
>>
>> /*
>> * Version 1: one byte at a time. Example of use:
>> *
>> * unsigned long hash = init_name_hash;
>> * while (*p)
>> * hash = partial_name_hash(tolower(*p++), hash);
>> * hash = end_name_hash(hash);
>>
>>
>>>
>>> Luis
>>> .
>>>
>>
>
--
Regards,
Zhen Lei
On 2022/10/29 16:10, Leizhen (ThunderTown) wrote:
>
>
> On 2022/10/27 14:27, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2022/10/27 11:26, Leizhen (ThunderTown) wrote:
>>>
>>>
>>> On 2022/10/27 3:03, Luis Chamberlain wrote:
>>>> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote:
>>>>> On 2022/10/26 1:53, Luis Chamberlain wrote:
>>>>>> This answers how we don't use a hash table, the question was *should* we
>>>>>> use one?
>>>>>
>>>>> I'm not the original author, and I can only answer now based on my understanding. Maybe
>>>>> the original author didn't think of the hash method, or he has weighed it out.
>>>>>
>>>>> Hash is a good solution if only performance is required and memory overhead is not
>>>>> considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms +
>>>>> 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million.
>>
>> Sorry, 1-2 million ==> 0.1~0.2 million
>>
>>>>>
>>>>> Because I don't know what hash algorithm will be used, the cost of generating the
>>>>> hash value corresponding to the symbol name is unknown now. But I think it's gonna
>>>>> be small. But it definitely needs a simpler algorithm, the tool needs to implement
>>>>> the same hash algorithm.
>>>>
>>>> For instance, you can look at evaluating if alloc_large_system_hash() would help.
>>>
>
> The following three hash algorithms are compared. The kernel is compiled by defconfig
> on arm64.
>
> |---------------------------------------------------------------------------------------|
> | | hash &= HASH_TABLE_SIZE - 1 |
> | | number of conflicts >= 1000 |
> |---------------------------------------------------------------------------------------|
> | ARRAY_SIZE(hash_table) | crc16 | jhash_one_at_a_time | string hash_32 |
> |---------------------------------------------------------------------------------------|
> | | 345b: 3905 | 0d40: 1013 | 4a4c: 6548 |
> | | 35bb: 1016 | 35ce: 6549 | 883a: 1015 |
> | 0x10000 | 385b: 6548 | 4440: 19126 | d05f: 19129 |
> | | f0ba: 19127 | 7ebe: 3916 | ecda: 3903 |
> |---------------------------------------------------------------------------------------|
> | | 0ba: 19168 | 440: 19165 | 05f: 19170 |
> | | 45b: 3955 | 5ce: 6577 | 83a: 1066 |
> | 0x1000 | 5bb: 1069 | d40: 1052 | a4c: 6609 |
> | | 85b: 6582 | ebe: 3938 | cda: 3924 |
> |---------------------------------------------------------------------------------------|
>
> Based on the above test results, I conclude that:
> 1. For the worst-case scenario, the three algorithms are not much different. But the kernel
> only implements crc16 and string hash_32. The latter is not processed byte-by-byte, so
> it is coupled with byte order and sizeof(long). So crc16 is the best choice.
> 2. For the worst-case scenario, there are almost 19K strings are mapped to the same hash
> value,just over 1/10 of the total. And with my current compression-then-comparison
> approach, it's 25-30 times faster. So there's still a need for my current approach, and
> they can be combined.
> if (nr_conflicts(key) >= CONST_N) {
> newname = compress(name);
> for_each_name_in_slot(key): compare(new_name)
> } else {
> for_each_name_in_slot(key): compare(name)
> }
>
> Above CONST_N can be roughly calculated:
> time_of_compress(name) + N * time_of_compare(new_name) <= N * time_of_compare(name)
> 3. For the worst-case scenario, there is no obvious difference between ARRAY_SIZE(hash_table)
> 0x10000 and 0x1000. So ARRAY_SIZE(hash_table)=0x1000 is enough.
> Statistic information:
> |------------------------------------------------------|
> | nr_conflicts(key) | ARRAY_SIZE(hash_table) |
> |------------------------------------------------------|
> | <= ? | 0x1000 | 0x10000 |
> |------------------------------------------------------|
> | 0 | 0 | 7821 |
> | 20 | 19 | 57375 |
> | 40 | 2419 | 124 |
> | 60 | 1343 | 70 |
> | 80 | 149 | 73 |
> | 100 | 38 | 49 |
> | 200 | 108 | 16 |
> | 400 | 14 | 2 |
> | 600 | 2 | 2 |
> | 800 | 0 | 0 |
> | 1000 | 0 | 0 |
> | 100000 | 4 | 4 |
> |------------------------------------------------------|
>
>
> Also, I re-calculated:
> Using hash will increase the memory size by up to "6 * kallsyms_num_syms + 4 * ARRAY_SIZE(hashtable)"
> |---- What I said earlier was 4
> The increased size is close to 1 MB if CONFIG_KALLSYMS_ALL=y.
>
> Hi, Luis:
> For the reasons of the above-mentioned second conclusion. And except for patches 4-6,
> even if only the hash method is used, other patches and option "--lto-clang" in patch 6/11
> are also needed. If you don't mind, I'd like to use hash at the next stage. The current
> patch set is already huge.
I just had an update in response to David Laight's email. The hash solution is like
a centrist. It doesn't seem very feasible.
Now, we need to make a decision. Choose one of the two:
1. Continue with my current approach. Improve the average performance of
kallsyms_lookup_name() by 20 to 30 times. The memory overhead is increased by:
arm64 (defconfig):
73.5KiB and 4.0% if CONFIG_KALLSYMS_ALL=y.
19.8KiB and 2.8% if CONFIG_KALLSYMS_ALL=n.
x86 (defconfig):
49.0KiB and 3.0% if CONFIG_KALLSYMS_ALL=y.
16.8KiB and 2.3% if CONFIG_KALLSYMS_ALL=n.
2. Sort names, binary search (The static function causes duplicate names. Additional work is required)
2^18=262144, only up to 18 symbol expansions and comparisons are required.
The performance is definitely excellent, although I haven't tested it yet.
The memory overhead is increased by: 6 * kallsyms_num_syms
arm64 (defconfig):
1MiB if CONFIG_KALLSYMS_ALL=y.
362KiB if CONFIG_KALLSYMS_ALL=n.
x86 (defconfig):
770KiB if CONFIG_KALLSYMS_ALL=y.
356KiB if CONFIG_KALLSYMS_ALL=n.
>
>
>>> OK, I found the right hash function. In this way, the tool does not need to consider
>>> the byte order.
>>
>> https://en.wikipedia.org/wiki/Jenkins_hash_function
>>
>> Let's go with jenkins_one_at_a_time_hash(), which looks simpler and doesn't even
>> have to think about sizeof(long). It seems to be closest to our current needs.
>>
>> uint32_t jenkins_one_at_a_time_hash(const uint8_t* key, size_t length) {
>> size_t i = 0;
>> uint32_t hash = 0;
>>
>> while (i != length) {
>> hash += key[i++];
>> hash += hash << 10;
>> hash ^= hash >> 6;
>> }
>> hash += hash << 3;
>> hash ^= hash >> 11;
>> hash += hash << 15;
>>
>> return hash;
>> }
>>
>>>
>>> include/linux/stringhash.h
>>>
>>> /*
>>> * Version 1: one byte at a time. Example of use:
>>> *
>>> * unsigned long hash = init_name_hash;
>>> * while (*p)
>>> * hash = partial_name_hash(tolower(*p++), hash);
>>> * hash = end_name_hash(hash);
>>>
>>>
>>>>
>>>> Luis
>>>> .
>>>>
>>>
>>
>
--
Regards,
Zhen Lei
On 2022/10/31 12:55, Leizhen (ThunderTown) wrote:
>
>
> On 2022/10/29 16:10, Leizhen (ThunderTown) wrote:
>>
>>
>> On 2022/10/27 14:27, Leizhen (ThunderTown) wrote:
>>>
>>>
>>> On 2022/10/27 11:26, Leizhen (ThunderTown) wrote:
>>>>
>>>>
>>>> On 2022/10/27 3:03, Luis Chamberlain wrote:
>>>>> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote:
>>>>>> On 2022/10/26 1:53, Luis Chamberlain wrote:
>>>>>>> This answers how we don't use a hash table, the question was *should* we
>>>>>>> use one?
>>>>>>
>>>>>> I'm not the original author, and I can only answer now based on my understanding. Maybe
>>>>>> the original author didn't think of the hash method, or he has weighed it out.
>>>>>>
>>>>>> Hash is a good solution if only performance is required and memory overhead is not
>>>>>> considered. Using hash will increase the memory size by up to "4 * kallsyms_num_syms +
>>>>>> 4 * ARRAY_SIZE(hashtable)" bytes, kallsyms_num_syms is about 1-2 million.
>>>
>>> Sorry, 1-2 million ==> 0.1~0.2 million
>>>
>>>>>>
>>>>>> Because I don't know what hash algorithm will be used, the cost of generating the
>>>>>> hash value corresponding to the symbol name is unknown now. But I think it's gonna
>>>>>> be small. But it definitely needs a simpler algorithm, the tool needs to implement
>>>>>> the same hash algorithm.
>>>>>
>>>>> For instance, you can look at evaluating if alloc_large_system_hash() would help.
>>>>
>>
>> The following three hash algorithms are compared. The kernel is compiled by defconfig
>> on arm64.
>>
>> |---------------------------------------------------------------------------------------|
>> | | hash &= HASH_TABLE_SIZE - 1 |
>> | | number of conflicts >= 1000 |
>> |---------------------------------------------------------------------------------------|
>> | ARRAY_SIZE(hash_table) | crc16 | jhash_one_at_a_time | string hash_32 |
>> |---------------------------------------------------------------------------------------|
>> | | 345b: 3905 | 0d40: 1013 | 4a4c: 6548 |
>> | | 35bb: 1016 | 35ce: 6549 | 883a: 1015 |
>> | 0x10000 | 385b: 6548 | 4440: 19126 | d05f: 19129 |
>> | | f0ba: 19127 | 7ebe: 3916 | ecda: 3903 |
>> |---------------------------------------------------------------------------------------|
>> | | 0ba: 19168 | 440: 19165 | 05f: 19170 |
>> | | 45b: 3955 | 5ce: 6577 | 83a: 1066 |
>> | 0x1000 | 5bb: 1069 | d40: 1052 | a4c: 6609 |
>> | | 85b: 6582 | ebe: 3938 | cda: 3924 |
>> |---------------------------------------------------------------------------------------|
>>
>> Based on the above test results, I conclude that:
>> 1. For the worst-case scenario, the three algorithms are not much different. But the kernel
>> only implements crc16 and string hash_32. The latter is not processed byte-by-byte, so
>> it is coupled with byte order and sizeof(long). So crc16 is the best choice.
>> 2. For the worst-case scenario, there are almost 19K strings are mapped to the same hash
>> value,just over 1/10 of the total. And with my current compression-then-comparison
>> approach, it's 25-30 times faster. So there's still a need for my current approach, and
>> they can be combined.
>> if (nr_conflicts(key) >= CONST_N) {
>> newname = compress(name);
>> for_each_name_in_slot(key): compare(new_name)
>> } else {
>> for_each_name_in_slot(key): compare(name)
>> }
>>
>> Above CONST_N can be roughly calculated:
>> time_of_compress(name) + N * time_of_compare(new_name) <= N * time_of_compare(name)
>> 3. For the worst-case scenario, there is no obvious difference between ARRAY_SIZE(hash_table)
>> 0x10000 and 0x1000. So ARRAY_SIZE(hash_table)=0x1000 is enough.
>> Statistic information:
>> |------------------------------------------------------|
>> | nr_conflicts(key) | ARRAY_SIZE(hash_table) |
>> |------------------------------------------------------|
>> | <= ? | 0x1000 | 0x10000 |
>> |------------------------------------------------------|
>> | 0 | 0 | 7821 |
>> | 20 | 19 | 57375 |
>> | 40 | 2419 | 124 |
>> | 60 | 1343 | 70 |
>> | 80 | 149 | 73 |
>> | 100 | 38 | 49 |
>> | 200 | 108 | 16 |
>> | 400 | 14 | 2 |
>> | 600 | 2 | 2 |
>> | 800 | 0 | 0 |
>> | 1000 | 0 | 0 |
>> | 100000 | 4 | 4 |
>> |------------------------------------------------------|
>>
>>
>> Also, I re-calculated:
>> Using hash will increase the memory size by up to "6 * kallsyms_num_syms + 4 * ARRAY_SIZE(hashtable)"
>> |---- What I said earlier was 4
>> The increased size is close to 1 MB if CONFIG_KALLSYMS_ALL=y.
>>
>> Hi, Luis:
>> For the reasons of the above-mentioned second conclusion. And except for patches 4-6,
>> even if only the hash method is used, other patches and option "--lto-clang" in patch 6/11
>> are also needed. If you don't mind, I'd like to use hash at the next stage. The current
>> patch set is already huge.
>
> I just had an update in response to David Laight's email. The hash solution is like
> a centrist. It doesn't seem very feasible.
>
> Now, we need to make a decision. Choose one of the two:
> 1. Continue with my current approach. Improve the average performance of
> kallsyms_lookup_name() by 20 to 30 times. The memory overhead is increased by:
> arm64 (defconfig):
> 73.5KiB and 4.0% if CONFIG_KALLSYMS_ALL=y.
> 19.8KiB and 2.8% if CONFIG_KALLSYMS_ALL=n.
> x86 (defconfig):
> 49.0KiB and 3.0% if CONFIG_KALLSYMS_ALL=y.
> 16.8KiB and 2.3% if CONFIG_KALLSYMS_ALL=n.
> 2. Sort names, binary search (The static function causes duplicate names. Additional work is required)
> 2^18=262144, only up to 18 symbol expansions and comparisons are required.
> The performance is definitely excellent, although I haven't tested it yet.
> The memory overhead is increased by: 6 * kallsyms_num_syms
> arm64 (defconfig):
> 1MiB if CONFIG_KALLSYMS_ALL=y.
> 362KiB if CONFIG_KALLSYMS_ALL=n.
> x86 (defconfig):
> 770KiB if CONFIG_KALLSYMS_ALL=y.
> 356KiB if CONFIG_KALLSYMS_ALL=n.
>
Preliminary Test Results: (On Qemu arm64)
[ 73.049249] kallsyms_selftest: kallsyms_lookup_name() looked up 151880 symbols
[ 73.049331] kallsyms_selftest: The time spent on each symbol is (ns): min=1088, max=46848, avg=6629
>
>
>
>>
>>
>>>> OK, I found the right hash function. In this way, the tool does not need to consider
>>>> the byte order.
>>>
>>> https://en.wikipedia.org/wiki/Jenkins_hash_function
>>>
>>> Let's go with jenkins_one_at_a_time_hash(), which looks simpler and doesn't even
>>> have to think about sizeof(long). It seems to be closest to our current needs.
>>>
>>> uint32_t jenkins_one_at_a_time_hash(const uint8_t* key, size_t length) {
>>> size_t i = 0;
>>> uint32_t hash = 0;
>>>
>>> while (i != length) {
>>> hash += key[i++];
>>> hash += hash << 10;
>>> hash ^= hash >> 6;
>>> }
>>> hash += hash << 3;
>>> hash ^= hash >> 11;
>>> hash += hash << 15;
>>>
>>> return hash;
>>> }
>>>
>>>>
>>>> include/linux/stringhash.h
>>>>
>>>> /*
>>>> * Version 1: one byte at a time. Example of use:
>>>> *
>>>> * unsigned long hash = init_name_hash;
>>>> * while (*p)
>>>> * hash = partial_name_hash(tolower(*p++), hash);
>>>> * hash = end_name_hash(hash);
>>>>
>>>>
>>>>>
>>>>> Luis
>>>>> .
>>>>>
>>>>
>>>
>>
>
--
Regards,
Zhen Lei
On 2022/10/31 23:04, Leizhen (ThunderTown) wrote: > Now, we need to make a decision. Choose one of the two: > 1. Continue with my current approach. Improve the average performance of > kallsyms_lookup_name() by 20 to 30 times. The memory overhead is increased by: > arm64 (defconfig): > 73.5KiB and 4.0% if CONFIG_KALLSYMS_ALL=y. > 19.8KiB and 2.8% if CONFIG_KALLSYMS_ALL=n. > x86 (defconfig): > 49.0KiB and 3.0% if CONFIG_KALLSYMS_ALL=y. > 16.8KiB and 2.3% if CONFIG_KALLSYMS_ALL=n. > 2. Sort names, binary search (The static function causes duplicate names. Additional work is required) > 2^18=262144, only up to 18 symbol expansions and comparisons are required. > The performance is definitely excellent, although I haven't tested it yet. > The memory overhead is increased by: 6 * kallsyms_num_syms > arm64 (defconfig): > 1MiB if CONFIG_KALLSYMS_ALL=y. > 362KiB if CONFIG_KALLSYMS_ALL=n. > x86 (defconfig): > 770KiB if CONFIG_KALLSYMS_ALL=y. > 356KiB if CONFIG_KALLSYMS_ALL=n. Hi, Luis: I've implemented v8 based on method 2(Sort names, binary search). The memory overhead is increased by: 3 * kallsyms_num_syms. kallsyms_offsets_of_names[] is not added in v8 because it does not help much in performance improvement, so save (3 * kallsyms_num_syms) bytes. For details about the performance data, please see the commit message. -- Regards, Zhen Lei
> >> On 2022/10/27 3:03, Luis Chamberlain wrote: > >>> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote: > >>>> On 2022/10/26 1:53, Luis Chamberlain wrote: > >>>>> This answers how we don't use a hash table, the question was *should* we > >>>>> use one? (Probably brainfart) thought... Is the current table (effectively) a sorted list of strings? So the lookup is a binary chop - so O(log(n)). But your hashes are having 'trouble' stopping one chain being very long? So a linear search of that hash chain is slow. In fact that sort of hashed lookup in O(n). What if the symbols were sorted by hash then name? (Without putting the hash into each entry.) Then the code could do a binary chop search over the symbols with the same hash value. The additional data is then an array of symbol numbers indexed by the hash - 32 bits for each bucket. If the hash table has 0x1000 entries it saves 12 compares. (All of which are likely to be data cache misses.) If you add the hash to each table entry then you can do a binary chop search for the hash itself. While this is the same search as is done for the strings the comparison (just a number) will be faster than a string compare. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 2022/10/29 20:49, David Laight wrote: >>>> On 2022/10/27 3:03, Luis Chamberlain wrote: >>>>> On Wed, Oct 26, 2022 at 02:44:36PM +0800, Leizhen (ThunderTown) wrote: >>>>>> On 2022/10/26 1:53, Luis Chamberlain wrote: >>>>>>> This answers how we don't use a hash table, the question was *should* we >>>>>>> use one? > > (Probably brainfart) thought... > > Is the current table (effectively) a sorted list of strings? > So the lookup is a binary chop - so O(log(n)). Currently not sorted. > > But your hashes are having 'trouble' stopping one chain > being very long? > So a linear search of that hash chain is slow. > In fact that sort of hashed lookup in O(n). You've analyzed it very well. The hash method is not good for sorting names and then searching in binary mode. I figured it out when I was working on the design these days. Current Implementation: --------------------------------------- | idx | addresses | markers | names | --------------------------------------- | 0 | addr0 | | name0 | | 1 | addr1 | | name1 | | ... | addrx | [0] | namex | | 255 | addrx | | name255| --------------------------------------- | 256 | addr256 | | name256| | ... | addrx | [1] | namex | | 511 | addr511 | | name511| --------------------------------------- markers[0] = offset_of(name0) markers[1] = offset_of(name256) 1. Find name by address binary search addresses[], get idx, traverse names[] from markers[idx>>8] to markers[(idx>>8) + 1], return name 2. Find address by name traverse names[], get idx, return addresses[idx] Hash Implementation: Add two new arrays: hash_table[] and names_offsets[] ----------------------------------------------------------------- | key | hash_table | names_offsets | |---------------------------------------------------------------| | 0 | names_offsets[key=0] | offsets of all names with key=0 | | 1 | names_offsets[key=1] | offsets of all names with key=1 | | ... | ... | offsets of all names with key=k | |---------------------------------------------------------------| hash_table[0] = 0 hash_table[1] = hash_table[0] + sizeof(names_offsets[0]) * number_of_names(key=0) hash_table[2] = hash_table[1] + sizeof(names_offsets[0]) * number_of_names(key=1) 1. Find address by name hash name, get key, traverse names_offsets[] from index=hash_table[key] to index=hash_table[key+1], get the offset of name in names[]. binary search markers[], get index, then traverse names[] from markers[index] to markers[index + 1], until match the offset of name, return addresses[idx]. 2. Find address by name No change. Sorted names Implementation: Add two new arrays: offsets_of_addr_to_name[] and offsets_of_name[] offsets_of_addr_to_name[i] = offset of name i in names[] offsets_of_name[i] = offset of sorted name i in names[] 1. Find name by address binary search addresses[], get idx, return names[offsets_of_addr_to_name[idx]] 2. Find address by name binary search offsets_of_name[], get idx, return addresses[idx] > > What if the symbols were sorted by hash then name? > (Without putting the hash into each entry.) > Then the code could do a binary chop search over > the symbols with the same hash value. > The additional data is then an array of symbol numbers > indexed by the hash - 32 bits for each bucket. > > If the hash table has 0x1000 entries it saves 12 compares. > (All of which are likely to be data cache misses.) > > If you add the hash to each table entry then you can do > a binary chop search for the hash itself. > While this is the same search as is done for the strings > the comparison (just a number) will be faster than a > string compare. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) > -- Regards, Zhen Lei
© 2016 - 2026 Red Hat, Inc.