[v1] iommu/iova: Make the rcache depot properly flexible

[PATCH 0/2] iommu/iova: Make the rcache depot properly flexible

Posted by Robin Murphy 2 years, 5 months ago

Hi all,

Prompted by [1], which reminded me I started this a while ago, I've now
finished off my own attempt at sorting out the horrid lack of rcache
scalability. It's become quite clear that given the vast range of system
sizes and workloads there is no right size for a fixed depot array, so I
reckon we're better off not having one at all.

Note that the reclaim threshold and rate are chosen fairly arbitrarily -
it's enough of a challenge to get my 4-core dev board with spinning disk
and gigabit ethernet to push anything into a depot at all :)

Thanks,
Robin.

[1] https://lore.kernel.org/linux-iommu/20230811130246.42719-1-zhangzekun11@huawei.com


Robin Murphy (2):
  iommu/iova: Make the rcache depot scale better
  iommu/iova: Manage the depot list size

 drivers/iommu/iova.c | 94 ++++++++++++++++++++++++++++++--------------
 1 file changed, 65 insertions(+), 29 deletions(-)

-- 
2.39.2.101.g768bb238c484.dirty

Re: [PATCH 0/2] iommu/iova: Make the rcache depot properly flexible

Posted by Jerry Snitselaar 2 years, 5 months ago

On Mon, Aug 14, 2023 at 06:53:32PM +0100, Robin Murphy wrote:
> Hi all,
> 
> Prompted by [1], which reminded me I started this a while ago, I've now
> finished off my own attempt at sorting out the horrid lack of rcache
> scalability. It's become quite clear that given the vast range of system
> sizes and workloads there is no right size for a fixed depot array, so I
> reckon we're better off not having one at all.
> 
> Note that the reclaim threshold and rate are chosen fairly arbitrarily -
> it's enough of a challenge to get my 4-core dev board with spinning disk
> and gigabit ethernet to push anything into a depot at all :)
> 
> Thanks,
> Robin.
> 
> [1] https://lore.kernel.org/linux-iommu/20230811130246.42719-1-zhangzekun11@huawei.com
> 
> 
> Robin Murphy (2):
>   iommu/iova: Make the rcache depot scale better
>   iommu/iova: Manage the depot list size
> 
>  drivers/iommu/iova.c | 94 ++++++++++++++++++++++++++++++--------------
>  1 file changed, 65 insertions(+), 29 deletions(-)
> 
> -- 
> 2.39.2.101.g768bb238c484.dirty
> 

I'm trying to hunt down a system where we've seen some issues before,
but most of them have involved systems with nvme drives. Commit
3710e2b056cb ("nvme-pci: clamp max_hw_sectors based on DMA optimized
limitation") has helped those cases. I ran the patches overnight with
IOVA_DEPOT_DELAY fixed up on a couple of Genoa based systems (384
cores) without issue.

Regards,
Jerry

Re: [PATCH 0/2] iommu/iova: Make the rcache depot properly flexible

Posted by John Garry 2 years, 5 months ago

On 14/08/2023 18:53, Robin Murphy wrote:
> Hi all,
> 

Hi Robin,

> Prompted by [1], which reminded me I started this a while ago, I've now
> finished off my own attempt at sorting out the horrid lack of rcache
> scalability. It's become quite clear that given the vast range of system
> sizes and workloads there is no right size for a fixed depot array, so I
> reckon we're better off not having one at all.
> 
> Note that the reclaim threshold and rate are chosen fairly arbitrarily -

This threshold is the number of online CPUs, right?

> it's enough of a challenge to get my 4-core dev board with spinning disk
> and gigabit ethernet to push anything into a depot at all :)
> 

I have to admit that I was hoping to also see a more aggressive reclaim 
strategy, where we also trim the per-CPU rcaches when not in use. 
Leizhen proposed something like this a long time ago.

Thanks,
John

> Thanks,
> Robin.
> 
> [1] https://urldefense.com/v3/__https://lore.kernel.org/linux-iommu/20230811130246.42719-1-zhangzekun11@huawei.com__;!!ACWV5N9M2RV99hQ!Oj-N3yDamuhrlNTcuL5MA2LQRVf1EwFxQU21BMXSFBR1Fb3z4H_on1uiFG0EOoYpNc-FKGeoKvw9wzEV_1TRcr4$
> 
> 
> Robin Murphy (2):
>    iommu/iova: Make the rcache depot scale better
>    iommu/iova: Manage the depot list size
> 
>   drivers/iommu/iova.c | 94 ++++++++++++++++++++++++++++++--------------
>   1 file changed, 65 insertions(+), 29 deletions(-)
>

Re: [PATCH 0/2] iommu/iova: Make the rcache depot properly flexible

Posted by Robin Murphy 2 years, 5 months ago

On 15/08/2023 11:24 am, John Garry wrote:
> On 14/08/2023 18:53, Robin Murphy wrote:
>> Hi all,
>>
> 
> Hi Robin,
> 
>> Prompted by [1], which reminded me I started this a while ago, I've now
>> finished off my own attempt at sorting out the horrid lack of rcache
>> scalability. It's become quite clear that given the vast range of system
>> sizes and workloads there is no right size for a fixed depot array, so I
>> reckon we're better off not having one at all.
>>
>> Note that the reclaim threshold and rate are chosen fairly arbitrarily -
> 
> This threshold is the number of online CPUs, right?

Yes, that's nominally half of the current fixed size (based on all the 
performance figures from the original series seemingly coming from a 
16-thread machine, but seemed like a fair compromise. I am of course 
keen to see how real-world testing actually pans out.

>> it's enough of a challenge to get my 4-core dev board with spinning disk
>> and gigabit ethernet to push anything into a depot at all :)
>>
> 
> I have to admit that I was hoping to also see a more aggressive reclaim 
> strategy, where we also trim the per-CPU rcaches when not in use. 
> Leizhen proposed something like this a long time ago.

Don't think I haven't been having various elaborate ideas for making it 
cleverer with multiple thresholds and self-tuning, however I have 
managed to restrain myself ;)

At this point I'm just looking to confirm whether the fundamental 
concepts are sound, and at least no worse than the current behaviour 
(hence keeping it split into 2 distinct patches for the sake of review 
and debugging). If it proves solid then we can absolutely come back and 
go to town on enhancements later.

Cheers,
Robin.

Re: [PATCH 0/2] iommu/iova: Make the rcache depot properly flexible

Posted by John Garry 2 years, 5 months ago

On 15/08/2023 12:11, Robin Murphy wrote:
>>
>> This threshold is the number of online CPUs, right?
> 
> Yes, that's nominally half of the current fixed size (based on all the 
> performance figures from the original series seemingly coming from a 
> 16-thread machine, 

If you are talking about 
https://lore.kernel.org/linux-iommu/20230811130246.42719-1-zhangzekun11@huawei.com/, 
then I think it's a 256-CPU system and the DMA controller has 16 HW 
queues. The 16 HW queues are relevant as the per-completion queue 
interrupt handler runs on a fixed CPU from the set of 16 CPUs in the HW 
queue interrupt handler affinity mask. And what this means is while any 
CPU may alloc an IOVA, only those 16 CPUs handling each HW queue 
interrupt will be free'ing IOVAs.

> but seemed like a fair compromise. I am of course 
> keen to see how real-world testing actually pans out.
> 
>>> it's enough of a challenge to get my 4-core dev board with spinning disk
>>> and gigabit ethernet to push anything into a depot at all 😄
>>>
>>
>> I have to admit that I was hoping to also see a more aggressive 
>> reclaim strategy, where we also trim the per-CPU rcaches when not in 
>> use. Leizhen proposed something like this a long time ago.
> 
> Don't think I haven't been having various elaborate ideas for making it 
> cleverer with multiple thresholds and self-tuning, however I have 
> managed to restrain myself 😉
> 

OK, understood. My main issue WRT scalability is that the total 
cacheable IOVAs (CPU and depot rcache) scales up with the number of 
CPUs, but many DMA controllers have a fixed number of max in-flight 
requests.

Consider a SCSI storage controller on a 256-CPU system. The in-flight 
limit for this example controller is 4096, which would typically never 
be even used up or may not be even usable.

For this device, we need 4096 * 6 [IOVA rcache range] = ~24K cached 
IOVAs if we were to pre-allocate them all - obviously I am ignoring that 
we have the per-CPU rcache for speed and it would not make sense to 
share one set. However, according to current IOVA driver, we can in 
theory cache upto ((256 [CPUs] * 2 [loaded + prev]) + 32 [depot size]) * 
6 [rcache range] * 128 (IOVA per mag) = ~420K IOVAs. That's ~17x what we 
would ever need.

Something like NVMe is different, as its total requests can scale up 
with the CPU count, but only to a limit. I am not sure about network 
controllers.

Anyway, this is just something which I think should be considered - 
which I guess already has been.

> At this point I'm just looking to confirm whether the fundamental 
> concepts are sound, and at least no worse than the current behaviour 
> (hence keeping it split into 2 distinct patches for the sake of review 
> and debugging). If it proves solid then we can absolutely come back and 
> go to town on enhancements later.

Thanks,
John

Re: [PATCH 0/2] iommu/iova: Make the rcache depot properly flexible

Posted by Robin Murphy 2 years, 5 months ago

On 15/08/2023 2:35 pm, John Garry wrote:
> On 15/08/2023 12:11, Robin Murphy wrote:
>>>
>>> This threshold is the number of online CPUs, right?
>>
>> Yes, that's nominally half of the current fixed size (based on all the 
>> performance figures from the original series seemingly coming from a 
>> 16-thread machine, 
> 
> If you are talking about 
> https://lore.kernel.org/linux-iommu/20230811130246.42719-1-zhangzekun11@huawei.com/,

No, I mean the *original* rcache patch submission, and its associated paper:

https://lore.kernel.org/linux-iommu/cover.1461135861.git.mad@cs.technion.ac.il/

> then I think it's a 256-CPU system and the DMA controller has 16 HW queues. The 16 HW queues are relevant as the per-completion queue interrupt handler runs on a fixed CPU from the set of 16 CPUs in the HW queue interrupt handler affinity mask. And what this means is while any CPU may alloc an IOVA, only those 16 CPUs handling each HW queue interrupt will be free'ing IOVAs.
> 
>> but seemed like a fair compromise. I am of course keen to see how 
>> real-world testing actually pans out.
>>
>>>> it's enough of a challenge to get my 4-core dev board with spinning 
>>>> disk
>>>> and gigabit ethernet to push anything into a depot at all 😄
>>>>
>>>
>>> I have to admit that I was hoping to also see a more aggressive 
>>> reclaim strategy, where we also trim the per-CPU rcaches when not in 
>>> use. Leizhen proposed something like this a long time ago.
>>
>> Don't think I haven't been having various elaborate ideas for making 
>> it cleverer with multiple thresholds and self-tuning, however I have 
>> managed to restrain myself 😉
>>
> 
> OK, understood. My main issue WRT scalability is that the total 
> cacheable IOVAs (CPU and depot rcache) scales up with the number of 
> CPUs, but many DMA controllers have a fixed number of max in-flight 
> requests.
> 
> Consider a SCSI storage controller on a 256-CPU system. The in-flight 
> limit for this example controller is 4096, which would typically never 
> be even used up or may not be even usable.
> 
> For this device, we need 4096 * 6 [IOVA rcache range] = ~24K cached 
> IOVAs if we were to pre-allocate them all - obviously I am ignoring that 
> we have the per-CPU rcache for speed and it would not make sense to 
> share one set. However, according to current IOVA driver, we can in 
> theory cache upto ((256 [CPUs] * 2 [loaded + prev]) + 32 [depot size]) * 
> 6 [rcache range] * 128 (IOVA per mag) = ~420K IOVAs. That's ~17x what we 
> would ever need.
> 
> Something like NVMe is different, as its total requests can scale up 
> with the CPU count, but only to a limit. I am not sure about network 
> controllers.

Remember that this threshold only represents a point at which we 
consider the cache to have grown "big enough" to start background 
reclaim - over the short term it is neither an upper nor a lower limit 
on the cache capacity itself. Indeed it will be larger than the working 
set of some workloads, but then it still wants to be enough of a buffer 
to be useful for others which do make big bursts of allocations only 
periodically.

> Anyway, this is just something which I think should be considered - 
> which I guess already has been.

Indeed, I would tend to assume that machines with hundreds of CPUs are 
less likely to be constrained on overall memory and/or IOVA space, so 
tuning for a more responsive cache should be more beneficial than any 
potential wastage is detrimental.

Cheers,
Robin.

Re: [PATCH 0/2] iommu/iova: Make the rcache depot properly flexible

Posted by John Garry 2 years, 5 months ago

On 16/08/2023 16:10, Robin Murphy wrote:
> On 15/08/2023 2:35 pm, John Garry wrote:
>> On 15/08/2023 12:11, Robin Murphy wrote:
>>>>
>>>> This threshold is the number of online CPUs, right?
>>>
>>> Yes, that's nominally half of the current fixed size (based on all 
>>> the performance figures from the original series seemingly coming 
>>> from a 16-thread machine, 
>>
>> If you are talking about 
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-iommu/20230811130246.42719-1-zhangzekun11@huawei.com/__;!!ACWV5N9M2RV99hQ!Op6GUnd7phh1sFyJwVOngmoeyKKbHWbSsNkhPB_7BpG45JFOHmN0HQ0Y7NOZZQ7VduKXaRYCXTta8LjrS99neyg$ ,
> 
> No, I mean the *original* rcache patch submission, and its associated 
> paper:
> 
> https://urldefense.com/v3/__https://lore.kernel.org/linux-iommu/cover.1461135861.git.mad@cs.technion.ac.il/__;!!ACWV5N9M2RV99hQ!Op6GUnd7phh1sFyJwVOngmoeyKKbHWbSsNkhPB_7BpG45JFOHmN0HQ0Y7NOZZQ7VduKXaRYCXTta8LjrOGggfnA$

oh, that one :)

>> then I think it's a 256-CPU system and the DMA controller has 16 HW 
>> queues. The 16 HW queues are relevant as the per-completion queue 
>> interrupt handler runs on a fixed CPU from the set of 16 CPUs in the 
>> HW queue interrupt handler affinity mask. And what this means is while 
>> any CPU may alloc an IOVA, only those 16 CPUs handling each HW queue 
>> interrupt will be free'ing IOVAs.
>>
>>> but seemed like a fair compromise. I am of course keen to see how 
>>> real-world testing actually pans out.
>>>
>>>>> it's enough of a challenge to get my 4-core dev board with spinning 
>>>>> disk
>>>>> and gigabit ethernet to push anything into a depot at all 😄
>>>>>
>>>>
>>>> I have to admit that I was hoping to also see a more aggressive 
>>>> reclaim strategy, where we also trim the per-CPU rcaches when not in 
>>>> use. Leizhen proposed something like this a long time ago.
>>>
>>> Don't think I haven't been having various elaborate ideas for making 
>>> it cleverer with multiple thresholds and self-tuning, however I have 
>>> managed to restrain myself 😉
>>>
>>
>> OK, understood. My main issue WRT scalability is that the total 
>> cacheable IOVAs (CPU and depot rcache) scales up with the number of 
>> CPUs, but many DMA controllers have a fixed number of max in-flight 
>> requests.
>>
>> Consider a SCSI storage controller on a 256-CPU system. The in-flight 
>> limit for this example controller is 4096, which would typically never 
>> be even used up or may not be even usable.
>>
>> For this device, we need 4096 * 6 [IOVA rcache range] = ~24K cached 
>> IOVAs if we were to pre-allocate them all - obviously I am ignoring 
>> that we have the per-CPU rcache for speed and it would not make sense 
>> to share one set. However, according to current IOVA driver, we can in 
>> theory cache upto ((256 [CPUs] * 2 [loaded + prev]) + 32 [depot size]) 
>> * 6 [rcache range] * 128 (IOVA per mag) = ~420K IOVAs. That's ~17x 
>> what we would ever need.
>>
>> Something like NVMe is different, as its total requests can scale up 
>> with the CPU count, but only to a limit. I am not sure about network 
>> controllers.
> 
> Remember that this threshold only represents a point at which we 
> consider the cache to have grown "big enough" to start background 
> reclaim - over the short term it is neither an upper nor a lower limit 
> on the cache capacity itself. Indeed it will be larger than the working 
> set of some workloads, but then it still wants to be enough of a buffer 
> to be useful for others which do make big bursts of allocations only 
> periodically.
> 

It would be interesting to see what zhangzekun finds for this series. He 
was testing on a 5.10-based kernel - things have changed a lot since 
then and I am not really sure what the problem could have been there.

>> Anyway, this is just something which I think should be considered - 
>> which I guess already has been.
> 
> Indeed, I would tend to assume that machines with hundreds of CPUs are 
> less likely to be constrained on overall memory and/or IOVA space, 

Cheers,
John