[v2] venus driver fixes to avoid possible OOB read access

[PATCH v2 0/2] venus driver fixes to avoid possible OOB read access

Posted by Vedang Nagar 11 months, 4 weeks ago

This series primarily adds check at relevant places in venus driver
where there are possible OOB accesses due to unexpected payload
from venus firmware. The patches describes the specific OOB possibility.

Signed-off-by: Vedang Nagar <quic_vnagar@quicinc.com>
---
Changes in v2:
- Decompose sequence change event function. 
- Fix repopulating the packet .with the first read during read_queue.
- Link to v1: https://lore.kernel.org/r/20250104-venus-security-fixes-v1-0-9d0dd4594cb4@quicinc.com

---
Vedang Nagar (2):
      media: venus: fix OOB read issue due to double read
      media: venus: fix OOB access issue while reading sequence changed events

 drivers/media/platform/qcom/venus/hfi_msgs.c  | 72 +++++++++++++++++++++++----
 drivers/media/platform/qcom/venus/hfi_venus.c |  1 +
 2 files changed, 63 insertions(+), 10 deletions(-)
---
base-commit: 91e71d606356e50f238d7a87aacdee4abc427f07
change-id: 20241211-venus-security-fixes-50c22e2564d5

Best regards,
-- 
Vedang Nagar <quic_vnagar@quicinc.com>

Re: [PATCH v2 0/2] venus driver fixes to avoid possible OOB read access

Posted by Bryan O'Donoghue 11 months, 3 weeks ago

On 15/02/2025 17:19, Vedang Nagar wrote:
> This series primarily adds check at relevant places in venus driver
> where there are possible OOB accesses due to unexpected payload
> from venus firmware. The patches describes the specific OOB possibility.
> 
> Signed-off-by: Vedang Nagar <quic_vnagar@quicinc.com>
> ---
> Changes in v2:
> - Decompose sequence change event function.
> - Fix repopulating the packet .with the first read during read_queue.
> - Link to v1: https://lore.kernel.org/r/20250104-venus-security-fixes-v1-0-9d0dd4594cb4@quicinc.com
> 
> ---
> Vedang Nagar (2):
>        media: venus: fix OOB read issue due to double read
>        media: venus: fix OOB access issue while reading sequence changed events
> 
>   drivers/media/platform/qcom/venus/hfi_msgs.c  | 72 +++++++++++++++++++++++----
>   drivers/media/platform/qcom/venus/hfi_venus.c |  1 +
>   2 files changed, 63 insertions(+), 10 deletions(-)
> ---
> base-commit: 91e71d606356e50f238d7a87aacdee4abc427f07
> change-id: 20241211-venus-security-fixes-50c22e2564d5
> 
> Best regards,

Could you please address the feedback I gave you / questions posited in 
these two messages ?

4cfc1fe1-2fab-4256-9ce2-b4a0aad1069e@linaro.org

0eab7323-ce86-40c7-9737-06eedcdf492d@linaro.org

The basic question : what is the lifetime of the data from RX interrupt 
to consumption by another system agent, DSP, userspace, whatever ?

Why is it in this small specific window that the data can change but not 
later ? What is the mechanism the data can change and how do the changes 
you propose here address the data lifetime problem ?

Without that context, I don't believe it is really possible to validate 
an additional memcpy() here and there in the code as fixing anything.

---
bod

Re: [PATCH v2 0/2] venus driver fixes to avoid possible OOB read access

Posted by Vedang Nagar 11 months, 1 week ago

Hi Bryan,

On 2/16/2025 9:33 PM, Bryan O'Donoghue wrote:
> On 15/02/2025 17:19, Vedang Nagar wrote:
>> This series primarily adds check at relevant places in venus driver
>> where there are possible OOB accesses due to unexpected payload
>> from venus firmware. The patches describes the specific OOB possibility.
>>
>> Signed-off-by: Vedang Nagar <quic_vnagar@quicinc.com>
>> ---
>> Changes in v2:
>> - Decompose sequence change event function.
>> - Fix repopulating the packet .with the first read during read_queue.
>> - Link to v1: https://lore.kernel.org/r/20250104-venus-security-fixes- 
>> v1-0-9d0dd4594cb4@quicinc.com
>>
>> ---
>> Vedang Nagar (2):
>>        media: venus: fix OOB read issue due to double read
>>        media: venus: fix OOB access issue while reading sequence 
>> changed events
>>
>>   drivers/media/platform/qcom/venus/hfi_msgs.c  | 72 +++++++++++++++++ 
>> ++++++----
>>   drivers/media/platform/qcom/venus/hfi_venus.c |  1 +
>>   2 files changed, 63 insertions(+), 10 deletions(-)
>> ---
>> base-commit: 91e71d606356e50f238d7a87aacdee4abc427f07
>> change-id: 20241211-venus-security-fixes-50c22e2564d5
>>
>> Best regards,
> 
> Could you please address the feedback I gave you / questions posited in 
> these two messages ?
> 
> 4cfc1fe1-2fab-4256-9ce2-b4a0aad1069e@linaro.org
> 
> 0eab7323-ce86-40c7-9737-06eedcdf492d@linaro.org
> 
> The basic question : what is the lifetime of the data from RX interrupt 
> to consumption by another system agent, DSP, userspace, whatever ?
As mentioned in [1], With the regular firmware, after RX interrupt the 
data can be considered as valid until next interrupt is raised, but with 
the rouge firmware, data can get invalid during the second read and our 
intention is to avoid out of bound access read because of such issues.

[1]: 
https://lore.kernel.org/lkml/4cfc1fe1-2fab-4256-9ce2-b4a0aad1069e@linaro.org/T/#m5f1737b16e68f8b8fc1d75517356b6566d0ec619
> 
> Why is it in this small specific window that the data can change but not 
> later ? What is the mechanism the data can change and how do the changes 
> you propose here address the data lifetime problem ?
Currently this issue has been discovered by external researchers at this 
point, but if any such OOB issue is discovered at later point as well 
then we shall fix them as well.

Also, with rougue firmware we cannot fix the data lifetime problem in my 
opinion, but atleast we can fix the out of bound issues.
> 
> Without that context, I don't believe it is really possible to validate 
> an additional memcpy() here and there in the code as fixing anything.
There is no additional memcpy() now in the v2 patch, but as part of the 
fix, we are just trying to retain the length of the packet which was 
being read in the first memcpy() to avoid the OOB read access.

Please let me know if you have any other suggestions.

Regards,
Vedang Nagar
> 
> ---
> bod

Re: [PATCH v2 0/2] venus driver fixes to avoid possible OOB read access

Posted by Bryan O'Donoghue 11 months, 1 week ago

On 02/03/2025 11:58, Vedang Nagar wrote:
>>
>> The basic question : what is the lifetime of the data from RX 
>> interrupt to consumption by another system agent, DSP, userspace, 
>> whatever ?
> As mentioned in [1], With the regular firmware, after RX interrupt the 
> data can be considered as valid until next interrupt is raised, but with 
> the rouge firmware, data can get invalid during the second read and our 
> intention is to avoid out of bound access read because of such issues.

This is definitely the part I don't compute.

1. RX interrupt
2. Frame#0 Some amount of time data is always valid
3. RX interrupt - new data
4. Frame#1 new data delivered into a buffer

Are you describing a case between RX interrupts 1-3 or a case after 1-4?

Why do we need to write code for rouge firmware anyway ?

And the real question - if the data can be invalidated in the 1-3 window 
above when is the safe time to snapshot that data ?

We seem to have alot of submissions to deal with 'rouge' firmware 
without I think properly describing the problem of the _expected_ data 
lifetime.

So

a) What is the expected data lifetime of an RX buffer between one
    RX IRQ and the next ?
    I hope the answer to this is - APSS owns the buffer.
    This is BTW usually the case in these types of asymmetric setups
    with a flag or some other kind of semaphore that indicates which
    side of the data-exchange owns the buffer.

b) In this rouge - buggy - firmware case what is the scope of the
    potential race condition ?

    What I'd really like to know here is why we have to seemingly
    memcpy() again and again in seemingly incongrous and not
    immediately obvious places in the code.

    Would we not be better advised to do a memcpy() of the entire
    RX frame in the RX IRQ handler path if as you appear to me
    suggesting - the firmware can "race" with the APSS
    i.e. the data-buffer ownership flag either doesn't work
    or isn't respected by one side in the data-exchange.

Can we please have a detailed description of the race condition here ?

I don't doubt the new memcpy() makes sense to you but without this 
detailed understanding of the underlying problem its virtually 
impossible to debate the appropriate remediation - perhaps this patch 
you've submitted - or some other solution.

Sorry to dig into my trench here but, way more detail is needed.

> [1]: https://lore.kernel.org/lkml/4cfc1fe1-2fab-4256-9ce2- 
> b4a0aad1069e@linaro.org/T/#m5f1737b16e68f8b8fc1d75517356b6566d0ec619
>>
>> Why is it in this small specific window that the data can change but 
>> not later ? What is the mechanism the data can change and how do the 
>> changes you propose here address the data lifetime problem ?
> Currently this issue has been discovered by external researchers at this 
> point, but if any such OOB issue is discovered at later point as well 
> then we shall fix them as well.

Right but, I'm looking for a detailed description of the problem.

Can you describe from RX interrupt again what the expected data lifetime 
of the RX frame is, which I hope we agree is until the next RX interrupt 
associated with a given buffer with an ownership flag shared between 
firmware and APSS - and then under what circumstances that "software 
contract" is being violated.

> Also, with rougue firmware we cannot fix the data lifetime problem in my 
> opinion, but atleast we can fix the out of bound issues.
>>
>> Without that context, I don't believe it is really possible to 
>> validate an additional memcpy() here and there in the code as fixing 
>> anything.
> There is no additional memcpy() now in the v2 patch, but as part of the 
> fix, we are just trying to retain the length of the packet which was 
> being read in the first memcpy() to avoid the OOB read access.

I can't make a suggestion because - personally speaking I still don't 
quite understand the data-race you are describing.

I get that you say the firmware is breaking the contract but, without 
more detail on _how_ it breaks that contract I don't think it's really 
possible to validate your fix here, fixes anything.

---
bod

Re: [PATCH v2 0/2] venus driver fixes to avoid possible OOB read access

Posted by Vikash Garodia 11 months, 1 week ago

On 3/2/2025 9:26 PM, Bryan O'Donoghue wrote:
> On 02/03/2025 11:58, Vedang Nagar wrote:
>>>
>>> The basic question : what is the lifetime of the data from RX interrupt to
>>> consumption by another system agent, DSP, userspace, whatever ?
>> As mentioned in [1], With the regular firmware, after RX interrupt the data
>> can be considered as valid until next interrupt is raised, but with the rouge
>> firmware, data can get invalid during the second read and our intention is to
>> avoid out of bound access read because of such issues.
> 
> This is definitely the part I don't compute.
> 
> 1. RX interrupt
> 2. Frame#0 Some amount of time data is always valid
This is not correct. Its not the amount of time which determines the validity of
the data, its the possibility of rogue firmware which, if incase, puts up the
date in shared queue, would always be invalid, irrespective of time.

> 3. RX interrupt - new data
> 4. Frame#1 new data delivered into a buffer
> 
> Are you describing a case between RX interrupts 1-3 or a case after 1-4?
> 
> Why do we need to write code for rouge firmware anyway ?
It is a way to prevent any possibility of OOB, similar to how any API does check
for validity of any arguments passed to it, prior to processing.
> 
> And the real question - if the data can be invalidated in the 1-3 window above
> when is the safe time to snapshot that data ?
> 
> We seem to have alot of submissions to deal with 'rouge' firmware without I
> think properly describing the problem of the _expected_ data lifetime.
> 
> So
> 
> a) What is the expected data lifetime of an RX buffer between one
>    RX IRQ and the next ?
>    I hope the answer to this is - APSS owns the buffer.
>    This is BTW usually the case in these types of asymmetric setups
>    with a flag or some other kind of semaphore that indicates which
>    side of the data-exchange owns the buffer.
> 
> b) In this rouge - buggy - firmware case what is the scope of the
>    potential race condition ?
> 
>    What I'd really like to know here is why we have to seemingly
>    memcpy() again and again in seemingly incongrous and not
>    immediately obvious places in the code.
> 
>    Would we not be better advised to do a memcpy() of the entire
>    RX frame in the RX IRQ handler path if as you appear to me
>    suggesting - the firmware can "race" with the APSS
>    i.e. the data-buffer ownership flag either doesn't work
>    or isn't respected by one side in the data-exchange.
> 
> Can we please have a detailed description of the race condition here ?
Below is the report which the reporter reported leading to OOB, let me know if
you are unable to deduce the trail leading to OOB here.

OOB read issue is in function event_seq_changed, please reference below code
snippet:

Buggy code snippet:

static void event_seq_changed(struct venus_core *core, struct venus_inst *inst,
        struct hfi_msg_event_notify_pkt *pkt)
...
num_properties_changed = pkt->event_data2; //num_properties_changed is from
message and is not validated.
...
data_ptr = (u8 *)&pkt->ext_event_data[0];
do {
 ptype = *((u32 *)data_ptr);
 switch (ptype) {
 case HFI_PROPERTY_PARAM_FRAME_SIZE:
  data_ptr += sizeof(u32);
  frame_sz = (struct hfi_framesize *)data_ptr;
  event.width = frame_sz->width;
...
 }
 num_properties_changed--;
} while (num_properties_changed > 0);
```
There is no validation against `num_properties_changed = pkt->event_data2`, so
OOB read occurs.
> 
> I don't doubt the new memcpy() makes sense to you but without this detailed
> understanding of the underlying problem its virtually impossible to debate the
> appropriate remediation - perhaps this patch you've submitted - or some other
> solution.
> 
> Sorry to dig into my trench here but, way more detail is needed.
> 
>> [1]: https://lore.kernel.org/lkml/4cfc1fe1-2fab-4256-9ce2-
>> b4a0aad1069e@linaro.org/T/#m5f1737b16e68f8b8fc1d75517356b6566d0ec619
>>>
>>> Why is it in this small specific window that the data can change but not
>>> later ? What is the mechanism the data can change and how do the changes you
>>> propose here address the data lifetime problem ?
>> Currently this issue has been discovered by external researchers at this
>> point, but if any such OOB issue is discovered at later point as well then we
>> shall fix them as well.
> 
> Right but, I'm looking for a detailed description of the problem.
> 
> Can you describe from RX interrupt again what the expected data lifetime of the
> RX frame is, which I hope we agree is until the next RX interrupt associated
> with a given buffer with an ownership flag shared between firmware and APSS -
> and then under what circumstances that "software contract" is being violated.
> 
>> Also, with rougue firmware we cannot fix the data lifetime problem in my
>> opinion, but atleast we can fix the out of bound issues.
>>>
>>> Without that context, I don't believe it is really possible to validate an
>>> additional memcpy() here and there in the code as fixing anything.
>> There is no additional memcpy() now in the v2 patch, but as part of the fix,
>> we are just trying to retain the length of the packet which was being read in
>> the first memcpy() to avoid the OOB read access.
> 
> I can't make a suggestion because - personally speaking I still don't quite
> understand the data-race you are describing.
Go through the reports from the reporter, it was quite evident in leading upto
OOB case.
Putting up the sequence for you to go over the interrupt handling and message
queue parsing of the packets from firmware
1.
https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_venus.c#L1082
2.
https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L816
3. event handling (this particular case)
https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L658
4.
https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L22

the "struct hfi_msg_event_notify_pkt *pkt" pkt here is having the data read from
shared queue.

> 
> I get that you say the firmware is breaking the contract but, without more
> detail on _how_ it breaks that contract I don't think it's really possible to
> validate your fix here, fixes anything.
> 
> ---
> bod

Regards,
Vikash

Re: [PATCH v2 0/2] venus driver fixes to avoid possible OOB read access

Posted by Bryan O'Donoghue 11 months, 1 week ago

On 03/03/2025 13:12, Vikash Garodia wrote:
> 
> On 3/2/2025 9:26 PM, Bryan O'Donoghue wrote:
>> On 02/03/2025 11:58, Vedang Nagar wrote:
>>>>
>>>> The basic question : what is the lifetime of the data from RX interrupt to
>>>> consumption by another system agent, DSP, userspace, whatever ?
>>> As mentioned in [1], With the regular firmware, after RX interrupt the data
>>> can be considered as valid until next interrupt is raised, but with the rouge
>>> firmware, data can get invalid during the second read and our intention is to
>>> avoid out of bound access read because of such issues.
>>
>> This is definitely the part I don't compute.
>>
>> 1. RX interrupt
>> 2. Frame#0 Some amount of time data is always valid
> This is not correct. Its not the amount of time which determines the validity of
> the data, its the possibility of rogue firmware which, if incase, puts up the
> date in shared queue, would always be invalid, irrespective of time.
> 
>> 3. RX interrupt - new data
>> 4. Frame#1 new data delivered into a buffer
>>
>> Are you describing a case between RX interrupts 1-3 or a case after 1-4?
>>
>> Why do we need to write code for rouge firmware anyway ?
> It is a way to prevent any possibility of OOB, similar to how any API does check
> for validity of any arguments passed to it, prior to processing.
>>
>> And the real question - if the data can be invalidated in the 1-3 window above
>> when is the safe time to snapshot that data ?
>>
>> We seem to have alot of submissions to deal with 'rouge' firmware without I
>> think properly describing the problem of the _expected_ data lifetime.
>>
>> So
>>
>> a) What is the expected data lifetime of an RX buffer between one
>>     RX IRQ and the next ?
>>     I hope the answer to this is - APSS owns the buffer.
>>     This is BTW usually the case in these types of asymmetric setups
>>     with a flag or some other kind of semaphore that indicates which
>>     side of the data-exchange owns the buffer.
>>
>> b) In this rouge - buggy - firmware case what is the scope of the
>>     potential race condition ?
>>
>>     What I'd really like to know here is why we have to seemingly
>>     memcpy() again and again in seemingly incongrous and not
>>     immediately obvious places in the code.
>>
>>     Would we not be better advised to do a memcpy() of the entire
>>     RX frame in the RX IRQ handler path if as you appear to me
>>     suggesting - the firmware can "race" with the APSS
>>     i.e. the data-buffer ownership flag either doesn't work
>>     or isn't respected by one side in the data-exchange.
>>
>> Can we please have a detailed description of the race condition here ?
> Below is the report which the reporter reported leading to OOB, let me know if
> you are unable to deduce the trail leading to OOB here.
> 
> OOB read issue is in function event_seq_changed, please reference below code
> snippet:
> 
> Buggy code snippet:
> 
> static void event_seq_changed(struct venus_core *core, struct venus_inst *inst,
>          struct hfi_msg_event_notify_pkt *pkt)
> ...
> num_properties_changed = pkt->event_data2; //num_properties_changed is from
> message and is not validated.
> ...
> data_ptr = (u8 *)&pkt->ext_event_data[0];
> do {
>   ptype = *((u32 *)data_ptr);
>   switch (ptype) {
>   case HFI_PROPERTY_PARAM_FRAME_SIZE:
>    data_ptr += sizeof(u32);
>    frame_sz = (struct hfi_framesize *)data_ptr;
>    event.width = frame_sz->width;
> ...
>   }
>   num_properties_changed--;
> } while (num_properties_changed > 0);
> ```
> There is no validation against `num_properties_changed = pkt->event_data2`, so
> OOB read occurs.
>>
>> I don't doubt the new memcpy() makes sense to you but without this detailed
>> understanding of the underlying problem its virtually impossible to debate the
>> appropriate remediation - perhaps this patch you've submitted - or some other
>> solution.
>>
>> Sorry to dig into my trench here but, way more detail is needed.
>>
>>> [1]: https://lore.kernel.org/lkml/4cfc1fe1-2fab-4256-9ce2-
>>> b4a0aad1069e@linaro.org/T/#m5f1737b16e68f8b8fc1d75517356b6566d0ec619
>>>>
>>>> Why is it in this small specific window that the data can change but not
>>>> later ? What is the mechanism the data can change and how do the changes you
>>>> propose here address the data lifetime problem ?
>>> Currently this issue has been discovered by external researchers at this
>>> point, but if any such OOB issue is discovered at later point as well then we
>>> shall fix them as well.
>>
>> Right but, I'm looking for a detailed description of the problem.
>>
>> Can you describe from RX interrupt again what the expected data lifetime of the
>> RX frame is, which I hope we agree is until the next RX interrupt associated
>> with a given buffer with an ownership flag shared between firmware and APSS -
>> and then under what circumstances that "software contract" is being violated.
>>
>>> Also, with rougue firmware we cannot fix the data lifetime problem in my
>>> opinion, but atleast we can fix the out of bound issues.
>>>>
>>>> Without that context, I don't believe it is really possible to validate an
>>>> additional memcpy() here and there in the code as fixing anything.
>>> There is no additional memcpy() now in the v2 patch, but as part of the fix,
>>> we are just trying to retain the length of the packet which was being read in
>>> the first memcpy() to avoid the OOB read access.
>>
>> I can't make a suggestion because - personally speaking I still don't quite
>> understand the data-race you are describing.
> Go through the reports from the reporter, it was quite evident in leading upto
> OOB case.
> Putting up the sequence for you to go over the interrupt handling and message
> queue parsing of the packets from firmware
> 1.
> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_venus.c#L1082
> 2.
> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L816
> 3. event handling (this particular case)
> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L658
> 4.
> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L22
> 
> the "struct hfi_msg_event_notify_pkt *pkt" pkt here is having the data read from
> shared queue.
> 
>>
>> I get that you say the firmware is breaking the contract but, without more
>> detail on _how_ it breaks that contract I don't think it's really possible to
>> validate your fix here, fixes anything.
>>
>> ---
>> bod
> 
> Regards,
> Vikash

I'll go through all of these links given here, thanks.

Whatever the result of the review, this detail needs to go into the 
commit log so that a reviewer can reasonably read the problem 
description and evaluate against submitted code as a fix.

---
bod

Re: [PATCH v2 0/2] venus driver fixes to avoid possible OOB read access

Posted by Vikash Garodia 11 months, 1 week ago

On 3/3/2025 8:26 PM, Bryan O'Donoghue wrote:
> On 03/03/2025 13:12, Vikash Garodia wrote:
>>
>> On 3/2/2025 9:26 PM, Bryan O'Donoghue wrote:
>>> On 02/03/2025 11:58, Vedang Nagar wrote:
>>>>>
>>>>> The basic question : what is the lifetime of the data from RX interrupt to
>>>>> consumption by another system agent, DSP, userspace, whatever ?
>>>> As mentioned in [1], With the regular firmware, after RX interrupt the data
>>>> can be considered as valid until next interrupt is raised, but with the rouge
>>>> firmware, data can get invalid during the second read and our intention is to
>>>> avoid out of bound access read because of such issues.
>>>
>>> This is definitely the part I don't compute.
>>>
>>> 1. RX interrupt
>>> 2. Frame#0 Some amount of time data is always valid
>> This is not correct. Its not the amount of time which determines the validity of
>> the data, its the possibility of rogue firmware which, if incase, puts up the
>> date in shared queue, would always be invalid, irrespective of time.
>>
>>> 3. RX interrupt - new data
>>> 4. Frame#1 new data delivered into a buffer
>>>
>>> Are you describing a case between RX interrupts 1-3 or a case after 1-4?
>>>
>>> Why do we need to write code for rouge firmware anyway ?
>> It is a way to prevent any possibility of OOB, similar to how any API does check
>> for validity of any arguments passed to it, prior to processing.
>>>
>>> And the real question - if the data can be invalidated in the 1-3 window above
>>> when is the safe time to snapshot that data ?
>>>
>>> We seem to have alot of submissions to deal with 'rouge' firmware without I
>>> think properly describing the problem of the _expected_ data lifetime.
>>>
>>> So
>>>
>>> a) What is the expected data lifetime of an RX buffer between one
>>>     RX IRQ and the next ?
>>>     I hope the answer to this is - APSS owns the buffer.
>>>     This is BTW usually the case in these types of asymmetric setups
>>>     with a flag or some other kind of semaphore that indicates which
>>>     side of the data-exchange owns the buffer.
>>>
>>> b) In this rouge - buggy - firmware case what is the scope of the
>>>     potential race condition ?
>>>
>>>     What I'd really like to know here is why we have to seemingly
>>>     memcpy() again and again in seemingly incongrous and not
>>>     immediately obvious places in the code.
>>>
>>>     Would we not be better advised to do a memcpy() of the entire
>>>     RX frame in the RX IRQ handler path if as you appear to me
>>>     suggesting - the firmware can "race" with the APSS
>>>     i.e. the data-buffer ownership flag either doesn't work
>>>     or isn't respected by one side in the data-exchange.
>>>
>>> Can we please have a detailed description of the race condition here ?
>> Below is the report which the reporter reported leading to OOB, let me know if
>> you are unable to deduce the trail leading to OOB here.
>>
>> OOB read issue is in function event_seq_changed, please reference below code
>> snippet:
>>
>> Buggy code snippet:
>>
>> static void event_seq_changed(struct venus_core *core, struct venus_inst *inst,
>>          struct hfi_msg_event_notify_pkt *pkt)
>> ...
>> num_properties_changed = pkt->event_data2; //num_properties_changed is from
>> message and is not validated.
>> ...
>> data_ptr = (u8 *)&pkt->ext_event_data[0];
>> do {
>>   ptype = *((u32 *)data_ptr);
>>   switch (ptype) {
>>   case HFI_PROPERTY_PARAM_FRAME_SIZE:
>>    data_ptr += sizeof(u32);
>>    frame_sz = (struct hfi_framesize *)data_ptr;
>>    event.width = frame_sz->width;
>> ...
>>   }
>>   num_properties_changed--;
>> } while (num_properties_changed > 0);
>> ```
>> There is no validation against `num_properties_changed = pkt->event_data2`, so
>> OOB read occurs.
>>>
>>> I don't doubt the new memcpy() makes sense to you but without this detailed
>>> understanding of the underlying problem its virtually impossible to debate the
>>> appropriate remediation - perhaps this patch you've submitted - or some other
>>> solution.
>>>
>>> Sorry to dig into my trench here but, way more detail is needed.
>>>
>>>> [1]: https://lore.kernel.org/lkml/4cfc1fe1-2fab-4256-9ce2-
>>>> b4a0aad1069e@linaro.org/T/#m5f1737b16e68f8b8fc1d75517356b6566d0ec619
>>>>>
>>>>> Why is it in this small specific window that the data can change but not
>>>>> later ? What is the mechanism the data can change and how do the changes you
>>>>> propose here address the data lifetime problem ?
>>>> Currently this issue has been discovered by external researchers at this
>>>> point, but if any such OOB issue is discovered at later point as well then we
>>>> shall fix them as well.
>>>
>>> Right but, I'm looking for a detailed description of the problem.
>>>
>>> Can you describe from RX interrupt again what the expected data lifetime of the
>>> RX frame is, which I hope we agree is until the next RX interrupt associated
>>> with a given buffer with an ownership flag shared between firmware and APSS -
>>> and then under what circumstances that "software contract" is being violated.
>>>
>>>> Also, with rougue firmware we cannot fix the data lifetime problem in my
>>>> opinion, but atleast we can fix the out of bound issues.
>>>>>
>>>>> Without that context, I don't believe it is really possible to validate an
>>>>> additional memcpy() here and there in the code as fixing anything.
>>>> There is no additional memcpy() now in the v2 patch, but as part of the fix,
>>>> we are just trying to retain the length of the packet which was being read in
>>>> the first memcpy() to avoid the OOB read access.
>>>
>>> I can't make a suggestion because - personally speaking I still don't quite
>>> understand the data-race you are describing.
>> Go through the reports from the reporter, it was quite evident in leading upto
>> OOB case.
>> Putting up the sequence for you to go over the interrupt handling and message
>> queue parsing of the packets from firmware
>> 1.
>> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_venus.c#L1082
>> 2.
>> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L816
>> 3. event handling (this particular case)
>> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L658
>> 4.
>> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L22
>>
>> the "struct hfi_msg_event_notify_pkt *pkt" pkt here is having the data read from
>> shared queue.
>>
>>>
>>> I get that you say the firmware is breaking the contract but, without more
>>> detail on _how_ it breaks that contract I don't think it's really possible to
>>> validate your fix here, fixes anything.
>>>
>>> ---
>>> bod
>>
>> Regards,
>> Vikash
> 
> I'll go through all of these links given here, thanks.
I would request you to go through the description putup by the reporter of this
OOB as well, i added in my earlier response. It provided a good background of
how the firmware response can led to this particular OOB, atleast that was the
source of OOB info for us.
Regards,
Vikash
> 
> Whatever the result of the review, this detail needs to go into the commit log
> so that a reviewer can reasonably read the problem description and evaluate
> against submitted code as a fix.
> 
> ---
> bod

Re: [PATCH v2 0/2] venus driver fixes to avoid possible OOB read access

Posted by Bryan O'Donoghue 11 months, 1 week ago

On 03/03/2025 15:01, Vikash Garodia wrote:
> 
> On 3/3/2025 8:26 PM, Bryan O'Donoghue wrote:
>> On 03/03/2025 13:12, Vikash Garodia wrote:
>>>
>>> On 3/2/2025 9:26 PM, Bryan O'Donoghue wrote:
>>>> On 02/03/2025 11:58, Vedang Nagar wrote:
>>>>>>
>>>>>> The basic question : what is the lifetime of the data from RX interrupt to
>>>>>> consumption by another system agent, DSP, userspace, whatever ?
>>>>> As mentioned in [1], With the regular firmware, after RX interrupt the data
>>>>> can be considered as valid until next interrupt is raised, but with the rouge
>>>>> firmware, data can get invalid during the second read and our intention is to
>>>>> avoid out of bound access read because of such issues.
>>>>
>>>> This is definitely the part I don't compute.
>>>>
>>>> 1. RX interrupt
>>>> 2. Frame#0 Some amount of time data is always valid
>>> This is not correct. Its not the amount of time which determines the validity of
>>> the data, its the possibility of rogue firmware which, if incase, puts up the
>>> date in shared queue, would always be invalid, irrespective of time.
>>>
>>>> 3. RX interrupt - new data
>>>> 4. Frame#1 new data delivered into a buffer
>>>>
>>>> Are you describing a case between RX interrupts 1-3 or a case after 1-4?
>>>>
>>>> Why do we need to write code for rouge firmware anyway ?
>>> It is a way to prevent any possibility of OOB, similar to how any API does check
>>> for validity of any arguments passed to it, prior to processing.
>>>>
>>>> And the real question - if the data can be invalidated in the 1-3 window above
>>>> when is the safe time to snapshot that data ?
>>>>
>>>> We seem to have alot of submissions to deal with 'rouge' firmware without I
>>>> think properly describing the problem of the _expected_ data lifetime.
>>>>
>>>> So
>>>>
>>>> a) What is the expected data lifetime of an RX buffer between one
>>>>      RX IRQ and the next ?
>>>>      I hope the answer to this is - APSS owns the buffer.
>>>>      This is BTW usually the case in these types of asymmetric setups
>>>>      with a flag or some other kind of semaphore that indicates which
>>>>      side of the data-exchange owns the buffer.
>>>>
>>>> b) In this rouge - buggy - firmware case what is the scope of the
>>>>      potential race condition ?
>>>>
>>>>      What I'd really like to know here is why we have to seemingly
>>>>      memcpy() again and again in seemingly incongrous and not
>>>>      immediately obvious places in the code.
>>>>
>>>>      Would we not be better advised to do a memcpy() of the entire
>>>>      RX frame in the RX IRQ handler path if as you appear to me
>>>>      suggesting - the firmware can "race" with the APSS
>>>>      i.e. the data-buffer ownership flag either doesn't work
>>>>      or isn't respected by one side in the data-exchange.
>>>>
>>>> Can we please have a detailed description of the race condition here ?
>>> Below is the report which the reporter reported leading to OOB, let me know if
>>> you are unable to deduce the trail leading to OOB here.
>>>
>>> OOB read issue is in function event_seq_changed, please reference below code
>>> snippet:
>>>
>>> Buggy code snippet:
>>>
>>> static void event_seq_changed(struct venus_core *core, struct venus_inst *inst,
>>>           struct hfi_msg_event_notify_pkt *pkt)
>>> ...
>>> num_properties_changed = pkt->event_data2; //num_properties_changed is from
>>> message and is not validated.
>>> ...
>>> data_ptr = (u8 *)&pkt->ext_event_data[0];
>>> do {
>>>    ptype = *((u32 *)data_ptr);
>>>    switch (ptype) {
>>>    case HFI_PROPERTY_PARAM_FRAME_SIZE:
>>>     data_ptr += sizeof(u32);
>>>     frame_sz = (struct hfi_framesize *)data_ptr;
>>>     event.width = frame_sz->width;
>>> ...
>>>    }
>>>    num_properties_changed--;
>>> } while (num_properties_changed > 0);
>>> ```
>>> There is no validation against `num_properties_changed = pkt->event_data2`, so
>>> OOB read occurs.
>>>>
>>>> I don't doubt the new memcpy() makes sense to you but without this detailed
>>>> understanding of the underlying problem its virtually impossible to debate the
>>>> appropriate remediation - perhaps this patch you've submitted - or some other
>>>> solution.
>>>>
>>>> Sorry to dig into my trench here but, way more detail is needed.
>>>>
>>>>> [1]: https://lore.kernel.org/lkml/4cfc1fe1-2fab-4256-9ce2-
>>>>> b4a0aad1069e@linaro.org/T/#m5f1737b16e68f8b8fc1d75517356b6566d0ec619
>>>>>>
>>>>>> Why is it in this small specific window that the data can change but not
>>>>>> later ? What is the mechanism the data can change and how do the changes you
>>>>>> propose here address the data lifetime problem ?
>>>>> Currently this issue has been discovered by external researchers at this
>>>>> point, but if any such OOB issue is discovered at later point as well then we
>>>>> shall fix them as well.
>>>>
>>>> Right but, I'm looking for a detailed description of the problem.
>>>>
>>>> Can you describe from RX interrupt again what the expected data lifetime of the
>>>> RX frame is, which I hope we agree is until the next RX interrupt associated
>>>> with a given buffer with an ownership flag shared between firmware and APSS -
>>>> and then under what circumstances that "software contract" is being violated.
>>>>
>>>>> Also, with rougue firmware we cannot fix the data lifetime problem in my
>>>>> opinion, but atleast we can fix the out of bound issues.
>>>>>>
>>>>>> Without that context, I don't believe it is really possible to validate an
>>>>>> additional memcpy() here and there in the code as fixing anything.
>>>>> There is no additional memcpy() now in the v2 patch, but as part of the fix,
>>>>> we are just trying to retain the length of the packet which was being read in
>>>>> the first memcpy() to avoid the OOB read access.
>>>>
>>>> I can't make a suggestion because - personally speaking I still don't quite
>>>> understand the data-race you are describing.
>>> Go through the reports from the reporter, it was quite evident in leading upto
>>> OOB case.
>>> Putting up the sequence for you to go over the interrupt handling and message
>>> queue parsing of the packets from firmware
>>> 1.
>>> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_venus.c#L1082
>>> 2.
>>> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L816
>>> 3. event handling (this particular case)
>>> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L658
>>> 4.
>>> https://elixir.bootlin.com/linux/v6.14-rc4/source/drivers/media/platform/qcom/venus/hfi_msgs.c#L22
>>>
>>> the "struct hfi_msg_event_notify_pkt *pkt" pkt here is having the data read from
>>> shared queue.
>>>
>>>>
>>>> I get that you say the firmware is breaking the contract but, without more
>>>> detail on _how_ it breaks that contract I don't think it's really possible to
>>>> validate your fix here, fixes anything.
>>>>
>>>> ---
>>>> bod
>>>
>>> Regards,
>>> Vikash
>>
>> I'll go through all of these links given here, thanks.
> I would request you to go through the description putup by the reporter of this
> OOB as well, i added in my earlier response. It provided a good background of
> how the firmware response can led to this particular OOB, atleast that was the
> source of OOB info for us.
> Regards,
> Vikash
>>
>> Whatever the result of the review, this detail needs to go into the commit log
>> so that a reviewer can reasonably read the problem description and evaluate
>> against submitted code as a fix.
>>
>> ---
>> bod

Replied to the wrong patch.

There is no memcpy() in this patch - there was in patch #1 which has 
subsequently been dropped.

Anyway I'll address my comment there.

---
bod