[PATCH v3 0/2] RDMA/rxe: Fix no completion event issue

Li Zhijian posted 2 patches 3 years, 11 months ago
drivers/infiniband/sw/rxe/rxe_req.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
[PATCH v3 0/2] RDMA/rxe: Fix no completion event issue
Posted by Li Zhijian 3 years, 11 months ago
Since RXE always posts RDMA_WRITE successfully, it's observed that
no more completion occurs after a few incorrect posts. Actually, it
will block the polling. we can easily reproduce it by the below pattern.

a. post correct RDMA_WRITE
b. poll completion event
while true {
  c. post incorrect RDMA_WRITE(wrong rkey for example)
  d. poll completion event <<<< block after 2 incorrect RDMA_WRITE posts
}


Li Zhijian (2):
  RDMA/rxe: Update wqe_index for each wqe error completion
  RDMA/rxe: Generate error completion for error requester QP state

 drivers/infiniband/sw/rxe/rxe_req.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

-- 
2.31.1
Re: [PATCH v3 0/2] RDMA/rxe: Fix no completion event issue
Posted by lizhijian@fujitsu.com 3 years, 11 months ago
Hi Json & Yanjun


I know there are still a few regressions on RXE, but i do wish you could take some time to review these *simple and bugfix* patches
They are not related to the regressions.


Thanks
Zhijian


On 16/05/2022 09:53, Li Zhijian wrote:
> Since RXE always posts RDMA_WRITE successfully, it's observed that
> no more completion occurs after a few incorrect posts. Actually, it
> will block the polling. we can easily reproduce it by the below pattern.
>
> a. post correct RDMA_WRITE
> b. poll completion event
> while true {
>    c. post incorrect RDMA_WRITE(wrong rkey for example)
>    d. poll completion event <<<< block after 2 incorrect RDMA_WRITE posts
> }
>
>
> Li Zhijian (2):
>    RDMA/rxe: Update wqe_index for each wqe error completion
>    RDMA/rxe: Generate error completion for error requester QP state
>
>   drivers/infiniband/sw/rxe/rxe_req.c | 12 +++++++++++-
>   1 file changed, 11 insertions(+), 1 deletion(-)
>
Re: [PATCH v3 0/2] RDMA/rxe: Fix no completion event issue
Posted by Yanjun Zhu 3 years, 10 months ago
在 2022/6/7 16:32, lizhijian@fujitsu.com 写道:
> Hi Json & Yanjun
>
>
> I know there are still a few regressions on RXE, but i do wish you could take some time to review these *simple and bugfix* patches
> They are not related to the regressions.

Now there are some problems from Redhat and other Linux Vendors.

We had better focus on these problems.

Zhu Yanjun

>
>
> Thanks
> Zhijian
>
>
> On 16/05/2022 09:53, Li Zhijian wrote:
>> Since RXE always posts RDMA_WRITE successfully, it's observed that
>> no more completion occurs after a few incorrect posts. Actually, it
>> will block the polling. we can easily reproduce it by the below pattern.
>>
>> a. post correct RDMA_WRITE
>> b. poll completion event
>> while true {
>>     c. post incorrect RDMA_WRITE(wrong rkey for example)
>>     d. poll completion event <<<< block after 2 incorrect RDMA_WRITE posts
>> }
>>
>>
>> Li Zhijian (2):
>>     RDMA/rxe: Update wqe_index for each wqe error completion
>>     RDMA/rxe: Generate error completion for error requester QP state
>>
>>    drivers/infiniband/sw/rxe/rxe_req.c | 12 +++++++++++-
>>    1 file changed, 11 insertions(+), 1 deletion(-)
>>
Re: [PATCH v3 0/2] RDMA/rxe: Fix no completion event issue
Posted by Li, Zhijian 3 years, 10 months ago
on 6/25/2022 8:59 PM, Yanjun Zhu wrote:
>
> 在 2022/6/7 16:32, lizhijian@fujitsu.com 写道:
>> Hi Json & Yanjun
>>
>>
>> I know there are still a few regressions on RXE, but i do wish you 
>> could take some time to review these *simple and bugfix* patches
>> They are not related to the regressions.
>
> Now there are some problems from Redhat and other Linux Vendors.
>
> We had better focus on these problems.

+ Xiao
I do believe regression is high priority,  and I'm very willing to contribute our efforts to improve the stability of RXE :)
Yang,Xiao and me tried to reproduce the issues in maillist and we also tried to review the their corresponding patches.
However actually we didn't find a unified way something like bugzilla to maintain the issues and their status, and most of
them are not reproduced by our local environment. So it's a bit hard for us to review/verify the patches especially for the
large/complicate patch if we don't have the use cases.

BTW, IMO we shouldn't stop reviewing other fixes expect recent regressions.

Zhijian

>
> Zhu Yanjun
>
>>
>>
>> Thanks
>> Zhijian
>>
>>
>> On 16/05/2022 09:53, Li Zhijian wrote:
>>> Since RXE always posts RDMA_WRITE successfully, it's observed that
>>> no more completion occurs after a few incorrect posts. Actually, it
>>> will block the polling. we can easily reproduce it by the below 
>>> pattern.
>>>
>>> a. post correct RDMA_WRITE
>>> b. poll completion event
>>> while true {
>>>     c. post incorrect RDMA_WRITE(wrong rkey for example)
>>>     d. poll completion event <<<< block after 2 incorrect RDMA_WRITE 
>>> posts
>>> }
>>>
>>>
>>> Li Zhijian (2):
>>>     RDMA/rxe: Update wqe_index for each wqe error completion
>>>     RDMA/rxe: Generate error completion for error requester QP state
>>>
>>>    drivers/infiniband/sw/rxe/rxe_req.c | 12 +++++++++++-
>>>    1 file changed, 11 insertions(+), 1 deletion(-)
>>>


Re: [PATCH v3 0/2] RDMA/rxe: Fix no completion event issue
Posted by yangx.jy@fujitsu.com 3 years, 10 months ago
On 2022/6/26 11:29, Li, Zhijian wrote:
> + Xiao
> I do believe regression is high priority,  and I'm very willing to 
> contribute our efforts to improve the stability of RXE :)
> Yang,Xiao and me tried to reproduce the issues in maillist and we also 
> tried to review the their corresponding patches.
> However actually we didn't find a unified way something like bugzilla to 
> maintain the issues and their status, and most of
> them are not reproduced by our local environment. So it's a bit hard for 
> us to review/verify the patches especially for the
> large/complicate patch if we don't have the use cases.
> 
> BTW, IMO we shouldn't stop reviewing other fixes expect recent regressions.

Agreed.

Besides, this patch set looks good to me.
Reviewed-by: Xiao Yang <yangx.jy@fujitsu.com>

Best Regards,
Xiao Yang
> 
> Zhijian
Re: [PATCH v3 0/2] RDMA/rxe: Fix no completion event issue
Posted by Jason Gunthorpe 3 years, 10 months ago
On Tue, Jun 07, 2022 at 08:32:40AM +0000, lizhijian@fujitsu.com wrote:
> Hi Json & Yanjun
> 
> 
> I know there are still a few regressions on RXE, but i do wish you
> could take some time to review these *simple and bugfix* patches
> They are not related to the regressions.

I would like someone familiar with rxe to ack the datapath changes - I
have a very limited knowledge about rxe.

If that is not forthcoming from others in the rxe community then I
will accept confirmation directly from you that the pyverbs tests and
the blktests scenarios have been run and pass for your changes.

Jason
Re: [PATCH v3 0/2] RDMA/rxe: Fix no completion event issue
Posted by Li, Zhijian 3 years, 10 months ago
on 6/25/2022 7:39 AM, Jason Gunthorpe wrote:
> On Tue, Jun 07, 2022 at 08:32:40AM +0000, lizhijian@fujitsu.com wrote:
>> Hi Json & Yanjun
>>
>>
>> I know there are still a few regressions on RXE, but i do wish you
>> could take some time to review these *simple and bugfix* patches
>> They are not related to the regressions.
> I would like someone familiar with rxe to ack the datapath changes

Thanks for your feedback

Haakon Bugge  had reviewed the datapath changes except the commit log in 
the V1 patches privately for some reasons weeks ago.

Hey Haakon, could you help to review these patches.


>   - I have a very limited knowledge about rxe.
>
> If that is not forthcoming from others in the rxe community then I
> will accept confirmation directly from you that the pyverbs tests and
> the blktests scenarios have been run and pass for your changes.

it's confirmed that pyverbs tests and nvme group with RXE of blktests 
have no regression after these changes

Thanks

Zhijian

>
> Jason