[v1] RE: [PATCH 0/6] refactor RDMA live migration based on rsocket API

RE: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Sean Hefty 1 year, 4 months ago

> > > I have met with the team from IONOS about their testing on actual IB
> > > hardware here at KVM Forum today and the requirements are starting
> > > to make more sense to me. I didn't say much in our previous thread
> > > because I misunderstood the requirements, so let me try to explain
> > > and see if we're all on the same page. There appears to be a
> > > fundamental limitation here with rsocket, for which I don't see how it is
> possible to overcome.
> > >
> > > The basic problem is that rsocket is trying to present a stream
> > > abstraction, a concept that is fundamentally incompatible with RDMA.
> > > The whole point of using RDMA in the first place is to avoid using
> > > the CPU, and to do that, all of the memory (potentially hundreds of
> > > gigabytes) need to be registered with the hardware *in advance* (this is
> how the original implementation works).
> > >
> > > The need to fake a socket/bytestream abstraction eventually breaks
> > > down => There is a limit (a few GB) in rsocket (which the IONOS team
> > > previous reported in testing.... see that email), it appears that
> > > means that rsocket is only going to be able to map a certain limited
> > > amount of memory with the hardware until its internal "buffer" runs
> > > out before it can then unmap and remap the next batch of memory with
> > > the hardware to continue along with the fake bytestream. This is
> > > very much sticking a square peg in a round hole. If you were to
> > > "relax" the rsocket implementation to register the entire VM memory
> > > space (as my original implementation does), then there wouldn't be any
> need for rsocket in the first place.
> 
> Yes, some test like this can be helpful.
> 
> And thanks for the summary.  That's definitely helpful.
> 
> One question from my side (as someone knows nothing on RDMA/rsocket): is
> that "a few GBs" limitation a software guard?  Would it be possible that rsocket
> provide some option to allow user opt-in on setting that value, so that it might
> work for VM use case?  Would that consume similar resources v.s. the current
> QEMU impl but allows it to use rsockets with no perf regressions?

Rsockets is emulated the streaming socket API.  The amount of memory dedicated to a single rsocket is controlled through a wmem_default configuration setting.  It is also configurable via rsetsockopt() SO_SNDBUF.  Both of those are similar to TCP settings.  The SW field used to store this value is 32-bits.

This internal buffer acts as a bounce buffer to convert the synchronous socket API calls into the asynchronous RDMA transfers.  Rsockets uses the CPU for data copies, but the transport is offloaded to the NIC, including kernel bypass.

Does your kernel allocate > 4 GBs of buffer space to an individual socket?

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Michael Galaxy 1 year, 4 months ago

On 9/27/24 16:45, Sean Hefty wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
>>>> I have met with the team from IONOS about their testing on actual IB
>>>> hardware here at KVM Forum today and the requirements are starting
>>>> to make more sense to me. I didn't say much in our previous thread
>>>> because I misunderstood the requirements, so let me try to explain
>>>> and see if we're all on the same page. There appears to be a
>>>> fundamental limitation here with rsocket, for which I don't see how it is
>> possible to overcome.
>>>> The basic problem is that rsocket is trying to present a stream
>>>> abstraction, a concept that is fundamentally incompatible with RDMA.
>>>> The whole point of using RDMA in the first place is to avoid using
>>>> the CPU, and to do that, all of the memory (potentially hundreds of
>>>> gigabytes) need to be registered with the hardware *in advance* (this is
>> how the original implementation works).
>>>> The need to fake a socket/bytestream abstraction eventually breaks
>>>> down => There is a limit (a few GB) in rsocket (which the IONOS team
>>>> previous reported in testing.... see that email), it appears that
>>>> means that rsocket is only going to be able to map a certain limited
>>>> amount of memory with the hardware until its internal "buffer" runs
>>>> out before it can then unmap and remap the next batch of memory with
>>>> the hardware to continue along with the fake bytestream. This is
>>>> very much sticking a square peg in a round hole. If you were to
>>>> "relax" the rsocket implementation to register the entire VM memory
>>>> space (as my original implementation does), then there wouldn't be any
>> need for rsocket in the first place.
>>
>> Yes, some test like this can be helpful.
>>
>> And thanks for the summary.  That's definitely helpful.
>>
>> One question from my side (as someone knows nothing on RDMA/rsocket): is
>> that "a few GBs" limitation a software guard?  Would it be possible that rsocket
>> provide some option to allow user opt-in on setting that value, so that it might
>> work for VM use case?  Would that consume similar resources v.s. the current
>> QEMU impl but allows it to use rsockets with no perf regressions?
> Rsockets is emulated the streaming socket API.  The amount of memory dedicated to a single rsocket is controlled through a wmem_default configuration setting.  It is also configurable via rsetsockopt() SO_SNDBUF.  Both of those are similar to TCP settings.  The SW field used to store this value is 32-bits.
>
> This internal buffer acts as a bounce buffer to convert the synchronous socket API calls into the asynchronous RDMA transfers.  Rsockets uses the CPU for data copies, but the transport is offloaded to the NIC, including kernel bypass.
Understood.
> Does your kernel allocate > 4 GBs of buffer space to an individual socket?
Yes, it absolutely does. We're dealing with virtual machines here, 
right? It is possible (and likely) to have a virtual machine that is 
hundreds of GBs of RAM in size.

A bounce buffer defeats the entire purpose of using RDMA in these cases. 
When using RDMA for very large transfers like this, the goal here is to 
map the entire memory region at once and avoid all CPU interactions 
(except for message management within libibverbs) so that the NIC is 
doing all of the work.

I'm sure rsocket has its place with much smaller transfer sizes, but 
this is very different.

- Michael

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Peter Xu 1 year, 4 months ago

On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
> On 9/27/24 16:45, Sean Hefty wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> > 
> > > > > I have met with the team from IONOS about their testing on actual IB
> > > > > hardware here at KVM Forum today and the requirements are starting
> > > > > to make more sense to me. I didn't say much in our previous thread
> > > > > because I misunderstood the requirements, so let me try to explain
> > > > > and see if we're all on the same page. There appears to be a
> > > > > fundamental limitation here with rsocket, for which I don't see how it is
> > > possible to overcome.
> > > > > The basic problem is that rsocket is trying to present a stream
> > > > > abstraction, a concept that is fundamentally incompatible with RDMA.
> > > > > The whole point of using RDMA in the first place is to avoid using
> > > > > the CPU, and to do that, all of the memory (potentially hundreds of
> > > > > gigabytes) need to be registered with the hardware *in advance* (this is
> > > how the original implementation works).
> > > > > The need to fake a socket/bytestream abstraction eventually breaks
> > > > > down => There is a limit (a few GB) in rsocket (which the IONOS team
> > > > > previous reported in testing.... see that email), it appears that
> > > > > means that rsocket is only going to be able to map a certain limited
> > > > > amount of memory with the hardware until its internal "buffer" runs
> > > > > out before it can then unmap and remap the next batch of memory with
> > > > > the hardware to continue along with the fake bytestream. This is
> > > > > very much sticking a square peg in a round hole. If you were to
> > > > > "relax" the rsocket implementation to register the entire VM memory
> > > > > space (as my original implementation does), then there wouldn't be any
> > > need for rsocket in the first place.
> > > 
> > > Yes, some test like this can be helpful.
> > > 
> > > And thanks for the summary.  That's definitely helpful.
> > > 
> > > One question from my side (as someone knows nothing on RDMA/rsocket): is
> > > that "a few GBs" limitation a software guard?  Would it be possible that rsocket
> > > provide some option to allow user opt-in on setting that value, so that it might
> > > work for VM use case?  Would that consume similar resources v.s. the current
> > > QEMU impl but allows it to use rsockets with no perf regressions?
> > Rsockets is emulated the streaming socket API.  The amount of memory dedicated to a single rsocket is controlled through a wmem_default configuration setting.  It is also configurable via rsetsockopt() SO_SNDBUF.  Both of those are similar to TCP settings.  The SW field used to store this value is 32-bits.
> > 
> > This internal buffer acts as a bounce buffer to convert the synchronous socket API calls into the asynchronous RDMA transfers.  Rsockets uses the CPU for data copies, but the transport is offloaded to the NIC, including kernel bypass.
> Understood.
> > Does your kernel allocate > 4 GBs of buffer space to an individual socket?
> Yes, it absolutely does. We're dealing with virtual machines here, right? It
> is possible (and likely) to have a virtual machine that is hundreds of GBs
> of RAM in size.
> 
> A bounce buffer defeats the entire purpose of using RDMA in these cases.
> When using RDMA for very large transfers like this, the goal here is to map
> the entire memory region at once and avoid all CPU interactions (except for
> message management within libibverbs) so that the NIC is doing all of the
> work.
> 
> I'm sure rsocket has its place with much smaller transfer sizes, but this is
> very different.

Is it possible to make rsocket be friendly with large buffers (>4GB) like
the VM use case?

I also wonder whether there're other applications that may benefit from
this outside of QEMU.

Thanks,

-- 
Peter Xu

RE: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Sean Hefty 1 year, 4 months ago

> > I'm sure rsocket has its place with much smaller transfer sizes, but
> > this is very different.
> 
> Is it possible to make rsocket be friendly with large buffers (>4GB) like the VM
> use case?

If you can perform large VM migrations using streaming sockets, rsockets is likely usable, but it will involve data copies.  The problem is the socket API semantics.

There are rsocket API extensions (riowrite, riomap) to support RDMA write operations.  This avoids the data copy at the target, but not the sender.   (riowrite follows the socket send semantics on buffer ownership.)

It may be possible to enhance rsockets with MSG_ZEROCOPY or io_uring extensions to enable zero-copy for large transfers, but that's not something I've looked at.  True zero copy may require combining MSG_ZEROCOPY with riowrite, but then that moves further away from using traditional socket calls.

- Sean

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Peter Xu 1 year, 4 months ago

On Mon, Sep 30, 2024 at 07:20:56PM +0000, Sean Hefty wrote:
> > > I'm sure rsocket has its place with much smaller transfer sizes, but
> > > this is very different.
> > 
> > Is it possible to make rsocket be friendly with large buffers (>4GB) like the VM
> > use case?
> 
> If you can perform large VM migrations using streaming sockets, rsockets is likely usable, but it will involve data copies.  The problem is the socket API semantics.
> 
> There are rsocket API extensions (riowrite, riomap) to support RDMA write operations.  This avoids the data copy at the target, but not the sender.   (riowrite follows the socket send semantics on buffer ownership.)
> 
> It may be possible to enhance rsockets with MSG_ZEROCOPY or io_uring extensions to enable zero-copy for large transfers, but that's not something I've looked at.  True zero copy may require combining MSG_ZEROCOPY with riowrite, but then that moves further away from using traditional socket calls.

Thanks, Sean.

One thing to mention is that QEMU has QIO_CHANNEL_WRITE_FLAG_ZERO_COPY,
which already supports MSG_ZEROCOPY but only on sender side, and only if
when multifd is enabled, because it requires page pinning and alignments,
while it's more challenging to pin a random buffer than a guest page.

Nobody moved on yet with zerocopy recv for TCP; there might be similar
challenges that normal socket APIs may not work easily on top of current
iochannel design, but I don't know well to say..

Not sure whether it means there can be a shared goal with QEMU ultimately
supporting better zerocopy via either TCP or RDMA.  If that's true, maybe
there's chance we can move towards rsocket with all the above facilities,
meanwhile RDMA can, ideally, run similiarly like TCP with the same (to be
enhanced..) iochannel API, so that it can do zerocopy on both sides with
either transport.

-- 
Peter Xu

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Michael Galaxy 1 year, 4 months ago

On 9/30/24 14:47, Peter Xu wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> On Mon, Sep 30, 2024 at 07:20:56PM +0000, Sean Hefty wrote:
>>>> I'm sure rsocket has its place with much smaller transfer sizes, but
>>>> this is very different.
>>> Is it possible to make rsocket be friendly with large buffers (>4GB) like the VM
>>> use case?
>> If you can perform large VM migrations using streaming sockets, rsockets is likely usable, but it will involve data copies.  The problem is the socket API semantics.
>>
>> There are rsocket API extensions (riowrite, riomap) to support RDMA write operations.  This avoids the data copy at the target, but not the sender.   (riowrite follows the socket send semantics on buffer ownership.)
>>
>> It may be possible to enhance rsockets with MSG_ZEROCOPY or io_uring extensions to enable zero-copy for large transfers, but that's not something I've looked at.  True zero copy may require combining MSG_ZEROCOPY with riowrite, but then that moves further away from using traditional socket calls.
> Thanks, Sean.
>
> One thing to mention is that QEMU has QIO_CHANNEL_WRITE_FLAG_ZERO_COPY,
> which already supports MSG_ZEROCOPY but only on sender side, and only if
> when multifd is enabled, because it requires page pinning and alignments,
> while it's more challenging to pin a random buffer than a guest page.
>
> Nobody moved on yet with zerocopy recv for TCP; there might be similar
> challenges that normal socket APIs may not work easily on top of current
> iochannel design, but I don't know well to say..
>
> Not sure whether it means there can be a shared goal with QEMU ultimately
> supporting better zerocopy via either TCP or RDMA.  If that's true, maybe
> there's chance we can move towards rsocket with all the above facilities,
> meanwhile RDMA can, ideally, run similiarly like TCP with the same (to be
> enhanced..) iochannel API, so that it can do zerocopy on both sides with
> either transport.
>
What about the testing solution that I mentioned?

Does that satisfy your concerns? Or is there still a gap here that needs 
to be met?

- Michael

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Peter Xu 1 year, 4 months ago

On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
> What about the testing solution that I mentioned?
> 
> Does that satisfy your concerns? Or is there still a gap here that needs to
> be met?

I think such testing framework would be helpful, especially if we can kick
it off in CI when preparing pull requests, then we can make sure nothing
will break RDMA easily.

Meanwhile, we still need people committed to this and actively maintain it,
who knows the rdma code well.

Thanks,

-- 
Peter Xu

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Michael Galaxy 1 year, 4 months ago

On 10/3/24 16:43, Peter Xu wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
>> What about the testing solution that I mentioned?
>>
>> Does that satisfy your concerns? Or is there still a gap here that needs to
>> be met?
> I think such testing framework would be helpful, especially if we can kick
> it off in CI when preparing pull requests, then we can make sure nothing
> will break RDMA easily.
>
> Meanwhile, we still need people committed to this and actively maintain it,
> who knows the rdma code well.
>
> Thanks,
>

OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test 
along these lines that would ensure that future RDMA breakages are 
detected more easily?

What do you think?

- Michael

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Yu Zhang 1 year, 4 months ago

Sure, as we talked at the KVM Forum, a possible approach is to set up
two VMs on a physical host, configure the SoftRoCE, and run the
migration test in two nested VMs to ensure that the migration data
traffic goes through the emulated RDMA hardware. I will continue with
this and let you know.


On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
>
>
> On 10/3/24 16:43, Peter Xu wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> >
> > On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
> >> What about the testing solution that I mentioned?
> >>
> >> Does that satisfy your concerns? Or is there still a gap here that needs to
> >> be met?
> > I think such testing framework would be helpful, especially if we can kick
> > it off in CI when preparing pull requests, then we can make sure nothing
> > will break RDMA easily.
> >
> > Meanwhile, we still need people committed to this and actively maintain it,
> > who knows the rdma code well.
> >
> > Thanks,
> >
>
> OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test
> along these lines that would ensure that future RDMA breakages are
> detected more easily?
>
> What do you think?
>
> - Michael
>

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Michael Galaxy 1 year, 4 months ago

Hi,

On 10/7/24 03:47, Yu Zhang wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> Sure, as we talked at the KVM Forum, a possible approach is to set up
> two VMs on a physical host, configure the SoftRoCE, and run the
> migration test in two nested VMs to ensure that the migration data
> traffic goes through the emulated RDMA hardware. I will continue with
> this and let you know.
>
Acknowledged. Do share if you have any problems with it, like if it has 
compatibility issues
or if we need a different solution. We're open to change.

I'm not familiar with the "current state" of this or how well it would 
even work.

- Michael


> On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
>>
>> On 10/3/24 16:43, Peter Xu wrote:
>>> !-------------------------------------------------------------------|
>>>     This Message Is From an External Sender
>>>     This message came from outside your organization.
>>> |-------------------------------------------------------------------!
>>>
>>> On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
>>>> What about the testing solution that I mentioned?
>>>>
>>>> Does that satisfy your concerns? Or is there still a gap here that needs to
>>>> be met?
>>> I think such testing framework would be helpful, especially if we can kick
>>> it off in CI when preparing pull requests, then we can make sure nothing
>>> will break RDMA easily.
>>>
>>> Meanwhile, we still need people committed to this and actively maintain it,
>>> who knows the rdma code well.
>>>
>>> Thanks,
>>>
>> OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test
>> along these lines that would ensure that future RDMA breakages are
>> detected more easily?
>>
>> What do you think?
>>
>> - Michael
>>

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Michael Galaxy 1 year, 3 months ago

Hi All,

This is just a heads up: I will be changing employment soon, so my 
Akamai email address will cease to operate this week.

My personal email: michael@flatgalaxy.com. I'll re-subscribe later once 
I have come back online to work soon.

Thanks!

- Michael

On 10/7/24 08:45, Michael Galaxy wrote:
> Hi,
>
> On 10/7/24 03:47, Yu Zhang wrote:
>> !-------------------------------------------------------------------|
>>    This Message Is From an External Sender
>>    This message came from outside your organization.
>> |-------------------------------------------------------------------!
>>
>> Sure, as we talked at the KVM Forum, a possible approach is to set up
>> two VMs on a physical host, configure the SoftRoCE, and run the
>> migration test in two nested VMs to ensure that the migration data
>> traffic goes through the emulated RDMA hardware. I will continue with
>> this and let you know.
>>
> Acknowledged. Do share if you have any problems with it, like if it 
> has compatibility issues
> or if we need a different solution. We're open to change.
>
> I'm not familiar with the "current state" of this or how well it would 
> even work.
>
> - Michael
>
>
>> On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> 
>> wrote:
>>>
>>> On 10/3/24 16:43, Peter Xu wrote:
>>>> !-------------------------------------------------------------------|
>>>>     This Message Is From an External Sender
>>>>     This message came from outside your organization.
>>>> |-------------------------------------------------------------------!
>>>>
>>>> On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
>>>>> What about the testing solution that I mentioned?
>>>>>
>>>>> Does that satisfy your concerns? Or is there still a gap here that 
>>>>> needs to
>>>>> be met?
>>>> I think such testing framework would be helpful, especially if we 
>>>> can kick
>>>> it off in CI when preparing pull requests, then we can make sure 
>>>> nothing
>>>> will break RDMA easily.
>>>>
>>>> Meanwhile, we still need people committed to this and actively 
>>>> maintain it,
>>>> who knows the rdma code well.
>>>>
>>>> Thanks,
>>>>
>>> OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test
>>> along these lines that would ensure that future RDMA breakages are
>>> detected more easily?
>>>
>>> What do you think?
>>>
>>> - Michael
>>>

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Leon Romanovsky 1 year, 4 months ago

On Mon, Oct 07, 2024 at 08:45:07AM -0500, Michael Galaxy wrote:
> Hi,
> 
> On 10/7/24 03:47, Yu Zhang wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> > 
> > Sure, as we talked at the KVM Forum, a possible approach is to set up
> > two VMs on a physical host, configure the SoftRoCE, and run the
> > migration test in two nested VMs to ensure that the migration data
> > traffic goes through the emulated RDMA hardware. I will continue with
> > this and let you know.
> > 
> Acknowledged. Do share if you have any problems with it, like if it has
> compatibility issues
> or if we need a different solution. We're open to change.
> 
> I'm not familiar with the "current state" of this or how well it would even
> work.

Any compatibility issue between versions of RXE (SoftRoCE) or between
RXE and real devices is a bug in RXE, which should be fixed.

RXE is expected to be compatible with rest RoCE devices, both virtual
and physical.

Thanks

> 
> - Michael
> 
> 
> > On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
> > > 
> > > On 10/3/24 16:43, Peter Xu wrote:
> > > > !-------------------------------------------------------------------|
> > > >     This Message Is From an External Sender
> > > >     This message came from outside your organization.
> > > > |-------------------------------------------------------------------!
> > > > 
> > > > On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
> > > > > What about the testing solution that I mentioned?
> > > > > 
> > > > > Does that satisfy your concerns? Or is there still a gap here that needs to
> > > > > be met?
> > > > I think such testing framework would be helpful, especially if we can kick
> > > > it off in CI when preparing pull requests, then we can make sure nothing
> > > > will break RDMA easily.
> > > > 
> > > > Meanwhile, we still need people committed to this and actively maintain it,
> > > > who knows the rdma code well.
> > > > 
> > > > Thanks,
> > > > 
> > > OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test
> > > along these lines that would ensure that future RDMA breakages are
> > > detected more easily?
> > > 
> > > What do you think?
> > > 
> > > - Michael
> > > 
>

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Zhu Yanjun 1 year, 4 months ago

在 2024/10/8 2:15, Leon Romanovsky 写道:
> On Mon, Oct 07, 2024 at 08:45:07AM -0500, Michael Galaxy wrote:
>> Hi,
>>
>> On 10/7/24 03:47, Yu Zhang wrote:
>>> !-------------------------------------------------------------------|
>>>     This Message Is From an External Sender
>>>     This message came from outside your organization.
>>> |-------------------------------------------------------------------!
>>>
>>> Sure, as we talked at the KVM Forum, a possible approach is to set up
>>> two VMs on a physical host, configure the SoftRoCE, and run the
>>> migration test in two nested VMs to ensure that the migration data
>>> traffic goes through the emulated RDMA hardware. I will continue with
>>> this and let you know.
>>>
>> Acknowledged. Do share if you have any problems with it, like if it has
>> compatibility issues
>> or if we need a different solution. We're open to change.
>>
>> I'm not familiar with the "current state" of this or how well it would even
>> work.
> 
> Any compatibility issue between versions of RXE (SoftRoCE) or between
> RXE and real devices is a bug in RXE, which should be fixed.
> 
> RXE is expected to be compatible with rest RoCE devices, both virtual
> and physical.

 From my tests, about physical RoCE devices, for example, Nvidia MLX5 
and intel E810 (iRDMA), if RDMA feature is disabled on those devices. 
RXE can work well with them.

About Virtual devices, most virtual devices can work well with RXE, for 
example,bonding, veth. I have done a lot of tests with them.

If some virtual devices can not work well with RXE, please share the 
error messages in RDMA maillist.

Zhu Yanjun

> 
> Thanks
> 
>>
>> - Michael
>>
>>
>>> On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
>>>>
>>>> On 10/3/24 16:43, Peter Xu wrote:
>>>>> !-------------------------------------------------------------------|
>>>>>      This Message Is From an External Sender
>>>>>      This message came from outside your organization.
>>>>> |-------------------------------------------------------------------!
>>>>>
>>>>> On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote:
>>>>>> What about the testing solution that I mentioned?
>>>>>>
>>>>>> Does that satisfy your concerns? Or is there still a gap here that needs to
>>>>>> be met?
>>>>> I think such testing framework would be helpful, especially if we can kick
>>>>> it off in CI when preparing pull requests, then we can make sure nothing
>>>>> will break RDMA easily.
>>>>>
>>>>> Meanwhile, we still need people committed to this and actively maintain it,
>>>>> who knows the rdma code well.
>>>>>
>>>>> Thanks,
>>>>>
>>>> OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test
>>>> along these lines that would ensure that future RDMA breakages are
>>>> detected more easily?
>>>>
>>>> What do you think?
>>>>
>>>> - Michael
>>>>
>>

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Michael S. Tsirkin 1 year, 4 months ago

On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
> A bounce buffer defeats the entire purpose of using RDMA in these cases.
> When using RDMA for very large transfers like this, the goal here is to map
> the entire memory region at once and avoid all CPU interactions (except for
> message management within libibverbs) so that the NIC is doing all of the
> work.
> 
> I'm sure rsocket has its place with much smaller transfer sizes, but this is
> very different.

To clarify, are you actively using rdma based migration in production? Stepping up
to help maintain it?

-- 
MST

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Michael Galaxy 1 year, 4 months ago

On 9/29/24 13:14, Michael S. Tsirkin wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
>> A bounce buffer defeats the entire purpose of using RDMA in these cases.
>> When using RDMA for very large transfers like this, the goal here is to map
>> the entire memory region at once and avoid all CPU interactions (except for
>> message management within libibverbs) so that the NIC is doing all of the
>> work.
>>
>> I'm sure rsocket has its place with much smaller transfer sizes, but this is
>> very different.
> To clarify, are you actively using rdma based migration in production? Stepping up
> to help maintain it?
>
Yes, both Huawei and IONOS have both been contributing here in this 
email thread.

They are both using it in production.

- Michael

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Michael S. Tsirkin 1 year, 4 months ago

On Sun, Sep 29, 2024 at 03:26:58PM -0500, Michael Galaxy wrote:
> 
> On 9/29/24 13:14, Michael S. Tsirkin wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> > 
> > On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
> > > A bounce buffer defeats the entire purpose of using RDMA in these cases.
> > > When using RDMA for very large transfers like this, the goal here is to map
> > > the entire memory region at once and avoid all CPU interactions (except for
> > > message management within libibverbs) so that the NIC is doing all of the
> > > work.
> > > 
> > > I'm sure rsocket has its place with much smaller transfer sizes, but this is
> > > very different.
> > To clarify, are you actively using rdma based migration in production? Stepping up
> > to help maintain it?
> > 
> Yes, both Huawei and IONOS have both been contributing here in this email
> thread.
> 
> They are both using it in production.
> 
> - Michael

Well, any plans to work on it? for example, postcopy does not really
do zero copy last time I checked, there's also a long TODO list.

-- 
MST

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Michael Galaxy 1 year, 4 months ago

On 9/29/24 17:26, Michael S. Tsirkin wrote:
> !-------------------------------------------------------------------|
>    This Message Is From an External Sender
>    This message came from outside your organization.
> |-------------------------------------------------------------------!
>
> On Sun, Sep 29, 2024 at 03:26:58PM -0500, Michael Galaxy wrote:
>> On 9/29/24 13:14, Michael S. Tsirkin wrote:
>>> !-------------------------------------------------------------------|
>>>     This Message Is From an External Sender
>>>     This message came from outside your organization.
>>> |-------------------------------------------------------------------!
>>>
>>> On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
>>>> A bounce buffer defeats the entire purpose of using RDMA in these cases.
>>>> When using RDMA for very large transfers like this, the goal here is to map
>>>> the entire memory region at once and avoid all CPU interactions (except for
>>>> message management within libibverbs) so that the NIC is doing all of the
>>>> work.
>>>>
>>>> I'm sure rsocket has its place with much smaller transfer sizes, but this is
>>>> very different.
>>> To clarify, are you actively using rdma based migration in production? Stepping up
>>> to help maintain it?
>>>
>> Yes, both Huawei and IONOS have both been contributing here in this email
>> thread.
>>
>> They are both using it in production.
>>
>> - Michael
> Well, any plans to work on it? for example, postcopy does not really
> do zero copy last time I checked, there's also a long TODO list.
>
I apologize, I'm not following the question here. Isn't that what this 
thread is about?

So, some background is missing here, perhaps: A few months ago, there 
was a proposal
to remove native RDMA support from live migration due to concerns about 
lack of testability.
Both IONOS and Huawei have stepped up that they are using it and are 
engaging with the
community here. I also proposed transferring over maintainership to them 
as well.  (I  no longer
have any of this hardware, so I cannot provide testing support anymore).

During that time, rsocket was proposed as an alternative, but as I have 
laid out above, I believe
it cannot work for technical reasons.

I also asked earlier in the thread if we can cover the community's 
testing concerns using softroce,
so that an integration test can be made to work (presumably through 
avocado or something similar).

Does that history make sense?

- Michael

Re: [PATCH 0/6] refactor RDMA live migration based on rsocket API

Posted by Yu Zhang 1 year, 4 months ago

Hello Michael,

That's true. To my understanding, to ease the maintenance, Gonglei's
team has taken efforts to refactorize the RDMA migration code by using
rsocket. However, due to a certain limitation in rsocket, it turned
out that only small VM (in terms of core number and memory) can be
migrated successfully. As long as this limitation persists, no
progress can be achieved in this direction. One the other hand, a
proper test environment and integration / regression test cases are
expected to catch any possible regression due to new changes. It seems
that currently, we can go in this direction.

Best regards,
Yu Zhang @ IONOS cloud

On Mon, Sep 30, 2024 at 5:00 PM Michael Galaxy <mgalaxy@akamai.com> wrote:
>
>
> On 9/29/24 17:26, Michael S. Tsirkin wrote:
> > !-------------------------------------------------------------------|
> >    This Message Is From an External Sender
> >    This message came from outside your organization.
> > |-------------------------------------------------------------------!
> >
> > On Sun, Sep 29, 2024 at 03:26:58PM -0500, Michael Galaxy wrote:
> >> On 9/29/24 13:14, Michael S. Tsirkin wrote:
> >>> !-------------------------------------------------------------------|
> >>>     This Message Is From an External Sender
> >>>     This message came from outside your organization.
> >>> |-------------------------------------------------------------------!
> >>>
> >>> On Sat, Sep 28, 2024 at 12:52:08PM -0500, Michael Galaxy wrote:
> >>>> A bounce buffer defeats the entire purpose of using RDMA in these cases.
> >>>> When using RDMA for very large transfers like this, the goal here is to map
> >>>> the entire memory region at once and avoid all CPU interactions (except for
> >>>> message management within libibverbs) so that the NIC is doing all of the
> >>>> work.
> >>>>
> >>>> I'm sure rsocket has its place with much smaller transfer sizes, but this is
> >>>> very different.
> >>> To clarify, are you actively using rdma based migration in production? Stepping up
> >>> to help maintain it?
> >>>
> >> Yes, both Huawei and IONOS have both been contributing here in this email
> >> thread.
> >>
> >> They are both using it in production.
> >>
> >> - Michael
> > Well, any plans to work on it? for example, postcopy does not really
> > do zero copy last time I checked, there's also a long TODO list.
> >
> I apologize, I'm not following the question here. Isn't that what this
> thread is about?
>
> So, some background is missing here, perhaps: A few months ago, there
> was a proposal
> to remove native RDMA support from live migration due to concerns about
> lack of testability.
> Both IONOS and Huawei have stepped up that they are using it and are
> engaging with the
> community here. I also proposed transferring over maintainership to them
> as well.  (I  no longer
> have any of this hardware, so I cannot provide testing support anymore).
>
> During that time, rsocket was proposed as an alternative, but as I have
> laid out above, I believe
> it cannot work for technical reasons.
>
> I also asked earlier in the thread if we can cover the community's
> testing concerns using softroce,
> so that an integration test can be made to work (presumably through
> avocado or something similar).
>
> Does that history make sense?
>
> - Michael
>