> > I'm sure rsocket has its place with much smaller transfer sizes, but > > this is very different. > > Is it possible to make rsocket be friendly with large buffers (>4GB) like the VM > use case? If you can perform large VM migrations using streaming sockets, rsockets is likely usable, but it will involve data copies. The problem is the socket API semantics. There are rsocket API extensions (riowrite, riomap) to support RDMA write operations. This avoids the data copy at the target, but not the sender. (riowrite follows the socket send semantics on buffer ownership.) It may be possible to enhance rsockets with MSG_ZEROCOPY or io_uring extensions to enable zero-copy for large transfers, but that's not something I've looked at. True zero copy may require combining MSG_ZEROCOPY with riowrite, but then that moves further away from using traditional socket calls. - Sean
On Mon, Sep 30, 2024 at 07:20:56PM +0000, Sean Hefty wrote: > > > I'm sure rsocket has its place with much smaller transfer sizes, but > > > this is very different. > > > > Is it possible to make rsocket be friendly with large buffers (>4GB) like the VM > > use case? > > If you can perform large VM migrations using streaming sockets, rsockets is likely usable, but it will involve data copies. The problem is the socket API semantics. > > There are rsocket API extensions (riowrite, riomap) to support RDMA write operations. This avoids the data copy at the target, but not the sender. (riowrite follows the socket send semantics on buffer ownership.) > > It may be possible to enhance rsockets with MSG_ZEROCOPY or io_uring extensions to enable zero-copy for large transfers, but that's not something I've looked at. True zero copy may require combining MSG_ZEROCOPY with riowrite, but then that moves further away from using traditional socket calls. Thanks, Sean. One thing to mention is that QEMU has QIO_CHANNEL_WRITE_FLAG_ZERO_COPY, which already supports MSG_ZEROCOPY but only on sender side, and only if when multifd is enabled, because it requires page pinning and alignments, while it's more challenging to pin a random buffer than a guest page. Nobody moved on yet with zerocopy recv for TCP; there might be similar challenges that normal socket APIs may not work easily on top of current iochannel design, but I don't know well to say.. Not sure whether it means there can be a shared goal with QEMU ultimately supporting better zerocopy via either TCP or RDMA. If that's true, maybe there's chance we can move towards rsocket with all the above facilities, meanwhile RDMA can, ideally, run similiarly like TCP with the same (to be enhanced..) iochannel API, so that it can do zerocopy on both sides with either transport. -- Peter Xu
On 9/30/24 14:47, Peter Xu wrote: > !-------------------------------------------------------------------| > This Message Is From an External Sender > This message came from outside your organization. > |-------------------------------------------------------------------! > > On Mon, Sep 30, 2024 at 07:20:56PM +0000, Sean Hefty wrote: >>>> I'm sure rsocket has its place with much smaller transfer sizes, but >>>> this is very different. >>> Is it possible to make rsocket be friendly with large buffers (>4GB) like the VM >>> use case? >> If you can perform large VM migrations using streaming sockets, rsockets is likely usable, but it will involve data copies. The problem is the socket API semantics. >> >> There are rsocket API extensions (riowrite, riomap) to support RDMA write operations. This avoids the data copy at the target, but not the sender. (riowrite follows the socket send semantics on buffer ownership.) >> >> It may be possible to enhance rsockets with MSG_ZEROCOPY or io_uring extensions to enable zero-copy for large transfers, but that's not something I've looked at. True zero copy may require combining MSG_ZEROCOPY with riowrite, but then that moves further away from using traditional socket calls. > Thanks, Sean. > > One thing to mention is that QEMU has QIO_CHANNEL_WRITE_FLAG_ZERO_COPY, > which already supports MSG_ZEROCOPY but only on sender side, and only if > when multifd is enabled, because it requires page pinning and alignments, > while it's more challenging to pin a random buffer than a guest page. > > Nobody moved on yet with zerocopy recv for TCP; there might be similar > challenges that normal socket APIs may not work easily on top of current > iochannel design, but I don't know well to say.. > > Not sure whether it means there can be a shared goal with QEMU ultimately > supporting better zerocopy via either TCP or RDMA. If that's true, maybe > there's chance we can move towards rsocket with all the above facilities, > meanwhile RDMA can, ideally, run similiarly like TCP with the same (to be > enhanced..) iochannel API, so that it can do zerocopy on both sides with > either transport. > What about the testing solution that I mentioned? Does that satisfy your concerns? Or is there still a gap here that needs to be met? - Michael
On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote: > What about the testing solution that I mentioned? > > Does that satisfy your concerns? Or is there still a gap here that needs to > be met? I think such testing framework would be helpful, especially if we can kick it off in CI when preparing pull requests, then we can make sure nothing will break RDMA easily. Meanwhile, we still need people committed to this and actively maintain it, who knows the rdma code well. Thanks, -- Peter Xu
On 10/3/24 16:43, Peter Xu wrote: > !-------------------------------------------------------------------| > This Message Is From an External Sender > This message came from outside your organization. > |-------------------------------------------------------------------! > > On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote: >> What about the testing solution that I mentioned? >> >> Does that satisfy your concerns? Or is there still a gap here that needs to >> be met? > I think such testing framework would be helpful, especially if we can kick > it off in CI when preparing pull requests, then we can make sure nothing > will break RDMA easily. > > Meanwhile, we still need people committed to this and actively maintain it, > who knows the rdma code well. > > Thanks, > OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test along these lines that would ensure that future RDMA breakages are detected more easily? What do you think? - Michael
Sure, as we talked at the KVM Forum, a possible approach is to set up two VMs on a physical host, configure the SoftRoCE, and run the migration test in two nested VMs to ensure that the migration data traffic goes through the emulated RDMA hardware. I will continue with this and let you know. On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> wrote: > > > On 10/3/24 16:43, Peter Xu wrote: > > !-------------------------------------------------------------------| > > This Message Is From an External Sender > > This message came from outside your organization. > > |-------------------------------------------------------------------! > > > > On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote: > >> What about the testing solution that I mentioned? > >> > >> Does that satisfy your concerns? Or is there still a gap here that needs to > >> be met? > > I think such testing framework would be helpful, especially if we can kick > > it off in CI when preparing pull requests, then we can make sure nothing > > will break RDMA easily. > > > > Meanwhile, we still need people committed to this and actively maintain it, > > who knows the rdma code well. > > > > Thanks, > > > > OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test > along these lines that would ensure that future RDMA breakages are > detected more easily? > > What do you think? > > - Michael >
Hi, On 10/7/24 03:47, Yu Zhang wrote: > !-------------------------------------------------------------------| > This Message Is From an External Sender > This message came from outside your organization. > |-------------------------------------------------------------------! > > Sure, as we talked at the KVM Forum, a possible approach is to set up > two VMs on a physical host, configure the SoftRoCE, and run the > migration test in two nested VMs to ensure that the migration data > traffic goes through the emulated RDMA hardware. I will continue with > this and let you know. > Acknowledged. Do share if you have any problems with it, like if it has compatibility issues or if we need a different solution. We're open to change. I'm not familiar with the "current state" of this or how well it would even work. - Michael > On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> wrote: >> >> On 10/3/24 16:43, Peter Xu wrote: >>> !-------------------------------------------------------------------| >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> |-------------------------------------------------------------------! >>> >>> On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote: >>>> What about the testing solution that I mentioned? >>>> >>>> Does that satisfy your concerns? Or is there still a gap here that needs to >>>> be met? >>> I think such testing framework would be helpful, especially if we can kick >>> it off in CI when preparing pull requests, then we can make sure nothing >>> will break RDMA easily. >>> >>> Meanwhile, we still need people committed to this and actively maintain it, >>> who knows the rdma code well. >>> >>> Thanks, >>> >> OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test >> along these lines that would ensure that future RDMA breakages are >> detected more easily? >> >> What do you think? >> >> - Michael >>
On Mon, Oct 07, 2024 at 08:45:07AM -0500, Michael Galaxy wrote: > Hi, > > On 10/7/24 03:47, Yu Zhang wrote: > > !-------------------------------------------------------------------| > > This Message Is From an External Sender > > This message came from outside your organization. > > |-------------------------------------------------------------------! > > > > Sure, as we talked at the KVM Forum, a possible approach is to set up > > two VMs on a physical host, configure the SoftRoCE, and run the > > migration test in two nested VMs to ensure that the migration data > > traffic goes through the emulated RDMA hardware. I will continue with > > this and let you know. > > > Acknowledged. Do share if you have any problems with it, like if it has > compatibility issues > or if we need a different solution. We're open to change. > > I'm not familiar with the "current state" of this or how well it would even > work. Any compatibility issue between versions of RXE (SoftRoCE) or between RXE and real devices is a bug in RXE, which should be fixed. RXE is expected to be compatible with rest RoCE devices, both virtual and physical. Thanks > > - Michael > > > > On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> wrote: > > > > > > On 10/3/24 16:43, Peter Xu wrote: > > > > !-------------------------------------------------------------------| > > > > This Message Is From an External Sender > > > > This message came from outside your organization. > > > > |-------------------------------------------------------------------! > > > > > > > > On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote: > > > > > What about the testing solution that I mentioned? > > > > > > > > > > Does that satisfy your concerns? Or is there still a gap here that needs to > > > > > be met? > > > > I think such testing framework would be helpful, especially if we can kick > > > > it off in CI when preparing pull requests, then we can make sure nothing > > > > will break RDMA easily. > > > > > > > > Meanwhile, we still need people committed to this and actively maintain it, > > > > who knows the rdma code well. > > > > > > > > Thanks, > > > > > > > OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test > > > along these lines that would ensure that future RDMA breakages are > > > detected more easily? > > > > > > What do you think? > > > > > > - Michael > > > >
在 2024/10/8 2:15, Leon Romanovsky 写道: > On Mon, Oct 07, 2024 at 08:45:07AM -0500, Michael Galaxy wrote: >> Hi, >> >> On 10/7/24 03:47, Yu Zhang wrote: >>> !-------------------------------------------------------------------| >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> |-------------------------------------------------------------------! >>> >>> Sure, as we talked at the KVM Forum, a possible approach is to set up >>> two VMs on a physical host, configure the SoftRoCE, and run the >>> migration test in two nested VMs to ensure that the migration data >>> traffic goes through the emulated RDMA hardware. I will continue with >>> this and let you know. >>> >> Acknowledged. Do share if you have any problems with it, like if it has >> compatibility issues >> or if we need a different solution. We're open to change. >> >> I'm not familiar with the "current state" of this or how well it would even >> work. > > Any compatibility issue between versions of RXE (SoftRoCE) or between > RXE and real devices is a bug in RXE, which should be fixed. > > RXE is expected to be compatible with rest RoCE devices, both virtual > and physical. From my tests, about physical RoCE devices, for example, Nvidia MLX5 and intel E810 (iRDMA), if RDMA feature is disabled on those devices. RXE can work well with them. About Virtual devices, most virtual devices can work well with RXE, for example,bonding, veth. I have done a lot of tests with them. If some virtual devices can not work well with RXE, please share the error messages in RDMA maillist. Zhu Yanjun > > Thanks > >> >> - Michael >> >> >>> On Fri, Oct 4, 2024 at 4:06 PM Michael Galaxy <mgalaxy@akamai.com> wrote: >>>> >>>> On 10/3/24 16:43, Peter Xu wrote: >>>>> !-------------------------------------------------------------------| >>>>> This Message Is From an External Sender >>>>> This message came from outside your organization. >>>>> |-------------------------------------------------------------------! >>>>> >>>>> On Thu, Oct 03, 2024 at 04:26:27PM -0500, Michael Galaxy wrote: >>>>>> What about the testing solution that I mentioned? >>>>>> >>>>>> Does that satisfy your concerns? Or is there still a gap here that needs to >>>>>> be met? >>>>> I think such testing framework would be helpful, especially if we can kick >>>>> it off in CI when preparing pull requests, then we can make sure nothing >>>>> will break RDMA easily. >>>>> >>>>> Meanwhile, we still need people committed to this and actively maintain it, >>>>> who knows the rdma code well. >>>>> >>>>> Thanks, >>>>> >>>> OK, so comments from Yu Zhang and Gonglei? Can we work up a CI test >>>> along these lines that would ensure that future RDMA breakages are >>>> detected more easily? >>>> >>>> What do you think? >>>> >>>> - Michael >>>> >>
© 2016 - 2024 Red Hat, Inc.