block/io.c | 122 ++++++++++++++++++++--------------------------------- 1 file changed, 45 insertions(+), 77 deletions(-)
Stefan (Reiter), after looking a bit closer at this, I think there is no bug in QEMU, but the bug is in your coroutine code that calls block layer functions without moving into the right AioContext first. I've written this series anyway as it potentially makes the life of callers easier and would probably make your buggy code correct. However, it doesn't feel right to commit something like patch 2 without having a user for it. Is there a reason why you can't upstream your async snapshot code? The series would also happen fix a bug in my recent patch to convert qmp_block_resize() to coroutines, but I feel it's not how I would naturally fix it. Switching the thread already in the QMP handler before calling bdrv_truncate() would feel more appropriate. I wonder if it wouldn't actually be the same for your snapshot code. Kevin Wolf (3): block: Factor out bdrv_run_co() block: Allow bdrv_run_co() from different AioContext block: Assert we're running in the right thread block/io.c | 122 ++++++++++++++++++++--------------------------------- 1 file changed, 45 insertions(+), 77 deletions(-) -- 2.25.3
On 5/12/20 4:43 PM, Kevin Wolf wrote: > Stefan (Reiter), after looking a bit closer at this, I think there is no > bug in QEMU, but the bug is in your coroutine code that calls block > layer functions without moving into the right AioContext first. I've > written this series anyway as it potentially makes the life of callers > easier and would probably make your buggy code correct. > However, it doesn't feel right to commit something like patch 2 without > having a user for it. Is there a reason why you can't upstream your > async snapshot code? I mean I understand what you mean, but it would make the interface IMO so much easier to use, if one wants to explicit schedule it beforehand they can still do. But that would open the way for two styles doing things, not sure if this would seen as bad. The assert about from patch 3/3 would be already really helping a lot, though. Regarding upstreaming, there was some historical attempt to upstream it from Dietmar, but in the time frame of ~ 8 to 10 years ago or so. I'm not quite sure why it didn't went through then, I see if I can get some time searching the mailing list archive. We'd be naturally open and glad to upstream it, what it effectively allow us to do is to not block the VM to much during snapshoting it live. I pushed a tree[0] with mostly just that specific code squashed together (hope I did not break anything), most of the actual code is in commit [1]. It'd be cleaned up a bit and checked for coding style issues, but works good here. Anyway, thanks for your help and pointers! [0]: https://github.com/ThomasLamprecht/qemu/tree/savevm-async [1]: https://github.com/ThomasLamprecht/qemu/commit/ffb9531f370ef0073e4b6f6021f4c47ccd702121
Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben: > On 5/12/20 4:43 PM, Kevin Wolf wrote: > > Stefan (Reiter), after looking a bit closer at this, I think there is no > > bug in QEMU, but the bug is in your coroutine code that calls block > > layer functions without moving into the right AioContext first. I've > > written this series anyway as it potentially makes the life of callers > > easier and would probably make your buggy code correct. > > > However, it doesn't feel right to commit something like patch 2 without > > having a user for it. Is there a reason why you can't upstream your > > async snapshot code? > > I mean I understand what you mean, but it would make the interface IMO so > much easier to use, if one wants to explicit schedule it beforehand they > can still do. But that would open the way for two styles doing things, not > sure if this would seen as bad. The assert about from patch 3/3 would be > already really helping a lot, though. I think patches 1 and 3 are good to be committed either way if people think they are useful. They make sense without the async snapshot code. My concern with the interface in patch 2 is both that it could give people a false sense of security and that it would be tempting to write inefficient code. Usually, you won't have just a single call into the block layer for a given block node, but you'll perform multiple operations. Switching to the target context once rather than switching back and forth in every operation is obviously more efficient. But chances are that even if one of these function is bdrv_flush(), which now works correctly from a different thread, you might need another function that doesn't implement the same magic. So you always need to be aware which functions support cross-context calls which ones don't. I feel we'd see a few bugs related to this. > Regarding upstreaming, there was some historical attempt to upstream it > from Dietmar, but in the time frame of ~ 8 to 10 years ago or so. > I'm not quite sure why it didn't went through then, I see if I can get > some time searching the mailing list archive. > > We'd be naturally open and glad to upstream it, what it effectively > allow us to do is to not block the VM to much during snapshoting it > live. Yes, there is no doubt that this is useful functionality. There has been talk about this every now and then, but I don't think we ever got to a point where it actually could be implemented. Vladimir, I seem to remember you (or someone else from your team?) were interested in async snapshots as well a while ago? > I pushed a tree[0] with mostly just that specific code squashed together (hope > I did not break anything), most of the actual code is in commit [1]. > It'd be cleaned up a bit and checked for coding style issues, but works good > here. > > Anyway, thanks for your help and pointers! > > [0]: https://github.com/ThomasLamprecht/qemu/tree/savevm-async > [1]: https://github.com/ThomasLamprecht/qemu/commit/ffb9531f370ef0073e4b6f6021f4c47ccd702121 It doesn't even look that bad in terms of patch size. I had imagined it a bit larger. But it seems this is not really just an async 'savevm' (which would save the VM state in a qcow2 file), but you store the state in a separate raw file. What is the difference between this and regular migration into a file? I remember people talking about how snapshotting can store things in a way that a normal migration stream can't do, like overwriting outdated RAM state instead of just appending the new state, but you don't seem to implement something like this. Kevin
14.05.2020 17:26, Kevin Wolf wrote: > Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben: >> On 5/12/20 4:43 PM, Kevin Wolf wrote: >>> Stefan (Reiter), after looking a bit closer at this, I think there is no >>> bug in QEMU, but the bug is in your coroutine code that calls block >>> layer functions without moving into the right AioContext first. I've >>> written this series anyway as it potentially makes the life of callers >>> easier and would probably make your buggy code correct. >> >>> However, it doesn't feel right to commit something like patch 2 without >>> having a user for it. Is there a reason why you can't upstream your >>> async snapshot code? >> >> I mean I understand what you mean, but it would make the interface IMO so >> much easier to use, if one wants to explicit schedule it beforehand they >> can still do. But that would open the way for two styles doing things, not >> sure if this would seen as bad. The assert about from patch 3/3 would be >> already really helping a lot, though. > > I think patches 1 and 3 are good to be committed either way if people > think they are useful. They make sense without the async snapshot code. > > My concern with the interface in patch 2 is both that it could give > people a false sense of security and that it would be tempting to write > inefficient code. > > Usually, you won't have just a single call into the block layer for a > given block node, but you'll perform multiple operations. Switching to > the target context once rather than switching back and forth in every > operation is obviously more efficient. > > But chances are that even if one of these function is bdrv_flush(), > which now works correctly from a different thread, you might need > another function that doesn't implement the same magic. So you always > need to be aware which functions support cross-context calls which > ones don't. > > I feel we'd see a few bugs related to this. > >> Regarding upstreaming, there was some historical attempt to upstream it >> from Dietmar, but in the time frame of ~ 8 to 10 years ago or so. >> I'm not quite sure why it didn't went through then, I see if I can get >> some time searching the mailing list archive. >> >> We'd be naturally open and glad to upstream it, what it effectively >> allow us to do is to not block the VM to much during snapshoting it >> live. > > Yes, there is no doubt that this is useful functionality. There has been > talk about this every now and then, but I don't think we ever got to a > point where it actually could be implemented. > > Vladimir, I seem to remember you (or someone else from your team?) were > interested in async snapshots as well a while ago? Den is working on this (add him to CC) > >> I pushed a tree[0] with mostly just that specific code squashed together (hope >> I did not break anything), most of the actual code is in commit [1]. >> It'd be cleaned up a bit and checked for coding style issues, but works good >> here. >> >> Anyway, thanks for your help and pointers! >> >> [0]: https://github.com/ThomasLamprecht/qemu/tree/savevm-async >> [1]: https://github.com/ThomasLamprecht/qemu/commit/ffb9531f370ef0073e4b6f6021f4c47ccd702121 > > It doesn't even look that bad in terms of patch size. I had imagined it > a bit larger. > > But it seems this is not really just an async 'savevm' (which would save > the VM state in a qcow2 file), but you store the state in a separate > raw file. What is the difference between this and regular migration into > a file? > > I remember people talking about how snapshotting can store things in a > way that a normal migration stream can't do, like overwriting outdated > RAM state instead of just appending the new state, but you don't seem to > implement something like this. > > Kevin > -- Best regards, Vladimir
On 19.05.2020 15:32, Vladimir Sementsov-Ogievskiy wrote: > 14.05.2020 17:26, Kevin Wolf wrote: >> Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben: >>> On 5/12/20 4:43 PM, Kevin Wolf wrote: >>>> Stefan (Reiter), after looking a bit closer at this, I think there >>>> is no >>>> bug in QEMU, but the bug is in your coroutine code that calls block >>>> layer functions without moving into the right AioContext first. I've >>>> written this series anyway as it potentially makes the life of callers >>>> easier and would probably make your buggy code correct. >>> >>>> However, it doesn't feel right to commit something like patch 2 >>>> without >>>> having a user for it. Is there a reason why you can't upstream your >>>> async snapshot code? >>> >>> I mean I understand what you mean, but it would make the interface >>> IMO so >>> much easier to use, if one wants to explicit schedule it beforehand >>> they >>> can still do. But that would open the way for two styles doing >>> things, not >>> sure if this would seen as bad. The assert about from patch 3/3 >>> would be >>> already really helping a lot, though. >> >> I think patches 1 and 3 are good to be committed either way if people >> think they are useful. They make sense without the async snapshot code. >> >> My concern with the interface in patch 2 is both that it could give >> people a false sense of security and that it would be tempting to write >> inefficient code. >> >> Usually, you won't have just a single call into the block layer for a >> given block node, but you'll perform multiple operations. Switching to >> the target context once rather than switching back and forth in every >> operation is obviously more efficient. >> >> But chances are that even if one of these function is bdrv_flush(), >> which now works correctly from a different thread, you might need >> another function that doesn't implement the same magic. So you always >> need to be aware which functions support cross-context calls which >> ones don't. >> >> I feel we'd see a few bugs related to this. >> >>> Regarding upstreaming, there was some historical attempt to upstream it >>> from Dietmar, but in the time frame of ~ 8 to 10 years ago or so. >>> I'm not quite sure why it didn't went through then, I see if I can get >>> some time searching the mailing list archive. >>> >>> We'd be naturally open and glad to upstream it, what it effectively >>> allow us to do is to not block the VM to much during snapshoting it >>> live. >> >> Yes, there is no doubt that this is useful functionality. There has been >> talk about this every now and then, but I don't think we ever got to a >> point where it actually could be implemented. >> >> Vladimir, I seem to remember you (or someone else from your team?) were >> interested in async snapshots as well a while ago? > > Den is working on this (add him to CC) Yes, I was working on that. What I've done can be found here: https://github.com/denis-plotnikov/qemu/commits/bgs_uffd The idea was to save a snapshot (state+ram) asynchronously to a separate (raw) file using the existing infrastructure. The goal of that was to reduce the VM downtime on snapshot. We decided to postpone this work until "userfaultfd write protected mode" wasn't in the linux mainstream. Now, userfaultfd-wp is merged to linux. So we have plans to continue this work. According to the saving the "internal" snapshot to qcow2 I still have a question. May be this is the right place and time to ask. If I remember correctly, in qcow2 the snapshot is stored at the end of the address space of the current block-placement-table. We switch to the new block-placement-table after the snapshot storing is complete. In case of sync snapshot, we should switch the table before the snapshot is written, another words, we should be able to preallocate the the space for the snapshot and keep a link to the space until snapshot writing is completed. The question is whether it could be done without qcow2 modification and if not, could you please give some ideas of how to implement that? Denis > >> >>> I pushed a tree[0] with mostly just that specific code squashed >>> together (hope >>> I did not break anything), most of the actual code is in commit [1]. >>> It'd be cleaned up a bit and checked for coding style issues, but >>> works good >>> here. >>> >>> Anyway, thanks for your help and pointers! >>> >>> [0]: https://github.com/ThomasLamprecht/qemu/tree/savevm-async >>> [1]: >>> https://github.com/ThomasLamprecht/qemu/commit/ffb9531f370ef0073e4b6f6021f4c47ccd702121 >> >> It doesn't even look that bad in terms of patch size. I had imagined it >> a bit larger. >> >> But it seems this is not really just an async 'savevm' (which would save >> the VM state in a qcow2 file), but you store the state in a separate >> raw file. What is the difference between this and regular migration into >> a file? >> >> I remember people talking about how snapshotting can store things in a >> way that a normal migration stream can't do, like overwriting outdated >> RAM state instead of just appending the new state, but you don't seem to >> implement something like this. >> >> Kevin >> > >
Am 19.05.2020 um 15:54 hat Denis Plotnikov geschrieben: > > > On 19.05.2020 15:32, Vladimir Sementsov-Ogievskiy wrote: > > 14.05.2020 17:26, Kevin Wolf wrote: > > > Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben: > > > > On 5/12/20 4:43 PM, Kevin Wolf wrote: > > > > > Stefan (Reiter), after looking a bit closer at this, I think > > > > > there is no > > > > > bug in QEMU, but the bug is in your coroutine code that calls block > > > > > layer functions without moving into the right AioContext first. I've > > > > > written this series anyway as it potentially makes the life of callers > > > > > easier and would probably make your buggy code correct. > > > > > > > > > However, it doesn't feel right to commit something like > > > > > patch 2 without > > > > > having a user for it. Is there a reason why you can't upstream your > > > > > async snapshot code? > > > > > > > > I mean I understand what you mean, but it would make the > > > > interface IMO so > > > > much easier to use, if one wants to explicit schedule it > > > > beforehand they > > > > can still do. But that would open the way for two styles doing > > > > things, not > > > > sure if this would seen as bad. The assert about from patch 3/3 > > > > would be > > > > already really helping a lot, though. > > > > > > I think patches 1 and 3 are good to be committed either way if people > > > think they are useful. They make sense without the async snapshot code. > > > > > > My concern with the interface in patch 2 is both that it could give > > > people a false sense of security and that it would be tempting to write > > > inefficient code. > > > > > > Usually, you won't have just a single call into the block layer for a > > > given block node, but you'll perform multiple operations. Switching to > > > the target context once rather than switching back and forth in every > > > operation is obviously more efficient. > > > > > > But chances are that even if one of these function is bdrv_flush(), > > > which now works correctly from a different thread, you might need > > > another function that doesn't implement the same magic. So you always > > > need to be aware which functions support cross-context calls which > > > ones don't. > > > > > > I feel we'd see a few bugs related to this. > > > > > > > Regarding upstreaming, there was some historical attempt to upstream it > > > > from Dietmar, but in the time frame of ~ 8 to 10 years ago or so. > > > > I'm not quite sure why it didn't went through then, I see if I can get > > > > some time searching the mailing list archive. > > > > > > > > We'd be naturally open and glad to upstream it, what it effectively > > > > allow us to do is to not block the VM to much during snapshoting it > > > > live. > > > > > > Yes, there is no doubt that this is useful functionality. There has been > > > talk about this every now and then, but I don't think we ever got to a > > > point where it actually could be implemented. > > > > > > Vladimir, I seem to remember you (or someone else from your team?) were > > > interested in async snapshots as well a while ago? > > > > Den is working on this (add him to CC) > Yes, I was working on that. > > What I've done can be found here: > https://github.com/denis-plotnikov/qemu/commits/bgs_uffd > > The idea was to save a snapshot (state+ram) asynchronously to a separate > (raw) file using the existing infrastructure. > The goal of that was to reduce the VM downtime on snapshot. > > We decided to postpone this work until "userfaultfd write protected mode" > wasn't in the linux mainstream. > Now, userfaultfd-wp is merged to linux. So we have plans to continue this > work. > > According to the saving the "internal" snapshot to qcow2 I still have a > question. May be this is the right place and time to ask. > > If I remember correctly, in qcow2 the snapshot is stored at the end of > the address space of the current block-placement-table. Yes. Basically the way snapshots with VM state work is that you write the VM state to some offset after the end of the virtual disk, when the VM state is completely written you snapshot the current state (including both content of the virtual disk and VM state) and finally discard the VM state again in the active L1 table. > We switch to the new block-placement-table after the snapshot storing > is complete. In case of sync snapshot, we should switch the table > before the snapshot is written, another words, we should be able to > preallocate the the space for the snapshot and keep a link to the > space until snapshot writing is completed. I don't see a fundamental difference between sync and async in this respect. Why can't you write the VM state to the current L1 table first like we usually do? We always have only one active L1 table at a time, which simplifies cluster allocation a bit, so it would be preferable to keep it this way. Kevin
On 19.05.2020 17:18, Kevin Wolf wrote: > Am 19.05.2020 um 15:54 hat Denis Plotnikov geschrieben: >> >> On 19.05.2020 15:32, Vladimir Sementsov-Ogievskiy wrote: >>> 14.05.2020 17:26, Kevin Wolf wrote: >>>> Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben: >>>>> On 5/12/20 4:43 PM, Kevin Wolf wrote: >>>>>> Stefan (Reiter), after looking a bit closer at this, I think >>>>>> there is no >>>>>> bug in QEMU, but the bug is in your coroutine code that calls block >>>>>> layer functions without moving into the right AioContext first. I've >>>>>> written this series anyway as it potentially makes the life of callers >>>>>> easier and would probably make your buggy code correct. >>>>>> However, it doesn't feel right to commit something like >>>>>> patch 2 without >>>>>> having a user for it. Is there a reason why you can't upstream your >>>>>> async snapshot code? >>>>> I mean I understand what you mean, but it would make the >>>>> interface IMO so >>>>> much easier to use, if one wants to explicit schedule it >>>>> beforehand they >>>>> can still do. But that would open the way for two styles doing >>>>> things, not >>>>> sure if this would seen as bad. The assert about from patch 3/3 >>>>> would be >>>>> already really helping a lot, though. >>>> I think patches 1 and 3 are good to be committed either way if people >>>> think they are useful. They make sense without the async snapshot code. >>>> >>>> My concern with the interface in patch 2 is both that it could give >>>> people a false sense of security and that it would be tempting to write >>>> inefficient code. >>>> >>>> Usually, you won't have just a single call into the block layer for a >>>> given block node, but you'll perform multiple operations. Switching to >>>> the target context once rather than switching back and forth in every >>>> operation is obviously more efficient. >>>> >>>> But chances are that even if one of these function is bdrv_flush(), >>>> which now works correctly from a different thread, you might need >>>> another function that doesn't implement the same magic. So you always >>>> need to be aware which functions support cross-context calls which >>>> ones don't. >>>> >>>> I feel we'd see a few bugs related to this. >>>> >>>>> Regarding upstreaming, there was some historical attempt to upstream it >>>>> from Dietmar, but in the time frame of ~ 8 to 10 years ago or so. >>>>> I'm not quite sure why it didn't went through then, I see if I can get >>>>> some time searching the mailing list archive. >>>>> >>>>> We'd be naturally open and glad to upstream it, what it effectively >>>>> allow us to do is to not block the VM to much during snapshoting it >>>>> live. >>>> Yes, there is no doubt that this is useful functionality. There has been >>>> talk about this every now and then, but I don't think we ever got to a >>>> point where it actually could be implemented. >>>> >>>> Vladimir, I seem to remember you (or someone else from your team?) were >>>> interested in async snapshots as well a while ago? >>> Den is working on this (add him to CC) >> Yes, I was working on that. >> >> What I've done can be found here: >> https://github.com/denis-plotnikov/qemu/commits/bgs_uffd >> >> The idea was to save a snapshot (state+ram) asynchronously to a separate >> (raw) file using the existing infrastructure. >> The goal of that was to reduce the VM downtime on snapshot. >> >> We decided to postpone this work until "userfaultfd write protected mode" >> wasn't in the linux mainstream. >> Now, userfaultfd-wp is merged to linux. So we have plans to continue this >> work. >> >> According to the saving the "internal" snapshot to qcow2 I still have a >> question. May be this is the right place and time to ask. >> >> If I remember correctly, in qcow2 the snapshot is stored at the end of >> the address space of the current block-placement-table. > Yes. Basically the way snapshots with VM state work is that you write > the VM state to some offset after the end of the virtual disk, when the > VM state is completely written you snapshot the current state (including > both content of the virtual disk and VM state) and finally discard the > VM state again in the active L1 table. > >> We switch to the new block-placement-table after the snapshot storing >> is complete. In case of sync snapshot, we should switch the table >> before the snapshot is written, another words, we should be able to >> preallocate the the space for the snapshot and keep a link to the >> space until snapshot writing is completed. > I don't see a fundamental difference between sync and async in this > respect. Why can't you write the VM state to the current L1 table first > like we usually do? I'm not quite sure I understand the point. Let's see all the picture of async snapshot: our goal is to minimize a VM downtime during the snapshot. When we do async snapshot we save vmstate except RAM when a VM is stopped using the current L1 table (further initial L1 table). Then, we want the VM start running and write RAM content. At this time all RAM is write-protected. We unprotect each RAM page once it has been written. All those RAM pages should go to the snapshot using the initial L1 table. Since the VM is running, it may want to write new disk blocks, so we need to use a NEW L1 table to provide this ability. (Am I correct so far?) Thus, if I understand correctly, we need to use two L1 tables: the initial one to store RAM pages to the vmstate and the new one to allow disk writings. May be I can't see a better way to achieve that. Please, correct me if I'm wrong. Denis > > We always have only one active L1 table at a time, which simplifies > cluster allocation a bit, so it would be preferable to keep it this way. > > Kevin >
Am 19.05.2020 um 17:05 hat Denis Plotnikov geschrieben: > On 19.05.2020 17:18, Kevin Wolf wrote: > > Am 19.05.2020 um 15:54 hat Denis Plotnikov geschrieben: > > > > > > On 19.05.2020 15:32, Vladimir Sementsov-Ogievskiy wrote: > > > > 14.05.2020 17:26, Kevin Wolf wrote: > > > > > Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben: > > > > > > On 5/12/20 4:43 PM, Kevin Wolf wrote: > > > > > > > Stefan (Reiter), after looking a bit closer at this, I think > > > > > > > there is no > > > > > > > bug in QEMU, but the bug is in your coroutine code that calls block > > > > > > > layer functions without moving into the right AioContext first. I've > > > > > > > written this series anyway as it potentially makes the life of callers > > > > > > > easier and would probably make your buggy code correct. > > > > > > > However, it doesn't feel right to commit something like > > > > > > > patch 2 without > > > > > > > having a user for it. Is there a reason why you can't upstream your > > > > > > > async snapshot code? > > > > > > I mean I understand what you mean, but it would make the > > > > > > interface IMO so > > > > > > much easier to use, if one wants to explicit schedule it > > > > > > beforehand they > > > > > > can still do. But that would open the way for two styles doing > > > > > > things, not > > > > > > sure if this would seen as bad. The assert about from patch 3/3 > > > > > > would be > > > > > > already really helping a lot, though. > > > > > I think patches 1 and 3 are good to be committed either way if people > > > > > think they are useful. They make sense without the async snapshot code. > > > > > > > > > > My concern with the interface in patch 2 is both that it could give > > > > > people a false sense of security and that it would be tempting to write > > > > > inefficient code. > > > > > > > > > > Usually, you won't have just a single call into the block layer for a > > > > > given block node, but you'll perform multiple operations. Switching to > > > > > the target context once rather than switching back and forth in every > > > > > operation is obviously more efficient. > > > > > > > > > > But chances are that even if one of these function is bdrv_flush(), > > > > > which now works correctly from a different thread, you might need > > > > > another function that doesn't implement the same magic. So you always > > > > > need to be aware which functions support cross-context calls which > > > > > ones don't. > > > > > > > > > > I feel we'd see a few bugs related to this. > > > > > > > > > > > Regarding upstreaming, there was some historical attempt to upstream it > > > > > > from Dietmar, but in the time frame of ~ 8 to 10 years ago or so. > > > > > > I'm not quite sure why it didn't went through then, I see if I can get > > > > > > some time searching the mailing list archive. > > > > > > > > > > > > We'd be naturally open and glad to upstream it, what it effectively > > > > > > allow us to do is to not block the VM to much during snapshoting it > > > > > > live. > > > > > Yes, there is no doubt that this is useful functionality. There has been > > > > > talk about this every now and then, but I don't think we ever got to a > > > > > point where it actually could be implemented. > > > > > > > > > > Vladimir, I seem to remember you (or someone else from your team?) were > > > > > interested in async snapshots as well a while ago? > > > > Den is working on this (add him to CC) > > > Yes, I was working on that. > > > > > > What I've done can be found here: > > > https://github.com/denis-plotnikov/qemu/commits/bgs_uffd > > > > > > The idea was to save a snapshot (state+ram) asynchronously to a separate > > > (raw) file using the existing infrastructure. > > > The goal of that was to reduce the VM downtime on snapshot. > > > > > > We decided to postpone this work until "userfaultfd write protected mode" > > > wasn't in the linux mainstream. > > > Now, userfaultfd-wp is merged to linux. So we have plans to continue this > > > work. > > > > > > According to the saving the "internal" snapshot to qcow2 I still have a > > > question. May be this is the right place and time to ask. > > > > > > If I remember correctly, in qcow2 the snapshot is stored at the end of > > > the address space of the current block-placement-table. > > Yes. Basically the way snapshots with VM state work is that you write > > the VM state to some offset after the end of the virtual disk, when the > > VM state is completely written you snapshot the current state (including > > both content of the virtual disk and VM state) and finally discard the > > VM state again in the active L1 table. > > > > > We switch to the new block-placement-table after the snapshot storing > > > is complete. In case of sync snapshot, we should switch the table > > > before the snapshot is written, another words, we should be able to > > > preallocate the the space for the snapshot and keep a link to the > > > space until snapshot writing is completed. > > I don't see a fundamental difference between sync and async in this > > respect. Why can't you write the VM state to the current L1 table first > > like we usually do? > > I'm not quite sure I understand the point. > Let's see all the picture of async snapshot: our goal is to minimize a VM > downtime during the snapshot. > When we do async snapshot we save vmstate except RAM when a VM is stopped > using the current L1 table (further initial L1 table). Then, we want the VM > start running > and write RAM content. At this time all RAM is write-protected. > We unprotect each RAM page once it has been written. Oh, I see, you're basically doing something like postcopy migration. I was assuming it was more like regular live migration, except that you would overwrite updated RAM blocks instead of appending them. I can see your requirement then. > All those RAM pages should go to the snapshot using the initial L1 table. > Since the VM is running, it may want to write new disk blocks, > so we need to use a NEW L1 table to provide this ability. (Am I correct so > far?) > Thus, if I understand correctly, we need to use two L1 tables: the initial > one to store RAM pages > to the vmstate and the new one to allow disk writings. > > May be I can't see a better way to achieve that. Please, correct me if I'm > wrong. I guess I could imagine a different, though probably not better way: We could internally have a separate low-level operation that moves the VM state from the active layer to an already existing disk snapshot. Then you would snapshot the disk and start writing the VM to the active layer, and when the VM state write has completed you move it into the snapshot. The other options is doing what you suggested. There is nothing in the qcow2 on-disk format that would prevent this, but we would have to extend the qcow2 driver to allow I/O to inactive L1 tables. This sounds like a non-trivial amount of code changes, though it would potentially enable more use cases we never implemented ((read-only) access to internal snapshots as block nodes, so you could e.g. use block jobs to export a snapshot). Kevin
19.05.2020 18:29, Kevin Wolf wrote: > Am 19.05.2020 um 17:05 hat Denis Plotnikov geschrieben: >> On 19.05.2020 17:18, Kevin Wolf wrote: >>> Am 19.05.2020 um 15:54 hat Denis Plotnikov geschrieben: >>>> >>>> On 19.05.2020 15:32, Vladimir Sementsov-Ogievskiy wrote: >>>>> 14.05.2020 17:26, Kevin Wolf wrote: >>>>>> Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben: >>>>>>> On 5/12/20 4:43 PM, Kevin Wolf wrote: >>>>>>>> Stefan (Reiter), after looking a bit closer at this, I think >>>>>>>> there is no >>>>>>>> bug in QEMU, but the bug is in your coroutine code that calls block >>>>>>>> layer functions without moving into the right AioContext first. I've >>>>>>>> written this series anyway as it potentially makes the life of callers >>>>>>>> easier and would probably make your buggy code correct. >>>>>>>> However, it doesn't feel right to commit something like >>>>>>>> patch 2 without >>>>>>>> having a user for it. Is there a reason why you can't upstream your >>>>>>>> async snapshot code? >>>>>>> I mean I understand what you mean, but it would make the >>>>>>> interface IMO so >>>>>>> much easier to use, if one wants to explicit schedule it >>>>>>> beforehand they >>>>>>> can still do. But that would open the way for two styles doing >>>>>>> things, not >>>>>>> sure if this would seen as bad. The assert about from patch 3/3 >>>>>>> would be >>>>>>> already really helping a lot, though. >>>>>> I think patches 1 and 3 are good to be committed either way if people >>>>>> think they are useful. They make sense without the async snapshot code. >>>>>> >>>>>> My concern with the interface in patch 2 is both that it could give >>>>>> people a false sense of security and that it would be tempting to write >>>>>> inefficient code. >>>>>> >>>>>> Usually, you won't have just a single call into the block layer for a >>>>>> given block node, but you'll perform multiple operations. Switching to >>>>>> the target context once rather than switching back and forth in every >>>>>> operation is obviously more efficient. >>>>>> >>>>>> But chances are that even if one of these function is bdrv_flush(), >>>>>> which now works correctly from a different thread, you might need >>>>>> another function that doesn't implement the same magic. So you always >>>>>> need to be aware which functions support cross-context calls which >>>>>> ones don't. >>>>>> >>>>>> I feel we'd see a few bugs related to this. >>>>>> >>>>>>> Regarding upstreaming, there was some historical attempt to upstream it >>>>>>> from Dietmar, but in the time frame of ~ 8 to 10 years ago or so. >>>>>>> I'm not quite sure why it didn't went through then, I see if I can get >>>>>>> some time searching the mailing list archive. >>>>>>> >>>>>>> We'd be naturally open and glad to upstream it, what it effectively >>>>>>> allow us to do is to not block the VM to much during snapshoting it >>>>>>> live. >>>>>> Yes, there is no doubt that this is useful functionality. There has been >>>>>> talk about this every now and then, but I don't think we ever got to a >>>>>> point where it actually could be implemented. >>>>>> >>>>>> Vladimir, I seem to remember you (or someone else from your team?) were >>>>>> interested in async snapshots as well a while ago? >>>>> Den is working on this (add him to CC) >>>> Yes, I was working on that. >>>> >>>> What I've done can be found here: >>>> https://github.com/denis-plotnikov/qemu/commits/bgs_uffd >>>> >>>> The idea was to save a snapshot (state+ram) asynchronously to a separate >>>> (raw) file using the existing infrastructure. >>>> The goal of that was to reduce the VM downtime on snapshot. >>>> >>>> We decided to postpone this work until "userfaultfd write protected mode" >>>> wasn't in the linux mainstream. >>>> Now, userfaultfd-wp is merged to linux. So we have plans to continue this >>>> work. >>>> >>>> According to the saving the "internal" snapshot to qcow2 I still have a >>>> question. May be this is the right place and time to ask. >>>> >>>> If I remember correctly, in qcow2 the snapshot is stored at the end of >>>> the address space of the current block-placement-table. >>> Yes. Basically the way snapshots with VM state work is that you write >>> the VM state to some offset after the end of the virtual disk, when the >>> VM state is completely written you snapshot the current state (including >>> both content of the virtual disk and VM state) and finally discard the >>> VM state again in the active L1 table. >>> >>>> We switch to the new block-placement-table after the snapshot storing >>>> is complete. In case of sync snapshot, we should switch the table >>>> before the snapshot is written, another words, we should be able to >>>> preallocate the the space for the snapshot and keep a link to the >>>> space until snapshot writing is completed. >>> I don't see a fundamental difference between sync and async in this >>> respect. Why can't you write the VM state to the current L1 table first >>> like we usually do? >> >> I'm not quite sure I understand the point. >> Let's see all the picture of async snapshot: our goal is to minimize a VM >> downtime during the snapshot. >> When we do async snapshot we save vmstate except RAM when a VM is stopped >> using the current L1 table (further initial L1 table). Then, we want the VM >> start running >> and write RAM content. At this time all RAM is write-protected. >> We unprotect each RAM page once it has been written. > > Oh, I see, you're basically doing something like postcopy migration. I > was assuming it was more like regular live migration, except that you > would overwrite updated RAM blocks instead of appending them. > > I can see your requirement then. > >> All those RAM pages should go to the snapshot using the initial L1 table. >> Since the VM is running, it may want to write new disk blocks, >> so we need to use a NEW L1 table to provide this ability. (Am I correct so >> far?) >> Thus, if I understand correctly, we need to use two L1 tables: the initial >> one to store RAM pages >> to the vmstate and the new one to allow disk writings. >> >> May be I can't see a better way to achieve that. Please, correct me if I'm >> wrong. > > I guess I could imagine a different, though probably not better way: We > could internally have a separate low-level operation that moves the VM > state from the active layer to an already existing disk snapshot. Then > you would snapshot the disk and start writing the VM to the active > layer, and when the VM state write has completed you move it into the > snapshot. > > The other options is doing what you suggested. There is nothing in the > qcow2 on-disk format that would prevent this, but we would have to > extend the qcow2 driver to allow I/O to inactive L1 tables. This sounds > like a non-trivial amount of code changes, though it would potentially > enable more use cases we never implemented ((read-only) access to > internal snapshots as block nodes, so you could e.g. use block jobs to > export a snapshot). Or export a snapshot through NBD. Still, I have one more idea, probably we already discussed it? Honestly, I don't like the fact that we store vmstate into guest-data space. After EOF, invisible, but still.. Maybe, it would be good to make a qcow2 extension for storing vmstate separately? So snapshot metadata will include two more fields: vmstate_offset and vmstate_length (hmm, actually we already have the second one), which will be allocated as normal qcow2 metadata? Or we can add one-two levels of layered allocation if needed, but keep it separate from L1/L2 tables for guest clusters. -- Best regards, Vladimir
On 5/19/20 10:48 AM, Vladimir Sementsov-Ogievskiy wrote: >> The other options is doing what you suggested. There is nothing in the >> qcow2 on-disk format that would prevent this, but we would have to >> extend the qcow2 driver to allow I/O to inactive L1 tables. This sounds >> like a non-trivial amount of code changes, though it would potentially >> enable more use cases we never implemented ((read-only) access to >> internal snapshots as block nodes, so you could e.g. use block jobs to >> export a snapshot). > > Or export a snapshot through NBD. > > Still, I have one more idea, probably we already discussed it? > Honestly, I don't like the fact that we store vmstate into guest-data > space. After EOF, invisible, but still.. > > Maybe, it would be good to make a qcow2 extension for storing vmstate > separately? The existing location of internal snapshots IS already stored separately from guest-data space, precisely because it is beyond EOF. > So snapshot metadata will include two more fields: > vmstate_offset and vmstate_length (hmm, actually we already have the > second one), which will be allocated as normal qcow2 metadata? How will adding redundant fields help? Both fields are already present in the snapshot table of v3 images (even if indirectly) by virtue of: 32 - 35: Size of the VM state in bytes. 0 if no VM state is saved. If there is VM state, it starts at the first cluster described by first L1 table entry that doesn't describe a regular guest cluster (i.e. VM state is stored like guest disk content, except that it is stored at offsets that are larger than the virtual disk presented to the guest) Byte 40 - 47: Size of the VM state in bytes. 0 if no VM state is saved. If this field is present, the 32-bit value in bytes 32-35 is ignored. Byte 48 - 55: Virtual disk size of the snapshot in bytes which gives you both the 64-bit size (in order to compute the last cluster accessible to the guest, and thus the next cluster available to the vmstate beyond EOF) and the 64-bit length of that vmstate. > Or we can > add one-two levels of layered allocation if needed, but keep it separate > from L1/L2 tables for guest clusters. > > -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
>> I'm not quite sure I understand the point. >> Let's see all the picture of async snapshot: our goal is to minimize a VM >> downtime during the snapshot. >> When we do async snapshot we save vmstate except RAM when a VM is stopped >> using the current L1 table (further initial L1 table). Then, we want the VM >> start running >> and write RAM content. At this time all RAM is write-protected. >> We unprotect each RAM page once it has been written. > Oh, I see, you're basically doing something like postcopy migration. I > was assuming it was more like regular live migration, except that you > would overwrite updated RAM blocks instead of appending them. > > I can see your requirement then. > >> All those RAM pages should go to the snapshot using the initial L1 table. >> Since the VM is running, it may want to write new disk blocks, >> so we need to use a NEW L1 table to provide this ability. (Am I correct so >> far?) >> Thus, if I understand correctly, we need to use two L1 tables: the initial >> one to store RAM pages >> to the vmstate and the new one to allow disk writings. >> >> May be I can't see a better way to achieve that. Please, correct me if I'm >> wrong. > I guess I could imagine a different, though probably not better way: We > could internally have a separate low-level operation that moves the VM > state from the active layer to an already existing disk snapshot. Then > you would snapshot the disk and start writing the VM to the active > layer, and when the VM state write has completed you move it into the > snapshot. > > The other options is doing what you suggested. There is nothing in the > qcow2 on-disk format that would prevent this, but we would have to > extend the qcow2 driver to allow I/O to inactive L1 tables. This sounds > like a non-trivial amount of code changes, though it would potentially > enable more use cases we never implemented ((read-only) access to > internal snapshots as block nodes, so you could e.g. use block jobs to > export a snapshot). > > Kevin Ok, thanks for validating the possibilities and more ideas of implementation. I think I should start from trying to post my background snapshot version storing the vmstate to an external file because write-protected-userfaultfd is now available on linux. And If it's accepted I'll try to come up with an internal version for qcow2 (It seems this is the only format supporting this). Denis
© 2016 - 2024 Red Hat, Inc.