exec.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++ hw/display/vga.c | 50 +++++++++++++++++-------------- include/exec/memory.h | 13 ++++++++ include/exec/ram_addr.h | 8 +++++ include/qemu/typedefs.h | 1 + memory.c | 15 ++++++++++ ui/console.c | 25 +--------------- 7 files changed, 145 insertions(+), 46 deletions(-)
Hi, First attempt on making display updates thread-save for real. Most interesting patches at this point are #2 (adds helper functions to create and use a dirty bitmap copy) and #3 (updates vga code to use them). Patch #1 fixes a bug I've noticed while wading through the vga code, and #4 removes the temporary workaround for testing purposes. Which will of course break some or all non-vga display adapters as those are not ported over yet. git branch is available here: git://git.kraxel.org/qemu work/vga-fixes Gerd Hoffmann (4): vga: add vga_scanline_invalidated helper memory: add support getting and using a dirty bitmap copy. vga: make display updates thread safe. [testing] console: remove do_safe_dpy_refresh exec.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++ hw/display/vga.c | 50 +++++++++++++++++-------------- include/exec/memory.h | 13 ++++++++ include/exec/ram_addr.h | 8 +++++ include/qemu/typedefs.h | 1 + memory.c | 15 ++++++++++ ui/console.c | 25 +--------------- 7 files changed, 145 insertions(+), 46 deletions(-) -- 1.8.3.1
On 30/03/17 07:55, Gerd Hoffmann wrote: > Hi, > > First attempt on making display updates thread-save for real. Most > interesting patches at this point are #2 (adds helper functions to > create and use a dirty bitmap copy) and #3 (updates vga code to use > them). > > Patch #1 fixes a bug I've noticed while wading through the vga code, > and #4 removes the temporary workaround for testing purposes. Which > will of course break some or all non-vga display adapters as those > are not ported over yet. > > git branch is available here: > git://git.kraxel.org/qemu work/vga-fixes > > Gerd Hoffmann (4): > vga: add vga_scanline_invalidated helper > memory: add support getting and using a dirty bitmap copy. > vga: make display updates thread safe. > [testing] console: remove do_safe_dpy_refresh > > exec.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++ > hw/display/vga.c | 50 +++++++++++++++++-------------- > include/exec/memory.h | 13 ++++++++ > include/exec/ram_addr.h | 8 +++++ > include/qemu/typedefs.h | 1 + > memory.c | 15 ++++++++++ > ui/console.c | 25 +--------------- > 7 files changed, 145 insertions(+), 46 deletions(-) Excellent! I can help out with converting and/or testing the SPARC devices (cg3/tcx) if required. Out of interest, from your work do you have a rough estimate as to how this affects guest performance? For example benchmarks with Jan's original commit to reduce the locking vs. Alex's current workaround vs. with this patchset applied? ATB, Mark.
Hi, > Excellent! I can help out with converting and/or testing the SPARC > devices (cg3/tcx) if required. Sure, test results and patches are very welcome. I've touched dirty bitmap code for the first time, so I've sent out this rfc for early feedback at the approach taken before going out converting more display drivers. Seems nobody found patch #2 horrible so far, I take that as good sign that we can go on with this draft API. > Out of interest, from your work do you have a rough estimate as to how > this affects guest performance? For example benchmarks with Jan's > original commit to reduce the locking vs. Alex's current workaround vs. > with this patchset applied? I expect this improves performance with mttcg because the locking goes away, which allows more concurrency, probably more visible with more guest cores. Didn't benchmark it though. cheers, Gerd
On 30/03/17 14:41, Gerd Hoffmann wrote: > Hi, > >> Excellent! I can help out with converting and/or testing the SPARC >> devices (cg3/tcx) if required. > > Sure, test results and patches are very welcome. I've spent a bit of time over the weekend attempting to convert the SPARC CG3/TCX framebuffers over to use your new code and pushed the result to my github branch here: https://github.com/mcayland/qemu/commits/vga-fixes-sparc. Feel free to steal/test/borrow as required. The two main things I've discovered to date are: 1) Problems with asserts() in cpu_physical_memory_snapshot_get_dirty() Using your current git branch, I get an assert() using VGA under qemu-system-ppc on startup unless I change the first assert to "assert(start >= snap->start)". I'm also not convinced by the second assert "assert(start + length < snap->end)" either. In the example CG3/TCX conversions I ended up having to subtract 1 from the size of the region passed to memory_region_snapshot_get_dirty() in order to avoid it firing when accessing the last scanline which seems wrong. I didn't explicitly add a patch to change this to "<=" in my git patchset just in case it broke further assumptions in the bitmap code relating to the end address. 2) Redraw issues with CG3/TCX after conversion Despite the conversion from your patches looking reasonably straightforward, I still see problems with areas of the screen not updating after conversion. I can easily reproduce these locally as follows: ./qemu-system-sparc -vga cg3 -prom-env 'auto-boot?=false' ./qemu-system-sparc -vga tcx -g 1024x768x8 -prom-env 'auto-boot?=false' ./qemu-system-sparc -vga tcx -g 1024x768x24 -prom-env 'auto-boot?=false' You can trigger extra screen updates by typing "show-devs" into OpenBIOS which will cause a lot of screen activity as it writes the device tree out to the framebuffer. I'm wondering if maybe this is related to the difference in sizes between DIRTY_MEMORY_BLOCK_SIZE and TARGET_PAGE_BITS for qemu-system-sparc? Finally on a slightly unrelated note, the TCX driver still has old code for 8 and 16 bpp surfaces in QEMU. Am I right in understanding that all surfaces in QEMU are now 32-bit and the 8/16 bpp code can be removed? ATB, Mark.
Hi, > I've spent a bit of time over the weekend attempting to convert the > SPARC CG3/TCX framebuffers over to use your new code and pushed the > result to my github branch here: > https://github.com/mcayland/qemu/commits/vga-fixes-sparc. work/vga-fixes branch updated & rebased to latest master, with your patches added. > Using your current git branch, I get an assert() using VGA under > qemu-system-ppc on startup unless I change the first assert to > "assert(start >= snap->start)". > > I'm also not convinced by the second assert "assert(start + length < > snap->end)" either. In the example CG3/TCX conversions I ended up having > to subtract 1 from the size of the region passed to > memory_region_snapshot_get_dirty() in order to avoid it firing when > accessing the last scanline which seems wrong. I didn't explicitly add a > patch to change this to "<=" in my git patchset just in case it broke > further assumptions in the bitmap code relating to the end address. Both asserts should use inclusive compare. Added a patch fixing the other assert too. > 2) Redraw issues with CG3/TCX after conversion > > Despite the conversion from your patches looking reasonably > straightforward, I still see problems with areas of the screen not > updating after conversion. I can easily reproduce these locally as follows: > > ./qemu-system-sparc -vga cg3 -prom-env 'auto-boot?=false' > ./qemu-system-sparc -vga tcx -g 1024x768x8 -prom-env 'auto-boot?=false' > ./qemu-system-sparc -vga tcx -g 1024x768x24 -prom-env 'auto-boot?=false' Hmm, I have a blank screen only (yellow with cg3, black with tcx). No openfirmware prompt showing up ... > Finally on a slightly unrelated note, the TCX driver still has old code > for 8 and 16 bpp surfaces in QEMU. Am I right in understanding that all > surfaces in QEMU are now 32-bit and the 8/16 bpp code can be removed? Yes, that is correct. tcx uses qemu_resize_console() so surfaces will always be 32bpp. cheers, Gerd
On 03/04/17 09:44, Gerd Hoffmann wrote: > Hi, > >> I've spent a bit of time over the weekend attempting to convert the >> SPARC CG3/TCX framebuffers over to use your new code and pushed the >> result to my github branch here: >> https://github.com/mcayland/qemu/commits/vga-fixes-sparc. > > work/vga-fixes branch updated & rebased to latest master, with your > patches added. I've just rebased on your branch with some extra fixes which should be squashed into the earlier commits and pushed to github again - see the commit messages for more information. >> 2) Redraw issues with CG3/TCX after conversion >> >> Despite the conversion from your patches looking reasonably >> straightforward, I still see problems with areas of the screen not >> updating after conversion. I can easily reproduce these locally as follows: >> >> ./qemu-system-sparc -vga cg3 -prom-env 'auto-boot?=false' >> ./qemu-system-sparc -vga tcx -g 1024x768x8 -prom-env 'auto-boot?=false' >> ./qemu-system-sparc -vga tcx -g 1024x768x24 -prom-env 'auto-boot?=false' > > Hmm, I have a blank screen only (yellow with cg3, black with tcx). > No openfirmware prompt showing up ... Yeah it's slightly baffling. The easiest one to look at is cg3 since it's a simple 8-bit palette dumb framebuffer with 256 colors, i.e. the differences between 8d3af4a~1 and 8d3af4a ("cg3: make display updates thread safe") in your updated branch. If nothing seems wrong with the conversion (once my latest fixes have been squashed in) then I presume something odd is happening in the dirty bitmap code... ATB, Mark.
On 30/03/2017 15:41, Gerd Hoffmann wrote: > Hi, > >> Excellent! I can help out with converting and/or testing the SPARC >> devices (cg3/tcx) if required. > > Sure, test results and patches are very welcome. > > I've touched dirty bitmap code for the first time, so I've sent out this > rfc for early feedback at the approach taken before going out converting > more display drivers. Seems nobody found patch #2 horrible so far, I > take that as good sign that we can go on with this draft API. > >> Out of interest, from your work do you have a rough estimate as to how >> this affects guest performance? For example benchmarks with Jan's >> original commit to reduce the locking vs. Alex's current workaround vs. >> with this patchset applied? > > I expect this improves performance with mttcg because the locking goes > away, which allows more concurrency, probably more visible with more > guest cores. Didn't benchmark it though. I checked the branch, is bitmap_copy_and_clear_atomic correct when you have partial updates? Maybe you need to handle partial updates of the first and last word? Also, the first and last bit should optionally be left untouched, for example: first y: 20 last y: 99 stride: 320 bytes read: 6400..31999 snapshotted pages: 4096..32767 (copying all of word 0 would do) cleared dirty bits: 8192..28671 (only bits 1..14 must be zeroed) Thanks, Paolo
Hi, > I checked the branch, is bitmap_copy_and_clear_atomic correct when you > have partial updates? Maybe you need to handle partial updates of the > first and last word? Should not be a problem. We might clear some more bits, but these are outsize the visible area so they should cause visible corruption (and if the visible area changes the display code needs to do a full refresh anyway). cheers, Gerd
On 03/04/17 13:03, Gerd Hoffmann wrote: > Hi, > >> I checked the branch, is bitmap_copy_and_clear_atomic correct when you >> have partial updates? Maybe you need to handle partial updates of the >> first and last word? > > Should not be a problem. We might clear some more bits, but these are > outsize the visible area so they should cause visible corruption (and if > the visible area changes the display code needs to do a full refresh > anyway). Right. Also is there a reason that cpu_physical_memory_snapshot_and_clear_dirty() uses BITS_PER_LEVEL when calculating the alignment used for the first/last addresses? I would have thought you can remove the align variable and use: ram_addr_t first = QEMU_ALIGN_DOWN(start, TARGET_PAGE_SIZE); ram_addr_t last = QEMU_ALIGN_UP(start + length, TARGET_PAGE_SIZE); Otherwise you end up expanding the range of your last address considerably beyond the next page alignment. And also in the main page loop why is BITS_PER_LEVEL used? I see that several of the internal bitmap routines appear to use BITS_PER_LONG for compressing into the bitmap which might be more appropriate? ATB, Mark.
On Mo, 2017-04-03 at 13:24 +0100, Mark Cave-Ayland wrote: > On 03/04/17 13:03, Gerd Hoffmann wrote: > > > Hi, > > > >> I checked the branch, is bitmap_copy_and_clear_atomic correct when you > >> have partial updates? Maybe you need to handle partial updates of the > >> first and last word? > > > > Should not be a problem. We might clear some more bits, but these are > > outsize the visible area so they should cause visible corruption (and if > > the visible area changes the display code needs to do a full refresh > > anyway). > > Right. Also is there a reason that > cpu_physical_memory_snapshot_and_clear_dirty() uses BITS_PER_LEVEL when > calculating the alignment used for the first/last addresses? The code copy doesn't operate on bits but on bitmap longs, so I can just copy the whole long instead of fiddeling with bits. So I round down to a bitmap long start. On 64bit hosts that are units of 64 pages. Actually querying the copied bitmap operates on page granularity of course. > Otherwise you end up expanding the range of your last address > considerably beyond the next page alignment. Yes, it's larger, but as it expands to the start of the long word I expect the penalty is almost zero (same cache line) and being able to just copy the whole long instead of dealing with individual bits should be a win. > And also in the main page > loop why is BITS_PER_LEVEL used? I see that several of the internal > bitmap routines appear to use BITS_PER_LONG for compressing into the > bitmap which might be more appropriate? shift by BITS_PER_LEVEL is the same as division by BITS_PER_LONG cheers, Gerd
On 03/04/17 13:42, Gerd Hoffmann wrote: > On Mo, 2017-04-03 at 13:24 +0100, Mark Cave-Ayland wrote: >> On 03/04/17 13:03, Gerd Hoffmann wrote: >> >>> Hi, >>> >>>> I checked the branch, is bitmap_copy_and_clear_atomic correct when you >>>> have partial updates? Maybe you need to handle partial updates of the >>>> first and last word? >>> >>> Should not be a problem. We might clear some more bits, but these are >>> outsize the visible area so they should cause visible corruption (and if >>> the visible area changes the display code needs to do a full refresh >>> anyway). >> >> Right. Also is there a reason that >> cpu_physical_memory_snapshot_and_clear_dirty() uses BITS_PER_LEVEL when >> calculating the alignment used for the first/last addresses? > > The code copy doesn't operate on bits but on bitmap longs, so I can just > copy the whole long instead of fiddeling with bits. So I round down to > a bitmap long start. On 64bit hosts that are units of 64 pages. > > Actually querying the copied bitmap operates on page granularity of > course. > >> Otherwise you end up expanding the range of your last address >> considerably beyond the next page alignment. > > Yes, it's larger, but as it expands to the start of the long word I > expect the penalty is almost zero (same cache line) and being able to > just copy the whole long instead of dealing with individual bits should > be a win. > >> And also in the main page >> loop why is BITS_PER_LEVEL used? I see that several of the internal >> bitmap routines appear to use BITS_PER_LONG for compressing into the >> bitmap which might be more appropriate? > > shift by BITS_PER_LEVEL is the same as division by BITS_PER_LONG Right hopefully I understand this now. The following diff appears to fix CG3/TCX for me: diff --git a/exec.c b/exec.c index 000af18..66bdcf4 100644 --- a/exec.c +++ b/exec.c @@ -1071,7 +1071,7 @@ DirtyBitmapSnapshot *cpu_physical_memory_snapshot_and_clear_dirty (ram_addr_t start, ram_addr_t length, unsigned client) { DirtyMemoryBlocks *blocks; - unsigned long align = 1 << (TARGET_PAGE_BITS + BITS_PER_LEVEL); + unsigned long align = 1UL << (TARGET_PAGE_BITS + BITS_PER_LEVEL); ram_addr_t first = QEMU_ALIGN_DOWN(start, align); ram_addr_t last = QEMU_ALIGN_UP(start + length, align); DirtyBitmapSnapshot *snap; @@ -1097,7 +1097,6 @@ DirtyBitmapSnapshot *cpu_physical_memory_snapshot_and_clear_dirty assert(QEMU_IS_ALIGNED(offset, (1 << BITS_PER_LEVEL))); assert(QEMU_IS_ALIGNED(num, (1 << BITS_PER_LEVEL))); - offset >>= BITS_PER_LEVEL; bitmap_copy_and_clear_atomic(snap->dirty + dest, blocks->blocks[idx] + (offset >> BITS_PER_LEVEL), There were 2 issues here: without the UL suffix on align I was getting incorrect first/last addresses since the high bits of align weren't being cleared, and then offset appeared to be shifted twice. Does this look reasonable to you? ATB, Mark.
Hi, > - unsigned long align = 1 << (TARGET_PAGE_BITS + BITS_PER_LEVEL); > + unsigned long align = 1UL << (TARGET_PAGE_BITS + BITS_PER_LEVEL); > There were 2 issues here: without the UL suffix on align I was getting > incorrect first/last addresses since the high bits of align weren't > being cleared, Ah, thanks, I'll add that. > and then offset appeared to be shifted twice. Yep, noticed that too meanwhile, fixed in the branch pushed half an hour ago. I've dropped the other shift though ;) cheers, Gerd
On 04/04/17 07:12, Gerd Hoffmann wrote: > Hi, > >> - unsigned long align = 1 << (TARGET_PAGE_BITS + BITS_PER_LEVEL); >> + unsigned long align = 1UL << (TARGET_PAGE_BITS + BITS_PER_LEVEL); > >> There were 2 issues here: without the UL suffix on align I was getting >> incorrect first/last addresses since the high bits of align weren't >> being cleared, > > Ah, thanks, I'll add that. > >> and then offset appeared to be shifted twice. > > Yep, noticed that too meanwhile, fixed in the branch pushed half an hour > ago. I've dropped the other shift though ;) Confirmed! The only minor nit I've noticed is that commit 322aef7 "cg3: fix up size parameter for memory_region_get_dirty()" isn't quite right now that the asserts() in cpu_physical_memory_snapshot_get_dirty() have now been fixed - the size should now be "width" rather than "width - 1". Other than that, I've just given it a quick spin across my SPARC CG3/TCX and PPC VGA images and it looks good here :) ATB, Mark.
On 03/04/2017 14:03, Gerd Hoffmann wrote: > We might clear some more bits, but these are > outsize the visible area so they should cause visible corruption (and if > the visible area changes the display code needs to do a full refresh > anyway). True, though this makes the snapshot abstraction a little more specialized. Please add a FIXME comment with the explanation, at least. Paolo
On Mo, 2017-04-03 at 19:02 +0200, Paolo Bonzini wrote: > > On 03/04/2017 14:03, Gerd Hoffmann wrote: > > We might clear some more bits, but these are > > outsize the visible area so they should cause visible corruption (and if > > the visible area changes the display code needs to do a full refresh > > anyway). > > True, though this makes the snapshot abstraction a little more > specialized. Please add a FIXME comment with the explanation, at least. Sure, I'll add a note to the (to be written) docs. cheers, Gerd
© 2016 - 2024 Red Hat, Inc.