[v1] qcow2: compressed write cache

[PATCH 0/7] qcow2: compressed write cache

Posted by Vladimir Sementsov-Ogievskiy 5 years ago

Hi all!

I know, I have several series waiting for a resend, but I had to switch
to another task spawned from our customer's bug.

Original problem: we use O_DIRECT for all vm images in our product, it's
the policy. The only exclusion is backup target qcow2 image for
compressed backup, because compressed backup is extremely slow with
O_DIRECT (due to unaligned writes). Customer complains that backup
produces a lot of pagecache.

So we can either implement some internal cache or use fadvise somehow.
Backup has several async workes, which writes simultaneously, so in both
ways we have to track host cluster filling (before dropping the cache
corresponding to the cluster).  So, if we have to track anyway, let's
try to implement the cache.

Idea is simple: cache small unaligned write and flush the cluster when
filled.

Performance result is very good (results in a table is time of
compressed backup of 1000M disk filled with ones in seconds):

---------------  -----------  -----------
                 backup(old)  backup(new)
ssd:hdd(direct)  3e+02        4.4
                                -99%
ssd:hdd(cached)  5.7          5.4
                                -5%
---------------  -----------  -----------

So, we have benefit even for cached mode! And the fastest thing is
O_DIRECT with new implemented cache. So, I suggest to enable the new
cache by default (which is done by the series).

Command to generate performance comparison table:

 ./scripts/simplebench/bench-backup.py --compressed --target-cache both --dir ssd:/ssd --dir hdd:/work --test ssd:hdd --env old:/work/src/qemu/master/build/qemu-system-x86_64 new:/work/src/qemu/up/qcow2-compressed-write-cache/build/qemu-system-x86_6

Vladimir Sementsov-Ogievskiy (7):
  qemu/queue: add some useful QLIST_ and QTAILQ_ macros
  block/qcow2: introduce cache for compressed writes
  block/qcow2: use compressed write cache
  simplebench: bench_one(): add slow_limit argument
  simplebench: bench_one(): support count=1
  simplebench/bench-backup: add --compressed option
  simplebench/bench-backup: add target-cache argument

 qapi/block-core.json                   |   8 +-
 block/qcow2.h                          |  33 ++
 include/qemu/queue.h                   |  14 +
 block/qcow2-compressed-write-cache.c   | 770 +++++++++++++++++++++++++
 block/qcow2-refcount.c                 |  13 +
 block/qcow2.c                          |  87 ++-
 block/meson.build                      |   1 +
 scripts/simplebench/bench-backup.py    |  74 ++-
 scripts/simplebench/bench_block_job.py |  33 +-
 scripts/simplebench/simplebench.py     |  34 +-
 10 files changed, 1039 insertions(+), 28 deletions(-)
 create mode 100644 block/qcow2-compressed-write-cache.c

-- 
2.29.2

Re: [PATCH 0/7] qcow2: compressed write cache

Posted by Kevin Wolf 4 years, 12 months ago

Am 29.01.2021 um 17:50 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Hi all!
> 
> I know, I have several series waiting for a resend, but I had to switch
> to another task spawned from our customer's bug.
> 
> Original problem: we use O_DIRECT for all vm images in our product, it's
> the policy. The only exclusion is backup target qcow2 image for
> compressed backup, because compressed backup is extremely slow with
> O_DIRECT (due to unaligned writes). Customer complains that backup
> produces a lot of pagecache.
> 
> So we can either implement some internal cache or use fadvise somehow.
> Backup has several async workes, which writes simultaneously, so in both
> ways we have to track host cluster filling (before dropping the cache
> corresponding to the cluster).  So, if we have to track anyway, let's
> try to implement the cache.
> 
> Idea is simple: cache small unaligned write and flush the cluster when
> filled.

I haven't had the time to properly look at the patches, but is there
anything in it that is actually specific to compressed writes?

I'm asking because you may remember that a few years ago I talked at KVM
Forum about how a data cache could be used for small unaligned (to
cluster sizes) writes to reduce COW cost (mostly for sequential access
where the other part of the cluster would be filled soon enough).

So if we're introducing some kind of data cache, wouldn't it be nice to
use it even in the more general case instead of just restricting it to
compression?

Kevin

Re: [PATCH 0/7] qcow2: compressed write cache

Posted by Vladimir Sementsov-Ogievskiy 4 years, 12 months ago

10.02.2021 15:35, Kevin Wolf wrote:
> Am 29.01.2021 um 17:50 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> Hi all!
>>
>> I know, I have several series waiting for a resend, but I had to switch
>> to another task spawned from our customer's bug.
>>
>> Original problem: we use O_DIRECT for all vm images in our product, it's
>> the policy. The only exclusion is backup target qcow2 image for
>> compressed backup, because compressed backup is extremely slow with
>> O_DIRECT (due to unaligned writes). Customer complains that backup
>> produces a lot of pagecache.
>>
>> So we can either implement some internal cache or use fadvise somehow.
>> Backup has several async workes, which writes simultaneously, so in both
>> ways we have to track host cluster filling (before dropping the cache
>> corresponding to the cluster).  So, if we have to track anyway, let's
>> try to implement the cache.
>>
>> Idea is simple: cache small unaligned write and flush the cluster when
>> filled.
> 
> I haven't had the time to properly look at the patches, but is there
> anything in it that is actually specific to compressed writes?
> 
> I'm asking because you may remember that a few years ago I talked at KVM
> Forum about how a data cache could be used for small unaligned (to
> cluster sizes) writes to reduce COW cost (mostly for sequential access
> where the other part of the cluster would be filled soon enough).
> 
> So if we're introducing some kind of data cache, wouldn't it be nice to
> use it even in the more general case instead of just restricting it to
> compression?
> 

Specific things are:

  - setting data_end per cluster at some moment (so we flush the cluster when it is not full) In this case we align up the data_end, as we know that the remaining part of cluster is unused. But, that may be refactored as an option.
  - wait for the whole cluster filled

So it can be reused for some sequential (more or less) copying process with unaligned chunks.. But different copying jobs in qemu always have aligned chunks, the only exclusion is copying to compressed target..

Still I intentionally implemented it in a separate file, and there no use of BDRVQcow2State, so it's simple enough to refactor and reuse if needed.

I can rename it to "unaligned_copy_cache" or something like this.

-- 
Best regards,
Vladimir

Re: [PATCH 0/7] qcow2: compressed write cache

Posted by no-reply@patchew.org 5 years ago

Patchew URL: https://patchew.org/QEMU/20210129165030.640169-1-vsementsov@virtuozzo.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20210129165030.640169-1-vsementsov@virtuozzo.com
Subject: [PATCH 0/7] qcow2: compressed write cache

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
   5101d00..3701c07  master     -> master
 - [tag update]      patchew/20210129110012.8660-1-peter.maydell@linaro.org -> patchew/20210129110012.8660-1-peter.maydell@linaro.org
 * [new tag]         patchew/20210129165030.640169-1-vsementsov@virtuozzo.com -> patchew/20210129165030.640169-1-vsementsov@virtuozzo.com
Switched to a new branch 'test'
d7783a4 simplebench/bench-backup: add target-cache argument
ddf4442 simplebench/bench-backup: add --compressed option
47ee627 simplebench: bench_one(): support count=1
2b80e33 simplebench: bench_one(): add slow_limit argument
acf2fb6 block/qcow2: use compressed write cache
d96e35f block/qcow2: introduce cache for compressed writes
0d009e1 qemu/queue: add some useful QLIST_ and QTAILQ_ macros

=== OUTPUT BEGIN ===
1/7 Checking commit 0d009e16280d (qemu/queue: add some useful QLIST_ and QTAILQ_ macros)
ERROR: spaces required around that '*' (ctx:WxV)
#25: FILE: include/qemu/queue.h:177:
+    typeof(*QLIST_FIRST(head)) *qffs_var, *qffs_next_var;               \
                                ^

WARNING: Block comments use a leading /* on a separate line
#29: FILE: include/qemu/queue.h:181:
+} while (/*CONSTCOND*/0)

ERROR: spaces required around that '*' (ctx:WxV)
#39: FILE: include/qemu/queue.h:501:
+    typeof(*QTAILQ_FIRST(head)) *qffs_var, *qffs_next_var;              \
                                 ^

WARNING: Block comments use a leading /* on a separate line
#43: FILE: include/qemu/queue.h:505:
+} while (/*CONSTCOND*/0)

total: 2 errors, 2 warnings, 26 lines checked

Patch 1/7 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

2/7 Checking commit d96e35f1544f (block/qcow2: introduce cache for compressed writes)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#34: 
new file mode 100644

total: 0 errors, 1 warnings, 816 lines checked

Patch 2/7 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
3/7 Checking commit acf2fb60d5c2 (block/qcow2: use compressed write cache)
4/7 Checking commit 2b80e334b300 (simplebench: bench_one(): add slow_limit argument)
5/7 Checking commit 47ee62768317 (simplebench: bench_one(): support count=1)
6/7 Checking commit ddf44420a9e8 (simplebench/bench-backup: add --compressed option)
7/7 Checking commit d7783a45ee7b (simplebench/bench-backup: add target-cache argument)
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20210129165030.640169-1-vsementsov@virtuozzo.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

Re: [PATCH 0/7] qcow2: compressed write cache

Posted by Vladimir Sementsov-Ogievskiy 5 years ago

29.01.2021 20:30, no-reply@patchew.org wrote:
> Patchew URL: https://patchew.org/QEMU/20210129165030.640169-1-vsementsov@virtuozzo.com/
> 
> 
> 
> Hi,
> 
> This series seems to have some coding style problems. See output below for
> more information:
> 
> Type: series
> Message-id: 20210129165030.640169-1-vsementsov@virtuozzo.com
> Subject: [PATCH 0/7] qcow2: compressed write cache
> 
> === TEST SCRIPT BEGIN ===
> #!/bin/bash
> git rev-parse base > /dev/null || exit 0
> git config --local diff.renamelimit 0
> git config --local diff.renames True
> git config --local diff.algorithm histogram
> ./scripts/checkpatch.pl --mailback base..
> === TEST SCRIPT END ===
> 
> Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
>  From https://github.com/patchew-project/qemu
>     5101d00..3701c07  master     -> master
>   - [tag update]      patchew/20210129110012.8660-1-peter.maydell@linaro.org -> patchew/20210129110012.8660-1-peter.maydell@linaro.org
>   * [new tag]         patchew/20210129165030.640169-1-vsementsov@virtuozzo.com -> patchew/20210129165030.640169-1-vsementsov@virtuozzo.com
> Switched to a new branch 'test'
> d7783a4 simplebench/bench-backup: add target-cache argument
> ddf4442 simplebench/bench-backup: add --compressed option
> 47ee627 simplebench: bench_one(): support count=1
> 2b80e33 simplebench: bench_one(): add slow_limit argument
> acf2fb6 block/qcow2: use compressed write cache
> d96e35f block/qcow2: introduce cache for compressed writes
> 0d009e1 qemu/queue: add some useful QLIST_ and QTAILQ_ macros
> 
> === OUTPUT BEGIN ===
> 1/7 Checking commit 0d009e16280d (qemu/queue: add some useful QLIST_ and QTAILQ_ macros)
> ERROR: spaces required around that '*' (ctx:WxV)
> #25: FILE: include/qemu/queue.h:177:
> +    typeof(*QLIST_FIRST(head)) *qffs_var, *qffs_next_var;               \
>                                  ^
> 
> WARNING: Block comments use a leading /* on a separate line
> #29: FILE: include/qemu/queue.h:181:
> +} while (/*CONSTCOND*/0)
> 
> ERROR: spaces required around that '*' (ctx:WxV)
> #39: FILE: include/qemu/queue.h:501:
> +    typeof(*QTAILQ_FIRST(head)) *qffs_var, *qffs_next_var;              \
>                                   ^
> 
> WARNING: Block comments use a leading /* on a separate line
> #43: FILE: include/qemu/queue.h:505:
> +} while (/*CONSTCOND*/0)
> 


all false positive


-- 
Best regards,
Vladimir

Re: [PATCH 0/7] qcow2: compressed write cache

Posted by Max Reitz 4 years, 12 months ago

On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> I know, I have several series waiting for a resend, but I had to switch
> to another task spawned from our customer's bug.
> 
> Original problem: we use O_DIRECT for all vm images in our product, it's
> the policy. The only exclusion is backup target qcow2 image for
> compressed backup, because compressed backup is extremely slow with
> O_DIRECT (due to unaligned writes). Customer complains that backup
> produces a lot of pagecache.
> 
> So we can either implement some internal cache or use fadvise somehow.
> Backup has several async workes, which writes simultaneously, so in both
> ways we have to track host cluster filling (before dropping the cache
> corresponding to the cluster).  So, if we have to track anyway, let's
> try to implement the cache.

I wanted to be excited here, because that sounds like it would be very 
easy to implement caching.  Like, just keep the cluster at 
free_byte_offset cached until the cluster it points to changes, then 
flush the cluster.

But then I see like 900 new lines of code, and I’m much less excited...

> Idea is simple: cache small unaligned write and flush the cluster when
> filled.
> 
> Performance result is very good (results in a table is time of
> compressed backup of 1000M disk filled with ones in seconds):

“Filled with ones” really is an edge case, though.

> ---------------  -----------  -----------
>                   backup(old)  backup(new)
> ssd:hdd(direct)  3e+02        4.4
>                                  -99%
> ssd:hdd(cached)  5.7          5.4
>                                  -5%
> ---------------  -----------  -----------
> 
> So, we have benefit even for cached mode! And the fastest thing is
> O_DIRECT with new implemented cache. So, I suggest to enable the new
> cache by default (which is done by the series).

First, I’m not sure how O_DIRECT really is relevant, because I don’t 
really see the point for writing compressed images.

Second, I find it a bit cheating if you say there is a huge improvement 
for the no-cache case, when actually, well, you just added a cache.  So 
the no-cache case just became faster because there is a cache now.

Well, I suppose I could follow that if O_DIRECT doesn’t make much sense 
for compressed images, qemu’s format drivers are free to introduce some 
caching (because technically the cache.direct option only applies to the 
protocol driver) for collecting compressed writes.  That conclusion 
makes both of my complaints kind of moot.

*shrug*

Third, what is the real-world impact on the page cache?  You described 
that that’s the reason why you need the cache in qemu, because otherwise 
the page cache is polluted too much.  How much is the difference really? 
  (I don’t know how good the compression ratio is for real-world images.)

Related to that, I remember a long time ago we had some discussion about 
letting qemu-img convert set a special cache mode for the target image 
that would make Linux drop everything before the last offset written 
(i.e., I suppose fadvise() with POSIX_FADV_SEQUENTIAL).  You discard 
that idea based on the fact that implementing a cache in qemu would be 
simple, but it isn’t, really.  What would the impact of 
POSIX_FADV_SEQUENTIAL be?  (One advantage of using that would be that we 
could reuse it for non-compressed images that are written by backup or 
qemu-img convert.)

(I don’t remember why that qemu-img discussion died back then.)

Fourth, regarding the code, would it be simpler if it were a pure write 
cache?  I.e., on read, everything is flushed, so we don’t have to deal 
with that.  I don’t think there are many valid cases where a compressed 
image is both written to and read from at the same time.  (Just asking, 
because I’d really want this code to be simpler.  I can imagine that 
reading from the cache is the least bit of complexity, but perhaps...)

Max

Re: [PATCH 0/7] qcow2: compressed write cache

Posted by Vladimir Sementsov-Ogievskiy 4 years, 12 months ago

09.02.2021 16:25, Max Reitz wrote:
> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>> Hi all!
>>
>> I know, I have several series waiting for a resend, but I had to switch
>> to another task spawned from our customer's bug.
>>
>> Original problem: we use O_DIRECT for all vm images in our product, it's
>> the policy. The only exclusion is backup target qcow2 image for
>> compressed backup, because compressed backup is extremely slow with
>> O_DIRECT (due to unaligned writes). Customer complains that backup
>> produces a lot of pagecache.
>>
>> So we can either implement some internal cache or use fadvise somehow.
>> Backup has several async workes, which writes simultaneously, so in both
>> ways we have to track host cluster filling (before dropping the cache
>> corresponding to the cluster).  So, if we have to track anyway, let's
>> try to implement the cache.
> 
> I wanted to be excited here, because that sounds like it would be very easy to implement caching.  Like, just keep the cluster at free_byte_offset cached until the cluster it points to changes, then flush the cluster.

The problem is that chunks are written asynchronously.. That's why this all is not so easy.

> 
> But then I see like 900 new lines of code, and I’m much less excited...
> 
>> Idea is simple: cache small unaligned write and flush the cluster when
>> filled.
>>
>> Performance result is very good (results in a table is time of
>> compressed backup of 1000M disk filled with ones in seconds):
> 
> “Filled with ones” really is an edge case, though.

Yes, I think, all clusters are compressed to rather small chunks :)

> 
>> ---------------  -----------  -----------
>>                   backup(old)  backup(new)
>> ssd:hdd(direct)  3e+02        4.4
>>                                  -99%
>> ssd:hdd(cached)  5.7          5.4
>>                                  -5%
>> ---------------  -----------  -----------
>>
>> So, we have benefit even for cached mode! And the fastest thing is
>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>> cache by default (which is done by the series).
> 
> First, I’m not sure how O_DIRECT really is relevant, because I don’t really see the point for writing compressed images.

compressed backup is a point

> 
> Second, I find it a bit cheating if you say there is a huge improvement for the no-cache case, when actually, well, you just added a cache.  So the no-cache case just became faster because there is a cache now.

Still, performance comparison is relevant to show that O_DIRECT as is unusable for compressed backup.

> 
> Well, I suppose I could follow that if O_DIRECT doesn’t make much sense for compressed images, qemu’s format drivers are free to introduce some caching (because technically the cache.direct option only applies to the protocol driver) for collecting compressed writes.

Yes I thought in this way, enabling the cache by default.

> That conclusion makes both of my complaints kind of moot.
> 
> *shrug*
> 
> Third, what is the real-world impact on the page cache?  You described that that’s the reason why you need the cache in qemu, because otherwise the page cache is polluted too much.  How much is the difference really?  (I don’t know how good the compression ratio is for real-world images.)

Hm. I don't know the ratio.. Customer reported that most of RAM is polluted by Qemu's cache, and we use O_DIRECT for everything except for target of compressed backup.. Still the pollution may relate to several backups and of course it is simple enough to drop the cache after each backup. But I think that even one backup of 16T disk may pollute RAM enough.

> 
> Related to that, I remember a long time ago we had some discussion about letting qemu-img convert set a special cache mode for the target image that would make Linux drop everything before the last offset written (i.e., I suppose fadvise() with POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact that implementing a cache in qemu would be simple, but it isn’t, really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One advantage of using that would be that we could reuse it for non-compressed images that are written by backup or qemu-img convert.)

The problem is that writes are async. And therefore, not sequential. So I have to track the writes and wait until the whole cluster is filled. It's simple use fadvise as an option to my cache: instead of caching data and write when cluster is filled we can instead mark cluster POSIX_FADV_DONTNEED.

> 
> (I don’t remember why that qemu-img discussion died back then.)
> 
> 
> Fourth, regarding the code, would it be simpler if it were a pure write cache?  I.e., on read, everything is flushed, so we don’t have to deal with that.  I don’t think there are many valid cases where a compressed image is both written to and read from at the same time.  (Just asking, because I’d really want this code to be simpler.  I can imagine that reading from the cache is the least bit of complexity, but perhaps...)
> 

Hm. I really didn't want to support reads, and do it only to make it possible to enable the cache by default.. Still read function is really simple, and I don't think that dropping it will simplify the code significantly.


-- 
Best regards,
Vladimir

Re: [PATCH 0/7] qcow2: compressed write cache

Posted by Max Reitz 4 years, 12 months ago

On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
> 09.02.2021 16:25, Max Reitz wrote:
>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>> Hi all!
>>>
>>> I know, I have several series waiting for a resend, but I had to switch
>>> to another task spawned from our customer's bug.
>>>
>>> Original problem: we use O_DIRECT for all vm images in our product, it's
>>> the policy. The only exclusion is backup target qcow2 image for
>>> compressed backup, because compressed backup is extremely slow with
>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>> produces a lot of pagecache.
>>>
>>> So we can either implement some internal cache or use fadvise somehow.
>>> Backup has several async workes, which writes simultaneously, so in both
>>> ways we have to track host cluster filling (before dropping the cache
>>> corresponding to the cluster).  So, if we have to track anyway, let's
>>> try to implement the cache.
>>
>> I wanted to be excited here, because that sounds like it would be very 
>> easy to implement caching.  Like, just keep the cluster at 
>> free_byte_offset cached until the cluster it points to changes, then 
>> flush the cluster.
> 
> The problem is that chunks are written asynchronously.. That's why this 
> all is not so easy.
> 
>>
>> But then I see like 900 new lines of code, and I’m much less excited...
>>
>>> Idea is simple: cache small unaligned write and flush the cluster when
>>> filled.
>>>
>>> Performance result is very good (results in a table is time of
>>> compressed backup of 1000M disk filled with ones in seconds):
>>
>> “Filled with ones” really is an edge case, though.
> 
> Yes, I think, all clusters are compressed to rather small chunks :)
> 
>>
>>> ---------------  -----------  -----------
>>>                   backup(old)  backup(new)
>>> ssd:hdd(direct)  3e+02        4.4
>>>                                  -99%
>>> ssd:hdd(cached)  5.7          5.4
>>>                                  -5%
>>> ---------------  -----------  -----------
>>>
>>> So, we have benefit even for cached mode! And the fastest thing is
>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>> cache by default (which is done by the series).
>>
>> First, I’m not sure how O_DIRECT really is relevant, because I don’t 
>> really see the point for writing compressed images.
> 
> compressed backup is a point

(Perhaps irrelevant, but just to be clear:) I meant the point of using 
O_DIRECT, which one can decide to not use for backup targets (as you 
have done already).

>> Second, I find it a bit cheating if you say there is a huge 
>> improvement for the no-cache case, when actually, well, you just added 
>> a cache.  So the no-cache case just became faster because there is a 
>> cache now.
> 
> Still, performance comparison is relevant to show that O_DIRECT as is 
> unusable for compressed backup.

(Again, perhaps irrelevant, but:) Yes, but my first point was exactly 
whether O_DIRECT is even relevant for writing compressed images.

>> Well, I suppose I could follow that if O_DIRECT doesn’t make much 
>> sense for compressed images, qemu’s format drivers are free to 
>> introduce some caching (because technically the cache.direct option 
>> only applies to the protocol driver) for collecting compressed writes.
> 
> Yes I thought in this way, enabling the cache by default.
> 
>> That conclusion makes both of my complaints kind of moot.
>>
>> *shrug*
>>
>> Third, what is the real-world impact on the page cache?  You described 
>> that that’s the reason why you need the cache in qemu, because 
>> otherwise the page cache is polluted too much.  How much is the 
>> difference really?  (I don’t know how good the compression ratio is 
>> for real-world images.)
> 
> Hm. I don't know the ratio.. Customer reported that most of RAM is 
> polluted by Qemu's cache, and we use O_DIRECT for everything except for 
> target of compressed backup.. Still the pollution may relate to several 
> backups and of course it is simple enough to drop the cache after each 
> backup. But I think that even one backup of 16T disk may pollute RAM 
> enough.

Oh, sorry, I just realized I had a brain fart there.  I was referring to 
whether this series improves the page cache pollution.  But obviously it 
will if it allows you to re-enable O_DIRECT.

>> Related to that, I remember a long time ago we had some discussion 
>> about letting qemu-img convert set a special cache mode for the target 
>> image that would make Linux drop everything before the last offset 
>> written (i.e., I suppose fadvise() with POSIX_FADV_SEQUENTIAL).  You 
>> discard that idea based on the fact that implementing a cache in qemu 
>> would be simple, but it isn’t, really.  What would the impact of 
>> POSIX_FADV_SEQUENTIAL be?  (One advantage of using that would be that 
>> we could reuse it for non-compressed images that are written by backup 
>> or qemu-img convert.)
> 
> The problem is that writes are async. And therefore, not sequential.

In theory, yes, but all compressed writes still goes through 
qcow2_alloc_bytes() right before submitting the write, so I wonder 
whether in practice the writes aren’t usually sufficiently sequential to 
make POSIX_FADV_SEQUENTIAL work fine.

> So
> I have to track the writes and wait until the whole cluster is filled. 
> It's simple use fadvise as an option to my cache: instead of caching 
> data and write when cluster is filled we can instead mark cluster 
> POSIX_FADV_DONTNEED.
> 
>>
>> (I don’t remember why that qemu-img discussion died back then.)
>>
>>
>> Fourth, regarding the code, would it be simpler if it were a pure 
>> write cache?  I.e., on read, everything is flushed, so we don’t have 
>> to deal with that.  I don’t think there are many valid cases where a 
>> compressed image is both written to and read from at the same time.  
>> (Just asking, because I’d really want this code to be simpler.  I can 
>> imagine that reading from the cache is the least bit of complexity, 
>> but perhaps...)
>>
> 
> Hm. I really didn't want to support reads, and do it only to make it 
> possible to enable the cache by default.. Still read function is really 
> simple, and I don't think that dropping it will simplify the code 
> significantly.

That’s too bad.

So the only question I have left is what POSIX_FADV_SEQUENTIAL actually 
would do in practice.

(But even then, the premise just doesn’t motivate me sufficiently yet...)

Max

Re: [PATCH 0/7] qcow2: compressed write cache

Posted by Vladimir Sementsov-Ogievskiy 4 years, 12 months ago

09.02.2021 17:47, Max Reitz wrote:
> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>> 09.02.2021 16:25, Max Reitz wrote:
>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>> Hi all!
>>>>
>>>> I know, I have several series waiting for a resend, but I had to switch
>>>> to another task spawned from our customer's bug.
>>>>
>>>> Original problem: we use O_DIRECT for all vm images in our product, it's
>>>> the policy. The only exclusion is backup target qcow2 image for
>>>> compressed backup, because compressed backup is extremely slow with
>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>> produces a lot of pagecache.
>>>>
>>>> So we can either implement some internal cache or use fadvise somehow.
>>>> Backup has several async workes, which writes simultaneously, so in both
>>>> ways we have to track host cluster filling (before dropping the cache
>>>> corresponding to the cluster).  So, if we have to track anyway, let's
>>>> try to implement the cache.
>>>
>>> I wanted to be excited here, because that sounds like it would be very easy to implement caching.  Like, just keep the cluster at free_byte_offset cached until the cluster it points to changes, then flush the cluster.
>>
>> The problem is that chunks are written asynchronously.. That's why this all is not so easy.
>>
>>>
>>> But then I see like 900 new lines of code, and I’m much less excited...
>>>
>>>> Idea is simple: cache small unaligned write and flush the cluster when
>>>> filled.
>>>>
>>>> Performance result is very good (results in a table is time of
>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>
>>> “Filled with ones” really is an edge case, though.
>>
>> Yes, I think, all clusters are compressed to rather small chunks :)
>>
>>>
>>>> ---------------  -----------  -----------
>>>>                   backup(old)  backup(new)
>>>> ssd:hdd(direct)  3e+02        4.4
>>>>                                  -99%
>>>> ssd:hdd(cached)  5.7          5.4
>>>>                                  -5%
>>>> ---------------  -----------  -----------
>>>>
>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>> cache by default (which is done by the series).
>>>
>>> First, I’m not sure how O_DIRECT really is relevant, because I don’t really see the point for writing compressed images.
>>
>> compressed backup is a point
> 
> (Perhaps irrelevant, but just to be clear:) I meant the point of using O_DIRECT, which one can decide to not use for backup targets (as you have done already).
> 
>>> Second, I find it a bit cheating if you say there is a huge improvement for the no-cache case, when actually, well, you just added a cache.  So the no-cache case just became faster because there is a cache now.
>>
>> Still, performance comparison is relevant to show that O_DIRECT as is unusable for compressed backup.
> 
> (Again, perhaps irrelevant, but:) Yes, but my first point was exactly whether O_DIRECT is even relevant for writing compressed images.
> 
>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much sense for compressed images, qemu’s format drivers are free to introduce some caching (because technically the cache.direct option only applies to the protocol driver) for collecting compressed writes.
>>
>> Yes I thought in this way, enabling the cache by default.
>>
>>> That conclusion makes both of my complaints kind of moot.
>>>
>>> *shrug*
>>>
>>> Third, what is the real-world impact on the page cache?  You described that that’s the reason why you need the cache in qemu, because otherwise the page cache is polluted too much.  How much is the difference really?  (I don’t know how good the compression ratio is for real-world images.)
>>
>> Hm. I don't know the ratio.. Customer reported that most of RAM is polluted by Qemu's cache, and we use O_DIRECT for everything except for target of compressed backup.. Still the pollution may relate to several backups and of course it is simple enough to drop the cache after each backup. But I think that even one backup of 16T disk may pollute RAM enough.
> 
> Oh, sorry, I just realized I had a brain fart there.  I was referring to whether this series improves the page cache pollution.  But obviously it will if it allows you to re-enable O_DIRECT.
> 
>>> Related to that, I remember a long time ago we had some discussion about letting qemu-img convert set a special cache mode for the target image that would make Linux drop everything before the last offset written (i.e., I suppose fadvise() with POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact that implementing a cache in qemu would be simple, but it isn’t, really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One advantage of using that would be that we could reuse it for non-compressed images that are written by backup or qemu-img convert.)
>>
>> The problem is that writes are async. And therefore, not sequential.
> 
> In theory, yes, but all compressed writes still goes through qcow2_alloc_bytes() right before submitting the write, so I wonder whether in practice the writes aren’t usually sufficiently sequential to make POSIX_FADV_SEQUENTIAL work fine.

Yes, allocation is sequential. But writes are not.. Reasonable, I should at least bench it. So we should set POSIX_FADV_SEQUENTIAL for the whole backup target before the backup? Will try. Still, I expect that my cache will show better performance anyway. Actually, comparing cached (by pagecache) vs my cache we have 5.7 -> 4.4, i.e. 20% faster, which is significant (still, yes, would be good to check it on more real case than all-ones).

> 
>> So
>> I have to track the writes and wait until the whole cluster is filled. It's simple use fadvise as an option to my cache: instead of caching data and write when cluster is filled we can instead mark cluster POSIX_FADV_DONTNEED.
>>
>>>
>>> (I don’t remember why that qemu-img discussion died back then.)
>>>
>>>
>>> Fourth, regarding the code, would it be simpler if it were a pure write cache?  I.e., on read, everything is flushed, so we don’t have to deal with that.  I don’t think there are many valid cases where a compressed image is both written to and read from at the same time. (Just asking, because I’d really want this code to be simpler.  I can imagine that reading from the cache is the least bit of complexity, but perhaps...)
>>>
>>
>> Hm. I really didn't want to support reads, and do it only to make it possible to enable the cache by default.. Still read function is really simple, and I don't think that dropping it will simplify the code significantly.
> 
> That’s too bad.
> 
> So the only question I have left is what POSIX_FADV_SEQUENTIAL actually would do in practice.

will check.

> 
> (But even then, the premise just doesn’t motivate me sufficiently yet...)
> 




-- 
Best regards,
Vladimir

Re: [PATCH 0/7] qcow2: compressed write cache

Posted by Vladimir Sementsov-Ogievskiy 4 years, 12 months ago

09.02.2021 19:39, Vladimir Sementsov-Ogievskiy wrote:
> 09.02.2021 17:47, Max Reitz wrote:
>> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>>> 09.02.2021 16:25, Max Reitz wrote:
>>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>>> Hi all!
>>>>>
>>>>> I know, I have several series waiting for a resend, but I had to switch
>>>>> to another task spawned from our customer's bug.
>>>>>
>>>>> Original problem: we use O_DIRECT for all vm images in our product, it's
>>>>> the policy. The only exclusion is backup target qcow2 image for
>>>>> compressed backup, because compressed backup is extremely slow with
>>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>>> produces a lot of pagecache.
>>>>>
>>>>> So we can either implement some internal cache or use fadvise somehow.
>>>>> Backup has several async workes, which writes simultaneously, so in both
>>>>> ways we have to track host cluster filling (before dropping the cache
>>>>> corresponding to the cluster).  So, if we have to track anyway, let's
>>>>> try to implement the cache.
>>>>
>>>> I wanted to be excited here, because that sounds like it would be very easy to implement caching.  Like, just keep the cluster at free_byte_offset cached until the cluster it points to changes, then flush the cluster.
>>>
>>> The problem is that chunks are written asynchronously.. That's why this all is not so easy.
>>>
>>>>
>>>> But then I see like 900 new lines of code, and I’m much less excited...
>>>>
>>>>> Idea is simple: cache small unaligned write and flush the cluster when
>>>>> filled.
>>>>>
>>>>> Performance result is very good (results in a table is time of
>>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>>
>>>> “Filled with ones” really is an edge case, though.
>>>
>>> Yes, I think, all clusters are compressed to rather small chunks :)
>>>
>>>>
>>>>> ---------------  -----------  -----------
>>>>>                   backup(old)  backup(new)
>>>>> ssd:hdd(direct)  3e+02        4.4
>>>>>                                  -99%
>>>>> ssd:hdd(cached)  5.7          5.4
>>>>>                                  -5%
>>>>> ---------------  -----------  -----------
>>>>>
>>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>>> cache by default (which is done by the series).
>>>>
>>>> First, I’m not sure how O_DIRECT really is relevant, because I don’t really see the point for writing compressed images.
>>>
>>> compressed backup is a point
>>
>> (Perhaps irrelevant, but just to be clear:) I meant the point of using O_DIRECT, which one can decide to not use for backup targets (as you have done already).
>>
>>>> Second, I find it a bit cheating if you say there is a huge improvement for the no-cache case, when actually, well, you just added a cache.  So the no-cache case just became faster because there is a cache now.
>>>
>>> Still, performance comparison is relevant to show that O_DIRECT as is unusable for compressed backup.
>>
>> (Again, perhaps irrelevant, but:) Yes, but my first point was exactly whether O_DIRECT is even relevant for writing compressed images.
>>
>>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much sense for compressed images, qemu’s format drivers are free to introduce some caching (because technically the cache.direct option only applies to the protocol driver) for collecting compressed writes.
>>>
>>> Yes I thought in this way, enabling the cache by default.
>>>
>>>> That conclusion makes both of my complaints kind of moot.
>>>>
>>>> *shrug*
>>>>
>>>> Third, what is the real-world impact on the page cache?  You described that that’s the reason why you need the cache in qemu, because otherwise the page cache is polluted too much.  How much is the difference really?  (I don’t know how good the compression ratio is for real-world images.)
>>>
>>> Hm. I don't know the ratio.. Customer reported that most of RAM is polluted by Qemu's cache, and we use O_DIRECT for everything except for target of compressed backup.. Still the pollution may relate to several backups and of course it is simple enough to drop the cache after each backup. But I think that even one backup of 16T disk may pollute RAM enough.
>>
>> Oh, sorry, I just realized I had a brain fart there.  I was referring to whether this series improves the page cache pollution.  But obviously it will if it allows you to re-enable O_DIRECT.
>>
>>>> Related to that, I remember a long time ago we had some discussion about letting qemu-img convert set a special cache mode for the target image that would make Linux drop everything before the last offset written (i.e., I suppose fadvise() with POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact that implementing a cache in qemu would be simple, but it isn’t, really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One advantage of using that would be that we could reuse it for non-compressed images that are written by backup or qemu-img convert.)
>>>
>>> The problem is that writes are async. And therefore, not sequential.
>>
>> In theory, yes, but all compressed writes still goes through qcow2_alloc_bytes() right before submitting the write, so I wonder whether in practice the writes aren’t usually sufficiently sequential to make POSIX_FADV_SEQUENTIAL work fine.
> 
> Yes, allocation is sequential. But writes are not.. Reasonable, I should at least bench it. So we should set POSIX_FADV_SEQUENTIAL for the whole backup target before the backup? Will try. Still, I expect that my cache will show better performance anyway. Actually, comparing cached (by pagecache) vs my cache we have 5.7 -> 4.4, i.e. 20% faster, which is significant (still, yes, would be good to check it on more real case than all-ones).
> 
>>
>>> So
>>> I have to track the writes and wait until the whole cluster is filled. It's simple use fadvise as an option to my cache: instead of caching data and write when cluster is filled we can instead mark cluster POSIX_FADV_DONTNEED.
>>>
>>>>
>>>> (I don’t remember why that qemu-img discussion died back then.)
>>>>
>>>>
>>>> Fourth, regarding the code, would it be simpler if it were a pure write cache?  I.e., on read, everything is flushed, so we don’t have to deal with that.  I don’t think there are many valid cases where a compressed image is both written to and read from at the same time. (Just asking, because I’d really want this code to be simpler.  I can imagine that reading from the cache is the least bit of complexity, but perhaps...)
>>>>
>>>
>>> Hm. I really didn't want to support reads, and do it only to make it possible to enable the cache by default.. Still read function is really simple, and I don't think that dropping it will simplify the code significantly.
>>
>> That’s too bad.
>>
>> So the only question I have left is what POSIX_FADV_SEQUENTIAL actually would do in practice.
> 
> will check.
> 

Checked that if I mark the whole file by FADV_SEQUENTIAL, cache is not removed.

Test:
[root@kvm fadvise]# cat a.c
#define _GNU_SOURCE
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <getopt.h>
#include <string.h>
#include <stdbool.h>

int main(int argc, char *argv[])
{
     int fd;
     int i;
     char mb[1024 * 1024];
     int open_flags = O_RDWR | O_CREAT | O_EXCL;
     int fadv_flags = 0;
     int fadv_final_flags = 0;
     int len = 1024 * 1024;
     bool do_fsync = false;

     for (i = 1; i < argc - 1; i++) {
         const char *arg = argv[i];

         if (!strcmp(arg, "direct")) {
             open_flags |= O_DIRECT;
         } else if (!strcmp(arg, "seq")) {
             fadv_flags = POSIX_FADV_SEQUENTIAL;
         } else if (!strcmp(arg, "dontneed")) {
             fadv_flags = POSIX_FADV_DONTNEED;
         } else if (!strcmp(arg, "final-dontneed")) {
             fadv_final_flags = POSIX_FADV_DONTNEED;
         } else if (!strcmp(arg, "fsync")) {
             do_fsync = true;
         } else {
             fprintf(stderr, "unknown: %s\n", arg);
             return 1;
         }
     }

     fd = open(argv[argc - 1], open_flags);

     if (fd < 0) {
         fprintf(stderr, "failed to open\n");
         return 1;
     }

     if (fadv_flags) {
         posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_flags);
     }
     for (i = 0; i < 100; i++) {
         write(fd, mb, len);
     }
     if (fadv_final_flags) {
         posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_final_flags);
     }
     if (do_fsync) {
         fsync(fd);
     }

     close(fd);
}



[root@kvm fadvise]# gcc a.c
[root@kvm fadvise]# rm -f x; ./a.out seq x; fincore x
   RES PAGES  SIZE FILE
  100M 25600  100M x
[root@kvm fadvise]# rm -f x; ./a.out dontneed x; fincore x
   RES PAGES  SIZE FILE
  100M 25600  100M x
[root@kvm fadvise]# rm -f x; ./a.out final-dontneed x; fincore x
   RES PAGES  SIZE FILE
   36M  9216  100M x
[root@kvm fadvise]# rm -f x; ./a.out seq fsync x; fincore x
   RES PAGES  SIZE FILE
  100M 25600  100M x
[root@kvm fadvise]# rm -f x; ./a.out dontneed fsync x; fincore x
   RES PAGES  SIZE FILE
  100M 25600  100M x
[root@kvm fadvise]# rm -f x; ./a.out final-dontneed fsync x; fincore x
   RES PAGES  SIZE FILE
   36M  9216  100M x
[root@kvm fadvise]# rm -f x; ./a.out direct x; fincore x
RES PAGES SIZE FILE
  0B     0   0B x



Backup-generated pagecache is a formal trash, it will be never used. And it's bad that it can displace another good pagecache. So the best thing for backup is direct mode + proposed cache.


-- 
Best regards,
Vladimir

Re: [PATCH 0/7] qcow2: compressed write cache

Posted by Denis V. Lunev 4 years, 12 months ago

On 2/9/21 9:36 PM, Vladimir Sementsov-Ogievskiy wrote:
> 09.02.2021 19:39, Vladimir Sementsov-Ogievskiy wrote:
>> 09.02.2021 17:47, Max Reitz wrote:
>>> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>>>> 09.02.2021 16:25, Max Reitz wrote:
>>>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> Hi all!
>>>>>>
>>>>>> I know, I have several series waiting for a resend, but I had to
>>>>>> switch
>>>>>> to another task spawned from our customer's bug.
>>>>>>
>>>>>> Original problem: we use O_DIRECT for all vm images in our
>>>>>> product, it's
>>>>>> the policy. The only exclusion is backup target qcow2 image for
>>>>>> compressed backup, because compressed backup is extremely slow with
>>>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>>>> produces a lot of pagecache.
>>>>>>
>>>>>> So we can either implement some internal cache or use fadvise
>>>>>> somehow.
>>>>>> Backup has several async workes, which writes simultaneously, so
>>>>>> in both
>>>>>> ways we have to track host cluster filling (before dropping the
>>>>>> cache
>>>>>> corresponding to the cluster).  So, if we have to track anyway,
>>>>>> let's
>>>>>> try to implement the cache.
>>>>>
>>>>> I wanted to be excited here, because that sounds like it would be
>>>>> very easy to implement caching.  Like, just keep the cluster at
>>>>> free_byte_offset cached until the cluster it points to changes,
>>>>> then flush the cluster.
>>>>
>>>> The problem is that chunks are written asynchronously.. That's why
>>>> this all is not so easy.
>>>>
>>>>>
>>>>> But then I see like 900 new lines of code, and I’m much less
>>>>> excited...
>>>>>
>>>>>> Idea is simple: cache small unaligned write and flush the cluster
>>>>>> when
>>>>>> filled.
>>>>>>
>>>>>> Performance result is very good (results in a table is time of
>>>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>>>
>>>>> “Filled with ones” really is an edge case, though.
>>>>
>>>> Yes, I think, all clusters are compressed to rather small chunks :)
>>>>
>>>>>
>>>>>> ---------------  -----------  -----------
>>>>>>                   backup(old)  backup(new)
>>>>>> ssd:hdd(direct)  3e+02        4.4
>>>>>>                                  -99%
>>>>>> ssd:hdd(cached)  5.7          5.4
>>>>>>                                  -5%
>>>>>> ---------------  -----------  -----------
>>>>>>
>>>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>>>> cache by default (which is done by the series).
>>>>>
>>>>> First, I’m not sure how O_DIRECT really is relevant, because I
>>>>> don’t really see the point for writing compressed images.
>>>>
>>>> compressed backup is a point
>>>
>>> (Perhaps irrelevant, but just to be clear:) I meant the point of
>>> using O_DIRECT, which one can decide to not use for backup targets
>>> (as you have done already).
>>>
>>>>> Second, I find it a bit cheating if you say there is a huge
>>>>> improvement for the no-cache case, when actually, well, you just
>>>>> added a cache.  So the no-cache case just became faster because
>>>>> there is a cache now.
>>>>
>>>> Still, performance comparison is relevant to show that O_DIRECT as
>>>> is unusable for compressed backup.
>>>
>>> (Again, perhaps irrelevant, but:) Yes, but my first point was
>>> exactly whether O_DIRECT is even relevant for writing compressed
>>> images.
>>>
>>>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much
>>>>> sense for compressed images, qemu’s format drivers are free to
>>>>> introduce some caching (because technically the cache.direct
>>>>> option only applies to the protocol driver) for collecting
>>>>> compressed writes.
>>>>
>>>> Yes I thought in this way, enabling the cache by default.
>>>>
>>>>> That conclusion makes both of my complaints kind of moot.
>>>>>
>>>>> *shrug*
>>>>>
>>>>> Third, what is the real-world impact on the page cache?  You
>>>>> described that that’s the reason why you need the cache in qemu,
>>>>> because otherwise the page cache is polluted too much.  How much
>>>>> is the difference really?  (I don’t know how good the compression
>>>>> ratio is for real-world images.)
>>>>
>>>> Hm. I don't know the ratio.. Customer reported that most of RAM is
>>>> polluted by Qemu's cache, and we use O_DIRECT for everything except
>>>> for target of compressed backup.. Still the pollution may relate to
>>>> several backups and of course it is simple enough to drop the cache
>>>> after each backup. But I think that even one backup of 16T disk may
>>>> pollute RAM enough.
>>>
>>> Oh, sorry, I just realized I had a brain fart there.  I was
>>> referring to whether this series improves the page cache pollution. 
>>> But obviously it will if it allows you to re-enable O_DIRECT.
>>>
>>>>> Related to that, I remember a long time ago we had some discussion
>>>>> about letting qemu-img convert set a special cache mode for the
>>>>> target image that would make Linux drop everything before the last
>>>>> offset written (i.e., I suppose fadvise() with
>>>>> POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
>>>>> that implementing a cache in qemu would be simple, but it isn’t,
>>>>> really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
>>>>> advantage of using that would be that we could reuse it for
>>>>> non-compressed images that are written by backup or qemu-img
>>>>> convert.)
>>>>
>>>> The problem is that writes are async. And therefore, not sequential.
>>>
>>> In theory, yes, but all compressed writes still goes through
>>> qcow2_alloc_bytes() right before submitting the write, so I wonder
>>> whether in practice the writes aren’t usually sufficiently
>>> sequential to make POSIX_FADV_SEQUENTIAL work fine.
>>
>> Yes, allocation is sequential. But writes are not.. Reasonable, I
>> should at least bench it. So we should set POSIX_FADV_SEQUENTIAL for
>> the whole backup target before the backup? Will try. Still, I expect
>> that my cache will show better performance anyway. Actually,
>> comparing cached (by pagecache) vs my cache we have 5.7 -> 4.4, i.e.
>> 20% faster, which is significant (still, yes, would be good to check
>> it on more real case than all-ones).
>>
>>>
>>>> So
>>>> I have to track the writes and wait until the whole cluster is
>>>> filled. It's simple use fadvise as an option to my cache: instead
>>>> of caching data and write when cluster is filled we can instead
>>>> mark cluster POSIX_FADV_DONTNEED.
>>>>
>>>>>
>>>>> (I don’t remember why that qemu-img discussion died back then.)
>>>>>
>>>>>
>>>>> Fourth, regarding the code, would it be simpler if it were a pure
>>>>> write cache?  I.e., on read, everything is flushed, so we don’t
>>>>> have to deal with that.  I don’t think there are many valid cases
>>>>> where a compressed image is both written to and read from at the
>>>>> same time. (Just asking, because I’d really want this code to be
>>>>> simpler.  I can imagine that reading from the cache is the least
>>>>> bit of complexity, but perhaps...)
>>>>>
>>>>
>>>> Hm. I really didn't want to support reads, and do it only to make
>>>> it possible to enable the cache by default.. Still read function is
>>>> really simple, and I don't think that dropping it will simplify the
>>>> code significantly.
>>>
>>> That’s too bad.
>>>
>>> So the only question I have left is what POSIX_FADV_SEQUENTIAL
>>> actually would do in practice.
>>
>> will check.
>>
>
> Checked that if I mark the whole file by FADV_SEQUENTIAL, cache is not
> removed.
>
> Test:
> [root@kvm fadvise]# cat a.c
> #define _GNU_SOURCE
> #include <fcntl.h>
> #include <unistd.h>
> #include <stdio.h>
> #include <getopt.h>
> #include <string.h>
> #include <stdbool.h>
>
> int main(int argc, char *argv[])
> {
>     int fd;
>     int i;
>     char mb[1024 * 1024];
>     int open_flags = O_RDWR | O_CREAT | O_EXCL;
>     int fadv_flags = 0;
>     int fadv_final_flags = 0;
>     int len = 1024 * 1024;
>     bool do_fsync = false;
>
>     for (i = 1; i < argc - 1; i++) {
>         const char *arg = argv[i];
>
>         if (!strcmp(arg, "direct")) {
>             open_flags |= O_DIRECT;
>         } else if (!strcmp(arg, "seq")) {
>             fadv_flags = POSIX_FADV_SEQUENTIAL;
>         } else if (!strcmp(arg, "dontneed")) {
>             fadv_flags = POSIX_FADV_DONTNEED;
>         } else if (!strcmp(arg, "final-dontneed")) {
>             fadv_final_flags = POSIX_FADV_DONTNEED;
>         } else if (!strcmp(arg, "fsync")) {
>             do_fsync = true;
>         } else {
>             fprintf(stderr, "unknown: %s\n", arg);
>             return 1;
>         }
>     }
>
>     fd = open(argv[argc - 1], open_flags);
>
>     if (fd < 0) {
>         fprintf(stderr, "failed to open\n");
>         return 1;
>     }
>
>     if (fadv_flags) {
>         posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_flags);
>     }
>     for (i = 0; i < 100; i++) {
>         write(fd, mb, len);
>     }
>     if (fadv_final_flags) {
>         posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_final_flags);
>     }
>     if (do_fsync) {
>         fsync(fd);
>     }
>
>     close(fd);
> }
>
>
>
> [root@kvm fadvise]# gcc a.c
> [root@kvm fadvise]# rm -f x; ./a.out seq x; fincore x
>   RES PAGES  SIZE FILE
>  100M 25600  100M x
> [root@kvm fadvise]# rm -f x; ./a.out dontneed x; fincore x
>   RES PAGES  SIZE FILE
>  100M 25600  100M x
> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed x; fincore x
>   RES PAGES  SIZE FILE
>   36M  9216  100M x
> [root@kvm fadvise]# rm -f x; ./a.out seq fsync x; fincore x
>   RES PAGES  SIZE FILE
>  100M 25600  100M x
> [root@kvm fadvise]# rm -f x; ./a.out dontneed fsync x; fincore x
>   RES PAGES  SIZE FILE
>  100M 25600  100M x
> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed fsync x; fincore x
>   RES PAGES  SIZE FILE
>   36M  9216  100M x
> [root@kvm fadvise]# rm -f x; ./a.out direct x; fincore x
> RES PAGES SIZE FILE
>  0B     0   0B x
>
>
>
> Backup-generated pagecache is a formal trash, it will be never used.
> And it's bad that it can displace another good pagecache. So the best
> thing for backup is direct mode + proposed cache.
>
>
I think that the original intention of Max is about POSIX_FADV_DONTNEED
One should call this fadvise for just fully written cluster. Though this is
a bit buggy and from performance point of view will be slower.

This call could be slow and thus it should be created as delegate to
co-routine. We have though on this, but the amount of work to
implement even if seems lower, is not really lower and the result
would not be as great.

Den

Re: [PATCH 0/7] qcow2: compressed write cache

Posted by Vladimir Sementsov-Ogievskiy 4 years, 12 months ago

09.02.2021 21:41, Denis V. Lunev wrote:
> On 2/9/21 9:36 PM, Vladimir Sementsov-Ogievskiy wrote:
>> 09.02.2021 19:39, Vladimir Sementsov-Ogievskiy wrote:
>>> 09.02.2021 17:47, Max Reitz wrote:
>>>> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 09.02.2021 16:25, Max Reitz wrote:
>>>>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>> Hi all!
>>>>>>>
>>>>>>> I know, I have several series waiting for a resend, but I had to
>>>>>>> switch
>>>>>>> to another task spawned from our customer's bug.
>>>>>>>
>>>>>>> Original problem: we use O_DIRECT for all vm images in our
>>>>>>> product, it's
>>>>>>> the policy. The only exclusion is backup target qcow2 image for
>>>>>>> compressed backup, because compressed backup is extremely slow with
>>>>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>>>>> produces a lot of pagecache.
>>>>>>>
>>>>>>> So we can either implement some internal cache or use fadvise
>>>>>>> somehow.
>>>>>>> Backup has several async workes, which writes simultaneously, so
>>>>>>> in both
>>>>>>> ways we have to track host cluster filling (before dropping the
>>>>>>> cache
>>>>>>> corresponding to the cluster).  So, if we have to track anyway,
>>>>>>> let's
>>>>>>> try to implement the cache.
>>>>>>
>>>>>> I wanted to be excited here, because that sounds like it would be
>>>>>> very easy to implement caching.  Like, just keep the cluster at
>>>>>> free_byte_offset cached until the cluster it points to changes,
>>>>>> then flush the cluster.
>>>>>
>>>>> The problem is that chunks are written asynchronously.. That's why
>>>>> this all is not so easy.
>>>>>
>>>>>>
>>>>>> But then I see like 900 new lines of code, and I’m much less
>>>>>> excited...
>>>>>>
>>>>>>> Idea is simple: cache small unaligned write and flush the cluster
>>>>>>> when
>>>>>>> filled.
>>>>>>>
>>>>>>> Performance result is very good (results in a table is time of
>>>>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>>>>
>>>>>> “Filled with ones” really is an edge case, though.
>>>>>
>>>>> Yes, I think, all clusters are compressed to rather small chunks :)
>>>>>
>>>>>>
>>>>>>> ---------------  -----------  -----------
>>>>>>>                    backup(old)  backup(new)
>>>>>>> ssd:hdd(direct)  3e+02        4.4
>>>>>>>                                   -99%
>>>>>>> ssd:hdd(cached)  5.7          5.4
>>>>>>>                                   -5%
>>>>>>> ---------------  -----------  -----------
>>>>>>>
>>>>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>>>>> cache by default (which is done by the series).
>>>>>>
>>>>>> First, I’m not sure how O_DIRECT really is relevant, because I
>>>>>> don’t really see the point for writing compressed images.
>>>>>
>>>>> compressed backup is a point
>>>>
>>>> (Perhaps irrelevant, but just to be clear:) I meant the point of
>>>> using O_DIRECT, which one can decide to not use for backup targets
>>>> (as you have done already).
>>>>
>>>>>> Second, I find it a bit cheating if you say there is a huge
>>>>>> improvement for the no-cache case, when actually, well, you just
>>>>>> added a cache.  So the no-cache case just became faster because
>>>>>> there is a cache now.
>>>>>
>>>>> Still, performance comparison is relevant to show that O_DIRECT as
>>>>> is unusable for compressed backup.
>>>>
>>>> (Again, perhaps irrelevant, but:) Yes, but my first point was
>>>> exactly whether O_DIRECT is even relevant for writing compressed
>>>> images.
>>>>
>>>>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much
>>>>>> sense for compressed images, qemu’s format drivers are free to
>>>>>> introduce some caching (because technically the cache.direct
>>>>>> option only applies to the protocol driver) for collecting
>>>>>> compressed writes.
>>>>>
>>>>> Yes I thought in this way, enabling the cache by default.
>>>>>
>>>>>> That conclusion makes both of my complaints kind of moot.
>>>>>>
>>>>>> *shrug*
>>>>>>
>>>>>> Third, what is the real-world impact on the page cache?  You
>>>>>> described that that’s the reason why you need the cache in qemu,
>>>>>> because otherwise the page cache is polluted too much.  How much
>>>>>> is the difference really?  (I don’t know how good the compression
>>>>>> ratio is for real-world images.)
>>>>>
>>>>> Hm. I don't know the ratio.. Customer reported that most of RAM is
>>>>> polluted by Qemu's cache, and we use O_DIRECT for everything except
>>>>> for target of compressed backup.. Still the pollution may relate to
>>>>> several backups and of course it is simple enough to drop the cache
>>>>> after each backup. But I think that even one backup of 16T disk may
>>>>> pollute RAM enough.
>>>>
>>>> Oh, sorry, I just realized I had a brain fart there.  I was
>>>> referring to whether this series improves the page cache pollution.
>>>> But obviously it will if it allows you to re-enable O_DIRECT.
>>>>
>>>>>> Related to that, I remember a long time ago we had some discussion
>>>>>> about letting qemu-img convert set a special cache mode for the
>>>>>> target image that would make Linux drop everything before the last
>>>>>> offset written (i.e., I suppose fadvise() with
>>>>>> POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
>>>>>> that implementing a cache in qemu would be simple, but it isn’t,
>>>>>> really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
>>>>>> advantage of using that would be that we could reuse it for
>>>>>> non-compressed images that are written by backup or qemu-img
>>>>>> convert.)
>>>>>
>>>>> The problem is that writes are async. And therefore, not sequential.
>>>>
>>>> In theory, yes, but all compressed writes still goes through
>>>> qcow2_alloc_bytes() right before submitting the write, so I wonder
>>>> whether in practice the writes aren’t usually sufficiently
>>>> sequential to make POSIX_FADV_SEQUENTIAL work fine.
>>>
>>> Yes, allocation is sequential. But writes are not.. Reasonable, I
>>> should at least bench it. So we should set POSIX_FADV_SEQUENTIAL for
>>> the whole backup target before the backup? Will try. Still, I expect
>>> that my cache will show better performance anyway. Actually,
>>> comparing cached (by pagecache) vs my cache we have 5.7 -> 4.4, i.e.
>>> 20% faster, which is significant (still, yes, would be good to check
>>> it on more real case than all-ones).
>>>
>>>>
>>>>> So
>>>>> I have to track the writes and wait until the whole cluster is
>>>>> filled. It's simple use fadvise as an option to my cache: instead
>>>>> of caching data and write when cluster is filled we can instead
>>>>> mark cluster POSIX_FADV_DONTNEED.
>>>>>
>>>>>>
>>>>>> (I don’t remember why that qemu-img discussion died back then.)
>>>>>>
>>>>>>
>>>>>> Fourth, regarding the code, would it be simpler if it were a pure
>>>>>> write cache?  I.e., on read, everything is flushed, so we don’t
>>>>>> have to deal with that.  I don’t think there are many valid cases
>>>>>> where a compressed image is both written to and read from at the
>>>>>> same time. (Just asking, because I’d really want this code to be
>>>>>> simpler.  I can imagine that reading from the cache is the least
>>>>>> bit of complexity, but perhaps...)
>>>>>>
>>>>>
>>>>> Hm. I really didn't want to support reads, and do it only to make
>>>>> it possible to enable the cache by default.. Still read function is
>>>>> really simple, and I don't think that dropping it will simplify the
>>>>> code significantly.
>>>>
>>>> That’s too bad.
>>>>
>>>> So the only question I have left is what POSIX_FADV_SEQUENTIAL
>>>> actually would do in practice.
>>>
>>> will check.
>>>
>>
>> Checked that if I mark the whole file by FADV_SEQUENTIAL, cache is not
>> removed.
>>
>> Test:
>> [root@kvm fadvise]# cat a.c
>> #define _GNU_SOURCE
>> #include <fcntl.h>
>> #include <unistd.h>
>> #include <stdio.h>
>> #include <getopt.h>
>> #include <string.h>
>> #include <stdbool.h>
>>
>> int main(int argc, char *argv[])
>> {
>>      int fd;
>>      int i;
>>      char mb[1024 * 1024];
>>      int open_flags = O_RDWR | O_CREAT | O_EXCL;
>>      int fadv_flags = 0;
>>      int fadv_final_flags = 0;
>>      int len = 1024 * 1024;
>>      bool do_fsync = false;
>>
>>      for (i = 1; i < argc - 1; i++) {
>>          const char *arg = argv[i];
>>
>>          if (!strcmp(arg, "direct")) {
>>              open_flags |= O_DIRECT;
>>          } else if (!strcmp(arg, "seq")) {
>>              fadv_flags = POSIX_FADV_SEQUENTIAL;
>>          } else if (!strcmp(arg, "dontneed")) {
>>              fadv_flags = POSIX_FADV_DONTNEED;
>>          } else if (!strcmp(arg, "final-dontneed")) {
>>              fadv_final_flags = POSIX_FADV_DONTNEED;
>>          } else if (!strcmp(arg, "fsync")) {
>>              do_fsync = true;
>>          } else {
>>              fprintf(stderr, "unknown: %s\n", arg);
>>              return 1;
>>          }
>>      }
>>
>>      fd = open(argv[argc - 1], open_flags);
>>
>>      if (fd < 0) {
>>          fprintf(stderr, "failed to open\n");
>>          return 1;
>>      }
>>
>>      if (fadv_flags) {
>>          posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_flags);
>>      }
>>      for (i = 0; i < 100; i++) {
>>          write(fd, mb, len);
>>      }
>>      if (fadv_final_flags) {
>>          posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_final_flags);
>>      }
>>      if (do_fsync) {
>>          fsync(fd);
>>      }
>>
>>      close(fd);
>> }
>>
>>
>>
>> [root@kvm fadvise]# gcc a.c
>> [root@kvm fadvise]# rm -f x; ./a.out seq x; fincore x
>>    RES PAGES  SIZE FILE
>>   100M 25600  100M x
>> [root@kvm fadvise]# rm -f x; ./a.out dontneed x; fincore x
>>    RES PAGES  SIZE FILE
>>   100M 25600  100M x
>> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed x; fincore x
>>    RES PAGES  SIZE FILE
>>    36M  9216  100M x
>> [root@kvm fadvise]# rm -f x; ./a.out seq fsync x; fincore x
>>    RES PAGES  SIZE FILE
>>   100M 25600  100M x
>> [root@kvm fadvise]# rm -f x; ./a.out dontneed fsync x; fincore x
>>    RES PAGES  SIZE FILE
>>   100M 25600  100M x
>> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed fsync x; fincore x
>>    RES PAGES  SIZE FILE
>>    36M  9216  100M x
>> [root@kvm fadvise]# rm -f x; ./a.out direct x; fincore x
>> RES PAGES SIZE FILE
>>   0B     0   0B x
>>
>>
>>
>> Backup-generated pagecache is a formal trash, it will be never used.
>> And it's bad that it can displace another good pagecache. So the best
>> thing for backup is direct mode + proposed cache.
>>
>>
> I think that the original intention of Max is about POSIX_FADV_DONTNEED
> One should call this fadvise for just fully written cluster.

This should work, but:

  - as we see from test above, POSIX_FADV_DONTNEED don't remove the whole cache (see final-dontneed)
  - as I said we'll have to track writes, so the cache will be the same, just instead of postponed-write operation we'll do fadvise.

Hmm. Still, in this way, we will not need some difficult things in my proposed cache.

So, it may be reasonable to at least split the big patch so that

  - first part brings writes / full-cluster tracking and fadvice

  - second part adds caching-wrties option, corresponding flush code and additional performance

Does it make sense?


> Though this is
> a bit buggy and from performance point of view will be slower.
> 
> This call could be slow and thus it should be created as delegate to
> co-routine. We have though on this, but the amount of work to
> implement even if seems lower, is not really lower and the result
> would not be as great.
> 
> Den
> 


-- 
Best regards,
Vladimir

Re: [PATCH 0/7] qcow2: compressed write cache

Posted by Max Reitz 4 years, 12 months ago

On 09.02.21 19:51, Vladimir Sementsov-Ogievskiy wrote:
> 09.02.2021 21:41, Denis V. Lunev wrote:
>> On 2/9/21 9:36 PM, Vladimir Sementsov-Ogievskiy wrote:
>>> 09.02.2021 19:39, Vladimir Sementsov-Ogievskiy wrote:
>>>> 09.02.2021 17:47, Max Reitz wrote:
>>>>> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> 09.02.2021 16:25, Max Reitz wrote:
>>>>>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>>> Hi all!
>>>>>>>>
>>>>>>>> I know, I have several series waiting for a resend, but I had to
>>>>>>>> switch
>>>>>>>> to another task spawned from our customer's bug.
>>>>>>>>
>>>>>>>> Original problem: we use O_DIRECT for all vm images in our
>>>>>>>> product, it's
>>>>>>>> the policy. The only exclusion is backup target qcow2 image for
>>>>>>>> compressed backup, because compressed backup is extremely slow with
>>>>>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>>>>>> produces a lot of pagecache.
>>>>>>>>
>>>>>>>> So we can either implement some internal cache or use fadvise
>>>>>>>> somehow.
>>>>>>>> Backup has several async workes, which writes simultaneously, so
>>>>>>>> in both
>>>>>>>> ways we have to track host cluster filling (before dropping the
>>>>>>>> cache
>>>>>>>> corresponding to the cluster).  So, if we have to track anyway,
>>>>>>>> let's
>>>>>>>> try to implement the cache.
>>>>>>>
>>>>>>> I wanted to be excited here, because that sounds like it would be
>>>>>>> very easy to implement caching.  Like, just keep the cluster at
>>>>>>> free_byte_offset cached until the cluster it points to changes,
>>>>>>> then flush the cluster.
>>>>>>
>>>>>> The problem is that chunks are written asynchronously.. That's why
>>>>>> this all is not so easy.
>>>>>>
>>>>>>>
>>>>>>> But then I see like 900 new lines of code, and I’m much less
>>>>>>> excited...
>>>>>>>
>>>>>>>> Idea is simple: cache small unaligned write and flush the cluster
>>>>>>>> when
>>>>>>>> filled.
>>>>>>>>
>>>>>>>> Performance result is very good (results in a table is time of
>>>>>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>>>>>
>>>>>>> “Filled with ones” really is an edge case, though.
>>>>>>
>>>>>> Yes, I think, all clusters are compressed to rather small chunks :)
>>>>>>
>>>>>>>
>>>>>>>> ---------------  -----------  -----------
>>>>>>>>                    backup(old)  backup(new)
>>>>>>>> ssd:hdd(direct)  3e+02        4.4
>>>>>>>>                                   -99%
>>>>>>>> ssd:hdd(cached)  5.7          5.4
>>>>>>>>                                   -5%
>>>>>>>> ---------------  -----------  -----------
>>>>>>>>
>>>>>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>>>>>> O_DIRECT with new implemented cache. So, I suggest to enable the 
>>>>>>>> new
>>>>>>>> cache by default (which is done by the series).
>>>>>>>
>>>>>>> First, I’m not sure how O_DIRECT really is relevant, because I
>>>>>>> don’t really see the point for writing compressed images.
>>>>>>
>>>>>> compressed backup is a point
>>>>>
>>>>> (Perhaps irrelevant, but just to be clear:) I meant the point of
>>>>> using O_DIRECT, which one can decide to not use for backup targets
>>>>> (as you have done already).
>>>>>
>>>>>>> Second, I find it a bit cheating if you say there is a huge
>>>>>>> improvement for the no-cache case, when actually, well, you just
>>>>>>> added a cache.  So the no-cache case just became faster because
>>>>>>> there is a cache now.
>>>>>>
>>>>>> Still, performance comparison is relevant to show that O_DIRECT as
>>>>>> is unusable for compressed backup.
>>>>>
>>>>> (Again, perhaps irrelevant, but:) Yes, but my first point was
>>>>> exactly whether O_DIRECT is even relevant for writing compressed
>>>>> images.
>>>>>
>>>>>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much
>>>>>>> sense for compressed images, qemu’s format drivers are free to
>>>>>>> introduce some caching (because technically the cache.direct
>>>>>>> option only applies to the protocol driver) for collecting
>>>>>>> compressed writes.
>>>>>>
>>>>>> Yes I thought in this way, enabling the cache by default.
>>>>>>
>>>>>>> That conclusion makes both of my complaints kind of moot.
>>>>>>>
>>>>>>> *shrug*
>>>>>>>
>>>>>>> Third, what is the real-world impact on the page cache?  You
>>>>>>> described that that’s the reason why you need the cache in qemu,
>>>>>>> because otherwise the page cache is polluted too much.  How much
>>>>>>> is the difference really?  (I don’t know how good the compression
>>>>>>> ratio is for real-world images.)
>>>>>>
>>>>>> Hm. I don't know the ratio.. Customer reported that most of RAM is
>>>>>> polluted by Qemu's cache, and we use O_DIRECT for everything except
>>>>>> for target of compressed backup.. Still the pollution may relate to
>>>>>> several backups and of course it is simple enough to drop the cache
>>>>>> after each backup. But I think that even one backup of 16T disk may
>>>>>> pollute RAM enough.
>>>>>
>>>>> Oh, sorry, I just realized I had a brain fart there.  I was
>>>>> referring to whether this series improves the page cache pollution.
>>>>> But obviously it will if it allows you to re-enable O_DIRECT.
>>>>>
>>>>>>> Related to that, I remember a long time ago we had some discussion
>>>>>>> about letting qemu-img convert set a special cache mode for the
>>>>>>> target image that would make Linux drop everything before the last
>>>>>>> offset written (i.e., I suppose fadvise() with
>>>>>>> POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
>>>>>>> that implementing a cache in qemu would be simple, but it isn’t,
>>>>>>> really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
>>>>>>> advantage of using that would be that we could reuse it for
>>>>>>> non-compressed images that are written by backup or qemu-img
>>>>>>> convert.)
>>>>>>
>>>>>> The problem is that writes are async. And therefore, not sequential.
>>>>>
>>>>> In theory, yes, but all compressed writes still goes through
>>>>> qcow2_alloc_bytes() right before submitting the write, so I wonder
>>>>> whether in practice the writes aren’t usually sufficiently
>>>>> sequential to make POSIX_FADV_SEQUENTIAL work fine.
>>>>
>>>> Yes, allocation is sequential. But writes are not.. Reasonable, I
>>>> should at least bench it. So we should set POSIX_FADV_SEQUENTIAL for
>>>> the whole backup target before the backup? Will try. Still, I expect
>>>> that my cache will show better performance anyway. Actually,
>>>> comparing cached (by pagecache) vs my cache we have 5.7 -> 4.4, i.e.
>>>> 20% faster, which is significant (still, yes, would be good to check
>>>> it on more real case than all-ones).
>>>>
>>>>>
>>>>>> So
>>>>>> I have to track the writes and wait until the whole cluster is
>>>>>> filled. It's simple use fadvise as an option to my cache: instead
>>>>>> of caching data and write when cluster is filled we can instead
>>>>>> mark cluster POSIX_FADV_DONTNEED.
>>>>>>
>>>>>>>
>>>>>>> (I don’t remember why that qemu-img discussion died back then.)
>>>>>>>
>>>>>>>
>>>>>>> Fourth, regarding the code, would it be simpler if it were a pure
>>>>>>> write cache?  I.e., on read, everything is flushed, so we don’t
>>>>>>> have to deal with that.  I don’t think there are many valid cases
>>>>>>> where a compressed image is both written to and read from at the
>>>>>>> same time. (Just asking, because I’d really want this code to be
>>>>>>> simpler.  I can imagine that reading from the cache is the least
>>>>>>> bit of complexity, but perhaps...)
>>>>>>>
>>>>>>
>>>>>> Hm. I really didn't want to support reads, and do it only to make
>>>>>> it possible to enable the cache by default.. Still read function is
>>>>>> really simple, and I don't think that dropping it will simplify the
>>>>>> code significantly.
>>>>>
>>>>> That’s too bad.
>>>>>
>>>>> So the only question I have left is what POSIX_FADV_SEQUENTIAL
>>>>> actually would do in practice.
>>>>
>>>> will check.
>>>>
>>>
>>> Checked that if I mark the whole file by FADV_SEQUENTIAL, cache is not
>>> removed.
>>>
>>> Test:
>>> [root@kvm fadvise]# cat a.c
>>> #define _GNU_SOURCE
>>> #include <fcntl.h>
>>> #include <unistd.h>
>>> #include <stdio.h>
>>> #include <getopt.h>
>>> #include <string.h>
>>> #include <stdbool.h>
>>>
>>> int main(int argc, char *argv[])
>>> {
>>>      int fd;
>>>      int i;
>>>      char mb[1024 * 1024];
>>>      int open_flags = O_RDWR | O_CREAT | O_EXCL;
>>>      int fadv_flags = 0;
>>>      int fadv_final_flags = 0;
>>>      int len = 1024 * 1024;
>>>      bool do_fsync = false;
>>>
>>>      for (i = 1; i < argc - 1; i++) {
>>>          const char *arg = argv[i];
>>>
>>>          if (!strcmp(arg, "direct")) {
>>>              open_flags |= O_DIRECT;
>>>          } else if (!strcmp(arg, "seq")) {
>>>              fadv_flags = POSIX_FADV_SEQUENTIAL;
>>>          } else if (!strcmp(arg, "dontneed")) {
>>>              fadv_flags = POSIX_FADV_DONTNEED;
>>>          } else if (!strcmp(arg, "final-dontneed")) {
>>>              fadv_final_flags = POSIX_FADV_DONTNEED;
>>>          } else if (!strcmp(arg, "fsync")) {
>>>              do_fsync = true;
>>>          } else {
>>>              fprintf(stderr, "unknown: %s\n", arg);
>>>              return 1;
>>>          }
>>>      }
>>>
>>>      fd = open(argv[argc - 1], open_flags);
>>>
>>>      if (fd < 0) {
>>>          fprintf(stderr, "failed to open\n");
>>>          return 1;
>>>      }
>>>
>>>      if (fadv_flags) {
>>>          posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_flags);
>>>      }
>>>      for (i = 0; i < 100; i++) {
>>>          write(fd, mb, len);
>>>      }
>>>      if (fadv_final_flags) {
>>>          posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_final_flags);
>>>      }
>>>      if (do_fsync) {
>>>          fsync(fd);
>>>      }
>>>
>>>      close(fd);
>>> }
>>>
>>>
>>>
>>> [root@kvm fadvise]# gcc a.c
>>> [root@kvm fadvise]# rm -f x; ./a.out seq x; fincore x
>>>    RES PAGES  SIZE FILE
>>>   100M 25600  100M x
>>> [root@kvm fadvise]# rm -f x; ./a.out dontneed x; fincore x
>>>    RES PAGES  SIZE FILE
>>>   100M 25600  100M x
>>> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed x; fincore x
>>>    RES PAGES  SIZE FILE
>>>    36M  9216  100M x
>>> [root@kvm fadvise]# rm -f x; ./a.out seq fsync x; fincore x
>>>    RES PAGES  SIZE FILE
>>>   100M 25600  100M x
>>> [root@kvm fadvise]# rm -f x; ./a.out dontneed fsync x; fincore x
>>>    RES PAGES  SIZE FILE
>>>   100M 25600  100M x
>>> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed fsync x; fincore x
>>>    RES PAGES  SIZE FILE
>>>    36M  9216  100M x
>>> [root@kvm fadvise]# rm -f x; ./a.out direct x; fincore x
>>> RES PAGES SIZE FILE
>>>   0B     0   0B x
>>>
>>>
>>>
>>> Backup-generated pagecache is a formal trash, it will be never used.
>>> And it's bad that it can displace another good pagecache. So the best
>>> thing for backup is direct mode + proposed cache.

What a shame.

Thanks a lot for testing.

>> I think that the original intention of Max is about POSIX_FADV_DONTNEED
>> One should call this fadvise for just fully written cluster.

I had hoped that SEQUENTIAL would just work, magically.

> This should work, but:
> 
>   - as we see from test above, POSIX_FADV_DONTNEED don't remove the 
> whole cache (see final-dontneed)
>   - as I said we'll have to track writes, so the cache will be the same, 
> just instead of postponed-write operation we'll do fadvise.
> 
> Hmm. Still, in this way, we will not need some difficult things in my 
> proposed cache.
> 
> So, it may be reasonable to at least split the big patch so that
> 
>   - first part brings writes / full-cluster tracking and fadvice
> 
>   - second part adds caching-wrties option, corresponding flush code and 
> additional performance
> 
> Does it make sense?

I think the fadvise solution would have been nice if it gave us 
something magical that we could also use for normal qemu-img convert (or 
backup) operations, but as that doesn’t seem to be the case, I don’t 
think it makes too much sense to implement something like that.  (I 
imagine doing fadvise also creates the need to implement new block to 
call into file-posix and so on.)

I’d propose I take some time to look at your patch as-is and then I 
report back.

Max

Re: [PATCH 0/7] qcow2: compressed write cache

Posted by Vladimir Sementsov-Ogievskiy 4 years, 12 months ago

10.02.2021 13:00, Max Reitz wrote:
> On 09.02.21 19:51, Vladimir Sementsov-Ogievskiy wrote:
>> 09.02.2021 21:41, Denis V. Lunev wrote:
>>> On 2/9/21 9:36 PM, Vladimir Sementsov-Ogievskiy wrote:
>>>> 09.02.2021 19:39, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 09.02.2021 17:47, Max Reitz wrote:
>>>>>> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>> 09.02.2021 16:25, Max Reitz wrote:
>>>>>>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>>>> Hi all!
>>>>>>>>>
>>>>>>>>> I know, I have several series waiting for a resend, but I had to
>>>>>>>>> switch
>>>>>>>>> to another task spawned from our customer's bug.
>>>>>>>>>
>>>>>>>>> Original problem: we use O_DIRECT for all vm images in our
>>>>>>>>> product, it's
>>>>>>>>> the policy. The only exclusion is backup target qcow2 image for
>>>>>>>>> compressed backup, because compressed backup is extremely slow with
>>>>>>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>>>>>>> produces a lot of pagecache.
>>>>>>>>>
>>>>>>>>> So we can either implement some internal cache or use fadvise
>>>>>>>>> somehow.
>>>>>>>>> Backup has several async workes, which writes simultaneously, so
>>>>>>>>> in both
>>>>>>>>> ways we have to track host cluster filling (before dropping the
>>>>>>>>> cache
>>>>>>>>> corresponding to the cluster).  So, if we have to track anyway,
>>>>>>>>> let's
>>>>>>>>> try to implement the cache.
>>>>>>>>
>>>>>>>> I wanted to be excited here, because that sounds like it would be
>>>>>>>> very easy to implement caching.  Like, just keep the cluster at
>>>>>>>> free_byte_offset cached until the cluster it points to changes,
>>>>>>>> then flush the cluster.
>>>>>>>
>>>>>>> The problem is that chunks are written asynchronously.. That's why
>>>>>>> this all is not so easy.
>>>>>>>
>>>>>>>>
>>>>>>>> But then I see like 900 new lines of code, and I’m much less
>>>>>>>> excited...
>>>>>>>>
>>>>>>>>> Idea is simple: cache small unaligned write and flush the cluster
>>>>>>>>> when
>>>>>>>>> filled.
>>>>>>>>>
>>>>>>>>> Performance result is very good (results in a table is time of
>>>>>>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>>>>>>
>>>>>>>> “Filled with ones” really is an edge case, though.
>>>>>>>
>>>>>>> Yes, I think, all clusters are compressed to rather small chunks :)
>>>>>>>
>>>>>>>>
>>>>>>>>> ---------------  -----------  -----------
>>>>>>>>>                    backup(old)  backup(new)
>>>>>>>>> ssd:hdd(direct)  3e+02        4.4
>>>>>>>>>                                   -99%
>>>>>>>>> ssd:hdd(cached)  5.7          5.4
>>>>>>>>>                                   -5%
>>>>>>>>> ---------------  -----------  -----------
>>>>>>>>>
>>>>>>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>>>>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>>>>>>> cache by default (which is done by the series).
>>>>>>>>
>>>>>>>> First, I’m not sure how O_DIRECT really is relevant, because I
>>>>>>>> don’t really see the point for writing compressed images.
>>>>>>>
>>>>>>> compressed backup is a point
>>>>>>
>>>>>> (Perhaps irrelevant, but just to be clear:) I meant the point of
>>>>>> using O_DIRECT, which one can decide to not use for backup targets
>>>>>> (as you have done already).
>>>>>>
>>>>>>>> Second, I find it a bit cheating if you say there is a huge
>>>>>>>> improvement for the no-cache case, when actually, well, you just
>>>>>>>> added a cache.  So the no-cache case just became faster because
>>>>>>>> there is a cache now.
>>>>>>>
>>>>>>> Still, performance comparison is relevant to show that O_DIRECT as
>>>>>>> is unusable for compressed backup.
>>>>>>
>>>>>> (Again, perhaps irrelevant, but:) Yes, but my first point was
>>>>>> exactly whether O_DIRECT is even relevant for writing compressed
>>>>>> images.
>>>>>>
>>>>>>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much
>>>>>>>> sense for compressed images, qemu’s format drivers are free to
>>>>>>>> introduce some caching (because technically the cache.direct
>>>>>>>> option only applies to the protocol driver) for collecting
>>>>>>>> compressed writes.
>>>>>>>
>>>>>>> Yes I thought in this way, enabling the cache by default.
>>>>>>>
>>>>>>>> That conclusion makes both of my complaints kind of moot.
>>>>>>>>
>>>>>>>> *shrug*
>>>>>>>>
>>>>>>>> Third, what is the real-world impact on the page cache?  You
>>>>>>>> described that that’s the reason why you need the cache in qemu,
>>>>>>>> because otherwise the page cache is polluted too much.  How much
>>>>>>>> is the difference really?  (I don’t know how good the compression
>>>>>>>> ratio is for real-world images.)
>>>>>>>
>>>>>>> Hm. I don't know the ratio.. Customer reported that most of RAM is
>>>>>>> polluted by Qemu's cache, and we use O_DIRECT for everything except
>>>>>>> for target of compressed backup.. Still the pollution may relate to
>>>>>>> several backups and of course it is simple enough to drop the cache
>>>>>>> after each backup. But I think that even one backup of 16T disk may
>>>>>>> pollute RAM enough.
>>>>>>
>>>>>> Oh, sorry, I just realized I had a brain fart there.  I was
>>>>>> referring to whether this series improves the page cache pollution.
>>>>>> But obviously it will if it allows you to re-enable O_DIRECT.
>>>>>>
>>>>>>>> Related to that, I remember a long time ago we had some discussion
>>>>>>>> about letting qemu-img convert set a special cache mode for the
>>>>>>>> target image that would make Linux drop everything before the last
>>>>>>>> offset written (i.e., I suppose fadvise() with
>>>>>>>> POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
>>>>>>>> that implementing a cache in qemu would be simple, but it isn’t,
>>>>>>>> really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
>>>>>>>> advantage of using that would be that we could reuse it for
>>>>>>>> non-compressed images that are written by backup or qemu-img
>>>>>>>> convert.)
>>>>>>>
>>>>>>> The problem is that writes are async. And therefore, not sequential.
>>>>>>
>>>>>> In theory, yes, but all compressed writes still goes through
>>>>>> qcow2_alloc_bytes() right before submitting the write, so I wonder
>>>>>> whether in practice the writes aren’t usually sufficiently
>>>>>> sequential to make POSIX_FADV_SEQUENTIAL work fine.
>>>>>
>>>>> Yes, allocation is sequential. But writes are not.. Reasonable, I
>>>>> should at least bench it. So we should set POSIX_FADV_SEQUENTIAL for
>>>>> the whole backup target before the backup? Will try. Still, I expect
>>>>> that my cache will show better performance anyway. Actually,
>>>>> comparing cached (by pagecache) vs my cache we have 5.7 -> 4.4, i.e.
>>>>> 20% faster, which is significant (still, yes, would be good to check
>>>>> it on more real case than all-ones).
>>>>>
>>>>>>
>>>>>>> So
>>>>>>> I have to track the writes and wait until the whole cluster is
>>>>>>> filled. It's simple use fadvise as an option to my cache: instead
>>>>>>> of caching data and write when cluster is filled we can instead
>>>>>>> mark cluster POSIX_FADV_DONTNEED.
>>>>>>>
>>>>>>>>
>>>>>>>> (I don’t remember why that qemu-img discussion died back then.)
>>>>>>>>
>>>>>>>>
>>>>>>>> Fourth, regarding the code, would it be simpler if it were a pure
>>>>>>>> write cache?  I.e., on read, everything is flushed, so we don’t
>>>>>>>> have to deal with that.  I don’t think there are many valid cases
>>>>>>>> where a compressed image is both written to and read from at the
>>>>>>>> same time. (Just asking, because I’d really want this code to be
>>>>>>>> simpler.  I can imagine that reading from the cache is the least
>>>>>>>> bit of complexity, but perhaps...)
>>>>>>>>
>>>>>>>
>>>>>>> Hm. I really didn't want to support reads, and do it only to make
>>>>>>> it possible to enable the cache by default.. Still read function is
>>>>>>> really simple, and I don't think that dropping it will simplify the
>>>>>>> code significantly.
>>>>>>
>>>>>> That’s too bad.
>>>>>>
>>>>>> So the only question I have left is what POSIX_FADV_SEQUENTIAL
>>>>>> actually would do in practice.
>>>>>
>>>>> will check.
>>>>>
>>>>
>>>> Checked that if I mark the whole file by FADV_SEQUENTIAL, cache is not
>>>> removed.
>>>>
>>>> Test:
>>>> [root@kvm fadvise]# cat a.c
>>>> #define _GNU_SOURCE
>>>> #include <fcntl.h>
>>>> #include <unistd.h>
>>>> #include <stdio.h>
>>>> #include <getopt.h>
>>>> #include <string.h>
>>>> #include <stdbool.h>
>>>>
>>>> int main(int argc, char *argv[])
>>>> {
>>>>      int fd;
>>>>      int i;
>>>>      char mb[1024 * 1024];
>>>>      int open_flags = O_RDWR | O_CREAT | O_EXCL;
>>>>      int fadv_flags = 0;
>>>>      int fadv_final_flags = 0;
>>>>      int len = 1024 * 1024;
>>>>      bool do_fsync = false;
>>>>
>>>>      for (i = 1; i < argc - 1; i++) {
>>>>          const char *arg = argv[i];
>>>>
>>>>          if (!strcmp(arg, "direct")) {
>>>>              open_flags |= O_DIRECT;
>>>>          } else if (!strcmp(arg, "seq")) {
>>>>              fadv_flags = POSIX_FADV_SEQUENTIAL;
>>>>          } else if (!strcmp(arg, "dontneed")) {
>>>>              fadv_flags = POSIX_FADV_DONTNEED;
>>>>          } else if (!strcmp(arg, "final-dontneed")) {
>>>>              fadv_final_flags = POSIX_FADV_DONTNEED;
>>>>          } else if (!strcmp(arg, "fsync")) {
>>>>              do_fsync = true;
>>>>          } else {
>>>>              fprintf(stderr, "unknown: %s\n", arg);
>>>>              return 1;
>>>>          }
>>>>      }
>>>>
>>>>      fd = open(argv[argc - 1], open_flags);
>>>>
>>>>      if (fd < 0) {
>>>>          fprintf(stderr, "failed to open\n");
>>>>          return 1;
>>>>      }
>>>>
>>>>      if (fadv_flags) {
>>>>          posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_flags);
>>>>      }
>>>>      for (i = 0; i < 100; i++) {
>>>>          write(fd, mb, len);
>>>>      }
>>>>      if (fadv_final_flags) {
>>>>          posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_final_flags);
>>>>      }
>>>>      if (do_fsync) {
>>>>          fsync(fd);
>>>>      }
>>>>
>>>>      close(fd);
>>>> }
>>>>
>>>>
>>>>
>>>> [root@kvm fadvise]# gcc a.c
>>>> [root@kvm fadvise]# rm -f x; ./a.out seq x; fincore x
>>>>    RES PAGES  SIZE FILE
>>>>   100M 25600  100M x
>>>> [root@kvm fadvise]# rm -f x; ./a.out dontneed x; fincore x
>>>>    RES PAGES  SIZE FILE
>>>>   100M 25600  100M x
>>>> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed x; fincore x
>>>>    RES PAGES  SIZE FILE
>>>>    36M  9216  100M x
>>>> [root@kvm fadvise]# rm -f x; ./a.out seq fsync x; fincore x
>>>>    RES PAGES  SIZE FILE
>>>>   100M 25600  100M x
>>>> [root@kvm fadvise]# rm -f x; ./a.out dontneed fsync x; fincore x
>>>>    RES PAGES  SIZE FILE
>>>>   100M 25600  100M x
>>>> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed fsync x; fincore x
>>>>    RES PAGES  SIZE FILE
>>>>    36M  9216  100M x
>>>> [root@kvm fadvise]# rm -f x; ./a.out direct x; fincore x
>>>> RES PAGES SIZE FILE
>>>>   0B     0   0B x
>>>>
>>>>
>>>>
>>>> Backup-generated pagecache is a formal trash, it will be never used.
>>>> And it's bad that it can displace another good pagecache. So the best
>>>> thing for backup is direct mode + proposed cache.
> 
> What a shame.
> 
> Thanks a lot for testing.
> 
>>> I think that the original intention of Max is about POSIX_FADV_DONTNEED
>>> One should call this fadvise for just fully written cluster.
> 
> I had hoped that SEQUENTIAL would just work, magically.
> 
>> This should work, but:
>>
>>   - as we see from test above, POSIX_FADV_DONTNEED don't remove the whole cache (see final-dontneed)
>>   - as I said we'll have to track writes, so the cache will be the same, just instead of postponed-write operation we'll do fadvise.
>>
>> Hmm. Still, in this way, we will not need some difficult things in my proposed cache.
>>
>> So, it may be reasonable to at least split the big patch so that
>>
>>   - first part brings writes / full-cluster tracking and fadvice
>>
>>   - second part adds caching-wrties option, corresponding flush code and additional performance
>>
>> Does it make sense?
> 
> I think the fadvise solution would have been nice if it gave us something magical that we could also use for normal qemu-img convert (or backup) operations, but as that doesn’t seem to be the case, I don’t think it makes too much sense to implement something like that.  (I imagine doing fadvise also creates the need to implement new block to call into file-posix and so on.)

Actually, with any kind of qemu-img convert, we shouldn't benefit of pagecache, as we write with large enough good aligned chunks. So probably all such tasks should prefer O_DIRECT mode to not produce extra pagecache. And in this way my patch helps for compressed qemu-img convert with target opened in O_DIRECT mode as well. So instead of using fadvise for all such tasks, we can use O_DIRECT for all of them, having compression powered by compressed-cache to work well with O_DIRECT.

> 
> I’d propose I take some time to look at your patch as-is and then I report back.
> 

Thanks!


-- 
Best regards,
Vladimir

Re: [PATCH 0/7] qcow2: compressed write cache

Posted by Denis V. Lunev 4 years, 12 months ago

On 2/9/21 5:47 PM, Max Reitz wrote:
> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>> 09.02.2021 16:25, Max Reitz wrote:
>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>> Hi all!
>>>>
>>>> I know, I have several series waiting for a resend, but I had to
>>>> switch
>>>> to another task spawned from our customer's bug.
>>>>
>>>> Original problem: we use O_DIRECT for all vm images in our product,
>>>> it's
>>>> the policy. The only exclusion is backup target qcow2 image for
>>>> compressed backup, because compressed backup is extremely slow with
>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>> produces a lot of pagecache.
>>>>
>>>> So we can either implement some internal cache or use fadvise somehow.
>>>> Backup has several async workes, which writes simultaneously, so in
>>>> both
>>>> ways we have to track host cluster filling (before dropping the cache
>>>> corresponding to the cluster).  So, if we have to track anyway, let's
>>>> try to implement the cache.
>>>
>>> I wanted to be excited here, because that sounds like it would be
>>> very easy to implement caching.  Like, just keep the cluster at
>>> free_byte_offset cached until the cluster it points to changes, then
>>> flush the cluster.
>>
>> The problem is that chunks are written asynchronously.. That's why
>> this all is not so easy.
>>
>>>
>>> But then I see like 900 new lines of code, and I’m much less excited...
>>>
>>>> Idea is simple: cache small unaligned write and flush the cluster when
>>>> filled.
>>>>
>>>> Performance result is very good (results in a table is time of
>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>
>>> “Filled with ones” really is an edge case, though.
>>
>> Yes, I think, all clusters are compressed to rather small chunks :)
>>
>>>
>>>> ---------------  -----------  -----------
>>>>                   backup(old)  backup(new)
>>>> ssd:hdd(direct)  3e+02        4.4
>>>>                                  -99%
>>>> ssd:hdd(cached)  5.7          5.4
>>>>                                  -5%
>>>> ---------------  -----------  -----------
>>>>
>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>> cache by default (which is done by the series).
>>>
>>> First, I’m not sure how O_DIRECT really is relevant, because I don’t
>>> really see the point for writing compressed images.
>>
>> compressed backup is a point
>
> (Perhaps irrelevant, but just to be clear:) I meant the point of using
> O_DIRECT, which one can decide to not use for backup targets (as you
> have done already).
>
>>> Second, I find it a bit cheating if you say there is a huge
>>> improvement for the no-cache case, when actually, well, you just
>>> added a cache.  So the no-cache case just became faster because
>>> there is a cache now.
>>
>> Still, performance comparison is relevant to show that O_DIRECT as is
>> unusable for compressed backup.
>
> (Again, perhaps irrelevant, but:) Yes, but my first point was exactly
> whether O_DIRECT is even relevant for writing compressed images.
>
>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much
>>> sense for compressed images, qemu’s format drivers are free to
>>> introduce some caching (because technically the cache.direct option
>>> only applies to the protocol driver) for collecting compressed writes.
>>
>> Yes I thought in this way, enabling the cache by default.
>>
>>> That conclusion makes both of my complaints kind of moot.
>>>
>>> *shrug*
>>>
>>> Third, what is the real-world impact on the page cache?  You
>>> described that that’s the reason why you need the cache in qemu,
>>> because otherwise the page cache is polluted too much.  How much is
>>> the difference really?  (I don’t know how good the compression ratio
>>> is for real-world images.)
>>
>> Hm. I don't know the ratio.. Customer reported that most of RAM is
>> polluted by Qemu's cache, and we use O_DIRECT for everything except
>> for target of compressed backup.. Still the pollution may relate to
>> several backups and of course it is simple enough to drop the cache
>> after each backup. But I think that even one backup of 16T disk may
>> pollute RAM enough.
>
> Oh, sorry, I just realized I had a brain fart there.  I was referring
> to whether this series improves the page cache pollution.  But
> obviously it will if it allows you to re-enable O_DIRECT.
>
>>> Related to that, I remember a long time ago we had some discussion
>>> about letting qemu-img convert set a special cache mode for the
>>> target image that would make Linux drop everything before the last
>>> offset written (i.e., I suppose fadvise() with
>>> POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
>>> that implementing a cache in qemu would be simple, but it isn’t,
>>> really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
>>> advantage of using that would be that we could reuse it for
>>> non-compressed images that are written by backup or qemu-img convert.)
>>
>> The problem is that writes are async. And therefore, not sequential.
>
> In theory, yes, but all compressed writes still goes through
> qcow2_alloc_bytes() right before submitting the write, so I wonder
> whether in practice the writes aren’t usually sufficiently sequential
> to make POSIX_FADV_SEQUENTIAL work fine.
>
>> So
>> I have to track the writes and wait until the whole cluster is
>> filled. It's simple use fadvise as an option to my cache: instead of
>> caching data and write when cluster is filled we can instead mark
>> cluster POSIX_FADV_DONTNEED.
>>
>>>
>>> (I don’t remember why that qemu-img discussion died back then.)
>>>
>>>
>>> Fourth, regarding the code, would it be simpler if it were a pure
>>> write cache?  I.e., on read, everything is flushed, so we don’t have
>>> to deal with that.  I don’t think there are many valid cases where a
>>> compressed image is both written to and read from at the same time. 
>>> (Just asking, because I’d really want this code to be simpler.  I
>>> can imagine that reading from the cache is the least bit of
>>> complexity, but perhaps...)
>>>
>>
>> Hm. I really didn't want to support reads, and do it only to make it
>> possible to enable the cache by default.. Still read function is
>> really simple, and I don't think that dropping it will simplify the
>> code significantly.
>
> That’s too bad.
>
> So the only question I have left is what POSIX_FADV_SEQUENTIAL
> actually would do in practice.
>
> (But even then, the premise just doesn’t motivate me sufficiently yet...)
>
POSIX_FADV_SEQUENTIAL influences the amount of the read-ahead
only.

int generic_fadvise(struct file *file, loff_t offset, loff_t len, int
advice)
{
.....
    case POSIX_FADV_NORMAL:
        file->f_ra.ra_pages = bdi->ra_pages;
        spin_lock(&file->f_lock);
        file->f_mode &= ~FMODE_RANDOM;
        spin_unlock(&file->f_lock);
        break;
    case POSIX_FADV_SEQUENTIAL:
        file->f_ra.ra_pages = bdi->ra_pages * 2;
        spin_lock(&file->f_lock);
        file->f_mode &= ~FMODE_RANDOM;
        spin_unlock(&file->f_lock);
        break;

thus the only difference from the usual NORMAL mode would be
doubled amount of read-ahead pages.

Den

Re: [PATCH 0/7] qcow2: compressed write cache

Posted by Max Reitz 4 years, 12 months ago

On 09.02.21 17:52, Denis V. Lunev wrote:
> On 2/9/21 5:47 PM, Max Reitz wrote:
>> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>>> 09.02.2021 16:25, Max Reitz wrote:
>>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>>> Hi all!
>>>>>
>>>>> I know, I have several series waiting for a resend, but I had to
>>>>> switch
>>>>> to another task spawned from our customer's bug.
>>>>>
>>>>> Original problem: we use O_DIRECT for all vm images in our product,
>>>>> it's
>>>>> the policy. The only exclusion is backup target qcow2 image for
>>>>> compressed backup, because compressed backup is extremely slow with
>>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>>> produces a lot of pagecache.
>>>>>
>>>>> So we can either implement some internal cache or use fadvise somehow.
>>>>> Backup has several async workes, which writes simultaneously, so in
>>>>> both
>>>>> ways we have to track host cluster filling (before dropping the cache
>>>>> corresponding to the cluster).  So, if we have to track anyway, let's
>>>>> try to implement the cache.
>>>>
>>>> I wanted to be excited here, because that sounds like it would be
>>>> very easy to implement caching.  Like, just keep the cluster at
>>>> free_byte_offset cached until the cluster it points to changes, then
>>>> flush the cluster.
>>>
>>> The problem is that chunks are written asynchronously.. That's why
>>> this all is not so easy.
>>>
>>>>
>>>> But then I see like 900 new lines of code, and I’m much less excited...
>>>>
>>>>> Idea is simple: cache small unaligned write and flush the cluster when
>>>>> filled.
>>>>>
>>>>> Performance result is very good (results in a table is time of
>>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>>
>>>> “Filled with ones” really is an edge case, though.
>>>
>>> Yes, I think, all clusters are compressed to rather small chunks :)
>>>
>>>>
>>>>> ---------------  -----------  -----------
>>>>>                    backup(old)  backup(new)
>>>>> ssd:hdd(direct)  3e+02        4.4
>>>>>                                   -99%
>>>>> ssd:hdd(cached)  5.7          5.4
>>>>>                                   -5%
>>>>> ---------------  -----------  -----------
>>>>>
>>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>>> cache by default (which is done by the series).
>>>>
>>>> First, I’m not sure how O_DIRECT really is relevant, because I don’t
>>>> really see the point for writing compressed images.
>>>
>>> compressed backup is a point
>>
>> (Perhaps irrelevant, but just to be clear:) I meant the point of using
>> O_DIRECT, which one can decide to not use for backup targets (as you
>> have done already).
>>
>>>> Second, I find it a bit cheating if you say there is a huge
>>>> improvement for the no-cache case, when actually, well, you just
>>>> added a cache.  So the no-cache case just became faster because
>>>> there is a cache now.
>>>
>>> Still, performance comparison is relevant to show that O_DIRECT as is
>>> unusable for compressed backup.
>>
>> (Again, perhaps irrelevant, but:) Yes, but my first point was exactly
>> whether O_DIRECT is even relevant for writing compressed images.
>>
>>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much
>>>> sense for compressed images, qemu’s format drivers are free to
>>>> introduce some caching (because technically the cache.direct option
>>>> only applies to the protocol driver) for collecting compressed writes.
>>>
>>> Yes I thought in this way, enabling the cache by default.
>>>
>>>> That conclusion makes both of my complaints kind of moot.
>>>>
>>>> *shrug*
>>>>
>>>> Third, what is the real-world impact on the page cache?  You
>>>> described that that’s the reason why you need the cache in qemu,
>>>> because otherwise the page cache is polluted too much.  How much is
>>>> the difference really?  (I don’t know how good the compression ratio
>>>> is for real-world images.)
>>>
>>> Hm. I don't know the ratio.. Customer reported that most of RAM is
>>> polluted by Qemu's cache, and we use O_DIRECT for everything except
>>> for target of compressed backup.. Still the pollution may relate to
>>> several backups and of course it is simple enough to drop the cache
>>> after each backup. But I think that even one backup of 16T disk may
>>> pollute RAM enough.
>>
>> Oh, sorry, I just realized I had a brain fart there.  I was referring
>> to whether this series improves the page cache pollution.  But
>> obviously it will if it allows you to re-enable O_DIRECT.
>>
>>>> Related to that, I remember a long time ago we had some discussion
>>>> about letting qemu-img convert set a special cache mode for the
>>>> target image that would make Linux drop everything before the last
>>>> offset written (i.e., I suppose fadvise() with
>>>> POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
>>>> that implementing a cache in qemu would be simple, but it isn’t,
>>>> really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
>>>> advantage of using that would be that we could reuse it for
>>>> non-compressed images that are written by backup or qemu-img convert.)
>>>
>>> The problem is that writes are async. And therefore, not sequential.
>>
>> In theory, yes, but all compressed writes still goes through
>> qcow2_alloc_bytes() right before submitting the write, so I wonder
>> whether in practice the writes aren’t usually sufficiently sequential
>> to make POSIX_FADV_SEQUENTIAL work fine.
>>
>>> So
>>> I have to track the writes and wait until the whole cluster is
>>> filled. It's simple use fadvise as an option to my cache: instead of
>>> caching data and write when cluster is filled we can instead mark
>>> cluster POSIX_FADV_DONTNEED.
>>>
>>>>
>>>> (I don’t remember why that qemu-img discussion died back then.)
>>>>
>>>>
>>>> Fourth, regarding the code, would it be simpler if it were a pure
>>>> write cache?  I.e., on read, everything is flushed, so we don’t have
>>>> to deal with that.  I don’t think there are many valid cases where a
>>>> compressed image is both written to and read from at the same time.
>>>> (Just asking, because I’d really want this code to be simpler.  I
>>>> can imagine that reading from the cache is the least bit of
>>>> complexity, but perhaps...)
>>>>
>>>
>>> Hm. I really didn't want to support reads, and do it only to make it
>>> possible to enable the cache by default.. Still read function is
>>> really simple, and I don't think that dropping it will simplify the
>>> code significantly.
>>
>> That’s too bad.
>>
>> So the only question I have left is what POSIX_FADV_SEQUENTIAL
>> actually would do in practice.
>>
>> (But even then, the premise just doesn’t motivate me sufficiently yet...)
>>
> POSIX_FADV_SEQUENTIAL influences the amount of the read-ahead
> only.
> 
> int generic_fadvise(struct file *file, loff_t offset, loff_t len, int
> advice)
> {
> .....
>      case POSIX_FADV_NORMAL:
>          file->f_ra.ra_pages = bdi->ra_pages;
>          spin_lock(&file->f_lock);
>          file->f_mode &= ~FMODE_RANDOM;
>          spin_unlock(&file->f_lock);
>          break;
>      case POSIX_FADV_SEQUENTIAL:
>          file->f_ra.ra_pages = bdi->ra_pages * 2;
>          spin_lock(&file->f_lock);
>          file->f_mode &= ~FMODE_RANDOM;
>          spin_unlock(&file->f_lock);
>          break;
> 
> thus the only difference from the usual NORMAL mode would be
> doubled amount of read-ahead pages.

:/

Thanks for looking it up.

Max