Xen Security Advisory 326 v4 (CVE-2022-42311,CVE-2022-42312,CVE-2022-42313,CVE-2022-42314,CVE-2022-42315,CVE-2022-42316,CVE-2022-42317,CVE-2022-42318) - Xenstore: guests can let run xenstored out of memory

Xen.org security team posted 1 patch 1 year, 6 months ago
Failed in applying to current master (apply log)
Xen Security Advisory 326 v4 (CVE-2022-42311,CVE-2022-42312,CVE-2022-42313,CVE-2022-42314,CVE-2022-42315,CVE-2022-42316,CVE-2022-42317,CVE-2022-42318) - Xenstore: guests can let run xenstored out of memory
Posted by Xen.org security team 1 year, 6 months ago
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

 Xen Security Advisory CVE-2022-42311,CVE-2022-42312,CVE-2022-42313,CVE-2022-42314,CVE-2022-42315,CVE-2022-42316,CVE-2022-42317,CVE-2022-42318 / XSA-326
                                                                        version 4

         Xenstore: guests can let run xenstored out of memory

UPDATES IN VERSION 4
====================

Public release.

ISSUE DESCRIPTION
=================

Malicious guests can cause xenstored to allocate vast amounts of memory,
eventually resulting in a Denial of Service (DoS) of xenstored.

There are multiple ways how guests can cause large memory allocations
in xenstored:

- - by issuing new requests to xenstored without reading the responses,
  causing the responses to be buffered in memory

- - by causing large number of watch events to be generated via setting up
  multiple xenstore watches and then e.g. deleting many xenstore nodes
  below the watched path

- - by creating as many nodes as allowed with the maximum allowed size and
  path length in as many transactions as possible

- - by accessing many nodes inside a transaction

IMPACT
======

Unprivileged guests can cause a DoS of xenstored, resulting in the
inability to create new guests or modify the configuration of running
guests.

VULNERABLE SYSTEMS
==================

All Xen versions are vulnerable.

Both Xenstore implementations (C and Ocaml) are vulnerable.

MITIGATION
==========

There is no mitigation available.

CREDITS
=======

This issue was discovered by Julien Grall of Amazon.

RESOLUTION
==========

Applying the appropriate attached patches resolve this issue.

Note that the final oxenstored patch (7 or 8, as applicable) is limiting
the security support for oxenstored to trusted driver domains only.

C xenstored Patches 15 and 16 are not part of the XSA, but are useful
for administrators to change current xenstored quota settings and to
audit per-guest resource usage in xenstored.

Note that the patches are based on top of the patches for XSA-414 and
XSA-415. There is a subtle dependency on XSA-419, which can't be resolved
easily, so the patches of XSA-326 should always be applied together with
those of XSA-419.

Note that patches for released versions are generally prepared to
apply to the stable branches, and may not apply cleanly to the most
recent release tarball.  Downstreams are encouraged to update to the
tip of the stable branch before applying these patches.

xsa326/xsa326-xenstored-??.patch           xen-unstable
xsa326/xsa326-oxenstored-??.patch          xen-unstable
xsa326/xsa326-4.16-xenstored-??.patch      Xen 4.16.x
xsa326/xsa326-4.16-oxenstored-??.patch     Xen 4.16.x
xsa326/xsa326-4.15-xenstored-??.patch      Xen 4.15.x
xsa326/xsa326-4.15-oxenstored-??.patch     Xen 4.15.x
xsa326/xsa326-4.14-xenstored-??.patch      Xen 4.14.x
xsa326/xsa326-4.14-oxenstored-??.patch     Xen 4.14.x
xsa326/xsa326-4.13-xenstored-??.patch      Xen 4.13.x
xsa326/xsa326-4.13-oxenstored-??.patch     Xen 4.13.x

$ sha256sum xsa326* xsa326*/*
fbeb48f2137ead7e933d487b95d819b4adec29e33141655dfb40e66861f8d005  xsa326.meta
5da5e9d053a51faba9a553970d53736b333ce713793ed3cf3fefc19943a3ba3d  xsa326/xsa326-4.13-oxenstored-01.patch
6c65b043f5a9a8963c74b22df2187be7936c1228b1dee7b3cd32ea2f207520d0  xsa326/xsa326-4.13-oxenstored-02.patch
f04f4c29f8a63ff7f08af4d9a99b5da9c44eface3523e2dd9da7119d85445d42  xsa326/xsa326-4.13-oxenstored-03.patch
438ddd4a5fb1b4c9bb5bc911052cbb84b3fbe2ce4c2559ec112b7e9cd6c3c436  xsa326/xsa326-4.13-oxenstored-04.patch
e57d98b53c5b03e34a2e554097b634bbf568d9e336ee0ef7ec703d3ff153dd8a  xsa326/xsa326-4.13-oxenstored-05.patch
0b13429993ab1bb5a2a58edeeebfc8bc50987e5d86dddfd6f7108259c31aed97  xsa326/xsa326-4.13-oxenstored-06.patch
e5c995a8eeea776e57c9878b612f17f2d8cad2538897d8cf385a9f9570ecd076  xsa326/xsa326-4.13-oxenstored-07.patch
247d2461b80884a1bbc063074b89beb769243f82f0de61fe0a45fb438b4a6d38  xsa326/xsa326-4.13-oxenstored-08.patch
928c1b4d624b73fab33af936ba520402d0010956939ed4f17f42c8a476e7dd02  xsa326/xsa326-4.13-xenstored-01.patch
4918eab37b70914a01b3277d83d56a20a877982fac8c5c9533afcdc8c16c4123  xsa326/xsa326-4.13-xenstored-02.patch
1b2df2030bbb91729b16174026127f1a056e011814e2c0b14e6b9430c00f6c41  xsa326/xsa326-4.13-xenstored-03.patch
e05aec57d8cdc1f3151cf6a2cfd8fdf10b9776e3ba564ff934d1dd51692c2f12  xsa326/xsa326-4.13-xenstored-04.patch
197e76c74166fc686fd5b1faf6e025abd9a3e1019ebc7954f63d3561b50aa13c  xsa326/xsa326-4.13-xenstored-05.patch
75dd40b36c3c8f43c8387402221caf05c7dd3b842caf88f59a5420039f63279f  xsa326/xsa326-4.13-xenstored-06.patch
979224585e94d6ba01c8faf2ce4378993aace0057b2377a3ef65aea522912787  xsa326/xsa326-4.13-xenstored-07.patch
ca15279f2d11ca693c1bf4f716835e029f200dab7ad07a12c5d4e9a9199d35ea  xsa326/xsa326-4.13-xenstored-08.patch
7a041894a74bed53ed9951b62725535915398a1dd90d825514d338264b80f3cc  xsa326/xsa326-4.13-xenstored-09.patch
19273b8a79da99ebfbe166e7eb2ec2ea4e68352d90535cc9e1ca154b6cdcab42  xsa326/xsa326-4.13-xenstored-10.patch
4fa07eb6d5fe1d0d49c1e7ad28e106a57f5785cae3a1ff8fd81a0192f0e1ed70  xsa326/xsa326-4.13-xenstored-11.patch
750984eee04854a09ea053213a7b3d411dc487a45056295e943ff4c5e7c8fa10  xsa326/xsa326-4.13-xenstored-12.patch
1aa1458b82fac3b1dbf71f0ad2d8f29203e95ffc8bbe61e3f8aa0895613cb5f8  xsa326/xsa326-4.13-xenstored-13.patch
791f86db3611e226801bf562cf93a4bcd5dd25070e65b6490d1a520e5570cda4  xsa326/xsa326-4.13-xenstored-14.patch
e78ea12c7446a773fb670d674d40cef195bb98f2776c4b43e3737f9cb2742182  xsa326/xsa326-4.13-xenstored-15.patch
3dc9ceed291b414931984952c9bc506e4686cf780a33cd338e1cec254831dc35  xsa326/xsa326-4.13-xenstored-16.patch
19952c1d5a9979cea871323a14ab390e239865e1323193eb46891b365ec4ed9c  xsa326/xsa326-4.14-oxenstored-01.patch
d29ad0d60c3fb07b0f6004bca7cb2457d88c4dd589ccf60261954905f27da982  xsa326/xsa326-4.14-oxenstored-02.patch
124ebbbd5e240113ee0b17fd45d0b8b8ab2fa185197bee9293be109ff209cedb  xsa326/xsa326-4.14-oxenstored-03.patch
8dc1e435dbe7b8ba439117c37e5115784942f0c9724b2976eb9b71eaaf4dacc4  xsa326/xsa326-4.14-oxenstored-04.patch
601dd879e100eb73d13018ba7f36a9e7b1e3d1fa82e0b09ab2e9e5eb9f1d901e  xsa326/xsa326-4.14-oxenstored-05.patch
1744a454249f2e93ca3b01442f9efe3ed699764780a58a99b23358f752d46b1c  xsa326/xsa326-4.14-oxenstored-06.patch
54cd2c156db841c66a1081c8c66b87442bf47d7e0375a311f786527a17feada5  xsa326/xsa326-4.14-oxenstored-07.patch
d6560f5aef9e8e28a4f9773bcc8dd89fd81be1d0a7267b6eba9e9b200c65d4df  xsa326/xsa326-4.14-oxenstored-08.patch
981c67cad44b33660e9e0e7fb6877659da05266a31affb54916cdbf2670ae435  xsa326/xsa326-4.14-xenstored-01.patch
0defc4dc7007d67d217de657305c9f3dce84dc8f9905fe82db5460cfdab48e8a  xsa326/xsa326-4.14-xenstored-02.patch
3b885e855debf116585f27e5c8a9e6e77575c25b4c729b8b50a9457ea815204e  xsa326/xsa326-4.14-xenstored-03.patch
167f178880e606f914bbd6a12cb0e6f56b4551d441d4ca4afa341978973e0fcc  xsa326/xsa326-4.14-xenstored-04.patch
101dda8679ca2c22a0cc7c38d8701dfb6a082e7bfc67846cf48d4eb9e35bfdc9  xsa326/xsa326-4.14-xenstored-05.patch
ad28cc050cdc76c8db6bacefe5d2084ec5ca2f0023ed6a463b9843f8a835173e  xsa326/xsa326-4.14-xenstored-06.patch
29c234ea29713c997e4686a13c8c6ef1eaa12cc0ba6ed49e729922435e3902f3  xsa326/xsa326-4.14-xenstored-07.patch
b20de5fd7d00218eb8f1e5014c06bc8397c6f93876a7328c61e99b010ad0814d  xsa326/xsa326-4.14-xenstored-08.patch
c1568765f386a9d70b9fb59d532c239c7ef9af5fda544518de13f6b16806e099  xsa326/xsa326-4.14-xenstored-09.patch
a99500c0d25f61c3bf4a29dc4c3a3d9457476c014c279267e2acea7714f5b92e  xsa326/xsa326-4.14-xenstored-10.patch
efa8ec1b0e8ff5f3bcb951e1838641480bb67af68fa6dddeed9a6ea6af45ac7b  xsa326/xsa326-4.14-xenstored-11.patch
fd40770a8cf1365034c76c99c26170ae23055000fbcad389ddad1b2d16426768  xsa326/xsa326-4.14-xenstored-12.patch
905525ab516cdc5104558667810ec0de8626e495ba70d571fc4afc8159768cee  xsa326/xsa326-4.14-xenstored-13.patch
4d1037a90a345ae71719abcacee274cbed35d05838659a0a4ab33951ee2418b5  xsa326/xsa326-4.14-xenstored-14.patch
bea121de03b5c2e4736020264b949c66bb5c18edfc3f17c5591cb9a42499f469  xsa326/xsa326-4.14-xenstored-15.patch
86376255e4b514ec77ce759321131271b8aa0075ac14116a7d49a36ac5debcc0  xsa326/xsa326-4.14-xenstored-16.patch
30d14a68dcd80fb3f9d4df12aed6897c0ddce12e5155ac844a42b776611769cc  xsa326/xsa326-4.15-oxenstored-01.patch
958e12676110ce2ad79103ac69c1b468dc792c40ebeb4a7898878d05661b865f  xsa326/xsa326-4.15-oxenstored-02.patch
5f9bd4a0bc12db5c9bf89259f1d2ea76b28308ac6f1a74292284c45d88dadd30  xsa326/xsa326-4.15-oxenstored-03.patch
b02baaad64ea00e3e05ab8de2b5c0bb1047792870f57c1974ae9cef43fc3201e  xsa326/xsa326-4.15-oxenstored-04.patch
644d84f59dca4d55894ec4851c11d4fc0a15203319a9016fd5476fb4a4c43ca7  xsa326/xsa326-4.15-oxenstored-05.patch
9a93874c9c63bd5a418160d2973517302c926cfaeaa22afab5dbe9da54399697  xsa326/xsa326-4.15-oxenstored-06.patch
7dbf0a1d70aa943ea7b0be69d16027239d7f965e3994a95b47d8822d7b0c3d84  xsa326/xsa326-4.15-oxenstored-07.patch
3809e21e09ff741448b3126bb2fb7979a67e430ca6d5b2a70fd22bd210ca276d  xsa326/xsa326-4.15-oxenstored-08.patch
b05a06e5f29c97192710376ce89e80962a893827a30911087a6b883ff644cef6  xsa326/xsa326-4.15-xenstored-01.patch
e0b3249792c03b9dd0e8820e5db9f6e08b38ea5182a60baff1d9264dcf6f1b16  xsa326/xsa326-4.15-xenstored-02.patch
d94f34802f4ed302f44823b1a47c25792b5e1d040d3e04878a53b006339b4654  xsa326/xsa326-4.15-xenstored-03.patch
ec414451bbec7229282e4db650b0b298d89c1881720886569b2a1210576398bd  xsa326/xsa326-4.15-xenstored-04.patch
ab25a8817732f5e9f4dd3cb3cf2130de50dbe39d284c0ac80ce210b738a6a3fa  xsa326/xsa326-4.15-xenstored-05.patch
a7c0151d34d7b340ccb02780dfc3267e654b4423cdfff32650577a4da519677e  xsa326/xsa326-4.15-xenstored-06.patch
a4933e62317428fc8d8a5ba12a653613ee3e54ad89f26831736f0b12bb18d68e  xsa326/xsa326-4.15-xenstored-07.patch
0b365ea9d0dfd2b2773b42a19826e369bb6e79c88f118ec41a80570be93d2c26  xsa326/xsa326-4.15-xenstored-08.patch
dd04f56f28a6943a141f425ce3b45ebc370c559e33dab2db48f89d077cde24bf  xsa326/xsa326-4.15-xenstored-09.patch
d2260693e4d94b4707459bf277c6a23f322fcd3fa58091cdac896b39a61a890f  xsa326/xsa326-4.15-xenstored-10.patch
97dfa89180a20cc3e3d03edaf2cc48a343d4f07e7982b5ee1e4c61afa3103a6f  xsa326/xsa326-4.15-xenstored-11.patch
acd6041a412fc584ccd9376f1e17f51cf40708ec3fa1c0ce64a9c9cdb393e727  xsa326/xsa326-4.15-xenstored-12.patch
ef00a409abfeb078a1e29abf3bd12c017440cb4db09b00a7cab875bb7a920788  xsa326/xsa326-4.15-xenstored-13.patch
e33042c8f63426a3ef75a884b00aaddd7f143324efbb216dae92155b3a6d23c8  xsa326/xsa326-4.15-xenstored-14.patch
e2ab4d46a6d836f485a062eddae2ea3e554da55c68551db22c40b19edc366a56  xsa326/xsa326-4.15-xenstored-15.patch
fb5eac62c4dd11e1a7e998a1b293e1b36998ec7540137790c66ee3e756ee7d7b  xsa326/xsa326-4.15-xenstored-16.patch
22188213c6caf1a9f84e0babdb3c35e9e828424e3bfced237036856291ec86c5  xsa326/xsa326-4.16-oxenstored-01.patch
631891588ca285eb44ebc393a13bfb7fd3da473db031aca612770ccb6e502447  xsa326/xsa326-4.16-oxenstored-02.patch
32f43582d2f25c46a837f36cca54d85a14afe0c04489597fe564bc688ead1dba  xsa326/xsa326-4.16-oxenstored-03.patch
9ea1efcf2260b2170318467a1ae99e898024a3ee139b61570838115a1de8b956  xsa326/xsa326-4.16-oxenstored-04.patch
03eb654ebacfef7e3a91234deb7bc4687f80762ca68b00b7fe23eb273ef8b9f1  xsa326/xsa326-4.16-oxenstored-05.patch
5b771df5d23ecd6a66de93b6d5a5ab3821a3f57770d6a8d9473eb18f4bf1ee9c  xsa326/xsa326-4.16-oxenstored-06.patch
eddf43db08e7c46a15f589f7be3ac64c3967c345b520dd5b4813117332da4b1a  xsa326/xsa326-4.16-oxenstored-07.patch
8c5b11c0a0af8f5f9dff4d64482377f0706c455e65a106f309c9ad56eea1adc6  xsa326/xsa326-4.16-oxenstored-08.patch
a4542bd9278ac83c0e633bbff7d3f446a03b4dac70269c0f079c980d58d9a5ae  xsa326/xsa326-4.16-xenstored-01.patch
6f7b7d523b0b085d2b7f371ec4477859212a265ae9a52f1f8c8f54e62f02a05e  xsa326/xsa326-4.16-xenstored-02.patch
2b9a3f2e1764fedc08aa335603fe7c253e67496534a29ffae8fe6e9c1ba0ce19  xsa326/xsa326-4.16-xenstored-03.patch
0fc9759eb7e6504b9f54090b5d249d602968df8db6de6dff32a84a9134317e72  xsa326/xsa326-4.16-xenstored-04.patch
6962f7381bc11df4fdccb89013968c583c708677d14f5ef57c07e945eaa7bcc6  xsa326/xsa326-4.16-xenstored-05.patch
d30bdd689b0a32b09ec8916917fe5297a1b3dd2f6c93e39fad2864fcd862b4bf  xsa326/xsa326-4.16-xenstored-06.patch
ecc07fc6f1ae78ea8455344e785d1c359fe0c5b3c4be97346812b5aa5dd3a19f  xsa326/xsa326-4.16-xenstored-07.patch
a0f0316c955a7a8a8e74509d9db052ab1560dd132b2e931121368338cd65e5b5  xsa326/xsa326-4.16-xenstored-08.patch
ccced498d856519df82836acb7dccd155b858c62cdab84d95e6aac12ca7e9963  xsa326/xsa326-4.16-xenstored-09.patch
5bc89ffba64be315264cf695a62e27ebb55879eff9d97e8bf0d71ee01eff78af  xsa326/xsa326-4.16-xenstored-10.patch
c25bd21bc05f93622dd9025e787ba60955dc6df0c74db915acd821ab7ecea733  xsa326/xsa326-4.16-xenstored-11.patch
5eec3bb81c5d3a3588bf30a754f630b3d08628c66c35a8d00823d1726591bae0  xsa326/xsa326-4.16-xenstored-12.patch
6f484f7c237c7e92d3ff225e4732b0496a5e899de02812fedfbbcdc5712fff03  xsa326/xsa326-4.16-xenstored-13.patch
e8382b1f37177d3dca5e66adce13e1cec4a320b0865f09535bf51a1d4662bb1c  xsa326/xsa326-4.16-xenstored-14.patch
274708be8a5951eaaa2adb61974c3a1529c35dc1f293cc2e9d4759a2d8e20693  xsa326/xsa326-4.16-xenstored-15.patch
cebadbd9b303551e0208eaefd831608c47056d27f05dcea97cee3cd761eb3f70  xsa326/xsa326-4.16-xenstored-16.patch
16248584282597dd5b405c8ced0d7d8ad644b68b9dbe13dbaa65ad9080fbbbc4  xsa326/xsa326-oxenstored-01.patch
8f1346250c54accdd4da3cbfb29c98bdf8511974e75e6433374e772c4a7f3b88  xsa326/xsa326-oxenstored-02.patch
bc59dbfbd41a95d73c81ecd011c3a3d2cc62f373e1ea0f79792a78572ca06af1  xsa326/xsa326-oxenstored-03.patch
b3e383389d3743809422a4e5a364bad10249531bd64d0af2873294cb9abbcb10  xsa326/xsa326-oxenstored-04.patch
ad9160630efefece9eb59e144e01911dc69d625acca2a5562a1640bc8823bcf0  xsa326/xsa326-oxenstored-05.patch
4279925ed16d89d3f26ecb4a71d2215547088c8f733c4bce596e29b1916e01cf  xsa326/xsa326-oxenstored-06.patch
de8faa4b114faef576024da5f99b7a961efd9f7de5fa6ba60160fe932af36494  xsa326/xsa326-oxenstored-07.patch
b4582a663bf5cc8ef7ab5dccaab1e5b686da6584a5cab3339319c66726535e8e  xsa326/xsa326-xenstored-01.patch
8a5699af6c6d0497f6b16030db31c59cf8b172c21a78d1d2d36f0c590a5f2319  xsa326/xsa326-xenstored-02.patch
b8a9286af5d14e35a9ec541afc20b2ca40550ac0a6e83fc012be396ba42a939b  xsa326/xsa326-xenstored-03.patch
10d4c34475550c7dcf808747a4a44ce74ed42d8c0b0c209c6dc318c397a4ba8f  xsa326/xsa326-xenstored-04.patch
3fba2fc49d5af5466452d4ddfa730194686ff8dbb5a96b29e4d89032e0135a78  xsa326/xsa326-xenstored-05.patch
57e008a2a8921186b797abe068f0ef9d39ea23dcd0f4cb8a4c20a022d17aff77  xsa326/xsa326-xenstored-06.patch
da69f7577dd38fc109e6271d583b3cd19197b6777e70191e079e2e120631d6cf  xsa326/xsa326-xenstored-07.patch
a2ab8f1307609dcfb66abf12c82e8f273f12e1c92f05b350933a73794b02ad73  xsa326/xsa326-xenstored-08.patch
417baecd2b6e10456ef6501619ba617e2c24a32bcad025df3f683f17334e42f9  xsa326/xsa326-xenstored-09.patch
2ccd4bd9524971d140568d9d0cee49931bcf85596744a13ac3520e1e67c71fd8  xsa326/xsa326-xenstored-10.patch
bf119e0c13e4f77d1029410be71987b51c48eb5bfa72c445394e2e2eea004e9c  xsa326/xsa326-xenstored-11.patch
70dadf62eca8bd119ff84d4efdb0c863f8ddaf58e25e29ef6d3b7bc92fc2f0fa  xsa326/xsa326-xenstored-12.patch
6fdd871d77b699fbb4df8efc18fd772131a216e9ac9387832ae66a3af6d58e07  xsa326/xsa326-xenstored-13.patch
49a22d518921be7688cbe5dced9c842b3f0c67f678f3d113bbe5fce36a59d775  xsa326/xsa326-xenstored-14.patch
a8ef297722bb4c5778d3e0f80ab16cdb6024cdb3a349789182d2167409cf1aa2  xsa326/xsa326-xenstored-15.patch
bf20cd4808cba1506ed7404af050d9b05619b48d2d8eda7e166050540b8f25e2  xsa326/xsa326-xenstored-16.patch
$

DEPLOYMENT DURING EMBARGO
=========================

Deployment of the patches and/or mitigations described above (or
others which are substantially similar) is permitted during the
embargo, even on public-facing systems with untrusted guest users and
administrators.

But: Distribution of updated software is prohibited (except to other
members of the predisclosure list).

Predisclosure list members who wish to deploy significantly different
patches and/or mitigations, please contact the Xen Project Security
Team.


(Note: this during-embargo deployment notice is retained in
post-embargo publicly released Xen Project advisories, even though it
is then no longer applicable.  This is to enable the community to have
oversight of the Xen Project Security Team's decisionmaking.)

For more information about permissible uses of embargoed information,
consult the Xen Project community's agreed Security Policy:
  http://www.xenproject.org/security-policy.html
-----BEGIN PGP SIGNATURE-----

iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAmNg+5QMHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZrb0IAKWuWJpPThwmSEFjzNMwdQ+L/xip0AEnl3aVC5UD
DEGtB7mETVnwsUYZYee9+OEWOjHJJ//4eENeaziGvzfPG5scGUjcdMeNrIhPtdqB
jgjrjfE/z+pTQvbQhu5vvjR/m0K+PHgBejiSfKC7K87+yhcuTaMFoUejBoQ2ZzZ0
h5UfEiTktdWRTwQ4HrofgJKKIfhXGBRRXJbzNysNZ2k8eSpq6ALjgEPpmhalBS/t
n1UPKGyToXhVnAwDkV8Bo54EOjhkppIwYuOiGEi4O+weHIq0Oqi9pqpkzCC5QO3q
muUGHYRjJ7yDWzo+gpr27O8949gPXPfDMTKLiWYCXGaw4CA=
=Eyn8
-----END PGP SIGNATURE-----
{
  "XSA": 326,
  "SupportedVersions": [
    "master",
    "4.16",
    "4.15",
    "4.14",
    "4.13"
  ],
  "Trees": [
    "xen"
  ],
  "Recipes": {
    "4.13": {
      "Recipes": {
        "xen": {
          "StableRef": "0be63c2615b268001f7cc9b72ce25eed952737dc",
          "Prereqs": [
            414,
            415
          ],
          "Patches": [
            "xsa326/xsa326-4.13-xenstored-??.patch",
            "xsa326/xsa326-4.13-oxenstored-??.patch"
          ]
        }
      }
    },
    "4.14": {
      "Recipes": {
        "xen": {
          "StableRef": "016de62747b26ead5a5c763b640fe8e205cd182b",
          "Prereqs": [
            414,
            415
          ],
          "Patches": [
            "xsa326/xsa326-4.14-xenstored-??.patch",
            "xsa326/xsa326-4.14-oxenstored-??.patch"
          ]
        }
      }
    },
    "4.15": {
      "Recipes": {
        "xen": {
          "StableRef": "816580afdd1730d4f85f64477a242a439af1cdf8",
          "Prereqs": [
            414,
            415
          ],
          "Patches": [
            "xsa326/xsa326-4.15-xenstored-??.patch",
            "xsa326/xsa326-4.15-oxenstored-??.patch"
          ]
        }
      }
    },
    "4.16": {
      "Recipes": {
        "xen": {
          "StableRef": "1bce7fb1f702da4f7a749c6f1457ecb20bf74fca",
          "Prereqs": [
            412,
            414,
            415
          ],
          "Patches": [
            "xsa326/xsa326-4.16-xenstored-??.patch",
            "xsa326/xsa326-4.16-oxenstored-??.patch"
          ]
        }
      }
    },
    "master": {
      "Recipes": {
        "xen": {
          "StableRef": "cc4747be8ba157a3b310921e9ee07fb8545aa206",
          "Prereqs": [
            412,
            414,
            415
          ],
          "Patches": [
            "xsa326/xsa326-xenstored-??.patch",
            "xsa326/xsa326-oxenstored-??.patch"
          ]
        }
      }
    }
  }
}From 24d6e912621db242a8fdff29b8352b516e1e9d1e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:01 +0100
Subject: tools/ocaml/xenstored: Synchronise defaults with oxenstore.conf.in
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

We currently have 2 different set of defaults in upstream Xen git tree:
* defined in the source code, only used if there is no config file
* defined in the oxenstored.conf.in upstream Xen

An oxenstored.conf file is not mandatory, and if missing, maxrequests in
particular has an unsafe default.

Resync the defaults from oxenstored.conf.in into the source code.

This is part of XSA-326 / CVE-2022-42316.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/xenstored/define.ml b/tools/ocaml/xenstored/define.ml
index f574397a4c0b..96c125a969da 100644
--- a/tools/ocaml/xenstored/define.ml
+++ b/tools/ocaml/xenstored/define.ml
@@ -22,9 +22,9 @@ let xs_daemon_socket_ro = Paths.xen_run_stored ^ "/socket_ro"
 
 let default_config_dir = Paths.xen_config_dir
 
-let maxwatch = ref (50)
-let maxtransaction = ref (20)
-let maxrequests = ref (-1)   (* maximum requests per transaction *)
+let maxwatch = ref (100)
+let maxtransaction = ref (10)
+let maxrequests = ref (1024)   (* maximum requests per transaction *)
 
 let conflict_burst_limit = ref 5.0
 let conflict_max_history_seconds = ref 0.05
diff --git a/tools/ocaml/xenstored/quota.ml b/tools/ocaml/xenstored/quota.ml
index abcac912805a..6e3d6401ae89 100644
--- a/tools/ocaml/xenstored/quota.ml
+++ b/tools/ocaml/xenstored/quota.ml
@@ -20,8 +20,8 @@ exception Transaction_opened
 
 let warn fmt = Logging.warn "quota" fmt
 let activate = ref true
-let maxent = ref (10000)
-let maxsize = ref (4096)
+let maxent = ref (1000)
+let maxsize = ref (2048)
 
 type t = {
 	maxent: int;               (* max entities per domU *)
From e33572f285d3e4b4aac849300044830943201d94 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Thu, 28 Jul 2022 17:08:15 +0100
Subject: tools/ocaml/xenstored: Check for maxrequests before performing
 operations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously we'd perform the operation, record the updated tree in the
transaction record, then try to insert a watchop path and the reply packet.

If we exceeded max requests we would've returned EQUOTA, but still:
* have performed the operation on the transaction's tree
* have recorded the watchop, making this queue effectively unbounded

It is better if we check whether we'd have room to store the operation before
performing the transaction, and raise EQUOTA there.  Then the transaction
record won't grow.

This is part of XSA-326 / CVE-2022-42317.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index 3ab09c6ce926..3279b19b1bff 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -253,6 +253,7 @@ let input_handle_error ~cons ~doms ~fct ~con ~t ~req =
 	let reply_error e =
 		Packet.Error e in
 	try
+		Transaction.check_quota_exn ~perm:(Connection.get_perm con) t;
 		fct con t doms cons req.Packet.data
 	with
 	| Define.Invalid_path          -> reply_error "EINVAL"
@@ -545,9 +546,10 @@ let process_packet ~store ~cons ~doms ~con ~req =
 		in
 
 		let response = try
+			Transaction.check_quota_exn ~perm:(Connection.get_perm con) t;
 			if tid <> Transaction.none then
 				(* Remember the request and response for this operation in case we need to replay the transaction *)
-				Transaction.add_operation ~perm:(Connection.get_perm con) t req response;
+				Transaction.add_operation t req response;
 			response
 		with Quota.Limit_reached ->
 			Packet.Error "EQUOTA"
diff --git a/tools/ocaml/xenstored/transaction.ml b/tools/ocaml/xenstored/transaction.ml
index 17b1bdf2eaf9..294143e2335b 100644
--- a/tools/ocaml/xenstored/transaction.ml
+++ b/tools/ocaml/xenstored/transaction.ml
@@ -85,6 +85,7 @@ type t = {
 	oldroot: Store.Node.t;
 	mutable paths: (Xenbus.Xb.Op.operation * Store.Path.t) list;
 	mutable operations: (Packet.request * Packet.response) list;
+	mutable quota_reached: bool;
 	mutable read_lowpath: Store.Path.t option;
 	mutable write_lowpath: Store.Path.t option;
 }
@@ -127,6 +128,7 @@ let make ?(internal=false) id store =
 		oldroot = Store.get_root store;
 		paths = [];
 		operations = [];
+		quota_reached = false;
 		read_lowpath = None;
 		write_lowpath = None;
 	} in
@@ -143,13 +145,19 @@ let get_root t = Store.get_root t.store
 
 let is_read_only t = t.paths = []
 let add_wop t ty path = t.paths <- (ty, path) :: t.paths
-let add_operation ~perm t request response =
+let get_operations t = List.rev t.operations
+
+let check_quota_exn ~perm t =
 	if !Define.maxrequests >= 0
 		&& not (Perms.Connection.is_dom0 perm)
-		&& List.length t.operations >= !Define.maxrequests
-		then raise Quota.Limit_reached;
+		&& (t.quota_reached || List.length t.operations >= !Define.maxrequests)
+		then begin
+			t.quota_reached <- true;
+			raise Quota.Limit_reached;
+		end
+
+let add_operation t request response =
 	t.operations <- (request, response) :: t.operations
-let get_operations t = List.rev t.operations
 let set_read_lowpath t path = t.read_lowpath <- get_lowest path t.read_lowpath
 let set_write_lowpath t path = t.write_lowpath <- get_lowest path t.write_lowpath
 
From 9b1d5795563ad23155abc95b053081ff6e850d3a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:07 +0100
Subject: tools/ocaml: GC parameter tuning
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

By default the OCaml garbage collector would return memory to the OS only
after unused memory is 5x live memory.  Tweak this to 120% instead, which
would match the major GC speed.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/xenstored/define.ml b/tools/ocaml/xenstored/define.ml
index 96c125a969da..1a5d2f34a678 100644
--- a/tools/ocaml/xenstored/define.ml
+++ b/tools/ocaml/xenstored/define.ml
@@ -26,6 +26,7 @@ let maxwatch = ref (100)
 let maxtransaction = ref (10)
 let maxrequests = ref (1024)   (* maximum requests per transaction *)
 
+let gc_max_overhead = ref 120 (* 120% see comment in xenstored.ml *)
 let conflict_burst_limit = ref 5.0
 let conflict_max_history_seconds = ref 0.05
 let conflict_rate_limit_is_aggregate = ref true
diff --git a/tools/ocaml/xenstored/xenstored.ml b/tools/ocaml/xenstored/xenstored.ml
index 369b5036f43d..0b6343dfc789 100644
--- a/tools/ocaml/xenstored/xenstored.ml
+++ b/tools/ocaml/xenstored/xenstored.ml
@@ -103,6 +103,7 @@ let parse_config filename =
 		("quota-maxsize", Config.Set_int Quota.maxsize);
 		("quota-maxrequests", Config.Set_int Define.maxrequests);
 		("quota-path-max", Config.Set_int Define.path_max);
+		("gc-max-overhead", Config.Set_int Define.gc_max_overhead);
 		("test-eagain", Config.Set_bool Transaction.test_eagain);
 		("persistent", Config.Set_bool Disk.enable);
 		("xenstored-log-file", Config.String Logging.set_xenstored_log_destination);
@@ -229,6 +230,67 @@ let to_file store cons file =
 	        (fun () -> close_out channel)
 end
 
+(*
+	By default OCaml's GC only returns memory to the OS when it exceeds a
+	configurable 'max overhead' setting.
+	The default is 500%, that is 5/6th of the OCaml heap needs to be free
+	and only 1/6th live for a compaction to be triggerred that would
+	release memory back to the OS.
+	If the limit is not hit then the OCaml process can reuse that memory
+	for its own purposes, but other processes won't be able to use it.
+
+	There is also a 'space overhead' setting that controls how much work
+	each major GC slice does, and by default aims at having no more than
+	80% or 120% (depending on version) garbage values compared to live
+	values.
+	This doesn't have as much relevance to memory returned to the OS as
+	long as space_overhead <= max_overhead, because compaction is only
+	triggerred at the end of major GC cycles.
+
+	The defaults are too large once the program starts using ~100MiB of
+	memory, at which point ~500MiB would be unavailable to other processes
+	(which would be fine if this was the main process in this VM, but it is
+	not).
+
+	Max overhead can also be set to 0, however this is for testing purposes
+	only (setting it lower than 'space overhead' wouldn't help because the
+	major GC wouldn't run fast enough, and compaction does have a
+	performance cost: we can only compact contiguous regions, so memory has
+	to be moved around).
+
+	Max overhead controls how often the heap is compacted, which is useful
+	if there are burst of activity followed by long periods of idle state,
+	or if a domain quits, etc. Compaction returns memory to the OS.
+
+	wasted = live * space_overhead / 100
+
+	For globally overriding the GC settings one can use OCAMLRUNPARAM,
+	however we provide a config file override to be consistent with other
+	oxenstored settings.
+
+	One might want to dynamically adjust the overhead setting based on used
+	memory, i.e. to use a fixed upper bound in bytes, not percentage. However
+	measurements show that such adjustments increase GC overhead massively,
+	while still not guaranteeing that memory is returned any more quickly
+	than with a percentage based setting.
+
+	The allocation policy could also be tweaked, e.g. first fit would reduce
+	fragmentation and thus memory usage, but the documentation warns that it
+	can be sensibly slower, and indeed one of our own testcases can trigger
+	such a corner case where it is multiple times slower, so it is best to keep
+	the default allocation policy (next-fit/best-fit depending on version).
+
+	There are other tweaks that can be attempted in the future, e.g. setting
+	'ulimit -v' to 75% of RAM, however getting the kernel to actually return
+	NULL from allocations is difficult even with that setting, and without a
+	NULL the emergency GC won't be triggerred.
+	Perhaps cgroup limits could help, but for now tweak the safest only.
+*)
+
+let tweak_gc () =
+	Gc.set { (Gc.get ()) with Gc.max_overhead = !Define.gc_max_overhead }
+
+
 let _ =
 	let cf = do_argv in
 	let pidfile =
@@ -238,6 +300,8 @@ let _ =
 			default_pidfile
 		in
 
+	tweak_gc ();
+
 	(try
 		Unixext.mkdir_rec (Filename.dirname pidfile) 0o755
 	with _ ->
From 76fd294c934a324a661250f57e2602eca963c49d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Fri, 29 Jul 2022 18:53:29 +0100
Subject: tools/ocaml/libs/xb: hide type of Xb.t
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Hiding the type will make it easier to change the implementation
in the future without breaking code that relies on it.

No functional change.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 7ade30a1451734d041363c750a65d322e25b47ba)

Reported-by: Julien Grall <jgrall@amazon.com>
diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 104d319d7747..8404ddd8a682 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -196,6 +196,9 @@ let peek_output con = Queue.peek con.pkt_out
 let input_len con = Queue.length con.pkt_in
 let has_in_packet con = Queue.length con.pkt_in > 0
 let get_in_packet con = Queue.pop con.pkt_in
+let has_partial_input con = match con.partial_in with
+	| HaveHdr _ -> true
+	| NoHdr (n, _) -> n < Partial.header_size ()
 let has_more_input con =
 	match con.backend with
 	| Fd _         -> false
diff --git a/tools/ocaml/libs/xb/xb.mli b/tools/ocaml/libs/xb/xb.mli
index 3a00da6cddc1..794e35bb343e 100644
--- a/tools/ocaml/libs/xb/xb.mli
+++ b/tools/ocaml/libs/xb/xb.mli
@@ -66,13 +66,7 @@ type backend_mmap = {
 type backend_fd = { fd : Unix.file_descr; }
 type backend = Fd of backend_fd | Xenmmap of backend_mmap
 type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
-type t = {
-  backend : backend;
-  pkt_in : Packet.t Queue.t;
-  pkt_out : Packet.t Queue.t;
-  mutable partial_in : partial_buf;
-  mutable partial_out : string;
-}
+type t
 val init_partial_in : unit -> partial_buf
 val reconnect : t -> unit
 val queue : t -> Packet.t -> unit
@@ -97,6 +91,7 @@ val has_output : t -> bool
 val peek_output : t -> Packet.t
 val input_len : t -> int
 val has_in_packet : t -> bool
+val has_partial_input : t -> bool
 val get_in_packet : t -> Packet.t
 val has_more_input : t -> bool
 val is_selectable : t -> bool
diff --git a/tools/ocaml/xenstored/connection.ml b/tools/ocaml/xenstored/connection.ml
index daf8d804f7ef..70c43485528c 100644
--- a/tools/ocaml/xenstored/connection.ml
+++ b/tools/ocaml/xenstored/connection.ml
@@ -125,9 +125,7 @@ let get_perm con =
 let set_target con target_domid =
 	con.perm <- Perms.Connection.set_target (get_perm con) ~perms:[Perms.READ; Perms.WRITE] target_domid
 
-let is_backend_mmap con = match con.xb.Xenbus.Xb.backend with
-	| Xenbus.Xb.Xenmmap _ -> true
-	| _ -> false
+let is_backend_mmap con = Xenbus.Xb.is_mmap con.xb
 
 let send_reply con tid rid ty data =
 	if (String.length data) > xenstore_payload_max && (is_backend_mmap con) then
From 76929380d9101a707a4a5327d2409bc1ff5900f5 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:02 +0100
Subject: tools/ocaml: Change Xb.input to return Packet.t option
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The queue here would only ever hold at most one element.  This will simplify
follow-up patches.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 8404ddd8a682..165fd4a1edf4 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -45,7 +45,6 @@ type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
 type t =
 {
 	backend: backend;
-	pkt_in: Packet.t Queue.t;
 	pkt_out: Packet.t Queue.t;
 	mutable partial_in: partial_buf;
 	mutable partial_out: string;
@@ -62,7 +61,6 @@ let reconnect t = match t.backend with
 		Xs_ring.close backend.mmap;
 		backend.eventchn_notify ();
 		(* Clear our old connection state *)
-		Queue.clear t.pkt_in;
 		Queue.clear t.pkt_out;
 		t.partial_in <- init_partial_in ();
 		t.partial_out <- ""
@@ -124,7 +122,6 @@ let output con =
 
 (* NB: can throw Reconnect *)
 let input con =
-	let newpacket = ref false in
 	let to_read =
 		match con.partial_in with
 		| HaveHdr partial_pkt -> Partial.to_complete partial_pkt
@@ -143,21 +140,19 @@ let input con =
 		if Partial.to_complete partial_pkt = 0 then (
 			let pkt = Packet.of_partialpkt partial_pkt in
 			con.partial_in <- init_partial_in ();
-			Queue.push pkt con.pkt_in;
-			newpacket := true
-		)
+			Some pkt
+		) else None
 	| NoHdr (i, buf)      ->
 		(* we complete the partial header *)
 		if sz > 0 then
 			Bytes.blit b 0 buf (Partial.header_size () - i) sz;
 		con.partial_in <- if sz = i then
-			HaveHdr (Partial.of_string (Bytes.to_string buf)) else NoHdr (i - sz, buf)
-	);
-	!newpacket
+			HaveHdr (Partial.of_string (Bytes.to_string buf)) else NoHdr (i - sz, buf);
+		None
+	)
 
 let newcon backend = {
 	backend = backend;
-	pkt_in = Queue.create ();
 	pkt_out = Queue.create ();
 	partial_in = init_partial_in ();
 	partial_out = "";
@@ -193,9 +188,6 @@ let has_output con = has_new_output con || has_old_output con
 
 let peek_output con = Queue.peek con.pkt_out
 
-let input_len con = Queue.length con.pkt_in
-let has_in_packet con = Queue.length con.pkt_in > 0
-let get_in_packet con = Queue.pop con.pkt_in
 let has_partial_input con = match con.partial_in with
 	| HaveHdr _ -> true
 	| NoHdr (n, _) -> n < Partial.header_size ()
diff --git a/tools/ocaml/libs/xb/xb.mli b/tools/ocaml/libs/xb/xb.mli
index 794e35bb343e..91c682162cea 100644
--- a/tools/ocaml/libs/xb/xb.mli
+++ b/tools/ocaml/libs/xb/xb.mli
@@ -77,7 +77,7 @@ val write_fd : backend_fd -> 'a -> string -> int -> int
 val write_mmap : backend_mmap -> 'a -> string -> int -> int
 val write : t -> string -> int -> int
 val output : t -> bool
-val input : t -> bool
+val input : t -> Packet.t option
 val newcon : backend -> t
 val open_fd : Unix.file_descr -> t
 val open_mmap : Xenmmap.mmap_interface -> (unit -> unit) -> t
@@ -89,10 +89,7 @@ val has_new_output : t -> bool
 val has_old_output : t -> bool
 val has_output : t -> bool
 val peek_output : t -> Packet.t
-val input_len : t -> int
-val has_in_packet : t -> bool
 val has_partial_input : t -> bool
-val get_in_packet : t -> Packet.t
 val has_more_input : t -> bool
 val is_selectable : t -> bool
 val get_fd : t -> Unix.file_descr
diff --git a/tools/ocaml/libs/xs/xsraw.ml b/tools/ocaml/libs/xs/xsraw.ml
index d982fb24dbb1..451f8b38dbcc 100644
--- a/tools/ocaml/libs/xs/xsraw.ml
+++ b/tools/ocaml/libs/xs/xsraw.ml
@@ -94,26 +94,18 @@ let pkt_send con =
 	done
 
 (* receive one packet - can sleep *)
-let pkt_recv con =
-	let workdone = ref false in
-	while not !workdone
-	do
-		workdone := Xb.input con.xb
-	done;
-	Xb.get_in_packet con.xb
+let rec pkt_recv con =
+	match Xb.input con.xb with
+	| Some packet -> packet
+	| None -> pkt_recv con
 
 let pkt_recv_timeout con timeout =
 	let fd = Xb.get_fd con.xb in
 	let r, _, _ = Unix.select [ fd ] [] [] timeout in
 	if r = [] then
 		true, None
-	else (
-		let workdone = Xb.input con.xb in
-		if workdone then
-			false, (Some (Xb.get_in_packet con.xb))
-		else
-			false, None
-	)
+	else
+		false, Xb.input con.xb
 
 let queue_watchevent con data =
 	let ls = split_string ~limit:2 '\000' data in
diff --git a/tools/ocaml/xenstored/connection.ml b/tools/ocaml/xenstored/connection.ml
index 70c43485528c..ace2aa5b4f53 100644
--- a/tools/ocaml/xenstored/connection.ml
+++ b/tools/ocaml/xenstored/connection.ml
@@ -277,8 +277,6 @@ let get_transaction con tid =
 	Hashtbl.find con.transactions tid
 
 let do_input con = Xenbus.Xb.input con.xb
-let has_input con = Xenbus.Xb.has_in_packet con.xb
-let pop_in con = Xenbus.Xb.get_in_packet con.xb
 let has_more_input con = Xenbus.Xb.has_more_input con.xb
 
 let has_output con = Xenbus.Xb.has_output con.xb
diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index 0df3df401db6..a72810d06f43 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -569,16 +569,17 @@ let do_input store cons doms con =
 			info "%s requests a reconnect" (Connection.get_domstr con);
 			History.reconnect con;
 			info "%s reconnection complete" (Connection.get_domstr con);
-			false
+			None
 		| Failure exp ->
 			error "caught exception %s" exp;
 			error "got a bad client %s" (sprintf "%-8s" (Connection.get_domstr con));
 			Connection.mark_as_bad con;
-			false
+			None
 	in
 
-	if newpacket then (
-		let packet = Connection.pop_in con in
+	match newpacket with
+	| None -> ()
+	| Some packet ->
 		let tid, rid, ty, data = Xenbus.Xb.Packet.unpack packet in
 		let req = {Packet.tid=tid; Packet.rid=rid; Packet.ty=ty; Packet.data=data} in
 
@@ -588,8 +589,7 @@ let do_input store cons doms con =
 		         (Xenbus.Xb.Op.to_string ty) (sanitize_data data); *)
 		process_packet ~store ~cons ~doms ~con ~req;
 		write_access_log ~ty ~tid ~con:(Connection.get_domstr con) ~data;
-		Connection.incr_ops con;
-	)
+		Connection.incr_ops con
 
 let do_output _store _cons _doms con =
 	if Connection.has_output con then (
From fb56109a7004de929c6596e124ebb0f162fb5856 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:03 +0100
Subject: tools/ocaml/xb: Add BoundedQueue
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Ensures we cannot store more than [capacity] elements in a [Queue].  Replacing
all Queue with this module will then ensure at compile time that all Queues
are correctly bound checked.

Each element in the queue has a class with its own limits.  This, in a
subsequent change, will ensure that command responses can proceed during a
flood of watch events.

No functional change.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 165fd4a1edf4..4197a3888a68 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -17,6 +17,98 @@
 module Op = struct include Op end
 module Packet = struct include Packet end
 
+module BoundedQueue : sig
+	type ('a, 'b) t
+
+	(** [create ~capacity ~classify ~limit] creates a queue with maximum [capacity] elements.
+	    This is burst capacity, each element is further classified according to [classify],
+	    and each class can have its own [limit].
+	    [capacity] is enforced as an overall limit.
+	    The [limit] can be dynamic, and can be smaller than the number of elements already queued of that class,
+	    in which case those elements are considered to use "burst capacity".
+	  *)
+	val create: capacity:int -> classify:('a -> 'b) -> limit:('b -> int) -> ('a, 'b) t
+
+	(** [clear q] discards all elements from [q] *)
+	val clear: ('a, 'b) t -> unit
+
+	(** [can_push q] when [length q < capacity].	*)
+	val can_push: ('a, 'b) t -> 'b -> bool
+
+	(** [push e q] adds [e] at the end of queue [q] if [can_push q], or returns [None]. *)
+	val push: 'a -> ('a, 'b) t -> unit option
+
+	(** [pop q] removes and returns first element in [q], or raises [Queue.Empty]. *)
+	val pop: ('a, 'b) t -> 'a
+
+	(** [peek q] returns the first element in [q], or raises [Queue.Empty].  *)
+	val peek : ('a, 'b) t -> 'a
+
+	(** [length q] returns the current number of elements in [q] *)
+	val length: ('a, 'b) t -> int
+
+	(** [debug string_of_class q] prints queue usage statistics in an unspecified internal format. *)
+	val debug: ('b -> string) -> (_, 'b) t -> string
+end = struct
+	type ('a, 'b) t =
+		{ q: 'a Queue.t
+		; capacity: int
+		; classify: 'a -> 'b
+		; limit: 'b -> int
+		; class_count: ('b, int) Hashtbl.t
+		}
+
+	let create ~capacity ~classify ~limit =
+		{ capacity; q = Queue.create (); classify; limit; class_count = Hashtbl.create 3 }
+
+	let get_count t classification = try Hashtbl.find t.class_count classification with Not_found -> 0
+
+	let can_push_internal t classification class_count =
+		Queue.length t.q < t.capacity && class_count < t.limit classification
+
+	let ok = Some ()
+
+	let push e t =
+		let classification = t.classify e in
+		let class_count = get_count t classification in
+		if can_push_internal t classification class_count then begin
+			Queue.push e t.q;
+			Hashtbl.replace t.class_count classification (class_count + 1);
+			ok
+		end
+		else
+			None
+
+	let can_push t classification =
+		can_push_internal t classification @@ get_count t classification
+
+	let clear t =
+		Queue.clear t.q;
+		Hashtbl.reset t.class_count
+
+	let pop t =
+		let e = Queue.pop t.q in
+		let classification = t.classify e in
+		let () = match get_count t classification - 1 with
+		| 0 -> Hashtbl.remove t.class_count classification (* reduces memusage *)
+		| n -> Hashtbl.replace t.class_count classification n
+		in
+		e
+
+	let peek t = Queue.peek t.q
+	let length t = Queue.length t.q
+
+	let debug string_of_class t =
+		let b = Buffer.create 128 in
+		Printf.bprintf b "BoundedQueue capacity: %d, used: {" t.capacity;
+		Hashtbl.iter (fun packet_class count ->
+			Printf.bprintf b "	%s: %d" (string_of_class packet_class) count
+		) t.class_count;
+		Printf.bprintf b "}";
+		Buffer.contents b
+end
+
+
 exception End_of_file
 exception Eagain
 exception Noent
From 9d55aebdafc0dc4860da8f00ad484407850d3647 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:04 +0100
Subject: tools/ocaml: Limit maximum in-flight requests / outstanding replies
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Introduce a limit on the number of outstanding reply packets in the xenbus
queue.  This limits the number of in-flight requests: when the output queue is
full we'll stop processing inputs until the output queue has room again.

To avoid a busy loop on the Unix socket we only add it to the watched input
file descriptor set if we'd be able to call `input` on it.  Even though Dom0
is trusted and exempt from quotas a flood of events might cause a backlog
where events are produced faster than daemons in Dom0 can consume them, which
could lead to an unbounded queue size and OOM.

Therefore the xenbus queue limit must apply to all connections, Dom0 is not
exempt from it, although if everything works correctly it will eventually
catch up.

This prevents a malicious guest from sending more commands while it has
outstanding watch events or command replies in its input ring.  However if it
can cause the generation of watch events by other means (e.g. by Dom0, or
another cooperative guest) and stop reading its own ring then watch events
would've queued up without limit.

The xenstore protocol doesn't have a back-pressure mechanism, and doesn't
allow dropping watch events.  In fact, dropping watch events is known to break
some pieces of normal functionality.  This leaves little choice to safely
implement the xenstore protocol without exposing the xenstore daemon to
out-of-memory attacks.

Implement the fix as pipes with bounded buffers:
* Use a bounded buffer for watch events
* The watch structure will have a bounded receiving pipe of watch events
* The source will have an "overflow" pipe of pending watch events it couldn't
  deliver

Items are queued up on one end and are sent as far along the pipe as possible:

  source domain -> watch -> xenbus of target -> xenstore ring/socket of target

If the pipe is "full" at any point then back-pressure is applied and we prevent
more items from being queued up.  For the source domain this means that we'll
stop accepting new commands as long as its pipe buffer is not empty.

Before we try to enqueue an item we first check whether it is possible to send
it further down the pipe, by attempting to recursively flush the pipes. This
ensures that we retain the order of events as much as possible.

We might break causality of watch events if the target domain's queue is full
and we need to start using the watch's queue.  This is a breaking change in
the xenstore protocol, but only for domains which are not processing their
incoming ring as expected.

When a watch is deleted its entire pending queue is dropped (no code is needed
for that, because it is part of the 'watch' type).

There is a cache of watches that have pending events that we attempt to flush
at every cycle if possible.

Introduce 3 limits here:
* quota-maxwatchevents on watch event destination: when this is hit the
  source will not be allowed to queue up more watch events.
* quota-maxoustanding which is the number of responses not read from the ring:
  once exceeded, no more inputs are processed until all outstanding replies
  are consumed by the client.
* overflow queue on the watch event source: all watches that cannot be stored
  on destination are queued up here, a single command can trigger multiple
  watches (e.g. due to recursion).

The overflow queue currently doesn't have an upper bound, it is difficult to
accurately calculate one as it depends on whether you are Dom0 and how many
watches each path has registered and how many watch events you can trigger
with a single command (e.g. a commit).  However these events were already
using memory, this just moves them elsewhere, and as long as we correctly
block a domain it shouldn't result in unbounded memory usage.

Note that Dom0 is not excluded from these checks, it is important that Dom0 is
especially not excluded when it is the source, since there are many ways in
which a guest could trigger Dom0 to send it watch events.

This should protect against malicious frontends as long as the backend follows
the PV xenstore protocol and only exposes paths needed by the frontend, and
changes those paths at most once as a reaction to guest events, or protocol
state.

The queue limits are per watch, and per domain-pair, so even if one
communication channel would be "blocked", others would keep working, and the
domain itself won't get blocked as long as it doesn't overflow the queue of
watch events.

Similarly a malicious backend could cause the frontend to get blocked, but
this watch queue protects the frontend as well as long as it follows the PV
protocol.  (Although note that protection against malicious backends is only a
best effort at the moment)

This is part of XSA-326 / CVE-2022-42318.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 4197a3888a68..b292ed7a874d 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -134,14 +134,44 @@ type backend = Fd of backend_fd | Xenmmap of backend_mmap
 
 type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
 
+(*
+	separate capacity reservation for replies and watch events:
+	this allows a domain to keep working even when under a constant flood of
+	watch events
+*)
+type capacity = { maxoutstanding: int; maxwatchevents: int }
+
+module Queue = BoundedQueue
+
+type packet_class =
+	| CommandReply
+	| Watchevent
+
+let string_of_packet_class = function
+	| CommandReply -> "command_reply"
+	| Watchevent -> "watch_event"
+
 type t =
 {
 	backend: backend;
-	pkt_out: Packet.t Queue.t;
+	pkt_out: (Packet.t, packet_class) Queue.t;
 	mutable partial_in: partial_buf;
 	mutable partial_out: string;
+	capacity: capacity
 }
 
+let to_read con =
+	match con.partial_in with
+		| HaveHdr partial_pkt -> Partial.to_complete partial_pkt
+		| NoHdr   (i, _)    -> i
+
+let debug t =
+	Printf.sprintf "XenBus state: partial_in: %d needed, partial_out: %d bytes, pkt_out: %d packets, %s"
+		(to_read t)
+		(String.length t.partial_out)
+		(Queue.length t.pkt_out)
+		(BoundedQueue.debug string_of_packet_class t.pkt_out)
+
 let init_partial_in () = NoHdr
 	(Partial.header_size (), Bytes.make (Partial.header_size()) '\000')
 
@@ -199,7 +229,8 @@ let output con =
 	let s = if String.length con.partial_out > 0 then
 			con.partial_out
 		else if Queue.length con.pkt_out > 0 then
-			Packet.to_string (Queue.pop con.pkt_out)
+			let pkt = Queue.pop con.pkt_out in
+			Packet.to_string pkt
 		else
 			"" in
 	(* send data from s, and save the unsent data to partial_out *)
@@ -212,12 +243,15 @@ let output con =
 	(* after sending one packet, partial is empty *)
 	con.partial_out = ""
 
+(* we can only process an input packet if we're guaranteed to have room
+   to store the response packet *)
+let can_input con = Queue.can_push con.pkt_out CommandReply
+
 (* NB: can throw Reconnect *)
 let input con =
-	let to_read =
-		match con.partial_in with
-		| HaveHdr partial_pkt -> Partial.to_complete partial_pkt
-		| NoHdr   (i, _)    -> i in
+	if not (can_input con) then None
+	else
+	let to_read = to_read con in
 
 	(* try to get more data from input stream *)
 	let b = Bytes.make to_read '\000' in
@@ -243,11 +277,22 @@ let input con =
 		None
 	)
 
-let newcon backend = {
+let classify t =
+	match t.Packet.ty with
+	| Op.Watchevent -> Watchevent
+	| _ -> CommandReply
+
+let newcon ~capacity backend =
+	let limit = function
+		| CommandReply -> capacity.maxoutstanding
+		| Watchevent -> capacity.maxwatchevents
+	in
+	{
 	backend = backend;
-	pkt_out = Queue.create ();
+	pkt_out = Queue.create ~capacity:(capacity.maxoutstanding + capacity.maxwatchevents) ~classify ~limit;
 	partial_in = init_partial_in ();
 	partial_out = "";
+	capacity = capacity;
 	}
 
 let open_fd fd = newcon (Fd { fd = fd; })
diff --git a/tools/ocaml/libs/xb/xb.mli b/tools/ocaml/libs/xb/xb.mli
index 91c682162cea..71b2754ca788 100644
--- a/tools/ocaml/libs/xb/xb.mli
+++ b/tools/ocaml/libs/xb/xb.mli
@@ -66,10 +66,11 @@ type backend_mmap = {
 type backend_fd = { fd : Unix.file_descr; }
 type backend = Fd of backend_fd | Xenmmap of backend_mmap
 type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
+type capacity = { maxoutstanding: int; maxwatchevents: int }
 type t
 val init_partial_in : unit -> partial_buf
 val reconnect : t -> unit
-val queue : t -> Packet.t -> unit
+val queue : t -> Packet.t -> unit option
 val read_fd : backend_fd -> 'a -> bytes -> int -> int
 val read_mmap : backend_mmap -> 'a -> bytes -> int -> int
 val read : t -> bytes -> int -> int
@@ -78,13 +79,14 @@ val write_mmap : backend_mmap -> 'a -> string -> int -> int
 val write : t -> string -> int -> int
 val output : t -> bool
 val input : t -> Packet.t option
-val newcon : backend -> t
-val open_fd : Unix.file_descr -> t
-val open_mmap : Xenmmap.mmap_interface -> (unit -> unit) -> t
+val newcon : capacity:capacity -> backend -> t
+val open_fd : Unix.file_descr -> capacity:capacity -> t
+val open_mmap : Xenmmap.mmap_interface -> (unit -> unit) -> capacity:capacity -> t
 val close : t -> unit
 val is_fd : t -> bool
 val is_mmap : t -> bool
 val output_len : t -> int
+val can_input: t -> bool
 val has_new_output : t -> bool
 val has_old_output : t -> bool
 val has_output : t -> bool
@@ -93,3 +95,4 @@ val has_partial_input : t -> bool
 val has_more_input : t -> bool
 val is_selectable : t -> bool
 val get_fd : t -> Unix.file_descr
+val debug: t -> string
diff --git a/tools/ocaml/libs/xs/queueop.ml b/tools/ocaml/libs/xs/queueop.ml
index 9ff5bbd529ce..4e532cdaeacb 100644
--- a/tools/ocaml/libs/xs/queueop.ml
+++ b/tools/ocaml/libs/xs/queueop.ml
@@ -16,9 +16,10 @@
 open Xenbus
 
 let data_concat ls = (String.concat "\000" ls) ^ "\000"
+let queue con pkt = let r = Xb.queue con pkt in assert (r <> None)
 let queue_path ty (tid: int) (path: string) con =
 	let data = data_concat [ path; ] in
-	Xb.queue con (Xb.Packet.create tid 0 ty data)
+	queue con (Xb.Packet.create tid 0 ty data)
 
 (* operations *)
 let directory tid path con = queue_path Xb.Op.Directory tid path con
@@ -27,48 +28,48 @@ let read tid path con = queue_path Xb.Op.Read tid path con
 let getperms tid path con = queue_path Xb.Op.Getperms tid path con
 
 let debug commands con =
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Debug (data_concat commands))
+	queue con (Xb.Packet.create 0 0 Xb.Op.Debug (data_concat commands))
 
 let watch path data con =
 	let data = data_concat [ path; data; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Watch data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Watch data)
 
 let unwatch path data con =
 	let data = data_concat [ path; data; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Unwatch data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Unwatch data)
 
 let transaction_start con =
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Transaction_start (data_concat []))
+	queue con (Xb.Packet.create 0 0 Xb.Op.Transaction_start (data_concat []))
 
 let transaction_end tid commit con =
 	let data = data_concat [ (if commit then "T" else "F"); ] in
-	Xb.queue con (Xb.Packet.create tid 0 Xb.Op.Transaction_end data)
+	queue con (Xb.Packet.create tid 0 Xb.Op.Transaction_end data)
 
 let introduce domid mfn port con =
 	let data = data_concat [ Printf.sprintf "%u" domid;
 	                         Printf.sprintf "%nu" mfn;
 	                         string_of_int port; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Introduce data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Introduce data)
 
 let release domid con =
 	let data = data_concat [ Printf.sprintf "%u" domid; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Release data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Release data)
 
 let resume domid con =
 	let data = data_concat [ Printf.sprintf "%u" domid; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Resume data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Resume data)
 
 let getdomainpath domid con =
 	let data = data_concat [ Printf.sprintf "%u" domid; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Getdomainpath data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Getdomainpath data)
 
 let write tid path value con =
 	let data = path ^ "\000" ^ value (* no NULL at the end *) in
-	Xb.queue con (Xb.Packet.create tid 0 Xb.Op.Write data)
+	queue con (Xb.Packet.create tid 0 Xb.Op.Write data)
 
 let mkdir tid path con = queue_path Xb.Op.Mkdir tid path con
 let rm tid path con = queue_path Xb.Op.Rm tid path con
 
 let setperms tid path perms con =
 	let data = data_concat [ path; perms ] in
-	Xb.queue con (Xb.Packet.create tid 0 Xb.Op.Setperms data)
+	queue con (Xb.Packet.create tid 0 Xb.Op.Setperms data)
diff --git a/tools/ocaml/libs/xs/xsraw.ml b/tools/ocaml/libs/xs/xsraw.ml
index 451f8b38dbcc..cbd17280600c 100644
--- a/tools/ocaml/libs/xs/xsraw.ml
+++ b/tools/ocaml/libs/xs/xsraw.ml
@@ -36,8 +36,10 @@ type con = {
 let close con =
 	Xb.close con.xb
 
+let capacity = { Xb.maxoutstanding = 1; maxwatchevents = 0; }
+
 let open_fd fd = {
-	xb = Xb.open_fd fd;
+	xb = Xb.open_fd ~capacity fd;
 	watchevents = Queue.create ();
 }
 
diff --git a/tools/ocaml/xenstored/connection.ml b/tools/ocaml/xenstored/connection.ml
index ace2aa5b4f53..9aad451a2dbd 100644
--- a/tools/ocaml/xenstored/connection.ml
+++ b/tools/ocaml/xenstored/connection.ml
@@ -20,12 +20,84 @@ open Stdext
 
 let xenstore_payload_max = 4096 (* xen/include/public/io/xs_wire.h *)
 
+type 'a bounded_sender = 'a -> unit option
+(** a bounded sender accepts an ['a] item and returns:
+    None - if there is no room to accept the item
+    Some () -  if it has successfully accepted/sent the item
+ *)
+
+module BoundedPipe : sig
+	type 'a t
+
+	(** [create ~capacity ~destination] creates a bounded pipe with a
+	    local buffer holding at most [capacity] items.  Once the buffer is
+	    full it will not accept further items.  items from the pipe are
+	    flushed into [destination] as long as it accepts items.  The
+	    destination could be another pipe.
+	 *)
+	val create: capacity:int -> destination:'a bounded_sender -> 'a t
+
+	(** [is_empty t] returns whether the local buffer of [t] is empty. *)
+	val is_empty : _ t -> bool
+
+	(** [length t] the number of items in the internal buffer *)
+	val length: _ t -> int
+
+	(** [flush_pipe t] sends as many items from the local buffer as possible,
+			which could be none. *)
+	val flush_pipe: _ t -> unit
+
+	(** [push t item] tries to [flush_pipe] and then push [item]
+	    into the pipe if its [capacity] allows.
+	    Returns [None] if there is no more room
+	 *)
+	val push : 'a t -> 'a bounded_sender
+end = struct
+	(* items are enqueued in [q], and then flushed to [connect_to] *)
+	type 'a t =
+		{ q: 'a Queue.t
+		; destination: 'a bounded_sender
+		; capacity: int
+		}
+
+	let create ~capacity ~destination =
+		{ q = Queue.create (); capacity; destination }
+
+	let rec flush_pipe t =
+		if not Queue.(is_empty t.q) then
+			let item = Queue.peek t.q in
+			match t.destination item with
+			| None -> () (* no room *)
+			| Some () ->
+				(* successfully sent item to next stage *)
+				let _ = Queue.pop t.q in
+				(* continue trying to send more items *)
+				flush_pipe t
+
+	let push t item =
+		(* first try to flush as many items from this pipe as possible to make room,
+		   it is important to do this first to preserve the order of the items
+		 *)
+		flush_pipe t;
+		if Queue.length t.q < t.capacity then begin
+			(* enqueue, instead of sending directly.
+			   this ensures that [out] sees the items in the same order as we receive them
+			 *)
+			Queue.push item t.q;
+			Some (flush_pipe t)
+		end else None
+
+	let is_empty t = Queue.is_empty t.q
+	let length t = Queue.length t.q
+end
+
 type watch = {
 	con: t;
 	token: string;
 	path: string;
 	base: string;
 	is_relative: bool;
+	pending_watchevents: Xenbus.Xb.Packet.t BoundedPipe.t;
 }
 
 and t = {
@@ -38,8 +110,36 @@ and t = {
 	anonid: int;
 	mutable stat_nb_ops: int;
 	mutable perm: Perms.Connection.t;
+	pending_source_watchevents: (watch * Xenbus.Xb.Packet.t) BoundedPipe.t
 }
 
+module Watch = struct
+	module T = struct
+		type t = watch
+
+		let compare w1 w2 =
+			(* cannot compare watches from different connections *)
+			assert (w1.con == w2.con);
+			match String.compare w1.token w2.token with
+			| 0 -> String.compare w1.path w2.path
+			| n -> n
+	end
+	module Set = Set.Make(T)
+
+	let flush_events t =
+		BoundedPipe.flush_pipe t.pending_watchevents;
+		not (BoundedPipe.is_empty t.pending_watchevents)
+
+	let pending_watchevents t =
+		BoundedPipe.length t.pending_watchevents
+end
+
+let source_flush_watchevents t =
+	BoundedPipe.flush_pipe t.pending_source_watchevents
+
+let source_pending_watchevents t =
+	BoundedPipe.length t.pending_source_watchevents
+
 let mark_as_bad con =
 	match con.dom with
 	|None -> ()
@@ -67,7 +167,8 @@ let watch_create ~con ~path ~token = {
 	token = token;
 	path = path;
 	base = get_path con;
-	is_relative = path.[0] <> '/' && path.[0] <> '@'
+	is_relative = path.[0] <> '/' && path.[0] <> '@';
+	pending_watchevents = BoundedPipe.create ~capacity:!Define.maxwatchevents ~destination:(Xenbus.Xb.queue con.xb)
 }
 
 let get_con w = w.con
@@ -93,6 +194,9 @@ let make_perm dom =
 	Perms.Connection.create ~perms:[Perms.READ; Perms.WRITE] domid
 
 let create xbcon dom =
+	let destination (watch, pkt) =
+		BoundedPipe.push watch.pending_watchevents pkt
+	in
 	let id =
 		match dom with
 		| None -> let old = !anon_id_next in incr anon_id_next; old
@@ -109,6 +213,16 @@ let create xbcon dom =
 	anonid = id;
 	stat_nb_ops = 0;
 	perm = make_perm dom;
+
+	(* the actual capacity will be lower, this is used as an overflow
+	   buffer: anything that doesn't fit elsewhere gets put here, only
+	   limited by the amount of watches that you can generate with a
+	   single xenstore command (which is finite, although possibly very
+	   large in theory for Dom0).  Once the pipe here has any contents the
+	   domain is blocked from sending more commands until it is empty
+	   again though.
+	 *)
+	pending_source_watchevents = BoundedPipe.create ~capacity:Sys.max_array_length ~destination
 	}
 	in
 	Logging.new_connection ~tid:Transaction.none ~con:(get_domstr con);
@@ -127,11 +241,17 @@ let set_target con target_domid =
 
 let is_backend_mmap con = Xenbus.Xb.is_mmap con.xb
 
-let send_reply con tid rid ty data =
+let packet_of con tid rid ty data =
 	if (String.length data) > xenstore_payload_max && (is_backend_mmap con) then
-		Xenbus.Xb.queue con.xb (Xenbus.Xb.Packet.create tid rid Xenbus.Xb.Op.Error "E2BIG\000")
+		Xenbus.Xb.Packet.create tid rid Xenbus.Xb.Op.Error "E2BIG\000"
 	else
-		Xenbus.Xb.queue con.xb (Xenbus.Xb.Packet.create tid rid ty data)
+		Xenbus.Xb.Packet.create tid rid ty data
+
+let send_reply con tid rid ty data =
+	let result = Xenbus.Xb.queue con.xb (packet_of con tid rid ty data) in
+	(* should never happen: we only process an input packet when there is room for an output packet *)
+	(* and the limit for replies is different from the limit for watch events *)
+	assert (result <> None)
 
 let send_error con tid rid err = send_reply con tid rid Xenbus.Xb.Op.Error (err ^ "\000")
 let send_ack con tid rid ty = send_reply con tid rid ty "OK\000"
@@ -181,11 +301,11 @@ let del_watch con path token =
 	apath, w
 
 let del_watches con =
-  Hashtbl.clear con.watches;
+  Hashtbl.reset con.watches;
   con.nb_watches <- 0
 
 let del_transactions con =
-  Hashtbl.clear con.transactions
+  Hashtbl.reset con.transactions
 
 let list_watches con =
 	let ll = Hashtbl.fold
@@ -208,21 +328,29 @@ let lookup_watch_perm path = function
 let lookup_watch_perms oldroot root path =
 	lookup_watch_perm path oldroot @ lookup_watch_perm path (Some root)
 
-let fire_single_watch_unchecked watch =
+let fire_single_watch_unchecked source watch =
 	let data = Utils.join_by_null [watch.path; watch.token; ""] in
-	send_reply watch.con Transaction.none 0 Xenbus.Xb.Op.Watchevent data
+	let pkt = packet_of watch.con Transaction.none 0 Xenbus.Xb.Op.Watchevent data in
+
+	match BoundedPipe.push source.pending_source_watchevents (watch, pkt) with
+	| Some () -> () (* packet queued *)
+	| None ->
+			(* a well behaved Dom0 shouldn't be able to trigger this,
+			   if it happens it is likely a Dom0 bug causing runaway memory usage
+			 *)
+			failwith "watch event overflow, cannot happen"
 
-let fire_single_watch (oldroot, root) watch =
+let fire_single_watch source (oldroot, root) watch =
 	let abspath = get_watch_path watch.con watch.path |> Store.Path.of_string in
 	let perms = lookup_watch_perms oldroot root abspath in
 	if Perms.can_fire_watch watch.con.perm perms then
-		fire_single_watch_unchecked watch
+		fire_single_watch_unchecked source watch
 	else
 		let perms = perms |> List.map (Perms.Node.to_string ~sep:" ") |> String.concat ", " in
 		let con = get_domstr watch.con in
 		Logging.watch_not_fired ~con perms (Store.Path.to_string abspath)
 
-let fire_watch roots watch path =
+let fire_watch source roots watch path =
 	let new_path =
 		if watch.is_relative && path.[0] = '/'
 		then begin
@@ -232,7 +360,7 @@ let fire_watch roots watch path =
 		end else
 			path
 	in
-	fire_single_watch roots { watch with path = new_path }
+	fire_single_watch source roots { watch with path = new_path }
 
 (* Search for a valid unused transaction id. *)
 let rec valid_transaction_id con proposed_id =
@@ -279,6 +407,7 @@ let get_transaction con tid =
 let do_input con = Xenbus.Xb.input con.xb
 let has_more_input con = Xenbus.Xb.has_more_input con.xb
 
+let can_input con = Xenbus.Xb.can_input con.xb && BoundedPipe.is_empty con.pending_source_watchevents
 let has_output con = Xenbus.Xb.has_output con.xb
 let has_old_output con = Xenbus.Xb.has_old_output con.xb
 let has_new_output con = Xenbus.Xb.has_new_output con.xb
@@ -286,7 +415,7 @@ let peek_output con = Xenbus.Xb.peek_output con.xb
 let do_output con = Xenbus.Xb.output con.xb
 
 let has_more_work con =
-	has_more_input con || not (has_old_output con) && has_new_output con
+	(has_more_input con && can_input con) || not (has_old_output con) && has_new_output con
 
 let incr_ops con = con.stat_nb_ops <- con.stat_nb_ops + 1
 
diff --git a/tools/ocaml/xenstored/connections.ml b/tools/ocaml/xenstored/connections.ml
index 7efdf3e5e05e..39190c19ec58 100644
--- a/tools/ocaml/xenstored/connections.ml
+++ b/tools/ocaml/xenstored/connections.ml
@@ -22,22 +22,30 @@ type t = {
 	domains: (int, Connection.t) Hashtbl.t;
 	ports: (Xeneventchn.t, Connection.t) Hashtbl.t;
 	mutable watches: (string, Connection.watch list) Trie.t;
+	mutable has_pending_watchevents: Connection.Watch.Set.t
 }
 
 let create () = {
 	anonymous = Hashtbl.create 37;
 	domains = Hashtbl.create 37;
 	ports = Hashtbl.create 37;
-	watches = Trie.create ()
+	watches = Trie.create ();
+	has_pending_watchevents = Connection.Watch.Set.empty;
 }
 
+let get_capacity () =
+	(* not multiplied by maxwatch on purpose: 2nd queue in watch itself! *)
+	{ Xenbus.Xb.maxoutstanding = !Define.maxoutstanding; maxwatchevents = !Define.maxwatchevents }
+
 let add_anonymous cons fd _can_write =
-	let xbcon = Xenbus.Xb.open_fd fd in
+	let capacity = get_capacity () in
+	let xbcon = Xenbus.Xb.open_fd fd ~capacity in
 	let con = Connection.create xbcon None in
 	Hashtbl.add cons.anonymous (Xenbus.Xb.get_fd xbcon) con
 
 let add_domain cons dom =
-	let xbcon = Xenbus.Xb.open_mmap (Domain.get_interface dom) (fun () -> Domain.notify dom) in
+	let capacity = get_capacity () in
+	let xbcon = Xenbus.Xb.open_mmap ~capacity (Domain.get_interface dom) (fun () -> Domain.notify dom) in
 	let con = Connection.create xbcon (Some dom) in
 	Hashtbl.add cons.domains (Domain.get_id dom) con;
 	match Domain.get_port dom with
@@ -48,7 +56,9 @@ let select ?(only_if = (fun _ -> true)) cons =
 	Hashtbl.fold (fun _ con (ins, outs) ->
 		if (only_if con) then (
 			let fd = Connection.get_fd con in
-			(fd :: ins,  if Connection.has_output con then fd :: outs else outs)
+			let in_fds = if Connection.can_input con then fd :: ins else ins in
+			let out_fds = if Connection.has_output con then fd :: outs else outs in
+			in_fds, out_fds
 		) else (ins, outs)
 	)
 	cons.anonymous ([], [])
@@ -67,10 +77,17 @@ let del_watches_of_con con watches =
 	| [] -> None
 	| ws -> Some ws
 
+let del_watches cons con =
+	Connection.del_watches con;
+	cons.watches <- Trie.map (del_watches_of_con con) cons.watches;
+	cons.has_pending_watchevents <-
+		cons.has_pending_watchevents |> Connection.Watch.Set.filter @@ fun w ->
+		Connection.get_con w != con
+
 let del_anonymous cons con =
 	try
 		Hashtbl.remove cons.anonymous (Connection.get_fd con);
-		cons.watches <- Trie.map (del_watches_of_con con) cons.watches;
+		del_watches cons con;
 		Connection.close con
 	with exn ->
 		debug "del anonymous %s" (Printexc.to_string exn)
@@ -85,7 +102,7 @@ let del_domain cons id =
 		    | Some p -> Hashtbl.remove cons.ports p
 		    | None -> ())
 		 | None -> ());
-		cons.watches <- Trie.map (del_watches_of_con con) cons.watches;
+		del_watches cons con;
 		Connection.close con
 	with exn ->
 		debug "del domain %u: %s" id (Printexc.to_string exn)
@@ -136,31 +153,33 @@ let del_watch cons con path token =
 		cons.watches <- Trie.set cons.watches key watches;
  	watch
 
-let del_watches cons con =
-	Connection.del_watches con;
-	cons.watches <- Trie.map (del_watches_of_con con) cons.watches
-
 (* path is absolute *)
-let fire_watches ?oldroot root cons path recurse =
+let fire_watches ?oldroot source root cons path recurse =
 	let key = key_of_path path in
 	let path = Store.Path.to_string path in
 	let roots = oldroot, root in
 	let fire_watch _ = function
 		| None         -> ()
-		| Some watches -> List.iter (fun w -> Connection.fire_watch roots w path) watches
+		| Some watches -> List.iter (fun w -> Connection.fire_watch source roots w path) watches
 	in
 	let fire_rec _x = function
 		| None         -> ()
 		| Some watches ->
-			List.iter (Connection.fire_single_watch roots) watches
+			List.iter (Connection.fire_single_watch source roots) watches
 	in
 	Trie.iter_path fire_watch cons.watches key;
 	if recurse then
 		Trie.iter fire_rec (Trie.sub cons.watches key)
 
+let send_watchevents cons con =
+	cons.has_pending_watchevents <-
+		cons.has_pending_watchevents |> Connection.Watch.Set.filter Connection.Watch.flush_events;
+	Connection.source_flush_watchevents con
+
 let fire_spec_watches root cons specpath =
+	let source = find_domain cons 0 in
 	iter cons (fun con ->
-		List.iter (Connection.fire_single_watch (None, root)) (Connection.get_watches con specpath))
+		List.iter (Connection.fire_single_watch source (None, root)) (Connection.get_watches con specpath))
 
 let set_target cons domain target_domain =
 	let con = find_domain cons domain in
@@ -196,3 +215,13 @@ let debug cons =
 	let anonymous = Hashtbl.fold (fun _ con accu -> Connection.debug con :: accu) cons.anonymous [] in
 	let domains = Hashtbl.fold (fun _ con accu -> Connection.debug con :: accu) cons.domains [] in
 	String.concat "" (domains @ anonymous)
+
+let debug_watchevents cons con =
+	(* == (physical equality)
+	   has to be used here because w.con.xb.backend might contain a [unit->unit] value causing regular
+	   comparison to fail due to having a 'functional value' which cannot be compared.
+	 *)
+	let s = cons.has_pending_watchevents |> Connection.Watch.Set.filter (fun w -> w.con == con) in
+	let pending = s |> Connection.Watch.Set.elements
+		|> List.map (fun w -> Connection.Watch.pending_watchevents w) |> List.fold_left (+) 0 in
+	Printf.sprintf "Watches with pending events: %d, pending events total: %d" (Connection.Watch.Set.cardinal s) pending
diff --git a/tools/ocaml/xenstored/define.ml b/tools/ocaml/xenstored/define.ml
index 1a5d2f34a678..9e5236709474 100644
--- a/tools/ocaml/xenstored/define.ml
+++ b/tools/ocaml/xenstored/define.ml
@@ -25,6 +25,13 @@ let default_config_dir = Paths.xen_config_dir
 let maxwatch = ref (100)
 let maxtransaction = ref (10)
 let maxrequests = ref (1024)   (* maximum requests per transaction *)
+let maxoutstanding = ref (1024) (* maximum outstanding requests, i.e. in-flight requests / domain *)
+let maxwatchevents = ref (1024)
+(*
+	maximum outstanding watch events per watch,
+	recommended >= maxoutstanding to avoid blocking backend transactions due to
+	malicious frontends
+ *)
 
 let gc_max_overhead = ref 120 (* 120% see comment in xenstored.ml *)
 let conflict_burst_limit = ref 5.0
diff --git a/tools/ocaml/xenstored/oxenstored.conf.in b/tools/ocaml/xenstored/oxenstored.conf.in
index 4ae48e42d47d..9d034e744b4b 100644
--- a/tools/ocaml/xenstored/oxenstored.conf.in
+++ b/tools/ocaml/xenstored/oxenstored.conf.in
@@ -62,6 +62,8 @@ quota-maxwatch = 100
 quota-transaction = 10
 quota-maxrequests = 1024
 quota-path-max = 1024
+quota-maxoutstanding = 1024
+quota-maxwatchevents = 1024
 
 # Activate filed base backend
 persistent = false
diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index a72810d06f43..082c93fa9d3f 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -56,7 +56,7 @@ let split_one_path data con =
 	| path :: "" :: [] -> Store.Path.create path (Connection.get_path con)
 	| _                -> raise Invalid_Cmd_Args
 
-let process_watch t cons =
+let process_watch source t cons =
 	let oldroot = t.Transaction.oldroot in
 	let newroot = Store.get_root t.store in
 	let ops = Transaction.get_paths t |> List.rev in
@@ -66,8 +66,9 @@ let process_watch t cons =
 		| Xenbus.Xb.Op.Rm       -> true, None, oldroot
 		| Xenbus.Xb.Op.Setperms -> false, Some oldroot, newroot
 		| _              -> raise (Failure "huh ?") in
-		Connections.fire_watches ?oldroot root cons (snd op) recurse in
-	List.iter (fun op -> do_op_watch op cons) ops
+		Connections.fire_watches ?oldroot source root cons (snd op) recurse in
+	List.iter (fun op -> do_op_watch op cons) ops;
+	Connections.send_watchevents cons source
 
 let create_implicit_path t perm path =
 	let dirname = Store.Path.get_parent path in
@@ -99,6 +100,20 @@ let do_debug con t _domains cons data =
 	| "watches" :: _ ->
 		let watches = Connections.debug cons in
 		Some (watches ^ "\000")
+	| "xenbus" :: domid :: _ ->
+		let domid = int_of_string domid in
+		let con = Connections.find_domain cons domid in
+		let s = Printf.sprintf "xenbus: %s; overflow queue length: %d, can_input: %b, has_more_input: %b, has_old_output: %b, has_new_output: %b, has_more_work: %b. pending: %s"
+			(Xenbus.Xb.debug con.xb)
+			(Connection.source_pending_watchevents con)
+			(Connection.can_input con)
+			(Connection.has_more_input con)
+			(Connection.has_old_output con)
+			(Connection.has_new_output con)
+			(Connection.has_more_work con)
+			(Connections.debug_watchevents cons con)
+		in
+		Some s
 	| "mfn" :: domid :: _ ->
 		let domid = int_of_string domid in
 		let con = Connections.find_domain cons domid in
@@ -207,7 +222,7 @@ let reply_ack fct con t doms cons data =
 	fct con t doms cons data;
 	Packet.Ack (fun () ->
 		if Transaction.get_id t = Transaction.none then
-			process_watch t cons
+			process_watch con t cons
 	)
 
 let reply_data fct con t doms cons data =
@@ -366,7 +381,7 @@ let do_watch con t _domains cons data =
 	Packet.Ack (fun () ->
 		(* xenstore.txt says this watch is fired immediately,
 		   implying even if path doesn't exist or is unreadable *)
-		Connection.fire_single_watch_unchecked watch)
+		Connection.fire_single_watch_unchecked con watch)
 
 let do_unwatch con _t _domains cons data =
 	let (node, token) =
@@ -397,7 +412,7 @@ let do_transaction_end con t domains cons data =
 	if not success then
 		raise Transaction_again;
 	if commit then begin
-		process_watch t cons;
+		process_watch con t cons;
 		match t.Transaction.ty with
 		| Transaction.No ->
 			() (* no need to record anything *)
@@ -564,7 +579,8 @@ let process_packet ~store ~cons ~doms ~con ~req =
 let do_input store cons doms con =
 	let newpacket =
 		try
-			Connection.do_input con
+			if Connection.can_input con then Connection.do_input con
+			else None
 		with Xenbus.Xb.Reconnect ->
 			info "%s requests a reconnect" (Connection.get_domstr con);
 			History.reconnect con;
@@ -592,6 +608,7 @@ let do_input store cons doms con =
 		Connection.incr_ops con
 
 let do_output _store _cons _doms con =
+	Connection.source_flush_watchevents con;
 	if Connection.has_output con then (
 		if Connection.has_new_output con then (
 			let packet = Connection.peek_output con in
diff --git a/tools/ocaml/xenstored/xenstored.ml b/tools/ocaml/xenstored/xenstored.ml
index 0b6343dfc789..4f8fab2dd13a 100644
--- a/tools/ocaml/xenstored/xenstored.ml
+++ b/tools/ocaml/xenstored/xenstored.ml
@@ -102,6 +102,8 @@ let parse_config filename =
 		("quota-maxentity", Config.Set_int Quota.maxent);
 		("quota-maxsize", Config.Set_int Quota.maxsize);
 		("quota-maxrequests", Config.Set_int Define.maxrequests);
+		("quota-maxoutstanding", Config.Set_int Define.maxoutstanding);
+		("quota-maxwatchevents", Config.Set_int Define.maxwatchevents);
 		("quota-path-max", Config.Set_int Define.path_max);
 		("gc-max-overhead", Config.Set_int Define.gc_max_overhead);
 		("test-eagain", Config.Set_bool Transaction.test_eagain);
From b5bb80a0ad19bd15f6115861ebb31f41ad79ba00 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Thu, 29 Sep 2022 13:07:35 +0200
Subject: SUPPORT.md: clarify support of untrusted driver domains with
 oxenstored

Add a support statement for the scope of support regarding different
Xenstore variants. Especially oxenstored does not (yet) have security
support of untrusted driver domains, as those might drive oxenstored
out of memory by creating lots of watch events for the guests they are
servicing.

Add a statement regarding Live Update support of oxenstored.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/SUPPORT.md b/SUPPORT.md
index 3f4a01101e53..2db341c1d853 100644
--- a/SUPPORT.md
+++ b/SUPPORT.md
@@ -149,6 +149,17 @@ Output of information in machine-parseable JSON format
 
     Status: Supported
 
+## Xenstore
+
+### C xenstored daemon
+
+    Status: Supported
+
+### OCaml xenstored daemon
+
+    Status: Supported
+    Status, untrusted driver domains: Supported, not security supported
+
 ## Toolstack/3rd party
 
 ### libvirt driver for xl
From 41763ca4c788638751fb5552b60cc932c375bbb9 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: split up send_reply()

Today send_reply() is used for both, normal request replies and watch
events.

Split it up into send_reply() and send_event(). This will be used to
add some event specific handling.

add_event() can be merged into send_event(), removing the need for an
intermediate memory allocation.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index e0d6d23f3b76..97ff35cd2b11 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -672,49 +672,32 @@ static void send_error(struct connection *conn, int error)
 void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		const void *data, unsigned int len)
 {
-	struct buffered_data *bdata;
+	struct buffered_data *bdata = conn->in;
+
+	assert(type != XS_WATCH_EVENT);
 
 	if ( len > XENSTORE_PAYLOAD_MAX ) {
 		send_error(conn, E2BIG);
 		return;
 	}
 
-	/* Replies reuse the request buffer, events need a new one. */
-	if (type != XS_WATCH_EVENT) {
-		bdata = conn->in;
-		/* Drop asynchronous responses, e.g. errors for watch events. */
-		if (!bdata)
-			return;
-		bdata->inhdr = true;
-		bdata->used = 0;
-		conn->in = NULL;
-	} else {
-		/* Message is a child of the connection for auto-cleanup. */
-		bdata = new_buffer(conn);
+	if (!bdata)
+		return;
+	bdata->inhdr = true;
+	bdata->used = 0;
 
-		/*
-		 * Allocation failure here is unfortunate: we have no way to
-		 * tell anybody about it.
-		 */
-		if (!bdata)
-			return;
-	}
 	if (len <= DEFAULT_BUFFER_SIZE)
 		bdata->buffer = bdata->default_buffer;
-	else
+	else {
 		bdata->buffer = talloc_array(bdata, char, len);
-	if (!bdata->buffer) {
-		if (type == XS_WATCH_EVENT) {
-			/* Same as above: no way to tell someone. */
-			talloc_free(bdata);
+		if (!bdata->buffer) {
+			send_error(conn, ENOMEM);
 			return;
 		}
-		/* re-establish request buffer for sending ENOMEM. */
-		conn->in = bdata;
-		send_error(conn, ENOMEM);
-		return;
 	}
 
+	conn->in = NULL;
+
 	/* Update relevant header fields and fill in the message body. */
 	bdata->hdr.msg.type = type;
 	bdata->hdr.msg.len = len;
@@ -722,8 +705,39 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+}
 
-	return;
+/*
+ * Send a watch event.
+ * As this is not directly related to the current command, errors can't be
+ * reported.
+ */
+void send_event(struct connection *conn, const char *path, const char *token)
+{
+	struct buffered_data *bdata;
+	unsigned int len;
+
+	len = strlen(path) + 1 + strlen(token) + 1;
+	/* Don't try to send over-long events. */
+	if (len > XENSTORE_PAYLOAD_MAX)
+		return;
+
+	bdata = new_buffer(conn);
+	if (!bdata)
+		return;
+
+	bdata->buffer = talloc_array(bdata, char, len);
+	if (!bdata->buffer) {
+		talloc_free(bdata);
+		return;
+	}
+	strcpy(bdata->buffer, path);
+	strcpy(bdata->buffer + strlen(path) + 1, token);
+	bdata->hdr.msg.type = XS_WATCH_EVENT;
+	bdata->hdr.msg.len = len;
+
+	/* Queue for later transmission. */
+	list_add_tail(&bdata->list, &conn->out_list);
 }
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 9369c4cbfd26..2b0f796d9bb1 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -150,6 +150,7 @@ unsigned int get_strings(struct buffered_data *data,
 
 void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		const void *data, unsigned int len);
+void send_event(struct connection *conn, const char *path, const char *token);
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
 void send_ack(struct connection *conn, enum xsd_sockmsg_type type);
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index 9ff20690c000..6d8097376e47 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -72,37 +72,17 @@ static bool is_child(const char *child, const char *parent)
 	return child[len] == '/' || child[len] == '\0';
 }
 
-/*
- * Send a watch event.
- * Temporary memory allocations are done with ctx.
- */
-static void add_event(struct connection *conn,
-		      const void *ctx,
-		      struct watch *watch,
-		      const char *name)
+static const char *get_watch_path(const struct watch *watch, const char *name)
 {
-	/* Data to send (node\0token\0). */
-	unsigned int len;
-	char *data;
+	const char *path = name;
 
 	if (watch->relative_path) {
-		name += strlen(watch->relative_path);
-		if (*name == '/') /* Could be "" */
-			name++;
+		path += strlen(watch->relative_path);
+		if (*path == '/') /* Could be "" */
+			path++;
 	}
 
-	len = strlen(name) + 1 + strlen(watch->token) + 1;
-	/* Don't try to send over-long events. */
-	if (len > XENSTORE_PAYLOAD_MAX)
-		return;
-
-	data = talloc_array(ctx, char, len);
-	if (!data)
-		return;
-	strcpy(data, name);
-	strcpy(data + strlen(name) + 1, watch->token);
-	send_reply(conn, XS_WATCH_EVENT, data, len);
-	talloc_free(data);
+	return path;
 }
 
 /*
@@ -181,10 +161,14 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		list_for_each_entry(watch, &i->watches, list) {
 			if (exact) {
 				if (streq(name, watch->node))
-					add_event(i, ctx, watch, name);
+					send_event(i,
+						   get_watch_path(watch, name),
+						   watch->token);
 			} else {
 				if (is_child(name, watch->node))
-					add_event(i, ctx, watch, name);
+					send_event(i,
+						   get_watch_path(watch, name),
+						   watch->token);
 			}
 		}
 	}
@@ -252,7 +236,7 @@ int do_watch(struct connection *conn, struct buffered_data *in)
 	send_ack(conn, XS_WATCH);
 
 	/* We fire once up front: simplifies clients and restart. */
-	add_event(conn, in, watch, watch->node);
+	send_event(conn, get_watch_path(watch, watch->node), watch->token);
 
 	return 0;
 }
From eca3108834bf1d115b24028f943eaa7e924030cd Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: add helpers to free struct buffered_data

Add two helpers for freeing struct buffered_data: free_buffered_data()
for freeing one instance and conn_free_buffered_data() for freeing all
instances for a connection.

This is avoiding duplicated code and will help later when more actions
are needed when freeing a struct buffered_data.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 97ff35cd2b11..11b8d986340f 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -205,6 +205,21 @@ void reopen_log(void)
 	}
 }
 
+static void free_buffered_data(struct buffered_data *out,
+			       struct connection *conn)
+{
+	list_del(&out->list);
+	talloc_free(out);
+}
+
+void conn_free_buffered_data(struct connection *conn)
+{
+	struct buffered_data *out;
+
+	while ((out = list_top(&conn->out_list, struct buffered_data, list)))
+		free_buffered_data(out, conn);
+}
+
 static bool write_messages(struct connection *conn)
 {
 	int ret;
@@ -248,8 +263,7 @@ static bool write_messages(struct connection *conn)
 
 	trace_io(conn, out, 1);
 
-	list_del(&out->list);
-	talloc_free(out);
+	free_buffered_data(out, conn);
 
 	return true;
 }
@@ -1389,18 +1403,12 @@ static struct {
  */
 static void ignore_connection(struct connection *conn)
 {
-	struct buffered_data *out, *tmp;
-
 	trace("CONN %p ignored\n", conn);
 
 	conn->is_ignored = true;
 	conn_delete_all_watches(conn);
 	conn_delete_all_transactions(conn);
-
-	list_for_each_entry_safe(out, tmp, &conn->out_list, list) {
-		list_del(&out->list);
-		talloc_free(out);
-	}
+	conn_free_buffered_data(conn);
 
 	talloc_free(conn->in);
 	conn->in = NULL;
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 2b0f796d9bb1..83d49693fc19 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -226,6 +226,8 @@ extern xengnttab_handle **xgt_handle;
 
 int remember_string(struct hashtable *hash, const char *str);
 
+void conn_free_buffered_data(struct connection *conn);
+
 #endif /* _XENSTORED_CORE_H */
 
 /*
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index cbd8e6b747bd..416b92cad4b2 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -406,15 +406,10 @@ static struct domain *find_domain_by_domid(unsigned int domid)
 static void domain_conn_reset(struct domain *domain)
 {
 	struct connection *conn = domain->conn;
-	struct buffered_data *out;
 
 	conn_delete_all_watches(conn);
 	conn_delete_all_transactions(conn);
-
-	while ((out = list_top(&conn->out_list, struct buffered_data, list))) {
-		list_del(&out->list);
-		talloc_free(out);
-	}
+	conn_free_buffered_data(conn);
 
 	talloc_free(conn->in);
 
From f0081da9dd7218f486841dc6cacbe71d2fc761e8 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: reduce number of watch events

When removing a watched node outside of a transaction, two watch events
are being produced instead of just a single one.

When finalizing a transaction watch events can be generated for each
node which is being modified, even if outside a transaction such
modifications might not have resulted in a watch event.

This happens e.g.:

- for nodes which are only modified due to added/removed child entries
- for nodes being removed or created implicitly (e.g. creation of a/b/c
  is implicitly creating a/b, resulting in watch events for a, a/b and
  a/b/c instead of a/b/c only)

Avoid these additional watch events, in order to reduce the needed
memory inside Xenstore for queueing them.

This is being achieved by adding event flags to struct accessed_node
specifying whether an event should be triggered, and whether it should
be an exact match of the modified path. Both flags can be set from
fire_watches() instead of implying them only.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 11b8d986340f..8f8d10cee95e 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -1180,7 +1180,7 @@ static void delete_child(struct connection *conn,
 }
 
 static int delete_node(struct connection *conn, const void *ctx,
-		       struct node *parent, struct node *node)
+		       struct node *parent, struct node *node, bool watch_exact)
 {
 	char *name;
 
@@ -1192,7 +1192,7 @@ static int delete_node(struct connection *conn, const void *ctx,
 				       node->children);
 		child = name ? read_node(conn, node, name) : NULL;
 		if (child) {
-			if (delete_node(conn, ctx, node, child))
+			if (delete_node(conn, ctx, node, child, true))
 				return errno;
 		} else {
 			trace("delete_node: Error deleting child '%s/%s'!\n",
@@ -1204,7 +1204,12 @@ static int delete_node(struct connection *conn, const void *ctx,
 		talloc_free(name);
 	}
 
-	fire_watches(conn, ctx, node->name, node, true, NULL);
+	/*
+	 * Fire the watches now, when we can still see the node permissions.
+	 * This fine as we are single threaded and the next possible read will
+	 * be handled only after the node has been really removed.
+	 */
+	fire_watches(conn, ctx, node->name, node, watch_exact, NULL);
 	delete_node_single(conn, node);
 	delete_child(conn, parent, basename(node->name));
 	talloc_free(node);
@@ -1230,13 +1235,7 @@ static int _rm(struct connection *conn, const void *ctx, struct node *node,
 		return (errno == ENOMEM) ? ENOMEM : EINVAL;
 	node->parent = parent;
 
-	/*
-	 * Fire the watches now, when we can still see the node permissions.
-	 * This fine as we are single threaded and the next possible read will
-	 * be handled only after the node has been really removed.
-	 */
-	fire_watches(conn, ctx, name, node, false, NULL);
-	return delete_node(conn, ctx, parent, node);
+	return delete_node(conn, ctx, parent, node, false);
 }
 
 
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index 4ffa18311120..6fbdb29dcdd7 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -130,6 +130,10 @@ struct accessed_node
 
 	/* Transaction node in data base? */
 	bool ta_node;
+
+	/* Watch event flags. */
+	bool fire_watch;
+	bool watch_exact;
 };
 
 struct changed_domain
@@ -330,6 +334,29 @@ int access_node(struct connection *conn, struct node *node,
 }
 
 /*
+ * A watch event should be fired for a node modified inside a transaction.
+ * Set the corresponding information. A non-exact event is replacing an exact
+ * one, but not the other way round.
+ */
+void queue_watches(struct connection *conn, const char *name, bool watch_exact)
+{
+	struct accessed_node *i;
+
+	i = find_accessed_node(conn->transaction, name);
+	if (!i) {
+		conn->transaction->fail = true;
+		return;
+	}
+
+	if (!i->fire_watch) {
+		i->fire_watch = true;
+		i->watch_exact = watch_exact;
+	} else if (!watch_exact) {
+		i->watch_exact = false;
+	}
+}
+
+/*
  * Finalize transaction:
  * Walk through accessed nodes and check generation against global data.
  * If all entries match, read the transaction entries and write them without
@@ -383,15 +410,15 @@ static int finalize_transaction(struct connection *conn,
 				ret = tdb_store(tdb_ctx, key, data,
 						TDB_REPLACE);
 				talloc_free(data.dptr);
-				if (ret)
-					goto err;
-				fire_watches(conn, trans, i->node, NULL, false,
-					     i->perms.p ? &i->perms : NULL);
 			} else {
-				fire_watches(conn, trans, i->node, NULL, false,
+				ret = tdb_delete(tdb_ctx, key);
+			}
+			if (ret)
+				goto err;
+			if (i->fire_watch) {
+				fire_watches(conn, trans, i->node, NULL,
+					     i->watch_exact,
 					     i->perms.p ? &i->perms : NULL);
-				if (tdb_delete(tdb_ctx, key))
-					goto err;
 			}
 		}
 
diff --git a/tools/xenstore/xenstored_transaction.h b/tools/xenstore/xenstored_transaction.h
index 14062730e3c9..0093cac807e3 100644
--- a/tools/xenstore/xenstored_transaction.h
+++ b/tools/xenstore/xenstored_transaction.h
@@ -42,6 +42,9 @@ void transaction_entry_dec(struct transaction *trans, unsigned int domid);
 int access_node(struct connection *conn, struct node *node,
                 enum node_access_type type, TDB_DATA *key);
 
+/* Queue watches for a modified node. */
+void queue_watches(struct connection *conn, const char *name, bool watch_exact);
+
 /* Prepend the transaction to name if appropriate. */
 int transaction_prepend(struct connection *conn, const char *name,
                         TDB_DATA *key);
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index 6d8097376e47..2f9367767e44 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -29,6 +29,7 @@
 #include "xenstore_lib.h"
 #include "utils.h"
 #include "xenstored_domain.h"
+#include "xenstored_transaction.h"
 
 extern int quota_nb_watch_per_domain;
 
@@ -143,9 +144,11 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 	struct connection *i;
 	struct watch *watch;
 
-	/* During transactions, don't fire watches. */
-	if (conn && conn->transaction)
+	/* During transactions, don't fire watches, but queue them. */
+	if (conn && conn->transaction) {
+		queue_watches(conn, name, exact);
 		return;
+	}
 
 	/* Create an event for each watch. */
 	list_for_each_entry(i, &connections, list) {
From dc89e2717237dad5488b5094d6def0db5438afba Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: let unread watch events time out

A future modification will limit the number of outstanding requests
for a domain, where "outstanding" means that the response of the
request or any resulting watch event hasn't been consumed yet.

In order to avoid a malicious guest being capable to block other guests
by not reading watch events, add a timeout for watch events. In case a
watch event hasn't been consumed after this timeout, it is being
deleted. Set the default timeout to 20 seconds (a random value being
not too high).

In order to support to specify other timeout values in future, use a
generic command line option for that purpose:

--timeout|-w watch-event=<seconds>

This is part of XSA-326 / CVE-2022-42311.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 8f8d10cee95e..5fb4714b356f 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -103,6 +103,8 @@ int quota_max_entry_size = 2048; /* 2K */
 int quota_max_transaction = 10;
 int quota_nb_perms_per_node = 5;
 
+unsigned int timeout_watch_event_msec = 20000;
+
 void trace(const char *fmt, ...)
 {
 	va_list arglist;
@@ -205,19 +207,92 @@ void reopen_log(void)
 	}
 }
 
+static uint64_t get_now_msec(void)
+{
+	struct timespec now_ts;
+
+	if (clock_gettime(CLOCK_MONOTONIC, &now_ts))
+		barf_perror("Could not find time (clock_gettime failed)");
+
+	return now_ts.tv_sec * 1000 + now_ts.tv_nsec / 1000000;
+}
+
 static void free_buffered_data(struct buffered_data *out,
 			       struct connection *conn)
 {
+	struct buffered_data *req;
+
 	list_del(&out->list);
+
+	/*
+	 * Update conn->timeout_msec with the next found timeout value in the
+	 * queued pending requests.
+	 */
+	if (out->timeout_msec) {
+		conn->timeout_msec = 0;
+		list_for_each_entry(req, &conn->out_list, list) {
+			if (req->timeout_msec) {
+				conn->timeout_msec = req->timeout_msec;
+				break;
+			}
+		}
+	}
+
 	talloc_free(out);
 }
 
+static void check_event_timeout(struct connection *conn, uint64_t msecs,
+				int *ptimeout)
+{
+	uint64_t delta;
+	struct buffered_data *out, *tmp;
+
+	if (!conn->timeout_msec)
+		return;
+
+	delta = conn->timeout_msec - msecs;
+	if (conn->timeout_msec <= msecs) {
+		delta = 0;
+		list_for_each_entry_safe(out, tmp, &conn->out_list, list) {
+			/*
+			 * Only look at buffers with timeout and no data
+			 * already written to the ring.
+			 */
+			if (out->timeout_msec && out->inhdr && !out->used) {
+				if (out->timeout_msec > msecs) {
+					conn->timeout_msec = out->timeout_msec;
+					delta = conn->timeout_msec - msecs;
+					break;
+				}
+
+				/*
+				 * Free out without updating conn->timeout_msec,
+				 * as the update is done in this loop already.
+				 */
+				out->timeout_msec = 0;
+				trace("watch event path %s for domain %u timed out\n",
+				      out->buffer, conn->id);
+				free_buffered_data(out, conn);
+			}
+		}
+		if (!delta) {
+			conn->timeout_msec = 0;
+			return;
+		}
+	}
+
+	if (*ptimeout == -1 || *ptimeout > delta)
+		*ptimeout = delta;
+}
+
 void conn_free_buffered_data(struct connection *conn)
 {
 	struct buffered_data *out;
 
 	while ((out = list_top(&conn->out_list, struct buffered_data, list)))
 		free_buffered_data(out, conn);
+
+	conn->timeout_msec = 0;
 }
 
 static bool write_messages(struct connection *conn)
@@ -331,6 +406,7 @@ static void initialize_fds(int sock, int *p_sock_pollfd_idx,
 {
 	struct connection *conn;
 	struct wrl_timestampt now;
+	uint64_t msecs;
 
 	if (fds)
 		memset(fds, 0, sizeof(struct pollfd) * current_array_size);
@@ -352,10 +428,12 @@ static void initialize_fds(int sock, int *p_sock_pollfd_idx,
 
 	wrl_gettime_now(&now);
 	wrl_log_periodic(now);
+	msecs = get_now_msec();
 
 	list_for_each_entry(conn, &connections, list) {
 		if (conn->domain) {
 			wrl_check_timeout(conn->domain, now, ptimeout);
+			check_event_timeout(conn, msecs, ptimeout);
 			if (domain_can_read(conn) ||
 			    (domain_can_write(conn) &&
 			     !list_empty(&conn->out_list)))
@@ -699,6 +777,7 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		return;
 	bdata->inhdr = true;
 	bdata->used = 0;
+	bdata->timeout_msec = 0;
 
 	if (len <= DEFAULT_BUFFER_SIZE)
 		bdata->buffer = bdata->default_buffer;
@@ -750,6 +829,12 @@ void send_event(struct connection *conn, const char *path, const char *token)
 	bdata->hdr.msg.type = XS_WATCH_EVENT;
 	bdata->hdr.msg.len = len;
 
+	if (timeout_watch_event_msec && domain_is_unprivileged(conn)) {
+		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
+		if (!conn->timeout_msec)
+			conn->timeout_msec = bdata->timeout_msec;
+	}
+
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
 }
@@ -2009,6 +2094,9 @@ static void usage(void)
 "  -W, --watch-nb <nb>     limit the number of watches per domain,\n"
 "  -t, --transaction <nb>  limit the number of transaction allowed per domain,\n"
 "  -A, --perm-nb <nb>      limit the number of permissions per node,\n"
+"  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
+"                          allowed timeout candidates are:\n"
+"                          watch-event: time a watch-event is kept pending\n"
 "  -R, --no-recovery       to request that no recovery should be attempted when\n"
 "                          the store is corrupted (debug only),\n"
 "  -I, --internal-db       store database in memory, not on disk\n"
@@ -2030,6 +2118,7 @@ static struct option options[] = {
 	{ "trace-file", 1, NULL, 'T' },
 	{ "transaction", 1, NULL, 't' },
 	{ "perm-nb", 1, NULL, 'A' },
+	{ "timeout", 1, NULL, 'w' },
 	{ "no-recovery", 0, NULL, 'R' },
 	{ "internal-db", 0, NULL, 'I' },
 	{ "verbose", 0, NULL, 'V' },
@@ -2041,6 +2130,39 @@ int dom0_domid = 0;
 int dom0_event = 0;
 int priv_domid = 0;
 
+static int get_optval_int(const char *arg)
+{
+	char *end;
+	long val;
+
+	val = strtol(arg, &end, 10);
+	if (!*arg || *end || val < 0 || val > INT_MAX)
+		barf("invalid parameter value \"%s\"\n", arg);
+
+	return val;
+}
+
+static bool what_matches(const char *arg, const char *what)
+{
+	unsigned int what_len = strlen(what);
+
+	return !strncmp(arg, what, what_len) && arg[what_len] == '=';
+}
+
+static void set_timeout(const char *arg)
+{
+	const char *eq = strchr(arg, '=');
+	int val;
+
+	if (!eq)
+		barf("quotas must be specified via <what>=<seconds>\n");
+	val = get_optval_int(eq + 1);
+	if (what_matches(arg, "watch-event"))
+		timeout_watch_event_msec = val * 1000;
+	else
+		barf("unknown timeout \"%s\"\n", arg);
+}
+
 int main(int argc, char *argv[])
 {
 	int opt, *sock = NULL, *ro_sock = NULL;
@@ -2052,7 +2174,7 @@ int main(int argc, char *argv[])
 	int timeout;
 
 
-	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:T:RVW:", options,
+	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:T:RVW:w:", options,
 				  NULL)) != -1) {
 		switch (opt) {
 		case 'D':
@@ -2097,6 +2219,9 @@ int main(int argc, char *argv[])
 		case 'A':
 			quota_nb_perms_per_node = strtol(optarg, NULL, 10);
 			break;
+		case 'w':
+			set_timeout(optarg);
+			break;
 		case 'e':
 			dom0_event = strtol(optarg, NULL, 10);
 			break;
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 83d49693fc19..3112c11811e5 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -27,6 +27,7 @@
 #include <dirent.h>
 #include <stdbool.h>
 #include <stdint.h>
+#include <time.h>
 #include <errno.h>
 
 #include "xenstore_lib.h"
@@ -56,6 +57,8 @@ struct buffered_data
 		char raw[sizeof(struct xsd_sockmsg)];
 	} hdr;
 
+	uint64_t timeout_msec;
+
 	/* The actual data. */
 	char *buffer;
 	char default_buffer[DEFAULT_BUFFER_SIZE];
@@ -88,6 +91,7 @@ struct connection
 
 	/* Buffered output data */
 	struct list_head out_list;
+	uint64_t timeout_msec;
 
 	/* Transaction context for current request (NULL if none). */
 	struct transaction *transaction;
@@ -199,6 +203,8 @@ extern int dom0_event;
 extern int priv_domid;
 extern int quota_nb_entry_per_domain;
 
+extern unsigned int timeout_watch_event_msec;
+
 /* Map the kernel's xenstore page. */
 void *xenbus_map(void);
 void unmap_xenbus(void *interface);
From c3e397f2d77645584cc955c223b2894a4239ea4c Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: limit outstanding requests

Add another quota for limiting the number of outstanding requests of a
guest. As the way to specify quotas on the command line is becoming
rather nasty, switch to a new scheme using [--quota|-Q] <what>=<val>
allowing to add more quotas in future easily.

Set the default value to 20 (basically a random value not seeming to
be too high or too low).

A request is said to be outstanding if any message generated by this
request (the direct response plus potential watch events) is not yet
completely stored into a ring buffer. The initial watch event sent as
a result of registering a watch is an exception.

Note that across a live update the relation to buffered watch events
for other domains is lost.

Use talloc_zero() for allocating the domain structure in order to have
all per-domain quota zeroed initially.

This is part of XSA-326 / CVE-2022-42312.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 5fb4714b356f..5f1733112a4f 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -102,6 +102,7 @@ int quota_nb_watch_per_domain = 128;
 int quota_max_entry_size = 2048; /* 2K */
 int quota_max_transaction = 10;
 int quota_nb_perms_per_node = 5;
+int quota_req_outstanding = 20;
 
 unsigned int timeout_watch_event_msec = 20000;
 
@@ -217,12 +218,24 @@ static uint64_t get_now_msec(void)
 	return now_ts.tv_sec * 1000 + now_ts.tv_nsec / 1000000;
 }
 
+/*
+ * Remove a struct buffered_data from the list of outgoing data.
+ * A struct buffered_data related to a request having caused watch events to be
+ * sent is kept until all those events have been written out.
+ * Each watch event is referencing the related request via pend.req, while the
+ * number of watch events caused by a request is kept in pend.ref.event_cnt
+ * (those two cases are mutually exclusive, so the two fields can share memory
+ * via a union).
+ * The struct buffered_data is freed only if no related watch event is
+ * referencing it. The related return data can be freed right away.
+ */
 static void free_buffered_data(struct buffered_data *out,
 			       struct connection *conn)
 {
 	struct buffered_data *req;
 
 	list_del(&out->list);
+	out->on_out_list = false;
 
 	/*
 	 * Update conn->timeout_msec with the next found timeout value in the
@@ -238,6 +251,30 @@ static void free_buffered_data(struct buffered_data *out,
 		}
 	}
 
+	if (out->hdr.msg.type == XS_WATCH_EVENT) {
+		req = out->pend.req;
+		if (req) {
+			req->pend.ref.event_cnt--;
+			if (!req->pend.ref.event_cnt && !req->on_out_list) {
+				if (req->on_ref_list) {
+					domain_outstanding_domid_dec(
+						req->pend.ref.domid);
+					list_del(&req->list);
+				}
+				talloc_free(req);
+			}
+		}
+	} else if (out->pend.ref.event_cnt) {
+		/* Hang out off from conn. */
+		talloc_steal(NULL, out);
+		if (out->buffer != out->default_buffer)
+			talloc_free(out->buffer);
+		list_add(&out->list, &conn->ref_list);
+		out->on_ref_list = true;
+		return;
+	} else
+		domain_outstanding_dec(conn);
+
 	talloc_free(out);
 }
 
@@ -346,6 +383,7 @@ static bool write_messages(struct connection *conn)
 static int destroy_conn(void *_conn)
 {
 	struct connection *conn = _conn;
+	struct buffered_data *req;
 
 	/* Flush outgoing if possible, but don't block. */
 	if (!conn->domain) {
@@ -359,6 +397,11 @@ static int destroy_conn(void *_conn)
 				break;
 		close(conn->fd);
 	}
+
+	conn_free_buffered_data(conn);
+	list_for_each_entry(req, &conn->ref_list, list)
+		req->on_ref_list = false;
+
         if (conn->target)
                 talloc_unlink(conn, conn->target);
 	list_del(&conn->list);
@@ -798,6 +841,8 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+	bdata->on_out_list = true;
+	domain_outstanding_inc(conn);
 }
 
 /*
@@ -805,7 +850,8 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
  * As this is not directly related to the current command, errors can't be
  * reported.
  */
-void send_event(struct connection *conn, const char *path, const char *token)
+void send_event(struct buffered_data *req, struct connection *conn,
+		const char *path, const char *token)
 {
 	struct buffered_data *bdata;
 	unsigned int len;
@@ -835,8 +881,13 @@ void send_event(struct connection *conn, const char *path, const char *token)
 			conn->timeout_msec = bdata->timeout_msec;
 	}
 
+	bdata->pend.req = req;
+	if (req)
+		req->pend.ref.event_cnt++;
+
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+	bdata->on_out_list = true;
 }
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
@@ -1572,6 +1623,7 @@ static void handle_input(struct connection *conn)
 			return;
 	}
 	in = conn->in;
+	in->pend.ref.domid = conn->id;
 
 	/* Not finished header yet? */
 	if (in->inhdr) {
@@ -1642,6 +1694,7 @@ struct connection *new_connection(connwritefn_t *write, connreadfn_t *read)
 	new->is_ignored = false;
 	new->transaction_started = 0;
 	INIT_LIST_HEAD(&new->out_list);
+	INIT_LIST_HEAD(&new->ref_list);
 	INIT_LIST_HEAD(&new->watches);
 	INIT_LIST_HEAD(&new->transaction_list);
 
@@ -2094,6 +2147,9 @@ static void usage(void)
 "  -W, --watch-nb <nb>     limit the number of watches per domain,\n"
 "  -t, --transaction <nb>  limit the number of transaction allowed per domain,\n"
 "  -A, --perm-nb <nb>      limit the number of permissions per node,\n"
+"  -Q, --quota <what>=<nb> set the quota <what> to the value <nb>, allowed\n"
+"                          quotas are:\n"
+"                          outstanding: number of outstanding requests\n"
 "  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
 "                          allowed timeout candidates are:\n"
 "                          watch-event: time a watch-event is kept pending\n"
@@ -2118,6 +2174,7 @@ static struct option options[] = {
 	{ "trace-file", 1, NULL, 'T' },
 	{ "transaction", 1, NULL, 't' },
 	{ "perm-nb", 1, NULL, 'A' },
+	{ "quota", 1, NULL, 'Q' },
 	{ "timeout", 1, NULL, 'w' },
 	{ "no-recovery", 0, NULL, 'R' },
 	{ "internal-db", 0, NULL, 'I' },
@@ -2163,6 +2220,20 @@ static void set_timeout(const char *arg)
 		barf("unknown timeout \"%s\"\n", arg);
 }
 
+static void set_quota(const char *arg)
+{
+	const char *eq = strchr(arg, '=');
+	int val;
+
+	if (!eq)
+		barf("quotas must be specified via <what>=<nb>\n");
+	val = get_optval_int(eq + 1);
+	if (what_matches(arg, "outstanding"))
+		quota_req_outstanding = val;
+	else
+		barf("unknown quota \"%s\"\n", arg);
+}
+
 int main(int argc, char *argv[])
 {
 	int opt, *sock = NULL, *ro_sock = NULL;
@@ -2174,7 +2245,7 @@ int main(int argc, char *argv[])
 	int timeout;
 
 
-	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:T:RVW:w:", options,
+	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:Q:T:RVW:w:", options,
 				  NULL)) != -1) {
 		switch (opt) {
 		case 'D':
@@ -2219,6 +2290,9 @@ int main(int argc, char *argv[])
 		case 'A':
 			quota_nb_perms_per_node = strtol(optarg, NULL, 10);
 			break;
+		case 'Q':
+			set_quota(optarg);
+			break;
 		case 'w':
 			set_timeout(optarg);
 			break;
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 3112c11811e5..edeaa96dd10b 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -45,6 +45,8 @@ typedef int32_t wrl_creditt;
 struct buffered_data
 {
 	struct list_head list;
+	bool on_out_list;
+	bool on_ref_list;
 
 	/* Are we still doing the header? */
 	bool inhdr;
@@ -52,6 +54,17 @@ struct buffered_data
 	/* How far are we? */
 	unsigned int used;
 
+	/* Outstanding request accounting. */
+	union {
+		/* ref is being used for requests. */
+		struct {
+			unsigned int event_cnt; /* # of outstanding events. */
+			unsigned int domid;     /* domid of request. */
+		} ref;
+		/* req is being used for watch events. */
+		struct buffered_data *req;      /* request causing event. */
+	} pend;
+
 	union {
 		struct xsd_sockmsg msg;
 		char raw[sizeof(struct xsd_sockmsg)];
@@ -93,6 +106,9 @@ struct connection
 	struct list_head out_list;
 	uint64_t timeout_msec;
 
+	/* Referenced requests no longer pending. */
+	struct list_head ref_list;
+
 	/* Transaction context for current request (NULL if none). */
 	struct transaction *transaction;
 
@@ -154,7 +170,8 @@ unsigned int get_strings(struct buffered_data *data,
 
 void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		const void *data, unsigned int len);
-void send_event(struct connection *conn, const char *path, const char *token);
+void send_event(struct buffered_data *req, struct connection *conn,
+		const char *path, const char *token);
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
 void send_ack(struct connection *conn, enum xsd_sockmsg_type type);
@@ -202,6 +219,7 @@ extern int dom0_domid;
 extern int dom0_event;
 extern int priv_domid;
 extern int quota_nb_entry_per_domain;
+extern int quota_req_outstanding;
 
 extern unsigned int timeout_watch_event_msec;
 
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 416b92cad4b2..58b7e0fe2fa7 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -86,6 +86,9 @@ struct domain
 	/* number of watch for this domain */
 	int nbwatch;
 
+	/* Number of outstanding requests. */
+	int nboutstanding;
+
 	/* write rate limit */
 	wrl_creditt wrl_credit; /* [ -wrl_config_writecost, +_dburst ] */
 	struct wrl_timestampt wrl_timestamp;
@@ -288,8 +291,12 @@ bool domain_can_read(struct connection *conn)
 {
 	struct xenstore_domain_interface *intf = conn->domain->interface;
 
-	if (domain_is_unprivileged(conn) && conn->domain->wrl_credit < 0)
-		return false;
+	if (domain_is_unprivileged(conn)) {
+		if (conn->domain->wrl_credit < 0)
+			return false;
+		if (conn->domain->nboutstanding >= quota_req_outstanding)
+			return false;
+	}
 
 	if (conn->is_ignored)
 		return false;
@@ -338,7 +345,7 @@ static struct domain *alloc_domain(void *context, unsigned int domid)
 {
 	struct domain *domain;
 
-	domain = talloc(context, struct domain);
+	domain = talloc_zero(context, struct domain);
 	if (!domain) {
 		errno = ENOMEM;
 		return NULL;
@@ -387,8 +394,6 @@ static int new_domain(struct domain *domain, int port)
 	domain->conn->id = domain->domid;
 
 	domain->remote_port = port;
-	domain->nbentry = 0;
-	domain->nbwatch = 0;
 
 	return 0;
 }
@@ -929,6 +934,28 @@ int domain_watch(struct connection *conn)
 		: 0;
 }
 
+void domain_outstanding_inc(struct connection *conn)
+{
+	if (!conn || !conn->domain)
+		return;
+	conn->domain->nboutstanding++;
+}
+
+void domain_outstanding_dec(struct connection *conn)
+{
+	if (!conn || !conn->domain)
+		return;
+	conn->domain->nboutstanding--;
+}
+
+void domain_outstanding_domid_dec(unsigned int domid)
+{
+	struct domain *d = find_domain_by_domid(domid);
+
+	if (d)
+		d->nboutstanding--;
+}
+
 static wrl_creditt wrl_config_writecost      = WRL_FACTOR;
 static wrl_creditt wrl_config_rate           = WRL_RATE   * WRL_FACTOR;
 static wrl_creditt wrl_config_dburst         = WRL_DBURST * WRL_FACTOR;
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index 5e00087206c7..4bff2e655b9b 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -67,6 +67,9 @@ int domain_entry(struct connection *conn);
 void domain_watch_inc(struct connection *conn);
 void domain_watch_dec(struct connection *conn);
 int domain_watch(struct connection *conn);
+void domain_outstanding_inc(struct connection *conn);
+void domain_outstanding_dec(struct connection *conn);
+void domain_outstanding_domid_dec(unsigned int domid);
 
 /* Special node permission handling. */
 int set_perms_special(struct connection *conn, const char *name,
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index 2f9367767e44..c50c0575f0f1 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -142,6 +142,7 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		  struct node *node, bool exact, struct node_perms *perms)
 {
 	struct connection *i;
+	struct buffered_data *req;
 	struct watch *watch;
 
 	/* During transactions, don't fire watches, but queue them. */
@@ -150,6 +151,8 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		return;
 	}
 
+	req = domain_is_unprivileged(conn) ? conn->in : NULL;
+
 	/* Create an event for each watch. */
 	list_for_each_entry(i, &connections, list) {
 		/* introduce/release domain watches */
@@ -164,12 +167,12 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		list_for_each_entry(watch, &i->watches, list) {
 			if (exact) {
 				if (streq(name, watch->node))
-					send_event(i,
+					send_event(req, i,
 						   get_watch_path(watch, name),
 						   watch->token);
 			} else {
 				if (is_child(name, watch->node))
-					send_event(i,
+					send_event(req, i,
 						   get_watch_path(watch, name),
 						   watch->token);
 			}
@@ -238,8 +241,12 @@ int do_watch(struct connection *conn, struct buffered_data *in)
 	talloc_set_destructor(watch, destroy_watch);
 	send_ack(conn, XS_WATCH);
 
-	/* We fire once up front: simplifies clients and restart. */
-	send_event(conn, get_watch_path(watch, watch->node), watch->token);
+	/*
+	 * We fire once up front: simplifies clients and restart.
+	 * This event will not be linked to the XS_WATCH request.
+	 */
+	send_event(NULL, conn, get_watch_path(watch, watch->node),
+		   watch->token);
 
 	return 0;
 }
From 5c18ae5ed96fd62f462b4be9a95022f143f6dee4 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: don't buffer multiple identical watch events

A guest not reading its Xenstore response buffer fast enough might
pile up lots of Xenstore watch events buffered. Reduce the generated
load by dropping new events which already have an identical copy
pending.

The special events "@..." are excluded from that handling as there are
known use cases where the handler is relying on each event to be sent
individually.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 5f1733112a4f..0621023bca16 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -821,6 +821,7 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 	bdata->inhdr = true;
 	bdata->used = 0;
 	bdata->timeout_msec = 0;
+	bdata->watch_event = false;
 
 	if (len <= DEFAULT_BUFFER_SIZE)
 		bdata->buffer = bdata->default_buffer;
@@ -853,7 +854,7 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 void send_event(struct buffered_data *req, struct connection *conn,
 		const char *path, const char *token)
 {
-	struct buffered_data *bdata;
+	struct buffered_data *bdata, *bd;
 	unsigned int len;
 
 	len = strlen(path) + 1 + strlen(token) + 1;
@@ -875,12 +876,29 @@ void send_event(struct buffered_data *req, struct connection *conn,
 	bdata->hdr.msg.type = XS_WATCH_EVENT;
 	bdata->hdr.msg.len = len;
 
+	/*
+	 * Check whether an identical event is pending already.
+	 * Special events are excluded from that check.
+	 */
+	if (path[0] != '@') {
+		list_for_each_entry(bd, &conn->out_list, list) {
+			if (bd->watch_event && bd->hdr.msg.len == len &&
+			    !memcmp(bdata->buffer, bd->buffer, len)) {
+				trace("dropping duplicate watch %s %s for domain %u\n",
+				      path, token, conn->id);
+				talloc_free(bdata);
+				return;
+			}
+		}
+	}
+
 	if (timeout_watch_event_msec && domain_is_unprivileged(conn)) {
 		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
 		if (!conn->timeout_msec)
 			conn->timeout_msec = bdata->timeout_msec;
 	}
 
+	bdata->watch_event = true;
 	bdata->pend.req = req;
 	if (req)
 		req->pend.ref.event_cnt++;
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index edeaa96dd10b..1eb6131fc88d 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -51,6 +51,9 @@ struct buffered_data
 	/* Are we still doing the header? */
 	bool inhdr;
 
+	/* Is this a watch event? */
+	bool watch_event;
+
 	/* How far are we? */
 	unsigned int used;
 
From 665a6ae7a4eb3977564a6f00c91758f988e35be8 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: fix connection->id usage

Don't use conn->id for privilege checks, but domain_is_unprivileged().

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_control.c b/tools/xenstore/xenstored_control.c
index e4b8aa95abfd..d3272e2ef9b5 100644
--- a/tools/xenstore/xenstored_control.c
+++ b/tools/xenstore/xenstored_control.c
@@ -180,7 +180,7 @@ int do_control(struct connection *conn, struct buffered_data *in)
 	int cmd;
 	char **vec;
 
-	if (conn->id != 0)
+	if (domain_is_unprivileged(conn))
 		return EACCES;
 
 	num = xs_count_strings(in->buffer, in->used);
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 1eb6131fc88d..98db4afcaabf 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -93,7 +93,7 @@ struct connection
 	/* The index of pollfd in global pollfd array */
 	int pollfd_idx;
 
-	/* Who am I? 0 for socket connections. */
+	/* Who am I? Domid of connection. */
 	unsigned int id;
 
 	/* Is this a read-only connection? */
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index 6fbdb29dcdd7..9bef6e72a566 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -483,7 +483,8 @@ int do_transaction_start(struct connection *conn, struct buffered_data *in)
 	if (conn->transaction)
 		return EBUSY;
 
-	if (conn->id && conn->transaction_started > quota_max_transaction)
+	if (domain_is_unprivileged(conn) &&
+	    conn->transaction_started > quota_max_transaction)
 		return ENOSPC;
 
 	/* Attach transaction to input for autofree until it's complete */
From bfdab395b08993c6bda7c546d6d1932d7cab8834 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: simplify and fix per domain node accounting

The accounting of nodes can be simplified now that each connection
holds the associated domid.

Fix the node accounting to cover nodes created for a domain before it
has been introduced. This requires to react properly to an allocation
failure inside domain_entry_inc() by returning an error code.

Especially in error paths the node accounting has to be fixed in some
cases.

This is part of XSA-326 / CVE-2022-42313.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_control.c b/tools/xenstore/xenstored_control.c
index d3272e2ef9b5..715e0d2a9e03 100644
--- a/tools/xenstore/xenstored_control.c
+++ b/tools/xenstore/xenstored_control.c
@@ -25,6 +25,7 @@
 #include "talloc.h"
 #include "xenstored_core.h"
 #include "xenstored_control.h"
+#include "xenstored_domain.h"
 
 struct cmd_s {
 	char *cmd;
diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 0621023bca16..98d242e06241 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -543,7 +543,7 @@ struct node *read_node(struct connection *conn, const void *ctx,
 
 	/* Permissions are struct xs_permissions. */
 	node->perms.p = hdr->perms;
-	if (domain_adjust_node_perms(node)) {
+	if (domain_adjust_node_perms(conn, node)) {
 		talloc_free(node);
 		return NULL;
 	}
@@ -565,7 +565,7 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
 	void *p;
 	struct xs_tdb_record_hdr *hdr;
 
-	if (domain_adjust_node_perms(node))
+	if (domain_adjust_node_perms(conn, node))
 		return errno;
 
 	data.dsize = sizeof(*hdr)
@@ -1159,13 +1159,17 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 	return NULL;
 }
 
-static int destroy_node(struct connection *conn, struct node *node)
+static void destroy_node_rm(struct node *node)
 {
 	if (streq(node->name, "/"))
 		corrupt(NULL, "Destroying root node!");
 
 	tdb_delete(tdb_ctx, node->key);
+}
 
+static int destroy_node(struct connection *conn, struct node *node)
+{
+	destroy_node_rm(node);
 	domain_entry_dec(conn, node);
 
 	/*
@@ -1215,8 +1219,12 @@ static struct node *create_node(struct connection *conn, const void *ctx,
 			goto err;
 
 		/* Account for new node */
-		if (i->parent)
-			domain_entry_inc(conn, i);
+		if (i->parent) {
+			if (domain_entry_inc(conn, i)) {
+				destroy_node_rm(i);
+				return NULL;
+			}
+		}
 	}
 
 	return node;
@@ -1497,10 +1505,27 @@ static int do_set_perms(struct connection *conn, struct buffered_data *in)
 	old_perms = node->perms;
 	domain_entry_dec(conn, node);
 	node->perms = perms;
-	domain_entry_inc(conn, node);
+	if (domain_entry_inc(conn, node)) {
+		node->perms = old_perms;
+		/*
+		 * This should never fail because we had a reference on the
+		 * domain before and Xenstored is single-threaded.
+		 */
+		domain_entry_inc(conn, node);
+		return ENOMEM;
+	}
 
-	if (write_node(conn, node, false))
+	if (write_node(conn, node, false)) {
+		int saved_errno = errno;
+
+		domain_entry_dec(conn, node);
+		node->perms = old_perms;
+		/* No failure possible as above. */
+		domain_entry_inc(conn, node);
+
+		errno = saved_errno;
 		return errno;
+	}
 
 	fire_watches(conn, in, name, node, false, &old_perms);
 	send_ack(conn, XS_SET_PERMS);
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 58b7e0fe2fa7..f4134db3e73a 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -16,6 +16,7 @@
     along with this program; If not, see <http://www.gnu.org/licenses/>.
 */
 
+#include <assert.h>
 #include <stdio.h>
 #include <sys/mman.h>
 #include <unistd.h>
@@ -362,6 +363,18 @@ static struct domain *alloc_domain(void *context, unsigned int domid)
 	return domain;
 }
 
+static struct domain *find_or_alloc_existing_domain(unsigned int domid)
+{
+	struct domain *domain;
+	xc_dominfo_t dominfo;
+
+	domain = find_domain_struct(domid);
+	if (!domain && get_domain_info(domid, &dominfo))
+		domain = alloc_domain(NULL, domid);
+
+	return domain;
+}
+
 static int new_domain(struct domain *domain, int port)
 {
 	int rc;
@@ -774,30 +787,28 @@ void domain_init(void)
 	virq_port = rc;
 }
 
-void domain_entry_inc(struct connection *conn, struct node *node)
+int domain_entry_inc(struct connection *conn, struct node *node)
 {
 	struct domain *d;
+	unsigned int domid;
 
 	if (!conn)
-		return;
+		return 0;
 
-	if (node->perms.p && node->perms.p[0].id != conn->id) {
-		if (conn->transaction) {
-			transaction_entry_inc(conn->transaction,
-				node->perms.p[0].id);
-		} else {
-			d = find_domain_by_domid(node->perms.p[0].id);
-			if (d)
-				d->nbentry++;
-		}
-	} else if (conn->domain) {
-		if (conn->transaction) {
-			transaction_entry_inc(conn->transaction,
-				conn->domain->domid);
- 		} else {
- 			conn->domain->nbentry++;
-		}
+	domid = node->perms.p ? node->perms.p[0].id : conn->id;
+
+	if (conn->transaction) {
+		transaction_entry_inc(conn->transaction, domid);
+	} else {
+		d = (domid == conn->id && conn->domain) ? conn->domain
+		    : find_or_alloc_existing_domain(domid);
+		if (d)
+			d->nbentry++;
+		else
+			return ENOMEM;
 	}
+
+	return 0;
 }
 
 /*
@@ -833,7 +844,7 @@ static int chk_domain_generation(unsigned int domid, uint64_t gen)
  * Remove permissions for no longer existing domains in order to avoid a new
  * domain with the same domid inheriting the permissions.
  */
-int domain_adjust_node_perms(struct node *node)
+int domain_adjust_node_perms(struct connection *conn, struct node *node)
 {
 	unsigned int i;
 	int ret;
@@ -843,8 +854,14 @@ int domain_adjust_node_perms(struct node *node)
 		return errno;
 
 	/* If the owner doesn't exist any longer give it to priv domain. */
-	if (!ret)
+	if (!ret) {
+		/*
+		 * In theory we'd need to update the number of dom0 nodes here,
+		 * but we could be called for a read of the node. So better
+		 * avoid the risk to overflow the node count of dom0.
+		 */
 		node->perms.p[0].id = priv_domid;
+	}
 
 	for (i = 1; i < node->perms.num; i++) {
 		if (node->perms.p[i].perms & XS_PERM_IGNORE)
@@ -863,25 +880,25 @@ int domain_adjust_node_perms(struct node *node)
 void domain_entry_dec(struct connection *conn, struct node *node)
 {
 	struct domain *d;
+	unsigned int domid;
 
 	if (!conn)
 		return;
 
-	if (node->perms.p && node->perms.p[0].id != conn->id) {
-		if (conn->transaction) {
-			transaction_entry_dec(conn->transaction,
-				node->perms.p[0].id);
-		} else {
-			d = find_domain_by_domid(node->perms.p[0].id);
-			if (d && d->nbentry)
-				d->nbentry--;
-		}
-	} else if (conn->domain && conn->domain->nbentry) {
-		if (conn->transaction) {
-			transaction_entry_dec(conn->transaction,
-				conn->domain->domid);
+	domid = node->perms.p ? node->perms.p[0].id : conn->id;
+
+	if (conn->transaction) {
+		transaction_entry_dec(conn->transaction, domid);
+	} else {
+		d = (domid == conn->id && conn->domain) ? conn->domain
+		    : find_domain_struct(domid);
+		if (d) {
+			d->nbentry--;
 		} else {
-			conn->domain->nbentry--;
+			errno = ENOENT;
+			corrupt(conn,
+				"Node \"%s\" owned by non-existing domain %u\n",
+				node->name, domid);
 		}
 	}
 }
@@ -891,13 +908,23 @@ int domain_entry_fix(unsigned int domid, int num, bool update)
 	struct domain *d;
 	int cnt;
 
-	d = find_domain_by_domid(domid);
-	if (!d)
-		return 0;
+	if (update) {
+		d = find_domain_struct(domid);
+		assert(d);
+	} else {
+		/*
+		 * We are called first with update == false in order to catch
+		 * any error. So do a possible allocation and check for error
+		 * only in this case, as in the case of update == true nothing
+		 * can go wrong anymore as the allocation already happened.
+		 */
+		d = find_or_alloc_existing_domain(domid);
+		if (!d)
+			return -1;
+	}
 
 	cnt = d->nbentry + num;
-	if (cnt < 0)
-		cnt = 0;
+	assert(cnt >= 0);
 
 	if (update)
 		d->nbentry = cnt;
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index 4bff2e655b9b..4edf1dba9425 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -57,10 +57,10 @@ bool domain_can_write(struct connection *conn);
 bool domain_is_unprivileged(struct connection *conn);
 
 /* Remove node permissions for no longer existing domains. */
-int domain_adjust_node_perms(struct node *node);
+int domain_adjust_node_perms(struct connection *conn, struct node *node);
 
 /* Quota manipulation */
-void domain_entry_inc(struct connection *conn, struct node *);
+int domain_entry_inc(struct connection *conn, struct node *);
 void domain_entry_dec(struct connection *conn, struct node *);
 int domain_entry_fix(unsigned int domid, int num, bool update);
 int domain_entry(struct connection *conn);
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index 9bef6e72a566..bf2fda8234b3 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -523,8 +523,12 @@ static int transaction_fix_domains(struct transaction *trans, bool update)
 
 	list_for_each_entry(d, &trans->changed_domains, list) {
 		cnt = domain_entry_fix(d->domid, d->nbentry, update);
-		if (!update && cnt >= quota_nb_entry_per_domain)
-			return ENOSPC;
+		if (!update) {
+			if (cnt >= quota_nb_entry_per_domain)
+				return ENOSPC;
+			if (cnt < 0)
+				return ENOMEM;
+		}
 	}
 
 	return 0;
From 7d313f4322e44425882b21764bc9c42a790d030a Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: limit max number of nodes accessed in a transaction

Today a guest is free to access as many nodes in a single transaction
as it wants. This can lead to unbounded memory consumption in Xenstore
as there is the need to keep track of all nodes having been accessed
during a transaction.

In oxenstored the number of requests in a transaction is being limited
via a quota maxrequests (default is 1024). As multiple accesses of a
node are not problematic in C Xenstore, limit the number of accessed
nodes.

In order to let read_node() detect a quota error in case too many nodes
are being accessed, check the return value of access_node() and return
NULL in case an error has been seen. Introduce __must_check and add it
to the access_node() prototype.

This is part of XSA-326 / CVE-2022-42314.

Reported-by: Julien Grall <jgrall@amazon.com>
Suggested-by: Julien Grall <julien@xen.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/include/xen-tools/libs.h b/tools/include/xen-tools/libs.h
index cc7dfc8c6453..34db3b784732 100644
--- a/tools/include/xen-tools/libs.h
+++ b/tools/include/xen-tools/libs.h
@@ -59,4 +59,8 @@
     })
 #endif
 
+#ifndef __must_check
+#define __must_check __attribute__((__warn_unused_result__))
+#endif
+
 #endif	/* __XEN_TOOLS_LIBS__ */
diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 98d242e06241..57c999129215 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -102,6 +102,7 @@ int quota_nb_watch_per_domain = 128;
 int quota_max_entry_size = 2048; /* 2K */
 int quota_max_transaction = 10;
 int quota_nb_perms_per_node = 5;
+int quota_trans_nodes = 1024;
 int quota_req_outstanding = 20;
 
 unsigned int timeout_watch_event_msec = 20000;
@@ -500,6 +501,7 @@ struct node *read_node(struct connection *conn, const void *ctx,
 	TDB_DATA key, data;
 	struct xs_tdb_record_hdr *hdr;
 	struct node *node;
+	int err;
 
 	node = talloc(ctx, struct node);
 	if (!node) {
@@ -521,14 +523,13 @@ struct node *read_node(struct connection *conn, const void *ctx,
 	if (data.dptr == NULL) {
 		if (tdb_error(tdb_ctx) == TDB_ERR_NOEXIST) {
 			node->generation = NO_GENERATION;
-			access_node(conn, node, NODE_ACCESS_READ, NULL);
-			errno = ENOENT;
+			err = access_node(conn, node, NODE_ACCESS_READ, NULL);
+			errno = err ? : ENOENT;
 		} else {
 			log("TDB error on read: %s", tdb_errorstr(tdb_ctx));
 			errno = EIO;
 		}
-		talloc_free(node);
-		return NULL;
+		goto error;
 	}
 
 	node->parent = NULL;
@@ -543,19 +544,36 @@ struct node *read_node(struct connection *conn, const void *ctx,
 
 	/* Permissions are struct xs_permissions. */
 	node->perms.p = hdr->perms;
-	if (domain_adjust_node_perms(conn, node)) {
-		talloc_free(node);
-		return NULL;
-	}
+	if (domain_adjust_node_perms(conn, node))
+		goto error;
 
 	/* Data is binary blob (usually ascii, no nul). */
 	node->data = node->perms.p + hdr->num_perms;
 	/* Children is strings, nul separated. */
 	node->children = node->data + node->datalen;
 
-	access_node(conn, node, NODE_ACCESS_READ, NULL);
+	if (access_node(conn, node, NODE_ACCESS_READ, NULL))
+		goto error;
 
 	return node;
+
+ error:
+	err = errno;
+	talloc_free(node);
+	errno = err;
+	return NULL;
+}
+
+static bool read_node_can_propagate_errno(void)
+{
+	/*
+	 * 2 error cases for read_node() can always be propagated up:
+	 * ENOMEM, because this has nothing to do with the node being in the
+	 * data base or not, but is caused by a general lack of memory.
+	 * ENOSPC, because this is related to hitting quota limits which need
+	 * to be respected.
+	 */
+	return errno == ENOMEM || errno == ENOSPC;
 }
 
 int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
@@ -670,7 +688,7 @@ static int ask_parents(struct connection *conn, const void *ctx,
 		node = read_node(conn, ctx, name);
 		if (node)
 			break;
-		if (errno == ENOMEM)
+		if (read_node_can_propagate_errno())
 			return errno;
 	} while (!streq(name, "/"));
 
@@ -733,7 +751,7 @@ static struct node *get_node(struct connection *conn,
 		}
 	}
 	/* Clean up errno if they weren't supposed to know. */
-	if (!node && errno != ENOMEM)
+	if (!node && !read_node_can_propagate_errno())
 		errno = errno_from_parents(conn, ctx, name, errno, perm);
 	return node;
 }
@@ -1115,7 +1133,7 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 
 	/* If parent doesn't exist, create it. */
 	parent = read_node(conn, parentname, parentname);
-	if (!parent)
+	if (!parent && errno == ENOENT)
 		parent = construct_node(conn, ctx, parentname);
 	if (!parent)
 		return NULL;
@@ -1394,7 +1412,7 @@ static int _rm(struct connection *conn, const void *ctx, struct node *node,
 
 	parent = read_node(conn, ctx, parentname);
 	if (!parent)
-		return (errno == ENOMEM) ? ENOMEM : EINVAL;
+		return read_node_can_propagate_errno() ? errno : EINVAL;
 	node->parent = parent;
 
 	return delete_node(conn, ctx, parent, node, false);
@@ -1422,7 +1440,7 @@ static int do_rm(struct connection *conn, struct buffered_data *in)
 				return 0;
 			}
 			/* Restore errno, just in case. */
-			if (errno != ENOMEM)
+			if (!read_node_can_propagate_errno())
 				errno = ENOENT;
 		}
 		return errno;
@@ -2192,6 +2210,8 @@ static void usage(void)
 "  -A, --perm-nb <nb>      limit the number of permissions per node,\n"
 "  -Q, --quota <what>=<nb> set the quota <what> to the value <nb>, allowed\n"
 "                          quotas are:\n"
+"                          transaction-nodes: number of accessed node per\n"
+"                                             transaction\n"
 "                          outstanding: number of outstanding requests\n"
 "  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
 "                          allowed timeout candidates are:\n"
@@ -2273,6 +2293,8 @@ static void set_quota(const char *arg)
 	val = get_optval_int(eq + 1);
 	if (what_matches(arg, "outstanding"))
 		quota_req_outstanding = val;
+	else if (what_matches(arg, "transaction-nodes"))
+		quota_trans_nodes = val;
 	else
 		barf("unknown quota \"%s\"\n", arg);
 }
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 98db4afcaabf..7e371253d2d1 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -34,6 +34,7 @@
 #include "list.h"
 #include "tdb.h"
 #include "hashtable.h"
+#include "utils.h"
 
 /* DEFAULT_BUFFER_SIZE should be large enough for each errno string. */
 #define DEFAULT_BUFFER_SIZE 16
@@ -223,6 +224,7 @@ extern int dom0_event;
 extern int priv_domid;
 extern int quota_nb_entry_per_domain;
 extern int quota_req_outstanding;
+extern int quota_trans_nodes;
 
 extern unsigned int timeout_watch_event_msec;
 
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index bf2fda8234b3..778b7e439cb3 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -156,6 +156,9 @@ struct transaction
 	/* Connection-local identifier for this transaction. */
 	uint32_t id;
 
+	/* Node counter. */
+	unsigned int nodes;
+
 	/* Generation when transaction started. */
 	uint64_t generation;
 
@@ -266,6 +269,11 @@ int access_node(struct connection *conn, struct node *node,
 
 	i = find_accessed_node(trans, node->name);
 	if (!i) {
+		if (trans->nodes >= quota_trans_nodes &&
+		    domain_is_unprivileged(conn)) {
+			ret = ENOSPC;
+			goto err;
+		}
 		i = talloc_zero(trans, struct accessed_node);
 		if (!i)
 			goto nomem;
@@ -303,6 +311,7 @@ int access_node(struct connection *conn, struct node *node,
 				i->ta_node = true;
 			}
 		}
+		trans->nodes++;
 		list_add_tail(&i->list, &trans->accessed);
 	}
 
diff --git a/tools/xenstore/xenstored_transaction.h b/tools/xenstore/xenstored_transaction.h
index 0093cac807e3..e3cbd6b23095 100644
--- a/tools/xenstore/xenstored_transaction.h
+++ b/tools/xenstore/xenstored_transaction.h
@@ -39,8 +39,8 @@ void transaction_entry_inc(struct transaction *trans, unsigned int domid);
 void transaction_entry_dec(struct transaction *trans, unsigned int domid);
 
 /* This node was accessed. */
-int access_node(struct connection *conn, struct node *node,
-                enum node_access_type type, TDB_DATA *key);
+int __must_check access_node(struct connection *conn, struct node *node,
+                             enum node_access_type type, TDB_DATA *key);
 
 /* Queue watches for a modified node. */
 void queue_watches(struct connection *conn, const char *name, bool watch_exact);
From a0d3bce827ca656fb5d8f462bb6786f9567dddfa Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: move the call of setup_structure() to dom0
 introduction

Setting up the basic structure when introducing dom0 has the advantage
to be able to add proper node memory accounting for the added nodes
later.

This makes it possible to do proper node accounting, too.

An additional requirement to make that work fine is to correct the
owner of the created nodes to be dom0_domid instead of domid 0.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 57c9991292..1335051a53 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -1832,7 +1832,8 @@ static int tdb_flags;
 static void manual_node(const char *name, const char *child)
 {
 	struct node *node;
-	struct xs_permissions perms = { .id = 0, .perms = XS_PERM_NONE };
+	struct xs_permissions perms = { .id = dom0_domid,
+					.perms = XS_PERM_NONE };
 
 	node = talloc_zero(NULL, struct node);
 	if (!node)
@@ -1871,7 +1872,7 @@ static void tdb_logger(TDB_CONTEXT *tdb, int level, const char * fmt, ...)
 	}
 }
 
-static void setup_structure(void)
+void setup_structure(void)
 {
 	char *tdbname;
 	tdbname = talloc_strdup(talloc_autofree_context(), xs_daemon_tdb());
@@ -1889,6 +1890,7 @@ static void setup_structure(void)
 	manual_node("/", "tool");
 	manual_node("/tool", "xenstored");
 	manual_node("/tool/xenstored", NULL);
+	domain_entry_fix(dom0_domid, 3, true);
 
 	check_store();
 }
@@ -2402,9 +2404,6 @@ int main(int argc, char *argv[])
 
 	init_pipe(reopen_log_pipe);
 
-	/* Setup the database */
-	setup_structure();
-
 	/* Listen to hypervisor. */
 	if (!no_domain_init)
 		domain_init();
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 7e371253d2..d95e4262a9 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -195,6 +195,7 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
 struct node *read_node(struct connection *conn, const void *ctx,
 		       const char *name);
 
+void setup_structure(void);
 struct connection *new_connection(connwritefn_t *write, connreadfn_t *read);
 void check_store(void);
 void corrupt(struct connection *conn, const char *fmt, ...);
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index f4134db3e7..8bf9db2d96 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -739,6 +739,8 @@ static int dom0_init(void)
 	if (dom0->interface == NULL)
 		return -1;
 
+	setup_structure();
+
 	talloc_steal(dom0->conn, dom0); 
 
 	xenevtchn_notify(xce_handle, dom0->port);
-- 
2.35.3

From ac36354c8fabf58b12781358dec39aecc3a6376b Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: add infrastructure to keep track of per domain memory
 usage

The amount of memory a domain can consume in Xenstore is limited by
various quota today, but even with sane quota a domain can still
consume rather large memory quantities.

Add the infrastructure for keeping track of the amount of memory a
domain is consuming in Xenstore. Note that this is only the memory a
domain has direct control over, so any internal administration data
needed by Xenstore only is not being accounted for.

There are two quotas defined: a soft quota which will result in a
warning issued via syslog() when it is exceeded, and a hard quota
resulting in a stop of accepting further requests or watch events as
long as the hard quota would be violated by accepting those.

Setting any of those quotas to 0 will disable it.

As default values use 2MB per domain for the soft limit (this basically
covers the allowed case to create 1000 nodes needing 2kB each), and
2.5MB for the hard limit.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 1335051a53f3..217096d91a9d 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -104,6 +104,8 @@ int quota_max_transaction = 10;
 int quota_nb_perms_per_node = 5;
 int quota_trans_nodes = 1024;
 int quota_req_outstanding = 20;
+int quota_memory_per_domain_soft = 2 * 1024 * 1024; /* 2 MB */
+int quota_memory_per_domain_hard = 2 * 1024 * 1024 + 512 * 1024; /* 2.5 MB */
 
 unsigned int timeout_watch_event_msec = 20000;
 
@@ -2214,7 +2216,14 @@ static void usage(void)
 "                          quotas are:\n"
 "                          transaction-nodes: number of accessed node per\n"
 "                                             transaction\n"
+"                          memory: total used memory per domain for nodes,\n"
+"                                  transactions, watches and requests, above\n"
+"                                  which Xenstore will stop talking to domain\n"
 "                          outstanding: number of outstanding requests\n"
+"  -q, --quota-soft <what>=<nb> set a soft quota <what> to the value <nb>,\n"
+"                          causing a warning to be issued via syslog() if the\n"
+"                          limit is violated, allowed quotas are:\n"
+"                          memory: see above\n"
 "  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
 "                          allowed timeout candidates are:\n"
 "                          watch-event: time a watch-event is kept pending\n"
@@ -2240,6 +2249,7 @@ static struct option options[] = {
 	{ "transaction", 1, NULL, 't' },
 	{ "perm-nb", 1, NULL, 'A' },
 	{ "quota", 1, NULL, 'Q' },
+	{ "quota-soft", 1, NULL, 'q' },
 	{ "timeout", 1, NULL, 'w' },
 	{ "no-recovery", 0, NULL, 'R' },
 	{ "internal-db", 0, NULL, 'I' },
@@ -2285,7 +2295,7 @@ static void set_timeout(const char *arg)
 		barf("unknown timeout \"%s\"\n", arg);
 }
 
-static void set_quota(const char *arg)
+static void set_quota(const char *arg, bool soft)
 {
 	const char *eq = strchr(arg, '=');
 	int val;
@@ -2293,11 +2303,16 @@ static void set_quota(const char *arg)
 	if (!eq)
 		barf("quotas must be specified via <what>=<nb>\n");
 	val = get_optval_int(eq + 1);
-	if (what_matches(arg, "outstanding"))
+	if (what_matches(arg, "outstanding") && !soft)
 		quota_req_outstanding = val;
-	else if (what_matches(arg, "transaction-nodes"))
+	else if (what_matches(arg, "transaction-nodes") && !soft)
 		quota_trans_nodes = val;
-	else
+	else if (what_matches(arg, "memory")) {
+		if (soft)
+			quota_memory_per_domain_soft = val;
+		else
+			quota_memory_per_domain_hard = val;
+	} else
 		barf("unknown quota \"%s\"\n", arg);
 }
 
@@ -2312,7 +2327,7 @@ int main(int argc, char *argv[])
 	int timeout;
 
 
-	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:Q:T:RVW:w:", options,
+	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:Q:q:T:RVW:w:", options,
 				  NULL)) != -1) {
 		switch (opt) {
 		case 'D':
@@ -2358,7 +2373,10 @@ int main(int argc, char *argv[])
 			quota_nb_perms_per_node = strtol(optarg, NULL, 10);
 			break;
 		case 'Q':
-			set_quota(optarg);
+			set_quota(optarg, false);
+			break;
+		case 'q':
+			set_quota(optarg, true);
 			break;
 		case 'w':
 			set_timeout(optarg);
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index d95e4262a91e..4e53072e637c 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -226,6 +226,8 @@ extern int priv_domid;
 extern int quota_nb_entry_per_domain;
 extern int quota_req_outstanding;
 extern int quota_trans_nodes;
+extern int quota_memory_per_domain_soft;
+extern int quota_memory_per_domain_hard;
 
 extern unsigned int timeout_watch_event_msec;
 
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index e411c79d58a7..112fb457581e 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -84,6 +84,13 @@ struct domain
 	/* number of entry from this domain in the store */
 	int nbentry;
 
+	/* Amount of memory allocated for this domain. */
+	int memory;
+	bool soft_quota_reported;
+	bool hard_quota_reported;
+	time_t mem_last_msg;
+#define MEM_WARN_MINTIME_SEC 10
+
 	/* number of watch for this domain */
 	int nbwatch;
 
@@ -297,6 +304,9 @@ bool domain_can_read(struct connection *conn)
 			return false;
 		if (conn->domain->nboutstanding >= quota_req_outstanding)
 			return false;
+		if (conn->domain->memory >= quota_memory_per_domain_hard &&
+		    quota_memory_per_domain_hard)
+			return false;
 	}
 
 	if (conn->is_ignored)
@@ -944,6 +954,89 @@ int domain_entry(struct connection *conn)
 		: 0;
 }
 
+static bool domain_chk_quota(struct domain *domain, int mem)
+{
+	time_t now;
+
+	if (!domain || !domid_is_unprivileged(domain->domid) ||
+	    (domain->conn && domain->conn->is_ignored))
+		return false;
+
+	now = time(NULL);
+
+	if (mem >= quota_memory_per_domain_hard &&
+	    quota_memory_per_domain_hard) {
+		if (domain->hard_quota_reported)
+			return true;
+		syslog(LOG_ERR, "Domain %u exceeds hard memory quota, Xenstore interface to domain stalled\n",
+		       domain->domid);
+		domain->mem_last_msg = now;
+		domain->hard_quota_reported = true;
+		return true;
+	}
+
+	if (now - domain->mem_last_msg >= MEM_WARN_MINTIME_SEC) {
+		if (domain->hard_quota_reported) {
+			domain->mem_last_msg = now;
+			domain->hard_quota_reported = false;
+			syslog(LOG_INFO, "Domain %u below hard memory quota again\n",
+			       domain->domid);
+		}
+		if (mem >= quota_memory_per_domain_soft &&
+		    quota_memory_per_domain_soft &&
+		    !domain->soft_quota_reported) {
+			domain->mem_last_msg = now;
+			domain->soft_quota_reported = true;
+			syslog(LOG_WARNING, "Domain %u exceeds soft memory quota\n",
+			       domain->domid);
+		}
+		if (mem < quota_memory_per_domain_soft &&
+		    domain->soft_quota_reported) {
+			domain->mem_last_msg = now;
+			domain->soft_quota_reported = false;
+			syslog(LOG_INFO, "Domain %u below soft memory quota again\n",
+			       domain->domid);
+		}
+
+	}
+
+	return false;
+}
+
+int domain_memory_add(unsigned int domid, int mem, bool no_quota_check)
+{
+	struct domain *domain;
+
+	domain = find_domain_struct(domid);
+	if (domain) {
+		/*
+		 * domain_chk_quota() will print warning and also store whether
+		 * the soft/hard quota has been hit. So check no_quota_check
+		 * *after*.
+		 */
+		if (domain_chk_quota(domain, domain->memory + mem) &&
+		    !no_quota_check)
+			return ENOMEM;
+		domain->memory += mem;
+	} else {
+		/*
+		 * The domain the memory is to be accounted for should always
+		 * exist, as accounting is done either for a domain related to
+		 * the current connection, or for the domain owning a node
+		 * (which is always existing, as the owner of the node is
+		 * tested to exist and replaced by domid 0 if not).
+		 * So not finding the related domain MUST be an error in the
+		 * data base.
+		 */
+		errno = ENOENT;
+		corrupt(NULL, "Accounting called for non-existing domain %u\n",
+			domid);
+		return ENOENT;
+	}
+
+	return 0;
+}
+
 void domain_watch_inc(struct connection *conn)
 {
 	if (!conn || !conn->domain)
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index 4edf1dba9425..3a8c6bab48ba 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -64,6 +64,26 @@ int domain_entry_inc(struct connection *conn, struct node *);
 void domain_entry_dec(struct connection *conn, struct node *);
 int domain_entry_fix(unsigned int domid, int num, bool update);
 int domain_entry(struct connection *conn);
+int domain_memory_add(unsigned int domid, int mem, bool no_quota_check);
+
+/*
+ * domain_memory_add_chk(): to be used when memory quota should be checked.
+ * Not to be used when specifying a negative mem value, as lowering the used
+ * memory should always be allowed.
+ */
+static inline int domain_memory_add_chk(unsigned int domid, int mem)
+{
+	return domain_memory_add(domid, mem, false);
+}
+/*
+ * domain_memory_add_nochk(): to be used when memory quota should not be
+ * checked, e.g. when lowering memory usage, or in an error case for undoing
+ * a previous memory adjustment.
+ */
+static inline void domain_memory_add_nochk(unsigned int domid, int mem)
+{
+	domain_memory_add(domid, mem, true);
+}
 void domain_watch_inc(struct connection *conn);
 void domain_watch_dec(struct connection *conn);
 int domain_watch(struct connection *conn);
From fbd9cae032b452f04d93823d48974d957d863beb Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: add memory accounting for responses

Add the memory accounting for queued responses.

In case adding a watch event for a guest is causing the hard memory
quota of that guest to be violated, the event is dropped. This will
ensure that it is impossible to drive another guest past its memory
quota by generating insane amounts of events for that guest. This is
especially important for protecting driver domains from that attack
vector.

This is part of XSA-326 / CVE-2022-42315.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 217096d91a9d..4f29439ad825 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -254,6 +254,8 @@ static void free_buffered_data(struct buffered_data *out,
 		}
 	}
 
+	domain_memory_add_nochk(conn->id, -out->hdr.msg.len - sizeof(out->hdr));
+
 	if (out->hdr.msg.type == XS_WATCH_EVENT) {
 		req = out->pend.req;
 		if (req) {
@@ -843,11 +845,14 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 	bdata->timeout_msec = 0;
 	bdata->watch_event = false;
 
-	if (len <= DEFAULT_BUFFER_SIZE)
+	if (len <= DEFAULT_BUFFER_SIZE) {
 		bdata->buffer = bdata->default_buffer;
-	else {
+		/* Don't check quota, path might be used for returning error. */
+		domain_memory_add_nochk(conn->id, len + sizeof(bdata->hdr));
+	} else {
 		bdata->buffer = talloc_array(bdata, char, len);
-		if (!bdata->buffer) {
+		if (!bdata->buffer ||
+		    domain_memory_add_chk(conn->id, len + sizeof(bdata->hdr))) {
 			send_error(conn, ENOMEM);
 			return;
 		}
@@ -912,6 +917,11 @@ void send_event(struct buffered_data *req, struct connection *conn,
 		}
 	}
 
+	if (domain_memory_add_chk(conn->id, len + sizeof(bdata->hdr))) {
+		talloc_free(bdata);
+		return;
+	}
+
 	if (timeout_watch_event_msec && domain_is_unprivileged(conn)) {
 		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
 		if (!conn->timeout_msec)
From fa2019ff2c0725c0d93f4be79a729515f70a1a40 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add memory accounting for watches

Add the memory accounting for registered watches.

When a socket connection is destroyed, the associated watches are
removed, too. In order to keep memory accounting correct the watches
must be removed explicitly via a call of conn_delete_all_watches() from
destroy_conn().

This is part of XSA-326 / CVE-2022-42315.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 4f29439ad825..eca04e734a83 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -404,6 +404,7 @@ static int destroy_conn(void *_conn)
 	}
 
 	conn_free_buffered_data(conn);
+	conn_delete_all_watches(conn);
 	list_for_each_entry(req, &conn->ref_list, list)
 		req->on_ref_list = false;
 
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index c50c0575f0f1..7118c30e8c32 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -224,7 +224,8 @@ int do_watch(struct connection *conn, struct buffered_data *in)
 		return ENOMEM;
 	watch->node = talloc_strdup(watch, vec[0]);
 	watch->token = talloc_strdup(watch, vec[1]);
-	if (!watch->node || !watch->token) {
+	if (!watch->node || !watch->token ||
+	    domain_memory_add_chk(conn->id, strlen(vec[0]) + strlen(vec[1]))) {
 		talloc_free(watch);
 		return ENOMEM;
 	}
@@ -265,6 +266,8 @@ int do_unwatch(struct connection *conn, struct buffered_data *in)
 	list_for_each_entry(watch, &conn->watches, list) {
 		if (streq(watch->node, node) && streq(watch->token, vec[1])) {
 			list_del(&watch->list);
+			domain_memory_add_nochk(conn->id, -strlen(watch->node) -
+							  strlen(watch->token));
 			talloc_free(watch);
 			domain_watch_dec(conn);
 			send_ack(conn, XS_UNWATCH);
@@ -280,6 +283,8 @@ void conn_delete_all_watches(struct connection *conn)
 
 	while ((watch = list_top(&conn->watches, struct watch, list))) {
 		list_del(&watch->list);
+		domain_memory_add_nochk(conn->id, -strlen(watch->node) -
+						  strlen(watch->token));
 		talloc_free(watch);
 		domain_watch_dec(conn);
 	}
From 145c8375f382fe8535d633891cfdb47ec29a490a Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add memory accounting for nodes

Add the memory accounting for Xenstore nodes. In order to make this
not too complicated allow for some sloppiness when writing nodes. Any
hard quota violation will result in no further requests to be accepted.

This is part of XSA-326 / CVE-2022-42315.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index eca04e734a83..2c0f8fd99bbd 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -496,6 +496,117 @@ static void initialize_fds(int sock, int *p_sock_pollfd_idx,
 	}
 }
 
+static void get_acc_data(TDB_DATA *key, struct node_account_data *acc)
+{
+	TDB_DATA old_data;
+	struct xs_tdb_record_hdr *hdr;
+
+	if (acc->memory < 0) {
+		old_data = tdb_fetch(tdb_ctx, *key);
+		/* No check for error, as the node might not exist. */
+		if (old_data.dptr == NULL) {
+			acc->memory = 0;
+		} else {
+			hdr = (void *)old_data.dptr;
+			acc->memory = old_data.dsize;
+			acc->domid = hdr->perms[0].id;
+		}
+		talloc_free(old_data.dptr);
+	}
+}
+
+/*
+ * Per-transaction nodes need to be accounted for the transaction owner.
+ * Those nodes are stored in the data base with the transaction generation
+ * count prepended (e.g. 123/local/domain/...). So testing for the node's
+ * key not to start with "/" is sufficient.
+ */
+static unsigned int get_acc_domid(struct connection *conn, TDB_DATA *key,
+				  unsigned int domid)
+{
+	return (!conn || key->dptr[0] == '/') ? domid : conn->id;
+}
+
+int do_tdb_write(struct connection *conn, TDB_DATA *key, TDB_DATA *data,
+		 struct node_account_data *acc, bool no_quota_check)
+{
+	struct xs_tdb_record_hdr *hdr = (void *)data->dptr;
+	struct node_account_data old_acc = {};
+	unsigned int old_domid, new_domid;
+	int ret;
+
+	if (!acc)
+		old_acc.memory = -1;
+	else
+		old_acc = *acc;
+
+	get_acc_data(key, &old_acc);
+	old_domid = get_acc_domid(conn, key, old_acc.domid);
+	new_domid = get_acc_domid(conn, key, hdr->perms[0].id);
+
+	/*
+	 * Don't check for ENOENT, as we want to be able to switch orphaned
+	 * nodes to new owners.
+	 */
+	if (old_acc.memory)
+		domain_memory_add_nochk(old_domid,
+					-old_acc.memory - key->dsize);
+	ret = domain_memory_add(new_domid, data->dsize + key->dsize,
+				no_quota_check);
+	if (ret) {
+		/* Error path, so no quota check. */
+		if (old_acc.memory)
+			domain_memory_add_nochk(old_domid,
+						old_acc.memory + key->dsize);
+		return ret;
+	}
+
+	/* TDB should set errno, but doesn't even set ecode AFAICT. */
+	if (tdb_store(tdb_ctx, *key, *data, TDB_REPLACE) != 0) {
+		domain_memory_add_nochk(new_domid, -data->dsize - key->dsize);
+		/* Error path, so no quota check. */
+		if (old_acc.memory)
+			domain_memory_add_nochk(old_domid,
+						old_acc.memory + key->dsize);
+		errno = EIO;
+		return errno;
+	}
+
+	if (acc) {
+		/* Don't use new_domid, as it might be a transaction node. */
+		acc->domid = hdr->perms[0].id;
+		acc->memory = data->dsize;
+	}
+
+	return 0;
+}
+
+int do_tdb_delete(struct connection *conn, TDB_DATA *key,
+		  struct node_account_data *acc)
+{
+	struct node_account_data tmp_acc;
+	unsigned int domid;
+
+	if (!acc) {
+		acc = &tmp_acc;
+		acc->memory = -1;
+	}
+
+	get_acc_data(key, acc);
+
+	if (tdb_delete(tdb_ctx, *key)) {
+		errno = EIO;
+		return errno;
+	}
+
+	if (acc->memory) {
+		domid = get_acc_domid(conn, key, acc->domid);
+		domain_memory_add_nochk(domid, -acc->memory - key->dsize);
+	}
+
+	return 0;
+}
+
 /*
  * If it fails, returns NULL and sets errno.
  * Temporary memory allocations will be done with ctx.
@@ -549,9 +660,15 @@ struct node *read_node(struct connection *conn, const void *ctx,
 
 	/* Permissions are struct xs_permissions. */
 	node->perms.p = hdr->perms;
+	node->acc.domid = node->perms.p[0].id;
+	node->acc.memory = data.dsize;
 	if (domain_adjust_node_perms(conn, node))
 		goto error;
 
+	/* If owner is gone reset currently accounted memory size. */
+	if (node->acc.domid != node->perms.p[0].id)
+		node->acc.memory = 0;
+
 	/* Data is binary blob (usually ascii, no nul). */
 	node->data = node->perms.p + hdr->num_perms;
 	/* Children is strings, nul separated. */
@@ -615,12 +732,9 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
 	p += node->datalen;
 	memcpy(p, node->children, node->childlen);
 
-	/* TDB should set errno, but doesn't even set ecode AFAICT. */
-	if (tdb_store(tdb_ctx, *key, data, TDB_REPLACE) != 0) {
-		corrupt(conn, "Write of %s failed", key->dptr);
-		errno = EIO;
-		return errno;
-	}
+	if (do_tdb_write(conn, key, &data, &node->acc, no_quota_check))
+		return EIO;
+
 	return 0;
 }
 
@@ -1119,7 +1233,7 @@ static void delete_node_single(struct connection *conn, struct node *node)
 	if (access_node(conn, node, NODE_ACCESS_DELETE, &key))
 		return;
 
-	if (tdb_delete(tdb_ctx, key) != 0) {
+	if (do_tdb_delete(conn, &key, &node->acc) != 0) {
 		corrupt(conn, "Could not delete '%s'", node->name);
 		return;
 	}
@@ -1182,6 +1296,7 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 	/* No children, no data */
 	node->children = node->data = NULL;
 	node->childlen = node->datalen = 0;
+	node->acc.memory = 0;
 	node->parent = parent;
 	return node;
 
@@ -1190,17 +1305,17 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 	return NULL;
 }
 
-static void destroy_node_rm(struct node *node)
+static void destroy_node_rm(struct connection *conn, struct node *node)
 {
 	if (streq(node->name, "/"))
 		corrupt(NULL, "Destroying root node!");
 
-	tdb_delete(tdb_ctx, node->key);
+	do_tdb_delete(conn, &node->key, &node->acc);
 }
 
 static int destroy_node(struct connection *conn, struct node *node)
 {
-	destroy_node_rm(node);
+	destroy_node_rm(conn, node);
 	domain_entry_dec(conn, node);
 
 	/*
@@ -1252,7 +1367,7 @@ static struct node *create_node(struct connection *conn, const void *ctx,
 		/* Account for new node */
 		if (i->parent) {
 			if (domain_entry_inc(conn, i)) {
-				destroy_node_rm(i);
+				destroy_node_rm(conn, i);
 				return NULL;
 			}
 		}
@@ -2075,7 +2190,7 @@ static int clean_store_(TDB_CONTEXT *tdb, TDB_DATA key, TDB_DATA val,
 	if (!hashtable_search(reachable, name)) {
 		log("clean_store: '%s' is orphaned!", name);
 		if (recovery) {
-			tdb_delete(tdb, key);
+			do_tdb_delete(NULL, &key, NULL);
 		}
 	}
 
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 4e53072e637c..521bc80384e5 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -141,6 +141,11 @@ struct node_perms {
 	struct xs_permissions *p;
 };
 
+struct node_account_data {
+	unsigned int domid;
+	int memory;		/* -1 if unknown */
+};
+
 struct node {
 	const char *name;
 	/* Key used to update TDB */
@@ -163,6 +168,9 @@ struct node {
 	/* Children, each nul-terminated. */
 	unsigned int childlen;
 	char *children;
+
+	/* Allocation information for node currently in store. */
+	struct node_account_data acc;
 };
 
 /* Return the only argument in the input. */
@@ -258,6 +266,11 @@ extern xengnttab_handle **xgt_handle;
 
 int remember_string(struct hashtable *hash, const char *str);
 
+int do_tdb_write(struct connection *conn, TDB_DATA *key, TDB_DATA *data,
+		 struct node_account_data *acc, bool no_quota_check);
+int do_tdb_delete(struct connection *conn, TDB_DATA *key,
+		  struct node_account_data *acc);
+
 void conn_free_buffered_data(struct connection *conn);
 
 #endif /* _XENSTORED_CORE_H */
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index 778b7e439cb3..c1beb40a3d51 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -153,6 +153,9 @@ struct transaction
 	/* List of all transactions active on this connection. */
 	struct list_head list;
 
+	/* Connection this transaction is associated with. */
+	struct connection *conn;
+
 	/* Connection-local identifier for this transaction. */
 	uint32_t id;
 
@@ -292,6 +295,8 @@ int access_node(struct connection *conn, struct node *node,
 
 		introduce = true;
 		i->ta_node = false;
+		/* acc.memory < 0 means "unknown, get size from TDB". */
+		node->acc.memory = -1;
 
 		/*
 		 * Additional transaction-specific node for read type. We only
@@ -416,11 +421,11 @@ static int finalize_transaction(struct connection *conn,
 					goto err;
 				hdr = (void *)data.dptr;
 				hdr->generation = ++generation;
-				ret = tdb_store(tdb_ctx, key, data,
-						TDB_REPLACE);
+				ret = do_tdb_write(conn, &key, &data, NULL,
+						   true);
 				talloc_free(data.dptr);
 			} else {
-				ret = tdb_delete(tdb_ctx, key);
+				ret = do_tdb_delete(conn, &key, NULL);
 			}
 			if (ret)
 				goto err;
@@ -431,7 +436,7 @@ static int finalize_transaction(struct connection *conn,
 			}
 		}
 
-		if (i->ta_node && tdb_delete(tdb_ctx, ta_key))
+		if (i->ta_node && do_tdb_delete(conn, &ta_key, NULL))
 			goto err;
 		list_del(&i->list);
 		talloc_free(i);
@@ -459,7 +464,7 @@ static int destroy_transaction(void *_transaction)
 							       i->node);
 			if (trans_name) {
 				set_tdb_key(trans_name, &key);
-				tdb_delete(tdb_ctx, key);
+				do_tdb_delete(trans->conn, &key, NULL);
 			}
 		}
 		list_del(&i->list);
@@ -503,6 +508,7 @@ int do_transaction_start(struct connection *conn, struct buffered_data *in)
 
 	INIT_LIST_HEAD(&trans->accessed);
 	INIT_LIST_HEAD(&trans->changed_domains);
+	trans->conn = conn;
 	trans->fail = false;
 	trans->generation = ++generation;
 
From 7c8688f12b8a64c77cb50848897fc9ac66cf4260 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add exports for quota variables

Some quota variables are not exported via header files.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 521bc80384e5..5abf06c21c98 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -231,6 +231,11 @@ extern TDB_CONTEXT *tdb_ctx;
 extern int dom0_domid;
 extern int dom0_event;
 extern int priv_domid;
+extern int quota_nb_watch_per_domain;
+extern int quota_max_transaction;
+extern int quota_max_entry_size;
+extern int quota_nb_perms_per_node;
+extern int quota_max_path_len;
 extern int quota_nb_entry_per_domain;
 extern int quota_req_outstanding;
 extern int quota_trans_nodes;
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index c1beb40a3d51..6e29118c800d 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -175,7 +175,6 @@ struct transaction
 	bool fail;
 };
 
-extern int quota_max_transaction;
 uint64_t generation;
 
 static void set_tdb_key(const char *name, TDB_DATA *key)
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index 7118c30e8c32..19d0fb01b1c4 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -31,8 +31,6 @@
 #include "xenstored_domain.h"
 #include "xenstored_transaction.h"
 
-extern int quota_nb_watch_per_domain;
-
 struct watch
 {
 	/* Watches on this connection */
From 3cc236d3f62c015a95d3d6dc578ad7fcfa19607a Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add control command for setting and showing quota

Add a xenstore-control command "quota" to:
- show current quota settings
- change quota settings
- show current quota related values of a domain

Note that in the case the new quota is lower than existing one,
Xenstored may continue to handle requests from a domain exceeding the
new limit (depends on which one has been broken) and the amount of
resource used will not change. However the domain will not be able to
create more resource (associated to the quota) until it is back to below
the limit.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/docs/misc/xenstore.txt b/docs/misc/xenstore.txt
index 32969eb3fecd..0dbac442d79d 100644
--- a/docs/misc/xenstore.txt
+++ b/docs/misc/xenstore.txt
@@ -346,6 +346,17 @@ CONTROL			<command>|[<parameters>|]
 	print|<string>
 		print <string> to syslog (xenstore runs as daemon) or
 		to console (xenstore runs as stubdom)
+	quota|[set <name> <val>|<domid>]
+		without parameters: print the current quota settings
+		with "set <name> <val>": set the quota <name> to new value
+		<val> (The admin should make sure all the domain usage is
+		below the quota. If it is not, then Xenstored may continue to
+		handle requests from the domain as long as the resource
+		violating the new quota setting isn't increased further)
+		with "<domid>": print quota related accounting data for
+		the domain <domid>
+	quota-soft|[set <name> <val>]
+		like the "quota" command, but for soft-quota.
 	help			<supported-commands>
 		return list of supported commands for CONTROL
 
diff --git a/tools/xenstore/xenstored_control.c b/tools/xenstore/xenstored_control.c
index 715e0d2a9e03..454fe9d5ab18 100644
--- a/tools/xenstore/xenstored_control.c
+++ b/tools/xenstore/xenstored_control.c
@@ -19,6 +19,7 @@
 #include <errno.h>
 #include <stdarg.h>
 #include <stdio.h>
+#include <stdlib.h>
 #include <string.h>
 
 #include "utils.h"
@@ -77,6 +78,114 @@ static int do_control_logfile(void *ctx, struct connection *conn,
 	return 0;
 }
 
+struct quota {
+	const char *name;
+	int *quota;
+	const char *descr;
+};
+
+static const struct quota hard_quotas[] = {
+	{ "nodes", &quota_nb_entry_per_domain, "Nodes per domain" },
+	{ "watches", &quota_nb_watch_per_domain, "Watches per domain" },
+	{ "transactions", &quota_max_transaction, "Transactions per domain" },
+	{ "outstanding", &quota_req_outstanding,
+		"Outstanding requests per domain" },
+	{ "transaction-nodes", &quota_trans_nodes,
+		"Max. number of accessed nodes per transaction" },
+	{ "memory", &quota_memory_per_domain_hard,
+		"Total Xenstore memory per domain (error level)" },
+	{ "node-size", &quota_max_entry_size, "Max. size of a node" },
+	{ "permissions", &quota_nb_perms_per_node,
+		"Max. number of permissions per node" },
+	{ NULL, NULL, NULL }
+};
+
+static const struct quota soft_quotas[] = {
+	{ "memory", &quota_memory_per_domain_soft,
+		"Total Xenstore memory per domain (warning level)" },
+	{ NULL, NULL, NULL }
+};
+
+static int quota_show_current(const void *ctx, struct connection *conn,
+			      const struct quota *quotas)
+{
+	char *resp;
+	unsigned int i;
+
+	resp = talloc_strdup(ctx, "Quota settings:\n");
+	if (!resp)
+		return ENOMEM;
+
+	for (i = 0; quotas[i].quota; i++) {
+		resp = talloc_asprintf_append(resp, "%-17s: %8d %s\n",
+					      quotas[i].name, *quotas[i].quota,
+					      quotas[i].descr);
+		if (!resp)
+			return ENOMEM;
+	}
+
+	send_reply(conn, XS_CONTROL, resp, strlen(resp) + 1);
+
+	return 0;
+}
+
+static int quota_set(const void *ctx, struct connection *conn,
+		     char **vec, int num, const struct quota *quotas)
+{
+	unsigned int i;
+	int val;
+
+	if (num != 2)
+		return EINVAL;
+
+	val = atoi(vec[1]);
+	if (val < 1)
+		return EINVAL;
+
+	for (i = 0; quotas[i].quota; i++) {
+		if (!strcmp(vec[0], quotas[i].name)) {
+			*quotas[i].quota = val;
+			send_ack(conn, XS_CONTROL);
+			return 0;
+		}
+	}
+
+	return EINVAL;
+}
+
+static int quota_get(const void *ctx, struct connection *conn,
+		     char **vec, int num)
+{
+	if (num != 1)
+		return EINVAL;
+
+	return domain_get_quota(ctx, conn, atoi(vec[0]));
+}
+
+static int do_control_quota(void *ctx, struct connection *conn,
+			    char **vec, int num)
+{
+	if (num == 0)
+		return quota_show_current(ctx, conn, hard_quotas);
+
+	if (!strcmp(vec[0], "set"))
+		return quota_set(ctx, conn, vec + 1, num - 1, hard_quotas);
+
+	return quota_get(ctx, conn, vec, num);
+}
+
+static int do_control_quota_s(void *ctx, struct connection *conn,
+			      char **vec, int num)
+{
+	if (num == 0)
+		return quota_show_current(ctx, conn, soft_quotas);
+
+	if (!strcmp(vec[0], "set"))
+		return quota_set(ctx, conn, vec + 1, num - 1, soft_quotas);
+
+	return EINVAL;
+}
+
 static int do_control_memreport(void *ctx, struct connection *conn,
 				char **vec, int num)
 {
@@ -136,6 +245,8 @@ static struct cmd_s cmds[] = {
 	{ "logfile", do_control_logfile, "<file>" },
 	{ "memreport", do_control_memreport, "[<file>]" },
 	{ "print", do_control_print, "<string>" },
+	{ "quota", do_control_quota, "[set <name> <val>|<domid>]" },
+	{ "quota-soft", do_control_quota_s, "[set <name> <val>]" },
 	{ "help", do_control_help, "" },
 };
 
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 112fb457581e..f458314c8e01 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -31,6 +31,7 @@
 #include "xenstored_domain.h"
 #include "xenstored_transaction.h"
 #include "xenstored_watch.h"
+#include "xenstored_control.h"
 
 #include <xenevtchn.h>
 #include <xenctrl.h>
@@ -352,6 +353,38 @@ static struct domain *find_domain_struct(unsigned int domid)
 	return NULL;
 }
 
+int domain_get_quota(const void *ctx, struct connection *conn,
+		     unsigned int domid)
+{
+	struct domain *d = find_domain_struct(domid);
+	char *resp;
+	int ta;
+
+	if (!d)
+		return ENOENT;
+
+	ta = d->conn ? d->conn->transaction_started : 0;
+	resp = talloc_asprintf(ctx, "Domain %u:\n", domid);
+	if (!resp)
+		return ENOMEM;
+
+#define ent(t, e) \
+	resp = talloc_asprintf_append(resp, "%-16s: %8d\n", #t, e); \
+	if (!resp) return ENOMEM
+
+	ent(nodes, d->nbentry);
+	ent(watches, d->nbwatch);
+	ent(transactions, ta);
+	ent(outstanding, d->nboutstanding);
+	ent(memory, d->memory);
+
+#undef ent
+
+	send_reply(conn, XS_CONTROL, resp, strlen(resp) + 1);
+
+	return 0;
+}
+
 static struct domain *alloc_domain(void *context, unsigned int domid)
 {
 	struct domain *domain;
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index 3a8c6bab48ba..e013a9991ca8 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -90,6 +90,8 @@ int domain_watch(struct connection *conn);
 void domain_outstanding_inc(struct connection *conn);
 void domain_outstanding_dec(struct connection *conn);
 void domain_outstanding_domid_dec(unsigned int domid);
+int domain_get_quota(const void *ctx, struct connection *conn,
+		     unsigned int domid);
 
 /* Special node permission handling. */
 int set_perms_special(struct connection *conn, const char *name,
From aa65e572b55ee515f01d7e08044c2b0f0a15a293 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:01 +0100
Subject: tools/ocaml/xenstored: Synchronise defaults with oxenstore.conf.in
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

We currently have 2 different set of defaults in upstream Xen git tree:
* defined in the source code, only used if there is no config file
* defined in the oxenstored.conf.in upstream Xen

An oxenstored.conf file is not mandatory, and if missing, maxrequests in
particular has an unsafe default.

Resync the defaults from oxenstored.conf.in into the source code.

This is part of XSA-326 / CVE-2022-42316.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/xenstored/define.ml b/tools/ocaml/xenstored/define.ml
index f574397a4c0b..96c125a969da 100644
--- a/tools/ocaml/xenstored/define.ml
+++ b/tools/ocaml/xenstored/define.ml
@@ -22,9 +22,9 @@ let xs_daemon_socket_ro = Paths.xen_run_stored ^ "/socket_ro"
 
 let default_config_dir = Paths.xen_config_dir
 
-let maxwatch = ref (50)
-let maxtransaction = ref (20)
-let maxrequests = ref (-1)   (* maximum requests per transaction *)
+let maxwatch = ref (100)
+let maxtransaction = ref (10)
+let maxrequests = ref (1024)   (* maximum requests per transaction *)
 
 let conflict_burst_limit = ref 5.0
 let conflict_max_history_seconds = ref 0.05
diff --git a/tools/ocaml/xenstored/quota.ml b/tools/ocaml/xenstored/quota.ml
index abcac912805a..6e3d6401ae89 100644
--- a/tools/ocaml/xenstored/quota.ml
+++ b/tools/ocaml/xenstored/quota.ml
@@ -20,8 +20,8 @@ exception Transaction_opened
 
 let warn fmt = Logging.warn "quota" fmt
 let activate = ref true
-let maxent = ref (10000)
-let maxsize = ref (4096)
+let maxent = ref (1000)
+let maxsize = ref (2048)
 
 type t = {
 	maxent: int;               (* max entities per domU *)
From 0594e132f16f780d7a7c2a2068e7c278f079bb69 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Thu, 28 Jul 2022 17:08:15 +0100
Subject: tools/ocaml/xenstored: Check for maxrequests before performing
 operations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously we'd perform the operation, record the updated tree in the
transaction record, then try to insert a watchop path and the reply packet.

If we exceeded max requests we would've returned EQUOTA, but still:
* have performed the operation on the transaction's tree
* have recorded the watchop, making this queue effectively unbounded

It is better if we check whether we'd have room to store the operation before
performing the transaction, and raise EQUOTA there.  Then the transaction
record won't grow.

This is part of XSA-326 / CVE-2022-42317.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index 3ab09c6ce926..3279b19b1bff 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -253,6 +253,7 @@ let input_handle_error ~cons ~doms ~fct ~con ~t ~req =
 	let reply_error e =
 		Packet.Error e in
 	try
+		Transaction.check_quota_exn ~perm:(Connection.get_perm con) t;
 		fct con t doms cons req.Packet.data
 	with
 	| Define.Invalid_path          -> reply_error "EINVAL"
@@ -545,9 +546,10 @@ let process_packet ~store ~cons ~doms ~con ~req =
 		in
 
 		let response = try
+			Transaction.check_quota_exn ~perm:(Connection.get_perm con) t;
 			if tid <> Transaction.none then
 				(* Remember the request and response for this operation in case we need to replay the transaction *)
-				Transaction.add_operation ~perm:(Connection.get_perm con) t req response;
+				Transaction.add_operation t req response;
 			response
 		with Quota.Limit_reached ->
 			Packet.Error "EQUOTA"
diff --git a/tools/ocaml/xenstored/transaction.ml b/tools/ocaml/xenstored/transaction.ml
index 17b1bdf2eaf9..294143e2335b 100644
--- a/tools/ocaml/xenstored/transaction.ml
+++ b/tools/ocaml/xenstored/transaction.ml
@@ -85,6 +85,7 @@ type t = {
 	oldroot: Store.Node.t;
 	mutable paths: (Xenbus.Xb.Op.operation * Store.Path.t) list;
 	mutable operations: (Packet.request * Packet.response) list;
+	mutable quota_reached: bool;
 	mutable read_lowpath: Store.Path.t option;
 	mutable write_lowpath: Store.Path.t option;
 }
@@ -127,6 +128,7 @@ let make ?(internal=false) id store =
 		oldroot = Store.get_root store;
 		paths = [];
 		operations = [];
+		quota_reached = false;
 		read_lowpath = None;
 		write_lowpath = None;
 	} in
@@ -143,13 +145,19 @@ let get_root t = Store.get_root t.store
 
 let is_read_only t = t.paths = []
 let add_wop t ty path = t.paths <- (ty, path) :: t.paths
-let add_operation ~perm t request response =
+let get_operations t = List.rev t.operations
+
+let check_quota_exn ~perm t =
 	if !Define.maxrequests >= 0
 		&& not (Perms.Connection.is_dom0 perm)
-		&& List.length t.operations >= !Define.maxrequests
-		then raise Quota.Limit_reached;
+		&& (t.quota_reached || List.length t.operations >= !Define.maxrequests)
+		then begin
+			t.quota_reached <- true;
+			raise Quota.Limit_reached;
+		end
+
+let add_operation t request response =
 	t.operations <- (request, response) :: t.operations
-let get_operations t = List.rev t.operations
 let set_read_lowpath t path = t.read_lowpath <- get_lowest path t.read_lowpath
 let set_write_lowpath t path = t.write_lowpath <- get_lowest path t.write_lowpath
 
From 0051d59ff2d75f9295a72c78373a1ed16a25e7b1 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:07 +0100
Subject: tools/ocaml: GC parameter tuning
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

By default the OCaml garbage collector would return memory to the OS only
after unused memory is 5x live memory.  Tweak this to 120% instead, which
would match the major GC speed.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/xenstored/define.ml b/tools/ocaml/xenstored/define.ml
index 96c125a969da..1a5d2f34a678 100644
--- a/tools/ocaml/xenstored/define.ml
+++ b/tools/ocaml/xenstored/define.ml
@@ -26,6 +26,7 @@ let maxwatch = ref (100)
 let maxtransaction = ref (10)
 let maxrequests = ref (1024)   (* maximum requests per transaction *)
 
+let gc_max_overhead = ref 120 (* 120% see comment in xenstored.ml *)
 let conflict_burst_limit = ref 5.0
 let conflict_max_history_seconds = ref 0.05
 let conflict_rate_limit_is_aggregate = ref true
diff --git a/tools/ocaml/xenstored/xenstored.ml b/tools/ocaml/xenstored/xenstored.ml
index 369b5036f43d..0b6343dfc789 100644
--- a/tools/ocaml/xenstored/xenstored.ml
+++ b/tools/ocaml/xenstored/xenstored.ml
@@ -103,6 +103,7 @@ let parse_config filename =
 		("quota-maxsize", Config.Set_int Quota.maxsize);
 		("quota-maxrequests", Config.Set_int Define.maxrequests);
 		("quota-path-max", Config.Set_int Define.path_max);
+		("gc-max-overhead", Config.Set_int Define.gc_max_overhead);
 		("test-eagain", Config.Set_bool Transaction.test_eagain);
 		("persistent", Config.Set_bool Disk.enable);
 		("xenstored-log-file", Config.String Logging.set_xenstored_log_destination);
@@ -229,6 +230,67 @@ let to_file store cons file =
 	        (fun () -> close_out channel)
 end
 
+(*
+	By default OCaml's GC only returns memory to the OS when it exceeds a
+	configurable 'max overhead' setting.
+	The default is 500%, that is 5/6th of the OCaml heap needs to be free
+	and only 1/6th live for a compaction to be triggerred that would
+	release memory back to the OS.
+	If the limit is not hit then the OCaml process can reuse that memory
+	for its own purposes, but other processes won't be able to use it.
+
+	There is also a 'space overhead' setting that controls how much work
+	each major GC slice does, and by default aims at having no more than
+	80% or 120% (depending on version) garbage values compared to live
+	values.
+	This doesn't have as much relevance to memory returned to the OS as
+	long as space_overhead <= max_overhead, because compaction is only
+	triggerred at the end of major GC cycles.
+
+	The defaults are too large once the program starts using ~100MiB of
+	memory, at which point ~500MiB would be unavailable to other processes
+	(which would be fine if this was the main process in this VM, but it is
+	not).
+
+	Max overhead can also be set to 0, however this is for testing purposes
+	only (setting it lower than 'space overhead' wouldn't help because the
+	major GC wouldn't run fast enough, and compaction does have a
+	performance cost: we can only compact contiguous regions, so memory has
+	to be moved around).
+
+	Max overhead controls how often the heap is compacted, which is useful
+	if there are burst of activity followed by long periods of idle state,
+	or if a domain quits, etc. Compaction returns memory to the OS.
+
+	wasted = live * space_overhead / 100
+
+	For globally overriding the GC settings one can use OCAMLRUNPARAM,
+	however we provide a config file override to be consistent with other
+	oxenstored settings.
+
+	One might want to dynamically adjust the overhead setting based on used
+	memory, i.e. to use a fixed upper bound in bytes, not percentage. However
+	measurements show that such adjustments increase GC overhead massively,
+	while still not guaranteeing that memory is returned any more quickly
+	than with a percentage based setting.
+
+	The allocation policy could also be tweaked, e.g. first fit would reduce
+	fragmentation and thus memory usage, but the documentation warns that it
+	can be sensibly slower, and indeed one of our own testcases can trigger
+	such a corner case where it is multiple times slower, so it is best to keep
+	the default allocation policy (next-fit/best-fit depending on version).
+
+	There are other tweaks that can be attempted in the future, e.g. setting
+	'ulimit -v' to 75% of RAM, however getting the kernel to actually return
+	NULL from allocations is difficult even with that setting, and without a
+	NULL the emergency GC won't be triggerred.
+	Perhaps cgroup limits could help, but for now tweak the safest only.
+*)
+
+let tweak_gc () =
+	Gc.set { (Gc.get ()) with Gc.max_overhead = !Define.gc_max_overhead }
+
+
 let _ =
 	let cf = do_argv in
 	let pidfile =
@@ -238,6 +300,8 @@ let _ =
 			default_pidfile
 		in
 
+	tweak_gc ();
+
 	(try
 		Unixext.mkdir_rec (Filename.dirname pidfile) 0o755
 	with _ ->
From e444ff5f14723c774e9993d4ea4edc4671cdcc34 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Fri, 29 Jul 2022 18:53:29 +0100
Subject: tools/ocaml/libs/xb: hide type of Xb.t
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Hiding the type will make it easier to change the implementation
in the future without breaking code that relies on it.

No functional change.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 7ade30a1451734d041363c750a65d322e25b47ba)

Reported-by: Julien Grall <jgrall@amazon.com>
diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 104d319d7747..8404ddd8a682 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -196,6 +196,9 @@ let peek_output con = Queue.peek con.pkt_out
 let input_len con = Queue.length con.pkt_in
 let has_in_packet con = Queue.length con.pkt_in > 0
 let get_in_packet con = Queue.pop con.pkt_in
+let has_partial_input con = match con.partial_in with
+	| HaveHdr _ -> true
+	| NoHdr (n, _) -> n < Partial.header_size ()
 let has_more_input con =
 	match con.backend with
 	| Fd _         -> false
diff --git a/tools/ocaml/libs/xb/xb.mli b/tools/ocaml/libs/xb/xb.mli
index 3a00da6cddc1..794e35bb343e 100644
--- a/tools/ocaml/libs/xb/xb.mli
+++ b/tools/ocaml/libs/xb/xb.mli
@@ -66,13 +66,7 @@ type backend_mmap = {
 type backend_fd = { fd : Unix.file_descr; }
 type backend = Fd of backend_fd | Xenmmap of backend_mmap
 type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
-type t = {
-  backend : backend;
-  pkt_in : Packet.t Queue.t;
-  pkt_out : Packet.t Queue.t;
-  mutable partial_in : partial_buf;
-  mutable partial_out : string;
-}
+type t
 val init_partial_in : unit -> partial_buf
 val reconnect : t -> unit
 val queue : t -> Packet.t -> unit
@@ -97,6 +91,7 @@ val has_output : t -> bool
 val peek_output : t -> Packet.t
 val input_len : t -> int
 val has_in_packet : t -> bool
+val has_partial_input : t -> bool
 val get_in_packet : t -> Packet.t
 val has_more_input : t -> bool
 val is_selectable : t -> bool
diff --git a/tools/ocaml/xenstored/connection.ml b/tools/ocaml/xenstored/connection.ml
index daf8d804f7ef..70c43485528c 100644
--- a/tools/ocaml/xenstored/connection.ml
+++ b/tools/ocaml/xenstored/connection.ml
@@ -125,9 +125,7 @@ let get_perm con =
 let set_target con target_domid =
 	con.perm <- Perms.Connection.set_target (get_perm con) ~perms:[Perms.READ; Perms.WRITE] target_domid
 
-let is_backend_mmap con = match con.xb.Xenbus.Xb.backend with
-	| Xenbus.Xb.Xenmmap _ -> true
-	| _ -> false
+let is_backend_mmap con = Xenbus.Xb.is_mmap con.xb
 
 let send_reply con tid rid ty data =
 	if (String.length data) > xenstore_payload_max && (is_backend_mmap con) then
From 4a0629f747237cd6fbb230f947ef273db77cc81e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:02 +0100
Subject: tools/ocaml: Change Xb.input to return Packet.t option
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The queue here would only ever hold at most one element.  This will simplify
follow-up patches.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 8404ddd8a682..165fd4a1edf4 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -45,7 +45,6 @@ type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
 type t =
 {
 	backend: backend;
-	pkt_in: Packet.t Queue.t;
 	pkt_out: Packet.t Queue.t;
 	mutable partial_in: partial_buf;
 	mutable partial_out: string;
@@ -62,7 +61,6 @@ let reconnect t = match t.backend with
 		Xs_ring.close backend.mmap;
 		backend.eventchn_notify ();
 		(* Clear our old connection state *)
-		Queue.clear t.pkt_in;
 		Queue.clear t.pkt_out;
 		t.partial_in <- init_partial_in ();
 		t.partial_out <- ""
@@ -124,7 +122,6 @@ let output con =
 
 (* NB: can throw Reconnect *)
 let input con =
-	let newpacket = ref false in
 	let to_read =
 		match con.partial_in with
 		| HaveHdr partial_pkt -> Partial.to_complete partial_pkt
@@ -143,21 +140,19 @@ let input con =
 		if Partial.to_complete partial_pkt = 0 then (
 			let pkt = Packet.of_partialpkt partial_pkt in
 			con.partial_in <- init_partial_in ();
-			Queue.push pkt con.pkt_in;
-			newpacket := true
-		)
+			Some pkt
+		) else None
 	| NoHdr (i, buf)      ->
 		(* we complete the partial header *)
 		if sz > 0 then
 			Bytes.blit b 0 buf (Partial.header_size () - i) sz;
 		con.partial_in <- if sz = i then
-			HaveHdr (Partial.of_string (Bytes.to_string buf)) else NoHdr (i - sz, buf)
-	);
-	!newpacket
+			HaveHdr (Partial.of_string (Bytes.to_string buf)) else NoHdr (i - sz, buf);
+		None
+	)
 
 let newcon backend = {
 	backend = backend;
-	pkt_in = Queue.create ();
 	pkt_out = Queue.create ();
 	partial_in = init_partial_in ();
 	partial_out = "";
@@ -193,9 +188,6 @@ let has_output con = has_new_output con || has_old_output con
 
 let peek_output con = Queue.peek con.pkt_out
 
-let input_len con = Queue.length con.pkt_in
-let has_in_packet con = Queue.length con.pkt_in > 0
-let get_in_packet con = Queue.pop con.pkt_in
 let has_partial_input con = match con.partial_in with
 	| HaveHdr _ -> true
 	| NoHdr (n, _) -> n < Partial.header_size ()
diff --git a/tools/ocaml/libs/xb/xb.mli b/tools/ocaml/libs/xb/xb.mli
index 794e35bb343e..91c682162cea 100644
--- a/tools/ocaml/libs/xb/xb.mli
+++ b/tools/ocaml/libs/xb/xb.mli
@@ -77,7 +77,7 @@ val write_fd : backend_fd -> 'a -> string -> int -> int
 val write_mmap : backend_mmap -> 'a -> string -> int -> int
 val write : t -> string -> int -> int
 val output : t -> bool
-val input : t -> bool
+val input : t -> Packet.t option
 val newcon : backend -> t
 val open_fd : Unix.file_descr -> t
 val open_mmap : Xenmmap.mmap_interface -> (unit -> unit) -> t
@@ -89,10 +89,7 @@ val has_new_output : t -> bool
 val has_old_output : t -> bool
 val has_output : t -> bool
 val peek_output : t -> Packet.t
-val input_len : t -> int
-val has_in_packet : t -> bool
 val has_partial_input : t -> bool
-val get_in_packet : t -> Packet.t
 val has_more_input : t -> bool
 val is_selectable : t -> bool
 val get_fd : t -> Unix.file_descr
diff --git a/tools/ocaml/libs/xs/xsraw.ml b/tools/ocaml/libs/xs/xsraw.ml
index d982fb24dbb1..451f8b38dbcc 100644
--- a/tools/ocaml/libs/xs/xsraw.ml
+++ b/tools/ocaml/libs/xs/xsraw.ml
@@ -94,26 +94,18 @@ let pkt_send con =
 	done
 
 (* receive one packet - can sleep *)
-let pkt_recv con =
-	let workdone = ref false in
-	while not !workdone
-	do
-		workdone := Xb.input con.xb
-	done;
-	Xb.get_in_packet con.xb
+let rec pkt_recv con =
+	match Xb.input con.xb with
+	| Some packet -> packet
+	| None -> pkt_recv con
 
 let pkt_recv_timeout con timeout =
 	let fd = Xb.get_fd con.xb in
 	let r, _, _ = Unix.select [ fd ] [] [] timeout in
 	if r = [] then
 		true, None
-	else (
-		let workdone = Xb.input con.xb in
-		if workdone then
-			false, (Some (Xb.get_in_packet con.xb))
-		else
-			false, None
-	)
+	else
+		false, Xb.input con.xb
 
 let queue_watchevent con data =
 	let ls = split_string ~limit:2 '\000' data in
diff --git a/tools/ocaml/xenstored/connection.ml b/tools/ocaml/xenstored/connection.ml
index 70c43485528c..ace2aa5b4f53 100644
--- a/tools/ocaml/xenstored/connection.ml
+++ b/tools/ocaml/xenstored/connection.ml
@@ -277,8 +277,6 @@ let get_transaction con tid =
 	Hashtbl.find con.transactions tid
 
 let do_input con = Xenbus.Xb.input con.xb
-let has_input con = Xenbus.Xb.has_in_packet con.xb
-let pop_in con = Xenbus.Xb.get_in_packet con.xb
 let has_more_input con = Xenbus.Xb.has_more_input con.xb
 
 let has_output con = Xenbus.Xb.has_output con.xb
diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index 0df3df401db6..a72810d06f43 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -569,16 +569,17 @@ let do_input store cons doms con =
 			info "%s requests a reconnect" (Connection.get_domstr con);
 			History.reconnect con;
 			info "%s reconnection complete" (Connection.get_domstr con);
-			false
+			None
 		| Failure exp ->
 			error "caught exception %s" exp;
 			error "got a bad client %s" (sprintf "%-8s" (Connection.get_domstr con));
 			Connection.mark_as_bad con;
-			false
+			None
 	in
 
-	if newpacket then (
-		let packet = Connection.pop_in con in
+	match newpacket with
+	| None -> ()
+	| Some packet ->
 		let tid, rid, ty, data = Xenbus.Xb.Packet.unpack packet in
 		let req = {Packet.tid=tid; Packet.rid=rid; Packet.ty=ty; Packet.data=data} in
 
@@ -588,8 +589,7 @@ let do_input store cons doms con =
 		         (Xenbus.Xb.Op.to_string ty) (sanitize_data data); *)
 		process_packet ~store ~cons ~doms ~con ~req;
 		write_access_log ~ty ~tid ~con:(Connection.get_domstr con) ~data;
-		Connection.incr_ops con;
-	)
+		Connection.incr_ops con
 
 let do_output _store _cons _doms con =
 	if Connection.has_output con then (
From faed0ee3ccc2940b01b0c50d3adc84eaf20618ed Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:03 +0100
Subject: tools/ocaml/xb: Add BoundedQueue
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Ensures we cannot store more than [capacity] elements in a [Queue].  Replacing
all Queue with this module will then ensure at compile time that all Queues
are correctly bound checked.

Each element in the queue has a class with its own limits.  This, in a
subsequent change, will ensure that command responses can proceed during a
flood of watch events.

No functional change.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 165fd4a1edf4..4197a3888a68 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -17,6 +17,98 @@
 module Op = struct include Op end
 module Packet = struct include Packet end
 
+module BoundedQueue : sig
+	type ('a, 'b) t
+
+	(** [create ~capacity ~classify ~limit] creates a queue with maximum [capacity] elements.
+	    This is burst capacity, each element is further classified according to [classify],
+	    and each class can have its own [limit].
+	    [capacity] is enforced as an overall limit.
+	    The [limit] can be dynamic, and can be smaller than the number of elements already queued of that class,
+	    in which case those elements are considered to use "burst capacity".
+	  *)
+	val create: capacity:int -> classify:('a -> 'b) -> limit:('b -> int) -> ('a, 'b) t
+
+	(** [clear q] discards all elements from [q] *)
+	val clear: ('a, 'b) t -> unit
+
+	(** [can_push q] when [length q < capacity].	*)
+	val can_push: ('a, 'b) t -> 'b -> bool
+
+	(** [push e q] adds [e] at the end of queue [q] if [can_push q], or returns [None]. *)
+	val push: 'a -> ('a, 'b) t -> unit option
+
+	(** [pop q] removes and returns first element in [q], or raises [Queue.Empty]. *)
+	val pop: ('a, 'b) t -> 'a
+
+	(** [peek q] returns the first element in [q], or raises [Queue.Empty].  *)
+	val peek : ('a, 'b) t -> 'a
+
+	(** [length q] returns the current number of elements in [q] *)
+	val length: ('a, 'b) t -> int
+
+	(** [debug string_of_class q] prints queue usage statistics in an unspecified internal format. *)
+	val debug: ('b -> string) -> (_, 'b) t -> string
+end = struct
+	type ('a, 'b) t =
+		{ q: 'a Queue.t
+		; capacity: int
+		; classify: 'a -> 'b
+		; limit: 'b -> int
+		; class_count: ('b, int) Hashtbl.t
+		}
+
+	let create ~capacity ~classify ~limit =
+		{ capacity; q = Queue.create (); classify; limit; class_count = Hashtbl.create 3 }
+
+	let get_count t classification = try Hashtbl.find t.class_count classification with Not_found -> 0
+
+	let can_push_internal t classification class_count =
+		Queue.length t.q < t.capacity && class_count < t.limit classification
+
+	let ok = Some ()
+
+	let push e t =
+		let classification = t.classify e in
+		let class_count = get_count t classification in
+		if can_push_internal t classification class_count then begin
+			Queue.push e t.q;
+			Hashtbl.replace t.class_count classification (class_count + 1);
+			ok
+		end
+		else
+			None
+
+	let can_push t classification =
+		can_push_internal t classification @@ get_count t classification
+
+	let clear t =
+		Queue.clear t.q;
+		Hashtbl.reset t.class_count
+
+	let pop t =
+		let e = Queue.pop t.q in
+		let classification = t.classify e in
+		let () = match get_count t classification - 1 with
+		| 0 -> Hashtbl.remove t.class_count classification (* reduces memusage *)
+		| n -> Hashtbl.replace t.class_count classification n
+		in
+		e
+
+	let peek t = Queue.peek t.q
+	let length t = Queue.length t.q
+
+	let debug string_of_class t =
+		let b = Buffer.create 128 in
+		Printf.bprintf b "BoundedQueue capacity: %d, used: {" t.capacity;
+		Hashtbl.iter (fun packet_class count ->
+			Printf.bprintf b "	%s: %d" (string_of_class packet_class) count
+		) t.class_count;
+		Printf.bprintf b "}";
+		Buffer.contents b
+end
+
+
 exception End_of_file
 exception Eagain
 exception Noent
From 374554c92a829472ec426fd1562d028fd2e33268 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:04 +0100
Subject: tools/ocaml: Limit maximum in-flight requests / outstanding replies
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Introduce a limit on the number of outstanding reply packets in the xenbus
queue.  This limits the number of in-flight requests: when the output queue is
full we'll stop processing inputs until the output queue has room again.

To avoid a busy loop on the Unix socket we only add it to the watched input
file descriptor set if we'd be able to call `input` on it.  Even though Dom0
is trusted and exempt from quotas a flood of events might cause a backlog
where events are produced faster than daemons in Dom0 can consume them, which
could lead to an unbounded queue size and OOM.

Therefore the xenbus queue limit must apply to all connections, Dom0 is not
exempt from it, although if everything works correctly it will eventually
catch up.

This prevents a malicious guest from sending more commands while it has
outstanding watch events or command replies in its input ring.  However if it
can cause the generation of watch events by other means (e.g. by Dom0, or
another cooperative guest) and stop reading its own ring then watch events
would've queued up without limit.

The xenstore protocol doesn't have a back-pressure mechanism, and doesn't
allow dropping watch events.  In fact, dropping watch events is known to break
some pieces of normal functionality.  This leaves little choice to safely
implement the xenstore protocol without exposing the xenstore daemon to
out-of-memory attacks.

Implement the fix as pipes with bounded buffers:
* Use a bounded buffer for watch events
* The watch structure will have a bounded receiving pipe of watch events
* The source will have an "overflow" pipe of pending watch events it couldn't
  deliver

Items are queued up on one end and are sent as far along the pipe as possible:

  source domain -> watch -> xenbus of target -> xenstore ring/socket of target

If the pipe is "full" at any point then back-pressure is applied and we prevent
more items from being queued up.  For the source domain this means that we'll
stop accepting new commands as long as its pipe buffer is not empty.

Before we try to enqueue an item we first check whether it is possible to send
it further down the pipe, by attempting to recursively flush the pipes. This
ensures that we retain the order of events as much as possible.

We might break causality of watch events if the target domain's queue is full
and we need to start using the watch's queue.  This is a breaking change in
the xenstore protocol, but only for domains which are not processing their
incoming ring as expected.

When a watch is deleted its entire pending queue is dropped (no code is needed
for that, because it is part of the 'watch' type).

There is a cache of watches that have pending events that we attempt to flush
at every cycle if possible.

Introduce 3 limits here:
* quota-maxwatchevents on watch event destination: when this is hit the
  source will not be allowed to queue up more watch events.
* quota-maxoustanding which is the number of responses not read from the ring:
  once exceeded, no more inputs are processed until all outstanding replies
  are consumed by the client.
* overflow queue on the watch event source: all watches that cannot be stored
  on destination are queued up here, a single command can trigger multiple
  watches (e.g. due to recursion).

The overflow queue currently doesn't have an upper bound, it is difficult to
accurately calculate one as it depends on whether you are Dom0 and how many
watches each path has registered and how many watch events you can trigger
with a single command (e.g. a commit).  However these events were already
using memory, this just moves them elsewhere, and as long as we correctly
block a domain it shouldn't result in unbounded memory usage.

Note that Dom0 is not excluded from these checks, it is important that Dom0 is
especially not excluded when it is the source, since there are many ways in
which a guest could trigger Dom0 to send it watch events.

This should protect against malicious frontends as long as the backend follows
the PV xenstore protocol and only exposes paths needed by the frontend, and
changes those paths at most once as a reaction to guest events, or protocol
state.

The queue limits are per watch, and per domain-pair, so even if one
communication channel would be "blocked", others would keep working, and the
domain itself won't get blocked as long as it doesn't overflow the queue of
watch events.

Similarly a malicious backend could cause the frontend to get blocked, but
this watch queue protects the frontend as well as long as it follows the PV
protocol.  (Although note that protection against malicious backends is only a
best effort at the moment)

This is part of XSA-326 / CVE-2022-42318.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 4197a3888a68..b292ed7a874d 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -134,14 +134,44 @@ type backend = Fd of backend_fd | Xenmmap of backend_mmap
 
 type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
 
+(*
+	separate capacity reservation for replies and watch events:
+	this allows a domain to keep working even when under a constant flood of
+	watch events
+*)
+type capacity = { maxoutstanding: int; maxwatchevents: int }
+
+module Queue = BoundedQueue
+
+type packet_class =
+	| CommandReply
+	| Watchevent
+
+let string_of_packet_class = function
+	| CommandReply -> "command_reply"
+	| Watchevent -> "watch_event"
+
 type t =
 {
 	backend: backend;
-	pkt_out: Packet.t Queue.t;
+	pkt_out: (Packet.t, packet_class) Queue.t;
 	mutable partial_in: partial_buf;
 	mutable partial_out: string;
+	capacity: capacity
 }
 
+let to_read con =
+	match con.partial_in with
+		| HaveHdr partial_pkt -> Partial.to_complete partial_pkt
+		| NoHdr   (i, _)    -> i
+
+let debug t =
+	Printf.sprintf "XenBus state: partial_in: %d needed, partial_out: %d bytes, pkt_out: %d packets, %s"
+		(to_read t)
+		(String.length t.partial_out)
+		(Queue.length t.pkt_out)
+		(BoundedQueue.debug string_of_packet_class t.pkt_out)
+
 let init_partial_in () = NoHdr
 	(Partial.header_size (), Bytes.make (Partial.header_size()) '\000')
 
@@ -199,7 +229,8 @@ let output con =
 	let s = if String.length con.partial_out > 0 then
 			con.partial_out
 		else if Queue.length con.pkt_out > 0 then
-			Packet.to_string (Queue.pop con.pkt_out)
+			let pkt = Queue.pop con.pkt_out in
+			Packet.to_string pkt
 		else
 			"" in
 	(* send data from s, and save the unsent data to partial_out *)
@@ -212,12 +243,15 @@ let output con =
 	(* after sending one packet, partial is empty *)
 	con.partial_out = ""
 
+(* we can only process an input packet if we're guaranteed to have room
+   to store the response packet *)
+let can_input con = Queue.can_push con.pkt_out CommandReply
+
 (* NB: can throw Reconnect *)
 let input con =
-	let to_read =
-		match con.partial_in with
-		| HaveHdr partial_pkt -> Partial.to_complete partial_pkt
-		| NoHdr   (i, _)    -> i in
+	if not (can_input con) then None
+	else
+	let to_read = to_read con in
 
 	(* try to get more data from input stream *)
 	let b = Bytes.make to_read '\000' in
@@ -243,11 +277,22 @@ let input con =
 		None
 	)
 
-let newcon backend = {
+let classify t =
+	match t.Packet.ty with
+	| Op.Watchevent -> Watchevent
+	| _ -> CommandReply
+
+let newcon ~capacity backend =
+	let limit = function
+		| CommandReply -> capacity.maxoutstanding
+		| Watchevent -> capacity.maxwatchevents
+	in
+	{
 	backend = backend;
-	pkt_out = Queue.create ();
+	pkt_out = Queue.create ~capacity:(capacity.maxoutstanding + capacity.maxwatchevents) ~classify ~limit;
 	partial_in = init_partial_in ();
 	partial_out = "";
+	capacity = capacity;
 	}
 
 let open_fd fd = newcon (Fd { fd = fd; })
diff --git a/tools/ocaml/libs/xb/xb.mli b/tools/ocaml/libs/xb/xb.mli
index 91c682162cea..71b2754ca788 100644
--- a/tools/ocaml/libs/xb/xb.mli
+++ b/tools/ocaml/libs/xb/xb.mli
@@ -66,10 +66,11 @@ type backend_mmap = {
 type backend_fd = { fd : Unix.file_descr; }
 type backend = Fd of backend_fd | Xenmmap of backend_mmap
 type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
+type capacity = { maxoutstanding: int; maxwatchevents: int }
 type t
 val init_partial_in : unit -> partial_buf
 val reconnect : t -> unit
-val queue : t -> Packet.t -> unit
+val queue : t -> Packet.t -> unit option
 val read_fd : backend_fd -> 'a -> bytes -> int -> int
 val read_mmap : backend_mmap -> 'a -> bytes -> int -> int
 val read : t -> bytes -> int -> int
@@ -78,13 +79,14 @@ val write_mmap : backend_mmap -> 'a -> string -> int -> int
 val write : t -> string -> int -> int
 val output : t -> bool
 val input : t -> Packet.t option
-val newcon : backend -> t
-val open_fd : Unix.file_descr -> t
-val open_mmap : Xenmmap.mmap_interface -> (unit -> unit) -> t
+val newcon : capacity:capacity -> backend -> t
+val open_fd : Unix.file_descr -> capacity:capacity -> t
+val open_mmap : Xenmmap.mmap_interface -> (unit -> unit) -> capacity:capacity -> t
 val close : t -> unit
 val is_fd : t -> bool
 val is_mmap : t -> bool
 val output_len : t -> int
+val can_input: t -> bool
 val has_new_output : t -> bool
 val has_old_output : t -> bool
 val has_output : t -> bool
@@ -93,3 +95,4 @@ val has_partial_input : t -> bool
 val has_more_input : t -> bool
 val is_selectable : t -> bool
 val get_fd : t -> Unix.file_descr
+val debug: t -> string
diff --git a/tools/ocaml/libs/xs/queueop.ml b/tools/ocaml/libs/xs/queueop.ml
index 9ff5bbd529ce..4e532cdaeacb 100644
--- a/tools/ocaml/libs/xs/queueop.ml
+++ b/tools/ocaml/libs/xs/queueop.ml
@@ -16,9 +16,10 @@
 open Xenbus
 
 let data_concat ls = (String.concat "\000" ls) ^ "\000"
+let queue con pkt = let r = Xb.queue con pkt in assert (r <> None)
 let queue_path ty (tid: int) (path: string) con =
 	let data = data_concat [ path; ] in
-	Xb.queue con (Xb.Packet.create tid 0 ty data)
+	queue con (Xb.Packet.create tid 0 ty data)
 
 (* operations *)
 let directory tid path con = queue_path Xb.Op.Directory tid path con
@@ -27,48 +28,48 @@ let read tid path con = queue_path Xb.Op.Read tid path con
 let getperms tid path con = queue_path Xb.Op.Getperms tid path con
 
 let debug commands con =
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Debug (data_concat commands))
+	queue con (Xb.Packet.create 0 0 Xb.Op.Debug (data_concat commands))
 
 let watch path data con =
 	let data = data_concat [ path; data; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Watch data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Watch data)
 
 let unwatch path data con =
 	let data = data_concat [ path; data; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Unwatch data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Unwatch data)
 
 let transaction_start con =
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Transaction_start (data_concat []))
+	queue con (Xb.Packet.create 0 0 Xb.Op.Transaction_start (data_concat []))
 
 let transaction_end tid commit con =
 	let data = data_concat [ (if commit then "T" else "F"); ] in
-	Xb.queue con (Xb.Packet.create tid 0 Xb.Op.Transaction_end data)
+	queue con (Xb.Packet.create tid 0 Xb.Op.Transaction_end data)
 
 let introduce domid mfn port con =
 	let data = data_concat [ Printf.sprintf "%u" domid;
 	                         Printf.sprintf "%nu" mfn;
 	                         string_of_int port; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Introduce data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Introduce data)
 
 let release domid con =
 	let data = data_concat [ Printf.sprintf "%u" domid; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Release data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Release data)
 
 let resume domid con =
 	let data = data_concat [ Printf.sprintf "%u" domid; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Resume data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Resume data)
 
 let getdomainpath domid con =
 	let data = data_concat [ Printf.sprintf "%u" domid; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Getdomainpath data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Getdomainpath data)
 
 let write tid path value con =
 	let data = path ^ "\000" ^ value (* no NULL at the end *) in
-	Xb.queue con (Xb.Packet.create tid 0 Xb.Op.Write data)
+	queue con (Xb.Packet.create tid 0 Xb.Op.Write data)
 
 let mkdir tid path con = queue_path Xb.Op.Mkdir tid path con
 let rm tid path con = queue_path Xb.Op.Rm tid path con
 
 let setperms tid path perms con =
 	let data = data_concat [ path; perms ] in
-	Xb.queue con (Xb.Packet.create tid 0 Xb.Op.Setperms data)
+	queue con (Xb.Packet.create tid 0 Xb.Op.Setperms data)
diff --git a/tools/ocaml/libs/xs/xsraw.ml b/tools/ocaml/libs/xs/xsraw.ml
index 451f8b38dbcc..cbd17280600c 100644
--- a/tools/ocaml/libs/xs/xsraw.ml
+++ b/tools/ocaml/libs/xs/xsraw.ml
@@ -36,8 +36,10 @@ type con = {
 let close con =
 	Xb.close con.xb
 
+let capacity = { Xb.maxoutstanding = 1; maxwatchevents = 0; }
+
 let open_fd fd = {
-	xb = Xb.open_fd fd;
+	xb = Xb.open_fd ~capacity fd;
 	watchevents = Queue.create ();
 }
 
diff --git a/tools/ocaml/xenstored/connection.ml b/tools/ocaml/xenstored/connection.ml
index ace2aa5b4f53..9aad451a2dbd 100644
--- a/tools/ocaml/xenstored/connection.ml
+++ b/tools/ocaml/xenstored/connection.ml
@@ -20,12 +20,84 @@ open Stdext
 
 let xenstore_payload_max = 4096 (* xen/include/public/io/xs_wire.h *)
 
+type 'a bounded_sender = 'a -> unit option
+(** a bounded sender accepts an ['a] item and returns:
+    None - if there is no room to accept the item
+    Some () -  if it has successfully accepted/sent the item
+ *)
+
+module BoundedPipe : sig
+	type 'a t
+
+	(** [create ~capacity ~destination] creates a bounded pipe with a
+	    local buffer holding at most [capacity] items.  Once the buffer is
+	    full it will not accept further items.  items from the pipe are
+	    flushed into [destination] as long as it accepts items.  The
+	    destination could be another pipe.
+	 *)
+	val create: capacity:int -> destination:'a bounded_sender -> 'a t
+
+	(** [is_empty t] returns whether the local buffer of [t] is empty. *)
+	val is_empty : _ t -> bool
+
+	(** [length t] the number of items in the internal buffer *)
+	val length: _ t -> int
+
+	(** [flush_pipe t] sends as many items from the local buffer as possible,
+			which could be none. *)
+	val flush_pipe: _ t -> unit
+
+	(** [push t item] tries to [flush_pipe] and then push [item]
+	    into the pipe if its [capacity] allows.
+	    Returns [None] if there is no more room
+	 *)
+	val push : 'a t -> 'a bounded_sender
+end = struct
+	(* items are enqueued in [q], and then flushed to [connect_to] *)
+	type 'a t =
+		{ q: 'a Queue.t
+		; destination: 'a bounded_sender
+		; capacity: int
+		}
+
+	let create ~capacity ~destination =
+		{ q = Queue.create (); capacity; destination }
+
+	let rec flush_pipe t =
+		if not Queue.(is_empty t.q) then
+			let item = Queue.peek t.q in
+			match t.destination item with
+			| None -> () (* no room *)
+			| Some () ->
+				(* successfully sent item to next stage *)
+				let _ = Queue.pop t.q in
+				(* continue trying to send more items *)
+				flush_pipe t
+
+	let push t item =
+		(* first try to flush as many items from this pipe as possible to make room,
+		   it is important to do this first to preserve the order of the items
+		 *)
+		flush_pipe t;
+		if Queue.length t.q < t.capacity then begin
+			(* enqueue, instead of sending directly.
+			   this ensures that [out] sees the items in the same order as we receive them
+			 *)
+			Queue.push item t.q;
+			Some (flush_pipe t)
+		end else None
+
+	let is_empty t = Queue.is_empty t.q
+	let length t = Queue.length t.q
+end
+
 type watch = {
 	con: t;
 	token: string;
 	path: string;
 	base: string;
 	is_relative: bool;
+	pending_watchevents: Xenbus.Xb.Packet.t BoundedPipe.t;
 }
 
 and t = {
@@ -38,8 +110,36 @@ and t = {
 	anonid: int;
 	mutable stat_nb_ops: int;
 	mutable perm: Perms.Connection.t;
+	pending_source_watchevents: (watch * Xenbus.Xb.Packet.t) BoundedPipe.t
 }
 
+module Watch = struct
+	module T = struct
+		type t = watch
+
+		let compare w1 w2 =
+			(* cannot compare watches from different connections *)
+			assert (w1.con == w2.con);
+			match String.compare w1.token w2.token with
+			| 0 -> String.compare w1.path w2.path
+			| n -> n
+	end
+	module Set = Set.Make(T)
+
+	let flush_events t =
+		BoundedPipe.flush_pipe t.pending_watchevents;
+		not (BoundedPipe.is_empty t.pending_watchevents)
+
+	let pending_watchevents t =
+		BoundedPipe.length t.pending_watchevents
+end
+
+let source_flush_watchevents t =
+	BoundedPipe.flush_pipe t.pending_source_watchevents
+
+let source_pending_watchevents t =
+	BoundedPipe.length t.pending_source_watchevents
+
 let mark_as_bad con =
 	match con.dom with
 	|None -> ()
@@ -67,7 +167,8 @@ let watch_create ~con ~path ~token = {
 	token = token;
 	path = path;
 	base = get_path con;
-	is_relative = path.[0] <> '/' && path.[0] <> '@'
+	is_relative = path.[0] <> '/' && path.[0] <> '@';
+	pending_watchevents = BoundedPipe.create ~capacity:!Define.maxwatchevents ~destination:(Xenbus.Xb.queue con.xb)
 }
 
 let get_con w = w.con
@@ -93,6 +194,9 @@ let make_perm dom =
 	Perms.Connection.create ~perms:[Perms.READ; Perms.WRITE] domid
 
 let create xbcon dom =
+	let destination (watch, pkt) =
+		BoundedPipe.push watch.pending_watchevents pkt
+	in
 	let id =
 		match dom with
 		| None -> let old = !anon_id_next in incr anon_id_next; old
@@ -109,6 +213,16 @@ let create xbcon dom =
 	anonid = id;
 	stat_nb_ops = 0;
 	perm = make_perm dom;
+
+	(* the actual capacity will be lower, this is used as an overflow
+	   buffer: anything that doesn't fit elsewhere gets put here, only
+	   limited by the amount of watches that you can generate with a
+	   single xenstore command (which is finite, although possibly very
+	   large in theory for Dom0).  Once the pipe here has any contents the
+	   domain is blocked from sending more commands until it is empty
+	   again though.
+	 *)
+	pending_source_watchevents = BoundedPipe.create ~capacity:Sys.max_array_length ~destination
 	}
 	in
 	Logging.new_connection ~tid:Transaction.none ~con:(get_domstr con);
@@ -127,11 +241,17 @@ let set_target con target_domid =
 
 let is_backend_mmap con = Xenbus.Xb.is_mmap con.xb
 
-let send_reply con tid rid ty data =
+let packet_of con tid rid ty data =
 	if (String.length data) > xenstore_payload_max && (is_backend_mmap con) then
-		Xenbus.Xb.queue con.xb (Xenbus.Xb.Packet.create tid rid Xenbus.Xb.Op.Error "E2BIG\000")
+		Xenbus.Xb.Packet.create tid rid Xenbus.Xb.Op.Error "E2BIG\000"
 	else
-		Xenbus.Xb.queue con.xb (Xenbus.Xb.Packet.create tid rid ty data)
+		Xenbus.Xb.Packet.create tid rid ty data
+
+let send_reply con tid rid ty data =
+	let result = Xenbus.Xb.queue con.xb (packet_of con tid rid ty data) in
+	(* should never happen: we only process an input packet when there is room for an output packet *)
+	(* and the limit for replies is different from the limit for watch events *)
+	assert (result <> None)
 
 let send_error con tid rid err = send_reply con tid rid Xenbus.Xb.Op.Error (err ^ "\000")
 let send_ack con tid rid ty = send_reply con tid rid ty "OK\000"
@@ -181,11 +301,11 @@ let del_watch con path token =
 	apath, w
 
 let del_watches con =
-  Hashtbl.clear con.watches;
+  Hashtbl.reset con.watches;
   con.nb_watches <- 0
 
 let del_transactions con =
-  Hashtbl.clear con.transactions
+  Hashtbl.reset con.transactions
 
 let list_watches con =
 	let ll = Hashtbl.fold
@@ -208,21 +328,29 @@ let lookup_watch_perm path = function
 let lookup_watch_perms oldroot root path =
 	lookup_watch_perm path oldroot @ lookup_watch_perm path (Some root)
 
-let fire_single_watch_unchecked watch =
+let fire_single_watch_unchecked source watch =
 	let data = Utils.join_by_null [watch.path; watch.token; ""] in
-	send_reply watch.con Transaction.none 0 Xenbus.Xb.Op.Watchevent data
+	let pkt = packet_of watch.con Transaction.none 0 Xenbus.Xb.Op.Watchevent data in
+
+	match BoundedPipe.push source.pending_source_watchevents (watch, pkt) with
+	| Some () -> () (* packet queued *)
+	| None ->
+			(* a well behaved Dom0 shouldn't be able to trigger this,
+			   if it happens it is likely a Dom0 bug causing runaway memory usage
+			 *)
+			failwith "watch event overflow, cannot happen"
 
-let fire_single_watch (oldroot, root) watch =
+let fire_single_watch source (oldroot, root) watch =
 	let abspath = get_watch_path watch.con watch.path |> Store.Path.of_string in
 	let perms = lookup_watch_perms oldroot root abspath in
 	if Perms.can_fire_watch watch.con.perm perms then
-		fire_single_watch_unchecked watch
+		fire_single_watch_unchecked source watch
 	else
 		let perms = perms |> List.map (Perms.Node.to_string ~sep:" ") |> String.concat ", " in
 		let con = get_domstr watch.con in
 		Logging.watch_not_fired ~con perms (Store.Path.to_string abspath)
 
-let fire_watch roots watch path =
+let fire_watch source roots watch path =
 	let new_path =
 		if watch.is_relative && path.[0] = '/'
 		then begin
@@ -232,7 +360,7 @@ let fire_watch roots watch path =
 		end else
 			path
 	in
-	fire_single_watch roots { watch with path = new_path }
+	fire_single_watch source roots { watch with path = new_path }
 
 (* Search for a valid unused transaction id. *)
 let rec valid_transaction_id con proposed_id =
@@ -279,6 +407,7 @@ let get_transaction con tid =
 let do_input con = Xenbus.Xb.input con.xb
 let has_more_input con = Xenbus.Xb.has_more_input con.xb
 
+let can_input con = Xenbus.Xb.can_input con.xb && BoundedPipe.is_empty con.pending_source_watchevents
 let has_output con = Xenbus.Xb.has_output con.xb
 let has_old_output con = Xenbus.Xb.has_old_output con.xb
 let has_new_output con = Xenbus.Xb.has_new_output con.xb
@@ -286,7 +415,7 @@ let peek_output con = Xenbus.Xb.peek_output con.xb
 let do_output con = Xenbus.Xb.output con.xb
 
 let has_more_work con =
-	has_more_input con || not (has_old_output con) && has_new_output con
+	(has_more_input con && can_input con) || not (has_old_output con) && has_new_output con
 
 let incr_ops con = con.stat_nb_ops <- con.stat_nb_ops + 1
 
diff --git a/tools/ocaml/xenstored/connections.ml b/tools/ocaml/xenstored/connections.ml
index 7efdf3e5e05e..39190c19ec58 100644
--- a/tools/ocaml/xenstored/connections.ml
+++ b/tools/ocaml/xenstored/connections.ml
@@ -22,22 +22,30 @@ type t = {
 	domains: (int, Connection.t) Hashtbl.t;
 	ports: (Xeneventchn.t, Connection.t) Hashtbl.t;
 	mutable watches: (string, Connection.watch list) Trie.t;
+	mutable has_pending_watchevents: Connection.Watch.Set.t
 }
 
 let create () = {
 	anonymous = Hashtbl.create 37;
 	domains = Hashtbl.create 37;
 	ports = Hashtbl.create 37;
-	watches = Trie.create ()
+	watches = Trie.create ();
+	has_pending_watchevents = Connection.Watch.Set.empty;
 }
 
+let get_capacity () =
+	(* not multiplied by maxwatch on purpose: 2nd queue in watch itself! *)
+	{ Xenbus.Xb.maxoutstanding = !Define.maxoutstanding; maxwatchevents = !Define.maxwatchevents }
+
 let add_anonymous cons fd _can_write =
-	let xbcon = Xenbus.Xb.open_fd fd in
+	let capacity = get_capacity () in
+	let xbcon = Xenbus.Xb.open_fd fd ~capacity in
 	let con = Connection.create xbcon None in
 	Hashtbl.add cons.anonymous (Xenbus.Xb.get_fd xbcon) con
 
 let add_domain cons dom =
-	let xbcon = Xenbus.Xb.open_mmap (Domain.get_interface dom) (fun () -> Domain.notify dom) in
+	let capacity = get_capacity () in
+	let xbcon = Xenbus.Xb.open_mmap ~capacity (Domain.get_interface dom) (fun () -> Domain.notify dom) in
 	let con = Connection.create xbcon (Some dom) in
 	Hashtbl.add cons.domains (Domain.get_id dom) con;
 	match Domain.get_port dom with
@@ -48,7 +56,9 @@ let select ?(only_if = (fun _ -> true)) cons =
 	Hashtbl.fold (fun _ con (ins, outs) ->
 		if (only_if con) then (
 			let fd = Connection.get_fd con in
-			(fd :: ins,  if Connection.has_output con then fd :: outs else outs)
+			let in_fds = if Connection.can_input con then fd :: ins else ins in
+			let out_fds = if Connection.has_output con then fd :: outs else outs in
+			in_fds, out_fds
 		) else (ins, outs)
 	)
 	cons.anonymous ([], [])
@@ -67,10 +77,17 @@ let del_watches_of_con con watches =
 	| [] -> None
 	| ws -> Some ws
 
+let del_watches cons con =
+	Connection.del_watches con;
+	cons.watches <- Trie.map (del_watches_of_con con) cons.watches;
+	cons.has_pending_watchevents <-
+		cons.has_pending_watchevents |> Connection.Watch.Set.filter @@ fun w ->
+		Connection.get_con w != con
+
 let del_anonymous cons con =
 	try
 		Hashtbl.remove cons.anonymous (Connection.get_fd con);
-		cons.watches <- Trie.map (del_watches_of_con con) cons.watches;
+		del_watches cons con;
 		Connection.close con
 	with exn ->
 		debug "del anonymous %s" (Printexc.to_string exn)
@@ -85,7 +102,7 @@ let del_domain cons id =
 		    | Some p -> Hashtbl.remove cons.ports p
 		    | None -> ())
 		 | None -> ());
-		cons.watches <- Trie.map (del_watches_of_con con) cons.watches;
+		del_watches cons con;
 		Connection.close con
 	with exn ->
 		debug "del domain %u: %s" id (Printexc.to_string exn)
@@ -136,31 +153,33 @@ let del_watch cons con path token =
 		cons.watches <- Trie.set cons.watches key watches;
  	watch
 
-let del_watches cons con =
-	Connection.del_watches con;
-	cons.watches <- Trie.map (del_watches_of_con con) cons.watches
-
 (* path is absolute *)
-let fire_watches ?oldroot root cons path recurse =
+let fire_watches ?oldroot source root cons path recurse =
 	let key = key_of_path path in
 	let path = Store.Path.to_string path in
 	let roots = oldroot, root in
 	let fire_watch _ = function
 		| None         -> ()
-		| Some watches -> List.iter (fun w -> Connection.fire_watch roots w path) watches
+		| Some watches -> List.iter (fun w -> Connection.fire_watch source roots w path) watches
 	in
 	let fire_rec _x = function
 		| None         -> ()
 		| Some watches ->
-			List.iter (Connection.fire_single_watch roots) watches
+			List.iter (Connection.fire_single_watch source roots) watches
 	in
 	Trie.iter_path fire_watch cons.watches key;
 	if recurse then
 		Trie.iter fire_rec (Trie.sub cons.watches key)
 
+let send_watchevents cons con =
+	cons.has_pending_watchevents <-
+		cons.has_pending_watchevents |> Connection.Watch.Set.filter Connection.Watch.flush_events;
+	Connection.source_flush_watchevents con
+
 let fire_spec_watches root cons specpath =
+	let source = find_domain cons 0 in
 	iter cons (fun con ->
-		List.iter (Connection.fire_single_watch (None, root)) (Connection.get_watches con specpath))
+		List.iter (Connection.fire_single_watch source (None, root)) (Connection.get_watches con specpath))
 
 let set_target cons domain target_domain =
 	let con = find_domain cons domain in
@@ -196,3 +215,13 @@ let debug cons =
 	let anonymous = Hashtbl.fold (fun _ con accu -> Connection.debug con :: accu) cons.anonymous [] in
 	let domains = Hashtbl.fold (fun _ con accu -> Connection.debug con :: accu) cons.domains [] in
 	String.concat "" (domains @ anonymous)
+
+let debug_watchevents cons con =
+	(* == (physical equality)
+	   has to be used here because w.con.xb.backend might contain a [unit->unit] value causing regular
+	   comparison to fail due to having a 'functional value' which cannot be compared.
+	 *)
+	let s = cons.has_pending_watchevents |> Connection.Watch.Set.filter (fun w -> w.con == con) in
+	let pending = s |> Connection.Watch.Set.elements
+		|> List.map (fun w -> Connection.Watch.pending_watchevents w) |> List.fold_left (+) 0 in
+	Printf.sprintf "Watches with pending events: %d, pending events total: %d" (Connection.Watch.Set.cardinal s) pending
diff --git a/tools/ocaml/xenstored/define.ml b/tools/ocaml/xenstored/define.ml
index 1a5d2f34a678..9e5236709474 100644
--- a/tools/ocaml/xenstored/define.ml
+++ b/tools/ocaml/xenstored/define.ml
@@ -25,6 +25,13 @@ let default_config_dir = Paths.xen_config_dir
 let maxwatch = ref (100)
 let maxtransaction = ref (10)
 let maxrequests = ref (1024)   (* maximum requests per transaction *)
+let maxoutstanding = ref (1024) (* maximum outstanding requests, i.e. in-flight requests / domain *)
+let maxwatchevents = ref (1024)
+(*
+	maximum outstanding watch events per watch,
+	recommended >= maxoutstanding to avoid blocking backend transactions due to
+	malicious frontends
+ *)
 
 let gc_max_overhead = ref 120 (* 120% see comment in xenstored.ml *)
 let conflict_burst_limit = ref 5.0
diff --git a/tools/ocaml/xenstored/oxenstored.conf.in b/tools/ocaml/xenstored/oxenstored.conf.in
index 4ae48e42d47d..9d034e744b4b 100644
--- a/tools/ocaml/xenstored/oxenstored.conf.in
+++ b/tools/ocaml/xenstored/oxenstored.conf.in
@@ -62,6 +62,8 @@ quota-maxwatch = 100
 quota-transaction = 10
 quota-maxrequests = 1024
 quota-path-max = 1024
+quota-maxoutstanding = 1024
+quota-maxwatchevents = 1024
 
 # Activate filed base backend
 persistent = false
diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index a72810d06f43..082c93fa9d3f 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -56,7 +56,7 @@ let split_one_path data con =
 	| path :: "" :: [] -> Store.Path.create path (Connection.get_path con)
 	| _                -> raise Invalid_Cmd_Args
 
-let process_watch t cons =
+let process_watch source t cons =
 	let oldroot = t.Transaction.oldroot in
 	let newroot = Store.get_root t.store in
 	let ops = Transaction.get_paths t |> List.rev in
@@ -66,8 +66,9 @@ let process_watch t cons =
 		| Xenbus.Xb.Op.Rm       -> true, None, oldroot
 		| Xenbus.Xb.Op.Setperms -> false, Some oldroot, newroot
 		| _              -> raise (Failure "huh ?") in
-		Connections.fire_watches ?oldroot root cons (snd op) recurse in
-	List.iter (fun op -> do_op_watch op cons) ops
+		Connections.fire_watches ?oldroot source root cons (snd op) recurse in
+	List.iter (fun op -> do_op_watch op cons) ops;
+	Connections.send_watchevents cons source
 
 let create_implicit_path t perm path =
 	let dirname = Store.Path.get_parent path in
@@ -99,6 +100,20 @@ let do_debug con t _domains cons data =
 	| "watches" :: _ ->
 		let watches = Connections.debug cons in
 		Some (watches ^ "\000")
+	| "xenbus" :: domid :: _ ->
+		let domid = int_of_string domid in
+		let con = Connections.find_domain cons domid in
+		let s = Printf.sprintf "xenbus: %s; overflow queue length: %d, can_input: %b, has_more_input: %b, has_old_output: %b, has_new_output: %b, has_more_work: %b. pending: %s"
+			(Xenbus.Xb.debug con.xb)
+			(Connection.source_pending_watchevents con)
+			(Connection.can_input con)
+			(Connection.has_more_input con)
+			(Connection.has_old_output con)
+			(Connection.has_new_output con)
+			(Connection.has_more_work con)
+			(Connections.debug_watchevents cons con)
+		in
+		Some s
 	| "mfn" :: domid :: _ ->
 		let domid = int_of_string domid in
 		let con = Connections.find_domain cons domid in
@@ -207,7 +222,7 @@ let reply_ack fct con t doms cons data =
 	fct con t doms cons data;
 	Packet.Ack (fun () ->
 		if Transaction.get_id t = Transaction.none then
-			process_watch t cons
+			process_watch con t cons
 	)
 
 let reply_data fct con t doms cons data =
@@ -366,7 +381,7 @@ let do_watch con t _domains cons data =
 	Packet.Ack (fun () ->
 		(* xenstore.txt says this watch is fired immediately,
 		   implying even if path doesn't exist or is unreadable *)
-		Connection.fire_single_watch_unchecked watch)
+		Connection.fire_single_watch_unchecked con watch)
 
 let do_unwatch con _t _domains cons data =
 	let (node, token) =
@@ -397,7 +412,7 @@ let do_transaction_end con t domains cons data =
 	if not success then
 		raise Transaction_again;
 	if commit then begin
-		process_watch t cons;
+		process_watch con t cons;
 		match t.Transaction.ty with
 		| Transaction.No ->
 			() (* no need to record anything *)
@@ -564,7 +579,8 @@ let process_packet ~store ~cons ~doms ~con ~req =
 let do_input store cons doms con =
 	let newpacket =
 		try
-			Connection.do_input con
+			if Connection.can_input con then Connection.do_input con
+			else None
 		with Xenbus.Xb.Reconnect ->
 			info "%s requests a reconnect" (Connection.get_domstr con);
 			History.reconnect con;
@@ -592,6 +608,7 @@ let do_input store cons doms con =
 		Connection.incr_ops con
 
 let do_output _store _cons _doms con =
+	Connection.source_flush_watchevents con;
 	if Connection.has_output con then (
 		if Connection.has_new_output con then (
 			let packet = Connection.peek_output con in
diff --git a/tools/ocaml/xenstored/xenstored.ml b/tools/ocaml/xenstored/xenstored.ml
index 0b6343dfc789..4f8fab2dd13a 100644
--- a/tools/ocaml/xenstored/xenstored.ml
+++ b/tools/ocaml/xenstored/xenstored.ml
@@ -102,6 +102,8 @@ let parse_config filename =
 		("quota-maxentity", Config.Set_int Quota.maxent);
 		("quota-maxsize", Config.Set_int Quota.maxsize);
 		("quota-maxrequests", Config.Set_int Define.maxrequests);
+		("quota-maxoutstanding", Config.Set_int Define.maxoutstanding);
+		("quota-maxwatchevents", Config.Set_int Define.maxwatchevents);
 		("quota-path-max", Config.Set_int Define.path_max);
 		("gc-max-overhead", Config.Set_int Define.gc_max_overhead);
 		("test-eagain", Config.Set_bool Transaction.test_eagain);
From 66ec303e65b16e28a53af097f5b7295458ea8b49 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Thu, 29 Sep 2022 13:07:35 +0200
Subject: SUPPORT.md: clarify support of untrusted driver domains with
 oxenstored

Add a support statement for the scope of support regarding different
Xenstore variants. Especially oxenstored does not (yet) have security
support of untrusted driver domains, as those might drive oxenstored
out of memory by creating lots of watch events for the guests they are
servicing.

Add a statement regarding Live Update support of oxenstored.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/SUPPORT.md b/SUPPORT.md
index c45390a245c0..dd9702bfe42f 100644
--- a/SUPPORT.md
+++ b/SUPPORT.md
@@ -175,6 +175,17 @@ Support for running qemu-xen device model in a linux stubdomain.
 
     Status: Tech Preview
 
+## Xenstore
+
+### C xenstored daemon
+
+    Status: Supported
+
+### OCaml xenstored daemon
+
+    Status: Supported
+    Status, untrusted driver domains: Supported, not security supported
+
 ## Toolstack/3rd party
 
 ### libvirt driver for xl
From be29b3934fd5a4d2bce55731f24e74415ebc4b13 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: split up send_reply()

Today send_reply() is used for both, normal request replies and watch
events.

Split it up into send_reply() and send_event(). This will be used to
add some event specific handling.

add_event() can be merged into send_event(), removing the need for an
intermediate memory allocation.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 8e91b554984d..e6776bae8f99 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -674,49 +674,32 @@ static void send_error(struct connection *conn, int error)
 void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		const void *data, unsigned int len)
 {
-	struct buffered_data *bdata;
+	struct buffered_data *bdata = conn->in;
+
+	assert(type != XS_WATCH_EVENT);
 
 	if ( len > XENSTORE_PAYLOAD_MAX ) {
 		send_error(conn, E2BIG);
 		return;
 	}
 
-	/* Replies reuse the request buffer, events need a new one. */
-	if (type != XS_WATCH_EVENT) {
-		bdata = conn->in;
-		/* Drop asynchronous responses, e.g. errors for watch events. */
-		if (!bdata)
-			return;
-		bdata->inhdr = true;
-		bdata->used = 0;
-		conn->in = NULL;
-	} else {
-		/* Message is a child of the connection for auto-cleanup. */
-		bdata = new_buffer(conn);
+	if (!bdata)
+		return;
+	bdata->inhdr = true;
+	bdata->used = 0;
 
-		/*
-		 * Allocation failure here is unfortunate: we have no way to
-		 * tell anybody about it.
-		 */
-		if (!bdata)
-			return;
-	}
 	if (len <= DEFAULT_BUFFER_SIZE)
 		bdata->buffer = bdata->default_buffer;
-	else
+	else {
 		bdata->buffer = talloc_array(bdata, char, len);
-	if (!bdata->buffer) {
-		if (type == XS_WATCH_EVENT) {
-			/* Same as above: no way to tell someone. */
-			talloc_free(bdata);
+		if (!bdata->buffer) {
+			send_error(conn, ENOMEM);
 			return;
 		}
-		/* re-establish request buffer for sending ENOMEM. */
-		conn->in = bdata;
-		send_error(conn, ENOMEM);
-		return;
 	}
 
+	conn->in = NULL;
+
 	/* Update relevant header fields and fill in the message body. */
 	bdata->hdr.msg.type = type;
 	bdata->hdr.msg.len = len;
@@ -724,8 +707,39 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+}
 
-	return;
+/*
+ * Send a watch event.
+ * As this is not directly related to the current command, errors can't be
+ * reported.
+ */
+void send_event(struct connection *conn, const char *path, const char *token)
+{
+	struct buffered_data *bdata;
+	unsigned int len;
+
+	len = strlen(path) + 1 + strlen(token) + 1;
+	/* Don't try to send over-long events. */
+	if (len > XENSTORE_PAYLOAD_MAX)
+		return;
+
+	bdata = new_buffer(conn);
+	if (!bdata)
+		return;
+
+	bdata->buffer = talloc_array(bdata, char, len);
+	if (!bdata->buffer) {
+		talloc_free(bdata);
+		return;
+	}
+	strcpy(bdata->buffer, path);
+	strcpy(bdata->buffer + strlen(path) + 1, token);
+	bdata->hdr.msg.type = XS_WATCH_EVENT;
+	bdata->hdr.msg.len = len;
+
+	/* Queue for later transmission. */
+	list_add_tail(&bdata->list, &conn->out_list);
 }
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 9369c4cbfd26..2b0f796d9bb1 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -150,6 +150,7 @@ unsigned int get_strings(struct buffered_data *data,
 
 void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		const void *data, unsigned int len);
+void send_event(struct connection *conn, const char *path, const char *token);
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
 void send_ack(struct connection *conn, enum xsd_sockmsg_type type);
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index 9ff20690c000..6d8097376e47 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -72,37 +72,17 @@ static bool is_child(const char *child, const char *parent)
 	return child[len] == '/' || child[len] == '\0';
 }
 
-/*
- * Send a watch event.
- * Temporary memory allocations are done with ctx.
- */
-static void add_event(struct connection *conn,
-		      const void *ctx,
-		      struct watch *watch,
-		      const char *name)
+static const char *get_watch_path(const struct watch *watch, const char *name)
 {
-	/* Data to send (node\0token\0). */
-	unsigned int len;
-	char *data;
+	const char *path = name;
 
 	if (watch->relative_path) {
-		name += strlen(watch->relative_path);
-		if (*name == '/') /* Could be "" */
-			name++;
+		path += strlen(watch->relative_path);
+		if (*path == '/') /* Could be "" */
+			path++;
 	}
 
-	len = strlen(name) + 1 + strlen(watch->token) + 1;
-	/* Don't try to send over-long events. */
-	if (len > XENSTORE_PAYLOAD_MAX)
-		return;
-
-	data = talloc_array(ctx, char, len);
-	if (!data)
-		return;
-	strcpy(data, name);
-	strcpy(data + strlen(name) + 1, watch->token);
-	send_reply(conn, XS_WATCH_EVENT, data, len);
-	talloc_free(data);
+	return path;
 }
 
 /*
@@ -181,10 +161,14 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		list_for_each_entry(watch, &i->watches, list) {
 			if (exact) {
 				if (streq(name, watch->node))
-					add_event(i, ctx, watch, name);
+					send_event(i,
+						   get_watch_path(watch, name),
+						   watch->token);
 			} else {
 				if (is_child(name, watch->node))
-					add_event(i, ctx, watch, name);
+					send_event(i,
+						   get_watch_path(watch, name),
+						   watch->token);
 			}
 		}
 	}
@@ -252,7 +236,7 @@ int do_watch(struct connection *conn, struct buffered_data *in)
 	send_ack(conn, XS_WATCH);
 
 	/* We fire once up front: simplifies clients and restart. */
-	add_event(conn, in, watch, watch->node);
+	send_event(conn, get_watch_path(watch, watch->node), watch->token);
 
 	return 0;
 }
From b3df897f6bab1574d22af10e76c5318c2e835dae Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: add helpers to free struct buffered_data

Add two helpers for freeing struct buffered_data: free_buffered_data()
for freeing one instance and conn_free_buffered_data() for freeing all
instances for a connection.

This is avoiding duplicated code and will help later when more actions
are needed when freeing a struct buffered_data.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index e6776bae8f99..5d54779d409b 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -208,6 +208,21 @@ void reopen_log(void)
 	}
 }
 
+static void free_buffered_data(struct buffered_data *out,
+			       struct connection *conn)
+{
+	list_del(&out->list);
+	talloc_free(out);
+}
+
+void conn_free_buffered_data(struct connection *conn)
+{
+	struct buffered_data *out;
+
+	while ((out = list_top(&conn->out_list, struct buffered_data, list)))
+		free_buffered_data(out, conn);
+}
+
 static bool write_messages(struct connection *conn)
 {
 	int ret;
@@ -251,8 +266,7 @@ static bool write_messages(struct connection *conn)
 
 	trace_io(conn, out, 1);
 
-	list_del(&out->list);
-	talloc_free(out);
+	free_buffered_data(out, conn);
 
 	return true;
 }
@@ -1391,18 +1405,12 @@ static struct {
  */
 static void ignore_connection(struct connection *conn)
 {
-	struct buffered_data *out, *tmp;
-
 	trace("CONN %p ignored\n", conn);
 
 	conn->is_ignored = true;
 	conn_delete_all_watches(conn);
 	conn_delete_all_transactions(conn);
-
-	list_for_each_entry_safe(out, tmp, &conn->out_list, list) {
-		list_del(&out->list);
-		talloc_free(out);
-	}
+	conn_free_buffered_data(conn);
 
 	talloc_free(conn->in);
 	conn->in = NULL;
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 2b0f796d9bb1..83d49693fc19 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -226,6 +226,8 @@ extern xengnttab_handle **xgt_handle;
 
 int remember_string(struct hashtable *hash, const char *str);
 
+void conn_free_buffered_data(struct connection *conn);
+
 #endif /* _XENSTORED_CORE_H */
 
 /*
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index d5e1e3e9d42d..3bff322d024d 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -402,15 +402,10 @@ static struct domain *find_domain_by_domid(unsigned int domid)
 static void domain_conn_reset(struct domain *domain)
 {
 	struct connection *conn = domain->conn;
-	struct buffered_data *out;
 
 	conn_delete_all_watches(conn);
 	conn_delete_all_transactions(conn);
-
-	while ((out = list_top(&conn->out_list, struct buffered_data, list))) {
-		list_del(&out->list);
-		talloc_free(out);
-	}
+	conn_free_buffered_data(conn);
 
 	talloc_free(conn->in);
 
From 89c02664745cdd62aa556a7870cdf49576a901a1 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: reduce number of watch events

When removing a watched node outside of a transaction, two watch events
are being produced instead of just a single one.

When finalizing a transaction watch events can be generated for each
node which is being modified, even if outside a transaction such
modifications might not have resulted in a watch event.

This happens e.g.:

- for nodes which are only modified due to added/removed child entries
- for nodes being removed or created implicitly (e.g. creation of a/b/c
  is implicitly creating a/b, resulting in watch events for a, a/b and
  a/b/c instead of a/b/c only)

Avoid these additional watch events, in order to reduce the needed
memory inside Xenstore for queueing them.

This is being achieved by adding event flags to struct accessed_node
specifying whether an event should be triggered, and whether it should
be an exact match of the modified path. Both flags can be set from
fire_watches() instead of implying them only.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 5d54779d409b..53d003aebffb 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -1182,7 +1182,7 @@ static void delete_child(struct connection *conn,
 }
 
 static int delete_node(struct connection *conn, const void *ctx,
-		       struct node *parent, struct node *node)
+		       struct node *parent, struct node *node, bool watch_exact)
 {
 	char *name;
 
@@ -1194,7 +1194,7 @@ static int delete_node(struct connection *conn, const void *ctx,
 				       node->children);
 		child = name ? read_node(conn, node, name) : NULL;
 		if (child) {
-			if (delete_node(conn, ctx, node, child))
+			if (delete_node(conn, ctx, node, child, true))
 				return errno;
 		} else {
 			trace("delete_node: Error deleting child '%s/%s'!\n",
@@ -1206,7 +1206,12 @@ static int delete_node(struct connection *conn, const void *ctx,
 		talloc_free(name);
 	}
 
-	fire_watches(conn, ctx, node->name, node, true, NULL);
+	/*
+	 * Fire the watches now, when we can still see the node permissions.
+	 * This fine as we are single threaded and the next possible read will
+	 * be handled only after the node has been really removed.
+	 */
+	fire_watches(conn, ctx, node->name, node, watch_exact, NULL);
 	delete_node_single(conn, node);
 	delete_child(conn, parent, basename(node->name));
 	talloc_free(node);
@@ -1232,13 +1237,7 @@ static int _rm(struct connection *conn, const void *ctx, struct node *node,
 		return (errno == ENOMEM) ? ENOMEM : EINVAL;
 	node->parent = parent;
 
-	/*
-	 * Fire the watches now, when we can still see the node permissions.
-	 * This fine as we are single threaded and the next possible read will
-	 * be handled only after the node has been really removed.
-	 */
-	fire_watches(conn, ctx, name, node, false, NULL);
-	return delete_node(conn, ctx, parent, node);
+	return delete_node(conn, ctx, parent, node, false);
 }
 
 
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index 4ffa18311120..6fbdb29dcdd7 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -130,6 +130,10 @@ struct accessed_node
 
 	/* Transaction node in data base? */
 	bool ta_node;
+
+	/* Watch event flags. */
+	bool fire_watch;
+	bool watch_exact;
 };
 
 struct changed_domain
@@ -330,6 +334,29 @@ int access_node(struct connection *conn, struct node *node,
 }
 
 /*
+ * A watch event should be fired for a node modified inside a transaction.
+ * Set the corresponding information. A non-exact event is replacing an exact
+ * one, but not the other way round.
+ */
+void queue_watches(struct connection *conn, const char *name, bool watch_exact)
+{
+	struct accessed_node *i;
+
+	i = find_accessed_node(conn->transaction, name);
+	if (!i) {
+		conn->transaction->fail = true;
+		return;
+	}
+
+	if (!i->fire_watch) {
+		i->fire_watch = true;
+		i->watch_exact = watch_exact;
+	} else if (!watch_exact) {
+		i->watch_exact = false;
+	}
+}
+
+/*
  * Finalize transaction:
  * Walk through accessed nodes and check generation against global data.
  * If all entries match, read the transaction entries and write them without
@@ -383,15 +410,15 @@ static int finalize_transaction(struct connection *conn,
 				ret = tdb_store(tdb_ctx, key, data,
 						TDB_REPLACE);
 				talloc_free(data.dptr);
-				if (ret)
-					goto err;
-				fire_watches(conn, trans, i->node, NULL, false,
-					     i->perms.p ? &i->perms : NULL);
 			} else {
-				fire_watches(conn, trans, i->node, NULL, false,
+				ret = tdb_delete(tdb_ctx, key);
+			}
+			if (ret)
+				goto err;
+			if (i->fire_watch) {
+				fire_watches(conn, trans, i->node, NULL,
+					     i->watch_exact,
 					     i->perms.p ? &i->perms : NULL);
-				if (tdb_delete(tdb_ctx, key))
-					goto err;
 			}
 		}
 
diff --git a/tools/xenstore/xenstored_transaction.h b/tools/xenstore/xenstored_transaction.h
index 14062730e3c9..0093cac807e3 100644
--- a/tools/xenstore/xenstored_transaction.h
+++ b/tools/xenstore/xenstored_transaction.h
@@ -42,6 +42,9 @@ void transaction_entry_dec(struct transaction *trans, unsigned int domid);
 int access_node(struct connection *conn, struct node *node,
                 enum node_access_type type, TDB_DATA *key);
 
+/* Queue watches for a modified node. */
+void queue_watches(struct connection *conn, const char *name, bool watch_exact);
+
 /* Prepend the transaction to name if appropriate. */
 int transaction_prepend(struct connection *conn, const char *name,
                         TDB_DATA *key);
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index 6d8097376e47..2f9367767e44 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -29,6 +29,7 @@
 #include "xenstore_lib.h"
 #include "utils.h"
 #include "xenstored_domain.h"
+#include "xenstored_transaction.h"
 
 extern int quota_nb_watch_per_domain;
 
@@ -143,9 +144,11 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 	struct connection *i;
 	struct watch *watch;
 
-	/* During transactions, don't fire watches. */
-	if (conn && conn->transaction)
+	/* During transactions, don't fire watches, but queue them. */
+	if (conn && conn->transaction) {
+		queue_watches(conn, name, exact);
 		return;
+	}
 
 	/* Create an event for each watch. */
 	list_for_each_entry(i, &connections, list) {
From 79a88b38cb8b8f8708935bc43b1e93826626abaf Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: let unread watch events time out

A future modification will limit the number of outstanding requests
for a domain, where "outstanding" means that the response of the
request or any resulting watch event hasn't been consumed yet.

In order to avoid a malicious guest being capable to block other guests
by not reading watch events, add a timeout for watch events. In case a
watch event hasn't been consumed after this timeout, it is being
deleted. Set the default timeout to 20 seconds (a random value being
not too high).

In order to support to specify other timeout values in future, use a
generic command line option for that purpose:

--timeout|-w watch-event=<seconds>

This is part of XSA-326 / CVE-2022-42311.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 53d003aebffb..98837ef2e9cf 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -106,6 +106,8 @@ int quota_max_entry_size = 2048; /* 2K */
 int quota_max_transaction = 10;
 int quota_nb_perms_per_node = 5;
 
+unsigned int timeout_watch_event_msec = 20000;
+
 void trace(const char *fmt, ...)
 {
 	va_list arglist;
@@ -208,19 +210,92 @@ void reopen_log(void)
 	}
 }
 
+static uint64_t get_now_msec(void)
+{
+	struct timespec now_ts;
+
+	if (clock_gettime(CLOCK_MONOTONIC, &now_ts))
+		barf_perror("Could not find time (clock_gettime failed)");
+
+	return now_ts.tv_sec * 1000 + now_ts.tv_nsec / 1000000;
+}
+
 static void free_buffered_data(struct buffered_data *out,
 			       struct connection *conn)
 {
+	struct buffered_data *req;
+
 	list_del(&out->list);
+
+	/*
+	 * Update conn->timeout_msec with the next found timeout value in the
+	 * queued pending requests.
+	 */
+	if (out->timeout_msec) {
+		conn->timeout_msec = 0;
+		list_for_each_entry(req, &conn->out_list, list) {
+			if (req->timeout_msec) {
+				conn->timeout_msec = req->timeout_msec;
+				break;
+			}
+		}
+	}
+
 	talloc_free(out);
 }
 
+static void check_event_timeout(struct connection *conn, uint64_t msecs,
+				int *ptimeout)
+{
+	uint64_t delta;
+	struct buffered_data *out, *tmp;
+
+	if (!conn->timeout_msec)
+		return;
+
+	delta = conn->timeout_msec - msecs;
+	if (conn->timeout_msec <= msecs) {
+		delta = 0;
+		list_for_each_entry_safe(out, tmp, &conn->out_list, list) {
+			/*
+			 * Only look at buffers with timeout and no data
+			 * already written to the ring.
+			 */
+			if (out->timeout_msec && out->inhdr && !out->used) {
+				if (out->timeout_msec > msecs) {
+					conn->timeout_msec = out->timeout_msec;
+					delta = conn->timeout_msec - msecs;
+					break;
+				}
+
+				/*
+				 * Free out without updating conn->timeout_msec,
+				 * as the update is done in this loop already.
+				 */
+				out->timeout_msec = 0;
+				trace("watch event path %s for domain %u timed out\n",
+				      out->buffer, conn->id);
+				free_buffered_data(out, conn);
+			}
+		}
+		if (!delta) {
+			conn->timeout_msec = 0;
+			return;
+		}
+	}
+
+	if (*ptimeout == -1 || *ptimeout > delta)
+		*ptimeout = delta;
+}
+
 void conn_free_buffered_data(struct connection *conn)
 {
 	struct buffered_data *out;
 
 	while ((out = list_top(&conn->out_list, struct buffered_data, list)))
 		free_buffered_data(out, conn);
+
+	conn->timeout_msec = 0;
 }
 
 static bool write_messages(struct connection *conn)
@@ -333,6 +408,7 @@ static void initialize_fds(int *p_sock_pollfd_idx, int *p_ro_sock_pollfd_idx,
 {
 	struct connection *conn;
 	struct wrl_timestampt now;
+	uint64_t msecs;
 
 	if (fds)
 		memset(fds, 0, sizeof(struct pollfd) * current_array_size);
@@ -354,10 +430,12 @@ static void initialize_fds(int *p_sock_pollfd_idx, int *p_ro_sock_pollfd_idx,
 
 	wrl_gettime_now(&now);
 	wrl_log_periodic(now);
+	msecs = get_now_msec();
 
 	list_for_each_entry(conn, &connections, list) {
 		if (conn->domain) {
 			wrl_check_timeout(conn->domain, now, ptimeout);
+			check_event_timeout(conn, msecs, ptimeout);
 			if (domain_can_read(conn) ||
 			    (domain_can_write(conn) &&
 			     !list_empty(&conn->out_list)))
@@ -701,6 +779,7 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		return;
 	bdata->inhdr = true;
 	bdata->used = 0;
+	bdata->timeout_msec = 0;
 
 	if (len <= DEFAULT_BUFFER_SIZE)
 		bdata->buffer = bdata->default_buffer;
@@ -752,6 +831,12 @@ void send_event(struct connection *conn, const char *path, const char *token)
 	bdata->hdr.msg.type = XS_WATCH_EVENT;
 	bdata->hdr.msg.len = len;
 
+	if (timeout_watch_event_msec && domain_is_unprivileged(conn)) {
+		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
+		if (!conn->timeout_msec)
+			conn->timeout_msec = bdata->timeout_msec;
+	}
+
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
 }
@@ -1994,6 +2079,9 @@ static void usage(void)
 "  -W, --watch-nb <nb>     limit the number of watches per domain,\n"
 "  -t, --transaction <nb>  limit the number of transaction allowed per domain,\n"
 "  -A, --perm-nb <nb>      limit the number of permissions per node,\n"
+"  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
+"                          allowed timeout candidates are:\n"
+"                          watch-event: time a watch-event is kept pending\n"
 "  -R, --no-recovery       to request that no recovery should be attempted when\n"
 "                          the store is corrupted (debug only),\n"
 "  -I, --internal-db       store database in memory, not on disk\n"
@@ -2015,6 +2103,7 @@ static struct option options[] = {
 	{ "trace-file", 1, NULL, 'T' },
 	{ "transaction", 1, NULL, 't' },
 	{ "perm-nb", 1, NULL, 'A' },
+	{ "timeout", 1, NULL, 'w' },
 	{ "no-recovery", 0, NULL, 'R' },
 	{ "internal-db", 0, NULL, 'I' },
 	{ "verbose", 0, NULL, 'V' },
@@ -2026,6 +2115,39 @@ int dom0_domid = 0;
 int dom0_event = 0;
 int priv_domid = 0;
 
+static int get_optval_int(const char *arg)
+{
+	char *end;
+	long val;
+
+	val = strtol(arg, &end, 10);
+	if (!*arg || *end || val < 0 || val > INT_MAX)
+		barf("invalid parameter value \"%s\"\n", arg);
+
+	return val;
+}
+
+static bool what_matches(const char *arg, const char *what)
+{
+	unsigned int what_len = strlen(what);
+
+	return !strncmp(arg, what, what_len) && arg[what_len] == '=';
+}
+
+static void set_timeout(const char *arg)
+{
+	const char *eq = strchr(arg, '=');
+	int val;
+
+	if (!eq)
+		barf("quotas must be specified via <what>=<seconds>\n");
+	val = get_optval_int(eq + 1);
+	if (what_matches(arg, "watch-event"))
+		timeout_watch_event_msec = val * 1000;
+	else
+		barf("unknown timeout \"%s\"\n", arg);
+}
+
 int main(int argc, char *argv[])
 {
 	int opt;
@@ -2037,7 +2159,7 @@ int main(int argc, char *argv[])
 	int timeout;
 
 
-	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:T:RVW:", options,
+	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:T:RVW:w:", options,
 				  NULL)) != -1) {
 		switch (opt) {
 		case 'D':
@@ -2082,6 +2204,9 @@ int main(int argc, char *argv[])
 		case 'A':
 			quota_nb_perms_per_node = strtol(optarg, NULL, 10);
 			break;
+		case 'w':
+			set_timeout(optarg);
+			break;
 		case 'e':
 			dom0_event = strtol(optarg, NULL, 10);
 			break;
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 83d49693fc19..3112c11811e5 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -27,6 +27,7 @@
 #include <dirent.h>
 #include <stdbool.h>
 #include <stdint.h>
+#include <time.h>
 #include <errno.h>
 
 #include "xenstore_lib.h"
@@ -56,6 +57,8 @@ struct buffered_data
 		char raw[sizeof(struct xsd_sockmsg)];
 	} hdr;
 
+	uint64_t timeout_msec;
+
 	/* The actual data. */
 	char *buffer;
 	char default_buffer[DEFAULT_BUFFER_SIZE];
@@ -88,6 +91,7 @@ struct connection
 
 	/* Buffered output data */
 	struct list_head out_list;
+	uint64_t timeout_msec;
 
 	/* Transaction context for current request (NULL if none). */
 	struct transaction *transaction;
@@ -199,6 +203,8 @@ extern int dom0_event;
 extern int priv_domid;
 extern int quota_nb_entry_per_domain;
 
+extern unsigned int timeout_watch_event_msec;
+
 /* Map the kernel's xenstore page. */
 void *xenbus_map(void);
 void unmap_xenbus(void *interface);
From 6102ec4d215359dbc9433bc2bfbbf63e24f91761 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: limit outstanding requests

Add another quota for limiting the number of outstanding requests of a
guest. As the way to specify quotas on the command line is becoming
rather nasty, switch to a new scheme using [--quota|-Q] <what>=<val>
allowing to add more quotas in future easily.

Set the default value to 20 (basically a random value not seeming to
be too high or too low).

A request is said to be outstanding if any message generated by this
request (the direct response plus potential watch events) is not yet
completely stored into a ring buffer. The initial watch event sent as
a result of registering a watch is an exception.

Note that across a live update the relation to buffered watch events
for other domains is lost.

Use talloc_zero() for allocating the domain structure in order to have
all per-domain quota zeroed initially.

This is part of XSA-326 / CVE-2022-42312.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 98837ef2e9cf..2ed91d13297b 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -105,6 +105,7 @@ int quota_nb_watch_per_domain = 128;
 int quota_max_entry_size = 2048; /* 2K */
 int quota_max_transaction = 10;
 int quota_nb_perms_per_node = 5;
+int quota_req_outstanding = 20;
 
 unsigned int timeout_watch_event_msec = 20000;
 
@@ -220,12 +221,24 @@ static uint64_t get_now_msec(void)
 	return now_ts.tv_sec * 1000 + now_ts.tv_nsec / 1000000;
 }
 
+/*
+ * Remove a struct buffered_data from the list of outgoing data.
+ * A struct buffered_data related to a request having caused watch events to be
+ * sent is kept until all those events have been written out.
+ * Each watch event is referencing the related request via pend.req, while the
+ * number of watch events caused by a request is kept in pend.ref.event_cnt
+ * (those two cases are mutually exclusive, so the two fields can share memory
+ * via a union).
+ * The struct buffered_data is freed only if no related watch event is
+ * referencing it. The related return data can be freed right away.
+ */
 static void free_buffered_data(struct buffered_data *out,
 			       struct connection *conn)
 {
 	struct buffered_data *req;
 
 	list_del(&out->list);
+	out->on_out_list = false;
 
 	/*
 	 * Update conn->timeout_msec with the next found timeout value in the
@@ -241,6 +254,30 @@ static void free_buffered_data(struct buffered_data *out,
 		}
 	}
 
+	if (out->hdr.msg.type == XS_WATCH_EVENT) {
+		req = out->pend.req;
+		if (req) {
+			req->pend.ref.event_cnt--;
+			if (!req->pend.ref.event_cnt && !req->on_out_list) {
+				if (req->on_ref_list) {
+					domain_outstanding_domid_dec(
+						req->pend.ref.domid);
+					list_del(&req->list);
+				}
+				talloc_free(req);
+			}
+		}
+	} else if (out->pend.ref.event_cnt) {
+		/* Hang out off from conn. */
+		talloc_steal(NULL, out);
+		if (out->buffer != out->default_buffer)
+			talloc_free(out->buffer);
+		list_add(&out->list, &conn->ref_list);
+		out->on_ref_list = true;
+		return;
+	} else
+		domain_outstanding_dec(conn);
+
 	talloc_free(out);
 }
 
@@ -349,6 +386,7 @@ static bool write_messages(struct connection *conn)
 static int destroy_conn(void *_conn)
 {
 	struct connection *conn = _conn;
+	struct buffered_data *req;
 
 	/* Flush outgoing if possible, but don't block. */
 	if (!conn->domain) {
@@ -362,6 +400,11 @@ static int destroy_conn(void *_conn)
 				break;
 		close(conn->fd);
 	}
+
+	conn_free_buffered_data(conn);
+	list_for_each_entry(req, &conn->ref_list, list)
+		req->on_ref_list = false;
+
         if (conn->target)
                 talloc_unlink(conn, conn->target);
 	list_del(&conn->list);
@@ -800,6 +843,8 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+	bdata->on_out_list = true;
+	domain_outstanding_inc(conn);
 }
 
 /*
@@ -807,7 +852,8 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
  * As this is not directly related to the current command, errors can't be
  * reported.
  */
-void send_event(struct connection *conn, const char *path, const char *token)
+void send_event(struct buffered_data *req, struct connection *conn,
+		const char *path, const char *token)
 {
 	struct buffered_data *bdata;
 	unsigned int len;
@@ -837,8 +883,13 @@ void send_event(struct connection *conn, const char *path, const char *token)
 			conn->timeout_msec = bdata->timeout_msec;
 	}
 
+	bdata->pend.req = req;
+	if (req)
+		req->pend.ref.event_cnt++;
+
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+	bdata->on_out_list = true;
 }
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
@@ -1574,6 +1625,7 @@ static void handle_input(struct connection *conn)
 			return;
 	}
 	in = conn->in;
+	in->pend.ref.domid = conn->id;
 
 	/* Not finished header yet? */
 	if (in->inhdr) {
@@ -1644,6 +1696,7 @@ struct connection *new_connection(connwritefn_t *write, connreadfn_t *read)
 	new->is_ignored = false;
 	new->transaction_started = 0;
 	INIT_LIST_HEAD(&new->out_list);
+	INIT_LIST_HEAD(&new->ref_list);
 	INIT_LIST_HEAD(&new->watches);
 	INIT_LIST_HEAD(&new->transaction_list);
 
@@ -2079,6 +2132,9 @@ static void usage(void)
 "  -W, --watch-nb <nb>     limit the number of watches per domain,\n"
 "  -t, --transaction <nb>  limit the number of transaction allowed per domain,\n"
 "  -A, --perm-nb <nb>      limit the number of permissions per node,\n"
+"  -Q, --quota <what>=<nb> set the quota <what> to the value <nb>, allowed\n"
+"                          quotas are:\n"
+"                          outstanding: number of outstanding requests\n"
 "  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
 "                          allowed timeout candidates are:\n"
 "                          watch-event: time a watch-event is kept pending\n"
@@ -2103,6 +2159,7 @@ static struct option options[] = {
 	{ "trace-file", 1, NULL, 'T' },
 	{ "transaction", 1, NULL, 't' },
 	{ "perm-nb", 1, NULL, 'A' },
+	{ "quota", 1, NULL, 'Q' },
 	{ "timeout", 1, NULL, 'w' },
 	{ "no-recovery", 0, NULL, 'R' },
 	{ "internal-db", 0, NULL, 'I' },
@@ -2148,6 +2205,20 @@ static void set_timeout(const char *arg)
 		barf("unknown timeout \"%s\"\n", arg);
 }
 
+static void set_quota(const char *arg)
+{
+	const char *eq = strchr(arg, '=');
+	int val;
+
+	if (!eq)
+		barf("quotas must be specified via <what>=<nb>\n");
+	val = get_optval_int(eq + 1);
+	if (what_matches(arg, "outstanding"))
+		quota_req_outstanding = val;
+	else
+		barf("unknown quota \"%s\"\n", arg);
+}
+
 int main(int argc, char *argv[])
 {
 	int opt;
@@ -2159,7 +2230,7 @@ int main(int argc, char *argv[])
 	int timeout;
 
 
-	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:T:RVW:w:", options,
+	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:Q:T:RVW:w:", options,
 				  NULL)) != -1) {
 		switch (opt) {
 		case 'D':
@@ -2204,6 +2275,9 @@ int main(int argc, char *argv[])
 		case 'A':
 			quota_nb_perms_per_node = strtol(optarg, NULL, 10);
 			break;
+		case 'Q':
+			set_quota(optarg);
+			break;
 		case 'w':
 			set_timeout(optarg);
 			break;
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 3112c11811e5..edeaa96dd10b 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -45,6 +45,8 @@ typedef int32_t wrl_creditt;
 struct buffered_data
 {
 	struct list_head list;
+	bool on_out_list;
+	bool on_ref_list;
 
 	/* Are we still doing the header? */
 	bool inhdr;
@@ -52,6 +54,17 @@ struct buffered_data
 	/* How far are we? */
 	unsigned int used;
 
+	/* Outstanding request accounting. */
+	union {
+		/* ref is being used for requests. */
+		struct {
+			unsigned int event_cnt; /* # of outstanding events. */
+			unsigned int domid;     /* domid of request. */
+		} ref;
+		/* req is being used for watch events. */
+		struct buffered_data *req;      /* request causing event. */
+	} pend;
+
 	union {
 		struct xsd_sockmsg msg;
 		char raw[sizeof(struct xsd_sockmsg)];
@@ -93,6 +106,9 @@ struct connection
 	struct list_head out_list;
 	uint64_t timeout_msec;
 
+	/* Referenced requests no longer pending. */
+	struct list_head ref_list;
+
 	/* Transaction context for current request (NULL if none). */
 	struct transaction *transaction;
 
@@ -154,7 +170,8 @@ unsigned int get_strings(struct buffered_data *data,
 
 void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		const void *data, unsigned int len);
-void send_event(struct connection *conn, const char *path, const char *token);
+void send_event(struct buffered_data *req, struct connection *conn,
+		const char *path, const char *token);
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
 void send_ack(struct connection *conn, enum xsd_sockmsg_type type);
@@ -202,6 +219,7 @@ extern int dom0_domid;
 extern int dom0_event;
 extern int priv_domid;
 extern int quota_nb_entry_per_domain;
+extern int quota_req_outstanding;
 
 extern unsigned int timeout_watch_event_msec;
 
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 3bff322d024d..2dd80eb1a7bb 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -82,6 +82,9 @@ struct domain
 	/* number of watch for this domain */
 	int nbwatch;
 
+	/* Number of outstanding requests. */
+	int nboutstanding;
+
 	/* write rate limit */
 	wrl_creditt wrl_credit; /* [ -wrl_config_writecost, +_dburst ] */
 	struct wrl_timestampt wrl_timestamp;
@@ -284,8 +287,12 @@ bool domain_can_read(struct connection *conn)
 {
 	struct xenstore_domain_interface *intf = conn->domain->interface;
 
-	if (domain_is_unprivileged(conn) && conn->domain->wrl_credit < 0)
-		return false;
+	if (domain_is_unprivileged(conn)) {
+		if (conn->domain->wrl_credit < 0)
+			return false;
+		if (conn->domain->nboutstanding >= quota_req_outstanding)
+			return false;
+	}
 
 	if (conn->is_ignored)
 		return false;
@@ -334,7 +341,7 @@ static struct domain *alloc_domain(void *context, unsigned int domid)
 {
 	struct domain *domain;
 
-	domain = talloc(context, struct domain);
+	domain = talloc_zero(context, struct domain);
 	if (!domain) {
 		errno = ENOMEM;
 		return NULL;
@@ -383,8 +390,6 @@ static int new_domain(struct domain *domain, int port)
 	domain->conn->id = domain->domid;
 
 	domain->remote_port = port;
-	domain->nbentry = 0;
-	domain->nbwatch = 0;
 
 	return 0;
 }
@@ -922,6 +927,28 @@ int domain_watch(struct connection *conn)
 		: 0;
 }
 
+void domain_outstanding_inc(struct connection *conn)
+{
+	if (!conn || !conn->domain)
+		return;
+	conn->domain->nboutstanding++;
+}
+
+void domain_outstanding_dec(struct connection *conn)
+{
+	if (!conn || !conn->domain)
+		return;
+	conn->domain->nboutstanding--;
+}
+
+void domain_outstanding_domid_dec(unsigned int domid)
+{
+	struct domain *d = find_domain_by_domid(domid);
+
+	if (d)
+		d->nboutstanding--;
+}
+
 static wrl_creditt wrl_config_writecost      = WRL_FACTOR;
 static wrl_creditt wrl_config_rate           = WRL_RATE   * WRL_FACTOR;
 static wrl_creditt wrl_config_dburst         = WRL_DBURST * WRL_FACTOR;
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index 5e00087206c7..4bff2e655b9b 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -67,6 +67,9 @@ int domain_entry(struct connection *conn);
 void domain_watch_inc(struct connection *conn);
 void domain_watch_dec(struct connection *conn);
 int domain_watch(struct connection *conn);
+void domain_outstanding_inc(struct connection *conn);
+void domain_outstanding_dec(struct connection *conn);
+void domain_outstanding_domid_dec(unsigned int domid);
 
 /* Special node permission handling. */
 int set_perms_special(struct connection *conn, const char *name,
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index 2f9367767e44..c50c0575f0f1 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -142,6 +142,7 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		  struct node *node, bool exact, struct node_perms *perms)
 {
 	struct connection *i;
+	struct buffered_data *req;
 	struct watch *watch;
 
 	/* During transactions, don't fire watches, but queue them. */
@@ -150,6 +151,8 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		return;
 	}
 
+	req = domain_is_unprivileged(conn) ? conn->in : NULL;
+
 	/* Create an event for each watch. */
 	list_for_each_entry(i, &connections, list) {
 		/* introduce/release domain watches */
@@ -164,12 +167,12 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		list_for_each_entry(watch, &i->watches, list) {
 			if (exact) {
 				if (streq(name, watch->node))
-					send_event(i,
+					send_event(req, i,
 						   get_watch_path(watch, name),
 						   watch->token);
 			} else {
 				if (is_child(name, watch->node))
-					send_event(i,
+					send_event(req, i,
 						   get_watch_path(watch, name),
 						   watch->token);
 			}
@@ -238,8 +241,12 @@ int do_watch(struct connection *conn, struct buffered_data *in)
 	talloc_set_destructor(watch, destroy_watch);
 	send_ack(conn, XS_WATCH);
 
-	/* We fire once up front: simplifies clients and restart. */
-	send_event(conn, get_watch_path(watch, watch->node), watch->token);
+	/*
+	 * We fire once up front: simplifies clients and restart.
+	 * This event will not be linked to the XS_WATCH request.
+	 */
+	send_event(NULL, conn, get_watch_path(watch, watch->node),
+		   watch->token);
 
 	return 0;
 }
From 3900bb43503f98c4b52cb194e0f4ba3aede889e4 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: don't buffer multiple identical watch events

A guest not reading its Xenstore response buffer fast enough might
pile up lots of Xenstore watch events buffered. Reduce the generated
load by dropping new events which already have an identical copy
pending.

The special events "@..." are excluded from that handling as there are
known use cases where the handler is relying on each event to be sent
individually.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 2ed91d13297b..c6f1d4189cfe 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -823,6 +823,7 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 	bdata->inhdr = true;
 	bdata->used = 0;
 	bdata->timeout_msec = 0;
+	bdata->watch_event = false;
 
 	if (len <= DEFAULT_BUFFER_SIZE)
 		bdata->buffer = bdata->default_buffer;
@@ -855,7 +856,7 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 void send_event(struct buffered_data *req, struct connection *conn,
 		const char *path, const char *token)
 {
-	struct buffered_data *bdata;
+	struct buffered_data *bdata, *bd;
 	unsigned int len;
 
 	len = strlen(path) + 1 + strlen(token) + 1;
@@ -877,12 +878,29 @@ void send_event(struct buffered_data *req, struct connection *conn,
 	bdata->hdr.msg.type = XS_WATCH_EVENT;
 	bdata->hdr.msg.len = len;
 
+	/*
+	 * Check whether an identical event is pending already.
+	 * Special events are excluded from that check.
+	 */
+	if (path[0] != '@') {
+		list_for_each_entry(bd, &conn->out_list, list) {
+			if (bd->watch_event && bd->hdr.msg.len == len &&
+			    !memcmp(bdata->buffer, bd->buffer, len)) {
+				trace("dropping duplicate watch %s %s for domain %u\n",
+				      path, token, conn->id);
+				talloc_free(bdata);
+				return;
+			}
+		}
+	}
+
 	if (timeout_watch_event_msec && domain_is_unprivileged(conn)) {
 		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
 		if (!conn->timeout_msec)
 			conn->timeout_msec = bdata->timeout_msec;
 	}
 
+	bdata->watch_event = true;
 	bdata->pend.req = req;
 	if (req)
 		req->pend.ref.event_cnt++;
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index edeaa96dd10b..1eb6131fc88d 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -51,6 +51,9 @@ struct buffered_data
 	/* Are we still doing the header? */
 	bool inhdr;
 
+	/* Is this a watch event? */
+	bool watch_event;
+
 	/* How far are we? */
 	unsigned int used;
 
From 9edfd7fab111833fba1d6ea5d26fd1471c090e83 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: fix connection->id usage

Don't use conn->id for privilege checks, but domain_is_unprivileged().

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_control.c b/tools/xenstore/xenstored_control.c
index 8d48ab48201b..bce6662f6e45 100644
--- a/tools/xenstore/xenstored_control.c
+++ b/tools/xenstore/xenstored_control.c
@@ -198,7 +198,7 @@ int do_control(struct connection *conn, struct buffered_data *in)
 	int cmd;
 	char **vec;
 
-	if (conn->id != 0)
+	if (domain_is_unprivileged(conn))
 		return EACCES;
 
 	num = xs_count_strings(in->buffer, in->used);
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 1eb6131fc88d..98db4afcaabf 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -93,7 +93,7 @@ struct connection
 	/* The index of pollfd in global pollfd array */
 	int pollfd_idx;
 
-	/* Who am I? 0 for socket connections. */
+	/* Who am I? Domid of connection. */
 	unsigned int id;
 
 	/* Is this a read-only connection? */
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index 6fbdb29dcdd7..9bef6e72a566 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -483,7 +483,8 @@ int do_transaction_start(struct connection *conn, struct buffered_data *in)
 	if (conn->transaction)
 		return EBUSY;
 
-	if (conn->id && conn->transaction_started > quota_max_transaction)
+	if (domain_is_unprivileged(conn) &&
+	    conn->transaction_started > quota_max_transaction)
 		return ENOSPC;
 
 	/* Attach transaction to input for autofree until it's complete */
From 5ab89308f46452567425cb32a843f9f37ed4588b Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: simplify and fix per domain node accounting

The accounting of nodes can be simplified now that each connection
holds the associated domid.

Fix the node accounting to cover nodes created for a domain before it
has been introduced. This requires to react properly to an allocation
failure inside domain_entry_inc() by returning an error code.

Especially in error paths the node accounting has to be fixed in some
cases.

This is part of XSA-326 / CVE-2022-42313.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_control.c b/tools/xenstore/xenstored_control.c
index bce6662f6e45..ab0794deedc8 100644
--- a/tools/xenstore/xenstored_control.c
+++ b/tools/xenstore/xenstored_control.c
@@ -25,6 +25,7 @@
 #include "talloc.h"
 #include "xenstored_core.h"
 #include "xenstored_control.h"
+#include "xenstored_domain.h"
 
 struct cmd_s {
 	char *cmd;
diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index c6f1d4189cfe..12d013d24949 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -545,7 +545,7 @@ struct node *read_node(struct connection *conn, const void *ctx,
 
 	/* Permissions are struct xs_permissions. */
 	node->perms.p = hdr->perms;
-	if (domain_adjust_node_perms(node)) {
+	if (domain_adjust_node_perms(conn, node)) {
 		talloc_free(node);
 		return NULL;
 	}
@@ -567,7 +567,7 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
 	void *p;
 	struct xs_tdb_record_hdr *hdr;
 
-	if (domain_adjust_node_perms(node))
+	if (domain_adjust_node_perms(conn, node))
 		return errno;
 
 	data.dsize = sizeof(*hdr)
@@ -1161,13 +1161,17 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 	return NULL;
 }
 
-static int destroy_node(struct connection *conn, struct node *node)
+static void destroy_node_rm(struct node *node)
 {
 	if (streq(node->name, "/"))
 		corrupt(NULL, "Destroying root node!");
 
 	tdb_delete(tdb_ctx, node->key);
+}
 
+static int destroy_node(struct connection *conn, struct node *node)
+{
+	destroy_node_rm(node);
 	domain_entry_dec(conn, node);
 
 	/*
@@ -1217,8 +1221,12 @@ static struct node *create_node(struct connection *conn, const void *ctx,
 			goto err;
 
 		/* Account for new node */
-		if (i->parent)
-			domain_entry_inc(conn, i);
+		if (i->parent) {
+			if (domain_entry_inc(conn, i)) {
+				destroy_node_rm(i);
+				return NULL;
+			}
+		}
 	}
 
 	return node;
@@ -1499,10 +1507,27 @@ static int do_set_perms(struct connection *conn, struct buffered_data *in)
 	old_perms = node->perms;
 	domain_entry_dec(conn, node);
 	node->perms = perms;
-	domain_entry_inc(conn, node);
+	if (domain_entry_inc(conn, node)) {
+		node->perms = old_perms;
+		/*
+		 * This should never fail because we had a reference on the
+		 * domain before and Xenstored is single-threaded.
+		 */
+		domain_entry_inc(conn, node);
+		return ENOMEM;
+	}
 
-	if (write_node(conn, node, false))
+	if (write_node(conn, node, false)) {
+		int saved_errno = errno;
+
+		domain_entry_dec(conn, node);
+		node->perms = old_perms;
+		/* No failure possible as above. */
+		domain_entry_inc(conn, node);
+
+		errno = saved_errno;
 		return errno;
+	}
 
 	fire_watches(conn, in, name, node, false, &old_perms);
 	send_ack(conn, XS_SET_PERMS);
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 2dd80eb1a7bb..306e12358bf9 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -16,6 +16,7 @@
     along with this program; If not, see <http://www.gnu.org/licenses/>.
 */
 
+#include <assert.h>
 #include <stdio.h>
 #include <sys/mman.h>
 #include <unistd.h>
@@ -358,6 +359,18 @@ static struct domain *alloc_domain(void *context, unsigned int domid)
 	return domain;
 }
 
+static struct domain *find_or_alloc_existing_domain(unsigned int domid)
+{
+	struct domain *domain;
+	xc_dominfo_t dominfo;
+
+	domain = find_domain_struct(domid);
+	if (!domain && get_domain_info(domid, &dominfo))
+		domain = alloc_domain(NULL, domid);
+
+	return domain;
+}
+
 static int new_domain(struct domain *domain, int port)
 {
 	int rc;
@@ -767,30 +780,28 @@ void domain_init(void)
 	virq_port = rc;
 }
 
-void domain_entry_inc(struct connection *conn, struct node *node)
+int domain_entry_inc(struct connection *conn, struct node *node)
 {
 	struct domain *d;
+	unsigned int domid;
 
 	if (!conn)
-		return;
+		return 0;
 
-	if (node->perms.p && node->perms.p[0].id != conn->id) {
-		if (conn->transaction) {
-			transaction_entry_inc(conn->transaction,
-				node->perms.p[0].id);
-		} else {
-			d = find_domain_by_domid(node->perms.p[0].id);
-			if (d)
-				d->nbentry++;
-		}
-	} else if (conn->domain) {
-		if (conn->transaction) {
-			transaction_entry_inc(conn->transaction,
-				conn->domain->domid);
- 		} else {
- 			conn->domain->nbentry++;
-		}
+	domid = node->perms.p ? node->perms.p[0].id : conn->id;
+
+	if (conn->transaction) {
+		transaction_entry_inc(conn->transaction, domid);
+	} else {
+		d = (domid == conn->id && conn->domain) ? conn->domain
+		    : find_or_alloc_existing_domain(domid);
+		if (d)
+			d->nbentry++;
+		else
+			return ENOMEM;
 	}
+
+	return 0;
 }
 
 /*
@@ -826,7 +837,7 @@ static int chk_domain_generation(unsigned int domid, uint64_t gen)
  * Remove permissions for no longer existing domains in order to avoid a new
  * domain with the same domid inheriting the permissions.
  */
-int domain_adjust_node_perms(struct node *node)
+int domain_adjust_node_perms(struct connection *conn, struct node *node)
 {
 	unsigned int i;
 	int ret;
@@ -836,8 +847,14 @@ int domain_adjust_node_perms(struct node *node)
 		return errno;
 
 	/* If the owner doesn't exist any longer give it to priv domain. */
-	if (!ret)
+	if (!ret) {
+		/*
+		 * In theory we'd need to update the number of dom0 nodes here,
+		 * but we could be called for a read of the node. So better
+		 * avoid the risk to overflow the node count of dom0.
+		 */
 		node->perms.p[0].id = priv_domid;
+	}
 
 	for (i = 1; i < node->perms.num; i++) {
 		if (node->perms.p[i].perms & XS_PERM_IGNORE)
@@ -856,25 +873,25 @@ int domain_adjust_node_perms(struct node *node)
 void domain_entry_dec(struct connection *conn, struct node *node)
 {
 	struct domain *d;
+	unsigned int domid;
 
 	if (!conn)
 		return;
 
-	if (node->perms.p && node->perms.p[0].id != conn->id) {
-		if (conn->transaction) {
-			transaction_entry_dec(conn->transaction,
-				node->perms.p[0].id);
-		} else {
-			d = find_domain_by_domid(node->perms.p[0].id);
-			if (d && d->nbentry)
-				d->nbentry--;
-		}
-	} else if (conn->domain && conn->domain->nbentry) {
-		if (conn->transaction) {
-			transaction_entry_dec(conn->transaction,
-				conn->domain->domid);
+	domid = node->perms.p ? node->perms.p[0].id : conn->id;
+
+	if (conn->transaction) {
+		transaction_entry_dec(conn->transaction, domid);
+	} else {
+		d = (domid == conn->id && conn->domain) ? conn->domain
+		    : find_domain_struct(domid);
+		if (d) {
+			d->nbentry--;
 		} else {
-			conn->domain->nbentry--;
+			errno = ENOENT;
+			corrupt(conn,
+				"Node \"%s\" owned by non-existing domain %u\n",
+				node->name, domid);
 		}
 	}
 }
@@ -884,13 +901,23 @@ int domain_entry_fix(unsigned int domid, int num, bool update)
 	struct domain *d;
 	int cnt;
 
-	d = find_domain_by_domid(domid);
-	if (!d)
-		return 0;
+	if (update) {
+		d = find_domain_struct(domid);
+		assert(d);
+	} else {
+		/*
+		 * We are called first with update == false in order to catch
+		 * any error. So do a possible allocation and check for error
+		 * only in this case, as in the case of update == true nothing
+		 * can go wrong anymore as the allocation already happened.
+		 */
+		d = find_or_alloc_existing_domain(domid);
+		if (!d)
+			return -1;
+	}
 
 	cnt = d->nbentry + num;
-	if (cnt < 0)
-		cnt = 0;
+	assert(cnt >= 0);
 
 	if (update)
 		d->nbentry = cnt;
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index 4bff2e655b9b..4edf1dba9425 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -57,10 +57,10 @@ bool domain_can_write(struct connection *conn);
 bool domain_is_unprivileged(struct connection *conn);
 
 /* Remove node permissions for no longer existing domains. */
-int domain_adjust_node_perms(struct node *node);
+int domain_adjust_node_perms(struct connection *conn, struct node *node);
 
 /* Quota manipulation */
-void domain_entry_inc(struct connection *conn, struct node *);
+int domain_entry_inc(struct connection *conn, struct node *);
 void domain_entry_dec(struct connection *conn, struct node *);
 int domain_entry_fix(unsigned int domid, int num, bool update);
 int domain_entry(struct connection *conn);
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index 9bef6e72a566..bf2fda8234b3 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -523,8 +523,12 @@ static int transaction_fix_domains(struct transaction *trans, bool update)
 
 	list_for_each_entry(d, &trans->changed_domains, list) {
 		cnt = domain_entry_fix(d->domid, d->nbentry, update);
-		if (!update && cnt >= quota_nb_entry_per_domain)
-			return ENOSPC;
+		if (!update) {
+			if (cnt >= quota_nb_entry_per_domain)
+				return ENOSPC;
+			if (cnt < 0)
+				return ENOMEM;
+		}
 	}
 
 	return 0;
From f0b856f6d6403d32563ac7f6b807a58ad5682e93 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: limit max number of nodes accessed in a transaction

Today a guest is free to access as many nodes in a single transaction
as it wants. This can lead to unbounded memory consumption in Xenstore
as there is the need to keep track of all nodes having been accessed
during a transaction.

In oxenstored the number of requests in a transaction is being limited
via a quota maxrequests (default is 1024). As multiple accesses of a
node are not problematic in C Xenstore, limit the number of accessed
nodes.

In order to let read_node() detect a quota error in case too many nodes
are being accessed, check the return value of access_node() and return
NULL in case an error has been seen. Introduce __must_check and add it
to the access_node() prototype.

This is part of XSA-326 / CVE-2022-42314.

Reported-by: Julien Grall <jgrall@amazon.com>
Suggested-by: Julien Grall <julien@xen.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/include/xen-tools/libs.h b/tools/include/xen-tools/libs.h
index cc7dfc8c6453..34db3b784732 100644
--- a/tools/include/xen-tools/libs.h
+++ b/tools/include/xen-tools/libs.h
@@ -59,4 +59,8 @@
     })
 #endif
 
+#ifndef __must_check
+#define __must_check __attribute__((__warn_unused_result__))
+#endif
+
 #endif	/* __XEN_TOOLS_LIBS__ */
diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 12d013d24949..ff649b7544db 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -105,6 +105,7 @@ int quota_nb_watch_per_domain = 128;
 int quota_max_entry_size = 2048; /* 2K */
 int quota_max_transaction = 10;
 int quota_nb_perms_per_node = 5;
+int quota_trans_nodes = 1024;
 int quota_req_outstanding = 20;
 
 unsigned int timeout_watch_event_msec = 20000;
@@ -502,6 +503,7 @@ struct node *read_node(struct connection *conn, const void *ctx,
 	TDB_DATA key, data;
 	struct xs_tdb_record_hdr *hdr;
 	struct node *node;
+	int err;
 
 	node = talloc(ctx, struct node);
 	if (!node) {
@@ -523,14 +525,13 @@ struct node *read_node(struct connection *conn, const void *ctx,
 	if (data.dptr == NULL) {
 		if (tdb_error(tdb_ctx) == TDB_ERR_NOEXIST) {
 			node->generation = NO_GENERATION;
-			access_node(conn, node, NODE_ACCESS_READ, NULL);
-			errno = ENOENT;
+			err = access_node(conn, node, NODE_ACCESS_READ, NULL);
+			errno = err ? : ENOENT;
 		} else {
 			log("TDB error on read: %s", tdb_errorstr(tdb_ctx));
 			errno = EIO;
 		}
-		talloc_free(node);
-		return NULL;
+		goto error;
 	}
 
 	node->parent = NULL;
@@ -545,19 +546,36 @@ struct node *read_node(struct connection *conn, const void *ctx,
 
 	/* Permissions are struct xs_permissions. */
 	node->perms.p = hdr->perms;
-	if (domain_adjust_node_perms(conn, node)) {
-		talloc_free(node);
-		return NULL;
-	}
+	if (domain_adjust_node_perms(conn, node))
+		goto error;
 
 	/* Data is binary blob (usually ascii, no nul). */
 	node->data = node->perms.p + hdr->num_perms;
 	/* Children is strings, nul separated. */
 	node->children = node->data + node->datalen;
 
-	access_node(conn, node, NODE_ACCESS_READ, NULL);
+	if (access_node(conn, node, NODE_ACCESS_READ, NULL))
+		goto error;
 
 	return node;
+
+ error:
+	err = errno;
+	talloc_free(node);
+	errno = err;
+	return NULL;
+}
+
+static bool read_node_can_propagate_errno(void)
+{
+	/*
+	 * 2 error cases for read_node() can always be propagated up:
+	 * ENOMEM, because this has nothing to do with the node being in the
+	 * data base or not, but is caused by a general lack of memory.
+	 * ENOSPC, because this is related to hitting quota limits which need
+	 * to be respected.
+	 */
+	return errno == ENOMEM || errno == ENOSPC;
 }
 
 int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
@@ -672,7 +690,7 @@ static int ask_parents(struct connection *conn, const void *ctx,
 		node = read_node(conn, ctx, name);
 		if (node)
 			break;
-		if (errno == ENOMEM)
+		if (read_node_can_propagate_errno())
 			return errno;
 	} while (!streq(name, "/"));
 
@@ -735,7 +753,7 @@ static struct node *get_node(struct connection *conn,
 		}
 	}
 	/* Clean up errno if they weren't supposed to know. */
-	if (!node && errno != ENOMEM)
+	if (!node && !read_node_can_propagate_errno())
 		errno = errno_from_parents(conn, ctx, name, errno, perm);
 	return node;
 }
@@ -1117,7 +1135,7 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 
 	/* If parent doesn't exist, create it. */
 	parent = read_node(conn, parentname, parentname);
-	if (!parent)
+	if (!parent && errno == ENOENT)
 		parent = construct_node(conn, ctx, parentname);
 	if (!parent)
 		return NULL;
@@ -1396,7 +1414,7 @@ static int _rm(struct connection *conn, const void *ctx, struct node *node,
 
 	parent = read_node(conn, ctx, parentname);
 	if (!parent)
-		return (errno == ENOMEM) ? ENOMEM : EINVAL;
+		return read_node_can_propagate_errno() ? errno : EINVAL;
 	node->parent = parent;
 
 	return delete_node(conn, ctx, parent, node, false);
@@ -1424,7 +1442,7 @@ static int do_rm(struct connection *conn, struct buffered_data *in)
 				return 0;
 			}
 			/* Restore errno, just in case. */
-			if (errno != ENOMEM)
+			if (!read_node_can_propagate_errno())
 				errno = ENOENT;
 		}
 		return errno;
@@ -2177,6 +2195,8 @@ static void usage(void)
 "  -A, --perm-nb <nb>      limit the number of permissions per node,\n"
 "  -Q, --quota <what>=<nb> set the quota <what> to the value <nb>, allowed\n"
 "                          quotas are:\n"
+"                          transaction-nodes: number of accessed node per\n"
+"                                             transaction\n"
 "                          outstanding: number of outstanding requests\n"
 "  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
 "                          allowed timeout candidates are:\n"
@@ -2258,6 +2278,8 @@ static void set_quota(const char *arg)
 	val = get_optval_int(eq + 1);
 	if (what_matches(arg, "outstanding"))
 		quota_req_outstanding = val;
+	else if (what_matches(arg, "transaction-nodes"))
+		quota_trans_nodes = val;
 	else
 		barf("unknown quota \"%s\"\n", arg);
 }
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 98db4afcaabf..7e371253d2d1 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -34,6 +34,7 @@
 #include "list.h"
 #include "tdb.h"
 #include "hashtable.h"
+#include "utils.h"
 
 /* DEFAULT_BUFFER_SIZE should be large enough for each errno string. */
 #define DEFAULT_BUFFER_SIZE 16
@@ -223,6 +224,7 @@ extern int dom0_event;
 extern int priv_domid;
 extern int quota_nb_entry_per_domain;
 extern int quota_req_outstanding;
+extern int quota_trans_nodes;
 
 extern unsigned int timeout_watch_event_msec;
 
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index bf2fda8234b3..778b7e439cb3 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -156,6 +156,9 @@ struct transaction
 	/* Connection-local identifier for this transaction. */
 	uint32_t id;
 
+	/* Node counter. */
+	unsigned int nodes;
+
 	/* Generation when transaction started. */
 	uint64_t generation;
 
@@ -266,6 +269,11 @@ int access_node(struct connection *conn, struct node *node,
 
 	i = find_accessed_node(trans, node->name);
 	if (!i) {
+		if (trans->nodes >= quota_trans_nodes &&
+		    domain_is_unprivileged(conn)) {
+			ret = ENOSPC;
+			goto err;
+		}
 		i = talloc_zero(trans, struct accessed_node);
 		if (!i)
 			goto nomem;
@@ -303,6 +311,7 @@ int access_node(struct connection *conn, struct node *node,
 				i->ta_node = true;
 			}
 		}
+		trans->nodes++;
 		list_add_tail(&i->list, &trans->accessed);
 	}
 
diff --git a/tools/xenstore/xenstored_transaction.h b/tools/xenstore/xenstored_transaction.h
index 0093cac807e3..e3cbd6b23095 100644
--- a/tools/xenstore/xenstored_transaction.h
+++ b/tools/xenstore/xenstored_transaction.h
@@ -39,8 +39,8 @@ void transaction_entry_inc(struct transaction *trans, unsigned int domid);
 void transaction_entry_dec(struct transaction *trans, unsigned int domid);
 
 /* This node was accessed. */
-int access_node(struct connection *conn, struct node *node,
-                enum node_access_type type, TDB_DATA *key);
+int __must_check access_node(struct connection *conn, struct node *node,
+                             enum node_access_type type, TDB_DATA *key);
 
 /* Queue watches for a modified node. */
 void queue_watches(struct connection *conn, const char *name, bool watch_exact);
From 83da938e7d66a86fd2e06a18aba5a90c52ac4bcb Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: move the call of setup_structure() to dom0
 introduction

Setting up the basic structure when introducing dom0 has the advantage
to be able to add proper node memory accounting for the added nodes
later.

This makes it possible to do proper node accounting, too.

An additional requirement to make that work fine is to correct the
owner of the created nodes to be dom0_domid instead of domid 0.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index ff649b7544..8123a65a58 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -1834,7 +1834,8 @@ static int tdb_flags;
 static void manual_node(const char *name, const char *child)
 {
 	struct node *node;
-	struct xs_permissions perms = { .id = 0, .perms = XS_PERM_NONE };
+	struct xs_permissions perms = { .id = dom0_domid,
+					.perms = XS_PERM_NONE };
 
 	node = talloc_zero(NULL, struct node);
 	if (!node)
@@ -1873,7 +1874,7 @@ static void tdb_logger(TDB_CONTEXT *tdb, int level, const char * fmt, ...)
 	}
 }
 
-static void setup_structure(void)
+void setup_structure(void)
 {
 	char *tdbname;
 	tdbname = talloc_strdup(talloc_autofree_context(), xs_daemon_tdb());
@@ -1891,6 +1892,7 @@ static void setup_structure(void)
 	manual_node("/", "tool");
 	manual_node("/tool", "xenstored");
 	manual_node("/tool/xenstored", NULL);
+	domain_entry_fix(dom0_domid, 3, true);
 
 	check_store();
 }
@@ -2389,9 +2391,6 @@ int main(int argc, char *argv[])
 
 	init_pipe(reopen_log_pipe);
 
-	/* Setup the database */
-	setup_structure();
-
 	/* Listen to hypervisor. */
 	if (!no_domain_init)
 		domain_init();
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 7e371253d2..d95e4262a9 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -195,6 +195,7 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
 struct node *read_node(struct connection *conn, const void *ctx,
 		       const char *name);
 
+void setup_structure(void);
 struct connection *new_connection(connwritefn_t *write, connreadfn_t *read);
 void check_store(void);
 void corrupt(struct connection *conn, const char *fmt, ...);
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 306e12358b..bed6c4e05a 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -732,6 +732,8 @@ static int dom0_init(void)
 	if (dom0->interface == NULL)
 		return -1;
 
+	setup_structure();
+
 	talloc_steal(dom0->conn, dom0); 
 
 	xenevtchn_notify(xce_handle, dom0->port);
-- 
2.35.3

From 56bb03067843b80ffd85d89610e7283d80d42335 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: add infrastructure to keep track of per domain memory
 usage

The amount of memory a domain can consume in Xenstore is limited by
various quota today, but even with sane quota a domain can still
consume rather large memory quantities.

Add the infrastructure for keeping track of the amount of memory a
domain is consuming in Xenstore. Note that this is only the memory a
domain has direct control over, so any internal administration data
needed by Xenstore only is not being accounted for.

There are two quotas defined: a soft quota which will result in a
warning issued via syslog() when it is exceeded, and a hard quota
resulting in a stop of accepting further requests or watch events as
long as the hard quota would be violated by accepting those.

Setting any of those quotas to 0 will disable it.

As default values use 2MB per domain for the soft limit (this basically
covers the allowed case to create 1000 nodes needing 2kB each), and
2.5MB for the hard limit.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 8123a65a58bf..9fd83ea0259a 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -107,6 +107,8 @@ int quota_max_transaction = 10;
 int quota_nb_perms_per_node = 5;
 int quota_trans_nodes = 1024;
 int quota_req_outstanding = 20;
+int quota_memory_per_domain_soft = 2 * 1024 * 1024; /* 2 MB */
+int quota_memory_per_domain_hard = 2 * 1024 * 1024 + 512 * 1024; /* 2.5 MB */
 
 unsigned int timeout_watch_event_msec = 20000;
 
@@ -2199,7 +2201,14 @@ static void usage(void)
 "                          quotas are:\n"
 "                          transaction-nodes: number of accessed node per\n"
 "                                             transaction\n"
+"                          memory: total used memory per domain for nodes,\n"
+"                                  transactions, watches and requests, above\n"
+"                                  which Xenstore will stop talking to domain\n"
 "                          outstanding: number of outstanding requests\n"
+"  -q, --quota-soft <what>=<nb> set a soft quota <what> to the value <nb>,\n"
+"                          causing a warning to be issued via syslog() if the\n"
+"                          limit is violated, allowed quotas are:\n"
+"                          memory: see above\n"
 "  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
 "                          allowed timeout candidates are:\n"
 "                          watch-event: time a watch-event is kept pending\n"
@@ -2225,6 +2234,7 @@ static struct option options[] = {
 	{ "transaction", 1, NULL, 't' },
 	{ "perm-nb", 1, NULL, 'A' },
 	{ "quota", 1, NULL, 'Q' },
+	{ "quota-soft", 1, NULL, 'q' },
 	{ "timeout", 1, NULL, 'w' },
 	{ "no-recovery", 0, NULL, 'R' },
 	{ "internal-db", 0, NULL, 'I' },
@@ -2270,7 +2280,7 @@ static void set_timeout(const char *arg)
 		barf("unknown timeout \"%s\"\n", arg);
 }
 
-static void set_quota(const char *arg)
+static void set_quota(const char *arg, bool soft)
 {
 	const char *eq = strchr(arg, '=');
 	int val;
@@ -2278,11 +2288,16 @@ static void set_quota(const char *arg)
 	if (!eq)
 		barf("quotas must be specified via <what>=<nb>\n");
 	val = get_optval_int(eq + 1);
-	if (what_matches(arg, "outstanding"))
+	if (what_matches(arg, "outstanding") && !soft)
 		quota_req_outstanding = val;
-	else if (what_matches(arg, "transaction-nodes"))
+	else if (what_matches(arg, "transaction-nodes") && !soft)
 		quota_trans_nodes = val;
-	else
+	else if (what_matches(arg, "memory")) {
+		if (soft)
+			quota_memory_per_domain_soft = val;
+		else
+			quota_memory_per_domain_hard = val;
+	} else
 		barf("unknown quota \"%s\"\n", arg);
 }
 
@@ -2297,7 +2312,7 @@ int main(int argc, char *argv[])
 	int timeout;
 
 
-	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:Q:T:RVW:w:", options,
+	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:Q:q:T:RVW:w:", options,
 				  NULL)) != -1) {
 		switch (opt) {
 		case 'D':
@@ -2343,7 +2358,10 @@ int main(int argc, char *argv[])
 			quota_nb_perms_per_node = strtol(optarg, NULL, 10);
 			break;
 		case 'Q':
-			set_quota(optarg);
+			set_quota(optarg, false);
+			break;
+		case 'q':
+			set_quota(optarg, true);
 			break;
 		case 'w':
 			set_timeout(optarg);
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index d95e4262a91e..4e53072e637c 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -226,6 +226,8 @@ extern int priv_domid;
 extern int quota_nb_entry_per_domain;
 extern int quota_req_outstanding;
 extern int quota_trans_nodes;
+extern int quota_memory_per_domain_soft;
+extern int quota_memory_per_domain_hard;
 
 extern unsigned int timeout_watch_event_msec;
 
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 0e116e5c3d63..7863fa55487d 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -80,6 +80,13 @@ struct domain
 	/* number of entry from this domain in the store */
 	int nbentry;
 
+	/* Amount of memory allocated for this domain. */
+	int memory;
+	bool soft_quota_reported;
+	bool hard_quota_reported;
+	time_t mem_last_msg;
+#define MEM_WARN_MINTIME_SEC 10
+
 	/* number of watch for this domain */
 	int nbwatch;
 
@@ -293,6 +300,9 @@ bool domain_can_read(struct connection *conn)
 			return false;
 		if (conn->domain->nboutstanding >= quota_req_outstanding)
 			return false;
+		if (conn->domain->memory >= quota_memory_per_domain_hard &&
+		    quota_memory_per_domain_hard)
+			return false;
 	}
 
 	if (conn->is_ignored)
@@ -937,6 +947,89 @@ int domain_entry(struct connection *conn)
 		: 0;
 }
 
+static bool domain_chk_quota(struct domain *domain, int mem)
+{
+	time_t now;
+
+	if (!domain || !domid_is_unprivileged(domain->domid) ||
+	    (domain->conn && domain->conn->is_ignored))
+		return false;
+
+	now = time(NULL);
+
+	if (mem >= quota_memory_per_domain_hard &&
+	    quota_memory_per_domain_hard) {
+		if (domain->hard_quota_reported)
+			return true;
+		syslog(LOG_ERR, "Domain %u exceeds hard memory quota, Xenstore interface to domain stalled\n",
+		       domain->domid);
+		domain->mem_last_msg = now;
+		domain->hard_quota_reported = true;
+		return true;
+	}
+
+	if (now - domain->mem_last_msg >= MEM_WARN_MINTIME_SEC) {
+		if (domain->hard_quota_reported) {
+			domain->mem_last_msg = now;
+			domain->hard_quota_reported = false;
+			syslog(LOG_INFO, "Domain %u below hard memory quota again\n",
+			       domain->domid);
+		}
+		if (mem >= quota_memory_per_domain_soft &&
+		    quota_memory_per_domain_soft &&
+		    !domain->soft_quota_reported) {
+			domain->mem_last_msg = now;
+			domain->soft_quota_reported = true;
+			syslog(LOG_WARNING, "Domain %u exceeds soft memory quota\n",
+			       domain->domid);
+		}
+		if (mem < quota_memory_per_domain_soft &&
+		    domain->soft_quota_reported) {
+			domain->mem_last_msg = now;
+			domain->soft_quota_reported = false;
+			syslog(LOG_INFO, "Domain %u below soft memory quota again\n",
+			       domain->domid);
+		}
+
+	}
+
+	return false;
+}
+
+int domain_memory_add(unsigned int domid, int mem, bool no_quota_check)
+{
+	struct domain *domain;
+
+	domain = find_domain_struct(domid);
+	if (domain) {
+		/*
+		 * domain_chk_quota() will print warning and also store whether
+		 * the soft/hard quota has been hit. So check no_quota_check
+		 * *after*.
+		 */
+		if (domain_chk_quota(domain, domain->memory + mem) &&
+		    !no_quota_check)
+			return ENOMEM;
+		domain->memory += mem;
+	} else {
+		/*
+		 * The domain the memory is to be accounted for should always
+		 * exist, as accounting is done either for a domain related to
+		 * the current connection, or for the domain owning a node
+		 * (which is always existing, as the owner of the node is
+		 * tested to exist and replaced by domid 0 if not).
+		 * So not finding the related domain MUST be an error in the
+		 * data base.
+		 */
+		errno = ENOENT;
+		corrupt(NULL, "Accounting called for non-existing domain %u\n",
+			domid);
+		return ENOENT;
+	}
+
+	return 0;
+}
+
 void domain_watch_inc(struct connection *conn)
 {
 	if (!conn || !conn->domain)
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index 4edf1dba9425..3a8c6bab48ba 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -64,6 +64,26 @@ int domain_entry_inc(struct connection *conn, struct node *);
 void domain_entry_dec(struct connection *conn, struct node *);
 int domain_entry_fix(unsigned int domid, int num, bool update);
 int domain_entry(struct connection *conn);
+int domain_memory_add(unsigned int domid, int mem, bool no_quota_check);
+
+/*
+ * domain_memory_add_chk(): to be used when memory quota should be checked.
+ * Not to be used when specifying a negative mem value, as lowering the used
+ * memory should always be allowed.
+ */
+static inline int domain_memory_add_chk(unsigned int domid, int mem)
+{
+	return domain_memory_add(domid, mem, false);
+}
+/*
+ * domain_memory_add_nochk(): to be used when memory quota should not be
+ * checked, e.g. when lowering memory usage, or in an error case for undoing
+ * a previous memory adjustment.
+ */
+static inline void domain_memory_add_nochk(unsigned int domid, int mem)
+{
+	domain_memory_add(domid, mem, true);
+}
 void domain_watch_inc(struct connection *conn);
 void domain_watch_dec(struct connection *conn);
 int domain_watch(struct connection *conn);
From 3ecf15728d7516e7564f29d2dd76724a3ed96cc4 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: add memory accounting for responses

Add the memory accounting for queued responses.

In case adding a watch event for a guest is causing the hard memory
quota of that guest to be violated, the event is dropped. This will
ensure that it is impossible to drive another guest past its memory
quota by generating insane amounts of events for that guest. This is
especially important for protecting driver domains from that attack
vector.

This is part of XSA-326 / CVE-2022-42315.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 9fd83ea0259a..4322d3cf63a1 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -257,6 +257,8 @@ static void free_buffered_data(struct buffered_data *out,
 		}
 	}
 
+	domain_memory_add_nochk(conn->id, -out->hdr.msg.len - sizeof(out->hdr));
+
 	if (out->hdr.msg.type == XS_WATCH_EVENT) {
 		req = out->pend.req;
 		if (req) {
@@ -845,11 +847,14 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 	bdata->timeout_msec = 0;
 	bdata->watch_event = false;
 
-	if (len <= DEFAULT_BUFFER_SIZE)
+	if (len <= DEFAULT_BUFFER_SIZE) {
 		bdata->buffer = bdata->default_buffer;
-	else {
+		/* Don't check quota, path might be used for returning error. */
+		domain_memory_add_nochk(conn->id, len + sizeof(bdata->hdr));
+	} else {
 		bdata->buffer = talloc_array(bdata, char, len);
-		if (!bdata->buffer) {
+		if (!bdata->buffer ||
+		    domain_memory_add_chk(conn->id, len + sizeof(bdata->hdr))) {
 			send_error(conn, ENOMEM);
 			return;
 		}
@@ -914,6 +919,11 @@ void send_event(struct buffered_data *req, struct connection *conn,
 		}
 	}
 
+	if (domain_memory_add_chk(conn->id, len + sizeof(bdata->hdr))) {
+		talloc_free(bdata);
+		return;
+	}
+
 	if (timeout_watch_event_msec && domain_is_unprivileged(conn)) {
 		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
 		if (!conn->timeout_msec)
From 6ab9c1de8be18105895b545e61d6e15501875951 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add memory accounting for watches

Add the memory accounting for registered watches.

When a socket connection is destroyed, the associated watches are
removed, too. In order to keep memory accounting correct the watches
must be removed explicitly via a call of conn_delete_all_watches() from
destroy_conn().

This is part of XSA-326 / CVE-2022-42315.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 4322d3cf63a1..0f589a1f63a0 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -407,6 +407,7 @@ static int destroy_conn(void *_conn)
 	}
 
 	conn_free_buffered_data(conn);
+	conn_delete_all_watches(conn);
 	list_for_each_entry(req, &conn->ref_list, list)
 		req->on_ref_list = false;
 
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index c50c0575f0f1..7118c30e8c32 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -224,7 +224,8 @@ int do_watch(struct connection *conn, struct buffered_data *in)
 		return ENOMEM;
 	watch->node = talloc_strdup(watch, vec[0]);
 	watch->token = talloc_strdup(watch, vec[1]);
-	if (!watch->node || !watch->token) {
+	if (!watch->node || !watch->token ||
+	    domain_memory_add_chk(conn->id, strlen(vec[0]) + strlen(vec[1]))) {
 		talloc_free(watch);
 		return ENOMEM;
 	}
@@ -265,6 +266,8 @@ int do_unwatch(struct connection *conn, struct buffered_data *in)
 	list_for_each_entry(watch, &conn->watches, list) {
 		if (streq(watch->node, node) && streq(watch->token, vec[1])) {
 			list_del(&watch->list);
+			domain_memory_add_nochk(conn->id, -strlen(watch->node) -
+							  strlen(watch->token));
 			talloc_free(watch);
 			domain_watch_dec(conn);
 			send_ack(conn, XS_UNWATCH);
@@ -280,6 +283,8 @@ void conn_delete_all_watches(struct connection *conn)
 
 	while ((watch = list_top(&conn->watches, struct watch, list))) {
 		list_del(&watch->list);
+		domain_memory_add_nochk(conn->id, -strlen(watch->node) -
+						  strlen(watch->token));
 		talloc_free(watch);
 		domain_watch_dec(conn);
 	}
From 23a68b338d36c21fae509761ff9ed117ad96e46b Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add memory accounting for nodes

Add the memory accounting for Xenstore nodes. In order to make this
not too complicated allow for some sloppiness when writing nodes. Any
hard quota violation will result in no further requests to be accepted.

This is part of XSA-326 / CVE-2022-42315.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 0f589a1f63a0..6ed1ae261470 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -498,6 +498,117 @@ static void initialize_fds(int *p_sock_pollfd_idx, int *p_ro_sock_pollfd_idx,
 	}
 }
 
+static void get_acc_data(TDB_DATA *key, struct node_account_data *acc)
+{
+	TDB_DATA old_data;
+	struct xs_tdb_record_hdr *hdr;
+
+	if (acc->memory < 0) {
+		old_data = tdb_fetch(tdb_ctx, *key);
+		/* No check for error, as the node might not exist. */
+		if (old_data.dptr == NULL) {
+			acc->memory = 0;
+		} else {
+			hdr = (void *)old_data.dptr;
+			acc->memory = old_data.dsize;
+			acc->domid = hdr->perms[0].id;
+		}
+		talloc_free(old_data.dptr);
+	}
+}
+
+/*
+ * Per-transaction nodes need to be accounted for the transaction owner.
+ * Those nodes are stored in the data base with the transaction generation
+ * count prepended (e.g. 123/local/domain/...). So testing for the node's
+ * key not to start with "/" is sufficient.
+ */
+static unsigned int get_acc_domid(struct connection *conn, TDB_DATA *key,
+				  unsigned int domid)
+{
+	return (!conn || key->dptr[0] == '/') ? domid : conn->id;
+}
+
+int do_tdb_write(struct connection *conn, TDB_DATA *key, TDB_DATA *data,
+		 struct node_account_data *acc, bool no_quota_check)
+{
+	struct xs_tdb_record_hdr *hdr = (void *)data->dptr;
+	struct node_account_data old_acc = {};
+	unsigned int old_domid, new_domid;
+	int ret;
+
+	if (!acc)
+		old_acc.memory = -1;
+	else
+		old_acc = *acc;
+
+	get_acc_data(key, &old_acc);
+	old_domid = get_acc_domid(conn, key, old_acc.domid);
+	new_domid = get_acc_domid(conn, key, hdr->perms[0].id);
+
+	/*
+	 * Don't check for ENOENT, as we want to be able to switch orphaned
+	 * nodes to new owners.
+	 */
+	if (old_acc.memory)
+		domain_memory_add_nochk(old_domid,
+					-old_acc.memory - key->dsize);
+	ret = domain_memory_add(new_domid, data->dsize + key->dsize,
+				no_quota_check);
+	if (ret) {
+		/* Error path, so no quota check. */
+		if (old_acc.memory)
+			domain_memory_add_nochk(old_domid,
+						old_acc.memory + key->dsize);
+		return ret;
+	}
+
+	/* TDB should set errno, but doesn't even set ecode AFAICT. */
+	if (tdb_store(tdb_ctx, *key, *data, TDB_REPLACE) != 0) {
+		domain_memory_add_nochk(new_domid, -data->dsize - key->dsize);
+		/* Error path, so no quota check. */
+		if (old_acc.memory)
+			domain_memory_add_nochk(old_domid,
+						old_acc.memory + key->dsize);
+		errno = EIO;
+		return errno;
+	}
+
+	if (acc) {
+		/* Don't use new_domid, as it might be a transaction node. */
+		acc->domid = hdr->perms[0].id;
+		acc->memory = data->dsize;
+	}
+
+	return 0;
+}
+
+int do_tdb_delete(struct connection *conn, TDB_DATA *key,
+		  struct node_account_data *acc)
+{
+	struct node_account_data tmp_acc;
+	unsigned int domid;
+
+	if (!acc) {
+		acc = &tmp_acc;
+		acc->memory = -1;
+	}
+
+	get_acc_data(key, acc);
+
+	if (tdb_delete(tdb_ctx, *key)) {
+		errno = EIO;
+		return errno;
+	}
+
+	if (acc->memory) {
+		domid = get_acc_domid(conn, key, acc->domid);
+		domain_memory_add_nochk(domid, -acc->memory - key->dsize);
+	}
+
+	return 0;
+}
+
 /*
  * If it fails, returns NULL and sets errno.
  * Temporary memory allocations will be done with ctx.
@@ -551,9 +662,15 @@ struct node *read_node(struct connection *conn, const void *ctx,
 
 	/* Permissions are struct xs_permissions. */
 	node->perms.p = hdr->perms;
+	node->acc.domid = node->perms.p[0].id;
+	node->acc.memory = data.dsize;
 	if (domain_adjust_node_perms(conn, node))
 		goto error;
 
+	/* If owner is gone reset currently accounted memory size. */
+	if (node->acc.domid != node->perms.p[0].id)
+		node->acc.memory = 0;
+
 	/* Data is binary blob (usually ascii, no nul). */
 	node->data = node->perms.p + hdr->num_perms;
 	/* Children is strings, nul separated. */
@@ -617,12 +734,9 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
 	p += node->datalen;
 	memcpy(p, node->children, node->childlen);
 
-	/* TDB should set errno, but doesn't even set ecode AFAICT. */
-	if (tdb_store(tdb_ctx, *key, data, TDB_REPLACE) != 0) {
-		corrupt(conn, "Write of %s failed", key->dptr);
-		errno = EIO;
-		return errno;
-	}
+	if (do_tdb_write(conn, key, &data, &node->acc, no_quota_check))
+		return EIO;
+
 	return 0;
 }
 
@@ -1121,7 +1235,7 @@ static void delete_node_single(struct connection *conn, struct node *node)
 	if (access_node(conn, node, NODE_ACCESS_DELETE, &key))
 		return;
 
-	if (tdb_delete(tdb_ctx, key) != 0) {
+	if (do_tdb_delete(conn, &key, &node->acc) != 0) {
 		corrupt(conn, "Could not delete '%s'", node->name);
 		return;
 	}
@@ -1184,6 +1298,7 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 	/* No children, no data */
 	node->children = node->data = NULL;
 	node->childlen = node->datalen = 0;
+	node->acc.memory = 0;
 	node->parent = parent;
 	return node;
 
@@ -1192,17 +1307,17 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 	return NULL;
 }
 
-static void destroy_node_rm(struct node *node)
+static void destroy_node_rm(struct connection *conn, struct node *node)
 {
 	if (streq(node->name, "/"))
 		corrupt(NULL, "Destroying root node!");
 
-	tdb_delete(tdb_ctx, node->key);
+	do_tdb_delete(conn, &node->key, &node->acc);
 }
 
 static int destroy_node(struct connection *conn, struct node *node)
 {
-	destroy_node_rm(node);
+	destroy_node_rm(conn, node);
 	domain_entry_dec(conn, node);
 
 	/*
@@ -1254,7 +1369,7 @@ static struct node *create_node(struct connection *conn, const void *ctx,
 		/* Account for new node */
 		if (i->parent) {
 			if (domain_entry_inc(conn, i)) {
-				destroy_node_rm(i);
+				destroy_node_rm(conn, i);
 				return NULL;
 			}
 		}
@@ -2077,7 +2192,7 @@ static int clean_store_(TDB_CONTEXT *tdb, TDB_DATA key, TDB_DATA val,
 	if (!hashtable_search(reachable, name)) {
 		log("clean_store: '%s' is orphaned!", name);
 		if (recovery) {
-			tdb_delete(tdb, key);
+			do_tdb_delete(NULL, &key, NULL);
 		}
 	}
 
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 4e53072e637c..521bc80384e5 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -141,6 +141,11 @@ struct node_perms {
 	struct xs_permissions *p;
 };
 
+struct node_account_data {
+	unsigned int domid;
+	int memory;		/* -1 if unknown */
+};
+
 struct node {
 	const char *name;
 	/* Key used to update TDB */
@@ -163,6 +168,9 @@ struct node {
 	/* Children, each nul-terminated. */
 	unsigned int childlen;
 	char *children;
+
+	/* Allocation information for node currently in store. */
+	struct node_account_data acc;
 };
 
 /* Return the only argument in the input. */
@@ -258,6 +266,11 @@ extern xengnttab_handle **xgt_handle;
 
 int remember_string(struct hashtable *hash, const char *str);
 
+int do_tdb_write(struct connection *conn, TDB_DATA *key, TDB_DATA *data,
+		 struct node_account_data *acc, bool no_quota_check);
+int do_tdb_delete(struct connection *conn, TDB_DATA *key,
+		  struct node_account_data *acc);
+
 void conn_free_buffered_data(struct connection *conn);
 
 #endif /* _XENSTORED_CORE_H */
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index 778b7e439cb3..c1beb40a3d51 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -153,6 +153,9 @@ struct transaction
 	/* List of all transactions active on this connection. */
 	struct list_head list;
 
+	/* Connection this transaction is associated with. */
+	struct connection *conn;
+
 	/* Connection-local identifier for this transaction. */
 	uint32_t id;
 
@@ -292,6 +295,8 @@ int access_node(struct connection *conn, struct node *node,
 
 		introduce = true;
 		i->ta_node = false;
+		/* acc.memory < 0 means "unknown, get size from TDB". */
+		node->acc.memory = -1;
 
 		/*
 		 * Additional transaction-specific node for read type. We only
@@ -416,11 +421,11 @@ static int finalize_transaction(struct connection *conn,
 					goto err;
 				hdr = (void *)data.dptr;
 				hdr->generation = ++generation;
-				ret = tdb_store(tdb_ctx, key, data,
-						TDB_REPLACE);
+				ret = do_tdb_write(conn, &key, &data, NULL,
+						   true);
 				talloc_free(data.dptr);
 			} else {
-				ret = tdb_delete(tdb_ctx, key);
+				ret = do_tdb_delete(conn, &key, NULL);
 			}
 			if (ret)
 				goto err;
@@ -431,7 +436,7 @@ static int finalize_transaction(struct connection *conn,
 			}
 		}
 
-		if (i->ta_node && tdb_delete(tdb_ctx, ta_key))
+		if (i->ta_node && do_tdb_delete(conn, &ta_key, NULL))
 			goto err;
 		list_del(&i->list);
 		talloc_free(i);
@@ -459,7 +464,7 @@ static int destroy_transaction(void *_transaction)
 							       i->node);
 			if (trans_name) {
 				set_tdb_key(trans_name, &key);
-				tdb_delete(tdb_ctx, key);
+				do_tdb_delete(trans->conn, &key, NULL);
 			}
 		}
 		list_del(&i->list);
@@ -503,6 +508,7 @@ int do_transaction_start(struct connection *conn, struct buffered_data *in)
 
 	INIT_LIST_HEAD(&trans->accessed);
 	INIT_LIST_HEAD(&trans->changed_domains);
+	trans->conn = conn;
 	trans->fail = false;
 	trans->generation = ++generation;
 
From eacc032e9e668dcdcbacb87bbb61a7ba2e398cdd Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add exports for quota variables

Some quota variables are not exported via header files.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 521bc80384e5..5abf06c21c98 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -231,6 +231,11 @@ extern TDB_CONTEXT *tdb_ctx;
 extern int dom0_domid;
 extern int dom0_event;
 extern int priv_domid;
+extern int quota_nb_watch_per_domain;
+extern int quota_max_transaction;
+extern int quota_max_entry_size;
+extern int quota_nb_perms_per_node;
+extern int quota_max_path_len;
 extern int quota_nb_entry_per_domain;
 extern int quota_req_outstanding;
 extern int quota_trans_nodes;
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index c1beb40a3d51..6e29118c800d 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -175,7 +175,6 @@ struct transaction
 	bool fail;
 };
 
-extern int quota_max_transaction;
 uint64_t generation;
 
 static void set_tdb_key(const char *name, TDB_DATA *key)
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index 7118c30e8c32..19d0fb01b1c4 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -31,8 +31,6 @@
 #include "xenstored_domain.h"
 #include "xenstored_transaction.h"
 
-extern int quota_nb_watch_per_domain;
-
 struct watch
 {
 	/* Watches on this connection */
From 5734fb655e87ec38c4e0af9023b54e13d827c7e6 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add control command for setting and showing quota

Add a xenstore-control command "quota" to:
- show current quota settings
- change quota settings
- show current quota related values of a domain

Note that in the case the new quota is lower than existing one,
Xenstored may continue to handle requests from a domain exceeding the
new limit (depends on which one has been broken) and the amount of
resource used will not change. However the domain will not be able to
create more resource (associated to the quota) until it is back to below
the limit.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/docs/misc/xenstore.txt b/docs/misc/xenstore.txt
index 2081f20f55e4..1f42a377c10f 100644
--- a/docs/misc/xenstore.txt
+++ b/docs/misc/xenstore.txt
@@ -329,6 +329,17 @@ CONTROL			<command>|[<parameters>|]
 	print|<string>
 		print <string> to syslog (xenstore runs as daemon) or
 		to console (xenstore runs as stubdom)
+	quota|[set <name> <val>|<domid>]
+		without parameters: print the current quota settings
+		with "set <name> <val>": set the quota <name> to new value
+		<val> (The admin should make sure all the domain usage is
+		below the quota. If it is not, then Xenstored may continue to
+		handle requests from the domain as long as the resource
+		violating the new quota setting isn't increased further)
+		with "<domid>": print quota related accounting data for
+		the domain <domid>
+	quota-soft|[set <name> <val>]
+		like the "quota" command, but for soft-quota.
 	help			<supported-commands>
 		return list of supported commands for CONTROL
 
diff --git a/tools/xenstore/xenstored_control.c b/tools/xenstore/xenstored_control.c
index ab0794deedc8..0227a5565657 100644
--- a/tools/xenstore/xenstored_control.c
+++ b/tools/xenstore/xenstored_control.c
@@ -19,6 +19,7 @@
 #include <errno.h>
 #include <stdarg.h>
 #include <stdio.h>
+#include <stdlib.h>
 #include <string.h>
 
 #include "utils.h"
@@ -62,6 +63,114 @@ static int do_control_log(void *ctx, struct connection *conn,
 	return 0;
 }
 
+struct quota {
+	const char *name;
+	int *quota;
+	const char *descr;
+};
+
+static const struct quota hard_quotas[] = {
+	{ "nodes", &quota_nb_entry_per_domain, "Nodes per domain" },
+	{ "watches", &quota_nb_watch_per_domain, "Watches per domain" },
+	{ "transactions", &quota_max_transaction, "Transactions per domain" },
+	{ "outstanding", &quota_req_outstanding,
+		"Outstanding requests per domain" },
+	{ "transaction-nodes", &quota_trans_nodes,
+		"Max. number of accessed nodes per transaction" },
+	{ "memory", &quota_memory_per_domain_hard,
+		"Total Xenstore memory per domain (error level)" },
+	{ "node-size", &quota_max_entry_size, "Max. size of a node" },
+	{ "permissions", &quota_nb_perms_per_node,
+		"Max. number of permissions per node" },
+	{ NULL, NULL, NULL }
+};
+
+static const struct quota soft_quotas[] = {
+	{ "memory", &quota_memory_per_domain_soft,
+		"Total Xenstore memory per domain (warning level)" },
+	{ NULL, NULL, NULL }
+};
+
+static int quota_show_current(const void *ctx, struct connection *conn,
+			      const struct quota *quotas)
+{
+	char *resp;
+	unsigned int i;
+
+	resp = talloc_strdup(ctx, "Quota settings:\n");
+	if (!resp)
+		return ENOMEM;
+
+	for (i = 0; quotas[i].quota; i++) {
+		resp = talloc_asprintf_append(resp, "%-17s: %8d %s\n",
+					      quotas[i].name, *quotas[i].quota,
+					      quotas[i].descr);
+		if (!resp)
+			return ENOMEM;
+	}
+
+	send_reply(conn, XS_CONTROL, resp, strlen(resp) + 1);
+
+	return 0;
+}
+
+static int quota_set(const void *ctx, struct connection *conn,
+		     char **vec, int num, const struct quota *quotas)
+{
+	unsigned int i;
+	int val;
+
+	if (num != 2)
+		return EINVAL;
+
+	val = atoi(vec[1]);
+	if (val < 1)
+		return EINVAL;
+
+	for (i = 0; quotas[i].quota; i++) {
+		if (!strcmp(vec[0], quotas[i].name)) {
+			*quotas[i].quota = val;
+			send_ack(conn, XS_CONTROL);
+			return 0;
+		}
+	}
+
+	return EINVAL;
+}
+
+static int quota_get(const void *ctx, struct connection *conn,
+		     char **vec, int num)
+{
+	if (num != 1)
+		return EINVAL;
+
+	return domain_get_quota(ctx, conn, atoi(vec[0]));
+}
+
+static int do_control_quota(void *ctx, struct connection *conn,
+			    char **vec, int num)
+{
+	if (num == 0)
+		return quota_show_current(ctx, conn, hard_quotas);
+
+	if (!strcmp(vec[0], "set"))
+		return quota_set(ctx, conn, vec + 1, num - 1, hard_quotas);
+
+	return quota_get(ctx, conn, vec, num);
+}
+
+static int do_control_quota_s(void *ctx, struct connection *conn,
+			      char **vec, int num)
+{
+	if (num == 0)
+		return quota_show_current(ctx, conn, soft_quotas);
+
+	if (!strcmp(vec[0], "set"))
+		return quota_set(ctx, conn, vec + 1, num - 1, soft_quotas);
+
+	return EINVAL;
+}
+
 #ifdef __MINIOS__
 static int do_control_memreport(void *ctx, struct connection *conn,
 				char **vec, int num)
@@ -154,6 +263,8 @@ static struct cmd_s cmds[] = {
 	{ "memreport", do_control_memreport, "[<file>]" },
 #endif
 	{ "print", do_control_print, "<string>" },
+	{ "quota", do_control_quota, "[set <name> <val>|<domid>]" },
+	{ "quota-soft", do_control_quota_s, "[set <name> <val>]" },
 	{ "help", do_control_help, "" },
 };
 
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 7863fa55487d..dd3ae15ea4fd 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -31,6 +31,7 @@
 #include "xenstored_domain.h"
 #include "xenstored_transaction.h"
 #include "xenstored_watch.h"
+#include "xenstored_control.h"
 
 #include <xenevtchn.h>
 #include <xenctrl.h>
@@ -348,6 +349,38 @@ static struct domain *find_domain_struct(unsigned int domid)
 	return NULL;
 }
 
+int domain_get_quota(const void *ctx, struct connection *conn,
+		     unsigned int domid)
+{
+	struct domain *d = find_domain_struct(domid);
+	char *resp;
+	int ta;
+
+	if (!d)
+		return ENOENT;
+
+	ta = d->conn ? d->conn->transaction_started : 0;
+	resp = talloc_asprintf(ctx, "Domain %u:\n", domid);
+	if (!resp)
+		return ENOMEM;
+
+#define ent(t, e) \
+	resp = talloc_asprintf_append(resp, "%-16s: %8d\n", #t, e); \
+	if (!resp) return ENOMEM
+
+	ent(nodes, d->nbentry);
+	ent(watches, d->nbwatch);
+	ent(transactions, ta);
+	ent(outstanding, d->nboutstanding);
+	ent(memory, d->memory);
+
+#undef ent
+
+	send_reply(conn, XS_CONTROL, resp, strlen(resp) + 1);
+
+	return 0;
+}
+
 static struct domain *alloc_domain(void *context, unsigned int domid)
 {
 	struct domain *domain;
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index 3a8c6bab48ba..e013a9991ca8 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -90,6 +90,8 @@ int domain_watch(struct connection *conn);
 void domain_outstanding_inc(struct connection *conn);
 void domain_outstanding_dec(struct connection *conn);
 void domain_outstanding_domid_dec(unsigned int domid);
+int domain_get_quota(const void *ctx, struct connection *conn,
+		     unsigned int domid);
 
 /* Special node permission handling. */
 int set_perms_special(struct connection *conn, const char *name,
From e3d0aacaf5321b9204d2ec628f98ba6949623b22 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:01 +0100
Subject: tools/ocaml/xenstored: Synchronise defaults with oxenstore.conf.in
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

We currently have 2 different set of defaults in upstream Xen git tree:
* defined in the source code, only used if there is no config file
* defined in the oxenstored.conf.in upstream Xen

An oxenstored.conf file is not mandatory, and if missing, maxrequests in
particular has an unsafe default.

Resync the defaults from oxenstored.conf.in into the source code.

This is part of XSA-326 / CVE-2022-42316.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/xenstored/define.ml b/tools/ocaml/xenstored/define.ml
index ebe18b8e312c..6b06f808595b 100644
--- a/tools/ocaml/xenstored/define.ml
+++ b/tools/ocaml/xenstored/define.ml
@@ -21,9 +21,9 @@ let xs_daemon_socket = Paths.xen_run_stored ^ "/socket"
 
 let default_config_dir = Paths.xen_config_dir
 
-let maxwatch = ref (50)
-let maxtransaction = ref (20)
-let maxrequests = ref (-1)   (* maximum requests per transaction *)
+let maxwatch = ref (100)
+let maxtransaction = ref (10)
+let maxrequests = ref (1024)   (* maximum requests per transaction *)
 
 let conflict_burst_limit = ref 5.0
 let conflict_max_history_seconds = ref 0.05
diff --git a/tools/ocaml/xenstored/quota.ml b/tools/ocaml/xenstored/quota.ml
index abcac912805a..6e3d6401ae89 100644
--- a/tools/ocaml/xenstored/quota.ml
+++ b/tools/ocaml/xenstored/quota.ml
@@ -20,8 +20,8 @@ exception Transaction_opened
 
 let warn fmt = Logging.warn "quota" fmt
 let activate = ref true
-let maxent = ref (10000)
-let maxsize = ref (4096)
+let maxent = ref (1000)
+let maxsize = ref (2048)
 
 type t = {
 	maxent: int;               (* max entities per domU *)
From b3b27e0cb66b69e0a7d3562d7846a3eafdd02a80 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Thu, 28 Jul 2022 17:08:15 +0100
Subject: tools/ocaml/xenstored: Check for maxrequests before performing
 operations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously we'd perform the operation, record the updated tree in the
transaction record, then try to insert a watchop path and the reply packet.

If we exceeded max requests we would've returned EQUOTA, but still:
* have performed the operation on the transaction's tree
* have recorded the watchop, making this queue effectively unbounded

It is better if we check whether we'd have room to store the operation before
performing the transaction, and raise EQUOTA there.  Then the transaction
record won't grow.

This is part of XSA-326 / CVE-2022-42317.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index 27790d4a5c41..dd58e6979cf9 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -389,6 +389,7 @@ let input_handle_error ~cons ~doms ~fct ~con ~t ~req =
 	let reply_error e =
 		Packet.Error e in
 	try
+		Transaction.check_quota_exn ~perm:(Connection.get_perm con) t;
 		fct con t doms cons req.Packet.data
 	with
 	| Define.Invalid_path          -> reply_error "EINVAL"
@@ -681,9 +682,10 @@ let process_packet ~store ~cons ~doms ~con ~req =
 		in
 
 		let response = try
+			Transaction.check_quota_exn ~perm:(Connection.get_perm con) t;
 			if tid <> Transaction.none then
 				(* Remember the request and response for this operation in case we need to replay the transaction *)
-				Transaction.add_operation ~perm:(Connection.get_perm con) t req response;
+				Transaction.add_operation t req response;
 			response
 		with Quota.Limit_reached ->
 			Packet.Error "EQUOTA"
diff --git a/tools/ocaml/xenstored/transaction.ml b/tools/ocaml/xenstored/transaction.ml
index 17b1bdf2eaf9..294143e2335b 100644
--- a/tools/ocaml/xenstored/transaction.ml
+++ b/tools/ocaml/xenstored/transaction.ml
@@ -85,6 +85,7 @@ type t = {
 	oldroot: Store.Node.t;
 	mutable paths: (Xenbus.Xb.Op.operation * Store.Path.t) list;
 	mutable operations: (Packet.request * Packet.response) list;
+	mutable quota_reached: bool;
 	mutable read_lowpath: Store.Path.t option;
 	mutable write_lowpath: Store.Path.t option;
 }
@@ -127,6 +128,7 @@ let make ?(internal=false) id store =
 		oldroot = Store.get_root store;
 		paths = [];
 		operations = [];
+		quota_reached = false;
 		read_lowpath = None;
 		write_lowpath = None;
 	} in
@@ -143,13 +145,19 @@ let get_root t = Store.get_root t.store
 
 let is_read_only t = t.paths = []
 let add_wop t ty path = t.paths <- (ty, path) :: t.paths
-let add_operation ~perm t request response =
+let get_operations t = List.rev t.operations
+
+let check_quota_exn ~perm t =
 	if !Define.maxrequests >= 0
 		&& not (Perms.Connection.is_dom0 perm)
-		&& List.length t.operations >= !Define.maxrequests
-		then raise Quota.Limit_reached;
+		&& (t.quota_reached || List.length t.operations >= !Define.maxrequests)
+		then begin
+			t.quota_reached <- true;
+			raise Quota.Limit_reached;
+		end
+
+let add_operation t request response =
 	t.operations <- (request, response) :: t.operations
-let get_operations t = List.rev t.operations
 let set_read_lowpath t path = t.read_lowpath <- get_lowest path t.read_lowpath
 let set_write_lowpath t path = t.write_lowpath <- get_lowest path t.write_lowpath
 
From 49ce6658aee7981a4e1925e449bbf99f4e8af39b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:07 +0100
Subject: tools/ocaml: GC parameter tuning
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

By default the OCaml garbage collector would return memory to the OS only
after unused memory is 5x live memory.  Tweak this to 120% instead, which
would match the major GC speed.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/xenstored/define.ml b/tools/ocaml/xenstored/define.ml
index 6b06f808595b..ba63a8147e09 100644
--- a/tools/ocaml/xenstored/define.ml
+++ b/tools/ocaml/xenstored/define.ml
@@ -25,6 +25,7 @@ let maxwatch = ref (100)
 let maxtransaction = ref (10)
 let maxrequests = ref (1024)   (* maximum requests per transaction *)
 
+let gc_max_overhead = ref 120 (* 120% see comment in xenstored.ml *)
 let conflict_burst_limit = ref 5.0
 let conflict_max_history_seconds = ref 0.05
 let conflict_rate_limit_is_aggregate = ref true
diff --git a/tools/ocaml/xenstored/xenstored.ml b/tools/ocaml/xenstored/xenstored.ml
index d44ae673c42a..3b57ad016dfb 100644
--- a/tools/ocaml/xenstored/xenstored.ml
+++ b/tools/ocaml/xenstored/xenstored.ml
@@ -104,6 +104,7 @@ let parse_config filename =
 		("quota-maxsize", Config.Set_int Quota.maxsize);
 		("quota-maxrequests", Config.Set_int Define.maxrequests);
 		("quota-path-max", Config.Set_int Define.path_max);
+		("gc-max-overhead", Config.Set_int Define.gc_max_overhead);
 		("test-eagain", Config.Set_bool Transaction.test_eagain);
 		("persistent", Config.Set_bool Disk.enable);
 		("xenstored-log-file", Config.String Logging.set_xenstored_log_destination);
@@ -265,6 +266,67 @@ let to_file store cons fds file =
 	        (fun () -> close_out channel)
 end
 
+(*
+	By default OCaml's GC only returns memory to the OS when it exceeds a
+	configurable 'max overhead' setting.
+	The default is 500%, that is 5/6th of the OCaml heap needs to be free
+	and only 1/6th live for a compaction to be triggerred that would
+	release memory back to the OS.
+	If the limit is not hit then the OCaml process can reuse that memory
+	for its own purposes, but other processes won't be able to use it.
+
+	There is also a 'space overhead' setting that controls how much work
+	each major GC slice does, and by default aims at having no more than
+	80% or 120% (depending on version) garbage values compared to live
+	values.
+	This doesn't have as much relevance to memory returned to the OS as
+	long as space_overhead <= max_overhead, because compaction is only
+	triggerred at the end of major GC cycles.
+
+	The defaults are too large once the program starts using ~100MiB of
+	memory, at which point ~500MiB would be unavailable to other processes
+	(which would be fine if this was the main process in this VM, but it is
+	not).
+
+	Max overhead can also be set to 0, however this is for testing purposes
+	only (setting it lower than 'space overhead' wouldn't help because the
+	major GC wouldn't run fast enough, and compaction does have a
+	performance cost: we can only compact contiguous regions, so memory has
+	to be moved around).
+
+	Max overhead controls how often the heap is compacted, which is useful
+	if there are burst of activity followed by long periods of idle state,
+	or if a domain quits, etc. Compaction returns memory to the OS.
+
+	wasted = live * space_overhead / 100
+
+	For globally overriding the GC settings one can use OCAMLRUNPARAM,
+	however we provide a config file override to be consistent with other
+	oxenstored settings.
+
+	One might want to dynamically adjust the overhead setting based on used
+	memory, i.e. to use a fixed upper bound in bytes, not percentage. However
+	measurements show that such adjustments increase GC overhead massively,
+	while still not guaranteeing that memory is returned any more quickly
+	than with a percentage based setting.
+
+	The allocation policy could also be tweaked, e.g. first fit would reduce
+	fragmentation and thus memory usage, but the documentation warns that it
+	can be sensibly slower, and indeed one of our own testcases can trigger
+	such a corner case where it is multiple times slower, so it is best to keep
+	the default allocation policy (next-fit/best-fit depending on version).
+
+	There are other tweaks that can be attempted in the future, e.g. setting
+	'ulimit -v' to 75% of RAM, however getting the kernel to actually return
+	NULL from allocations is difficult even with that setting, and without a
+	NULL the emergency GC won't be triggerred.
+	Perhaps cgroup limits could help, but for now tweak the safest only.
+*)
+
+let tweak_gc () =
+	Gc.set { (Gc.get ()) with Gc.max_overhead = !Define.gc_max_overhead }
+
+
 let _ =
 	let cf = do_argv in
 	let pidfile =
@@ -274,6 +336,8 @@ let _ =
 			default_pidfile
 		in
 
+	tweak_gc ();
+
 	(try
 		Unixext.mkdir_rec (Filename.dirname pidfile) 0o755
 	with _ ->
From 62d05b9ed538c3c1064215fb1430bb9b1c49df4d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Fri, 29 Jul 2022 18:53:29 +0100
Subject: tools/ocaml/libs/xb: hide type of Xb.t
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Hiding the type will make it easier to change the implementation
in the future without breaking code that relies on it.

No functional change.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 7ade30a1451734d041363c750a65d322e25b47ba)

Reported-by: Julien Grall <jgrall@amazon.com>
diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 104d319d7747..8404ddd8a682 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -196,6 +196,9 @@ let peek_output con = Queue.peek con.pkt_out
 let input_len con = Queue.length con.pkt_in
 let has_in_packet con = Queue.length con.pkt_in > 0
 let get_in_packet con = Queue.pop con.pkt_in
+let has_partial_input con = match con.partial_in with
+	| HaveHdr _ -> true
+	| NoHdr (n, _) -> n < Partial.header_size ()
 let has_more_input con =
 	match con.backend with
 	| Fd _         -> false
diff --git a/tools/ocaml/libs/xb/xb.mli b/tools/ocaml/libs/xb/xb.mli
index 3a00da6cddc1..794e35bb343e 100644
--- a/tools/ocaml/libs/xb/xb.mli
+++ b/tools/ocaml/libs/xb/xb.mli
@@ -66,13 +66,7 @@ type backend_mmap = {
 type backend_fd = { fd : Unix.file_descr; }
 type backend = Fd of backend_fd | Xenmmap of backend_mmap
 type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
-type t = {
-  backend : backend;
-  pkt_in : Packet.t Queue.t;
-  pkt_out : Packet.t Queue.t;
-  mutable partial_in : partial_buf;
-  mutable partial_out : string;
-}
+type t
 val init_partial_in : unit -> partial_buf
 val reconnect : t -> unit
 val queue : t -> Packet.t -> unit
@@ -97,6 +91,7 @@ val has_output : t -> bool
 val peek_output : t -> Packet.t
 val input_len : t -> int
 val has_in_packet : t -> bool
+val has_partial_input : t -> bool
 val get_in_packet : t -> Packet.t
 val has_more_input : t -> bool
 val is_selectable : t -> bool
diff --git a/tools/ocaml/xenstored/connection.ml b/tools/ocaml/xenstored/connection.ml
index 65f99ea6f28a..38b47363a173 100644
--- a/tools/ocaml/xenstored/connection.ml
+++ b/tools/ocaml/xenstored/connection.ml
@@ -125,9 +125,7 @@ let get_perm con =
 let set_target con target_domid =
 	con.perm <- Perms.Connection.set_target (get_perm con) ~perms:[Perms.READ; Perms.WRITE] target_domid
 
-let is_backend_mmap con = match con.xb.Xenbus.Xb.backend with
-	| Xenbus.Xb.Xenmmap _ -> true
-	| _ -> false
+let is_backend_mmap con = Xenbus.Xb.is_mmap con.xb
 
 let send_reply con tid rid ty data =
 	if (String.length data) > xenstore_payload_max && (is_backend_mmap con) then
@@ -280,9 +278,7 @@ let get_transaction con tid =
 
 let do_input con = Xenbus.Xb.input con.xb
 let has_input con = Xenbus.Xb.has_in_packet con.xb
-let has_partial_input con = match con.xb.Xenbus.Xb.partial_in with
-	| HaveHdr _ -> true
-	| NoHdr (n, _) -> n < Xenbus.Partial.header_size ()
+let has_partial_input con = Xenbus.Xb.has_partial_input con.xb
 let pop_in con = Xenbus.Xb.get_in_packet con.xb
 let has_more_input con = Xenbus.Xb.has_more_input con.xb
 
From 11ce5196932445ccf6679d04ef2e1963951967c1 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:02 +0100
Subject: tools/ocaml: Change Xb.input to return Packet.t option
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The queue here would only ever hold at most one element.  This will simplify
follow-up patches.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 8404ddd8a682..165fd4a1edf4 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -45,7 +45,6 @@ type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
 type t =
 {
 	backend: backend;
-	pkt_in: Packet.t Queue.t;
 	pkt_out: Packet.t Queue.t;
 	mutable partial_in: partial_buf;
 	mutable partial_out: string;
@@ -62,7 +61,6 @@ let reconnect t = match t.backend with
 		Xs_ring.close backend.mmap;
 		backend.eventchn_notify ();
 		(* Clear our old connection state *)
-		Queue.clear t.pkt_in;
 		Queue.clear t.pkt_out;
 		t.partial_in <- init_partial_in ();
 		t.partial_out <- ""
@@ -124,7 +122,6 @@ let output con =
 
 (* NB: can throw Reconnect *)
 let input con =
-	let newpacket = ref false in
 	let to_read =
 		match con.partial_in with
 		| HaveHdr partial_pkt -> Partial.to_complete partial_pkt
@@ -143,21 +140,19 @@ let input con =
 		if Partial.to_complete partial_pkt = 0 then (
 			let pkt = Packet.of_partialpkt partial_pkt in
 			con.partial_in <- init_partial_in ();
-			Queue.push pkt con.pkt_in;
-			newpacket := true
-		)
+			Some pkt
+		) else None
 	| NoHdr (i, buf)      ->
 		(* we complete the partial header *)
 		if sz > 0 then
 			Bytes.blit b 0 buf (Partial.header_size () - i) sz;
 		con.partial_in <- if sz = i then
-			HaveHdr (Partial.of_string (Bytes.to_string buf)) else NoHdr (i - sz, buf)
-	);
-	!newpacket
+			HaveHdr (Partial.of_string (Bytes.to_string buf)) else NoHdr (i - sz, buf);
+		None
+	)
 
 let newcon backend = {
 	backend = backend;
-	pkt_in = Queue.create ();
 	pkt_out = Queue.create ();
 	partial_in = init_partial_in ();
 	partial_out = "";
@@ -193,9 +188,6 @@ let has_output con = has_new_output con || has_old_output con
 
 let peek_output con = Queue.peek con.pkt_out
 
-let input_len con = Queue.length con.pkt_in
-let has_in_packet con = Queue.length con.pkt_in > 0
-let get_in_packet con = Queue.pop con.pkt_in
 let has_partial_input con = match con.partial_in with
 	| HaveHdr _ -> true
 	| NoHdr (n, _) -> n < Partial.header_size ()
diff --git a/tools/ocaml/libs/xb/xb.mli b/tools/ocaml/libs/xb/xb.mli
index 794e35bb343e..91c682162cea 100644
--- a/tools/ocaml/libs/xb/xb.mli
+++ b/tools/ocaml/libs/xb/xb.mli
@@ -77,7 +77,7 @@ val write_fd : backend_fd -> 'a -> string -> int -> int
 val write_mmap : backend_mmap -> 'a -> string -> int -> int
 val write : t -> string -> int -> int
 val output : t -> bool
-val input : t -> bool
+val input : t -> Packet.t option
 val newcon : backend -> t
 val open_fd : Unix.file_descr -> t
 val open_mmap : Xenmmap.mmap_interface -> (unit -> unit) -> t
@@ -89,10 +89,7 @@ val has_new_output : t -> bool
 val has_old_output : t -> bool
 val has_output : t -> bool
 val peek_output : t -> Packet.t
-val input_len : t -> int
-val has_in_packet : t -> bool
 val has_partial_input : t -> bool
-val get_in_packet : t -> Packet.t
 val has_more_input : t -> bool
 val is_selectable : t -> bool
 val get_fd : t -> Unix.file_descr
diff --git a/tools/ocaml/libs/xs/xsraw.ml b/tools/ocaml/libs/xs/xsraw.ml
index d982fb24dbb1..451f8b38dbcc 100644
--- a/tools/ocaml/libs/xs/xsraw.ml
+++ b/tools/ocaml/libs/xs/xsraw.ml
@@ -94,26 +94,18 @@ let pkt_send con =
 	done
 
 (* receive one packet - can sleep *)
-let pkt_recv con =
-	let workdone = ref false in
-	while not !workdone
-	do
-		workdone := Xb.input con.xb
-	done;
-	Xb.get_in_packet con.xb
+let rec pkt_recv con =
+	match Xb.input con.xb with
+	| Some packet -> packet
+	| None -> pkt_recv con
 
 let pkt_recv_timeout con timeout =
 	let fd = Xb.get_fd con.xb in
 	let r, _, _ = Unix.select [ fd ] [] [] timeout in
 	if r = [] then
 		true, None
-	else (
-		let workdone = Xb.input con.xb in
-		if workdone then
-			false, (Some (Xb.get_in_packet con.xb))
-		else
-			false, None
-	)
+	else
+		false, Xb.input con.xb
 
 let queue_watchevent con data =
 	let ls = split_string ~limit:2 '\000' data in
diff --git a/tools/ocaml/xenstored/connection.ml b/tools/ocaml/xenstored/connection.ml
index 38b47363a173..cc20e047d2b9 100644
--- a/tools/ocaml/xenstored/connection.ml
+++ b/tools/ocaml/xenstored/connection.ml
@@ -277,9 +277,7 @@ let get_transaction con tid =
 	Hashtbl.find con.transactions tid
 
 let do_input con = Xenbus.Xb.input con.xb
-let has_input con = Xenbus.Xb.has_in_packet con.xb
 let has_partial_input con = Xenbus.Xb.has_partial_input con.xb
-let pop_in con = Xenbus.Xb.get_in_packet con.xb
 let has_more_input con = Xenbus.Xb.has_more_input con.xb
 
 let has_output con = Xenbus.Xb.has_output con.xb
@@ -307,7 +305,7 @@ let is_bad con = match con.dom with None -> false | Some dom -> Domain.is_bad_do
    Restrictions below can be relaxed once xenstored learns to dump more
    of its live state in a safe way *)
 let has_extra_connection_data con =
-	let has_in = has_input con || has_partial_input con in
+	let has_in = has_partial_input con in
 	let has_out = has_output con in
 	let has_socket = con.dom = None in
 	let has_nondefault_perms = make_perm con.dom <> con.perm in
diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index 6a3435c265d3..2d67456a2aa0 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -195,10 +195,9 @@ let parse_live_update args =
 			| _ when Unix.gettimeofday () < t.deadline -> false
 			| l ->
 				warn "timeout reached: have to wait, migrate or shutdown %d domains:" (List.length l);
-				let msgs = List.rev_map (fun con -> Printf.sprintf "%s: %d tx, in: %b, out: %b, perm: %s"
+				let msgs = List.rev_map (fun con -> Printf.sprintf "%s: %d tx, out: %b, perm: %s"
 					(Connection.get_domstr con)
 					(Connection.number_of_transactions con)
-					(Connection.has_input con)
 					(Connection.has_output con)
 					(Connection.get_perm con |> Perms.Connection.to_string)
 					) l in
@@ -705,16 +704,17 @@ let do_input store cons doms con =
 			info "%s requests a reconnect" (Connection.get_domstr con);
 			History.reconnect con;
 			info "%s reconnection complete" (Connection.get_domstr con);
-			false
+			None
 		| Failure exp ->
 			error "caught exception %s" exp;
 			error "got a bad client %s" (sprintf "%-8s" (Connection.get_domstr con));
 			Connection.mark_as_bad con;
-			false
+			None
 	in
 
-	if newpacket then (
-		let packet = Connection.pop_in con in
+	match newpacket with
+	| None -> ()
+	| Some packet ->
 		let tid, rid, ty, data = Xenbus.Xb.Packet.unpack packet in
 		let req = {Packet.tid=tid; Packet.rid=rid; Packet.ty=ty; Packet.data=data} in
 
@@ -724,8 +724,7 @@ let do_input store cons doms con =
 		         (Xenbus.Xb.Op.to_string ty) (sanitize_data data); *)
 		process_packet ~store ~cons ~doms ~con ~req;
 		write_access_log ~ty ~tid ~con:(Connection.get_domstr con) ~data;
-		Connection.incr_ops con;
-	)
+		Connection.incr_ops con
 
 let do_output _store _cons _doms con =
 	if Connection.has_output con then (
From 6824bd28b59eef858257dde8076d4f47024bd3eb Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:03 +0100
Subject: tools/ocaml/xb: Add BoundedQueue
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Ensures we cannot store more than [capacity] elements in a [Queue].  Replacing
all Queue with this module will then ensure at compile time that all Queues
are correctly bound checked.

Each element in the queue has a class with its own limits.  This, in a
subsequent change, will ensure that command responses can proceed during a
flood of watch events.

No functional change.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 165fd4a1edf4..4197a3888a68 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -17,6 +17,98 @@
 module Op = struct include Op end
 module Packet = struct include Packet end
 
+module BoundedQueue : sig
+	type ('a, 'b) t
+
+	(** [create ~capacity ~classify ~limit] creates a queue with maximum [capacity] elements.
+	    This is burst capacity, each element is further classified according to [classify],
+	    and each class can have its own [limit].
+	    [capacity] is enforced as an overall limit.
+	    The [limit] can be dynamic, and can be smaller than the number of elements already queued of that class,
+	    in which case those elements are considered to use "burst capacity".
+	  *)
+	val create: capacity:int -> classify:('a -> 'b) -> limit:('b -> int) -> ('a, 'b) t
+
+	(** [clear q] discards all elements from [q] *)
+	val clear: ('a, 'b) t -> unit
+
+	(** [can_push q] when [length q < capacity].	*)
+	val can_push: ('a, 'b) t -> 'b -> bool
+
+	(** [push e q] adds [e] at the end of queue [q] if [can_push q], or returns [None]. *)
+	val push: 'a -> ('a, 'b) t -> unit option
+
+	(** [pop q] removes and returns first element in [q], or raises [Queue.Empty]. *)
+	val pop: ('a, 'b) t -> 'a
+
+	(** [peek q] returns the first element in [q], or raises [Queue.Empty].  *)
+	val peek : ('a, 'b) t -> 'a
+
+	(** [length q] returns the current number of elements in [q] *)
+	val length: ('a, 'b) t -> int
+
+	(** [debug string_of_class q] prints queue usage statistics in an unspecified internal format. *)
+	val debug: ('b -> string) -> (_, 'b) t -> string
+end = struct
+	type ('a, 'b) t =
+		{ q: 'a Queue.t
+		; capacity: int
+		; classify: 'a -> 'b
+		; limit: 'b -> int
+		; class_count: ('b, int) Hashtbl.t
+		}
+
+	let create ~capacity ~classify ~limit =
+		{ capacity; q = Queue.create (); classify; limit; class_count = Hashtbl.create 3 }
+
+	let get_count t classification = try Hashtbl.find t.class_count classification with Not_found -> 0
+
+	let can_push_internal t classification class_count =
+		Queue.length t.q < t.capacity && class_count < t.limit classification
+
+	let ok = Some ()
+
+	let push e t =
+		let classification = t.classify e in
+		let class_count = get_count t classification in
+		if can_push_internal t classification class_count then begin
+			Queue.push e t.q;
+			Hashtbl.replace t.class_count classification (class_count + 1);
+			ok
+		end
+		else
+			None
+
+	let can_push t classification =
+		can_push_internal t classification @@ get_count t classification
+
+	let clear t =
+		Queue.clear t.q;
+		Hashtbl.reset t.class_count
+
+	let pop t =
+		let e = Queue.pop t.q in
+		let classification = t.classify e in
+		let () = match get_count t classification - 1 with
+		| 0 -> Hashtbl.remove t.class_count classification (* reduces memusage *)
+		| n -> Hashtbl.replace t.class_count classification n
+		in
+		e
+
+	let peek t = Queue.peek t.q
+	let length t = Queue.length t.q
+
+	let debug string_of_class t =
+		let b = Buffer.create 128 in
+		Printf.bprintf b "BoundedQueue capacity: %d, used: {" t.capacity;
+		Hashtbl.iter (fun packet_class count ->
+			Printf.bprintf b "	%s: %d" (string_of_class packet_class) count
+		) t.class_count;
+		Printf.bprintf b "}";
+		Buffer.contents b
+end
+
+
 exception End_of_file
 exception Eagain
 exception Noent
From 0e0d85385f773949005ea2efa18956de23081364 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:04 +0100
Subject: tools/ocaml: Limit maximum in-flight requests / outstanding replies
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Introduce a limit on the number of outstanding reply packets in the xenbus
queue.  This limits the number of in-flight requests: when the output queue is
full we'll stop processing inputs until the output queue has room again.

To avoid a busy loop on the Unix socket we only add it to the watched input
file descriptor set if we'd be able to call `input` on it.  Even though Dom0
is trusted and exempt from quotas a flood of events might cause a backlog
where events are produced faster than daemons in Dom0 can consume them, which
could lead to an unbounded queue size and OOM.

Therefore the xenbus queue limit must apply to all connections, Dom0 is not
exempt from it, although if everything works correctly it will eventually
catch up.

This prevents a malicious guest from sending more commands while it has
outstanding watch events or command replies in its input ring.  However if it
can cause the generation of watch events by other means (e.g. by Dom0, or
another cooperative guest) and stop reading its own ring then watch events
would've queued up without limit.

The xenstore protocol doesn't have a back-pressure mechanism, and doesn't
allow dropping watch events.  In fact, dropping watch events is known to break
some pieces of normal functionality.  This leaves little choice to safely
implement the xenstore protocol without exposing the xenstore daemon to
out-of-memory attacks.

Implement the fix as pipes with bounded buffers:
* Use a bounded buffer for watch events
* The watch structure will have a bounded receiving pipe of watch events
* The source will have an "overflow" pipe of pending watch events it couldn't
  deliver

Items are queued up on one end and are sent as far along the pipe as possible:

  source domain -> watch -> xenbus of target -> xenstore ring/socket of target

If the pipe is "full" at any point then back-pressure is applied and we prevent
more items from being queued up.  For the source domain this means that we'll
stop accepting new commands as long as its pipe buffer is not empty.

Before we try to enqueue an item we first check whether it is possible to send
it further down the pipe, by attempting to recursively flush the pipes. This
ensures that we retain the order of events as much as possible.

We might break causality of watch events if the target domain's queue is full
and we need to start using the watch's queue.  This is a breaking change in
the xenstore protocol, but only for domains which are not processing their
incoming ring as expected.

When a watch is deleted its entire pending queue is dropped (no code is needed
for that, because it is part of the 'watch' type).

There is a cache of watches that have pending events that we attempt to flush
at every cycle if possible.

Introduce 3 limits here:
* quota-maxwatchevents on watch event destination: when this is hit the
  source will not be allowed to queue up more watch events.
* quota-maxoustanding which is the number of responses not read from the ring:
  once exceeded, no more inputs are processed until all outstanding replies
  are consumed by the client.
* overflow queue on the watch event source: all watches that cannot be stored
  on destination are queued up here, a single command can trigger multiple
  watches (e.g. due to recursion).

The overflow queue currently doesn't have an upper bound, it is difficult to
accurately calculate one as it depends on whether you are Dom0 and how many
watches each path has registered and how many watch events you can trigger
with a single command (e.g. a commit).  However these events were already
using memory, this just moves them elsewhere, and as long as we correctly
block a domain it shouldn't result in unbounded memory usage.

Note that Dom0 is not excluded from these checks, it is important that Dom0 is
especially not excluded when it is the source, since there are many ways in
which a guest could trigger Dom0 to send it watch events.

This should protect against malicious frontends as long as the backend follows
the PV xenstore protocol and only exposes paths needed by the frontend, and
changes those paths at most once as a reaction to guest events, or protocol
state.

The queue limits are per watch, and per domain-pair, so even if one
communication channel would be "blocked", others would keep working, and the
domain itself won't get blocked as long as it doesn't overflow the queue of
watch events.

Similarly a malicious backend could cause the frontend to get blocked, but
this watch queue protects the frontend as well as long as it follows the PV
protocol.  (Although note that protection against malicious backends is only a
best effort at the moment)

This is part of XSA-326 / CVE-2022-42318.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 4197a3888a68..b292ed7a874d 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -134,14 +134,44 @@ type backend = Fd of backend_fd | Xenmmap of backend_mmap
 
 type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
 
+(*
+	separate capacity reservation for replies and watch events:
+	this allows a domain to keep working even when under a constant flood of
+	watch events
+*)
+type capacity = { maxoutstanding: int; maxwatchevents: int }
+
+module Queue = BoundedQueue
+
+type packet_class =
+	| CommandReply
+	| Watchevent
+
+let string_of_packet_class = function
+	| CommandReply -> "command_reply"
+	| Watchevent -> "watch_event"
+
 type t =
 {
 	backend: backend;
-	pkt_out: Packet.t Queue.t;
+	pkt_out: (Packet.t, packet_class) Queue.t;
 	mutable partial_in: partial_buf;
 	mutable partial_out: string;
+	capacity: capacity
 }
 
+let to_read con =
+	match con.partial_in with
+		| HaveHdr partial_pkt -> Partial.to_complete partial_pkt
+		| NoHdr   (i, _)    -> i
+
+let debug t =
+	Printf.sprintf "XenBus state: partial_in: %d needed, partial_out: %d bytes, pkt_out: %d packets, %s"
+		(to_read t)
+		(String.length t.partial_out)
+		(Queue.length t.pkt_out)
+		(BoundedQueue.debug string_of_packet_class t.pkt_out)
+
 let init_partial_in () = NoHdr
 	(Partial.header_size (), Bytes.make (Partial.header_size()) '\000')
 
@@ -199,7 +229,8 @@ let output con =
 	let s = if String.length con.partial_out > 0 then
 			con.partial_out
 		else if Queue.length con.pkt_out > 0 then
-			Packet.to_string (Queue.pop con.pkt_out)
+			let pkt = Queue.pop con.pkt_out in
+			Packet.to_string pkt
 		else
 			"" in
 	(* send data from s, and save the unsent data to partial_out *)
@@ -212,12 +243,15 @@ let output con =
 	(* after sending one packet, partial is empty *)
 	con.partial_out = ""
 
+(* we can only process an input packet if we're guaranteed to have room
+   to store the response packet *)
+let can_input con = Queue.can_push con.pkt_out CommandReply
+
 (* NB: can throw Reconnect *)
 let input con =
-	let to_read =
-		match con.partial_in with
-		| HaveHdr partial_pkt -> Partial.to_complete partial_pkt
-		| NoHdr   (i, _)    -> i in
+	if not (can_input con) then None
+	else
+	let to_read = to_read con in
 
 	(* try to get more data from input stream *)
 	let b = Bytes.make to_read '\000' in
@@ -243,11 +277,22 @@ let input con =
 		None
 	)
 
-let newcon backend = {
+let classify t =
+	match t.Packet.ty with
+	| Op.Watchevent -> Watchevent
+	| _ -> CommandReply
+
+let newcon ~capacity backend =
+	let limit = function
+		| CommandReply -> capacity.maxoutstanding
+		| Watchevent -> capacity.maxwatchevents
+	in
+	{
 	backend = backend;
-	pkt_out = Queue.create ();
+	pkt_out = Queue.create ~capacity:(capacity.maxoutstanding + capacity.maxwatchevents) ~classify ~limit;
 	partial_in = init_partial_in ();
 	partial_out = "";
+	capacity = capacity;
 	}
 
 let open_fd fd = newcon (Fd { fd = fd; })
diff --git a/tools/ocaml/libs/xb/xb.mli b/tools/ocaml/libs/xb/xb.mli
index 91c682162cea..71b2754ca788 100644
--- a/tools/ocaml/libs/xb/xb.mli
+++ b/tools/ocaml/libs/xb/xb.mli
@@ -66,10 +66,11 @@ type backend_mmap = {
 type backend_fd = { fd : Unix.file_descr; }
 type backend = Fd of backend_fd | Xenmmap of backend_mmap
 type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
+type capacity = { maxoutstanding: int; maxwatchevents: int }
 type t
 val init_partial_in : unit -> partial_buf
 val reconnect : t -> unit
-val queue : t -> Packet.t -> unit
+val queue : t -> Packet.t -> unit option
 val read_fd : backend_fd -> 'a -> bytes -> int -> int
 val read_mmap : backend_mmap -> 'a -> bytes -> int -> int
 val read : t -> bytes -> int -> int
@@ -78,13 +79,14 @@ val write_mmap : backend_mmap -> 'a -> string -> int -> int
 val write : t -> string -> int -> int
 val output : t -> bool
 val input : t -> Packet.t option
-val newcon : backend -> t
-val open_fd : Unix.file_descr -> t
-val open_mmap : Xenmmap.mmap_interface -> (unit -> unit) -> t
+val newcon : capacity:capacity -> backend -> t
+val open_fd : Unix.file_descr -> capacity:capacity -> t
+val open_mmap : Xenmmap.mmap_interface -> (unit -> unit) -> capacity:capacity -> t
 val close : t -> unit
 val is_fd : t -> bool
 val is_mmap : t -> bool
 val output_len : t -> int
+val can_input: t -> bool
 val has_new_output : t -> bool
 val has_old_output : t -> bool
 val has_output : t -> bool
@@ -93,3 +95,4 @@ val has_partial_input : t -> bool
 val has_more_input : t -> bool
 val is_selectable : t -> bool
 val get_fd : t -> Unix.file_descr
+val debug: t -> string
diff --git a/tools/ocaml/libs/xs/queueop.ml b/tools/ocaml/libs/xs/queueop.ml
index 9ff5bbd529ce..4e532cdaeacb 100644
--- a/tools/ocaml/libs/xs/queueop.ml
+++ b/tools/ocaml/libs/xs/queueop.ml
@@ -16,9 +16,10 @@
 open Xenbus
 
 let data_concat ls = (String.concat "\000" ls) ^ "\000"
+let queue con pkt = let r = Xb.queue con pkt in assert (r <> None)
 let queue_path ty (tid: int) (path: string) con =
 	let data = data_concat [ path; ] in
-	Xb.queue con (Xb.Packet.create tid 0 ty data)
+	queue con (Xb.Packet.create tid 0 ty data)
 
 (* operations *)
 let directory tid path con = queue_path Xb.Op.Directory tid path con
@@ -27,48 +28,48 @@ let read tid path con = queue_path Xb.Op.Read tid path con
 let getperms tid path con = queue_path Xb.Op.Getperms tid path con
 
 let debug commands con =
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Debug (data_concat commands))
+	queue con (Xb.Packet.create 0 0 Xb.Op.Debug (data_concat commands))
 
 let watch path data con =
 	let data = data_concat [ path; data; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Watch data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Watch data)
 
 let unwatch path data con =
 	let data = data_concat [ path; data; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Unwatch data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Unwatch data)
 
 let transaction_start con =
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Transaction_start (data_concat []))
+	queue con (Xb.Packet.create 0 0 Xb.Op.Transaction_start (data_concat []))
 
 let transaction_end tid commit con =
 	let data = data_concat [ (if commit then "T" else "F"); ] in
-	Xb.queue con (Xb.Packet.create tid 0 Xb.Op.Transaction_end data)
+	queue con (Xb.Packet.create tid 0 Xb.Op.Transaction_end data)
 
 let introduce domid mfn port con =
 	let data = data_concat [ Printf.sprintf "%u" domid;
 	                         Printf.sprintf "%nu" mfn;
 	                         string_of_int port; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Introduce data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Introduce data)
 
 let release domid con =
 	let data = data_concat [ Printf.sprintf "%u" domid; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Release data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Release data)
 
 let resume domid con =
 	let data = data_concat [ Printf.sprintf "%u" domid; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Resume data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Resume data)
 
 let getdomainpath domid con =
 	let data = data_concat [ Printf.sprintf "%u" domid; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Getdomainpath data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Getdomainpath data)
 
 let write tid path value con =
 	let data = path ^ "\000" ^ value (* no NULL at the end *) in
-	Xb.queue con (Xb.Packet.create tid 0 Xb.Op.Write data)
+	queue con (Xb.Packet.create tid 0 Xb.Op.Write data)
 
 let mkdir tid path con = queue_path Xb.Op.Mkdir tid path con
 let rm tid path con = queue_path Xb.Op.Rm tid path con
 
 let setperms tid path perms con =
 	let data = data_concat [ path; perms ] in
-	Xb.queue con (Xb.Packet.create tid 0 Xb.Op.Setperms data)
+	queue con (Xb.Packet.create tid 0 Xb.Op.Setperms data)
diff --git a/tools/ocaml/libs/xs/xsraw.ml b/tools/ocaml/libs/xs/xsraw.ml
index 451f8b38dbcc..cbd17280600c 100644
--- a/tools/ocaml/libs/xs/xsraw.ml
+++ b/tools/ocaml/libs/xs/xsraw.ml
@@ -36,8 +36,10 @@ type con = {
 let close con =
 	Xb.close con.xb
 
+let capacity = { Xb.maxoutstanding = 1; maxwatchevents = 0; }
+
 let open_fd fd = {
-	xb = Xb.open_fd fd;
+	xb = Xb.open_fd ~capacity fd;
 	watchevents = Queue.create ();
 }
 
diff --git a/tools/ocaml/xenstored/connection.ml b/tools/ocaml/xenstored/connection.ml
index cc20e047d2b9..9624a5f9da2c 100644
--- a/tools/ocaml/xenstored/connection.ml
+++ b/tools/ocaml/xenstored/connection.ml
@@ -20,12 +20,84 @@ open Stdext
 
 let xenstore_payload_max = 4096 (* xen/include/public/io/xs_wire.h *)
 
+type 'a bounded_sender = 'a -> unit option
+(** a bounded sender accepts an ['a] item and returns:
+    None - if there is no room to accept the item
+    Some () -  if it has successfully accepted/sent the item
+ *)
+
+module BoundedPipe : sig
+	type 'a t
+
+	(** [create ~capacity ~destination] creates a bounded pipe with a
+	    local buffer holding at most [capacity] items.  Once the buffer is
+	    full it will not accept further items.  items from the pipe are
+	    flushed into [destination] as long as it accepts items.  The
+	    destination could be another pipe.
+	 *)
+	val create: capacity:int -> destination:'a bounded_sender -> 'a t
+
+	(** [is_empty t] returns whether the local buffer of [t] is empty. *)
+	val is_empty : _ t -> bool
+
+	(** [length t] the number of items in the internal buffer *)
+	val length: _ t -> int
+
+	(** [flush_pipe t] sends as many items from the local buffer as possible,
+			which could be none. *)
+	val flush_pipe: _ t -> unit
+
+	(** [push t item] tries to [flush_pipe] and then push [item]
+	    into the pipe if its [capacity] allows.
+	    Returns [None] if there is no more room
+	 *)
+	val push : 'a t -> 'a bounded_sender
+end = struct
+	(* items are enqueued in [q], and then flushed to [connect_to] *)
+	type 'a t =
+		{ q: 'a Queue.t
+		; destination: 'a bounded_sender
+		; capacity: int
+		}
+
+	let create ~capacity ~destination =
+		{ q = Queue.create (); capacity; destination }
+
+	let rec flush_pipe t =
+		if not Queue.(is_empty t.q) then
+			let item = Queue.peek t.q in
+			match t.destination item with
+			| None -> () (* no room *)
+			| Some () ->
+				(* successfully sent item to next stage *)
+				let _ = Queue.pop t.q in
+				(* continue trying to send more items *)
+				flush_pipe t
+
+	let push t item =
+		(* first try to flush as many items from this pipe as possible to make room,
+		   it is important to do this first to preserve the order of the items
+		 *)
+		flush_pipe t;
+		if Queue.length t.q < t.capacity then begin
+			(* enqueue, instead of sending directly.
+			   this ensures that [out] sees the items in the same order as we receive them
+			 *)
+			Queue.push item t.q;
+			Some (flush_pipe t)
+		end else None
+
+	let is_empty t = Queue.is_empty t.q
+	let length t = Queue.length t.q
+end
+
 type watch = {
 	con: t;
 	token: string;
 	path: string;
 	base: string;
 	is_relative: bool;
+	pending_watchevents: Xenbus.Xb.Packet.t BoundedPipe.t;
 }
 
 and t = {
@@ -38,8 +110,36 @@ and t = {
 	anonid: int;
 	mutable stat_nb_ops: int;
 	mutable perm: Perms.Connection.t;
+	pending_source_watchevents: (watch * Xenbus.Xb.Packet.t) BoundedPipe.t
 }
 
+module Watch = struct
+	module T = struct
+		type t = watch
+
+		let compare w1 w2 =
+			(* cannot compare watches from different connections *)
+			assert (w1.con == w2.con);
+			match String.compare w1.token w2.token with
+			| 0 -> String.compare w1.path w2.path
+			| n -> n
+	end
+	module Set = Set.Make(T)
+
+	let flush_events t =
+		BoundedPipe.flush_pipe t.pending_watchevents;
+		not (BoundedPipe.is_empty t.pending_watchevents)
+
+	let pending_watchevents t =
+		BoundedPipe.length t.pending_watchevents
+end
+
+let source_flush_watchevents t =
+	BoundedPipe.flush_pipe t.pending_source_watchevents
+
+let source_pending_watchevents t =
+	BoundedPipe.length t.pending_source_watchevents
+
 let mark_as_bad con =
 	match con.dom with
 	|None -> ()
@@ -67,7 +167,8 @@ let watch_create ~con ~path ~token = {
 	token = token;
 	path = path;
 	base = get_path con;
-	is_relative = path.[0] <> '/' && path.[0] <> '@'
+	is_relative = path.[0] <> '/' && path.[0] <> '@';
+	pending_watchevents = BoundedPipe.create ~capacity:!Define.maxwatchevents ~destination:(Xenbus.Xb.queue con.xb)
 }
 
 let get_con w = w.con
@@ -93,6 +194,9 @@ let make_perm dom =
 	Perms.Connection.create ~perms:[Perms.READ; Perms.WRITE] domid
 
 let create xbcon dom =
+	let destination (watch, pkt) =
+		BoundedPipe.push watch.pending_watchevents pkt
+	in
 	let id =
 		match dom with
 		| None -> let old = !anon_id_next in incr anon_id_next; old
@@ -109,6 +213,16 @@ let create xbcon dom =
 	anonid = id;
 	stat_nb_ops = 0;
 	perm = make_perm dom;
+
+	(* the actual capacity will be lower, this is used as an overflow
+	   buffer: anything that doesn't fit elsewhere gets put here, only
+	   limited by the amount of watches that you can generate with a
+	   single xenstore command (which is finite, although possibly very
+	   large in theory for Dom0).  Once the pipe here has any contents the
+	   domain is blocked from sending more commands until it is empty
+	   again though.
+	 *)
+	pending_source_watchevents = BoundedPipe.create ~capacity:Sys.max_array_length ~destination
 	}
 	in
 	Logging.new_connection ~tid:Transaction.none ~con:(get_domstr con);
@@ -127,11 +241,17 @@ let set_target con target_domid =
 
 let is_backend_mmap con = Xenbus.Xb.is_mmap con.xb
 
-let send_reply con tid rid ty data =
+let packet_of con tid rid ty data =
 	if (String.length data) > xenstore_payload_max && (is_backend_mmap con) then
-		Xenbus.Xb.queue con.xb (Xenbus.Xb.Packet.create tid rid Xenbus.Xb.Op.Error "E2BIG\000")
+		Xenbus.Xb.Packet.create tid rid Xenbus.Xb.Op.Error "E2BIG\000"
 	else
-		Xenbus.Xb.queue con.xb (Xenbus.Xb.Packet.create tid rid ty data)
+		Xenbus.Xb.Packet.create tid rid ty data
+
+let send_reply con tid rid ty data =
+	let result = Xenbus.Xb.queue con.xb (packet_of con tid rid ty data) in
+	(* should never happen: we only process an input packet when there is room for an output packet *)
+	(* and the limit for replies is different from the limit for watch events *)
+	assert (result <> None)
 
 let send_error con tid rid err = send_reply con tid rid Xenbus.Xb.Op.Error (err ^ "\000")
 let send_ack con tid rid ty = send_reply con tid rid ty "OK\000"
@@ -181,11 +301,11 @@ let del_watch con path token =
 	apath, w
 
 let del_watches con =
-  Hashtbl.clear con.watches;
+  Hashtbl.reset con.watches;
   con.nb_watches <- 0
 
 let del_transactions con =
-  Hashtbl.clear con.transactions
+  Hashtbl.reset con.transactions
 
 let list_watches con =
 	let ll = Hashtbl.fold
@@ -208,21 +328,29 @@ let lookup_watch_perm path = function
 let lookup_watch_perms oldroot root path =
 	lookup_watch_perm path oldroot @ lookup_watch_perm path (Some root)
 
-let fire_single_watch_unchecked watch =
+let fire_single_watch_unchecked source watch =
 	let data = Utils.join_by_null [watch.path; watch.token; ""] in
-	send_reply watch.con Transaction.none 0 Xenbus.Xb.Op.Watchevent data
+	let pkt = packet_of watch.con Transaction.none 0 Xenbus.Xb.Op.Watchevent data in
+
+	match BoundedPipe.push source.pending_source_watchevents (watch, pkt) with
+	| Some () -> () (* packet queued *)
+	| None ->
+			(* a well behaved Dom0 shouldn't be able to trigger this,
+			   if it happens it is likely a Dom0 bug causing runaway memory usage
+			 *)
+			failwith "watch event overflow, cannot happen"
 
-let fire_single_watch (oldroot, root) watch =
+let fire_single_watch source (oldroot, root) watch =
 	let abspath = get_watch_path watch.con watch.path |> Store.Path.of_string in
 	let perms = lookup_watch_perms oldroot root abspath in
 	if Perms.can_fire_watch watch.con.perm perms then
-		fire_single_watch_unchecked watch
+		fire_single_watch_unchecked source watch
 	else
 		let perms = perms |> List.map (Perms.Node.to_string ~sep:" ") |> String.concat ", " in
 		let con = get_domstr watch.con in
 		Logging.watch_not_fired ~con perms (Store.Path.to_string abspath)
 
-let fire_watch roots watch path =
+let fire_watch source roots watch path =
 	let new_path =
 		if watch.is_relative && path.[0] = '/'
 		then begin
@@ -232,7 +360,7 @@ let fire_watch roots watch path =
 		end else
 			path
 	in
-	fire_single_watch roots { watch with path = new_path }
+	fire_single_watch source roots { watch with path = new_path }
 
 (* Search for a valid unused transaction id. *)
 let rec valid_transaction_id con proposed_id =
@@ -280,6 +408,7 @@ let do_input con = Xenbus.Xb.input con.xb
 let has_partial_input con = Xenbus.Xb.has_partial_input con.xb
 let has_more_input con = Xenbus.Xb.has_more_input con.xb
 
+let can_input con = Xenbus.Xb.can_input con.xb && BoundedPipe.is_empty con.pending_source_watchevents
 let has_output con = Xenbus.Xb.has_output con.xb
 let has_old_output con = Xenbus.Xb.has_old_output con.xb
 let has_new_output con = Xenbus.Xb.has_new_output con.xb
@@ -323,7 +452,7 @@ let prevents_live_update con = not (is_bad con)
 	&& (has_extra_connection_data con || has_transaction_data con)
 
 let has_more_work con =
-	has_more_input con || not (has_old_output con) && has_new_output con
+	(has_more_input con && can_input con) || not (has_old_output con) && has_new_output con
 
 let incr_ops con = con.stat_nb_ops <- con.stat_nb_ops + 1
 
diff --git a/tools/ocaml/xenstored/connections.ml b/tools/ocaml/xenstored/connections.ml
index 3c7429fe7f61..7d68c583b43a 100644
--- a/tools/ocaml/xenstored/connections.ml
+++ b/tools/ocaml/xenstored/connections.ml
@@ -22,22 +22,30 @@ type t = {
 	domains: (int, Connection.t) Hashtbl.t;
 	ports: (Xeneventchn.t, Connection.t) Hashtbl.t;
 	mutable watches: Connection.watch list Trie.t;
+	mutable has_pending_watchevents: Connection.Watch.Set.t
 }
 
 let create () = {
 	anonymous = Hashtbl.create 37;
 	domains = Hashtbl.create 37;
 	ports = Hashtbl.create 37;
-	watches = Trie.create ()
+	watches = Trie.create ();
+	has_pending_watchevents = Connection.Watch.Set.empty;
 }
 
+let get_capacity () =
+	(* not multiplied by maxwatch on purpose: 2nd queue in watch itself! *)
+	{ Xenbus.Xb.maxoutstanding = !Define.maxoutstanding; maxwatchevents = !Define.maxwatchevents }
+
 let add_anonymous cons fd =
-	let xbcon = Xenbus.Xb.open_fd fd in
+	let capacity = get_capacity () in
+	let xbcon = Xenbus.Xb.open_fd fd ~capacity in
 	let con = Connection.create xbcon None in
 	Hashtbl.add cons.anonymous (Xenbus.Xb.get_fd xbcon) con
 
 let add_domain cons dom =
-	let xbcon = Xenbus.Xb.open_mmap (Domain.get_interface dom) (fun () -> Domain.notify dom) in
+	let capacity = get_capacity () in
+	let xbcon = Xenbus.Xb.open_mmap ~capacity (Domain.get_interface dom) (fun () -> Domain.notify dom) in
 	let con = Connection.create xbcon (Some dom) in
 	Hashtbl.add cons.domains (Domain.get_id dom) con;
 	match Domain.get_port dom with
@@ -48,7 +56,9 @@ let select ?(only_if = (fun _ -> true)) cons =
 	Hashtbl.fold (fun _ con (ins, outs) ->
 		if (only_if con) then (
 			let fd = Connection.get_fd con in
-			(fd :: ins,  if Connection.has_output con then fd :: outs else outs)
+			let in_fds = if Connection.can_input con then fd :: ins else ins in
+			let out_fds = if Connection.has_output con then fd :: outs else outs in
+			in_fds, out_fds
 		) else (ins, outs)
 	)
 	cons.anonymous ([], [])
@@ -67,10 +77,17 @@ let del_watches_of_con con watches =
 	| [] -> None
 	| ws -> Some ws
 
+let del_watches cons con =
+	Connection.del_watches con;
+	cons.watches <- Trie.map (del_watches_of_con con) cons.watches;
+	cons.has_pending_watchevents <-
+		cons.has_pending_watchevents |> Connection.Watch.Set.filter @@ fun w ->
+		Connection.get_con w != con
+
 let del_anonymous cons con =
 	try
 		Hashtbl.remove cons.anonymous (Connection.get_fd con);
-		cons.watches <- Trie.map (del_watches_of_con con) cons.watches;
+		del_watches cons con;
 		Connection.close con
 	with exn ->
 		debug "del anonymous %s" (Printexc.to_string exn)
@@ -85,7 +102,7 @@ let del_domain cons id =
 		    | Some p -> Hashtbl.remove cons.ports p
 		    | None -> ())
 		 | None -> ());
-		cons.watches <- Trie.map (del_watches_of_con con) cons.watches;
+		del_watches cons con;
 		Connection.close con
 	with exn ->
 		debug "del domain %u: %s" id (Printexc.to_string exn)
@@ -136,31 +153,33 @@ let del_watch cons con path token =
 		cons.watches <- Trie.set cons.watches key watches;
  	watch
 
-let del_watches cons con =
-	Connection.del_watches con;
-	cons.watches <- Trie.map (del_watches_of_con con) cons.watches
-
 (* path is absolute *)
-let fire_watches ?oldroot root cons path recurse =
+let fire_watches ?oldroot source root cons path recurse =
 	let key = key_of_path path in
 	let path = Store.Path.to_string path in
 	let roots = oldroot, root in
 	let fire_watch _ = function
 		| None         -> ()
-		| Some watches -> List.iter (fun w -> Connection.fire_watch roots w path) watches
+		| Some watches -> List.iter (fun w -> Connection.fire_watch source roots w path) watches
 	in
 	let fire_rec _x = function
 		| None         -> ()
 		| Some watches ->
-			List.iter (Connection.fire_single_watch roots) watches
+			List.iter (Connection.fire_single_watch source roots) watches
 	in
 	Trie.iter_path fire_watch cons.watches key;
 	if recurse then
 		Trie.iter fire_rec (Trie.sub cons.watches key)
 
+let send_watchevents cons con =
+	cons.has_pending_watchevents <-
+		cons.has_pending_watchevents |> Connection.Watch.Set.filter Connection.Watch.flush_events;
+	Connection.source_flush_watchevents con
+
 let fire_spec_watches root cons specpath =
+	let source = find_domain cons 0 in
 	iter cons (fun con ->
-		List.iter (Connection.fire_single_watch (None, root)) (Connection.get_watches con specpath))
+		List.iter (Connection.fire_single_watch source (None, root)) (Connection.get_watches con specpath))
 
 let set_target cons domain target_domain =
 	let con = find_domain cons domain in
@@ -197,6 +216,16 @@ let debug cons =
 	let domains = Hashtbl.fold (fun _ con accu -> Connection.debug con :: accu) cons.domains [] in
 	String.concat "" (domains @ anonymous)
 
+let debug_watchevents cons con =
+	(* == (physical equality)
+	   has to be used here because w.con.xb.backend might contain a [unit->unit] value causing regular
+	   comparison to fail due to having a 'functional value' which cannot be compared.
+	 *)
+	let s = cons.has_pending_watchevents |> Connection.Watch.Set.filter (fun w -> w.con == con) in
+	let pending = s |> Connection.Watch.Set.elements
+		|> List.map (fun w -> Connection.Watch.pending_watchevents w) |> List.fold_left (+) 0 in
+	Printf.sprintf "Watches with pending events: %d, pending events total: %d" (Connection.Watch.Set.cardinal s) pending
+
 let filter ~f cons =
 	let fold _ v acc = if f v then v :: acc else acc in
 	[]
diff --git a/tools/ocaml/xenstored/define.ml b/tools/ocaml/xenstored/define.ml
index ba63a8147e09..327b6d795ec7 100644
--- a/tools/ocaml/xenstored/define.ml
+++ b/tools/ocaml/xenstored/define.ml
@@ -24,6 +24,13 @@ let default_config_dir = Paths.xen_config_dir
 let maxwatch = ref (100)
 let maxtransaction = ref (10)
 let maxrequests = ref (1024)   (* maximum requests per transaction *)
+let maxoutstanding = ref (1024) (* maximum outstanding requests, i.e. in-flight requests / domain *)
+let maxwatchevents = ref (1024)
+(*
+	maximum outstanding watch events per watch,
+	recommended >= maxoutstanding to avoid blocking backend transactions due to
+	malicious frontends
+ *)
 
 let gc_max_overhead = ref 120 (* 120% see comment in xenstored.ml *)
 let conflict_burst_limit = ref 5.0
diff --git a/tools/ocaml/xenstored/oxenstored.conf.in b/tools/ocaml/xenstored/oxenstored.conf.in
index 4ae48e42d47d..9d034e744b4b 100644
--- a/tools/ocaml/xenstored/oxenstored.conf.in
+++ b/tools/ocaml/xenstored/oxenstored.conf.in
@@ -62,6 +62,8 @@ quota-maxwatch = 100
 quota-transaction = 10
 quota-maxrequests = 1024
 quota-path-max = 1024
+quota-maxoutstanding = 1024
+quota-maxwatchevents = 1024
 
 # Activate filed base backend
 persistent = false
diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index 2d67456a2aa0..6dcedfda86e4 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -57,7 +57,7 @@ let split_one_path data con =
 	| path :: "" :: [] -> Store.Path.create path (Connection.get_path con)
 	| _                -> raise Invalid_Cmd_Args
 
-let process_watch t cons =
+let process_watch source t cons =
 	let oldroot = t.Transaction.oldroot in
 	let newroot = Store.get_root t.store in
 	let ops = Transaction.get_paths t |> List.rev in
@@ -67,8 +67,9 @@ let process_watch t cons =
 		| Xenbus.Xb.Op.Rm       -> true, None, oldroot
 		| Xenbus.Xb.Op.Setperms -> false, Some oldroot, newroot
 		| _              -> raise (Failure "huh ?") in
-		Connections.fire_watches ?oldroot root cons (snd op) recurse in
-	List.iter (fun op -> do_op_watch op cons) ops
+		Connections.fire_watches ?oldroot source root cons (snd op) recurse in
+	List.iter (fun op -> do_op_watch op cons) ops;
+	Connections.send_watchevents cons source
 
 let create_implicit_path t perm path =
 	let dirname = Store.Path.get_parent path in
@@ -234,6 +235,20 @@ let do_debug con t _domains cons data =
 	| "watches" :: _ ->
 		let watches = Connections.debug cons in
 		Some (watches ^ "\000")
+	| "xenbus" :: domid :: _ ->
+		let domid = int_of_string domid in
+		let con = Connections.find_domain cons domid in
+		let s = Printf.sprintf "xenbus: %s; overflow queue length: %d, can_input: %b, has_more_input: %b, has_old_output: %b, has_new_output: %b, has_more_work: %b. pending: %s"
+			(Xenbus.Xb.debug con.xb)
+			(Connection.source_pending_watchevents con)
+			(Connection.can_input con)
+			(Connection.has_more_input con)
+			(Connection.has_old_output con)
+			(Connection.has_new_output con)
+			(Connection.has_more_work con)
+			(Connections.debug_watchevents cons con)
+		in
+		Some s
 	| "mfn" :: domid :: _ ->
 		let domid = int_of_string domid in
 		let con = Connections.find_domain cons domid in
@@ -342,7 +357,7 @@ let reply_ack fct con t doms cons data =
 	fct con t doms cons data;
 	Packet.Ack (fun () ->
 		if Transaction.get_id t = Transaction.none then
-			process_watch t cons
+			process_watch con t cons
 	)
 
 let reply_data fct con t doms cons data =
@@ -501,7 +516,7 @@ let do_watch con t _domains cons data =
 	Packet.Ack (fun () ->
 		(* xenstore.txt says this watch is fired immediately,
 		   implying even if path doesn't exist or is unreadable *)
-		Connection.fire_single_watch_unchecked watch)
+		Connection.fire_single_watch_unchecked con watch)
 
 let do_unwatch con _t _domains cons data =
 	let (node, token) =
@@ -532,7 +547,7 @@ let do_transaction_end con t domains cons data =
 	if not success then
 		raise Transaction_again;
 	if commit then begin
-		process_watch t cons;
+		process_watch con t cons;
 		match t.Transaction.ty with
 		| Transaction.No ->
 			() (* no need to record anything *)
@@ -699,7 +714,8 @@ let process_packet ~store ~cons ~doms ~con ~req =
 let do_input store cons doms con =
 	let newpacket =
 		try
-			Connection.do_input con
+			if Connection.can_input con then Connection.do_input con
+			else None
 		with Xenbus.Xb.Reconnect ->
 			info "%s requests a reconnect" (Connection.get_domstr con);
 			History.reconnect con;
@@ -727,6 +743,7 @@ let do_input store cons doms con =
 		Connection.incr_ops con
 
 let do_output _store _cons _doms con =
+	Connection.source_flush_watchevents con;
 	if Connection.has_output con then (
 		if Connection.has_new_output con then (
 			let packet = Connection.peek_output con in
diff --git a/tools/ocaml/xenstored/xenstored.ml b/tools/ocaml/xenstored/xenstored.ml
index 3b57ad016dfb..c799e20f1145 100644
--- a/tools/ocaml/xenstored/xenstored.ml
+++ b/tools/ocaml/xenstored/xenstored.ml
@@ -103,6 +103,8 @@ let parse_config filename =
 		("quota-maxentity", Config.Set_int Quota.maxent);
 		("quota-maxsize", Config.Set_int Quota.maxsize);
 		("quota-maxrequests", Config.Set_int Define.maxrequests);
+		("quota-maxoutstanding", Config.Set_int Define.maxoutstanding);
+		("quota-maxwatchevents", Config.Set_int Define.maxwatchevents);
 		("quota-path-max", Config.Set_int Define.path_max);
 		("gc-max-overhead", Config.Set_int Define.gc_max_overhead);
 		("test-eagain", Config.Set_bool Transaction.test_eagain);
From 8eba0bab9c36c04d924f9fe97b1fa264fe23f19e Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Thu, 29 Sep 2022 13:07:35 +0200
Subject: SUPPORT.md: clarify support of untrusted driver domains with
 oxenstored

Add a support statement for the scope of support regarding different
Xenstore variants. Especially oxenstored does not (yet) have security
support of untrusted driver domains, as those might drive oxenstored
out of memory by creating lots of watch events for the guests they are
servicing.

Add a statement regarding Live Update support of oxenstored.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/SUPPORT.md b/SUPPORT.md
index 0fb262f81f40..48fb462221cf 100644
--- a/SUPPORT.md
+++ b/SUPPORT.md
@@ -179,13 +179,18 @@ Support for running qemu-xen device model in a linux stubdomain.
 
     Status: Tech Preview
 
-## Liveupdate of C xenstored daemon
+## Xenstore
 
-    Status: Tech Preview
+### C xenstored daemon
 
-## Liveupdate of OCaml xenstored daemon
+    Status: Supported
+    Status, Liveupdate: Tech Preview
 
-    Status: Tech Preview
+### OCaml xenstored daemon
+
+    Status: Supported
+    Status, untrusted driver domains: Supported, not security supported
+    Status, Liveupdate: Not functional
 
 ## Toolstack/3rd party
 
From bb22709d94fa98f5a2abba4eeeba41ef09753f8e Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: split up send_reply()

Today send_reply() is used for both, normal request replies and watch
events.

Split it up into send_reply() and send_event(). This will be used to
add some event specific handling.

add_event() can be merged into send_event(), removing the need for an
intermediate memory allocation.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index b28c2c66b53b..01d4a2e440ec 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -733,49 +733,32 @@ static void send_error(struct connection *conn, int error)
 void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		const void *data, unsigned int len)
 {
-	struct buffered_data *bdata;
+	struct buffered_data *bdata = conn->in;
+
+	assert(type != XS_WATCH_EVENT);
 
 	if ( len > XENSTORE_PAYLOAD_MAX ) {
 		send_error(conn, E2BIG);
 		return;
 	}
 
-	/* Replies reuse the request buffer, events need a new one. */
-	if (type != XS_WATCH_EVENT) {
-		bdata = conn->in;
-		/* Drop asynchronous responses, e.g. errors for watch events. */
-		if (!bdata)
-			return;
-		bdata->inhdr = true;
-		bdata->used = 0;
-		conn->in = NULL;
-	} else {
-		/* Message is a child of the connection for auto-cleanup. */
-		bdata = new_buffer(conn);
+	if (!bdata)
+		return;
+	bdata->inhdr = true;
+	bdata->used = 0;
 
-		/*
-		 * Allocation failure here is unfortunate: we have no way to
-		 * tell anybody about it.
-		 */
-		if (!bdata)
-			return;
-	}
 	if (len <= DEFAULT_BUFFER_SIZE)
 		bdata->buffer = bdata->default_buffer;
-	else
+	else {
 		bdata->buffer = talloc_array(bdata, char, len);
-	if (!bdata->buffer) {
-		if (type == XS_WATCH_EVENT) {
-			/* Same as above: no way to tell someone. */
-			talloc_free(bdata);
+		if (!bdata->buffer) {
+			send_error(conn, ENOMEM);
 			return;
 		}
-		/* re-establish request buffer for sending ENOMEM. */
-		conn->in = bdata;
-		send_error(conn, ENOMEM);
-		return;
 	}
 
+	conn->in = NULL;
+
 	/* Update relevant header fields and fill in the message body. */
 	bdata->hdr.msg.type = type;
 	bdata->hdr.msg.len = len;
@@ -783,8 +766,39 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+}
 
-	return;
+/*
+ * Send a watch event.
+ * As this is not directly related to the current command, errors can't be
+ * reported.
+ */
+void send_event(struct connection *conn, const char *path, const char *token)
+{
+	struct buffered_data *bdata;
+	unsigned int len;
+
+	len = strlen(path) + 1 + strlen(token) + 1;
+	/* Don't try to send over-long events. */
+	if (len > XENSTORE_PAYLOAD_MAX)
+		return;
+
+	bdata = new_buffer(conn);
+	if (!bdata)
+		return;
+
+	bdata->buffer = talloc_array(bdata, char, len);
+	if (!bdata->buffer) {
+		talloc_free(bdata);
+		return;
+	}
+	strcpy(bdata->buffer, path);
+	strcpy(bdata->buffer + strlen(path) + 1, token);
+	bdata->hdr.msg.type = XS_WATCH_EVENT;
+	bdata->hdr.msg.len = len;
+
+	/* Queue for later transmission. */
+	list_add_tail(&bdata->list, &conn->out_list);
 }
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 900336afa426..38d97fa081a6 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -180,6 +180,7 @@ unsigned int get_string(const struct buffered_data *data, unsigned int offset);
 
 void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		const void *data, unsigned int len);
+void send_event(struct connection *conn, const char *path, const char *token);
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
 void send_ack(struct connection *conn, enum xsd_sockmsg_type type);
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index db89e0141fce..a116f967dc66 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -86,35 +86,6 @@ static const char *get_watch_path(const struct watch *watch, const char *name)
 }
 
 /*
- * Send a watch event.
- * Temporary memory allocations are done with ctx.
- */
-static void add_event(struct connection *conn,
-		      const void *ctx,
-		      struct watch *watch,
-		      const char *name)
-{
-	/* Data to send (node\0token\0). */
-	unsigned int len;
-	char *data;
-
-	name = get_watch_path(watch, name);
-
-	len = strlen(name) + 1 + strlen(watch->token) + 1;
-	/* Don't try to send over-long events. */
-	if (len > XENSTORE_PAYLOAD_MAX)
-		return;
-
-	data = talloc_array(ctx, char, len);
-	if (!data)
-		return;
-	strcpy(data, name);
-	strcpy(data + strlen(name) + 1, watch->token);
-	send_reply(conn, XS_WATCH_EVENT, data, len);
-	talloc_free(data);
-}
-
-/*
  * Check permissions of a specific watch to fire:
  * Either the node itself or its parent have to be readable by the connection
  * the watch has been setup for. In case a watch event is created due to
@@ -190,10 +161,14 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		list_for_each_entry(watch, &i->watches, list) {
 			if (exact) {
 				if (streq(name, watch->node))
-					add_event(i, ctx, watch, name);
+					send_event(i,
+						   get_watch_path(watch, name),
+						   watch->token);
 			} else {
 				if (is_child(name, watch->node))
-					add_event(i, ctx, watch, name);
+					send_event(i,
+						   get_watch_path(watch, name),
+						   watch->token);
 			}
 		}
 	}
@@ -292,7 +267,7 @@ int do_watch(struct connection *conn, struct buffered_data *in)
 	send_ack(conn, XS_WATCH);
 
 	/* We fire once up front: simplifies clients and restart. */
-	add_event(conn, in, watch, watch->node);
+	send_event(conn, get_watch_path(watch, watch->node), watch->token);
 
 	return 0;
 }
From 6af15525260ddd8f78f75338b2ca97b4f6815dfb Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: add helpers to free struct buffered_data

Add two helpers for freeing struct buffered_data: free_buffered_data()
for freeing one instance and conn_free_buffered_data() for freeing all
instances for a connection.

This is avoiding duplicated code and will help later when more actions
are needed when freeing a struct buffered_data.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 01d4a2e440ec..6498bf603666 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -211,6 +211,21 @@ void reopen_log(void)
 	}
 }
 
+static void free_buffered_data(struct buffered_data *out,
+			       struct connection *conn)
+{
+	list_del(&out->list);
+	talloc_free(out);
+}
+
+void conn_free_buffered_data(struct connection *conn)
+{
+	struct buffered_data *out;
+
+	while ((out = list_top(&conn->out_list, struct buffered_data, list)))
+		free_buffered_data(out, conn);
+}
+
 static bool write_messages(struct connection *conn)
 {
 	int ret;
@@ -254,8 +269,7 @@ static bool write_messages(struct connection *conn)
 
 	trace_io(conn, out, 1);
 
-	list_del(&out->list);
-	talloc_free(out);
+	free_buffered_data(out, conn);
 
 	return true;
 }
@@ -1472,18 +1486,12 @@ static struct {
  */
 static void ignore_connection(struct connection *conn)
 {
-	struct buffered_data *out, *tmp;
-
 	trace("CONN %p ignored\n", conn);
 
 	conn->is_ignored = true;
 	conn_delete_all_watches(conn);
 	conn_delete_all_transactions(conn);
-
-	list_for_each_entry_safe(out, tmp, &conn->out_list, list) {
-		list_del(&out->list);
-		talloc_free(out);
-	}
+	conn_free_buffered_data(conn);
 
 	talloc_free(conn->in);
 	conn->in = NULL;
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 38d97fa081a6..0ba5b783d4d1 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -270,6 +270,8 @@ int remember_string(struct hashtable *hash, const char *str);
 
 void set_tdb_key(const char *name, TDB_DATA *key);
 
+void conn_free_buffered_data(struct connection *conn);
+
 const char *dump_state_global(FILE *fp);
 const char *dump_state_buffered_data(FILE *fp, const struct connection *c,
 				     const struct connection *conn,
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 3d4d0649a243..72a5cd3b9aaf 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -417,15 +417,10 @@ static struct domain *find_domain_by_domid(unsigned int domid)
 static void domain_conn_reset(struct domain *domain)
 {
 	struct connection *conn = domain->conn;
-	struct buffered_data *out;
 
 	conn_delete_all_watches(conn);
 	conn_delete_all_transactions(conn);
-
-	while ((out = list_top(&conn->out_list, struct buffered_data, list))) {
-		list_del(&out->list);
-		talloc_free(out);
-	}
+	conn_free_buffered_data(conn);
 
 	talloc_free(conn->in);
 
From cdc3747676b1e5ea726729f8865031bf0f764778 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: reduce number of watch events

When removing a watched node outside of a transaction, two watch events
are being produced instead of just a single one.

When finalizing a transaction watch events can be generated for each
node which is being modified, even if outside a transaction such
modifications might not have resulted in a watch event.

This happens e.g.:

- for nodes which are only modified due to added/removed child entries
- for nodes being removed or created implicitly (e.g. creation of a/b/c
  is implicitly creating a/b, resulting in watch events for a, a/b and
  a/b/c instead of a/b/c only)

Avoid these additional watch events, in order to reduce the needed
memory inside Xenstore for queueing them.

This is being achieved by adding event flags to struct accessed_node
specifying whether an event should be triggered, and whether it should
be an exact match of the modified path. Both flags can be set from
fire_watches() instead of implying them only.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 6498bf603666..5157a7527f58 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -1261,7 +1261,7 @@ static void delete_child(struct connection *conn,
 }
 
 static int delete_node(struct connection *conn, const void *ctx,
-		       struct node *parent, struct node *node)
+		       struct node *parent, struct node *node, bool watch_exact)
 {
 	char *name;
 
@@ -1273,7 +1273,7 @@ static int delete_node(struct connection *conn, const void *ctx,
 				       node->children);
 		child = name ? read_node(conn, node, name) : NULL;
 		if (child) {
-			if (delete_node(conn, ctx, node, child))
+			if (delete_node(conn, ctx, node, child, true))
 				return errno;
 		} else {
 			trace("delete_node: Error deleting child '%s/%s'!\n",
@@ -1285,7 +1285,12 @@ static int delete_node(struct connection *conn, const void *ctx,
 		talloc_free(name);
 	}
 
-	fire_watches(conn, ctx, node->name, node, true, NULL);
+	/*
+	 * Fire the watches now, when we can still see the node permissions.
+	 * This fine as we are single threaded and the next possible read will
+	 * be handled only after the node has been really removed.
+	 */
+	fire_watches(conn, ctx, node->name, node, watch_exact, NULL);
 	delete_node_single(conn, node);
 	delete_child(conn, parent, basename(node->name));
 	talloc_free(node);
@@ -1311,13 +1316,7 @@ static int _rm(struct connection *conn, const void *ctx, struct node *node,
 		return (errno == ENOMEM) ? ENOMEM : EINVAL;
 	node->parent = parent;
 
-	/*
-	 * Fire the watches now, when we can still see the node permissions.
-	 * This fine as we are single threaded and the next possible read will
-	 * be handled only after the node has been really removed.
-	 */
-	fire_watches(conn, ctx, name, node, false, NULL);
-	return delete_node(conn, ctx, parent, node);
+	return delete_node(conn, ctx, parent, node, false);
 }
 
 
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index faf6c930e42a..54432907fc76 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -130,6 +130,10 @@ struct accessed_node
 
 	/* Transaction node in data base? */
 	bool ta_node;
+
+	/* Watch event flags. */
+	bool fire_watch;
+	bool watch_exact;
 };
 
 struct changed_domain
@@ -324,6 +328,29 @@ int access_node(struct connection *conn, struct node *node,
 }
 
 /*
+ * A watch event should be fired for a node modified inside a transaction.
+ * Set the corresponding information. A non-exact event is replacing an exact
+ * one, but not the other way round.
+ */
+void queue_watches(struct connection *conn, const char *name, bool watch_exact)
+{
+	struct accessed_node *i;
+
+	i = find_accessed_node(conn->transaction, name);
+	if (!i) {
+		conn->transaction->fail = true;
+		return;
+	}
+
+	if (!i->fire_watch) {
+		i->fire_watch = true;
+		i->watch_exact = watch_exact;
+	} else if (!watch_exact) {
+		i->watch_exact = false;
+	}
+}
+
+/*
  * Finalize transaction:
  * Walk through accessed nodes and check generation against global data.
  * If all entries match, read the transaction entries and write them without
@@ -377,15 +404,15 @@ static int finalize_transaction(struct connection *conn,
 				ret = tdb_store(tdb_ctx, key, data,
 						TDB_REPLACE);
 				talloc_free(data.dptr);
-				if (ret)
-					goto err;
-				fire_watches(conn, trans, i->node, NULL, false,
-					     i->perms.p ? &i->perms : NULL);
 			} else {
-				fire_watches(conn, trans, i->node, NULL, false,
+				ret = tdb_delete(tdb_ctx, key);
+			}
+			if (ret)
+				goto err;
+			if (i->fire_watch) {
+				fire_watches(conn, trans, i->node, NULL,
+					     i->watch_exact,
 					     i->perms.p ? &i->perms : NULL);
-				if (tdb_delete(tdb_ctx, key))
-					goto err;
 			}
 		}
 
diff --git a/tools/xenstore/xenstored_transaction.h b/tools/xenstore/xenstored_transaction.h
index 14062730e3c9..0093cac807e3 100644
--- a/tools/xenstore/xenstored_transaction.h
+++ b/tools/xenstore/xenstored_transaction.h
@@ -42,6 +42,9 @@ void transaction_entry_dec(struct transaction *trans, unsigned int domid);
 int access_node(struct connection *conn, struct node *node,
                 enum node_access_type type, TDB_DATA *key);
 
+/* Queue watches for a modified node. */
+void queue_watches(struct connection *conn, const char *name, bool watch_exact);
+
 /* Prepend the transaction to name if appropriate. */
 int transaction_prepend(struct connection *conn, const char *name,
                         TDB_DATA *key);
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index a116f967dc66..bc6d833028a3 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -29,6 +29,7 @@
 #include "xenstore_lib.h"
 #include "utils.h"
 #include "xenstored_domain.h"
+#include "xenstored_transaction.h"
 
 extern int quota_nb_watch_per_domain;
 
@@ -143,9 +144,11 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 	struct connection *i;
 	struct watch *watch;
 
-	/* During transactions, don't fire watches. */
-	if (conn && conn->transaction)
+	/* During transactions, don't fire watches, but queue them. */
+	if (conn && conn->transaction) {
+		queue_watches(conn, name, exact);
 		return;
+	}
 
 	/* Create an event for each watch. */
 	list_for_each_entry(i, &connections, list) {
From 43dd7e4d3952e4f2100e6f04de2b9febb2c5c50a Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: let unread watch events time out

A future modification will limit the number of outstanding requests
for a domain, where "outstanding" means that the response of the
request or any resulting watch event hasn't been consumed yet.

In order to avoid a malicious guest being capable to block other guests
by not reading watch events, add a timeout for watch events. In case a
watch event hasn't been consumed after this timeout, it is being
deleted. Set the default timeout to 20 seconds (a random value being
not too high).

In order to support to specify other timeout values in future, use a
generic command line option for that purpose:

--timeout|-w watch-event=<seconds>

This is part of XSA-326 / CVE-2022-42311.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 5157a7527f58..ee3396fefa94 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -108,6 +108,8 @@ int quota_max_transaction = 10;
 int quota_nb_perms_per_node = 5;
 int quota_max_path_len = XENSTORE_REL_PATH_MAX;
 
+unsigned int timeout_watch_event_msec = 20000;
+
 void trace(const char *fmt, ...)
 {
 	va_list arglist;
@@ -211,19 +213,92 @@ void reopen_log(void)
 	}
 }
 
+static uint64_t get_now_msec(void)
+{
+	struct timespec now_ts;
+
+	if (clock_gettime(CLOCK_MONOTONIC, &now_ts))
+		barf_perror("Could not find time (clock_gettime failed)");
+
+	return now_ts.tv_sec * 1000 + now_ts.tv_nsec / 1000000;
+}
+
 static void free_buffered_data(struct buffered_data *out,
 			       struct connection *conn)
 {
+	struct buffered_data *req;
+
 	list_del(&out->list);
+
+	/*
+	 * Update conn->timeout_msec with the next found timeout value in the
+	 * queued pending requests.
+	 */
+	if (out->timeout_msec) {
+		conn->timeout_msec = 0;
+		list_for_each_entry(req, &conn->out_list, list) {
+			if (req->timeout_msec) {
+				conn->timeout_msec = req->timeout_msec;
+				break;
+			}
+		}
+	}
+
 	talloc_free(out);
 }
 
+static void check_event_timeout(struct connection *conn, uint64_t msecs,
+				int *ptimeout)
+{
+	uint64_t delta;
+	struct buffered_data *out, *tmp;
+
+	if (!conn->timeout_msec)
+		return;
+
+	delta = conn->timeout_msec - msecs;
+	if (conn->timeout_msec <= msecs) {
+		delta = 0;
+		list_for_each_entry_safe(out, tmp, &conn->out_list, list) {
+			/*
+			 * Only look at buffers with timeout and no data
+			 * already written to the ring.
+			 */
+			if (out->timeout_msec && out->inhdr && !out->used) {
+				if (out->timeout_msec > msecs) {
+					conn->timeout_msec = out->timeout_msec;
+					delta = conn->timeout_msec - msecs;
+					break;
+				}
+
+				/*
+				 * Free out without updating conn->timeout_msec,
+				 * as the update is done in this loop already.
+				 */
+				out->timeout_msec = 0;
+				trace("watch event path %s for domain %u timed out\n",
+				      out->buffer, conn->id);
+				free_buffered_data(out, conn);
+			}
+		}
+		if (!delta) {
+			conn->timeout_msec = 0;
+			return;
+		}
+	}
+
+	if (*ptimeout == -1 || *ptimeout > delta)
+		*ptimeout = delta;
+}
+
 void conn_free_buffered_data(struct connection *conn)
 {
 	struct buffered_data *out;
 
 	while ((out = list_top(&conn->out_list, struct buffered_data, list)))
 		free_buffered_data(out, conn);
+
+	conn->timeout_msec = 0;
 }
 
 static bool write_messages(struct connection *conn)
@@ -382,6 +457,7 @@ static void initialize_fds(int *p_sock_pollfd_idx, int *ptimeout)
 {
 	struct connection *conn;
 	struct wrl_timestampt now;
+	uint64_t msecs;
 
 	if (fds)
 		memset(fds, 0, sizeof(struct pollfd) * current_array_size);
@@ -402,10 +478,12 @@ static void initialize_fds(int *p_sock_pollfd_idx, int *ptimeout)
 
 	wrl_gettime_now(&now);
 	wrl_log_periodic(now);
+	msecs = get_now_msec();
 
 	list_for_each_entry(conn, &connections, list) {
 		if (conn->domain) {
 			wrl_check_timeout(conn->domain, now, ptimeout);
+			check_event_timeout(conn, msecs, ptimeout);
 			if (domain_can_read(conn) ||
 			    (domain_can_write(conn) &&
 			     !list_empty(&conn->out_list)))
@@ -760,6 +838,7 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		return;
 	bdata->inhdr = true;
 	bdata->used = 0;
+	bdata->timeout_msec = 0;
 
 	if (len <= DEFAULT_BUFFER_SIZE)
 		bdata->buffer = bdata->default_buffer;
@@ -811,6 +890,12 @@ void send_event(struct connection *conn, const char *path, const char *token)
 	bdata->hdr.msg.type = XS_WATCH_EVENT;
 	bdata->hdr.msg.len = len;
 
+	if (timeout_watch_event_msec && domain_is_unprivileged(conn)) {
+		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
+		if (!conn->timeout_msec)
+			conn->timeout_msec = bdata->timeout_msec;
+	}
+
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
 }
@@ -2099,6 +2184,9 @@ static void usage(void)
 "  -t, --transaction <nb>  limit the number of transaction allowed per domain,\n"
 "  -A, --perm-nb <nb>      limit the number of permissions per node,\n"
 "  -M, --path-max <chars>  limit the allowed Xenstore node path length,\n"
+"  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
+"                          allowed timeout candidates are:\n"
+"                          watch-event: time a watch-event is kept pending\n"
 "  -R, --no-recovery       to request that no recovery should be attempted when\n"
 "                          the store is corrupted (debug only),\n"
 "  -I, --internal-db       store database in memory, not on disk\n"
@@ -2121,6 +2209,7 @@ static struct option options[] = {
 	{ "transaction", 1, NULL, 't' },
 	{ "perm-nb", 1, NULL, 'A' },
 	{ "path-max", 1, NULL, 'M' },
+	{ "timeout", 1, NULL, 'w' },
 	{ "no-recovery", 0, NULL, 'R' },
 	{ "internal-db", 0, NULL, 'I' },
 	{ "verbose", 0, NULL, 'V' },
@@ -2135,6 +2224,39 @@ int dom0_domid = 0;
 int dom0_event = 0;
 int priv_domid = 0;
 
+static int get_optval_int(const char *arg)
+{
+	char *end;
+	long val;
+
+	val = strtol(arg, &end, 10);
+	if (!*arg || *end || val < 0 || val > INT_MAX)
+		barf("invalid parameter value \"%s\"\n", arg);
+
+	return val;
+}
+
+static bool what_matches(const char *arg, const char *what)
+{
+	unsigned int what_len = strlen(what);
+
+	return !strncmp(arg, what, what_len) && arg[what_len] == '=';
+}
+
+static void set_timeout(const char *arg)
+{
+	const char *eq = strchr(arg, '=');
+	int val;
+
+	if (!eq)
+		barf("quotas must be specified via <what>=<seconds>\n");
+	val = get_optval_int(eq + 1);
+	if (what_matches(arg, "watch-event"))
+		timeout_watch_event_msec = val * 1000;
+	else
+		barf("unknown timeout \"%s\"\n", arg);
+}
+
 int main(int argc, char *argv[])
 {
 	int opt;
@@ -2149,7 +2271,7 @@ int main(int argc, char *argv[])
 	orig_argc = argc;
 	orig_argv = argv;
 
-	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:T:RVW:U", options,
+	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:T:RVW:w:U", options,
 				  NULL)) != -1) {
 		switch (opt) {
 		case 'D':
@@ -2198,6 +2320,9 @@ int main(int argc, char *argv[])
 			quota_max_path_len = min(XENSTORE_REL_PATH_MAX,
 						 quota_max_path_len);
 			break;
+		case 'w':
+			set_timeout(optarg);
+			break;
 		case 'e':
 			dom0_event = strtol(optarg, NULL, 10);
 			break;
@@ -2642,6 +2767,12 @@ static void add_buffered_data(struct buffered_data *bdata,
 		barf("error restoring buffered data");
 
 	memcpy(bdata->buffer, data, len);
+	if (bdata->hdr.msg.type == XS_WATCH_EVENT && timeout_watch_event_msec &&
+	    domain_is_unprivileged(conn)) {
+		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
+		if (!conn->timeout_msec)
+			conn->timeout_msec = bdata->timeout_msec;
+	}
 
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 0ba5b783d4d1..2db577928fc6 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -27,6 +27,7 @@
 #include <dirent.h>
 #include <stdbool.h>
 #include <stdint.h>
+#include <time.h>
 #include <errno.h>
 
 #include "xenstore_lib.h"
@@ -67,6 +68,8 @@ struct buffered_data
 		char raw[sizeof(struct xsd_sockmsg)];
 	} hdr;
 
+	uint64_t timeout_msec;
+
 	/* The actual data. */
 	char *buffer;
 	char default_buffer[DEFAULT_BUFFER_SIZE];
@@ -110,6 +113,7 @@ struct connection
 
 	/* Buffered output data */
 	struct list_head out_list;
+	uint64_t timeout_msec;
 
 	/* Transaction context for current request (NULL if none). */
 	struct transaction *transaction;
@@ -237,6 +241,8 @@ extern int dom0_event;
 extern int priv_domid;
 extern int quota_nb_entry_per_domain;
 
+extern unsigned int timeout_watch_event_msec;
+
 /* Map the kernel's xenstore page. */
 void *xenbus_map(void);
 void unmap_xenbus(void *interface);
From 4bfc8b2cf25f2c418dc2c8a11cab6cd12d428b61 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: limit outstanding requests

Add another quota for limiting the number of outstanding requests of a
guest. As the way to specify quotas on the command line is becoming
rather nasty, switch to a new scheme using [--quota|-Q] <what>=<val>
allowing to add more quotas in future easily.

Set the default value to 20 (basically a random value not seeming to
be too high or too low).

A request is said to be outstanding if any message generated by this
request (the direct response plus potential watch events) is not yet
completely stored into a ring buffer. The initial watch event sent as
a result of registering a watch is an exception.

Note that across a live update the relation to buffered watch events
for other domains is lost.

Use talloc_zero() for allocating the domain structure in order to have
all per-domain quota zeroed initially.

This is part of XSA-326 / CVE-2022-42312.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index ee3396fefa94..d871f217af9c 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -107,6 +107,7 @@ int quota_max_entry_size = 2048; /* 2K */
 int quota_max_transaction = 10;
 int quota_nb_perms_per_node = 5;
 int quota_max_path_len = XENSTORE_REL_PATH_MAX;
+int quota_req_outstanding = 20;
 
 unsigned int timeout_watch_event_msec = 20000;
 
@@ -223,12 +224,24 @@ static uint64_t get_now_msec(void)
 	return now_ts.tv_sec * 1000 + now_ts.tv_nsec / 1000000;
 }
 
+/*
+ * Remove a struct buffered_data from the list of outgoing data.
+ * A struct buffered_data related to a request having caused watch events to be
+ * sent is kept until all those events have been written out.
+ * Each watch event is referencing the related request via pend.req, while the
+ * number of watch events caused by a request is kept in pend.ref.event_cnt
+ * (those two cases are mutually exclusive, so the two fields can share memory
+ * via a union).
+ * The struct buffered_data is freed only if no related watch event is
+ * referencing it. The related return data can be freed right away.
+ */
 static void free_buffered_data(struct buffered_data *out,
 			       struct connection *conn)
 {
 	struct buffered_data *req;
 
 	list_del(&out->list);
+	out->on_out_list = false;
 
 	/*
 	 * Update conn->timeout_msec with the next found timeout value in the
@@ -244,6 +257,30 @@ static void free_buffered_data(struct buffered_data *out,
 		}
 	}
 
+	if (out->hdr.msg.type == XS_WATCH_EVENT) {
+		req = out->pend.req;
+		if (req) {
+			req->pend.ref.event_cnt--;
+			if (!req->pend.ref.event_cnt && !req->on_out_list) {
+				if (req->on_ref_list) {
+					domain_outstanding_domid_dec(
+						req->pend.ref.domid);
+					list_del(&req->list);
+				}
+				talloc_free(req);
+			}
+		}
+	} else if (out->pend.ref.event_cnt) {
+		/* Hang out off from conn. */
+		talloc_steal(NULL, out);
+		if (out->buffer != out->default_buffer)
+			talloc_free(out->buffer);
+		list_add(&out->list, &conn->ref_list);
+		out->on_ref_list = true;
+		return;
+	} else
+		domain_outstanding_dec(conn);
+
 	talloc_free(out);
 }
 
@@ -399,6 +436,7 @@ int delay_request(struct connection *conn, struct buffered_data *in,
 static int destroy_conn(void *_conn)
 {
 	struct connection *conn = _conn;
+	struct buffered_data *req;
 
 	/* Flush outgoing if possible, but don't block. */
 	if (!conn->domain) {
@@ -412,6 +450,11 @@ static int destroy_conn(void *_conn)
 				break;
 		close(conn->fd);
 	}
+
+	conn_free_buffered_data(conn);
+	list_for_each_entry(req, &conn->ref_list, list)
+		req->on_ref_list = false;
+
         if (conn->target)
                 talloc_unlink(conn, conn->target);
 	list_del(&conn->list);
@@ -859,6 +902,8 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+	bdata->on_out_list = true;
+	domain_outstanding_inc(conn);
 }
 
 /*
@@ -866,7 +911,8 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
  * As this is not directly related to the current command, errors can't be
  * reported.
  */
-void send_event(struct connection *conn, const char *path, const char *token)
+void send_event(struct buffered_data *req, struct connection *conn,
+		const char *path, const char *token)
 {
 	struct buffered_data *bdata;
 	unsigned int len;
@@ -896,8 +942,13 @@ void send_event(struct connection *conn, const char *path, const char *token)
 			conn->timeout_msec = bdata->timeout_msec;
 	}
 
+	bdata->pend.req = req;
+	if (req)
+		req->pend.ref.event_cnt++;
+
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+	bdata->on_out_list = true;
 }
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
@@ -1658,6 +1709,7 @@ static void handle_input(struct connection *conn)
 			return;
 	}
 	in = conn->in;
+	in->pend.ref.domid = conn->id;
 
 	/* Not finished header yet? */
 	if (in->inhdr) {
@@ -1727,6 +1779,7 @@ struct connection *new_connection(connwritefn_t *write, connreadfn_t *read)
 	new->is_ignored = false;
 	new->transaction_started = 0;
 	INIT_LIST_HEAD(&new->out_list);
+	INIT_LIST_HEAD(&new->ref_list);
 	INIT_LIST_HEAD(&new->watches);
 	INIT_LIST_HEAD(&new->transaction_list);
 	INIT_LIST_HEAD(&new->delayed);
@@ -2184,6 +2237,9 @@ static void usage(void)
 "  -t, --transaction <nb>  limit the number of transaction allowed per domain,\n"
 "  -A, --perm-nb <nb>      limit the number of permissions per node,\n"
 "  -M, --path-max <chars>  limit the allowed Xenstore node path length,\n"
+"  -Q, --quota <what>=<nb> set the quota <what> to the value <nb>, allowed\n"
+"                          quotas are:\n"
+"                          outstanding: number of outstanding requests\n"
 "  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
 "                          allowed timeout candidates are:\n"
 "                          watch-event: time a watch-event is kept pending\n"
@@ -2209,6 +2265,7 @@ static struct option options[] = {
 	{ "transaction", 1, NULL, 't' },
 	{ "perm-nb", 1, NULL, 'A' },
 	{ "path-max", 1, NULL, 'M' },
+	{ "quota", 1, NULL, 'Q' },
 	{ "timeout", 1, NULL, 'w' },
 	{ "no-recovery", 0, NULL, 'R' },
 	{ "internal-db", 0, NULL, 'I' },
@@ -2257,6 +2314,20 @@ static void set_timeout(const char *arg)
 		barf("unknown timeout \"%s\"\n", arg);
 }
 
+static void set_quota(const char *arg)
+{
+	const char *eq = strchr(arg, '=');
+	int val;
+
+	if (!eq)
+		barf("quotas must be specified via <what>=<nb>\n");
+	val = get_optval_int(eq + 1);
+	if (what_matches(arg, "outstanding"))
+		quota_req_outstanding = val;
+	else
+		barf("unknown quota \"%s\"\n", arg);
+}
+
 int main(int argc, char *argv[])
 {
 	int opt;
@@ -2271,8 +2342,8 @@ int main(int argc, char *argv[])
 	orig_argc = argc;
 	orig_argv = argv;
 
-	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:T:RVW:w:U", options,
-				  NULL)) != -1) {
+	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:Q:T:RVW:w:U",
+				  options, NULL)) != -1) {
 		switch (opt) {
 		case 'D':
 			no_domain_init = true;
@@ -2320,6 +2391,9 @@ int main(int argc, char *argv[])
 			quota_max_path_len = min(XENSTORE_REL_PATH_MAX,
 						 quota_max_path_len);
 			break;
+		case 'Q':
+			set_quota(optarg);
+			break;
 		case 'w':
 			set_timeout(optarg);
 			break;
@@ -2776,6 +2850,14 @@ static void add_buffered_data(struct buffered_data *bdata,
 
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+	bdata->on_out_list = true;
+	/*
+	 * Watch events are never "outstanding", but the request causing them
+	 * are instead kept "outstanding" until all watch events caused by that
+	 * request have been delivered.
+	 */
+	if (bdata->hdr.msg.type != XS_WATCH_EVENT)
+		domain_outstanding_inc(conn);
 }
 
 void read_state_buffered_data(const void *ctx, struct connection *conn,
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 2db577928fc6..fcb27399f116 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -56,6 +56,8 @@ struct xs_state_connection;
 struct buffered_data
 {
 	struct list_head list;
+	bool on_out_list;
+	bool on_ref_list;
 
 	/* Are we still doing the header? */
 	bool inhdr;
@@ -63,6 +65,17 @@ struct buffered_data
 	/* How far are we? */
 	unsigned int used;
 
+	/* Outstanding request accounting. */
+	union {
+		/* ref is being used for requests. */
+		struct {
+			unsigned int event_cnt; /* # of outstanding events. */
+			unsigned int domid;     /* domid of request. */
+		} ref;
+		/* req is being used for watch events. */
+		struct buffered_data *req;      /* request causing event. */
+	} pend;
+
 	union {
 		struct xsd_sockmsg msg;
 		char raw[sizeof(struct xsd_sockmsg)];
@@ -115,6 +128,9 @@ struct connection
 	struct list_head out_list;
 	uint64_t timeout_msec;
 
+	/* Referenced requests no longer pending. */
+	struct list_head ref_list;
+
 	/* Transaction context for current request (NULL if none). */
 	struct transaction *transaction;
 
@@ -184,7 +200,8 @@ unsigned int get_string(const struct buffered_data *data, unsigned int offset);
 
 void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		const void *data, unsigned int len);
-void send_event(struct connection *conn, const char *path, const char *token);
+void send_event(struct buffered_data *req, struct connection *conn,
+		const char *path, const char *token);
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
 void send_ack(struct connection *conn, enum xsd_sockmsg_type type);
@@ -240,6 +257,7 @@ extern int dom0_domid;
 extern int dom0_event;
 extern int priv_domid;
 extern int quota_nb_entry_per_domain;
+extern int quota_req_outstanding;
 
 extern unsigned int timeout_watch_event_msec;
 
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 72a5cd3b9aaf..979f8c629835 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -78,6 +78,9 @@ struct domain
 	/* number of watch for this domain */
 	int nbwatch;
 
+	/* Number of outstanding requests. */
+	int nboutstanding;
+
 	/* write rate limit */
 	wrl_creditt wrl_credit; /* [ -wrl_config_writecost, +_dburst ] */
 	struct wrl_timestampt wrl_timestamp;
@@ -287,8 +290,12 @@ bool domain_can_read(struct connection *conn)
 {
 	struct xenstore_domain_interface *intf = conn->domain->interface;
 
-	if (domain_is_unprivileged(conn) && conn->domain->wrl_credit < 0)
-		return false;
+	if (domain_is_unprivileged(conn)) {
+		if (conn->domain->wrl_credit < 0)
+			return false;
+		if (conn->domain->nboutstanding >= quota_req_outstanding)
+			return false;
+	}
 
 	if (conn->is_ignored)
 		return false;
@@ -337,7 +344,7 @@ static struct domain *alloc_domain(const void *context, unsigned int domid)
 {
 	struct domain *domain;
 
-	domain = talloc(context, struct domain);
+	domain = talloc_zero(context, struct domain);
 	if (!domain) {
 		errno = ENOMEM;
 		return NULL;
@@ -398,9 +405,6 @@ static int new_domain(struct domain *domain, int port, bool restore)
 	domain->conn->domain = domain;
 	domain->conn->id = domain->domid;
 
-	domain->nbentry = 0;
-	domain->nbwatch = 0;
-
 	return 0;
 }
 
@@ -944,6 +948,28 @@ int domain_watch(struct connection *conn)
 		: 0;
 }
 
+void domain_outstanding_inc(struct connection *conn)
+{
+	if (!conn || !conn->domain)
+		return;
+	conn->domain->nboutstanding++;
+}
+
+void domain_outstanding_dec(struct connection *conn)
+{
+	if (!conn || !conn->domain)
+		return;
+	conn->domain->nboutstanding--;
+}
+
+void domain_outstanding_domid_dec(unsigned int domid)
+{
+	struct domain *d = find_domain_by_domid(domid);
+
+	if (d)
+		d->nboutstanding--;
+}
+
 static wrl_creditt wrl_config_writecost      = WRL_FACTOR;
 static wrl_creditt wrl_config_rate           = WRL_RATE   * WRL_FACTOR;
 static wrl_creditt wrl_config_dburst         = WRL_DBURST * WRL_FACTOR;
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index dc9759171317..5757a6557146 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -68,6 +68,9 @@ int domain_entry(struct connection *conn);
 void domain_watch_inc(struct connection *conn);
 void domain_watch_dec(struct connection *conn);
 int domain_watch(struct connection *conn);
+void domain_outstanding_inc(struct connection *conn);
+void domain_outstanding_dec(struct connection *conn);
+void domain_outstanding_domid_dec(unsigned int domid);
 
 /* Special node permission handling. */
 int set_perms_special(struct connection *conn, const char *name,
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index bc6d833028a3..1d664e3d6b72 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -142,6 +142,7 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		  struct node *node, bool exact, struct node_perms *perms)
 {
 	struct connection *i;
+	struct buffered_data *req;
 	struct watch *watch;
 
 	/* During transactions, don't fire watches, but queue them. */
@@ -150,6 +151,8 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		return;
 	}
 
+	req = domain_is_unprivileged(conn) ? conn->in : NULL;
+
 	/* Create an event for each watch. */
 	list_for_each_entry(i, &connections, list) {
 		/* introduce/release domain watches */
@@ -164,12 +167,12 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		list_for_each_entry(watch, &i->watches, list) {
 			if (exact) {
 				if (streq(name, watch->node))
-					send_event(i,
+					send_event(req, i,
 						   get_watch_path(watch, name),
 						   watch->token);
 			} else {
 				if (is_child(name, watch->node))
-					send_event(i,
+					send_event(req, i,
 						   get_watch_path(watch, name),
 						   watch->token);
 			}
@@ -269,8 +272,12 @@ int do_watch(struct connection *conn, struct buffered_data *in)
 	trace_create(watch, "watch");
 	send_ack(conn, XS_WATCH);
 
-	/* We fire once up front: simplifies clients and restart. */
-	send_event(conn, get_watch_path(watch, watch->node), watch->token);
+	/*
+	 * We fire once up front: simplifies clients and restart.
+	 * This event will not be linked to the XS_WATCH request.
+	 */
+	send_event(NULL, conn, get_watch_path(watch, watch->node),
+		   watch->token);
 
 	return 0;
 }
From 4522b9e5c05f12bca0c7d1c2c9fea15c7bc41358 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: don't buffer multiple identical watch events

A guest not reading its Xenstore response buffer fast enough might
pile up lots of Xenstore watch events buffered. Reduce the generated
load by dropping new events which already have an identical copy
pending.

The special events "@..." are excluded from that handling as there are
known use cases where the handler is relying on each event to be sent
individually.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index d871f217af9c..6ea06e20df91 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -882,6 +882,7 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 	bdata->inhdr = true;
 	bdata->used = 0;
 	bdata->timeout_msec = 0;
+	bdata->watch_event = false;
 
 	if (len <= DEFAULT_BUFFER_SIZE)
 		bdata->buffer = bdata->default_buffer;
@@ -914,7 +915,7 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 void send_event(struct buffered_data *req, struct connection *conn,
 		const char *path, const char *token)
 {
-	struct buffered_data *bdata;
+	struct buffered_data *bdata, *bd;
 	unsigned int len;
 
 	len = strlen(path) + 1 + strlen(token) + 1;
@@ -936,12 +937,29 @@ void send_event(struct buffered_data *req, struct connection *conn,
 	bdata->hdr.msg.type = XS_WATCH_EVENT;
 	bdata->hdr.msg.len = len;
 
+	/*
+	 * Check whether an identical event is pending already.
+	 * Special events are excluded from that check.
+	 */
+	if (path[0] != '@') {
+		list_for_each_entry(bd, &conn->out_list, list) {
+			if (bd->watch_event && bd->hdr.msg.len == len &&
+			    !memcmp(bdata->buffer, bd->buffer, len)) {
+				trace("dropping duplicate watch %s %s for domain %u\n",
+				      path, token, conn->id);
+				talloc_free(bdata);
+				return;
+			}
+		}
+	}
+
 	if (timeout_watch_event_msec && domain_is_unprivileged(conn)) {
 		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
 		if (!conn->timeout_msec)
 			conn->timeout_msec = bdata->timeout_msec;
 	}
 
+	bdata->watch_event = true;
 	bdata->pend.req = req;
 	if (req)
 		req->pend.ref.event_cnt++;
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index fcb27399f116..afbd982c2654 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -62,6 +62,9 @@ struct buffered_data
 	/* Are we still doing the header? */
 	bool inhdr;
 
+	/* Is this a watch event? */
+	bool watch_event;
+
 	/* How far are we? */
 	unsigned int used;
 
From b28ad9eb7615d05716bd728e6b2df0f84d0711a0 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: fix connection->id usage

Don't use conn->id for privilege checks, but domain_is_unprivileged().

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_control.c b/tools/xenstore/xenstored_control.c
index 8e470f2b2056..211fe1fd9b37 100644
--- a/tools/xenstore/xenstored_control.c
+++ b/tools/xenstore/xenstored_control.c
@@ -821,7 +821,7 @@ int do_control(struct connection *conn, struct buffered_data *in)
 	unsigned int cmd, num, off;
 	char **vec = NULL;
 
-	if (conn->id != 0)
+	if (domain_is_unprivileged(conn))
 		return EACCES;
 
 	off = get_string(in, 0);
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index afbd982c2654..c0a056ce13fe 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -118,7 +118,7 @@ struct connection
 	/* The index of pollfd in global pollfd array */
 	int pollfd_idx;
 
-	/* Who am I? 0 for socket connections. */
+	/* Who am I? Domid of connection. */
 	unsigned int id;
 
 	/* Is this connection ignored? */
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index 54432907fc76..ee1b09031a3b 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -477,7 +477,8 @@ int do_transaction_start(struct connection *conn, struct buffered_data *in)
 	if (conn->transaction)
 		return EBUSY;
 
-	if (conn->id && conn->transaction_started > quota_max_transaction)
+	if (domain_is_unprivileged(conn) &&
+	    conn->transaction_started > quota_max_transaction)
 		return ENOSPC;
 
 	/* Attach transaction to input for autofree until it's complete */
From 0e724a79645d05f117f0af832b24bc334f762dbc Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: simplify and fix per domain node accounting

The accounting of nodes can be simplified now that each connection
holds the associated domid.

Fix the node accounting to cover nodes created for a domain before it
has been introduced. This requires to react properly to an allocation
failure inside domain_entry_inc() by returning an error code.

Especially in error paths the node accounting has to be fixed in some
cases.

This is part of XSA-326 / CVE-2022-42313.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 6ea06e20df91..85c0d2f38fac 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -603,7 +603,7 @@ struct node *read_node(struct connection *conn, const void *ctx,
 
 	/* Permissions are struct xs_permissions. */
 	node->perms.p = hdr->perms;
-	if (domain_adjust_node_perms(node)) {
+	if (domain_adjust_node_perms(conn, node)) {
 		talloc_free(node);
 		return NULL;
 	}
@@ -625,7 +625,7 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
 	void *p;
 	struct xs_tdb_record_hdr *hdr;
 
-	if (domain_adjust_node_perms(node))
+	if (domain_adjust_node_perms(conn, node))
 		return errno;
 
 	data.dsize = sizeof(*hdr)
@@ -1238,13 +1238,17 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 	return NULL;
 }
 
-static int destroy_node(struct connection *conn, struct node *node)
+static void destroy_node_rm(struct node *node)
 {
 	if (streq(node->name, "/"))
 		corrupt(NULL, "Destroying root node!");
 
 	tdb_delete(tdb_ctx, node->key);
+}
 
+static int destroy_node(struct connection *conn, struct node *node)
+{
+	destroy_node_rm(node);
 	domain_entry_dec(conn, node);
 
 	/*
@@ -1294,8 +1298,12 @@ static struct node *create_node(struct connection *conn, const void *ctx,
 			goto err;
 
 		/* Account for new node */
-		if (i->parent)
-			domain_entry_inc(conn, i);
+		if (i->parent) {
+			if (domain_entry_inc(conn, i)) {
+				destroy_node_rm(i);
+				return NULL;
+			}
+		}
 	}
 
 	return node;
@@ -1580,10 +1588,27 @@ static int do_set_perms(struct connection *conn, struct buffered_data *in)
 	old_perms = node->perms;
 	domain_entry_dec(conn, node);
 	node->perms = perms;
-	domain_entry_inc(conn, node);
+	if (domain_entry_inc(conn, node)) {
+		node->perms = old_perms;
+		/*
+		 * This should never fail because we had a reference on the
+		 * domain before and Xenstored is single-threaded.
+		 */
+		domain_entry_inc(conn, node);
+		return ENOMEM;
+	}
+
+	if (write_node(conn, node, false)) {
+		int saved_errno = errno;
 
-	if (write_node(conn, node, false))
+		domain_entry_dec(conn, node);
+		node->perms = old_perms;
+		/* No failure possible as above. */
+		domain_entry_inc(conn, node);
+
+		errno = saved_errno;
 		return errno;
+	}
 
 	fire_watches(conn, in, name, node, false, &old_perms);
 	send_ack(conn, XS_SET_PERMS);
@@ -3003,7 +3028,9 @@ void read_state_node(const void *ctx, const void *state)
 	set_tdb_key(name, &key);
 	if (write_node_raw(NULL, &key, node, true))
 		barf("write node error restoring node");
-	domain_entry_inc(&conn, node);
+
+	if (domain_entry_inc(&conn, node))
+		barf("node accounting error restoring node");
 
 	talloc_free(node);
 }
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 979f8c629835..3c27973fb836 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -16,6 +16,7 @@
     along with this program; If not, see <http://www.gnu.org/licenses/>.
 */
 
+#include <assert.h>
 #include <stdio.h>
 #include <sys/mman.h>
 #include <unistd.h>
@@ -369,6 +370,18 @@ static struct domain *find_or_alloc_domain(const void *ctx, unsigned int domid)
 	return domain ? : alloc_domain(ctx, domid);
 }
 
+static struct domain *find_or_alloc_existing_domain(unsigned int domid)
+{
+	struct domain *domain;
+	xc_dominfo_t dominfo;
+
+	domain = find_domain_struct(domid);
+	if (!domain && get_domain_info(domid, &dominfo))
+		domain = alloc_domain(NULL, domid);
+
+	return domain;
+}
+
 static int new_domain(struct domain *domain, int port, bool restore)
 {
 	int rc;
@@ -788,30 +801,28 @@ void domain_deinit(void)
 		xenevtchn_unbind(xce_handle, virq_port);
 }
 
-void domain_entry_inc(struct connection *conn, struct node *node)
+int domain_entry_inc(struct connection *conn, struct node *node)
 {
 	struct domain *d;
+	unsigned int domid;
 
 	if (!conn)
-		return;
+		return 0;
 
-	if (node->perms.p && node->perms.p[0].id != conn->id) {
-		if (conn->transaction) {
-			transaction_entry_inc(conn->transaction,
-				node->perms.p[0].id);
-		} else {
-			d = find_domain_by_domid(node->perms.p[0].id);
-			if (d)
-				d->nbentry++;
-		}
-	} else if (conn->domain) {
-		if (conn->transaction) {
-			transaction_entry_inc(conn->transaction,
-				conn->domain->domid);
- 		} else {
- 			conn->domain->nbentry++;
-		}
+	domid = node->perms.p ? node->perms.p[0].id : conn->id;
+
+	if (conn->transaction) {
+		transaction_entry_inc(conn->transaction, domid);
+	} else {
+		d = (domid == conn->id && conn->domain) ? conn->domain
+		    : find_or_alloc_existing_domain(domid);
+		if (d)
+			d->nbentry++;
+		else
+			return ENOMEM;
 	}
+
+	return 0;
 }
 
 /*
@@ -847,7 +858,7 @@ static int chk_domain_generation(unsigned int domid, uint64_t gen)
  * Remove permissions for no longer existing domains in order to avoid a new
  * domain with the same domid inheriting the permissions.
  */
-int domain_adjust_node_perms(struct node *node)
+int domain_adjust_node_perms(struct connection *conn, struct node *node)
 {
 	unsigned int i;
 	int ret;
@@ -857,8 +868,14 @@ int domain_adjust_node_perms(struct node *node)
 		return errno;
 
 	/* If the owner doesn't exist any longer give it to priv domain. */
-	if (!ret)
+	if (!ret) {
+		/*
+		 * In theory we'd need to update the number of dom0 nodes here,
+		 * but we could be called for a read of the node. So better
+		 * avoid the risk to overflow the node count of dom0.
+		 */
 		node->perms.p[0].id = priv_domid;
+	}
 
 	for (i = 1; i < node->perms.num; i++) {
 		if (node->perms.p[i].perms & XS_PERM_IGNORE)
@@ -877,25 +894,25 @@ int domain_adjust_node_perms(struct node *node)
 void domain_entry_dec(struct connection *conn, struct node *node)
 {
 	struct domain *d;
+	unsigned int domid;
 
 	if (!conn)
 		return;
 
-	if (node->perms.p && node->perms.p[0].id != conn->id) {
-		if (conn->transaction) {
-			transaction_entry_dec(conn->transaction,
-				node->perms.p[0].id);
-		} else {
-			d = find_domain_by_domid(node->perms.p[0].id);
-			if (d && d->nbentry)
-				d->nbentry--;
-		}
-	} else if (conn->domain && conn->domain->nbentry) {
-		if (conn->transaction) {
-			transaction_entry_dec(conn->transaction,
-				conn->domain->domid);
+	domid = node->perms.p ? node->perms.p[0].id : conn->id;
+
+	if (conn->transaction) {
+		transaction_entry_dec(conn->transaction, domid);
+	} else {
+		d = (domid == conn->id && conn->domain) ? conn->domain
+		    : find_domain_struct(domid);
+		if (d) {
+			d->nbentry--;
 		} else {
-			conn->domain->nbentry--;
+			errno = ENOENT;
+			corrupt(conn,
+				"Node \"%s\" owned by non-existing domain %u\n",
+				node->name, domid);
 		}
 	}
 }
@@ -905,13 +922,23 @@ int domain_entry_fix(unsigned int domid, int num, bool update)
 	struct domain *d;
 	int cnt;
 
-	d = find_domain_by_domid(domid);
-	if (!d)
-		return 0;
+	if (update) {
+		d = find_domain_struct(domid);
+		assert(d);
+	} else {
+		/*
+		 * We are called first with update == false in order to catch
+		 * any error. So do a possible allocation and check for error
+		 * only in this case, as in the case of update == true nothing
+		 * can go wrong anymore as the allocation already happened.
+		 */
+		d = find_or_alloc_existing_domain(domid);
+		if (!d)
+			return -1;
+	}
 
 	cnt = d->nbentry + num;
-	if (cnt < 0)
-		cnt = 0;
+	assert(cnt >= 0);
 
 	if (update)
 		d->nbentry = cnt;
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index 5757a6557146..cce13d14f016 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -58,10 +58,10 @@ bool domain_can_write(struct connection *conn);
 bool domain_is_unprivileged(struct connection *conn);
 
 /* Remove node permissions for no longer existing domains. */
-int domain_adjust_node_perms(struct node *node);
+int domain_adjust_node_perms(struct connection *conn, struct node *node);
 
 /* Quota manipulation */
-void domain_entry_inc(struct connection *conn, struct node *);
+int domain_entry_inc(struct connection *conn, struct node *);
 void domain_entry_dec(struct connection *conn, struct node *);
 int domain_entry_fix(unsigned int domid, int num, bool update);
 int domain_entry(struct connection *conn);
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index ee1b09031a3b..86caf6c398be 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -519,8 +519,12 @@ static int transaction_fix_domains(struct transaction *trans, bool update)
 
 	list_for_each_entry(d, &trans->changed_domains, list) {
 		cnt = domain_entry_fix(d->domid, d->nbentry, update);
-		if (!update && cnt >= quota_nb_entry_per_domain)
-			return ENOSPC;
+		if (!update) {
+			if (cnt >= quota_nb_entry_per_domain)
+				return ENOSPC;
+			if (cnt < 0)
+				return ENOMEM;
+		}
 	}
 
 	return 0;
From f56b0aa0430d0ee78e6582b323a552084361901a Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: limit max number of nodes accessed in a transaction

Today a guest is free to access as many nodes in a single transaction
as it wants. This can lead to unbounded memory consumption in Xenstore
as there is the need to keep track of all nodes having been accessed
during a transaction.

In oxenstored the number of requests in a transaction is being limited
via a quota maxrequests (default is 1024). As multiple accesses of a
node are not problematic in C Xenstore, limit the number of accessed
nodes.

In order to let read_node() detect a quota error in case too many nodes
are being accessed, check the return value of access_node() and return
NULL in case an error has been seen. Introduce __must_check and add it
to the access_node() prototype.

This is part of XSA-326 / CVE-2022-42314.

Reported-by: Julien Grall <jgrall@amazon.com>
Suggested-by: Julien Grall <julien@xen.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/include/xen-tools/libs.h b/tools/include/xen-tools/libs.h
index a16e0c380709..bafc90e2f603 100644
--- a/tools/include/xen-tools/libs.h
+++ b/tools/include/xen-tools/libs.h
@@ -63,4 +63,8 @@
 #define ROUNDUP(_x,_w) (((unsigned long)(_x)+(1UL<<(_w))-1) & ~((1UL<<(_w))-1))
 #endif
 
+#ifndef __must_check
+#define __must_check __attribute__((__warn_unused_result__))
+#endif
+
 #endif	/* __XEN_TOOLS_LIBS__ */
diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 85c0d2f38fac..050d6f651ae9 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -106,6 +106,7 @@ int quota_nb_watch_per_domain = 128;
 int quota_max_entry_size = 2048; /* 2K */
 int quota_max_transaction = 10;
 int quota_nb_perms_per_node = 5;
+int quota_trans_nodes = 1024;
 int quota_max_path_len = XENSTORE_REL_PATH_MAX;
 int quota_req_outstanding = 20;
 
@@ -560,6 +561,7 @@ struct node *read_node(struct connection *conn, const void *ctx,
 	TDB_DATA key, data;
 	struct xs_tdb_record_hdr *hdr;
 	struct node *node;
+	int err;
 
 	node = talloc(ctx, struct node);
 	if (!node) {
@@ -581,14 +583,13 @@ struct node *read_node(struct connection *conn, const void *ctx,
 	if (data.dptr == NULL) {
 		if (tdb_error(tdb_ctx) == TDB_ERR_NOEXIST) {
 			node->generation = NO_GENERATION;
-			access_node(conn, node, NODE_ACCESS_READ, NULL);
-			errno = ENOENT;
+			err = access_node(conn, node, NODE_ACCESS_READ, NULL);
+			errno = err ? : ENOENT;
 		} else {
 			log("TDB error on read: %s", tdb_errorstr(tdb_ctx));
 			errno = EIO;
 		}
-		talloc_free(node);
-		return NULL;
+		goto error;
 	}
 
 	node->parent = NULL;
@@ -603,19 +604,36 @@ struct node *read_node(struct connection *conn, const void *ctx,
 
 	/* Permissions are struct xs_permissions. */
 	node->perms.p = hdr->perms;
-	if (domain_adjust_node_perms(conn, node)) {
-		talloc_free(node);
-		return NULL;
-	}
+	if (domain_adjust_node_perms(conn, node))
+		goto error;
 
 	/* Data is binary blob (usually ascii, no nul). */
 	node->data = node->perms.p + hdr->num_perms;
 	/* Children is strings, nul separated. */
 	node->children = node->data + node->datalen;
 
-	access_node(conn, node, NODE_ACCESS_READ, NULL);
+	if (access_node(conn, node, NODE_ACCESS_READ, NULL))
+		goto error;
 
 	return node;
+
+ error:
+	err = errno;
+	talloc_free(node);
+	errno = err;
+	return NULL;
+}
+
+static bool read_node_can_propagate_errno(void)
+{
+	/*
+	 * 2 error cases for read_node() can always be propagated up:
+	 * ENOMEM, because this has nothing to do with the node being in the
+	 * data base or not, but is caused by a general lack of memory.
+	 * ENOSPC, because this is related to hitting quota limits which need
+	 * to be respected.
+	 */
+	return errno == ENOMEM || errno == ENOSPC;
 }
 
 int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
@@ -732,7 +750,7 @@ static int ask_parents(struct connection *conn, const void *ctx,
 		node = read_node(conn, ctx, name);
 		if (node)
 			break;
-		if (errno == ENOMEM)
+		if (read_node_can_propagate_errno())
 			return errno;
 	} while (!streq(name, "/"));
 
@@ -795,7 +813,7 @@ static struct node *get_node(struct connection *conn,
 		}
 	}
 	/* Clean up errno if they weren't supposed to know. */
-	if (!node && errno != ENOMEM)
+	if (!node && !read_node_can_propagate_errno())
 		errno = errno_from_parents(conn, ctx, name, errno, perm);
 	return node;
 }
@@ -1201,7 +1219,7 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 
 	/* If parent doesn't exist, create it. */
 	parent = read_node(conn, parentname, parentname);
-	if (!parent)
+	if (!parent && errno == ENOENT)
 		parent = construct_node(conn, ctx, parentname);
 	if (!parent)
 		return NULL;
@@ -1475,7 +1493,7 @@ static int _rm(struct connection *conn, const void *ctx, struct node *node,
 
 	parent = read_node(conn, ctx, parentname);
 	if (!parent)
-		return (errno == ENOMEM) ? ENOMEM : EINVAL;
+		return read_node_can_propagate_errno() ? errno : EINVAL;
 	node->parent = parent;
 
 	return delete_node(conn, ctx, parent, node, false);
@@ -1505,7 +1523,7 @@ static int do_rm(struct connection *conn, struct buffered_data *in)
 				return 0;
 			}
 			/* Restore errno, just in case. */
-			if (errno != ENOMEM)
+			if (!read_node_can_propagate_errno())
 				errno = ENOENT;
 		}
 		return errno;
@@ -2282,6 +2300,8 @@ static void usage(void)
 "  -M, --path-max <chars>  limit the allowed Xenstore node path length,\n"
 "  -Q, --quota <what>=<nb> set the quota <what> to the value <nb>, allowed\n"
 "                          quotas are:\n"
+"                          transaction-nodes: number of accessed node per\n"
+"                                             transaction\n"
 "                          outstanding: number of outstanding requests\n"
 "  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
 "                          allowed timeout candidates are:\n"
@@ -2367,6 +2387,8 @@ static void set_quota(const char *arg)
 	val = get_optval_int(eq + 1);
 	if (what_matches(arg, "outstanding"))
 		quota_req_outstanding = val;
+	else if (what_matches(arg, "transaction-nodes"))
+		quota_trans_nodes = val;
 	else
 		barf("unknown quota \"%s\"\n", arg);
 }
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index c0a056ce13fe..1b3bd5ca563a 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -261,6 +261,7 @@ extern int dom0_event;
 extern int priv_domid;
 extern int quota_nb_entry_per_domain;
 extern int quota_req_outstanding;
+extern int quota_trans_nodes;
 
 extern unsigned int timeout_watch_event_msec;
 
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index 86caf6c398be..7bd41eb475e3 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -156,6 +156,9 @@ struct transaction
 	/* Connection-local identifier for this transaction. */
 	uint32_t id;
 
+	/* Node counter. */
+	unsigned int nodes;
+
 	/* Generation when transaction started. */
 	uint64_t generation;
 
@@ -260,6 +263,11 @@ int access_node(struct connection *conn, struct node *node,
 
 	i = find_accessed_node(trans, node->name);
 	if (!i) {
+		if (trans->nodes >= quota_trans_nodes &&
+		    domain_is_unprivileged(conn)) {
+			ret = ENOSPC;
+			goto err;
+		}
 		i = talloc_zero(trans, struct accessed_node);
 		if (!i)
 			goto nomem;
@@ -297,6 +305,7 @@ int access_node(struct connection *conn, struct node *node,
 				i->ta_node = true;
 			}
 		}
+		trans->nodes++;
 		list_add_tail(&i->list, &trans->accessed);
 	}
 
diff --git a/tools/xenstore/xenstored_transaction.h b/tools/xenstore/xenstored_transaction.h
index 0093cac807e3..e3cbd6b23095 100644
--- a/tools/xenstore/xenstored_transaction.h
+++ b/tools/xenstore/xenstored_transaction.h
@@ -39,8 +39,8 @@ void transaction_entry_inc(struct transaction *trans, unsigned int domid);
 void transaction_entry_dec(struct transaction *trans, unsigned int domid);
 
 /* This node was accessed. */
-int access_node(struct connection *conn, struct node *node,
-                enum node_access_type type, TDB_DATA *key);
+int __must_check access_node(struct connection *conn, struct node *node,
+                             enum node_access_type type, TDB_DATA *key);
 
 /* Queue watches for a modified node. */
 void queue_watches(struct connection *conn, const char *name, bool watch_exact);
From 7327806a83071af4105e8c323ccea5b4d439ddc8 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: move the call of setup_structure() to dom0
 introduction

Setting up the basic structure when introducing dom0 has the advantage
to be able to add proper node memory accounting for the added nodes
later.

This makes it possible to do proper node accounting, too.

An additional requirement to make that work fine is to correct the
owner of the created nodes to be dom0_domid instead of domid 0.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 050d6f651ae9..51af74390cbe 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -1940,7 +1940,8 @@ static int tdb_flags;
 static void manual_node(const char *name, const char *child)
 {
 	struct node *node;
-	struct xs_permissions perms = { .id = 0, .perms = XS_PERM_NONE };
+	struct xs_permissions perms = { .id = dom0_domid,
+					.perms = XS_PERM_NONE };
 
 	node = talloc_zero(NULL, struct node);
 	if (!node)
@@ -1979,7 +1980,7 @@ static void tdb_logger(TDB_CONTEXT *tdb, int level, const char * fmt, ...)
 	}
 }
 
-static void setup_structure(bool live_update)
+void setup_structure(bool live_update)
 {
 	char *tdbname;
 
@@ -2002,6 +2003,7 @@ static void setup_structure(bool live_update)
 		manual_node("/", "tool");
 		manual_node("/tool", "xenstored");
 		manual_node("/tool/xenstored", NULL);
+		domain_entry_fix(dom0_domid, 3, true);
 	}
 
 	check_store();
@@ -2512,9 +2514,6 @@ int main(int argc, char *argv[])
 
 	init_pipe(reopen_log_pipe);
 
-	/* Setup the database */
-	setup_structure(live_update);
-
 	/* Listen to hypervisor. */
 	if (!no_domain_init && !live_update) {
 		domain_init(-1);
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 1b3bd5ca563a..459698d8407a 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -224,6 +224,7 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
 struct node *read_node(struct connection *conn, const void *ctx,
 		       const char *name);
 
+void setup_structure(bool live_update);
 struct connection *new_connection(connwritefn_t *write, connreadfn_t *read);
 struct connection *get_connection_by_id(unsigned int conn_id);
 void check_store(void);
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 3c27973fb836..0dd75a6a2194 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -476,6 +476,9 @@ static struct domain *introduce_domain(const void *ctx,
 		}
 		domain->interface = interface;
 
+		if (is_master_domain)
+			setup_structure(restore);
+
 		/* Now domain belongs to its connection. */
 		talloc_steal(domain->conn, domain);
 
From e9dd60538abe7193eaf2c5eb72cc1f18749e7c1a Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: add infrastructure to keep track of per domain memory
 usage

The amount of memory a domain can consume in Xenstore is limited by
various quota today, but even with sane quota a domain can still
consume rather large memory quantities.

Add the infrastructure for keeping track of the amount of memory a
domain is consuming in Xenstore. Note that this is only the memory a
domain has direct control over, so any internal administration data
needed by Xenstore only is not being accounted for.

There are two quotas defined: a soft quota which will result in a
warning issued via syslog() when it is exceeded, and a hard quota
resulting in a stop of accepting further requests or watch events as
long as the hard quota would be violated by accepting those.

Setting any of those quotas to 0 will disable it.

As default values use 2MB per domain for the soft limit (this basically
covers the allowed case to create 1000 nodes needing 2kB each), and
2.5MB for the hard limit.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 51af74390cbe..eeb0d893e8c3 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -109,6 +109,8 @@ int quota_nb_perms_per_node = 5;
 int quota_trans_nodes = 1024;
 int quota_max_path_len = XENSTORE_REL_PATH_MAX;
 int quota_req_outstanding = 20;
+int quota_memory_per_domain_soft = 2 * 1024 * 1024; /* 2 MB */
+int quota_memory_per_domain_hard = 2 * 1024 * 1024 + 512 * 1024; /* 2.5 MB */
 
 unsigned int timeout_watch_event_msec = 20000;
 
@@ -2304,7 +2306,14 @@ static void usage(void)
 "                          quotas are:\n"
 "                          transaction-nodes: number of accessed node per\n"
 "                                             transaction\n"
+"                          memory: total used memory per domain for nodes,\n"
+"                                  transactions, watches and requests, above\n"
+"                                  which Xenstore will stop talking to domain\n"
 "                          outstanding: number of outstanding requests\n"
+"  -q, --quota-soft <what>=<nb> set a soft quota <what> to the value <nb>,\n"
+"                          causing a warning to be issued via syslog() if the\n"
+"                          limit is violated, allowed quotas are:\n"
+"                          memory: see above\n"
 "  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
 "                          allowed timeout candidates are:\n"
 "                          watch-event: time a watch-event is kept pending\n"
@@ -2331,6 +2340,7 @@ static struct option options[] = {
 	{ "perm-nb", 1, NULL, 'A' },
 	{ "path-max", 1, NULL, 'M' },
 	{ "quota", 1, NULL, 'Q' },
+	{ "quota-soft", 1, NULL, 'q' },
 	{ "timeout", 1, NULL, 'w' },
 	{ "no-recovery", 0, NULL, 'R' },
 	{ "internal-db", 0, NULL, 'I' },
@@ -2379,7 +2389,7 @@ static void set_timeout(const char *arg)
 		barf("unknown timeout \"%s\"\n", arg);
 }
 
-static void set_quota(const char *arg)
+static void set_quota(const char *arg, bool soft)
 {
 	const char *eq = strchr(arg, '=');
 	int val;
@@ -2387,11 +2397,16 @@ static void set_quota(const char *arg)
 	if (!eq)
 		barf("quotas must be specified via <what>=<nb>\n");
 	val = get_optval_int(eq + 1);
-	if (what_matches(arg, "outstanding"))
+	if (what_matches(arg, "outstanding") && !soft)
 		quota_req_outstanding = val;
-	else if (what_matches(arg, "transaction-nodes"))
+	else if (what_matches(arg, "transaction-nodes") && !soft)
 		quota_trans_nodes = val;
-	else
+	else if (what_matches(arg, "memory")) {
+		if (soft)
+			quota_memory_per_domain_soft = val;
+		else
+			quota_memory_per_domain_hard = val;
+	} else
 		barf("unknown quota \"%s\"\n", arg);
 }
 
@@ -2409,7 +2424,7 @@ int main(int argc, char *argv[])
 	orig_argc = argc;
 	orig_argv = argv;
 
-	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:Q:T:RVW:w:U",
+	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:Q:q:T:RVW:w:U",
 				  options, NULL)) != -1) {
 		switch (opt) {
 		case 'D':
@@ -2459,7 +2474,10 @@ int main(int argc, char *argv[])
 						 quota_max_path_len);
 			break;
 		case 'Q':
-			set_quota(optarg);
+			set_quota(optarg, false);
+			break;
+		case 'q':
+			set_quota(optarg, true);
 			break;
 		case 'w':
 			set_timeout(optarg);
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 459698d8407a..2fb37dbfe847 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -263,6 +263,8 @@ extern int priv_domid;
 extern int quota_nb_entry_per_domain;
 extern int quota_req_outstanding;
 extern int quota_trans_nodes;
+extern int quota_memory_per_domain_soft;
+extern int quota_memory_per_domain_hard;
 
 extern unsigned int timeout_watch_event_msec;
 
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 0dd75a6a2194..ec542df6a67e 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -76,6 +76,13 @@ struct domain
 	/* number of entry from this domain in the store */
 	int nbentry;
 
+	/* Amount of memory allocated for this domain. */
+	int memory;
+	bool soft_quota_reported;
+	bool hard_quota_reported;
+	time_t mem_last_msg;
+#define MEM_WARN_MINTIME_SEC 10
+
 	/* number of watch for this domain */
 	int nbwatch;
 
@@ -296,6 +303,9 @@ bool domain_can_read(struct connection *conn)
 			return false;
 		if (conn->domain->nboutstanding >= quota_req_outstanding)
 			return false;
+		if (conn->domain->memory >= quota_memory_per_domain_hard &&
+		    quota_memory_per_domain_hard)
+			return false;
 	}
 
 	if (conn->is_ignored)
@@ -956,6 +966,89 @@ int domain_entry(struct connection *conn)
 		: 0;
 }
 
+static bool domain_chk_quota(struct domain *domain, int mem)
+{
+	time_t now;
+
+	if (!domain || !domid_is_unprivileged(domain->domid) ||
+	    (domain->conn && domain->conn->is_ignored))
+		return false;
+
+	now = time(NULL);
+
+	if (mem >= quota_memory_per_domain_hard &&
+	    quota_memory_per_domain_hard) {
+		if (domain->hard_quota_reported)
+			return true;
+		syslog(LOG_ERR, "Domain %u exceeds hard memory quota, Xenstore interface to domain stalled\n",
+		       domain->domid);
+		domain->mem_last_msg = now;
+		domain->hard_quota_reported = true;
+		return true;
+	}
+
+	if (now - domain->mem_last_msg >= MEM_WARN_MINTIME_SEC) {
+		if (domain->hard_quota_reported) {
+			domain->mem_last_msg = now;
+			domain->hard_quota_reported = false;
+			syslog(LOG_INFO, "Domain %u below hard memory quota again\n",
+			       domain->domid);
+		}
+		if (mem >= quota_memory_per_domain_soft &&
+		    quota_memory_per_domain_soft &&
+		    !domain->soft_quota_reported) {
+			domain->mem_last_msg = now;
+			domain->soft_quota_reported = true;
+			syslog(LOG_WARNING, "Domain %u exceeds soft memory quota\n",
+			       domain->domid);
+		}
+		if (mem < quota_memory_per_domain_soft &&
+		    domain->soft_quota_reported) {
+			domain->mem_last_msg = now;
+			domain->soft_quota_reported = false;
+			syslog(LOG_INFO, "Domain %u below soft memory quota again\n",
+			       domain->domid);
+		}
+
+	}
+
+	return false;
+}
+
+int domain_memory_add(unsigned int domid, int mem, bool no_quota_check)
+{
+	struct domain *domain;
+
+	domain = find_domain_struct(domid);
+	if (domain) {
+		/*
+		 * domain_chk_quota() will print warning and also store whether
+		 * the soft/hard quota has been hit. So check no_quota_check
+		 * *after*.
+		 */
+		if (domain_chk_quota(domain, domain->memory + mem) &&
+		    !no_quota_check)
+			return ENOMEM;
+		domain->memory += mem;
+	} else {
+		/*
+		 * The domain the memory is to be accounted for should always
+		 * exist, as accounting is done either for a domain related to
+		 * the current connection, or for the domain owning a node
+		 * (which is always existing, as the owner of the node is
+		 * tested to exist and replaced by domid 0 if not).
+		 * So not finding the related domain MUST be an error in the
+		 * data base.
+		 */
+		errno = ENOENT;
+		corrupt(NULL, "Accounting called for non-existing domain %u\n",
+			domid);
+		return ENOENT;
+	}
+
+	return 0;
+}
+
 void domain_watch_inc(struct connection *conn)
 {
 	if (!conn || !conn->domain)
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index cce13d14f016..571aa46d158e 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -65,6 +65,26 @@ int domain_entry_inc(struct connection *conn, struct node *);
 void domain_entry_dec(struct connection *conn, struct node *);
 int domain_entry_fix(unsigned int domid, int num, bool update);
 int domain_entry(struct connection *conn);
+int domain_memory_add(unsigned int domid, int mem, bool no_quota_check);
+
+/*
+ * domain_memory_add_chk(): to be used when memory quota should be checked.
+ * Not to be used when specifying a negative mem value, as lowering the used
+ * memory should always be allowed.
+ */
+static inline int domain_memory_add_chk(unsigned int domid, int mem)
+{
+	return domain_memory_add(domid, mem, false);
+}
+/*
+ * domain_memory_add_nochk(): to be used when memory quota should not be
+ * checked, e.g. when lowering memory usage, or in an error case for undoing
+ * a previous memory adjustment.
+ */
+static inline void domain_memory_add_nochk(unsigned int domid, int mem)
+{
+	domain_memory_add(domid, mem, true);
+}
 void domain_watch_inc(struct connection *conn);
 void domain_watch_dec(struct connection *conn);
 int domain_watch(struct connection *conn);
From 4b403268d9d078a8bfd295b6a43735a8cbed9341 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: add memory accounting for responses

Add the memory accounting for queued responses.

In case adding a watch event for a guest is causing the hard memory
quota of that guest to be violated, the event is dropped. This will
ensure that it is impossible to drive another guest past its memory
quota by generating insane amounts of events for that guest. This is
especially important for protecting driver domains from that attack
vector.

This is part of XSA-326 / CVE-2022-42315.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index eeb0d893e8c3..2e02b577c912 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -260,6 +260,8 @@ static void free_buffered_data(struct buffered_data *out,
 		}
 	}
 
+	domain_memory_add_nochk(conn->id, -out->hdr.msg.len - sizeof(out->hdr));
+
 	if (out->hdr.msg.type == XS_WATCH_EVENT) {
 		req = out->pend.req;
 		if (req) {
@@ -904,11 +906,14 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 	bdata->timeout_msec = 0;
 	bdata->watch_event = false;
 
-	if (len <= DEFAULT_BUFFER_SIZE)
+	if (len <= DEFAULT_BUFFER_SIZE) {
 		bdata->buffer = bdata->default_buffer;
-	else {
+		/* Don't check quota, path might be used for returning error. */
+		domain_memory_add_nochk(conn->id, len + sizeof(bdata->hdr));
+	} else {
 		bdata->buffer = talloc_array(bdata, char, len);
-		if (!bdata->buffer) {
+		if (!bdata->buffer ||
+		    domain_memory_add_chk(conn->id, len + sizeof(bdata->hdr))) {
 			send_error(conn, ENOMEM);
 			return;
 		}
@@ -973,6 +978,11 @@ void send_event(struct buffered_data *req, struct connection *conn,
 		}
 	}
 
+	if (domain_memory_add_chk(conn->id, len + sizeof(bdata->hdr))) {
+		talloc_free(bdata);
+		return;
+	}
+
 	if (timeout_watch_event_msec && domain_is_unprivileged(conn)) {
 		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
 		if (!conn->timeout_msec)
@@ -2940,6 +2950,12 @@ static void add_buffered_data(struct buffered_data *bdata,
 	 */
 	if (bdata->hdr.msg.type != XS_WATCH_EVENT)
 		domain_outstanding_inc(conn);
+	/*
+	 * We are restoring the state after Live-Update and the new quota may
+	 * be smaller. So ignore it. The limit will be applied for any resource
+	 * after the state has been fully restored.
+	 */
+	domain_memory_add_nochk(conn->id, len + sizeof(bdata->hdr));
 }
 
 void read_state_buffered_data(const void *ctx, struct connection *conn,
From 61b64c457431be0a444b2a771b766bad3e5abf82 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add memory accounting for watches

Add the memory accounting for registered watches.

When a socket connection is destroyed, the associated watches are
removed, too. In order to keep memory accounting correct the watches
must be removed explicitly via a call of conn_delete_all_watches() from
destroy_conn().

This is part of XSA-326 / CVE-2022-42315.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 2e02b577c912..b1a4575929bd 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -457,6 +457,7 @@ static int destroy_conn(void *_conn)
 	}
 
 	conn_free_buffered_data(conn);
+	conn_delete_all_watches(conn);
 	list_for_each_entry(req, &conn->ref_list, list)
 		req->on_ref_list = false;
 
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index 1d664e3d6b72..0d5858df5bdd 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -211,7 +211,7 @@ static int check_watch_path(struct connection *conn, const void *ctx,
 }
 
 static struct watch *add_watch(struct connection *conn, char *path, char *token,
-			       bool relative)
+			       bool relative, bool no_quota_check)
 {
 	struct watch *watch;
 
@@ -222,6 +222,9 @@ static struct watch *add_watch(struct connection *conn, char *path, char *token,
 	watch->token = talloc_strdup(watch, token);
 	if (!watch->node || !watch->token)
 		goto nomem;
+	if (domain_memory_add(conn->id, strlen(path) + strlen(token),
+			      no_quota_check))
+		goto nomem;
 
 	if (relative)
 		watch->relative_path = get_implicit_path(conn);
@@ -265,7 +268,7 @@ int do_watch(struct connection *conn, struct buffered_data *in)
 	if (domain_watch(conn) > quota_nb_watch_per_domain)
 		return E2BIG;
 
-	watch = add_watch(conn, vec[0], vec[1], relative);
+	watch = add_watch(conn, vec[0], vec[1], relative, false);
 	if (!watch)
 		return errno;
 
@@ -296,6 +299,8 @@ int do_unwatch(struct connection *conn, struct buffered_data *in)
 	list_for_each_entry(watch, &conn->watches, list) {
 		if (streq(watch->node, node) && streq(watch->token, vec[1])) {
 			list_del(&watch->list);
+			domain_memory_add_nochk(conn->id, -strlen(watch->node) -
+							  strlen(watch->token));
 			talloc_free(watch);
 			domain_watch_dec(conn);
 			send_ack(conn, XS_UNWATCH);
@@ -311,6 +316,8 @@ void conn_delete_all_watches(struct connection *conn)
 
 	while ((watch = list_top(&conn->watches, struct watch, list))) {
 		list_del(&watch->list);
+		domain_memory_add_nochk(conn->id, -strlen(watch->node) -
+						  strlen(watch->token));
 		talloc_free(watch);
 		domain_watch_dec(conn);
 	}
@@ -373,7 +380,7 @@ void read_state_watch(const void *ctx, const void *state)
 	if (!path)
 		barf("allocation error for read watch");
 
-	if (!add_watch(conn, path, token, relative))
+	if (!add_watch(conn, path, token, relative, true))
 		barf("error adding watch");
 }
 
From 87bfccac57f9addc1a45eb7222c5402e45d2a88a Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add memory accounting for nodes

Add the memory accounting for Xenstore nodes. In order to make this
not too complicated allow for some sloppiness when writing nodes. Any
hard quota violation will result in no further requests to be accepted.

This is part of XSA-326 / CVE-2022-42315.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index b1a4575929bd..f27d5c0101bc 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -556,6 +556,117 @@ void set_tdb_key(const char *name, TDB_DATA *key)
 	key->dsize = strlen(name);
 }
 
+static void get_acc_data(TDB_DATA *key, struct node_account_data *acc)
+{
+	TDB_DATA old_data;
+	struct xs_tdb_record_hdr *hdr;
+
+	if (acc->memory < 0) {
+		old_data = tdb_fetch(tdb_ctx, *key);
+		/* No check for error, as the node might not exist. */
+		if (old_data.dptr == NULL) {
+			acc->memory = 0;
+		} else {
+			hdr = (void *)old_data.dptr;
+			acc->memory = old_data.dsize;
+			acc->domid = hdr->perms[0].id;
+		}
+		talloc_free(old_data.dptr);
+	}
+}
+
+/*
+ * Per-transaction nodes need to be accounted for the transaction owner.
+ * Those nodes are stored in the data base with the transaction generation
+ * count prepended (e.g. 123/local/domain/...). So testing for the node's
+ * key not to start with "/" is sufficient.
+ */
+static unsigned int get_acc_domid(struct connection *conn, TDB_DATA *key,
+				  unsigned int domid)
+{
+	return (!conn || key->dptr[0] == '/') ? domid : conn->id;
+}
+
+int do_tdb_write(struct connection *conn, TDB_DATA *key, TDB_DATA *data,
+		 struct node_account_data *acc, bool no_quota_check)
+{
+	struct xs_tdb_record_hdr *hdr = (void *)data->dptr;
+	struct node_account_data old_acc = {};
+	unsigned int old_domid, new_domid;
+	int ret;
+
+	if (!acc)
+		old_acc.memory = -1;
+	else
+		old_acc = *acc;
+
+	get_acc_data(key, &old_acc);
+	old_domid = get_acc_domid(conn, key, old_acc.domid);
+	new_domid = get_acc_domid(conn, key, hdr->perms[0].id);
+
+	/*
+	 * Don't check for ENOENT, as we want to be able to switch orphaned
+	 * nodes to new owners.
+	 */
+	if (old_acc.memory)
+		domain_memory_add_nochk(old_domid,
+					-old_acc.memory - key->dsize);
+	ret = domain_memory_add(new_domid, data->dsize + key->dsize,
+				no_quota_check);
+	if (ret) {
+		/* Error path, so no quota check. */
+		if (old_acc.memory)
+			domain_memory_add_nochk(old_domid,
+						old_acc.memory + key->dsize);
+		return ret;
+	}
+
+	/* TDB should set errno, but doesn't even set ecode AFAICT. */
+	if (tdb_store(tdb_ctx, *key, *data, TDB_REPLACE) != 0) {
+		domain_memory_add_nochk(new_domid, -data->dsize - key->dsize);
+		/* Error path, so no quota check. */
+		if (old_acc.memory)
+			domain_memory_add_nochk(old_domid,
+						old_acc.memory + key->dsize);
+		errno = EIO;
+		return errno;
+	}
+
+	if (acc) {
+		/* Don't use new_domid, as it might be a transaction node. */
+		acc->domid = hdr->perms[0].id;
+		acc->memory = data->dsize;
+	}
+
+	return 0;
+}
+
+int do_tdb_delete(struct connection *conn, TDB_DATA *key,
+		  struct node_account_data *acc)
+{
+	struct node_account_data tmp_acc;
+	unsigned int domid;
+
+	if (!acc) {
+		acc = &tmp_acc;
+		acc->memory = -1;
+	}
+
+	get_acc_data(key, acc);
+
+	if (tdb_delete(tdb_ctx, *key)) {
+		errno = EIO;
+		return errno;
+	}
+
+	if (acc->memory) {
+		domid = get_acc_domid(conn, key, acc->domid);
+		domain_memory_add_nochk(domid, -acc->memory - key->dsize);
+	}
+
+	return 0;
+}
+
 /*
  * If it fails, returns NULL and sets errno.
  * Temporary memory allocations will be done with ctx.
@@ -609,9 +720,15 @@ struct node *read_node(struct connection *conn, const void *ctx,
 
 	/* Permissions are struct xs_permissions. */
 	node->perms.p = hdr->perms;
+	node->acc.domid = node->perms.p[0].id;
+	node->acc.memory = data.dsize;
 	if (domain_adjust_node_perms(conn, node))
 		goto error;
 
+	/* If owner is gone reset currently accounted memory size. */
+	if (node->acc.domid != node->perms.p[0].id)
+		node->acc.memory = 0;
+
 	/* Data is binary blob (usually ascii, no nul). */
 	node->data = node->perms.p + hdr->num_perms;
 	/* Children is strings, nul separated. */
@@ -680,12 +797,9 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
 	p += node->datalen;
 	memcpy(p, node->children, node->childlen);
 
-	/* TDB should set errno, but doesn't even set ecode AFAICT. */
-	if (tdb_store(tdb_ctx, *key, data, TDB_REPLACE) != 0) {
-		corrupt(conn, "Write of %s failed", key->dptr);
-		errno = EIO;
-		return errno;
-	}
+	if (do_tdb_write(conn, key, &data, &node->acc, no_quota_check))
+		return EIO;
+
 	return 0;
 }
 
@@ -1188,7 +1302,7 @@ static void delete_node_single(struct connection *conn, struct node *node)
 	if (access_node(conn, node, NODE_ACCESS_DELETE, &key))
 		return;
 
-	if (tdb_delete(tdb_ctx, key) != 0) {
+	if (do_tdb_delete(conn, &key, &node->acc) != 0) {
 		corrupt(conn, "Could not delete '%s'", node->name);
 		return;
 	}
@@ -1261,6 +1375,7 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 	/* No children, no data */
 	node->children = node->data = NULL;
 	node->childlen = node->datalen = 0;
+	node->acc.memory = 0;
 	node->parent = parent;
 	return node;
 
@@ -1269,17 +1384,17 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 	return NULL;
 }
 
-static void destroy_node_rm(struct node *node)
+static void destroy_node_rm(struct connection *conn, struct node *node)
 {
 	if (streq(node->name, "/"))
 		corrupt(NULL, "Destroying root node!");
 
-	tdb_delete(tdb_ctx, node->key);
+	do_tdb_delete(conn, &node->key, &node->acc);
 }
 
 static int destroy_node(struct connection *conn, struct node *node)
 {
-	destroy_node_rm(node);
+	destroy_node_rm(conn, node);
 	domain_entry_dec(conn, node);
 
 	/*
@@ -1331,7 +1446,7 @@ static struct node *create_node(struct connection *conn, const void *ctx,
 		/* Account for new node */
 		if (i->parent) {
 			if (domain_entry_inc(conn, i)) {
-				destroy_node_rm(i);
+				destroy_node_rm(conn, i);
 				return NULL;
 			}
 		}
@@ -2192,7 +2307,7 @@ static int clean_store_(TDB_CONTEXT *tdb, TDB_DATA key, TDB_DATA val,
 	if (!hashtable_search(reachable, name)) {
 		log("clean_store: '%s' is orphaned!", name);
 		if (recovery) {
-			tdb_delete(tdb, key);
+			do_tdb_delete(NULL, &key, NULL);
 		}
 	}
 
@@ -3030,6 +3145,7 @@ void read_state_node(const void *ctx, const void *state)
 	if (!node)
 		barf("allocation error restoring node");
 
+	node->acc.memory = 0;
 	node->name = name;
 	node->generation = ++generation;
 	node->datalen = sn->data_len;
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 2fb37dbfe847..5c1b574bffe6 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -169,6 +169,11 @@ struct node_perms {
 	struct xs_permissions *p;
 };
 
+struct node_account_data {
+	unsigned int domid;
+	int memory;		/* -1 if unknown */
+};
+
 struct node {
 	const char *name;
 	/* Key used to update TDB */
@@ -191,6 +196,9 @@ struct node {
 	/* Children, each nul-terminated. */
 	unsigned int childlen;
 	char *children;
+
+	/* Allocation information for node currently in store. */
+	struct node_account_data acc;
 };
 
 /* Return the only argument in the input. */
@@ -300,6 +308,10 @@ extern xengnttab_handle **xgt_handle;
 int remember_string(struct hashtable *hash, const char *str);
 
 void set_tdb_key(const char *name, TDB_DATA *key);
+int do_tdb_write(struct connection *conn, TDB_DATA *key, TDB_DATA *data,
+		 struct node_account_data *acc, bool no_quota_check);
+int do_tdb_delete(struct connection *conn, TDB_DATA *key,
+		  struct node_account_data *acc);
 
 void conn_free_buffered_data(struct connection *conn);
 
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index 7bd41eb475e3..ace9a11d77bb 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -153,6 +153,9 @@ struct transaction
 	/* List of all transactions active on this connection. */
 	struct list_head list;
 
+	/* Connection this transaction is associated with. */
+	struct connection *conn;
+
 	/* Connection-local identifier for this transaction. */
 	uint32_t id;
 
@@ -286,6 +289,8 @@ int access_node(struct connection *conn, struct node *node,
 
 		introduce = true;
 		i->ta_node = false;
+		/* acc.memory < 0 means "unknown, get size from TDB". */
+		node->acc.memory = -1;
 
 		/*
 		 * Additional transaction-specific node for read type. We only
@@ -410,11 +415,11 @@ static int finalize_transaction(struct connection *conn,
 					goto err;
 				hdr = (void *)data.dptr;
 				hdr->generation = ++generation;
-				ret = tdb_store(tdb_ctx, key, data,
-						TDB_REPLACE);
+				ret = do_tdb_write(conn, &key, &data, NULL,
+						   true);
 				talloc_free(data.dptr);
 			} else {
-				ret = tdb_delete(tdb_ctx, key);
+				ret = do_tdb_delete(conn, &key, NULL);
 			}
 			if (ret)
 				goto err;
@@ -425,7 +430,7 @@ static int finalize_transaction(struct connection *conn,
 			}
 		}
 
-		if (i->ta_node && tdb_delete(tdb_ctx, ta_key))
+		if (i->ta_node && do_tdb_delete(conn, &ta_key, NULL))
 			goto err;
 		list_del(&i->list);
 		talloc_free(i);
@@ -453,7 +458,7 @@ static int destroy_transaction(void *_transaction)
 							       i->node);
 			if (trans_name) {
 				set_tdb_key(trans_name, &key);
-				tdb_delete(tdb_ctx, key);
+				do_tdb_delete(trans->conn, &key, NULL);
 			}
 		}
 		list_del(&i->list);
@@ -497,6 +502,7 @@ int do_transaction_start(struct connection *conn, struct buffered_data *in)
 
 	INIT_LIST_HEAD(&trans->accessed);
 	INIT_LIST_HEAD(&trans->changed_domains);
+	trans->conn = conn;
 	trans->fail = false;
 	trans->generation = ++generation;
 
From fed629259c64d91dfb26bd478c260b66dfad4dae Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add exports for quota variables

Some quota variables are not exported via header files.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 5c1b574bffe6..1eb3708f82dd 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -268,6 +268,11 @@ extern TDB_CONTEXT *tdb_ctx;
 extern int dom0_domid;
 extern int dom0_event;
 extern int priv_domid;
+extern int quota_nb_watch_per_domain;
+extern int quota_max_transaction;
+extern int quota_max_entry_size;
+extern int quota_nb_perms_per_node;
+extern int quota_max_path_len;
 extern int quota_nb_entry_per_domain;
 extern int quota_req_outstanding;
 extern int quota_trans_nodes;
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index ace9a11d77bb..28774813de83 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -175,7 +175,6 @@ struct transaction
 	bool fail;
 };
 
-extern int quota_max_transaction;
 uint64_t generation;
 
 static struct accessed_node *find_accessed_node(struct transaction *trans,
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index 0d5858df5bdd..4970e9f1a1b9 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -31,8 +31,6 @@
 #include "xenstored_domain.h"
 #include "xenstored_transaction.h"
 
-extern int quota_nb_watch_per_domain;
-
 struct watch
 {
 	/* Watches on this connection */
From e7d84673f757cd38ad02391fe079f291b8197d54 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add control command for setting and showing quota

Add a xenstore-control command "quota" to:
- show current quota settings
- change quota settings
- show current quota related values of a domain

Note that in the case the new quota is lower than existing one,
Xenstored may continue to handle requests from a domain exceeding the
new limit (depends on which one has been broken) and the amount of
resource used will not change. However the domain will not be able to
create more resource (associated to the quota) until it is back to below
the limit.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/docs/misc/xenstore.txt b/docs/misc/xenstore.txt
index 334dc8b6fdf5..a7d006519ae8 100644
--- a/docs/misc/xenstore.txt
+++ b/docs/misc/xenstore.txt
@@ -366,6 +366,17 @@ CONTROL			<command>|[<parameters>|]
 	print|<string>
 		print <string> to syslog (xenstore runs as daemon) or
 		to console (xenstore runs as stubdom)
+	quota|[set <name> <val>|<domid>]
+		without parameters: print the current quota settings
+		with "set <name> <val>": set the quota <name> to new value
+		<val> (The admin should make sure all the domain usage is
+		below the quota. If it is not, then Xenstored may continue to
+		handle requests from the domain as long as the resource
+		violating the new quota setting isn't increased further)
+		with "<domid>": print quota related accounting data for
+		the domain <domid>
+	quota-soft|[set <name> <val>]
+		like the "quota" command, but for soft-quota.
 	help			<supported-commands>
 		return list of supported commands for CONTROL
 
diff --git a/tools/xenstore/xenstored_control.c b/tools/xenstore/xenstored_control.c
index 211fe1fd9b37..980279fa53ff 100644
--- a/tools/xenstore/xenstored_control.c
+++ b/tools/xenstore/xenstored_control.c
@@ -148,6 +148,115 @@ static int do_control_log(void *ctx, struct connection *conn,
 	return 0;
 }
 
+struct quota {
+	const char *name;
+	int *quota;
+	const char *descr;
+};
+
+static const struct quota hard_quotas[] = {
+	{ "nodes", &quota_nb_entry_per_domain, "Nodes per domain" },
+	{ "watches", &quota_nb_watch_per_domain, "Watches per domain" },
+	{ "transactions", &quota_max_transaction, "Transactions per domain" },
+	{ "outstanding", &quota_req_outstanding,
+		"Outstanding requests per domain" },
+	{ "transaction-nodes", &quota_trans_nodes,
+		"Max. number of accessed nodes per transaction" },
+	{ "memory", &quota_memory_per_domain_hard,
+		"Total Xenstore memory per domain (error level)" },
+	{ "node-size", &quota_max_entry_size, "Max. size of a node" },
+	{ "path-max", &quota_max_path_len, "Max. length of a node path" },
+	{ "permissions", &quota_nb_perms_per_node,
+		"Max. number of permissions per node" },
+	{ NULL, NULL, NULL }
+};
+
+static const struct quota soft_quotas[] = {
+	{ "memory", &quota_memory_per_domain_soft,
+		"Total Xenstore memory per domain (warning level)" },
+	{ NULL, NULL, NULL }
+};
+
+static int quota_show_current(const void *ctx, struct connection *conn,
+			      const struct quota *quotas)
+{
+	char *resp;
+	unsigned int i;
+
+	resp = talloc_strdup(ctx, "Quota settings:\n");
+	if (!resp)
+		return ENOMEM;
+
+	for (i = 0; quotas[i].quota; i++) {
+		resp = talloc_asprintf_append(resp, "%-17s: %8d %s\n",
+					      quotas[i].name, *quotas[i].quota,
+					      quotas[i].descr);
+		if (!resp)
+			return ENOMEM;
+	}
+
+	send_reply(conn, XS_CONTROL, resp, strlen(resp) + 1);
+
+	return 0;
+}
+
+static int quota_set(const void *ctx, struct connection *conn,
+		     char **vec, int num, const struct quota *quotas)
+{
+	unsigned int i;
+	int val;
+
+	if (num != 2)
+		return EINVAL;
+
+	val = atoi(vec[1]);
+	if (val < 1)
+		return EINVAL;
+
+	for (i = 0; quotas[i].quota; i++) {
+		if (!strcmp(vec[0], quotas[i].name)) {
+			*quotas[i].quota = val;
+			send_ack(conn, XS_CONTROL);
+			return 0;
+		}
+	}
+
+	return EINVAL;
+}
+
+static int quota_get(const void *ctx, struct connection *conn,
+		     char **vec, int num)
+{
+	if (num != 1)
+		return EINVAL;
+
+	return domain_get_quota(ctx, conn, atoi(vec[0]));
+}
+
+static int do_control_quota(void *ctx, struct connection *conn,
+			    char **vec, int num)
+{
+	if (num == 0)
+		return quota_show_current(ctx, conn, hard_quotas);
+
+	if (!strcmp(vec[0], "set"))
+		return quota_set(ctx, conn, vec + 1, num - 1, hard_quotas);
+
+	return quota_get(ctx, conn, vec, num);
+}
+
+static int do_control_quota_s(void *ctx, struct connection *conn,
+			      char **vec, int num)
+{
+	if (num == 0)
+		return quota_show_current(ctx, conn, soft_quotas);
+
+	if (!strcmp(vec[0], "set"))
+		return quota_set(ctx, conn, vec + 1, num - 1, soft_quotas);
+
+	return EINVAL;
+}
+
 #ifdef __MINIOS__
 static int do_control_memreport(void *ctx, struct connection *conn,
 				char **vec, int num)
@@ -777,6 +886,8 @@ static struct cmd_s cmds[] = {
 	{ "memreport", do_control_memreport, "[<file>]" },
 #endif
 	{ "print", do_control_print, "<string>" },
+	{ "quota", do_control_quota, "[set <name> <val>|<domid>]" },
+	{ "quota-soft", do_control_quota_s, "[set <name> <val>]" },
 	{ "help", do_control_help, "" },
 };
 
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index ec542df6a67e..3d5142581332 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -31,6 +31,7 @@
 #include "xenstored_domain.h"
 #include "xenstored_transaction.h"
 #include "xenstored_watch.h"
+#include "xenstored_control.h"
 
 #include <xenevtchn.h>
 #include <xenctrl.h>
@@ -351,6 +352,38 @@ static struct domain *find_domain_struct(unsigned int domid)
 	return NULL;
 }
 
+int domain_get_quota(const void *ctx, struct connection *conn,
+		     unsigned int domid)
+{
+	struct domain *d = find_domain_struct(domid);
+	char *resp;
+	int ta;
+
+	if (!d)
+		return ENOENT;
+
+	ta = d->conn ? d->conn->transaction_started : 0;
+	resp = talloc_asprintf(ctx, "Domain %u:\n", domid);
+	if (!resp)
+		return ENOMEM;
+
+#define ent(t, e) \
+	resp = talloc_asprintf_append(resp, "%-16s: %8d\n", #t, e); \
+	if (!resp) return ENOMEM
+
+	ent(nodes, d->nbentry);
+	ent(watches, d->nbwatch);
+	ent(transactions, ta);
+	ent(outstanding, d->nboutstanding);
+	ent(memory, d->memory);
+
+#undef ent
+
+	send_reply(conn, XS_CONTROL, resp, strlen(resp) + 1);
+
+	return 0;
+}
+
 static struct domain *alloc_domain(const void *context, unsigned int domid)
 {
 	struct domain *domain;
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index 571aa46d158e..0f883936f413 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -91,6 +91,8 @@ int domain_watch(struct connection *conn);
 void domain_outstanding_inc(struct connection *conn);
 void domain_outstanding_dec(struct connection *conn);
 void domain_outstanding_domid_dec(unsigned int domid);
+int domain_get_quota(const void *ctx, struct connection *conn,
+		     unsigned int domid);
 
 /* Special node permission handling. */
 int set_perms_special(struct connection *conn, const char *name,
From 8d6bb4ac40619877130533b11655829101b31d04 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:01 +0100
Subject: tools/ocaml/xenstored: Synchronise defaults with oxenstore.conf.in
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

We currently have 2 different set of defaults in upstream Xen git tree:
* defined in the source code, only used if there is no config file
* defined in the oxenstored.conf.in upstream Xen

An oxenstored.conf file is not mandatory, and if missing, maxrequests in
particular has an unsafe default.

Resync the defaults from oxenstored.conf.in into the source code.

This is part of XSA-326 / CVE-2022-42316.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/xenstored/define.ml b/tools/ocaml/xenstored/define.ml
index ebe18b8e312c..6b06f808595b 100644
--- a/tools/ocaml/xenstored/define.ml
+++ b/tools/ocaml/xenstored/define.ml
@@ -21,9 +21,9 @@ let xs_daemon_socket = Paths.xen_run_stored ^ "/socket"
 
 let default_config_dir = Paths.xen_config_dir
 
-let maxwatch = ref (50)
-let maxtransaction = ref (20)
-let maxrequests = ref (-1)   (* maximum requests per transaction *)
+let maxwatch = ref (100)
+let maxtransaction = ref (10)
+let maxrequests = ref (1024)   (* maximum requests per transaction *)
 
 let conflict_burst_limit = ref 5.0
 let conflict_max_history_seconds = ref 0.05
diff --git a/tools/ocaml/xenstored/quota.ml b/tools/ocaml/xenstored/quota.ml
index abcac912805a..6e3d6401ae89 100644
--- a/tools/ocaml/xenstored/quota.ml
+++ b/tools/ocaml/xenstored/quota.ml
@@ -20,8 +20,8 @@ exception Transaction_opened
 
 let warn fmt = Logging.warn "quota" fmt
 let activate = ref true
-let maxent = ref (10000)
-let maxsize = ref (4096)
+let maxent = ref (1000)
+let maxsize = ref (2048)
 
 type t = {
 	maxent: int;               (* max entities per domU *)
From 78d5af44ab13bb18c87b6ad75e505bd374379cb3 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Thu, 28 Jul 2022 17:08:15 +0100
Subject: tools/ocaml/xenstored: Check for maxrequests before performing
 operations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously we'd perform the operation, record the updated tree in the
transaction record, then try to insert a watchop path and the reply packet.

If we exceeded max requests we would've returned EQUOTA, but still:
* have performed the operation on the transaction's tree
* have recorded the watchop, making this queue effectively unbounded

It is better if we check whether we'd have room to store the operation before
performing the transaction, and raise EQUOTA there.  Then the transaction
record won't grow.

This is part of XSA-326 / CVE-2022-42317.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index 27790d4a5c41..dd58e6979cf9 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -389,6 +389,7 @@ let input_handle_error ~cons ~doms ~fct ~con ~t ~req =
 	let reply_error e =
 		Packet.Error e in
 	try
+		Transaction.check_quota_exn ~perm:(Connection.get_perm con) t;
 		fct con t doms cons req.Packet.data
 	with
 	| Define.Invalid_path          -> reply_error "EINVAL"
@@ -681,9 +682,10 @@ let process_packet ~store ~cons ~doms ~con ~req =
 		in
 
 		let response = try
+			Transaction.check_quota_exn ~perm:(Connection.get_perm con) t;
 			if tid <> Transaction.none then
 				(* Remember the request and response for this operation in case we need to replay the transaction *)
-				Transaction.add_operation ~perm:(Connection.get_perm con) t req response;
+				Transaction.add_operation t req response;
 			response
 		with Quota.Limit_reached ->
 			Packet.Error "EQUOTA"
diff --git a/tools/ocaml/xenstored/transaction.ml b/tools/ocaml/xenstored/transaction.ml
index 17b1bdf2eaf9..294143e2335b 100644
--- a/tools/ocaml/xenstored/transaction.ml
+++ b/tools/ocaml/xenstored/transaction.ml
@@ -85,6 +85,7 @@ type t = {
 	oldroot: Store.Node.t;
 	mutable paths: (Xenbus.Xb.Op.operation * Store.Path.t) list;
 	mutable operations: (Packet.request * Packet.response) list;
+	mutable quota_reached: bool;
 	mutable read_lowpath: Store.Path.t option;
 	mutable write_lowpath: Store.Path.t option;
 }
@@ -127,6 +128,7 @@ let make ?(internal=false) id store =
 		oldroot = Store.get_root store;
 		paths = [];
 		operations = [];
+		quota_reached = false;
 		read_lowpath = None;
 		write_lowpath = None;
 	} in
@@ -143,13 +145,19 @@ let get_root t = Store.get_root t.store
 
 let is_read_only t = t.paths = []
 let add_wop t ty path = t.paths <- (ty, path) :: t.paths
-let add_operation ~perm t request response =
+let get_operations t = List.rev t.operations
+
+let check_quota_exn ~perm t =
 	if !Define.maxrequests >= 0
 		&& not (Perms.Connection.is_dom0 perm)
-		&& List.length t.operations >= !Define.maxrequests
-		then raise Quota.Limit_reached;
+		&& (t.quota_reached || List.length t.operations >= !Define.maxrequests)
+		then begin
+			t.quota_reached <- true;
+			raise Quota.Limit_reached;
+		end
+
+let add_operation t request response =
 	t.operations <- (request, response) :: t.operations
-let get_operations t = List.rev t.operations
 let set_read_lowpath t path = t.read_lowpath <- get_lowest path t.read_lowpath
 let set_write_lowpath t path = t.write_lowpath <- get_lowest path t.write_lowpath
 
From 600c45e49c2060e077c06ab19078da89aa8e2e08 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:07 +0100
Subject: tools/ocaml: GC parameter tuning
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

By default the OCaml garbage collector would return memory to the OS only
after unused memory is 5x live memory.  Tweak this to 120% instead, which
would match the major GC speed.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/xenstored/define.ml b/tools/ocaml/xenstored/define.ml
index 6b06f808595b..ba63a8147e09 100644
--- a/tools/ocaml/xenstored/define.ml
+++ b/tools/ocaml/xenstored/define.ml
@@ -25,6 +25,7 @@ let maxwatch = ref (100)
 let maxtransaction = ref (10)
 let maxrequests = ref (1024)   (* maximum requests per transaction *)
 
+let gc_max_overhead = ref 120 (* 120% see comment in xenstored.ml *)
 let conflict_burst_limit = ref 5.0
 let conflict_max_history_seconds = ref 0.05
 let conflict_rate_limit_is_aggregate = ref true
diff --git a/tools/ocaml/xenstored/xenstored.ml b/tools/ocaml/xenstored/xenstored.ml
index d44ae673c42a..3b57ad016dfb 100644
--- a/tools/ocaml/xenstored/xenstored.ml
+++ b/tools/ocaml/xenstored/xenstored.ml
@@ -104,6 +104,7 @@ let parse_config filename =
 		("quota-maxsize", Config.Set_int Quota.maxsize);
 		("quota-maxrequests", Config.Set_int Define.maxrequests);
 		("quota-path-max", Config.Set_int Define.path_max);
+		("gc-max-overhead", Config.Set_int Define.gc_max_overhead);
 		("test-eagain", Config.Set_bool Transaction.test_eagain);
 		("persistent", Config.Set_bool Disk.enable);
 		("xenstored-log-file", Config.String Logging.set_xenstored_log_destination);
@@ -265,6 +266,67 @@ let to_file store cons fds file =
 	        (fun () -> close_out channel)
 end
 
+(*
+	By default OCaml's GC only returns memory to the OS when it exceeds a
+	configurable 'max overhead' setting.
+	The default is 500%, that is 5/6th of the OCaml heap needs to be free
+	and only 1/6th live for a compaction to be triggerred that would
+	release memory back to the OS.
+	If the limit is not hit then the OCaml process can reuse that memory
+	for its own purposes, but other processes won't be able to use it.
+
+	There is also a 'space overhead' setting that controls how much work
+	each major GC slice does, and by default aims at having no more than
+	80% or 120% (depending on version) garbage values compared to live
+	values.
+	This doesn't have as much relevance to memory returned to the OS as
+	long as space_overhead <= max_overhead, because compaction is only
+	triggerred at the end of major GC cycles.
+
+	The defaults are too large once the program starts using ~100MiB of
+	memory, at which point ~500MiB would be unavailable to other processes
+	(which would be fine if this was the main process in this VM, but it is
+	not).
+
+	Max overhead can also be set to 0, however this is for testing purposes
+	only (setting it lower than 'space overhead' wouldn't help because the
+	major GC wouldn't run fast enough, and compaction does have a
+	performance cost: we can only compact contiguous regions, so memory has
+	to be moved around).
+
+	Max overhead controls how often the heap is compacted, which is useful
+	if there are burst of activity followed by long periods of idle state,
+	or if a domain quits, etc. Compaction returns memory to the OS.
+
+	wasted = live * space_overhead / 100
+
+	For globally overriding the GC settings one can use OCAMLRUNPARAM,
+	however we provide a config file override to be consistent with other
+	oxenstored settings.
+
+	One might want to dynamically adjust the overhead setting based on used
+	memory, i.e. to use a fixed upper bound in bytes, not percentage. However
+	measurements show that such adjustments increase GC overhead massively,
+	while still not guaranteeing that memory is returned any more quickly
+	than with a percentage based setting.
+
+	The allocation policy could also be tweaked, e.g. first fit would reduce
+	fragmentation and thus memory usage, but the documentation warns that it
+	can be sensibly slower, and indeed one of our own testcases can trigger
+	such a corner case where it is multiple times slower, so it is best to keep
+	the default allocation policy (next-fit/best-fit depending on version).
+
+	There are other tweaks that can be attempted in the future, e.g. setting
+	'ulimit -v' to 75% of RAM, however getting the kernel to actually return
+	NULL from allocations is difficult even with that setting, and without a
+	NULL the emergency GC won't be triggerred.
+	Perhaps cgroup limits could help, but for now tweak the safest only.
+*)
+
+let tweak_gc () =
+	Gc.set { (Gc.get ()) with Gc.max_overhead = !Define.gc_max_overhead }
+
+
 let _ =
 	let cf = do_argv in
 	let pidfile =
@@ -274,6 +336,8 @@ let _ =
 			default_pidfile
 		in
 
+	tweak_gc ();
+
 	(try
 		Unixext.mkdir_rec (Filename.dirname pidfile) 0o755
 	with _ ->
From fd6d9cd3d20e496bdbf3e0a07354f65de0bcf4ae Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Fri, 29 Jul 2022 18:53:29 +0100
Subject: tools/ocaml/libs/xb: hide type of Xb.t
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Hiding the type will make it easier to change the implementation
in the future without breaking code that relies on it.

No functional change.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 7ade30a1451734d041363c750a65d322e25b47ba)

Reported-by: Julien Grall <jgrall@amazon.com>
diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 104d319d7747..8404ddd8a682 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -196,6 +196,9 @@ let peek_output con = Queue.peek con.pkt_out
 let input_len con = Queue.length con.pkt_in
 let has_in_packet con = Queue.length con.pkt_in > 0
 let get_in_packet con = Queue.pop con.pkt_in
+let has_partial_input con = match con.partial_in with
+	| HaveHdr _ -> true
+	| NoHdr (n, _) -> n < Partial.header_size ()
 let has_more_input con =
 	match con.backend with
 	| Fd _         -> false
diff --git a/tools/ocaml/libs/xb/xb.mli b/tools/ocaml/libs/xb/xb.mli
index 3a00da6cddc1..794e35bb343e 100644
--- a/tools/ocaml/libs/xb/xb.mli
+++ b/tools/ocaml/libs/xb/xb.mli
@@ -66,13 +66,7 @@ type backend_mmap = {
 type backend_fd = { fd : Unix.file_descr; }
 type backend = Fd of backend_fd | Xenmmap of backend_mmap
 type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
-type t = {
-  backend : backend;
-  pkt_in : Packet.t Queue.t;
-  pkt_out : Packet.t Queue.t;
-  mutable partial_in : partial_buf;
-  mutable partial_out : string;
-}
+type t
 val init_partial_in : unit -> partial_buf
 val reconnect : t -> unit
 val queue : t -> Packet.t -> unit
@@ -97,6 +91,7 @@ val has_output : t -> bool
 val peek_output : t -> Packet.t
 val input_len : t -> int
 val has_in_packet : t -> bool
+val has_partial_input : t -> bool
 val get_in_packet : t -> Packet.t
 val has_more_input : t -> bool
 val is_selectable : t -> bool
diff --git a/tools/ocaml/xenstored/connection.ml b/tools/ocaml/xenstored/connection.ml
index 65f99ea6f28a..38b47363a173 100644
--- a/tools/ocaml/xenstored/connection.ml
+++ b/tools/ocaml/xenstored/connection.ml
@@ -125,9 +125,7 @@ let get_perm con =
 let set_target con target_domid =
 	con.perm <- Perms.Connection.set_target (get_perm con) ~perms:[Perms.READ; Perms.WRITE] target_domid
 
-let is_backend_mmap con = match con.xb.Xenbus.Xb.backend with
-	| Xenbus.Xb.Xenmmap _ -> true
-	| _ -> false
+let is_backend_mmap con = Xenbus.Xb.is_mmap con.xb
 
 let send_reply con tid rid ty data =
 	if (String.length data) > xenstore_payload_max && (is_backend_mmap con) then
@@ -280,9 +278,7 @@ let get_transaction con tid =
 
 let do_input con = Xenbus.Xb.input con.xb
 let has_input con = Xenbus.Xb.has_in_packet con.xb
-let has_partial_input con = match con.xb.Xenbus.Xb.partial_in with
-	| HaveHdr _ -> true
-	| NoHdr (n, _) -> n < Xenbus.Partial.header_size ()
+let has_partial_input con = Xenbus.Xb.has_partial_input con.xb
 let pop_in con = Xenbus.Xb.get_in_packet con.xb
 let has_more_input con = Xenbus.Xb.has_more_input con.xb
 
From f13fe5903361953e4ccf8602b9c8df7e64568d55 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:02 +0100
Subject: tools/ocaml: Change Xb.input to return Packet.t option
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The queue here would only ever hold at most one element.  This will simplify
follow-up patches.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 8404ddd8a682..165fd4a1edf4 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -45,7 +45,6 @@ type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
 type t =
 {
 	backend: backend;
-	pkt_in: Packet.t Queue.t;
 	pkt_out: Packet.t Queue.t;
 	mutable partial_in: partial_buf;
 	mutable partial_out: string;
@@ -62,7 +61,6 @@ let reconnect t = match t.backend with
 		Xs_ring.close backend.mmap;
 		backend.eventchn_notify ();
 		(* Clear our old connection state *)
-		Queue.clear t.pkt_in;
 		Queue.clear t.pkt_out;
 		t.partial_in <- init_partial_in ();
 		t.partial_out <- ""
@@ -124,7 +122,6 @@ let output con =
 
 (* NB: can throw Reconnect *)
 let input con =
-	let newpacket = ref false in
 	let to_read =
 		match con.partial_in with
 		| HaveHdr partial_pkt -> Partial.to_complete partial_pkt
@@ -143,21 +140,19 @@ let input con =
 		if Partial.to_complete partial_pkt = 0 then (
 			let pkt = Packet.of_partialpkt partial_pkt in
 			con.partial_in <- init_partial_in ();
-			Queue.push pkt con.pkt_in;
-			newpacket := true
-		)
+			Some pkt
+		) else None
 	| NoHdr (i, buf)      ->
 		(* we complete the partial header *)
 		if sz > 0 then
 			Bytes.blit b 0 buf (Partial.header_size () - i) sz;
 		con.partial_in <- if sz = i then
-			HaveHdr (Partial.of_string (Bytes.to_string buf)) else NoHdr (i - sz, buf)
-	);
-	!newpacket
+			HaveHdr (Partial.of_string (Bytes.to_string buf)) else NoHdr (i - sz, buf);
+		None
+	)
 
 let newcon backend = {
 	backend = backend;
-	pkt_in = Queue.create ();
 	pkt_out = Queue.create ();
 	partial_in = init_partial_in ();
 	partial_out = "";
@@ -193,9 +188,6 @@ let has_output con = has_new_output con || has_old_output con
 
 let peek_output con = Queue.peek con.pkt_out
 
-let input_len con = Queue.length con.pkt_in
-let has_in_packet con = Queue.length con.pkt_in > 0
-let get_in_packet con = Queue.pop con.pkt_in
 let has_partial_input con = match con.partial_in with
 	| HaveHdr _ -> true
 	| NoHdr (n, _) -> n < Partial.header_size ()
diff --git a/tools/ocaml/libs/xb/xb.mli b/tools/ocaml/libs/xb/xb.mli
index 794e35bb343e..91c682162cea 100644
--- a/tools/ocaml/libs/xb/xb.mli
+++ b/tools/ocaml/libs/xb/xb.mli
@@ -77,7 +77,7 @@ val write_fd : backend_fd -> 'a -> string -> int -> int
 val write_mmap : backend_mmap -> 'a -> string -> int -> int
 val write : t -> string -> int -> int
 val output : t -> bool
-val input : t -> bool
+val input : t -> Packet.t option
 val newcon : backend -> t
 val open_fd : Unix.file_descr -> t
 val open_mmap : Xenmmap.mmap_interface -> (unit -> unit) -> t
@@ -89,10 +89,7 @@ val has_new_output : t -> bool
 val has_old_output : t -> bool
 val has_output : t -> bool
 val peek_output : t -> Packet.t
-val input_len : t -> int
-val has_in_packet : t -> bool
 val has_partial_input : t -> bool
-val get_in_packet : t -> Packet.t
 val has_more_input : t -> bool
 val is_selectable : t -> bool
 val get_fd : t -> Unix.file_descr
diff --git a/tools/ocaml/libs/xs/xsraw.ml b/tools/ocaml/libs/xs/xsraw.ml
index d982fb24dbb1..451f8b38dbcc 100644
--- a/tools/ocaml/libs/xs/xsraw.ml
+++ b/tools/ocaml/libs/xs/xsraw.ml
@@ -94,26 +94,18 @@ let pkt_send con =
 	done
 
 (* receive one packet - can sleep *)
-let pkt_recv con =
-	let workdone = ref false in
-	while not !workdone
-	do
-		workdone := Xb.input con.xb
-	done;
-	Xb.get_in_packet con.xb
+let rec pkt_recv con =
+	match Xb.input con.xb with
+	| Some packet -> packet
+	| None -> pkt_recv con
 
 let pkt_recv_timeout con timeout =
 	let fd = Xb.get_fd con.xb in
 	let r, _, _ = Unix.select [ fd ] [] [] timeout in
 	if r = [] then
 		true, None
-	else (
-		let workdone = Xb.input con.xb in
-		if workdone then
-			false, (Some (Xb.get_in_packet con.xb))
-		else
-			false, None
-	)
+	else
+		false, Xb.input con.xb
 
 let queue_watchevent con data =
 	let ls = split_string ~limit:2 '\000' data in
diff --git a/tools/ocaml/xenstored/connection.ml b/tools/ocaml/xenstored/connection.ml
index 38b47363a173..cc20e047d2b9 100644
--- a/tools/ocaml/xenstored/connection.ml
+++ b/tools/ocaml/xenstored/connection.ml
@@ -277,9 +277,7 @@ let get_transaction con tid =
 	Hashtbl.find con.transactions tid
 
 let do_input con = Xenbus.Xb.input con.xb
-let has_input con = Xenbus.Xb.has_in_packet con.xb
 let has_partial_input con = Xenbus.Xb.has_partial_input con.xb
-let pop_in con = Xenbus.Xb.get_in_packet con.xb
 let has_more_input con = Xenbus.Xb.has_more_input con.xb
 
 let has_output con = Xenbus.Xb.has_output con.xb
@@ -307,7 +305,7 @@ let is_bad con = match con.dom with None -> false | Some dom -> Domain.is_bad_do
    Restrictions below can be relaxed once xenstored learns to dump more
    of its live state in a safe way *)
 let has_extra_connection_data con =
-	let has_in = has_input con || has_partial_input con in
+	let has_in = has_partial_input con in
 	let has_out = has_output con in
 	let has_socket = con.dom = None in
 	let has_nondefault_perms = make_perm con.dom <> con.perm in
diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index 6a3435c265d3..2d67456a2aa0 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -195,10 +195,9 @@ let parse_live_update args =
 			| _ when Unix.gettimeofday () < t.deadline -> false
 			| l ->
 				warn "timeout reached: have to wait, migrate or shutdown %d domains:" (List.length l);
-				let msgs = List.rev_map (fun con -> Printf.sprintf "%s: %d tx, in: %b, out: %b, perm: %s"
+				let msgs = List.rev_map (fun con -> Printf.sprintf "%s: %d tx, out: %b, perm: %s"
 					(Connection.get_domstr con)
 					(Connection.number_of_transactions con)
-					(Connection.has_input con)
 					(Connection.has_output con)
 					(Connection.get_perm con |> Perms.Connection.to_string)
 					) l in
@@ -705,16 +704,17 @@ let do_input store cons doms con =
 			info "%s requests a reconnect" (Connection.get_domstr con);
 			History.reconnect con;
 			info "%s reconnection complete" (Connection.get_domstr con);
-			false
+			None
 		| Failure exp ->
 			error "caught exception %s" exp;
 			error "got a bad client %s" (sprintf "%-8s" (Connection.get_domstr con));
 			Connection.mark_as_bad con;
-			false
+			None
 	in
 
-	if newpacket then (
-		let packet = Connection.pop_in con in
+	match newpacket with
+	| None -> ()
+	| Some packet ->
 		let tid, rid, ty, data = Xenbus.Xb.Packet.unpack packet in
 		let req = {Packet.tid=tid; Packet.rid=rid; Packet.ty=ty; Packet.data=data} in
 
@@ -724,8 +724,7 @@ let do_input store cons doms con =
 		         (Xenbus.Xb.Op.to_string ty) (sanitize_data data); *)
 		process_packet ~store ~cons ~doms ~con ~req;
 		write_access_log ~ty ~tid ~con:(Connection.get_domstr con) ~data;
-		Connection.incr_ops con;
-	)
+		Connection.incr_ops con
 
 let do_output _store _cons _doms con =
 	if Connection.has_output con then (
From 2440a8b69a118fe14e73eb6cab4a050922866f1a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:03 +0100
Subject: tools/ocaml/xb: Add BoundedQueue
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Ensures we cannot store more than [capacity] elements in a [Queue].  Replacing
all Queue with this module will then ensure at compile time that all Queues
are correctly bound checked.

Each element in the queue has a class with its own limits.  This, in a
subsequent change, will ensure that command responses can proceed during a
flood of watch events.

No functional change.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 165fd4a1edf4..4197a3888a68 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -17,6 +17,98 @@
 module Op = struct include Op end
 module Packet = struct include Packet end
 
+module BoundedQueue : sig
+	type ('a, 'b) t
+
+	(** [create ~capacity ~classify ~limit] creates a queue with maximum [capacity] elements.
+	    This is burst capacity, each element is further classified according to [classify],
+	    and each class can have its own [limit].
+	    [capacity] is enforced as an overall limit.
+	    The [limit] can be dynamic, and can be smaller than the number of elements already queued of that class,
+	    in which case those elements are considered to use "burst capacity".
+	  *)
+	val create: capacity:int -> classify:('a -> 'b) -> limit:('b -> int) -> ('a, 'b) t
+
+	(** [clear q] discards all elements from [q] *)
+	val clear: ('a, 'b) t -> unit
+
+	(** [can_push q] when [length q < capacity].	*)
+	val can_push: ('a, 'b) t -> 'b -> bool
+
+	(** [push e q] adds [e] at the end of queue [q] if [can_push q], or returns [None]. *)
+	val push: 'a -> ('a, 'b) t -> unit option
+
+	(** [pop q] removes and returns first element in [q], or raises [Queue.Empty]. *)
+	val pop: ('a, 'b) t -> 'a
+
+	(** [peek q] returns the first element in [q], or raises [Queue.Empty].  *)
+	val peek : ('a, 'b) t -> 'a
+
+	(** [length q] returns the current number of elements in [q] *)
+	val length: ('a, 'b) t -> int
+
+	(** [debug string_of_class q] prints queue usage statistics in an unspecified internal format. *)
+	val debug: ('b -> string) -> (_, 'b) t -> string
+end = struct
+	type ('a, 'b) t =
+		{ q: 'a Queue.t
+		; capacity: int
+		; classify: 'a -> 'b
+		; limit: 'b -> int
+		; class_count: ('b, int) Hashtbl.t
+		}
+
+	let create ~capacity ~classify ~limit =
+		{ capacity; q = Queue.create (); classify; limit; class_count = Hashtbl.create 3 }
+
+	let get_count t classification = try Hashtbl.find t.class_count classification with Not_found -> 0
+
+	let can_push_internal t classification class_count =
+		Queue.length t.q < t.capacity && class_count < t.limit classification
+
+	let ok = Some ()
+
+	let push e t =
+		let classification = t.classify e in
+		let class_count = get_count t classification in
+		if can_push_internal t classification class_count then begin
+			Queue.push e t.q;
+			Hashtbl.replace t.class_count classification (class_count + 1);
+			ok
+		end
+		else
+			None
+
+	let can_push t classification =
+		can_push_internal t classification @@ get_count t classification
+
+	let clear t =
+		Queue.clear t.q;
+		Hashtbl.reset t.class_count
+
+	let pop t =
+		let e = Queue.pop t.q in
+		let classification = t.classify e in
+		let () = match get_count t classification - 1 with
+		| 0 -> Hashtbl.remove t.class_count classification (* reduces memusage *)
+		| n -> Hashtbl.replace t.class_count classification n
+		in
+		e
+
+	let peek t = Queue.peek t.q
+	let length t = Queue.length t.q
+
+	let debug string_of_class t =
+		let b = Buffer.create 128 in
+		Printf.bprintf b "BoundedQueue capacity: %d, used: {" t.capacity;
+		Hashtbl.iter (fun packet_class count ->
+			Printf.bprintf b "	%s: %d" (string_of_class packet_class) count
+		) t.class_count;
+		Printf.bprintf b "}";
+		Buffer.contents b
+end
+
+
 exception End_of_file
 exception Eagain
 exception Noent
From bc0f05e6f3a3c93c853ceffd1f6d2022dc30fb77 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:04 +0100
Subject: tools/ocaml: Limit maximum in-flight requests / outstanding replies
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Introduce a limit on the number of outstanding reply packets in the xenbus
queue.  This limits the number of in-flight requests: when the output queue is
full we'll stop processing inputs until the output queue has room again.

To avoid a busy loop on the Unix socket we only add it to the watched input
file descriptor set if we'd be able to call `input` on it.  Even though Dom0
is trusted and exempt from quotas a flood of events might cause a backlog
where events are produced faster than daemons in Dom0 can consume them, which
could lead to an unbounded queue size and OOM.

Therefore the xenbus queue limit must apply to all connections, Dom0 is not
exempt from it, although if everything works correctly it will eventually
catch up.

This prevents a malicious guest from sending more commands while it has
outstanding watch events or command replies in its input ring.  However if it
can cause the generation of watch events by other means (e.g. by Dom0, or
another cooperative guest) and stop reading its own ring then watch events
would've queued up without limit.

The xenstore protocol doesn't have a back-pressure mechanism, and doesn't
allow dropping watch events.  In fact, dropping watch events is known to break
some pieces of normal functionality.  This leaves little choice to safely
implement the xenstore protocol without exposing the xenstore daemon to
out-of-memory attacks.

Implement the fix as pipes with bounded buffers:
* Use a bounded buffer for watch events
* The watch structure will have a bounded receiving pipe of watch events
* The source will have an "overflow" pipe of pending watch events it couldn't
  deliver

Items are queued up on one end and are sent as far along the pipe as possible:

  source domain -> watch -> xenbus of target -> xenstore ring/socket of target

If the pipe is "full" at any point then back-pressure is applied and we prevent
more items from being queued up.  For the source domain this means that we'll
stop accepting new commands as long as its pipe buffer is not empty.

Before we try to enqueue an item we first check whether it is possible to send
it further down the pipe, by attempting to recursively flush the pipes. This
ensures that we retain the order of events as much as possible.

We might break causality of watch events if the target domain's queue is full
and we need to start using the watch's queue.  This is a breaking change in
the xenstore protocol, but only for domains which are not processing their
incoming ring as expected.

When a watch is deleted its entire pending queue is dropped (no code is needed
for that, because it is part of the 'watch' type).

There is a cache of watches that have pending events that we attempt to flush
at every cycle if possible.

Introduce 3 limits here:
* quota-maxwatchevents on watch event destination: when this is hit the
  source will not be allowed to queue up more watch events.
* quota-maxoustanding which is the number of responses not read from the ring:
  once exceeded, no more inputs are processed until all outstanding replies
  are consumed by the client.
* overflow queue on the watch event source: all watches that cannot be stored
  on destination are queued up here, a single command can trigger multiple
  watches (e.g. due to recursion).

The overflow queue currently doesn't have an upper bound, it is difficult to
accurately calculate one as it depends on whether you are Dom0 and how many
watches each path has registered and how many watch events you can trigger
with a single command (e.g. a commit).  However these events were already
using memory, this just moves them elsewhere, and as long as we correctly
block a domain it shouldn't result in unbounded memory usage.

Note that Dom0 is not excluded from these checks, it is important that Dom0 is
especially not excluded when it is the source, since there are many ways in
which a guest could trigger Dom0 to send it watch events.

This should protect against malicious frontends as long as the backend follows
the PV xenstore protocol and only exposes paths needed by the frontend, and
changes those paths at most once as a reaction to guest events, or protocol
state.

The queue limits are per watch, and per domain-pair, so even if one
communication channel would be "blocked", others would keep working, and the
domain itself won't get blocked as long as it doesn't overflow the queue of
watch events.

Similarly a malicious backend could cause the frontend to get blocked, but
this watch queue protects the frontend as well as long as it follows the PV
protocol.  (Although note that protection against malicious backends is only a
best effort at the moment)

This is part of XSA-326 / CVE-2022-42318.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 4197a3888a68..b292ed7a874d 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -134,14 +134,44 @@ type backend = Fd of backend_fd | Xenmmap of backend_mmap
 
 type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
 
+(*
+	separate capacity reservation for replies and watch events:
+	this allows a domain to keep working even when under a constant flood of
+	watch events
+*)
+type capacity = { maxoutstanding: int; maxwatchevents: int }
+
+module Queue = BoundedQueue
+
+type packet_class =
+	| CommandReply
+	| Watchevent
+
+let string_of_packet_class = function
+	| CommandReply -> "command_reply"
+	| Watchevent -> "watch_event"
+
 type t =
 {
 	backend: backend;
-	pkt_out: Packet.t Queue.t;
+	pkt_out: (Packet.t, packet_class) Queue.t;
 	mutable partial_in: partial_buf;
 	mutable partial_out: string;
+	capacity: capacity
 }
 
+let to_read con =
+	match con.partial_in with
+		| HaveHdr partial_pkt -> Partial.to_complete partial_pkt
+		| NoHdr   (i, _)    -> i
+
+let debug t =
+	Printf.sprintf "XenBus state: partial_in: %d needed, partial_out: %d bytes, pkt_out: %d packets, %s"
+		(to_read t)
+		(String.length t.partial_out)
+		(Queue.length t.pkt_out)
+		(BoundedQueue.debug string_of_packet_class t.pkt_out)
+
 let init_partial_in () = NoHdr
 	(Partial.header_size (), Bytes.make (Partial.header_size()) '\000')
 
@@ -199,7 +229,8 @@ let output con =
 	let s = if String.length con.partial_out > 0 then
 			con.partial_out
 		else if Queue.length con.pkt_out > 0 then
-			Packet.to_string (Queue.pop con.pkt_out)
+			let pkt = Queue.pop con.pkt_out in
+			Packet.to_string pkt
 		else
 			"" in
 	(* send data from s, and save the unsent data to partial_out *)
@@ -212,12 +243,15 @@ let output con =
 	(* after sending one packet, partial is empty *)
 	con.partial_out = ""
 
+(* we can only process an input packet if we're guaranteed to have room
+   to store the response packet *)
+let can_input con = Queue.can_push con.pkt_out CommandReply
+
 (* NB: can throw Reconnect *)
 let input con =
-	let to_read =
-		match con.partial_in with
-		| HaveHdr partial_pkt -> Partial.to_complete partial_pkt
-		| NoHdr   (i, _)    -> i in
+	if not (can_input con) then None
+	else
+	let to_read = to_read con in
 
 	(* try to get more data from input stream *)
 	let b = Bytes.make to_read '\000' in
@@ -243,11 +277,22 @@ let input con =
 		None
 	)
 
-let newcon backend = {
+let classify t =
+	match t.Packet.ty with
+	| Op.Watchevent -> Watchevent
+	| _ -> CommandReply
+
+let newcon ~capacity backend =
+	let limit = function
+		| CommandReply -> capacity.maxoutstanding
+		| Watchevent -> capacity.maxwatchevents
+	in
+	{
 	backend = backend;
-	pkt_out = Queue.create ();
+	pkt_out = Queue.create ~capacity:(capacity.maxoutstanding + capacity.maxwatchevents) ~classify ~limit;
 	partial_in = init_partial_in ();
 	partial_out = "";
+	capacity = capacity;
 	}
 
 let open_fd fd = newcon (Fd { fd = fd; })
diff --git a/tools/ocaml/libs/xb/xb.mli b/tools/ocaml/libs/xb/xb.mli
index 91c682162cea..71b2754ca788 100644
--- a/tools/ocaml/libs/xb/xb.mli
+++ b/tools/ocaml/libs/xb/xb.mli
@@ -66,10 +66,11 @@ type backend_mmap = {
 type backend_fd = { fd : Unix.file_descr; }
 type backend = Fd of backend_fd | Xenmmap of backend_mmap
 type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
+type capacity = { maxoutstanding: int; maxwatchevents: int }
 type t
 val init_partial_in : unit -> partial_buf
 val reconnect : t -> unit
-val queue : t -> Packet.t -> unit
+val queue : t -> Packet.t -> unit option
 val read_fd : backend_fd -> 'a -> bytes -> int -> int
 val read_mmap : backend_mmap -> 'a -> bytes -> int -> int
 val read : t -> bytes -> int -> int
@@ -78,13 +79,14 @@ val write_mmap : backend_mmap -> 'a -> string -> int -> int
 val write : t -> string -> int -> int
 val output : t -> bool
 val input : t -> Packet.t option
-val newcon : backend -> t
-val open_fd : Unix.file_descr -> t
-val open_mmap : Xenmmap.mmap_interface -> (unit -> unit) -> t
+val newcon : capacity:capacity -> backend -> t
+val open_fd : Unix.file_descr -> capacity:capacity -> t
+val open_mmap : Xenmmap.mmap_interface -> (unit -> unit) -> capacity:capacity -> t
 val close : t -> unit
 val is_fd : t -> bool
 val is_mmap : t -> bool
 val output_len : t -> int
+val can_input: t -> bool
 val has_new_output : t -> bool
 val has_old_output : t -> bool
 val has_output : t -> bool
@@ -93,3 +95,4 @@ val has_partial_input : t -> bool
 val has_more_input : t -> bool
 val is_selectable : t -> bool
 val get_fd : t -> Unix.file_descr
+val debug: t -> string
diff --git a/tools/ocaml/libs/xs/queueop.ml b/tools/ocaml/libs/xs/queueop.ml
index 9ff5bbd529ce..4e532cdaeacb 100644
--- a/tools/ocaml/libs/xs/queueop.ml
+++ b/tools/ocaml/libs/xs/queueop.ml
@@ -16,9 +16,10 @@
 open Xenbus
 
 let data_concat ls = (String.concat "\000" ls) ^ "\000"
+let queue con pkt = let r = Xb.queue con pkt in assert (r <> None)
 let queue_path ty (tid: int) (path: string) con =
 	let data = data_concat [ path; ] in
-	Xb.queue con (Xb.Packet.create tid 0 ty data)
+	queue con (Xb.Packet.create tid 0 ty data)
 
 (* operations *)
 let directory tid path con = queue_path Xb.Op.Directory tid path con
@@ -27,48 +28,48 @@ let read tid path con = queue_path Xb.Op.Read tid path con
 let getperms tid path con = queue_path Xb.Op.Getperms tid path con
 
 let debug commands con =
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Debug (data_concat commands))
+	queue con (Xb.Packet.create 0 0 Xb.Op.Debug (data_concat commands))
 
 let watch path data con =
 	let data = data_concat [ path; data; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Watch data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Watch data)
 
 let unwatch path data con =
 	let data = data_concat [ path; data; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Unwatch data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Unwatch data)
 
 let transaction_start con =
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Transaction_start (data_concat []))
+	queue con (Xb.Packet.create 0 0 Xb.Op.Transaction_start (data_concat []))
 
 let transaction_end tid commit con =
 	let data = data_concat [ (if commit then "T" else "F"); ] in
-	Xb.queue con (Xb.Packet.create tid 0 Xb.Op.Transaction_end data)
+	queue con (Xb.Packet.create tid 0 Xb.Op.Transaction_end data)
 
 let introduce domid mfn port con =
 	let data = data_concat [ Printf.sprintf "%u" domid;
 	                         Printf.sprintf "%nu" mfn;
 	                         string_of_int port; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Introduce data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Introduce data)
 
 let release domid con =
 	let data = data_concat [ Printf.sprintf "%u" domid; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Release data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Release data)
 
 let resume domid con =
 	let data = data_concat [ Printf.sprintf "%u" domid; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Resume data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Resume data)
 
 let getdomainpath domid con =
 	let data = data_concat [ Printf.sprintf "%u" domid; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Getdomainpath data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Getdomainpath data)
 
 let write tid path value con =
 	let data = path ^ "\000" ^ value (* no NULL at the end *) in
-	Xb.queue con (Xb.Packet.create tid 0 Xb.Op.Write data)
+	queue con (Xb.Packet.create tid 0 Xb.Op.Write data)
 
 let mkdir tid path con = queue_path Xb.Op.Mkdir tid path con
 let rm tid path con = queue_path Xb.Op.Rm tid path con
 
 let setperms tid path perms con =
 	let data = data_concat [ path; perms ] in
-	Xb.queue con (Xb.Packet.create tid 0 Xb.Op.Setperms data)
+	queue con (Xb.Packet.create tid 0 Xb.Op.Setperms data)
diff --git a/tools/ocaml/libs/xs/xsraw.ml b/tools/ocaml/libs/xs/xsraw.ml
index 451f8b38dbcc..cbd17280600c 100644
--- a/tools/ocaml/libs/xs/xsraw.ml
+++ b/tools/ocaml/libs/xs/xsraw.ml
@@ -36,8 +36,10 @@ type con = {
 let close con =
 	Xb.close con.xb
 
+let capacity = { Xb.maxoutstanding = 1; maxwatchevents = 0; }
+
 let open_fd fd = {
-	xb = Xb.open_fd fd;
+	xb = Xb.open_fd ~capacity fd;
 	watchevents = Queue.create ();
 }
 
diff --git a/tools/ocaml/xenstored/connection.ml b/tools/ocaml/xenstored/connection.ml
index cc20e047d2b9..9624a5f9da2c 100644
--- a/tools/ocaml/xenstored/connection.ml
+++ b/tools/ocaml/xenstored/connection.ml
@@ -20,12 +20,84 @@ open Stdext
 
 let xenstore_payload_max = 4096 (* xen/include/public/io/xs_wire.h *)
 
+type 'a bounded_sender = 'a -> unit option
+(** a bounded sender accepts an ['a] item and returns:
+    None - if there is no room to accept the item
+    Some () -  if it has successfully accepted/sent the item
+ *)
+
+module BoundedPipe : sig
+	type 'a t
+
+	(** [create ~capacity ~destination] creates a bounded pipe with a
+	    local buffer holding at most [capacity] items.  Once the buffer is
+	    full it will not accept further items.  items from the pipe are
+	    flushed into [destination] as long as it accepts items.  The
+	    destination could be another pipe.
+	 *)
+	val create: capacity:int -> destination:'a bounded_sender -> 'a t
+
+	(** [is_empty t] returns whether the local buffer of [t] is empty. *)
+	val is_empty : _ t -> bool
+
+	(** [length t] the number of items in the internal buffer *)
+	val length: _ t -> int
+
+	(** [flush_pipe t] sends as many items from the local buffer as possible,
+			which could be none. *)
+	val flush_pipe: _ t -> unit
+
+	(** [push t item] tries to [flush_pipe] and then push [item]
+	    into the pipe if its [capacity] allows.
+	    Returns [None] if there is no more room
+	 *)
+	val push : 'a t -> 'a bounded_sender
+end = struct
+	(* items are enqueued in [q], and then flushed to [connect_to] *)
+	type 'a t =
+		{ q: 'a Queue.t
+		; destination: 'a bounded_sender
+		; capacity: int
+		}
+
+	let create ~capacity ~destination =
+		{ q = Queue.create (); capacity; destination }
+
+	let rec flush_pipe t =
+		if not Queue.(is_empty t.q) then
+			let item = Queue.peek t.q in
+			match t.destination item with
+			| None -> () (* no room *)
+			| Some () ->
+				(* successfully sent item to next stage *)
+				let _ = Queue.pop t.q in
+				(* continue trying to send more items *)
+				flush_pipe t
+
+	let push t item =
+		(* first try to flush as many items from this pipe as possible to make room,
+		   it is important to do this first to preserve the order of the items
+		 *)
+		flush_pipe t;
+		if Queue.length t.q < t.capacity then begin
+			(* enqueue, instead of sending directly.
+			   this ensures that [out] sees the items in the same order as we receive them
+			 *)
+			Queue.push item t.q;
+			Some (flush_pipe t)
+		end else None
+
+	let is_empty t = Queue.is_empty t.q
+	let length t = Queue.length t.q
+end
+
 type watch = {
 	con: t;
 	token: string;
 	path: string;
 	base: string;
 	is_relative: bool;
+	pending_watchevents: Xenbus.Xb.Packet.t BoundedPipe.t;
 }
 
 and t = {
@@ -38,8 +110,36 @@ and t = {
 	anonid: int;
 	mutable stat_nb_ops: int;
 	mutable perm: Perms.Connection.t;
+	pending_source_watchevents: (watch * Xenbus.Xb.Packet.t) BoundedPipe.t
 }
 
+module Watch = struct
+	module T = struct
+		type t = watch
+
+		let compare w1 w2 =
+			(* cannot compare watches from different connections *)
+			assert (w1.con == w2.con);
+			match String.compare w1.token w2.token with
+			| 0 -> String.compare w1.path w2.path
+			| n -> n
+	end
+	module Set = Set.Make(T)
+
+	let flush_events t =
+		BoundedPipe.flush_pipe t.pending_watchevents;
+		not (BoundedPipe.is_empty t.pending_watchevents)
+
+	let pending_watchevents t =
+		BoundedPipe.length t.pending_watchevents
+end
+
+let source_flush_watchevents t =
+	BoundedPipe.flush_pipe t.pending_source_watchevents
+
+let source_pending_watchevents t =
+	BoundedPipe.length t.pending_source_watchevents
+
 let mark_as_bad con =
 	match con.dom with
 	|None -> ()
@@ -67,7 +167,8 @@ let watch_create ~con ~path ~token = {
 	token = token;
 	path = path;
 	base = get_path con;
-	is_relative = path.[0] <> '/' && path.[0] <> '@'
+	is_relative = path.[0] <> '/' && path.[0] <> '@';
+	pending_watchevents = BoundedPipe.create ~capacity:!Define.maxwatchevents ~destination:(Xenbus.Xb.queue con.xb)
 }
 
 let get_con w = w.con
@@ -93,6 +194,9 @@ let make_perm dom =
 	Perms.Connection.create ~perms:[Perms.READ; Perms.WRITE] domid
 
 let create xbcon dom =
+	let destination (watch, pkt) =
+		BoundedPipe.push watch.pending_watchevents pkt
+	in
 	let id =
 		match dom with
 		| None -> let old = !anon_id_next in incr anon_id_next; old
@@ -109,6 +213,16 @@ let create xbcon dom =
 	anonid = id;
 	stat_nb_ops = 0;
 	perm = make_perm dom;
+
+	(* the actual capacity will be lower, this is used as an overflow
+	   buffer: anything that doesn't fit elsewhere gets put here, only
+	   limited by the amount of watches that you can generate with a
+	   single xenstore command (which is finite, although possibly very
+	   large in theory for Dom0).  Once the pipe here has any contents the
+	   domain is blocked from sending more commands until it is empty
+	   again though.
+	 *)
+	pending_source_watchevents = BoundedPipe.create ~capacity:Sys.max_array_length ~destination
 	}
 	in
 	Logging.new_connection ~tid:Transaction.none ~con:(get_domstr con);
@@ -127,11 +241,17 @@ let set_target con target_domid =
 
 let is_backend_mmap con = Xenbus.Xb.is_mmap con.xb
 
-let send_reply con tid rid ty data =
+let packet_of con tid rid ty data =
 	if (String.length data) > xenstore_payload_max && (is_backend_mmap con) then
-		Xenbus.Xb.queue con.xb (Xenbus.Xb.Packet.create tid rid Xenbus.Xb.Op.Error "E2BIG\000")
+		Xenbus.Xb.Packet.create tid rid Xenbus.Xb.Op.Error "E2BIG\000"
 	else
-		Xenbus.Xb.queue con.xb (Xenbus.Xb.Packet.create tid rid ty data)
+		Xenbus.Xb.Packet.create tid rid ty data
+
+let send_reply con tid rid ty data =
+	let result = Xenbus.Xb.queue con.xb (packet_of con tid rid ty data) in
+	(* should never happen: we only process an input packet when there is room for an output packet *)
+	(* and the limit for replies is different from the limit for watch events *)
+	assert (result <> None)
 
 let send_error con tid rid err = send_reply con tid rid Xenbus.Xb.Op.Error (err ^ "\000")
 let send_ack con tid rid ty = send_reply con tid rid ty "OK\000"
@@ -181,11 +301,11 @@ let del_watch con path token =
 	apath, w
 
 let del_watches con =
-  Hashtbl.clear con.watches;
+  Hashtbl.reset con.watches;
   con.nb_watches <- 0
 
 let del_transactions con =
-  Hashtbl.clear con.transactions
+  Hashtbl.reset con.transactions
 
 let list_watches con =
 	let ll = Hashtbl.fold
@@ -208,21 +328,29 @@ let lookup_watch_perm path = function
 let lookup_watch_perms oldroot root path =
 	lookup_watch_perm path oldroot @ lookup_watch_perm path (Some root)
 
-let fire_single_watch_unchecked watch =
+let fire_single_watch_unchecked source watch =
 	let data = Utils.join_by_null [watch.path; watch.token; ""] in
-	send_reply watch.con Transaction.none 0 Xenbus.Xb.Op.Watchevent data
+	let pkt = packet_of watch.con Transaction.none 0 Xenbus.Xb.Op.Watchevent data in
+
+	match BoundedPipe.push source.pending_source_watchevents (watch, pkt) with
+	| Some () -> () (* packet queued *)
+	| None ->
+			(* a well behaved Dom0 shouldn't be able to trigger this,
+			   if it happens it is likely a Dom0 bug causing runaway memory usage
+			 *)
+			failwith "watch event overflow, cannot happen"
 
-let fire_single_watch (oldroot, root) watch =
+let fire_single_watch source (oldroot, root) watch =
 	let abspath = get_watch_path watch.con watch.path |> Store.Path.of_string in
 	let perms = lookup_watch_perms oldroot root abspath in
 	if Perms.can_fire_watch watch.con.perm perms then
-		fire_single_watch_unchecked watch
+		fire_single_watch_unchecked source watch
 	else
 		let perms = perms |> List.map (Perms.Node.to_string ~sep:" ") |> String.concat ", " in
 		let con = get_domstr watch.con in
 		Logging.watch_not_fired ~con perms (Store.Path.to_string abspath)
 
-let fire_watch roots watch path =
+let fire_watch source roots watch path =
 	let new_path =
 		if watch.is_relative && path.[0] = '/'
 		then begin
@@ -232,7 +360,7 @@ let fire_watch roots watch path =
 		end else
 			path
 	in
-	fire_single_watch roots { watch with path = new_path }
+	fire_single_watch source roots { watch with path = new_path }
 
 (* Search for a valid unused transaction id. *)
 let rec valid_transaction_id con proposed_id =
@@ -280,6 +408,7 @@ let do_input con = Xenbus.Xb.input con.xb
 let has_partial_input con = Xenbus.Xb.has_partial_input con.xb
 let has_more_input con = Xenbus.Xb.has_more_input con.xb
 
+let can_input con = Xenbus.Xb.can_input con.xb && BoundedPipe.is_empty con.pending_source_watchevents
 let has_output con = Xenbus.Xb.has_output con.xb
 let has_old_output con = Xenbus.Xb.has_old_output con.xb
 let has_new_output con = Xenbus.Xb.has_new_output con.xb
@@ -323,7 +452,7 @@ let prevents_live_update con = not (is_bad con)
 	&& (has_extra_connection_data con || has_transaction_data con)
 
 let has_more_work con =
-	has_more_input con || not (has_old_output con) && has_new_output con
+	(has_more_input con && can_input con) || not (has_old_output con) && has_new_output con
 
 let incr_ops con = con.stat_nb_ops <- con.stat_nb_ops + 1
 
diff --git a/tools/ocaml/xenstored/connections.ml b/tools/ocaml/xenstored/connections.ml
index 3c7429fe7f61..7d68c583b43a 100644
--- a/tools/ocaml/xenstored/connections.ml
+++ b/tools/ocaml/xenstored/connections.ml
@@ -22,22 +22,30 @@ type t = {
 	domains: (int, Connection.t) Hashtbl.t;
 	ports: (Xeneventchn.t, Connection.t) Hashtbl.t;
 	mutable watches: Connection.watch list Trie.t;
+	mutable has_pending_watchevents: Connection.Watch.Set.t
 }
 
 let create () = {
 	anonymous = Hashtbl.create 37;
 	domains = Hashtbl.create 37;
 	ports = Hashtbl.create 37;
-	watches = Trie.create ()
+	watches = Trie.create ();
+	has_pending_watchevents = Connection.Watch.Set.empty;
 }
 
+let get_capacity () =
+	(* not multiplied by maxwatch on purpose: 2nd queue in watch itself! *)
+	{ Xenbus.Xb.maxoutstanding = !Define.maxoutstanding; maxwatchevents = !Define.maxwatchevents }
+
 let add_anonymous cons fd =
-	let xbcon = Xenbus.Xb.open_fd fd in
+	let capacity = get_capacity () in
+	let xbcon = Xenbus.Xb.open_fd fd ~capacity in
 	let con = Connection.create xbcon None in
 	Hashtbl.add cons.anonymous (Xenbus.Xb.get_fd xbcon) con
 
 let add_domain cons dom =
-	let xbcon = Xenbus.Xb.open_mmap (Domain.get_interface dom) (fun () -> Domain.notify dom) in
+	let capacity = get_capacity () in
+	let xbcon = Xenbus.Xb.open_mmap ~capacity (Domain.get_interface dom) (fun () -> Domain.notify dom) in
 	let con = Connection.create xbcon (Some dom) in
 	Hashtbl.add cons.domains (Domain.get_id dom) con;
 	match Domain.get_port dom with
@@ -48,7 +56,9 @@ let select ?(only_if = (fun _ -> true)) cons =
 	Hashtbl.fold (fun _ con (ins, outs) ->
 		if (only_if con) then (
 			let fd = Connection.get_fd con in
-			(fd :: ins,  if Connection.has_output con then fd :: outs else outs)
+			let in_fds = if Connection.can_input con then fd :: ins else ins in
+			let out_fds = if Connection.has_output con then fd :: outs else outs in
+			in_fds, out_fds
 		) else (ins, outs)
 	)
 	cons.anonymous ([], [])
@@ -67,10 +77,17 @@ let del_watches_of_con con watches =
 	| [] -> None
 	| ws -> Some ws
 
+let del_watches cons con =
+	Connection.del_watches con;
+	cons.watches <- Trie.map (del_watches_of_con con) cons.watches;
+	cons.has_pending_watchevents <-
+		cons.has_pending_watchevents |> Connection.Watch.Set.filter @@ fun w ->
+		Connection.get_con w != con
+
 let del_anonymous cons con =
 	try
 		Hashtbl.remove cons.anonymous (Connection.get_fd con);
-		cons.watches <- Trie.map (del_watches_of_con con) cons.watches;
+		del_watches cons con;
 		Connection.close con
 	with exn ->
 		debug "del anonymous %s" (Printexc.to_string exn)
@@ -85,7 +102,7 @@ let del_domain cons id =
 		    | Some p -> Hashtbl.remove cons.ports p
 		    | None -> ())
 		 | None -> ());
-		cons.watches <- Trie.map (del_watches_of_con con) cons.watches;
+		del_watches cons con;
 		Connection.close con
 	with exn ->
 		debug "del domain %u: %s" id (Printexc.to_string exn)
@@ -136,31 +153,33 @@ let del_watch cons con path token =
 		cons.watches <- Trie.set cons.watches key watches;
  	watch
 
-let del_watches cons con =
-	Connection.del_watches con;
-	cons.watches <- Trie.map (del_watches_of_con con) cons.watches
-
 (* path is absolute *)
-let fire_watches ?oldroot root cons path recurse =
+let fire_watches ?oldroot source root cons path recurse =
 	let key = key_of_path path in
 	let path = Store.Path.to_string path in
 	let roots = oldroot, root in
 	let fire_watch _ = function
 		| None         -> ()
-		| Some watches -> List.iter (fun w -> Connection.fire_watch roots w path) watches
+		| Some watches -> List.iter (fun w -> Connection.fire_watch source roots w path) watches
 	in
 	let fire_rec _x = function
 		| None         -> ()
 		| Some watches ->
-			List.iter (Connection.fire_single_watch roots) watches
+			List.iter (Connection.fire_single_watch source roots) watches
 	in
 	Trie.iter_path fire_watch cons.watches key;
 	if recurse then
 		Trie.iter fire_rec (Trie.sub cons.watches key)
 
+let send_watchevents cons con =
+	cons.has_pending_watchevents <-
+		cons.has_pending_watchevents |> Connection.Watch.Set.filter Connection.Watch.flush_events;
+	Connection.source_flush_watchevents con
+
 let fire_spec_watches root cons specpath =
+	let source = find_domain cons 0 in
 	iter cons (fun con ->
-		List.iter (Connection.fire_single_watch (None, root)) (Connection.get_watches con specpath))
+		List.iter (Connection.fire_single_watch source (None, root)) (Connection.get_watches con specpath))
 
 let set_target cons domain target_domain =
 	let con = find_domain cons domain in
@@ -197,6 +216,16 @@ let debug cons =
 	let domains = Hashtbl.fold (fun _ con accu -> Connection.debug con :: accu) cons.domains [] in
 	String.concat "" (domains @ anonymous)
 
+let debug_watchevents cons con =
+	(* == (physical equality)
+	   has to be used here because w.con.xb.backend might contain a [unit->unit] value causing regular
+	   comparison to fail due to having a 'functional value' which cannot be compared.
+	 *)
+	let s = cons.has_pending_watchevents |> Connection.Watch.Set.filter (fun w -> w.con == con) in
+	let pending = s |> Connection.Watch.Set.elements
+		|> List.map (fun w -> Connection.Watch.pending_watchevents w) |> List.fold_left (+) 0 in
+	Printf.sprintf "Watches with pending events: %d, pending events total: %d" (Connection.Watch.Set.cardinal s) pending
+
 let filter ~f cons =
 	let fold _ v acc = if f v then v :: acc else acc in
 	[]
diff --git a/tools/ocaml/xenstored/define.ml b/tools/ocaml/xenstored/define.ml
index ba63a8147e09..327b6d795ec7 100644
--- a/tools/ocaml/xenstored/define.ml
+++ b/tools/ocaml/xenstored/define.ml
@@ -24,6 +24,13 @@ let default_config_dir = Paths.xen_config_dir
 let maxwatch = ref (100)
 let maxtransaction = ref (10)
 let maxrequests = ref (1024)   (* maximum requests per transaction *)
+let maxoutstanding = ref (1024) (* maximum outstanding requests, i.e. in-flight requests / domain *)
+let maxwatchevents = ref (1024)
+(*
+	maximum outstanding watch events per watch,
+	recommended >= maxoutstanding to avoid blocking backend transactions due to
+	malicious frontends
+ *)
 
 let gc_max_overhead = ref 120 (* 120% see comment in xenstored.ml *)
 let conflict_burst_limit = ref 5.0
diff --git a/tools/ocaml/xenstored/oxenstored.conf.in b/tools/ocaml/xenstored/oxenstored.conf.in
index 4ae48e42d47d..9d034e744b4b 100644
--- a/tools/ocaml/xenstored/oxenstored.conf.in
+++ b/tools/ocaml/xenstored/oxenstored.conf.in
@@ -62,6 +62,8 @@ quota-maxwatch = 100
 quota-transaction = 10
 quota-maxrequests = 1024
 quota-path-max = 1024
+quota-maxoutstanding = 1024
+quota-maxwatchevents = 1024
 
 # Activate filed base backend
 persistent = false
diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index 2d67456a2aa0..6dcedfda86e4 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -57,7 +57,7 @@ let split_one_path data con =
 	| path :: "" :: [] -> Store.Path.create path (Connection.get_path con)
 	| _                -> raise Invalid_Cmd_Args
 
-let process_watch t cons =
+let process_watch source t cons =
 	let oldroot = t.Transaction.oldroot in
 	let newroot = Store.get_root t.store in
 	let ops = Transaction.get_paths t |> List.rev in
@@ -67,8 +67,9 @@ let process_watch t cons =
 		| Xenbus.Xb.Op.Rm       -> true, None, oldroot
 		| Xenbus.Xb.Op.Setperms -> false, Some oldroot, newroot
 		| _              -> raise (Failure "huh ?") in
-		Connections.fire_watches ?oldroot root cons (snd op) recurse in
-	List.iter (fun op -> do_op_watch op cons) ops
+		Connections.fire_watches ?oldroot source root cons (snd op) recurse in
+	List.iter (fun op -> do_op_watch op cons) ops;
+	Connections.send_watchevents cons source
 
 let create_implicit_path t perm path =
 	let dirname = Store.Path.get_parent path in
@@ -234,6 +235,20 @@ let do_debug con t _domains cons data =
 	| "watches" :: _ ->
 		let watches = Connections.debug cons in
 		Some (watches ^ "\000")
+	| "xenbus" :: domid :: _ ->
+		let domid = int_of_string domid in
+		let con = Connections.find_domain cons domid in
+		let s = Printf.sprintf "xenbus: %s; overflow queue length: %d, can_input: %b, has_more_input: %b, has_old_output: %b, has_new_output: %b, has_more_work: %b. pending: %s"
+			(Xenbus.Xb.debug con.xb)
+			(Connection.source_pending_watchevents con)
+			(Connection.can_input con)
+			(Connection.has_more_input con)
+			(Connection.has_old_output con)
+			(Connection.has_new_output con)
+			(Connection.has_more_work con)
+			(Connections.debug_watchevents cons con)
+		in
+		Some s
 	| "mfn" :: domid :: _ ->
 		let domid = int_of_string domid in
 		let con = Connections.find_domain cons domid in
@@ -342,7 +357,7 @@ let reply_ack fct con t doms cons data =
 	fct con t doms cons data;
 	Packet.Ack (fun () ->
 		if Transaction.get_id t = Transaction.none then
-			process_watch t cons
+			process_watch con t cons
 	)
 
 let reply_data fct con t doms cons data =
@@ -501,7 +516,7 @@ let do_watch con t _domains cons data =
 	Packet.Ack (fun () ->
 		(* xenstore.txt says this watch is fired immediately,
 		   implying even if path doesn't exist or is unreadable *)
-		Connection.fire_single_watch_unchecked watch)
+		Connection.fire_single_watch_unchecked con watch)
 
 let do_unwatch con _t _domains cons data =
 	let (node, token) =
@@ -532,7 +547,7 @@ let do_transaction_end con t domains cons data =
 	if not success then
 		raise Transaction_again;
 	if commit then begin
-		process_watch t cons;
+		process_watch con t cons;
 		match t.Transaction.ty with
 		| Transaction.No ->
 			() (* no need to record anything *)
@@ -699,7 +714,8 @@ let process_packet ~store ~cons ~doms ~con ~req =
 let do_input store cons doms con =
 	let newpacket =
 		try
-			Connection.do_input con
+			if Connection.can_input con then Connection.do_input con
+			else None
 		with Xenbus.Xb.Reconnect ->
 			info "%s requests a reconnect" (Connection.get_domstr con);
 			History.reconnect con;
@@ -727,6 +743,7 @@ let do_input store cons doms con =
 		Connection.incr_ops con
 
 let do_output _store _cons _doms con =
+	Connection.source_flush_watchevents con;
 	if Connection.has_output con then (
 		if Connection.has_new_output con then (
 			let packet = Connection.peek_output con in
diff --git a/tools/ocaml/xenstored/xenstored.ml b/tools/ocaml/xenstored/xenstored.ml
index 3b57ad016dfb..c799e20f1145 100644
--- a/tools/ocaml/xenstored/xenstored.ml
+++ b/tools/ocaml/xenstored/xenstored.ml
@@ -103,6 +103,8 @@ let parse_config filename =
 		("quota-maxentity", Config.Set_int Quota.maxent);
 		("quota-maxsize", Config.Set_int Quota.maxsize);
 		("quota-maxrequests", Config.Set_int Define.maxrequests);
+		("quota-maxoutstanding", Config.Set_int Define.maxoutstanding);
+		("quota-maxwatchevents", Config.Set_int Define.maxwatchevents);
 		("quota-path-max", Config.Set_int Define.path_max);
 		("gc-max-overhead", Config.Set_int Define.gc_max_overhead);
 		("test-eagain", Config.Set_bool Transaction.test_eagain);
From 09aa10649f75a262028e9a9b7d859ef7efb23d54 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Thu, 29 Sep 2022 13:07:35 +0200
Subject: SUPPORT.md: clarify support of untrusted driver domains with
 oxenstored

Add a support statement for the scope of support regarding different
Xenstore variants. Especially oxenstored does not (yet) have security
support of untrusted driver domains, as those might drive oxenstored
out of memory by creating lots of watch events for the guests they are
servicing.

Add a statement regarding Live Update support of oxenstored.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/SUPPORT.md b/SUPPORT.md
index 85726102eab8..7d0cb34c8f6f 100644
--- a/SUPPORT.md
+++ b/SUPPORT.md
@@ -179,13 +179,18 @@ Support for running qemu-xen device model in a linux stubdomain.
 
     Status: Tech Preview
 
-## Liveupdate of C xenstored daemon
+## Xenstore
 
-    Status: Tech Preview
+### C xenstored daemon
 
-## Liveupdate of OCaml xenstored daemon
+    Status: Supported
+    Status, Liveupdate: Tech Preview
 
-    Status: Tech Preview
+### OCaml xenstored daemon
+
+    Status: Supported
+    Status, untrusted driver domains: Supported, not security supported
+    Status, Liveupdate: Not functional
 
 ## Toolstack/3rd party
 
From 5192f13a41661b1c1b9e0889d57c0f5b41925c39 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: split up send_reply()

Today send_reply() is used for both, normal request replies and watch
events.

Split it up into send_reply() and send_event(). This will be used to
add some event specific handling.

add_event() can be merged into send_event(), removing the need for an
intermediate memory allocation.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index e9c9695fd16e..249ad5ec6fb1 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -767,49 +767,32 @@ static void send_error(struct connection *conn, int error)
 void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		const void *data, unsigned int len)
 {
-	struct buffered_data *bdata;
+	struct buffered_data *bdata = conn->in;
+
+	assert(type != XS_WATCH_EVENT);
 
 	if ( len > XENSTORE_PAYLOAD_MAX ) {
 		send_error(conn, E2BIG);
 		return;
 	}
 
-	/* Replies reuse the request buffer, events need a new one. */
-	if (type != XS_WATCH_EVENT) {
-		bdata = conn->in;
-		/* Drop asynchronous responses, e.g. errors for watch events. */
-		if (!bdata)
-			return;
-		bdata->inhdr = true;
-		bdata->used = 0;
-		conn->in = NULL;
-	} else {
-		/* Message is a child of the connection for auto-cleanup. */
-		bdata = new_buffer(conn);
+	if (!bdata)
+		return;
+	bdata->inhdr = true;
+	bdata->used = 0;
 
-		/*
-		 * Allocation failure here is unfortunate: we have no way to
-		 * tell anybody about it.
-		 */
-		if (!bdata)
-			return;
-	}
 	if (len <= DEFAULT_BUFFER_SIZE)
 		bdata->buffer = bdata->default_buffer;
-	else
+	else {
 		bdata->buffer = talloc_array(bdata, char, len);
-	if (!bdata->buffer) {
-		if (type == XS_WATCH_EVENT) {
-			/* Same as above: no way to tell someone. */
-			talloc_free(bdata);
+		if (!bdata->buffer) {
+			send_error(conn, ENOMEM);
 			return;
 		}
-		/* re-establish request buffer for sending ENOMEM. */
-		conn->in = bdata;
-		send_error(conn, ENOMEM);
-		return;
 	}
 
+	conn->in = NULL;
+
 	/* Update relevant header fields and fill in the message body. */
 	bdata->hdr.msg.type = type;
 	bdata->hdr.msg.len = len;
@@ -817,8 +800,39 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+}
 
-	return;
+/*
+ * Send a watch event.
+ * As this is not directly related to the current command, errors can't be
+ * reported.
+ */
+void send_event(struct connection *conn, const char *path, const char *token)
+{
+	struct buffered_data *bdata;
+	unsigned int len;
+
+	len = strlen(path) + 1 + strlen(token) + 1;
+	/* Don't try to send over-long events. */
+	if (len > XENSTORE_PAYLOAD_MAX)
+		return;
+
+	bdata = new_buffer(conn);
+	if (!bdata)
+		return;
+
+	bdata->buffer = talloc_array(bdata, char, len);
+	if (!bdata->buffer) {
+		talloc_free(bdata);
+		return;
+	}
+	strcpy(bdata->buffer, path);
+	strcpy(bdata->buffer + strlen(path) + 1, token);
+	bdata->hdr.msg.type = XS_WATCH_EVENT;
+	bdata->hdr.msg.len = len;
+
+	/* Queue for later transmission. */
+	list_add_tail(&bdata->list, &conn->out_list);
 }
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 0004fa848c83..9af9af4390bd 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -187,6 +187,7 @@ unsigned int get_string(const struct buffered_data *data, unsigned int offset);
 
 void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		const void *data, unsigned int len);
+void send_event(struct connection *conn, const char *path, const char *token);
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
 void send_ack(struct connection *conn, enum xsd_sockmsg_type type);
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index aca0a71bada1..99a2c266b28a 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -86,35 +86,6 @@ static const char *get_watch_path(const struct watch *watch, const char *name)
 }
 
 /*
- * Send a watch event.
- * Temporary memory allocations are done with ctx.
- */
-static void add_event(struct connection *conn,
-		      const void *ctx,
-		      struct watch *watch,
-		      const char *name)
-{
-	/* Data to send (node\0token\0). */
-	unsigned int len;
-	char *data;
-
-	name = get_watch_path(watch, name);
-
-	len = strlen(name) + 1 + strlen(watch->token) + 1;
-	/* Don't try to send over-long events. */
-	if (len > XENSTORE_PAYLOAD_MAX)
-		return;
-
-	data = talloc_array(ctx, char, len);
-	if (!data)
-		return;
-	strcpy(data, name);
-	strcpy(data + strlen(name) + 1, watch->token);
-	send_reply(conn, XS_WATCH_EVENT, data, len);
-	talloc_free(data);
-}
-
-/*
  * Check permissions of a specific watch to fire:
  * Either the node itself or its parent have to be readable by the connection
  * the watch has been setup for. In case a watch event is created due to
@@ -190,10 +161,14 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		list_for_each_entry(watch, &i->watches, list) {
 			if (exact) {
 				if (streq(name, watch->node))
-					add_event(i, ctx, watch, name);
+					send_event(i,
+						   get_watch_path(watch, name),
+						   watch->token);
 			} else {
 				if (is_child(name, watch->node))
-					add_event(i, ctx, watch, name);
+					send_event(i,
+						   get_watch_path(watch, name),
+						   watch->token);
 			}
 		}
 	}
@@ -292,7 +267,7 @@ int do_watch(struct connection *conn, struct buffered_data *in)
 	send_ack(conn, XS_WATCH);
 
 	/* We fire once up front: simplifies clients and restart. */
-	add_event(conn, in, watch, watch->node);
+	send_event(conn, get_watch_path(watch, watch->node), watch->token);
 
 	return 0;
 }
From 0a4c86f8a8febd85610496470123adfc4fbc1c5d Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: add helpers to free struct buffered_data

Add two helpers for freeing struct buffered_data: free_buffered_data()
for freeing one instance and conn_free_buffered_data() for freeing all
instances for a connection.

This is avoiding duplicated code and will help later when more actions
are needed when freeing a struct buffered_data.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 249ad5ec6fb1..527a1ebdeded 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -211,6 +211,21 @@ void reopen_log(void)
 	}
 }
 
+static void free_buffered_data(struct buffered_data *out,
+			       struct connection *conn)
+{
+	list_del(&out->list);
+	talloc_free(out);
+}
+
+void conn_free_buffered_data(struct connection *conn)
+{
+	struct buffered_data *out;
+
+	while ((out = list_top(&conn->out_list, struct buffered_data, list)))
+		free_buffered_data(out, conn);
+}
+
 static bool write_messages(struct connection *conn)
 {
 	int ret;
@@ -254,8 +269,7 @@ static bool write_messages(struct connection *conn)
 
 	trace_io(conn, out, 1);
 
-	list_del(&out->list);
-	talloc_free(out);
+	free_buffered_data(out, conn);
 
 	return true;
 }
@@ -1506,18 +1520,12 @@ static struct {
  */
 void ignore_connection(struct connection *conn)
 {
-	struct buffered_data *out, *tmp;
-
 	trace("CONN %p ignored\n", conn);
 
 	conn->is_ignored = true;
 	conn_delete_all_watches(conn);
 	conn_delete_all_transactions(conn);
-
-	list_for_each_entry_safe(out, tmp, &conn->out_list, list) {
-		list_del(&out->list);
-		talloc_free(out);
-	}
+	conn_free_buffered_data(conn);
 
 	talloc_free(conn->in);
 	conn->in = NULL;
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 9af9af4390bd..e7ee87825c3b 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -276,6 +276,8 @@ int remember_string(struct hashtable *hash, const char *str);
 
 void set_tdb_key(const char *name, TDB_DATA *key);
 
+void conn_free_buffered_data(struct connection *conn);
+
 const char *dump_state_global(FILE *fp);
 const char *dump_state_buffered_data(FILE *fp, const struct connection *c,
 				     struct xs_state_connection *sc);
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index d03c7d93a9e7..93c4c1edcdd1 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -411,15 +411,10 @@ static struct domain *find_domain_by_domid(unsigned int domid)
 static void domain_conn_reset(struct domain *domain)
 {
 	struct connection *conn = domain->conn;
-	struct buffered_data *out;
 
 	conn_delete_all_watches(conn);
 	conn_delete_all_transactions(conn);
-
-	while ((out = list_top(&conn->out_list, struct buffered_data, list))) {
-		list_del(&out->list);
-		talloc_free(out);
-	}
+	conn_free_buffered_data(conn);
 
 	talloc_free(conn->in);
 
From a6c4198242bf69bea1825492b7665b559023390c Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: reduce number of watch events

When removing a watched node outside of a transaction, two watch events
are being produced instead of just a single one.

When finalizing a transaction watch events can be generated for each
node which is being modified, even if outside a transaction such
modifications might not have resulted in a watch event.

This happens e.g.:

- for nodes which are only modified due to added/removed child entries
- for nodes being removed or created implicitly (e.g. creation of a/b/c
  is implicitly creating a/b, resulting in watch events for a, a/b and
  a/b/c instead of a/b/c only)

Avoid these additional watch events, in order to reduce the needed
memory inside Xenstore for queueing them.

This is being achieved by adding event flags to struct accessed_node
specifying whether an event should be triggered, and whether it should
be an exact match of the modified path. Both flags can be set from
fire_watches() instead of implying them only.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 527a1ebdeded..bf2243873901 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -1295,7 +1295,7 @@ static void delete_child(struct connection *conn,
 }
 
 static int delete_node(struct connection *conn, const void *ctx,
-		       struct node *parent, struct node *node)
+		       struct node *parent, struct node *node, bool watch_exact)
 {
 	char *name;
 
@@ -1307,7 +1307,7 @@ static int delete_node(struct connection *conn, const void *ctx,
 				       node->children);
 		child = name ? read_node(conn, node, name) : NULL;
 		if (child) {
-			if (delete_node(conn, ctx, node, child))
+			if (delete_node(conn, ctx, node, child, true))
 				return errno;
 		} else {
 			trace("delete_node: Error deleting child '%s/%s'!\n",
@@ -1319,7 +1319,12 @@ static int delete_node(struct connection *conn, const void *ctx,
 		talloc_free(name);
 	}
 
-	fire_watches(conn, ctx, node->name, node, true, NULL);
+	/*
+	 * Fire the watches now, when we can still see the node permissions.
+	 * This fine as we are single threaded and the next possible read will
+	 * be handled only after the node has been really removed.
+	 */
+	fire_watches(conn, ctx, node->name, node, watch_exact, NULL);
 	delete_node_single(conn, node);
 	delete_child(conn, parent, basename(node->name));
 	talloc_free(node);
@@ -1345,13 +1350,7 @@ static int _rm(struct connection *conn, const void *ctx, struct node *node,
 		return (errno == ENOMEM) ? ENOMEM : EINVAL;
 	node->parent = parent;
 
-	/*
-	 * Fire the watches now, when we can still see the node permissions.
-	 * This fine as we are single threaded and the next possible read will
-	 * be handled only after the node has been really removed.
-	 */
-	fire_watches(conn, ctx, name, node, false, NULL);
-	return delete_node(conn, ctx, parent, node);
+	return delete_node(conn, ctx, parent, node, false);
 }
 
 
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index faf6c930e42a..54432907fc76 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -130,6 +130,10 @@ struct accessed_node
 
 	/* Transaction node in data base? */
 	bool ta_node;
+
+	/* Watch event flags. */
+	bool fire_watch;
+	bool watch_exact;
 };
 
 struct changed_domain
@@ -324,6 +328,29 @@ int access_node(struct connection *conn, struct node *node,
 }
 
 /*
+ * A watch event should be fired for a node modified inside a transaction.
+ * Set the corresponding information. A non-exact event is replacing an exact
+ * one, but not the other way round.
+ */
+void queue_watches(struct connection *conn, const char *name, bool watch_exact)
+{
+	struct accessed_node *i;
+
+	i = find_accessed_node(conn->transaction, name);
+	if (!i) {
+		conn->transaction->fail = true;
+		return;
+	}
+
+	if (!i->fire_watch) {
+		i->fire_watch = true;
+		i->watch_exact = watch_exact;
+	} else if (!watch_exact) {
+		i->watch_exact = false;
+	}
+}
+
+/*
  * Finalize transaction:
  * Walk through accessed nodes and check generation against global data.
  * If all entries match, read the transaction entries and write them without
@@ -377,15 +404,15 @@ static int finalize_transaction(struct connection *conn,
 				ret = tdb_store(tdb_ctx, key, data,
 						TDB_REPLACE);
 				talloc_free(data.dptr);
-				if (ret)
-					goto err;
-				fire_watches(conn, trans, i->node, NULL, false,
-					     i->perms.p ? &i->perms : NULL);
 			} else {
-				fire_watches(conn, trans, i->node, NULL, false,
+				ret = tdb_delete(tdb_ctx, key);
+			}
+			if (ret)
+				goto err;
+			if (i->fire_watch) {
+				fire_watches(conn, trans, i->node, NULL,
+					     i->watch_exact,
 					     i->perms.p ? &i->perms : NULL);
-				if (tdb_delete(tdb_ctx, key))
-					goto err;
 			}
 		}
 
diff --git a/tools/xenstore/xenstored_transaction.h b/tools/xenstore/xenstored_transaction.h
index 14062730e3c9..0093cac807e3 100644
--- a/tools/xenstore/xenstored_transaction.h
+++ b/tools/xenstore/xenstored_transaction.h
@@ -42,6 +42,9 @@ void transaction_entry_dec(struct transaction *trans, unsigned int domid);
 int access_node(struct connection *conn, struct node *node,
                 enum node_access_type type, TDB_DATA *key);
 
+/* Queue watches for a modified node. */
+void queue_watches(struct connection *conn, const char *name, bool watch_exact);
+
 /* Prepend the transaction to name if appropriate. */
 int transaction_prepend(struct connection *conn, const char *name,
                         TDB_DATA *key);
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index 99a2c266b28a..205d9d8ea116 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -29,6 +29,7 @@
 #include "xenstore_lib.h"
 #include "utils.h"
 #include "xenstored_domain.h"
+#include "xenstored_transaction.h"
 
 extern int quota_nb_watch_per_domain;
 
@@ -143,9 +144,11 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 	struct connection *i;
 	struct watch *watch;
 
-	/* During transactions, don't fire watches. */
-	if (conn && conn->transaction)
+	/* During transactions, don't fire watches, but queue them. */
+	if (conn && conn->transaction) {
+		queue_watches(conn, name, exact);
 		return;
+	}
 
 	/* Create an event for each watch. */
 	list_for_each_entry(i, &connections, list) {
From 2feed737530592688382c655680982e10951c1ec Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: let unread watch events time out

A future modification will limit the number of outstanding requests
for a domain, where "outstanding" means that the response of the
request or any resulting watch event hasn't been consumed yet.

In order to avoid a malicious guest being capable to block other guests
by not reading watch events, add a timeout for watch events. In case a
watch event hasn't been consumed after this timeout, it is being
deleted. Set the default timeout to 20 seconds (a random value being
not too high).

In order to support to specify other timeout values in future, use a
generic command line option for that purpose:

--timeout|-w watch-event=<seconds>

This is part of XSA-326 / CVE-2022-42311.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index bf2243873901..45244c021cd3 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -108,6 +108,8 @@ int quota_max_transaction = 10;
 int quota_nb_perms_per_node = 5;
 int quota_max_path_len = XENSTORE_REL_PATH_MAX;
 
+unsigned int timeout_watch_event_msec = 20000;
+
 void trace(const char *fmt, ...)
 {
 	va_list arglist;
@@ -211,19 +213,92 @@ void reopen_log(void)
 	}
 }
 
+static uint64_t get_now_msec(void)
+{
+	struct timespec now_ts;
+
+	if (clock_gettime(CLOCK_MONOTONIC, &now_ts))
+		barf_perror("Could not find time (clock_gettime failed)");
+
+	return now_ts.tv_sec * 1000 + now_ts.tv_nsec / 1000000;
+}
+
 static void free_buffered_data(struct buffered_data *out,
 			       struct connection *conn)
 {
+	struct buffered_data *req;
+
 	list_del(&out->list);
+
+	/*
+	 * Update conn->timeout_msec with the next found timeout value in the
+	 * queued pending requests.
+	 */
+	if (out->timeout_msec) {
+		conn->timeout_msec = 0;
+		list_for_each_entry(req, &conn->out_list, list) {
+			if (req->timeout_msec) {
+				conn->timeout_msec = req->timeout_msec;
+				break;
+			}
+		}
+	}
+
 	talloc_free(out);
 }
 
+static void check_event_timeout(struct connection *conn, uint64_t msecs,
+				int *ptimeout)
+{
+	uint64_t delta;
+	struct buffered_data *out, *tmp;
+
+	if (!conn->timeout_msec)
+		return;
+
+	delta = conn->timeout_msec - msecs;
+	if (conn->timeout_msec <= msecs) {
+		delta = 0;
+		list_for_each_entry_safe(out, tmp, &conn->out_list, list) {
+			/*
+			 * Only look at buffers with timeout and no data
+			 * already written to the ring.
+			 */
+			if (out->timeout_msec && out->inhdr && !out->used) {
+				if (out->timeout_msec > msecs) {
+					conn->timeout_msec = out->timeout_msec;
+					delta = conn->timeout_msec - msecs;
+					break;
+				}
+
+				/*
+				 * Free out without updating conn->timeout_msec,
+				 * as the update is done in this loop already.
+				 */
+				out->timeout_msec = 0;
+				trace("watch event path %s for domain %u timed out\n",
+				      out->buffer, conn->id);
+				free_buffered_data(out, conn);
+			}
+		}
+		if (!delta) {
+			conn->timeout_msec = 0;
+			return;
+		}
+	}
+
+	if (*ptimeout == -1 || *ptimeout > delta)
+		*ptimeout = delta;
+}
+
 void conn_free_buffered_data(struct connection *conn)
 {
 	struct buffered_data *out;
 
 	while ((out = list_top(&conn->out_list, struct buffered_data, list)))
 		free_buffered_data(out, conn);
+
+	conn->timeout_msec = 0;
 }
 
 static bool write_messages(struct connection *conn)
@@ -411,6 +486,7 @@ static void initialize_fds(int *p_sock_pollfd_idx, int *ptimeout)
 {
 	struct connection *conn;
 	struct wrl_timestampt now;
+	uint64_t msecs;
 
 	if (fds)
 		memset(fds, 0, sizeof(struct pollfd) * current_array_size);
@@ -431,10 +507,12 @@ static void initialize_fds(int *p_sock_pollfd_idx, int *ptimeout)
 
 	wrl_gettime_now(&now);
 	wrl_log_periodic(now);
+	msecs = get_now_msec();
 
 	list_for_each_entry(conn, &connections, list) {
 		if (conn->domain) {
 			wrl_check_timeout(conn->domain, now, ptimeout);
+			check_event_timeout(conn, msecs, ptimeout);
 			if (conn_can_read(conn) ||
 			    (conn_can_write(conn) &&
 			     !list_empty(&conn->out_list)))
@@ -794,6 +872,7 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		return;
 	bdata->inhdr = true;
 	bdata->used = 0;
+	bdata->timeout_msec = 0;
 
 	if (len <= DEFAULT_BUFFER_SIZE)
 		bdata->buffer = bdata->default_buffer;
@@ -845,6 +924,12 @@ void send_event(struct connection *conn, const char *path, const char *token)
 	bdata->hdr.msg.type = XS_WATCH_EVENT;
 	bdata->hdr.msg.len = len;
 
+	if (timeout_watch_event_msec && domain_is_unprivileged(conn)) {
+		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
+		if (!conn->timeout_msec)
+			conn->timeout_msec = bdata->timeout_msec;
+	}
+
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
 }
@@ -2201,6 +2286,9 @@ static void usage(void)
 "  -t, --transaction <nb>  limit the number of transaction allowed per domain,\n"
 "  -A, --perm-nb <nb>      limit the number of permissions per node,\n"
 "  -M, --path-max <chars>  limit the allowed Xenstore node path length,\n"
+"  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
+"                          allowed timeout candidates are:\n"
+"                          watch-event: time a watch-event is kept pending\n"
 "  -R, --no-recovery       to request that no recovery should be attempted when\n"
 "                          the store is corrupted (debug only),\n"
 "  -I, --internal-db       store database in memory, not on disk\n"
@@ -2223,6 +2311,7 @@ static struct option options[] = {
 	{ "transaction", 1, NULL, 't' },
 	{ "perm-nb", 1, NULL, 'A' },
 	{ "path-max", 1, NULL, 'M' },
+	{ "timeout", 1, NULL, 'w' },
 	{ "no-recovery", 0, NULL, 'R' },
 	{ "internal-db", 0, NULL, 'I' },
 	{ "verbose", 0, NULL, 'V' },
@@ -2236,6 +2325,39 @@ int dom0_domid = 0;
 int dom0_event = 0;
 int priv_domid = 0;
 
+static int get_optval_int(const char *arg)
+{
+	char *end;
+	long val;
+
+	val = strtol(arg, &end, 10);
+	if (!*arg || *end || val < 0 || val > INT_MAX)
+		barf("invalid parameter value \"%s\"\n", arg);
+
+	return val;
+}
+
+static bool what_matches(const char *arg, const char *what)
+{
+	unsigned int what_len = strlen(what);
+
+	return !strncmp(arg, what, what_len) && arg[what_len] == '=';
+}
+
+static void set_timeout(const char *arg)
+{
+	const char *eq = strchr(arg, '=');
+	int val;
+
+	if (!eq)
+		barf("quotas must be specified via <what>=<seconds>\n");
+	val = get_optval_int(eq + 1);
+	if (what_matches(arg, "watch-event"))
+		timeout_watch_event_msec = val * 1000;
+	else
+		barf("unknown timeout \"%s\"\n", arg);
+}
+
 int main(int argc, char *argv[])
 {
 	int opt;
@@ -2250,7 +2372,7 @@ int main(int argc, char *argv[])
 	orig_argc = argc;
 	orig_argv = argv;
 
-	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:T:RVW:U", options,
+	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:T:RVW:w:U", options,
 				  NULL)) != -1) {
 		switch (opt) {
 		case 'D':
@@ -2300,6 +2422,9 @@ int main(int argc, char *argv[])
 			quota_max_path_len = min(XENSTORE_REL_PATH_MAX,
 						 quota_max_path_len);
 			break;
+		case 'w':
+			set_timeout(optarg);
+			break;
 		case 'e':
 			dom0_event = strtol(optarg, NULL, 10);
 			break;
@@ -2741,6 +2866,12 @@ static void add_buffered_data(struct buffered_data *bdata,
 		barf("error restoring buffered data");
 
 	memcpy(bdata->buffer, data, len);
+	if (bdata->hdr.msg.type == XS_WATCH_EVENT && timeout_watch_event_msec &&
+	    domain_is_unprivileged(conn)) {
+		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
+		if (!conn->timeout_msec)
+			conn->timeout_msec = bdata->timeout_msec;
+	}
 
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index e7ee87825c3b..8a81fc693f01 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -27,6 +27,7 @@
 #include <fcntl.h>
 #include <stdbool.h>
 #include <stdint.h>
+#include <time.h>
 #include <errno.h>
 
 #include "xenstore_lib.h"
@@ -67,6 +68,8 @@ struct buffered_data
 		char raw[sizeof(struct xsd_sockmsg)];
 	} hdr;
 
+	uint64_t timeout_msec;
+
 	/* The actual data. */
 	char *buffer;
 	char default_buffer[DEFAULT_BUFFER_SIZE];
@@ -118,6 +121,7 @@ struct connection
 
 	/* Buffered output data */
 	struct list_head out_list;
+	uint64_t timeout_msec;
 
 	/* Transaction context for current request (NULL if none). */
 	struct transaction *transaction;
@@ -244,6 +248,8 @@ extern int dom0_event;
 extern int priv_domid;
 extern int quota_nb_entry_per_domain;
 
+extern unsigned int timeout_watch_event_msec;
+
 /* Map the kernel's xenstore page. */
 void *xenbus_map(void);
 void unmap_xenbus(void *interface);
From 2eee122a45eb4a218596b103ce7f0759a824cf2e Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: limit outstanding requests

Add another quota for limiting the number of outstanding requests of a
guest. As the way to specify quotas on the command line is becoming
rather nasty, switch to a new scheme using [--quota|-Q] <what>=<val>
allowing to add more quotas in future easily.

Set the default value to 20 (basically a random value not seeming to
be too high or too low).

A request is said to be outstanding if any message generated by this
request (the direct response plus potential watch events) is not yet
completely stored into a ring buffer. The initial watch event sent as
a result of registering a watch is an exception.

Note that across a live update the relation to buffered watch events
for other domains is lost.

Use talloc_zero() for allocating the domain structure in order to have
all per-domain quota zeroed initially.

This is part of XSA-326 / CVE-2022-42312.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 45244c021cd3..488d540f3a32 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -107,6 +107,7 @@ int quota_max_entry_size = 2048; /* 2K */
 int quota_max_transaction = 10;
 int quota_nb_perms_per_node = 5;
 int quota_max_path_len = XENSTORE_REL_PATH_MAX;
+int quota_req_outstanding = 20;
 
 unsigned int timeout_watch_event_msec = 20000;
 
@@ -223,12 +224,24 @@ static uint64_t get_now_msec(void)
 	return now_ts.tv_sec * 1000 + now_ts.tv_nsec / 1000000;
 }
 
+/*
+ * Remove a struct buffered_data from the list of outgoing data.
+ * A struct buffered_data related to a request having caused watch events to be
+ * sent is kept until all those events have been written out.
+ * Each watch event is referencing the related request via pend.req, while the
+ * number of watch events caused by a request is kept in pend.ref.event_cnt
+ * (those two cases are mutually exclusive, so the two fields can share memory
+ * via a union).
+ * The struct buffered_data is freed only if no related watch event is
+ * referencing it. The related return data can be freed right away.
+ */
 static void free_buffered_data(struct buffered_data *out,
 			       struct connection *conn)
 {
 	struct buffered_data *req;
 
 	list_del(&out->list);
+	out->on_out_list = false;
 
 	/*
 	 * Update conn->timeout_msec with the next found timeout value in the
@@ -244,6 +257,30 @@ static void free_buffered_data(struct buffered_data *out,
 		}
 	}
 
+	if (out->hdr.msg.type == XS_WATCH_EVENT) {
+		req = out->pend.req;
+		if (req) {
+			req->pend.ref.event_cnt--;
+			if (!req->pend.ref.event_cnt && !req->on_out_list) {
+				if (req->on_ref_list) {
+					domain_outstanding_domid_dec(
+						req->pend.ref.domid);
+					list_del(&req->list);
+				}
+				talloc_free(req);
+			}
+		}
+	} else if (out->pend.ref.event_cnt) {
+		/* Hang out off from conn. */
+		talloc_steal(NULL, out);
+		if (out->buffer != out->default_buffer)
+			talloc_free(out->buffer);
+		list_add(&out->list, &conn->ref_list);
+		out->on_ref_list = true;
+		return;
+	} else
+		domain_outstanding_dec(conn);
+
 	talloc_free(out);
 }
 
@@ -405,6 +442,7 @@ int delay_request(struct connection *conn, struct buffered_data *in,
 static int destroy_conn(void *_conn)
 {
 	struct connection *conn = _conn;
+	struct buffered_data *req;
 
 	/* Flush outgoing if possible, but don't block. */
 	if (!conn->domain) {
@@ -418,6 +456,11 @@ static int destroy_conn(void *_conn)
 				break;
 		close(conn->fd);
 	}
+
+	conn_free_buffered_data(conn);
+	list_for_each_entry(req, &conn->ref_list, list)
+		req->on_ref_list = false;
+
         if (conn->target)
                 talloc_unlink(conn, conn->target);
 	list_del(&conn->list);
@@ -893,6 +936,8 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+	bdata->on_out_list = true;
+	domain_outstanding_inc(conn);
 }
 
 /*
@@ -900,7 +945,8 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
  * As this is not directly related to the current command, errors can't be
  * reported.
  */
-void send_event(struct connection *conn, const char *path, const char *token)
+void send_event(struct buffered_data *req, struct connection *conn,
+		const char *path, const char *token)
 {
 	struct buffered_data *bdata;
 	unsigned int len;
@@ -930,8 +976,13 @@ void send_event(struct connection *conn, const char *path, const char *token)
 			conn->timeout_msec = bdata->timeout_msec;
 	}
 
+	bdata->pend.req = req;
+	if (req)
+		req->pend.ref.event_cnt++;
+
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+	bdata->on_out_list = true;
 }
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
@@ -1740,6 +1791,7 @@ static void handle_input(struct connection *conn)
 			return;
 	}
 	in = conn->in;
+	in->pend.ref.domid = conn->id;
 
 	/* Not finished header yet? */
 	if (in->inhdr) {
@@ -1808,6 +1860,7 @@ struct connection *new_connection(const struct interface_funcs *funcs)
 	new->is_stalled = false;
 	new->transaction_started = 0;
 	INIT_LIST_HEAD(&new->out_list);
+	INIT_LIST_HEAD(&new->ref_list);
 	INIT_LIST_HEAD(&new->watches);
 	INIT_LIST_HEAD(&new->transaction_list);
 	INIT_LIST_HEAD(&new->delayed);
@@ -2286,6 +2339,9 @@ static void usage(void)
 "  -t, --transaction <nb>  limit the number of transaction allowed per domain,\n"
 "  -A, --perm-nb <nb>      limit the number of permissions per node,\n"
 "  -M, --path-max <chars>  limit the allowed Xenstore node path length,\n"
+"  -Q, --quota <what>=<nb> set the quota <what> to the value <nb>, allowed\n"
+"                          quotas are:\n"
+"                          outstanding: number of outstanding requests\n"
 "  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
 "                          allowed timeout candidates are:\n"
 "                          watch-event: time a watch-event is kept pending\n"
@@ -2311,6 +2367,7 @@ static struct option options[] = {
 	{ "transaction", 1, NULL, 't' },
 	{ "perm-nb", 1, NULL, 'A' },
 	{ "path-max", 1, NULL, 'M' },
+	{ "quota", 1, NULL, 'Q' },
 	{ "timeout", 1, NULL, 'w' },
 	{ "no-recovery", 0, NULL, 'R' },
 	{ "internal-db", 0, NULL, 'I' },
@@ -2358,6 +2415,20 @@ static void set_timeout(const char *arg)
 		barf("unknown timeout \"%s\"\n", arg);
 }
 
+static void set_quota(const char *arg)
+{
+	const char *eq = strchr(arg, '=');
+	int val;
+
+	if (!eq)
+		barf("quotas must be specified via <what>=<nb>\n");
+	val = get_optval_int(eq + 1);
+	if (what_matches(arg, "outstanding"))
+		quota_req_outstanding = val;
+	else
+		barf("unknown quota \"%s\"\n", arg);
+}
+
 int main(int argc, char *argv[])
 {
 	int opt;
@@ -2372,8 +2443,8 @@ int main(int argc, char *argv[])
 	orig_argc = argc;
 	orig_argv = argv;
 
-	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:T:RVW:w:U", options,
-				  NULL)) != -1) {
+	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:Q:T:RVW:w:U",
+				  options, NULL)) != -1) {
 		switch (opt) {
 		case 'D':
 			no_domain_init = true;
@@ -2422,6 +2493,9 @@ int main(int argc, char *argv[])
 			quota_max_path_len = min(XENSTORE_REL_PATH_MAX,
 						 quota_max_path_len);
 			break;
+		case 'Q':
+			set_quota(optarg);
+			break;
 		case 'w':
 			set_timeout(optarg);
 			break;
@@ -2875,6 +2949,14 @@ static void add_buffered_data(struct buffered_data *bdata,
 
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+	bdata->on_out_list = true;
+	/*
+	 * Watch events are never "outstanding", but the request causing them
+	 * are instead kept "outstanding" until all watch events caused by that
+	 * request have been delivered.
+	 */
+	if (bdata->hdr.msg.type != XS_WATCH_EVENT)
+		domain_outstanding_inc(conn);
 }
 
 void read_state_buffered_data(const void *ctx, struct connection *conn,
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 8a81fc693f01..db09f463a657 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -56,6 +56,8 @@ struct xs_state_connection;
 struct buffered_data
 {
 	struct list_head list;
+	bool on_out_list;
+	bool on_ref_list;
 
 	/* Are we still doing the header? */
 	bool inhdr;
@@ -63,6 +65,17 @@ struct buffered_data
 	/* How far are we? */
 	unsigned int used;
 
+	/* Outstanding request accounting. */
+	union {
+		/* ref is being used for requests. */
+		struct {
+			unsigned int event_cnt; /* # of outstanding events. */
+			unsigned int domid;     /* domid of request. */
+		} ref;
+		/* req is being used for watch events. */
+		struct buffered_data *req;      /* request causing event. */
+	} pend;
+
 	union {
 		struct xsd_sockmsg msg;
 		char raw[sizeof(struct xsd_sockmsg)];
@@ -123,6 +136,9 @@ struct connection
 	struct list_head out_list;
 	uint64_t timeout_msec;
 
+	/* Referenced requests no longer pending. */
+	struct list_head ref_list;
+
 	/* Transaction context for current request (NULL if none). */
 	struct transaction *transaction;
 
@@ -191,7 +207,8 @@ unsigned int get_string(const struct buffered_data *data, unsigned int offset);
 
 void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		const void *data, unsigned int len);
-void send_event(struct connection *conn, const char *path, const char *token);
+void send_event(struct buffered_data *req, struct connection *conn,
+		const char *path, const char *token);
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
 void send_ack(struct connection *conn, enum xsd_sockmsg_type type);
@@ -247,6 +264,7 @@ extern int dom0_domid;
 extern int dom0_event;
 extern int priv_domid;
 extern int quota_nb_entry_per_domain;
+extern int quota_req_outstanding;
 
 extern unsigned int timeout_watch_event_msec;
 
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 93c4c1edcdd1..850085a92c76 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -78,6 +78,9 @@ struct domain
 	/* number of watch for this domain */
 	int nbwatch;
 
+	/* Number of outstanding requests. */
+	int nboutstanding;
+
 	/* write rate limit */
 	wrl_creditt wrl_credit; /* [ -wrl_config_writecost, +_dburst ] */
 	struct wrl_timestampt wrl_timestamp;
@@ -183,8 +186,12 @@ static bool domain_can_read(struct connection *conn)
 {
 	struct xenstore_domain_interface *intf = conn->domain->interface;
 
-	if (domain_is_unprivileged(conn) && conn->domain->wrl_credit < 0)
-		return false;
+	if (domain_is_unprivileged(conn)) {
+		if (conn->domain->wrl_credit < 0)
+			return false;
+		if (conn->domain->nboutstanding >= quota_req_outstanding)
+			return false;
+	}
 
 	return (intf->req_cons != intf->req_prod);
 }
@@ -331,7 +338,7 @@ static struct domain *alloc_domain(const void *context, unsigned int domid)
 {
 	struct domain *domain;
 
-	domain = talloc(context, struct domain);
+	domain = talloc_zero(context, struct domain);
 	if (!domain) {
 		errno = ENOMEM;
 		return NULL;
@@ -392,9 +399,6 @@ static int new_domain(struct domain *domain, int port, bool restore)
 	domain->conn->domain = domain;
 	domain->conn->id = domain->domid;
 
-	domain->nbentry = 0;
-	domain->nbwatch = 0;
-
 	return 0;
 }
 
@@ -938,6 +942,28 @@ int domain_watch(struct connection *conn)
 		: 0;
 }
 
+void domain_outstanding_inc(struct connection *conn)
+{
+	if (!conn || !conn->domain)
+		return;
+	conn->domain->nboutstanding++;
+}
+
+void domain_outstanding_dec(struct connection *conn)
+{
+	if (!conn || !conn->domain)
+		return;
+	conn->domain->nboutstanding--;
+}
+
+void domain_outstanding_domid_dec(unsigned int domid)
+{
+	struct domain *d = find_domain_by_domid(domid);
+
+	if (d)
+		d->nboutstanding--;
+}
+
 static wrl_creditt wrl_config_writecost      = WRL_FACTOR;
 static wrl_creditt wrl_config_rate           = WRL_RATE   * WRL_FACTOR;
 static wrl_creditt wrl_config_dburst         = WRL_DBURST * WRL_FACTOR;
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index 1e929b8f8c6f..4f51b005291a 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -64,6 +64,9 @@ int domain_entry(struct connection *conn);
 void domain_watch_inc(struct connection *conn);
 void domain_watch_dec(struct connection *conn);
 int domain_watch(struct connection *conn);
+void domain_outstanding_inc(struct connection *conn);
+void domain_outstanding_dec(struct connection *conn);
+void domain_outstanding_domid_dec(unsigned int domid);
 
 /* Special node permission handling. */
 int set_perms_special(struct connection *conn, const char *name,
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index 205d9d8ea116..0755ffa375ba 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -142,6 +142,7 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		  struct node *node, bool exact, struct node_perms *perms)
 {
 	struct connection *i;
+	struct buffered_data *req;
 	struct watch *watch;
 
 	/* During transactions, don't fire watches, but queue them. */
@@ -150,6 +151,8 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		return;
 	}
 
+	req = domain_is_unprivileged(conn) ? conn->in : NULL;
+
 	/* Create an event for each watch. */
 	list_for_each_entry(i, &connections, list) {
 		/* introduce/release domain watches */
@@ -164,12 +167,12 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		list_for_each_entry(watch, &i->watches, list) {
 			if (exact) {
 				if (streq(name, watch->node))
-					send_event(i,
+					send_event(req, i,
 						   get_watch_path(watch, name),
 						   watch->token);
 			} else {
 				if (is_child(name, watch->node))
-					send_event(i,
+					send_event(req, i,
 						   get_watch_path(watch, name),
 						   watch->token);
 			}
@@ -269,8 +272,12 @@ int do_watch(struct connection *conn, struct buffered_data *in)
 	trace_create(watch, "watch");
 	send_ack(conn, XS_WATCH);
 
-	/* We fire once up front: simplifies clients and restart. */
-	send_event(conn, get_watch_path(watch, watch->node), watch->token);
+	/*
+	 * We fire once up front: simplifies clients and restart.
+	 * This event will not be linked to the XS_WATCH request.
+	 */
+	send_event(NULL, conn, get_watch_path(watch, watch->node),
+		   watch->token);
 
 	return 0;
 }
From c8057cb483abf2cd4060b39616423e19283fbd0a Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: don't buffer multiple identical watch events

A guest not reading its Xenstore response buffer fast enough might
pile up lots of Xenstore watch events buffered. Reduce the generated
load by dropping new events which already have an identical copy
pending.

The special events "@..." are excluded from that handling as there are
known use cases where the handler is relying on each event to be sent
individually.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 488d540f3a32..f1fa97b8cf50 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -916,6 +916,7 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 	bdata->inhdr = true;
 	bdata->used = 0;
 	bdata->timeout_msec = 0;
+	bdata->watch_event = false;
 
 	if (len <= DEFAULT_BUFFER_SIZE)
 		bdata->buffer = bdata->default_buffer;
@@ -948,7 +949,7 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 void send_event(struct buffered_data *req, struct connection *conn,
 		const char *path, const char *token)
 {
-	struct buffered_data *bdata;
+	struct buffered_data *bdata, *bd;
 	unsigned int len;
 
 	len = strlen(path) + 1 + strlen(token) + 1;
@@ -970,12 +971,29 @@ void send_event(struct buffered_data *req, struct connection *conn,
 	bdata->hdr.msg.type = XS_WATCH_EVENT;
 	bdata->hdr.msg.len = len;
 
+	/*
+	 * Check whether an identical event is pending already.
+	 * Special events are excluded from that check.
+	 */
+	if (path[0] != '@') {
+		list_for_each_entry(bd, &conn->out_list, list) {
+			if (bd->watch_event && bd->hdr.msg.len == len &&
+			    !memcmp(bdata->buffer, bd->buffer, len)) {
+				trace("dropping duplicate watch %s %s for domain %u\n",
+				      path, token, conn->id);
+				talloc_free(bdata);
+				return;
+			}
+		}
+	}
+
 	if (timeout_watch_event_msec && domain_is_unprivileged(conn)) {
 		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
 		if (!conn->timeout_msec)
 			conn->timeout_msec = bdata->timeout_msec;
 	}
 
+	bdata->watch_event = true;
 	bdata->pend.req = req;
 	if (req)
 		req->pend.ref.event_cnt++;
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index db09f463a657..b9b50e81c7b4 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -62,6 +62,9 @@ struct buffered_data
 	/* Are we still doing the header? */
 	bool inhdr;
 
+	/* Is this a watch event? */
+	bool watch_event;
+
 	/* How far are we? */
 	unsigned int used;
 
From 5eac692b841633be3e85f0125c59fa02af103989 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: fix connection->id usage

Don't use conn->id for privilege checks, but domain_is_unprivileged().

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_control.c b/tools/xenstore/xenstored_control.c
index 7b4300ef7777..adb8d51b043b 100644
--- a/tools/xenstore/xenstored_control.c
+++ b/tools/xenstore/xenstored_control.c
@@ -891,7 +891,7 @@ int do_control(struct connection *conn, struct buffered_data *in)
 	unsigned int cmd, num, off;
 	char **vec = NULL;
 
-	if (conn->id != 0)
+	if (domain_is_unprivileged(conn))
 		return EACCES;
 
 	off = get_string(in, 0);
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index b9b50e81c7b4..b1a70488b989 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -123,7 +123,7 @@ struct connection
 	/* The index of pollfd in global pollfd array */
 	int pollfd_idx;
 
-	/* Who am I? 0 for socket connections. */
+	/* Who am I? Domid of connection. */
 	unsigned int id;
 
 	/* Is this connection ignored? */
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index 54432907fc76..ee1b09031a3b 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -477,7 +477,8 @@ int do_transaction_start(struct connection *conn, struct buffered_data *in)
 	if (conn->transaction)
 		return EBUSY;
 
-	if (conn->id && conn->transaction_started > quota_max_transaction)
+	if (domain_is_unprivileged(conn) &&
+	    conn->transaction_started > quota_max_transaction)
 		return ENOSPC;
 
 	/* Attach transaction to input for autofree until it's complete */
From f9f3171441b5fcb3339cf612400794fc26cd2ec2 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: simplify and fix per domain node accounting

The accounting of nodes can be simplified now that each connection
holds the associated domid.

Fix the node accounting to cover nodes created for a domain before it
has been introduced. This requires to react properly to an allocation
failure inside domain_entry_inc() by returning an error code.

Especially in error paths the node accounting has to be fixed in some
cases.

This is part of XSA-326 / CVE-2022-42313.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index f1fa97b8cf50..692d863fce35 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -638,7 +638,7 @@ struct node *read_node(struct connection *conn, const void *ctx,
 
 	/* Permissions are struct xs_permissions. */
 	node->perms.p = hdr->perms;
-	if (domain_adjust_node_perms(node)) {
+	if (domain_adjust_node_perms(conn, node)) {
 		talloc_free(node);
 		return NULL;
 	}
@@ -660,7 +660,7 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
 	void *p;
 	struct xs_tdb_record_hdr *hdr;
 
-	if (domain_adjust_node_perms(node))
+	if (domain_adjust_node_perms(conn, node))
 		return errno;
 
 	data.dsize = sizeof(*hdr)
@@ -1272,13 +1272,17 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 	return NULL;
 }
 
-static int destroy_node(struct connection *conn, struct node *node)
+static void destroy_node_rm(struct node *node)
 {
 	if (streq(node->name, "/"))
 		corrupt(NULL, "Destroying root node!");
 
 	tdb_delete(tdb_ctx, node->key);
+}
 
+static int destroy_node(struct connection *conn, struct node *node)
+{
+	destroy_node_rm(node);
 	domain_entry_dec(conn, node);
 
 	/*
@@ -1328,8 +1332,12 @@ static struct node *create_node(struct connection *conn, const void *ctx,
 			goto err;
 
 		/* Account for new node */
-		if (i->parent)
-			domain_entry_inc(conn, i);
+		if (i->parent) {
+			if (domain_entry_inc(conn, i)) {
+				destroy_node_rm(i);
+				return NULL;
+			}
+		}
 	}
 
 	return node;
@@ -1614,10 +1622,27 @@ static int do_set_perms(struct connection *conn, struct buffered_data *in)
 	old_perms = node->perms;
 	domain_entry_dec(conn, node);
 	node->perms = perms;
-	domain_entry_inc(conn, node);
+	if (domain_entry_inc(conn, node)) {
+		node->perms = old_perms;
+		/*
+		 * This should never fail because we had a reference on the
+		 * domain before and Xenstored is single-threaded.
+		 */
+		domain_entry_inc(conn, node);
+		return ENOMEM;
+	}
+
+	if (write_node(conn, node, false)) {
+		int saved_errno = errno;
 
-	if (write_node(conn, node, false))
+		domain_entry_dec(conn, node);
+		node->perms = old_perms;
+		/* No failure possible as above. */
+		domain_entry_inc(conn, node);
+
+		errno = saved_errno;
 		return errno;
+	}
 
 	fire_watches(conn, in, name, node, false, &old_perms);
 	send_ack(conn, XS_SET_PERMS);
@@ -3122,7 +3147,9 @@ void read_state_node(const void *ctx, const void *state)
 	set_tdb_key(name, &key);
 	if (write_node_raw(NULL, &key, node, true))
 		barf("write node error restoring node");
-	domain_entry_inc(&conn, node);
+
+	if (domain_entry_inc(&conn, node))
+		barf("node accounting error restoring node");
 
 	talloc_free(node);
 }
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 850085a92c76..260952e09096 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -16,6 +16,7 @@
     along with this program; If not, see <http://www.gnu.org/licenses/>.
 */
 
+#include <assert.h>
 #include <stdio.h>
 #include <sys/mman.h>
 #include <unistd.h>
@@ -363,6 +364,18 @@ static struct domain *find_or_alloc_domain(const void *ctx, unsigned int domid)
 	return domain ? : alloc_domain(ctx, domid);
 }
 
+static struct domain *find_or_alloc_existing_domain(unsigned int domid)
+{
+	struct domain *domain;
+	xc_dominfo_t dominfo;
+
+	domain = find_domain_struct(domid);
+	if (!domain && get_domain_info(domid, &dominfo))
+		domain = alloc_domain(NULL, domid);
+
+	return domain;
+}
+
 static int new_domain(struct domain *domain, int port, bool restore)
 {
 	int rc;
@@ -782,30 +795,28 @@ void domain_deinit(void)
 		xenevtchn_unbind(xce_handle, virq_port);
 }
 
-void domain_entry_inc(struct connection *conn, struct node *node)
+int domain_entry_inc(struct connection *conn, struct node *node)
 {
 	struct domain *d;
+	unsigned int domid;
 
 	if (!conn)
-		return;
+		return 0;
 
-	if (node->perms.p && node->perms.p[0].id != conn->id) {
-		if (conn->transaction) {
-			transaction_entry_inc(conn->transaction,
-				node->perms.p[0].id);
-		} else {
-			d = find_domain_by_domid(node->perms.p[0].id);
-			if (d)
-				d->nbentry++;
-		}
-	} else if (conn->domain) {
-		if (conn->transaction) {
-			transaction_entry_inc(conn->transaction,
-				conn->domain->domid);
- 		} else {
- 			conn->domain->nbentry++;
-		}
+	domid = node->perms.p ? node->perms.p[0].id : conn->id;
+
+	if (conn->transaction) {
+		transaction_entry_inc(conn->transaction, domid);
+	} else {
+		d = (domid == conn->id && conn->domain) ? conn->domain
+		    : find_or_alloc_existing_domain(domid);
+		if (d)
+			d->nbentry++;
+		else
+			return ENOMEM;
 	}
+
+	return 0;
 }
 
 /*
@@ -841,7 +852,7 @@ static int chk_domain_generation(unsigned int domid, uint64_t gen)
  * Remove permissions for no longer existing domains in order to avoid a new
  * domain with the same domid inheriting the permissions.
  */
-int domain_adjust_node_perms(struct node *node)
+int domain_adjust_node_perms(struct connection *conn, struct node *node)
 {
 	unsigned int i;
 	int ret;
@@ -851,8 +862,14 @@ int domain_adjust_node_perms(struct node *node)
 		return errno;
 
 	/* If the owner doesn't exist any longer give it to priv domain. */
-	if (!ret)
+	if (!ret) {
+		/*
+		 * In theory we'd need to update the number of dom0 nodes here,
+		 * but we could be called for a read of the node. So better
+		 * avoid the risk to overflow the node count of dom0.
+		 */
 		node->perms.p[0].id = priv_domid;
+	}
 
 	for (i = 1; i < node->perms.num; i++) {
 		if (node->perms.p[i].perms & XS_PERM_IGNORE)
@@ -871,25 +888,25 @@ int domain_adjust_node_perms(struct node *node)
 void domain_entry_dec(struct connection *conn, struct node *node)
 {
 	struct domain *d;
+	unsigned int domid;
 
 	if (!conn)
 		return;
 
-	if (node->perms.p && node->perms.p[0].id != conn->id) {
-		if (conn->transaction) {
-			transaction_entry_dec(conn->transaction,
-				node->perms.p[0].id);
-		} else {
-			d = find_domain_by_domid(node->perms.p[0].id);
-			if (d && d->nbentry)
-				d->nbentry--;
-		}
-	} else if (conn->domain && conn->domain->nbentry) {
-		if (conn->transaction) {
-			transaction_entry_dec(conn->transaction,
-				conn->domain->domid);
+	domid = node->perms.p ? node->perms.p[0].id : conn->id;
+
+	if (conn->transaction) {
+		transaction_entry_dec(conn->transaction, domid);
+	} else {
+		d = (domid == conn->id && conn->domain) ? conn->domain
+		    : find_domain_struct(domid);
+		if (d) {
+			d->nbentry--;
 		} else {
-			conn->domain->nbentry--;
+			errno = ENOENT;
+			corrupt(conn,
+				"Node \"%s\" owned by non-existing domain %u\n",
+				node->name, domid);
 		}
 	}
 }
@@ -899,13 +916,23 @@ int domain_entry_fix(unsigned int domid, int num, bool update)
 	struct domain *d;
 	int cnt;
 
-	d = find_domain_by_domid(domid);
-	if (!d)
-		return 0;
+	if (update) {
+		d = find_domain_struct(domid);
+		assert(d);
+	} else {
+		/*
+		 * We are called first with update == false in order to catch
+		 * any error. So do a possible allocation and check for error
+		 * only in this case, as in the case of update == true nothing
+		 * can go wrong anymore as the allocation already happened.
+		 */
+		d = find_or_alloc_existing_domain(domid);
+		if (!d)
+			return -1;
+	}
 
 	cnt = d->nbentry + num;
-	if (cnt < 0)
-		cnt = 0;
+	assert(cnt >= 0);
 
 	if (update)
 		d->nbentry = cnt;
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index 4f51b005291a..d6519904d831 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -54,10 +54,10 @@ const char *get_implicit_path(const struct connection *conn);
 bool domain_is_unprivileged(struct connection *conn);
 
 /* Remove node permissions for no longer existing domains. */
-int domain_adjust_node_perms(struct node *node);
+int domain_adjust_node_perms(struct connection *conn, struct node *node);
 
 /* Quota manipulation */
-void domain_entry_inc(struct connection *conn, struct node *);
+int domain_entry_inc(struct connection *conn, struct node *);
 void domain_entry_dec(struct connection *conn, struct node *);
 int domain_entry_fix(unsigned int domid, int num, bool update);
 int domain_entry(struct connection *conn);
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index ee1b09031a3b..86caf6c398be 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -519,8 +519,12 @@ static int transaction_fix_domains(struct transaction *trans, bool update)
 
 	list_for_each_entry(d, &trans->changed_domains, list) {
 		cnt = domain_entry_fix(d->domid, d->nbentry, update);
-		if (!update && cnt >= quota_nb_entry_per_domain)
-			return ENOSPC;
+		if (!update) {
+			if (cnt >= quota_nb_entry_per_domain)
+				return ENOSPC;
+			if (cnt < 0)
+				return ENOMEM;
+		}
 	}
 
 	return 0;
From 71aac6f7e89d5c101adb9e82eea7031e16d34e46 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: limit max number of nodes accessed in a transaction

Today a guest is free to access as many nodes in a single transaction
as it wants. This can lead to unbounded memory consumption in Xenstore
as there is the need to keep track of all nodes having been accessed
during a transaction.

In oxenstored the number of requests in a transaction is being limited
via a quota maxrequests (default is 1024). As multiple accesses of a
node are not problematic in C Xenstore, limit the number of accessed
nodes.

In order to let read_node() detect a quota error in case too many nodes
are being accessed, check the return value of access_node() and return
NULL in case an error has been seen. Introduce __must_check and add it
to the access_node() prototype.

This is part of XSA-326 / CVE-2022-42314.

Reported-by: Julien Grall <jgrall@amazon.com>
Suggested-by: Julien Grall <julien@xen.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/include/xen-tools/libs.h b/tools/include/xen-tools/libs.h
index a16e0c380709..bafc90e2f603 100644
--- a/tools/include/xen-tools/libs.h
+++ b/tools/include/xen-tools/libs.h
@@ -63,4 +63,8 @@
 #define ROUNDUP(_x,_w) (((unsigned long)(_x)+(1UL<<(_w))-1) & ~((1UL<<(_w))-1))
 #endif
 
+#ifndef __must_check
+#define __must_check __attribute__((__warn_unused_result__))
+#endif
+
 #endif	/* __XEN_TOOLS_LIBS__ */
diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 692d863fce35..f835aa1b2f1f 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -106,6 +106,7 @@ int quota_nb_watch_per_domain = 128;
 int quota_max_entry_size = 2048; /* 2K */
 int quota_max_transaction = 10;
 int quota_nb_perms_per_node = 5;
+int quota_trans_nodes = 1024;
 int quota_max_path_len = XENSTORE_REL_PATH_MAX;
 int quota_req_outstanding = 20;
 
@@ -595,6 +596,7 @@ struct node *read_node(struct connection *conn, const void *ctx,
 	TDB_DATA key, data;
 	struct xs_tdb_record_hdr *hdr;
 	struct node *node;
+	int err;
 
 	node = talloc(ctx, struct node);
 	if (!node) {
@@ -616,14 +618,13 @@ struct node *read_node(struct connection *conn, const void *ctx,
 	if (data.dptr == NULL) {
 		if (tdb_error(tdb_ctx) == TDB_ERR_NOEXIST) {
 			node->generation = NO_GENERATION;
-			access_node(conn, node, NODE_ACCESS_READ, NULL);
-			errno = ENOENT;
+			err = access_node(conn, node, NODE_ACCESS_READ, NULL);
+			errno = err ? : ENOENT;
 		} else {
 			log("TDB error on read: %s", tdb_errorstr(tdb_ctx));
 			errno = EIO;
 		}
-		talloc_free(node);
-		return NULL;
+		goto error;
 	}
 
 	node->parent = NULL;
@@ -638,19 +639,36 @@ struct node *read_node(struct connection *conn, const void *ctx,
 
 	/* Permissions are struct xs_permissions. */
 	node->perms.p = hdr->perms;
-	if (domain_adjust_node_perms(conn, node)) {
-		talloc_free(node);
-		return NULL;
-	}
+	if (domain_adjust_node_perms(conn, node))
+		goto error;
 
 	/* Data is binary blob (usually ascii, no nul). */
 	node->data = node->perms.p + hdr->num_perms;
 	/* Children is strings, nul separated. */
 	node->children = node->data + node->datalen;
 
-	access_node(conn, node, NODE_ACCESS_READ, NULL);
+	if (access_node(conn, node, NODE_ACCESS_READ, NULL))
+		goto error;
 
 	return node;
+
+ error:
+	err = errno;
+	talloc_free(node);
+	errno = err;
+	return NULL;
+}
+
+static bool read_node_can_propagate_errno(void)
+{
+	/*
+	 * 2 error cases for read_node() can always be propagated up:
+	 * ENOMEM, because this has nothing to do with the node being in the
+	 * data base or not, but is caused by a general lack of memory.
+	 * ENOSPC, because this is related to hitting quota limits which need
+	 * to be respected.
+	 */
+	return errno == ENOMEM || errno == ENOSPC;
 }
 
 int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
@@ -767,7 +785,7 @@ static int ask_parents(struct connection *conn, const void *ctx,
 		node = read_node(conn, ctx, name);
 		if (node)
 			break;
-		if (errno == ENOMEM)
+		if (read_node_can_propagate_errno())
 			return errno;
 	} while (!streq(name, "/"));
 
@@ -829,7 +847,7 @@ static struct node *get_node(struct connection *conn,
 		}
 	}
 	/* Clean up errno if they weren't supposed to know. */
-	if (!node && errno != ENOMEM)
+	if (!node && !read_node_can_propagate_errno())
 		errno = errno_from_parents(conn, ctx, name, errno, perm);
 	return node;
 }
@@ -1235,7 +1253,7 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 
 	/* If parent doesn't exist, create it. */
 	parent = read_node(conn, parentname, parentname);
-	if (!parent)
+	if (!parent && errno == ENOENT)
 		parent = construct_node(conn, ctx, parentname);
 	if (!parent)
 		return NULL;
@@ -1509,7 +1527,7 @@ static int _rm(struct connection *conn, const void *ctx, struct node *node,
 
 	parent = read_node(conn, ctx, parentname);
 	if (!parent)
-		return (errno == ENOMEM) ? ENOMEM : EINVAL;
+		return read_node_can_propagate_errno() ? errno : EINVAL;
 	node->parent = parent;
 
 	return delete_node(conn, ctx, parent, node, false);
@@ -1539,7 +1557,7 @@ static int do_rm(struct connection *conn, struct buffered_data *in)
 				return 0;
 			}
 			/* Restore errno, just in case. */
-			if (errno != ENOMEM)
+			if (!read_node_can_propagate_errno())
 				errno = ENOENT;
 		}
 		return errno;
@@ -2384,6 +2402,8 @@ static void usage(void)
 "  -M, --path-max <chars>  limit the allowed Xenstore node path length,\n"
 "  -Q, --quota <what>=<nb> set the quota <what> to the value <nb>, allowed\n"
 "                          quotas are:\n"
+"                          transaction-nodes: number of accessed node per\n"
+"                                             transaction\n"
 "                          outstanding: number of outstanding requests\n"
 "  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
 "                          allowed timeout candidates are:\n"
@@ -2468,6 +2488,8 @@ static void set_quota(const char *arg)
 	val = get_optval_int(eq + 1);
 	if (what_matches(arg, "outstanding"))
 		quota_req_outstanding = val;
+	else if (what_matches(arg, "transaction-nodes"))
+		quota_trans_nodes = val;
 	else
 		barf("unknown quota \"%s\"\n", arg);
 }
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index b1a70488b989..245f9258235f 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -268,6 +268,7 @@ extern int dom0_event;
 extern int priv_domid;
 extern int quota_nb_entry_per_domain;
 extern int quota_req_outstanding;
+extern int quota_trans_nodes;
 
 extern unsigned int timeout_watch_event_msec;
 
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index 86caf6c398be..7bd41eb475e3 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -156,6 +156,9 @@ struct transaction
 	/* Connection-local identifier for this transaction. */
 	uint32_t id;
 
+	/* Node counter. */
+	unsigned int nodes;
+
 	/* Generation when transaction started. */
 	uint64_t generation;
 
@@ -260,6 +263,11 @@ int access_node(struct connection *conn, struct node *node,
 
 	i = find_accessed_node(trans, node->name);
 	if (!i) {
+		if (trans->nodes >= quota_trans_nodes &&
+		    domain_is_unprivileged(conn)) {
+			ret = ENOSPC;
+			goto err;
+		}
 		i = talloc_zero(trans, struct accessed_node);
 		if (!i)
 			goto nomem;
@@ -297,6 +305,7 @@ int access_node(struct connection *conn, struct node *node,
 				i->ta_node = true;
 			}
 		}
+		trans->nodes++;
 		list_add_tail(&i->list, &trans->accessed);
 	}
 
diff --git a/tools/xenstore/xenstored_transaction.h b/tools/xenstore/xenstored_transaction.h
index 0093cac807e3..e3cbd6b23095 100644
--- a/tools/xenstore/xenstored_transaction.h
+++ b/tools/xenstore/xenstored_transaction.h
@@ -39,8 +39,8 @@ void transaction_entry_inc(struct transaction *trans, unsigned int domid);
 void transaction_entry_dec(struct transaction *trans, unsigned int domid);
 
 /* This node was accessed. */
-int access_node(struct connection *conn, struct node *node,
-                enum node_access_type type, TDB_DATA *key);
+int __must_check access_node(struct connection *conn, struct node *node,
+                             enum node_access_type type, TDB_DATA *key);
 
 /* Queue watches for a modified node. */
 void queue_watches(struct connection *conn, const char *name, bool watch_exact);
From 90013d6a735491a7b93a6832eb2a51e5633254f5 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: move the call of setup_structure() to dom0
 introduction

Setting up the basic structure when introducing dom0 has the advantage
to be able to add proper node memory accounting for the added nodes
later.

This makes it possible to do proper node accounting, too.

An additional requirement to make that work fine is to correct the
owner of the created nodes to be dom0_domid instead of domid 0.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index f835aa1b2f1f..5171d34c947e 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -2039,7 +2039,8 @@ static int tdb_flags;
 static void manual_node(const char *name, const char *child)
 {
 	struct node *node;
-	struct xs_permissions perms = { .id = 0, .perms = XS_PERM_NONE };
+	struct xs_permissions perms = { .id = dom0_domid,
+					.perms = XS_PERM_NONE };
 
 	node = talloc_zero(NULL, struct node);
 	if (!node)
@@ -2078,7 +2079,7 @@ static void tdb_logger(TDB_CONTEXT *tdb, int level, const char * fmt, ...)
 	}
 }
 
-static void setup_structure(bool live_update)
+void setup_structure(bool live_update)
 {
 	char *tdbname;
 
@@ -2101,6 +2102,7 @@ static void setup_structure(bool live_update)
 		manual_node("/", "tool");
 		manual_node("/tool", "xenstored");
 		manual_node("/tool/xenstored", NULL);
+		domain_entry_fix(dom0_domid, 3, true);
 	}
 
 	check_store();
@@ -2614,9 +2616,6 @@ int main(int argc, char *argv[])
 
 	init_pipe(reopen_log_pipe);
 
-	/* Setup the database */
-	setup_structure(live_update);
-
 	/* Listen to hypervisor. */
 	if (!no_domain_init && !live_update) {
 		domain_init(-1);
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 245f9258235f..2c77ec7ee0f4 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -231,6 +231,7 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
 struct node *read_node(struct connection *conn, const void *ctx,
 		       const char *name);
 
+void setup_structure(bool live_update);
 struct connection *new_connection(const struct interface_funcs *funcs);
 struct connection *get_connection_by_id(unsigned int conn_id);
 void ignore_connection(struct connection *conn);
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 260952e09096..f04b7aae8a32 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -470,6 +470,9 @@ static struct domain *introduce_domain(const void *ctx,
 		}
 		domain->interface = interface;
 
+		if (is_master_domain)
+			setup_structure(restore);
+
 		/* Now domain belongs to its connection. */
 		talloc_steal(domain->conn, domain);
 
From 6af17b8bf52b9dfdc6a5ecd3efbcea9fddd57d91 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: add infrastructure to keep track of per domain memory
 usage

The amount of memory a domain can consume in Xenstore is limited by
various quota today, but even with sane quota a domain can still
consume rather large memory quantities.

Add the infrastructure for keeping track of the amount of memory a
domain is consuming in Xenstore. Note that this is only the memory a
domain has direct control over, so any internal administration data
needed by Xenstore only is not being accounted for.

There are two quotas defined: a soft quota which will result in a
warning issued via syslog() when it is exceeded, and a hard quota
resulting in a stop of accepting further requests or watch events as
long as the hard quota would be violated by accepting those.

Setting any of those quotas to 0 will disable it.

As default values use 2MB per domain for the soft limit (this basically
covers the allowed case to create 1000 nodes needing 2kB each), and
2.5MB for the hard limit.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 5171d34c947e..b2bf6740d430 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -109,6 +109,8 @@ int quota_nb_perms_per_node = 5;
 int quota_trans_nodes = 1024;
 int quota_max_path_len = XENSTORE_REL_PATH_MAX;
 int quota_req_outstanding = 20;
+int quota_memory_per_domain_soft = 2 * 1024 * 1024; /* 2 MB */
+int quota_memory_per_domain_hard = 2 * 1024 * 1024 + 512 * 1024; /* 2.5 MB */
 
 unsigned int timeout_watch_event_msec = 20000;
 
@@ -2406,7 +2408,14 @@ static void usage(void)
 "                          quotas are:\n"
 "                          transaction-nodes: number of accessed node per\n"
 "                                             transaction\n"
+"                          memory: total used memory per domain for nodes,\n"
+"                                  transactions, watches and requests, above\n"
+"                                  which Xenstore will stop talking to domain\n"
 "                          outstanding: number of outstanding requests\n"
+"  -q, --quota-soft <what>=<nb> set a soft quota <what> to the value <nb>,\n"
+"                          causing a warning to be issued via syslog() if the\n"
+"                          limit is violated, allowed quotas are:\n"
+"                          memory: see above\n"
 "  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
 "                          allowed timeout candidates are:\n"
 "                          watch-event: time a watch-event is kept pending\n"
@@ -2433,6 +2442,7 @@ static struct option options[] = {
 	{ "perm-nb", 1, NULL, 'A' },
 	{ "path-max", 1, NULL, 'M' },
 	{ "quota", 1, NULL, 'Q' },
+	{ "quota-soft", 1, NULL, 'q' },
 	{ "timeout", 1, NULL, 'w' },
 	{ "no-recovery", 0, NULL, 'R' },
 	{ "internal-db", 0, NULL, 'I' },
@@ -2480,7 +2490,7 @@ static void set_timeout(const char *arg)
 		barf("unknown timeout \"%s\"\n", arg);
 }
 
-static void set_quota(const char *arg)
+static void set_quota(const char *arg, bool soft)
 {
 	const char *eq = strchr(arg, '=');
 	int val;
@@ -2488,11 +2498,16 @@ static void set_quota(const char *arg)
 	if (!eq)
 		barf("quotas must be specified via <what>=<nb>\n");
 	val = get_optval_int(eq + 1);
-	if (what_matches(arg, "outstanding"))
+	if (what_matches(arg, "outstanding") && !soft)
 		quota_req_outstanding = val;
-	else if (what_matches(arg, "transaction-nodes"))
+	else if (what_matches(arg, "transaction-nodes") && !soft)
 		quota_trans_nodes = val;
-	else
+	else if (what_matches(arg, "memory")) {
+		if (soft)
+			quota_memory_per_domain_soft = val;
+		else
+			quota_memory_per_domain_hard = val;
+	} else
 		barf("unknown quota \"%s\"\n", arg);
 }
 
@@ -2510,7 +2525,7 @@ int main(int argc, char *argv[])
 	orig_argc = argc;
 	orig_argv = argv;
 
-	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:Q:T:RVW:w:U",
+	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:Q:q:T:RVW:w:U",
 				  options, NULL)) != -1) {
 		switch (opt) {
 		case 'D':
@@ -2561,7 +2576,10 @@ int main(int argc, char *argv[])
 						 quota_max_path_len);
 			break;
 		case 'Q':
-			set_quota(optarg);
+			set_quota(optarg, false);
+			break;
+		case 'q':
+			set_quota(optarg, true);
 			break;
 		case 'w':
 			set_timeout(optarg);
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 2c77ec7ee0f4..373af18297bf 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -270,6 +270,8 @@ extern int priv_domid;
 extern int quota_nb_entry_per_domain;
 extern int quota_req_outstanding;
 extern int quota_trans_nodes;
+extern int quota_memory_per_domain_soft;
+extern int quota_memory_per_domain_hard;
 
 extern unsigned int timeout_watch_event_msec;
 
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index f04b7aae8a32..94fd561e9de4 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -76,6 +76,13 @@ struct domain
 	/* number of entry from this domain in the store */
 	int nbentry;
 
+	/* Amount of memory allocated for this domain. */
+	int memory;
+	bool soft_quota_reported;
+	bool hard_quota_reported;
+	time_t mem_last_msg;
+#define MEM_WARN_MINTIME_SEC 10
+
 	/* number of watch for this domain */
 	int nbwatch;
 
@@ -192,6 +199,9 @@ static bool domain_can_read(struct connection *conn)
 			return false;
 		if (conn->domain->nboutstanding >= quota_req_outstanding)
 			return false;
+		if (conn->domain->memory >= quota_memory_per_domain_hard &&
+		    quota_memory_per_domain_hard)
+			return false;
 	}
 
 	return (intf->req_cons != intf->req_prod);
@@ -950,6 +960,89 @@ int domain_entry(struct connection *conn)
 		: 0;
 }
 
+static bool domain_chk_quota(struct domain *domain, int mem)
+{
+	time_t now;
+
+	if (!domain || !domid_is_unprivileged(domain->domid) ||
+	    (domain->conn && domain->conn->is_ignored))
+		return false;
+
+	now = time(NULL);
+
+	if (mem >= quota_memory_per_domain_hard &&
+	    quota_memory_per_domain_hard) {
+		if (domain->hard_quota_reported)
+			return true;
+		syslog(LOG_ERR, "Domain %u exceeds hard memory quota, Xenstore interface to domain stalled\n",
+		       domain->domid);
+		domain->mem_last_msg = now;
+		domain->hard_quota_reported = true;
+		return true;
+	}
+
+	if (now - domain->mem_last_msg >= MEM_WARN_MINTIME_SEC) {
+		if (domain->hard_quota_reported) {
+			domain->mem_last_msg = now;
+			domain->hard_quota_reported = false;
+			syslog(LOG_INFO, "Domain %u below hard memory quota again\n",
+			       domain->domid);
+		}
+		if (mem >= quota_memory_per_domain_soft &&
+		    quota_memory_per_domain_soft &&
+		    !domain->soft_quota_reported) {
+			domain->mem_last_msg = now;
+			domain->soft_quota_reported = true;
+			syslog(LOG_WARNING, "Domain %u exceeds soft memory quota\n",
+			       domain->domid);
+		}
+		if (mem < quota_memory_per_domain_soft &&
+		    domain->soft_quota_reported) {
+			domain->mem_last_msg = now;
+			domain->soft_quota_reported = false;
+			syslog(LOG_INFO, "Domain %u below soft memory quota again\n",
+			       domain->domid);
+		}
+
+	}
+
+	return false;
+}
+
+int domain_memory_add(unsigned int domid, int mem, bool no_quota_check)
+{
+	struct domain *domain;
+
+	domain = find_domain_struct(domid);
+	if (domain) {
+		/*
+		 * domain_chk_quota() will print warning and also store whether
+		 * the soft/hard quota has been hit. So check no_quota_check
+		 * *after*.
+		 */
+		if (domain_chk_quota(domain, domain->memory + mem) &&
+		    !no_quota_check)
+			return ENOMEM;
+		domain->memory += mem;
+	} else {
+		/*
+		 * The domain the memory is to be accounted for should always
+		 * exist, as accounting is done either for a domain related to
+		 * the current connection, or for the domain owning a node
+		 * (which is always existing, as the owner of the node is
+		 * tested to exist and replaced by domid 0 if not).
+		 * So not finding the related domain MUST be an error in the
+		 * data base.
+		 */
+		errno = ENOENT;
+		corrupt(NULL, "Accounting called for non-existing domain %u\n",
+			domid);
+		return ENOENT;
+	}
+
+	return 0;
+}
+
 void domain_watch_inc(struct connection *conn)
 {
 	if (!conn || !conn->domain)
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index d6519904d831..633c9a0a0a1f 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -61,6 +61,26 @@ int domain_entry_inc(struct connection *conn, struct node *);
 void domain_entry_dec(struct connection *conn, struct node *);
 int domain_entry_fix(unsigned int domid, int num, bool update);
 int domain_entry(struct connection *conn);
+int domain_memory_add(unsigned int domid, int mem, bool no_quota_check);
+
+/*
+ * domain_memory_add_chk(): to be used when memory quota should be checked.
+ * Not to be used when specifying a negative mem value, as lowering the used
+ * memory should always be allowed.
+ */
+static inline int domain_memory_add_chk(unsigned int domid, int mem)
+{
+	return domain_memory_add(domid, mem, false);
+}
+/*
+ * domain_memory_add_nochk(): to be used when memory quota should not be
+ * checked, e.g. when lowering memory usage, or in an error case for undoing
+ * a previous memory adjustment.
+ */
+static inline void domain_memory_add_nochk(unsigned int domid, int mem)
+{
+	domain_memory_add(domid, mem, true);
+}
 void domain_watch_inc(struct connection *conn);
 void domain_watch_dec(struct connection *conn);
 int domain_watch(struct connection *conn);
From ae7042f024af7584251f776a12d9bb24d13fecaf Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: add memory accounting for responses

Add the memory accounting for queued responses.

In case adding a watch event for a guest is causing the hard memory
quota of that guest to be violated, the event is dropped. This will
ensure that it is impossible to drive another guest past its memory
quota by generating insane amounts of events for that guest. This is
especially important for protecting driver domains from that attack
vector.

This is part of XSA-326 / CVE-2022-42315.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index b2bf6740d430..ecab6cfbbe15 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -260,6 +260,8 @@ static void free_buffered_data(struct buffered_data *out,
 		}
 	}
 
+	domain_memory_add_nochk(conn->id, -out->hdr.msg.len - sizeof(out->hdr));
+
 	if (out->hdr.msg.type == XS_WATCH_EVENT) {
 		req = out->pend.req;
 		if (req) {
@@ -938,11 +940,14 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 	bdata->timeout_msec = 0;
 	bdata->watch_event = false;
 
-	if (len <= DEFAULT_BUFFER_SIZE)
+	if (len <= DEFAULT_BUFFER_SIZE) {
 		bdata->buffer = bdata->default_buffer;
-	else {
+		/* Don't check quota, path might be used for returning error. */
+		domain_memory_add_nochk(conn->id, len + sizeof(bdata->hdr));
+	} else {
 		bdata->buffer = talloc_array(bdata, char, len);
-		if (!bdata->buffer) {
+		if (!bdata->buffer ||
+		    domain_memory_add_chk(conn->id, len + sizeof(bdata->hdr))) {
 			send_error(conn, ENOMEM);
 			return;
 		}
@@ -1007,6 +1012,11 @@ void send_event(struct buffered_data *req, struct connection *conn,
 		}
 	}
 
+	if (domain_memory_add_chk(conn->id, len + sizeof(bdata->hdr))) {
+		talloc_free(bdata);
+		return;
+	}
+
 	if (timeout_watch_event_msec && domain_is_unprivileged(conn)) {
 		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
 		if (!conn->timeout_msec)
@@ -3039,6 +3049,12 @@ static void add_buffered_data(struct buffered_data *bdata,
 	 */
 	if (bdata->hdr.msg.type != XS_WATCH_EVENT)
 		domain_outstanding_inc(conn);
+	/*
+	 * We are restoring the state after Live-Update and the new quota may
+	 * be smaller. So ignore it. The limit will be applied for any resource
+	 * after the state has been fully restored.
+	 */
+	domain_memory_add_nochk(conn->id, len + sizeof(bdata->hdr));
 }
 
 void read_state_buffered_data(const void *ctx, struct connection *conn,
From 4628ae0a56b037dcdc8a3e42c543c5b9fd9990cf Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add memory accounting for watches

Add the memory accounting for registered watches.

When a socket connection is destroyed, the associated watches are
removed, too. In order to keep memory accounting correct the watches
must be removed explicitly via a call of conn_delete_all_watches() from
destroy_conn().

This is part of XSA-326 / CVE-2022-42315.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index ecab6cfbbe15..d86942f5aa77 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -463,6 +463,7 @@ static int destroy_conn(void *_conn)
 	}
 
 	conn_free_buffered_data(conn);
+	conn_delete_all_watches(conn);
 	list_for_each_entry(req, &conn->ref_list, list)
 		req->on_ref_list = false;
 
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index 0755ffa375ba..fdf9b2d653a0 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -211,7 +211,7 @@ static int check_watch_path(struct connection *conn, const void *ctx,
 }
 
 static struct watch *add_watch(struct connection *conn, char *path, char *token,
-			       bool relative)
+			       bool relative, bool no_quota_check)
 {
 	struct watch *watch;
 
@@ -222,6 +222,9 @@ static struct watch *add_watch(struct connection *conn, char *path, char *token,
 	watch->token = talloc_strdup(watch, token);
 	if (!watch->node || !watch->token)
 		goto nomem;
+	if (domain_memory_add(conn->id, strlen(path) + strlen(token),
+			      no_quota_check))
+		goto nomem;
 
 	if (relative)
 		watch->relative_path = get_implicit_path(conn);
@@ -265,7 +268,7 @@ int do_watch(struct connection *conn, struct buffered_data *in)
 	if (domain_watch(conn) > quota_nb_watch_per_domain)
 		return E2BIG;
 
-	watch = add_watch(conn, vec[0], vec[1], relative);
+	watch = add_watch(conn, vec[0], vec[1], relative, false);
 	if (!watch)
 		return errno;
 
@@ -296,6 +299,8 @@ int do_unwatch(struct connection *conn, struct buffered_data *in)
 	list_for_each_entry(watch, &conn->watches, list) {
 		if (streq(watch->node, node) && streq(watch->token, vec[1])) {
 			list_del(&watch->list);
+			domain_memory_add_nochk(conn->id, -strlen(watch->node) -
+							  strlen(watch->token));
 			talloc_free(watch);
 			domain_watch_dec(conn);
 			send_ack(conn, XS_UNWATCH);
@@ -311,6 +316,8 @@ void conn_delete_all_watches(struct connection *conn)
 
 	while ((watch = list_top(&conn->watches, struct watch, list))) {
 		list_del(&watch->list);
+		domain_memory_add_nochk(conn->id, -strlen(watch->node) -
+						  strlen(watch->token));
 		talloc_free(watch);
 		domain_watch_dec(conn);
 	}
@@ -373,7 +380,7 @@ void read_state_watch(const void *ctx, const void *state)
 	if (!path)
 		barf("allocation error for read watch");
 
-	if (!add_watch(conn, path, token, relative))
+	if (!add_watch(conn, path, token, relative, true))
 		barf("error adding watch");
 }
 
From b8bd74e5e962955211ab0c5c1924ebf2bb526799 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add memory accounting for nodes

Add the memory accounting for Xenstore nodes. In order to make this
not too complicated allow for some sloppiness when writing nodes. Any
hard quota violation will result in no further requests to be accepted.

This is part of XSA-326 / CVE-2022-42315.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index d86942f5aa77..16504de42017 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -591,6 +591,117 @@ void set_tdb_key(const char *name, TDB_DATA *key)
 	key->dsize = strlen(name);
 }
 
+static void get_acc_data(TDB_DATA *key, struct node_account_data *acc)
+{
+	TDB_DATA old_data;
+	struct xs_tdb_record_hdr *hdr;
+
+	if (acc->memory < 0) {
+		old_data = tdb_fetch(tdb_ctx, *key);
+		/* No check for error, as the node might not exist. */
+		if (old_data.dptr == NULL) {
+			acc->memory = 0;
+		} else {
+			hdr = (void *)old_data.dptr;
+			acc->memory = old_data.dsize;
+			acc->domid = hdr->perms[0].id;
+		}
+		talloc_free(old_data.dptr);
+	}
+}
+
+/*
+ * Per-transaction nodes need to be accounted for the transaction owner.
+ * Those nodes are stored in the data base with the transaction generation
+ * count prepended (e.g. 123/local/domain/...). So testing for the node's
+ * key not to start with "/" is sufficient.
+ */
+static unsigned int get_acc_domid(struct connection *conn, TDB_DATA *key,
+				  unsigned int domid)
+{
+	return (!conn || key->dptr[0] == '/') ? domid : conn->id;
+}
+
+int do_tdb_write(struct connection *conn, TDB_DATA *key, TDB_DATA *data,
+		 struct node_account_data *acc, bool no_quota_check)
+{
+	struct xs_tdb_record_hdr *hdr = (void *)data->dptr;
+	struct node_account_data old_acc = {};
+	unsigned int old_domid, new_domid;
+	int ret;
+
+	if (!acc)
+		old_acc.memory = -1;
+	else
+		old_acc = *acc;
+
+	get_acc_data(key, &old_acc);
+	old_domid = get_acc_domid(conn, key, old_acc.domid);
+	new_domid = get_acc_domid(conn, key, hdr->perms[0].id);
+
+	/*
+	 * Don't check for ENOENT, as we want to be able to switch orphaned
+	 * nodes to new owners.
+	 */
+	if (old_acc.memory)
+		domain_memory_add_nochk(old_domid,
+					-old_acc.memory - key->dsize);
+	ret = domain_memory_add(new_domid, data->dsize + key->dsize,
+				no_quota_check);
+	if (ret) {
+		/* Error path, so no quota check. */
+		if (old_acc.memory)
+			domain_memory_add_nochk(old_domid,
+						old_acc.memory + key->dsize);
+		return ret;
+	}
+
+	/* TDB should set errno, but doesn't even set ecode AFAICT. */
+	if (tdb_store(tdb_ctx, *key, *data, TDB_REPLACE) != 0) {
+		domain_memory_add_nochk(new_domid, -data->dsize - key->dsize);
+		/* Error path, so no quota check. */
+		if (old_acc.memory)
+			domain_memory_add_nochk(old_domid,
+						old_acc.memory + key->dsize);
+		errno = EIO;
+		return errno;
+	}
+
+	if (acc) {
+		/* Don't use new_domid, as it might be a transaction node. */
+		acc->domid = hdr->perms[0].id;
+		acc->memory = data->dsize;
+	}
+
+	return 0;
+}
+
+int do_tdb_delete(struct connection *conn, TDB_DATA *key,
+		  struct node_account_data *acc)
+{
+	struct node_account_data tmp_acc;
+	unsigned int domid;
+
+	if (!acc) {
+		acc = &tmp_acc;
+		acc->memory = -1;
+	}
+
+	get_acc_data(key, acc);
+
+	if (tdb_delete(tdb_ctx, *key)) {
+		errno = EIO;
+		return errno;
+	}
+
+	if (acc->memory) {
+		domid = get_acc_domid(conn, key, acc->domid);
+		domain_memory_add_nochk(domid, -acc->memory - key->dsize);
+	}
+
+	return 0;
+}
+
 /*
  * If it fails, returns NULL and sets errno.
  * Temporary memory allocations will be done with ctx.
@@ -644,9 +755,15 @@ struct node *read_node(struct connection *conn, const void *ctx,
 
 	/* Permissions are struct xs_permissions. */
 	node->perms.p = hdr->perms;
+	node->acc.domid = node->perms.p[0].id;
+	node->acc.memory = data.dsize;
 	if (domain_adjust_node_perms(conn, node))
 		goto error;
 
+	/* If owner is gone reset currently accounted memory size. */
+	if (node->acc.domid != node->perms.p[0].id)
+		node->acc.memory = 0;
+
 	/* Data is binary blob (usually ascii, no nul). */
 	node->data = node->perms.p + hdr->num_perms;
 	/* Children is strings, nul separated. */
@@ -715,12 +832,9 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
 	p += node->datalen;
 	memcpy(p, node->children, node->childlen);
 
-	/* TDB should set errno, but doesn't even set ecode AFAICT. */
-	if (tdb_store(tdb_ctx, *key, data, TDB_REPLACE) != 0) {
-		corrupt(conn, "Write of %s failed", key->dptr);
-		errno = EIO;
-		return errno;
-	}
+	if (do_tdb_write(conn, key, &data, &node->acc, no_quota_check))
+		return EIO;
+
 	return 0;
 }
 
@@ -1222,7 +1336,7 @@ static void delete_node_single(struct connection *conn, struct node *node)
 	if (access_node(conn, node, NODE_ACCESS_DELETE, &key))
 		return;
 
-	if (tdb_delete(tdb_ctx, key) != 0) {
+	if (do_tdb_delete(conn, &key, &node->acc) != 0) {
 		corrupt(conn, "Could not delete '%s'", node->name);
 		return;
 	}
@@ -1295,6 +1409,7 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 	/* No children, no data */
 	node->children = node->data = NULL;
 	node->childlen = node->datalen = 0;
+	node->acc.memory = 0;
 	node->parent = parent;
 	return node;
 
@@ -1303,17 +1418,17 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 	return NULL;
 }
 
-static void destroy_node_rm(struct node *node)
+static void destroy_node_rm(struct connection *conn, struct node *node)
 {
 	if (streq(node->name, "/"))
 		corrupt(NULL, "Destroying root node!");
 
-	tdb_delete(tdb_ctx, node->key);
+	do_tdb_delete(conn, &node->key, &node->acc);
 }
 
 static int destroy_node(struct connection *conn, struct node *node)
 {
-	destroy_node_rm(node);
+	destroy_node_rm(conn, node);
 	domain_entry_dec(conn, node);
 
 	/*
@@ -1365,7 +1480,7 @@ static struct node *create_node(struct connection *conn, const void *ctx,
 		/* Account for new node */
 		if (i->parent) {
 			if (domain_entry_inc(conn, i)) {
-				destroy_node_rm(i);
+				destroy_node_rm(conn, i);
 				return NULL;
 			}
 		}
@@ -2291,7 +2406,7 @@ static int clean_store_(TDB_CONTEXT *tdb, TDB_DATA key, TDB_DATA val,
 	if (!hashtable_search(reachable, name)) {
 		log("clean_store: '%s' is orphaned!", name);
 		if (recovery) {
-			tdb_delete(tdb, key);
+			do_tdb_delete(NULL, &key, NULL);
 		}
 	}
 
@@ -3149,6 +3264,7 @@ void read_state_node(const void *ctx, const void *state)
 	if (!node)
 		barf("allocation error restoring node");
 
+	node->acc.memory = 0;
 	node->name = name;
 	node->generation = ++generation;
 	node->datalen = sn->data_len;
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 373af18297bf..da9ecce67f31 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -176,6 +176,11 @@ struct node_perms {
 	struct xs_permissions *p;
 };
 
+struct node_account_data {
+	unsigned int domid;
+	int memory;		/* -1 if unknown */
+};
+
 struct node {
 	const char *name;
 	/* Key used to update TDB */
@@ -198,6 +203,9 @@ struct node {
 	/* Children, each nul-terminated. */
 	unsigned int childlen;
 	char *children;
+
+	/* Allocation information for node currently in store. */
+	struct node_account_data acc;
 };
 
 /* Return the only argument in the input. */
@@ -306,6 +314,10 @@ extern xengnttab_handle **xgt_handle;
 int remember_string(struct hashtable *hash, const char *str);
 
 void set_tdb_key(const char *name, TDB_DATA *key);
+int do_tdb_write(struct connection *conn, TDB_DATA *key, TDB_DATA *data,
+		 struct node_account_data *acc, bool no_quota_check);
+int do_tdb_delete(struct connection *conn, TDB_DATA *key,
+		  struct node_account_data *acc);
 
 void conn_free_buffered_data(struct connection *conn);
 
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index 7bd41eb475e3..ace9a11d77bb 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -153,6 +153,9 @@ struct transaction
 	/* List of all transactions active on this connection. */
 	struct list_head list;
 
+	/* Connection this transaction is associated with. */
+	struct connection *conn;
+
 	/* Connection-local identifier for this transaction. */
 	uint32_t id;
 
@@ -286,6 +289,8 @@ int access_node(struct connection *conn, struct node *node,
 
 		introduce = true;
 		i->ta_node = false;
+		/* acc.memory < 0 means "unknown, get size from TDB". */
+		node->acc.memory = -1;
 
 		/*
 		 * Additional transaction-specific node for read type. We only
@@ -410,11 +415,11 @@ static int finalize_transaction(struct connection *conn,
 					goto err;
 				hdr = (void *)data.dptr;
 				hdr->generation = ++generation;
-				ret = tdb_store(tdb_ctx, key, data,
-						TDB_REPLACE);
+				ret = do_tdb_write(conn, &key, &data, NULL,
+						   true);
 				talloc_free(data.dptr);
 			} else {
-				ret = tdb_delete(tdb_ctx, key);
+				ret = do_tdb_delete(conn, &key, NULL);
 			}
 			if (ret)
 				goto err;
@@ -425,7 +430,7 @@ static int finalize_transaction(struct connection *conn,
 			}
 		}
 
-		if (i->ta_node && tdb_delete(tdb_ctx, ta_key))
+		if (i->ta_node && do_tdb_delete(conn, &ta_key, NULL))
 			goto err;
 		list_del(&i->list);
 		talloc_free(i);
@@ -453,7 +458,7 @@ static int destroy_transaction(void *_transaction)
 							       i->node);
 			if (trans_name) {
 				set_tdb_key(trans_name, &key);
-				tdb_delete(tdb_ctx, key);
+				do_tdb_delete(trans->conn, &key, NULL);
 			}
 		}
 		list_del(&i->list);
@@ -497,6 +502,7 @@ int do_transaction_start(struct connection *conn, struct buffered_data *in)
 
 	INIT_LIST_HEAD(&trans->accessed);
 	INIT_LIST_HEAD(&trans->changed_domains);
+	trans->conn = conn;
 	trans->fail = false;
 	trans->generation = ++generation;
 
From c55a1ea0a5ea7f6a3dc850cb015a49ba9ec571ab Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add exports for quota variables

Some quota variables are not exported via header files.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index da9ecce67f31..bfd3fc1e9df3 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -275,6 +275,11 @@ extern TDB_CONTEXT *tdb_ctx;
 extern int dom0_domid;
 extern int dom0_event;
 extern int priv_domid;
+extern int quota_nb_watch_per_domain;
+extern int quota_max_transaction;
+extern int quota_max_entry_size;
+extern int quota_nb_perms_per_node;
+extern int quota_max_path_len;
 extern int quota_nb_entry_per_domain;
 extern int quota_req_outstanding;
 extern int quota_trans_nodes;
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index ace9a11d77bb..28774813de83 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -175,7 +175,6 @@ struct transaction
 	bool fail;
 };
 
-extern int quota_max_transaction;
 uint64_t generation;
 
 static struct accessed_node *find_accessed_node(struct transaction *trans,
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index fdf9b2d653a0..85362bcce314 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -31,8 +31,6 @@
 #include "xenstored_domain.h"
 #include "xenstored_transaction.h"
 
-extern int quota_nb_watch_per_domain;
-
 struct watch
 {
 	/* Watches on this connection */
From 05cc2af50ba43431d6d50aff758e968833aab9c6 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add control command for setting and showing quota

Add a xenstore-control command "quota" to:
- show current quota settings
- change quota settings
- show current quota related values of a domain

Note that in the case the new quota is lower than existing one,
Xenstored may continue to handle requests from a domain exceeding the
new limit (depends on which one has been broken) and the amount of
resource used will not change. However the domain will not be able to
create more resource (associated to the quota) until it is back to below
the limit.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/docs/misc/xenstore.txt b/docs/misc/xenstore.txt
index 334dc8b6fdf5..a7d006519ae8 100644
--- a/docs/misc/xenstore.txt
+++ b/docs/misc/xenstore.txt
@@ -366,6 +366,17 @@ CONTROL			<command>|[<parameters>|]
 	print|<string>
 		print <string> to syslog (xenstore runs as daemon) or
 		to console (xenstore runs as stubdom)
+	quota|[set <name> <val>|<domid>]
+		without parameters: print the current quota settings
+		with "set <name> <val>": set the quota <name> to new value
+		<val> (The admin should make sure all the domain usage is
+		below the quota. If it is not, then Xenstored may continue to
+		handle requests from the domain as long as the resource
+		violating the new quota setting isn't increased further)
+		with "<domid>": print quota related accounting data for
+		the domain <domid>
+	quota-soft|[set <name> <val>]
+		like the "quota" command, but for soft-quota.
 	help			<supported-commands>
 		return list of supported commands for CONTROL
 
diff --git a/tools/xenstore/xenstored_control.c b/tools/xenstore/xenstored_control.c
index adb8d51b043b..1031a81c3874 100644
--- a/tools/xenstore/xenstored_control.c
+++ b/tools/xenstore/xenstored_control.c
@@ -196,6 +196,115 @@ static int do_control_log(void *ctx, struct connection *conn,
 	return 0;
 }
 
+struct quota {
+	const char *name;
+	int *quota;
+	const char *descr;
+};
+
+static const struct quota hard_quotas[] = {
+	{ "nodes", &quota_nb_entry_per_domain, "Nodes per domain" },
+	{ "watches", &quota_nb_watch_per_domain, "Watches per domain" },
+	{ "transactions", &quota_max_transaction, "Transactions per domain" },
+	{ "outstanding", &quota_req_outstanding,
+		"Outstanding requests per domain" },
+	{ "transaction-nodes", &quota_trans_nodes,
+		"Max. number of accessed nodes per transaction" },
+	{ "memory", &quota_memory_per_domain_hard,
+		"Total Xenstore memory per domain (error level)" },
+	{ "node-size", &quota_max_entry_size, "Max. size of a node" },
+	{ "path-max", &quota_max_path_len, "Max. length of a node path" },
+	{ "permissions", &quota_nb_perms_per_node,
+		"Max. number of permissions per node" },
+	{ NULL, NULL, NULL }
+};
+
+static const struct quota soft_quotas[] = {
+	{ "memory", &quota_memory_per_domain_soft,
+		"Total Xenstore memory per domain (warning level)" },
+	{ NULL, NULL, NULL }
+};
+
+static int quota_show_current(const void *ctx, struct connection *conn,
+			      const struct quota *quotas)
+{
+	char *resp;
+	unsigned int i;
+
+	resp = talloc_strdup(ctx, "Quota settings:\n");
+	if (!resp)
+		return ENOMEM;
+
+	for (i = 0; quotas[i].quota; i++) {
+		resp = talloc_asprintf_append(resp, "%-17s: %8d %s\n",
+					      quotas[i].name, *quotas[i].quota,
+					      quotas[i].descr);
+		if (!resp)
+			return ENOMEM;
+	}
+
+	send_reply(conn, XS_CONTROL, resp, strlen(resp) + 1);
+
+	return 0;
+}
+
+static int quota_set(const void *ctx, struct connection *conn,
+		     char **vec, int num, const struct quota *quotas)
+{
+	unsigned int i;
+	int val;
+
+	if (num != 2)
+		return EINVAL;
+
+	val = atoi(vec[1]);
+	if (val < 1)
+		return EINVAL;
+
+	for (i = 0; quotas[i].quota; i++) {
+		if (!strcmp(vec[0], quotas[i].name)) {
+			*quotas[i].quota = val;
+			send_ack(conn, XS_CONTROL);
+			return 0;
+		}
+	}
+
+	return EINVAL;
+}
+
+static int quota_get(const void *ctx, struct connection *conn,
+		     char **vec, int num)
+{
+	if (num != 1)
+		return EINVAL;
+
+	return domain_get_quota(ctx, conn, atoi(vec[0]));
+}
+
+static int do_control_quota(void *ctx, struct connection *conn,
+			    char **vec, int num)
+{
+	if (num == 0)
+		return quota_show_current(ctx, conn, hard_quotas);
+
+	if (!strcmp(vec[0], "set"))
+		return quota_set(ctx, conn, vec + 1, num - 1, hard_quotas);
+
+	return quota_get(ctx, conn, vec, num);
+}
+
+static int do_control_quota_s(void *ctx, struct connection *conn,
+			      char **vec, int num)
+{
+	if (num == 0)
+		return quota_show_current(ctx, conn, soft_quotas);
+
+	if (!strcmp(vec[0], "set"))
+		return quota_set(ctx, conn, vec + 1, num - 1, soft_quotas);
+
+	return EINVAL;
+}
+
 #ifdef __MINIOS__
 static int do_control_memreport(void *ctx, struct connection *conn,
 				char **vec, int num)
@@ -847,6 +956,8 @@ static struct cmd_s cmds[] = {
 	{ "memreport", do_control_memreport, "[<file>]" },
 #endif
 	{ "print", do_control_print, "<string>" },
+	{ "quota", do_control_quota, "[set <name> <val>|<domid>]" },
+	{ "quota-soft", do_control_quota_s, "[set <name> <val>]" },
 	{ "help", do_control_help, "" },
 };
 
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 94fd561e9de4..e7c6886ccf47 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -31,6 +31,7 @@
 #include "xenstored_domain.h"
 #include "xenstored_transaction.h"
 #include "xenstored_watch.h"
+#include "xenstored_control.h"
 
 #include <xenevtchn.h>
 #include <xenctrl.h>
@@ -345,6 +346,38 @@ static struct domain *find_domain_struct(unsigned int domid)
 	return NULL;
 }
 
+int domain_get_quota(const void *ctx, struct connection *conn,
+		     unsigned int domid)
+{
+	struct domain *d = find_domain_struct(domid);
+	char *resp;
+	int ta;
+
+	if (!d)
+		return ENOENT;
+
+	ta = d->conn ? d->conn->transaction_started : 0;
+	resp = talloc_asprintf(ctx, "Domain %u:\n", domid);
+	if (!resp)
+		return ENOMEM;
+
+#define ent(t, e) \
+	resp = talloc_asprintf_append(resp, "%-16s: %8d\n", #t, e); \
+	if (!resp) return ENOMEM
+
+	ent(nodes, d->nbentry);
+	ent(watches, d->nbwatch);
+	ent(transactions, ta);
+	ent(outstanding, d->nboutstanding);
+	ent(memory, d->memory);
+
+#undef ent
+
+	send_reply(conn, XS_CONTROL, resp, strlen(resp) + 1);
+
+	return 0;
+}
+
 static struct domain *alloc_domain(const void *context, unsigned int domid)
 {
 	struct domain *domain;
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index 633c9a0a0a1f..904faa923afb 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -87,6 +87,8 @@ int domain_watch(struct connection *conn);
 void domain_outstanding_inc(struct connection *conn);
 void domain_outstanding_dec(struct connection *conn);
 void domain_outstanding_domid_dec(unsigned int domid);
+int domain_get_quota(const void *ctx, struct connection *conn,
+		     unsigned int domid);
 
 /* Special node permission handling. */
 int set_perms_special(struct connection *conn, const char *name,
From 5aec1a37a8ccc51e613641decf99a10d77052f3f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:01 +0100
Subject: tools/ocaml/xenstored: Synchronise defaults with oxenstore.conf.in
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

We currently have 2 different set of defaults in upstream Xen git tree:
* defined in the source code, only used if there is no config file
* defined in the oxenstored.conf.in upstream Xen

An oxenstored.conf file is not mandatory, and if missing, maxrequests in
particular has an unsafe default.

Resync the defaults from oxenstored.conf.in into the source code.

This is part of XSA-326 / CVE-2022-42316.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/xenstored/define.ml b/tools/ocaml/xenstored/define.ml
index ebe18b8e312c..6b06f808595b 100644
--- a/tools/ocaml/xenstored/define.ml
+++ b/tools/ocaml/xenstored/define.ml
@@ -21,9 +21,9 @@ let xs_daemon_socket = Paths.xen_run_stored ^ "/socket"
 
 let default_config_dir = Paths.xen_config_dir
 
-let maxwatch = ref (50)
-let maxtransaction = ref (20)
-let maxrequests = ref (-1)   (* maximum requests per transaction *)
+let maxwatch = ref (100)
+let maxtransaction = ref (10)
+let maxrequests = ref (1024)   (* maximum requests per transaction *)
 
 let conflict_burst_limit = ref 5.0
 let conflict_max_history_seconds = ref 0.05
diff --git a/tools/ocaml/xenstored/quota.ml b/tools/ocaml/xenstored/quota.ml
index abcac912805a..6e3d6401ae89 100644
--- a/tools/ocaml/xenstored/quota.ml
+++ b/tools/ocaml/xenstored/quota.ml
@@ -20,8 +20,8 @@ exception Transaction_opened
 
 let warn fmt = Logging.warn "quota" fmt
 let activate = ref true
-let maxent = ref (10000)
-let maxsize = ref (4096)
+let maxent = ref (1000)
+let maxsize = ref (2048)
 
 type t = {
 	maxent: int;               (* max entities per domU *)
From f9156d3a9de198e9dbb6274eeb233a1e12d96229 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Thu, 28 Jul 2022 17:08:15 +0100
Subject: tools/ocaml/xenstored: Check for maxrequests before performing
 operations
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously we'd perform the operation, record the updated tree in the
transaction record, then try to insert a watchop path and the reply packet.

If we exceeded max requests we would've returned EQUOTA, but still:
* have performed the operation on the transaction's tree
* have recorded the watchop, making this queue effectively unbounded

It is better if we check whether we'd have room to store the operation before
performing the transaction, and raise EQUOTA there.  Then the transaction
record won't grow.

This is part of XSA-326 / CVE-2022-42317.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index 86eed024137b..d0400419ab4f 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -389,6 +389,7 @@ let input_handle_error ~cons ~doms ~fct ~con ~t ~req =
 	let reply_error e =
 		Packet.Error e in
 	try
+		Transaction.check_quota_exn ~perm:(Connection.get_perm con) t;
 		fct con t doms cons req.Packet.data
 	with
 	| Define.Invalid_path          -> reply_error "EINVAL"
@@ -682,9 +683,10 @@ let process_packet ~store ~cons ~doms ~con ~req =
 		in
 
 		let response = try
+			Transaction.check_quota_exn ~perm:(Connection.get_perm con) t;
 			if tid <> Transaction.none then
 				(* Remember the request and response for this operation in case we need to replay the transaction *)
-				Transaction.add_operation ~perm:(Connection.get_perm con) t req response;
+				Transaction.add_operation t req response;
 			response
 		with Quota.Limit_reached ->
 			Packet.Error "EQUOTA"
diff --git a/tools/ocaml/xenstored/transaction.ml b/tools/ocaml/xenstored/transaction.ml
index 17b1bdf2eaf9..294143e2335b 100644
--- a/tools/ocaml/xenstored/transaction.ml
+++ b/tools/ocaml/xenstored/transaction.ml
@@ -85,6 +85,7 @@ type t = {
 	oldroot: Store.Node.t;
 	mutable paths: (Xenbus.Xb.Op.operation * Store.Path.t) list;
 	mutable operations: (Packet.request * Packet.response) list;
+	mutable quota_reached: bool;
 	mutable read_lowpath: Store.Path.t option;
 	mutable write_lowpath: Store.Path.t option;
 }
@@ -127,6 +128,7 @@ let make ?(internal=false) id store =
 		oldroot = Store.get_root store;
 		paths = [];
 		operations = [];
+		quota_reached = false;
 		read_lowpath = None;
 		write_lowpath = None;
 	} in
@@ -143,13 +145,19 @@ let get_root t = Store.get_root t.store
 
 let is_read_only t = t.paths = []
 let add_wop t ty path = t.paths <- (ty, path) :: t.paths
-let add_operation ~perm t request response =
+let get_operations t = List.rev t.operations
+
+let check_quota_exn ~perm t =
 	if !Define.maxrequests >= 0
 		&& not (Perms.Connection.is_dom0 perm)
-		&& List.length t.operations >= !Define.maxrequests
-		then raise Quota.Limit_reached;
+		&& (t.quota_reached || List.length t.operations >= !Define.maxrequests)
+		then begin
+			t.quota_reached <- true;
+			raise Quota.Limit_reached;
+		end
+
+let add_operation t request response =
 	t.operations <- (request, response) :: t.operations
-let get_operations t = List.rev t.operations
 let set_read_lowpath t path = t.read_lowpath <- get_lowest path t.read_lowpath
 let set_write_lowpath t path = t.write_lowpath <- get_lowest path t.write_lowpath
 
From fe783d1e5a69352da305fadd345b26e48aab2380 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:07 +0100
Subject: tools/ocaml: GC parameter tuning
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

By default the OCaml garbage collector would return memory to the OS only
after unused memory is 5x live memory.  Tweak this to 120% instead, which
would match the major GC speed.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/xenstored/define.ml b/tools/ocaml/xenstored/define.ml
index 6b06f808595b..ba63a8147e09 100644
--- a/tools/ocaml/xenstored/define.ml
+++ b/tools/ocaml/xenstored/define.ml
@@ -25,6 +25,7 @@ let maxwatch = ref (100)
 let maxtransaction = ref (10)
 let maxrequests = ref (1024)   (* maximum requests per transaction *)
 
+let gc_max_overhead = ref 120 (* 120% see comment in xenstored.ml *)
 let conflict_burst_limit = ref 5.0
 let conflict_max_history_seconds = ref 0.05
 let conflict_rate_limit_is_aggregate = ref true
diff --git a/tools/ocaml/xenstored/xenstored.ml b/tools/ocaml/xenstored/xenstored.ml
index d44ae673c42a..3b57ad016dfb 100644
--- a/tools/ocaml/xenstored/xenstored.ml
+++ b/tools/ocaml/xenstored/xenstored.ml
@@ -104,6 +104,7 @@ let parse_config filename =
 		("quota-maxsize", Config.Set_int Quota.maxsize);
 		("quota-maxrequests", Config.Set_int Define.maxrequests);
 		("quota-path-max", Config.Set_int Define.path_max);
+		("gc-max-overhead", Config.Set_int Define.gc_max_overhead);
 		("test-eagain", Config.Set_bool Transaction.test_eagain);
 		("persistent", Config.Set_bool Disk.enable);
 		("xenstored-log-file", Config.String Logging.set_xenstored_log_destination);
@@ -265,6 +266,67 @@ let to_file store cons fds file =
 	        (fun () -> close_out channel)
 end
 
+(*
+	By default OCaml's GC only returns memory to the OS when it exceeds a
+	configurable 'max overhead' setting.
+	The default is 500%, that is 5/6th of the OCaml heap needs to be free
+	and only 1/6th live for a compaction to be triggerred that would
+	release memory back to the OS.
+	If the limit is not hit then the OCaml process can reuse that memory
+	for its own purposes, but other processes won't be able to use it.
+
+	There is also a 'space overhead' setting that controls how much work
+	each major GC slice does, and by default aims at having no more than
+	80% or 120% (depending on version) garbage values compared to live
+	values.
+	This doesn't have as much relevance to memory returned to the OS as
+	long as space_overhead <= max_overhead, because compaction is only
+	triggerred at the end of major GC cycles.
+
+	The defaults are too large once the program starts using ~100MiB of
+	memory, at which point ~500MiB would be unavailable to other processes
+	(which would be fine if this was the main process in this VM, but it is
+	not).
+
+	Max overhead can also be set to 0, however this is for testing purposes
+	only (setting it lower than 'space overhead' wouldn't help because the
+	major GC wouldn't run fast enough, and compaction does have a
+	performance cost: we can only compact contiguous regions, so memory has
+	to be moved around).
+
+	Max overhead controls how often the heap is compacted, which is useful
+	if there are burst of activity followed by long periods of idle state,
+	or if a domain quits, etc. Compaction returns memory to the OS.
+
+	wasted = live * space_overhead / 100
+
+	For globally overriding the GC settings one can use OCAMLRUNPARAM,
+	however we provide a config file override to be consistent with other
+	oxenstored settings.
+
+	One might want to dynamically adjust the overhead setting based on used
+	memory, i.e. to use a fixed upper bound in bytes, not percentage. However
+	measurements show that such adjustments increase GC overhead massively,
+	while still not guaranteeing that memory is returned any more quickly
+	than with a percentage based setting.
+
+	The allocation policy could also be tweaked, e.g. first fit would reduce
+	fragmentation and thus memory usage, but the documentation warns that it
+	can be sensibly slower, and indeed one of our own testcases can trigger
+	such a corner case where it is multiple times slower, so it is best to keep
+	the default allocation policy (next-fit/best-fit depending on version).
+
+	There are other tweaks that can be attempted in the future, e.g. setting
+	'ulimit -v' to 75% of RAM, however getting the kernel to actually return
+	NULL from allocations is difficult even with that setting, and without a
+	NULL the emergency GC won't be triggerred.
+	Perhaps cgroup limits could help, but for now tweak the safest only.
+*)
+
+let tweak_gc () =
+	Gc.set { (Gc.get ()) with Gc.max_overhead = !Define.gc_max_overhead }
+
+
 let _ =
 	let cf = do_argv in
 	let pidfile =
@@ -274,6 +336,8 @@ let _ =
 			default_pidfile
 		in
 
+	tweak_gc ();
+
 	(try
 		Unixext.mkdir_rec (Filename.dirname pidfile) 0o755
 	with _ ->
From e9af39f0b4d47022babe3dba38d83d7eb82d8a3e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:02 +0100
Subject: tools/ocaml: Change Xb.input to return Packet.t option
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The queue here would only ever hold at most one element.  This will simplify
follow-up patches.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 8404ddd8a682..165fd4a1edf4 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -45,7 +45,6 @@ type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
 type t =
 {
 	backend: backend;
-	pkt_in: Packet.t Queue.t;
 	pkt_out: Packet.t Queue.t;
 	mutable partial_in: partial_buf;
 	mutable partial_out: string;
@@ -62,7 +61,6 @@ let reconnect t = match t.backend with
 		Xs_ring.close backend.mmap;
 		backend.eventchn_notify ();
 		(* Clear our old connection state *)
-		Queue.clear t.pkt_in;
 		Queue.clear t.pkt_out;
 		t.partial_in <- init_partial_in ();
 		t.partial_out <- ""
@@ -124,7 +122,6 @@ let output con =
 
 (* NB: can throw Reconnect *)
 let input con =
-	let newpacket = ref false in
 	let to_read =
 		match con.partial_in with
 		| HaveHdr partial_pkt -> Partial.to_complete partial_pkt
@@ -143,21 +140,19 @@ let input con =
 		if Partial.to_complete partial_pkt = 0 then (
 			let pkt = Packet.of_partialpkt partial_pkt in
 			con.partial_in <- init_partial_in ();
-			Queue.push pkt con.pkt_in;
-			newpacket := true
-		)
+			Some pkt
+		) else None
 	| NoHdr (i, buf)      ->
 		(* we complete the partial header *)
 		if sz > 0 then
 			Bytes.blit b 0 buf (Partial.header_size () - i) sz;
 		con.partial_in <- if sz = i then
-			HaveHdr (Partial.of_string (Bytes.to_string buf)) else NoHdr (i - sz, buf)
-	);
-	!newpacket
+			HaveHdr (Partial.of_string (Bytes.to_string buf)) else NoHdr (i - sz, buf);
+		None
+	)
 
 let newcon backend = {
 	backend = backend;
-	pkt_in = Queue.create ();
 	pkt_out = Queue.create ();
 	partial_in = init_partial_in ();
 	partial_out = "";
@@ -193,9 +188,6 @@ let has_output con = has_new_output con || has_old_output con
 
 let peek_output con = Queue.peek con.pkt_out
 
-let input_len con = Queue.length con.pkt_in
-let has_in_packet con = Queue.length con.pkt_in > 0
-let get_in_packet con = Queue.pop con.pkt_in
 let has_partial_input con = match con.partial_in with
 	| HaveHdr _ -> true
 	| NoHdr (n, _) -> n < Partial.header_size ()
diff --git a/tools/ocaml/libs/xb/xb.mli b/tools/ocaml/libs/xb/xb.mli
index 794e35bb343e..91c682162cea 100644
--- a/tools/ocaml/libs/xb/xb.mli
+++ b/tools/ocaml/libs/xb/xb.mli
@@ -77,7 +77,7 @@ val write_fd : backend_fd -> 'a -> string -> int -> int
 val write_mmap : backend_mmap -> 'a -> string -> int -> int
 val write : t -> string -> int -> int
 val output : t -> bool
-val input : t -> bool
+val input : t -> Packet.t option
 val newcon : backend -> t
 val open_fd : Unix.file_descr -> t
 val open_mmap : Xenmmap.mmap_interface -> (unit -> unit) -> t
@@ -89,10 +89,7 @@ val has_new_output : t -> bool
 val has_old_output : t -> bool
 val has_output : t -> bool
 val peek_output : t -> Packet.t
-val input_len : t -> int
-val has_in_packet : t -> bool
 val has_partial_input : t -> bool
-val get_in_packet : t -> Packet.t
 val has_more_input : t -> bool
 val is_selectable : t -> bool
 val get_fd : t -> Unix.file_descr
diff --git a/tools/ocaml/libs/xs/xsraw.ml b/tools/ocaml/libs/xs/xsraw.ml
index d982fb24dbb1..451f8b38dbcc 100644
--- a/tools/ocaml/libs/xs/xsraw.ml
+++ b/tools/ocaml/libs/xs/xsraw.ml
@@ -94,26 +94,18 @@ let pkt_send con =
 	done
 
 (* receive one packet - can sleep *)
-let pkt_recv con =
-	let workdone = ref false in
-	while not !workdone
-	do
-		workdone := Xb.input con.xb
-	done;
-	Xb.get_in_packet con.xb
+let rec pkt_recv con =
+	match Xb.input con.xb with
+	| Some packet -> packet
+	| None -> pkt_recv con
 
 let pkt_recv_timeout con timeout =
 	let fd = Xb.get_fd con.xb in
 	let r, _, _ = Unix.select [ fd ] [] [] timeout in
 	if r = [] then
 		true, None
-	else (
-		let workdone = Xb.input con.xb in
-		if workdone then
-			false, (Some (Xb.get_in_packet con.xb))
-		else
-			false, None
-	)
+	else
+		false, Xb.input con.xb
 
 let queue_watchevent con data =
 	let ls = split_string ~limit:2 '\000' data in
diff --git a/tools/ocaml/xenstored/connection.ml b/tools/ocaml/xenstored/connection.ml
index 64180bb2d5f6..3f6a8f1ad0f7 100644
--- a/tools/ocaml/xenstored/connection.ml
+++ b/tools/ocaml/xenstored/connection.ml
@@ -277,9 +277,7 @@ let get_transaction con tid =
 	Hashtbl.find con.transactions tid
 
 let do_input con = Xenbus.Xb.input con.xb
-let has_input con = Xenbus.Xb.has_in_packet con.xb
 let has_partial_input con = Xenbus.Xb.has_partial_input con.xb
-let pop_in con = Xenbus.Xb.get_in_packet con.xb
 let has_more_input con = Xenbus.Xb.has_more_input con.xb
 
 let has_output con = Xenbus.Xb.has_output con.xb
@@ -307,7 +305,7 @@ let is_bad con = match con.dom with None -> false | Some dom -> Domain.is_bad_do
    Restrictions below can be relaxed once xenstored learns to dump more
    of its live state in a safe way *)
 let has_extra_connection_data con =
-	let has_in = has_input con || has_partial_input con in
+	let has_in = has_partial_input con in
 	let has_out = has_output con in
 	let has_nondefault_perms = make_perm con.dom <> con.perm in
 	has_in || has_out
diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index 3aef4e4673f9..69a96f2da8e9 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -195,10 +195,9 @@ let parse_live_update args =
 			| _ when Unix.gettimeofday () < t.deadline -> false
 			| l ->
 				warn "timeout reached: have to wait, migrate or shutdown %d domains:" (List.length l);
-				let msgs = List.rev_map (fun con -> Printf.sprintf "%s: %d tx, in: %b, out: %b, perm: %s"
+				let msgs = List.rev_map (fun con -> Printf.sprintf "%s: %d tx, out: %b, perm: %s"
 					(Connection.get_domstr con)
 					(Connection.number_of_transactions con)
-					(Connection.has_input con)
 					(Connection.has_output con)
 					(Connection.get_perm con |> Perms.Connection.to_string)
 					) l in
@@ -706,16 +705,17 @@ let do_input store cons doms con =
 			info "%s requests a reconnect" (Connection.get_domstr con);
 			History.reconnect con;
 			info "%s reconnection complete" (Connection.get_domstr con);
-			false
+			None
 		| Failure exp ->
 			error "caught exception %s" exp;
 			error "got a bad client %s" (sprintf "%-8s" (Connection.get_domstr con));
 			Connection.mark_as_bad con;
-			false
+			None
 	in
 
-	if newpacket then (
-		let packet = Connection.pop_in con in
+	match newpacket with
+	| None -> ()
+	| Some packet ->
 		let tid, rid, ty, data = Xenbus.Xb.Packet.unpack packet in
 		let req = {Packet.tid=tid; Packet.rid=rid; Packet.ty=ty; Packet.data=data} in
 
@@ -725,8 +725,7 @@ let do_input store cons doms con =
 		         (Xenbus.Xb.Op.to_string ty) (sanitize_data data); *)
 		process_packet ~store ~cons ~doms ~con ~req;
 		write_access_log ~ty ~tid ~con:(Connection.get_domstr con) ~data;
-		Connection.incr_ops con;
-	)
+		Connection.incr_ops con
 
 let do_output _store _cons _doms con =
 	if Connection.has_output con then (
From 40998535b6d1b8e7670da1c4ea81b6d8d8994c18 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:03 +0100
Subject: tools/ocaml/xb: Add BoundedQueue
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Ensures we cannot store more than [capacity] elements in a [Queue].  Replacing
all Queue with this module will then ensure at compile time that all Queues
are correctly bound checked.

Each element in the queue has a class with its own limits.  This, in a
subsequent change, will ensure that command responses can proceed during a
flood of watch events.

No functional change.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 165fd4a1edf4..4197a3888a68 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -17,6 +17,98 @@
 module Op = struct include Op end
 module Packet = struct include Packet end
 
+module BoundedQueue : sig
+	type ('a, 'b) t
+
+	(** [create ~capacity ~classify ~limit] creates a queue with maximum [capacity] elements.
+	    This is burst capacity, each element is further classified according to [classify],
+	    and each class can have its own [limit].
+	    [capacity] is enforced as an overall limit.
+	    The [limit] can be dynamic, and can be smaller than the number of elements already queued of that class,
+	    in which case those elements are considered to use "burst capacity".
+	  *)
+	val create: capacity:int -> classify:('a -> 'b) -> limit:('b -> int) -> ('a, 'b) t
+
+	(** [clear q] discards all elements from [q] *)
+	val clear: ('a, 'b) t -> unit
+
+	(** [can_push q] when [length q < capacity].	*)
+	val can_push: ('a, 'b) t -> 'b -> bool
+
+	(** [push e q] adds [e] at the end of queue [q] if [can_push q], or returns [None]. *)
+	val push: 'a -> ('a, 'b) t -> unit option
+
+	(** [pop q] removes and returns first element in [q], or raises [Queue.Empty]. *)
+	val pop: ('a, 'b) t -> 'a
+
+	(** [peek q] returns the first element in [q], or raises [Queue.Empty].  *)
+	val peek : ('a, 'b) t -> 'a
+
+	(** [length q] returns the current number of elements in [q] *)
+	val length: ('a, 'b) t -> int
+
+	(** [debug string_of_class q] prints queue usage statistics in an unspecified internal format. *)
+	val debug: ('b -> string) -> (_, 'b) t -> string
+end = struct
+	type ('a, 'b) t =
+		{ q: 'a Queue.t
+		; capacity: int
+		; classify: 'a -> 'b
+		; limit: 'b -> int
+		; class_count: ('b, int) Hashtbl.t
+		}
+
+	let create ~capacity ~classify ~limit =
+		{ capacity; q = Queue.create (); classify; limit; class_count = Hashtbl.create 3 }
+
+	let get_count t classification = try Hashtbl.find t.class_count classification with Not_found -> 0
+
+	let can_push_internal t classification class_count =
+		Queue.length t.q < t.capacity && class_count < t.limit classification
+
+	let ok = Some ()
+
+	let push e t =
+		let classification = t.classify e in
+		let class_count = get_count t classification in
+		if can_push_internal t classification class_count then begin
+			Queue.push e t.q;
+			Hashtbl.replace t.class_count classification (class_count + 1);
+			ok
+		end
+		else
+			None
+
+	let can_push t classification =
+		can_push_internal t classification @@ get_count t classification
+
+	let clear t =
+		Queue.clear t.q;
+		Hashtbl.reset t.class_count
+
+	let pop t =
+		let e = Queue.pop t.q in
+		let classification = t.classify e in
+		let () = match get_count t classification - 1 with
+		| 0 -> Hashtbl.remove t.class_count classification (* reduces memusage *)
+		| n -> Hashtbl.replace t.class_count classification n
+		in
+		e
+
+	let peek t = Queue.peek t.q
+	let length t = Queue.length t.q
+
+	let debug string_of_class t =
+		let b = Buffer.create 128 in
+		Printf.bprintf b "BoundedQueue capacity: %d, used: {" t.capacity;
+		Hashtbl.iter (fun packet_class count ->
+			Printf.bprintf b "	%s: %d" (string_of_class packet_class) count
+		) t.class_count;
+		Printf.bprintf b "}";
+		Buffer.contents b
+end
+
+
 exception End_of_file
 exception Eagain
 exception Noent
From 35808f876d525514cbc13d2b8dadd364fb2040c1 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edvin.torok@citrix.com>
Date: Wed, 12 Oct 2022 19:13:04 +0100
Subject: tools/ocaml: Limit maximum in-flight requests / outstanding replies
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Introduce a limit on the number of outstanding reply packets in the xenbus
queue.  This limits the number of in-flight requests: when the output queue is
full we'll stop processing inputs until the output queue has room again.

To avoid a busy loop on the Unix socket we only add it to the watched input
file descriptor set if we'd be able to call `input` on it.  Even though Dom0
is trusted and exempt from quotas a flood of events might cause a backlog
where events are produced faster than daemons in Dom0 can consume them, which
could lead to an unbounded queue size and OOM.

Therefore the xenbus queue limit must apply to all connections, Dom0 is not
exempt from it, although if everything works correctly it will eventually
catch up.

This prevents a malicious guest from sending more commands while it has
outstanding watch events or command replies in its input ring.  However if it
can cause the generation of watch events by other means (e.g. by Dom0, or
another cooperative guest) and stop reading its own ring then watch events
would've queued up without limit.

The xenstore protocol doesn't have a back-pressure mechanism, and doesn't
allow dropping watch events.  In fact, dropping watch events is known to break
some pieces of normal functionality.  This leaves little choice to safely
implement the xenstore protocol without exposing the xenstore daemon to
out-of-memory attacks.

Implement the fix as pipes with bounded buffers:
* Use a bounded buffer for watch events
* The watch structure will have a bounded receiving pipe of watch events
* The source will have an "overflow" pipe of pending watch events it couldn't
  deliver

Items are queued up on one end and are sent as far along the pipe as possible:

  source domain -> watch -> xenbus of target -> xenstore ring/socket of target

If the pipe is "full" at any point then back-pressure is applied and we prevent
more items from being queued up.  For the source domain this means that we'll
stop accepting new commands as long as its pipe buffer is not empty.

Before we try to enqueue an item we first check whether it is possible to send
it further down the pipe, by attempting to recursively flush the pipes. This
ensures that we retain the order of events as much as possible.

We might break causality of watch events if the target domain's queue is full
and we need to start using the watch's queue.  This is a breaking change in
the xenstore protocol, but only for domains which are not processing their
incoming ring as expected.

When a watch is deleted its entire pending queue is dropped (no code is needed
for that, because it is part of the 'watch' type).

There is a cache of watches that have pending events that we attempt to flush
at every cycle if possible.

Introduce 3 limits here:
* quota-maxwatchevents on watch event destination: when this is hit the
  source will not be allowed to queue up more watch events.
* quota-maxoustanding which is the number of responses not read from the ring:
  once exceeded, no more inputs are processed until all outstanding replies
  are consumed by the client.
* overflow queue on the watch event source: all watches that cannot be stored
  on destination are queued up here, a single command can trigger multiple
  watches (e.g. due to recursion).

The overflow queue currently doesn't have an upper bound, it is difficult to
accurately calculate one as it depends on whether you are Dom0 and how many
watches each path has registered and how many watch events you can trigger
with a single command (e.g. a commit).  However these events were already
using memory, this just moves them elsewhere, and as long as we correctly
block a domain it shouldn't result in unbounded memory usage.

Note that Dom0 is not excluded from these checks, it is important that Dom0 is
especially not excluded when it is the source, since there are many ways in
which a guest could trigger Dom0 to send it watch events.

This should protect against malicious frontends as long as the backend follows
the PV xenstore protocol and only exposes paths needed by the frontend, and
changes those paths at most once as a reaction to guest events, or protocol
state.

The queue limits are per watch, and per domain-pair, so even if one
communication channel would be "blocked", others would keep working, and the
domain itself won't get blocked as long as it doesn't overflow the queue of
watch events.

Similarly a malicious backend could cause the frontend to get blocked, but
this watch queue protects the frontend as well as long as it follows the PV
protocol.  (Although note that protection against malicious backends is only a
best effort at the moment)

This is part of XSA-326 / CVE-2022-42318.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/tools/ocaml/libs/xb/xb.ml b/tools/ocaml/libs/xb/xb.ml
index 4197a3888a68..b292ed7a874d 100644
--- a/tools/ocaml/libs/xb/xb.ml
+++ b/tools/ocaml/libs/xb/xb.ml
@@ -134,14 +134,44 @@ type backend = Fd of backend_fd | Xenmmap of backend_mmap
 
 type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
 
+(*
+	separate capacity reservation for replies and watch events:
+	this allows a domain to keep working even when under a constant flood of
+	watch events
+*)
+type capacity = { maxoutstanding: int; maxwatchevents: int }
+
+module Queue = BoundedQueue
+
+type packet_class =
+	| CommandReply
+	| Watchevent
+
+let string_of_packet_class = function
+	| CommandReply -> "command_reply"
+	| Watchevent -> "watch_event"
+
 type t =
 {
 	backend: backend;
-	pkt_out: Packet.t Queue.t;
+	pkt_out: (Packet.t, packet_class) Queue.t;
 	mutable partial_in: partial_buf;
 	mutable partial_out: string;
+	capacity: capacity
 }
 
+let to_read con =
+	match con.partial_in with
+		| HaveHdr partial_pkt -> Partial.to_complete partial_pkt
+		| NoHdr   (i, _)    -> i
+
+let debug t =
+	Printf.sprintf "XenBus state: partial_in: %d needed, partial_out: %d bytes, pkt_out: %d packets, %s"
+		(to_read t)
+		(String.length t.partial_out)
+		(Queue.length t.pkt_out)
+		(BoundedQueue.debug string_of_packet_class t.pkt_out)
+
 let init_partial_in () = NoHdr
 	(Partial.header_size (), Bytes.make (Partial.header_size()) '\000')
 
@@ -199,7 +229,8 @@ let output con =
 	let s = if String.length con.partial_out > 0 then
 			con.partial_out
 		else if Queue.length con.pkt_out > 0 then
-			Packet.to_string (Queue.pop con.pkt_out)
+			let pkt = Queue.pop con.pkt_out in
+			Packet.to_string pkt
 		else
 			"" in
 	(* send data from s, and save the unsent data to partial_out *)
@@ -212,12 +243,15 @@ let output con =
 	(* after sending one packet, partial is empty *)
 	con.partial_out = ""
 
+(* we can only process an input packet if we're guaranteed to have room
+   to store the response packet *)
+let can_input con = Queue.can_push con.pkt_out CommandReply
+
 (* NB: can throw Reconnect *)
 let input con =
-	let to_read =
-		match con.partial_in with
-		| HaveHdr partial_pkt -> Partial.to_complete partial_pkt
-		| NoHdr   (i, _)    -> i in
+	if not (can_input con) then None
+	else
+	let to_read = to_read con in
 
 	(* try to get more data from input stream *)
 	let b = Bytes.make to_read '\000' in
@@ -243,11 +277,22 @@ let input con =
 		None
 	)
 
-let newcon backend = {
+let classify t =
+	match t.Packet.ty with
+	| Op.Watchevent -> Watchevent
+	| _ -> CommandReply
+
+let newcon ~capacity backend =
+	let limit = function
+		| CommandReply -> capacity.maxoutstanding
+		| Watchevent -> capacity.maxwatchevents
+	in
+	{
 	backend = backend;
-	pkt_out = Queue.create ();
+	pkt_out = Queue.create ~capacity:(capacity.maxoutstanding + capacity.maxwatchevents) ~classify ~limit;
 	partial_in = init_partial_in ();
 	partial_out = "";
+	capacity = capacity;
 	}
 
 let open_fd fd = newcon (Fd { fd = fd; })
diff --git a/tools/ocaml/libs/xb/xb.mli b/tools/ocaml/libs/xb/xb.mli
index 91c682162cea..71b2754ca788 100644
--- a/tools/ocaml/libs/xb/xb.mli
+++ b/tools/ocaml/libs/xb/xb.mli
@@ -66,10 +66,11 @@ type backend_mmap = {
 type backend_fd = { fd : Unix.file_descr; }
 type backend = Fd of backend_fd | Xenmmap of backend_mmap
 type partial_buf = HaveHdr of Partial.pkt | NoHdr of int * bytes
+type capacity = { maxoutstanding: int; maxwatchevents: int }
 type t
 val init_partial_in : unit -> partial_buf
 val reconnect : t -> unit
-val queue : t -> Packet.t -> unit
+val queue : t -> Packet.t -> unit option
 val read_fd : backend_fd -> 'a -> bytes -> int -> int
 val read_mmap : backend_mmap -> 'a -> bytes -> int -> int
 val read : t -> bytes -> int -> int
@@ -78,13 +79,14 @@ val write_mmap : backend_mmap -> 'a -> string -> int -> int
 val write : t -> string -> int -> int
 val output : t -> bool
 val input : t -> Packet.t option
-val newcon : backend -> t
-val open_fd : Unix.file_descr -> t
-val open_mmap : Xenmmap.mmap_interface -> (unit -> unit) -> t
+val newcon : capacity:capacity -> backend -> t
+val open_fd : Unix.file_descr -> capacity:capacity -> t
+val open_mmap : Xenmmap.mmap_interface -> (unit -> unit) -> capacity:capacity -> t
 val close : t -> unit
 val is_fd : t -> bool
 val is_mmap : t -> bool
 val output_len : t -> int
+val can_input: t -> bool
 val has_new_output : t -> bool
 val has_old_output : t -> bool
 val has_output : t -> bool
@@ -93,3 +95,4 @@ val has_partial_input : t -> bool
 val has_more_input : t -> bool
 val is_selectable : t -> bool
 val get_fd : t -> Unix.file_descr
+val debug: t -> string
diff --git a/tools/ocaml/libs/xs/queueop.ml b/tools/ocaml/libs/xs/queueop.ml
index 9ff5bbd529ce..4e532cdaeacb 100644
--- a/tools/ocaml/libs/xs/queueop.ml
+++ b/tools/ocaml/libs/xs/queueop.ml
@@ -16,9 +16,10 @@
 open Xenbus
 
 let data_concat ls = (String.concat "\000" ls) ^ "\000"
+let queue con pkt = let r = Xb.queue con pkt in assert (r <> None)
 let queue_path ty (tid: int) (path: string) con =
 	let data = data_concat [ path; ] in
-	Xb.queue con (Xb.Packet.create tid 0 ty data)
+	queue con (Xb.Packet.create tid 0 ty data)
 
 (* operations *)
 let directory tid path con = queue_path Xb.Op.Directory tid path con
@@ -27,48 +28,48 @@ let read tid path con = queue_path Xb.Op.Read tid path con
 let getperms tid path con = queue_path Xb.Op.Getperms tid path con
 
 let debug commands con =
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Debug (data_concat commands))
+	queue con (Xb.Packet.create 0 0 Xb.Op.Debug (data_concat commands))
 
 let watch path data con =
 	let data = data_concat [ path; data; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Watch data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Watch data)
 
 let unwatch path data con =
 	let data = data_concat [ path; data; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Unwatch data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Unwatch data)
 
 let transaction_start con =
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Transaction_start (data_concat []))
+	queue con (Xb.Packet.create 0 0 Xb.Op.Transaction_start (data_concat []))
 
 let transaction_end tid commit con =
 	let data = data_concat [ (if commit then "T" else "F"); ] in
-	Xb.queue con (Xb.Packet.create tid 0 Xb.Op.Transaction_end data)
+	queue con (Xb.Packet.create tid 0 Xb.Op.Transaction_end data)
 
 let introduce domid mfn port con =
 	let data = data_concat [ Printf.sprintf "%u" domid;
 	                         Printf.sprintf "%nu" mfn;
 	                         string_of_int port; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Introduce data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Introduce data)
 
 let release domid con =
 	let data = data_concat [ Printf.sprintf "%u" domid; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Release data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Release data)
 
 let resume domid con =
 	let data = data_concat [ Printf.sprintf "%u" domid; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Resume data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Resume data)
 
 let getdomainpath domid con =
 	let data = data_concat [ Printf.sprintf "%u" domid; ] in
-	Xb.queue con (Xb.Packet.create 0 0 Xb.Op.Getdomainpath data)
+	queue con (Xb.Packet.create 0 0 Xb.Op.Getdomainpath data)
 
 let write tid path value con =
 	let data = path ^ "\000" ^ value (* no NULL at the end *) in
-	Xb.queue con (Xb.Packet.create tid 0 Xb.Op.Write data)
+	queue con (Xb.Packet.create tid 0 Xb.Op.Write data)
 
 let mkdir tid path con = queue_path Xb.Op.Mkdir tid path con
 let rm tid path con = queue_path Xb.Op.Rm tid path con
 
 let setperms tid path perms con =
 	let data = data_concat [ path; perms ] in
-	Xb.queue con (Xb.Packet.create tid 0 Xb.Op.Setperms data)
+	queue con (Xb.Packet.create tid 0 Xb.Op.Setperms data)
diff --git a/tools/ocaml/libs/xs/xsraw.ml b/tools/ocaml/libs/xs/xsraw.ml
index 451f8b38dbcc..cbd17280600c 100644
--- a/tools/ocaml/libs/xs/xsraw.ml
+++ b/tools/ocaml/libs/xs/xsraw.ml
@@ -36,8 +36,10 @@ type con = {
 let close con =
 	Xb.close con.xb
 
+let capacity = { Xb.maxoutstanding = 1; maxwatchevents = 0; }
+
 let open_fd fd = {
-	xb = Xb.open_fd fd;
+	xb = Xb.open_fd ~capacity fd;
 	watchevents = Queue.create ();
 }
 
diff --git a/tools/ocaml/xenstored/connection.ml b/tools/ocaml/xenstored/connection.ml
index 3f6a8f1ad0f7..54f7f765167b 100644
--- a/tools/ocaml/xenstored/connection.ml
+++ b/tools/ocaml/xenstored/connection.ml
@@ -20,12 +20,84 @@ open Stdext
 
 let xenstore_payload_max = 4096 (* xen/include/public/io/xs_wire.h *)
 
+type 'a bounded_sender = 'a -> unit option
+(** a bounded sender accepts an ['a] item and returns:
+    None - if there is no room to accept the item
+    Some () -  if it has successfully accepted/sent the item
+ *)
+
+module BoundedPipe : sig
+	type 'a t
+
+	(** [create ~capacity ~destination] creates a bounded pipe with a
+	    local buffer holding at most [capacity] items.  Once the buffer is
+	    full it will not accept further items.  items from the pipe are
+	    flushed into [destination] as long as it accepts items.  The
+	    destination could be another pipe.
+	 *)
+	val create: capacity:int -> destination:'a bounded_sender -> 'a t
+
+	(** [is_empty t] returns whether the local buffer of [t] is empty. *)
+	val is_empty : _ t -> bool
+
+	(** [length t] the number of items in the internal buffer *)
+	val length: _ t -> int
+
+	(** [flush_pipe t] sends as many items from the local buffer as possible,
+			which could be none. *)
+	val flush_pipe: _ t -> unit
+
+	(** [push t item] tries to [flush_pipe] and then push [item]
+	    into the pipe if its [capacity] allows.
+	    Returns [None] if there is no more room
+	 *)
+	val push : 'a t -> 'a bounded_sender
+end = struct
+	(* items are enqueued in [q], and then flushed to [connect_to] *)
+	type 'a t =
+		{ q: 'a Queue.t
+		; destination: 'a bounded_sender
+		; capacity: int
+		}
+
+	let create ~capacity ~destination =
+		{ q = Queue.create (); capacity; destination }
+
+	let rec flush_pipe t =
+		if not Queue.(is_empty t.q) then
+			let item = Queue.peek t.q in
+			match t.destination item with
+			| None -> () (* no room *)
+			| Some () ->
+				(* successfully sent item to next stage *)
+				let _ = Queue.pop t.q in
+				(* continue trying to send more items *)
+				flush_pipe t
+
+	let push t item =
+		(* first try to flush as many items from this pipe as possible to make room,
+		   it is important to do this first to preserve the order of the items
+		 *)
+		flush_pipe t;
+		if Queue.length t.q < t.capacity then begin
+			(* enqueue, instead of sending directly.
+			   this ensures that [out] sees the items in the same order as we receive them
+			 *)
+			Queue.push item t.q;
+			Some (flush_pipe t)
+		end else None
+
+	let is_empty t = Queue.is_empty t.q
+	let length t = Queue.length t.q
+end
+
 type watch = {
 	con: t;
 	token: string;
 	path: string;
 	base: string;
 	is_relative: bool;
+	pending_watchevents: Xenbus.Xb.Packet.t BoundedPipe.t;
 }
 
 and t = {
@@ -38,8 +110,36 @@ and t = {
 	anonid: int;
 	mutable stat_nb_ops: int;
 	mutable perm: Perms.Connection.t;
+	pending_source_watchevents: (watch * Xenbus.Xb.Packet.t) BoundedPipe.t
 }
 
+module Watch = struct
+	module T = struct
+		type t = watch
+
+		let compare w1 w2 =
+			(* cannot compare watches from different connections *)
+			assert (w1.con == w2.con);
+			match String.compare w1.token w2.token with
+			| 0 -> String.compare w1.path w2.path
+			| n -> n
+	end
+	module Set = Set.Make(T)
+
+	let flush_events t =
+		BoundedPipe.flush_pipe t.pending_watchevents;
+		not (BoundedPipe.is_empty t.pending_watchevents)
+
+	let pending_watchevents t =
+		BoundedPipe.length t.pending_watchevents
+end
+
+let source_flush_watchevents t =
+	BoundedPipe.flush_pipe t.pending_source_watchevents
+
+let source_pending_watchevents t =
+	BoundedPipe.length t.pending_source_watchevents
+
 let mark_as_bad con =
 	match con.dom with
 	|None -> ()
@@ -67,7 +167,8 @@ let watch_create ~con ~path ~token = {
 	token = token;
 	path = path;
 	base = get_path con;
-	is_relative = path.[0] <> '/' && path.[0] <> '@'
+	is_relative = path.[0] <> '/' && path.[0] <> '@';
+	pending_watchevents = BoundedPipe.create ~capacity:!Define.maxwatchevents ~destination:(Xenbus.Xb.queue con.xb)
 }
 
 let get_con w = w.con
@@ -93,6 +194,9 @@ let make_perm dom =
 	Perms.Connection.create ~perms:[Perms.READ; Perms.WRITE] domid
 
 let create xbcon dom =
+	let destination (watch, pkt) =
+		BoundedPipe.push watch.pending_watchevents pkt
+	in
 	let id =
 		match dom with
 		| None -> let old = !anon_id_next in incr anon_id_next; old
@@ -109,6 +213,16 @@ let create xbcon dom =
 	anonid = id;
 	stat_nb_ops = 0;
 	perm = make_perm dom;
+
+	(* the actual capacity will be lower, this is used as an overflow
+	   buffer: anything that doesn't fit elsewhere gets put here, only
+	   limited by the amount of watches that you can generate with a
+	   single xenstore command (which is finite, although possibly very
+	   large in theory for Dom0).  Once the pipe here has any contents the
+	   domain is blocked from sending more commands until it is empty
+	   again though.
+	 *)
+	pending_source_watchevents = BoundedPipe.create ~capacity:Sys.max_array_length ~destination
 	}
 	in
 	Logging.new_connection ~tid:Transaction.none ~con:(get_domstr con);
@@ -127,11 +241,17 @@ let set_target con target_domid =
 
 let is_backend_mmap con = Xenbus.Xb.is_mmap con.xb
 
-let send_reply con tid rid ty data =
+let packet_of con tid rid ty data =
 	if (String.length data) > xenstore_payload_max && (is_backend_mmap con) then
-		Xenbus.Xb.queue con.xb (Xenbus.Xb.Packet.create tid rid Xenbus.Xb.Op.Error "E2BIG\000")
+		Xenbus.Xb.Packet.create tid rid Xenbus.Xb.Op.Error "E2BIG\000"
 	else
-		Xenbus.Xb.queue con.xb (Xenbus.Xb.Packet.create tid rid ty data)
+		Xenbus.Xb.Packet.create tid rid ty data
+
+let send_reply con tid rid ty data =
+	let result = Xenbus.Xb.queue con.xb (packet_of con tid rid ty data) in
+	(* should never happen: we only process an input packet when there is room for an output packet *)
+	(* and the limit for replies is different from the limit for watch events *)
+	assert (result <> None)
 
 let send_error con tid rid err = send_reply con tid rid Xenbus.Xb.Op.Error (err ^ "\000")
 let send_ack con tid rid ty = send_reply con tid rid ty "OK\000"
@@ -181,11 +301,11 @@ let del_watch con path token =
 	apath, w
 
 let del_watches con =
-  Hashtbl.clear con.watches;
+  Hashtbl.reset con.watches;
   con.nb_watches <- 0
 
 let del_transactions con =
-  Hashtbl.clear con.transactions
+  Hashtbl.reset con.transactions
 
 let list_watches con =
 	let ll = Hashtbl.fold
@@ -208,21 +328,29 @@ let lookup_watch_perm path = function
 let lookup_watch_perms oldroot root path =
 	lookup_watch_perm path oldroot @ lookup_watch_perm path (Some root)
 
-let fire_single_watch_unchecked watch =
+let fire_single_watch_unchecked source watch =
 	let data = Utils.join_by_null [watch.path; watch.token; ""] in
-	send_reply watch.con Transaction.none 0 Xenbus.Xb.Op.Watchevent data
+	let pkt = packet_of watch.con Transaction.none 0 Xenbus.Xb.Op.Watchevent data in
+
+	match BoundedPipe.push source.pending_source_watchevents (watch, pkt) with
+	| Some () -> () (* packet queued *)
+	| None ->
+			(* a well behaved Dom0 shouldn't be able to trigger this,
+			   if it happens it is likely a Dom0 bug causing runaway memory usage
+			 *)
+			failwith "watch event overflow, cannot happen"
 
-let fire_single_watch (oldroot, root) watch =
+let fire_single_watch source (oldroot, root) watch =
 	let abspath = get_watch_path watch.con watch.path |> Store.Path.of_string in
 	let perms = lookup_watch_perms oldroot root abspath in
 	if Perms.can_fire_watch watch.con.perm perms then
-		fire_single_watch_unchecked watch
+		fire_single_watch_unchecked source watch
 	else
 		let perms = perms |> List.map (Perms.Node.to_string ~sep:" ") |> String.concat ", " in
 		let con = get_domstr watch.con in
 		Logging.watch_not_fired ~con perms (Store.Path.to_string abspath)
 
-let fire_watch roots watch path =
+let fire_watch source roots watch path =
 	let new_path =
 		if watch.is_relative && path.[0] = '/'
 		then begin
@@ -232,7 +360,7 @@ let fire_watch roots watch path =
 		end else
 			path
 	in
-	fire_single_watch roots { watch with path = new_path }
+	fire_single_watch source roots { watch with path = new_path }
 
 (* Search for a valid unused transaction id. *)
 let rec valid_transaction_id con proposed_id =
@@ -280,6 +408,7 @@ let do_input con = Xenbus.Xb.input con.xb
 let has_partial_input con = Xenbus.Xb.has_partial_input con.xb
 let has_more_input con = Xenbus.Xb.has_more_input con.xb
 
+let can_input con = Xenbus.Xb.can_input con.xb && BoundedPipe.is_empty con.pending_source_watchevents
 let has_output con = Xenbus.Xb.has_output con.xb
 let has_old_output con = Xenbus.Xb.has_old_output con.xb
 let has_new_output con = Xenbus.Xb.has_new_output con.xb
@@ -322,7 +451,7 @@ let prevents_live_update con = not (is_bad con)
 	&& (has_extra_connection_data con || has_transaction_data con)
 
 let has_more_work con =
-	has_more_input con || not (has_old_output con) && has_new_output con
+	(has_more_input con && can_input con) || not (has_old_output con) && has_new_output con
 
 let incr_ops con = con.stat_nb_ops <- con.stat_nb_ops + 1
 
diff --git a/tools/ocaml/xenstored/connections.ml b/tools/ocaml/xenstored/connections.ml
index 3c7429fe7f61..7d68c583b43a 100644
--- a/tools/ocaml/xenstored/connections.ml
+++ b/tools/ocaml/xenstored/connections.ml
@@ -22,22 +22,30 @@ type t = {
 	domains: (int, Connection.t) Hashtbl.t;
 	ports: (Xeneventchn.t, Connection.t) Hashtbl.t;
 	mutable watches: Connection.watch list Trie.t;
+	mutable has_pending_watchevents: Connection.Watch.Set.t
 }
 
 let create () = {
 	anonymous = Hashtbl.create 37;
 	domains = Hashtbl.create 37;
 	ports = Hashtbl.create 37;
-	watches = Trie.create ()
+	watches = Trie.create ();
+	has_pending_watchevents = Connection.Watch.Set.empty;
 }
 
+let get_capacity () =
+	(* not multiplied by maxwatch on purpose: 2nd queue in watch itself! *)
+	{ Xenbus.Xb.maxoutstanding = !Define.maxoutstanding; maxwatchevents = !Define.maxwatchevents }
+
 let add_anonymous cons fd =
-	let xbcon = Xenbus.Xb.open_fd fd in
+	let capacity = get_capacity () in
+	let xbcon = Xenbus.Xb.open_fd fd ~capacity in
 	let con = Connection.create xbcon None in
 	Hashtbl.add cons.anonymous (Xenbus.Xb.get_fd xbcon) con
 
 let add_domain cons dom =
-	let xbcon = Xenbus.Xb.open_mmap (Domain.get_interface dom) (fun () -> Domain.notify dom) in
+	let capacity = get_capacity () in
+	let xbcon = Xenbus.Xb.open_mmap ~capacity (Domain.get_interface dom) (fun () -> Domain.notify dom) in
 	let con = Connection.create xbcon (Some dom) in
 	Hashtbl.add cons.domains (Domain.get_id dom) con;
 	match Domain.get_port dom with
@@ -48,7 +56,9 @@ let select ?(only_if = (fun _ -> true)) cons =
 	Hashtbl.fold (fun _ con (ins, outs) ->
 		if (only_if con) then (
 			let fd = Connection.get_fd con in
-			(fd :: ins,  if Connection.has_output con then fd :: outs else outs)
+			let in_fds = if Connection.can_input con then fd :: ins else ins in
+			let out_fds = if Connection.has_output con then fd :: outs else outs in
+			in_fds, out_fds
 		) else (ins, outs)
 	)
 	cons.anonymous ([], [])
@@ -67,10 +77,17 @@ let del_watches_of_con con watches =
 	| [] -> None
 	| ws -> Some ws
 
+let del_watches cons con =
+	Connection.del_watches con;
+	cons.watches <- Trie.map (del_watches_of_con con) cons.watches;
+	cons.has_pending_watchevents <-
+		cons.has_pending_watchevents |> Connection.Watch.Set.filter @@ fun w ->
+		Connection.get_con w != con
+
 let del_anonymous cons con =
 	try
 		Hashtbl.remove cons.anonymous (Connection.get_fd con);
-		cons.watches <- Trie.map (del_watches_of_con con) cons.watches;
+		del_watches cons con;
 		Connection.close con
 	with exn ->
 		debug "del anonymous %s" (Printexc.to_string exn)
@@ -85,7 +102,7 @@ let del_domain cons id =
 		    | Some p -> Hashtbl.remove cons.ports p
 		    | None -> ())
 		 | None -> ());
-		cons.watches <- Trie.map (del_watches_of_con con) cons.watches;
+		del_watches cons con;
 		Connection.close con
 	with exn ->
 		debug "del domain %u: %s" id (Printexc.to_string exn)
@@ -136,31 +153,33 @@ let del_watch cons con path token =
 		cons.watches <- Trie.set cons.watches key watches;
  	watch
 
-let del_watches cons con =
-	Connection.del_watches con;
-	cons.watches <- Trie.map (del_watches_of_con con) cons.watches
-
 (* path is absolute *)
-let fire_watches ?oldroot root cons path recurse =
+let fire_watches ?oldroot source root cons path recurse =
 	let key = key_of_path path in
 	let path = Store.Path.to_string path in
 	let roots = oldroot, root in
 	let fire_watch _ = function
 		| None         -> ()
-		| Some watches -> List.iter (fun w -> Connection.fire_watch roots w path) watches
+		| Some watches -> List.iter (fun w -> Connection.fire_watch source roots w path) watches
 	in
 	let fire_rec _x = function
 		| None         -> ()
 		| Some watches ->
-			List.iter (Connection.fire_single_watch roots) watches
+			List.iter (Connection.fire_single_watch source roots) watches
 	in
 	Trie.iter_path fire_watch cons.watches key;
 	if recurse then
 		Trie.iter fire_rec (Trie.sub cons.watches key)
 
+let send_watchevents cons con =
+	cons.has_pending_watchevents <-
+		cons.has_pending_watchevents |> Connection.Watch.Set.filter Connection.Watch.flush_events;
+	Connection.source_flush_watchevents con
+
 let fire_spec_watches root cons specpath =
+	let source = find_domain cons 0 in
 	iter cons (fun con ->
-		List.iter (Connection.fire_single_watch (None, root)) (Connection.get_watches con specpath))
+		List.iter (Connection.fire_single_watch source (None, root)) (Connection.get_watches con specpath))
 
 let set_target cons domain target_domain =
 	let con = find_domain cons domain in
@@ -197,6 +216,16 @@ let debug cons =
 	let domains = Hashtbl.fold (fun _ con accu -> Connection.debug con :: accu) cons.domains [] in
 	String.concat "" (domains @ anonymous)
 
+let debug_watchevents cons con =
+	(* == (physical equality)
+	   has to be used here because w.con.xb.backend might contain a [unit->unit] value causing regular
+	   comparison to fail due to having a 'functional value' which cannot be compared.
+	 *)
+	let s = cons.has_pending_watchevents |> Connection.Watch.Set.filter (fun w -> w.con == con) in
+	let pending = s |> Connection.Watch.Set.elements
+		|> List.map (fun w -> Connection.Watch.pending_watchevents w) |> List.fold_left (+) 0 in
+	Printf.sprintf "Watches with pending events: %d, pending events total: %d" (Connection.Watch.Set.cardinal s) pending
+
 let filter ~f cons =
 	let fold _ v acc = if f v then v :: acc else acc in
 	[]
diff --git a/tools/ocaml/xenstored/define.ml b/tools/ocaml/xenstored/define.ml
index ba63a8147e09..327b6d795ec7 100644
--- a/tools/ocaml/xenstored/define.ml
+++ b/tools/ocaml/xenstored/define.ml
@@ -24,6 +24,13 @@ let default_config_dir = Paths.xen_config_dir
 let maxwatch = ref (100)
 let maxtransaction = ref (10)
 let maxrequests = ref (1024)   (* maximum requests per transaction *)
+let maxoutstanding = ref (1024) (* maximum outstanding requests, i.e. in-flight requests / domain *)
+let maxwatchevents = ref (1024)
+(*
+	maximum outstanding watch events per watch,
+	recommended >= maxoutstanding to avoid blocking backend transactions due to
+	malicious frontends
+ *)
 
 let gc_max_overhead = ref 120 (* 120% see comment in xenstored.ml *)
 let conflict_burst_limit = ref 5.0
diff --git a/tools/ocaml/xenstored/oxenstored.conf.in b/tools/ocaml/xenstored/oxenstored.conf.in
index 4ae48e42d47d..9d034e744b4b 100644
--- a/tools/ocaml/xenstored/oxenstored.conf.in
+++ b/tools/ocaml/xenstored/oxenstored.conf.in
@@ -62,6 +62,8 @@ quota-maxwatch = 100
 quota-transaction = 10
 quota-maxrequests = 1024
 quota-path-max = 1024
+quota-maxoutstanding = 1024
+quota-maxwatchevents = 1024
 
 # Activate filed base backend
 persistent = false
diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml
index 69a96f2da8e9..5f439fe59f47 100644
--- a/tools/ocaml/xenstored/process.ml
+++ b/tools/ocaml/xenstored/process.ml
@@ -57,7 +57,7 @@ let split_one_path data con =
 	| path :: "" :: [] -> Store.Path.create path (Connection.get_path con)
 	| _                -> raise Invalid_Cmd_Args
 
-let process_watch t cons =
+let process_watch source t cons =
 	let oldroot = t.Transaction.oldroot in
 	let newroot = Store.get_root t.Transaction.store in
 	let ops = Transaction.get_paths t |> List.rev in
@@ -67,8 +67,9 @@ let process_watch t cons =
 		| Xenbus.Xb.Op.Rm       -> true, None, oldroot
 		| Xenbus.Xb.Op.Setperms -> false, Some oldroot, newroot
 		| _              -> raise (Failure "huh ?") in
-		Connections.fire_watches ?oldroot root cons (snd op) recurse in
-	List.iter (fun op -> do_op_watch op cons) ops
+		Connections.fire_watches ?oldroot source root cons (snd op) recurse in
+	List.iter (fun op -> do_op_watch op cons) ops;
+	Connections.send_watchevents cons source
 
 let create_implicit_path t perm path =
 	let dirname = Store.Path.get_parent path in
@@ -234,6 +235,20 @@ let do_debug con t _domains cons data =
 	| "watches" :: _ ->
 		let watches = Connections.debug cons in
 		Some (watches ^ "\000")
+	| "xenbus" :: domid :: _ ->
+		let domid = int_of_string domid in
+		let con = Connections.find_domain cons domid in
+		let s = Printf.sprintf "xenbus: %s; overflow queue length: %d, can_input: %b, has_more_input: %b, has_old_output: %b, has_new_output: %b, has_more_work: %b. pending: %s"
+			(Xenbus.Xb.debug con.xb)
+			(Connection.source_pending_watchevents con)
+			(Connection.can_input con)
+			(Connection.has_more_input con)
+			(Connection.has_old_output con)
+			(Connection.has_new_output con)
+			(Connection.has_more_work con)
+			(Connections.debug_watchevents cons con)
+		in
+		Some s
 	| "mfn" :: domid :: _ ->
 		let domid = int_of_string domid in
 		let con = Connections.find_domain cons domid in
@@ -342,7 +357,7 @@ let reply_ack fct con t doms cons data =
 	fct con t doms cons data;
 	Packet.Ack (fun () ->
 		if Transaction.get_id t = Transaction.none then
-			process_watch t cons
+			process_watch con t cons
 	)
 
 let reply_data fct con t doms cons data =
@@ -501,7 +516,7 @@ let do_watch con _t _domains cons data =
 	Packet.Ack (fun () ->
 		(* xenstore.txt says this watch is fired immediately,
 		   implying even if path doesn't exist or is unreadable *)
-		Connection.fire_single_watch_unchecked watch)
+		Connection.fire_single_watch_unchecked con watch)
 
 let do_unwatch con _t _domains cons data =
 	let (node, token) =
@@ -532,7 +547,7 @@ let do_transaction_end con t domains cons data =
 	if not success then
 		raise Transaction_again;
 	if commit then begin
-		process_watch t cons;
+		process_watch con t cons;
 		match t.Transaction.ty with
 		| Transaction.No ->
 			() (* no need to record anything *)
@@ -700,7 +715,8 @@ let process_packet ~store ~cons ~doms ~con ~req =
 let do_input store cons doms con =
 	let newpacket =
 		try
-			Connection.do_input con
+			if Connection.can_input con then Connection.do_input con
+			else None
 		with Xenbus.Xb.Reconnect ->
 			info "%s requests a reconnect" (Connection.get_domstr con);
 			History.reconnect con;
@@ -728,6 +744,7 @@ let do_input store cons doms con =
 		Connection.incr_ops con
 
 let do_output _store _cons _doms con =
+	Connection.source_flush_watchevents con;
 	if Connection.has_output con then (
 		if Connection.has_new_output con then (
 			let packet = Connection.peek_output con in
diff --git a/tools/ocaml/xenstored/xenstored.ml b/tools/ocaml/xenstored/xenstored.ml
index 3b57ad016dfb..c799e20f1145 100644
--- a/tools/ocaml/xenstored/xenstored.ml
+++ b/tools/ocaml/xenstored/xenstored.ml
@@ -103,6 +103,8 @@ let parse_config filename =
 		("quota-maxentity", Config.Set_int Quota.maxent);
 		("quota-maxsize", Config.Set_int Quota.maxsize);
 		("quota-maxrequests", Config.Set_int Define.maxrequests);
+		("quota-maxoutstanding", Config.Set_int Define.maxoutstanding);
+		("quota-maxwatchevents", Config.Set_int Define.maxwatchevents);
 		("quota-path-max", Config.Set_int Define.path_max);
 		("gc-max-overhead", Config.Set_int Define.gc_max_overhead);
 		("test-eagain", Config.Set_bool Transaction.test_eagain);
From a773ccb663f26db829c8126d92cd7e038dcd895c Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Thu, 29 Sep 2022 13:07:35 +0200
Subject: SUPPORT.md: clarify support of untrusted driver domains with
 oxenstored

Add a support statement for the scope of support regarding different
Xenstore variants. Especially oxenstored does not (yet) have security
support of untrusted driver domains, as those might drive oxenstored
out of memory by creating lots of watch events for the guests they are
servicing.

Add a statement regarding Live Update support of oxenstored.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>

diff --git a/SUPPORT.md b/SUPPORT.md
index cf2ddfacaf09..ab71464cf672 100644
--- a/SUPPORT.md
+++ b/SUPPORT.md
@@ -193,13 +193,18 @@ Support for running qemu-xen device model in a linux stubdomain.
 
     Status: Tech Preview
 
-## Liveupdate of C xenstored daemon
+## Xenstore
 
-    Status: Tech Preview
+### C xenstored daemon
 
-## Liveupdate of OCaml xenstored daemon
+    Status: Supported
+    Status, Liveupdate: Tech Preview
 
-    Status: Tech Preview
+### OCaml xenstored daemon
+
+    Status: Supported
+    Status, untrusted driver domains: Supported, not security supported
+    Status, Liveupdate: Not functional
 
 ## Toolstack/3rd party
 
From 2c2a703f6b40a7d8ffde3e4799e6d18c438d3007 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: split up send_reply()

Today send_reply() is used for both, normal request replies and watch
events.

Split it up into send_reply() and send_event(). This will be used to
add some event specific handling.

add_event() can be merged into send_event(), removing the need for an
intermediate memory allocation.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 55b79e4c032e..ed742d9dfc2e 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -763,49 +763,32 @@ static void send_error(struct connection *conn, int error)
 void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		const void *data, unsigned int len)
 {
-	struct buffered_data *bdata;
+	struct buffered_data *bdata = conn->in;
+
+	assert(type != XS_WATCH_EVENT);
 
 	if ( len > XENSTORE_PAYLOAD_MAX ) {
 		send_error(conn, E2BIG);
 		return;
 	}
 
-	/* Replies reuse the request buffer, events need a new one. */
-	if (type != XS_WATCH_EVENT) {
-		bdata = conn->in;
-		/* Drop asynchronous responses, e.g. errors for watch events. */
-		if (!bdata)
-			return;
-		bdata->inhdr = true;
-		bdata->used = 0;
-		conn->in = NULL;
-	} else {
-		/* Message is a child of the connection for auto-cleanup. */
-		bdata = new_buffer(conn);
+	if (!bdata)
+		return;
+	bdata->inhdr = true;
+	bdata->used = 0;
 
-		/*
-		 * Allocation failure here is unfortunate: we have no way to
-		 * tell anybody about it.
-		 */
-		if (!bdata)
-			return;
-	}
 	if (len <= DEFAULT_BUFFER_SIZE)
 		bdata->buffer = bdata->default_buffer;
-	else
+	else {
 		bdata->buffer = talloc_array(bdata, char, len);
-	if (!bdata->buffer) {
-		if (type == XS_WATCH_EVENT) {
-			/* Same as above: no way to tell someone. */
-			talloc_free(bdata);
+		if (!bdata->buffer) {
+			send_error(conn, ENOMEM);
 			return;
 		}
-		/* re-establish request buffer for sending ENOMEM. */
-		conn->in = bdata;
-		send_error(conn, ENOMEM);
-		return;
 	}
 
+	conn->in = NULL;
+
 	/* Update relevant header fields and fill in the message body. */
 	bdata->hdr.msg.type = type;
 	bdata->hdr.msg.len = len;
@@ -813,8 +796,39 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+}
 
-	return;
+/*
+ * Send a watch event.
+ * As this is not directly related to the current command, errors can't be
+ * reported.
+ */
+void send_event(struct connection *conn, const char *path, const char *token)
+{
+	struct buffered_data *bdata;
+	unsigned int len;
+
+	len = strlen(path) + 1 + strlen(token) + 1;
+	/* Don't try to send over-long events. */
+	if (len > XENSTORE_PAYLOAD_MAX)
+		return;
+
+	bdata = new_buffer(conn);
+	if (!bdata)
+		return;
+
+	bdata->buffer = talloc_array(bdata, char, len);
+	if (!bdata->buffer) {
+		talloc_free(bdata);
+		return;
+	}
+	strcpy(bdata->buffer, path);
+	strcpy(bdata->buffer + strlen(path) + 1, token);
+	bdata->hdr.msg.type = XS_WATCH_EVENT;
+	bdata->hdr.msg.len = len;
+
+	/* Queue for later transmission. */
+	list_add_tail(&bdata->list, &conn->out_list);
 }
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 7d0fe77e7989..99a0373944b2 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -187,6 +187,7 @@ unsigned int get_string(const struct buffered_data *data, unsigned int offset);
 
 void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		const void *data, unsigned int len);
+void send_event(struct connection *conn, const char *path, const char *token);
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
 void send_ack(struct connection *conn, enum xsd_sockmsg_type type);
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index aca0a71bada1..99a2c266b28a 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -86,35 +86,6 @@ static const char *get_watch_path(const struct watch *watch, const char *name)
 }
 
 /*
- * Send a watch event.
- * Temporary memory allocations are done with ctx.
- */
-static void add_event(struct connection *conn,
-		      const void *ctx,
-		      struct watch *watch,
-		      const char *name)
-{
-	/* Data to send (node\0token\0). */
-	unsigned int len;
-	char *data;
-
-	name = get_watch_path(watch, name);
-
-	len = strlen(name) + 1 + strlen(watch->token) + 1;
-	/* Don't try to send over-long events. */
-	if (len > XENSTORE_PAYLOAD_MAX)
-		return;
-
-	data = talloc_array(ctx, char, len);
-	if (!data)
-		return;
-	strcpy(data, name);
-	strcpy(data + strlen(name) + 1, watch->token);
-	send_reply(conn, XS_WATCH_EVENT, data, len);
-	talloc_free(data);
-}
-
-/*
  * Check permissions of a specific watch to fire:
  * Either the node itself or its parent have to be readable by the connection
  * the watch has been setup for. In case a watch event is created due to
@@ -190,10 +161,14 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		list_for_each_entry(watch, &i->watches, list) {
 			if (exact) {
 				if (streq(name, watch->node))
-					add_event(i, ctx, watch, name);
+					send_event(i,
+						   get_watch_path(watch, name),
+						   watch->token);
 			} else {
 				if (is_child(name, watch->node))
-					add_event(i, ctx, watch, name);
+					send_event(i,
+						   get_watch_path(watch, name),
+						   watch->token);
 			}
 		}
 	}
@@ -292,7 +267,7 @@ int do_watch(struct connection *conn, struct buffered_data *in)
 	send_ack(conn, XS_WATCH);
 
 	/* We fire once up front: simplifies clients and restart. */
-	add_event(conn, in, watch, watch->node);
+	send_event(conn, get_watch_path(watch, watch->node), watch->token);
 
 	return 0;
 }
From df0c107fbf65f61bb1d31c9a34ecee05f38526a7 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: add helpers to free struct buffered_data

Add two helpers for freeing struct buffered_data: free_buffered_data()
for freeing one instance and conn_free_buffered_data() for freeing all
instances for a connection.

This is avoiding duplicated code and will help later when more actions
are needed when freeing a struct buffered_data.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index ed742d9dfc2e..61fc368e8c28 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -207,6 +207,21 @@ void reopen_log(void)
 	}
 }
 
+static void free_buffered_data(struct buffered_data *out,
+			       struct connection *conn)
+{
+	list_del(&out->list);
+	talloc_free(out);
+}
+
+void conn_free_buffered_data(struct connection *conn)
+{
+	struct buffered_data *out;
+
+	while ((out = list_top(&conn->out_list, struct buffered_data, list)))
+		free_buffered_data(out, conn);
+}
+
 static bool write_messages(struct connection *conn)
 {
 	int ret;
@@ -250,8 +265,7 @@ static bool write_messages(struct connection *conn)
 
 	trace_io(conn, out, 1);
 
-	list_del(&out->list);
-	talloc_free(out);
+	free_buffered_data(out, conn);
 
 	return true;
 }
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 99a0373944b2..c9ea796185e8 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -271,6 +271,8 @@ int remember_string(struct hashtable *hash, const char *str);
 
 void set_tdb_key(const char *name, TDB_DATA *key);
 
+void conn_free_buffered_data(struct connection *conn);
+
 const char *dump_state_global(FILE *fp);
 const char *dump_state_buffered_data(FILE *fp, const struct connection *c,
 				     struct xs_state_connection *sc);
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index ead4c237d233..de349e2a77a5 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -411,15 +411,10 @@ static struct domain *find_domain_by_domid(unsigned int domid)
 static void domain_conn_reset(struct domain *domain)
 {
 	struct connection *conn = domain->conn;
-	struct buffered_data *out;
 
 	conn_delete_all_watches(conn);
 	conn_delete_all_transactions(conn);
-
-	while ((out = list_top(&conn->out_list, struct buffered_data, list))) {
-		list_del(&out->list);
-		talloc_free(out);
-	}
+	conn_free_buffered_data(conn);
 
 	talloc_free(conn->in);
 
@@ -436,8 +431,6 @@ static void domain_conn_reset(struct domain *domain)
  */
 void ignore_connection(struct connection *conn, unsigned int err)
 {
-	struct buffered_data *out, *tmp;
-
 	trace("CONN %p ignored, reason %u\n", conn, err);
 
 	if (conn->domain && conn->domain->interface)
@@ -446,11 +439,7 @@ void ignore_connection(struct connection *conn, unsigned int err)
 	conn->is_ignored = true;
 	conn_delete_all_watches(conn);
 	conn_delete_all_transactions(conn);
-
-	list_for_each_entry_safe(out, tmp, &conn->out_list, list) {
-		list_del(&out->list);
-		talloc_free(out);
-	}
+	conn_free_buffered_data(conn);
 
 	talloc_free(conn->in);
 	conn->in = NULL;
From 16a8a1854bc8b335ffb82786a9dd90ad268aa558 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: reduce number of watch events

When removing a watched node outside of a transaction, two watch events
are being produced instead of just a single one.

When finalizing a transaction watch events can be generated for each
node which is being modified, even if outside a transaction such
modifications might not have resulted in a watch event.

This happens e.g.:

- for nodes which are only modified due to added/removed child entries
- for nodes being removed or created implicitly (e.g. creation of a/b/c
  is implicitly creating a/b, resulting in watch events for a, a/b and
  a/b/c instead of a/b/c only)

Avoid these additional watch events, in order to reduce the needed
memory inside Xenstore for queueing them.

This is being achieved by adding event flags to struct accessed_node
specifying whether an event should be triggered, and whether it should
be an exact match of the modified path. Both flags can be set from
fire_watches() instead of implying them only.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 61fc368e8c28..b9a0ff5e05cf 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -1291,7 +1291,7 @@ static void delete_child(struct connection *conn,
 }
 
 static int delete_node(struct connection *conn, const void *ctx,
-		       struct node *parent, struct node *node)
+		       struct node *parent, struct node *node, bool watch_exact)
 {
 	char *name;
 
@@ -1303,7 +1303,7 @@ static int delete_node(struct connection *conn, const void *ctx,
 				       node->children);
 		child = name ? read_node(conn, node, name) : NULL;
 		if (child) {
-			if (delete_node(conn, ctx, node, child))
+			if (delete_node(conn, ctx, node, child, true))
 				return errno;
 		} else {
 			trace("delete_node: Error deleting child '%s/%s'!\n",
@@ -1315,7 +1315,12 @@ static int delete_node(struct connection *conn, const void *ctx,
 		talloc_free(name);
 	}
 
-	fire_watches(conn, ctx, node->name, node, true, NULL);
+	/*
+	 * Fire the watches now, when we can still see the node permissions.
+	 * This fine as we are single threaded and the next possible read will
+	 * be handled only after the node has been really removed.
+	 */
+	fire_watches(conn, ctx, node->name, node, watch_exact, NULL);
 	delete_node_single(conn, node);
 	delete_child(conn, parent, basename(node->name));
 	talloc_free(node);
@@ -1341,13 +1346,7 @@ static int _rm(struct connection *conn, const void *ctx, struct node *node,
 		return (errno == ENOMEM) ? ENOMEM : EINVAL;
 	node->parent = parent;
 
-	/*
-	 * Fire the watches now, when we can still see the node permissions.
-	 * This fine as we are single threaded and the next possible read will
-	 * be handled only after the node has been really removed.
-	 */
-	fire_watches(conn, ctx, name, node, false, NULL);
-	return delete_node(conn, ctx, parent, node);
+	return delete_node(conn, ctx, parent, node, false);
 }
 
 
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index faf6c930e42a..54432907fc76 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -130,6 +130,10 @@ struct accessed_node
 
 	/* Transaction node in data base? */
 	bool ta_node;
+
+	/* Watch event flags. */
+	bool fire_watch;
+	bool watch_exact;
 };
 
 struct changed_domain
@@ -324,6 +328,29 @@ int access_node(struct connection *conn, struct node *node,
 }
 
 /*
+ * A watch event should be fired for a node modified inside a transaction.
+ * Set the corresponding information. A non-exact event is replacing an exact
+ * one, but not the other way round.
+ */
+void queue_watches(struct connection *conn, const char *name, bool watch_exact)
+{
+	struct accessed_node *i;
+
+	i = find_accessed_node(conn->transaction, name);
+	if (!i) {
+		conn->transaction->fail = true;
+		return;
+	}
+
+	if (!i->fire_watch) {
+		i->fire_watch = true;
+		i->watch_exact = watch_exact;
+	} else if (!watch_exact) {
+		i->watch_exact = false;
+	}
+}
+
+/*
  * Finalize transaction:
  * Walk through accessed nodes and check generation against global data.
  * If all entries match, read the transaction entries and write them without
@@ -377,15 +404,15 @@ static int finalize_transaction(struct connection *conn,
 				ret = tdb_store(tdb_ctx, key, data,
 						TDB_REPLACE);
 				talloc_free(data.dptr);
-				if (ret)
-					goto err;
-				fire_watches(conn, trans, i->node, NULL, false,
-					     i->perms.p ? &i->perms : NULL);
 			} else {
-				fire_watches(conn, trans, i->node, NULL, false,
+				ret = tdb_delete(tdb_ctx, key);
+			}
+			if (ret)
+				goto err;
+			if (i->fire_watch) {
+				fire_watches(conn, trans, i->node, NULL,
+					     i->watch_exact,
 					     i->perms.p ? &i->perms : NULL);
-				if (tdb_delete(tdb_ctx, key))
-					goto err;
 			}
 		}
 
diff --git a/tools/xenstore/xenstored_transaction.h b/tools/xenstore/xenstored_transaction.h
index 14062730e3c9..0093cac807e3 100644
--- a/tools/xenstore/xenstored_transaction.h
+++ b/tools/xenstore/xenstored_transaction.h
@@ -42,6 +42,9 @@ void transaction_entry_dec(struct transaction *trans, unsigned int domid);
 int access_node(struct connection *conn, struct node *node,
                 enum node_access_type type, TDB_DATA *key);
 
+/* Queue watches for a modified node. */
+void queue_watches(struct connection *conn, const char *name, bool watch_exact);
+
 /* Prepend the transaction to name if appropriate. */
 int transaction_prepend(struct connection *conn, const char *name,
                         TDB_DATA *key);
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index 99a2c266b28a..205d9d8ea116 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -29,6 +29,7 @@
 #include "xenstore_lib.h"
 #include "utils.h"
 #include "xenstored_domain.h"
+#include "xenstored_transaction.h"
 
 extern int quota_nb_watch_per_domain;
 
@@ -143,9 +144,11 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 	struct connection *i;
 	struct watch *watch;
 
-	/* During transactions, don't fire watches. */
-	if (conn && conn->transaction)
+	/* During transactions, don't fire watches, but queue them. */
+	if (conn && conn->transaction) {
+		queue_watches(conn, name, exact);
 		return;
+	}
 
 	/* Create an event for each watch. */
 	list_for_each_entry(i, &connections, list) {
From 508f58a92597e2ca727752ebdf9adba59cf3fb23 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:07 +0200
Subject: tools/xenstore: let unread watch events time out

A future modification will limit the number of outstanding requests
for a domain, where "outstanding" means that the response of the
request or any resulting watch event hasn't been consumed yet.

In order to avoid a malicious guest being capable to block other guests
by not reading watch events, add a timeout for watch events. In case a
watch event hasn't been consumed after this timeout, it is being
deleted. Set the default timeout to 20 seconds (a random value being
not too high).

In order to support to specify other timeout values in future, use a
generic command line option for that purpose:

--timeout|-w watch-event=<seconds>

This is part of XSA-326 / CVE-2022-42311.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index b9a0ff5e05cf..cce02f24b51c 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -108,6 +108,8 @@ int quota_max_transaction = 10;
 int quota_nb_perms_per_node = 5;
 int quota_max_path_len = XENSTORE_REL_PATH_MAX;
 
+unsigned int timeout_watch_event_msec = 20000;
+
 void trace(const char *fmt, ...)
 {
 	va_list arglist;
@@ -207,19 +209,92 @@ void reopen_log(void)
 	}
 }
 
+static uint64_t get_now_msec(void)
+{
+	struct timespec now_ts;
+
+	if (clock_gettime(CLOCK_MONOTONIC, &now_ts))
+		barf_perror("Could not find time (clock_gettime failed)");
+
+	return now_ts.tv_sec * 1000 + now_ts.tv_nsec / 1000000;
+}
+
 static void free_buffered_data(struct buffered_data *out,
 			       struct connection *conn)
 {
+	struct buffered_data *req;
+
 	list_del(&out->list);
+
+	/*
+	 * Update conn->timeout_msec with the next found timeout value in the
+	 * queued pending requests.
+	 */
+	if (out->timeout_msec) {
+		conn->timeout_msec = 0;
+		list_for_each_entry(req, &conn->out_list, list) {
+			if (req->timeout_msec) {
+				conn->timeout_msec = req->timeout_msec;
+				break;
+			}
+		}
+	}
+
 	talloc_free(out);
 }
 
+static void check_event_timeout(struct connection *conn, uint64_t msecs,
+				int *ptimeout)
+{
+	uint64_t delta;
+	struct buffered_data *out, *tmp;
+
+	if (!conn->timeout_msec)
+		return;
+
+	delta = conn->timeout_msec - msecs;
+	if (conn->timeout_msec <= msecs) {
+		delta = 0;
+		list_for_each_entry_safe(out, tmp, &conn->out_list, list) {
+			/*
+			 * Only look at buffers with timeout and no data
+			 * already written to the ring.
+			 */
+			if (out->timeout_msec && out->inhdr && !out->used) {
+				if (out->timeout_msec > msecs) {
+					conn->timeout_msec = out->timeout_msec;
+					delta = conn->timeout_msec - msecs;
+					break;
+				}
+
+				/*
+				 * Free out without updating conn->timeout_msec,
+				 * as the update is done in this loop already.
+				 */
+				out->timeout_msec = 0;
+				trace("watch event path %s for domain %u timed out\n",
+				      out->buffer, conn->id);
+				free_buffered_data(out, conn);
+			}
+		}
+		if (!delta) {
+			conn->timeout_msec = 0;
+			return;
+		}
+	}
+
+	if (*ptimeout == -1 || *ptimeout > delta)
+		*ptimeout = delta;
+}
+
 void conn_free_buffered_data(struct connection *conn)
 {
 	struct buffered_data *out;
 
 	while ((out = list_top(&conn->out_list, struct buffered_data, list)))
 		free_buffered_data(out, conn);
+
+	conn->timeout_msec = 0;
 }
 
 static bool write_messages(struct connection *conn)
@@ -407,6 +482,7 @@ static void initialize_fds(int *p_sock_pollfd_idx, int *ptimeout)
 {
 	struct connection *conn;
 	struct wrl_timestampt now;
+	uint64_t msecs;
 
 	if (fds)
 		memset(fds, 0, sizeof(struct pollfd) * current_array_size);
@@ -427,10 +503,12 @@ static void initialize_fds(int *p_sock_pollfd_idx, int *ptimeout)
 
 	wrl_gettime_now(&now);
 	wrl_log_periodic(now);
+	msecs = get_now_msec();
 
 	list_for_each_entry(conn, &connections, list) {
 		if (conn->domain) {
 			wrl_check_timeout(conn->domain, now, ptimeout);
+			check_event_timeout(conn, msecs, ptimeout);
 			if (conn_can_read(conn) ||
 			    (conn_can_write(conn) &&
 			     !list_empty(&conn->out_list)))
@@ -790,6 +868,7 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		return;
 	bdata->inhdr = true;
 	bdata->used = 0;
+	bdata->timeout_msec = 0;
 
 	if (len <= DEFAULT_BUFFER_SIZE)
 		bdata->buffer = bdata->default_buffer;
@@ -841,6 +920,12 @@ void send_event(struct connection *conn, const char *path, const char *token)
 	bdata->hdr.msg.type = XS_WATCH_EVENT;
 	bdata->hdr.msg.len = len;
 
+	if (timeout_watch_event_msec && domain_is_unprivileged(conn)) {
+		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
+		if (!conn->timeout_msec)
+			conn->timeout_msec = bdata->timeout_msec;
+	}
+
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
 }
@@ -2185,6 +2270,9 @@ static void usage(void)
 "  -t, --transaction <nb>  limit the number of transaction allowed per domain,\n"
 "  -A, --perm-nb <nb>      limit the number of permissions per node,\n"
 "  -M, --path-max <chars>  limit the allowed Xenstore node path length,\n"
+"  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
+"                          allowed timeout candidates are:\n"
+"                          watch-event: time a watch-event is kept pending\n"
 "  -R, --no-recovery       to request that no recovery should be attempted when\n"
 "                          the store is corrupted (debug only),\n"
 "  -I, --internal-db       store database in memory, not on disk\n"
@@ -2207,6 +2295,7 @@ static struct option options[] = {
 	{ "transaction", 1, NULL, 't' },
 	{ "perm-nb", 1, NULL, 'A' },
 	{ "path-max", 1, NULL, 'M' },
+	{ "timeout", 1, NULL, 'w' },
 	{ "no-recovery", 0, NULL, 'R' },
 	{ "internal-db", 0, NULL, 'I' },
 	{ "verbose", 0, NULL, 'V' },
@@ -2220,6 +2309,39 @@ int dom0_domid = 0;
 int dom0_event = 0;
 int priv_domid = 0;
 
+static int get_optval_int(const char *arg)
+{
+	char *end;
+	long val;
+
+	val = strtol(arg, &end, 10);
+	if (!*arg || *end || val < 0 || val > INT_MAX)
+		barf("invalid parameter value \"%s\"\n", arg);
+
+	return val;
+}
+
+static bool what_matches(const char *arg, const char *what)
+{
+	unsigned int what_len = strlen(what);
+
+	return !strncmp(arg, what, what_len) && arg[what_len] == '=';
+}
+
+static void set_timeout(const char *arg)
+{
+	const char *eq = strchr(arg, '=');
+	int val;
+
+	if (!eq)
+		barf("quotas must be specified via <what>=<seconds>\n");
+	val = get_optval_int(eq + 1);
+	if (what_matches(arg, "watch-event"))
+		timeout_watch_event_msec = val * 1000;
+	else
+		barf("unknown timeout \"%s\"\n", arg);
+}
+
 int main(int argc, char *argv[])
 {
 	int opt;
@@ -2234,7 +2356,7 @@ int main(int argc, char *argv[])
 	orig_argc = argc;
 	orig_argv = argv;
 
-	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:T:RVW:U", options,
+	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:T:RVW:w:U", options,
 				  NULL)) != -1) {
 		switch (opt) {
 		case 'D':
@@ -2284,6 +2406,9 @@ int main(int argc, char *argv[])
 			quota_max_path_len = min(XENSTORE_REL_PATH_MAX,
 						 quota_max_path_len);
 			break;
+		case 'w':
+			set_timeout(optarg);
+			break;
 		case 'e':
 			dom0_event = strtol(optarg, NULL, 10);
 			break;
@@ -2714,6 +2839,12 @@ static void add_buffered_data(struct buffered_data *bdata,
 		barf("error restoring buffered data");
 
 	memcpy(bdata->buffer, data, len);
+	if (bdata->hdr.msg.type == XS_WATCH_EVENT && timeout_watch_event_msec &&
+	    domain_is_unprivileged(conn)) {
+		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
+		if (!conn->timeout_msec)
+			conn->timeout_msec = bdata->timeout_msec;
+	}
 
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index c9ea796185e8..745262af96fd 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -27,6 +27,7 @@
 #include <fcntl.h>
 #include <stdbool.h>
 #include <stdint.h>
+#include <time.h>
 #include <errno.h>
 
 #include "xenstore_lib.h"
@@ -67,6 +68,8 @@ struct buffered_data
 		char raw[sizeof(struct xsd_sockmsg)];
 	} hdr;
 
+	uint64_t timeout_msec;
+
 	/* The actual data. */
 	char *buffer;
 	char default_buffer[DEFAULT_BUFFER_SIZE];
@@ -118,6 +121,7 @@ struct connection
 
 	/* Buffered output data */
 	struct list_head out_list;
+	uint64_t timeout_msec;
 
 	/* Transaction context for current request (NULL if none). */
 	struct transaction *transaction;
@@ -242,6 +246,8 @@ extern int dom0_event;
 extern int priv_domid;
 extern int quota_nb_entry_per_domain;
 
+extern unsigned int timeout_watch_event_msec;
+
 /* Map the kernel's xenstore page. */
 void *xenbus_map(void);
 void unmap_xenbus(void *interface);
From 8273d5019234cba308d7fc96470069544949f60d Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: limit outstanding requests

Add another quota for limiting the number of outstanding requests of a
guest. As the way to specify quotas on the command line is becoming
rather nasty, switch to a new scheme using [--quota|-Q] <what>=<val>
allowing to add more quotas in future easily.

Set the default value to 20 (basically a random value not seeming to
be too high or too low).

A request is said to be outstanding if any message generated by this
request (the direct response plus potential watch events) is not yet
completely stored into a ring buffer. The initial watch event sent as
a result of registering a watch is an exception.

Note that across a live update the relation to buffered watch events
for other domains is lost.

Use talloc_zero() for allocating the domain structure in order to have
all per-domain quota zeroed initially.

This is part of XSA-326 / CVE-2022-42312.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index cce02f24b51c..54e6add1a157 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -107,6 +107,7 @@ int quota_max_entry_size = 2048; /* 2K */
 int quota_max_transaction = 10;
 int quota_nb_perms_per_node = 5;
 int quota_max_path_len = XENSTORE_REL_PATH_MAX;
+int quota_req_outstanding = 20;
 
 unsigned int timeout_watch_event_msec = 20000;
 
@@ -219,12 +220,24 @@ static uint64_t get_now_msec(void)
 	return now_ts.tv_sec * 1000 + now_ts.tv_nsec / 1000000;
 }
 
+/*
+ * Remove a struct buffered_data from the list of outgoing data.
+ * A struct buffered_data related to a request having caused watch events to be
+ * sent is kept until all those events have been written out.
+ * Each watch event is referencing the related request via pend.req, while the
+ * number of watch events caused by a request is kept in pend.ref.event_cnt
+ * (those two cases are mutually exclusive, so the two fields can share memory
+ * via a union).
+ * The struct buffered_data is freed only if no related watch event is
+ * referencing it. The related return data can be freed right away.
+ */
 static void free_buffered_data(struct buffered_data *out,
 			       struct connection *conn)
 {
 	struct buffered_data *req;
 
 	list_del(&out->list);
+	out->on_out_list = false;
 
 	/*
 	 * Update conn->timeout_msec with the next found timeout value in the
@@ -240,6 +253,30 @@ static void free_buffered_data(struct buffered_data *out,
 		}
 	}
 
+	if (out->hdr.msg.type == XS_WATCH_EVENT) {
+		req = out->pend.req;
+		if (req) {
+			req->pend.ref.event_cnt--;
+			if (!req->pend.ref.event_cnt && !req->on_out_list) {
+				if (req->on_ref_list) {
+					domain_outstanding_domid_dec(
+						req->pend.ref.domid);
+					list_del(&req->list);
+				}
+				talloc_free(req);
+			}
+		}
+	} else if (out->pend.ref.event_cnt) {
+		/* Hang out off from conn. */
+		talloc_steal(NULL, out);
+		if (out->buffer != out->default_buffer)
+			talloc_free(out->buffer);
+		list_add(&out->list, &conn->ref_list);
+		out->on_ref_list = true;
+		return;
+	} else
+		domain_outstanding_dec(conn);
+
 	talloc_free(out);
 }
 
@@ -401,6 +438,7 @@ int delay_request(struct connection *conn, struct buffered_data *in,
 static int destroy_conn(void *_conn)
 {
 	struct connection *conn = _conn;
+	struct buffered_data *req;
 
 	/* Flush outgoing if possible, but don't block. */
 	if (!conn->domain) {
@@ -414,6 +452,11 @@ static int destroy_conn(void *_conn)
 				break;
 		close(conn->fd);
 	}
+
+	conn_free_buffered_data(conn);
+	list_for_each_entry(req, &conn->ref_list, list)
+		req->on_ref_list = false;
+
         if (conn->target)
                 talloc_unlink(conn, conn->target);
 	list_del(&conn->list);
@@ -889,6 +932,8 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+	bdata->on_out_list = true;
+	domain_outstanding_inc(conn);
 }
 
 /*
@@ -896,7 +941,8 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
  * As this is not directly related to the current command, errors can't be
  * reported.
  */
-void send_event(struct connection *conn, const char *path, const char *token)
+void send_event(struct buffered_data *req, struct connection *conn,
+		const char *path, const char *token)
 {
 	struct buffered_data *bdata;
 	unsigned int len;
@@ -926,8 +972,13 @@ void send_event(struct connection *conn, const char *path, const char *token)
 			conn->timeout_msec = bdata->timeout_msec;
 	}
 
+	bdata->pend.req = req;
+	if (req)
+		req->pend.ref.event_cnt++;
+
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+	bdata->on_out_list = true;
 }
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
@@ -1714,6 +1765,7 @@ static void handle_input(struct connection *conn)
 			return;
 	}
 	in = conn->in;
+	in->pend.ref.domid = conn->id;
 
 	/* Not finished header yet? */
 	if (in->inhdr) {
@@ -1787,6 +1839,7 @@ struct connection *new_connection(const struct interface_funcs *funcs)
 	new->is_stalled = false;
 	new->transaction_started = 0;
 	INIT_LIST_HEAD(&new->out_list);
+	INIT_LIST_HEAD(&new->ref_list);
 	INIT_LIST_HEAD(&new->watches);
 	INIT_LIST_HEAD(&new->transaction_list);
 	INIT_LIST_HEAD(&new->delayed);
@@ -2270,6 +2323,9 @@ static void usage(void)
 "  -t, --transaction <nb>  limit the number of transaction allowed per domain,\n"
 "  -A, --perm-nb <nb>      limit the number of permissions per node,\n"
 "  -M, --path-max <chars>  limit the allowed Xenstore node path length,\n"
+"  -Q, --quota <what>=<nb> set the quota <what> to the value <nb>, allowed\n"
+"                          quotas are:\n"
+"                          outstanding: number of outstanding requests\n"
 "  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
 "                          allowed timeout candidates are:\n"
 "                          watch-event: time a watch-event is kept pending\n"
@@ -2295,6 +2351,7 @@ static struct option options[] = {
 	{ "transaction", 1, NULL, 't' },
 	{ "perm-nb", 1, NULL, 'A' },
 	{ "path-max", 1, NULL, 'M' },
+	{ "quota", 1, NULL, 'Q' },
 	{ "timeout", 1, NULL, 'w' },
 	{ "no-recovery", 0, NULL, 'R' },
 	{ "internal-db", 0, NULL, 'I' },
@@ -2342,6 +2399,20 @@ static void set_timeout(const char *arg)
 		barf("unknown timeout \"%s\"\n", arg);
 }
 
+static void set_quota(const char *arg)
+{
+	const char *eq = strchr(arg, '=');
+	int val;
+
+	if (!eq)
+		barf("quotas must be specified via <what>=<nb>\n");
+	val = get_optval_int(eq + 1);
+	if (what_matches(arg, "outstanding"))
+		quota_req_outstanding = val;
+	else
+		barf("unknown quota \"%s\"\n", arg);
+}
+
 int main(int argc, char *argv[])
 {
 	int opt;
@@ -2356,8 +2427,8 @@ int main(int argc, char *argv[])
 	orig_argc = argc;
 	orig_argv = argv;
 
-	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:T:RVW:w:U", options,
-				  NULL)) != -1) {
+	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:Q:T:RVW:w:U",
+				  options, NULL)) != -1) {
 		switch (opt) {
 		case 'D':
 			no_domain_init = true;
@@ -2406,6 +2477,9 @@ int main(int argc, char *argv[])
 			quota_max_path_len = min(XENSTORE_REL_PATH_MAX,
 						 quota_max_path_len);
 			break;
+		case 'Q':
+			set_quota(optarg);
+			break;
 		case 'w':
 			set_timeout(optarg);
 			break;
@@ -2848,6 +2922,14 @@ static void add_buffered_data(struct buffered_data *bdata,
 
 	/* Queue for later transmission. */
 	list_add_tail(&bdata->list, &conn->out_list);
+	bdata->on_out_list = true;
+	/*
+	 * Watch events are never "outstanding", but the request causing them
+	 * are instead kept "outstanding" until all watch events caused by that
+	 * request have been delivered.
+	 */
+	if (bdata->hdr.msg.type != XS_WATCH_EVENT)
+		domain_outstanding_inc(conn);
 }
 
 void read_state_buffered_data(const void *ctx, struct connection *conn,
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 745262af96fd..acb6b9fe2ac3 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -56,6 +56,8 @@ struct xs_state_connection;
 struct buffered_data
 {
 	struct list_head list;
+	bool on_out_list;
+	bool on_ref_list;
 
 	/* Are we still doing the header? */
 	bool inhdr;
@@ -63,6 +65,17 @@ struct buffered_data
 	/* How far are we? */
 	unsigned int used;
 
+	/* Outstanding request accounting. */
+	union {
+		/* ref is being used for requests. */
+		struct {
+			unsigned int event_cnt; /* # of outstanding events. */
+			unsigned int domid;     /* domid of request. */
+		} ref;
+		/* req is being used for watch events. */
+		struct buffered_data *req;      /* request causing event. */
+	} pend;
+
 	union {
 		struct xsd_sockmsg msg;
 		char raw[sizeof(struct xsd_sockmsg)];
@@ -123,6 +136,9 @@ struct connection
 	struct list_head out_list;
 	uint64_t timeout_msec;
 
+	/* Referenced requests no longer pending. */
+	struct list_head ref_list;
+
 	/* Transaction context for current request (NULL if none). */
 	struct transaction *transaction;
 
@@ -191,7 +207,8 @@ unsigned int get_string(const struct buffered_data *data, unsigned int offset);
 
 void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 		const void *data, unsigned int len);
-void send_event(struct connection *conn, const char *path, const char *token);
+void send_event(struct buffered_data *req, struct connection *conn,
+		const char *path, const char *token);
 
 /* Some routines (write, mkdir, etc) just need a non-error return */
 void send_ack(struct connection *conn, enum xsd_sockmsg_type type);
@@ -245,6 +262,7 @@ extern int dom0_domid;
 extern int dom0_event;
 extern int priv_domid;
 extern int quota_nb_entry_per_domain;
+extern int quota_req_outstanding;
 
 extern unsigned int timeout_watch_event_msec;
 
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index de349e2a77a5..c0a37712f89b 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -78,6 +78,9 @@ struct domain
 	/* number of watch for this domain */
 	int nbwatch;
 
+	/* Number of outstanding requests. */
+	int nboutstanding;
+
 	/* write rate limit */
 	wrl_creditt wrl_credit; /* [ -wrl_config_writecost, +_dburst ] */
 	struct wrl_timestampt wrl_timestamp;
@@ -183,8 +186,12 @@ static bool domain_can_read(struct connection *conn)
 {
 	struct xenstore_domain_interface *intf = conn->domain->interface;
 
-	if (domain_is_unprivileged(conn) && conn->domain->wrl_credit < 0)
-		return false;
+	if (domain_is_unprivileged(conn)) {
+		if (conn->domain->wrl_credit < 0)
+			return false;
+		if (conn->domain->nboutstanding >= quota_req_outstanding)
+			return false;
+	}
 
 	return (intf->req_cons != intf->req_prod);
 }
@@ -331,7 +338,7 @@ static struct domain *alloc_domain(const void *context, unsigned int domid)
 {
 	struct domain *domain;
 
-	domain = talloc(context, struct domain);
+	domain = talloc_zero(context, struct domain);
 	if (!domain) {
 		errno = ENOMEM;
 		return NULL;
@@ -392,9 +399,6 @@ static int new_domain(struct domain *domain, int port, bool restore)
 	domain->conn->domain = domain;
 	domain->conn->id = domain->domid;
 
-	domain->nbentry = 0;
-	domain->nbwatch = 0;
-
 	return 0;
 }
 
@@ -970,6 +974,28 @@ int domain_watch(struct connection *conn)
 		: 0;
 }
 
+void domain_outstanding_inc(struct connection *conn)
+{
+	if (!conn || !conn->domain)
+		return;
+	conn->domain->nboutstanding++;
+}
+
+void domain_outstanding_dec(struct connection *conn)
+{
+	if (!conn || !conn->domain)
+		return;
+	conn->domain->nboutstanding--;
+}
+
+void domain_outstanding_domid_dec(unsigned int domid)
+{
+	struct domain *d = find_domain_by_domid(domid);
+
+	if (d)
+		d->nboutstanding--;
+}
+
 static wrl_creditt wrl_config_writecost      = WRL_FACTOR;
 static wrl_creditt wrl_config_rate           = WRL_RATE   * WRL_FACTOR;
 static wrl_creditt wrl_config_dburst         = WRL_DBURST * WRL_FACTOR;
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index 4a37de67a09e..617d0acfd75b 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -65,6 +65,9 @@ int domain_entry(struct connection *conn);
 void domain_watch_inc(struct connection *conn);
 void domain_watch_dec(struct connection *conn);
 int domain_watch(struct connection *conn);
+void domain_outstanding_inc(struct connection *conn);
+void domain_outstanding_dec(struct connection *conn);
+void domain_outstanding_domid_dec(unsigned int domid);
 
 /* Special node permission handling. */
 int set_perms_special(struct connection *conn, const char *name,
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index 205d9d8ea116..0755ffa375ba 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -142,6 +142,7 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		  struct node *node, bool exact, struct node_perms *perms)
 {
 	struct connection *i;
+	struct buffered_data *req;
 	struct watch *watch;
 
 	/* During transactions, don't fire watches, but queue them. */
@@ -150,6 +151,8 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		return;
 	}
 
+	req = domain_is_unprivileged(conn) ? conn->in : NULL;
+
 	/* Create an event for each watch. */
 	list_for_each_entry(i, &connections, list) {
 		/* introduce/release domain watches */
@@ -164,12 +167,12 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name,
 		list_for_each_entry(watch, &i->watches, list) {
 			if (exact) {
 				if (streq(name, watch->node))
-					send_event(i,
+					send_event(req, i,
 						   get_watch_path(watch, name),
 						   watch->token);
 			} else {
 				if (is_child(name, watch->node))
-					send_event(i,
+					send_event(req, i,
 						   get_watch_path(watch, name),
 						   watch->token);
 			}
@@ -269,8 +272,12 @@ int do_watch(struct connection *conn, struct buffered_data *in)
 	trace_create(watch, "watch");
 	send_ack(conn, XS_WATCH);
 
-	/* We fire once up front: simplifies clients and restart. */
-	send_event(conn, get_watch_path(watch, watch->node), watch->token);
+	/*
+	 * We fire once up front: simplifies clients and restart.
+	 * This event will not be linked to the XS_WATCH request.
+	 */
+	send_event(NULL, conn, get_watch_path(watch, watch->node),
+		   watch->token);
 
 	return 0;
 }
From 7c82c195e0b97be4c07728ba741beb54afbee1b7 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: don't buffer multiple identical watch events

A guest not reading its Xenstore response buffer fast enough might
pile up lots of Xenstore watch events buffered. Reduce the generated
load by dropping new events which already have an identical copy
pending.

The special events "@..." are excluded from that handling as there are
known use cases where the handler is relying on each event to be sent
individually.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 54e6add1a157..45feae313ae6 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -912,6 +912,7 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 	bdata->inhdr = true;
 	bdata->used = 0;
 	bdata->timeout_msec = 0;
+	bdata->watch_event = false;
 
 	if (len <= DEFAULT_BUFFER_SIZE)
 		bdata->buffer = bdata->default_buffer;
@@ -944,7 +945,7 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 void send_event(struct buffered_data *req, struct connection *conn,
 		const char *path, const char *token)
 {
-	struct buffered_data *bdata;
+	struct buffered_data *bdata, *bd;
 	unsigned int len;
 
 	len = strlen(path) + 1 + strlen(token) + 1;
@@ -966,12 +967,29 @@ void send_event(struct buffered_data *req, struct connection *conn,
 	bdata->hdr.msg.type = XS_WATCH_EVENT;
 	bdata->hdr.msg.len = len;
 
+	/*
+	 * Check whether an identical event is pending already.
+	 * Special events are excluded from that check.
+	 */
+	if (path[0] != '@') {
+		list_for_each_entry(bd, &conn->out_list, list) {
+			if (bd->watch_event && bd->hdr.msg.len == len &&
+			    !memcmp(bdata->buffer, bd->buffer, len)) {
+				trace("dropping duplicate watch %s %s for domain %u\n",
+				      path, token, conn->id);
+				talloc_free(bdata);
+				return;
+			}
+		}
+	}
+
 	if (timeout_watch_event_msec && domain_is_unprivileged(conn)) {
 		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
 		if (!conn->timeout_msec)
 			conn->timeout_msec = bdata->timeout_msec;
 	}
 
+	bdata->watch_event = true;
 	bdata->pend.req = req;
 	if (req)
 		req->pend.ref.event_cnt++;
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index acb6b9fe2ac3..e1d47f88445f 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -62,6 +62,9 @@ struct buffered_data
 	/* Are we still doing the header? */
 	bool inhdr;
 
+	/* Is this a watch event? */
+	bool watch_event;
+
 	/* How far are we? */
 	unsigned int used;
 
From 52553b65f4838e2d7ddcf69de7b4f90f28c4c4ca Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: fix connection->id usage

Don't use conn->id for privilege checks, but domain_is_unprivileged().

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_control.c b/tools/xenstore/xenstored_control.c
index f0e00db633ec..61bcbc069d75 100644
--- a/tools/xenstore/xenstored_control.c
+++ b/tools/xenstore/xenstored_control.c
@@ -878,7 +878,7 @@ int do_control(struct connection *conn, struct buffered_data *in)
 	unsigned int cmd, num, off;
 	char **vec = NULL;
 
-	if (conn->id != 0)
+	if (domain_is_unprivileged(conn))
 		return EACCES;
 
 	off = get_string(in, 0);
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index e1d47f88445f..aa0dedde644b 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -123,7 +123,7 @@ struct connection
 	/* The index of pollfd in global pollfd array */
 	int pollfd_idx;
 
-	/* Who am I? 0 for socket connections. */
+	/* Who am I? Domid of connection. */
 	unsigned int id;
 
 	/* Is this connection ignored? */
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index 54432907fc76..ee1b09031a3b 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -477,7 +477,8 @@ int do_transaction_start(struct connection *conn, struct buffered_data *in)
 	if (conn->transaction)
 		return EBUSY;
 
-	if (conn->id && conn->transaction_started > quota_max_transaction)
+	if (domain_is_unprivileged(conn) &&
+	    conn->transaction_started > quota_max_transaction)
 		return ENOSPC;
 
 	/* Attach transaction to input for autofree until it's complete */
From 5ad365b9c2a8f1a1c356000ee19df5dcb2ce5c87 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:08 +0200
Subject: tools/xenstore: simplify and fix per domain node accounting

The accounting of nodes can be simplified now that each connection
holds the associated domid.

Fix the node accounting to cover nodes created for a domain before it
has been introduced. This requires to react properly to an allocation
failure inside domain_entry_inc() by returning an error code.

Especially in error paths the node accounting has to be fixed in some
cases.

This is part of XSA-326 / CVE-2022-42313.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 45feae313ae6..0a684450bca6 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -634,7 +634,7 @@ struct node *read_node(struct connection *conn, const void *ctx,
 
 	/* Permissions are struct xs_permissions. */
 	node->perms.p = hdr->perms;
-	if (domain_adjust_node_perms(node)) {
+	if (domain_adjust_node_perms(conn, node)) {
 		talloc_free(node);
 		return NULL;
 	}
@@ -656,7 +656,7 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
 	void *p;
 	struct xs_tdb_record_hdr *hdr;
 
-	if (domain_adjust_node_perms(node))
+	if (domain_adjust_node_perms(conn, node))
 		return errno;
 
 	data.dsize = sizeof(*hdr)
@@ -1268,13 +1268,17 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 	return NULL;
 }
 
-static int destroy_node(struct connection *conn, struct node *node)
+static void destroy_node_rm(struct node *node)
 {
 	if (streq(node->name, "/"))
 		corrupt(NULL, "Destroying root node!");
 
 	tdb_delete(tdb_ctx, node->key);
+}
 
+static int destroy_node(struct connection *conn, struct node *node)
+{
+	destroy_node_rm(node);
 	domain_entry_dec(conn, node);
 
 	/*
@@ -1324,8 +1328,12 @@ static struct node *create_node(struct connection *conn, const void *ctx,
 			goto err;
 
 		/* Account for new node */
-		if (i->parent)
-			domain_entry_inc(conn, i);
+		if (i->parent) {
+			if (domain_entry_inc(conn, i)) {
+				destroy_node_rm(i);
+				return NULL;
+			}
+		}
 	}
 
 	return node;
@@ -1610,10 +1618,27 @@ static int do_set_perms(struct connection *conn, struct buffered_data *in)
 	old_perms = node->perms;
 	domain_entry_dec(conn, node);
 	node->perms = perms;
-	domain_entry_inc(conn, node);
+	if (domain_entry_inc(conn, node)) {
+		node->perms = old_perms;
+		/*
+		 * This should never fail because we had a reference on the
+		 * domain before and Xenstored is single-threaded.
+		 */
+		domain_entry_inc(conn, node);
+		return ENOMEM;
+	}
+
+	if (write_node(conn, node, false)) {
+		int saved_errno = errno;
 
-	if (write_node(conn, node, false))
+		domain_entry_dec(conn, node);
+		node->perms = old_perms;
+		/* No failure possible as above. */
+		domain_entry_inc(conn, node);
+
+		errno = saved_errno;
 		return errno;
+	}
 
 	fire_watches(conn, in, name, node, false, &old_perms);
 	send_ack(conn, XS_SET_PERMS);
@@ -3095,7 +3120,9 @@ void read_state_node(const void *ctx, const void *state)
 	set_tdb_key(name, &key);
 	if (write_node_raw(NULL, &key, node, true))
 		barf("write node error restoring node");
-	domain_entry_inc(&conn, node);
+
+	if (domain_entry_inc(&conn, node))
+		barf("node accounting error restoring node");
 
 	talloc_free(node);
 }
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index c0a37712f89b..44ce267ec557 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -16,6 +16,7 @@
     along with this program; If not, see <http://www.gnu.org/licenses/>.
 */
 
+#include <assert.h>
 #include <stdio.h>
 #include <sys/mman.h>
 #include <unistd.h>
@@ -363,6 +364,18 @@ static struct domain *find_or_alloc_domain(const void *ctx, unsigned int domid)
 	return domain ? : alloc_domain(ctx, domid);
 }
 
+static struct domain *find_or_alloc_existing_domain(unsigned int domid)
+{
+	struct domain *domain;
+	xc_dominfo_t dominfo;
+
+	domain = find_domain_struct(domid);
+	if (!domain && get_domain_info(domid, &dominfo))
+		domain = alloc_domain(NULL, domid);
+
+	return domain;
+}
+
 static int new_domain(struct domain *domain, int port, bool restore)
 {
 	int rc;
@@ -814,30 +827,28 @@ void domain_deinit(void)
 		xenevtchn_unbind(xce_handle, virq_port);
 }
 
-void domain_entry_inc(struct connection *conn, struct node *node)
+int domain_entry_inc(struct connection *conn, struct node *node)
 {
 	struct domain *d;
+	unsigned int domid;
 
 	if (!conn)
-		return;
+		return 0;
 
-	if (node->perms.p && node->perms.p[0].id != conn->id) {
-		if (conn->transaction) {
-			transaction_entry_inc(conn->transaction,
-				node->perms.p[0].id);
-		} else {
-			d = find_domain_by_domid(node->perms.p[0].id);
-			if (d)
-				d->nbentry++;
-		}
-	} else if (conn->domain) {
-		if (conn->transaction) {
-			transaction_entry_inc(conn->transaction,
-				conn->domain->domid);
- 		} else {
- 			conn->domain->nbentry++;
-		}
+	domid = node->perms.p ? node->perms.p[0].id : conn->id;
+
+	if (conn->transaction) {
+		transaction_entry_inc(conn->transaction, domid);
+	} else {
+		d = (domid == conn->id && conn->domain) ? conn->domain
+		    : find_or_alloc_existing_domain(domid);
+		if (d)
+			d->nbentry++;
+		else
+			return ENOMEM;
 	}
+
+	return 0;
 }
 
 /*
@@ -873,7 +884,7 @@ static int chk_domain_generation(unsigned int domid, uint64_t gen)
  * Remove permissions for no longer existing domains in order to avoid a new
  * domain with the same domid inheriting the permissions.
  */
-int domain_adjust_node_perms(struct node *node)
+int domain_adjust_node_perms(struct connection *conn, struct node *node)
 {
 	unsigned int i;
 	int ret;
@@ -883,8 +894,14 @@ int domain_adjust_node_perms(struct node *node)
 		return errno;
 
 	/* If the owner doesn't exist any longer give it to priv domain. */
-	if (!ret)
+	if (!ret) {
+		/*
+		 * In theory we'd need to update the number of dom0 nodes here,
+		 * but we could be called for a read of the node. So better
+		 * avoid the risk to overflow the node count of dom0.
+		 */
 		node->perms.p[0].id = priv_domid;
+	}
 
 	for (i = 1; i < node->perms.num; i++) {
 		if (node->perms.p[i].perms & XS_PERM_IGNORE)
@@ -903,25 +920,25 @@ int domain_adjust_node_perms(struct node *node)
 void domain_entry_dec(struct connection *conn, struct node *node)
 {
 	struct domain *d;
+	unsigned int domid;
 
 	if (!conn)
 		return;
 
-	if (node->perms.p && node->perms.p[0].id != conn->id) {
-		if (conn->transaction) {
-			transaction_entry_dec(conn->transaction,
-				node->perms.p[0].id);
-		} else {
-			d = find_domain_by_domid(node->perms.p[0].id);
-			if (d && d->nbentry)
-				d->nbentry--;
-		}
-	} else if (conn->domain && conn->domain->nbentry) {
-		if (conn->transaction) {
-			transaction_entry_dec(conn->transaction,
-				conn->domain->domid);
+	domid = node->perms.p ? node->perms.p[0].id : conn->id;
+
+	if (conn->transaction) {
+		transaction_entry_dec(conn->transaction, domid);
+	} else {
+		d = (domid == conn->id && conn->domain) ? conn->domain
+		    : find_domain_struct(domid);
+		if (d) {
+			d->nbentry--;
 		} else {
-			conn->domain->nbentry--;
+			errno = ENOENT;
+			corrupt(conn,
+				"Node \"%s\" owned by non-existing domain %u\n",
+				node->name, domid);
 		}
 	}
 }
@@ -931,13 +948,23 @@ int domain_entry_fix(unsigned int domid, int num, bool update)
 	struct domain *d;
 	int cnt;
 
-	d = find_domain_by_domid(domid);
-	if (!d)
-		return 0;
+	if (update) {
+		d = find_domain_struct(domid);
+		assert(d);
+	} else {
+		/*
+		 * We are called first with update == false in order to catch
+		 * any error. So do a possible allocation and check for error
+		 * only in this case, as in the case of update == true nothing
+		 * can go wrong anymore as the allocation already happened.
+		 */
+		d = find_or_alloc_existing_domain(domid);
+		if (!d)
+			return -1;
+	}
 
 	cnt = d->nbentry + num;
-	if (cnt < 0)
-		cnt = 0;
+	assert(cnt >= 0);
 
 	if (update)
 		d->nbentry = cnt;
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index 617d0acfd75b..593793131494 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -55,10 +55,10 @@ const char *get_implicit_path(const struct connection *conn);
 bool domain_is_unprivileged(struct connection *conn);
 
 /* Remove node permissions for no longer existing domains. */
-int domain_adjust_node_perms(struct node *node);
+int domain_adjust_node_perms(struct connection *conn, struct node *node);
 
 /* Quota manipulation */
-void domain_entry_inc(struct connection *conn, struct node *);
+int domain_entry_inc(struct connection *conn, struct node *);
 void domain_entry_dec(struct connection *conn, struct node *);
 int domain_entry_fix(unsigned int domid, int num, bool update);
 int domain_entry(struct connection *conn);
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index ee1b09031a3b..86caf6c398be 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -519,8 +519,12 @@ static int transaction_fix_domains(struct transaction *trans, bool update)
 
 	list_for_each_entry(d, &trans->changed_domains, list) {
 		cnt = domain_entry_fix(d->domid, d->nbentry, update);
-		if (!update && cnt >= quota_nb_entry_per_domain)
-			return ENOSPC;
+		if (!update) {
+			if (cnt >= quota_nb_entry_per_domain)
+				return ENOSPC;
+			if (cnt < 0)
+				return ENOMEM;
+		}
 	}
 
 	return 0;
From 5a9d0ae4879fb2b5a0da8a592a3573ff96f57c05 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: limit max number of nodes accessed in a transaction

Today a guest is free to access as many nodes in a single transaction
as it wants. This can lead to unbounded memory consumption in Xenstore
as there is the need to keep track of all nodes having been accessed
during a transaction.

In oxenstored the number of requests in a transaction is being limited
via a quota maxrequests (default is 1024). As multiple accesses of a
node are not problematic in C Xenstore, limit the number of accessed
nodes.

In order to let read_node() detect a quota error in case too many nodes
are being accessed, check the return value of access_node() and return
NULL in case an error has been seen. Introduce __must_check and add it
to the access_node() prototype.

This is part of XSA-326 / CVE-2022-42314.

Reported-by: Julien Grall <jgrall@amazon.com>
Suggested-by: Julien Grall <julien@xen.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/include/xen-tools/libs.h b/tools/include/xen-tools/libs.h
index a16e0c380709..bafc90e2f603 100644
--- a/tools/include/xen-tools/libs.h
+++ b/tools/include/xen-tools/libs.h
@@ -63,4 +63,8 @@
 #define ROUNDUP(_x,_w) (((unsigned long)(_x)+(1UL<<(_w))-1) & ~((1UL<<(_w))-1))
 #endif
 
+#ifndef __must_check
+#define __must_check __attribute__((__warn_unused_result__))
+#endif
+
 #endif	/* __XEN_TOOLS_LIBS__ */
diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 0a684450bca6..d4fd005f599d 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -106,6 +106,7 @@ int quota_nb_watch_per_domain = 128;
 int quota_max_entry_size = 2048; /* 2K */
 int quota_max_transaction = 10;
 int quota_nb_perms_per_node = 5;
+int quota_trans_nodes = 1024;
 int quota_max_path_len = XENSTORE_REL_PATH_MAX;
 int quota_req_outstanding = 20;
 
@@ -591,6 +592,7 @@ struct node *read_node(struct connection *conn, const void *ctx,
 	TDB_DATA key, data;
 	struct xs_tdb_record_hdr *hdr;
 	struct node *node;
+	int err;
 
 	node = talloc(ctx, struct node);
 	if (!node) {
@@ -612,14 +614,13 @@ struct node *read_node(struct connection *conn, const void *ctx,
 	if (data.dptr == NULL) {
 		if (tdb_error(tdb_ctx) == TDB_ERR_NOEXIST) {
 			node->generation = NO_GENERATION;
-			access_node(conn, node, NODE_ACCESS_READ, NULL);
-			errno = ENOENT;
+			err = access_node(conn, node, NODE_ACCESS_READ, NULL);
+			errno = err ? : ENOENT;
 		} else {
 			log("TDB error on read: %s", tdb_errorstr(tdb_ctx));
 			errno = EIO;
 		}
-		talloc_free(node);
-		return NULL;
+		goto error;
 	}
 
 	node->parent = NULL;
@@ -634,19 +635,36 @@ struct node *read_node(struct connection *conn, const void *ctx,
 
 	/* Permissions are struct xs_permissions. */
 	node->perms.p = hdr->perms;
-	if (domain_adjust_node_perms(conn, node)) {
-		talloc_free(node);
-		return NULL;
-	}
+	if (domain_adjust_node_perms(conn, node))
+		goto error;
 
 	/* Data is binary blob (usually ascii, no nul). */
 	node->data = node->perms.p + hdr->num_perms;
 	/* Children is strings, nul separated. */
 	node->children = node->data + node->datalen;
 
-	access_node(conn, node, NODE_ACCESS_READ, NULL);
+	if (access_node(conn, node, NODE_ACCESS_READ, NULL))
+		goto error;
 
 	return node;
+
+ error:
+	err = errno;
+	talloc_free(node);
+	errno = err;
+	return NULL;
+}
+
+static bool read_node_can_propagate_errno(void)
+{
+	/*
+	 * 2 error cases for read_node() can always be propagated up:
+	 * ENOMEM, because this has nothing to do with the node being in the
+	 * data base or not, but is caused by a general lack of memory.
+	 * ENOSPC, because this is related to hitting quota limits which need
+	 * to be respected.
+	 */
+	return errno == ENOMEM || errno == ENOSPC;
 }
 
 int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
@@ -763,7 +781,7 @@ static int ask_parents(struct connection *conn, const void *ctx,
 		node = read_node(conn, ctx, name);
 		if (node)
 			break;
-		if (errno == ENOMEM)
+		if (read_node_can_propagate_errno())
 			return errno;
 	} while (!streq(name, "/"));
 
@@ -825,7 +843,7 @@ static struct node *get_node(struct connection *conn,
 		}
 	}
 	/* Clean up errno if they weren't supposed to know. */
-	if (!node && errno != ENOMEM)
+	if (!node && !read_node_can_propagate_errno())
 		errno = errno_from_parents(conn, ctx, name, errno, perm);
 	return node;
 }
@@ -1231,7 +1249,7 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 
 	/* If parent doesn't exist, create it. */
 	parent = read_node(conn, parentname, parentname);
-	if (!parent)
+	if (!parent && errno == ENOENT)
 		parent = construct_node(conn, ctx, parentname);
 	if (!parent)
 		return NULL;
@@ -1505,7 +1523,7 @@ static int _rm(struct connection *conn, const void *ctx, struct node *node,
 
 	parent = read_node(conn, ctx, parentname);
 	if (!parent)
-		return (errno == ENOMEM) ? ENOMEM : EINVAL;
+		return read_node_can_propagate_errno() ? errno : EINVAL;
 	node->parent = parent;
 
 	return delete_node(conn, ctx, parent, node, false);
@@ -1535,7 +1553,7 @@ static int do_rm(struct connection *conn, struct buffered_data *in)
 				return 0;
 			}
 			/* Restore errno, just in case. */
-			if (errno != ENOMEM)
+			if (!read_node_can_propagate_errno())
 				errno = ENOENT;
 		}
 		return errno;
@@ -2368,6 +2386,8 @@ static void usage(void)
 "  -M, --path-max <chars>  limit the allowed Xenstore node path length,\n"
 "  -Q, --quota <what>=<nb> set the quota <what> to the value <nb>, allowed\n"
 "                          quotas are:\n"
+"                          transaction-nodes: number of accessed node per\n"
+"                                             transaction\n"
 "                          outstanding: number of outstanding requests\n"
 "  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
 "                          allowed timeout candidates are:\n"
@@ -2452,6 +2472,8 @@ static void set_quota(const char *arg)
 	val = get_optval_int(eq + 1);
 	if (what_matches(arg, "outstanding"))
 		quota_req_outstanding = val;
+	else if (what_matches(arg, "transaction-nodes"))
+		quota_trans_nodes = val;
 	else
 		barf("unknown quota \"%s\"\n", arg);
 }
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index aa0dedde644b..9c572a3c6e2a 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -266,6 +266,7 @@ extern int dom0_event;
 extern int priv_domid;
 extern int quota_nb_entry_per_domain;
 extern int quota_req_outstanding;
+extern int quota_trans_nodes;
 
 extern unsigned int timeout_watch_event_msec;
 
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index 86caf6c398be..7bd41eb475e3 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -156,6 +156,9 @@ struct transaction
 	/* Connection-local identifier for this transaction. */
 	uint32_t id;
 
+	/* Node counter. */
+	unsigned int nodes;
+
 	/* Generation when transaction started. */
 	uint64_t generation;
 
@@ -260,6 +263,11 @@ int access_node(struct connection *conn, struct node *node,
 
 	i = find_accessed_node(trans, node->name);
 	if (!i) {
+		if (trans->nodes >= quota_trans_nodes &&
+		    domain_is_unprivileged(conn)) {
+			ret = ENOSPC;
+			goto err;
+		}
 		i = talloc_zero(trans, struct accessed_node);
 		if (!i)
 			goto nomem;
@@ -297,6 +305,7 @@ int access_node(struct connection *conn, struct node *node,
 				i->ta_node = true;
 			}
 		}
+		trans->nodes++;
 		list_add_tail(&i->list, &trans->accessed);
 	}
 
diff --git a/tools/xenstore/xenstored_transaction.h b/tools/xenstore/xenstored_transaction.h
index 0093cac807e3..e3cbd6b23095 100644
--- a/tools/xenstore/xenstored_transaction.h
+++ b/tools/xenstore/xenstored_transaction.h
@@ -39,8 +39,8 @@ void transaction_entry_inc(struct transaction *trans, unsigned int domid);
 void transaction_entry_dec(struct transaction *trans, unsigned int domid);
 
 /* This node was accessed. */
-int access_node(struct connection *conn, struct node *node,
-                enum node_access_type type, TDB_DATA *key);
+int __must_check access_node(struct connection *conn, struct node *node,
+                             enum node_access_type type, TDB_DATA *key);
 
 /* Queue watches for a modified node. */
 void queue_watches(struct connection *conn, const char *name, bool watch_exact);
From 4e1fc1fef11b9ee9ecc100f1ac1a7abc3cff8c0a Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: move the call of setup_structure() to dom0
 introduction

Setting up the basic structure when introducing dom0 has the advantage
to be able to add proper node memory accounting for the added nodes
later.

This makes it possible to do proper node accounting, too.

An additional requirement to make that work fine is to correct the
owner of the created nodes to be dom0_domid instead of domid 0.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index d4fd005f599d..844ae396a0d5 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -2018,7 +2018,8 @@ static int tdb_flags;
 static void manual_node(const char *name, const char *child)
 {
 	struct node *node;
-	struct xs_permissions perms = { .id = 0, .perms = XS_PERM_NONE };
+	struct xs_permissions perms = { .id = dom0_domid,
+					.perms = XS_PERM_NONE };
 
 	node = talloc_zero(NULL, struct node);
 	if (!node)
@@ -2057,7 +2058,7 @@ static void tdb_logger(TDB_CONTEXT *tdb, int level, const char * fmt, ...)
 	}
 }
 
-static void setup_structure(bool live_update)
+void setup_structure(bool live_update)
 {
 	char *tdbname;
 
@@ -2080,6 +2081,7 @@ static void setup_structure(bool live_update)
 		manual_node("/", "tool");
 		manual_node("/tool", "xenstored");
 		manual_node("/tool/xenstored", NULL);
+		domain_entry_fix(dom0_domid, 3, true);
 	}
 
 	check_store();
@@ -2598,9 +2600,6 @@ int main(int argc, char *argv[])
 
 	init_pipe(reopen_log_pipe);
 
-	/* Setup the database */
-	setup_structure(live_update);
-
 	/* Listen to hypervisor. */
 	if (!no_domain_init && !live_update) {
 		domain_init(-1);
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 9c572a3c6e2a..a772f3b8ead2 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -231,6 +231,7 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
 struct node *read_node(struct connection *conn, const void *ctx,
 		       const char *name);
 
+void setup_structure(bool live_update);
 struct connection *new_connection(const struct interface_funcs *funcs);
 struct connection *get_connection_by_id(unsigned int conn_id);
 void check_store(void);
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 44ce267ec557..5c79eed3dc34 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -496,6 +496,9 @@ static struct domain *introduce_domain(const void *ctx,
 		}
 		domain->interface = interface;
 
+		if (is_master_domain)
+			setup_structure(restore);
+
 		/* Now domain belongs to its connection. */
 		talloc_steal(domain->conn, domain);
 
From f30edd5452a3226d18ff98441013c9e86e4e34f2 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: add infrastructure to keep track of per domain memory
 usage

The amount of memory a domain can consume in Xenstore is limited by
various quota today, but even with sane quota a domain can still
consume rather large memory quantities.

Add the infrastructure for keeping track of the amount of memory a
domain is consuming in Xenstore. Note that this is only the memory a
domain has direct control over, so any internal administration data
needed by Xenstore only is not being accounted for.

There are two quotas defined: a soft quota which will result in a
warning issued via syslog() when it is exceeded, and a hard quota
resulting in a stop of accepting further requests or watch events as
long as the hard quota would be violated by accepting those.

Setting any of those quotas to 0 will disable it.

As default values use 2MB per domain for the soft limit (this basically
covers the allowed case to create 1000 nodes needing 2kB each), and
2.5MB for the hard limit.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 844ae396a0d5..f03ad93b4385 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -109,6 +109,8 @@ int quota_nb_perms_per_node = 5;
 int quota_trans_nodes = 1024;
 int quota_max_path_len = XENSTORE_REL_PATH_MAX;
 int quota_req_outstanding = 20;
+int quota_memory_per_domain_soft = 2 * 1024 * 1024; /* 2 MB */
+int quota_memory_per_domain_hard = 2 * 1024 * 1024 + 512 * 1024; /* 2.5 MB */
 
 unsigned int timeout_watch_event_msec = 20000;
 
@@ -2390,7 +2392,14 @@ static void usage(void)
 "                          quotas are:\n"
 "                          transaction-nodes: number of accessed node per\n"
 "                                             transaction\n"
+"                          memory: total used memory per domain for nodes,\n"
+"                                  transactions, watches and requests, above\n"
+"                                  which Xenstore will stop talking to domain\n"
 "                          outstanding: number of outstanding requests\n"
+"  -q, --quota-soft <what>=<nb> set a soft quota <what> to the value <nb>,\n"
+"                          causing a warning to be issued via syslog() if the\n"
+"                          limit is violated, allowed quotas are:\n"
+"                          memory: see above\n"
 "  -w, --timeout <what>=<seconds>   set the timeout in seconds for <what>,\n"
 "                          allowed timeout candidates are:\n"
 "                          watch-event: time a watch-event is kept pending\n"
@@ -2417,6 +2426,7 @@ static struct option options[] = {
 	{ "perm-nb", 1, NULL, 'A' },
 	{ "path-max", 1, NULL, 'M' },
 	{ "quota", 1, NULL, 'Q' },
+	{ "quota-soft", 1, NULL, 'q' },
 	{ "timeout", 1, NULL, 'w' },
 	{ "no-recovery", 0, NULL, 'R' },
 	{ "internal-db", 0, NULL, 'I' },
@@ -2464,7 +2474,7 @@ static void set_timeout(const char *arg)
 		barf("unknown timeout \"%s\"\n", arg);
 }
 
-static void set_quota(const char *arg)
+static void set_quota(const char *arg, bool soft)
 {
 	const char *eq = strchr(arg, '=');
 	int val;
@@ -2472,11 +2482,16 @@ static void set_quota(const char *arg)
 	if (!eq)
 		barf("quotas must be specified via <what>=<nb>\n");
 	val = get_optval_int(eq + 1);
-	if (what_matches(arg, "outstanding"))
+	if (what_matches(arg, "outstanding") && !soft)
 		quota_req_outstanding = val;
-	else if (what_matches(arg, "transaction-nodes"))
+	else if (what_matches(arg, "transaction-nodes") && !soft)
 		quota_trans_nodes = val;
-	else
+	else if (what_matches(arg, "memory")) {
+		if (soft)
+			quota_memory_per_domain_soft = val;
+		else
+			quota_memory_per_domain_hard = val;
+	} else
 		barf("unknown quota \"%s\"\n", arg);
 }
 
@@ -2494,7 +2509,7 @@ int main(int argc, char *argv[])
 	orig_argc = argc;
 	orig_argv = argv;
 
-	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:Q:T:RVW:w:U",
+	while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:M:Q:q:T:RVW:w:U",
 				  options, NULL)) != -1) {
 		switch (opt) {
 		case 'D':
@@ -2545,7 +2560,10 @@ int main(int argc, char *argv[])
 						 quota_max_path_len);
 			break;
 		case 'Q':
-			set_quota(optarg);
+			set_quota(optarg, false);
+			break;
+		case 'q':
+			set_quota(optarg, true);
 			break;
 		case 'w':
 			set_timeout(optarg);
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index a772f3b8ead2..ec52d8d3ff03 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -268,6 +268,8 @@ extern int priv_domid;
 extern int quota_nb_entry_per_domain;
 extern int quota_req_outstanding;
 extern int quota_trans_nodes;
+extern int quota_memory_per_domain_soft;
+extern int quota_memory_per_domain_hard;
 
 extern unsigned int timeout_watch_event_msec;
 
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 5c79eed3dc34..42423808863f 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -76,6 +76,13 @@ struct domain
 	/* number of entry from this domain in the store */
 	int nbentry;
 
+	/* Amount of memory allocated for this domain. */
+	int memory;
+	bool soft_quota_reported;
+	bool hard_quota_reported;
+	time_t mem_last_msg;
+#define MEM_WARN_MINTIME_SEC 10
+
 	/* number of watch for this domain */
 	int nbwatch;
 
@@ -192,6 +199,9 @@ static bool domain_can_read(struct connection *conn)
 			return false;
 		if (conn->domain->nboutstanding >= quota_req_outstanding)
 			return false;
+		if (conn->domain->memory >= quota_memory_per_domain_hard &&
+		    quota_memory_per_domain_hard)
+			return false;
 	}
 
 	return (intf->req_cons != intf->req_prod);
@@ -982,6 +992,89 @@ int domain_entry(struct connection *conn)
 		: 0;
 }
 
+static bool domain_chk_quota(struct domain *domain, int mem)
+{
+	time_t now;
+
+	if (!domain || !domid_is_unprivileged(domain->domid) ||
+	    (domain->conn && domain->conn->is_ignored))
+		return false;
+
+	now = time(NULL);
+
+	if (mem >= quota_memory_per_domain_hard &&
+	    quota_memory_per_domain_hard) {
+		if (domain->hard_quota_reported)
+			return true;
+		syslog(LOG_ERR, "Domain %u exceeds hard memory quota, Xenstore interface to domain stalled\n",
+		       domain->domid);
+		domain->mem_last_msg = now;
+		domain->hard_quota_reported = true;
+		return true;
+	}
+
+	if (now - domain->mem_last_msg >= MEM_WARN_MINTIME_SEC) {
+		if (domain->hard_quota_reported) {
+			domain->mem_last_msg = now;
+			domain->hard_quota_reported = false;
+			syslog(LOG_INFO, "Domain %u below hard memory quota again\n",
+			       domain->domid);
+		}
+		if (mem >= quota_memory_per_domain_soft &&
+		    quota_memory_per_domain_soft &&
+		    !domain->soft_quota_reported) {
+			domain->mem_last_msg = now;
+			domain->soft_quota_reported = true;
+			syslog(LOG_WARNING, "Domain %u exceeds soft memory quota\n",
+			       domain->domid);
+		}
+		if (mem < quota_memory_per_domain_soft &&
+		    domain->soft_quota_reported) {
+			domain->mem_last_msg = now;
+			domain->soft_quota_reported = false;
+			syslog(LOG_INFO, "Domain %u below soft memory quota again\n",
+			       domain->domid);
+		}
+
+	}
+
+	return false;
+}
+
+int domain_memory_add(unsigned int domid, int mem, bool no_quota_check)
+{
+	struct domain *domain;
+
+	domain = find_domain_struct(domid);
+	if (domain) {
+		/*
+		 * domain_chk_quota() will print warning and also store whether
+		 * the soft/hard quota has been hit. So check no_quota_check
+		 * *after*.
+		 */
+		if (domain_chk_quota(domain, domain->memory + mem) &&
+		    !no_quota_check)
+			return ENOMEM;
+		domain->memory += mem;
+	} else {
+		/*
+		 * The domain the memory is to be accounted for should always
+		 * exist, as accounting is done either for a domain related to
+		 * the current connection, or for the domain owning a node
+		 * (which is always existing, as the owner of the node is
+		 * tested to exist and replaced by domid 0 if not).
+		 * So not finding the related domain MUST be an error in the
+		 * data base.
+		 */
+		errno = ENOENT;
+		corrupt(NULL, "Accounting called for non-existing domain %u\n",
+			domid);
+		return ENOENT;
+	}
+
+	return 0;
+}
+
 void domain_watch_inc(struct connection *conn)
 {
 	if (!conn || !conn->domain)
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index 593793131494..d342e5e867ed 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -62,6 +62,26 @@ int domain_entry_inc(struct connection *conn, struct node *);
 void domain_entry_dec(struct connection *conn, struct node *);
 int domain_entry_fix(unsigned int domid, int num, bool update);
 int domain_entry(struct connection *conn);
+int domain_memory_add(unsigned int domid, int mem, bool no_quota_check);
+
+/*
+ * domain_memory_add_chk(): to be used when memory quota should be checked.
+ * Not to be used when specifying a negative mem value, as lowering the used
+ * memory should always be allowed.
+ */
+static inline int domain_memory_add_chk(unsigned int domid, int mem)
+{
+	return domain_memory_add(domid, mem, false);
+}
+/*
+ * domain_memory_add_nochk(): to be used when memory quota should not be
+ * checked, e.g. when lowering memory usage, or in an error case for undoing
+ * a previous memory adjustment.
+ */
+static inline void domain_memory_add_nochk(unsigned int domid, int mem)
+{
+	domain_memory_add(domid, mem, true);
+}
 void domain_watch_inc(struct connection *conn);
 void domain_watch_dec(struct connection *conn);
 int domain_watch(struct connection *conn);
From 8356fb51c9993a6276af25de3d183a972cb9c49a Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:09 +0200
Subject: tools/xenstore: add memory accounting for responses

Add the memory accounting for queued responses.

In case adding a watch event for a guest is causing the hard memory
quota of that guest to be violated, the event is dropped. This will
ensure that it is impossible to drive another guest past its memory
quota by generating insane amounts of events for that guest. This is
especially important for protecting driver domains from that attack
vector.

This is part of XSA-326 / CVE-2022-42315.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index f03ad93b4385..009eaa8e5f53 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -256,6 +256,8 @@ static void free_buffered_data(struct buffered_data *out,
 		}
 	}
 
+	domain_memory_add_nochk(conn->id, -out->hdr.msg.len - sizeof(out->hdr));
+
 	if (out->hdr.msg.type == XS_WATCH_EVENT) {
 		req = out->pend.req;
 		if (req) {
@@ -934,11 +936,14 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type,
 	bdata->timeout_msec = 0;
 	bdata->watch_event = false;
 
-	if (len <= DEFAULT_BUFFER_SIZE)
+	if (len <= DEFAULT_BUFFER_SIZE) {
 		bdata->buffer = bdata->default_buffer;
-	else {
+		/* Don't check quota, path might be used for returning error. */
+		domain_memory_add_nochk(conn->id, len + sizeof(bdata->hdr));
+	} else {
 		bdata->buffer = talloc_array(bdata, char, len);
-		if (!bdata->buffer) {
+		if (!bdata->buffer ||
+		    domain_memory_add_chk(conn->id, len + sizeof(bdata->hdr))) {
 			send_error(conn, ENOMEM);
 			return;
 		}
@@ -1003,6 +1008,11 @@ void send_event(struct buffered_data *req, struct connection *conn,
 		}
 	}
 
+	if (domain_memory_add_chk(conn->id, len + sizeof(bdata->hdr))) {
+		talloc_free(bdata);
+		return;
+	}
+
 	if (timeout_watch_event_msec && domain_is_unprivileged(conn)) {
 		bdata->timeout_msec = get_now_msec() + timeout_watch_event_msec;
 		if (!conn->timeout_msec)
@@ -3012,6 +3022,12 @@ static void add_buffered_data(struct buffered_data *bdata,
 	 */
 	if (bdata->hdr.msg.type != XS_WATCH_EVENT)
 		domain_outstanding_inc(conn);
+	/*
+	 * We are restoring the state after Live-Update and the new quota may
+	 * be smaller. So ignore it. The limit will be applied for any resource
+	 * after the state has been fully restored.
+	 */
+	domain_memory_add_nochk(conn->id, len + sizeof(bdata->hdr));
 }
 
 void read_state_buffered_data(const void *ctx, struct connection *conn,
From 0f576f3376b38298234cc74a372f0bb2ea186a30 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add memory accounting for watches

Add the memory accounting for registered watches.

When a socket connection is destroyed, the associated watches are
removed, too. In order to keep memory accounting correct the watches
must be removed explicitly via a call of conn_delete_all_watches() from
destroy_conn().

This is part of XSA-326 / CVE-2022-42315.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 009eaa8e5f53..1a5ba4aba839 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -459,6 +459,7 @@ static int destroy_conn(void *_conn)
 	}
 
 	conn_free_buffered_data(conn);
+	conn_delete_all_watches(conn);
 	list_for_each_entry(req, &conn->ref_list, list)
 		req->on_ref_list = false;
 
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index 0755ffa375ba..fdf9b2d653a0 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -211,7 +211,7 @@ static int check_watch_path(struct connection *conn, const void *ctx,
 }
 
 static struct watch *add_watch(struct connection *conn, char *path, char *token,
-			       bool relative)
+			       bool relative, bool no_quota_check)
 {
 	struct watch *watch;
 
@@ -222,6 +222,9 @@ static struct watch *add_watch(struct connection *conn, char *path, char *token,
 	watch->token = talloc_strdup(watch, token);
 	if (!watch->node || !watch->token)
 		goto nomem;
+	if (domain_memory_add(conn->id, strlen(path) + strlen(token),
+			      no_quota_check))
+		goto nomem;
 
 	if (relative)
 		watch->relative_path = get_implicit_path(conn);
@@ -265,7 +268,7 @@ int do_watch(struct connection *conn, struct buffered_data *in)
 	if (domain_watch(conn) > quota_nb_watch_per_domain)
 		return E2BIG;
 
-	watch = add_watch(conn, vec[0], vec[1], relative);
+	watch = add_watch(conn, vec[0], vec[1], relative, false);
 	if (!watch)
 		return errno;
 
@@ -296,6 +299,8 @@ int do_unwatch(struct connection *conn, struct buffered_data *in)
 	list_for_each_entry(watch, &conn->watches, list) {
 		if (streq(watch->node, node) && streq(watch->token, vec[1])) {
 			list_del(&watch->list);
+			domain_memory_add_nochk(conn->id, -strlen(watch->node) -
+							  strlen(watch->token));
 			talloc_free(watch);
 			domain_watch_dec(conn);
 			send_ack(conn, XS_UNWATCH);
@@ -311,6 +316,8 @@ void conn_delete_all_watches(struct connection *conn)
 
 	while ((watch = list_top(&conn->watches, struct watch, list))) {
 		list_del(&watch->list);
+		domain_memory_add_nochk(conn->id, -strlen(watch->node) -
+						  strlen(watch->token));
 		talloc_free(watch);
 		domain_watch_dec(conn);
 	}
@@ -373,7 +380,7 @@ void read_state_watch(const void *ctx, const void *state)
 	if (!path)
 		barf("allocation error for read watch");
 
-	if (!add_watch(conn, path, token, relative))
+	if (!add_watch(conn, path, token, relative, true))
 		barf("error adding watch");
 }
 
From 3c9de01807e23fe7d11fbf0cab64f4274d260e91 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add memory accounting for nodes

Add the memory accounting for Xenstore nodes. In order to make this
not too complicated allow for some sloppiness when writing nodes. Any
hard quota violation will result in no further requests to be accepted.

This is part of XSA-326 / CVE-2022-42315.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
index 1a5ba4aba839..f7f1e00c715b 100644
--- a/tools/xenstore/xenstored_core.c
+++ b/tools/xenstore/xenstored_core.c
@@ -587,6 +587,117 @@ void set_tdb_key(const char *name, TDB_DATA *key)
 	key->dsize = strlen(name);
 }
 
+static void get_acc_data(TDB_DATA *key, struct node_account_data *acc)
+{
+	TDB_DATA old_data;
+	struct xs_tdb_record_hdr *hdr;
+
+	if (acc->memory < 0) {
+		old_data = tdb_fetch(tdb_ctx, *key);
+		/* No check for error, as the node might not exist. */
+		if (old_data.dptr == NULL) {
+			acc->memory = 0;
+		} else {
+			hdr = (void *)old_data.dptr;
+			acc->memory = old_data.dsize;
+			acc->domid = hdr->perms[0].id;
+		}
+		talloc_free(old_data.dptr);
+	}
+}
+
+/*
+ * Per-transaction nodes need to be accounted for the transaction owner.
+ * Those nodes are stored in the data base with the transaction generation
+ * count prepended (e.g. 123/local/domain/...). So testing for the node's
+ * key not to start with "/" is sufficient.
+ */
+static unsigned int get_acc_domid(struct connection *conn, TDB_DATA *key,
+				  unsigned int domid)
+{
+	return (!conn || key->dptr[0] == '/') ? domid : conn->id;
+}
+
+int do_tdb_write(struct connection *conn, TDB_DATA *key, TDB_DATA *data,
+		 struct node_account_data *acc, bool no_quota_check)
+{
+	struct xs_tdb_record_hdr *hdr = (void *)data->dptr;
+	struct node_account_data old_acc = {};
+	unsigned int old_domid, new_domid;
+	int ret;
+
+	if (!acc)
+		old_acc.memory = -1;
+	else
+		old_acc = *acc;
+
+	get_acc_data(key, &old_acc);
+	old_domid = get_acc_domid(conn, key, old_acc.domid);
+	new_domid = get_acc_domid(conn, key, hdr->perms[0].id);
+
+	/*
+	 * Don't check for ENOENT, as we want to be able to switch orphaned
+	 * nodes to new owners.
+	 */
+	if (old_acc.memory)
+		domain_memory_add_nochk(old_domid,
+					-old_acc.memory - key->dsize);
+	ret = domain_memory_add(new_domid, data->dsize + key->dsize,
+				no_quota_check);
+	if (ret) {
+		/* Error path, so no quota check. */
+		if (old_acc.memory)
+			domain_memory_add_nochk(old_domid,
+						old_acc.memory + key->dsize);
+		return ret;
+	}
+
+	/* TDB should set errno, but doesn't even set ecode AFAICT. */
+	if (tdb_store(tdb_ctx, *key, *data, TDB_REPLACE) != 0) {
+		domain_memory_add_nochk(new_domid, -data->dsize - key->dsize);
+		/* Error path, so no quota check. */
+		if (old_acc.memory)
+			domain_memory_add_nochk(old_domid,
+						old_acc.memory + key->dsize);
+		errno = EIO;
+		return errno;
+	}
+
+	if (acc) {
+		/* Don't use new_domid, as it might be a transaction node. */
+		acc->domid = hdr->perms[0].id;
+		acc->memory = data->dsize;
+	}
+
+	return 0;
+}
+
+int do_tdb_delete(struct connection *conn, TDB_DATA *key,
+		  struct node_account_data *acc)
+{
+	struct node_account_data tmp_acc;
+	unsigned int domid;
+
+	if (!acc) {
+		acc = &tmp_acc;
+		acc->memory = -1;
+	}
+
+	get_acc_data(key, acc);
+
+	if (tdb_delete(tdb_ctx, *key)) {
+		errno = EIO;
+		return errno;
+	}
+
+	if (acc->memory) {
+		domid = get_acc_domid(conn, key, acc->domid);
+		domain_memory_add_nochk(domid, -acc->memory - key->dsize);
+	}
+
+	return 0;
+}
+
 /*
  * If it fails, returns NULL and sets errno.
  * Temporary memory allocations will be done with ctx.
@@ -640,9 +751,15 @@ struct node *read_node(struct connection *conn, const void *ctx,
 
 	/* Permissions are struct xs_permissions. */
 	node->perms.p = hdr->perms;
+	node->acc.domid = node->perms.p[0].id;
+	node->acc.memory = data.dsize;
 	if (domain_adjust_node_perms(conn, node))
 		goto error;
 
+	/* If owner is gone reset currently accounted memory size. */
+	if (node->acc.domid != node->perms.p[0].id)
+		node->acc.memory = 0;
+
 	/* Data is binary blob (usually ascii, no nul). */
 	node->data = node->perms.p + hdr->num_perms;
 	/* Children is strings, nul separated. */
@@ -711,12 +828,9 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node,
 	p += node->datalen;
 	memcpy(p, node->children, node->childlen);
 
-	/* TDB should set errno, but doesn't even set ecode AFAICT. */
-	if (tdb_store(tdb_ctx, *key, data, TDB_REPLACE) != 0) {
-		corrupt(conn, "Write of %s failed", key->dptr);
-		errno = EIO;
-		return errno;
-	}
+	if (do_tdb_write(conn, key, &data, &node->acc, no_quota_check))
+		return EIO;
+
 	return 0;
 }
 
@@ -1218,7 +1332,7 @@ static void delete_node_single(struct connection *conn, struct node *node)
 	if (access_node(conn, node, NODE_ACCESS_DELETE, &key))
 		return;
 
-	if (tdb_delete(tdb_ctx, key) != 0) {
+	if (do_tdb_delete(conn, &key, &node->acc) != 0) {
 		corrupt(conn, "Could not delete '%s'", node->name);
 		return;
 	}
@@ -1291,6 +1405,7 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 	/* No children, no data */
 	node->children = node->data = NULL;
 	node->childlen = node->datalen = 0;
+	node->acc.memory = 0;
 	node->parent = parent;
 	return node;
 
@@ -1299,17 +1414,17 @@ static struct node *construct_node(struct connection *conn, const void *ctx,
 	return NULL;
 }
 
-static void destroy_node_rm(struct node *node)
+static void destroy_node_rm(struct connection *conn, struct node *node)
 {
 	if (streq(node->name, "/"))
 		corrupt(NULL, "Destroying root node!");
 
-	tdb_delete(tdb_ctx, node->key);
+	do_tdb_delete(conn, &node->key, &node->acc);
 }
 
 static int destroy_node(struct connection *conn, struct node *node)
 {
-	destroy_node_rm(node);
+	destroy_node_rm(conn, node);
 	domain_entry_dec(conn, node);
 
 	/*
@@ -1361,7 +1476,7 @@ static struct node *create_node(struct connection *conn, const void *ctx,
 		/* Account for new node */
 		if (i->parent) {
 			if (domain_entry_inc(conn, i)) {
-				destroy_node_rm(i);
+				destroy_node_rm(conn, i);
 				return NULL;
 			}
 		}
@@ -2270,7 +2385,7 @@ static int clean_store_(TDB_CONTEXT *tdb, TDB_DATA key, TDB_DATA val,
 	if (!hashtable_search(reachable, name)) {
 		log("clean_store: '%s' is orphaned!", name);
 		if (recovery) {
-			tdb_delete(tdb, key);
+			do_tdb_delete(NULL, &key, NULL);
 		}
 	}
 
@@ -3122,6 +3237,7 @@ void read_state_node(const void *ctx, const void *state)
 	if (!node)
 		barf("allocation error restoring node");
 
+	node->acc.memory = 0;
 	node->name = name;
 	node->generation = ++generation;
 	node->datalen = sn->data_len;
diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index ec52d8d3ff03..031a8213586c 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -176,6 +176,11 @@ struct node_perms {
 	struct xs_permissions *p;
 };
 
+struct node_account_data {
+	unsigned int domid;
+	int memory;		/* -1 if unknown */
+};
+
 struct node {
 	const char *name;
 	/* Key used to update TDB */
@@ -198,6 +203,9 @@ struct node {
 	/* Children, each nul-terminated. */
 	unsigned int childlen;
 	char *children;
+
+	/* Allocation information for node currently in store. */
+	struct node_account_data acc;
 };
 
 /* Return the only argument in the input. */
@@ -301,6 +309,10 @@ extern xengnttab_handle **xgt_handle;
 int remember_string(struct hashtable *hash, const char *str);
 
 void set_tdb_key(const char *name, TDB_DATA *key);
+int do_tdb_write(struct connection *conn, TDB_DATA *key, TDB_DATA *data,
+		 struct node_account_data *acc, bool no_quota_check);
+int do_tdb_delete(struct connection *conn, TDB_DATA *key,
+		  struct node_account_data *acc);
 
 void conn_free_buffered_data(struct connection *conn);
 
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index 7bd41eb475e3..ace9a11d77bb 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -153,6 +153,9 @@ struct transaction
 	/* List of all transactions active on this connection. */
 	struct list_head list;
 
+	/* Connection this transaction is associated with. */
+	struct connection *conn;
+
 	/* Connection-local identifier for this transaction. */
 	uint32_t id;
 
@@ -286,6 +289,8 @@ int access_node(struct connection *conn, struct node *node,
 
 		introduce = true;
 		i->ta_node = false;
+		/* acc.memory < 0 means "unknown, get size from TDB". */
+		node->acc.memory = -1;
 
 		/*
 		 * Additional transaction-specific node for read type. We only
@@ -410,11 +415,11 @@ static int finalize_transaction(struct connection *conn,
 					goto err;
 				hdr = (void *)data.dptr;
 				hdr->generation = ++generation;
-				ret = tdb_store(tdb_ctx, key, data,
-						TDB_REPLACE);
+				ret = do_tdb_write(conn, &key, &data, NULL,
+						   true);
 				talloc_free(data.dptr);
 			} else {
-				ret = tdb_delete(tdb_ctx, key);
+				ret = do_tdb_delete(conn, &key, NULL);
 			}
 			if (ret)
 				goto err;
@@ -425,7 +430,7 @@ static int finalize_transaction(struct connection *conn,
 			}
 		}
 
-		if (i->ta_node && tdb_delete(tdb_ctx, ta_key))
+		if (i->ta_node && do_tdb_delete(conn, &ta_key, NULL))
 			goto err;
 		list_del(&i->list);
 		talloc_free(i);
@@ -453,7 +458,7 @@ static int destroy_transaction(void *_transaction)
 							       i->node);
 			if (trans_name) {
 				set_tdb_key(trans_name, &key);
-				tdb_delete(tdb_ctx, key);
+				do_tdb_delete(trans->conn, &key, NULL);
 			}
 		}
 		list_del(&i->list);
@@ -497,6 +502,7 @@ int do_transaction_start(struct connection *conn, struct buffered_data *in)
 
 	INIT_LIST_HEAD(&trans->accessed);
 	INIT_LIST_HEAD(&trans->changed_domains);
+	trans->conn = conn;
 	trans->fail = false;
 	trans->generation = ++generation;
 
From 1268b81309928e8f59a5358c8d0b0b50074b5eef Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add exports for quota variables

Some quota variables are not exported via header files.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>

diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
index 031a8213586c..f7c37fe3b565 100644
--- a/tools/xenstore/xenstored_core.h
+++ b/tools/xenstore/xenstored_core.h
@@ -273,6 +273,11 @@ extern TDB_CONTEXT *tdb_ctx;
 extern int dom0_domid;
 extern int dom0_event;
 extern int priv_domid;
+extern int quota_nb_watch_per_domain;
+extern int quota_max_transaction;
+extern int quota_max_entry_size;
+extern int quota_nb_perms_per_node;
+extern int quota_max_path_len;
 extern int quota_nb_entry_per_domain;
 extern int quota_req_outstanding;
 extern int quota_trans_nodes;
diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c
index ace9a11d77bb..28774813de83 100644
--- a/tools/xenstore/xenstored_transaction.c
+++ b/tools/xenstore/xenstored_transaction.c
@@ -175,7 +175,6 @@ struct transaction
 	bool fail;
 };
 
-extern int quota_max_transaction;
 uint64_t generation;
 
 static struct accessed_node *find_accessed_node(struct transaction *trans,
diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c
index fdf9b2d653a0..85362bcce314 100644
--- a/tools/xenstore/xenstored_watch.c
+++ b/tools/xenstore/xenstored_watch.c
@@ -31,8 +31,6 @@
 #include "xenstored_domain.h"
 #include "xenstored_transaction.h"
 
-extern int quota_nb_watch_per_domain;
-
 struct watch
 {
 	/* Watches on this connection */
From e902946fb0e432b4cbb58f2e99ab562fe100bd75 Mon Sep 17 00:00:00 2001
From: Juergen Gross <jgross@suse.com>
Date: Tue, 13 Sep 2022 07:35:10 +0200
Subject: tools/xenstore: add control command for setting and showing quota

Add a xenstore-control command "quota" to:
- show current quota settings
- change quota settings
- show current quota related values of a domain

Note that in the case the new quota is lower than existing one,
Xenstored may continue to handle requests from a domain exceeding the
new limit (depends on which one has been broken) and the amount of
resource used will not change. However the domain will not be able to
create more resource (associated to the quota) until it is back to below
the limit.

This is part of XSA-326.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

diff --git a/docs/misc/xenstore.txt b/docs/misc/xenstore.txt
index 4bc262fd5db1..988ef89cba2d 100644
--- a/docs/misc/xenstore.txt
+++ b/docs/misc/xenstore.txt
@@ -410,6 +410,17 @@ CONTROL			<command>|[<parameters>|]
 	print|<string>
 		print <string> to syslog (xenstore runs as daemon) or
 		to console (xenstore runs as stubdom)
+	quota|[set <name> <val>|<domid>]
+		without parameters: print the current quota settings
+		with "set <name> <val>": set the quota <name> to new value
+		<val> (The admin should make sure all the domain usage is
+		below the quota. If it is not, then Xenstored may continue to
+		handle requests from the domain as long as the resource
+		violating the new quota setting isn't increased further)
+		with "<domid>": print quota related accounting data for
+		the domain <domid>
+	quota-soft|[set <name> <val>]
+		like the "quota" command, but for soft-quota.
 	help			<supported-commands>
 		return list of supported commands for CONTROL
 
diff --git a/tools/xenstore/xenstored_control.c b/tools/xenstore/xenstored_control.c
index 61bcbc069d75..264bb39d7b0e 100644
--- a/tools/xenstore/xenstored_control.c
+++ b/tools/xenstore/xenstored_control.c
@@ -196,6 +196,115 @@ static int do_control_log(void *ctx, struct connection *conn,
 	return 0;
 }
 
+struct quota {
+	const char *name;
+	int *quota;
+	const char *descr;
+};
+
+static const struct quota hard_quotas[] = {
+	{ "nodes", &quota_nb_entry_per_domain, "Nodes per domain" },
+	{ "watches", &quota_nb_watch_per_domain, "Watches per domain" },
+	{ "transactions", &quota_max_transaction, "Transactions per domain" },
+	{ "outstanding", &quota_req_outstanding,
+		"Outstanding requests per domain" },
+	{ "transaction-nodes", &quota_trans_nodes,
+		"Max. number of accessed nodes per transaction" },
+	{ "memory", &quota_memory_per_domain_hard,
+		"Total Xenstore memory per domain (error level)" },
+	{ "node-size", &quota_max_entry_size, "Max. size of a node" },
+	{ "path-max", &quota_max_path_len, "Max. length of a node path" },
+	{ "permissions", &quota_nb_perms_per_node,
+		"Max. number of permissions per node" },
+	{ NULL, NULL, NULL }
+};
+
+static const struct quota soft_quotas[] = {
+	{ "memory", &quota_memory_per_domain_soft,
+		"Total Xenstore memory per domain (warning level)" },
+	{ NULL, NULL, NULL }
+};
+
+static int quota_show_current(const void *ctx, struct connection *conn,
+			      const struct quota *quotas)
+{
+	char *resp;
+	unsigned int i;
+
+	resp = talloc_strdup(ctx, "Quota settings:\n");
+	if (!resp)
+		return ENOMEM;
+
+	for (i = 0; quotas[i].quota; i++) {
+		resp = talloc_asprintf_append(resp, "%-17s: %8d %s\n",
+					      quotas[i].name, *quotas[i].quota,
+					      quotas[i].descr);
+		if (!resp)
+			return ENOMEM;
+	}
+
+	send_reply(conn, XS_CONTROL, resp, strlen(resp) + 1);
+
+	return 0;
+}
+
+static int quota_set(const void *ctx, struct connection *conn,
+		     char **vec, int num, const struct quota *quotas)
+{
+	unsigned int i;
+	int val;
+
+	if (num != 2)
+		return EINVAL;
+
+	val = atoi(vec[1]);
+	if (val < 1)
+		return EINVAL;
+
+	for (i = 0; quotas[i].quota; i++) {
+		if (!strcmp(vec[0], quotas[i].name)) {
+			*quotas[i].quota = val;
+			send_ack(conn, XS_CONTROL);
+			return 0;
+		}
+	}
+
+	return EINVAL;
+}
+
+static int quota_get(const void *ctx, struct connection *conn,
+		     char **vec, int num)
+{
+	if (num != 1)
+		return EINVAL;
+
+	return domain_get_quota(ctx, conn, atoi(vec[0]));
+}
+
+static int do_control_quota(void *ctx, struct connection *conn,
+			    char **vec, int num)
+{
+	if (num == 0)
+		return quota_show_current(ctx, conn, hard_quotas);
+
+	if (!strcmp(vec[0], "set"))
+		return quota_set(ctx, conn, vec + 1, num - 1, hard_quotas);
+
+	return quota_get(ctx, conn, vec, num);
+}
+
+static int do_control_quota_s(void *ctx, struct connection *conn,
+			      char **vec, int num)
+{
+	if (num == 0)
+		return quota_show_current(ctx, conn, soft_quotas);
+
+	if (!strcmp(vec[0], "set"))
+		return quota_set(ctx, conn, vec + 1, num - 1, soft_quotas);
+
+	return EINVAL;
+}
+
 #ifdef __MINIOS__
 static int do_control_memreport(void *ctx, struct connection *conn,
 				char **vec, int num)
@@ -847,6 +956,8 @@ static struct cmd_s cmds[] = {
 	{ "memreport", do_control_memreport, "[<file>]" },
 #endif
 	{ "print", do_control_print, "<string>" },
+	{ "quota", do_control_quota, "[set <name> <val>|<domid>]" },
+	{ "quota-soft", do_control_quota_s, "[set <name> <val>]" },
 	{ "help", do_control_help, "" },
 };
 
diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index 42423808863f..983b348ee59c 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -31,6 +31,7 @@
 #include "xenstored_domain.h"
 #include "xenstored_transaction.h"
 #include "xenstored_watch.h"
+#include "xenstored_control.h"
 
 #include <xenevtchn.h>
 #include <xenctrl.h>
@@ -345,6 +346,38 @@ static struct domain *find_domain_struct(unsigned int domid)
 	return NULL;
 }
 
+int domain_get_quota(const void *ctx, struct connection *conn,
+		     unsigned int domid)
+{
+	struct domain *d = find_domain_struct(domid);
+	char *resp;
+	int ta;
+
+	if (!d)
+		return ENOENT;
+
+	ta = d->conn ? d->conn->transaction_started : 0;
+	resp = talloc_asprintf(ctx, "Domain %u:\n", domid);
+	if (!resp)
+		return ENOMEM;
+
+#define ent(t, e) \
+	resp = talloc_asprintf_append(resp, "%-16s: %8d\n", #t, e); \
+	if (!resp) return ENOMEM
+
+	ent(nodes, d->nbentry);
+	ent(watches, d->nbwatch);
+	ent(transactions, ta);
+	ent(outstanding, d->nboutstanding);
+	ent(memory, d->memory);
+
+#undef ent
+
+	send_reply(conn, XS_CONTROL, resp, strlen(resp) + 1);
+
+	return 0;
+}
+
 static struct domain *alloc_domain(const void *context, unsigned int domid)
 {
 	struct domain *domain;
diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h
index d342e5e867ed..5b86a92e1b5b 100644
--- a/tools/xenstore/xenstored_domain.h
+++ b/tools/xenstore/xenstored_domain.h
@@ -88,6 +88,8 @@ int domain_watch(struct connection *conn);
 void domain_outstanding_inc(struct connection *conn);
 void domain_outstanding_dec(struct connection *conn);
 void domain_outstanding_domid_dec(unsigned int domid);
+int domain_get_quota(const void *ctx, struct connection *conn,
+		     unsigned int domid);
 
 /* Special node permission handling. */
 int set_perms_special(struct connection *conn, const char *name,