22.11.2021 19:30, Eric Blake wrote:
> Reviving this thread, as a good as place as any for my question:
>
> On Thu, Jun 10, 2021 at 01:07:50PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> Add an option for a thread to retry connection until succeeds. We'll
>> use nbd/client-connection both for reconnect and for initial connection
>> in nbd_open(), so we need a possibility to use same NBDClientConnection
>> instance to connect once in nbd_open() and then use retry semantics for
>> reconnect.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>> include/block/nbd.h | 2 ++
>> nbd/client-connection.c | 56 +++++++++++++++++++++++++++++++----------
>> 2 files changed, 45 insertions(+), 13 deletions(-)
>
>> NBDClientConnection *nbd_client_connection_new(const SocketAddress *saddr,
>> bool do_negotiation,
>> const char *export_name,
>> @@ -154,23 +164,43 @@ static void *connect_thread_func(void *opaque)
>> NBDClientConnection *conn = opaque;
>> int ret;
>> bool do_free;
>> + uint64_t timeout = 1;
>> + uint64_t max_timeout = 16;
>>
>> - conn->sioc = qio_channel_socket_new();
>> + while (true) {
>> + conn->sioc = qio_channel_socket_new();
>>
>> - error_free(conn->err);
>> - conn->err = NULL;
>> - conn->updated_info = conn->initial_info;
>> + error_free(conn->err);
>> + conn->err = NULL;
>> + conn->updated_info = conn->initial_info;
>>
>> - ret = nbd_connect(conn->sioc, conn->saddr,
>> - conn->do_negotiation ? &conn->updated_info : NULL,
>> - conn->tlscreds, &conn->ioc, &conn->err);
>
> This says that on each retry attempt, we reset whether to ask the
> server for structured replies back to our original initial_info
> values.
>
> But when dealing with NBD retries in general, I suspect we have a bug.
> Consider what happens if our first connection requests structured
> replies and base:allocation block status, and we are successful. But
> later, the server disconnects, triggering a retry. Suppose that on
> our retry, we encounter a different server that no longer supports
> structured replies. We would no longer be justified in sending
> NBD_CMD_BLOCK_STATUS requests to the reconnected server. But I can't
> find anywhere in the code base that ensures that on a reconnect, the
> new server supplies at least as many extensions as the original
> server, nor anywhere that we would be able to gracefully handle an
> in-flight block status command that can no longer be successfully
> continued because the reconnect landed on a downgraded server.
>
> In general, you don't expect a server to downgrade its capabilities
> across restarts, so assuming that a retried connection will hit a
> server at least as capable as the original server is typical, even if
> unsafe. But it is easy enough to use nbdkit to write a server that
> purposefully downgrades its abilities after the first client
> connection, for testing how qemu falls apart if it continues making
> assumptions about the current server based solely on what it learned
> prior to retrying from the first server.
>
> Is this something we need to address quickly for inclusion in 6.2?
> Maybe by having a retry connect fail if the new server does not have
> the same capabilities as the old? Do we also need to care about a
> server reporting a different size export than the old server?
>
Yes that's a problem. We previously noted it here https://lists.gnu.org/archive/html/qemu-block/2021-06/msg00458.html
Honestly, I didn't start any fix for that :(.. I agree, it would be good to fix it somehow in 6.2. I'll try to make something simple this week. Or did you already started doing some fix?
--
Best regards,
Vladimir