Hi,
On Mon, Sep 30, 2024 at 4:49 PM John Stoffel <john@stoffel.org> wrote:
>
> >>>>> "Alexander" == Alexander Aring <aahringo@redhat.com> writes:
>
> > Hi,
> > this patch-series is huge but brings a lot of basic "fun" net-namespace
> > functionality to DLM. Currently you need a couple of Linux kernel
>
> Please spell out TLAs like DLM the first time you use them. In this
> case I'm suer you mean Distributed Lock Manager, right?
>
Yes, DLM stands for Distributed Lock Manager that lives currently in "fs/dlm".
> > instances running in e.g. Virtual Machines. With this patch-series I
> > want to break out of this virtual machine world dealing with multiple
> > kernels need to boot them all individually, etc. Now you can use DLM in
> > only one Linux kernel instance and each "node" (previously represented
> > by a virtual machine) is separate by a net-namespace. Why
> > net-namespaces? It just fits to the DLM design for now, you need to have
> > them anyway because the internal DLM socket handling on a per node
> > basis. What we do additionally is to separate the DLM lockspaces (the
> > lockspace that is being registered) by net-namespaces as this represents
> > a "network entity" (node). There might be reasons to introduce a
> > complete new kind of namespaces (locking namespace?) but I don't want to
> > do this step now and as I said net-namespaces are required anyway for
> > the DLM sockets.
>
> This section needs to be re-written to more clearly explain what
> you're trying to accomplish here, and how this is different or better
> than what went before. I realize you probably have this knowledge all
> internalized, but spelling it out in a clear and simple manner would
> be helpful to everyone.
>
Okay, I'll try my best next time.
Usually lockspaces are separated by a per node instance as a different
"network entity" with net-namespaces. I separate them instead of
building a different "network entity" as a virtual machine that runs a
different Linux kernel instance.
There might be a question if DLM lockspaces should be separated by
net-namespace or yet another "locking" namespace can be introduced? I
don't want to go this step yet as lockspaces are separated by a
"network entity" anyway.
> > You need some new user space tooling as a new netlink net-namespace
> > aware UAPI is introduced (but can co-exist with configfs that operates
> > on init_net only). See [0] for more steps, there is a copr repo for the
> > new tooling and can be enabled by:
>
> What the heck is a 'copr'?
>
That is just a binary repo for rpm packages. Some users may find it handy.
>
> > $ dnf copr enable aring/nldlm
> > $ dnf install nldlm
>
> > or compile it yourself.
>
> These steps really entirely ignore the _why_ you would do this. And
> assume RedHad based systems.
>
That is correct. I will mention that those steps are only for those
specific systems.
> > Then there is currently a very simple script [1] to show a 3 nodes cluster
>
> nit: 3 node cluster
>
> > using gfs2 on a multiple loop block devices on a shared loop block device
> > image (sounds weird but I do something like that). There are currently
> > some user space synchronization issues that I solve by simple sleeps,
> > but they are only user space problems.
>
> Can you give the example on how to do this setup? Ideally in another
> patch which updates the Documentation/??? file to in the kernel tree.
>
https://gitlab.com/netcoder/gfs2ns-examples/-/blob/main/three_nodes
As I quote with [1]. Okay, I will move them away from my separate
repository and add them in Documentation/
> > To test it I recommend some virtual machine "but only one" and run the
>
> I'm having a hard time parsing this, please be more careful with
> singular or plural usage. English is hard! :-)
>
> > [1] script. Afterwards you have in your executed net-namespace the 3
> > mountpoints /cluster/node1, /cluster/node2/ and /cluster/node3. Any vfs
> > operations on those mountpoints acts as a per node entity operation.
>
> Which means what? So if I write to /cluster/node1/foo, it shows up in
> the other two mount points? Or do I need to create a filesystem on
> top?
>
Now we are at a point where I think nobody does it in such a way
before. I create a "fake" shared block device with 3 block devices:
/dev/loop1, /dev/loop2, /dev/loop3 and they all point to the same
filesystem image. Then create only once the gfs2 filesystem on it.
Afterwards you can call mount with each process context in the
previously mentioned "network entity" for each block device in their
"imagined" assigned network entity. The example script does a mount
from each net-namespace in the executed net-namespace and you can
access each per "network entity" mountpoint per /cluster/node1,
/cluster/node2, /cluster/node3 on the executed net-namespace context.
Yes when you call touch /cluster/node1/foo it should show up in the
other mountpoints.
> > We can use it for testing, development and also scale testing to have a
> > large number of nodes joining a lockspace (which seems to be a problem
> > right now). Instead of running 1000 vms, we can run 1000 net-namespaces
> > in a more resource limited environment. For me it seems gfs2 can handle
> > several mounts and still separate the resource according their global
> > variables. Their data structures e.g. glock hash seems to have in their
> > key a separation for that (fsid?). However this is still an experimental
> > feature we might run into issues that requires more separation related
> > to net-namespaces. However basic testing seems to run just fine.
>
> So is this all just to make testing and development easier so you
> don't need 10 or 1000 nodes to do stress testing? Would anyone use
> this in real life?
>
Stress testing maybe, development easier for sure. There are scaling
issues with the recovery handling and handling about ~100 nodes
related that DLM will stop all lockspace activity when nodes
join/leave, that is something I want to look at when I hopefully have
this patch series upstream.
Another example is the DLM lock verifier [0], and I need to be careful
with the name lock verifier. It verifies that only compatible lock
modes can be in use at the same time on a per "network entity" basis.
This is the fundamental mechanism of DLM, if this does not work DLM is
broken. We can do that now because we know the whole cluster
information. We can confirm on any payload that DLM works correctly.
For me, this alone is worth having this feature.
For example, we can introduce a new sort of cluster file system
xfstests, touch /cluster/node1/foo and check if the file shows up in
/cluster/node2/foo. That is an easy example, sometimes we need to
synchronize vfs operations and check them on the other "network
entity". With this feature we don't need to synchronize our "testing
script" over the network anymore with other processes running on other
"network entities".
In real life there is maybe not an example yet. Maybe when people
start to use DLM for user space locking on a container basis, but this
requires net-namespace user space locking functionality that is a
future step.
> > Limitations
>
> > I disable any functionality for the DLM character device that allow
> > plock handling or do DLM locking from user space. Just don't use any
> > plock locking in gfs2 for now. But basic vfs operations should work. You
> > can even sniff DLM traffic on the created "dlmsw" virtual bridge.
>
> So... what functionality is exposed by this patchset? And Maybe add
> in an "Advantages" section to explain why this is so good.
>
Sure, it is important to mention that this net-namespace functionality
is experimental. If you use DLM without changing the net-namespace
process context it should work as before, in this case there are no
limitations.
- Alex
[0] https://lore.kernel.org/gfs2/20240827180236.316946-1-aahringo@redhat.com/T/#t