[v2] dlm: net-namespace functionality

[PATCHv2 dlm/next 00/12] dlm: net-namespace functionality

Posted by Alexander Aring 1 year, 4 months ago

Hi,

this patch-series is huge but brings a lot of basic "fun" net-namespace
functionality to DLM. Currently you need a couple of Linux kernel
instances running in e.g. Virtual Machines. With this patch-series I
want to break out of this virtual machine world dealing with multiple
kernels need to boot them all individually, etc. Now you can use DLM in
only one Linux kernel instance and each "node" (previously represented
by a virtual machine) is separate by a net-namespace. Why
net-namespaces? It just fits to the DLM design for now, you need to have
them anyway because the internal DLM socket handling on a per node
basis. What we do additionally is to separate the DLM lockspaces (the
lockspace that is being registered) by net-namespaces as this represents
a "network entity" (node). There might be reasons to introduce a
complete new kind of namespaces (locking namespace?) but I don't want to
do this step now and as I said net-namespaces are required anyway for
the DLM sockets.

You need some new user space tooling as a new netlink net-namespace
aware UAPI is introduced (but can co-exist with configfs that operates
on init_net only). See [0] for more steps, there is a copr repo for the
new tooling and can be enabled by:

$ dnf copr enable aring/nldlm
$ dnf install nldlm

or compile it yourself.

Then there is currently a very simple script [1] to show a 3 nodes cluster
using gfs2 on a multiple loop block devices on a shared loop block device
image (sounds weird but I do something like that). There are currently
some user space synchronization issues that I solve by simple sleeps,
but they are only user space problems.

To test it I recommend some virtual machine "but only one" and run the
[1] script. Afterwards you have in your executed net-namespace the 3
mountpoints /cluster/node1, /cluster/node2/ and /cluster/node3. Any vfs
operations on those mountpoints acts as a per node entity operation.

We can use it for testing, development and also scale testing to have a
large number of nodes joining a lockspace (which seems to be a problem
right now). Instead of running 1000 vms, we can run 1000 net-namespaces
in a more resource limited environment. For me it seems gfs2 can handle
several mounts and still separate the resource according their global
variables. Their data structures e.g. glock hash seems to have in their
key a separation for that (fsid?). However this is still an experimental
feature we might run into issues that requires more separation related
to net-namespaces. However basic testing seems to run just fine.

Limitations

I disable any functionality for the DLM character device that allow
plock handling or do DLM locking from user space. Just don't use any
plock locking in gfs2 for now. But basic vfs operations should work. You
can even sniff DLM traffic on the created "dlmsw" virtual bridge.

- Alex

[0] https://gitlab.com/netcoder/nldlm
[1] https://gitlab.com/netcoder/gfs2ns-examples/-/blob/main/three_nodes

changes since v2:
 - move to ynl and introduce and use netlink yaml spec
 - put the nldlm.h DLM netlink header under UAPI directory
 - fix build issues building with CONFIG_NET disabled
 - fix possible nullpointer deference if lookup of lockspace failed

Alexander Aring (12):
  dlm: introduce dlm_find_lockspace_name()
  dlm: disallow different configs nodeid storages
  dlm: add struct net to dlm_new_lockspace()
  dlm: handle port as __be16 network byte order
  dlm: use dlm_config as only cluster configuration
  dlm: dlm_config_info config fields to unsigned int
  dlm: rename config to configfs
  kobject: add kset_type_create_and_add() helper
  kobject: export generic helper ops
  dlm: separate dlm lockspaces per net-namespace
  dlm: add nldlm net-namespace aware UAPI
  gfs2: separate mount context by net-namespaces

 Documentation/netlink/specs/nldlm.yaml |  438 ++++++++
 drivers/md/md-cluster.c                |    3 +-
 fs/dlm/Makefile                        |    3 +
 fs/dlm/config.c                        | 1291 +++++++++--------------
 fs/dlm/config.h                        |  215 +++-
 fs/dlm/configfs.c                      |  882 ++++++++++++++++
 fs/dlm/configfs.h                      |   19 +
 fs/dlm/debug_fs.c                      |   24 +-
 fs/dlm/dir.c                           |    4 +-
 fs/dlm/dlm_internal.h                  |   24 +-
 fs/dlm/lock.c                          |   64 +-
 fs/dlm/lock.h                          |    3 +-
 fs/dlm/lockspace.c                     |  220 ++--
 fs/dlm/lockspace.h                     |   12 +-
 fs/dlm/lowcomms.c                      |  525 +++++-----
 fs/dlm/lowcomms.h                      |   29 +-
 fs/dlm/main.c                          |    5 -
 fs/dlm/member.c                        |   36 +-
 fs/dlm/midcomms.c                      |  287 ++---
 fs/dlm/midcomms.h                      |   31 +-
 fs/dlm/netlink2.c                      | 1330 ++++++++++++++++++++++++
 fs/dlm/nldlm-kernel.c                  |  290 ++++++
 fs/dlm/nldlm-kernel.h                  |   50 +
 fs/dlm/nldlm.c                         |  847 +++++++++++++++
 fs/dlm/plock.c                         |    2 +-
 fs/dlm/rcom.c                          |   16 +-
 fs/dlm/rcom.h                          |    3 +-
 fs/dlm/recover.c                       |   17 +-
 fs/dlm/user.c                          |   63 +-
 fs/dlm/user.h                          |    2 +-
 fs/gfs2/glock.c                        |    8 +
 fs/gfs2/incore.h                       |    2 +
 fs/gfs2/lock_dlm.c                     |    6 +-
 fs/gfs2/ops_fstype.c                   |    5 +
 fs/gfs2/sys.c                          |   35 +-
 fs/ocfs2/stack_user.c                  |    2 +-
 include/linux/dlm.h                    |    9 +-
 include/linux/kobject.h                |   10 +-
 include/uapi/linux/nldlm.h             |  153 +++
 lib/kobject.c                          |   65 +-
 40 files changed, 5566 insertions(+), 1464 deletions(-)
 create mode 100644 Documentation/netlink/specs/nldlm.yaml
 create mode 100644 fs/dlm/configfs.c
 create mode 100644 fs/dlm/configfs.h
 create mode 100644 fs/dlm/netlink2.c
 create mode 100644 fs/dlm/nldlm-kernel.c
 create mode 100644 fs/dlm/nldlm-kernel.h
 create mode 100644 fs/dlm/nldlm.c
 create mode 100644 include/uapi/linux/nldlm.h

-- 
2.43.0

Re: [PATCHv2 dlm/next 00/12] dlm: net-namespace functionality

Posted by John Stoffel 1 year, 4 months ago

>>>>> "Alexander" == Alexander Aring <aahringo@redhat.com> writes:

> Hi,
> this patch-series is huge but brings a lot of basic "fun" net-namespace
> functionality to DLM. Currently you need a couple of Linux kernel

Please spell out TLAs like DLM the first time you use them.  In this
case I'm suer you mean Distributed Lock Manager, right?  

> instances running in e.g. Virtual Machines. With this patch-series I
> want to break out of this virtual machine world dealing with multiple
> kernels need to boot them all individually, etc. Now you can use DLM in
> only one Linux kernel instance and each "node" (previously represented
> by a virtual machine) is separate by a net-namespace. Why
> net-namespaces? It just fits to the DLM design for now, you need to have
> them anyway because the internal DLM socket handling on a per node
> basis. What we do additionally is to separate the DLM lockspaces (the
> lockspace that is being registered) by net-namespaces as this represents
> a "network entity" (node). There might be reasons to introduce a
> complete new kind of namespaces (locking namespace?) but I don't want to
> do this step now and as I said net-namespaces are required anyway for
> the DLM sockets.

This section needs to be re-written to more clearly explain what
you're trying to accomplish here, and how this is different or better
than what went before.  I realize you probably have this knowledge all
internalized, but spelling it out in a clear and simple manner would
be helpful to everyone.  

> You need some new user space tooling as a new netlink net-namespace
> aware UAPI is introduced (but can co-exist with configfs that operates
> on init_net only). See [0] for more steps, there is a copr repo for the
> new tooling and can be enabled by:

What the heck is a 'copr'?   


> $ dnf copr enable aring/nldlm
> $ dnf install nldlm

> or compile it yourself.

These steps really entirely ignore the _why_ you would do this.  And
assume RedHad based systems.

> Then there is currently a very simple script [1] to show a 3 nodes cluster

nit: 3 node cluster

> using gfs2 on a multiple loop block devices on a shared loop block device
> image (sounds weird but I do something like that). There are currently
> some user space synchronization issues that I solve by simple sleeps,
> but they are only user space problems.

Can you give the example on how to do this setup?  Ideally in another
patch which updates the Documentation/??? file to in the kernel tree.

> To test it I recommend some virtual machine "but only one" and run the

I'm having a hard time parsing this, please be more careful with
singular or plural usage.  English is hard!  :-)

> [1] script. Afterwards you have in your executed net-namespace the 3
> mountpoints /cluster/node1, /cluster/node2/ and /cluster/node3. Any vfs
> operations on those mountpoints acts as a per node entity operation.

Which means what?  So if I write to /cluster/node1/foo, it shows up in
the other two mount points?  Or do I need to create a filesystem on
top?   

> We can use it for testing, development and also scale testing to have a
> large number of nodes joining a lockspace (which seems to be a problem
> right now). Instead of running 1000 vms, we can run 1000 net-namespaces
> in a more resource limited environment. For me it seems gfs2 can handle
> several mounts and still separate the resource according their global
> variables. Their data structures e.g. glock hash seems to have in their
> key a separation for that (fsid?). However this is still an experimental
> feature we might run into issues that requires more separation related
> to net-namespaces. However basic testing seems to run just fine.

So is this all just to make testing and development easier so you
don't need 10 or 1000 nodes to do stress testing?  Would anyone use
this in real life?  

> Limitations

> I disable any functionality for the DLM character device that allow
> plock handling or do DLM locking from user space. Just don't use any
> plock locking in gfs2 for now. But basic vfs operations should work. You
> can even sniff DLM traffic on the created "dlmsw" virtual bridge.

So... what functionality is exposed by this patchset?  And Maybe add
in an "Advantages" section to explain why this is so good.  

Thanks!
John


> - Alex

> [0] https://gitlab.com/netcoder/nldlm
> [1] https://gitlab.com/netcoder/gfs2ns-examples/-/blob/main/three_nodes

> changes since v2:
>  - move to ynl and introduce and use netlink yaml spec
>  - put the nldlm.h DLM netlink header under UAPI directory
>  - fix build issues building with CONFIG_NET disabled
>  - fix possible nullpointer deference if lookup of lockspace failed

> Alexander Aring (12):
>   dlm: introduce dlm_find_lockspace_name()
>   dlm: disallow different configs nodeid storages
>   dlm: add struct net to dlm_new_lockspace()
>   dlm: handle port as __be16 network byte order
>   dlm: use dlm_config as only cluster configuration
>   dlm: dlm_config_info config fields to unsigned int
>   dlm: rename config to configfs
>   kobject: add kset_type_create_and_add() helper
>   kobject: export generic helper ops
>   dlm: separate dlm lockspaces per net-namespace
>   dlm: add nldlm net-namespace aware UAPI
>   gfs2: separate mount context by net-namespaces

>  Documentation/netlink/specs/nldlm.yaml |  438 ++++++++
>  drivers/md/md-cluster.c                |    3 +-
>  fs/dlm/Makefile                        |    3 +
>  fs/dlm/config.c                        | 1291 +++++++++--------------
>  fs/dlm/config.h                        |  215 +++-
>  fs/dlm/configfs.c                      |  882 ++++++++++++++++
>  fs/dlm/configfs.h                      |   19 +
>  fs/dlm/debug_fs.c                      |   24 +-
>  fs/dlm/dir.c                           |    4 +-
>  fs/dlm/dlm_internal.h                  |   24 +-
>  fs/dlm/lock.c                          |   64 +-
>  fs/dlm/lock.h                          |    3 +-
>  fs/dlm/lockspace.c                     |  220 ++--
>  fs/dlm/lockspace.h                     |   12 +-
>  fs/dlm/lowcomms.c                      |  525 +++++-----
>  fs/dlm/lowcomms.h                      |   29 +-
>  fs/dlm/main.c                          |    5 -
>  fs/dlm/member.c                        |   36 +-
>  fs/dlm/midcomms.c                      |  287 ++---
>  fs/dlm/midcomms.h                      |   31 +-
>  fs/dlm/netlink2.c                      | 1330 ++++++++++++++++++++++++
>  fs/dlm/nldlm-kernel.c                  |  290 ++++++
>  fs/dlm/nldlm-kernel.h                  |   50 +
>  fs/dlm/nldlm.c                         |  847 +++++++++++++++
>  fs/dlm/plock.c                         |    2 +-
>  fs/dlm/rcom.c                          |   16 +-
>  fs/dlm/rcom.h                          |    3 +-
>  fs/dlm/recover.c                       |   17 +-
>  fs/dlm/user.c                          |   63 +-
>  fs/dlm/user.h                          |    2 +-
>  fs/gfs2/glock.c                        |    8 +
>  fs/gfs2/incore.h                       |    2 +
>  fs/gfs2/lock_dlm.c                     |    6 +-
>  fs/gfs2/ops_fstype.c                   |    5 +
>  fs/gfs2/sys.c                          |   35 +-
>  fs/ocfs2/stack_user.c                  |    2 +-
>  include/linux/dlm.h                    |    9 +-
>  include/linux/kobject.h                |   10 +-
>  include/uapi/linux/nldlm.h             |  153 +++
>  lib/kobject.c                          |   65 +-
>  40 files changed, 5566 insertions(+), 1464 deletions(-)
>  create mode 100644 Documentation/netlink/specs/nldlm.yaml
>  create mode 100644 fs/dlm/configfs.c
>  create mode 100644 fs/dlm/configfs.h
>  create mode 100644 fs/dlm/netlink2.c
>  create mode 100644 fs/dlm/nldlm-kernel.c
>  create mode 100644 fs/dlm/nldlm-kernel.h
>  create mode 100644 fs/dlm/nldlm.c
>  create mode 100644 include/uapi/linux/nldlm.h

> -- 
> 2.43.0

Re: [PATCHv2 dlm/next 00/12] dlm: net-namespace functionality

Posted by Alexander Aring 1 year, 4 months ago

Hi,

On Mon, Sep 30, 2024 at 4:49 PM John Stoffel <john@stoffel.org> wrote:
>
> >>>>> "Alexander" == Alexander Aring <aahringo@redhat.com> writes:
>
> > Hi,
> > this patch-series is huge but brings a lot of basic "fun" net-namespace
> > functionality to DLM. Currently you need a couple of Linux kernel
>
> Please spell out TLAs like DLM the first time you use them.  In this
> case I'm suer you mean Distributed Lock Manager, right?
>

Yes, DLM stands for Distributed Lock Manager that lives currently in "fs/dlm".

> > instances running in e.g. Virtual Machines. With this patch-series I
> > want to break out of this virtual machine world dealing with multiple
> > kernels need to boot them all individually, etc. Now you can use DLM in
> > only one Linux kernel instance and each "node" (previously represented
> > by a virtual machine) is separate by a net-namespace. Why
> > net-namespaces? It just fits to the DLM design for now, you need to have
> > them anyway because the internal DLM socket handling on a per node
> > basis. What we do additionally is to separate the DLM lockspaces (the
> > lockspace that is being registered) by net-namespaces as this represents
> > a "network entity" (node). There might be reasons to introduce a
> > complete new kind of namespaces (locking namespace?) but I don't want to
> > do this step now and as I said net-namespaces are required anyway for
> > the DLM sockets.
>
> This section needs to be re-written to more clearly explain what
> you're trying to accomplish here, and how this is different or better
> than what went before.  I realize you probably have this knowledge all
> internalized, but spelling it out in a clear and simple manner would
> be helpful to everyone.
>

Okay, I'll try my best next time.

Usually lockspaces are separated by a per node instance as a different
"network entity" with net-namespaces. I separate them instead of
building a different "network entity" as a virtual machine that runs a
different Linux kernel instance.

There might be a question if DLM lockspaces should be separated by
net-namespace or yet another "locking" namespace can be introduced? I
don't want to go this step yet as lockspaces are separated by a
"network entity" anyway.

> > You need some new user space tooling as a new netlink net-namespace
> > aware UAPI is introduced (but can co-exist with configfs that operates
> > on init_net only). See [0] for more steps, there is a copr repo for the
> > new tooling and can be enabled by:
>
> What the heck is a 'copr'?
>

That is just a binary repo for rpm packages. Some users may find it handy.

>
> > $ dnf copr enable aring/nldlm
> > $ dnf install nldlm
>
> > or compile it yourself.
>
> These steps really entirely ignore the _why_ you would do this.  And
> assume RedHad based systems.
>

That is correct. I will mention that those steps are only for those
specific systems.

> > Then there is currently a very simple script [1] to show a 3 nodes cluster
>
> nit: 3 node cluster
>
> > using gfs2 on a multiple loop block devices on a shared loop block device
> > image (sounds weird but I do something like that). There are currently
> > some user space synchronization issues that I solve by simple sleeps,
> > but they are only user space problems.
>
> Can you give the example on how to do this setup?  Ideally in another
> patch which updates the Documentation/??? file to in the kernel tree.
>

https://gitlab.com/netcoder/gfs2ns-examples/-/blob/main/three_nodes

As I quote with [1]. Okay, I will move them away from my separate
repository and add them in Documentation/

> > To test it I recommend some virtual machine "but only one" and run the
>
> I'm having a hard time parsing this, please be more careful with
> singular or plural usage.  English is hard!  :-)
>
> > [1] script. Afterwards you have in your executed net-namespace the 3
> > mountpoints /cluster/node1, /cluster/node2/ and /cluster/node3. Any vfs
> > operations on those mountpoints acts as a per node entity operation.
>
> Which means what?  So if I write to /cluster/node1/foo, it shows up in
> the other two mount points?  Or do I need to create a filesystem on
> top?
>

Now we are at a point where I think nobody does it in such a way
before. I create a "fake" shared block device with 3 block devices:
/dev/loop1, /dev/loop2, /dev/loop3 and they all point to the same
filesystem image. Then create only once the gfs2 filesystem on it.
Afterwards you can call mount with each process context in the
previously mentioned "network entity" for each block device in their
"imagined" assigned network entity. The example script does a mount
from each net-namespace in the executed net-namespace and you can
access each per "network entity" mountpoint per /cluster/node1,
/cluster/node2, /cluster/node3 on the executed net-namespace context.

Yes when you call touch /cluster/node1/foo it should show up in the
other mountpoints.

> > We can use it for testing, development and also scale testing to have a
> > large number of nodes joining a lockspace (which seems to be a problem
> > right now). Instead of running 1000 vms, we can run 1000 net-namespaces
> > in a more resource limited environment. For me it seems gfs2 can handle
> > several mounts and still separate the resource according their global
> > variables. Their data structures e.g. glock hash seems to have in their
> > key a separation for that (fsid?). However this is still an experimental
> > feature we might run into issues that requires more separation related
> > to net-namespaces. However basic testing seems to run just fine.
>
> So is this all just to make testing and development easier so you
> don't need 10 or 1000 nodes to do stress testing?  Would anyone use
> this in real life?
>

Stress testing maybe, development easier for sure. There are scaling
issues with the recovery handling and handling about ~100 nodes
related that DLM will stop all lockspace activity when nodes
join/leave, that is something I want to look at when I hopefully have
this patch series upstream.

Another example is the DLM lock verifier [0], and I need to be careful
with the name lock verifier. It verifies that only compatible lock
modes can be in use at the same time on a per "network entity" basis.
This is the fundamental mechanism of DLM, if this does not work DLM is
broken. We can do that now because we know the whole cluster
information. We can confirm on any payload that DLM works correctly.
For me, this alone is worth having this feature.

For example, we can introduce a new sort of cluster file system
xfstests, touch /cluster/node1/foo and check if the file shows up in
/cluster/node2/foo. That is an easy example, sometimes we need to
synchronize vfs operations and check them on the other "network
entity". With this feature we don't need to synchronize our "testing
script" over the network anymore with other processes running on other
"network entities".

In real life there is maybe not an example yet. Maybe when people
start to use DLM for user space locking on a container basis, but this
requires net-namespace user space locking functionality that is a
future step.

> > Limitations
>
> > I disable any functionality for the DLM character device that allow
> > plock handling or do DLM locking from user space. Just don't use any
> > plock locking in gfs2 for now. But basic vfs operations should work. You
> > can even sniff DLM traffic on the created "dlmsw" virtual bridge.
>
> So... what functionality is exposed by this patchset?  And Maybe add
> in an "Advantages" section to explain why this is so good.
>

Sure, it is important to mention that this net-namespace functionality
is experimental. If you use DLM without changing the net-namespace
process context it should work as before, in this case there are no
limitations.

- Alex

[0] https://lore.kernel.org/gfs2/20240827180236.316946-1-aahringo@redhat.com/T/#t