Discussion:
Request for Feedback on Design Issue with Systemd and "Consistent Network Device Naming"
(too old to reply)
Simon Foley
2021-04-21 09:13:41 UTC
Permalink
Hi all,

    I wonder if you can help. I'm trying to find a contact in systemd
dev who has been involved in the "Consistent Network Device Naming"
initiative.

As a HPC compute architect I was surprised to come across some changes
in RHEL8 when testing that seem to originate from systemd work.

While I applaud the initiative, I think that there has been some
fundamental oversight on real world use cases for network device management.

Rather than create a more *consistent* OS environment for applications
the implementation will, in the real world,  make the environment
fundamentally more confusing divergent for users.  More importantly for
commercial businesses there will be a $ impact on managing the the
changes in the data center and require people to invalidate commercial
support by disabling the feature via a kernel bootstrap argument
net.ifnames=0 to disable the feature.


### PROBLEM ###

The issue is around the depreciation of the support for the HWADDR=
argument in the ifcfg files (RHEL, other distros are available).

This feature is used in the real world to migrate device names around
physical NIC cards and ports in *order to create a more consistent
environment* for application users in multi homed servers. In HPC one of
the challenges we face is the fact that our server farms are depreciated
over 3-5 years and during that time capacity expansions mean we don't
have 100% consistent hardware, especially when it comes to NIC
implementation.  Dedicated on board, discrete PCIe NIC Cards and Server
flex-lom/riser and their firmware's are constantly changing with their
version iterations. This means that the systemd project can never
control server HW manufactures;

1. PCIe implementations and lane allocations to specific slots on the
motherboard.
2. Decision on the number of "on board" Chipset NIC's (typically RJ45
1GbE though i'm sure we will soon see 10GbE SFP+ becoming a norm)
3. Default "FlexLom/Riser" cards (can vary from 2 to 4 1GbE Ports or 1
to 2 x 10GbE SFP+ Ports)
4. Ports numbers (RJ45 and SFP+) they put by default on NIC
Manufacturers cards as their model iterations increase.
5. Firmware changes on NIC cards that can affect the order of the
initialization of ports on the PCIe bus for each CPU.
6. The OEM relationships with NIC makers that servers manufactures have
where on board and flexlom NIC Chipsets change regularly with each base
revision (broadcom, realtek,  Intel etc )

Now in HPC one of the biggest challenge we face is to maximize
performance on the increasing amount of compute cores we get per socket
per and to maximize efficiency and lower latency.
In order to do this a common approach (see attached diagrams for use
cases) is to separate data flows into an ingress and egress paradigm.
We can then use multi homed servers with discrete PCIe high performance
NIC's exploiting full bandwidth 16 lane's going directly into a processors.
Dual socket servers allow then us to split the compute data flows into
reader and writer threads and dedicate a Processor, DDR RAM Banks, and a
NIC Card for each thread type.
Typically a sweet spot is a Dual Socket white box server where HPC
Designers in the OS Space target interfaces for functional roles

Processor O ->  PCIe Slot 1 (Full 16 lane) => Ingress Threads.
Processor 1 ->  PCIe Slot 4 (Full 16 lane) => Egress Threads.

Now because of all the issues listed (1-6) we can *never* guarantee
which interface device name Linux will allocate to these key NIC ports.
And yet we want to create a consistent environment for the application
team to know which processor and interface they need to pin their
processes to.
They need to know this in order to minimize memory NUMA latency and
irrelevant NIC interrupts.

How HPC architects try to help sysadmins and application teams in the
process is to have post build modifications.
Here we can use the HWADDR= variable in the ifcfg-[device name] files to
move a *specific* device name to these targeted NIC cards and ports.
This way application teams can always associated a *specific* device
name for a specific functional purpose (Feed,Backbone,Access) and know
for them where to tie their reader and writer threads.
Also we can always standardize that a given interface is always the
"default route" interface for a specific server blueprint.

It would appear in RHEL8 that due to systemd the HWADDR= is no longer
supported and we have lost this fundamentally important feature.

### REQUIREMENT ###

Sysadmins and HPC Designers need a supported way to either swap / move
kernel allocated device names around the physical NIC Cards and ports to
create consistent compute environments.
The HWADDR= solution was rather brutal, but effective way of achieving
this but it would appear now that this is no longer supported in systemd.
A better solution would be the support for the user to define unique
device names for NIC card interfaces to they can be more explicit in
their naming conventions


e.g.

Ethernet:     enofeed1.0, enofeed1.1, enoback1.0, enoback1.1,
enoaccess1.0, enoaccess1.1,
Infiniband:     ibofeed1.0, ibofeed1.1, iboback1.0, iboback1.1,
iboaccess1.0, iboaccess1.1,

### THE FUTURE ###
The industry is moving towards moving compute *closer to the network*
and NIC Cards are having FPGA, DDR Memory Banks, GPU, Many-Core all
integrated on the PCB attached to the PCIe slot. The Linux kernel needs
to enable sysadmins and HPC architects to create consistent compute
environments across heterogeneous server environments.

Who can I discuss these design issues with in the systemd space ?

Yours Sincerely
Axel
Andrei Borzenkov
2021-04-21 09:57:57 UTC
Permalink
On Wed, Apr 21, 2021 at 12:20 PM Simon Foley <***@simonfoley.net> wrote:
...
Post by Simon Foley
Here we can use the HWADDR= variable in the ifcfg-[device name] files to
move a *specific* device name to these targeted NIC cards and ports.
Read man systemd.link.

[Match]
MACAddress= (or PermanentMACAddress=)

[Link]
Name=whatever-name-you-need
Lennart Poettering
2021-04-21 10:07:23 UTC
Permalink
The issue is around the depreciation of the support for the HWADDR= argument
in the ifcfg files (RHEL, other distros are available).
systemd upstream is not involved in that. ifcfg is specific to Red Hat
distributions and systemd doesn't mandate the concept to be
deprecated. It doesn't support them natively, but there's no need to.

The .link concept systemd provides is more powerful and works across
distributions. You can use that to name your interfaces by MAC
address, it's very well supported.
How HPC architects try to help sysadmins and application teams in the
process is to have post build modifications.
Here we can use the HWADDR= variable in the ifcfg-[device name] files to
move a *specific* device name to these targeted NIC cards and ports.
systemd doesn't stop you to.

It provides a more generic way to do this via .link files, but from
systemd's PoV you don#t have to migrate, if you don't want.

You could easily write a conversion script btw, that takes your ifcfg
files and converts them to .link files in /run, if you like.
It would appear in RHEL8 that due to systemd the HWADDR= is no longer
supported and we have lost this fundamentally important feature.
If RHEL deprecated this, that's a decision by RHEL, and the upstream
systemd project does not mandate anything in this area. It provides a
generic mechanism to do the same, but you can use whatever you want.

Anyway, the upstream systemd project is the wrong forum to discuss any
of this. You are apparently upset by a RHEL decision. While I
sympathize with the decision, it's not a decision the systemd project
took, but RHEL did, and technically nothing in systemd mandates
this.

Lennart

--
Lennart Poettering, Berlin

Loading...