[systemd-devel] Request for Feedback on Design Issue with Systemd and "Consistent Network Device Naming"

Simon Foley

2021-04-21 09:13:41 UTC

Hi all,

Â Â Â I wonder if you can help. I'm trying to find a contact in systemd
dev who has been involved in the "Consistent Network Device Naming"
initiative.

As a HPC compute architect I was surprised to come across some changes
in RHEL8 when testing that seem to originate from systemd work.

While I applaud the initiative, I think that there has been some
fundamental oversight on real world use cases for network device management.

Rather than create a more *consistent* OS environment for applications
the implementation will, in the real world,Â make the environment
fundamentally more confusing divergent for users.Â More importantly for
commercial businesses there will be a $ impact on managing the the
changes in the data center and require people to invalidate commercial
support by disabling the feature via a kernel bootstrap argument
net.ifnames=0 to disable the feature.

### PROBLEM ###

The issue is around the depreciation of the support for the HWADDR=
argument in the ifcfg files (RHEL, other distros are available).

This feature is used in the real world to migrate device names around
physical NIC cards and ports in *order to create a more consistent
environment* for application users in multi homed servers. In HPC one of
the challenges we face is the fact that our server farms are depreciated
over 3-5 years and during that time capacity expansions mean we don't
have 100% consistent hardware, especially when it comes to NIC
implementation.Â Dedicated on board, discrete PCIe NIC Cards and Server
flex-lom/riser and their firmware's are constantly changing with their
version iterations. This means that the systemd project can never
control server HW manufactures;

1. PCIe implementations and lane allocations to specific slots on the
motherboard.
2. Decision on the number of "on board" Chipset NIC's (typically RJ45
1GbE though i'm sure we will soon see 10GbE SFP+ becoming a norm)
3. Default "FlexLom/Riser" cards (can vary from 2 to 4 1GbE Ports or 1
to 2 x 10GbE SFP+ Ports)
4. Ports numbers (RJ45 and SFP+) they put by default on NIC
Manufacturers cards as their model iterations increase.
5. Firmware changes on NIC cards that can affect the order of the
initialization of ports on the PCIe bus for each CPU.
6. The OEM relationships with NIC makers that servers manufactures have
where on board and flexlom NIC Chipsets change regularly with each base
revision (broadcom, realtek,Â Intel etc )

Now in HPC one of the biggest challenge we face is to maximize
performance on the increasing amount of compute cores we get per socket
per and to maximize efficiency and lower latency.
In order to do this a common approach (see attached diagrams for use
cases) is to separate data flows into an ingress and egress paradigm.
We can then use multi homed servers with discrete PCIe high performance
NIC's exploiting full bandwidth 16 lane's going directly into a processors.
Dual socket servers allow then us to split the compute data flows into
reader and writer threads and dedicate a Processor, DDR RAM Banks, and a
NIC Card for each thread type.
Typically a sweet spot is a Dual Socket white box server where HPC
Designers in the OS Space target interfaces for functional roles

Processor O ->Â PCIe Slot 1 (Full 16 lane) => Ingress Threads.
Processor 1 ->Â PCIe Slot 4 (Full 16 lane) => Egress Threads.

Now because of all the issues listed (1-6) we can *never* guarantee
which interface device name Linux will allocate to these key NIC ports.
And yet we want to create a consistent environment for the application
team to know which processor and interface they need to pin their
processes to.
They need to know this in order to minimize memory NUMA latency and
irrelevant NIC interrupts.

How HPC architects try to help sysadmins and application teams in the
process is to have post build modifications.
Here we can use the HWADDR= variable in the ifcfg-[device name] files to
move a *specific* device name to these targeted NIC cards and ports.
This way application teams can always associated a *specific* device
name for a specific functional purpose (Feed,Backbone,Access) and know
for them where to tie their reader and writer threads.
Also we can always standardize that a given interface is always the
"default route" interface for a specific server blueprint.

It would appear in RHEL8 that due to systemd the HWADDR= is no longer
supported and we have lost this fundamentally important feature.

### REQUIREMENT ###

Sysadmins and HPC Designers need a supported way to either swap / move
kernel allocated device names around the physical NIC Cards and ports to
create consistent compute environments.
The HWADDR= solution was rather brutal, but effective way of achieving
this but it would appear now that this is no longer supported in systemd.
A better solution would be the support for the user to define unique
device names for NIC card interfaces to they can be more explicit in
their naming conventions

e.g.

Ethernet: Â Â Â enofeed1.0, enofeed1.1, enoback1.0, enoback1.1,
enoaccess1.0, enoaccess1.1,
Infiniband: Â Â Â ibofeed1.0, ibofeed1.1, iboback1.0, iboback1.1,
iboaccess1.0, iboaccess1.1,

### THE FUTURE ###
The industry is moving towards moving compute *closer to the network*
and NIC Cards are having FPGA, DDR Memory Banks, GPU, Many-Core all
integrated on the PCB attached to the PCIe slot. The Linux kernel needs
to enable sysadmins and HPC architects to create consistent compute
environments across heterogeneous server environments.

Who can I discuss these design issues with in the systemd space ?

Yours Sincerely
Axel