Discussion:
[systemd-devel] RFC: one more time: SCSI device identification
Martin Wilck
2021-03-29 09:58:49 UTC
Permalink
Hello,

[sorry for cross-posting, I think this is relevant to multiple
communities.]

I'm referring to the recent discussion about SCSI device identification
for multipath-tools 
(https://listman.redhat.com/archives/dm-devel/2021-March/msg00332.html)

As you all know, there are different designators to identify SCSI LUNs,
and the specs don't mandate priorities for devices that support
multiple designator types. There are various implementations for device
identification, which use different priorities (summarized below).

It's highly desirable to clean up this confusion and settle on a single
instance and a unique priority order. I believe this instance should be
the kernel.

OTOH, changing device WWIDs is highly dangerous for productive systems.
The WWID is prominently used in multipath-tools, but also in lots of
other important places such as fstab, grub.cfg, dracut, etc. No doubt
that we'll be stuck with the different algorithms for years, especially
for LTS distributions. But perhaps we can figure out a long-term exit
strategy?

The kernel's preference for type 8 designators (see below) is in
contrast with the established user space algorithms, which determine
SCSI WWIDs on productive systems in practice. User space can try to
adapt to the kernel logic, but it will necessarily be a slow and
painful path if we want to avoid breaking user setups.

In principle, I believe the kernel is "right" to prefer type 8. But
because the "wwid" attribute isn't actually used for device
identification today, changing the kernel logic would be less prone to
regressions than changing user space, even if it violates the principle
that the kernel's user space API must remain stable.

Would it be an option to modify the kernel logic?

If we can't, I think we should start with making the "wwid" attribute
part of the udev rule logic, and letting distros configure whether the
kernel logic or the traditional udev logic would be used.

Please tell me your thoughts on this matter.

Regards,
Martin

PS: Incomplete list of algorithms for SCSI designator priorities:

The kernel ("wwid" sysfs attribute) prefers "SCSI name string" (type 8)
designators over other types
(https://elixir.bootlin.com/linux/latest/A/ident/designator_prio).

The current set of udev rules in sg3_utils
(https://github.com/hreinecke/sg3_utils/blob/master/scripts/55-scsi-sg3_id.rules)
don't use the kernel's wwid attribute; they parse VPD 83 and 80
instead and prioritize types 36, 35, 32, and 2 over type 8.

udev's "scsi_id" tool, historically the first attempt to implement a
priority for this, doesn't look at the SCSI name attribute at all:
https://github.com/systemd/systemd/blob/main/src/udev/scsi_id/scsi_serial.c

There's a "fallback" logic in multipath-tools in case udev doesn't
provide a WWID:
https://github.com/opensvc/multipath-tools/blob/a41a61e8482def33e3ca8c9e3639ad2c37611551/libmultipath/discovery.c#L1040
--
Dr. Martin Wilck <***@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer
Martin K. Petersen
2021-04-06 04:47:17 UTC
Permalink
Martin,
Post by Martin Wilck
The kernel's preference for type 8 designators (see below) is in
contrast with the established user space algorithms, which determine
SCSI WWIDs on productive systems in practice. User space can try to
adapt to the kernel logic, but it will necessarily be a slow and
painful path if we want to avoid breaking user setups.
I was concerned when you changed the kernel prioritization a while back
and I still don't think that we should tweak that code any further.

If the kernel picks one ID over another, that should be for the kernel's
use. Letting the kernel decide which ID is best for userland is not a
good approach.

So while I originally liked the idea of exposing a transport and
protocol agnostic wwid for each block device, I think that all the
descriptors and ID formats available in both SCSI and NVMe have shown
that that approach is fraught with peril.

Descriptors that provide "good uniqueness" on one device may be a
completely sub-optimal choice for another (zero-padded values, full of
spaces, vendors getting things wrong in general).

So I think my inclination would be to leave the current wwid as-is to
avoid the risk of breaking things. And then export all ID descriptors
reported in sysfs. Even though vpd83 is already exported in its
entirety, I don't have any particular concerns about the individual
values being exported separately. That makes many userland things so
much easier. And I think the kernel is in a good position to disseminate
information reported by the hardware.

This puts the prioritization entirely in the distro/udev/scripting
domain. Taking the kernel out of the picture will make migration
easier. And it allows a user to pick their descriptor of choice should a
device report something completely unusable in type 8.
--
Martin K. Petersen Oracle Linux Engineering
Martin Wilck
2021-04-16 23:28:50 UTC
Permalink
Hello Martin,

Sorry for the late response, still recovering from a week out of
office.
Post by Martin K. Petersen
Martin,
Post by Martin Wilck
The kernel's preference for type 8 designators (see below) is in
contrast with the established user space algorithms, which
determine
SCSI WWIDs on productive systems in practice. User space can try to
adapt to the kernel logic, but it will necessarily be a slow and
painful path if we want to avoid breaking user setups.
I was concerned when you changed the kernel prioritization a while back
and I still don't think that we should tweak that code any further.
Ok.
Post by Martin K. Petersen
If the kernel picks one ID over another, that should be for the kernel's
use. Letting the kernel decide which ID is best for userland is not a
good approach.
Well, the kernel itself doesn't make any use of this property currently
(and user space doesn't much either, afaik).
Post by Martin K. Petersen
So I think my inclination would be to leave the current wwid as-is to
avoid the risk of breaking things. And then export all ID descriptors
reported in sysfs. Even though vpd83 is already exported in its
entirety, I don't have any particular concerns about the individual
values being exported separately. That makes many userland things so
much easier. And I think the kernel is in a good position to
disseminate
information reported by the hardware.
This puts the prioritization entirely in the distro/udev/scripting
domain. Taking the kernel out of the picture will make migration
easier. And it allows a user to pick their descriptor of choice should a
device report something completely unusable in type 8.
Hm, it sounds intriguing, but it has issues in its own right. For years
to come, user space will have to probe whether these attribute exist,
and fall back to the current ones ("wwid", "vpd_pg83") otherwise. So
user space can't be simplified any time soon. Speaking for an important
user space consumer of WWIDs (multipathd), I doubt that this would
improve matters for us. We'd be happy if the kernel could just pick the
"best" designator for us. But I understand that the kernel can't
guarantee a good choice (user space can't either).

What is your idea how these new sysfs attributes should be named? Just
enumerate, or name them by type somehow?

Thanks,
Martin
--
Dr. Martin Wilck <***@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer
Martin K. Petersen
2021-04-22 02:46:27 UTC
Permalink
Martin,
Post by Martin Wilck
Hm, it sounds intriguing, but it has issues in its own right. For
years to come, user space will have to probe whether these attribute
exist, and fall back to the current ones ("wwid", "vpd_pg83")
otherwise. So user space can't be simplified any time soon. Speaking
for an important user space consumer of WWIDs (multipathd), I doubt
that this would improve matters for us. We'd be happy if the kernel
could just pick the "best" designator for us. But I understand that
the kernel can't guarantee a good choice (user space can't either).
But user space can be adapted at runtime to pick one designator over the
other (ha!).

We could do that in the kernel too, of course, but I'm afraid what the
resulting BLIST changes would end up looking like over time.

I am also very concerned about changing what the kernel currently
exports in a given variable like "wwid". A seemingly innocuous change to
the reported value could lead to a system no longer booting after
updating the kernel.

(Ignoring for a moment that some arrays will helpfully add a new ID
designator after a firmware upgrade and thus change what the kernel
reports. *sigh*)
Post by Martin Wilck
What is your idea how these new sysfs attributes should be named? Just
enumerate, or name them by type somehow?
Up to you. Whatever you think would be easiest for userland to deal
with. I don't have a good feeling for how common vendor specific ones
are in practice. Things would obviously be easier if SCSI didn't have so
many choices :(

But taking a step back: Other than "it's not what userland currently
does", what specifically is the problem with designator_prio()? We've
picked the priority list once and for all. If we promise never to change
it, what is the issue?
--
Martin K. Petersen Oracle Linux Engineering
Martin Wilck
2021-04-22 09:07:15 UTC
Permalink
Post by Martin K. Petersen
Martin,
Post by Martin Wilck
Hm, it sounds intriguing, but it has issues in its own right. For
years to come, user space will have to probe whether these attribute
exist, and fall back to the current ones ("wwid", "vpd_pg83")
otherwise. So user space can't be simplified any time soon. Speaking
for an important user space consumer of WWIDs (multipathd), I doubt
that this would improve matters for us. We'd be happy if the kernel
could just pick the "best" designator for us. But I understand that
the kernel can't guarantee a good choice (user space can't either).
But user space can be adapted at runtime to pick one designator over the
other (ha!).
And that's exactly the problem. Effectively, all user space relies on
udev today, because that's where this "adaptation" is taking place. It
happens

1) either in systemd's scsi_id built-in 
(https://github.com/systemd/systemd/blob/7feb1dd6544d1bf373dbe13dd33cf563ed16f891/src/udev/scsi_id/scsi_serial.c#L37)
2) or in the udev rules coming with sg3_utils 
(https://github.com/hreinecke/sg3_utils/blob/master/scripts/55-scsi-sg3_id.rules)

1) is just as opaque and un-"adaptable" as the kernel, and the logic is
suboptimal. 2) is of course "adaptable", but that's a problem in
practice, if udev fails to provide a WWID. multipath-tools go through
various twists for this case to figure out "fallback" WWIDs, guessing
whether that "fallback" matches what udev would have returned if it had
worked.

That's the gist of it - the general frustration about udev among some
of its heaviest users (talk to the LVM2 maintainers).

I suppose 99.9% of users never bother with customizing the udev rules.
IOW, these users might as well just use a kernel-provided value. But
the remaining 0.1% causes headaches for user-space applications, which
can't make solid assumptions about the rules. Thus, in a way, the
flexibility of the rules does more harm than it helps.
Post by Martin K. Petersen
We could do that in the kernel too, of course, but I'm afraid what the
resulting BLIST changes would end up looking like over time.
That's something we want to avoid, sure.

But we can actually combine both approaches. If "wwid" yields a good
value most of the time (which is true IMO), we could make user space
rely on it by default, and make it possible to set an udev property
(e.g. ENV{ID_LEGACY}="1") to tell udev rules to determine WWID
differently. User-space apps like multipath could check the ID_LEGACY
property to determine whether or not reading the "wwid" attribute would
be consistent with udev. That would simplify matters a lot for us (Ben,
do you agree?), without the need of adding endless BLIST entries.
Post by Martin K. Petersen
I am also very concerned about changing what the kernel currently
exports in a given variable like "wwid". A seemingly innocuous change to
the reported value could lead to a system no longer booting after
updating the kernel.
AFAICT, no major distribution uses "wwid" for this purpose (yet). I
just recently realized that the kernel's ALUA code refers to it. (*)

In a recent discussion with Hannes, the idea came up that the priority
of "SCSI name string" designators should actually depend on their
subtype. "naa." name strings should map to the respective NAA
descriptors, and "eui.", likewise (only "iqn." descriptors have no
binary counterpart; we thought they should rather be put below NAA,
prio-wise).

I wonder if you'd agree with a change made that way for "wwid". I
suppose you don't. I'd then propose to add a new attribute following
this logic. It could simply be an additional attribute with a different
name. Or this new attribute could be a property of the block device
rather than the SCSI device, like NVMe does it
(/sys/block/nvme0n2/wwid).

I don't like the idea of having separate sysfs attributes for
designators of different types, that's impractical for user space.
Post by Martin K. Petersen
But taking a step back: Other than "it's not what userland currently
does", what specifically is the problem with designator_prio()? We've
picked the priority list once and for all. If we promise never to change
it, what is the issue?
If the prioritization in kernel and user space was the same, we could
migrate away from udev more easily without risking boot failure.

Thanks,
Martin

(*) which is an argument for using "wwid" in user space too - just to
be consitent with the kernel's internal logic.
--
Dr. Martin Wilck <***@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer
Benjamin Marzinski
2021-04-22 16:14:20 UTC
Permalink
Post by Martin Wilck
Post by Martin K. Petersen
Martin,
Post by Martin Wilck
Hm, it sounds intriguing, but it has issues in its own right. For
years to come, user space will have to probe whether these attribute
exist, and fall back to the current ones ("wwid", "vpd_pg83")
otherwise. So user space can't be simplified any time soon. Speaking
for an important user space consumer of WWIDs (multipathd), I doubt
that this would improve matters for us. We'd be happy if the kernel
could just pick the "best" designator for us. But I understand that
the kernel can't guarantee a good choice (user space can't either).
But user space can be adapted at runtime to pick one designator over the
other (ha!).
And that's exactly the problem. Effectively, all user space relies on
udev today, because that's where this "adaptation" is taking place. It
happens
1) either in systemd's scsi_id built-in 
(https://github.com/systemd/systemd/blob/7feb1dd6544d1bf373dbe13dd33cf563ed16f891/src/udev/scsi_id/scsi_serial.c#L37)
2) or in the udev rules coming with sg3_utils 
(https://github.com/hreinecke/sg3_utils/blob/master/scripts/55-scsi-sg3_id.rules)
1) is just as opaque and un-"adaptable" as the kernel, and the logic is
suboptimal. 2) is of course "adaptable", but that's a problem in
practice, if udev fails to provide a WWID. multipath-tools go through
various twists for this case to figure out "fallback" WWIDs, guessing
whether that "fallback" matches what udev would have returned if it had
worked.
That's the gist of it - the general frustration about udev among some
of its heaviest users (talk to the LVM2 maintainers).
I suppose 99.9% of users never bother with customizing the udev rules.
IOW, these users might as well just use a kernel-provided value. But
the remaining 0.1% causes headaches for user-space applications, which
can't make solid assumptions about the rules. Thus, in a way, the
flexibility of the rules does more harm than it helps.
Post by Martin K. Petersen
We could do that in the kernel too, of course, but I'm afraid what the
resulting BLIST changes would end up looking like over time.
That's something we want to avoid, sure.
But we can actually combine both approaches. If "wwid" yields a good
value most of the time (which is true IMO), we could make user space
rely on it by default, and make it possible to set an udev property
(e.g. ENV{ID_LEGACY}="1") to tell udev rules to determine WWID
differently. User-space apps like multipath could check the ID_LEGACY
property to determine whether or not reading the "wwid" attribute would
be consistent with udev. That would simplify matters a lot for us (Ben,
do you agree?), without the need of adding endless BLIST entries.
Yeah, as long as ID_LEGACY was changed in a careful manner, so WWIDs
didn't simply change without warning because of an upgrade, a path out
of this complexity is a definitely helpful.

-Ben
Post by Martin Wilck
Post by Martin K. Petersen
I am also very concerned about changing what the kernel currently
exports in a given variable like "wwid". A seemingly innocuous change to
the reported value could lead to a system no longer booting after
updating the kernel.
AFAICT, no major distribution uses "wwid" for this purpose (yet). I
just recently realized that the kernel's ALUA code refers to it. (*)
In a recent discussion with Hannes, the idea came up that the priority
of "SCSI name string" designators should actually depend on their
subtype. "naa." name strings should map to the respective NAA
descriptors, and "eui.", likewise (only "iqn." descriptors have no
binary counterpart; we thought they should rather be put below NAA,
prio-wise).
I wonder if you'd agree with a change made that way for "wwid". I
suppose you don't. I'd then propose to add a new attribute following
this logic. It could simply be an additional attribute with a different
name. Or this new attribute could be a property of the block device
rather than the SCSI device, like NVMe does it
(/sys/block/nvme0n2/wwid).
I don't like the idea of having separate sysfs attributes for
designators of different types, that's impractical for user space.
Post by Martin K. Petersen
But taking a step back: Other than "it's not what userland currently
does", what specifically is the problem with designator_prio()? We've
picked the priority list once and for all. If we promise never to change
it, what is the issue?
If the prioritization in kernel and user space was the same, we could
migrate away from udev more easily without risking boot failure.
Thanks,
Martin
(*) which is an argument for using "wwid" in user space too - just to
be consitent with the kernel's internal logic.
--
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer
Martin K. Petersen
2021-04-23 01:40:03 UTC
Permalink
Martin,
Post by Martin Wilck
I suppose 99.9% of users never bother with customizing the udev rules.
Except for the other 99.9%, perhaps? :) We definitely have many users
that tweak udev storage rules for a variety of reasons. Including being
able to use RII for LUN naming purposes.
Post by Martin Wilck
But we can actually combine both approaches. If "wwid" yields a good
value most of the time (which is true IMO), we could make user space
rely on it by default, and make it possible to set an udev property
(e.g. ENV{ID_LEGACY}="1") to tell udev rules to determine WWID
differently. User-space apps like multipath could check the ID_LEGACY
property to determine whether or not reading the "wwid" attribute would
be consistent with udev. That would simplify matters a lot for us (Ben,
do you agree?), without the need of adding endless BLIST entries.
That's fine with me.
Post by Martin Wilck
AFAICT, no major distribution uses "wwid" for this purpose (yet).
We definitely have users that currently rely on wwid, although probably
not through standard distro scripts.
Post by Martin Wilck
In a recent discussion with Hannes, the idea came up that the priority
of "SCSI name string" designators should actually depend on their
subtype. "naa." name strings should map to the respective NAA
descriptors, and "eui.", likewise (only "iqn." descriptors have no
binary counterpart; we thought they should rather be put below NAA,
prio-wise).
I like what NVMe did wrt. to exporting eui, nguid, uuid separately from
the best-effort wwid. That's why I suggested separate sysfs files for
the various page 0x83 descriptors. I like the idea of being able to
explicitly ask for an eui if that's what I need. But that appears to be
somewhat orthogonal to your request.
Post by Martin Wilck
I wonder if you'd agree with a change made that way for "wwid". I
suppose you don't. I'd then propose to add a new attribute following
this logic. It could simply be an additional attribute with a different
name. Or this new attribute could be a property of the block device
rather than the SCSI device, like NVMe does it
(/sys/block/nvme0n2/wwid).
That's fine. I am not a big fan of the idea that block/foo/wwid and
block/foo/device/wwid could end up being different. But I do think that
from a userland tooling perspective the consistency with NVMe is more
important.
--
Martin K. Petersen Oracle Linux Engineering
Martin Wilck
2021-04-23 10:28:19 UTC
Permalink
Post by Martin K. Petersen
Martin,
Post by Martin Wilck
I suppose 99.9% of users never bother with customizing the udev rules.
Except for the other 99.9%, perhaps? :) We definitely have many users
that tweak udev storage rules for a variety of reasons. Including being
able to use RII for LUN naming purposes.
Post by Martin Wilck
But we can actually combine both approaches. If "wwid" yields a good
value most of the time (which is true IMO), we could make user space
rely on it by default, and make it possible to set an udev property
(e.g. ENV{ID_LEGACY}="1") to tell udev rules to determine WWID
differently. User-space apps like multipath could check the
ID_LEGACY
property to determine whether or not reading the "wwid" attribute would
be consistent with udev. That would simplify matters a lot for us (Ben,
do you agree?), without the need of adding endless BLIST entries.
That's fine with me.
Post by Martin Wilck
AFAICT, no major distribution uses "wwid" for this purpose (yet).
We definitely have users that currently rely on wwid, although
probably
not through standard distro scripts.
Post by Martin Wilck
In a recent discussion with Hannes, the idea came up that the priority
of "SCSI name string" designators should actually depend on their
subtype. "naa." name strings should map to the respective NAA
descriptors, and "eui.", likewise (only "iqn." descriptors have no
binary counterpart; we thought they should rather be put below NAA,
prio-wise).
I like what NVMe did wrt. to exporting eui, nguid, uuid separately from
the best-effort wwid. That's why I suggested separate sysfs files for
the various page 0x83 descriptors. I like the idea of being able to
explicitly ask for an eui if that's what I need. But that appears to be
somewhat orthogonal to your request.
Post by Martin Wilck
I wonder if you'd agree with a change made that way for "wwid". I
suppose you don't. I'd then propose to add a new attribute
following
this logic. It could simply be an additional attribute with a different
name. Or this new attribute could be a property of the block device
rather than the SCSI device, like NVMe does it
(/sys/block/nvme0n2/wwid).
That's fine. I am not a big fan of the idea that block/foo/wwid and
block/foo/device/wwid could end up being different. But I do think that
from a userland tooling perspective the consistency with NVMe is more
important.
OK, then here's the plan: Change SCSI (block) device identification to
work similar to NVMe (in addition to what we have now).

1. add a new sysfs attribute for SCSI block devices as
/sys/block/sd$X/wwid, the value derived similar to the current "wwid"
SCSI device attribute, but using the same prio for SCSI name strings as
for their binary counterparts, as described above.

2. add "naa" and "eui" attributes, too, for user-space applications
that are interested in these specific attributes. 
Fixme: should we differentiate between different "naa" or eui subtypes,
like "naa_regext", "eui64" or similar? If the device defines multiple
"naa" designators, which one should we choose?

3. Change udev rules such that they primarily look at the attribute in
1.) on new installments, and introduce a variable ID_LEGACY to tell the
rules to fall back to the current algorithm. I suppose it makes sense
to have at least ID_VENDOR and ID_PRODUCT available when making this
decision, so that it doesn't have to be a global setting on a given
host.

While we're at it, I'd like to mention another issue: WWID changes.

This is a big problem for multipathd. The gist is that the device
identification attributes in sysfs only change after rescanning the
device. Thus if a user changes LUN assignments on a storage system,
it can happen that a direct INQUIRY returns a different WWID as in
sysfs, which is fatal. If we plan to rely more on sysfs for device
identification in the future, the problem gets worse. 

I wonder if there's a chance that future kernels would automatically
update the attributes if a corresponding UNIT ATTENTION condition such
as INQUIRY DATA HAS CHANGED is received (*), or if we can find some
other way to avoid data corruption resulting from writing to the wrong
device.

Regards,
Martin

(*) I've been told that WWID changes can happen even without receiving
an UA. But in that case I'm inclined to put the blame on the storage.
--
Dr. Martin Wilck <***@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer
Ulrich Windl
2021-04-26 11:14:58 UTC
Permalink
Post by Martin Wilck
Post by Martin K. Petersen
Martin,
Post by Martin Wilck
I suppose 99.9% of users never bother with customizing the udev rules.
Except for the other 99.9%, perhaps? :) We definitely have many users
that tweak udev storage rules for a variety of reasons. Including being
able to use RII for LUN naming purposes.
Post by Martin Wilck
But we can actually combine both approaches. If "wwid" yields a good
value most of the time (which is true IMO), we could make user space
rely on it by default, and make it possible to set an udev property
(e.g. ENV{ID_LEGACY}="1") to tell udev rules to determine WWID
differently. User‑space apps like multipath could check the
ID_LEGACY
property to determine whether or not reading the "wwid" attribute would
be consistent with udev. That would simplify matters a lot for us (Ben,
do you agree?), without the need of adding endless BLIST entries.
That's fine with me.
Post by Martin Wilck
AFAICT, no major distribution uses "wwid" for this purpose (yet).
We definitely have users that currently rely on wwid, although probably
not through standard distro scripts.
Post by Martin Wilck
In a recent discussion with Hannes, the idea came up that the priority
of "SCSI name string" designators should actually depend on their
subtype. "naa." name strings should map to the respective NAA
descriptors, and "eui.", likewise (only "iqn." descriptors have no
binary counterpart; we thought they should rather be put below NAA,
prio‑wise).
I like what NVMe did wrt. to exporting eui, nguid, uuid separately from
the best‑effort wwid. That's why I suggested separate sysfs files for
the various page 0x83 descriptors. I like the idea of being able to
explicitly ask for an eui if that's what I need. But that appears to be
somewhat orthogonal to your request.
Post by Martin Wilck
I wonder if you'd agree with a change made that way for "wwid". I
suppose you don't. I'd then propose to add a new attribute
following
this logic. It could simply be an additional attribute with a different
name. Or this new attribute could be a property of the block device
rather than the SCSI device, like NVMe does it
(/sys/block/nvme0n2/wwid).
That's fine. I am not a big fan of the idea that block/foo/wwid and
block/foo/device/wwid could end up being different. But I do think that
from a userland tooling perspective the consistency with NVMe is more
important.
OK, then here's the plan: Change SCSI (block) device identification to
work similar to NVMe (in addition to what we have now).
1. add a new sysfs attribute for SCSI block devices as
/sys/block/sd$X/wwid, the value derived similar to the current "wwid"
SCSI device attribute, but using the same prio for SCSI name strings as
for their binary counterparts, as described above.
2. add "naa" and "eui" attributes, too, for user‑space applications
that are interested in these specific attributes.
Fixme: should we differentiate between different "naa" or eui subtypes,
like "naa_regext", "eui64" or similar? If the device defines multiple
"naa" designators, which one should we choose?
3. Change udev rules such that they primarily look at the attribute in
1.) on new installments, and introduce a variable ID_LEGACY to tell the
rules to fall back to the current algorithm. I suppose it makes sense
to have at least ID_VENDOR and ID_PRODUCT available when making this
decision, so that it doesn't have to be a global setting on a given
host.
While we're at it, I'd like to mention another issue: WWID changes.
This is a big problem for multipathd. The gist is that the device
identification attributes in sysfs only change after rescanning the
device. Thus if a user changes LUN assignments on a storage system,
it can happen that a direct INQUIRY returns a different WWID as in
sysfs, which is fatal. If we plan to rely more on sysfs for device
identification in the future, the problem gets worse.
I think many devices rely on the fact that they are identified by
Vendor/model/serial_nr, because in most professional SAN storage systems you
can pre-set the serial number to a custom value; so if you want a new disk
(maybe a snapshot) to be compatible with the old one, just assign the same
serial number. I guess that's the idea behind.
Post by Martin Wilck
I wonder if there's a chance that future kernels would automatically
update the attributes if a corresponding UNIT ATTENTION condition such
as INQUIRY DATA HAS CHANGED is received (*), or if we can find some
other way to avoid data corruption resulting from writing to the wrong
device.
Regards,
Martin
(*) I've been told that WWID changes can happen even without receiving
an UA. But in that case I'm inclined to put the blame on the storage.
‑‑
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer
_______________________________________________
systemd‑devel mailing list
https://lists.freedesktop.org/mailman/listinfo/systemd‑devel
Martin Wilck
2021-04-26 13:16:58 UTC
Permalink
Post by Ulrich Windl
Post by Martin Wilck
While we're at it, I'd like to mention another issue: WWID changes.
This is a big problem for multipathd. The gist is that the device
identification attributes in sysfs only change after rescanning the
device. Thus if a user changes LUN assignments on a storage system,
it can happen that a direct INQUIRY returns a different WWID as in
sysfs, which is fatal. If we plan to rely more on sysfs for device
identification in the future, the problem gets worse.
I think many devices rely on the fact that they are identified by
Vendor/model/serial_nr, because in most professional SAN storage systems you
can pre-set the serial number to a custom value; so if you want a new disk
(maybe a snapshot) to be compatible with the old one, just assign the same
serial number. I guess that's the idea behind.
What you are saying sounds dangerous to me. If a snapshot has the same
WWID as the device it's a snapshot of, it must not be exposed to any
host(s) at the same time with its origin, otherwise the host may
happily combine it with the origin into one multipath map, and data
corruption will almost certainly result.

My argument is about how the host is supposed to deal with a WWID
change if it happens. Here, "WWID change" means that a given H:C:T:L
suddenly exposes different device designators than it used to, while
this device is in use by a host. Here, too, data corruption is
imminent, and can happen in a blink of an eye. To avoid this, several
things are needed:

1) the host needs to get notified about the change (likely by an UA of
some sort)
2) the kernel needs to react to the notification immediately, e.g. by
blocking IO to the device,
3) userspace tooling such as udev or multipathd need to figure out how
to how to deal with the situation cleanly, and eventually unblock it.

Wrt 1), we can only hope that it's the case. But 2) and 3) need work,
afaics.

Martin
--
Dr. Martin Wilck <***@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer
Ulrich Windl
2021-04-27 07:02:10 UTC
Permalink
Nachricht
Post by Martin Wilck
Post by Ulrich Windl
Post by Martin Wilck
While we're at it, I'd like to mention another issue: WWID changes.
This is a big problem for multipathd. The gist is that the device
identification attributes in sysfs only change after rescanning the
device. Thus if a user changes LUN assignments on a storage system,
it can happen that a direct INQUIRY returns a different WWID as in
sysfs, which is fatal. If we plan to rely more on sysfs for device
identification in the future, the problem gets worse.
I think many devices rely on the fact that they are identified by
Vendor/model/serial_nr, because in most professional SAN storage systems you
can pre-set the serial number to a custom value; so if you want a
new
disk
(maybe a snapshot) to be compatible with the old one, just assign
the
same
serial number. I guess that's the idea behind.
What you are saying sounds dangerous to me. If a snapshot has the same
WWID as the device it's a snapshot of, it must not be exposed to any
host(s) at the same time with its origin, otherwise the host may
happily combine it with the origin into one multipath map, and data
corruption will almost certainly result.
My argument is about how the host is supposed to deal with a WWID
change if it happens. Here, "WWID change" means that a given H:C:T:L
suddenly exposes different device designators than it used to, while
this device is in use by a host. Here, too, data corruption is
imminent, and can happen in a blink of an eye. To avoid this, several
1) the host needs to get notified about the change (likely by an UA of
some sort)
2) the kernel needs to react to the notification immediately, e.g. by
blocking IO to the device,
3) userspace tooling such as udev or multipathd need to figure out how
to how to deal with the situation cleanly, and eventually unblock it.
Wrt 1), we can only hope that it's the case. But 2) and 3) need work,
afaics.
In my view the WWID should never change. If a snapshot is created it
should either obtain a new WWID. An example out of a Hitachi array is
designator type: T10 vendor identification, code set: ASCII
vendor id: HITACHI
vendor specific: 50403B050709
designator type: NAA, code set: Binary
0x60060e80123b050050403b0500000709
The majority of the naa wwid is tied to the storage subsystem and
identifies the vendor oui, model, serial etc. The last 4 in this
example indicate the LDEV ID (Sorry mainframe heritage here..). When a
snapshot is taken these 4 will change as a new LDEV ID is assigned to
the snapshot. This sort of behaviour should be consistent across all
storage vendors imho.
It's getting off-topic, but in automatic desaster recovery scenarios one might want that the "new disk" (maybe a snapshot of the original disk before it got corrupted) looks like the "old disk", so that the OS can boot without needing any adjustments.

Regards,
Ulrich
Post by Martin Wilck
Martin
Martin Wilck
2021-04-27 08:10:53 UTC
Permalink
Post by Martin Wilck
Wrt 1), we can only hope that it's the case. But 2) and 3) need work,
afaics.
In my view the WWID should never change. 
In an ideal world, perhaps not. But in the dm-multipath realm, we know
that WWID changes can happen with certain storage arrays. See 
https://listman.redhat.com/archives/dm-devel/2021-February/msg00116.html 
and follow-ups, for example.

Regards,
Martin
--
Dr. Martin Wilck <***@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer
Hannes Reinecke
2021-04-27 08:21:03 UTC
Permalink
Post by Martin Wilck
Post by Martin Wilck
Wrt 1), we can only hope that it's the case. But 2) and 3) need work,
afaics.
In my view the WWID should never change. 
In an ideal world, perhaps not. But in the dm-multipath realm, we know
that WWID changes can happen with certain storage arrays. See 
https://listman.redhat.com/archives/dm-devel/2021-February/msg00116.html 
and follow-ups, for example.
And it's actually something which might happen quite easily.
The storage array can unmap a LUN, delete it, create a new one, and map
that one into the same LUN number than the old one.
If we didn't do I/O during that interval upon the next I/O we will be
getting the dreaded 'Power-On/Reset' sense code.
_And nothing else_, due to the arcane rules for sense code generation in
SAM.
But we end up with a completely different device.

The only way out of it is to do a rescan for every POR sense code, and
disable the device eg via DID_NO_CONNECT whenever we find that the
identification has changed. We already have a copy of the original VPD
page 0x83 at hand, so that should be reasonably easy.

I had a rather lengthy discussion with Fred Knight @ NetApp about
Power-On/Reset handling, what with him complaining that we don't handle
is correctly. So this really is something we should be looking into,
even independently of multipathing.

But actually I like the idea from Martin Petersen to expose the parsed
VPD identifiers to sysfs; that would allow us to drop sg_inq completely
from the udev rules.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
***@suse.de +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)
Ulrich Windl
2021-04-27 10:52:33 UTC
Permalink
Post by Hannes Reinecke
Post by Martin Wilck
Wrt 1), we can only hope that it's the case. But 2) and 3) need work,
afaics.
In my view the WWID should never change.
In an ideal world, perhaps not. But in the dm‑multipath realm, we know
that WWID changes can happen with certain storage arrays. See
https://listman.redhat.com/archives/dm‑devel/2021‑February/msg00116.html
and follow‑ups, for example.
And it's actually something which might happen quite easily.
The storage array can unmap a LUN, delete it, create a new one, and map
that one into the same LUN number than the old one.
If we didn't do I/O during that interval upon the next I/O we will be
getting the dreaded 'Power‑On/Reset' sense code.
_And nothing else_, due to the arcane rules for sense code generation in
SAM.
But we end up with a completely different device.
The only way out of it is to do a rescan for every POR sense code, and
disable the device eg via DID_NO_CONNECT whenever we find that the
identification has changed. We already have a copy of the original VPD
page 0x83 at hand, so that should be reasonably easy.
I don't know the depth of the SCSI or FC protocol, but storage systems
typically signal such events, maybe either via some unit attention or some FC
event. Older kernels logged that there was a change, but a manual SCSI bus scan
is needed, while newer kernels find new devices "automagically" for some
products. The HP EVA 6000 series wored that way, a 3PAR SotorServ 8000 series
also seems to work that way, but not Pure Storage X70 R3. FOr the latter you
need something like a FC LIP to make the kernel detect the new devices (LUNs).
I'm unsure where the problem is, but in principle the kernel can be
notified...
Post by Hannes Reinecke
Power‑On/Reset handling, what with him complaining that we don't handle
is correctly. So this really is something we should be looking into,
even independently of multipathing.
But actually I like the idea from Martin Petersen to expose the parsed
VPD identifiers to sysfs; that would allow us to drop sg_inq completely
from the udev rules.
Talking of VPDs: Somewhere in the last 12 years (within SLES 11)there was a
kernel change regarding trailing blanks in VPD data. That change blew up
several configurations being unable to re-recognize the devices. In one case
the software even had bound a license to a specific device with serial number,
and that software found "new" devices while missing the "old" ones...

Regards,
Ulrich
Post by Hannes Reinecke
Cheers,
Hannes
‑‑
Dr. Hannes Reinecke Kernel Storage Architect
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)
Ewan D. Milne
2021-04-27 20:04:19 UTC
Permalink
Post by Ulrich Windl
Post by Hannes Reinecke
Post by Martin Wilck
Wrt 1), we can only hope that it's the case. But 2) and 3) need work,
afaics.
In my view the WWID should never change.
In an ideal world, perhaps not. But in the dm‑multipath realm, we know
that WWID changes can happen with certain storage arrays. See
https://listman.redhat.com/archives/dm‑devel/2021‑February/msg00116.html
Post by Ulrich Windl
Post by Hannes Reinecke
and follow‑ups, for example.
And it's actually something which might happen quite easily.
The storage array can unmap a LUN, delete it, create a new one, and map
that one into the same LUN number than the old one.
If we didn't do I/O during that interval upon the next I/O we will be
getting the dreaded 'Power‑On/Reset' sense code.
_And nothing else_, due to the arcane rules for sense code
generation in
SAM.
But we end up with a completely different device.
The only way out of it is to do a rescan for every POR sense code, and
disable the device eg via DID_NO_CONNECT whenever we find that the
identification has changed. We already have a copy of the original VPD
page 0x83 at hand, so that should be reasonably easy.
I don't know the depth of the SCSI or FC protocol, but storage
systems
typically signal such events, maybe either via some unit attention or some FC
event. Older kernels logged that there was a change, but a manual SCSI bus scan
is needed, while newer kernels find new devices "automagically" for some
products. The HP EVA 6000 series wored that way, a 3PAR SotorServ 8000 series
also seems to work that way, but not Pure Storage X70 R3. FOr the latter you
need something like a FC LIP to make the kernel detect the new
devices (LUNs).
I'm unsure where the problem is, but in principle the kernel can be
notified...
There has to be some command on which the Unit Attention status
can be returned. (In a multipath configuration, the path checker
commands may do this). In absence of a command, there is no
asynchronous mechanism in SCSI to report the status.

On FC things related to finding a remote port will trigger a rescan.

-Ewan
Post by Ulrich Windl
Post by Hannes Reinecke
Power‑On/Reset handling, what with him complaining that we don't handle
is correctly. So this really is something we should be looking into,
even independently of multipathing.
But actually I like the idea from Martin Petersen to expose the parsed
VPD identifiers to sysfs; that would allow us to drop sg_inq
completely
from the udev rules.
Talking of VPDs: Somewhere in the last 12 years (within SLES 11)there was a
kernel change regarding trailing blanks in VPD data. That change blew up
several configurations being unable to re-recognize the devices. In one case
the software even had bound a license to a specific device with serial number,
and that software found "new" devices while missing the "old" ones...
Regards,
Ulrich
Post by Hannes Reinecke
Cheers,
Hannes
‑‑
Dr. Hannes Reinecke Kernel Storage Architect
688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)
Hannes Reinecke
2021-05-04 07:32:25 UTC
Permalink
Post by Ulrich Windl
Post by Hannes Reinecke
Post by Martin Wilck
Wrt 1), we can only hope that it's the case. But 2) and 3) need work,
afaics.
In my view the WWID should never change.
In an ideal world, perhaps not. But in the dm‑multipath realm, we know
that WWID changes can happen with certain storage arrays. See
https://listman.redhat.com/archives/dm‑devel/2021‑February/msg00116.html
and follow‑ups, for example.
And it's actually something which might happen quite easily.
The storage array can unmap a LUN, delete it, create a new one, and map
that one into the same LUN number than the old one.
If we didn't do I/O during that interval upon the next I/O we will be
getting the dreaded 'Power‑On/Reset' sense code.
_And nothing else_, due to the arcane rules for sense code generation in
SAM.
But we end up with a completely different device.
The only way out of it is to do a rescan for every POR sense code, and
disable the device eg via DID_NO_CONNECT whenever we find that the
identification has changed. We already have a copy of the original VPD
page 0x83 at hand, so that should be reasonably easy.
I don't know the depth of the SCSI or FC protocol, but storage systems
typically signal such events, maybe either via some unit attention or some FC
event. Older kernels logged that there was a change, but a manual SCSI bus scan
is needed, while newer kernels find new devices "automagically" for some
products. The HP EVA 6000 series wored that way, a 3PAR SotorServ 8000 series
also seems to work that way, but not Pure Storage X70 R3. FOr the latter you
need something like a FC LIP to make the kernel detect the new devices (LUNs).
I'm unsure where the problem is, but in principle the kernel can be
notified...
My point was that while there _is_ a unit attention with the sense code
'INQUIRY DATA CHANGED' (and that indeed will generate a kernel message),
it might be obscured by a subsequent unit attention with the sense code
'Power-On/Reset', as per SCSI spec the latter might cause the previous
ones to _not_ being sent.
So from that reasoning we will need to rescan the device upon
'Power-on/Reset'.
But 'Power-On/Reset' is a sense code which we also get during initial
device scan, so the problem is that we will be triggering a rescan while
_doing_ a rescan, and as such it would need some really careful testing.

As for the PureStorage behaviour: The problem with changing the LUN
mapping on the array is that it we might not _have_ a device to send
unit attentions to.
If the array already exports LUNs to some other hosts, it doesn't need
to re-initialize the FC port when starting to export LUNs to _this_
host. And as _this_ host doesn't have a LUN on which unit attentions can
be sent, _and_ the FC port is already registered, there are no events
whatsoever which would cause the host to initiate a rescan.
To resolve that the array would need to induce eg an RSCN, but that will
only be triggered if a FC port is (re-)registered.
Which is what HPe arrays do; initiate a link-bounce on the attached
ports, which will cause the attached hosts to initiate a rescan.
Of course, _all_ hosts will need to rescan (and thereby causing an
interruption even on unrelated hosts), which is why this is not done by
all vendors.
Post by Ulrich Windl
Post by Hannes Reinecke
Power‑On/Reset handling, what with him complaining that we don't handle
is correctly. So this really is something we should be looking into,
even independently of multipathing.
But actually I like the idea from Martin Petersen to expose the parsed
VPD identifiers to sysfs; that would allow us to drop sg_inq completely
from the udev rules.
Talking of VPDs: Somewhere in the last 12 years (within SLES 11)there was a
kernel change regarding trailing blanks in VPD data. That change blew up
several configurations being unable to re-recognize the devices. In one case
the software even had bound a license to a specific device with serial number,
and that software found "new" devices while missing the "old" ones...
That's probably just for VPD page 0x80. Page 0x83 has pretty strict
rules on how the entries are formatted, so chopping off trailing blanks
is not easily done there.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
***@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
Martin Wilck
2021-04-28 06:34:41 UTC
Permalink
The way out of this is to chuck the array in the bin. As I mentioned
in one of my other emails when a scenario happens as you described
above and the array does not inform the initiator it goes against the
SAM-5 standard.
5.14 Unit attention conditions
5.14.1 Unit attention conditions that are not coalesced
Each logical unit shall establish a unit attention condition whenever
a) a power on (see 6.3.1), hard reset (see 6.3.2), logical
unit reset (see 6.3.3), I_T nexus loss (see 6.3.4), or power loss
expected (see 6.3.5) occurs;
b) commands received on this I_T nexus have been cleared by
a command or a task management function associated with another I_T
nexus and the TAS bit was set to zero in the Control mode page
associated with this I_T nexus (see 5.6);
c) the portion of the logical unit inventory that consists
of administrative logical units and hierarchical logical units has
been changed (see 4.6.18.1); or
d) any other event requiring the attention of the SCSI
initiator device.
Especially the I_T nexus loss under a is an important trigger.
---
6.3.4 I_T nexus loss
 a) a hard reset condition (see 6.3.2);
 b) an I_T nexus loss event (e.g., logout) indicated by a Nexus Loss
event notification (see 6.4);
 c) indication that an I_T NEXUS RESET task management request (see
7.6) has been processed; or
 d) an indication that a REMOVE I_T NEXUS command (see SPC-4) has
been processed.
An I_T nexus loss event is an indication from the SCSI transport
protocol to the SAL that an I_T nexus no
longer exists. SCSI transport protocols may define I_T nexus loss
events.
Each SCSI transport protocol standard that defines I_T nexus loss
events should specify when those events
result in the delivery of a Nexus Loss event notification to the SAL.
The I_T nexus loss condition applies to both SCSI initiator devices
and SCSI target devices.
If a SCSI target port detects an I_T nexus loss, then a Nexus Loss
event notification shall be delivered to
each logical unit to which the I_T nexus has access.
In response to an I_T nexus loss condition a logical unit shall take
a) abort all commands received on the I_T nexus as described in 5.6;
b) abort all background third-party copy operations (see SPC-4) that
are using the I_T nexus;
c) terminate all task management functions received on the I_T nexus;
d) clear all ACA conditions (see 5.9.5) associated with the I_T
nexus;
e) establish a unit attention condition for the SCSI initiator port
associated with the I_T nexus (see 5.14
and 6.2); and
f) perform any additional functions required by the applicable
command standards.
---
This does also mean that any underlying transport protocol issues
like on FC or TCP for iSCSI will very often trigger aborted commands
or UA's as well which will be picked up by the kernel/respected
drivers.
Thanks a lot. I'm not quite certain which of these paragraphs would
apply to the situation I had in mind (administrator remapping an
existing LUN on a storage array to a different volume). That scenario
wouldn't necessarily involve transport-level errors, or an I_T nexus
loss. 5.14.1 c) or d) might apply, is that what you meant?

Regards
Martin
--
Dr. Martin Wilck <***@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer
Ewan D. Milne
2021-04-27 20:14:59 UTC
Permalink
Post by Martin Wilck
Post by Ulrich Windl
Post by Martin Wilck
While we're at it, I'd like to mention another issue: WWID
changes.
This is a big problem for multipathd. The gist is that the device
identification attributes in sysfs only change after rescanning the
device. Thus if a user changes LUN assignments on a storage system,
it can happen that a direct INQUIRY returns a different WWID as in
sysfs, which is fatal. If we plan to rely more on sysfs for device
identification in the future, the problem gets worse.
I think many devices rely on the fact that they are identified by
Vendor/model/serial_nr, because in most professional SAN storage systems you
can pre-set the serial number to a custom value; so if you want a
new
disk
(maybe a snapshot) to be compatible with the old one, just assign
the
same
serial number. I guess that's the idea behind.
What you are saying sounds dangerous to me. If a snapshot has the same
WWID as the device it's a snapshot of, it must not be exposed to any
host(s) at the same time with its origin, otherwise the host may
happily combine it with the origin into one multipath map, and data
corruption will almost certainly result.
My argument is about how the host is supposed to deal with a WWID
change if it happens. Here, "WWID change" means that a given H:C:T:L
suddenly exposes different device designators than it used to, while
this device is in use by a host. Here, too, data corruption is
imminent, and can happen in a blink of an eye. To avoid this, several
1) the host needs to get notified about the change (likely by an UA of
some sort)
2) the kernel needs to react to the notification immediately, e.g. by
blocking IO to the device,
There's no way to do that, in principle. Because there could be
other I/Os in flight. You might (somehow) avoid retrying an I/O
that got a UA until you figured out if something changed, but other
I/Os can already have been sent to the target, or issued before you
get to look at the status.

-Ewan
Post by Martin Wilck
3) userspace tooling such as udev or multipathd need to figure out how
to how to deal with the situation cleanly, and eventually unblock it.
Wrt 1), we can only hope that it's the case. But 2) and 3) need work,
afaics.
Martin
Martin Wilck
2021-04-27 20:33:43 UTC
Permalink
Post by Martin Wilck
Post by Ulrich Windl
Post by Martin Wilck
While we're at it, I'd like to mention another issue: WWID changes.
This is a big problem for multipathd. The gist is that the device
identification attributes in sysfs only change after rescanning the
device. Thus if a user changes LUN assignments on a storage system,
it can happen that a direct INQUIRY returns a different WWID as in
sysfs, which is fatal. If we plan to rely more on sysfs for device
identification in the future, the problem gets worse.
I think many devices rely on the fact that they are identified by
Vendor/model/serial_nr, because in most professional SAN storage systems you
can pre-set the serial number to a custom value; so if you want a
new
disk
(maybe a snapshot) to be compatible with the old one, just assign
the
same
serial number. I guess that's the idea behind.
What you are saying sounds dangerous to me. If a snapshot has the same
WWID as the device it's a snapshot of, it must not be exposed to any
host(s) at the same time with its origin, otherwise the host may
happily combine it with the origin into one multipath map, and data
corruption will almost certainly result.
My argument is about how the host is supposed to deal with a WWID
change if it happens. Here, "WWID change" means that a given
H:C:T:L
suddenly exposes different device designators than it used to, while
this device is in use by a host. Here, too, data corruption is
imminent, and can happen in a blink of an eye. To avoid this, several
 1) the host needs to get notified about the change (likely by an
UA
of
some sort)
 2) the kernel needs to react to the notification immediately, e.g.
by
blocking IO to the device,
There's no way to do that, in principle.  Because there could be
other I/Os in flight.  You might (somehow) avoid retrying an I/O
that got a UA until you figured out if something changed, but other
I/Os can already have been sent to the target, or issued before you
get to look at the status.
Right. But in practice, a WWID change will hardly happen under full IO
load. The storage side will probably have to block IO while this
happens, at least for a short time period. So blocking and quiescing
the queue upon an UA might still work, most of the time. Even if we
were too late already, the sooner we stop the queue, the better.

The current algorithm in multipath-tools needs to detect a path going
down and being reinstated. The time interval during which a WWID change
will go unnoticed is one or more path checker intervals, typically on
the order of 5-30 seconds. If we could decrease this interval to a sub-
second or even millisecond range by blocking the queue in the kernel
quickly, we'd have made a big step forward.

Regards
Martin
Ewan D. Milne
2021-04-27 20:41:45 UTC
Permalink
Post by Martin Wilck
Post by Ewan D. Milne
There's no way to do that, in principle. Because there could be
other I/Os in flight. You might (somehow) avoid retrying an I/O
that got a UA until you figured out if something changed, but other
I/Os can already have been sent to the target, or issued before you
get to look at the status.
Right. But in practice, a WWID change will hardly happen under full IO
load. The storage side will probably have to block IO while this
happens, at least for a short time period. So blocking and quiescing
the queue upon an UA might still work, most of the time. Even if we
were too late already, the sooner we stop the queue, the better.
The current algorithm in multipath-tools needs to detect a path going
down and being reinstated. The time interval during which a WWID change
will go unnoticed is one or more path checker intervals, typically on
the order of 5-30 seconds. If we could decrease this interval to a sub-
second or even millisecond range by blocking the queue in the kernel
quickly, we'd have made a big step forward.
Yes, and in many situations this may help. But in the general case
we can't protect against a storage array misconfiguration,
where something like this can happen. So I worry about people
believing the host software will protect them against a mistake,
when we can't really do that.

All it takes is one I/O (a discard) to make a thorough mess of the LUN.

-Ewan
Post by Martin Wilck
Regards
Martin
Martin Wilck
2021-04-28 06:30:28 UTC
Permalink
Post by Martin Wilck
There's no way to do that, in principle.  Because there could be
other I/Os in flight.  You might (somehow) avoid retrying an I/O
that got a UA until you figured out if something changed, but other
I/Os can already have been sent to the target, or issued before you
get to look at the status.
Right. But in practice, a WWID change will hardly happen under full IO
load. The storage side will probably have to block IO while this
happens, at least for a short time period. So blocking and quiescing
the queue upon an UA might still work, most of the time. Even if we
were too late already, the sooner we stop the queue, the better.
The current algorithm in multipath-tools needs to detect a path going
down and being reinstated. The time interval during which a WWID change
will go unnoticed is one or more path checker intervals, typically on
the order of 5-30 seconds. If we could decrease this interval to a sub-
second or even millisecond range by blocking the queue in the kernel
quickly, we'd have made a big step forward.
Yes, and in many situations this may help.  But in the general case
we can't protect against a storage array misconfiguration,
where something like this can happen.  So I worry about people
believing the host software will protect them against a mistake,
when we can't really do that.
I agree. I expressed a similar notion in the following thread about
multipathd's WWID change detection capabilities in the face of really
bad mistakes on the administrator's (or storage array's, FTM) part:
https://listman.redhat.com/archives/dm-devel/2021-February/msg00248.html
But others stressed that nonetheless we should try our best to
avoid customer data corruption (which I agree with, too), and thus we
settled on the current algorithm, which suited the needs at least of
the affected user(s) in that specific case.

Personally I think that the current "5-30s" time period for WWID change
detection in multipathd is unsafe both theoretically and practially,
and may lure users into a false feeling of safety. Therefore I'd
strongly welcome a kernel-side solution that might still not be safe
theoretically, but cover most practical problem scenarios much better
than we currently do.

Regards
Martin
--
Dr. Martin Wilck <***@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Software Solutions Germany GmbH
HRB 36809, AG Nürnberg GF: Felix Imendörffer
Ewan D. Milne
2021-04-30 23:44:48 UTC
Permalink
Post by Ewan D. Milne
Post by Martin Wilck
Post by Ewan D. Milne
There's no way to do that, in principle. Because there could be
other I/Os in flight. You might (somehow) avoid retrying an I/O
that got a UA until you figured out if something changed, but other
I/Os can already have been sent to the target, or issued before you
get to look at the status.
If something happens on a storage side where a lun gets it's
attributes changed (any, doesn't matter which one) a UA should be
sent. Also all outstanding IO's on that lun should be returning an
Abort as it can no longer warrant the validity of any IO due to these
changes. Especially when parameters are involved like reservations
(PR's) etc. If that does not happen from an array side all bets are
off as the only way to be able to get back in business is to start
from scratch.
Perhaps an array might abort I/Os it has received in the Device Server
whensomething changes. I have no idea if most or any arrays actually
do that.
But, what about I/O that has already been queued from the host to
thehost bus adapter? I don't see how we can abort those I/Os
properly.Most high-performance HBAs have a queue of commands and a
queueof responses, there could be lots of commands queued before
wemanage to notice an interesting status. And AFAIK there is no
conditionalmechanism that could hold them off (and, they could be in-
flight on thewire anyway).
I get what you are saying about what SAM describes, I just don't see
howwe can guarantee we don't send any further commands after the
statuswith the UA is sent back, before we can understand what happened.
-Ewan
Post by Ewan D. Milne
Post by Martin Wilck
Right. But in practice, a WWID change will hardly happen under
full
IO
load. The storage side will probably have to block IO while this
happens, at least for a short time period. So blocking and
quiescing
the queue upon an UA might still work, most of the time. Even if we
were too late already, the sooner we stop the queue, the better.
I think in most cases when something happens on an array side you
will see IO's being aborted. That might be a good time to start doing
TUR's and if these come back OK do a new inquiry. From a host side
there is only so much you can do.
Post by Ewan D. Milne
Post by Martin Wilck
The current algorithm in multipath-tools needs to detect a path going
down and being reinstated. The time interval during which a WWID change
will go unnoticed is one or more path checker intervals,
typically on
the order of 5-30 seconds. If we could decrease this interval to
a
sub-
second or even millisecond range by blocking the queue in the kernel
quickly, we'd have made a big step forward.
Yes, and in many situations this may help. But in the general case
we can't protect against a storage array misconfiguration,
where something like this can happen. So I worry about people
believing the host software will protect them against a mistake,
when we can't really do that.
My thought exactly.
Post by Ewan D. Milne
All it takes is one I/O (a discard) to make a thorough mess of the LUN.
-Ewan
Post by Martin Wilck
Regards
Martin
--
dm-devel mailing list
https://listman.redhat.com/mailman/listinfo/dm-devel
https://listman.redhat.com/mailman/listinfo/dm-devel
Loading...