Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should values be well-typed? #561

Open
ia0 opened this issue Mar 24, 2025 · 8 comments
Open

Should values be well-typed? #561

ia0 opened this issue Mar 24, 2025 · 8 comments

Comments

@ia0
Copy link

ia0 commented Mar 24, 2025

The UCG currently requires that produced values satisfy their validity invariant. This issue asks whether the UCG should also recommend that produced values satisfy their safety invariant. Here are examples of values that would follow the requirement but not the recommendation:

Those values can be considered "unsafe" because they cannot be passed to arbitrary safe code (aka safe code outside the scope of unsafe) at their current type without possibly breaking soundness.

The decision is not obvious, because there are many violations of this recommendation at this time (and more getting added). Besides this legacy issue, there is a non-trivial trade-off between making the language simpler to explain on one side, and making the language harder to use on the other (most probably why there are so many violations at this time).

Difference with library UB

This issue is not about the UCG recommending programs to not have library UB. This would be a different issue.

Library UB occurs when the safety contract of a library API is violated. This issue is about producing a value that does not satisfy its safety invariant. It is independent of the notion of library API and their safety contracts. It is a notion of the type system and in particular types (possibly library types) and their safety invariants.

Language simpler to explain

The rules for unsafe Rust are simpler for users who follow recommendations, because they don't need to know about the validity invariant (since you can't produce a value violating the validity invariant without also violating the safety invariant). This is valuable for a few reasons:

  • The validity invariant is unstable while the safety invariant is stable
  • The validity invariant doesn't leak to users, letting language designers do what they want with it
  • Users can get confused with 2 invariants (which one is UB? which one is maybe later UB to violate?)
  • Most type systems have only one notion of type invariant

The language is also simpler to make sense of. There is no need to account for ill-typed values (no need for an update type when a refinement type would suffice, see also this thread).

This would also be a way to provide partial guarantees about validity invariants in the form of new and weaker types whose safety invariant matches the partial guarantees we want to provide for that validity invariant. Note that multiple types (of the type system) can have the same underlying calculus type (the opsem concept that gives the notion of validity invariant). This is for example the case with pointers. This gives programmers access to a hierarchy of types with different safety invariants for the same underlying calculus type, to choose the most appropriate type for the needed operation (see next section).

Language harder to use

The type of an ill-typed value needs to be changed (weakened in the hierarchy of types) until the value becomes well-typed at that new type. In the initial examples, that would be:

  • Using &[u8] instead of &str.
  • Using Pin<&mut T> or NonNull<T> instead of &mut T.
  • Using &mut [MaybeUninit<u8>] instead of &mut [u8].
  • Using NonNull<T> or &'a T (with a correct and possibly too strict 'a) instead of &'static T.

Besides the inconvenience of propagating those changes (when starting from an ill-typed program), one needs to find a suitable weakened type (even when starting a program from scratch). To maximize the chances of finding a suitable type, the hierarchy of types needs to be as small grain as possible, and go up (or down depending how you look at it) to the most weakened types, namely the calculus types. The situation has improved in that regard, like the introduction of MaybeUninit, and will probably continue, like with UnsafeAlias. One could hope that it would continue to improve, and a suitable type can always be found without having to weaken too much and use *mut T if NonNull<T> would suffice (and it would not exist).

Values are usually only temporarily ill-typed, since they should eventually reach safe code. When values cross a safe API, their type may need to be adjusted such that they can be used at their maximum capability without unsafe. For example, if unsafe code took a &mut [MaybeUninit<u8>] as argument, it could return a &mut [u8] to the initialized part as convenience to its user, maximizing what a safe user could do compared to an unsafe user.

Changing the type of a value crossing a safe API is only possible in some situations. If the type is under a type constructor, one needs layout guarantees like those of slices. If unsafe code receives a Foo<Dirty>, cleans it, and returns a Foo<Clean> (where Clean is a semantic subtype of Dirty), it can't simply transmute the value if Foo doesn't provide any layout guarantees. It needs to deconstruct and reconstruct the value.

Those difficulties seem quite consequent for multiple reasons:

  • We are used to ignore them.
  • The type hierarchy is not granular enough.
  • We lack subtyping relations in the hierarchy.
  • We lack strong updates.

The main question (at least for me) is whether we believe those difficulties will be lifted over time, and it's already time to start having type hygiene, or whether we have to wait and continue wearing our cowboy hats.

Terminology

The term EB (#512) could be generalized the same way the term UB has been. We would get:

  • UB = a requirement of an X safety contract was violated (you get no guarantees)
  • EB = a recommendation of an X safety contract was violated (you get guarantees but you're operating outside recommendations)
  • X can be calculus, type system, language, library, program, environment, anything that can have a contract
  • When "X = language", those requirements and recommendations must be checkable

The UCG rules mentioned at the beginning of the issue would read:

  • Producing a value violating its validity invariant is language UB
  • Producing a value violating its safety invariant is type system EB
@RalfJung
Copy link
Member

RalfJung commented Mar 24, 2025

Pretty hard "no" for this from me. "Erroneous behavior" should be defined in the operational semantics and checkable with sanitizers like Miri. "Does this type satisfy its library/safety invariant" is in many cases not decidable (e.g. for f: fn(i32) -> i32), and in other cases it's entirely unclear how to check it operationally (e.g. for x: &mut i32).

@ia0
Copy link
Author

ia0 commented Mar 24, 2025

"Erroneous behavior" should be defined in the operational semantics

This is type system EB, not opsem EB. This is similar to the situation we have with library UB (not checkable) and opsem UB (checkable). (I'm using "opsem" instead of "language" to be clear that I mean calculus.)

And regardless of this "not opsem" consideration, once we start to have contracts support, and some people start using them, and some of them are checkable, then we would be able to check both library UB and type system EB (to the extent of how checkable they are and how much people use them).

@RalfJung
Copy link
Member

RalfJung commented Mar 24, 2025

I don't understand what "type system EB" should be or mean. You pointed at #512, which is strictly an opsem concept. If I make a guess, it sounds like it would require e.g. methods on str to either abort execution on non-UTF-8 inputs or behave in some well-specified manner (much like how opsem EB requires to either behave in some well-defined clearly-documented way or abort execution). That runs completely counter to the point of the UTF-8 invariant; we have existing code which will cause UB if the input is not UTF-8.

Apart from the status quo, we could make ill-typed values UB instead of just EB.

The status quo is that violating the library invariant is library UB. So I don't know what you are trying to imply here about the status quo. Saying that this is EB means providing more guarantees to users than we currently do -- guarantees we do not want to provide.

Or maybe you're using the terms EB and "ill-typed values" in completely different ways, I can't tell.

@ia0
Copy link
Author

ia0 commented Mar 24, 2025

You pointed at #512, which is strictly an opsem concept.

I might have generalized EB the same way UB was generalized:

  • UB = a MUST in the contract was violated (you get no guarantees)
  • EB = a SHOULD in the contract was violated (you get guarantees but you're operating outside recommendations)

it sounds like it would require e.g. methods on str to either abort execution on non-UTF-8 inputs or behave in some well-specified manner

That's not what I'm suggesting for 2 reasons:

  • "methods on str" is a library concept, ill-typed values is a type system concept (so it's not just at function calls)
  • declaring something EB doesn't change/require any behavior, it's just a recommendation

Declaring ill-typed values EB means that if in your program a value is ill-typed (not necessarily when calling a library function), you are operating outside recommendations. The behavior is the same as today.

The status quo is that violating the library invariant is library UB.

This is only true when calling a function (that is not robust). Saying that ill-typed values is EB covers the remaining cases where you have Γ ⊢ e : τ is your typing derivation and e is not directly a function call argument. I guess that's what you were missing, that this only makes a difference for ill-typed values that are not function call arguments.

@ia0
Copy link
Author

ia0 commented Mar 24, 2025

Thinking about it, a much simpler way to formulate it, is to draw the parallel with language UB:

  • Producing a value that violates the validity invariant is language UB.
  • Producing a value that violates the safety invariant is type system EB (or library EB for user-defined types).

EDIT: I've updated OP with the feedback received so far.

@CAD97
Copy link

CAD97 commented Mar 25, 2025

First off: violating API contracts is erroneous behavior. If the violation does not require using unsafe, then it must only be EB if the system is sound. If the violation required an incorrect unsafe assertion of some program property, then the violation may be undefined behavior independently of being erroneous.

Aside: I believe the main disconnect between how you're conceptualizing things and how Ralf and I are conceptualizing the same concepts is that you're trying to extend the concepts which we're applying exclusively to soundness w.r.t. automated compiler optimizations (more often called memory safety) in order to also reason about arbitrary program correctness properties. You do this primarily for the purpose of formally reasoning that arbitrary program correctness feeding back into memory safety still maintains program soundness, but this overgeneralization makes it much more difficult to emphasize the most important parts of direct soundness.

Soundness can rely on correctness. It isn't necessary to have one unified system which can formally reason about them as one thing.

The terms UB (undefined behavior) and EB (erroneous behavior) are to be understood in their generalized meaning

TL;DR: your generalization of EB is incorrect as I understand the tool of EB to be defined such that it may be useful. What you are describing is instead almost exactly what #539 proposes to call "language UB" and "library UB" as they are handled by tooling for arbitrary user code.

Reframing Library UB as Erroneous Behavior is definitely interesting — from the language and compiler view, library UB "doesn't exist" and still has defined behavior, so EB may actually be a better description of what is occurring; any user defined invariants cannot add new cases of UB to the language, only new erroneous states.

However, a strong part of the reason that EB is useful as a tool is that it is not UB. If we say that "using1" &str that contains non-WF UTF-8 is EB, we're also saying that an allowed outcome of EB is UB (at a later point in time). But if UB is a possible outcome, that's not EB, it's "latent," "deferred," or "library" UB.

The allowed outcomes of EB as discussed in #512 are either a diagnosis (as defined by the sanitizing implementation; it may be a program crash or halt, a panic!, or perhaps other things we have yet to conceptualize, but critically, UB is not an allowed diagnosis) or the specified defined behavior. After that point, if it wasn't diagnosed, the EB doesn't exist anymore, and we only have the defined fallback behavior left to influence program behavior moving forward.

Users can get confused with 2 invariants (which one is UB? which one is fine to violate?)

It isn't "fine" to violate a library invariant. If you have a side-channel which provides you with a weaker library invariant to work with (usually called "relying on implementation details), then it can seem "fine" to violate what the library invariant is for other people, but you aren't ever actually violating invariants, you're simply just working with a different set of invariants.

The difference is only one thing: the full set of library invariants is extensible, and thus tooling cannot diagnose library invariants unless you tell it how to do so.

It may seem like side-channel guarantees allow you to violate the library invariants, but not any language invariants. But this isn't the case; we have explicitly reserved the right to make cases of library UB into defined behavior in the future (your correct program must not rely on absence of language UB to prove its own correctness), and there are multiple cases of Rust level UB which we do not currently exploit nor do we tell LLVM about that you could choose to rely on the side-channel of knowing implementation details to write a program that works if it's only compiled under a toolchain that uses the same implementation details.

Should ill-typed values be considered erroneous behavior?

Generally? No. If some sort of future contracts functionality is applied to the type? In that case, likely yes. But this also allows the user to in turn distinguish between the different levels that the invariants may exist at.

Pinned &mut T cannot be considered EB without massive breaking changes to how pinning works, due to how Pin::as_ref goes through DerefMut.

Footnotes

  1. Here, the term "using" is deliberately left ambiguous.

@ia0 ia0 changed the title Should ill-typed values be considered erroneous behavior? Should values be well-typed? Mar 26, 2025
@ia0
Copy link
Author

ia0 commented Mar 26, 2025

I updated OP to account for your feedback. The main changes are:

  • Avoid using EB until the generalization is approved.
  • Explain how this differs from library UB.
  • Explain how this would provide partial guarantees for the validity invariant (that's independent of your feedback).

First off: violating API contracts is erroneous behavior.

This issue is not about violating API contracts, but producing a value violating the safety invariant. This doesn't only happen at API boundaries. I consider this part of the type system contract with the programmer because it's about types (safety invariant). This is why I called it "type system EB". What you are referring to is "library UB". And I guess you are saying that library UB is (or could be) language EB. But that would be a different issue.

makes it much more difficult to emphasize the most important parts of direct soundness.

It seems to me you're confusing soundness and language UB. There is no such thing as "direct soundness". It's simply language UB. Soundness is a type system property that well-typed programs don't have language UB. Rust also uses the word soundness for programs, which could either mean that the program doesn't have language UB, or a proof of that fact. The distinction you want to make is in the context of an operation at a specific program point and between the following:

  • There is language UB at that specific program point for that operation.
  • There is some violation at that specific program point for that operation such that there will (or there may, if "soundness of a program" is just about the proof) be language UB at a possibly other program point for a possibly other operation. This is library UB and not the subject of this issue.

Soundness can rely on correctness. It isn't necessary to have one unified system which can formally reason about them as one thing.

The fact that one part of a program relies on guarantees of another part to ensure that the whole program does not have language UB is not correctness. Not all guarantees are about correctness. In particular, those guarantees (used to prove the absence of language UB) are usually weaker than those used to prove that the program is correct.

TL;DR: your generalization of EB is incorrect as I understand the tool of EB to be defined such that it may be useful.

No, the generalization is perfectly parallel with UB:

UB EB
language must be checkable must be checkable
anything else
(library, type system, ...)
not necessarily checkable not necessarily checkable

It isn't "fine" to violate a library invariant.

My bad, that wording was meant in opposition to the (immediate) UB of the first question. I fixed it into "maybe later UB" to make it clear.

Pinned &mut T cannot be considered EB without massive breaking changes to how pinning works, due to how Pin::as_ref goes through DerefMut.

Yes, I believe Pin is a major problem at this time. It is unsound in many different ways and will need heavy fixing regardless of the decision here. Pin is actually why this all started. Pin would indeed be a good reason to delay such decision if we don't want to make it violate UCG recommendations, besides the fact that it already violates UCG requirements (to avoid making it worse).

@ia0
Copy link
Author

ia0 commented Mar 26, 2025

I suspect a possible source of confusion is that the unsafe code guidelines repository has been almost exclusively used for opsem rules, that it has become the default assumption that issues are about opsem rules. This issue is not about opsem rules, it's about unsafe code guidelines, namely how to write unsafe code. It's not about which unsafe code has language UB (which is a narrower sense of how to write unsafe code).

It was my mistake to mention erroneous behavior, since its current narrow definition is about opsem exclusively. This mention is now removed, so hopefully this source of confusion has been removed. But I feel that clarifying the reasoning may help too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants