Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does re-use of the same MIME types constitute a breaking change? #141

Open
RubenVerborgh opened this issue Jan 14, 2025 · 24 comments
Open

Comments

@RubenVerborgh
Copy link
Member

RubenVerborgh commented Jan 14, 2025

Summary

In rdfjs/N3.js#484, I learned that the specifications intend to redefine the set of valid documents under the text/turtle media type (and presumably others).

Such a change might not be possible/desired, or should at least be acknowledged as a breaking change.

Definitions

  • text/turtle as the media type defined by https://www.w3.org/TR/turtle/
  • valid-turtle as the (infinite) set of valid Turtle 1.1 documents
  • invalid-turtle as the (infinite) set of documents that are not in valid-turtle
  • spec-compliant Turtle parser as a piece of software that:
    • for each document in valid-turtle, produces the corresponding set of triples
    • for each document in invalid-turtle, rejects it (possibly with details on the syntax error)

Note here that the above definition includes rejection; the 1.1 specification text does not, its test cases do.

Potential problems

  1. Retroactively changing the definition of text/turtle breaks existing spec-compliant Turtle parsers, as they will incorrectly label valid text/turtle documents as invalid.
  2. There is no way to distinguish Turtle 1.1 from Turtle 1.2.
  • While 1 could be argued away as "1.1 parsers only break on 1.2 Turtle", it's a problem that the parser will not be able to tell you why it breaks. Does it break because it's invalid Turtle 1.1? Does it break because it's valid Turtle 1.2? Does it break because it's invalid Turtle 1.2, despite this document intending to be within the 1.1 subset? i.e., should or shouldn't it have worked with this particular text/turtle document and no other context?
  1. Building on 2, neither new nor old parsers will be able to fully automatically validate Turtle documents, since they need to be told out of band whether to validate for 1.1 or 1.2.
  2. Because of the closed-set nature of text/turtle in the Turtle 1.1 spec, any changes to that set (whether deletions or additions) would contradict the Turtle 1.1 spec itself / make it invalid.
  3. The problem will happen again in RDF 1.3.
  4. As a more specific instance of 3, there is no standards-based way for clients or servers to indicate they only support Turtle 1.1, nor to discover whether recipients support Turtle 1.1 or 1.2 (or 1.3), as Accept: text/turtle does not tell them. Nor does Content-Type: text/turtle tell them whether their parser can handle the contents, and we could be 20 gigabytes in until we notice it doesn't.

Analysis

Unlike formats like HTML, Turtle 1.1 does not contain provisions for upgrading. The specification assumes a closed set of valid documents. We find further evidence in a number of bad test cases (https://www.w3.org/2013/TurtleTests/), which explicitly consider more permissive parsers to be non-compliant.

There is a note in the spec (but only a note, and thus explicitly non-normative):

This specification does not define how Turtle parsers handle non-conforming input documents.

but this non-normative statement is contradicted by the bad test cases, which parsers need to reject in order to produce a compliant report.

Although the considered changes for 1.2 are presumably not in contradiction with those bad cases, the test suite was not designed to be exhaustive. Rather, the 1.1 specification considers text/turtle to be a closed set, and the test cases consider a handful of examples to verify the set is indeed closed.

In particular, no extension points where left open on purpose.
Therefore, the 1.1 spec is not only defining “Turtle 1.1”, but also strictly finalizing text/turtle.

(The IANA submission's reservation that "The W3C reserves change control over this specifications [sic]." does not change the above arguments.)

Potential solutions

A set of non-mutually exclusive solutions, which each cover part or all of the problem space:

  1. Factual disagreements with the above.

  2. The introduction of a new media type.

  3. The introduction of a new profile on top of the existing text/turtle media type.

  4. A change to the Turtle 1.1 spec that adds extension points or otherwise opens the set of text/turtle.

  5. Syntactical support in Turtle 1.2 for extension and/or versioning.

@afs
Copy link
Contributor

afs commented Jan 14, 2025

Thank you for the analysis.

We do have https://www.w3.org/TR/rdf12-turtle/#changes-12 and the matter will be in "RDF 1.2 New".

The WG is discussing levels of conformance.

@afs
Copy link
Contributor

afs commented Jan 14, 2025

Related:
There are links to specific versions of documents for both RDF and SPARQL:

https://www.w3.org/TR/rdf12-turtle/ -- currently, the 1.2 working draft. This will be the REC when published. Title "RDF 1.2 Turtle".
https://www.w3.org/TR/rdf11-turtle/ -- The RDF 1.1 published standard. Title "RDF 1.1 Turtle".
https://www.w3.org/TR/rdf-turtle/ -- Tracks the latest publication. Currently, 1.1.
https://www.w3.org/TR/turtle/ -- old name, tracks "rdf-turtle".

@RubenVerborgh
Copy link
Member Author

The WG is discussing levels of conformance.

Interesting; may I suggest a standards-based mechanism for agents to indicate this level? (A media type or profile comes to mind.)

Or would “classic conformance” de facto be the same as effectively only parsing the RDF 1.1 subset (in which case it would be equivalent to one of the points above)? However, this does not seem to be the case, with for instance base directions being added to literals (in which case “classic” might be a confusing/misleading term).

@afs
Copy link
Contributor

afs commented Jan 15, 2025

[This is not a WG response]

Any approach for versioning can have costs on both the reader-side and the writer-side.

For example, anything in the HTTP header to make the data consumer's task easier puts a requirement on the data producer. In the same way that RDF 1.2 syntax can be a long way in the delivered stream for a reader, having an HTTP header with the information makes the writers life harder because it may need to see all the data first - no stream writing without recording the details of which version is in the data, which would also be a producer-side burden.

One way to publish data is using a web server support for mapping file extensions to Content-Type headers -- .htaccess (httpd) or types{} (nginx) etc. This also appears with data dumps in archives such as zip.

Today, a tool kit may need to "know" to look in the URL to get the content type if no realistic content-type is available.

Given the file extension situation, I think any solution will not help RDF that much. Software will want to handle the static/non-profile/file-extension/... cases anyway. Only a domain specific (i.e. consumer and producer) deployment can be sure the global rules are in-play.

There is a trade-off of whether the long term continued provision of a migration solution is a greater burden than the evolution itself. Such migration should never be withdrawn -- "The web is not versioned".

@RubenVerborgh
Copy link
Member Author

Thanks, @afs. I want to leave space for others so will be brief, but quickly:

  • Not explicitly indicating feature/version/… support also incurs costs.
  • Your answer covers the case where such indications happen in the message; different arguments and trade-offs apply when they happen in the body. As a quick example, a first-line @version 1.2 or @features literal-direction would cause a desired fail-fast on 1.1 parsers, and assist 1.2 and future parsers.

@niklasl
Copy link

niklasl commented Jan 16, 2025

Maybe an optional version or feature declaration, to support fail-fast detection? With the implicit being "latest REC". It should perhaps be clearly stated that implementations are required to follow the evolution of the format; with the reciprocal requirement of evolving the format responsibly, aspiring to standardize once "sufficient" implementation coverage has been established. AFAIK, there is a requirement of multiple independent implementations; perhaps that number should be a function of the "cardinality of known deployments" and "how viable it is to upgrade them"? (I know it is a practical impossibility to quantify that on the web scale, but it goes to show awareness of the complexity underlying these judgement calls. And that we (W3C members) have a responsibility to care and cater for cooperative evolution to ensure web interop.)

I think this follows the conventions @afs referenced, which is a trade-off I'm cautiously in agreement with. Defining a new format (mime-type + suffix) is the only other viable option AFAICS; and while that caters for more overlap in deployments, it also induces a certain inertia and growing technical debt. (When is the previous format "sunset"? How is the data quality impacted during the overlap period? How do applications take the difference in expressivity into account?)

I see no practical way around some form of social contracts, as even content negotiation is not merely technical (q=0.9 ...). The most important contract is for publishers to avoid utilizing new features until their consumers have been notified and been able to upgrade; balanced with the need for precision in the domain of discourse among those who already have (we form a web after all).

@RubenVerborgh
Copy link
Member Author

With the implicit being "latest REC"

There is a trade-off of whether the long term continued provision of a migration solution is a greater burden than the evolution itself.

"The web is not versioned".

The key difference being that—for example—HTTP, HTML, and CSS have explicit behaviors on how to deal with unsupported constructs. HTTP proxies have rules on how to deal with unknown headers, HTTP has version negotiation, HTML has rules for unknown tags and attributes, CSS has rules for unsupported properties and even syntax.

So the Web's ability to be non-versioned is baked into the design of those technologies. Conversely, RDF adopting the non-versioned philosophy does not equate doing nothing on the feature support/versioning front, but rather being very explicit about how non-versioning is to be made possible.

In summary, not doing anything put us on neither a versioned nor a non-versioned trajectory. They are not binary opposites, with the third option “incompatible with versioning and non-versioning” being the unfortunate default choice.

@lisp
Copy link

lisp commented Jan 16, 2025

to take a concrete example as precedent, i do not recall that in the transition from sparql 1.0 to sparql 1.1 was the continued use of the same media type designators was problematic.

in what sense other than the concern about "late failure" for large documents should that matter for document media types?
the notion, that 1.2 documents would be marked may seem attractive, but to fail early would still require a change to import control flow.
that, where the inability to modified deployed 1.0 version resources is central to the problem.

@RubenVerborgh
Copy link
Member Author

to take a concrete example as precedent, i do not recall that in the transition from sparql 1.0 to sparql 1.1 was the continued use of the same media type designators was problematic.

Apples and oranges.
SPARQL is not a data language, nor is it problem-free.
The context of a query language is very different, including:

  • limited average and typical document length
  • different consequence of failure, with immediate and specific feedback
    • failure is in fact is sometimes triggered deliberately for endpoint feature discovery
  • absence of streaming parsing
  • different reuse context: individual queries tend to be sent to specific endpoints

So the upgrade path of SPARQL is much more similar to that of SQL, with similar challenges and non-issues.
Not comparable to that of HTTP, HTML, CSS, RDF.

And quite a pain in practice: one typically needs to know out-of-band what precise SPARQL endpoint software an interface is running, which determines how well certain SPARQL 1.0 or 1.1 features are supported.

In contrast, at least today, text/turtle has been 100% unambiguous since the introduction of the spec.
If anything, let's not go the SPARQL route.

in what sense […] should that matter for document media types?

RDF is about enabling interoperability. Yes, on the semantic level, but not having interoperability on the syntactical level precludes that.

In the pre-1.1 days, “Turtle” had been around as a format for over a decade, and parsers were incompatible with each other. It was quite the nightmare, trying to exchange data or write parsers. There was no established (let alone standard) way of knowing what subset was supported by everyone. The Turtle standard solved this by bringing certainty about what is and isn't text/turtle.

The proposed re-definition of text/turtle without any explicit indication, sends us back on a path where parsers may or may not be compatible with a certain Turtle version, and they can't even tell us. We cannot ask servers or clients. We have to know what software they are running. Not exactly the automated interoperability goal.

other than the concern about "late failure" for large documents

One might not even know. One could've parsed a 1.2 document wrongly without ever knowing. One could've rejected or accepted a document based on the wrong assumption (because assumptions are all you have, in band). One doesn't know if downstream systems are compatible with 1.1 or 1.2, because they can't tell.

It's an absolute interoperability nightmare that systems don't even have the words to express what they do and do not support. In a context where we're advocating for semantic interoperability, failing at syntactic interoperability is a serious flaw from a technical and strategic perspective. It adds a serious degree of brittleness, the details of which only a small group of people understand, which carries a major risk of reflecting badly on RDF as a whole for not being a sustainable—let alone interoperable—technology. People will say that RDF doesn't work reliably across systems, and they will be right.

@lisp
Copy link

lisp commented Jan 16, 2025

SPARQL is not a data language, ...

that may be, unless one is concerned with sparql processors.

@lisp
Copy link

lisp commented Jan 16, 2025

RDF is about enabling interoperability. Yes, on the semantic level, but not having interoperability on the syntactical level precludes that.

we agree - vehemently.
as much as ambiguous recommendations is not the answer, neither is error signalling and handling.
would graph store protocol endpoint service descriptions provide sufficient information to the architectures which you envision, in order for them to more effectively control requests?

@Ostrzyciel
Copy link

2 cents from someone who did implement a non-standard RDF format that has an analogue of Ruben's proposed @version 1.2 or @features literal-direction – it sounds like a nice idea, but implementing a serializer that would reliably set such flags is a pain. You essentially need to predict the future: "will this document need 1.2 features or not?" This may seem like a trivial question if we are dealing with a small piece of metadata on the Web, but is completely impossible if we have something like a database dump or any other long stream of data.

I ended up making the serializer always claim that all features are used by default. Then, it's up to the user to tell the serializer that "this and that" feature won't be needed. This creates an obvious compatibility problem, because parsers will simply refuse to read these files, even though in practice the feature may not be used. I have not found a better solution to this problem. I think this is a sensible compromise for my ugly format, but I would be against this in W3C formats. More details here.

Overall, I think a sensible solution would be to embrace the mess and just live with the fact that RDF formats can evolve. I would also like to ask the WG to kindly consider producing some "best practices" for how to mark that an RDF file is 1.2, in a use case-specific manner. I like the suggestion from @lisp for adding some info in graph store protocol descriptions. I'm also curious if something like a non-mandatory HTTP header would be an option. Or maybe a comment at the start of the file (like a shebang in .sh files) – of course, entirely optional. (disclaimer: I did not think these ideas through, they may be VERY bad)

@HolgerKnublauch
Copy link

Intuitively to me it sounds like TTL documents that use any of the new features need a new media type and file ending.

@lisp
Copy link

lisp commented Jan 16, 2025

I'm also curious if something like a non-mandatory HTTP header would be an option.

legacy software will not see them.
placing them such that the control flow of those components will have to be aware of them is not effective.
a service description, based on which a higher-level process can orchestrate operations would be much more effective.

@namedgraph
Copy link

namedgraph commented Jan 16, 2025

Isn't the situation with Turtle 1.1 and Turtle 1.2 a bit like Turtle and TriG? In both cases the former syntax is a subset of the latter.
With Turtle and TriG we got distinct media types (text/turtle and application/trig to be exact). Why shouldn't the same apply to Turtle 1.2?

@dr0i
Copy link

dr0i commented Jan 16, 2025

Consuming data which is suddenly turtle 1.2 (coming with unchanged MIME text/turtle) and this now breaks my formerly working turtle parser (say a widely used library ) is like an API break resulting in a non-working program.
So this is bad.
To avoid this developers providing code provide different versions of libraries, and provide these over time, marking those with API breaks resp. those who should be compatible, using semantic versioning.
It's unlikely that data deliverers would provide different turtle versions even if there would be a HTTP header (or other things) for that.
I ACK the problem, but tend to see it like @niklasl ("I see no practical way around some form of social contracts").
(BTW, even if we only change our data schema, not the RDF version, we also call this out as a possible API break to our customers, as even this can break consumers programs).

@coolharsh55
Copy link

Hi. My thoughts on this from a practicality perspective: I echo Ruben's argument that we should be aiming to support interoperability and backwards compatibility - especially when we know exactly how and why an existing system will break due to new changes. For Turtle, the mime type can be versioned - there is precedent for this if we look at existing mime types.

If we don't version the mime type, existing systems will break. They will need to be updated to support turtle 1.2. There is no way to distinguish between turtle 1.1 and turtle 1.2, so there is no way for them to silently fail or ignore turtle 1.2. There isn't also a way to fail with context i.e. failed as it doesn't handle turtle 1.2 - it will fail equally for valid turtle 1.2 and invalid turtle 1.1. So this is not a trivially fixable change. Not desirable IMO.

If we do version the mime type, existing systems will not break. If they have to support turtle 1.2, then they MUST change or be updated anyway since turtle 1.2 requires updates anyway, and hence there is an opportunity for these systems to add the mime type handling change alongside the turtle 1.2 handling changes. It might result in some extra work, potentially some complex cases as there is mime type handling. However, we know for sure that existing systems won't break (assuming the mime type is used as intended here), and if they do get an incorrectly assigned mime type then the fix is to use the correct mime type. So this should be the desirable state.

This also brings up the question of what should happen when Turtle 1.3 eventually is required. Again versioning the mime type is an option, but pragmatically, having the version in the document itself is the best forwards-compatible solution and a known best practice. It would be ideal to have it here.

@rubensworks
Copy link
Member

rubensworks commented Jan 17, 2025

Another important consideration to take into account here is the length of Accept headers when doing requests within a browser.

Long accept headers in browsers are problematic

The Fetch spec (CORS section) specifies that each header (including the Accept header) is limited to 128 characters.
But even this limit is already causing issues in practise when just taking into account today's RDF media types for content negotiation.

As an example, the Comunica query engine uses the following Accept header by default, which contains 324 characters:

Accept: application/n-quads,application/trig;q=0.95,application/ld+json;q=0.9,application/n-triples;q=0.8,text/turtle;q=0.6,application/rdf+xml;q=0.5,text/n3;q=0.35,application/xml;q=0.3,image/svg+xml;q=0.3,text/xml;q=0.3,text/html;q=0.2,application/xhtml+xml;q=0.18,application/json;q=0.135,text/shaclc;q=0.1,text/shaclc-ext;q=0.05

Hence, when we do these requests in a browser, we must splice this Accept header to 128 characters, which causes some (valid) RDF media types to not even get requested to the server.

New media types exacerbate this problem

As such, I believe introducing new media types for each RDF serialization in 1.2 is not the right way forward.
Because this would essentially halve the number of media types that can be requested from a server within a browser environment.

For example, the following (which contains some arbitrary new media types for 1.2) already reaches the limit according to CORS:

Accept: application/n-quads,application/n-quads-12,application/trig;q=0.95,application/trig-12;q=0.95,application/ld+json;q=0.9

And this problem would only get worse for every new RDF version:

Accept: application/n-quads,application/n-quads-12,application/n-quads-13,application/n-quads-13,application/n-quads-14

Towards a solution

My initial thought when reading this issue was that profile-based negotiation could be a good solution,
but this is not very compatible with CORS either (longer Accept header or new headers that are now allowed by default under CORS).

From this perspective, my feeling is that new media types or profile-based negotiation are not the way to go, and that in-band solutions such as @version might be better (there is precedent for this in JSON-LD's @version).


Not only does this problem apply to RDF serialization, it also applies to SPARQL result serializations: SPARQL/JSON, SPARQL/XML, SPARQL/CSV, SPARQL/TSV.

@namedgraph
Copy link

profile-based negotiation could be a good solution

Except that again, no established frameworks (e.g. JAX-RS implementations) support it.

@lisp
Copy link

lisp commented Jan 17, 2025

Except that again, no established frameworks (e.g. JAX-RS implementations) support it.

which is why it is better to implement the logic which verifies availability of the required media type on a higher level.
as long as there are legacy applications, a client application framework will have to validate the service endpoint before the request is made.

@kasei
Copy link
Contributor

kasei commented Jan 18, 2025

In contrast, at least today, text/turtle has been 100% unambiguous since the introduction of the spec.

While that's true for the spec version, I don't think the same can be said for the widespread use of the Team Submission that predates the spec. The same media type was in use for years before Turtle 1.1 was introduced and brought with it changes to the syntax. I'm not sure that's reason to do the same thing again, but this isn't the first time we've been faced with this issue.

@hvdsomp
Copy link

hvdsomp commented Jan 19, 2025

Towards a solution

My initial thought when reading this issue was that profile-based negotiation could be a good solution, but this is not very compatible with CORS either (longer Accept header or new headers that are now allowed by default under CORS).

Just felt like pointing out that there’s also an IETF Internet Draft on profile based negotiation, of which @RubenVerborgh is co-author. It’s been in the works for quite a long time. There’s been renewed interest from the cultural heritage community and even from the W3C where some consider this a topic that falls in the IETF realm. See https://datatracker.ietf.org/doc/draft-svensson-profiled-representations/.

@RubenVerborgh
Copy link
Member Author

widespread use of the Team Submission that predates the spec. […] I'm not sure that's reason to do the same thing again, but this isn't the first time we've been faced with this issue.

Technologically? Been there, done that.

Reputationally? Not so much.
Fifteen years ago, at least we could say: “All of this is mess happens because Turtle isn't yet a standard.”
Without a solution, we'll have to say henceforth: “All of this mess happens because Turtle is a standard. Twice—so far.”

@TallTed
Copy link
Member

TallTed commented Jan 23, 2025

@RubenVerborgh — You pointed to "Extended discussion at https://ruben.verborgh.org/articles/fine-grained-content-negotiation/"

— which included —

Particularly exciting is that multiple profiles can be combined in a single response, in contrast to the single-dimensional nature of MIME types.

First thing, your writing betrays a limited understanding of your topic, as you refer consistently to "MIME types", which are actually "media types", though they are used in a universe of MIME.

Next, I bear relatively recent scars of a years-long effort to convince IETF to follow their own documentation and work with a number of folks (including me) who wanted to extend media types by defining how to interpret multiple + therein. Part of the scarring came from IETF rejecting their own pre-existing profile extension, especially when the value(s) of profile are URIs, because there's a relatively SMALL character count beyond which those profile values are now to be considered malware(!).

In other words -- your "extended discussion" (which is really an extended monologue) has been overtaken by events, and is no longer (if it ever was) applicable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests