The basis of RSS3 project implementation: RFC3986 Uniform Resource Identifier

RSS3 is regarded as a promising project in the web3 field. Recently, I have been experiencing web3 applications, trying to find some definition paradigms for the core elements of web3. I happened to be looking at RSS3 and found a protocol I was looking for. It can be defined as a data specification protocol, the protocol is the RFC3986 Uniform Resource Identifier, or it can be understood as a general grammar.

The original document contains more than 30 words, which is not easy to read, so I edited and edited the document a lot, and made an example in order to understand the data format of web3.

It should be known that this specification is an Internet information standard, and its application time is very early. RSS3 is some development practices made on this basis to be applied in the field of web3.

RSS3

RSS3 is an open information syndication protocol designed to support efficient and decentralized information distribution in Web3. It defines a format in which information is presented and communicated so that other consumers can easily access various content sources in a uniform format without requiring extensive compatibility logic.

In the RSS3 protocol, information is divided into four types: configuration files, links, assets, comments

RSS3 applications use the RSS3SDK to access and publish data in the format defined by the RSS3 protocol. The RSS3 SDK obtains data from the RSS3 network and publishes the data to RSS3-supported networks. The RSS3 Network crawls data from various RSS3 Supported Networks and caches the data to itself In the high-efficiency database, do some preprocessing, such as applying artificial intelligence recommendation algorithms to provide search functions.

In such a product design, the most primitive data specification is completed by defining some details of the network transmission data. Once the data is defined, the basic data availability part is completed. The upper-layer application can be implemented more easily, let us look at this protocol: RFC3986 Uniform Resource Identifier. After deleting the content, the author strives to achieve some relevant requirements for a brief understanding of Internet data processing.

RFC3986: Uniform Resource Identifier

This specification is derived from RFC2396 [RFC2396], RFC1808 [RFC1808], and RFC1738 [RFC1738], and also contains updates (and corrections) for IPv6 literals in host syntax.

A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource, providing a simple and extensible method for identifying resources. The specification defines the general URI syntax and procedural resolution of URI references in relative form, as well as guidelines and security considerations for using URIs.

The URI grammar defines a syntactic superset. Valid URIs allow implementations of common component parsing, enabling the use of a URI to refer to every possible identifier without the requirements of a particular scheme. The specification does not define a grammar for generating URIs.

Uniform Resource Identifier (URI) semantics are derived from the concept introduced by the World Wide Web Global Information Initiative, and the syntax is intended to meet the requirements for the Resource Locator [RFC1736] and Uniform Resource Name functionality [RFC1737] listed in the “Internet Functionality Recommendations” .

This document obsoletes [RFC2396] and merges “Uniform Resource Locator” [RFC1738] and “Relative Uniform Resource Locator” [RFC1808] to define a single common syntax for all URIs. Obsolete [RFC2732], introducing the syntax for IPv6 addresses.

Features of URIs

uniformity

It allows different types of resources to use the same resource identifier in the same context, even though the mechanisms used to access those resources may be different.

It allows a unified semantic interpretation of common sentences to complete the convention on identifiers across different types of resources.

It allows the introduction of new types of resource identifiers without interfering with the way existing identifiers are used.

It allows identifiers to be reused in many different contexts, allowing new applications or protocols to take advantage of the existing, large and widely used set of resource identifiers.

resource

The term “resource” in the general sense refers to anything that might be identified by a URI. Familiar examples include electronic documents, images, information sources, services, and other collections of resources. Resources are not necessarily accessible over the Internet. Likewise, abstract concepts can be resources, such as operators and operands of mathematical equations, types of relationships (for example, “parent” or “employee”), or numeric values ​​(for example, zero, one, infinity).

identifier

An identifier embodies the process of content authentication that distinguishes the desired information from all other things within its scope. But these definitions should not be mistaken for definitions of identifiers or identities that embody what is being referenced, and in many cases URIs are used to denote a resource, but not that it can be accessed. Likewise, an identified “a” resource may not be singular in nature (eg, a resource may be a named set or a time-varying map).

URIs have global scope and are used to consistently interpret context in all cases, although the results of this interpretation may be relevant to the end user’s context. For example, “http://localhost/” has the same interpretation for every user referenced, even though the network interface corresponding to “localhost” may be a different user, which means: interpretation has nothing to do with access.

common grammar

The URI grammar is a federated and extensible naming system in which the specification of each scheme can further restrict the syntax and semantics of identifiers using that scheme.

URI references use an independent parsing mechanism by which protocols and data formats using URI references can define URIs with reference to all allowed syntactic ranges of this specification, including those schemes that are not yet defined.

A parser for the generic URI grammar can parse any URI reference into its main component. After the plan is finalized, further

Scenario-specific parsing can be performed on components. In other words, the URI common syntax is a superset of all URI syntaxes

URI 、 URL 和 URN

URIs can be further categorized as locators, names, or both.

“Uniform Resource Locator” (URL) refers to a subset of URIs. In addition to identifying the resource, it also provides a way to locate the resource (eg, its network “location”) by describing its access mechanism.

“Uniform Resource Name” (URN) was used to refer to any other URI that retains globally unique properties in the event that the resource ceases to exist or becomes unavailable.

URIs come from a very limited set: Latin letters, numbers, and some special characters.

URIs can be represented in many forms; for example, ink on paper, pixels on a screen, or a series of character-encoded octets. The interpretation of the URI depends only on the characters used. In a local or regional environment, as technology advances, users are able to use a wider range of characters.

Separation of recognition and interaction

A common misconception about URIs is that they are only used to refer to accessible resources. The URI itself only provides authentication, and does not guarantee access to the resource implied by the existence of the URI. Instead, any relevant URI reference is defined by a protocol element, such as a data format attribute or the natural language text in which it appears.

Given a URI, the system may attempt to perform various operations on the resource, possibly characterized by words such as “access”, “update”, “replace”, or “find attribute”. Such operations are defined by the protocol using URIs.

Hierarchical identifier

The URI syntax is organized hierarchically, with components in order of decreasing importance from left to right.

The generic grammar uses slash (“/”), question mark (“?”), and number sign (“#”) characters to separate components, important to the level interpretation of the generic parser, except that the readable identifiers of this class are consistent Using a familiar syntax, a unified representation across a hierarchy of naming schemes allows scheme-independent referencing relative to that hierarchy.

Typically, a set or “tree” of documents has been constructed to serve a common purpose, and the vast majority of URI references in these documents point to resources in the tree rather than outside it. Documentation sites at a specific location are more likely to reference other resources on that site than resources at remote sites. References to URIs allow document tree sections to be independent of their location and access scheme.

Syntax notation

Notation using ABNF [RFC2234], including the following core ABNF grammar rules:

ALPHA (letters), CR (carriage return), DIGIT (decimal digits), DQUOTE (double quotes), HEXDIG (hexadecimal digits), LF (line feed), SP (space), etc.

The URI syntax provides a way to encode data, presumably for the purpose of identifying a resource as a sequence of characters. URIs, in turn, characters are often encoded as octets for transmission or presentation.

ABNF notation defines its terminal value as a non-negative, integer (code point) based on the US-ASCII coded character set [ASCII]. Because a URI is a sequence of characters, we must reverse the relationship in order to understand the URI syntax. Therefore, integer values ​​used by ABNF must map back to their US-ASCII counterparts to complete the grammar rules.

reserved characters

A URI consists of “reserved” characters that separate components and subcomponents.

The purpose of reserved characters is to provide a set of characters that separates the delimiter from other data in the URI. A subset of reserved characters (gen-delims) are used as delimiters for generic URI components. The ABNF grammar rules for a component are not named directly using reserved or gen-delims, instead, each grammar rule lists the characters that are allowed within that component (i.e., not delimited), other subcomponents can be specified by the specification of the URI scheme definition.

no reserved characters

Characters allowed in a URI but not reserved, including uppercase and lowercase letters, decimal digits, hyphens, periods, underscores, and tildes.

unreserved=ALPHA/DIGIT/”-“/”.”/”_”/”~”

Replace non-reserved characters with different URIs, but their corresponding percent-encoded US-ASCII octets are equivalent: they identify the same resource. For consistency, percent-encoded octets in the ALPHA range (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphens (%2D), periods (%2E) , URIs should not create underscore (%5F) or tilde (%7E) producers, and when found in a URI, should decode them to the corresponding unreserved characters of the URI normalizer.

identifying data

The URI characters provide the identifying data component for each URI as an external interface to the identified system.

Production and transmission of URIs: local name and data encoding, public interface encoding, URI character encoding, data format encoding, and protocol encoding.

Local names (eg file system names) are stored in the local character encoding. URI-generating applications (eg, origin servers) typically use the local encoding as a basis for generating meaningful names. The URI producer will convert the native encoding to an encoding suitable for the public interface, and then convert the public interface encoding to a restricted set of URI characters (reserved, unreserved, and percent-encoded).

These characters, in turn, are encoded into octets for use as references in data formats (eg, document character sets), etc. Data formats such as references are often then encoded for transmission over Internet protocols.

In some cases, the URI component and identifying the data it represents are far less straightforward than character encoding translations.

Syntax component

The generic URI syntax consists of a hierarchical sequence with scheme, authority, path, query and segment.

The scheme and path components are required, although the path may be empty (no characters). When the permission exists, the path must be either empty or start with a slash (“/”) character. When the permission does not exist, the path cannot start with two slash characters (“//”). These restrictions result in five different ABNF path rules, only one of which matches any given URI reference.

Program

Each URI begins with a scheme name that references a specification for assigning identifiers in that scheme.

The scheme name consists of a series of letters starting with a, followed by any combination of letters, numbers, and plus signs (“+”), periods (“.”), or hyphens (“-“).

scheme=ALPHA*(ALPHA/DIGIT/”+”/”-“/”.”)

permission

Many URI schemes include hierarchical element rights for naming so that management is delegated to the authority by the rest of the URI. The generic syntax provides a generic based on the registered name or server address, and optional port and user information.

An authority component is preceded by a double slash (“//”) and is followed by the next slash (“/”), question mark (“?”), or trailing digit (“#”) character, or at the end of the URI.

permissions=[userinfo”@”]host[“:”port]

host

The host subcomponent of the authority is enclosed in square brackets by the IP literal identifier. In many cases, the host syntax is only used to create and deploy an existing registration process DNS, resulting in a globally unique name without the expense of deploying another registry.

host=IP field/IPv4address/reg-name

IP field=”[“(IPv6Address/IPvFuture)”]”

IPvFuture=”v”1*HEDXIG”.”1*(unreserved/subdelimiter/”:”)

Inquire

The query component contains non-hierarchical data, as well as data in the path component to identify resources within the scope of the URI scheme and naming authority.

The query component is represented by a question mark (“?”) character and terminated by a number sign (“#”) character.

query=*(pchar/”/”/”?”)

usage

When applications reference a URI, they do not always use the full reference form defined by the “URI” syntax rules. Preserving space and exploiting hierarchical locality, many Internet Protocol elements and media type formats allow for abbreviated URIs, while others restrict the syntax to specific forms of URIs.

Build base URI

Except for fragment-only references, the base URI is known to be required. The parser must establish a base URI. The base URI must conform to the <absolute-URI> syntax rules.

The base URI can be established in one of four ways

Base URI embedded in content

The base URI of the encapsulated entity

URI for retrieving the entity

Default base URI (depending on application)

Normalize and compare

The most common operations on URIs are simple comparisons: determining whether two URIs are equivalent to accessing their respective resources without using URIs. Extensive normalization is usually done before comparing URIs. URI comparisons are performed for some specific purpose.

equivalence

Because URIs exist to identify resources, they are considered equivalent when they identify the same resource. However, this definition of equivalence is of little practical use, since there is no way to compare two resources unless it has full knowledge or control over them.

Even though it can be determined that two URIs are equivalent, URI comparison is not sufficient to determine that the two URIs identify different resources.

grammar-based normalization

Grammar-based normalization includes the following technical case normalization, percent-encoding normalization, and dot-segment removal.

Safety Precautions

URIs by themselves do not pose a security threat. But URIs are often used to provide a compact set of instructions to access

For web resources, care must be taken to properly interpret the data in the URI to prevent that data from leading to accidental access and to avoid including data text that should not be made public.

Sensitive information

URI producers should not provide passwords that contain usernames or are intended to be kept secret. URIs are often displayed by browsers, stored in clear text bookmarks, and used by user agent history and intermediate applications (proxies).

semantic attack

Because the userinfo subcomponent is rarely used, the host that appears in the permissions component can be used to construct a URI to mislead the user into trust, e.g.

ftp://cnn.example.com&story=break_news@10.0.0.1/top_story.htm

may cause the user to assume the host is “cnn.example.com” when it is actually ‘10.0.0.1’. A misleading URI may be an attack on the user, which attacks the user’s preconceived notions. Regarding the software itself, such attacks can be avoided by distinguishing the various components of the URI.

Posted by:CoinYuppie,Reprinted with attribution to:https://coinyuppie.com/the-basis-of-rss3-project-implementation-rfc3986-uniform-resource-identifier/
Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.

Like (0)
Donate Buy me a coffee Buy me a coffee
Previous 2022-05-11 10:31
Next 2022-05-11 10:33

Related articles