Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add registry proxying section #66

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dmcgowan
Copy link
Member

@dmcgowan dmcgowan commented Jul 26, 2019

Define repository namespace query parameter for proxying.

Closes #12

Giving time for registry operators to weigh in

Maintainer approval

Copy link
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments & questions.

spec.md Outdated
@@ -901,6 +902,20 @@ If the image had already been deleted or did not exist, a `404 Not Found` respon

> for more details, see: [compatibility.md](https://github.com/docker/distribution/blob/master/docs/spec/manifest-v2-2.md#backward-compatibility)

### Registry Mirroring

A registry MAY operate as a mirror for an upstream registry to support pull-through caching or proxying of fetch operations (such as fetching tags, manifest, or blobs).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/s/manifest/manifests/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first time for the term fetch operations in this document

Maybe http GET requests/operations instead? If agree need to also change below.. otherwise maybe introduce the term earlier on in the spec?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine, I'll update to pull. That is more consistent

spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated Show resolved Hide resolved
spec.md Outdated

A client SHOULD be aware of whether a registry host being used is a mirror.
A client SHOULD avoid sending `ns` query parameters to non-mirrored registries.
A client SHOULD NOT send authorization credentials for an upstream registry to a mirrored registry.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/s/mirrored/mirror/

@jzelinskie
Copy link
Member

Why do clients need to know anything about pull-through caching if its implemented server-side?

@dmcgowan
Copy link
Member Author

dmcgowan commented Aug 9, 2019

Why do clients need to know anything about pull-through caching if its implemented server-side?

The clients should know how the registry host was resolved from a given image reference. The clients don't care how the server is implemented, but they SHOULD provide information to the server which indicates what the reference being asked for is. Just as when an HTTP client connects with a PROXY server it must communicate what the upstream server is, the same is true here. Today the protocol doesn't define anyway to communicate what the upstream is and proxies end up be hardcoded to a single upstream. In a few cases you can see proxies use custom domains per upstream and require users to change the name of their images in order to use them.

@josephschorr
Copy link

Today the protocol doesn't define anyway to communicate what the upstream is and proxies end up be hardcoded to a single upstream.

Right... isn't that the point? If I encode that "myregistry/mynamespace/myrepo" goes to "upstream/foo/bar", that's a detail for the maintainer of myrepo and one that the client, ideally, doesn't need to know; the whole point is the puller thinks they are getting "myregistry/mynamespace/myrepo".

If the goal is to allow the client to specify "upstream/foo/bar", then I'd that the target is not really a repository anymore, but simply a working proxy, and thus, a different protocol parameter might be useful, but registries should therefore have the option to not support said parameter.

@dmcgowan
Copy link
Member Author

Right... isn't that the point?

That is one use case that will still work. In the example you mentioned, when a repository is proxied in that fashion, the puller often does know of this detail as they must explicitly provide myregistry with the intent of getting some upstream content. The use case where myregistry is some sort of blessed version of upstream is reasonable, but not the intent of the namespace parameter here.

If the goal is to allow the client to specify "upstream/foo/bar"

This is the use case here and proxy may be better terminology here, but that is really a detail of the registry. The registry may act as a proxy, proxy-cache, or active mirror, that is out of scope for definition here. This parameter just enables all of those features to work across multiple namespaces. For example if you want public images from both docker.io/* and quay.io/* to be cached in the same registry proxy today, you would need the server to have two hostnames (something like docker.io.myproxy and quay.io.myproxy) then have clients configured to do that mapping for each namespace. This simple query parameter provides a much simpler option to clients and servers. If a server does not support it, it ignores the parameter. If a client is configured to send all requests to a server which does not support it, that is not different than any other misconfiguration by clients today.

@josephschorr
Copy link

For example if you want public images from both docker.io/* and quay.io/* to be cached in the same registry proxy today, you would need the server to have two hostnames (something like docker.io.myproxy and quay.io.myproxy) then have clients configured to do that mapping for each namespace

Or configure two repositories, one for each? (especially since combining them could lead to merge conflicts).

I'm concerned we're adding quite a bit of complexity to address a use case that has simpler solutions when configured on the registry side.

@dmcgowan
Copy link
Member Author

I'm concerned we're adding quite a bit of complexity to address a use case that has simpler solutions when configured on the registry side.

Can you elaborate here? Configuring a repository for each mirror is non-trivial. Configuring a domain for each upstream and routing to the upstream based on the domain is not easier, that would still requires the same routing on the server side that an implementation of this would require. The client side implementation to support per-registry configuration is not simple and inherently requires catch -all conditions when trying to enforce proxying through a gateway.

I did do a client side implementation of this to demonstrate the feature and allow server side implementations a client to test against. On the client side, it is not complex at all since clients should already know how to handle 404s when multiple registry endpoints are configured. On the server side, the complexity to support this isn't much more than existing proxy-cache support.

@josephschorr
Copy link

Configuring a repository for each mirror is non-trivial.

Its non-trivial, but its not that difficult either :)

My concern remains around complexity: the document as outlined, for example, says that the ns should not be sent to non-mirroring registries... but how does the client know that? Is it the registry's job to report back an error if that argument is found but unsupported? How will clients know to be able to check for this capability?

If we feel that pass-through proxying of other registries is, in and of itself, a feature of the protocol (rather than something configured on the registry side), then I suspect we need to give significantly more thought to the end-to-end user experience. For example, I could imagine some paths supporting proxying and others not.

@dmcgowan
Copy link
Member Author

how does the client know that?

The clients have the most context and really does not need to be defined here, only that a client SHOULD make that distinction to avoid sending unnecessary redundant information. The clients themselves have both the configuration and endpoint resolution logic, so it has multiple options for determining this. In the implementation I sent I just simply did this by checking whether the endpoint was configured without push support, as this could indicate the registry being communicated with may not be the upstream source. However, I will probably add a check there for ns != host since there are never push configurations (such as with a Kubernetes runtime). Either way, this is trivial and not required.

Is it the registry's job to report back an error if that argument is found but unsupported?

No, the registry can simply ignore it. This is like asking a registry today which was configured to mirror docker.io to return an error if the client actually meant quay.io, the registry just isn't expected to have the same amount of context as a client in regards to the intent of the entire pull process. If the registry chooses to be handle the ns parameter and not support it, it is as easy as returning a 404 for unconfigured upstreams.

How will clients know to be able to check for this capability?

They aren't expected to check for it, but rather be explicitly configured for it. A client will know if it is configured to always use a specific mirror or a mirror for multiple namespaces. I think what you are suggesting here though is the idea of registry discovery. That is a much larger topic that I would still love to see happen, in that feature a client could start with zero knowledge (except of course the domain quay.io, docker.io, etc) and discover registry capabilities and endpoints.

spec.md Outdated Show resolved Hide resolved
@vbatts
Copy link
Member

vbatts commented Aug 14, 2019

discussion ensues on the call today.
This sounds like a decent addition, but with a clear use-case for the behavior, and whether a registry implementation MUST support it.

@jzelinskie
Copy link
Member

pinging @thomasmckay and @kurtismullins who are implementing mirroring on Quay -- they probably have feedback and want to track this thread

@RCMariko
Copy link

Pulp container plugin team will want to keep an eye on this thread as well. Any feedback @ipanova @dkliban @asmacdo ?

@vbatts
Copy link
Member

vbatts commented Aug 15, 2019 via email

@dmcgowan
Copy link
Member Author

@vbatts I use ns or namespace here because namespace is a very generic concept. Certainly as a generic concept it has been used to mean many things. However generally namespace would refer to additional context (such as a prefix) on another name. In the distribution spec case, the name given to the registry would be the part on the URL path, the namespace just gives that additional context to the name. Existing distribution clients today parse the name as you described <sometransport/host/whatever>/<name given to the registry>, in which case when given just <name given to registry>, the <sometransport/host/whatever> would be the namespace of that. You could continue to divide those parts in smaller names in other namespaces, such as the Docker hub does with usernames/reponames, but that is out of scope here. Capturing this in elegant words is kind of tough, recommendations on which parts are unclear or how to make it better are appreciated.

stevvooe
stevvooe previously approved these changes Aug 24, 2019
Copy link
Contributor

@stevvooe stevvooe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think this is a great addition. It might be a good idea to add a few combinatoric examples of how ns and the the repo name are combined to calculate the upstream and local mirroring location.

@vbatts
Copy link
Member

vbatts commented Dec 16, 2019

@dmcgowan one thing i'm unclear on here is: can i have a single registry mirror that will be usable for more than one remote registry (i.e. remote of docker.io/..., quay.io/..., etc)

spec.md Outdated Show resolved Hide resolved
@vbatts
Copy link
Member

vbatts commented Apr 1, 2020

Notes from today's call:

  • (jz) quay does something different for mirroring. This should be called "proxying"
  • (dmcg) this is really for client side caching proxy, and is needed.

Please lets find a way to classify this language (whether client or server side). So we can close out or merge this

@jzelinskie
Copy link
Member

Reiterating our convo from the meeting:

I actually see a lot of value in adding this query parameter, but removing any connotation that it is the blessed solution for repository mirroring. I think that by including this value, a proxy could implement lots of different behavior for the client that need not be directly related to repository mirroring.

@amouat
Copy link
Contributor

amouat commented Apr 9, 2020

I note that harbor uses the term "replication" rather than proxy-cache or mirroring, which I quite like https://goharbor.io/docs/1.10/administration/configuring-replication/


A client SHOULD be aware of whether a registry host is being used is a proxy, such as when the `ns` query parameter differs from the `Host` header.
A client SHOULD avoid sending `ns` query parameters to non-proxy registries.
A client SHOULD NOT unintentionally send authorization credentials for an upstream registry to a proxy registry.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop "unintentionally", replace with "... except when explicitly indicated (e.g. via configuration)"

Security sensitive requirements being labeled as SHOULD / SHOULD NOT, SHOULD also give you pause. Can this be upgraded to a MUST?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative wording:

Clients SHOULD NOT assume that proxy registries can handle authentication credentials issued by an upstream registry. Authentication credentials MUST NOT be sent to different hostnames except when prior knowledge or a configuration signal of authentication support is present.

@jdolitsky
Copy link
Member

big time rebase needed @dmcgowan

| `Host` | header | Standard HTTP Host Header. SHOULD be set to the registry host. |
| `Authorization` | header | An RFC7235 compliant authorization header. |
| `name` | path | Name of the target repository. |
| Name | Kind | Description |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we lost these tables in the spec.md rewrite

An example of delegating functionality is proxying pull operations to another registry.
An example of adding functionality is implementing a pull-through cache of pulls to another registry.
When operating as a proxy, the `Host` header passed to the registry will be the host of the PROXY and NOT the host in the repository name used by the client.
A `ns` query parameter on pull operations is OPTIONAL, but when used specifies the host in a repository name used by a client.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the key piece - we should document at least this query param

@jdolitsky jdolitsky modified the milestones: v1.0.0-next, v1.2.0 Feb 17, 2022
@jdolitsky
Copy link
Member

Pushing this out to 1.2 while we figure out how to properly word this

@dcarrion87
Copy link

Hi team, just checking to see when this is likely to be merged in? Thanks

@sudo-bmitch
Copy link
Contributor

@dcarrion87 it's going to be a while. Effort is still continuing on the 1.1 release. After that, we'll revisit this.

@dmcgowan
Copy link
Member Author

I don't mind rebasing this after 1.1 to target 1.2. It really is a pretty light change, it has been included by containerd since 1.4. The issue on the client is whenever a proxy is used, the client is throwing away information that could be important to the registry. We only include ns whenever the proxy host name does not match the requested image host name. I would love to see a registry proxy make use of it even before it hits an official 1.2.0 release.

| `Range` | header | HTTP Range header specifying blob chunk. |
| `name` | path | Name of the target repository. |
| `digest` | path | Digest of desired blob. |
| `ns` | query | (OPTIONAL) Namespace of repository. SHOULD be set to source host. |

###### On Success: Partial Content

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quickly please!

@phillebaba
Copy link

I just wanted to check in if this proposed change has just stalled or is abandoned? I know that these things take a long time, so it can at times be hard to tell.

I have been working in depth with mirroring OCI artifacts over the last year now. The addition of an ns query parameter solves a lot of problems seen in different projects. The major issue right now seems to be that all workarounds are non standard, causing incompatibility. Containerd adds the query parameters whil CRIO does not. Some registries will look for the query parameter while others either only support a single registry or suggest including the root url in the namespace.

I just wanted to add my input that I think that adding this to the spec would be very useful and could potentially help pain points that exist today when mirroring.

If there is any additional work required before merging this, I would be happy to help.

@brandond
Copy link

I've been watching this as well, and would like to voice my support for getting this officially included in the spec.

@darzanebor
Copy link

+1

@BenTheElder
Copy link

Operating https://registry.k8s.io we see a TON of requests that are obviously from pull-through caches/mirrors like artifactory for images that we don't host. If this API was standardized we could eventually start requesting that these tools switch to ns-aware pulls and stop spamming us with requests for images we don't provide.

While not the top of our concerns, this feels wasteful and generates a lot of log noise, and isn't free.

The current state where pull-through mirroring pretty much has to request all upstreams feels bad, what concerns remain with this proposal?

@sudo-bmitch
Copy link
Contributor

Operating https://registry.k8s.io we see a TON of requests that are obviously from pull-through caches/mirrors like artifactory for images that we don't host. If this API was standardized we could eventually start requesting that these tools switch to ns-aware pulls and stop spamming us with requests for images we don't provide.

I'd guess this is from users thinking they have setup a proxy, and docker automatically falling back when the proxy pull fails. This is made worse by docker only supporting a proxy to Docker Hub. The result is the proxy does nothing, never used to pull k8s images (since the client would pull directly) and never finding a match (since the proxy isn't pointing to Hub) for the Hub pulls.

The current state where pull-through mirroring pretty much has to request all upstreams feels bad, what concerns remain with this proposal?

Are there proxies doing this today? The few I've used are configured for a single upstream registry, at least for a given namespace or port.

I worry we'll all agree this is a good change, but the key players in the space won't implement it. It would be good to see this being actively used by both clients and proxies.

We also need to consider existing clients and proxies in various combinations. E.g. if a proxy for a single registry is working with older clients and gets upgraded to use the headers, those older clients still need to go to the original registry. Similarly, newer clients configured with an older proxy should not pull content from the wrong upstream registry, since that would expose them to attacks by dependency confusion and namespace squatters.

@BenTheElder
Copy link

I'd guess this is from users thinking they have setup a proxy, and docker automatically falling back when the proxy pull fails. This is made worse by docker only supporting a proxy to Docker Hub. The result is the proxy does nothing, never used to pull k8s images (since the client would pull directly) and never finding a match (since the proxy isn't pointing to Hub) for the Hub pulls.

The issue where docker only supports configuring a mirror for dockerhub is also pretty unfortunate but not what I'm referring to, the headers are for agents like artifactory (and other similar tools).

My understanding is that users setup a central catch-all pull-through cache and have no way to disambiguate which upstream the pulls were meant for, so we get pulls for ... a lot of images clearly not associated with us.

This seems to be common with self hosted options, versus something like ECR/AR remote repositories. It's understandable that adding a distinct IP / hostname for every proxied host is heavy handed, and I don't think most users even think to use a port other than 5000.

Are there proxies doing this today? The few I've used are configured for a single upstream registry, at least for a given namespace or port.

AFAICT, yes, for example:

https://jfrog.com/help/r/jfrog-artifactory-documentation/virtual-docker-repositories

"This allows you to access images that are hosted locally on local Docker repositories, as well as remote images that are proxied by remote Docker repositories, and access all of them from a single URL defined for the virtual repository. Using virtual repositories can be very useful since users will continue to work with the virtual repository while the admin can manage the included repositories, replace the default deployment target and those changes will be transparent to the users. [...]"

I worry we'll all agree this is a good change, but the key players in the space won't implement it. It would be good to see this being actively used by both clients and proxies.

This seems a bit circular? If we don't document the spec for this optional feature then most projects are not going to enable support (e.g. distribution/distribution#3864 (comment))

We also need to consider existing clients and proxies in various combinations. E.g. if a proxy for a single registry is working with older clients and gets upgraded to use the headers, those older clients still need to go to the original registry. Similarly, newer clients configured with an older proxy should not pull content from the wrong upstream registry, since that would expose them to attacks by dependency confusion and namespace squatters.

That's true, but a proxy can behave more intelligently when the namespace parameter is present and over time more clients will have it, there are various references to waiting for this to be standard. The proxies can fall back to existing behavior such as querying all upstreams when the header is missing.

Containerd alone represents a lot of clients that already have this header in place. Were it to become standard and proxies begin supporting it there's actually a pretty large client base sending it already, but there's resistance to supporting it more widely referring back to waiting for this PR.

@sudo-bmitch
Copy link
Contributor

The issue where docker only supports configuring a mirror for dockerhub is also pretty unfortunate but not what I'm referring to, the headers are for agents like artifactory (and other similar tools).

I'm not sure how you'd differentiate the scenarios from the k8s registry. It would just look like a bunch of failed image pulls to that registry with either scenario.

Are there proxies doing this today? The few I've used are configured for a single upstream registry, at least for a given namespace or port.

AFAICT, yes, for example:

https://jfrog.com/help/r/jfrog-artifactory-documentation/virtual-docker-repositories

Ugh, someone should open a dependency confusion CVE against them if that's really what's happening.

I worry we'll all agree this is a good change, but the key players in the space won't implement it. It would be good to see this being actively used by both clients and proxies.

This seems a bit circular? If we don't document the spec for this optional feature then most projects are not going to enable support (e.g. distribution/distribution#3864 (comment))

We've always standardized things that were already implemented somewhere to show the feasibility first. You can't untag a standard, and implementations are effectively the CI for a standard. If the distribution project doesn't want to add it, or maintain a feature branch, then we'd need another project.

We also need to consider existing clients and proxies in various combinations. E.g. if a proxy for a single registry is working with older clients and gets upgraded to use the headers, those older clients still need to go to the original registry. Similarly, newer clients configured with an older proxy should not pull content from the wrong upstream registry, since that would expose them to attacks by dependency confusion and namespace squatters.

That's true, but a proxy can behave more intelligently when the namespace parameter is present and over time more clients will have it, there are various references to waiting for this to be standard. The proxies can fall back to existing behavior such as querying all upstreams when the header is missing.

Registries falling back when the header isn't seen is probably easy enough. That needs to be to a single registry to avoid a dependency confusion attack.

The fallback for new clients talking to older registries is less clear. The API needs to ensure new clients either know to fallback when talking to older proxies, or that the API doesn't exist on old proxies to avoid the security risk of sending requests to the wrong upstream registry.

Containerd alone represents a lot of clients that already have this header in place.

With Docker updating that integration, the single proxy setting to Hub will hopefully stop being an issue in the future.

@phillebaba
Copy link

If you need an example of the ns query parameter being used in the wild here is one. I am currently using it in Spegel do enable tag resolving in a simple way. This is less of an issue for me as I only support Containerd which has been adding this parameter to all of its image pulling request for a few years now. This has been working really well as from a user perspective it becomes a seamless experience.

https://github.com/spegel-org/spegel/blob/ecfdd57cb2a22513c5785cce89e5c8eb6771dbf3/pkg/registry/registry.go#L180-L185

The other solution that I have seen used out in the wild is to add a header to the mirror configuration so that the receiving registry is aware of the original registry the image originates from. This is how Drangonfly solves the problem for example. Which means that they would not be able to support wild card mirror configuration.

https://d7y.io/docs/next/operations/integrations/container-runtime/containerd/#multiple-registries

Right now every registry out there seems to come up with their own standard for how to deal with mirroring multiple namespaces. A common practices is to to join the original image name reference with the mirror registry. For example docker.io/library/ubuntu:20.04 would become example.com/docker.io/library/ubuntu:20.04. While it may work it creates another set of challenges forcing users to modify image reference and creating incompatibility with other tools that do not understand the logic implemented in the name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

spec: support for passing client image name for mirroring use case