Skip to content

Dockerfile: Various bug fixes, clean ups, and an integration test #79

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 20, 2021

Conversation

Krinkle
Copy link
Member

@Krinkle Krinkle commented Aug 17, 2021

TLDR:

  • Fix all the headers for correctness (mime + charset), performance, and security.
  • Use much simpler redirect logic and apply the same approach as we use today.
  • Verify it all with an integration test.

Differences in file headers

In summary:

  • Missing GZIP support and no "Vary: Accept-Encoding" to avoid cache poisoning.
  • Missing CORS support (Access-Control-Allow-Origin).
  • Missing caching allowance (Cache-Control public 10 years, max Expires).
  • Missing disablement of server_tokens.
  • Missing charset utf-8.
  • Unstable E-Tag and Last-Modified which would cause serious cache churn
    after every rebuild.

Status quo:

krinkle@wp-03:~$ curl -H 'Host: codeorigin.jquery.com' -i 'http://localhost/jquery-3.0.0.js' | head -n20

Server: nginx
Date: Mon, 16 Aug 2021 23:12:52 GMT
Connection: keep-alive
Access-Control-Allow-Origin: *
Cache-Control: max-age=315360000
Cache-Control: public
Content-Length: 263268
Content-Type: application/javascript; charset=utf-8
ETag: "28feccc0-40464"
Expires: Thu, 31 Dec 2037 23:55:55 GMT
Last-Modified: Fri, 18 Oct 1991 12:00:00 GMT
Vary: Accept-Encoding
Accept-Ranges: bytes

Previously, without this patch:

dockerize$ curl -i 'http://localhost:4000/jquery-3.0.0.js' | head -n20

Server: nginx/1.21.1
Date: Mon, 16 Aug 2021 23:13:30 GMT
Connection: keep-alive
Content-Length: 263268
Content-Type: application/javascript
ETag: "5fdbec22-40464"
Last-Modified: Thu, 17 Dec 2020 23:39:14 GMT
Accept-Ranges: bytes

Match existing Nginx config

Most of the above was addressed by simply using the same Nginx config as we have in the private infrastructure.git repo for provisioning legacy codeorigin (copied from wordpress-header.conf.erb and wp/jquery.pp, respectively).

However, the last two points (charset, and unstable etag), require additional changes.

Nginx: charset

The default charset from Nginx differs between Debian and Alpine, possibly due to changes in some of the shared libraries. Fix this by explicitly setting charset utf-8. Note that this does (and should) only apply to files with a MIME type in charset_types list, which by default contains HTML, JS, and CSS.

More importantly, it does not (and should not) cause PNG files and other binary assets to be served with a charset, which would be wrong.

Nginx: E-Tag and Last-Modified

This was a tricky one. Git does not store file modification timestamps, which means the on-disk file mtimes are somewhat arbitrarily set to when the file was created by the local Git client. For most files, this means the time you cloned the repository, and for files added later, the time you ran git pull.

This was de-facto "stable" on the legacy server because it is a persistent server with a persistent git directory (we always use the same clone and just update it to fetch new files after a commit happens). The timestamps themselves don't matter
as long as they remain constant for any given file, and they were.

This is important because E-Tag is computed in Nginx based on file mtime and size and (unlike Apache) this behaviour is not configurable in Nginx (e.g. to make it compute a content hash instead). And Last-Modified naturally uses the
mtime as well.

As a workaround, I've added a touch command to assign all CDN files a fixed timestamp far in the past. This is safe for us because all files are expected to remain static. Even if we would change a file, it wouldn't propagate to the CDN without a manual CDN API purge (given the all-important high values for Cache-Control max-age and Expires), and after that would not propagate to any browser that has previously fetched it unless the user clears their own cache (given the all-important Cache-Control telling the browser to re-use the file blindly, which is critical for performance). As such, whether the timestamp remains constant or changes after the file is changed, would make no difference for how it propagates.

New state

After, with this patch (and verified by a simple test suite).

dockerize$ curl -i 'http://localhost:4000/jquery-3.0.0.js' | head -n20

Server: nginx
Date: Mon, 16 Aug 2021 23:49:43 GMT
Connection: keep-alive
Access-Control-Allow-Origin: *
Cache-Control: max-age=315360000
Cache-Control: public
Content-Length: 263268
Content-Type: application/javascript; charset=utf-8
ETag: "28feccc0-40464"
Expires: Thu, 31 Dec 2037 23:55:55 GMT
Last-Modified: Fri, 18 Oct 1991 12:00:00 GMT
Vary: Accept-Encoding
Accept-Ranges: bytes

Misc changes

  • Remove left-over Nginx config for purge.php.

  • Remove left-over php-fpm script.

  • Use our own document root at /var/www/cdn instead of mixing our files into the default /usr/share/nginx/html/ which contains files from the "docker-nginx" project that we may not want to serve blindly.
    This mainly affects the odd index.html file that could otherwise be served.

  • Pin the Nginx base image version at 1.x. This is mostly a no-op since they've been on 1.x for 10 years, but if it does rollover
    we wouldn't want an unmanaged upgrade to a new major version. We could pun this more narrowly at some point, but 1.x should be fine for now.

  • Invert the config file (let Dockerfile remove code instead of comment out) so that the most critical lines can be easily
    linted.

  • Ensure all files listed are redirected correctly, as per https://github.com/jquery/infrastructure/issues/554 and verify this with an integration test. The expected values are populated based on running curl against localhost from a shell on the legacy codeorigin. I specifically did not compare against live code.jquery.com, since we care about matching what we feed to the CDN (although the above problems were surfacing through there as well). Some of the CDN behaviours can't and don't need to be mimicked from the origin. Prior to launch we will (also) compare the live code.jquery.com against the new staging CDN endpoint, but that's separate from the test for the container.

Ref https://github.com/jquery/infrastructure/issues/554.

TLDR:

* Fix all the headers for correctness (mime + charset), performance, and security.
* Use much simpler redirect logic and apply the same approach as we use today.
* Verify it all with an integration test.

== Differences in file headers

In summary:

* Missing GZIP support and no "Vary: Accept-Encoding" to avoid cache poisoning.
* Missing CORS support (Access-Control-Allow-Origin).
* Missing caching allowance (Cache-Control public 10 years, max Expires).
* Missing disablement of server_tokens.
* Missing charset utf-8.
* Unstable E-Tag and Last-Modified which would cause serious cache churn
  after every rebuild.

Status quo:

```
krinkle@wp-03:~$ curl -H 'Host: codeorigin.jquery.com' -i 'http://localhost/jquery-3.0.0.js' | head -n20

Server: nginx
Date: Mon, 16 Aug 2021 23:12:52 GMT
Connection: keep-alive
Access-Control-Allow-Origin: *
Cache-Control: max-age=315360000
Cache-Control: public
Content-Length: 263268
Content-Type: application/javascript; charset=utf-8
ETag: "28feccc0-40464"
Expires: Thu, 31 Dec 2037 23:55:55 GMT
Last-Modified: Fri, 18 Oct 1991 12:00:00 GMT
Vary: Accept-Encoding
Accept-Ranges: bytes
```

Previously, without this patch:

```
dockerize$ curl -i 'http://localhost:4000/jquery-3.0.0.js' | head -n20

Server: nginx/1.21.1
Date: Mon, 16 Aug 2021 23:13:30 GMT
Connection: keep-alive
Content-Length: 263268
Content-Type: application/javascript
ETag: "5fdbec22-40464"
Last-Modified: Thu, 17 Dec 2020 23:39:14 GMT
Accept-Ranges: bytes
```

== Match existing Nginx config

Most of the above was addressed by simply using the same Nginx config as we have
in the private infrastructure.git repo for provisioning legacy codeorigin
(copied from wordpress-header.conf.erb and wp/jquery.pp, respectively).

However, the last two points (charset, and unstable etag), require additional
changes.

== Nginx: charset

The default charset from Nginx differs between Debian and Alpine, possibly
due to changes in some of the shared libraries. Fix this by explicitly setting
`charset utf-8`. Note that this does (and should) only apply to files with a
MIME type in `charset_types` list, which by default contains HTML, JS, and CSS.

More importantly, it does not (and should not) cause PNG files and other binary
assets to be served with a charset, which would be wrong.

== Nginx: E-Tag and Last-Modified

This was a tricky one. Git does not store file modification timestamps, which
means the on-disk file mtimes are somewhat arbitrarily set to when the file
was created by the local Git client. For most files, this means the time you
cloned the repository, and for files added later, the time you ran `git pull`.

This was de-facto "stable" on the legacy server because it is a persistent server
with a persistent git directory (we always use the same clone and just update it
to fetch new files after a commit happens). The timestamps themselves don't matter
as long as they remain constant for any given file, and they were.

This is important because E-Tag is computed in Nginx based on file mtime and
size and (unlike Apache) this behaviour is not configurable in Nginx (e.g. to
make it compute a content hash instead). And Last-Modified naturally uses the
mtime as well.

As a workaround, I've added a `touch` command to assign all CDN files a fixed
timestamp far in the past. This is safe for us because all files are expected
to remain static. Even if we would change a file, it wouldn't propagate to the
CDN without a manual CDN API purge (given the all-important high values for
Cache-Control max-age and Expires), and after that would not propagate to
any browser that has previously fetched it unless the user clears their own
cache (given the all-important Cache-Control telling the browser to re-use the
file blindly, which is critical for performance). As such, whether the timestamp
remains constant or changes after the file is changed, would make no difference
for how it propagates.

== New state

After, with this patch (and verified by a simple test suite).

```
dockerize$ curl -i 'http://localhost:4000/jquery-3.0.0.js' | head -n20

Server: nginx
Date: Mon, 16 Aug 2021 23:49:43 GMT
Connection: keep-alive
Access-Control-Allow-Origin: *
Cache-Control: max-age=315360000
Cache-Control: public
Content-Length: 263268
Content-Type: application/javascript; charset=utf-8
ETag: "28feccc0-40464"
Expires: Thu, 31 Dec 2037 23:55:55 GMT
Last-Modified: Fri, 18 Oct 1991 12:00:00 GMT
Vary: Accept-Encoding
Accept-Ranges: bytes
```

== Misc changes

* Remove left-over Nginx config for purge.php.

* Remove left-over php-fpm script.

* Use our own document root at /var/www/cdn instead of mixing our
  files into the default /usr/share/nginx/html/ which contains files
  from the "docker-nginx" project that we may not want to serve blindly.
  This mainly affects the odd index.html file that could otherwise be
  served.

* Pin the Nginx base image version at 1.x. This is mostly a no-op
  since they've been on 1.x for 10 years, but if it does rollover
  we wouldn't want an unmanaged upgrade to a new major version.
  We could pun this more narrowly at some point, but 1.x should be
  fine for now.

* Invert the config file (let Dockerfile remove code instead of
  comment out) so that the most critical lines can be easily
  linted.

* Ensure all files listed are redirected correctly,
  as per https://github.com/jquery/infrastructure/issues/474#issuecomment-844582865
  and verify this with an integration test. The expected values
  are populated based on running `curl` against localhost from a shell
  on the legacy codeorigin. I specifically did not compare against
  live code.jquery.com, since we care about matching what we feed to
  the CDN (although the above problems were surfacing through there
  as well). Some of the CDN behaviours can't and don't need to be
  mimicked from the origin. Prior to launch we will (also) compare the
  live code.jquery.com against the new staging CDN endpoint, but that's
  separate from the test for the container.

Ref https://github.com/jquery/infrastructure/issues/474.
@Krinkle Krinkle requested a review from brianwarner August 17, 2021 03:58
@Krinkle Krinkle requested a review from mgol August 17, 2021 04:08
@Krinkle Krinkle merged commit c928ad9 into jquery:dockerize Aug 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants