Skip to content

Conversation

@adqm
Copy link
Contributor

@adqm adqm commented Sep 15, 2025

Fixes

Description

This PR fixes the whitespace issues described in #485, where translated strings and inline HTML elements were often surrounded by extra whitespace.

The most noticeable effects are the removal of whitespace around some punctuation, and the removal of whitespace from the ends of link text.

Technical details

The main change is avoiding the use of BeautifulSoup's prettify function, which can add whitespace that affects the rendered HTML. To my eyes, the HTML is plenty 'pretty' without calling that function 🙂

Additional changes to specific templates catch several other related issues (manually adjusting whitespace, replacing {% trans ... %} with {% blocktrans trimmed %}...{% endblocktrans %}).

Screenshots

Some sample screenshots:

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@adqm
Copy link
Contributor Author

adqm commented Sep 15, 2025

For now, outside of just not calling prettify, I only modified templates/includes/legalcode_licenses_4.0.html; hoping for some feedback on the approach before committing to making a similar pass through all the other templates 🙂.

@TimidRobot
Copy link
Member

@adqm Thank you for the work you put into this. The links to examples in the description is very helpful.

However, the HTML output during publishing must be normalized/tidy and consistent. There must be a formatting step. If bs4.HTMLFormatter is removed, it must be replaced by something else.

@adqm
Copy link
Contributor Author

adqm commented Sep 16, 2025

Thanks for taking a look! That makes sense. I'll look into a few alternatives and see if I can find a reasonable solution.

@adqm
Copy link
Contributor Author

adqm commented Sep 20, 2025

@TimidRobot I've been trying a few things over the past few days. Nothing is perfect, but I wanted to give a little bit of an update. Apologies for the long message 😅.

I've pushed what I think is the least-objectionable method I've tried so far, but it still definitely has some downsides:

  • Most notably, bin/publish.sh takes around 5-6x longer than before on the machine I used for testing.
  • This approach (described below) definitely adds complexity to the codebase.

That said, I'm quite happy with the output (both the rendered output and the source code look pretty good to me). Since everything is ultimately served statically, my opinion is that the improvement in the output is worth the slowdown and the extra complexity at build time.

(Maybe also worth mentioning that I didn't yet try to update the README's instructions for manual setup, which will need adjustment if this approach is ultimately used).

Method

Based on prior experience, my first thought was to use Prettier to do the HTML formatting; it seems to be the popular tool for this kind of thing nowadays, and it doesn't have the same issues with adding extra whitespace that BeautifulSoup's .prettify method does.

My initial attempt of calling out to a new subprocess for every page, though, naturally slowed things down a lot. I knew ahead of time that this would be the case, but I didn't expect it to be as slow as it ended up being (40 minutes for bin/publish.sh on my machine, as opposed to 2 minutes for the current tip of main).

I explored Biome as an alternative as well, but while it did seem to be substantially faster, its HTML support is still a work in progress and did not work great on our input (I was getting parse errors on well-formed files). Maybe something to consider for the future, but I don't think it'll work right now.

So what I've just pushed still uses prettier, but it sets up a persistent server in the Docker container instead of starting a new prettier process for each page load. Avoiding that startup overhead brought the time down quite a bit, to around 11 minutes on that same machine.

I didn't find any pure-Python HTML formatters that seemed to be on prettier's level, though it's entirely possible I've missed some. Happy to look more deeply, though, if the approach I'm proposing here doesn't work for you (I'll admit it doesn't feel particularly elegant...).

Malformed HTML

prettier also errors out on malformed HTML instead of just "fixing" it, which is quite nice; it helped me fix a few places where the HTML templates in the templates wasn't well-formed.

For right now, the approach in the code is that malformed HTML simply isn't formatted; it falls back to the unformatted version while printing out an error message. The remaining instances of malformed HTML are all in translation files. I'm not sure how to fix these, though (nor do I speak most of these languages, so in a few cases it's not obvious to me whether the apparent obvious fix will affect the meaning). But regardless of what comes of this PR, those might be worth fixing.

Should I open a separate issue for the issues I've found there? If so, should it be in this repo, the data repo, or someplace else?

@adqm
Copy link
Contributor Author

adqm commented Oct 9, 2025

@TimidRobot, sorry to pester, but just wondering if you might be able to take a look and share your thoughts on the current approach here?

@TimidRobot
Copy link
Member

@adqm Thank you for the reminder (not a pester at all). This is a really interesting approach! I would like to do some testing with it. I suggest we proceed as follows:

  1. You move the template corrections to a separate pull request (PR) against this repository
  2. Then, I'll merge this PR into a feature branch

@adqm
Copy link
Contributor Author

adqm commented Oct 9, 2025

Thanks! I can certainly split things up. I should be able to get to that later today, maybe tomorrow.

These will go into a separate PR.
@adqm
Copy link
Contributor Author

adqm commented Oct 13, 2025

Sorry for the delay here, @TimidRobot! I went ahead and made the changes you requested (new PR is #538), and I also opened a separate issue creativecommons/cc-legal-tools-data#260 for some related issues in some of the translation files.

Feedback is welcome; happy to adjust the approach here if need be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

[Bug] Extraneous spaces in HTML

2 participants