Sounds good. We can only expect to get more submodules in the furture and especially for CI/CD container stuff (not that I use any of that) it is quite useful to not always have to fetch it all.

> TLDR, we can decrease the submodule data size by over 900GB by using shallow submodules.

Not sure what you mean by that though.

greetings
Max


On January 21, 2024 5:16:53 AM GMT+01:00, Martin Roth via coreboot <coreboot@coreboot.org> wrote:
One of the tasks I've had in my queue for a bit is to look into shallow submodules for coreboot.

TLDR, we can decrease the submodule data size by over 900GB by using shallow submodules.

The idea behind shallow submodules is that by default, we don't pull down as much data. For most submodules, most people typically only use the single commit pointed to by the submodule pointer. This means that people are spending time pulling down commits and then storing data that they don't typically need.

The downside of shallow submodules (and shallow git repos in general) is that they don't contain the full data for the repository, so you can't immediately look at the full history or check out other versions.

There are a couple of ways to limit the amount of data being fetched by a git repository.
- You can fetch a single branch
- You can fetch a limited amount of the git repo's history, either by date or by number of commits.

This explores those possibilities.

Currently if you download everything in the coreboot tree, along with the full submodules you get 1.4GB of data:
226MB of data after downloading the just coreboot repo, but before downloading submodules.
Then another 1.2GB of data for all of the submodules.


Changing the submodules to pull down a depth of 1 commit, we fetch just 208MB of data, a much more reasonable size than 1.2GB

Pulling down the full branch used for each submodule increases the submodule size to 769MB.

Here's a spreadsheet with the submodule size data, along for recommendations for each. 
https://docs.google.com/spreadsheets/d/1DAnFFnoLxdLE15CsUTE_AuCZeKG-UdrBndiFxhPDEO8/edit <https://docs.google.com/spreadsheets/d/1DAnFFnoLxdLE15CsUTE_AuCZeKG-UdrBndiFxhPDEO8/edit?usp=sharing>

Submodules recommended to use --depth=1 are:  arm-trusted-firmware, intel-sec-tools, stm, chromeec, blobs, fsp, vboot

All other submodules would fetch the current branch.
My thought is that we set the default for each submodule to the recommended value in the sheet, then we can add a Make target or Kconfig option that pulls down the rest, for anyone who wants the full data. Once you've pulled down the entire history of a repo, it won't get removed, so the update (mostly) only needs to be done once for historic data. You'd need to do it again after any submodule update to get the full recent commit history though.



Unrelated to the submodules or any proposed changes, to thin the coreboot repo out a bit, you can limit how far back the coreboot history goes, and ignore branches other than main by running a command like this:

`git clone https://review.coreboot.org/coreboot.git coreboot-https --shallow-since=2020-01-01 --branch=main --single-branch`

This pulls down 127MB of data instead of the full 226MB. Obviously you can change the date as desired to get more or less history.
If at some point, you decide  you want the rest of the data for that repo, you can run `git fetch --unshallow`.

One note on pulling less history for the coreboot repo - Make sure you get the latest tag, or your coreboot version id will be wrong and confusing.



Let me know what you think about updating the submodules as recommended.
Martin
coreboot mailing list -- coreboot@coreboot.org
To unsubscribe send an email to coreboot-leave@coreboot.org