One of the tasks I've had in my queue for a bit is to look into shallow submodules for coreboot.
TLDR, we can decrease the submodule data size by over 900GB by using shallow submodules.
The idea behind shallow submodules is that by default, we don't pull down as much data. For most submodules, most people typically only use the single commit pointed to by the submodule pointer. This means that people are spending time pulling down commits and then storing data that they don't typically need.
The downside of shallow submodules (and shallow git repos in general) is that they don't contain the full data for the repository, so you can't immediately look at the full history or check out other versions.
There are a couple of ways to limit the amount of data being fetched by a git repository. - You can fetch a single branch - You can fetch a limited amount of the git repo's history, either by date or by number of commits.
This explores those possibilities.
Currently if you download everything in the coreboot tree, along with the full submodules you get 1.4GB of data: 226MB of data after downloading the just coreboot repo, but before downloading submodules. Then another 1.2GB of data for all of the submodules.
Changing the submodules to pull down a depth of 1 commit, we fetch just 208MB of data, a much more reasonable size than 1.2GB
Pulling down the full branch used for each submodule increases the submodule size to 769MB.
Here's a spreadsheet with the submodule size data, along for recommendations for each. https://docs.google.com/spreadsheets/d/1DAnFFnoLxdLE15CsUTE_AuCZeKG-UdrBndiF... https://docs.google.com/spreadsheets/d/1DAnFFnoLxdLE15CsUTE_AuCZeKG-UdrBndiFxhPDEO8/edit?usp=sharing
Submodules recommended to use --depth=1 are: arm-trusted-firmware, intel-sec-tools, stm, chromeec, blobs, fsp, vboot
All other submodules would fetch the current branch. My thought is that we set the default for each submodule to the recommended value in the sheet, then we can add a Make target or Kconfig option that pulls down the rest, for anyone who wants the full data. Once you've pulled down the entire history of a repo, it won't get removed, so the update (mostly) only needs to be done once for historic data. You'd need to do it again after any submodule update to get the full recent commit history though.
Unrelated to the submodules or any proposed changes, to thin the coreboot repo out a bit, you can limit how far back the coreboot history goes, and ignore branches other than main by running a command like this:
`git clone https://review.coreboot.org/coreboot.git coreboot-https --shallow-since=2020-01-01 --branch=main --single-branch`
This pulls down 127MB of data instead of the full 226MB. Obviously you can change the date as desired to get more or less history. If at some point, you decide you want the rest of the data for that repo, you can run `git fetch --unshallow`.
One note on pulling less history for the coreboot repo - Make sure you get the latest tag, or your coreboot version id will be wrong and confusing.
Let me know what you think about updating the submodules as recommended. Martin
Sounds good. We can only expect to get more submodules in the furture and especially for CI/CD container stuff (not that I use any of that) it is quite useful to not always have to fetch it all.
TLDR, we can decrease the submodule data size by over 900GB by using shallow submodules.
Not sure what you mean by that though.
greetings Max
On January 21, 2024 5:16:53 AM GMT+01:00, Martin Roth via coreboot coreboot@coreboot.org wrote:
One of the tasks I've had in my queue for a bit is to look into shallow submodules for coreboot.
TLDR, we can decrease the submodule data size by over 900GB by using shallow submodules.
The idea behind shallow submodules is that by default, we don't pull down as much data. For most submodules, most people typically only use the single commit pointed to by the submodule pointer. This means that people are spending time pulling down commits and then storing data that they don't typically need.
The downside of shallow submodules (and shallow git repos in general) is that they don't contain the full data for the repository, so you can't immediately look at the full history or check out other versions.
There are a couple of ways to limit the amount of data being fetched by a git repository.
- You can fetch a single branch
- You can fetch a limited amount of the git repo's history, either by date or by number of commits.
This explores those possibilities.
Currently if you download everything in the coreboot tree, along with the full submodules you get 1.4GB of data: 226MB of data after downloading the just coreboot repo, but before downloading submodules. Then another 1.2GB of data for all of the submodules.
Changing the submodules to pull down a depth of 1 commit, we fetch just 208MB of data, a much more reasonable size than 1.2GB
Pulling down the full branch used for each submodule increases the submodule size to 769MB.
Here's a spreadsheet with the submodule size data, along for recommendations for each. https://docs.google.com/spreadsheets/d/1DAnFFnoLxdLE15CsUTE_AuCZeKG-UdrBndiF... https://docs.google.com/spreadsheets/d/1DAnFFnoLxdLE15CsUTE_AuCZeKG-UdrBndiFxhPDEO8/edit?usp=sharing
Submodules recommended to use --depth=1 are: arm-trusted-firmware, intel-sec-tools, stm, chromeec, blobs, fsp, vboot
All other submodules would fetch the current branch. My thought is that we set the default for each submodule to the recommended value in the sheet, then we can add a Make target or Kconfig option that pulls down the rest, for anyone who wants the full data. Once you've pulled down the entire history of a repo, it won't get removed, so the update (mostly) only needs to be done once for historic data. You'd need to do it again after any submodule update to get the full recent commit history though.
Unrelated to the submodules or any proposed changes, to thin the coreboot repo out a bit, you can limit how far back the coreboot history goes, and ignore branches other than main by running a command like this:
`git clone https://review.coreboot.org/coreboot.git coreboot-https --shallow-since=2020-01-01 --branch=main --single-branch`
This pulls down 127MB of data instead of the full 226MB. Obviously you can change the date as desired to get more or less history. If at some point, you decide you want the rest of the data for that repo, you can run `git fetch --unshallow`.
One note on pulling less history for the coreboot repo - Make sure you get the latest tag, or your coreboot version id will be wrong and confusing.
Let me know what you think about updating the submodules as recommended. Martin
coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org
Hi Martin,
sounds like a really good idea!
On 21.01.24 05:16, Martin Roth via coreboot wrote:
Here's a spreadsheet with the submodule size data, along for recommendations for each. https://docs.google.com/spreadsheets/d/1DAnFFnoLxdLE15CsUTE_AuCZeKG-UdrBndiF... https://docs.google.com/spreadsheets/d/1DAnFFnoLxdLE15CsUTE_AuCZeKG-UdrBndiFxhPDEO8/edit?usp=sharing
Submodules recommended to use --depth=1 are: arm-trusted-firmware, intel-sec-tools, stm, chromeec, blobs, fsp, vboot
Not sure if we actually need to make a distinction. Most people probably won't look into the submodules anyway. Your selection seems fine to me, though.
All other submodules would fetch the current branch. My thought is that we set the default for each submodule to the recommended value in the sheet, then we can add a Make target or Kconfig option that pulls down the rest, for anyone who wants the full data.
Most people who edit submodules probably know Git well enough. So I would only invest into the Kconfig/Makefile logic if people request it.
Cheers, Nico
Hi Martin,
thanks - that's a great idea. While we are touching submodules anyway - can we make the path a bit more agnostic? While forking coreboot into my org I found that I either:
1. Have to clone all submodules as well 2. or touch the submodules file to point to the upstream coreboot repo e.g. instead of '../flashrom.git' use '../../coreboot/flashrom.git'. That works on github and should also work within our upstream coreboot repo.
Maybe I am using git submodules wrong here.. :)
Best,
Chris
On 1/21/24 13:07, Nico Huber via coreboot wrote:
Hi Martin,
sounds like a really good idea!
On 21.01.24 05:16, Martin Roth via coreboot wrote:
Here's a spreadsheet with the submodule size data, along for recommendations for each. https://docs.google.com/spreadsheets/d/1DAnFFnoLxdLE15CsUTE_AuCZeKG-UdrBndiF... https://docs.google.com/spreadsheets/d/1DAnFFnoLxdLE15CsUTE_AuCZeKG-UdrBndiFxhPDEO8/edit?usp=sharing
Submodules recommended to use --depth=1 are: arm-trusted-firmware, intel-sec-tools, stm, chromeec, blobs, fsp, vboot
Not sure if we actually need to make a distinction. Most people probably won't look into the submodules anyway. Your selection seems fine to me, though.
All other submodules would fetch the current branch. My thought is that we set the default for each submodule to the recommended value in the sheet, then we can add a Make target or Kconfig option that pulls down the rest, for anyone who wants the full data.
Most people who edit submodules probably know Git well enough. So I would only invest into the Kconfig/Makefile logic if people request it.
Cheers, Nico
coreboot mailing list --coreboot@coreboot.org To unsubscribe send an email tocoreboot-leave@coreboot.org
Hi Chris,
On 21.01.24 14:57, Christian Walter wrote:
thanks - that's a great idea. While we are touching submodules anyway - can we make the path a bit more agnostic? While forking coreboot into my org I found that I either:
- Have to clone all submodules as well
you mean `fork' in GitHub parlance, right?
- or touch the submodules file to point to the upstream coreboot repo
e.g. instead of '../flashrom.git' use '../../coreboot/flashrom.git'. That works on github and should also work within our upstream coreboot repo.
I don't think this would work with the current upstream setup. Because upstream we don't have a coreboot/ namespace. And worse, the double ../ seems to even kill the host part of the URL. I can't imagine a backwards compatible way right now. Hope I miss something, though.
The basic problem is in the way how submodule URLs are built: A relative path is kind of appended to the URL of the outer repository, e.g. with https://review.coreboot.org/coreboot.git cloned, ../flashrom.git becomes https://review.coreboot.org/coreboot.git/../flashrom.git After removing the `coreboot.git/../' part, this works as URL.
But https://review.coreboot.org/coreboot.git/../../coreboot/flashrom.git results in https://coreboot/flashrom.git
Maybe I am using git submodules wrong here.. :)
No, I think you're doing it right. Just our setup targets GitHub only as a mirror, not actually as a tool to work on coreboot. :-/
Nico
Hi Nico,
oh yeah - I was probably too fast suggesting this - sorry about that! Yeah, seems like there is not straight forward solution right now.. nevermind - I'll carry around a patch then that fixes up the submodules.
Thanks, and sorry for interrupting :)
Chris
On 1/21/24 16:05, Nico Huber wrote:
Hi Chris,
On 21.01.24 14:57, Christian Walter wrote:
thanks - that's a great idea. While we are touching submodules anyway - can we make the path a bit more agnostic? While forking coreboot into my org I found that I either:
- Have to clone all submodules as well
you mean `fork' in GitHub parlance, right?
- or touch the submodules file to point to the upstream coreboot repo
e.g. instead of '../flashrom.git' use '../../coreboot/flashrom.git'. That works on github and should also work within our upstream coreboot repo.
I don't think this would work with the current upstream setup. Because upstream we don't have a coreboot/ namespace. And worse, the double ../ seems to even kill the host part of the URL. I can't imagine a backwards compatible way right now. Hope I miss something, though.
The basic problem is in the way how submodule URLs are built: A relative path is kind of appended to the URL of the outer repository, e.g. withhttps://review.coreboot.org/coreboot.git cloned, ../flashrom.git becomes https://review.coreboot.org/coreboot.git/../flashrom.git After removing the `coreboot.git/../' part, this works as URL.
But https://review.coreboot.org/coreboot.git/../../coreboot/flashrom.git results in https://coreboot/flashrom.git
Maybe I am using git submodules wrong here.. :)
No, I think you're doing it right. Just our setup targets GitHub only as a mirror, not actually as a tool to work on coreboot. :-/
Nico
Is there a reason not to use absolute paths in .gitmodules for upstream? That's what I have to do currently for my fork, for the submodules which I haven't forked
On Sun, Jan 21, 2024 at 10:06 AM Christian Walter < christian.walter@9elements.com> wrote:
Hi Nico,
oh yeah - I was probably too fast suggesting this - sorry about that! Yeah, seems like there is not straight forward solution right now.. nevermind - I'll carry around a patch then that fixes up the submodules.
Thanks, and sorry for interrupting :)
Chris On 1/21/24 16:05, Nico Huber wrote:
Hi Chris,
On 21.01.24 14:57, Christian Walter wrote:
thanks - that's a great idea. While we are touching submodules anyway - can we make the path a bit more agnostic? While forking coreboot into my org I found that I either:
- Have to clone all submodules as well
you mean `fork' in GitHub parlance, right?
- or touch the submodules file to point to the upstream coreboot repo
e.g. instead of '../flashrom.git' use '../../coreboot/flashrom.git'. That works on github and should also work within our upstream coreboot repo.
I don't think this would work with the current upstream setup. Because upstream we don't have a coreboot/ namespace. And worse, the double ../ seems to even kill the host part of the URL. I can't imagine a backwards compatible way right now. Hope I miss something, though.
The basic problem is in the way how submodule URLs are built: A relative path is kind of appended to the URL of the outer repository, e.g. with https://review.coreboot.org/coreboot.git cloned, ../flashrom.git becomes https://review.coreboot.org/coreboot.git/../flashrom.git After removing the `coreboot.git/../' part, this works as URL.
But https://review.coreboot.org/coreboot.git/../../coreboot/flashrom.git results in https://coreboot/flashrom.git
Maybe I am using git submodules wrong here.. :)
No, I think you're doing it right. Just our setup targets GitHub only as a mirror, not actually as a tool to work on coreboot. :-/
Nico
-- *Christian Walter* *Head of Firmware Development / Cyber Security *
9elements GmbH, Kortumstraße 19-21, 44787 Bochum, Germany Email: christian.walter@9elements.com Phone: *+49 234 68 94 188 <+492346894188>* Mobile: *+49 176 70845047 <+4917670845047>*
Sitz der Gesellschaft: Bochum Handelsregister: Amtsgericht Bochum, HRB 17519 Geschäftsführung: Sebastian Deutsch, Eray Basar
Datenschutzhinweise nach Art. 13 DSGVO https://9elements.com/privacy _______________________________________________ coreboot mailing list -- coreboot@coreboot.org To unsubscribe send an email to coreboot-leave@coreboot.org
Hi,
On 21.01.24 21:29, Matt DeVillier wrote:
Is there a reason not to use absolute paths in .gitmodules for upstream? That's what I have to do currently for my fork, for the submodules which I haven't forked
with some people cloning from coreboot.org and some from github.com, there is no correct absolute path.
The manpage git-submodule(1) gives a hint about `init': "You can then customize the submodule clone URLs in .git/config for your local setup and proceed to git submodule update; [...]"
We could indeed script something after running `submodule init'. Like a Kconfig that replaces the default `../' prefix. Would that help? One could then have the path (prefix) in site-local/.
Nico