Dear coreboot readers!
This is the automated build check service of coreboot.
The developer "oxygene" checked in revision 3777 to the coreboot source repository and caused the following changes:
Change Log: This patch fixes the ugly race condition created through build_opt_tbl running twice at the same time, overwriting its output files. This caused a depending rule to produce an object file with no symbols in it.
This should silence up the regularly happening build failure messages on the mailing list since we moved to the newer, much faster server.
Signed-off-by: Stefan Reinauer stepan@coresystems.de Signed-off-by: Patrick Georgi patrick.georgi@coresystems.de Acked-by: Peter Stuge peter@stuge.se
Build Log: Compilation of gigabyte:m57sli has been broken See the error log at http://qa.coreboot.org/log_buildbrd.php?revision=3777&device=m57sli&... Compilation of via:pc2500e has been broken See the error log at http://qa.coreboot.org/log_buildbrd.php?revision=3777&device=pc2500e&...
If something broke during this checkin please be a pain in oxygene's neck until the issue is fixed.
If this issue is not fixed within 24h the revision should be backed out.
Best regards, coreboot automatic build system
On 28.11.2008 13:13, coreboot information wrote:
The developer "oxygene" checked in revision 3777
Change Log: This patch fixes the ugly race condition created through build_opt_tbl running twice at the same time, overwriting its output files. This caused a depending rule to produce an object file with no symbols in it.
This should silence up the regularly happening build failure messages on the mailing list since we moved to the newer, much faster server.
And it didn't work. What now?
Signed-off-by: Stefan Reinauer stepan@coresystems.de Signed-off-by: Patrick Georgi patrick.georgi@coresystems.de Acked-by: Peter Stuge peter@stuge.se
Build Log: Compilation of gigabyte:m57sli has been broken See the error log at http://qa.coreboot.org/log_buildbrd.php?revision=3777&device=m57sli&... Compilation of via:pc2500e has been broken See the error log at http://qa.coreboot.org/log_buildbrd.php?revision=3777&device=pc2500e&...
This is exactly the error the change was supposed to fix.
Regards, Carl-Daniel
On Fri, Nov 28, 2008 at 9:10 AM, Carl-Daniel Hailfinger < c-d.hailfinger.devel.2006@gmx.net> wrote:
On 28.11.2008 13:13, coreboot information wrote:
The developer "oxygene" checked in revision 3777
Change Log: This patch fixes the ugly race condition created through build_opt_tbl running twice at the same time, overwriting its output files. This caused a depending rule to produce an object file with no symbols in it.
This should silence up the regularly happening build failure messages on the mailing list since we moved to the newer, much faster server.
And it didn't work. What now?
Signed-off-by: Stefan Reinauer stepan@coresystems.de Signed-off-by: Patrick Georgi patrick.georgi@coresystems.de Acked-by: Peter Stuge peter@stuge.se
Build Log: Compilation of gigabyte:m57sli has been broken See the error log at
http://qa.coreboot.org/log_buildbrd.php?revision=3777&device=m57sli&...
Compilation of via:pc2500e has been broken See the error log at
http://qa.coreboot.org/log_buildbrd.php?revision=3777&device=pc2500e&...
This is exactly the error the change was supposed to fix.
Is it possible the build system didn't rebuild abuild or some part of the config system before it ran?
-Corey
Corey Osgood wrote:
This is exactly the error the change was supposed to fix.
Is it possible the build system didn't rebuild abuild or some part of the config system before it ran?
-Corey
It's odd. I couldn't reproduce any failure in 5 complete abuild runs after we made the patch.
The problem is that we end up with a valid option_table.o file (not truncated) but it has no symbols and no data in there. The file is 900 something bytes, which is exactly the size of an elf object created by touch foo.c gcc -c foo.c -o foo.o
At least in one of the targets that failed, buid_opt_table was called 5 times, whereas it should only be called twice in the whole build process.
This makes me think make gets confused during the build process. Patrick suggested it might be some serialization issue, but I'm slowly running out of ideas for things that we can check without wasting a lot of time.
Stefan
On 28.11.2008 23:03, Stefan Reinauer wrote:
Corey Osgood wrote:
This is exactly the error the change was supposed to fix.
Is it possible the build system didn't rebuild abuild or some part of the config system before it ran?
-Corey
It's odd. I couldn't reproduce any failure in 5 complete abuild runs after we made the patch.
The problem is that we end up with a valid option_table.o file (not truncated) but it has no symbols and no data in there. The file is 900 something bytes, which is exactly the size of an elf object created by touch foo.c gcc -c foo.c -o foo.o
At least in one of the targets that failed, buid_opt_table was called 5 times, whereas it should only be called twice in the whole build process.
This makes me think make gets confused during the build process. Patrick suggested it might be some serialization issue, but I'm slowly running out of ideas for things that we can check without wasting a lot of time.
I think I can fix that serialization problem, but it seems r3777 makes the situation a bit better than it was before. That alone is an improvement. Stefan, you are right about the time/result tradeoff. If the failure rate stays low enough, we might want to leave your fix in place and simply accept the occassional failure.
Regards, Carl-Daniel
Carl-Daniel Hailfinger wrote:
I think I can fix that serialization problem,
Aha? Mind to share the insight?
but it seems r3777 makes the situation a bit better than it was before. That alone is an improvement. Stefan, you are right about the time/result tradeoff. If the failure rate stays low enough, we might want to leave your fix in place and simply accept the occassional failure.
Oh, we do want to leave my fix in place, even if you come up with another fix on top of that.
Having two rules for the same set of files is quite unhealthy.
Stefan
On 29.11.2008 11:53, Stefan Reinauer wrote:
Carl-Daniel Hailfinger wrote:
I think I can fix that serialization problem,
Aha? Mind to share the insight?
Sure. You found the bug. Really. Remember what you wrote earlier? Let me quote you:
The problem is that we end up with a valid option_table.o file (not truncated) but it has no symbols and no data in there. The file is 900 something bytes, which is exactly the size of an elf object created by touch foo.c gcc -c foo.c -o foo.o
And that's exactly what was happening. How?
Look at how option_table.c and option_table.o are generated. util/options/build_opt_tbl.c is doing that and it seems to be doing this in a way that confuses make. How can creating a simple file confuse make? GCC does it all the time. The answer is that gcc does it differently. Make totally depends on timestamps. It also assumes that if a file is present, it is usable. (Reread the last sentence, it is important.) The only way to make sure that a file is usable directly in the instant it is created (to avoid race conditions) is to demand creating and writing the whole file has to be one atomic operation. There is no way to do that directly with the standard fopen/fwrite/fclose. There is one way out: Use an atomic operation to make the whole file available. Rename is atomic. Create(open)/write/close a file with a temporary unique name, and after that is done, rename it to the file you wanted to create in the first place. gcc does it that way and make is happy. build_opt_tbl creates the files directly and has a HUGE race between fopen and fclose. We're hitting that race condition.
How do we solve it? Two ways are possible: 1. Fix build_opt_tbl.c to use fopen/fwrite/fclose/rename. 2. Perform that logic in the makefile. Fixing build_opt_tbl.c is IMHO the preferred course because it avoids hacks in makefiles.
I'd provide a patch, but my right hand is injured and I am typing with my left hand only (no worries ;-)).
but it seems r3777 makes the situation a bit better than it was before. That alone is an improvement. Stefan, you are right about the time/result tradeoff. If the failure rate stays low enough, we might want to leave your fix in place and simply accept the occassional failure.
Oh, we do want to leave my fix in place, even if you come up with another fix on top of that.
Absolutely. (Technically, the problem will be fixed completely by fixing build_opt_tbl.c even if r3777 is reverted.) Improving make rules for better readability and less overhead is something I'd call a requirement as well, so r3777 is a keeper.
Having two rules for the same set of files is quite unhealthy.
Yes, that's true. We had a similar problem some time back in v2. v3 also has them, but they are not showing up yet. I should resend my v3 dependency fix RFC, maybe it gathers more interest this time.
Regards, Carl-Daniel
Carl-Daniel Hailfinger wrote:
Make totally depends on timestamps.
This is interesting information. I falsely assumed it would make sure that it executed a rule to the end if it's pulled in through a dependency before assuming that dependency to be satisfied.
How do we solve it? Two ways are possible:
- Fix build_opt_tbl.c to use fopen/fwrite/fclose/rename.
- Perform that logic in the makefile.
Fixing build_opt_tbl.c is IMHO the preferred course because it avoids hacks in makefiles.
Ok. Well, 2. was definitely easier to hack up correctly, but here you go. A somewhat race safe build_opt_tbl. Check r3783.
I'm attaching the Makefile hack fix as a reference (Not required anymore, DO NOT APPLY)
my right hand is injured and I am typing with my left hand only (no worries ;-)).
Get well soon!