On Sun, Feb 21, 2010 at 04:18:38PM -0700, Brandon Bennett wrote:
On Sat, Feb 20, 2010 at 9:05 PM, Kevin O'Connor kevin@koconnor.net wrote:
Should a kernel fail during boot, I'd suspect it doesn't like one of the apm/pcibios callbacks, or it doesn't like one of the smbios/mptable/acpi tables. You could try compiling the SeaBIOS code (see http://seabios.org/Download ) and increasing the debugging by modifying src/config.h. Specifically, you could increase CONFIG_DEBUG_LEVEL, and set DEBUG_HDL_pcibios32 and DEBUG_HDL_apm to
- Also, you could try disabling some of the features to see if that
prevents the fault (eg, disabling CONFIG_ACPI / CONFIG_SMBIOS / CONFIG_MPTABLE).
I have narrowed it down to SMBIOS. If I disable CONFIG_SMBIOS the image boots up fine.
Gleb, have you seen this thread?
Some of the recent changes to smbios that look like possible culprits are:
Make SMBIOS table pass MS SVVP test Use MaxCountCPUs during building of per cpu tables. Add malloc_high/fseg() and rework bios table creation to use them.
There were other changes, but the comments indicate they were only ports of changes already in bochs. I suppose it's also possible the lack of smbios is turning off some other feature in the guest (eg, acpi) that's the real culprit.
-Kevin
On Sun, Feb 28, 2010 at 02:39:04PM -0500, Kevin O'Connor wrote:
On Sun, Feb 21, 2010 at 04:18:38PM -0700, Brandon Bennett wrote:
On Sat, Feb 20, 2010 at 9:05 PM, Kevin O'Connor kevin@koconnor.net wrote:
Should a kernel fail during boot, I'd suspect it doesn't like one of the apm/pcibios callbacks, or it doesn't like one of the smbios/mptable/acpi tables. You could try compiling the SeaBIOS code (see http://seabios.org/Download ) and increasing the debugging by modifying src/config.h. Specifically, you could increase CONFIG_DEBUG_LEVEL, and set DEBUG_HDL_pcibios32 and DEBUG_HDL_apm to
- Also, you could try disabling some of the features to see if that
prevents the fault (eg, disabling CONFIG_ACPI / CONFIG_SMBIOS / CONFIG_MPTABLE).
I have narrowed it down to SMBIOS. If I disable CONFIG_SMBIOS the image boots up fine.
Gleb, have you seen this thread?
Some of the recent changes to smbios that look like possible culprits are:
Make SMBIOS table pass MS SVVP test Use MaxCountCPUs during building of per cpu tables. Add malloc_high/fseg() and rework bios table creation to use them.
If there is any seabios revision that works then it is possible to bisect to find problematic commit.
There were other changes, but the comments indicate they were only ports of changes already in bochs. I suppose it's also possible the lack of smbios is turning off some other feature in the guest (eg, acpi) that's the real culprit.
-- Gleb.
Gleb Natapov gleb@redhat.com writes:
On Sun, Feb 28, 2010 at 02:39:04PM -0500, Kevin O'Connor wrote:
On Sun, Feb 21, 2010 at 04:18:38PM -0700, Brandon Bennett wrote:
I have narrowed it down to SMBIOS. If I disable CONFIG_SMBIOS the image boots up fine.
Gleb, have you seen this thread?
Some of the recent changes to smbios that look like possible culprits are:
Make SMBIOS table pass MS SVVP test Use MaxCountCPUs during building of per cpu tables. Add malloc_high/fseg() and rework bios table creation to use them.
If there is any seabios revision that works then it is possible to bisect to find problematic commit.
Hello,
I'm also bitten by this and have now attempted to bisect it. The result was:
c95d2cee36db79c88253d43a90230ceadf0c26cf is first bad commit commit c95d2cee36db79c88253d43a90230ceadf0c26cf Author: Kevin O'Connor kevin@koconnor.net Date: Wed Jul 29 20:41:39 2009 -0400
Add auto-generated version info to each build.
Add versioning info to initial debug and screen banner output.
:100644 100644 2e2ba1dbf46cb7a12eeaafbc394f3f279c10abf4 37589097295dde049f8512d99561a0320955fb41 M Makefile :040000 040000 29c4ea1e03416fce9b3aa963ece74050a7b5769a 2b5425a20726caeadba524e14756c8827dd66595 M src
Which looks a bit strange... I'm afraid I'm unable to revert this on the current HEAD due to other changes in the same area and my complete lack of understanding any assembler. But I have verified that commit 8c8a880b584ccf8958d67e99a6750ba32d0b6454 is working for me, while commit c95d2cee36db79c88253d43a90230ceadf0c26cf has the mentioned problems.
However, I am not completely sure that I didn't mess up something which affected the result. I do note that the *working* version logs a "Bad SMBIOS data checksum" message:
olive@canardo:~$ kvm -L /usr/local/src/git/seabios/out -m 1024 -hda olive2.img -nographic -serial mon:stdio -monitor null Could not open option rom 'vapic.bin': No such file or directory pci_add_option_rom: failed to find romfile "vgabios-cirrus.bin" pci_add_option_rom: failed to find romfile "pxe-rtl8139.bin" Consoles: serial port BIOS drive C: is disk0 BIOS 639kB/1047544kB available memory
FreeBSD/i386 bootstrap loader, Revision 1.1 (builder@ormonth.juniper.net, Tue Nov 3 08:19:23 UTC 2009) Loading /boot/defaults/loader.conf /kernel text=0x74574c smbios_init: Bad SMBIOS data checksum data=0x42ba0+0x92e34 syms=[0x4+0x7f570+0x4+0xb12b2] /boot/modules/if_bge.ko text=0xa98c data=0x364+0xc syms=[0x4+0xd50+0x4+0xd18] /boot/modules/mpt_core.ko text=0x18dbc data=0x488+0x358 /boot/modules/if_bce.ko text=0xd07c data=0x16d94+0x24e4 syms=[0x4+0x14d0+0x4+0x1787] /boot/modules/acb.ko text=0x3d78 data=0x284+0x80 syms=[0x4+0xa70+0x4+0x9a4] /boot/modules/mcs.ko text=0x4dc0 data=0x391+0xeb syms=[0x4+0xc40+0x4+0xbc2] /boot/modules/scs.ko text=0x7c08 data=0x564+0x164 syms=[0x4+0x1110+0x4+0x1179] /boot/modules/rcb.ko text=0x29c8 data=0x184+0x2c syms=[0x4+0x7e0+0x4+0x704] /boot/modules/cb.ko text=0x63fc data=0x3b8+0x11c syms=[0x4+0xf20+0x4+0xe69] /boot/modules/mesw.ko text=0x630c data=0x344+0x58 syms=[0x4+0xba0+0x4+0xe7a] /boot/modules/cbd.ko text=0x1fcc data=0x9c+0xc syms=[0x4+0x540+0x4+0x445] /boot/modules/sfccb.ko text=0xe30 data=0x1b0+0x14 syms=[0x4+0x540+0x4+0x4a5] /boot/modules/mac_runasnonroot.ko text=0x7b4 data=0x4d0 syms=[0x4+0x310+0x4+0x39d]
Hit [Enter] to boot immediately, or space bar for command prompt. Booting [/kernel]... platform_early_bootinit: M/T Series Early Boot Initialization Olive CPU GDB: debug ports: sio [snip]
while the failing version does not cause this message:
olive@canardo:~$ kvm -L /usr/local/src/git/seabios/out -m 1024 -hda olive2.img -nographic -serial mon:stdio -monitor null Could not open option rom 'vapic.bin': No such file or directory pci_add_option_rom: failed to find romfile "vgabios-cirrus.bin" pci_add_option_rom: failed to find romfile "pxe-rtl8139.bin" Consoles: serial port BIOS drive C: is disk0 BIOS 639kB/1047544kB available memory
FreeBSD/i386 bootstrap loader, Revision 1.1 (builder@ormonth.juniper.net, Tue Nov 3 08:19:23 UTC 2009) Loading /boot/defaults/loader.conf /kernel text=0x74574c data=0x42ba0+0x92e34 syms=[0x4+0x7f570+0x4+0xb12b2] /boot/modules/if_bge.ko text=0xa98c data=0x364+0xc syms=[0x4+0xd50+0x4+0xd18] /boot/modules/mpt_core.ko text=0x18dbc data=0x488+0x358 /boot/modules/if_bce.ko text=0xd07c data=0x16d94+0x24e4 syms=[0x4+0x14d0+0x4+0x1787] /boot/modules/acb.ko text=0x3d78 data=0x284+0x80 syms=[0x4+0xa70+0x4+0x9a4] /boot/modules/mcs.ko text=0x4dc0 data=0x391+0xeb syms=[0x4+0xc40+0x4+0xbc2] /boot/modules/scs.ko text=0x7c08 data=0x564+0x164 syms=[0x4+0x1110+0x4+0x1179] /boot/modules/rcb.ko text=0x29c8 data=0x184+0x2c syms=[0x4+0x7e0+0x4+0x704] /boot/modules/cb.ko text=0x63fc data=0x3b8+0x11c syms=[0x4+0xf20+0x4+0xe69] /boot/modules/mesw.ko text=0x630c data=0x344+0x58 syms=[0x4+0xba0+0x4+0xe7a] /boot/modules/cbd.ko text=0x1fcc data=0x9c+0xc syms=[0x4+0x540+0x4+0x445] /boot/modules/sfccb.ko text=0xe30 data=0x1b0+0x14 syms=[0x4+0x540+0x4+0x4a5] /boot/modules/mac_runasnonroot.ko text=0x7b4 data=0x4d0 syms=[0x4+0x310+0x4+0x39d]
Hit [Enter] to boot immediately, or space bar for command prompt. Booting [/kernel]... platform_early_bootinit: M/T Series Early Boot Initialization kernel trap 12 with interrupts disabled
Fatal trap 30: reserved (unknown) fault while in kernel mode instruction pointer = 0x20:0xc09ba131 stack pointer = 0x28:0xc1021c6c frame pointer = 0x28:0xc1021ca4 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, IOPL = 0 current process = 0 () trap number = 30 dog: ERROR - reset of uninitialized watchdog panic: reserved (unknown) fault (null)(c0ba9440,c0ba9440,c0b40fd0,c1021bb4,5) at 0xc09a5757 (null)(c0b40fd0,1e,c1021c2c,1,1) at 0xc0593caf (null)(c0b23132,0,c1021d14,c05b41e0,a) at 0xc09b9747 (null)(c1021c2c) at 0xc09ba631 (null)(c1021cb0) at 0xc09a6d8f (null)(c1021d44,c0aa42a9,c0b42b37,c1021d34,c1021d30) at 0xc09a6d8f (null)(c0b42b37,c1021d34,c1021d30,a,c1021d54) at 0xc09a001e (null)(c0b21154,c0ad7f78,c1021d84,c09af90e,80) at 0xc0aa42a9 (null)(80,c09a6dd0,f,3,20) at 0xc0aa488e (null)(1026000) at 0xc09af90e (null)() at 0xc049bb8d kernel trap 12 with interrupts disabled
Fatal trap 12: page fault while in kernel mode fault virtual address = 0xf000ff53 fault code = supervisor write, page not present instruction pointer = 0x20:0xc05bc3bd stack pointer = 0x28:0xc1021920 frame pointer = 0x28:0xc1021940 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = resume, IOPL = 0 current process = 0 () trap number = 12 dog: ERROR - reset of uninitialized watchdog panic: page fault XXXXX: ERROR platform_boot_mastership_relinquish not defined:XXXXXdog: ERROR - reset of uninitialized watchdog dog: ERROR - reset of uninitialized watchdog Uptime: 1s
I guess this can affect the result if it means that SMBIOS is being disabled/ignored. I can confirm Brandon's findings that disabling CONFIG_SMBIOS on current head also gives me a working image.
Bjørn
On Tue, Apr 13, 2010 at 08:48:35PM +0200, Bjørn Mork wrote:
Gleb Natapov gleb@redhat.com writes:
On Sun, Feb 28, 2010 at 02:39:04PM -0500, Kevin O'Connor wrote:
On Sun, Feb 21, 2010 at 04:18:38PM -0700, Brandon Bennett wrote:
I have narrowed it down to SMBIOS. If I disable CONFIG_SMBIOS the image boots up fine.
If there is any seabios revision that works then it is possible to bisect to find problematic commit.
I'm also bitten by this and have now attempted to bisect it. The result was:
c95d2cee36db79c88253d43a90230ceadf0c26cf is first bad commit commit c95d2cee36db79c88253d43a90230ceadf0c26cf
It looks like memory layout changes in the f-segment is tickling the underlying bug. I don't think SMBIOS, the above commit, or the other commit identified earlier are the root cause of the problem. Instead, I'd guess these commits just change the memory layout enough to avoid the root cause.
This looks like it is going to require some careful study with a debugger and some execution traces. Unfortunately, since this image isn't available for download it makes it difficult to track down.
-Kevin
"Kevin O'Connor" kevin@koconnor.net writes:
It looks like memory layout changes in the f-segment is tickling the underlying bug. I don't think SMBIOS, the above commit, or the other commit identified earlier are the root cause of the problem. Instead, I'd guess these commits just change the memory layout enough to avoid the root cause.
Sounds like a reasonable explanation. Is there anything I can do to try to pin down the problem?
This looks like it is going to require some careful study with a debugger and some execution traces. Unfortunately, since this image isn't available for download it makes it difficult to track down.
I understand that. But I'm afraid I can't provide any such image.
I don't think Juniper has done anything extra-ordinary at this point however, so I'll try to see if I can create a freely distributable image triggering the bug using the same tools as them. Will let you know if I succeed.
Bjørn
Bjørn Mork bjorn@mork.no writes:
"Kevin O'Connor" kevin@koconnor.net writes:
It looks like memory layout changes in the f-segment is tickling the underlying bug. I don't think SMBIOS, the above commit, or the other commit identified earlier are the root cause of the problem. Instead, I'd guess these commits just change the memory layout enough to avoid the root cause.
Sounds like a reasonable explanation. Is there anything I can do to try to pin down the problem?
This looks like it is going to require some careful study with a debugger and some execution traces. Unfortunately, since this image isn't available for download it makes it difficult to track down.
I understand that. But I'm afraid I can't provide any such image.
I don't think Juniper has done anything extra-ordinary at this point however, so I'll try to see if I can create a freely distributable image triggering the bug using the same tools as them. Will let you know if I succeed.
It's been a while with little work and little progress on my side... But I looked at this again today, and found that it may be related to the SMBIOS table being allocated with malloc_high(). Does that make sense?
Anyway, the problematic OS boots without problems with current seabios from git if I make this change:
diff --git a/src/smbios.c b/src/smbios.c index 8df0f2d..c96deb5 100644 --- a/src/smbios.c +++ b/src/smbios.c @@ -17,7 +17,7 @@ smbios_entry_point_init(u16 max_structure_size, u16 number_of_structures) { struct smbios_entry_point *ep = malloc_fseg(sizeof(*ep)); - void *finaltable = malloc_high(structure_table_length); + void *finaltable = malloc_fseg(structure_table_length); if (!ep || !finaltable) { warn_noalloc(); free(ep);
I tried malloc_low() too, and that works as well. But malloc_fseg() seems appropriate, unless I've misunderstood something here. Which very well can be. I am not going to claim any understanding at all.
Does the above make any sense, or is this just another example of "tickling the underlying bug"?
Bjørn
On Thu, Jul 07, 2011 at 05:45:02PM +0200, Bjørn Mork wrote:
It's been a while with little work and little progress on my side... But I looked at this again today, and found that it may be related to the SMBIOS table being allocated with malloc_high(). Does that make sense?
Anyway, the problematic OS boots without problems with current seabios from git if I make this change:
diff --git a/src/smbios.c b/src/smbios.c index 8df0f2d..c96deb5 100644 --- a/src/smbios.c +++ b/src/smbios.c @@ -17,7 +17,7 @@ smbios_entry_point_init(u16 max_structure_size, u16 number_of_structures) { struct smbios_entry_point *ep = malloc_fseg(sizeof(*ep));
- void *finaltable = malloc_high(structure_table_length);
- void *finaltable = malloc_fseg(structure_table_length); if (!ep || !finaltable) { warn_noalloc(); free(ep);
Thanks.
It's possible that the OS has an error in handling the SMBIOS when it is in high-memory (located above 1meg). (For example, older versions of Linux crash when the mptable is in high memory.)
However, it would be really odd for the OS to work some times with the SMBIOS in high memory and sometimes fail.
I tried malloc_low() too, and that works as well. But malloc_fseg() seems appropriate, unless I've misunderstood something here. Which very well can be. I am not going to claim any understanding at all.
malloc_low and malloc_fseg would both put the table in the first megabyte of physical ram. Of the two, malloc_fseg would be preferable.
Does the above make any sense, or is this just another example of "tickling the underlying bug"?
I have to wonder if the reorganization of memory just caused the bug to not pop up. If you disable SMBIOS, can you confirm the problem reliably goes away on multiple versions of SeaBIOS?
-Kevin
"Kevin O'Connor" kevin@koconnor.net writes:
On Thu, Jul 07, 2011 at 05:45:02PM +0200, Bjørn Mork wrote:
It's been a while with little work and little progress on my side... But I looked at this again today, and found that it may be related to the SMBIOS table being allocated with malloc_high(). Does that make sense?
Anyway, the problematic OS boots without problems with current seabios from git if I make this change:
diff --git a/src/smbios.c b/src/smbios.c index 8df0f2d..c96deb5 100644 --- a/src/smbios.c +++ b/src/smbios.c @@ -17,7 +17,7 @@ smbios_entry_point_init(u16 max_structure_size, u16 number_of_structures) { struct smbios_entry_point *ep = malloc_fseg(sizeof(*ep));
- void *finaltable = malloc_high(structure_table_length);
- void *finaltable = malloc_fseg(structure_table_length); if (!ep || !finaltable) { warn_noalloc(); free(ep);
Thanks.
It's possible that the OS has an error in handling the SMBIOS when it is in high-memory (located above 1meg). (For example, older versions of Linux crash when the mptable is in high memory.)
I looked at a couple of physical machines with vendor BIOSes, and they seem to put the table in low memory:
# dmidecode 2.9 SMBIOS 2.4 present. 71 structures occupying 2506 bytes. Table at 0x000F06F0.
# dmidecode 2.9 SMBIOS 2.4 present. 80 structures occupying 2858 bytes. Table at 0x000E0010.
Makes me think that this would be the safest approach for SeaBIOS as well. With the patch above, I get this location:
# dmidecode 2.9 SMBIOS 2.4 present. 10 structures occupying 263 bytes. Table at 0x000FDA00.
Without it, I get:
# dmidecode 2.9 SMBIOS 2.4 present. 10 structures occupying 263 bytes. Table at 0x1FFFFEF0.
However, it would be really odd for the OS to work some times with the SMBIOS in high memory and sometimes fail.
Yes. Just to be perfectly clear: The crash with SMBIOS in high memory happens every time with "recent" (anything from 2009 or later) SeaBIOS versions.
I must admit that I right now am wondering whether I somehow screwed up the previous testing of older versions. I am not at all sure under what circumstances older SeaBIOS would work with SMBIOS enabled.
I tried malloc_low() too, and that works as well. But malloc_fseg() seems appropriate, unless I've misunderstood something here. Which very well can be. I am not going to claim any understanding at all.
malloc_low and malloc_fseg would both put the table in the first megabyte of physical ram. Of the two, malloc_fseg would be preferable.
That's what I thought. Glad I could be right about something :-)
Does the above make any sense, or is this just another example of "tickling the underlying bug"?
I have to wonder if the reorganization of memory just caused the bug to not pop up. If you disable SMBIOS, can you confirm the problem reliably goes away on multiple versions of SeaBIOS?
Yes. Tested with current HEAD and with a number of revisions around the beginning of 2009, i.e. version 0.4.0. Just to be sure, I selected an intermediate version as well: 0.5.1. And I can confirm that the problem goes away there too when I disable SMBIOS.
Bjørn
On Fri, Jul 08, 2011 at 09:46:32AM +0200, Bjørn Mork wrote:
"Kevin O'Connor" kevin@koconnor.net writes:
It's possible that the OS has an error in handling the SMBIOS when it is in high-memory (located above 1meg). (For example, older versions of Linux crash when the mptable is in high memory.)
[...]
However, it would be really odd for the OS to work some times with the SMBIOS in high memory and sometimes fail.
Yes. Just to be perfectly clear: The crash with SMBIOS in high memory happens every time with "recent" (anything from 2009 or later) SeaBIOS versions.
I must admit that I right now am wondering whether I somehow screwed up the previous testing of older versions. I am not at all sure under what circumstances older SeaBIOS would work with SMBIOS enabled.
I investigated this a bit and I believe the Juniper OS has two SMBIOS bugs: it crashes when the table is in high memory, and when searching for the table it only checks the first 16byte aligned "_SM_" signature it finds.
The first bug looks likely because when one runs QEmu with "-d in_asm,int,exec" there is a "check_exception" near the end of the log with an address that is similar to the address of the high-memory SMBIOS table.
The second bug causes the OS to ignore the SMBIOS table if the SeaBIOS code layout happens to put an "_SM_" signature on a 16byte boundary before the real SMBIOS table - this led to "Bad SMBIOS data checksum" messages. When this happened, the OS would go on to boot okay because it then didn't try to read the real SMBIOS table from high-memory. Since SeaBIOS has the signature in its code (it needs it to build the table) there was roughly a 1 in 16 chance that a random build would happen to place code in a place that would confuse the OS and cause it to ignore the real SMBIOS table.
Long story short, the Juniper OS probably started failing at commit 2929c352b but bisect didn't find it because random other builds would pass due to the second bug.
I tried malloc_low() too, and that works as well. But malloc_fseg() seems appropriate, unless I've misunderstood something here. Which very well can be. I am not going to claim any understanding at all.
So, moving the SMBIOS back to the f-segment would fix this. But, there's an issue with that.
The SMBIOS table is normally pretty small (eg, 263 bytes), but it increases with the amount of ram and number of CPUs. If one starts QEmu with 255 cpus, the SMBIOS size is over 11K. Fitting that into the f-segment (64K shared with all the other 16bit code and data) is going to be a real problem. This is also an issue for the mptable (which is over 5K at 255 cpus), but newer OSes don't care about mptable - not so for the smbios table though.
I'm not sure what to do. -Kevin
Kevin O'Connor wrote:
I investigated this a bit and I believe the Juniper OS has two SMBIOS bugs: it crashes when the table is in high memory, and when searching
Are all versions based on FreeBSD 4.11? Are newer versions still affected?
So, moving the SMBIOS back to the f-segment would fix this. But, there's an issue with that.
The SMBIOS table is normally pretty small (eg, 263 bytes), but it increases with the amount of ram and number of CPUs. If one starts QEmu with 255 cpus, the SMBIOS size is over 11K. Fitting that into the f-segment (64K shared with all the other 16bit code and data) is going to be a real problem. This is also an issue for the mptable (which is over 5K at 255 cpus), but newer OSes don't care about mptable - not so for the smbios table though.
You might want to introduce CONFIG_SMBIOS_LOCATION and check for space shortage in the F-segment case. Bochs BIOS panics if all requested tables don't fit into the available space.
Sebastian
Are all versions based on FreeBSD 4.11? Are newer versions still affected?
Newer versions should be based on 6.1 but there are a lot of changes. I haven't had a chance to test with something newer yet.
-Brandon