Hi Michael & Lists,
I'd like to ask for ideas with the following problem we have.
(1) There is a functional iPXE + WDS setup, with iPXE built as a traditional BIOS PCI option ROM, using CONFIG=qemu. Accordingly the platform is qemu, with SeaBIOS, and the NIC is virtio-net-pci.
I don't know anything about the particulars of the WDS setup at this point, only that the boot loader program it exposes is WDSNBP.COM.
(2) The setup works fine when iPXE is built at commit 4e85b2708fa0 ("[virtio] Use host-specified MTU when available", 2017-01-23).
(3) When iPXE is built at commit 133f4c47baef ("[build] Handle R_X86_64_PLT32 from binutils 2.31", 2018-09-17), the setup breaks.
The symptom is that iPXE fetches WDSNBP.COM just fine, but WDSNBP.COM, rather than doing whatever it does otherwise, keeps PXE-booting itself (3+ times), and finally aborts.
Consider the following log output (my undertanding is that all this is logged by WDSNBP.COM):
Downloaded WDSNBP...
Press F12 for network service boot Architecture: x64 WDSNBP started using DHCP Referral. Contacting Server: ... (Gateway: ...) Contacting Server: ... TFTP Download: boot\x86\wdsnbp.com
This block repeats approx. 3 times, after which the following is displayed:
Windows Deployment Services: PXE Boot Aborted. Could not boot image: Error 0x7f8d8101 (http://ipxe.org/7f8d8101) No more network devices
No bootable device
My understanding is that the first line from this last block is printed by WDSNBP.COM, the second line by iPXE (in pxe_start_nbp()), the third line also by iPXE, and the last one by SeaBIOS.
This seems to indicate that WDSNBP.COM exits with an error code, and pxe_start_nbp() logs it as "Error 0x7f8d8101".
(4) Now, after a bit of searching the web, I've found the following articles, which indicate that the WDS (= server side) setup is incorrect:
(4a) "disable NetBios over TCPIP, on the WDS server"
https://techthoughts.info/pxe-booting-wds-dhcp-scope-vs-ip-helpers/#comment-... https://social.technet.microsoft.com/Forums/ie/en-US/f3883e8b-1039-477d-999d...
(4b) "cover all combinations of forward and backwards slashes in ReadFilter, on the WDS server"
http://ipxe.org/appnote/chainload_wds#tftp_loops
However: the regression appears to be a function of *only* the git commit at which we build iPXE. It seems so deterministic that we bisected commit range 4e85b2708fa0..133f4c47baef. (Hence we have not captured the network traffic yet, nor have we investigated the WDS server config.)
The "culprit" commit is ea29122a70c6 ("[http] Include error messages for 4xx and 5xx response codes", 2017-12-28).
(5) Which makes no sense to me, unfortunately. :(
Commit ea29122a70c6 adds the "http_errors" array to the code. According to
src/include/ipxe/tables.h
and the build artifact
src/bin/1af41000.rom.tmp.map
this new array is placed in a new section called
.textdata.tbl.errortab.01
Trying to retro-fit those facts to the symptom encountered, I came up with the idea that *maybe* the new array (or section) causes a memory allocation failure in WDSNBP.COM -- due to increased memory footprint of iPXE. Which then leads to the misbehavior of WDSNBP.COM.
After all, WDSNBP.COM is a 16-bit real-mode program:
https://support.microsoft.com/en-us/help/4468601/pxe-boot-in-configuration-m...
so it could be susceptible to the size & fragmentation of the RAM that is under 640KB.
(6) Unfortunately, this "low RAM exhaustion" idea doesn't seem to hold water. There are at least two counter-arguments:
(6a) if I revert commit ea29122a70c6 on top of commit 133f4c47baef, then the issue does *not* go away.
(The issue also does not go away if I remove the "netdev_errors" array, also on top of commit 133f4c47baef -- that's a larger array.)
(... In theory anyway, this might not necessarily disprove the memory exhaustion idea. What if the iPXE footprint grows, over the ea29122a70c6..133f4c47baef so much, for independent reasons, that reverting ea29122a70c6 at the end cannot compensate for that increase?)
(6b) I added "DEBUG=pxe_call:1" to the "make" command, and compared the debug messages printed by pxe_start_nbp(), between 4e85b2708fa0 and 133f4c47baef. Alas, the debug messages are identical:
PXE NBP starting with netdev net0, code 9c6c:0802, data 9cf0:2ce0
which to me suggests that there is no change in the amount of memory that is made available to WDSNBP.COM -- its code and data continue to start at 0x9_CEC2 and 0x9_FBE0, respectively.
Any hints as to what could be going wrong?
Thanks! Laszlo
On 15/11/2019 08:45, Laszlo Ersek wrote:
(1) There is a functional iPXE + WDS setup, with iPXE built as a traditional BIOS PCI option ROM, using CONFIG=qemu. Accordingly the platform is qemu, with SeaBIOS, and the NIC is virtio-net-pci.
I don't know anything about the particulars of the WDS setup at this point, only that the boot loader program it exposes is WDSNBP.COM.
<snip>
Any hints as to what could be going wrong?
Your analysis appears correct to me throughout. No idea what might be the root cause at this stage.
Do you have an easy set of instructions for reproducing the problem? A copy of the precise version of WDSNBP.COM that you are using may be sufficient.
Michael
Hello Michael,
On 11/15/19 18:03, Michael Brown wrote:
On 15/11/2019 08:45, Laszlo Ersek wrote:
(1) There is a functional iPXE + WDS setup, with iPXE built as a traditional BIOS PCI option ROM, using CONFIG=qemu. Accordingly the platform is qemu, with SeaBIOS, and the NIC is virtio-net-pci.
I don't know anything about the particulars of the WDS setup at this point, only that the boot loader program it exposes is WDSNBP.COM.
<snip>
Any hints as to what could be going wrong?
Your analysis appears correct to me throughout. No idea what might be the root cause at this stage.
Do you have an easy set of instructions for reproducing the problem? A copy of the precise version of WDSNBP.COM that you are using may be sufficient.
Thank you very much for responding!
I will get to work on collecting the details of the actual WDS environment. It will probably take some time. I'll make an attempt to reproduce the issue directly in my environment too, so I can provide working instructions.
(Right now, my own env is a minimal "mock" setup, with a semi-random WDSNBP.COM binary extracted from a long-term Windows Server 2012 R2 virtual machine of mine. Nothing beyond serving that binary up with libvirt / dnsmasq is configured, so the setup is not nearly a "real" WDS one.)
I'll report back.
Thank you again! Laszlo
On 15/11/2019 21:49, Laszlo Ersek wrote:
(Right now, my own env is a minimal "mock" setup, with a semi-random WDSNBP.COM binary extracted from a long-term Windows Server 2012 R2 virtual machine of mine. Nothing beyond serving that binary up with libvirt / dnsmasq is configured, so the setup is not nearly a "real" WDS one.)
I'm very happy to work in a minimal mock setup, if it's sufficient to reproduce the problem. Most of my test setups are built that way already.
Michael
On 11/16/19 01:59, Michael Brown wrote:
On 15/11/2019 21:49, Laszlo Ersek wrote:
(Right now, my own env is a minimal "mock" setup, with a semi-random WDSNBP.COM binary extracted from a long-term Windows Server 2012 R2 virtual machine of mine. Nothing beyond serving that binary up with libvirt / dnsmasq is configured, so the setup is not nearly a "real" WDS one.)
I'm very happy to work in a minimal mock setup, if it's sufficient to reproduce the problem. Most of my test setups are built that way already.
Apologies, I was unclear. My personal env is minimal to the point of not reproducing the problem. I used my environment only for checking the DBGC output (code and data size) in pxe_start_nbp(). In my env, the above-referenced WDSNBP.COM binary tries to contact the WDS server, and then cleanly gives up. I don't see any looping.
I've sent out some requests internally, for more information.
Thank you! Laszlo
Hi All,
On 11/18/19 18:58, Laszlo Ersek wrote:
On 11/16/19 01:59, Michael Brown wrote:
On 15/11/2019 21:49, Laszlo Ersek wrote:
(Right now, my own env is a minimal "mock" setup, with a semi-random WDSNBP.COM binary extracted from a long-term Windows Server 2012 R2 virtual machine of mine. Nothing beyond serving that binary up with libvirt / dnsmasq is configured, so the setup is not nearly a "real" WDS one.)
I'm very happy to work in a minimal mock setup, if it's sufficient to reproduce the problem. Most of my test setups are built that way already.
Apologies, I was unclear. My personal env is minimal to the point of not reproducing the problem. I used my environment only for checking the DBGC output (code and data size) in pxe_start_nbp(). In my env, the above-referenced WDSNBP.COM binary tries to contact the WDS server, and then cleanly gives up. I don't see any looping.
I've sent out some requests internally, for more information.
This issue has now been resolved. The problem was related to WDSNBP.COM.
There are apparently multiple versions of WDSNBP.COM in common (?) use. For example:
(1) size: 31,140 bytes sha256: 75ccf88f9ceefcf02089b6f859ebbdb39eba05f63ebeda48c3f7cc318e4bf2b4 shipped with: Windows Server 2008 R2 (possibly as an upgrade?)
(2) size: 31,124 sha256: 44d07502bb87c9e89c68f0d101fb33636dc389b5607e6c173524e6506bcb2f1c shipped with: Windows Server 2008 R2 (possibly as an upgrade?)
(3) size: 30,832 bytes sha256: 2b2fb3a7cfba1ef640bcb5d75050e57c79ff639ec621b0162f837c2c889ca178 shipped with: Windows Server 2012 R2
With WDSNBP.COM consistently upgraded to version (3), in the WDS environment that originally experienced the issue, the symptoms have disappeared.
Thank you Michael for your help! Laszlo