Hi Michael & Lists,
I'd like to ask for ideas with the following problem we have.
(1) There is a functional iPXE + WDS setup, with iPXE built as a traditional BIOS PCI option ROM, using CONFIG=qemu. Accordingly the platform is qemu, with SeaBIOS, and the NIC is virtio-net-pci.
I don't know anything about the particulars of the WDS setup at this point, only that the boot loader program it exposes is WDSNBP.COM.
(2) The setup works fine when iPXE is built at commit 4e85b2708fa0 ("[virtio] Use host-specified MTU when available", 2017-01-23).
(3) When iPXE is built at commit 133f4c47baef ("[build] Handle R_X86_64_PLT32 from binutils 2.31", 2018-09-17), the setup breaks.
The symptom is that iPXE fetches WDSNBP.COM just fine, but WDSNBP.COM, rather than doing whatever it does otherwise, keeps PXE-booting itself (3+ times), and finally aborts.
Consider the following log output (my undertanding is that all this is logged by WDSNBP.COM):
Downloaded WDSNBP...
Press F12 for network service boot Architecture: x64 WDSNBP started using DHCP Referral. Contacting Server: ... (Gateway: ...) Contacting Server: ... TFTP Download: boot\x86\wdsnbp.com
This block repeats approx. 3 times, after which the following is displayed:
Windows Deployment Services: PXE Boot Aborted. Could not boot image: Error 0x7f8d8101 (http://ipxe.org/7f8d8101) No more network devices
No bootable device
My understanding is that the first line from this last block is printed by WDSNBP.COM, the second line by iPXE (in pxe_start_nbp()), the third line also by iPXE, and the last one by SeaBIOS.
This seems to indicate that WDSNBP.COM exits with an error code, and pxe_start_nbp() logs it as "Error 0x7f8d8101".
(4) Now, after a bit of searching the web, I've found the following articles, which indicate that the WDS (= server side) setup is incorrect:
(4a) "disable NetBios over TCPIP, on the WDS server"
https://techthoughts.info/pxe-booting-wds-dhcp-scope-vs-ip-helpers/#comment-... https://social.technet.microsoft.com/Forums/ie/en-US/f3883e8b-1039-477d-999d...
(4b) "cover all combinations of forward and backwards slashes in ReadFilter, on the WDS server"
http://ipxe.org/appnote/chainload_wds#tftp_loops
However: the regression appears to be a function of *only* the git commit at which we build iPXE. It seems so deterministic that we bisected commit range 4e85b2708fa0..133f4c47baef. (Hence we have not captured the network traffic yet, nor have we investigated the WDS server config.)
The "culprit" commit is ea29122a70c6 ("[http] Include error messages for 4xx and 5xx response codes", 2017-12-28).
(5) Which makes no sense to me, unfortunately. :(
Commit ea29122a70c6 adds the "http_errors" array to the code. According to
src/include/ipxe/tables.h
and the build artifact
src/bin/1af41000.rom.tmp.map
this new array is placed in a new section called
.textdata.tbl.errortab.01
Trying to retro-fit those facts to the symptom encountered, I came up with the idea that *maybe* the new array (or section) causes a memory allocation failure in WDSNBP.COM -- due to increased memory footprint of iPXE. Which then leads to the misbehavior of WDSNBP.COM.
After all, WDSNBP.COM is a 16-bit real-mode program:
https://support.microsoft.com/en-us/help/4468601/pxe-boot-in-configuration-m...
so it could be susceptible to the size & fragmentation of the RAM that is under 640KB.
(6) Unfortunately, this "low RAM exhaustion" idea doesn't seem to hold water. There are at least two counter-arguments:
(6a) if I revert commit ea29122a70c6 on top of commit 133f4c47baef, then the issue does *not* go away.
(The issue also does not go away if I remove the "netdev_errors" array, also on top of commit 133f4c47baef -- that's a larger array.)
(... In theory anyway, this might not necessarily disprove the memory exhaustion idea. What if the iPXE footprint grows, over the ea29122a70c6..133f4c47baef so much, for independent reasons, that reverting ea29122a70c6 at the end cannot compensate for that increase?)
(6b) I added "DEBUG=pxe_call:1" to the "make" command, and compared the debug messages printed by pxe_start_nbp(), between 4e85b2708fa0 and 133f4c47baef. Alas, the debug messages are identical:
PXE NBP starting with netdev net0, code 9c6c:0802, data 9cf0:2ce0
which to me suggests that there is no change in the amount of memory that is made available to WDSNBP.COM -- its code and data continue to start at 0x9_CEC2 and 0x9_FBE0, respectively.
Any hints as to what could be going wrong?
Thanks! Laszlo