Hi Michael & Lists,
I'd like to ask for ideas with the following problem we have.
(1) There is a functional iPXE + WDS setup, with iPXE built as a
traditional BIOS PCI option ROM, using CONFIG=qemu. Accordingly the
platform is qemu, with SeaBIOS, and the NIC is virtio-net-pci.
I don't know anything about the particulars of the WDS setup at this
point, only that the boot loader program it exposes is WDSNBP.COM
(2) The setup works fine when iPXE is built at commit 4e85b2708fa0
("[virtio] Use host-specified MTU when available", 2017-01-23).
(3) When iPXE is built at commit 133f4c47baef ("[build] Handle
R_X86_64_PLT32 from binutils 2.31", 2018-09-17), the setup breaks.
The symptom is that iPXE fetches WDSNBP.COM
just fine, but WDSNBP.COM
rather than doing whatever it does otherwise, keeps PXE-booting itself
(3+ times), and finally aborts.
Consider the following log output (my undertanding is that all this is
logged by WDSNBP.COM
Press F12 for network service boot
WDSNBP started using DHCP Referral.
Contacting Server: ... (Gateway: ...)
Contacting Server: ...
TFTP Download: boot\x86\wdsnbp.com
This block repeats approx. 3 times, after which the following is
Windows Deployment Services: PXE Boot Aborted.
Could not boot image: Error 0x7f8d8101 (http://ipxe.org/7f8d8101
No more network devices
No bootable device
My understanding is that the first line from this last block is printed
, the second line by iPXE (in pxe_start_nbp()), the third
line also by iPXE, and the last one by SeaBIOS.
This seems to indicate that WDSNBP.COM
exits with an error code, and
pxe_start_nbp() logs it as "Error 0x7f8d8101".
(4) Now, after a bit of searching the web, I've found the following
articles, which indicate that the WDS (= server side) setup is
(4a) "disable NetBios over TCPIP, on the WDS server"
(4b) "cover all combinations of forward and backwards slashes in
ReadFilter, on the WDS server"
However: the regression appears to be a function of *only* the git
commit at which we build iPXE. It seems so deterministic that we
bisected commit range 4e85b2708fa0..133f4c47baef. (Hence we have not
captured the network traffic yet, nor have we investigated the WDS
The "culprit" commit is ea29122a70c6 ("[http] Include error messages for
4xx and 5xx response codes", 2017-12-28).
(5) Which makes no sense to me, unfortunately. :(
Commit ea29122a70c6 adds the "http_errors" array to the code. According
and the build artifact
this new array is placed in a new section called
Trying to retro-fit those facts to the symptom encountered, I came up
with the idea that *maybe* the new array (or section) causes a memory
allocation failure in WDSNBP.COM
-- due to increased memory footprint of
iPXE. Which then leads to the misbehavior of WDSNBP.COM
After all, WDSNBP.COM
is a 16-bit real-mode program:
so it could be susceptible to the size & fragmentation of the RAM that
is under 640KB.
(6) Unfortunately, this "low RAM exhaustion" idea doesn't seem to hold
water. There are at least two counter-arguments:
(6a) if I revert commit ea29122a70c6 on top of commit 133f4c47baef, then
the issue does *not* go away.
(The issue also does not go away if I remove the "netdev_errors" array,
also on top of commit 133f4c47baef -- that's a larger array.)
(... In theory anyway, this might not necessarily disprove the memory
exhaustion idea. What if the iPXE footprint grows, over the
ea29122a70c6..133f4c47baef so much, for independent reasons, that
reverting ea29122a70c6 at the end cannot compensate for that increase?)
(6b) I added "DEBUG=pxe_call:1" to the "make" command, and compared
debug messages printed by pxe_start_nbp(), between 4e85b2708fa0 and
133f4c47baef. Alas, the debug messages are identical:
PXE NBP starting with netdev net0, code 9c6c:0802,
which to me suggests that there is no change in the amount of memory
that is made available to WDSNBP.COM
-- its code and data continue to
start at 0x9_CEC2 and 0x9_FBE0, respectively.
Any hints as to what could be going wrong?