On Thu, Sep 26, 2019 at 8:51 AM Kyösti Mälkki <kyosti.malkki@gmail.com> wrote:
Hi Matt,

thanks for bringing the topic up. Please also contact your Intel reps
about this via commercial support channel as well. I believe Patrick
G. once stated that he could act as a relay when it comes to disputes
between commercial and community development 'strategies'.

On Wed, Sep 25, 2019 at 8:48 PM Matt DeVillier <matt.devillier@gmail.com> wrote:
>
> Commit 903b40a [soc/intel: Replace uses of dev_find_slot()] replaced calls to dev_find_slot() with calls to pcidev_path_on_root() for all Intel common SoCs, but it seems unclear exactly what problem this was intended to fix, and has created problems with locating hidden PCI devices.

I believe [1] commit explains dev_find_slot() with enough details.

> Commit f2ac0137 [soc/intel: Fix regression with hidden PCI devices] fixed this partially by creating a debug version of pcidev_path_on_root() which falls back on dev_find_slot(), but did not apply it to all devices which need it. On SKL/KBL alone, there are at least 5 different function calls accessing a dozen PCI devices which require falling back on dev_find_slot().
>
> Kyosti has expressed a desire to eliminate the use of dev_find_slot() since it can potentially return false positives prior to device enumeration in ramstage, but as currently implemented the cure seems worse than the disease.
>
> Short term, it seems like having pcidev_path_on_root() fall back on dev_find_slot() with a loud warning (like pcidev_path_on_root_debug() does now) makes the most sense, vs having two nearly identical function calls used inconsistently. Long term, we need a better strategy for dealing with PCI devices which get hidden by FSP / are in violation of PCI spec.
>
> But since discussion on Gerrit seems to have died, reviving it here for a larger audience.

I just noticed these commits and have commented there. :)

I will quote below what I already wrote in gerrit [2], [3]:

The trouble is the entire PCI subsystem in ramstage is based on
matching the vendor/device ID register with a PCI driver and to the
source we want to control that device with. To allow this hiding of
PCI devices will ultimately force us to write the control somewhere in
the scope of SOC instead. Oh, but wait, perhaps the intention is for
us to __not__ write that control anymore but let the FSP blob do all
that too!

I think we can provide a semantic of "possibly hidden" devices and provide operations which do the correct thing provided the platform constraints. I do want to understand the complete problem, though.

My current understanding is that we're forcing a device to be unhidden after a sequence we know that hides a particular device. And we're attempting to put it back to hidden permanently because of policy enforced by a silicon vendor. However, if an access comes after the hide sequence we trip over the NULL check if the device was removed from the topology. What I couldn't figure out was what caused the need for f2ac0137? p2sb was working still? The commits don't have more information as to the circumstances of the failure.

AFAICS, hardware that does not respond to vendor/device ID register
reads does not meet PCI compliance. I am willing to hit +2 on removing
support for platforms that do not meet PCI compliance, specially when
in the cases here, it is a matter of broken FSP blobs and not broken
silicon per-se.

I'm not in agreement on the policies/thinking from all the hardware vendors. Many issues have arisen in not thinking through pci dev/fun allocation. e.g. function 0 can be hidden but functions >1 should be around. Definitely a violation of PCI spec and many things have been worked around that.

That said, computer architecture/design is messy, and I think it's naive to think everything is spec compliant. There are complex fabrics and fabric policies in all these devices that are utilized to provide a "pci". And as firmware people we need to deal with the realities of the hardware. That's pretty much the point of firmware: prepare the abstractions for the OS to make those assumptions of aligning with specs. Even then, some things just can't be fixed (check out all the quirks and layering violations in the drivers).

 

Also, I should not be accepting new callers for dev_find_slot() due
the ill semantics it has. Prior to device enumeration, it can return
false positives because all devices appear on bus 0. So please look
for alternative solution if you want to support Intel's initiative of
more blob less FOSS.

dev_find_slot() failures have nothing to do with blobs or not. Certainly, some of the circumstances have tripped up some of the assumptions made in that code. And I agree we should move away from using that API and/or removing it.

I suggest you post on the mailing list. That active PCI devices no
longer respond to Vendor ID / Device ID queries does not meet PCI
compliance, as I understand the specs. Unfortunately, someone with
enough authority inside the org will likely decide it's just fine that
ramstage will no longer be designed using PCI drivers and allow use of
shim calling massive FSP blob.


As for solutions:

Preferably fix FSP and not allow this PCI hide-and-seek.

or

Rewrite PCI device enumeration such that devices that do not respond
to ID queries are not permanently removed from topology links. You

I thought that the devices being removed was part of the original problem?
need platform hooks to decide whether to remove a particular device
(based on B:D.F) or not. This would avoid ill dev_find_slot() then.


I have some ideas on some solutions, but I want to understand the failure that occurred after 903b40a.

or

Revert to __SIMPLE_DEVICE__ for PCI access. But __SIMPLE_DEVICE__ is
also on my kill list due to uglyness and high maintenance whenever we
attempt to move sources to earlier stages. Unfortunately things found
in soc/intel has slowed down the process to standstill and work is not
funded.

[1] https://review.coreboot.org/c/coreboot/+/26447
[2] https://review.coreboot.org/c/coreboot/+/35087
[3] https://review.coreboot.org/c/coreboot/+/35088

Regards,
Kyösti Mälkki
_______________________________________________
coreboot mailing list -- coreboot@coreboot.org
To unsubscribe send an email to coreboot-leave@coreboot.org