[PATCH v3 0/4] nvme: Add PRP List support

List overview All Threads
Download

newer

older

Error when stack "hopping back" to...

new november release?

Alexander Graf

30 Sep 2020 30 Sep '20

3:10 p.m.

While looking at VM bootup times, we stumbled over the fact that the NVMe code only does I/O operations of up to 4kb at a given point in time. This is usually ok, but if you have an OS that loads a lot of data on boot in combination to network backed storage, it shows in bootup times.

There is no need to restrict ourselves to 4kb though. The INT13 call we receive gives us much larger chunks which we can just map into a native bigger NVMe I/O call if the request buffer is page aligned.

This patch implements all logic required to do the above and gives a substantial performance boost on boot.

v1 -> v2:

- fix dprintf - Fix bounds check on PRP list add logic - Reduce PRP list to 15 entries embedded in the ns struct. This reduces BIOS reserved memory footprint by 4kb.

v2 -> v3:

- new patch: nvme: Split requests by maximum allowed size - Adjust the maximum request size to sector units. The hardware field describes it in pages.

Alexander Graf (4): nvme: Record maximum allowed request size nvme: Allow to set PRP2 nvme: Pass large I/O requests as PRP lists nvme: Split requests by maximum allowed size

src/hw/nvme-int.h | 16 ++++++- src/hw/nvme.c | 122 +++++++++++++++++++++++++++++++++++++++++++++++------- 2 files changed, 121 insertions(+), 17 deletions(-)

-- 2.16.4 Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879

Show replies by date

Alexander Graf

30 Sep 30 Sep

3:10 p.m.

New subject: [PATCH v3 1/4] nvme: Record maximum allowed request size

NVMe has a limit on how many sectors it can handle at most within a single request. Remember that number, so that in a follow-up patch, we can verify that we don't exceed it.

Signed-off-by: Alexander Graf graf@amazon.com

---

v1 -> v2:

- fix dprintf

v2 -> v3:

- Adjust the maximum request size to sector units. The hardware field describes it in pages. --- src/hw/nvme-int.h | 8 +++++++- src/hw/nvme.c | 13 +++++++++++-- 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h index 9f95dd8..674008a 100644 --- a/src/hw/nvme-int.h +++ b/src/hw/nvme-int.h @@ -117,6 +117,7 @@ struct nvme_namespace {

u32 block_size; u32 metadata_size; + u32 max_req_size;

/* Page aligned buffer of size NVME_PAGE_SIZE. */ char *dma_buffer; @@ -131,7 +132,12 @@ struct nvme_identify_ctrl { char mn[40]; char fr[8];

- char _boring[516 - 72]; + u8 rab; + u8 ieee[3]; + u8 cmic; + u8 mdts; + + char _boring[516 - 78];

u32 nn; /* number of namespaces */ }; diff --git a/src/hw/nvme.c b/src/hw/nvme.c index 6a01204..5bc2586 100644 --- a/src/hw/nvme.c +++ b/src/hw/nvme.c @@ -238,7 +238,8 @@ nvme_admin_identify_ns(struct nvme_ctrl *ctrl, u32 ns_id) }

static void -nvme_probe_ns(struct nvme_ctrl *ctrl, struct nvme_namespace *ns, u32 ns_id) +nvme_probe_ns(struct nvme_ctrl *ctrl, struct nvme_namespace *ns, u32 ns_id, + u8 mdts) { ns->ctrl = ctrl; ns->ns_id = ns_id; @@ -281,6 +282,14 @@ nvme_probe_ns(struct nvme_ctrl *ctrl, struct nvme_namespace *ns, u32 ns_id) ns->drive.blksize = ns->block_size; ns->drive.sectors = ns->lba_count;

+ if (mdts) { + ns->max_req_size = ((1U << mdts) * NVME_PAGE_SIZE) / ns->block_size; + dprintf(3, "NVME NS %u max request size: %d sectors\n", + ns_id, ns->max_req_size); + } else { + ns->max_req_size = -1U; + } + ns->dma_buffer = zalloc_page_aligned(&ZoneHigh, NVME_PAGE_SIZE);

char *desc = znprintf(MAXDESCSIZE, "NVMe NS %u: %llu MiB (%llu %u-byte " @@ -567,7 +576,7 @@ nvme_controller_enable(struct nvme_ctrl *ctrl) /* Populate namespace IDs */ int ns_idx; for (ns_idx = 0; ns_idx < ctrl->ns_count; ns_idx++) { - nvme_probe_ns(ctrl, &ctrl->ns[ns_idx], ns_idx + 1); + nvme_probe_ns(ctrl, &ctrl->ns[ns_idx], ns_idx + 1, identify->mdts); }

dprintf(3, "NVMe initialization complete!\n");

Alexander Graf

3:10 p.m.

New subject: [PATCH v3 2/4] nvme: Allow to set PRP2

When creating a PRP based I/O request, we pass in the pointer to operate on. Going forward, we will want to be able to pass additional pointers though for mappings above 4k.

This patch adds a parameter to nvme_get_next_sqe() to pass in the PRP2 value of an NVMe I/O request, paving the way for a future patch to implement PRP lists.

Signed-off-by: Alexander Graf graf@amazon.com Reviewed-by: Filippo Sironi sironi@amazon.de --- src/hw/nvme.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/src/hw/nvme.c b/src/hw/nvme.c index 5bc2586..f1b9331 100644 --- a/src/hw/nvme.c +++ b/src/hw/nvme.c @@ -152,7 +152,7 @@ nvme_wait(struct nvme_sq *sq) /* Returns the next submission queue entry (or NULL if the queue is full). It also fills out Command Dword 0 and clears the rest. */ static struct nvme_sqe * -nvme_get_next_sqe(struct nvme_sq *sq, u8 opc, void *metadata, void *data) +nvme_get_next_sqe(struct nvme_sq *sq, u8 opc, void *metadata, void *data, void *data2) { if (((sq->head + 1) & sq->common.mask) == sq->tail) { dprintf(3, "submission queue is full"); @@ -166,6 +166,7 @@ nvme_get_next_sqe(struct nvme_sq *sq, u8 opc, void *metadata, void *data) sqe->cdw0 = opc | (sq->tail << 16 /* CID */); sqe->mptr = (u32)metadata; sqe->dptr_prp1 = (u32)data; + sqe->dptr_prp2 = (u32)data2;

if (sqe->dptr_prp1 & (NVME_PAGE_SIZE - 1)) { /* Data buffer not page aligned. */ @@ -200,7 +201,7 @@ nvme_admin_identify(struct nvme_ctrl *ctrl, u8 cns, u32 nsid) struct nvme_sqe *cmd_identify; cmd_identify = nvme_get_next_sqe(&ctrl->admin_sq, NVME_SQE_OPC_ADMIN_IDENTIFY, NULL, - identify_buf); + identify_buf, NULL);

if (!cmd_identify) { warn_internalerror(); @@ -338,7 +339,7 @@ nvme_create_io_cq(struct nvme_ctrl *ctrl, struct nvme_cq *cq, u16 q_idx)

cmd_create_cq = nvme_get_next_sqe(&ctrl->admin_sq, NVME_SQE_OPC_ADMIN_CREATE_IO_CQ, NULL, - cq->cqe); + cq->cqe, NULL); if (!cmd_create_cq) { goto err_destroy_cq; } @@ -382,7 +383,7 @@ nvme_create_io_sq(struct nvme_ctrl *ctrl, struct nvme_sq *sq, u16 q_idx, struct

cmd_create_sq = nvme_get_next_sqe(&ctrl->admin_sq, NVME_SQE_OPC_ADMIN_CREATE_IO_SQ, NULL, - sq->sqe); + sq->sqe, NULL); if (!cmd_create_sq) { goto err_destroy_sq; } @@ -429,7 +430,7 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, char *buf, u16 count, struct nvme_sqe *io_read = nvme_get_next_sqe(&ns->ctrl->io_sq, write ? NVME_SQE_OPC_IO_WRITE : NVME_SQE_OPC_IO_READ, - NULL, buf); + NULL, buf, NULL); io_read->nsid = ns->ns_id; io_read->dword[10] = (u32)lba; io_read->dword[11] = (u32)(lba >> 32);

Alexander Graf

3:10 p.m.

New subject: [PATCH v3 3/4] nvme: Pass large I/O requests as PRP lists

Today, we split every I/O request into at most 4kb chunks and wait for these requests to finish. We encountered issues where the backing storage is network based, so every I/O request needs to go over the network with associated latency cost. A few ms of latency when loading 100MB initrd in 4kb chunks does add up.

NVMe implements a feature to allow I/O requests spanning multiple pages, called PRP lists. This patch takes larger I/O operations and checks if they can be directly passed to the NVMe backing device as PRP list. At least for grub, read operations can always be mapped directly into PRP list items.

This reduces the number of I/O operations required during a typical boot path by roughly a factor of 5.

Signed-off-by: Alexander Graf graf@amazon.com

---

v1 -> v2:

- Fix bounds check on PRP list add logic - Reduce PRP list to 15 entries embedded in the ns struct. This reduces BIOS reserved memory footprint by 4kb. --- src/hw/nvme-int.h | 8 ++++++ src/hw/nvme.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++------- 2 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/src/hw/nvme-int.h b/src/hw/nvme-int.h index 674008a..7947b29 100644 --- a/src/hw/nvme-int.h +++ b/src/hw/nvme-int.h @@ -10,6 +10,8 @@ #include "types.h" // u32 #include "pcidevice.h" // struct pci_device

+#define NVME_MAX_PRPL_ENTRIES 15 /* Allows requests up to 64kb */ + /* Data structures */

/* The register file of a NVMe host controller. This struct follows the naming @@ -121,6 +123,11 @@ struct nvme_namespace {

/* Page aligned buffer of size NVME_PAGE_SIZE. */ char *dma_buffer; + + /* Page List */ + u32 prpl_len; + void *prp1; + u64 prpl[NVME_MAX_PRPL_ENTRIES]; };

/* Data structures for NVMe admin identify commands */ @@ -195,6 +202,7 @@ union nvme_identify { #define NVME_CQE_DW3_P (1U << 16)

#define NVME_PAGE_SIZE 4096 +#define NVME_PAGE_MASK ~(NVME_PAGE_SIZE - 1)

/* Length for the queue entries. */ #define NVME_SQE_SIZE_LOG 6 diff --git a/src/hw/nvme.c b/src/hw/nvme.c index f1b9331..b92ca52 100644 --- a/src/hw/nvme.c +++ b/src/hw/nvme.c @@ -168,11 +168,6 @@ nvme_get_next_sqe(struct nvme_sq *sq, u8 opc, void *metadata, void *data, void * sqe->dptr_prp1 = (u32)data; sqe->dptr_prp2 = (u32)data2;

- if (sqe->dptr_prp1 & (NVME_PAGE_SIZE - 1)) { - /* Data buffer not page aligned. */ - warn_internalerror(); - } - return sqe; }

@@ -418,19 +413,29 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, char *buf, u16 count, int write) { u32 buf_addr = (u32)buf; + void *prp2;

- if ((buf_addr & 0x3) || - ((buf_addr & ~(NVME_PAGE_SIZE - 1)) != - ((buf_addr + ns->block_size * count - 1) & ~(NVME_PAGE_SIZE - 1)))) { - /* Buffer is misaligned or crosses page boundary */ + if (buf_addr & 0x3) { + /* Buffer is misaligned */ warn_internalerror(); return DISK_RET_EBADTRACK; }

+ if ((ns->block_size * count) > (NVME_PAGE_SIZE * 2)) { + /* We need to describe more than 2 pages, rely on PRP List */ + prp2 = ns->prpl; + } else if ((ns->block_size * count) > NVME_PAGE_SIZE) { + /* Directly embed the 2nd page if we only need 2 pages */ + prp2 = (void *)(long)ns->prpl[0]; + } else { + /* One page is enough, don't expose anything else */ + prp2 = NULL; + } + struct nvme_sqe *io_read = nvme_get_next_sqe(&ns->ctrl->io_sq, write ? NVME_SQE_OPC_IO_WRITE : NVME_SQE_OPC_IO_READ, - NULL, buf, NULL); + NULL, buf, prp2); io_read->nsid = ns->ns_id; io_read->dword[10] = (u32)lba; io_read->dword[11] = (u32)(lba >> 32); @@ -450,6 +455,60 @@ nvme_io_readwrite(struct nvme_namespace *ns, u64 lba, char *buf, u16 count, return DISK_RET_SUCCESS; }

+static void nvme_reset_prpl(struct nvme_namespace *ns) +{ + ns->prpl_len = 0; +} + +static int nvme_add_prpl(struct nvme_namespace *ns, u64 base) +{ + if (ns->prpl_len >= NVME_MAX_PRPL_ENTRIES) + return -1; + + ns->prpl[ns->prpl_len++] = base; + + return 0; +} + +int nvme_build_prpl(struct nvme_namespace *ns, struct disk_op_s *op) +{ + int first_page = 1; + u32 base = (long)op->buf_fl; + s32 size = op->count * ns->block_size; + + if (op->count > ns->max_req_size) + return -1; + + nvme_reset_prpl(ns); + + /* Special case for transfers that fit into PRP1, but are unaligned */ + if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE)) { + ns->prp1 = op->buf_fl; + return 0; + } + + /* Every request has to be page aligned */ + if (base & ~NVME_PAGE_MASK) + return -1; + + /* Make sure a full block fits into the last chunk */ + if (size & (ns->block_size - 1ULL)) + return -1; + + for (; size > 0; base += NVME_PAGE_SIZE, size -= NVME_PAGE_SIZE) { + if (first_page) { + /* First page is special */ + ns->prp1 = (void*)base; + first_page = 0; + continue; + } + if (nvme_add_prpl(ns, base)) + return -1; + } + + return 0; +} + static int nvme_create_io_queues(struct nvme_ctrl *ctrl) { @@ -668,6 +727,11 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write) u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size; u16 i;

+ if (!nvme_build_prpl(ns, op)) { + /* Request goes via PRP List logic */ + return nvme_io_readwrite(ns, op->lba, ns->prp1, op->count, write); + } + for (i = 0; i < op->count && res == DISK_RET_SUCCESS;) { u16 blocks_remaining = op->count - i; u16 blocks = blocks_remaining < max_blocks ? blocks_remaining

Alexander Graf

3:10 p.m.

New subject: [PATCH v3 4/4] nvme: Split requests by maximum allowed size

Some NVMe controllers only support small maximum request sizes, such as the AWS EBS NVMe implementation which only supports NVMe requests of up to 32 pages (256kb) at once.

BIOS callers can exceed those request sizes by defining sector counts above this threshold. Currently we fall back to the bounce buffer implementation for those. This is slow.

This patch introduces splitting logic to the NVMe I/O request code so that every NVMe I/O request gets handled in a chunk size that is consumable by the NVMe adapter, while maintaining the fast path PRPL logic we just introduced.

Signed-off-by: Alexander Graf graf@amazon.com --- src/hw/nvme.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)

diff --git a/src/hw/nvme.c b/src/hw/nvme.c index b92ca52..cc37bca 100644 --- a/src/hw/nvme.c +++ b/src/hw/nvme.c @@ -727,6 +727,22 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write) u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size; u16 i;

+ /* Split up requests that are larger than the device can handle */ + if (op->count > ns->max_req_size) { + u16 count = op->count; + + /* Handle the first max_req_size elements */ + op->count = ns->max_req_size; + if (nvme_cmd_readwrite(ns, op, write)) + return res; + + /* Handle the remainder of the request */ + op->count = count - ns->max_req_size; + op->lba += ns->max_req_size; + op->buf_fl += (ns->max_req_size * ns->block_size); + return nvme_cmd_readwrite(ns, op, write); + } + if (!nvme_build_prpl(ns, op)) { /* Request goes via PRP List logic */ return nvme_io_readwrite(ns, op->lba, ns->prp1, op->count, write);

Kevin O'Connor

2 Oct 2 Oct

7:47 p.m.

New subject: [PATCH v3 4/4] nvme: Split requests by maximum allowed size

On Wed, Sep 30, 2020 at 11:10:56PM +0200, Alexander Graf wrote:

...

Some NVMe controllers only support small maximum request sizes, such as the AWS EBS NVMe implementation which only supports NVMe requests of up to 32 pages (256kb) at once.

BIOS callers can exceed those request sizes by defining sector counts above this threshold. Currently we fall back to the bounce buffer implementation for those. This is slow.

This patch introduces splitting logic to the NVMe I/O request code so that every NVMe I/O request gets handled in a chunk size that is consumable by the NVMe adapter, while maintaining the fast path PRPL logic we just introduced.

Signed-off-by: Alexander Graf graf@amazon.com

src/hw/nvme.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)

diff --git a/src/hw/nvme.c b/src/hw/nvme.c index b92ca52..cc37bca 100644 --- a/src/hw/nvme.c +++ b/src/hw/nvme.c @@ -727,6 +727,22 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write) u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size; u16 i;
/* Split up requests that are larger than the device can handle */

if (op->count > ns->max_req_size) {
   u16 count = op->count;
   /* Handle the first max_req_size elements */
   op->count = ns->max_req_size;
   if (nvme_cmd_readwrite(ns, op, write))
       return res;
   /* Handle the remainder of the request */
   op->count = count - ns->max_req_size;
   op->lba += ns->max_req_size;
   op->buf_fl += (ns->max_req_size * ns->block_size);
   return nvme_cmd_readwrite(ns, op, write);
}

Depending on the disk access, this code could run with a small stack. I would avoid recursion.

Otherwise, the patch series looks okay to me. (I don't have enough knowledge of the nvme code to give a full review though.)

-Kevin

Graf (AWS), Alexander

11:41 p.m.

New subject: [PATCH v3 4/4] nvme: Split requests by maximum allowed size

Hey Kevin,

...

Am 03.10.2020 um 03:47 schrieb Kevin O'Connor kevin@koconnor.net:

...
On Wed, Sep 30, 2020 at 11:10:56PM +0200, Alexander Graf wrote: Some NVMe controllers only support small maximum request sizes, such as the AWS EBS NVMe implementation which only supports NVMe requests of up to 32 pages (256kb) at once.

BIOS callers can exceed those request sizes by defining sector counts above this threshold. Currently we fall back to the bounce buffer implementation for those. This is slow.

This patch introduces splitting logic to the NVMe I/O request code so that every NVMe I/O request gets handled in a chunk size that is consumable by the NVMe adapter, while maintaining the fast path PRPL logic we just introduced.

Signed-off-by: Alexander Graf graf@amazon.com

src/hw/nvme.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)

diff --git a/src/hw/nvme.c b/src/hw/nvme.c index b92ca52..cc37bca 100644 --- a/src/hw/nvme.c +++ b/src/hw/nvme.c @@ -727,6 +727,22 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write) u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size; u16 i;
/* Split up requests that are larger than the device can handle */

if (op->count > ns->max_req_size) {
   u16 count = op->count;
   /* Handle the first max_req_size elements */
   op->count = ns->max_req_size;
   if (nvme_cmd_readwrite(ns, op, write))
       return res;
   /* Handle the remainder of the request */
   op->count = count - ns->max_req_size;
   op->lba += ns->max_req_size;
   op->buf_fl += (ns->max_req_size * ns->block_size);
   return nvme_cmd_readwrite(ns, op, write);
}
Depending on the disk access, this code could run with a small stack. I would avoid recursion.

This is tail recursion, which any reasonable compiler converts into a jmp :). Take a look at the .o file - for me it did become a plain jump.

...

Otherwise, the patch series looks okay to me. (I don't have enough knowledge of the nvme code to give a full review though).

Thanks :)

Alex

Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879

Kevin O'Connor

7 Oct 7 Oct

10:13 a.m.

New subject: [PATCH v3 4/4] nvme: Split requests by maximum allowed size

On Sat, Oct 03, 2020 at 05:40:58AM +0000, Graf (AWS), Alexander wrote:

...

Hey Kevin,

...
Am 03.10.2020 um 03:47 schrieb Kevin O'Connor kevin@koconnor.net:

...
On Wed, Sep 30, 2020 at 11:10:56PM +0200, Alexander Graf wrote: Some NVMe controllers only support small maximum request sizes, such as the AWS EBS NVMe implementation which only supports NVMe requests of up to 32 pages (256kb) at once.

BIOS callers can exceed those request sizes by defining sector counts above this threshold. Currently we fall back to the bounce buffer implementation for those. This is slow.

This patch introduces splitting logic to the NVMe I/O request code so that every NVMe I/O request gets handled in a chunk size that is consumable by the NVMe adapter, while maintaining the fast path PRPL logic we just introduced.

Signed-off-by: Alexander Graf graf@amazon.com

src/hw/nvme.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)

diff --git a/src/hw/nvme.c b/src/hw/nvme.c index b92ca52..cc37bca 100644 --- a/src/hw/nvme.c +++ b/src/hw/nvme.c @@ -727,6 +727,22 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write) u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size; u16 i;
/* Split up requests that are larger than the device can handle */

if (op->count > ns->max_req_size) {
   u16 count = op->count;
   /* Handle the first max_req_size elements */
   op->count = ns->max_req_size;
   if (nvme_cmd_readwrite(ns, op, write))
       return res;
   /* Handle the remainder of the request */
   op->count = count - ns->max_req_size;
   op->lba += ns->max_req_size;
   op->buf_fl += (ns->max_req_size * ns->block_size);
   return nvme_cmd_readwrite(ns, op, write);
}
Depending on the disk access, this code could run with a small stack. I would avoid recursion.
This is tail recursion, which any reasonable compiler converts into a jmp :). Take a look at the .o file - for me it did become a plain jump.

Okay, that's fine then.

...

...
Otherwise, the patch series looks okay to me. (I don't have enough knowledge of the nvme code to give a full review though).

Thanks :)

I'll leave a few more days to allow others to comment, but otherwise I'm fine with committing.

Cheers, -Kevin

David Woodhouse

29 Oct 29 Oct

11:21 a.m.

New subject: [PATCH v3 4/4] nvme: Split requests by maximum allowed size

On Wed, 2020-10-07 at 12:13 -0400, Kevin O'Connor wrote:

...

...
...
...
--- a/src/hw/nvme.c +++ b/src/hw/nvme.c @@ -727,6 +727,22 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write) u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size; u16 i;
/* Split up requests that are larger than the device can handle */

if (op->count > ns->max_req_size) {
   u16 count = op->count;
   /* Handle the first max_req_size elements */
   op->count = ns->max_req_size;
   if (nvme_cmd_readwrite(ns, op, write))
       return res;
   /* Handle the remainder of the request */
   op->count = count - ns->max_req_size;
   op->lba += ns->max_req_size;
   op->buf_fl += (ns->max_req_size * ns->block_size);
   return nvme_cmd_readwrite(ns, op, write);
}
Depending on the disk access, this code could run with a small stack. I would avoid recursion.
This is tail recursion, which any reasonable compiler converts into a jmp :). Take a look at the .o file - for me it did become a plain jump.
Okay, that's fine then.

It's still kind of icky. This used to be a purely iterative function. Now for a large unaligned request (which nvme_build_prpl refuses to handle) you get a mixture of (mostly tail) recursion and iteration.

It'll recurse for each max_req_size, then iterate over each page within that.

I'd much rather see just a simple iterative loop. Something like this (incremental, completely untested):

--- a/src/hw/nvme.c +++ b/src/hw/nvme.c @@ -472,19 +472,20 @@ static int nvme_add_prpl(struct nvme_namespace *ns, u64 base)

int nvme_build_prpl(struct nvme_namespace *ns, struct disk_op_s *op) { - int first_page = 1; + int first_page = 1, count = op->count; u32 base = (long)op->buf_fl; - s32 size = op->count * ns->block_size; + s32 size;

- if (op->count > ns->max_req_size) - return -1; + if (count > ns->max_req_size) + count = ns->max_req_size;

nvme_reset_prpl(ns);

+ size = count * ns->block_size; /* Special case for transfers that fit into PRP1, but are unaligned */ if (((size + (base & ~NVME_PAGE_MASK)) <= NVME_PAGE_SIZE)) { ns->prp1 = op->buf_fl; - return 0; + return count; }

/* Every request has to be page aligned */ @@ -506,7 +507,7 @@ int nvme_build_prpl(struct nvme_namespace *ns, struct disk_op_s *op) return -1; }

- return 0; + return count; }

static int @@ -725,27 +726,16 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write) { int res = DISK_RET_SUCCESS; u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size; - u16 i; - - /* Split up requests that are larger than the device can handle */ - if (op->count > ns->max_req_size) { - u16 count = op->count; + u16 i, count;

- /* Handle the first max_req_size elements */ - op->count = ns->max_req_size; - if (nvme_cmd_readwrite(ns, op, write)) - return res; - - /* Handle the remainder of the request */ - op->count = count - ns->max_req_size; - op->lba += ns->max_req_size; - op->buf_fl += (ns->max_req_size * ns->block_size); - return nvme_cmd_readwrite(ns, op, write); - } + while (op->count && ((count = nvme_build_prpl(ns, op)) > 0)) { + res = nvme_io_readwrite(ns, op->lba, ns->prp1, count, write); + if (res != DISK_RET_SUCCESS) + break;

- if (!nvme_build_prpl(ns, op)) { - /* Request goes via PRP List logic */ - return nvme_io_readwrite(ns, op->lba, ns->prp1, op->count, write); + op->count -= count; + op->lba += count; + op->buf_fl += (count * ns->block_size); }

for (i = 0; i < op->count && res == DISK_RET_SUCCESS;) {

Kevin O'Connor

30 Oct 30 Oct

5:09 p.m.

New subject: [PATCH v3 4/4] nvme: Split requests by maximum allowed size

On Thu, Oct 29, 2020 at 05:21:04PM +0000, David Woodhouse wrote:

...

On Wed, 2020-10-07 at 12:13 -0400, Kevin O'Connor wrote:

...
...
...
...
--- a/src/hw/nvme.c +++ b/src/hw/nvme.c @@ -727,6 +727,22 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write) u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size; u16 i;
/* Split up requests that are larger than the device can handle */

if (op->count > ns->max_req_size) {
   u16 count = op->count;
   /* Handle the first max_req_size elements */
   op->count = ns->max_req_size;
   if (nvme_cmd_readwrite(ns, op, write))
       return res;
   /* Handle the remainder of the request */
   op->count = count - ns->max_req_size;
   op->lba += ns->max_req_size;
   op->buf_fl += (ns->max_req_size * ns->block_size);
   return nvme_cmd_readwrite(ns, op, write);
}
Depending on the disk access, this code could run with a small stack. I would avoid recursion.
This is tail recursion, which any reasonable compiler converts into a jmp :). Take a look at the .o file - for me it did become a plain jump.
Okay, that's fine then.
It's still kind of icky. This used to be a purely iterative function. Now for a large unaligned request (which nvme_build_prpl refuses to handle) you get a mixture of (mostly tail) recursion and iteration.

It'll recurse for each max_req_size, then iterate over each page within that.

I'd much rather see just a simple iterative loop. Something like this (incremental, completely untested):

FWIW, I agree that a version without recursion (even tail recursion) would be nice. If someone wants to test and confirm that, then that would be great.

-Kevin

David Woodhouse

5:48 p.m.

New subject: [PATCH] nvme: Clean up nvme_cmd_readwrite()

From: David woodhouse dwmw@amazon.co.uk

This ended up with an odd mix of recursion (albeit *mostly* tail-recursion) and interation that could have been prettier. Make it prettier by making nvme_build_prpl() return the number of blocks it consumes, and just looping.

If nvme_build_prpl() doesn't want to play, just fall through to the original loop.

Signed-off-by: David Woodhouse dwmw@amazon.co.uk --- On Fri, 2020-10-30 at 19:09 -0400, Kevin O'Connor wrote:

...

FWIW, I agree that a version without recursion (even tail recursion) would be nice. If someone wants to test and confirm that, then that would be great.

Seems to work here... I just tidied it up a little more and added a dprintf() to match the original loop. Didn't see it actually limiting the requests in my testing so I made nvme_build_prpl() limit to 32 blocks and then it did... and still seems to put things in the right place.

src/hw/nvme.c | 47 +++++++++++++++++++++-------------------------- 1 file changed, 21 insertions(+), 26 deletions(-)

diff --git a/src/hw/nvme.c b/src/hw/nvme.c index cc37bca..5363d3f 100644 --- a/src/hw/nvme.c +++ b/src/hw/nvme.c @@ -472,19 +472,20 @@ static int nvme_add_prpl(struct nvme_namespace *ns, u64 base)

- if (op->count > ns->max_req_size) - return -1; + if (count > ns->max_req_size) + count = ns->max_req_size;

nvme_reset_prpl(ns);

/* Every request has to be page aligned */ @@ -506,7 +507,7 @@ int nvme_build_prpl(struct nvme_namespace *ns, struct disk_op_s *op) return -1; }

- return 0; + return count; }

static int @@ -725,34 +726,28 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write) { int res = DISK_RET_SUCCESS; u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size; - u16 i; + u16 i, blocks;

- /* Split up requests that are larger than the device can handle */ - if (op->count > ns->max_req_size) { - u16 count = op->count; + while (op->count && ((blocks = nvme_build_prpl(ns, op)) > 0)) { + res = nvme_io_readwrite(ns, op->lba, ns->prp1, blocks, write); + dprintf(5, "ns %u %s lba %llu+%u: %d\n", ns->ns_id, write ? "write" + : "read", + op->lba, blocks, res);

- /* Handle the first max_req_size elements */ - op->count = ns->max_req_size; - if (nvme_cmd_readwrite(ns, op, write)) + if (res != DISK_RET_SUCCESS) return res;

- /* Handle the remainder of the request */ - op->count = count - ns->max_req_size; - op->lba += ns->max_req_size; - op->buf_fl += (ns->max_req_size * ns->block_size); - return nvme_cmd_readwrite(ns, op, write); - } - - if (!nvme_build_prpl(ns, op)) { - /* Request goes via PRP List logic */ - return nvme_io_readwrite(ns, op->lba, ns->prp1, op->count, write); + op->count -= blocks; + op->lba += blocks; + op->buf_fl += (blocks * ns->block_size); }

for (i = 0; i < op->count && res == DISK_RET_SUCCESS;) { - u16 blocks_remaining = op->count - i; - u16 blocks = blocks_remaining < max_blocks ? blocks_remaining - : max_blocks; char *op_buf = op->buf_fl + i * ns->block_size; + u16 blocks_remaining = op->count - i; + + blocks = blocks_remaining < max_blocks ? blocks_remaining + : max_blocks;

if (write) { memcpy(ns->dma_buffer, op_buf, blocks * ns->block_size);

-- 2.17.1

Matt DeVillier

17 Jan 17 Jan

6:24 p.m.

New subject: [PATCH v3 4/4] nvme: Split requests by maximum allowed size

Greetings! Was this patch set tested on bare metal hardware? I'm seeing "GRUB loading: Read Error" when attempting to boot from NVMe on various Purism Librem devices (Intel Skylake thru Cometlake hardware). Reverting this series resolves the issue. I'm not seeing anything in coreboot's cbmem console log (even with the debug level raised), which I suppose isn't unexpected given it's a grub error and not in SeaBIOS itself. NVMe is a run of the mill Samsung EVO (960 or 970) and SeaBIOS reports the max request size is 4096 sectors.

cheers, Matt

On Fri, Oct 30, 2020 at 6:09 PM Kevin O'Connor kevin@koconnor.net wrote:

...

On Thu, Oct 29, 2020 at 05:21:04PM +0000, David Woodhouse wrote:

...
On Wed, 2020-10-07 at 12:13 -0400, Kevin O'Connor wrote:

...
...
...
...
--- a/src/hw/nvme.c +++ b/src/hw/nvme.c @@ -727,6 +727,22 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write) u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size; u16 i;
/* Split up requests that are larger than the device can handle */

if (op->count > ns->max_req_size) {
   u16 count = op->count;
   /* Handle the first max_req_size elements */
   op->count = ns->max_req_size;
   if (nvme_cmd_readwrite(ns, op, write))
       return res;
   /* Handle the remainder of the request */
   op->count = count - ns->max_req_size;
   op->lba += ns->max_req_size;
   op->buf_fl += (ns->max_req_size * ns->block_size);
   return nvme_cmd_readwrite(ns, op, write);
}
Depending on the disk access, this code could run with a small stack. I would avoid recursion.
This is tail recursion, which any reasonable compiler converts into a jmp :). Take a look at the .o file - for me it did become a plain jump.
Okay, that's fine then.
It's still kind of icky. This used to be a purely iterative function. Now for a large unaligned request (which nvme_build_prpl refuses to handle) you get a mixture of (mostly tail) recursion and iteration.

It'll recurse for each max_req_size, then iterate over each page within that.

I'd much rather see just a simple iterative loop. Something like this (incremental, completely untested):
FWIW, I agree that a version without recursion (even tail recursion) would be nice. If someone wants to test and confirm that, then that would be great.

-Kevin _______________________________________________ SeaBIOS mailing list -- seabios@seabios.org To unsubscribe send an email to seabios-leave@seabios.org

Matt DeVillier

18 Jan 18 Jan

10:45 a.m.

New subject: [PATCH v3 4/4] nvme: Split requests by maximum allowed size

here's the NVMe init log, was able to get it off a Chromebox with CCD :)

/7aa26000\ Start thread |7aa26000| Found NVMe controller with version 1.3.0. |7aa26000| Capabilities 000000303c033fff |7aa26000| q 0x7aa77bce q_idx 1 dbl 0x91201004 |7aa26000| q 0x7aa77bbc q_idx 0 dbl 0x91201000 |7aa26000| sq 0x7aa77bbc q_idx 0 sqe 0x7aa53000 |7aa26000| admin submission queue: 0x7aa53000 |7aa26000| admin completion queue: 0x7aa54000 |7aa26000| sq 0x7aa77bbc next_sqe 0 |7aa26000| sq 0x7aa77bbc commit_sqe 0 |7aa26000| cq 0x7aa77bce head 0 -> 1 |7aa26000| sq 0x7aa77bbc advanced to 1 |7aa26000| NVMe has 1 namespace. |7aa26000| q 0x7aa77bf1 q_idx 3 dbl 0x9120100c |7aa26000| sq 0x7aa77bbc next_sqe 1 |7aa26000| sq 0x7aa77bbc commit_sqe 1 |7aa26000| cq 0x7aa77bce head 1 -> 2 |7aa26000| sq 0x7aa77bbc advanced to 2 |7aa26000| q 0x7aa77bdf q_idx 2 dbl 0x91201008 |7aa26000| sq 0x7aa77bdf q_idx 2 sqe 0x7aa4e000 |7aa26000| sq 0x7aa77bbc next_sqe 2 |7aa26000| sq 0x7aa77bdf create dword10 00ff0001 dword11 00010001 |7aa26000| sq 0x7aa77bbc commit_sqe 2 |7aa26000| cq 0x7aa77bce head 2 -> 3 |7aa26000| sq 0x7aa77bbc advanced to 3 |7aa26000| sq 0x7aa77bbc next_sqe 3 |7aa26000| sq 0x7aa77bbc commit_sqe 3 |7aa26000| cq 0x7aa77bce head 3 -> 4 |7aa26000| sq 0x7aa77bbc advanced to 4 |7aa26000| NVME NS 1 max request size: 4096 sectors |7aa26000| NVMe NS 1: 476940 MiB (976773168 512-byte blocks + 0-byte metadata) |7aa26000| Searching bootorder for: /pci@i0cf8/pci-bridge@1c,4/*@0 |7aa26000| Registering bootable: NVMe NS 1: 476940 MiB (976773168 512-byte blocks + 0-byte metadata) (type:2 prio:103 data:f59e0) |7aa26000| NVMe initialization complete! \7aa26000/ End thread

On Mon, Jan 17, 2022 at 7:24 PM Matt DeVillier matt.devillier@gmail.com wrote:

...

Greetings! Was this patch set tested on bare metal hardware? I'm seeing "GRUB loading: Read Error" when attempting to boot from NVMe on various Purism Librem devices (Intel Skylake thru Cometlake hardware). Reverting this series resolves the issue. I'm not seeing anything in coreboot's cbmem console log (even with the debug level raised), which I suppose isn't unexpected given it's a grub error and not in SeaBIOS itself. NVMe is a run of the mill Samsung EVO (960 or 970) and SeaBIOS reports the max request size is 4096 sectors.

cheers, Matt

On Fri, Oct 30, 2020 at 6:09 PM Kevin O'Connor kevin@koconnor.net wrote:

...
On Thu, Oct 29, 2020 at 05:21:04PM +0000, David Woodhouse wrote:

...
On Wed, 2020-10-07 at 12:13 -0400, Kevin O'Connor wrote:

...
...
...
> --- a/src/hw/nvme.c > +++ b/src/hw/nvme.c > @@ -727,6 +727,22 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write) > u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size; > u16 i; > > + /* Split up requests that are larger than the device can handle */ > + if (op->count > ns->max_req_size) { > + u16 count = op->count; > + > + /* Handle the first max_req_size elements */ > + op->count = ns->max_req_size; > + if (nvme_cmd_readwrite(ns, op, write)) > + return res; > + > + /* Handle the remainder of the request */ > + op->count = count - ns->max_req_size; > + op->lba += ns->max_req_size; > + op->buf_fl += (ns->max_req_size * ns->block_size); > + return nvme_cmd_readwrite(ns, op, write); > + }

Depending on the disk access, this code could run with a small stack. I would avoid recursion.

This is tail recursion, which any reasonable compiler converts into a jmp :). Take a look at the .o file - for me it did become a plain jump.

Okay, that's fine then.

It's still kind of icky. This used to be a purely iterative function. Now for a large unaligned request (which nvme_build_prpl refuses to handle) you get a mixture of (mostly tail) recursion and iteration.

It'll recurse for each max_req_size, then iterate over each page within that.

I'd much rather see just a simple iterative loop. Something like this (incremental, completely untested):

FWIW, I agree that a version without recursion (even tail recursion) would be nice. If someone wants to test and confirm that, then that would be great.

-Kevin _______________________________________________ SeaBIOS mailing list -- seabios@seabios.org To unsubscribe send an email to seabios-leave@seabios.org

Matt DeVillier

10:46 a.m.

New subject: [PATCH v3 4/4] nvme: Split requests by maximum allowed size

and attempting to boot from it:

Booting from Hard Disk... sq 0x7aa77bdf next_sqe 0 sq 0x7aa77bdf commit_sqe 0 cq 0x7aa77bf1 head 0 -> 1 sq 0x7aa77bdf advanced to 1 ns 1 read lba 0+1: 0 Booting from 0000:7c00 sq 0x7aa77bdf next_sqe 1 sq 0x7aa77bdf commit_sqe 1 cq 0x7aa77bf1 head 1 -> 2 sq 0x7aa77bdf advanced to 2 ns 1 read lba 1+1: 0 sq 0x7aa77bdf next_sqe 2 sq 0x7aa77bdf commit_sqe 2 cq 0x7aa77bf1 head 2 -> 3 sq 0x7aa77bdf advanced to 3 read io: 00000000 00000000 00010003 40270002 ns 1 read lba 2+105: 12

On Tue, Jan 18, 2022 at 11:45 AM Matt DeVillier matt.devillier@gmail.com wrote:

...

here's the NVMe init log, was able to get it off a Chromebox with CCD :)

/7aa26000\ Start thread |7aa26000| Found NVMe controller with version 1.3.0. |7aa26000| Capabilities 000000303c033fff |7aa26000| q 0x7aa77bce q_idx 1 dbl 0x91201004 |7aa26000| q 0x7aa77bbc q_idx 0 dbl 0x91201000 |7aa26000| sq 0x7aa77bbc q_idx 0 sqe 0x7aa53000 |7aa26000| admin submission queue: 0x7aa53000 |7aa26000| admin completion queue: 0x7aa54000 |7aa26000| sq 0x7aa77bbc next_sqe 0 |7aa26000| sq 0x7aa77bbc commit_sqe 0 |7aa26000| cq 0x7aa77bce head 0 -> 1 |7aa26000| sq 0x7aa77bbc advanced to 1 |7aa26000| NVMe has 1 namespace. |7aa26000| q 0x7aa77bf1 q_idx 3 dbl 0x9120100c |7aa26000| sq 0x7aa77bbc next_sqe 1 |7aa26000| sq 0x7aa77bbc commit_sqe 1 |7aa26000| cq 0x7aa77bce head 1 -> 2 |7aa26000| sq 0x7aa77bbc advanced to 2 |7aa26000| q 0x7aa77bdf q_idx 2 dbl 0x91201008 |7aa26000| sq 0x7aa77bdf q_idx 2 sqe 0x7aa4e000 |7aa26000| sq 0x7aa77bbc next_sqe 2 |7aa26000| sq 0x7aa77bdf create dword10 00ff0001 dword11 00010001 |7aa26000| sq 0x7aa77bbc commit_sqe 2 |7aa26000| cq 0x7aa77bce head 2 -> 3 |7aa26000| sq 0x7aa77bbc advanced to 3 |7aa26000| sq 0x7aa77bbc next_sqe 3 |7aa26000| sq 0x7aa77bbc commit_sqe 3 |7aa26000| cq 0x7aa77bce head 3 -> 4 |7aa26000| sq 0x7aa77bbc advanced to 4 |7aa26000| NVME NS 1 max request size: 4096 sectors |7aa26000| NVMe NS 1: 476940 MiB (976773168 512-byte blocks + 0-byte metadata) |7aa26000| Searching bootorder for: /pci@i0cf8/pci-bridge@1c,4/*@0 |7aa26000| Registering bootable: NVMe NS 1: 476940 MiB (976773168 512-byte blocks + 0-byte metadata) (type:2 prio:103 data:f59e0) |7aa26000| NVMe initialization complete! \7aa26000/ End thread

On Mon, Jan 17, 2022 at 7:24 PM Matt DeVillier matt.devillier@gmail.com wrote:

...
Greetings! Was this patch set tested on bare metal hardware? I'm seeing "GRUB loading: Read Error" when attempting to boot from NVMe on various Purism Librem devices (Intel Skylake thru Cometlake hardware). Reverting this series resolves the issue. I'm not seeing anything in coreboot's cbmem console log (even with the debug level raised), which I suppose isn't unexpected given it's a grub error and not in SeaBIOS itself. NVMe is a run of the mill Samsung EVO (960 or 970) and SeaBIOS reports the max request size is 4096 sectors.

cheers, Matt

On Fri, Oct 30, 2020 at 6:09 PM Kevin O'Connor kevin@koconnor.net wrote:

...
On Thu, Oct 29, 2020 at 05:21:04PM +0000, David Woodhouse wrote:

...
On Wed, 2020-10-07 at 12:13 -0400, Kevin O'Connor wrote:

...
...
> > --- a/src/hw/nvme.c > > +++ b/src/hw/nvme.c > > @@ -727,6 +727,22 @@ nvme_cmd_readwrite(struct nvme_namespace *ns, struct disk_op_s *op, int write) > > u16 const max_blocks = NVME_PAGE_SIZE / ns->block_size; > > u16 i; > > > > + /* Split up requests that are larger than the device can handle */ > > + if (op->count > ns->max_req_size) { > > + u16 count = op->count; > > + > > + /* Handle the first max_req_size elements */ > > + op->count = ns->max_req_size; > > + if (nvme_cmd_readwrite(ns, op, write)) > > + return res; > > + > > + /* Handle the remainder of the request */ > > + op->count = count - ns->max_req_size; > > + op->lba += ns->max_req_size; > > + op->buf_fl += (ns->max_req_size * ns->block_size); > > + return nvme_cmd_readwrite(ns, op, write); > > + } > > Depending on the disk access, this code could run with a small stack. > I would avoid recursion.

This is tail recursion, which any reasonable compiler converts into a jmp :). Take a look at the .o file - for me it did become a plain jump.

Okay, that's fine then.

It's still kind of icky. This used to be a purely iterative function. Now for a large unaligned request (which nvme_build_prpl refuses to handle) you get a mixture of (mostly tail) recursion and iteration.

It'll recurse for each max_req_size, then iterate over each page within that.

I'd much rather see just a simple iterative loop. Something like this (incremental, completely untested):

FWIW, I agree that a version without recursion (even tail recursion) would be nice. If someone wants to test and confirm that, then that would be great.

-Kevin _______________________________________________ SeaBIOS mailing list -- seabios@seabios.org To unsubscribe send an email to seabios-leave@seabios.org

Alexander Graf

3:06 p.m.

New subject: [PATCH v3 4/4] nvme: Split requests by maximum allowed size

Hi Matt,

On 18.01.22 18:46, Matt DeVillier wrote:

...

and attempting to boot from it:

Booting from Hard Disk... sq 0x7aa77bdf next_sqe 0 sq 0x7aa77bdf commit_sqe 0 cq 0x7aa77bf1 head 0 -> 1 sq 0x7aa77bdf advanced to 1 ns 1 read lba 0+1: 0 Booting from 0000:7c00 sq 0x7aa77bdf next_sqe 1 sq 0x7aa77bdf commit_sqe 1 cq 0x7aa77bf1 head 1 -> 2 sq 0x7aa77bdf advanced to 2 ns 1 read lba 1+1: 0 sq 0x7aa77bdf next_sqe 2 sq 0x7aa77bdf commit_sqe 2 cq 0x7aa77bf1 head 2 -> 3 sq 0x7aa77bdf advanced to 3 read io: 00000000 00000000 00010003 40270002

This error means

"PRP Offset Invalid: The Offset field for a PRP entry is invalid. This may occur when there is a PRP entry with a non-zero offset after the first entry or when the Offset field in any PRP entry is not dword aligned (i.e., bits 1:0 are not cleared to 00b)."

which points towards a misaligned PRP entry. The block size of this request is 105 sectors (52.5kb), so it must be a PRPL. I don't see any obvious code paths that would allow us to ever get into a misaligned request here.

That said, while trying to understand what is happening here I did stumble over a different weird effect: ns->prp1 gets overwritten with 0 by some interrupt handler. Could you please try the patch / hack below and see if it fixes the problem for you? That way I at least know we're hunting the same ghost.

If it doesn't help, I'll send you a debug patch that will give us some more information about the PRP list SeaBIOS assembles.

Thanks,

Alex

---

diff --git a/src/hw/nvme.c b/src/hw/nvme.c index f035fa2..99ad7a8 100644 --- a/src/hw/nvme.c +++ b/src/hw/nvme.c @@ -257,7 +257,7 @@ nvme_probe_ns(struct nvme_ctrl *ctrl, u32 ns_idx, u8 mdts) goto free_buffer; }

- struct nvme_namespace *ns = malloc_fseg(sizeof(*ns)); + struct nvme_namespace *ns = malloc_low(sizeof(*ns)); if (!ns) { warn_noalloc(); goto free_buffer;

Matt DeVillier

3:47 p.m.

New subject: [PATCH v3 4/4] nvme: Split requests by maximum allowed size

On Tue, Jan 18, 2022 at 4:06 PM Alexander Graf graf@amazon.de wrote:

...

Hi Matt,

hi Alex,

...

On 18.01.22 18:46, Matt DeVillier wrote:

...
and attempting to boot from it:

Booting from Hard Disk... sq 0x7aa77bdf next_sqe 0 sq 0x7aa77bdf commit_sqe 0 cq 0x7aa77bf1 head 0 -> 1 sq 0x7aa77bdf advanced to 1 ns 1 read lba 0+1: 0 Booting from 0000:7c00 sq 0x7aa77bdf next_sqe 1 sq 0x7aa77bdf commit_sqe 1 cq 0x7aa77bf1 head 1 -> 2 sq 0x7aa77bdf advanced to 2 ns 1 read lba 1+1: 0 sq 0x7aa77bdf next_sqe 2 sq 0x7aa77bdf commit_sqe 2 cq 0x7aa77bf1 head 2 -> 3 sq 0x7aa77bdf advanced to 3 read io: 00000000 00000000 00010003 40270002

This error means

"PRP Offset Invalid: The Offset field for a PRP entry is invalid. This may occur when there is a PRP entry with a non-zero offset after the first entry or when the Offset field in any PRP entry is not dword aligned (i.e., bits 1:0 are not cleared to 00b)."

which points towards a misaligned PRP entry. The block size of this request is 105 sectors (52.5kb), so it must be a PRPL. I don't see any obvious code paths that would allow us to ever get into a misaligned request here.

That said, while trying to understand what is happening here I did stumble over a different weird effect: ns->prp1 gets overwritten with 0 by some interrupt handler. Could you please try the patch / hack below and see if it fixes the problem for you? That way I at least know we're hunting the same ghost.

The hack below does indeed work. Not sure if that's good or bad :)

...

If it doesn't help, I'll send you a debug patch that will give us some more information about the PRP list SeaBIOS assembles.

Thanks,

Alex

diff --git a/src/hw/nvme.c b/src/hw/nvme.c index f035fa2..99ad7a8 100644 --- a/src/hw/nvme.c +++ b/src/hw/nvme.c @@ -257,7 +257,7 @@ nvme_probe_ns(struct nvme_ctrl *ctrl, u32 ns_idx, u8 mdts) goto free_buffer; }

struct nvme_namespace *ns = malloc_fseg(sizeof(*ns));

struct nvme_namespace *ns = malloc_low(sizeof(*ns)); if (!ns) { warn_noalloc(); goto free_buffer;

Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879

Kevin O'Connor

28 Oct 28 Oct

1:21 p.m.

On Wed, Sep 30, 2020 at 11:10:52PM +0200, Alexander Graf wrote:

...

While looking at VM bootup times, we stumbled over the fact that the NVMe code only does I/O operations of up to 4kb at a given point in time. This is usually ok, but if you have an OS that loads a lot of data on boot in combination to network backed storage, it shows in bootup times.

There is no need to restrict ourselves to 4kb though. The INT13 call we receive gives us much larger chunks which we can just map into a native bigger NVMe I/O call if the request buffer is page aligned.

This patch implements all logic required to do the above and gives a substantial performance boost on boot.

Thanks. I committed this series.

-Kevin

1212

days inactive

1687

days old

seabios@seabios.org

15 comments

6 participants

tags (0)

participants (6)

Alexander Graf
Alexander Graf
David Woodhouse
Graf (AWS), Alexander
Kevin O'Connor
Matt DeVillier