9 minute read

CVE-2025-40019 is an integer underflow vulnerability in the crypto subsystem of the linux kernel. Specifically, the bug is in the decryption path of the esssiv wrapper.

The bug was exploited as part of kernelCTF. Below is my write-up of the vulnerability and its exploitation. The exploit was not developed entirely independently. I had to drew on ideas and approaches from the kernelCTF submission to complete it.

ESSIV

Encrypted salt-sector initialization vector (ESSIV) is an IV generation method for block encryption and is mostly used in the context of disk encryption. For this exploit we will be using essiv(authenc(hmac(sha256),cbc(aes)),sha256). This is an encrypt-then-MAC construction where data is encrypted using AES-CBC with per-block ESSIV-derived IVs and an HMAC-SHA256 tag computed over the ciphertext and optionally associated data.

The IV is derived roughly like the following pseudocode for each encrypted block:

// init
salt = sha256(enc_key || auth_key)
IV_in = req->iv

// encryption & decryption
IV_out = AES_ECB_256_encrypt(key = salt, block = IV_in)
IV_in = IV_out

Buggy code location

Inside essiv_aead_crypt() the subtraction of req->assoclen - ivsize is only checked for underflow in the else block. We can control the value of req->assoclen and trigger the underflow in the if block by e.g. passing a value of 0.

static int essiv_aead_crypt(struct aead_request *req, bool enc)
{
	struct crypto_aead *tfm = crypto_aead_reqtfm(req);
	const struct essiv_tfm_ctx *tctx = crypto_aead_ctx(tfm);
	struct essiv_aead_request_ctx *rctx = aead_request_ctx(req);
	struct aead_request *subreq = &rctx->aead_req;
	struct scatterlist *src = req->src;
	int err;

	crypto_cipher_encrypt_one(tctx->essiv_cipher, req->iv, req->iv);

	/*
	 * dm-crypt embeds the sector number and the IV in the AAD region, so
	 * we have to copy the converted IV into the right scatterlist before
	 * we pass it on.
	 */
	rctx->assoc = NULL;
	if (req->src == req->dst || !enc) {
		scatterwalk_map_and_copy(req->iv, req->dst,
					 req->assoclen - crypto_aead_ivsize(tfm),
					 crypto_aead_ivsize(tfm), 1);
	} else {
		u8 *iv = (u8 *)aead_request_ctx(req) + tctx->ivoffset;
		int ivsize = crypto_aead_ivsize(tfm);
		int ssize = req->assoclen - ivsize;
		struct scatterlist *sg;
		int nents;

		if (ssize < 0)
			return -EINVAL;

    	...
    }
}

The patch for this vulnerability simply adds the underflow check before the if block so that it covers both cases.

Effects of the underflow

As a result of the underflow, the start parameter of scatterwalk_map_and_copy() which in turn calls memcpy_to_sglist() in the decryption case, is huge.

void memcpy_to_sglist(struct scatterlist *sg, unsigned int start,
		      const void *buf, unsigned int nbytes)
{
	struct scatter_walk walk;

	if (unlikely(nbytes == 0)) /* in case sg == NULL */
		return;

	scatterwalk_start_at_pos(&walk, sg, start);
	memcpy_to_scatterwalk(&walk, buf, nbytes);
}

As the name implies, memcpy_to_sglist() is like memcpy(), but the destination is a scatterlist rather than a plain buffer. A scatterlist is a kernel data structure that enables working on physical memory frames that are scattered through memory as if they were continous. Each segment is represented by one struct scatterlist entry. Multiple entries are chained to describe a logical contiguous byte stream. Leaving out some details, struct scatterlist can be represented like this:

typedef struct {
    uint64_t page_link;
    uint32_t offset;
    uint32_t length;
    uint8_t pad[0x10];
} scatter_list_t;

The page_link member encodes a struct page* as well as scatterlist control bits.

To figure out where to start writing, memcpy_to_sglist() calls scatterwalk_start_at_pos() with our underflowed start parameter. scatterwalk_start_at_pos() advances over scatterlist segments until it reaches the segment containing byte offset pos:

static inline void scatterwalk_start_at_pos(struct scatter_walk *walk,
					    struct scatterlist *sg,
					    unsigned int pos)
{
	while (pos > sg->length) {
		pos -= sg->length;
		sg = sg_next(sg);
	}
	walk->sg = sg;
	walk->offset = sg->offset + pos;
}

To understand how sg_next() works, it helps to treat a scatterlist as one logical stream made of many struct scatterlist entries, where each entry describes one segment. Those entries are often stored in fixed-size arrays inside wrapper structs (for example af_alg_sgl). To support more entires, multiple arrays can also be linked via chaining.

struct af_alg_sgl {
	struct sg_table sgt;
	struct scatterlist sgl[ALG_MAX_PAGES + 1];
	bool need_unpin;
};

If more than ALG_MAX_PAGES pages are required, the last member of the sgl array acts as a pointer to the next array.

Equipped with this knowledge, the sg_next() function is straightforward:

static inline struct scatterlist *sg_next(struct scatterlist *sg)
{
	if (sg_is_last(sg))
		return NULL;

	sg++;
	if (unlikely(sg_is_chain(sg)))
		sg = sg_chain_ptr(sg);

	return sg;
}

The function returns NULL in case this is the last scatterlist, else it jumps to the next scatterlist member in the array and either chains to the next array or returns sg immediately.

After scatterwalk_start_at_pos() has found the starting scatterlist segment, memcpy_to_scatterwalk() writes the input bytes into that segment and keeps advancing through subsequent segments as needed. For each segment, it temporarily maps the referenced page memory into kernel address space, copies the relevant range, and continues until all data is written.

If we trigger the underflow bug, we get following kernel panic:

[    1.810997] Oops: general protection fault, probably for non-canonical address 0xca4284654a22: 0000 [#1] SMP NOPTI
[    1.811539] CPU: 0 UID: 1000 PID: 201 Comm: exp Not tainted 6.17.0-rc1+ #66 NONE
[    1.811933] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
[    1.812536] RIP: 0010:memcpy_to_scatterwalk+0x12c/0x1d0
[    1.812819] Code: 0f b6 37 45 0f b6 47 03 e8 c1 de 72 ff 49 8b 0c 24 89 d8 83 fb 08 0f 82 01 ff ff ff 49 8b 17 48 8d 79 08 4c 89 fe 48 83 e7 f8 <48> 89 11 49 8b 54 07 f8 48 89 54 01 f8 48 29 f9 48 29 ce 01 d9 c1
[    1.813794] RSP: 0018:ffffa7a4c06efb58 EFLAGS: 00010206
[    1.814078] RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000ca4284654a22
[    1.814453] RDX: 62c79569ff9e8cd0 RSI: ffffa093438dd070 RDI: 0000ca4284654a28
[    1.814827] RBP: ffffa7a4c06efb88 R08: 00000000ffffdfff R09: 0000000000000001
[    1.815211] R10: 00000000ffffdfff R11: ffffffffb64a7160 R12: ffffa7a4c06efb98
[    1.815590] R13: 0000000000000010 R14: ffffa09345cd4a90 R15: ffffa093438dd070
[    1.815973] FS:  00000000116ab3c0(0000) GS:ffffa093a4e4f000(0000) knlGS:0000000000000000
[    1.816397] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.816700] CR2: 00007ffff9a56320 CR3: 0000000103945003 CR4: 0000000000772ef0
[    1.817080] PKRU: 55555554
[    1.817229] Call Trace:
[    1.817372]  <TASK>
[    1.817495]  memcpy_to_sglist+0xe2/0x120
[    1.817706]  essiv_aead_crypt+0x1c4/0x320
[    1.817929]  aead_recvmsg+0x52f/0x660
[    1.818136]  sock_recvmsg+0xad/0xc0
[    1.818333]  ____sys_recvmsg+0x97/0x1f0
[    1.818543]  ___sys_recvmsg+0x90/0xe0
[    1.818742]  __sys_recvmsg+0x84/0xe0
[    1.818944]  do_syscall_64+0x5d/0x200
[    1.819161]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

The panic happens inside memcpy_to_scatterwalk() which tries to access an out-of-bounds struct scatterlist returned by sg_next() due to the underflowed pos. memcpy_to_scatterwalk() dereferences an invalid struct page pointer encoded in the page_link field, leading to a general protection fault.

Depending on how you setup the arguments to trigger the bug, the stack trace might look different, but all crashes originate from the resulting out-of-bounds traversal of the scatterlist and access to invalid struct scatterlist entries.

Linux kernel crypto API

Communication with the Linux kernel crypto API from userspace works via sockets of type AF_ALG. Cryptographic operations are created using the send system call family and results obtained with the recv system call family. A request is stored by the kernel inside a TX SGL (transmit scatter-gather list) represented by struct af_alg_tsgl. The result is written via a RX SGL (receive scatter-gather list) represented by struct af_alg_rsgl which is backed by user-supplied buffers mapped into kernel space.

Exploitation

To exploit this vulnerability and elevate privileges to root, the best case scenario would be to control the page_link member of a struct scatterlist and have memcpy_to_scatterwalk() write controlled data to a chosen physical page frame. For this vulnerability we can gain a write equal to the size of the IV, which is 16 bytes in this case.

To achieve this, the exploit has two parts:

  1. Get the corruption of the scatterlist into a controllable state to control the struct scatterlist which memcpy_to_scatterwalk() uses for the write.
  2. Leverage this control to leak the physical base address of the kernel and use it to write chosen data to arbitrary kernel memory by corrupting a PTE.

1. Gaining control over the corrupted scatterlist

Gaining control over the corrupted scatterlist is only made possible by areq->rgsl_list being left uninitialized inside af_alg_get_rsgl() if the maxsize parameter is 0. areq->rgsl_list is a list of struct af_alg_rsgl entries that contain scatterlists pointing to userspace recvmsg destination buffers where the result of a crypto request will be written to. When this member is being left uninitialized it may contain a pointer to the rsgl_list of a previous request.

We can exploit this by reclaiming the chunk referenced by the pointer and thereby gaining control over the corrupted scatterlist. This is achieved by writing a fake struct scatterlist array into the reclaimed chunk. To do so, we trigger an allocation from the kmalloc-1k cache, since struct af_alg_rsgl has a size of 592 bytes and is allocated from this cache.

To force maxsize to be 0 we create a request that tries to decrypt data of length equal to the authentication tag, such that outlen, which is being passed as maxsize to af_alg_get_rsgl(), is 0.

struct af_alg_async_req being left partly uninitialized has also been patched.

2. Leveraging control over the corrupted scatterlist to leak the physical base address of the kernel and gaining arbitrary write to kernel memory

Having gained control over the corrupted scatterlist gives us control over the pos entry inside scatterwalk_start_at_pos(), however we can not write to arbitrary physical addresses just yet, since we do not have a kernel heap leak which would allow us to craft a pointer to a struct page. We can however abuse our control over the corrupted scatterlist to achieve a write of controlled data to the physical frame that backed a struct scatterlist of a previous request by leveraging the chaining behavior of scatterlists: To achieve in-place decryption, the RX SGL is chained to the part of the TX SGL that holds the authentication tag. By crafting our fake struct scatterlist array in a specific way, we can acieve a write to the physical frame backing the scatterlist of that TX SGL.

The comments in my exploit best explain how this was achieved based on creating a fake struct scatterlist array:

// static inline void scatterwalk_start_at_pos(struct scatter_walk *walk,
//					    struct scatterlist *sg,
//					    unsigned int pos)
//{
//	while (pos > sg->length) {
//		pos -= sg->length;
//		sg = sg_next(sg);
//	}
//	walk->sg = sg;
//	walk->offset = sg->offset + pos;
// }
//
// Fake scatterlist array. Because of underflow `pos` == 0xffffffe0, so set
// `length` to the same value to make the loop inside
// `scatterwalk_start_at_pos()` stop. `sg = sg_next(sg)` will **chain** to
// stale TX SGL that used to contain authentication tag of a previous request.

scatter_list_t *li = (scatter_list_t *)(spray_payload + 0x10);
// 0xf = ALG_MAX_PAGES => next entry is chain entry
li[0xf].page_link = 0x41414140;
li[0xf].length = 0xffffffe0;
li[0xf].offset = 0x0;

We can leverage the controlled write of 16 bytes to the page frame of the freed scatterlist to obtain a physical address leak, by causing the frame to be reclaimed by a page table and corrupting a PTE to point to a frame containing a physical address that is at a known offset relative to the kernel base. For more information about this technique, refer to this blog post (ctrl-f for 0x9c000).

Once we have obtained a leak of the kernel’s physical base address, we rerun the exploit on a different CPU to obtain a fresh SLUB state and corrupt a PTE to point to core_pattern to elevate privileges.

My exploit can be found on GitHub. The KernelCTF exploit and write-up is available here.

Updated: