Linux Kernel 6.x — x86_64

Page Table Hierarchy

Referensi interaktif: PGD → P4D → PUD → PMD → PTE → Physical page. Kernel vs userspace mapping, VA decoder, PTE flags.

CR3 Physical address of PGD — di-reload tiap context switch
PGD
Level 4

Page Global Directory

VA[47:39] — 9 bit → 512 entries × 8 byte = 4 KB table

Base dari CR3 register. Entry berisi physical address P4D (atau PUD di 4-level) + flags.
pgd_t *pgd = pgd_offset(mm, addr)

PGD — detail teknis
Kernel 6.x: di 4-level paging, PGD adalah root. Entry 256-511 (upper half) = kernel space, di-share semua proses via swapper_pg_dir. Entry 0-255 = user space, unik per proses. pgd_offset(mm, addr) = mm->pgd + pgd_index(addr). Saat fork(), kernel entries di-copy dari init_mm.pgd, user entries awalnya kosong (demand paging). CR3 reload saat context switch → TLB flush kecuali PCID match.
P4D
Level 4.5

Page 4th-level Directory

VA[38:30] — folded ke PGD di 4-level; aktif di 5-level (LA57)

Di CPU tanpa LA57, level ini di-fold: pgtable-nop4d.h membuat p4d_offset() = pgd. Di kernel 6.x dengan CONFIG_X86_5LEVEL, ini jadi level nyata.
p4d_t *p4d = p4d_offset(pgd, addr)

P4D — 5-level paging (LA57)
Aktifkan dengan CONFIG_X86_5LEVEL=y. CPU harus support CPUID leaf 0x7 bit ECX[16] (LA57). Saat kernel boot deteksi LA57, cr4.la57 di-set dan virtual address space naik ke 57 bit (128 PiB per half). Perlu Linux ≥ 4.17 dan glibc ≥ 2.28. Tanpa ini, p4d_offset() simply return argument pgd-nya = zero overhead.
PUD
Level 3

Page Upper Directory

VA[29:21] — bisa leaf 1 GB huge page (PS=1)

Jika bit PS=1 di PUD entry → 1 GB huge page, walk berhenti di sini. Dipakai kernel direct map region (physmap). pud_huge() cek ini.
pud_t *pud = pud_offset(p4d, addr)

PUD huge page — 1 GB leaf
Kernel physmap (direct physical mapping) hampir selalu pakai 1 GB PUD entries untuk efisiensi TLB. Satu TLB entry = 1 GB coverage. Intel CPU modern support ini via PDPE1GB CPUID bit. Entry format: bit[51:30] = PFN (shifted 30), bit 7 (PS)=1, standard flags. Kernel check: pud_trans_huge(*pud). Untuk userspace, 1 GB huge pages jarang (perlu MAP_HUGETLB + explicit size hint).
PMD
Level 2

Page Middle Directory

VA[20:12] — bisa leaf 2 MB huge page (PS=1)

2 MB THP (Transparent HugePage) bekerja di level ini — kernel auto-promote 512 contiguous 4K pages. pmd_trans_huge() cek ini. Split via __split_huge_pmd().
pmd_t *pmd = pmd_offset(pud, addr)

THP — Transparent HugePage di PMD
Kernel secara otomatis collapse 512 × 4KB pages menjadi satu PMD entry 2MB jika memenuhi syarat (contiguous, aligned, tidak ada page table walker aktif). Dipicu oleh khugepaged daemon. Keuntungan: drastis kurangi TLB pressure. Masalah: fragment internal (allocate 2MB, pakai 4KB = 2044KB terbuang). Control via /sys/kernel/mm/transparent_hugepage/enabled. Kernel pakai pmd_devmap() untuk DAX (Device-DAX) yang juga pakai PMD leaf.
PTE
Level 1

Page Table Entry

64-bit entry: bits[51:12] = PFN, bits[11:0] + [63:52] = flags

Leaf node untuk 4KB page. Bit 63 = NX. Bit 2 = US (user/supervisor). Physical address = PFN × 4096.
pte_t *pte = pte_offset_map(pmd, addr)

P=1 RW US NX Dirty Accessed Global PROT_KEY
PTE — swap entry encoding
Saat page di-swap out (P=0), seluruh 64 bit PTE dipakai untuk encode swap info: bit[1] = swap type selector, bit[11:2] + bit[63:12] = swap offset. Kernel decode via pte_to_swp_entry(). Saat swap in, do_swap_page() restore PTE dengan physical frame baru. Juga: bit P=0 dengan bit khusus = "migration entry" (page sedang di-migrate antar NUMA node).
PHYS
4 KB frame

Physical memory access

PA = PFN × 4096 + VA[11:0]

MMU hardware lakukan walk ini secara otomatis dari TLB miss. Kernel manual walk via walk_page_table() / follow_pte() / follow_pfn().
PA = (pte_pfn(*pte) << PAGE_SHIFT) | (addr & ~PAGE_MASK)

Physical frame — struct page
Setiap physical frame punya struct page di kernel (8-64 byte per frame). Akses via pfn_to_page(pfn). Di 5-level paging dengan RAM besar, struct page array sendiri bisa puluhan GB — kernel pakai memory sections dan sparse memory model (CONFIG_SPARSEMEM_VMEMMAP) untuk efisiensi. page_address(page) return virtual address di physmap region.

CONFIG_PGTABLE_LEVELS — Folded levels

2-level

PGD → PTE. P4D, PUD, PMD di-fold. 32-bit compat mode, jarang di x86_64 kernel 6.x modern.

4-level (default)

PGD → PUD → PMD → PTE. P4D folded. VA space: 256 TiB user + 256 TiB kernel. Semua distro x86_64 modern.

5-level LA57 ✦

PGD → P4D → PUD → PMD → PTE. CONFIG_X86_5LEVEL=y. VA space: 128 PiB. CPU harus support CPUID LA57. Linux ≥ 4.17.

← Klik salah satu segment untuk melihat detail bit-field

Canonical address rule — x86_64

Pada 4-level paging, hanya bit[47:0] yang valid. Bit[63:48] harus merupakan sign extension dari bit 47. Pelanggaran → #GP fault langsung dari CPU (sebelum page fault handler).

0x00000000000000000x00007fffffffffff = userspace canonical
0xffff8000000000000xffffffffffffffff = kernel canonical
0x00008000000000000xffff7fffffffffff = NON-CANONICAL — #GP fault

Pada 5-level (LA57): bit[63:57] harus sign-extend bit 56. Range user: 0x…00007fffffffffff0x…ff8000000000000 (128 PiB).

Linux x86_64 virtual address space layout (kernel 6.x, 4-level paging). Klik region untuk detail. Warna gelap = kernel space (ring 0).

Bit(s) Nama Kernel macro Fungsi Scope

Contoh nyata — bagaimana kernel set PTE flags

mmap(PROT_READ|PROT_WRITE, MAP_ANONYMOUS) — heap/stack page
_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_NX | _PAGE_ACCESSED

Data page — writable, accessible dari ring 3, tidak executable (NX=1). Awalnya RW=0 sampai write (demand paging), lalu set dirty.

mmap(PROT_READ|PROT_EXEC) — program text (.text section)
_PAGE_PRESENT | _PAGE_USER

Code page — read-only (RW=0), executable (NX=0), user accessible. Shared antar fork via copy-on-write.

Kernel text segment (.text di ffff...80000000)
_PAGE_PRESENT | _PAGE_GLOBAL

RW=0 (read-only post init via mark_rodata_ro), NX=0 (executable code), US=0 (kernel-only). Global=1 agar TLB entry di-share semua CPU.

Kernel data/BSS segment
_PAGE_PRESENT | _PAGE_RW | _PAGE_GLOBAL | _PAGE_NX

Writable data, tidak executable (NX=1), kernel-only (US=0). Per-CPU data: virtual address sama, physical frame berbeda per CPU.

Copy-On-Write (COW) page setelah fork()
_PAGE_PRESENT | _PAGE_USER | _PAGE_NX ← RW=0 (write-protected)

Saat child atau parent write → #PF → do_wp_page() → alokasi frame baru → copy content → set RW=1 di PTE baru. VMA tetap WRITE tapi PTE RW=0.

SMEP + SMAP enforcement (CR4 bits)
CR4.SMEP: kernel tidak execute page US=1 (ring 3 page dari ring 0) CR4.SMAP: kernel tidak akses US=1 tanpa EFLAGS.AC=1 (stac/clac)

Proteksi ini enforce di hardware level, bukan di PTE. Kernel bypass SMAP secara eksplisit via copy_from_user() / copy_to_user() yang wrap stac/clac instructions.

KPTI — Kernel Page Table Isolation (Meltdown mitigation)
Dua PGD per proses: "user shadow" PGD + full kernel PGD

Ring 3 pakai shadow PGD yang hanya map kernel trampoline + IDT. Saat syscall/interrupt → switch ke full PGD. Overhead: ~1-3% pada I/O-heavy workload. Bisa disable via nopti boot param jika CPU immune (KAISER flag).