“Huge Dirty COW” (CVE-2017–1000405)

The incomplete Dirty COW patch

Published in

Bindecy

10 min readNov 27, 2017

The “Dirty COW” vulnerability (CVE-2016–5195) is one of the most hyped and branded vulnerabilities published. Every Linux version from the last decade, including Android, desktops and servers was vulnerable. The impact was vast — millions of users could be compromised easily and reliably, bypassing common exploit defenses.

Plenty of information was published about the vulnerability, but its patch was not analyzed in detail.

We at Bindecy were interested to study the patch and all of its implications. Surprisingly, despite the enormous publicity the bug had received, we discovered that the patch was incomplete.

“Dirty COW” recap

First, we need a full understanding of the original Dirty COW exploit. We’ll assume basic understanding of the Linux memory manager. We won’t recover the original gory details, as talented people have already done so.

The original vulnerability was in the get_user_pages function. This function is used to get the physical pages behind virtual addresses in user processes. The caller has to specify what kind of actions he intends to perform on these pages (touch, write, lock, etc…), so the memory manager could prepare the pages accordingly. Specifically, when planning to perform a write action on a page inside a private mapping, the page may need to go through a COW (Copy-On-Write) cycle — the original, “read-only” page is copied to a new page which is writable. The original page could be “privileged” — it could be mapped in other processes as well, and might even be written back to the disk after it’s modified.

Let’s now take a look at the relevant code in __get_user_pages:

The while loop’s goal is to fetch each page in the requested page range. Each page has to be faulted in until our requirements are satisfied — that’s what the retry label is used for.

follow_page_mask’s role is to scan the page tables to get the physical page for the given address (while taking into account the PTE permissions), or fail in case the request can’t be satisfied. During follow_page_mask’s operation the PTE’s spinlock is acquired— this guarantees the physical page won’t be released before we grab a reference.

faultin_page requests the memory manager to handle the fault in the given address with the specified permissions (also under the PTE’s spinlock). Note that after a successful call to faultin_page the lock is released — it’s not guaranteed that follow_page_mask will succeed in the next retry; another piece of code might have messed with our page.

The original vulnerable code resided at the end of faultin_page:

The reason for removing theFOLL_WRITE flag is to take into account the case the FOLL_FORCE flag is applied on a read-only VMA (when the VM_MAYWRITE flag is set in the VMA). In that case, the pte_maybe_mkwrite function won’t set the write bit, however the faulted-in page is indeed ready for writing.

If the page went through a COW cycle (marked by the VM_FAULT_WRITE flag) while performing faultin_page and the VMA is not writable, the FOLL_WRITE flag is removed from the next attempt to access the page — only read permissions will be requested.

If the first follow_page_mask fails because the page was read-only or not present, we’ll try to fault it in. Now let’s imagine that during that time, until the next attempt to get the page, we’ll get rid of the COW version (e.g. by using madvise(MADV_DONTNEED)).

The next call to faultin_page will be made without the FOLL_WRITE flag, so we’ll get the read-only version of the page from the page cache. Now, the next call to follow_page_mask will also happen without the FOLL_WRITE flag, so it will return the privileged read-only page — as opposed to the caller’s original request for a writable version of the page.

Basically, the aforementioned flow is the Dirty COW vulnerability — it allows us to write to the read-only privileged version of a page. The following fix was introduced in faultin_page:

And a new function, which is called by follow_page_mask, was added:

Instead of reducing the requested permissions, get_user_pages now remembers the fact the we went through a COW cycle. On the next iteration, we would be able to get a read-only page for a write operation only if the FOLL_FORCE and FOLL_COW flags are specified, and that the PTE is marked as dirty.

This patch assumes that the read-only privileged copy of a page will never have a PTE pointing to it with the dirty bit on — a reasonable assumption… or is it?

Transparent Huge Pages (THP)

Normally, Linux usually uses a 4096-bytes long pages. In order to enable the system to manage large amounts of memory, we can either increase the number of page table entries, or use larger pages. We focus on the second method, which is implemented in Linux by using huge pages.

A huge page is a 2MB long page. One of the ways to utilize this feature is through the Transparent Huge Pages mechanism. While there are other ways to get huge pages, they are outside of our scope.

The kernel will attempt to satisfy relevant memory allocations using huge pages. THP are swappable and “breakable” (i.e. can be split into normal 4096-bytes pages), and can be used in anonymous, shmem and tmpfs mappings (the latter two are true only in newer kernel versions).

Usually (depending on the compilation flags and the machine configuration) the default THP support is for anonymous mapping only. Shmem and tmpfs support can be turned on manually, and in general THP support can be turned on and off while the system is running by writing to some kernel’s special files.

An important optimization opportunity is to coalesce normal pages into huge pages. A special daemon called khugepaged scans constantly for possible candidate pages that could be merged into huge pages. Obviously, to be a candidate, a VMA must cover a whole, aligned 2MB memory range.

THP is implemented by turning on the _PAGE_PSE bit of the PMD (Page Medium Directory, one level above the PTE level). The PMD thus points to a 2MB physical page, instead of a directory of PTEs. Each time the page tables are scanned, the PMDs must be checked with the pmd_trans_huge function, so we can decide whether the PMD points to a pfn or a directory of PTEs. On some architectures, huge PUDs (Page Upper Directory) exist as well, resulting in 1GB pages.

THP is supported since kernel 2.6.38. On most Android devices the THP subsystem is not enabled.

The bug 🐞

Delving into the Dirty COW patch code that deals with THP, we can see that the same logic of can_follow_write_pte was applied to huge PMDs. A matching function called can_follow_write_pmd was added:

However, in the huge PMD case, a page can be marked dirty without going through a COW cycle, using the touch_pmd function:

This function is reached by follow_page_mask, which will be called each time get_user_pages tries to get a huge page. Obviously, the comment is incorrect and nowadays the dirty bit is NOT meaningless. In particular — when using get_user_pages to read a huge page, that page will be marked dirty without going through a COW cycle, and can_follow_write_pmd’s logic is now broken.

At this point, exploiting the bug is straightforward — we can use a similar pattern of the original Dirty COW race. This time, after we get rid of the copied version of the page, we have to fault the original page twice — first to make it present, and then to turn on the dirty bit.

Now comes the inevitable question — how bad is this?

Bug implications

In order to exploit the bug, we have to choose an interesting read-only huge page as a target for the writing. The only constraint is that we need to be able to fetch it after it’s discarded with madvise(MADV_DONTNEED). Anonymous huge pages that were inherited from a parent process after a fork are a valuable target, however once they are discarded they are lost for good — we can’t fetch them again.

We found two interesting targets that should not be written into:

The huge zero page
Sealed (read-only) huge pages

The zero page

When issuing a read fault on an anonymous mapping before it was ever written, we get a special physical page called the zero page. This optimization prevents the system from having to allocate multiple zeroed out pages in the system, which might never be written to. Thus, the exact same zero page is mapped in many different processes, which have different security levels.

The same principle applies to huge pages as well — there’s no need to create another huge page if no write fault has occurred yet — a special page called the huge zero page will be mapped, instead. Note that this feature can be turned off as well.

THP, shmem and sealed files

shmem and tmpfs files can be mapped using THP as well. shmem files can be created using the memfd_create syscall, or by mmaping anonymous shared mappings. tmpfs files can be created using the mount point of the tmpfs (usually /dev/shm). Both can be mapped with huge pages, depending on the system configuration.

shmem files can be sealed — sealing a file restricts the set of operations allowed on the file in question. This mechanism allows processes that don’t trust each other to communicate via shared memory without having to take extra measures to deal with unexpected manipulations of the shared memory region (see man memfd_create() for more info). Three types of seals exist -

F_SEAL_SHRINK: file size cannot be reduced
F_SEAL_GROW: file size cannot be increased
F_SEAL_WRITE: file content cannot be modified

These seals can be added to the shmem file using the fcntl syscall.

POC

Our POC demonstrates overwriting the huge zero page. Overwriting shmem should be equally possible and would lead to an alternative exploit path.

Note that after the first write page-fault to the zero page, it will be replaced with a new fresh (and zeroed) THP. Using this primitive, we successfully crash several processes. A likely consequence of overwriting the huge zero page is having improper initial values inside large BSS sections. A common vulnerable pattern would be using the zero value as an indicator that a global variable hasn’t been initialized yet.

The following crash example demonstrates that pattern. In this example, the JS Helper thread of Firefox makes a NULL-deref, probably because the boolean pointed by %rdx erroneously says the object was initialized:

Thread 10 "JS Helper" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe2aee700 (LWP 14775)]
0x00007ffff13233d3 in ?? () from /opt/firefox/libxul.so
(gdb) i r
rax            0x7fffba7ef080 140736322269312
rbx            0x0 0
rcx            0x22 34
rdx            0x7fffba7ef080 140736322269312
rsi            0x400000000 17179869184
rdi            0x7fffe2aede10 140736996498960
rbp            0x0 0x0
rsp            0x7fffe2aede10 0x7fffe2aede10
r8             0x20000 131072
r9             0x7fffba900000 140736323387392
r10            0x7fffba700000 140736321290240
r11            0x7fffe2aede50 140736996499024
r12            0x1 1
r13            0x7fffba7ef090 140736322269328
r14            0x2 2
r15            0x7fffe2aee700 140736996501248
rip            0x7ffff13233d3 0x7ffff13233d3
eflags         0x10246 [ PF ZF IF RF ]
cs             0x33 51
ss             0x2b 43
ds             0x0 0
es             0x0 0
fs             0x0 0
gs             0x0 0
(gdb) x/10i $pc-0x10
   0x7ffff13233c3: mov    %rax,0x10(%rsp)
   0x7ffff13233c8: mov    0x8(%rdx),%rbx
   0x7ffff13233cc: mov    %rbx,%rbp
   0x7ffff13233cf: and    $0xfffffffffffffffe,%rbp
=> 0x7ffff13233d3: mov    0x0(%rbp),%eax
   0x7ffff13233d6: and    $0x28,%eax
   0x7ffff13233d9: cmp    $0x28,%eax
   0x7ffff13233dc: je     0x7ffff1323440
   0x7ffff13233de: mov    %rbx,%r13
   0x7ffff13233e1: and    $0xfffffffffff00000,%r13
(gdb) x/10w $rdx
0x7fffba7ef080: 0x41414141 0x00000000 0x00000000 0x00000000
0x7fffba7ef090: 0xeef93bba 0x00000000 0xda95dd80 0x00007fff
0x7fffba7ef0a0: 0x778513f1 0x00000000

This is another crash example — gdb crashes while loading the symbols for a Firefox debugging session:

(gdb) r
Starting program: /opt/firefox/firefox 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".Program received signal SIGSEGV, Segmentation fault.
0x0000555555825487 in eq_demangled_name_entry (a=0x4141414141414141, b=<optimized out>) at symtab.c:697
697   return strcmp (da->mangled, db->mangled) == 0;
(gdb) i s
#0  0x0000555555825487 in eq_demangled_name_entry (a=0x4141414141414141, b=<optimized out>) at symtab.c:697
#1  0x0000555555955203 in htab_find_slot_with_hash (htab=0x555557008e60, element=element@entry=0x7fffffffdb00, hash=4181413748, insert=insert@entry=INSERT) at ./hashtab.c:659
#2  0x0000555555955386 in htab_find_slot (htab=<optimized out>, element=element@entry=0x7fffffffdb00, insert=insert@entry=INSERT) at ./hashtab.c:703
#3  0x00005555558273e5 in symbol_set_names (gsymbol=gsymbol@entry=0x5555595b3778, linkage_name=linkage_name@entry=0x7ffff2ac5254 "_ZN7mozilla3dom16HTMLTableElement11CreateTHeadEv", len=len@entry=48, 
    copy_name=copy_name@entry=0, objfile=<optimized out>) at symtab.c:818
#4  0x00005555557d186f in minimal_symbol_reader::record_full (this=0x7fffffffdce0, this@entry=0x1768bd6, name=<optimized out>, 
    name@entry=0x7ffff2ac5254 "_ZN7mozilla3dom16HTMLTableElement11CreateTHeadEv", name_len=<optimized out>, copy_name=copy_name@entry=48, address=24546262, ms_type=ms_type@entry=mst_file_text, 
    section=13) at minsyms.c:1010
#5  0x00005555556959ec in record_minimal_symbol (reader=..., name=name@entry=0x7ffff2ac5254 "_ZN7mozilla3dom16HTMLTableElement11CreateTHeadEv", name_len=<optimized out>, copy_name=copy_name@entry=false, 
    address=<optimized out>, address@entry=24546262, ms_type=ms_type@entry=mst_file_text, bfd_section=<optimized out>, objfile=0x555557077860) at elfread.c:209
#6  0x0000555555696ac6 in elf_symtab_read (reader=..., objfile=objfile@entry=0x555557077860, type=type@entry=0, number_of_symbols=number_of_symbols@entry=365691, 
    symbol_table=symbol_table@entry=0x7ffff6a6d020, copy_names=copy_names@entry=false) at elfread.c:462
#7  0x00005555556970c4 in elf_read_minimal_symbols (symfile_flags=<optimized out>, ei=0x7fffffffdcd0, objfile=0x555557077860) at elfread.c:1084
#8  elf_symfile_read (objfile=0x555557077860, symfile_flags=...) at elfread.c:1194
#9  0x000055555581f559 in read_symbols (objfile=objfile@entry=0x555557077860, add_flags=...) at symfile.c:861
#10 0x000055555581f00b in syms_from_objfile_1 (add_flags=..., addrs=0x555557101b00, objfile=0x555557077860) at symfile.c:1062
#11 syms_from_objfile (add_flags=..., addrs=0x555557101b00, objfile=0x555557077860) at symfile.c:1078
#12 symbol_file_add_with_addrs (abfd=<optimized out>, name=name@entry=0x55555738c1d0 "/opt/firefox/libxul.so", add_flags=..., addrs=addrs@entry=0x555557101b00, flags=..., parent=parent@entry=0x0)
    at symfile.c:1177
#13 0x000055555581f63d in symbol_file_add_from_bfd (abfd=<optimized out>, name=name@entry=0x55555738c1d0 "/opt/firefox/libxul.so", add_flags=..., addrs=addrs@entry=0x555557101b00, flags=..., 
    parent=parent@entry=0x0) at symfile.c:1268
#14 0x000055555580b256 in solib_read_symbols (so=so@entry=0x55555738bfc0, flags=...) at solib.c:712
#15 0x000055555580be9b in solib_add (pattern=pattern@entry=0x0, from_tty=from_tty@entry=0, readsyms=1) at solib.c:1016
#16 0x000055555580c678 in handle_solib_event () at solib.c:1301
#17 0x00005555556f9db4 in bpstat_stop_status (aspace=0x555555ff5670, bp_addr=bp_addr@entry=140737351961185, ptid=..., ws=ws@entry=0x7fffffffe1d0) at breakpoint.c:5712
#18 0x00005555557ad1ef in handle_signal_stop (ecs=0x7fffffffe1b0) at infrun.c:5963
#19 0x00005555557aec8a in handle_inferior_event_1 (ecs=0x7fffffffe1b0) at infrun.c:5392
#20 handle_inferior_event (ecs=ecs@entry=0x7fffffffe1b0) at infrun.c:5427
#21 0x00005555557afd57 in fetch_inferior_event (client_data=<optimized out>) at infrun.c:3932
#22 0x000055555576ade5 in gdb_wait_for_event (block=block@entry=0) at event-loop.c:859
#23 0x000055555576aef7 in gdb_do_one_event () at event-loop.c:322
#24 0x000055555576b095 in gdb_do_one_event () at ./common/common-exceptions.h:221
#25 start_event_loop () at event-loop.c:371
#26 0x00005555557c3938 in captured_command_loop (data=data@entry=0x0) at main.c:325
#27 0x000055555576d243 in catch_errors (func=func@entry=0x5555557c3910 <captured_command_loop(void*)>, func_args=func_args@entry=0x0, errstring=errstring@entry=0x555555a035da "", 
    mask=mask@entry=RETURN_MASK_ALL) at exceptions.c:236
#28 0x00005555557c49ae in captured_main (data=<optimized out>) at main.c:1150
#29 gdb_main (args=<optimized out>) at main.c:1160
#30 0x00005555555ed628 in main (argc=<optimized out>, argv=<optimized out>) at gdb.c:32
(gdb) list
692   const struct demangled_name_entry *da
693     = (const struct demangled_name_entry *) a;
694   const struct demangled_name_entry *db
695     = (const struct demangled_name_entry *) b;
696 
697   return strcmp (da->mangled, db->mangled) == 0;
698 }
699 
700 /* Create the hash table used for demangled names.  Each hash entry is
701    a pair of strings; one for the mangled name and one for the demangled
(gdb)

Link to our POC

Summary

This bug demonstrates the importance of patch auditing in the security development life-cycle. As the Dirty COW case and other past cases show, even hyped vulnerabilities may get incomplete patches. The situation is not reserved for closed source software only; open source software suffers just as much.

Feel free to comment with any question or idea about the issue ☺

Disclosure timeline

The initial report was on the 22.11.17 to the kernel and distros mailing lists. The response was immediate and professional with a patch ready in a few days. The patch fixes the touch_pmd function to set the dirty bit of the PMD entry only when the caller asks for write access.

Thanks to the Security team and the distros for their time and effort of maintaining a high standard of security.

22.11.17 — Initial report to security@kernel.org and linux-distros@vs.openwall.org
22.11.17 — CVE-2017–1000405 was assigned
27.11.17 — Patch was committed to mainline kernel
29.11.17 — Public announcement