徐霁的博客 | 关于禁用透明大页
徐霁

关于禁用透明大页

https://blog.nelhage.com/post/transparent-hugepages/

关于禁用透明大页

“Transparent Hugepages” is a Linux kernel feature intended to improve performance by making more efficient use of your processor’s memory-mapping hardware. It is enabled (“enabled=always”) by default in most Linux distributions.

Transparent Hugepages gives some applications a small performance improvement (~ 10% at best, 0-3% more typically), but can cause significant performance problems, or even apparent memory leaks at worst.

To avoid these problems, you should set enabled=madvise on your servers by running

echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

and setting transparent_hugepage=madvise on your kernel command line (e.g. in /etc/default/grub).

This change will allow applications that are optimized for transparent hugepages to obtain the performance benefits, and prevent the associated problems otherwise.

Read on for more details.

What are transparent hugepages? 

What are hugepages? 

For decades now, processors and operating systems have collaborated to use virtual memory to provide a layer of indirection between memory as seen by applications (the “virtual address space”), and the underlying physical memory of the hardware. This indirection protects applications from each other, and enables a whole host of powerful features.

x86 processors, like many others, implement virtual memory by a page table scheme that stores the mapping as a large table in memory 1. Traditionally, on x86 processors, each table entry controls the mapping of a single 4KB “page” of memory.

While these page tables are themselves stored in memory, the processor caches a subset of the page table entries in a cache on the processor itself, called the TLB. A look through the output of cpuid(1) on my laptop reveals that its lowest-level TLB contains 64 entries for 4KB data pages. 64*4KB is only a quarter-megabyte, much smaller than the working memory of most useful applications in 2017. This size mismatch means that applications accessing large amounts of memory may regularly “miss” the TLB, requiring expensive fetches from main memory just to locate their data in memory

Primarily in an effort to improve TLB efficiency, therefore, x86 (and other) processors have long supported creating “huge pages”, in which a single page-table entry maps a larger segment of address space to physical memory. Depending on how the OS configures it, most recent chips can map 2MB, 4MB, or even 1GB pages. Using large pages means more data fits into the TLB, which means better performance for certain workloads.

What are transparent hugepages? 

The existence of multiple flavors of page table management means that the operating system needs to determine how to map address space to physical memory. Because application memory management interfaces (like mmap(2)) have historically been based on the smallest 4KB pages, the kernel must always support mapping data in 4KB increments. The simplest and most flexible (in terms of supported memory layouts) solution, therefore, is to just always use 4KB pages, and not benefit from hugepages for application memory mappings. And for a long time this has been the strategy adopted by the general-purpose memory management code in the kernel.

For applications (such as certain databases or scientific computing programs) that are known to require large amounts of memory and be performance-sensitive, the kernel introduced the hugetlbfs feature, which allows administrators to explicitly configure certain applications to use hugepages.

Transparent Hugepages (“THP” for short), as the name suggests, intended to bring hugepage support automatically to applications, without requiring custom configuration. Transparent hugepage support works by scanning memory mappings in the background (via the “khugepaged” kernel thread), attempting to find or create (by moving memory around) contiguous 2MB ranges of 4KB mappings, that can be replaced with a single hugepage.

What goes wrong? 

When transparent hugepage support works well, it can garner up to about a 10% performance improvement on certain benchmarks. However, it also comes with at least two serious failure modes:

Memory Leaks 

THP attempts to create 2MB mappings. However, it’s overly greedy in doing so, and too unwilling to break them back up if necessary. If an application maps a large range but only touches the first few bytes, it would traditionally consume only a single 4KB page of physical memory. With THP enabled, khugepaged can come and extend that 4KB page into a 2MB page, effectively bloating memory usage by 512x (An example reproducer on this bug report actually demonstrates the 512x worst case!).

This behavior isn’t hypothetical; Go’s GC had to include an explicit workaround for it, and Digital Oceandocumented their woes with Redis, THP, and the jemalloc allocator.

Pauses and CPU usage 

In steady-state usage by applications with fairly static memory allocation, the work done by khugepaged is minimal. However, on certain workloads that involve aggressive memory remapping or short-lived processes, khugepaged can end up doing huge amounts of work to merge and/or split memory regions, which ends up being entirely short-lived and useless. This manifests as excessive CPU usage, and can also manifest as long pauses, as the kernel is forced to break up a 2MB page back into 4KB pages before performing what would otherwise have been a fast operation on a single page.

Several applications have seen 30% performance degradations or worse with THP enabled, for these reasons.

So what now? 

The THP authors were aware of the potential downsides of transparent hugepages (although, with hindsight, we might argue that they underestimated them). They therefore opted to make the behavior configurable via the /sys/kernel/mm/transparent_hugepage/enabled sysfs file.

Even more importantly, they implemented an “opt-in” mode for transparent hugepages. With the madvisesetting in /sys/kernel/mm/transparent_hugepage/enabledkhugepaged will leave memory alone by default, but applications can use the madvise system call to specifically request THP behavior for selected ranges of memory.

Since – for the most part – only a few specialized applications receive substantial benefits from hugepage support, this option gives us the best of both worlds. The applications of those applications can opt-in using madvise, and the rest of us can remain free from the undesirable side-effects of transparent hugepages.

Thus, I recommend that all users set their transparent hugepage setting to madvise, as described in the tl;drsection at the top. I also hope to persuade the major distributions to disable them by default, to save numerous more administrators and operators and developers from having to rediscover these failure modes for themselves.

Postscript 

This post is also available in <a href=http://howtorecover.me/otklucenie-prozracnyh-vizitnyh-kartocek” rel="nofollow">translation into Russian

“Transparent Hugepages”是一个 Linux 内核特性,旨在通过更有效地利用处理器的内存映射硬件来提高性能。enabled=always在大多数 Linux 发行版中默认启用(“ ”)。

Transparent Hugepages 为某些应用程序提供了 很小的性能改进(最多约 10%,通常会提高0-3%),但可能会导致严重的 性能 问题,甚至在最坏的情况下会导致明显的 内存 泄漏

为避免这些问题,您应该enabled=madvise通过运行在服务器上进行设置

echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

transparent_hugepage=madvise在您的内核命令行上进行设置(例如 in /etc/default/grub)。

此更改将允许针对透明大页面进行优化的应用程序获得性能优势,并防止相关问题的发生。

请阅读以获得更多详情。

什么是透明大页? 

什么是大页面? 

几十年来,处理器和操作系统一直合作使用虚拟内存在应用程序看到的内存(“虚拟地址空间”)和硬件的底层物理内存之间提供一个间接层。这种间接性保护了应用程序之间的相互隔离,并启用了大量强大的功能。

x86 处理器与许多其他处理器一样,通过页表方案实现虚拟内存,该 方案将映射存储为内存中的大表1传统上,在 x86 处理器上,每个表条目控制单个 4KB“页面”内存的映射。

虽然这些页表本身存储在内存中,但处理器将页表条目的子集缓存在处理器本身的缓存中,称为TLBcpuid(1) 在我的笔记本电脑上查看 的输出显示其最低级别的 TLB 包含 4KB 数据页的 64 个条目。64 * 4KB是只有四分之一兆字节,比最有用的应用程序的工作记忆这个尺寸不匹配意味着应用程序访问大量的内存可以有规律地“小姐”的TLB,需要昂贵的取从主内存小得多在2017年 刚在内存中定位他们的数据

因此,主要是为了提高 TLB 效率,x86(和其他)处理器长期以来一直支持创建“大页面”,其中单个页表条目将更大的地址空间段映射到物理内存。根据操作系统的配置方式,最新的芯片可以映射 2MB、4MB 甚至 1GB 的页面。使用大页面意味着更多的数据适合 TLB,这意味着某些工作负载的性能更好。

什么是透明大页? 

页表管理的多种风格的存在意味着操作系统需要确定如何将地址空间映射到物理内存。由于应用程序内存管理接口(如mmap(2))过去一直基于最小的 4KB 页,因此内核必须始终支持以 4KB 为增量的映射数据。因此,最简单和最灵活(就支持的内存布局而言)解决方案是始终使用 4KB 页面,而不是从应用程序内存映射的大页面中受益。长期以来,这一直是内核中通用内存管理代码所采用的策略。

对于已知需要大量内存且对性能敏感的应用程序(例如某些数据库或科学计算程序),内核引入了hugetlbfs功能,该功能允许管理员明确配置某些应用程序以使用大页面。

Transparent Hugepages(简称“THP”),顾名思义,旨在自动为应用程序带来大页面支持,无需自定义配置。透明大页支持通过在后台扫描内存映射(通过“ khugepaged”内核线程),尝试查找或创建(通过移动内存)连续 2MB 范围的 4KB 映射来工作,这些映射可以用单个大页面替换。

出了什么问题? 

当透明大页面支持运行良好时,它可以在某些基准测试中获得高达 10% 的性能提升。但是,它也带有至少两种严重的故障模式:

内存泄漏 

THP 尝试创建 2MB 的映射。但是,这样做过于贪婪,并且在必要时也不愿意将它们分解。如果应用程序映射大范围但只涉及前几个字节,则传统上它只会消耗单个 4KB 的物理内存页面。启用 THP 后,khugepaged可以将 4KB 页面扩展为 2MB 页面,有效地将内存使用量增加 512 倍(此错误报告中的示例重现  实际上演示了 512 倍的最坏情况!)。

这种行为不是假设的;Go 的 GC 必须包含一个 明确的解决方法,Digital Ocean 用 Redis、THP 和 分配器记录了他们的困境jemalloc

暂停和 CPU 使用率 

在具有相当静态内存分配的应用程序的稳态使用中,完成的工作khugepaged最少。然而,在某些涉及积极内存重新映射或短期进程的工作负载上,khugepaged最终可能会做大量的工作来合并和/或拆分内存区域,这最终是完全短暂且无用的。这表现为 CPU 使用率过高,也可能表现为长时间的停顿,因为内核被迫将 2MB 的页面分解为 4KB 的页面,然后才能在单个页面上执行本来可以快速执行的操作。

由于这些原因,在启用 THP 的情况下,一些应用程序的性能下降了 30% 或更糟。

所以现在怎么办? 

THP 作者意识到透明大页面的潜在缺点(尽管事后看来,我们可能会争辩说他们低估了它们)。因此,他们选择通过/sys/kernel/mm/transparent_hugepage/enabled sysfs 文件使行为可配置

更重要的是,他们为透明大页面实施了“选择加入”模式。使用 中的madvise设置 /sys/kernel/mm/transparent_hugepage/enabledkhugepaged默认情况下将单独保留内存,但应用程序可以使用madvise系统调用来专门请求选定内存范围的 THP 行为。

由于 – 在大多数情况下 – 只有少数专业应用程序从大页面支持中获得实质性好处,因此该选项为我们提供了两全其美的优势。这些应用程序的应用程序可以选择使用madvise,我们其他人可以免受透明大页面的不良副作用的影响。

因此,我建议所有用户将他们的透明大页面设置设置为madvise,如顶部tl;dr部分所述。我还希望说服主要发行版默认禁用它们,从而使更多的管理员、操作员和开发人员不必自己重新发现这些故障模式。

后记 

这篇文章也可以在<a href=http://howtorecover.me/otklucenie-prozracnyh-vizitnyh-kartocek” rel="nofollow">翻译成俄语中找到

码字很辛苦,转载请注明来自徐霁的博客《关于禁用透明大页》

评论

你需要 登录 才可以回复.