On Tue, Oct 01, 2019 at 08:47:07AM -0400, Alex Pilon wrote: > Richard suggested I just post this here in advance of the meeting. Mix > of merge commits from Linus and the commits proper. […] More of these > on my work laptop. Many of them somewhat relevant at work too. Attached.
Plenty of more detailed commits in there should you git log v5.3.. --merges --author='Linus Torvalds', like some sched, perf, and RCU ones. I only highlighted the funny or otherwise interesting ones. commit 110ea1d833ad291272d52e0a40a06157a3d9ba17 Author: Alexander Schremmer <alex [ at ] alexanderweb [ dot ] de> Date: Thu Aug 22 13:48:33 2019 +0200 platform/x86: thinkpad_acpi: Add ThinkPad PrivacyGuard This feature is found optionally in T480s, T490, T490s. The feature is called lcdshadow and visible via /proc/acpi/ibm/lcdshadow. The ACPI methods \_SB.PCI0.LPCB.EC.HKEY.{GSSS,SSSS,TSSS,CSSS} are available in these machines. They get, set, toggle or change the state apparently. The patch was tested on a 5.0 series kernel on a T480s. commit e86c2c8b9380440bbe761b8e2f63ab6b04a45ac2 Author: Brendan Shanks <bshanks [ at ] codeweavers [ dot ] com> Date: Thu Sep 5 16:22:21 2019 -0700 x86/umip: Add emulation (spoofing) for UMIP covered instructions in 64-bit processes as well Add emulation (spoofing) of the SGDT, SIDT, and SMSW instructions for 64-bit processes. Wine users have encountered a number of 64-bit Windows games that use these instructions (particularly SGDT), and were crashing when run on UMIP-enabled systems. commit e0d60a1e68a3fbf42cdf3546004e00230d9048ba Merge: 22331f895298 6365b842aae4 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Mon Sep 16 19:06:29 2019 -0700 Merge branch 'x86-entry-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 entry updates from Ingo Molnar: "This contains x32 and compat syscall improvements, the biggest one of which splits x32 syscalls into their own table, which allows new syscalls to share the x32 and x86-64 number - which turns the 512-547 special syscall numbers range into a legacy wart that won't be extended going forward" commit f240652b6032b48ad7fa35c5e701cc4c8d697c0b Author: Dave Hansen <dave [ dot ] hansen [ at ] linux [ dot ] intel [ dot ] com> Date: Fri Jul 5 10:53:21 2019 -0700 x86/mpx: Remove MPX APIs MPX is being removed from the kernel due to a lack of support in the toolchain going forward (gcc). The first step is to remove the userspace-visible ABIs so that applications will stop using it. The most visible one are the enable/disable prctl()s. Remove them first. This is the most minimal and least invasive change needed to ensure that apps stop using MPX with new kernels. commit 7e67a859997aad47727aff9c5a32e160da079ce3 Merge: 772c1d06bd40 563c4f85f9f0 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Mon Sep 16 17:25:49 2019 -0700 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: […] - Add another series of patches that brings the -rt (PREEMPT_RT) tree closer to mainline: split the monolithic CONFIG_PREEMPT dependencies into a new CONFIG_PREEMPTION category that will allow the eventual introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches to go though. […] - Improve load-balancing on AMD EPYC systems. commit eb92692b2544d3f415887dbbc98499843dfe568b Author: Quentin Perret <quentin [ dot ] perret [ at ] arm [ dot ] com> Date: Thu Sep 12 11:44:04 2019 +0200 sched/fair: Speed-up energy-aware wake-ups EAS computes the energy impact of migrating a waking task when deciding on which CPU it should run. However, the current approach is known to have a high algorithmic complexity, which can result in prohibitively high wake-up latencies on systems with complex energy models, such as systems with per-CPU DVFS. On such systems, the algorithm complexity is in O(n^2) (ignoring the cost of searching for performance states in the EM) with 'n' the number of CPUs. To address this, re-factor the EAS wake-up path to compute the energy 'delta' (with and without the task) on a per-performance domain basis, rather than system-wide, which brings the complexity down to O(n). No functional changes intended. Test results ~~~~~~~~~~~~ * Setup: Tested on a Google Pixel 3, with a Snapdragon 845 (4+4 CPUs, A55/A75). Base kernel is 5.3-rc5 + Pixel3 specific patches. Android userspace, no graphics. * Test case: Run a periodic rt-app task, with 16ms period, ramping down from 70% to 10%, in 5% steps of 500 ms each (json avail. at [1]). Frequencies of all CPUs are pinned to max (using scaling_min_freq CPUFreq sysfs entries) to reduce variability. The time to run select_task_rq_fair() is measured using the function profiler (/sys/kernel/debug/tracing/trace_stat/function*). See the test script for more details [2]. Test 1: I hacked the DT to 'fake' per-CPU DVFS. That is, we end up with one CPUFreq policy per CPU (8 policies in total). Since all frequencies are pinned to max for the test, this should have no impact on the actual frequency selection, but it does in the EAS calculation. +---------------------------+----------------------------------+ | Without patch | With patch | +-----+-----+----------+----------+-----+-----------------+----------+ | CPU | Hit | Avg (us) | s^2 (us) | Hit | Avg (us) | s^2 (us) | |-----+-----+----------+----------+-----+-----------------+----------+ | 0 | 274 | 38.303 | 1750.239 | 401 | 14.126 (-63.1%) | 146.625 | | 1 | 197 | 49.529 | 1695.852 | 314 | 16.135 (-67.4%) | 167.525 | | 2 | 142 | 34.296 | 1758.665 | 302 | 14.133 (-58.8%) | 130.071 | | 3 | 172 | 31.734 | 1490.975 | 641 | 14.637 (-53.9%) | 139.189 | | 4 | 316 | 7.834 | 178.217 | 425 | 5.413 (-30.9%) | 20.803 | | 5 | 447 | 8.424 | 144.638 | 556 | 5.929 (-29.6%) | 27.301 | | 6 | 581 | 14.886 | 346.793 | 456 | 5.711 (-61.6%) | 23.124 | | 7 | 456 | 10.005 | 211.187 | 997 | 4.708 (-52.9%) | 21.144 | +-----+-----+----------+----------+-----+-----------------+----------+ * Hit, Avg and s^2 are as reported by the function profiler Test 2: I also ran the same test with a normal DT, with 2 CPUFreq policies, to see if this causes regressions in the most common case. +---------------------------+----------------------------------+ | Without patch | With patch | +-----+-----+----------+----------+-----+-----------------+----------+ | CPU | Hit | Avg (us) | s^2 (us) | Hit | Avg (us) | s^2 (us) | |-----+-----+----------+----------+-----+-----------------+----------+ | 0 | 345 | 22.184 | 215.321 | 580 | 18.635 (-16.0%) | 146.892 | | 1 | 358 | 18.597 | 200.596 | 438 | 12.934 (-30.5%) | 104.604 | | 2 | 359 | 25.566 | 200.217 | 397 | 10.826 (-57.7%) | 74.021 | | 3 | 362 | 16.881 | 200.291 | 718 | 11.455 (-32.1%) | 102.280 | | 4 | 457 | 3.822 | 9.895 | 757 | 4.616 (+20.8%) | 13.369 | | 5 | 344 | 4.301 | 7.121 | 594 | 5.320 (+23.7%) | 18.798 | | 6 | 472 | 4.326 | 7.849 | 464 | 5.648 (+30.6%) | 22.022 | | 7 | 331 | 4.630 | 13.937 | 408 | 5.299 (+14.4%) | 18.273 | +-----+-----+----------+----------+-----+-----------------+----------+ * Hit, Avg and s^2 are as reported by the function profiler In addition to these two tests, I also ran 50 iterations of the Lisa EAS functional test suite [3] with this patch applied on Arm Juno r0, Arm Juno r2, Arm TC2 and Hikey960, and could not see any regressions (all EAS functional tests are passing). [1] https://paste.debian.net/1100055/ [2] https://paste.debian.net/1100057/ [3] https://github.com/ARM-software/lisa/blob/master/lisa/tests/scheduler/eas_behaviour.py Signed-off-by: Quentin Perret <quentin [ dot ] perret [ at ] arm [ dot ] com> Cc: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Cc: Peter Zijlstra <peterz [ at ] infradead [ dot ] org> Cc: Thomas Gleixner <tglx [ at ] linutronix [ dot ] de> Cc: dietmar [ dot ] eggemann [ at ] arm [ dot ] com Cc: juri [ dot ] lelli [ at ] redhat [ dot ] com Cc: morten [ dot ] rasmussen [ at ] arm [ dot ] com Cc: qais [ dot ] yousef [ at ] arm [ dot ] com Cc: qperret [ at ] qperret [ dot ] net Cc: rjw [ at ] rjwysocki [ dot ] net Cc: tkjos [ at ] google [ dot ] com Cc: valentin [ dot ] schneider [ at ] arm [ dot ] com Cc: vincent [ dot ] guittot [ at ] linaro [ dot ] org Link: https://lkml.kernel.org/r/20190912094404 [ dot ] 13802-1-qperret [ at ] qperret [ dot ] net Signed-off-by: Ingo Molnar <mingo [ at ] kernel [ dot ] org> End of an era. commit cf07cb1ff4ea008abf06c95878c700cf1dd65c3e Author: Christoph Hellwig <hch [ at ] lst [ dot ] de> Date: Tue Aug 13 09:25:01 2019 +0200 ia64: remove support for the SGI SN2 platform The SGI SN2 (early Altix) is a very non-standard IA64 platform that was at the very high end of even IA64 hardware, and has been discontinued a long time ago. Remove it because there no upstream users left, and it has magic hooks all over the kernel. commit e77fafe9afb53b7f4d8176c5cd5c10c43a905bc8 Merge: 52a5525214d0 e376897f424a Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Mon Sep 16 14:31:40 2019 -0700 Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "Although there isn't tonnes of code in terms of line count, there are a fair few headline features which I've noted both in the tag and also in the merge commits when I pulled everything together. The part I'm most pleased with is that we had 35 contributors this time around, which feels like a big jump from the usual small group of core arm64 arch developers. Hopefully they all enjoyed it so much that they'll continue to contribute, but we'll see. It's probably worth highlighting that we've pulled in a branch from the risc-v folks which moves our CPU topology code out to where it can be shared with others. Summary: - 52-bit virtual addressing in the kernel - New ABI to allow tagged user pointers to be dereferenced by syscalls - Early RNG seeding by the bootloader […] - Fix TLB invalidation in light of recent architectural clarifications […] - Relaxation of implicit I/O memory barriers - Build with RELR relocations when toolchain supports them - Numerous cleanups and non-critical fixes" commit c17112a5c413f20188da276c138484e7127cdc84 Merge: 4d856f72c10e 821cc7b0b205 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Mon Sep 16 09:28:19 2019 -0700 Merge tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull pidfd/waitid updates from Christian Brauner: "This contains two features and various tests. First, it adds support for waiting on process through pidfds by adding the P_PIDFD type to the waitid() syscall. This completes the basic functionality of the pidfd api (cf. [1]). In the meantime we also have a new adition to the userspace projects that make use of the pidfd api. The qt project was nice enough to send a mail pointing out that they have a pr up to switch to the pidfd api (cf. [2]). Second, this tag contains an extension to the waitid() syscall to make it possible to wait on the current process group in a race free manner (even though the actual problem is very unlikely) by specifing 0 together with the P_PGID type. This extension traces back to a discussion on the glibc development mailing list. There are also a range of tests for the features above. Additionally, the test-suite which detected the pidfd-polling race we fixed in [3] is included in this tag" [1] https://lwn.net/Articles/794707/ [2] https://codereview.qt-project.org/c/qt/qtbase/+/108456 [3] commit b191d6491be6 ("pidfd: fix a poll race when setting exit_state") * tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: waitid: Add support for waiting for the current process group tests: add pidfd poll tests tests: move common definitions and functions into pidfd.h pidfd: add pidfd_wait tests pidfd: add P_PIDFD to waitid() commit e444d51b14c4795074f485c79debd234931f0e49 Merge: c6b48dad92ae 1dce2df3ee06 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Wed Sep 18 10:50:47 2019 -0700 Merge tag 'tty-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial driver updates from Greg KH: "Even in this age, people are still making new serial port silicon, why... commit e6874fc29410fabfdbc8c12b467f41a16cbcfd2b Merge: e444d51b14c4 3fb73eddba10 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Wed Sep 18 11:05:34 2019 -0700 Merge tag 'staging-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging and IIO driver updates from Greg KH: "Here is the big staging/iio driver update for 5.4-rc1. Lots of churn here, with a few driver/filesystems moving out of staging finally: - erofs moved out of staging - greybus core code moved out of staging Along with that, a new filesytem has been added: - extfat to provide support for those devices requiring that filesystem (i.e. transfer devices to/from windows systems or printers) commit c6b48dad92aedaa9bdc013ee495cb5b1bbdf1f11 Merge: 1f7d290a7275 fb9617edf6c0 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Wed Sep 18 10:33:46 2019 -0700 Merge tag 'usb-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB updates from Greg KH: "Here is the big set of USB patches for 5.4-rc1. Two major chunks of code are moving out of the tree and into the staging directory, uwb and wusb (wireless USB support), because there are no devices that actually use this protocol anymore, and what we have today probably doesn't work at all given that the maintainers left many many years ago. So move it to staging where it will be removed in a few releases if no one screams. commit e6bc9de714972cac34daa1dc1567ee48a47a9342 Merge: b6c0d3577246 dc617f29dbe5 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Wed Sep 18 17:35:20 2019 -0700 Merge tag 'vfs-5.4-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull swap access updates from Darrick Wong: "Prohibit writing to active swap files and swap partitions. There's no non-malicious use case for allowing userspace to scribble on storage that the kernel thinks it owns" commit f60c55a94e1d127186566f06294f2dadd966e9b4 Merge: 734d1ed83e1f 95ae251fe828 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Wed Sep 18 16:59:14 2019 -0700 Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fs-verity support from Eric Biggers: "fs-verity is a filesystem feature that provides Merkle tree based hashing (similar to dm-verity) for individual readonly files, mainly for the purpose of efficient authenticity verification. This pull request includes: (a) The fs/verity/ support layer and documentation. (b) fs-verity support for ext4 and f2fs. Compared to the original fs-verity patchset from last year, the UAPI to enable fs-verity on a file has been greatly simplified. Lots of other things were cleaned up too. fs-verity is planned to be used by two different projects on Android; most of the userspace code is in place already. Another userspace tool ("fsverity-utils"), and xfstests, are also available. e2fsprogs and f2fs-tools already have fs-verity support. Other people have shown interest in using fs-verity too. I've tested this on ext4 and f2fs with xfstests, both the existing tests and the new fs-verity tests. This has also been in linux-next since July 30 with no reported issues except a couple minor ones I found myself and folded in fixes for. Ted and I will be co-maintaining fs-verity" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: f2fs: add fs-verity support ext4: update on-disk format documentation for fs-verity ext4: add fs-verity read support ext4: add basic fs-verity support fs-verity: support builtin file signatures fs-verity: add SHA-512 support fs-verity: implement FS_IOC_MEASURE_VERITY ioctl fs-verity: implement FS_IOC_ENABLE_VERITY ioctl fs-verity: add data verification hooks for ->readpages() fs-verity: add the hook for file ->setattr() fs-verity: add the hook for file ->open() fs-verity: add inode and superblock fields fs-verity: add Kconfig and the helper functions for hashing fs: uapi: define verity bit for FS_IOC_GETFLAGS fs-verity: add UAPI header fs-verity: add MAINTAINERS file entry fs-verity: add a documentation file commit 734d1ed83e1f9b7bafb650033fb87c657858cf5b Merge: d013cc800a2a 0642ea2409f3 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Wed Sep 18 16:08:52 2019 -0700 Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fscrypt updates from Eric Biggers: "This is a large update to fs/crypto/ which includes: - Add ioctls that add/remove encryption keys to/from a filesystem-level keyring. These fix user-reported issues where e.g. an encrypted home directory can break NetworkManager, sshd, Docker, etc. because they don't get access to the needed keyring. These ioctls also provide a way to lock encrypted directories that doesn't use the vm.drop_caches sysctl, so is faster, more reliable, and doesn't always need root. - Add a new encryption policy version ("v2") which switches to a more standard, secure, and flexible key derivation function, and starts verifying that the correct key was supplied before using it. The key derivation improvement is needed for its own sake as well as for ongoing feature work for which the current way is too inflexible. Work is in progress to update both Android and the 'fscrypt' userspace tool to use both these features. (Working patches are available and just need to be reviewed+merged.) Chrome OS will likely use them too. This has also been tested on ext4, f2fs, and ubifs with xfstests -- both the existing encryption tests, and the new tests for this. This has also been in linux-next since Aug 16 with no reported issues. I'm also using an fscrypt v2-encrypted home directory on my personal desktop" commit 40144e49ff84c3bd6bd091b58115257670be8803 Author: Jan Kara <jack [ at ] suse [ dot ] cz> Date: Thu Aug 29 09:04:12 2019 -0700 xfs: Fix stale data exposure when readahead races with hole punch Hole puching currently evicts pages from page cache and then goes on to remove blocks from the inode. This happens under both XFS_IOLOCK_EXCL and XFS_MMAPLOCK_EXCL which provides appropriate serialization with racing reads or page faults. However there is currently nothing that prevents readahead triggered by fadvise() or madvise() from racing with the hole punch and instantiating page cache page after hole punching has evicted page cache in xfs_flush_unmap_range() but before it has removed blocks from the inode. This page cache page will be mapping soon to be freed block and that can lead to returning stale data to userspace or even filesystem corruption. Fix the problem by protecting handling of readahead requests by XFS_IOLOCK_SHARED similarly as we protect reads. CC: stable [ at ] vger [ dot ] kernel [ dot ] org Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQNmxqmtA_VbYW0Su9rKRk2zobJmahcyeaEVOFKVQ5dw@mai> Reported-by: Amir Goldstein <amir73il [ at ] gmail [ dot ] com> Signed-off-by: Jan Kara <jack [ at ] suse [ dot ] cz> Reviewed-by: Darrick J. Wong <darrick [ dot ] wong [ at ] oracle [ dot ] com> Signed-off-by: Darrick J. Wong <darrick [ dot ] wong [ at ] oracle [ dot ] com> commit ddbca70cc45c0ac97ff6d9529e45f10b8ae73ad4 Author: Dave Chinner <dchinner [ at ] redhat [ dot ] com> Date: Thu Aug 29 09:04:10 2019 -0700 xfs: allocate xattr buffer on demand When doing file lookups and checking for permissions, we end up in xfs_get_acl() to see if there are any ACLs on the inode. This requires and xattr lookup, and to do that we have to supply a buffer large enough to hold an maximum sized xattr. On workloads were we are accessing a wide range of cache cold files under memory pressure (e.g. NFS fileservers) we end up spending a lot of time allocating the buffer. The buffer is 64k in length, so is a contiguous multi-page allocation, and if that then fails we fall back to vmalloc(). Hence the allocation here is /expensive/ when we are looking up hundreds of thousands of files a second. Initial numbers from a bpf trace show average time in xfs_get_acl() is ~32us, with ~19us of that in the memory allocation. Note these are average times, so there are going to be affected by the worst case allocations more than the common fast case... To avoid this, we could just do a "null" lookup to see if the ACL xattr exists and then only do the allocation if it exists. This, however, optimises the path for the "no ACL present" case at the expense of the "acl present" case. i.e. we can halve the time in xfs_get_acl() for the no acl case (i.e down to ~10-15us), but that then increases the ACL case by 30% (i.e. up to 40-45us). To solve this and speed up both cases, drive the xattr buffer allocation into the attribute code once we know what the actual xattr length is. For the no-xattr case, we avoid the allocation completely, speeding up that case. For the common ACL case, we'll end up with a fast heap allocation (because it'll be smaller than a page), and only for the rarer "we have a remote xattr" will we have a multi-page allocation occur. Hence the common ACL case will be much faster, too. Signed-off-by: Dave Chinner <dchinner [ at ] redhat [ dot ] com> Reviewed-by: Christoph Hellwig <hch [ at ] lst [ dot ] de> Reviewed-by: Darrick J. Wong <darrick [ dot ] wong [ at ] oracle [ dot ] com> Signed-off-by: Darrick J. Wong <darrick [ dot ] wong [ at ] oracle [ dot ] com> commit 756c6f0f7efe8759ff6dda35d220e2e753e2b0e3 Author: Dave Chinner <dchinner [ at ] redhat [ dot ] com> Date: Thu Aug 29 09:04:08 2019 -0700 xfs: reverse search directory freespace indexes When a directory is growing rapidly, new blocks tend to get added at the end of the directory. These end up at the end of the freespace index, and when the directory gets large finding these new freespaces gets expensive. The code does a linear search across the frespace index from the first block in the directory to the last, hence meaning the newly added space is the last index searched. Instead, do a reverse order index search, starting from the last block and index in the freespace index. This makes most lookups for free space on rapidly growing directories O(1) instead of O(N), but should not have any impact on random insert workloads because the average search length is the same regardless of which end of the array we start at. The result is a major improvement in large directory grow rates: create time(sec) / rate (files/s) File count vanilla Prev commit Patched 10k 0.41 / 24.3k 0.42 / 23.8k 0.41 / 24.3k 20k 0.74 / 27.0k 0.76 / 26.3k 0.75 / 26.7k 100k 3.81 / 26.4k 3.47 / 28.8k 3.27 / 30.6k 200k 8.58 / 23.3k 7.19 / 27.8k 6.71 / 29.8k 1M 85.69 / 11.7k 48.53 / 20.6k 37.67 / 26.5k 2M 280.31 / 7.1k 130.14 / 15.3k 79.55 / 25.2k 10M 3913.26 / 2.5k 552.89 / 18.1k Signed-off-by: Dave Chinner <dchinner [ at ] redhat [ dot ] com> Reviewed-by: Christoph Hellwig <hch [ at ] lst [ dot ] de> Reviewed-by: Darrick J. Wong <darrick [ dot ] wong [ at ] oracle [ dot ] com> Signed-off-by: Darrick J. Wong <darrick [ dot ] wong [ at ] oracle [ dot ] com> commit 610125ab1e4b1b48dcffe74d9d82b0606bf1b923 Author: Dave Chinner <dchinner [ at ] redhat [ dot ] com> Date: Thu Aug 29 09:04:07 2019 -0700 xfs: speed up directory bestfree block scanning When running a "create millions inodes in a directory" test recently, I noticed we were spending a huge amount of time converting freespace block headers from disk format to in-memory format: […] commit f8f9ee479439c1be9e33c4404912a2a112c46200 Author: Dave Chinner <dchinner [ at ] redhat [ dot ] com> Date: Mon Aug 26 12:08:39 2019 -0700 xfs: add kmem_alloc_io() Memory we use to submit for IO needs strict alignment to the underlying driver contraints. Worst case, this is 512 bytes. Given that all allocations for IO are always a power of 2 multiple of 512 bytes, the kernel heap provides natural alignment for objects of these sizes and that suffices. Until, of course, memory debugging of some kind is turned on (e.g. red zones, poisoning, KASAN) and then the alignment of the heap objects is thrown out the window. Then we get weird IO errors and data corruption problems because drivers don't validate alignment and do the wrong thing when passed unaligned memory buffers in bios. TO fix this, introduce kmem_alloc_io(), which will guaranteeat least 512 byte alignment of buffers for IO, even if memory debugging options are turned on. It is assumed that the minimum allocation size will be 512 bytes, and that sizes will be power of 2 mulitples of 512 bytes. Use this everywhere we allocate buffers for IO. This no longer fails with log recovery errors when KASAN is enabled due to the brd driver not handling unaligned memory buffers: # mkfs.xfs -f /dev/ram0 ; mount /dev/ram0 /mnt/test Signed-off-by: Dave Chinner <dchinner [ at ] redhat [ dot ] com> Reviewed-by: Christoph Hellwig <hch [ at ] lst [ dot ] de> Reviewed-by: Darrick J. Wong <darrick [ dot ] wong [ at ] oracle [ dot ] com> Signed-off-by: Darrick J. Wong <darrick [ dot ] wong [ at ] oracle [ dot ] com> commit e2079e93f562c7f7a030eb7642017ee5eabaaa10 Author: Nathan Chancellor <natechancellor [ at ] gmail [ dot ] com> Date: Mon Aug 26 17:41:55 2019 -0700 kbuild: Do not enable -Wimplicit-fallthrough for clang for now This functionally reverts commit bfd77145f35c ("Makefile: Convert -Wimplicit-fallthrough=3 to just -Wimplicit-fallthrough for clang"). clang enabled support for -Wimplicit-fallthrough in C in r369414 [1], which causes a lot of warnings when building the kernel for two reasons: 1. Clang does not support the /* fall through */ comments. There seems to be a general consensus in the LLVM community that this is not something they want to support. Joe Perches wrote a script to convert all of the comments to a "fallthrough" keyword that will be added to compiler_attributes.h [2] [3], which catches the vast majority of the comments. There doesn't appear to be any consensus in the kernel community when to do this conversion. 2. Clang and GCC disagree about falling through to final case statements with no content or cases that simply break: https://godbolt.org/z/c8csDu This difference contributes at least 50 warnings in an allyesconfig build for x86, not considering other architectures. This difference will need to be discussed to see which compiler is right [4] [5]. [1]: https://github.com/llvm/llvm-project/commit/1e0affb6e564b7361b0aadb38805f26deff4ecfc [2]: https://lore.kernel.org/lkml/61ddbb86d5e68a15e24ccb06d9b399bbf5ce2da7 [ dot ] camel [ at ] perches [ dot ] com/ [3]: https://lore.kernel.org/lkml/1d2830aadbe9d8151728a7df5b88528fc72a0095 [ dot ] 1564549413 [ dot ] git [ dot ] joe [ at ] perches [ dot ] com/ [4]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91432 [5]: https://github.com/ClangBuiltLinux/linux/issues/636 Given these two problems need discussion and coordination, do not enable -Wimplicit-fallthrough with clang right now. Add a comment to explain what is going on as well. This commit should be reverted once these two issues are fully flushed out and resolved. Suggested-by: Masahiro Yamada <yamada [ dot ] masahiro [ at ] socionext [ dot ] com> Signed-off-by: Nathan Chancellor <natechancellor [ at ] gmail [ dot ] com> Acked-by: Miguel Ojeda <miguel [ dot ] ojeda [ dot ] sandonis [ at ] gmail [ dot ] com> Acked-by: Nick Desaulniers <ndesaulniers [ at ] google [ dot ] com> Acked-by: Gustavo A. R. Silva <gustavo [ at ] embeddedor [ dot ] com> Signed-off-by: Masahiro Yamada <yamada [ dot ] masahiro [ at ] socionext [ dot ] com> commit aec256d0ecd561036f188dbc8fa7924c47a9edfd Author: Joao Moreno <mail [ at ] joaomoreno [ dot ] com> Date: Tue Sep 3 16:46:32 2019 +0200 HID: apple: Fix stuck function keys when using FN This fixes an issue in which key down events for function keys would be repeatedly emitted even after the user has raised the physical key. For example, the driver fails to emit the F5 key up event when going through the following steps: - fnmode=1: hold FN, hold F5, release FN, release F5 - fnmode=2: hold F5, hold FN, release F5, release FN The repeated F5 key down events can be easily verified using xev. commit 1b5fb415442eb3ec946d48afe8c87b0f2fd42d7c Merge: 5825a95fe925 21ab8580b383 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Mon Sep 23 11:39:56 2019 -0700 Merge tag 'safesetid-bugfix-5.4' of git://github.com/micah-morton/linux Pull SafeSetID fix from Micah Morton: "Jann Horn sent some patches to fix some bugs in SafeSetID for 5.3. After he had done his testing there were a couple small code tweaks that went in and caused this bug. From what I can see SafeSetID is broken in 5.3 and crashes the kernel every time during initialization if you try to use it. I came across this bug when backporting Jann's changes for 5.3 to older kernels (4.14 and 4.19). I've tested on a Chrome OS device with those kernels and verified that this change fixes things. It doesn't seem super useful to have this bake in linux-next, since it is completely broken in 5.3 and nobody noticed" * tag 'safesetid-bugfix-5.4' of git://github.com/micah-morton/linux: LSM: SafeSetID: Stop releasing uninitialized ruleset Ouch. Harsh. Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Mon Sep 23 11:21:04 2019 -0700 Merge tag 'selinux-pr-20190917' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull selinux updates from Paul Moore: - Add LSM hooks, and SELinux access control hooks, for dnotify, fanotify, and inotify watches. This has been discussed with both the LSM and fs/notify folks and everybody is good with these new hooks. […] - Improve our network object labeling cache so that we always return the object's label, even when under memory pressure. Previously we would return an error if we couldn't allocate a new cache entry, now we always return the label even if we can't create a new cache entry for it. […] commit 99cb0dbd47a15d395bf3faa78dc122bc5efe3fc0 Author: Song Liu <songliubraving [ at ] fb [ dot ] com> Date: Mon Sep 23 15:38:00 2019 -0700 mm,thp: add read-only THP support for (non-shmem) FS This patch is (hopefully) the first step to enable THP for non-shmem filesystems. This patch enables an application to put part of its text sections to THP via madvise, for example: madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE); We tried to reuse the logic for THP on tmpfs. Currently, write is not supported for non-shmem THP. khugepaged will only process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests (see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is execve(). This requirement limits non-shmem THP to text sections. The next patch will handle writes, which would only happen when the all the vmas with VM_DENYWRITE are unmapped. An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this feature. commit 1c6c15971e4709953f75082a5d44212536b1c2b7 Author: Hillf Danton <hdanton [ at ] sina [ dot ] com> Date: Mon Sep 23 15:37:26 2019 -0700 mm, reclaim: make should_continue_reclaim perform dryrun detection Patch series "address hugetlb page allocation stalls", v2. Allocation of hugetlb pages via sysctl or procfs can stall for minutes or hours. A simple example on a two node system with 8GB of memory is as follows: echo 4096 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages echo 4096 > /proc/sys/vm/nr_hugepages Obviously, both allocation attempts will fall short of their 8GB goal. However, one or both of these commands may stall and not be interruptible. The issues were initially discussed in mail thread [1] and RFC code at [2]. This series addresses the issues causing the stalls. There are two distinct fixes, a cleanup, and an optimization. The reclaim patch by Hillf and compaction patch by Vlasitmil address corner cases in their respective areas. hugetlb page allocation could stall due to either of these issues. Vlasitmil added a cleanup patch after Hillf's modifications. The hugetlb patch by Mike is an optimization suggested during the debug and development process. [1] http://lkml.kernel.org/r/d38a095e-dc39-7e82-bb76-2c9247929f07 [ at ] oracle [ dot ] com [2] http://lkml.kernel.org/r/20190724175014 [ dot ] 9935-1-mike [ dot ] kravetz [ at ] oracle [ dot ] com This patch (of 4): Address the issue of should_continue_reclaim returning true too often for __GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned. This was observed during hugetlb page allocation causing stalls for minutes or hours. We can stop reclaiming pages if compaction reports it can make a progress. There might be side-effects for other high-order allocations that would potentially benefit from reclaiming more before compaction so that they would be faster and less likely to stall. However, the consequences of premature/over-reclaim are considered worse. We can also bail out of reclaiming pages if we know that there are not enough inactive lru pages left to satisfy the costly allocation. We can give up reclaiming pages too if we see dryrun occur, with the certainty of plenty of inactive pages. IOW with dryrun detected, we are sure we have reclaimed as many pages as we could. commit 70cb6d2677905121bfc7fdf5babfd8444218edd9 Author: Edward Chron <echron [ at ] arista [ dot ] com> Date: Mon Sep 23 15:37:11 2019 -0700 mm/oom: add oom_score_adj and pgtables to Killed process message For an OOM event: print oom_score_adj value for the OOM Killed process to document what the oom score adjust value was at the time the process was OOM Killed. The adjustment value can be set by user code and it affects the resulting oom_score so it is used to influence kill process selection. When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing this value is the only documentation of the value for the process being killed. Having this value on the Killed process message is useful to document if a miscconfiguration occurred or to confirm that the oom_score_adj configuration applies as expected. An example which illustates both misconfiguration and validation that the oom_score_adj was applied as expected is: Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692 (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB, shmem-rss:0kB pgtables:22kB oom_score_adj:1000 The systemd-udevd is a critical system application that should have an oom_score_adj of -1000. It was miconfigured to have a adjustment of 1000 making it a highly favored OOM kill target process. The output documents both the misconfiguration and the fact that the process was correctly targeted by OOM due to the miconfiguration. This can be quite helpful for triage and problem determination. The addition of the pgtables_bytes shows page table usage by the process and is a useful measure of the memory size of the process. Link: http://lkml.kernel.org/r/20190822173157 [ dot ] 1569-1-echron [ at ] arista [ dot ] com Signed-off-by: Edward Chron <echron [ at ] arista [ dot ] com> Acked-by: Michal Hocko <mhocko [ at ] suse [ dot ] com> Acked-by: David Rientjes <rientjes [ at ] google [ dot ] com> Signed-off-by: Andrew Morton <akpm [ at ] linux-foundation [ dot ] org> Signed-off-by: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> commit 1ba6fc9af35bf97c84567d9b3eeb26629d1e3af0 Author: Johannes Weiner <hannes [ at ] cmpxchg [ dot ] org> Date: Mon Sep 23 15:35:01 2019 -0700 mm: vmscan: do not share cgroup iteration between reclaimers One of our services observed a high rate of cgroup OOM kills in the presence of large amounts of clean cache. Debugging showed that the culprit is the shared cgroup iteration in page reclaim. Under high allocation concurrency, multiple threads enter reclaim at the same time. Fearing overreclaim when we first switched from the single global LRU to cgrouped LRU lists, we introduced a shared iteration state for reclaim invocations - whether 1 or 20 reclaimers are active concurrently, we only walk the cgroup tree once: the 1st reclaimer reclaims the first cgroup, the second the second one etc. With more reclaimers than cgroups, we start another walk from the top. This sounded reasonable at the time, but the problem is that reclaim concurrency doesn't scale with allocation concurrency. As reclaim concurrency increases, the amount of memory individual reclaimers get to scan gets smaller and smaller. Individual reclaimers may only see one cgroup per cycle, and that may not have much reclaimable memory. We see individual reclaimers declare OOM when there is plenty of reclaimable memory available in cgroups they didn't visit. This patch does away with the shared iterator, and every reclaimer is allowed to scan the full cgroup tree and see all of reclaimable memory, just like it would on a non-cgrouped system. This way, when OOM is declared, we know that the reclaimer actually had a chance. To still maintain fairness in reclaim pressure, disallow cgroup reclaim from bailing out of the tree walk early. Kswapd and regular direct reclaim already don't bail, so it's not clear why limit reclaim would have to, especially since it only walks subtrees to begin with. This change completely eliminates the OOM kills on our service, while showing no signs of overreclaim - no increased scan rates, %sys time, or abrupt free memory spikes. I tested across 100 machines that have 64G of RAM and host about 300 cgroups each. [ It's possible overreclaim never was a *practical* issue to begin with - it was simply a concern we had on the mailing lists at the time, with no real data to back it up. But we have also added more bail-out conditions deeper inside reclaim (e.g. the proportional exit in shrink_node_memcg) since. Regardless, now we have data that suggests full walks are more reliable and scale just fine. ] Link: http://lkml.kernel.org/r/20190812192316 [ dot ] 13615-1-hannes [ at ] cmpxchg [ dot ] org Signed-off-by: Johannes Weiner <hannes [ at ] cmpxchg [ dot ] org> Reviewed-by: Roman Gushchin <guro [ at ] fb [ dot ] com> Acked-by: Michal Hocko <mhocko [ at ] suse [ dot ] com> Cc: Vladimir Davydov <vdavydov [ dot ] dev [ at ] gmail [ dot ] com> Signed-off-by: Andrew Morton <akpm [ at ] linux-foundation [ dot ] org> Signed-off-by: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Some bfq commits that didn't claim to be a big deal and that I cannot evaluate the impact of. Interesting… commit 0183eb8bb59d45f26ec4fc73aaa416067fe6c0be Author: Jean Delvare <jdelvare [ at ] suse [ dot ] de> Date: Fri Aug 2 14:55:26 2019 +0200 i2c: piix4: Add ACPI support Enable the i2c-piix4 SMBus controller driver to enumerate I2C slave devices using ACPI. It builds on the related I2C mux device work in commit 8eb5c87a92c0 ("i2c: add ACPI support for I2C mux ports") In the i2c-piix4 driver the adapters are enumerated as: Main SMBus adapter Port 0, Port 2, ..., aux port (i.e., ASF adapter) commit 97f9a3c4eee55b0178b518ae7114a6a53372913d (HEAD -> master, origin/master) Merge: 1eb80d6ffb17 dc925a36060e Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Sun Sep 29 19:52:52 2019 -0700 Merge tag 'char-misc-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull Documentation/process update from Greg KH: "Here are two small Documentation/process/embargoed-hardware-issues.rst file updates that missed my previous char/misc pull request. The first one adds an Intel representative for the process, and the second one cleans up the text a bit more when it comes to how the disclosure rules work, as it was a bit confusing to some companies" * tag 'char-misc-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: Documentation/process: Clarify disclosure rules Documentation/process: Volunteer as the ambassador for Intel commit dc925a36060e8cef050a9d05c64dae1c30dc5027 Author: Thomas Gleixner <tglx [ at ] linutronix [ dot ] de> Date: Wed Sep 25 10:29:49 2019 +0200 Documentation/process: Clarify disclosure rules The role of the contact list provided by the disclosing party and how it affects the disclosure process and the ability to include experts into the development process is not really well explained. Neither is it entirely clear when the disclosing party will be informed about the fact that a developer who is not covered by an employer NDA needs to be brought in and disclosed. Explain the role of the contact list and the information policy along with an eventual conflict resolution better. Reported-by: Dave Hansen <dave [ dot ] hansen [ at ] linux [ dot ] intel [ dot ] com> Signed-off-by: Thomas Gleixner <tglx [ at ] linutronix [ dot ] de> Acked-by: Dave Hansen <dave [ dot ] hansen [ at ] linux [ dot ] intel [ dot ] com> Link: https://lore.kernel.org/r/alpine [ dot ] DEB [ dot ] 2 [ dot ] 21 [ dot ] 1909251028390 [ dot ] 10825 [ at ] nanos [ dot ] tec [ dot ] linutronix [ dot ] de Signed-off-by: Greg Kroah-Hartman <gregkh [ at ] linuxfoundation [ dot ] org> diff --git a/Documentation/process/embargoed-hardware-issues.rst b/Documentation/process/embargoed-hardware-issues.rst index e57b9f39c69f..a3c3349046c4 100644 --- a/Documentation/process/embargoed-hardware-issues.rst +++ b/Documentation/process/embargoed-hardware-issues.rst @@ -143,6 +143,20 @@ via their employer, they cannot enter individual non-disclosure agreements in their role as Linux kernel developers. They will, however, agree to adhere to this documented process and the Memorandum of Understanding. +The disclosing party should provide a list of contacts for all other +entities who have already been, or should be, informed about the issue. +This serves several purposes: + + - The list of disclosed entities allows communication accross the + industry, e.g. other OS vendors, HW vendors, etc. + + - The disclosed entities can be contacted to name experts who should + participate in the mitigation development. + + - If an expert which is required to handle an issue is employed by an + listed entity or member of an listed entity, then the response teams can + request the disclosure of that expert from that entity. This ensures + that the expert is also part of the entity's response team. Disclosure """""""""" @@ -158,10 +172,7 @@ Mitigation development """""""""""""""""""""" The initial response team sets up an encrypted mailing-list or repurposes -an existing one if appropriate. The disclosing party should provide a list -of contacts for all other parties who have already been, or should be, -informed about the issue. The response team contacts these parties so they -can name experts who should be subscribed to the mailing-list. +an existing one if appropriate. Using a mailing-list is close to the normal Linux development process and has been successfully used in developing mitigations for various hardware @@ -175,9 +186,24 @@ development branch against the mainline kernel and backport branches for stable kernel versions as necessary. The initial response team will identify further experts from the Linux -kernel developer community as needed and inform the disclosing party about -their participation. Bringing in experts can happen at any time of the -development process and often needs to be handled in a timely manner. +kernel developer community as needed. Bringing in experts can happen at any +time of the development process and needs to be handled in a timely manner. + +If an expert is employed by or member of an entity on the disclosure list +provided by the disclosing party, then participation will be requested from +the relevant entity. + +If not, then the disclosing party will be informed about the experts +participation. The experts are covered by the Memorandum of Understanding +and the disclosing party is requested to acknowledge the participation. In +case that the disclosing party has a compelling reason to object, then this +objection has to be raised within five work days and resolved with the +incident team immediately. If the disclosing party does not react within +five work days this is taken as silent acknowledgement. + +After acknowledgement or resolution of an objection the expert is disclosed +by the incident team and brought into the development process. + Coordinated release """"""""""""""""""" commit 3f2dc2798b81531fd93a3b9b7c39da47ec689e55 Merge: a3c0e7b1fe1f 02f03c4206c1 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Sun Sep 29 19:25:39 2019 -0700 Merge branch 'entropy' Merge active entropy generation updates. This is admittedly partly "for discussion". We need to have a way forward for the boot time deadlocks where user space ends up waiting for more entropy, but no entropy is forthcoming because the system is entirely idle just waiting for something to happen. While this was triggered by what is arguably a user space bug with GDM/gnome-session asking for secure randomness during early boot, when they didn't even need any such truly secure thing, the issue ends up being that our "getrandom()" interface is prone to that kind of confusion, because people don't think very hard about whether they want to block for sufficient amounts of entropy. The approach here-in is to decide to not just passively wait for entropy to happen, but to start actively collecting it if it is missing. This is not necessarily always possible, but if the architecture has a CPU cycle counter, there is a fair amount of noise in the exact timings of reasonably complex loads. We may end up tweaking the load and the entropy estimates, but this should be at least a reasonable starting point. As part of this, we also revert the revert of the ext4 IO pattern improvement that ended up triggering the reported lack of external entropy. * getrandom() active entropy waiting: Revert "Revert "ext4: make __ext4_get_inode_loc plug"" random: try to actively add entropy rather than passively wait for it commit 02f03c4206c1b2a7451d3b3546f86c9c783eac13 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Sun Sep 29 17:59:23 2019 -0700 Revert "Revert "ext4: make __ext4_get_inode_loc plug"" This reverts commit 72dbcf72156641fde4d8ea401e977341bfd35a05. Instead of waiting forever for entropy that may just not happen, we now try to actively generate entropy when required, and are thus hopefully avoiding the problem that caused the nice ext4 IO pattern fix to be reverted. So revert the revert. Cc: Ahmed S. Darwish <darwish [ dot ] 07 [ at ] gmail [ dot ] com> Cc: Ted Ts'o <tytso [ at ] mit [ dot ] edu> Cc: Willy Tarreau <w [ at ] 1wt [ dot ] eu> Cc: Alexander E. Patrakov <patrakov [ at ] gmail [ dot ] com> Signed-off-by: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> commit 50ee7529ec4500c88f8664560770a7a1b65db72b Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Sat Sep 28 16:53:52 2019 -0700 random: try to actively add entropy rather than passively wait for it For 5.3 we had to revert a nice ext4 IO pattern improvement, because it caused a bootup regression due to lack of entropy at bootup together with arguably broken user space that was asking for secure random numbers when it really didn't need to. See commit 72dbcf721566 (Revert "ext4: make __ext4_get_inode_loc plug"). This aims to solve the issue by actively generating entropy noise using the CPU cycle counter when waiting for the random number generator to initialize. This only works when you have a high-frequency time stamp counter available, but that's the case on all modern x86 CPU's, and on most other modern CPU's too. What we do is to generate jitter entropy from the CPU cycle counter under a somewhat complex load: calling the scheduler while also guaranteeing a certain amount of timing noise by also triggering a timer. I'm sure we can tweak this, and that people will want to look at other alternatives, but there's been a number of papers written on jitter entropy, and this should really be fairly conservative by crediting one bit of entropy for every timer-induced jump in the cycle counter. Not because the timer itself would be all that unpredictable, but because the interaction between the timer and the loop is going to be. Even if (and perhaps particularly if) the timer actually happens on another CPU, the cacheline interaction between the loop that reads the cycle counter and the timer itself firing is going to add perturbations to the cycle counter values that get mixed into the entropy pool. As Thomas pointed out, with a modern out-of-order CPU, even quite simple loops show a fair amount of hard-to-predict timing variability even in the absense of external interrupts. But this tries to take that further by actually having a fairly complex interaction. This is not going to solve the entropy issue for architectures that have no CPU cycle counter, but it's not clear how (and if) that is solvable, and the hardware in question is largely starting to be irrelevant. And by doing this we can at least avoid some of the even more contentious approaches (like making the entropy waiting time out in order to avoid the possibly unbounded waiting). Cc: Ahmed Darwish <darwish [ dot ] 07 [ at ] gmail [ dot ] com> Cc: Thomas Gleixner <tglx [ at ] linutronix [ dot ] de> Cc: Theodore Ts'o <tytso [ at ] mit [ dot ] edu> Cc: Nicholas Mc Guire <hofrat [ at ] opentech [ dot ] at> Cc: Andy Lutomirski <luto [ at ] kernel [ dot ] org> Cc: Kees Cook <keescook [ at ] chromium [ dot ] org> Cc: Willy Tarreau <w [ at ] 1wt [ dot ] eu> Cc: Alexander E. Patrakov <patrakov [ at ] gmail [ dot ] com> Cc: Lennart Poettering <mzxreary [ at ] 0pointer [ dot ] de> Signed-off-by: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org>
If only I could order commits by their message length. Saw RGB's patches go on. Do explain if a fundamental or important change. Usual Spectre: commit 223cea6a4f0552b86fb25e3b8bbd00469816cd7a (HEAD -> master, origin/master) Merge: 2f0f6503e375 993773d11d45 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Mon Jul 8 12:23:00 2019 -0700 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 pti updates from Thomas Gleixner: "The speculative paranoia departement delivers a few more plugs for possible (probably theoretical) spectre/mds leaks" * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tls: Fix possible spectre-v1 in do_get_thread_area() x86/ptrace: Fix possible spectre-v1 in ptrace_get_debugreg() x86/speculation/mds: Eliminate leaks by trace_hardirqs_on() Fun description: commit 2f0f6503e37551eb8d8d5e4d27c78d28a30fed5a Merge: 13324c42c140 e44252f4fe79 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Mon Jul 8 12:16:40 2019 -0700 Merge branch 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 timer updates from Thomas Gleixner: "A rather large series consolidating the HPET code, which was triggered by the attempt to bolt HPET NMI watchdog support on to the existing maze with the usual duct tape and super glue approach. This mainly removes two separate partially redundant storage layers and consolidates them into a single one which provides a consistent view of the different HPET channels and their usage and allows to integrate HPET NMI watchdog support (if it turns out to be feasible) in a non intrusive way" The maximum time a MWAIT can halt in userspace is controlled by the kernel and can be adjusted by the sysadmin. Spinlocks in userspace, manually? Why? Thought this what futex was for: commit 13324c42c1401ad838208ee1e98f3821fce1cd86 Merge: ab2486a9ee32 049331f277fe Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Mon Jul 8 11:59:59 2019 -0700 Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 CPU feature updates from Thomas Gleixner: "Updates for x86 CPU features: - Support for UMWAIT/UMONITOR, which allows to use MWAIT and MONITOR instructions in user space to save power e.g. in HPC workloads which spin wait on synchronization points. New one in a while? - Support for the new x86 vendor Zhaoxin who develops processors based on the VIA Centaur technology. Bluntness: - The addition and late revert of the FSGSBASE support. The revert was required as it turned out that the code still has hard to diagnose issues. Yet another engineering trainwreck... Bit I disabled it entirely on mine… commit 0d37dde70655be73575d011be1bffaf0e3b16ea9 Merge: 0902d5011cfa 7f0a5e075583 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Mon Jul 8 11:42:09 2019 -0700 Merge branch 'x86-entry-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 vsyscall updates from Thomas Gleixner: "Further hardening of the legacy vsyscall by providing support for execute only mode and switching the default to it. This prevents a certain class of attacks which rely on the vsyscall page being accessible at a fixed address in the canonical kernel address space" Okay, but what does this mean for me? commit 0902d5011cfaabd6a09326299ef77e1c8735fb89 Merge: 927ba67a63c7 f8a8fe61fec8 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Mon Jul 8 11:22:57 2019 -0700 Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x96 apic updates from Thomas Gleixner: "Updates for the x86 APIC interrupt handling and APIC timer: - Fix a long standing issue with spurious interrupts which was caused by the big vector management rework a few years ago. Robert Hodaszi provided finally enough debug data and an excellent initial failure analysis which allowed to understand the underlying issues. Who cares? We're all stuck on x86 anyway! commit 927ba67a63c72ee87d655e30183d1576c3717d3e Merge: 2a1ccd31420a 9176ab1b8480 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Mon Jul 8 11:06:29 2019 -0700 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "The timer and timekeeping departement delivers: Core: - […] This gets rid of the unnecessary different copies of the same code and brings all architectures on the same level of VDSO functionality. Hey Ben, this supposed to compete with AMD's Ryzen/Threadripper? commit 222a21d29521d144f3dd7a0bc4d4020e448f0126 Merge: 8faef7125d02 eb876fbc248e Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Mon Jul 8 18:28:44 2019 -0700 Merge branch 'x86-topology-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 topology updates from Ingo Molnar: "Implement multi-die topology support on Intel CPUs and expose the die topology to user-space tooling, by Len Brown, Kan Liang and Zhang Rui. These changes should have no effect on the kernel's existing understanding of topologies, i.e. there should be no behavioral impact on cache, NUMA, scheduler, perf and other topologies and overall system performance" Holy shit! commit e1928328699a582a540b105e5f4c160832a7fdcb Merge: 46f1ec23a469 9156e545765e Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Mon Jul 8 16:12:03 2019 -0700 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The main changes in this cycle are: - rwsem scalability improvements, phase #2, by Waiman Long, which are rather impressive: "On a 2-socket 40-core 80-thread Skylake system with 40 reader and writer locking threads, the min/mean/max locking operations done in a 5-second testing window before the patchset were: 40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810 40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255 After the patchset, they became: 40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741 40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098" There's a lot of changes to the locking implementation that makes it similar to qrwlock, including owner handoff for more fair locking. Oh, wait, microbenchmark! Another microbenchmark shows how across the spectrum the improvements are: "With a locking microbenchmark running on 5.1 based kernel, the total locking rates (in kops/s) on a 2-socket Skylake system with equal numbers of readers and writers (mixed) before and after this patchset were: # of Threads Before Patch After Patch ------------ ------------ ----------- 2 2,618 4,193 4 1,202 3,726 8 802 3,622 16 729 3,359 32 319 2,826 64 102 2,744" The changes are extensive and the patch-set has been through several iterations addressing various locking workloads. There might be more regressions, but unless they are pathological I believe we want to use this new implementation as the baseline going forward. But does this matter to you guys, as programmers? - atomic64_t cross-arch type cleanups by Mark Rutland: over the last ~10 years of atomic64_t existence the various types used by the APIs only had to be self-consistent within each architecture - which means they became wildly inconsistent across architectures. Mark puts and end to this by reworking all the atomic64 implementations to use 's64' as the base type for atomic64_t, and to ensure that this type is consistently used for parameters and return values in the API, avoiding further problems in this area. Does this IOMMU stuff matter? commit 6b04014f3f151ed62878327813859e76e8e23d78 Merge: c6b6cebbc597 d95c3885865b Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Tue Jul 9 09:21:02 2019 -0700 Merge tag 'iommu-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu updates from Joerg Roedel: - Make the dma-iommu code more generic so that it can be used outside of the ARM context with other IOMMU drivers. Goal is to make use of it on x86 too. - Generic IOMMU domain support for the Intel VT-d driver. This driver now makes more use of common IOMMU code to allocate default domains for the devices it handles. - An IOMMU fault reporting API to userspace. With that the IOMMU fault handling can be done in user-space, for example to forward the faults to a VM. - Better handling for reserved regions requested by the firmware. These can be 'relaxed' now, meaning that those don't prevent a device being attached to a VM. Kernel people have higher standards: commit e9a83bd2322035ed9d7dcf35753d3f984d76c6a5 (HEAD -> master, origin/master) Merge: 7011b7e1b702 454f96f2b738 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Tue Jul 9 12:34:26 2019 -0700 Merge tag 'docs-5.3' of git://git.lwn.net/linux Pull Documentation updates from Jonathan Corbet: "It's been a relatively busy cycle for docs: - […] - A new document on how to use merges and rebases in kernel repos, and one on Spectre vulnerabilities. - Various improvements to the build system, including automatic markup of function() references because some people, for reasons I will never understand, were of the opinion that :c:func:``function()`` is unattractive and not fun to type. Ewwwwwwwwwwwwwwwww: commit b7d5c9239855f99762e8a547bea03a436e8a12e8 Merge: 608745f12462 8ff80fbe7e98 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Tue Jul 9 11:35:38 2019 -0700 Merge branch 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 boot updates from Thomas Gleixner: "Assorted updates to kexec/kdump: - Proper kexec support for 4/5-level paging and jumping from a 5-level to a 4-level paging kernel. You didn't do this the first time? Not impressed. commit 5450e8a316a64cddcbc15f90733ebc78aa736545 Merge: 29cd581b5949 172bb24a4f48 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Wed Jul 10 22:17:21 2019 -0700 Merge tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull pidfd updates from Christian Brauner: "This adds two main features. - First, it adds polling support for pidfds. This allows process managers to know when a (non-parent) process dies in a race-free way. The notification mechanism used follows the same logic that is currently used when the parent of a task is notified of a child's death. With this patchset it is possible to put pidfds in an {e}poll loop and get reliable notifications for process (i.e. thread-group) exit. - The second feature compliments the first one by making it possible to retrieve pollable pidfds for processes that were not created using CLONE_PIDFD. A lot of processes get created with traditional PID-based calls such as fork() or clone() (without CLONE_PIDFD). For these processes a caller can currently not create a pollable pidfd. This is a problem for Android's low memory killer (LMK) and service managers such as systemd. Both patchsets are accompanied by selftests. It's perhaps worth noting that the work done so far and the work done in this branch for pidfd_open() and polling support do already see some adoption: - Android is in the process of backporting this work to all their LTS kernels [1] - Service managers make use of pidfd_send_signal but will need to wait until we enable waiting on pidfds for full adoption. - And projects I maintain make use of both pidfd_send_signal and CLONE_PIDFD [2] and will use polling support and pidfd_open() too" [1] https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.9+backport%22 https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.14+backport%22 https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.19+backport%22 [2] https://github.com/lxc/lxc/blob/aab6e3eb73c343231cdde775db938994fc6f2803/src/lxc/start.c#L1753 * tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: tests: add pidfd_open() tests arch: wire-up pidfd_open() pid: add pidfd_open() pidfd: add polling selftests pidfd: add polling support commit b53b0b9d9a613c418057f6cb921c2f40a6f78c24 Author: Joel Fernandes (Google) <joel [ at ] joelfernandes [ dot ] org> Date: Tue Apr 30 12:21:53 2019 -0400 pidfd: add polling support This patch adds polling support to pidfd. Android low memory killer (LMK) needs to know when a process dies once it is sent the kill signal. It does so by checking for the existence of /proc/pid which is both racy and slow. For example, if a PID is reused between when LMK sends a kill signal and checks for existence of the PID, since the wrong PID is now possibly checked for existence. Using the polling support, LMK will be able to get notified when a process exists in race-free and fast way, and allows the LMK to do other things (such as by polling on other fds) while awaiting the process being killed to die. For notification to polling processes, we follow the same existing mechanism in the kernel used when the parent of the task group is to be notified of a child's death (do_notify_parent). This is precisely when the tasks waiting on a poll of pidfd are also awakened in this patch. We have decided to include the waitqueue in struct pid for the following reasons: 1. The wait queue has to survive for the lifetime of the poll. Including it in task_struct would not be option in this case because the task can be reaped and destroyed before the poll returns. 2. By including the struct pid for the waitqueue means that during de_thread(), the new thread group leader automatically gets the new waitqueue/pid even though its task_struct is different. Appropriate test cases are added in the second patch to provide coverage of all the cases the patch is handling. Cc: Andy Lutomirski <luto [ at ] amacapital [ dot ] net> Cc: Steven Rostedt <rostedt [ at ] goodmis [ dot ] org> Cc: Daniel Colascione <dancol [ at ] google [ dot ] com> Cc: Jann Horn <jannh [ at ] google [ dot ] com> Cc: Tim Murray <timmurray [ at ] google [ dot ] com> Cc: Jonathan Kowalski <bl0pbl33p [ at ] gmail [ dot ] com> Cc: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Cc: Al Viro <viro [ at ] zeniv [ dot ] linux [ dot ] org [ dot ] uk> Cc: Kees Cook <keescook [ at ] chromium [ dot ] org> Cc: David Howells <dhowells [ at ] redhat [ dot ] com> Cc: Oleg Nesterov <oleg [ at ] redhat [ dot ] com> Cc: kernel-team [ at ] android [ dot ] com Reviewed-by: Oleg Nesterov <oleg [ at ] redhat [ dot ] com> Co-developed-by: Daniel Colascione <dancol [ at ] google [ dot ] com> Signed-off-by: Daniel Colascione <dancol [ at ] google [ dot ] com> Signed-off-by: Joel Fernandes (Google) <joel [ at ] joelfernandes [ dot ] org> Signed-off-by: Christian Brauner <christian [ at ] brauner [ dot ] io> This feature comes too late for Ben: commit d2b6b4c832f7e3067709e8d4970b7b82b44419ac Merge: 0248a8be6d21 b78fa45d4edb Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Wed Jul 10 21:22:43 2019 -0700 Merge tag 'nfsd-5.3' of git://linux-nfs.org/~bfields/linux Pull nfsd updates from Bruce Fields: "Highlights: - Add a new /proc/fs/nfsd/clients/ directory which exposes some long-requested information about NFSv4 clients (like open files) and allows forced revocation of client state. It's not OCFS2 like Ben asked, but GFS is being updated. AFS is still being updated AF. Still funny, this casefold feature: commit 3ae72562ad917df36a1b1247d749240e3b4865db Author: Gabriel Krisman Bertazi <krisman [ at ] collabora [ dot ] com> Date: Wed Jun 19 23:45:09 2019 -0400 ext4: optimize case-insensitive lookups Temporarily cache a casefolded version of the file name under lookup in ext4_filename, to avoid repeatedly casefolding it. I got up to 30% speedup on lookups of large directories (>100k entries), depending on the length of the string under lookup. Signed-off-by: Gabriel Krisman Bertazi <krisman [ at ] collabora [ dot ] com> Signed-off-by: Theodore Ts'o <tytso [ at ] mit [ dot ] edu> Scary. Is my data okay!?!?!??!? commit 40f06c799539739a08a56be8a096f56aeed05731 Merge: a47f5c56b2eb fe0da9c09b2d Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Wed Jul 10 20:32:37 2019 -0700 Merge tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull copy_file_range updates from Darrick Wong: "This fixes numerous parameter checking problems and inconsistent behaviors in the new(ish) copy_file_range system call. Now the system call will actually check its range parameters correctly; refuse to copy into files for which the caller does not have sufficient privileges; update mtime and strip setuid like file writes are supposed to do; and allows copying up to the EOF of the source file instead of failing the call like we used to. LOL, still fixing ext2: commit 682f7c5c465d7ac4107e51dbf2a847a026b384e8 Merge: e6983afd9254 fa33cdbf3ece Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Wed Jul 10 20:27:07 2019 -0700 Merge tag 'for_v5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2, udf and quota updates from Jan Kara: - some ext2 fixes and cleanups clone3: commit 8f6ccf6159aed1f04c6d179f61f6fb2691261e84 Merge: 5450e8a316a6 d68dbb0c9ac8 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Thu Jul 11 10:09:44 2019 -0700 Merge tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull clone3 system call from Christian Brauner: "This adds the clone3 syscall which is an extensible successor to clone after we snagged the last flag with CLONE_PIDFD during the 5.2 merge window for clone(). It cleanly supports all of the flags from clone() and thus all legacy workloads. There are few user visible differences between clone3 and clone. First, CLONE_DETACHED will cause EINVAL with clone3 so we can reuse this flag. Second, the CSIGNAL flag is deprecated and will cause EINVAL to be reported. It is superseeded by a dedicated "exit_signal" argument in struct clone_args thus freeing up even more flags. And third, clone3 gives CLONE_PIDFD a dedicated return argument in struct clone_args instead of abusing CLONE_PARENT_SETTID's parent_tidptr argument. The clone3 uapi is designed to be easy to handle on 32- and 64 bit: /* uapi */ struct clone_args { __aligned_u64 flags; __aligned_u64 pidfd; __aligned_u64 child_tid; __aligned_u64 parent_tid; __aligned_u64 exit_signal; __aligned_u64 stack; __aligned_u64 stack_size; __aligned_u64 tls; }; and a separate kernel struct is used that uses proper kernel typing: /* kernel internal */ struct kernel_clone_args { u64 flags; int __user *pidfd; int __user *child_tid; int __user *parent_tid; int exit_signal; unsigned long stack; unsigned long stack_size; unsigned long tls; }; The system call comes with a size argument which enables the kernel to detect what version of clone_args userspace is passing in. clone3 validates that any additional bytes a given kernel does not know about are set to zero and that the size never exceeds a page. A nice feature is that this patchset allowed us to cleanup and simplify various core kernel codepaths in kernel/fork.c by making the internal _do_fork() function take struct kernel_clone_args even for legacy clone(). This patch also unblocks the time namespace patchset which wants to introduce a new CLONE_TIMENS flag. Note, that clone3 has only been wired up for x86{_32,64}, arm{64}, and xtensa. These were the architectures that did not require special massaging. Other architectures treat fork-like system calls individually and after some back and forth neither Arnd nor I felt confident that we dared to add clone3 unconditionally to all architectures. We agreed to leave this up to individual architecture maintainers. This is why there's an additional patch that introduces __ARCH_WANT_SYS_CLONE3 which any architecture can set once it has implemented support for clone3. The patch also adds a cond_syscall(clone3) for architectures such as nios2 or h8300 that generate their syscall table by simply including asm-generic/unistd.h. The hope is to get rid of __ARCH_WANT_SYS_CLONE3 and cond_syscall() rather soon" * tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: arch: handle arches who do not yet define clone3 arch: wire-up clone3() syscall fork: add clone3 commit 7f192e3cd316ba58c88dfa26796cf77789dd9872 Author: Christian Brauner <christian [ at ] brauner [ dot ] io> Date: Sat May 25 11:36:41 2019 +0200 fork: add clone3 This adds the clone3 system call. As mentioned several times already (cf. [7], [8]) here's the promised patchset for clone3(). We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last free flag from clone(). Independent of the CLONE_PIDFD patchset a time namespace has been discussed at Linux Plumber Conference last year and has been sent out and reviewed (cf. [5]). It is expected that it will go upstream in the not too distant future. However, it relies on the addition of the CLONE_NEWTIME flag to clone(). The only other good candidate - CLONE_DETACHED - is currently not recyclable as we have identified at least two large or widely used codebases that currently pass this flag (cf. [2], [3], and [4]). Given that CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively blocked. clone3() has the advantage that it will unblock this patchset again. In general, clone3() is extensible and allows for the implementation of new features. The idea is to keep clone3() very simple and close to the original clone(), specifically, to keep on supporting old clone()-based workloads. We know there have been various creative proposals how a new process creation syscall or even api is supposed to look like. Some people even going so far as to argue that the traditional fork()+exec() split should be abandoned in favor of an in-kernel version of spawn(). Independent of whether or not we personally think spawn() is a good idea this patchset has and does not want to have anything to do with this. One stance we take is that there's no real good alternative to clone()+exec() and we need and want to support this model going forward; independent of spawn(). The following requirements guided clone3(): - bump the number of available flags - move arguments that are currently passed as separate arguments in clone() into a dedicated struct clone_args - choose a struct layout that is easy to handle on 32 and on 64 bit - choose a struct layout that is extensible - give new flags that currently need to abuse another flag's dedicated return argument in clone() their own dedicated return argument (e.g. CLONE_PIDFD) - use a separate kernel internal struct kernel_clone_args that is properly typed according to current kernel conventions in fork.c and is different from the uapi struct clone_args - port _do_fork() to use kernel_clone_args so that all process creation syscalls such as fork(), vfork(), clone(), and clone3() behave identical (Arnd suggested, that we can probably also port do_fork() itself in a separate patchset.) - ease of transition for userspace from clone() to clone3() This very much means that we do *not* remove functionality that userspace currently relies on as the latter is a good way of creating a syscall that won't be adopted. - do not try to be clever or complex: keep clone3() as dumb as possible In accordance with Linus suggestions (cf. [11]), clone3() has the following signature: /* uapi */ struct clone_args { __aligned_u64 flags; __aligned_u64 pidfd; __aligned_u64 child_tid; __aligned_u64 parent_tid; __aligned_u64 exit_signal; __aligned_u64 stack; __aligned_u64 stack_size; __aligned_u64 tls; }; /* kernel internal */ struct kernel_clone_args { u64 flags; int __user *pidfd; int __user *child_tid; int __user *parent_tid; int exit_signal; unsigned long stack; unsigned long stack_size; unsigned long tls; }; long sys_clone3(struct clone_args __user *uargs, size_t size) clone3() cleanly supports all of the supported flags from clone() and thus all legacy workloads. The advantage of sticking close to the old clone() is the low cost for userspace to switch to this new api. Quite a lot of userspace apis (e.g. pthreads) are based on the clone() syscall. With the new clone3() syscall supporting all of the old workloads and opening up the ability to add new features should make switching to it for userspace more appealing. In essence, glibc can just write a simple wrapper to switch from clone() to clone3(). There has been some interest in this patchset already. We have received a patch from the CRIU corner for clone3() that would set the PID/TID of a restored process without /proc/sys/kernel/ns_last_pid to eliminate a race. /* User visible differences to legacy clone() */ - CLONE_DETACHED will cause EINVAL with clone3() - CSIGNAL is deprecated It is superseeded by a dedicated "exit_signal" argument in struct clone_args freeing up space for additional flags. This is based on a suggestion from Andrei and Linus (cf. [9] and [10]) /* References */ [1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be [2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343 [3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233 [4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740 [5]: https://lore.kernel.org/lkml/20190425161416 [ dot ] 26600-1-dima [ at ] arista [ dot ] com/ [6]: https://lore.kernel.org/lkml/20190425161416 [ dot ] 26600-2-dima [ at ] arista [ dot ] com/ [7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ [ at ] mail [ dot ] gmail [ dot ] com/ [8]: https://lore.kernel.org/lkml/20190524102756 [ dot ] qjsjxukuq2f4t6bo [ at ] brauner [ dot ] io/ [9]: https://lore.kernel.org/lkml/20190529222414 [ dot ] GA6492 [ at ] gmail [ dot ] com/ [10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw [ at ] mail [ dot ] gmail [ dot ] com/ [11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw [ at ] mail [ dot ] gmail [ dot ] com/ Suggested-by: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Signed-off-by: Christian Brauner <christian [ at ] brauner [ dot ] io> Acked-by: Arnd Bergmann <arnd [ at ] arndb [ dot ] de> Acked-by: Serge Hallyn <serge [ at ] hallyn [ dot ] com> Cc: Kees Cook <keescook [ at ] chromium [ dot ] org> Cc: Pavel Emelyanov <xemul [ at ] virtuozzo [ dot ] com> Cc: Jann Horn <jannh [ at ] google [ dot ] com> Cc: David Howells <dhowells [ at ] redhat [ dot ] com> Cc: Andrew Morton <akpm [ at ] linux-foundation [ dot ] org> Cc: Oleg Nesterov <oleg [ at ] redhat [ dot ] com> Cc: Adrian Reber <adrian [ at ] lisas [ dot ] de> Cc: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Cc: Andrei Vagin <avagin [ at ] gmail [ dot ] com> Cc: Al Viro <viro [ at ] zeniv [ dot ] linux [ dot ] org [ dot ] uk> Cc: Florian Weimer <fweimer [ at ] redhat [ dot ] com> Cc: linux-api [ at ] vger [ dot ] kernel [ dot ] org commit 70e6e1b971e46f5c1c2d72217ba62401a2edc22b Merge: 07ab9d5bc53d a50a3f4b6a31 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Sat Jul 20 10:33:44 2019 -0700 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CONFIG_PREEMPT_RT stub config from Thomas Gleixner: "The real-time preemption patch set exists for almost 15 years now and while the vast majority of infrastructure and enhancements have found their way into the mainline kernel, the final integration of RT is still missing. Over the course of the last few years, we have worked on reducing the intrusivenness of the RT patches by refactoring kernel infrastructure to be more real-time friendly. Almost all of these changes were benefitial to the mainline kernel on their own, so there was no objection to integrate them. Though except for the still ongoing printk refactoring, the remaining changes which are required to make RT a first class mainline citizen are not longer arguable as immediately beneficial for the mainline kernel. Most of them are either reordering code flows or adding RT specific functionality. But this now has hit a wall and turned into a classic hen and egg problem: Maintainers are rightfully wary vs. these changes as they make only sense if the final integration of RT into the mainline kernel takes place. Adding CONFIG_PREEMPT_RT aims to solve this as a clear sign that RT will be fully integrated into the mainline kernel. The final integration of the missing bits and pieces will be of course done with the same careful approach as we have used in the past. While I'm aware that you are not entirely enthusiastic about that, I think that RT should receive the same treatment as any other widely used out of tree functionality, which we have accepted into mainline over the years. RT has become the de-facto standard real-time enhancement and is shipped by enterprise, embedded and community distros. It's in use throughout a wide range of industries: telecommunications, industrial automation, professional audio, medical devices, data acquisition, automotive - just to name a few major use cases. RT development is backed by a Linuxfoundation project which is supported by major stakeholders of this technology. The funding will continue over the actual inclusion into mainline to make sure that the functionality is neither introducing regressions, regressing itself, nor becomes subject to bitrot. There is also a lifely user community around RT as well, so contrary to the grim situation 5 years ago, it's a healthy project. As RT is still a good vehicle to exercise rarely used code paths and to detect hard to trigger issues, you could at least view it as a QA tool if nothing else" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/rt, Kconfig: Introduce CONFIG_PREEMPT_RT Purism Librem5 devkit now has a DT in kernel 10) Use promisc for unsupported number of filters, from Justin Chen. commit 933a90bf4f3505f8ec83bda21a3c7d70d7c2b426 Merge: 5f4fc6d440d7 037f11b4752f Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Fri Jul 19 10:42:02 2019 -0700 Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs mount updates from Al Viro: "The first part of mount updates. Convert filesystems to use the new mount API" Features: - Allow NFS client to set up multiple TCP connections to the server using a new 'nconnect=X' mount option. Queue length is used to balance load. - Enhance statistics reporting to report on all transports when using multiple connections. - Speed up SUNRPC by removing bh-safe spinlocks - Add a mechanism to allow NFSv4 to request that containers set a unique per-host identifier for when the hostname is not set. - Ensure NFSv4 updates the lease_time after a clientid update SMB 3.1.1 GCM instead of CCM commit a29a0a467e2c02fe4287c2d4eff86c9eb6beff0c Merge: bed38c3e2dca d7852fbd0f04 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Thu Jul 25 08:36:29 2019 -0700 Merge branch 'access-creds' The access() (and faccessat()) credentials change can cause an unnecessary load on the RCU machinery because every access() call ends up freeing the temporary access credential using RCU. This isn't really noticeable on small machines, but if you have hundreds of cores you can cause huge slowdowns due to RCU storms. It's easy to avoid: the temporary access crededntials aren't actually normally accessed using RCU at all, so we can avoid the whole issue by just marking them as such. * access-creds: access: avoid the RCU grace period for the temporary subjective credentials commit d7852fbd0f0423937fa287a598bfde188bb68c22 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Thu Jul 11 09:54:40 2019 -0700 access: avoid the RCU grace period for the temporary subjective credentials It turns out that 'access()' (and 'faccessat()') can cause a lot of RCU work because it installs a temporary credential that gets allocated and freed for each system call. The allocation and freeing overhead is mostly benign, but because credentials can be accessed under the RCU read lock, the freeing involves a RCU grace period. Which is not a huge deal normally, but if you have a lot of access() calls, this causes a fair amount of seconday damage: instead of having a nice alloc/free patterns that hits in hot per-CPU slab caches, you have all those delayed free's, and on big machines with hundreds of cores, the RCU overhead can end up being enormous. But it turns out that all of this is entirely unnecessary. Exactly because access() only installs the credential as the thread-local subjective credential, the temporary cred pointer doesn't actually need to be RCU free'd at all. Once we're done using it, we can just free it synchronously and avoid all the RCU overhead. So add a 'non_rcu' flag to 'struct cred', which can be set by users that know they only use it in non-RCU context (there are other potential users for this). We can make it a union with the rcu freeing list head that we need for the RCU case, so this doesn't need any extra storage. Note that this also makes 'get_current_cred()' clear the new non_rcu flag, in case we have filesystems that take a long-term reference to the cred and then expect the RCU delayed freeing afterwards. It's not entirely clear that this is required, but it makes for clear semantics: the subjective cred remains non-RCU as long as you only access it synchronously using the thread-local accessors, but you _can_ use it as a generic cred if you want to. It is possible that we should just remove the whole RCU markings for ->cred entirely. Only ->real_cred is really supposed to be accessed through RCU, and the long-term cred copies that nfs uses might want to explicitly re-enable RCU freeing if required, rather than have get_current_cred() do it implicitly. But this is a "minimal semantic changes" change for the immediate problem. Acked-by: Peter Zijlstra (Intel) <peterz [ at ] infradead [ dot ] org> Acked-by: Eric Dumazet <edumazet [ at ] google [ dot ] com> Acked-by: Paul E. McKenney <paulmck [ at ] linux [ dot ] ibm [ dot ] com> Cc: Oleg Nesterov <oleg [ at ] redhat [ dot ] com> Cc: Jan Glauber <jglauber [ at ] marvell [ dot ] com> Cc: Jiri Kosina <jikos [ at ] kernel [ dot ] org> Cc: Jayachandran Chandrasekharan Nair <jnair [ at ] marvell [ dot ] com> Cc: Greg KH <greg [ at ] kroah [ dot ] com> Cc: Kees Cook <keescook [ at ] chromium [ dot ] org> Cc: David Howells <dhowells [ at ] redhat [ dot ] com> Cc: Miklos Szeredi <miklos [ at ] szeredi [ dot ] hu> Cc: Al Viro <viro [ at ] zeniv [ dot ] linux [ dot ] org [ dot ] uk> Signed-off-by: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> On the other hand, I love PG commit messages: On the other hand, it emerges that FreeBSD and possibly other packagers are so wedded to backwards compatibility that they hack the IANA data to keep the old spelling --- and not just that old spelling, but even older spellings that IANA used back in the stone age. This caused the filter logic to fail to suppress "Factory" at all on such platforms, though the formatting problem is definitely real in that case. commit 0432a0a066b05361b6d4d26522233c3c76c9e5da Merge: af42e7450f4b 33a58980ff3c Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Sat Aug 3 10:51:29 2019 -0700 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull vdso timer fixes from Thomas Gleixner: "A series of commits to deal with the regression caused by the generic VDSO implementation. The usage of clock_gettime64() for 32bit compat fallback syscalls caused seccomp filters to kill innocent processes because they only allow clock_gettime(). Handle the compat syscalls with clock_gettime() as before, which is not a functional problem for the VDSO as the legacy compat application interface is not y2038 safe anyway. It's just extra fallback code which needs to be implemented on every architecture. It's opt in for now so that it does not break the compile of already converted architectures in linux-next. Once these are fixed, the #ifdeffery goes away. So much for trying to be smart and reuse code..." I thought we had stack leak plugins ages ago? commit 2e616d9f9ce8d469db4cd0a019cdc2ff3feab577 Author: Darrick J. Wong <darrick [ dot ] wong [ at ] oracle [ dot ] com> Date: Sun Jul 28 21:12:32 2019 -0700 xfs: fix stack contents leakage in the v1 inumber ioctls Explicitly initialize the onstack structures to zero so we don't leak kernel memory into userspace when converting the in-core inumbers structure to the v1 inogrp ioctl structure. Add a comment about why we have to use memset to ensure that the padding holes in the structures are set to zero. Woof. Shots fired. commit 0e31225f99e077d0b8c7f8577aab39e766e2477b Merge: 4f1a6ef1df6f 9c8c9c7cdb4c Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Fri Aug 2 18:53:51 2019 -0700 Merge tag 'drm-fixes-2019-08-02-1' of git://anongit.freedesktop.org/drm/drm Pull more drm fixes from Daniel Vetter: "Dave sends his pull, everyone realizes they've been asleep at the wheel and hits send on their own pulls :-/ Normally I'd just ignore these all because w/e for me and Dave. But this time around the latecomers also included drm-intel-fixes, which failed to send out a -fixes pull thus far for this release (screwed up vacation coverage, despite that 2/3 maintainers were around ... they all look appropriately guilty), and that really is overdue to get landed. When Linux comments on your commits: commit 6e6d05360b80f196ed07061327f03346b204abea Merge: 10e5ddd71fb3 e82f04ec6ba9 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Fri Aug 2 14:46:33 2019 -0700 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Seven fixes to four drivers with no core changes. The mpt3sas one is theoretical until we get a CPU that goes up to 64 bits physical, the qla2xxx one fixes an oops in a driver initialization error leg and the others are mostly cosmetic" [ The fcoe patches may be worth highlighting - they may be "just" cleanups, but they simplify and fix the odd fc_rport_priv structure handling rules so that the new gcc-9 warnings about memset crossing structure boundaries are gone. The old code was hard for humans to understand too, and really confused the compiler sanity checks - Linus ] Good old GCC. People still doing manual array lists, like Vim. commit 7086751c5e4eb3cfee0b98df0d3cedc8bff47d35 Author: Jan Edmund Lazo <jan [ dot ] lazo [ at ] mail [ dot ] utoronto [ dot ] ca> Date: Mon Aug 5 17:42:41 2019 -0400 vim-patch:8.1.1439: ga_grow(): 1.5x growth rate #10699 Problem: Json_encode() is very slow for large results. Solution: In the growarray use a growth of at least 50%. (Ken Takata, closes vim/vim#4461) https://github.com/vim/vim/commit/c47ed44be76a520ded90913099771999c8a79eeb diff --git a/src/nvim/garray.c b/src/nvim/garray.c index 74fd9d89c..1cfc2b617 100644 --- a/src/nvim/garray.c +++ b/src/nvim/garray.c @@ -89,6 +89,14 @@ void ga_grow(garray_T *gap, int n) if (n < gap->ga_growsize) { n = gap->ga_growsize; } + + // A linear growth is very inefficient when the array grows big. This + // is a compromise between allocating memory that won't be used and too + // many copy operations. A factor of 1.5 seems reasonable. + if (n < gap->ga_len / 2) { + n = gap->ga_len / 2; + } + int new_maxlen = gap->ga_len + n; size_t new_size = (size_t)gap->ga_itemsize * (size_t)new_maxlen; commit 4368c4bc9d36821690d6bb2e743d5a075b6ddb55 Merge: 0eb0ce0a78e1 4c92057661a3 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Tue Aug 6 11:22:22 2019 -0700 Merge branch 'x86/grand-schemozzle' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull pti updates from Thomas Gleixner: "The performance deterioration departement is not proud at all to present yet another set of speculation fences to mitigate the next chapter in the 'what could possibly go wrong' story. The new vulnerability belongs to the Spectre class and affects GS based data accesses and has therefore been dubbed 'Grand Schemozzle' for secret communication purposes. It's officially listed as CVE-2019-1125. Conditional branches in the entry paths which contain a SWAPGS instruction (interrupts and exceptions) can be mis-speculated which results in speculative accesses with a wrong GS base. This can happen on entry from user mode through a mis-speculated branch which takes the entry from kernel mode path and therefore does not execute the SWAPGS instruction. The following speculative accesses are done with user GS base. On entry from kernel mode the mis-speculated branch executes the SWAPGS instruction in the entry from user mode path which has the same effect that the following GS based accesses are done with user GS base. If there is a disclosure gadget available in these code paths the mis-speculated data access can be leaked through the usual side channels. The entry from user mode issue affects all CPUs which have speculative execution. The entry from kernel mode issue affects only Intel CPUs which can speculate through SWAPGS. On CPUs from other vendors SWAPGS has semantics which prevent that. SMAP migitates both problems but only when the CPU is not affected by the Meltdown vulnerability. The mitigation is to issue LFENCE instructions in the entry from kernel mode path for all affected CPUs and on the affected Intel CPUs also in the entry from user mode path unless PTI is enabled because the CR3 write is serializing. The fences are as usual enabled conditionally and can be completely disabled on the kernel command line. The Spectre V1 documentation is updated accordingly. A big "Thank You!" goes to Josh for doing the heavy lifting for this round of hardware misfeature 'repair'. Of course also "Thank You!" to everybody else who contributed in one way or the other" * 'x86/grand-schemozzle' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Documentation: Add swapgs description to the Spectre v1 documentation x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGS x86/entry/64: Use JMP instead of JMPQ x86/speculation: Enable Spectre v1 swapgs mitigations x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations Been seeing these clang things become more popular over the years: commit 23df57afe8eebff6ece05a815934f2f70a851e0a Merge: bf1881cf484d ed4289e8b488 Author: Linus Torvalds <torvalds [ at ] linux-foundation [ dot ] org> Date: Sat Aug 10 10:17:19 2019 -0700 Merge tag 'powerpc-5.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fix from Michael Ellerman: "Just one fix, a revert of a commit that was meant to be a minor improvement to some inline asm, but ended up having no real benefit with GCC and broke booting 32-bit machines when using Clang. Thanks to: Arnd Bergmann, Christophe Leroy, Nathan Chancellor, Nick Desaulniers, Segher Boessenkool" * tag 'powerpc-5.3-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: Revert "powerpc: slightly improve cache helpers" Secure RPC: Disable insecure Kerberos encryption types (SUNRPC_DISABLE_INSECURE_ENCTYPES) [N/y/?] (NEW) ? CONFIG_SUNRPC_DISABLE_INSECURE_ENCTYPES: Choose Y here to disable the use of deprecated encryption types with the Kerberos version 5 GSS-API mechanism (RFC 1964). The deprecated encryption types include DES-CBC-MD5, DES-CBC-CRC, and DES-CBC-MD4. These types were deprecated by RFC 6649 because they were found to be insecure. N is the default because many sites have deployed KDCs and keytabs that contain only these deprecated encryption types. Choosing Y prevents the use of known-insecure encryption types but might result in compatibility problems.