Skip to content

plusterkopp/Java-Thread-Affinity

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

91 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This fork aims to provide support for another use case: Binding to groups of CPUs instead of single cores. This is useful for exploiting cache locality for groups of threads with similar data access patterns.

Another feature that was missing in 2013 when I started working in this fork was layout introspection for the Windows platform. For Linux, this is done by reading and parsing /proc/cpuinfo, which does not exist in Windows outside of Cygwin.

Instead, Windows provides GetLogicalProcessorInformationEx. Here, we get a number of SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX structs, that describe the CPU layout in terms of caches, numa nodes, core and packages (sockets). Also there are groups, which are used by Windows to keep processor masks within 64 bits. Systems with more than 64 logical CPUs will therefore have more than one group. Since the scheduler normally uses the CPUs on the one group only, for a process on a machine with 72 lCPUs (like 2x E5-4669 v3 where each CPU socket counts as a group of 36 lCPUs), Runtime.availableProcessors will return 36. Without affinities, no thread will run on the other half of CPUs. Of course, this gets worse on 4 socket configurations.

We gather all SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX using JNA and a small DLL built from affinityInfo.cpp. We then construct a number of LayoutEntities, each having its own cpu mask. At the time of fork, there was only support for 64 lCPUs on the Linux side using BitSets. On the Windows side, I had to create a GroupAffinityMask comprising a groupId and a long mask.

As per Peters request at the time, I left all the original interfaces untouched and provided my own for the new features. There is a new kind of CpuLayout, WindowsCpuLayout that is constructed upon SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX structures and extends VanillaCpuLayout with lists of Groups, Caches and NumaNodes. To access each of them, there is an interface: NumaCpuLayout, GroupedCpuLayout and CacheCpuLayout. Likewise, WindowsCpuInfo extends VanillaCpuInfo with numaId, groupId and mask.

When I started this work, the only relevant machines had Intel CPUs, where relations are rather simple. One socket has a number of cores, possibly with SMT, each core has its own L1+2 cache. The socket then has a hopefully large L3 cache and its own path to memory, aka numa node. In order to do what I wanted (exploiting caches), one could either focus on L2 caches, which was equivalent to a core, or L3 cache, which was equivalent to a socket. The Linux implementation provides access to both and is therefore OK as long as one uses Intel hardware.

Enter AMD Zen.

Now with an Epyc 7301, two cores are organized into core complexes which each have their own L3 cache. Two core complexes form a numa node. Each socket has four of them. There is no longer a simple 1:1 mapping of socket to numa node and L3 cache. Using only naive socket binding on a hardware like this yields huge performance losses.

Unfortunately, Linux currently does not provide the same one-stop introspection API as Windows.

Caches do not execute code, and are therefore not referenced in CpuInfos. Instead, they form their own hierarchy. We can only relate LayoutEntities (including caches) to each other by their mask.

To bind threads to LayoutEntities, they provide a bind() method to bind the current thread, keeping track of the threads that are currently bound to it. Alternatively, the AffinityManager provides methods to bind to each kind LayoutEntity, returning true if successful. This functionality is vaguely similar to AffinityStrategy, but better matched to what I wanted to do.

Not contained in this package is how to bind worker threads in a pool. Since this can be done only by each thread for itself, I subclassed Thread as AffitityThread. For constructors with a Runnable argument, I wrap that Runnable into one that first binds the new Thread.

public AffinityThread( Runnable target, String name, LayoutEntity bindTo) {
	super( createBindingRunnable( bindTo, target), name);
}

private static Runnable createBindingRunnable( LayoutEntity bindTo, Runnable target) {
	if ( bindTo == null) {
		return target;
	}
	Runnable	bindingRunnable = new Runnable() {
		@Override
		public void run() {
			bindTo.bind();
			target.run();
		}
	};
	return bindingRunnable;
}

To use these threads, ThreadPools need a custom ThreadFactory that creates AffinityThreads.

Usually, I do not unbind or rebind threads after they have been initially bound. However, this is possible and covered in unit tests.

I would like to see this fork somehow remerged back into the original package if my use case becomes more relevant to the people maintaining it. If I need to add more comments to my already totally self-explanatory code to help this effort, please contact me.


Overdue update Oct 2023

The hwloc package provides layout introspection capabilities similar to Windows. When running under Linux, we therefore call lstopo-no-graphics -v --no-io and parse the output.

About

OpenHFT Java Thread Affinity library

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 95.5%
  • C++ 3.3%
  • Other 1.2%