A programming and hobby blog.

InterpreterPoolExecutor - We Need PEP 734

Python programs can sometimes be compute bound in surprising ways. Recently I tried refactoring a program that downloaded 4 JSON files, parsed them, and made them available to be used in a larger program. When I rolled out my “improvement”, it actually made the code slower, and I had to quickly fix it. How could have I avoided this?

What We Should Expect from a Good Program

A few things would make our lives easier. Python has not traditionally made the following easy, but we are right on the cusp of having our cake and eating it too. Here’s what I would expect from a good program:

Easy to Parallelize. If the code is slow, we should be able to split it up.
Easy to Profile If the code is slow it should be easy to figure out why.

Let’s see if we can get both at the same time.

Hard to Parallelize

The original authors had used os.fork() to acheive parallelism, which has problems. I assumed that this was to avoid using threads directly, or some other reason, but it turned out to not be the case. “Downloading some JSON and sticking it in Redis? That’s definitely IO-bound”. Wrong. The JSON parser in Python is very slow. To the point that trying to download and parse all 4 versions ended up taking more than 60 seconds. The refresh interval for this code was only 1 minute long. When I replaced the fork-based code with a ThreadPoolExecutor, the code started taking minutes to nearly hours to finish. It seemed IO bound, but it was actually CPU bound.

Hard to Profile

A more seasoned engineer might point out that I should have profiled this code before trying to “optimize” it. However, Python only recently gained the ability to integrate with perf. Unfortunately, the implementation creates a new, PID-named file, at an unconfigurable location, each time the procress starts. In a fork-based concurrency world, that’s a lot of PIDs. And because these perf-based files aren’t small, it runs the risk of maxing out the disk of the server you are profiling on. Secondly, these forks flare into, and out-of existence quickly (i.e. seconds), so it’s hard to catch them in the act of what they are doing. A long lived process would be much easier to observe.

And Still Hard to Parallelize?

When I replaced my ThreadPoolExecutor with a ProcessPoolExecutor, this problem reared its head again. Because the processes associated with the pool aren’t associated with the tasks, it’s hard to identify which processes to profile. The same problem exists; tracking down all the PIDs associated with my pool is trickier. Secondly, switching from ThreadPoolExecutor to ProcessPoolExecutor is not straightforward. All the functions and arguments now need to be Pickle-able, meaning things like references to class methods no longer work.

Parallel, Profile-able Python

Python 3.14 adds a new module and APIs for creating sub-interpreters. (e.g. InterpreterPoolExecutor) Significant work has gone into CPython to make the Interpreter state a thread-local, meaning it’s possible to run multiple “Pythons” in the same process. This helps us a lot because it means we can get the parallelism we want, without the system overhead of running multiple processes. Specifically:

There’s no overhead of starting up multiple processes. Processes can share Page tables, Signal Handlers, file descriptors, and so on.
PIDs are way more stable. The Process ID of the parent thread is the same as the ID of the child (sub) threads.
Memory sharing (is | will be) easier. Rather than have to convert from Python objects in one interpreter to a serialized (cough Pickle cough) form, it will be much easier to synchronize with other workers. (also shout out to Ray which has done the hard work to make this sharing a lot easier).

The multiple-runtimes-in-one-process model is not new, with the most notable example being NodeJS. But, it is a greatly welcome addition to Python. Given the amazing improvements in GIL removal and JIT addition in Python 3.13, Python is becoming a much more workable language for server development.

First Thoughts On Java’s Valhalla

After watching Brian Goetz’s Presentation on Valhalla, I started thinking more seriously about how value classes work. There are a few things that are exciting, but a few that are pretty concerning too. Below are my thoughts; please reach out if I missed something!

Equality (`==`) is No Longer Cheap

Pre-Valhalla, checking if two variables were the same was cheap. A single word comparison.
Valhalla changes that to depend on the runtime type of the object. This also implies an extra null check, since the VM needs can’t load the class word eagerly. With a segfault handler to try and skip the null check, the performance of == would no longer be consistent.
This isn’t the end of the world for high performance computing, but it doesn’t seem like that big of a win. Everyone’s code bears the cost.

It appears most of the performance optimizations available to Valhalla are not yet in, so it’s hard to tell if the memory layout improvements are worth the expense.

Minor: IdentityHashMap now is a performance liability. Don’t accidentally put in a value object or else.

AtomicReference

How value classes will interact with AtomicReference seems to be an issue. While value objects can be passed around by value, they can also be passed by reference, depending on the VM. However, AtomicReference is defined in terms of == for ops like compareAndSet. Value objects no longer have an atomic comparison. What will happen? Consider the following sequence of events:

value record Point(int x, int y, int z) {}
static final AtomicReference<Point> POINT = 
    new AtomicReference<>(new Point(1, 2, 3));

T1 - start POINT.compareAndSet(new Point(1, 2, 3), new Point(4, 5, 6))
T2 - start POINT.compareAndSet(new Point(1, 2, 3), new Point(1, 2, 3))
T2 - finish and win compareAndSet()
T1 - finish compareAndSet()

A regular AtomicReference would return false for T1, despite the value being the expected value before, during, and after the call. We can use it to resolve a race. A value based object though: what could it do?

Where is the Class Word?

Without object identity, most of the object header isn’t needed. The identity hash code, synchronization bits, and probably any GC bits aren’t needed any more. But, what about valueObj.getClass() ?

I can’t see an easy way of implementing it. If the class word is adjacent to the object state in memory, we don’t get nearly the memory savings we wanted.

If we had a single class pointer for an array of value objects, it still doesn’t help. Consider:

value record Point(int x, int y, int z) {}

Object[] points = 
    new Object[]{new Point(1, 2, 3), new Point(4, 5, 6)};

for (Object p : points) { System.out.println(p.getClass()); }

The VM would have to either prove every object in the array has the same class, or else store it per object.

It would be great to see how the class pointer is elided in real life.

Intrusive Linked Lists and Trees

Value objects’ state is implicitly final, which means they can’t really be used for mutable data structures. One of the things I miss from my C days is having a value included in a linked list node. This saves space, but doesn’t appear to work for value objects. The same goes for trees. I haven’t thought extensively about it, but denser data-structures don’t seem to be served by the Valhalla update.

Values Really Don’t Have Identities.

Ending on a positive note, one of the things I liked about JEP 401 was the attention called to mutating a value object. Specifically:

Field mutation is closely tied to identity: an object whose field is being updated is the same object before and after the update

Many years ago, I had an argument with a coworker about Go’s non-reentrant mutex, v.s. Java’s reentrant synchronizers. As most [civil] arguments go, both of us learned something new: Go’s mutexes can be locked multiple times. Behold!

package main

import (
	"fmt"
	"sync"
)

func main() {
	var m sync.Mutex
	m.Lock()
	m = *(new(sync.Mutex))
	m.Lock()
	defer m.Unlock()
	fmt.Println("Hello")
}

This code shows the problem. The mutex becomes a new object upon reassignment, despite being the same variable. If the second .Lock() call is removed, this code actually panics, despite the Lock call coming before the Unlock, and there being the same number of Locks and Unlocks.

Java is saying the same thing here. Mutability implies identity.

Conclusion

At this point, I think the Valhalla branch is interesting, but not enough to carry it’s own weight. Without being able to see the awesome performance and memory improvements, it’s hard to tell if the language and VM complexity are justified.

You Probably Want nanoTime

Quick Quiz: What does the following Java code do?

public class Timer {
  public static void main(String [] args) throws Exception {
    Instant start = Instant.now();
    System.err.println("Starting at " + start);
    Thread.sleep(Duration.ofSeconds(10));
    Instant end = Instant.now();
    System.out.println("Slept for " + Duration.between(start, end));
  }
}

On the surface, it looks correct. The code tries to sleep for 10 seconds, and then prints out how long it actually slept for. However, there is a subtle bug: It’s using calendar time instead of monotonic time

`Instant.now()` is Calendar Time

Instant.now() seems like a good API to use. It’s typesafe, modern, and has nanosecond resolution! All good right? The problem is that the time comes from computer’s clock, which can move around unpredictably. To show this, I recorded running this program:

As we can see, the program takes a little over 10 seconds to run. However, what would happen if the system clock were to be adjusted? Let’s look:

Time went backwards and our program didn’t measure the duration correctly! This can happen during daylight savings time switches, users changing their system clock manually, and even when returning from sleep or hibernate power states.

Use `System.nanoTime` to Measure Duration

To avoid clock drift, we can use System.nanoTime(). This API returns a timestamp that is arbitrary, but is consistent during the run of our program. Here’s how to use it:

public class Timer {
  public static void main(String [] args) throws Exception {
    long start = System.nanoTime();
    System.err.println("Starting at " + start);
    Thread.sleep(Duration.ofSeconds(10));
    long end = System.nanoTime();
    System.out.println("Slept for " + Duration.ofNanos(end - start));
  }
}

We don’t get to use the object oriented time APIs, but those weren’t meant for recording duration anyways. It feels a little more raw to use long primitives, but the result is always correct. If you are looking for a typesafe way to do this, consider using Guava’s Stopwatch class.

The nanoTime() call is great in lot’s of situations:

Logging how long a Function takes to run
Calculating how long to wait in an exponential back-off retry loop
Picking a time to schedule future work.
Recording in metrics how long a Function takes to run

What about `System.currentTimeMillis()`?

While this function worked well for a long time, it has been superseded by Instant.now(). I usually see other programmers use this function because they only care about millisecond granularity. However, this suffer from the same clock drift problem as Instant.now().

More Thoughts:

You can find me on Twitter @CarlMastrangelo

Carl Mastrangelo