Updates when stitching the world

The change to OctoberProject's item addressing is complete and, while I still don't like it (the addressing still feels "not quite right" in some way), it does do everything required and probably in about the most reasonable way possible.

Moving on from that, tool applicability to block materials has been completed (meaning a shovel is different from an axe, in terms of what it does). In terms of feature work, the next release really just requires the hotbar slots and armour slots, which are conceptually straight-forward enough. I still need to do some general block toughness and inventory sizing reworks, but that is just playing with some values in data files until they "feel right". Beyond that, the UI will need some serious attention, although I am not sure how much of that to do now, versus a larger overhaul later (not to mention that I am terrible at UI work and find it gruelling).

However, I decided to stop and take a look at a bizarre performance outlier I have seen on occasion: While a tick normally completes in 1-2 milliseconds (no optimization work has been done here, either), I have seen outliers get as high as about 250 milliseconds. Given that the target scheduling expects less than 100 milliseconds, this means that there is a problem.

After spending today adding instrumentation and doing some investigation, I found that the problem is related to cuboid loading. Specifically, the way that newly-loaded cuboids are stitched into the logical world, dynamically. This is because of something which nicely handles a bunch of corner-cases (which something like Minecraft does NOT handle, by the way) related to what happens when a block is updated on the border of an unloaded cuboid which logically would have changed the neighbouring block, had it been loaded.

The way OctoberProject does this is that it synthesizes block update events for every block on the face of every loaded cuboid which borders a newly-loaded cuboid. A cuboid is 32x32x32 so each face is 1024 blocks. While the parallel executor does a good job of handling this across the world, the unit of parallelism is a single cuboid, meaning that one of these faces must be processed by a single thread. The bizarre thing is that this seems to take between about 10 and 200 ms, but the variation has no obvious explanation. Even in isolated testing, the variation is wild and heavily JIT-sensitive, but so infrequently quite THAT bad (more like ~15 ms, only above 100 on the first run - this is looping on new worlds, so there is no caching).

I will need to find a way to explain this wild variation but even this should be optimized somehow. Some thoughts:
1) Technically, only vertical faces NEED this update to account for things falling as the horizontal faces could just ignore this like Minecraft does and this would cause rare, and fairly harmless, world inconsistencies. This just means fewer threads would do this, but doesn't improve the outliers so it isn't really a solution.
2) I could specialize this update event with knowledge of what face matters, so that fewer checks would be needed since even the initial check involves looking at the 6 surrounding blocks and then almost never doing anything, when at most 3 would actually matter (and only 1 in over 85% of cases). Similarly, there are some checks around falling block inventories which could be ignored for all but the bottom face. This is likely worth doing but an analysis of the benefit would be required.
3) A cuboid could actually record what blocks on a given face have updated since its neighbour was last loaded, thus replacing this heavy approach with a more precise one. While this is possible, it is actually pretty complicated and ugly so I would like to avoid it (lots of moving parts and bizarre knowledge plumbing in this).
4) I could use this as a way to investigate and optimize lower-level common routines hit in these heavy cases. While this probably won't provide much of a win, it might be a good opportunity to look at some detailed profiles.
5) The unit of parallelism could be made into something more granular (like 1/8 of a cuboid instead of a cuboid, for a simple example). While this would improve the overall performance in cases where only a few cuboids have loaded in a given tick (by nearly 4x, in this example), it wouldn't improve it much when many cuboids have been loaded in a single tick. It would also potentially cause a more fundamental problem, later on, with the unit of parallelism being too granular, leading to overhead. In fact, I suspect that even a single cuboid may be too granular for future scalability. Hence, I do not want to change this.

I will need to think about this a bit more as the performance is so far very good but these outliers are definitely a problem,
Jeff.