You used ECS in order to fast iterate through the same kind of components across multiple objects (
Entity) instead of the old OOP way. Could the same concept can be applied to even smaller level : each field in the struct?
For example if you use ECS normally, and your
Data : IComponentData has
int a; int b; and
int c; It is already faster than usual to be able to iterate through
Data of multiple entities linearly without encountering other unrelated components. Great! But what about things inside it?
Suppose you are going to add +1 to the
b integer across all Data that match your query. When your job work on each
Data , it encounters a b c | a b c | a b c .. like this. This is AOS, Array of Structure. A job rarely do things to all elements in the structure, other than intializing them all to 0 or something.
Structure of Array (SOA) optimization
At struct level wouldn’t it be nice if we can arrange it in multiple streams of each struct field instead when we have a lot of them? So when we mass
b += 1 , we could select the stream of data
b. It could then be turned into SIMD instruction, because SIMD wants homogenous data. (The same type being next to each other)
NativeArray, you can use
NativeArrayChunked8 for data ordering customization!
For ECS database however, as far as I know the most granular is AOS per component. It do not know down to each field in the struct. ECS could SIMD you each
struct in parallel, but not down to each
struct field in parallel.
CPU and the cache line unit
How CPU works is it fetch not only exactly what you want to work, but fetch the whole cache line. It is usually of size 64 bytes, and in this case when you access
Data you usually get something like
Data (partial) (Magnified :
a b c | a b c | a b c | a b c | a b c | a ) Because each one contains 3
int, that took 12 bytes, and we get 5 of a full struct. The last one you only get 1
int pulled together at cache line size. (If you really need that last element later on, it would have to find the remaining from RAM)
If what you want is in the middle of cache line "grid", you must move back until it is aligned with 64 bytes block and get the whole block. Ideally your data should be cache line aligned, and the leftover data you get along with that cache fetch should be what you want next.
When you put things in this special native container, it will be sliced and arranged in SOA style per field. (It contains a reflection-like routine to inspect your fields) When you use the indexer to get through the data, it reassemble those parts to give you one full
struct to C# like nothing happened.
This whole slicing and reassembling would normally make things slower, but we are using the fact that cache line now brought us a different data that we are likely using next. And Burst could understand it and optimize them.
The point is it's memory representation. You now could have that NA of
float3 where the
x really lining up in memory one after another. Even normal ECS database with
Translation component, it would be
x y z | x y z | ... inside. (AOS)
So when CPU get things from RAM in unit of cache line, you would wish the leftover data that tag along is the one you want to use next. In this case you are getting a bunch of
x when ask for
na[i].x. If you are iterating through all the things in order to do something on only
x, you will get a longer streak of cache hit because you didn't waste
z in that 64 bytes cache line. On the other hand if you are changing
x y z together the AOS way could be better. So these containers are pretty specialized.
The hard limitation right now is that your struct must be composed of only 4 byte data types. The logic is optimized to jump around a cached layout of 4 byte data types. (
short would not work for example)
Test it out
This 2 jobs differ in only the native container type. The task is to iterate through my quite large struct with 10+ fields finding the first item that a specific field equals 555.
NativeArrayFullSOA is in
using Unity.Collections.Experimental; who knows they could disappear someday.
Using the enhanced assembly option we get segmented yellow texts for each section of code.
The real highlight is in the 3rd image, where you see it is
cmp with 555 and if it is so,
je jump equal to the long routine that writes to the native array. The majority of long preparation step in the 1st image is for safety check code in
NativeArray in the 2nd image. (That should disappear in the real build right?)
You can see we arrive at
cmp 555 much earlier! This is what I thought it gonna be.
- It prepare the data to be iterated in
r14from the start and also the loop iterator rbx. (smart!)
- Compare length field ( +8 from the beginning of
NativeArrayFullSOAit seems) if not equal to length yet… ( test that does not trigger
rcxacts as an anchor for starting point,
iin the loop code is multiplied by 4 to directly get my
.laneIndexfor comparison! That’s the power of SOA since the next
.laneIndexis right at the next multiple of 4. An
rbxwould be able to get the next one right away. Now you know why the 4 bytes limitation was there.
That assembly was generated from the code which I want the 1st field in the struct,
serializedLaneIndex. Actually accessed by property
laneIndex but property does not count in the data layout, and
StructLayout attribute seems to not matter.
What if I rearrange it?
Thing changed a bit with introduction of
lea , load effective address. Some preparation are needed to get
rcx to the correct “stream”, then we can go SOA from that point. Note that if
cmp fail, after the
inc it jump right back to the next
cmp without doing
lea again. So yeah, it is still smart when your wanted field is elsewhere.
Back to the
Now we can understand why
NativeArray is worse than
First of all, the preparation for
eax is much longer. So many offsets
r15 r14 r13 r12 are all needed in accessing data from
NativeArray! And you see from label
.LB_02 onwards, I don’t know why accessing data has to be that long but it spans one more image.
Until we finally can compare with 555, if failed, the we are back to
This rearrange things too, but now in a chunk of 32 bytes each. Each field must still be in a size of 4 bytes.
Data is now a bit 100 bytes
struct. (So 25 fields, since each field must take 4 bytes) If you use
NativeArrayChunked8<Data>, and you have 5 elements of
Data in it, you get (5 * 100)/32 = 16 chunks of 32 bytes each. Each chunk will then is able to hold 8 fields. (32/4 = 8) Now you know where that "8" in the name came from.
When you use indexer on this thing, it would calculate into the correct chunk, then the correct chunk offset (+4 byte per an offset) to get you to the correct field.
Now what? Isn't this achieving the same purpose as full SOA one? Because both are in
Experimental namespace and there is no documentation I could only guess, but 32 bytes is the number of magic. You see, the cache line is usually 64 bytes per one fetch from RAM to CPU cache.
I can see if you use
NativeArrayChunked8 you might be very consistently allocation cache line aligned memory no matter what the size of your
struct. When getting data then you are more likely to start at exactly the beginning of the data you want.
I didn't check Burst assembly on this one.
It could be interesting if we could make an ECS chunk reorder itself to SOA then lock the chunk to optimize some routine. But for now this move is limited to these 2 hand allocated containers full SOA and chunked 8. (that you still can throw to
IJob to do something, even out of ECS database system.)
The relevant talk
This talk is super fun! I recommended watching even if you are not an optimization geek. He talks very simply.