You used ECS in order to fast iterate through the same kind of components across multiple objects (Entity) instead of the old OOP way. Could the same concept can be applied to even smaller level : each field in the struct?

For example if you use ECS normally, and your Data : IComponentData has int a; int b; and int c; It is already faster than usual to be able to iterate through Data of multiple entities linearly without encountering other unrelated components. Great! But what about things inside it?

Suppose you are going to add +1 to the b integer across all Data that match your query. When your job work on each Data , it encounters a b c | a b c | a b c .. like this. This is AOS, Array of Structure. A job rarely do things to all elements in the structure, other than intializing them all to 0 or something.

Structure of Array (SOA) optimization

At struct level wouldn’t it be nice if we can arrange it in multiple streams of each struct field instead when we have a lot of them? So when we mass b += 1 , we could select the stream of data b. It could then be turned into SIMD instruction, because SIMD wants homogenous data. (The same type being next to each other)

For NativeArray, you can use NativeArrayFullSOA and NativeArrayChunked8 for data ordering customization!

For ECS database however, as far as I know the most granular is AOS per component. It do not know down to each field in the struct. ECS could SIMD you each struct in parallel, but not down to each struct field in parallel.

CPU and the cache line unit

How CPU works is it fetch not only exactly what you want to work, but fetch the whole cache line. It is usually of size 64 bytes, and in this case when you access Data you usually get something like Data Data Data Data Data Data (partial) (Magnified : a b c | a b c | a b c | a b c | a b c | a ) Because each one contains 3 int, that took 12 bytes, and we get 5 of a full struct. The last one you only get 1 int pulled together at cache line size. (If you really need that last element later on, it would have to find the remaining from RAM)

If what you want is in the middle of cache line "grid", you must move back until it is aligned with 64 bytes block and get the whole block. Ideally your data should be cache line aligned, and the leftover data you get along with that cache fetch should be what you want next.

NativeArrayFullSOA

When you put things in this special native container, it will be sliced and arranged in SOA style per field. (It contains a reflection-like routine to inspect your fields) When you use the indexer to get through the data, it reassemble those parts to give you one full struct to C# like nothing happened.

This whole slicing and reassembling would normally make things slower, but we are using the fact that cache line now brought us a different data that we are likely using next. And Burst could understand it and optimize them.

The point is it's memory representation.  You now could have that NA of float3 where the x really lining up in memory one after another. Even normal ECS database with Translation component, it would be x y z | x y z | ... inside. (AOS)

So when CPU get things from RAM in unit of cache line, you would wish the leftover data that tag along is the one you want to use next. In this case you are getting a bunch of x when ask for na[i].x. If you are iterating through all the things in order to do something on only x, you will get a longer streak of cache hit because you didn't waste y and z in that 64 bytes cache line. On the other hand if you are changing x y z together the AOS way could be better. So these containers are pretty specialized.

The hard limitation right now is that your struct must be composed of only 4 byte data types. The logic is optimized to jump around a cached layout of 4 byte data types. ( byte double short would not work for example)

Test it out

This 2 jobs differ in only the native container type. The task is to iterate through my quite large struct with 10+ fields finding the first item that a specific field equals 555.

Note that NativeArrayFullSOA is in using Unity.Collections.Experimental; who knows they could disappear someday.

NativeArray version

Using the enhanced assembly option we get segmented yellow texts for each section of code.

The real highlight is in the 3rd image, where you see it is cmp with 555 and if it is so, je jump equal to the long routine that writes to the native array. The majority of long preparation step in the 1st image is for safety check code in NativeArray in the 2nd image. (That should disappear in the real build right?)

NativeArrayFullSOA version

You can see we arrive at cmp 555 much earlier! This is what I thought it gonna be.

  • It prepare the data to be iterated in r14 from the start and also the loop iterator rbx. (smart!)
  • Compare length field ( +8 from the beginning of NativeArrayFullSOA it seems) if not equal to length yet… ( test that does not trigger jle )
  • rcx acts as an anchor for starting point, rbx which is i in the loop code is multiplied by 4 to directly get my .laneIndex for comparison! That’s the power of SOA since the next .laneIndex is right at the next multiple of 4. An inc ‘d rbx would be able to get the next one right away. Now you know why the 4 bytes limitation was there.

That assembly was generated from the code which I want the 1st field in the struct, serializedLaneIndex.  Actually accessed by property laneIndex but property does not count in the data layout, and StructLayout attribute seems to not matter.

What if I rearrange it?

Thing changed a bit with introduction of lea , load effective address. Some preparation are needed to get rcx to the correct “stream”, then we can go SOA from that point. Note that if cmp fail, after the inc it jump right back to the next cmp without doing lea again. So yeah, it is still smart when your wanted field is elsewhere.

Back to the NativeArray version

Now we can understand why NativeArray is worse than NativeArrayFullSOA

First of all, the preparation for rbx and eax is much longer. So many offsets r15 r14 r13 r12 are all needed in accessing data from NativeArray! And you see from label .LB_02 onwards, I don’t know why accessing data has to be that long but it spans one more image.

Until we finally can compare with 555, if failed, the we are back to .LB_02 again.

NativeArrayChunked8

This rearrange things too, but now in a chunk of 32 bytes each. Each field must still be in a size of 4 bytes.

Imagine your Data is now a bit 100 bytes struct. (So 25 fields, since each field must take 4 bytes) If you use NativeArrayChunked8<Data>, and you have 5 elements of Data in it, you get (5 * 100)/32 = 16 chunks of 32 bytes each. Each chunk will then is able to hold 8 fields. (32/4 = 8) Now you know where that "8" in the name came from.

When you use indexer on this thing, it would calculate into the correct chunk, then the correct chunk offset (+4 byte per an offset) to get you to the correct field.

Now what? Isn't this achieving the same purpose as full SOA one? Because both are in Experimental namespace and there is no documentation I could only guess, but 32 bytes is the number of magic. You see, the cache line is usually 64 bytes per one fetch from RAM to CPU cache.

I can see if you use NativeArrayChunked8 you might be very consistently allocation cache line aligned memory no matter what the size of your struct. When getting data then you are more likely to start at exactly the beginning of the data you want.

I didn't check Burst assembly on this one.

It could be interesting if we could make an ECS chunk reorder itself to SOA then lock the chunk to optimize some routine. But for now this move is limited to these 2 hand allocated containers full SOA and chunked 8. (that you still can throw to IJob to do something, even out of ECS database system.)

The relevant talk

This talk is super fun! I recommended watching even if you are not an optimization geek. He talks very simply.

Also 

AOS and SOA - Wikipedia
Structure of arrays (or SoA) is a layout separating elements of a record (or 'struct' in the C programming language)…en.wikipedia.org