Batched operations on EntityManager

Jobs aren't always the fastest for all operations. EntityManager operation mostly have to cause structural change on the main thread. Let's see how we could speed this up.

Batched operations on EntityManager

Jobs aren't always the fastest for all operations. EntityManager operation mostly have to cause structural change on the main thread. Let's see how we could speed this up.

Latest update : Entities 0.14 | 19/08/2020 (https://docs.unity3d.com/Packages/com.unity.entities@0.14)

Don't schedule a job just to queue EntityCommandBuffer

It is true that worker threads are great but the mistake is that EntityCommandBuffer. It may feel like you are doing things in jobs efficiently, however you are just delaying them for later.

EntityCommandBuffer is for remembering what to do, then work on them later simply because those are not possible in a job. Unity's "foolproof" C# Jobs System requires structural change on main thread to ensure safety.

When "played back", it would be like EntityManager doing DestroyEntity one by one in the main thread. It is not like they are being destroyed in a job. (Pay attention to the word "command buffer". The commands are not being executed just yet.)

No, EntityCommandBuffer.ParallelWriter won't help you much either, you are just remembering what to do in parallel. (And ParallelWriter blocks the other thread on write if they clashed.)

The purpose of EntityCommandBuffer is to defer EntityManager commands, not to speed up EntityManager commands. And therefore this job is a complete waste of time if you schedule a job just to add/remove/delete entities via EntityCommandBuffer. You better off just do it without the job.

However if you are already in a job that is working on something else and you are in just the right if branch to decide what command to do, that's the right spot to queue up commands to EntityCommandBuffer.

Just use EntityManager in the main thread

Even iterate destroying with EntityManager directly, naively, will be faster because you didn’t schedule a job!

Use NativeArray overload

The strength of these methods is that you can put whatever Entity in NativeArray<Entity> that you may allocate by yourself and handpicked Entity in them, from ToEntityArray of EntityQuery, or from NativeSlice subset of NativeArray<Entity.

//Add-remove types on handpicked entities.
//Adding has no data, they will start on default of that type.
AddComponent(NativeArray<Entity> entities, ComponentType componentType)
AddComponent<T>(NativeArray<Entity> entities)

RemoveComponent(NativeArray<Entity> entities, ComponentType componentType)
RemoveComponent<T>(NativeArray<Entity> entities)

//Mass-create. One fills the length of input array, the other returns a new array based on your count.
CreateEntity(EntityArchetype archetype, NativeArray<Entity> entities)
CreateEntity(EntityArchetype archetype, int entityCount, Allocator allocator)

//Like create variant but clones components. Has built-in behavior
//to some special components such as Prefab and LinkedEntityGroup
Instantiate(Entity srcEntity, NativeArray<Entity> outputEntities)
Instantiate(Entity srcEntity, int instanceCount, Allocator allocator)
Instantiate(NativeArray<Entity> srcEntities, NativeArray<Entity> outputEntities)

//Dumbly copy even Prefab or LinkedEntityGroup
CopyEntities(NativeArray<Entity> srcEntities, NativeArray<Entity> outputEntities)

DestroyEntity(NativeArray<Entity> entities)

The inside of these methods are bursted, don't underestimate them!

Analysis : CreateEntity

Characteristic of this method is simple since there is no Entity input. It starts with finding existing chunks of the target archetype you want with some space left then work on that contiguous space given, then deduct from the count of creation you want. If no existing chunk, then you pay for allocating a new, fresh chunk. Therefore performance depends on the archetype's capacity % your creation count + how many existing chunks.

Analysis : Add/RemoveComponent

If you could picture ECS memory in your head, unlike setting component, adding/removing a component is not "just". It's huge because the chunk need to be moved entirely, rearranged such that there is now more capacity behind every contiguous data types. Picture the chunk : AAA _ _ BBB _ _ CCC _ _ , if you remove C though it came the last, now it has to be something like AAA _ _ _ _ BBB _ _ _ _. See that even BBB can't stay in the old location, those are now the spare capacity for A component.

  • If you added/removed zero-sized component (tag component), you can save time since the new archetype is layout-compatible. If that destination Chunk already exist, it is just a matter of moving a section of memory block to a new place!
  • If entity count is 10 or below, it loop in a for loop and remove one by one on main thread.
  • If over 10 entities, it enters batching operation. As for why, it is quite troublesome and overkill if the amount is low. You can input any arbitrary Entity jumbled in the array, but the algorithm inside is trying to see which Chunk each one is in, at which position, then finally sort them and output them in terms of chunk-startIndex-count. The actual worker then could do faster the more contiguous entities you want to add/remove you have.
    Now at least you know how to speed up this operation by including entities that came from the same chunk and are contiguous as much as possible, but still, it is troublesome for the algorithm to be able to accept a wild NativeArray<Entity>. "Just do it" sequentially without care about chunk and whatnot is better when entities are below 10 because the intelligent sorting outweighs the benefit.

Analysis : Instantiate/Copy

Instantiate(Entity srcEntity, NativeArray<Entity> outputEntities)
Instantiate(Entity srcEntity, int instanceCount, Allocator allocator)
Instantiate(NativeArray<Entity> srcEntities, NativeArray<Entity> outputEntities)

The futuristic version of MonoBehaviour instantiation. The hope of mankind. All three variants uses the same innards. Even the version that instantiate a single entity but that's not fitting to this article's topic..

Increasing instanceCount is not entirely O(n) on the expensive part, it could work on multiple instances at the same time to the chunk's boundary. For example if your chunk is at capacity 100, instantiating 1200 Entity will have to loop allocate 12 times to clone em. (Not 1 time, or not 1200 times.) That's some of the magic of contiguous memory!

Use ExclusiveEntityTransaction

But if you wanna destroy (or other EM operations) in the job for real, then use ExclusiveEntityTransaction.

ExclusiveEntityTransaction is like an inverse of what normally occurs. Normally to do things to EntityManager we have to “come back” to the main thread for a moment (at EntityCommandBufferSystem, at ComponentSystem, etc.)

With ExclusiveEntityTransaction, we can “lock the EntityManager” for one thread to work on and prevent the main thread from using it. At the same time the main thread of that world can go on and do other things which do not touch EntityManager.

Main thread of other world can touch its EntityManager though! Remember that EntityManager is a singleton per world, not per Unity. It manages Entity in one world.

So the use of ExclusiveEntityTransaction is heavily geared towards having multiple worlds. It renders 1 World near unusable (EntityManager became busy in-job) but your other worlds may still work on with their own EntityManager and the remaining worker threads. Now you see something that only multiple worlds can achieve! Worker threads do work stealing automatically and is a shared resource for all World.

But to make multiple worlds useful to each other it would requires more careful planning how to communicate.

Cross-world entity operations

Overcome limitation of EntityManager with multiple worlds. Unlock full power with ExclusiveEntityTransaction. What's left is the final glue of moving entities from one EntityManager to another. (That is, one World to another.)

MoveEntitiesFrom

public void MoveEntitiesFrom(EntityManager srcEntities)
public void MoveEntitiesFrom(out NativeArray<Entity> output, EntityManager srcEntities)
public void MoveEntitiesFrom(out NativeArray<Entity> output, EntityManager srcEntities, NativeArray<EntityRemapUtility.EntityRemapInfo> entityRemapping)
public void MoveEntitiesFrom(EntityManager srcEntities, NativeArray<EntityRemapUtility.EntityRemapInfo> entityRemapping)

public void MoveEntitiesFrom(EntityManager srcEntities, EntityQuery filter)
public void MoveEntitiesFrom(EntityManager srcEntities, EntityQuery filter, NativeArray<EntityRemapUtility.EntityRemapInfo> entityRemapping)
public void MoveEntitiesFrom(out NativeArray<Entity> output, EntityManager srcEntities, EntityQuery filter)

That's a ton of overloads...

Moving Entity from the other world will not retain their Index and Version, since each World's EntityManager manages its own increasing indexes. It would be disastrous if we move world and it randomly overrides Entity that is already occupying the same index.

Therefore this operation will perform a remapping. After you read that article, it is now clear that the first 4 overloads are just : do you want the remapping result in output? and do you want to provide your own "remapping workspace" of entityRemapping or let it allocate then destroy it completely inside? (You may reuse the workspace array if you have a lot of moves to do.)

The last 2 overloads allows you to move only some entities, not all. The "some" is by selecting relevant chunks with EntityQuery. Remember that EntityQuery must be made from the world owning things you want to move! EntityQuery is not interchangeable between worlds.

These methods from roughly looking, runs a job that cuts off queried chunk pointers (or everything if no EntityQuery) and hand it to the 2nd EntityManager . No copy or anything. This way you could "get rid" of entities quickly, or just setting up for ExclusiveEntityTransaction. This is why moving has huge advantage over copy.

Though don't overestimate it, still the Entity are technically new in the destination even though the method is called "move" and need some reservation. "Entities are just indexes", yes, but they need the right place for their data that the index could point to.

And also the filtered version (EntityQuery version) is not just filtered in the entity selection phase. There are many more subtle details in moving that all needs to be filtered as well. The deepest code path of all version and filtered version is completely different though the code is similar! They are heavily jobified, and the jobs without filter is certainly leaner.

CopyEntitiesFrom

public void CopyEntitiesFrom(EntityManager srcEntityManager, NativeArray<Entity> srcEntities, NativeArray<Entity> outputEntities = default)
public void CopyAndReplaceEntitiesFrom(EntityManager srcEntityManager)

Not to confuse with CopyEntities which performs in the same manager.

The 1st overload... actually call that CopyEntities to the same manager. But the copied entities are then tagged IsolateCopiedEntities component and the rest you may already could deduce, it use the same code as MoveEntitiesFrom with EntityQuery (that had been told the IsolateCopiedEntities component.)

Remember how "move" versions could be considered "copy" since Entity are created newly then we try to move component data over. This 2nd overload CopyAndReplaceEntitiesFrom is an on steroid version. Since "replace", unlike "move", does not have to care about remapping anymore, the operation can retain everything down to Entity identity from the source world. The code comment said it could be used for deterministic rollback. We can easily backup and restore an entire world. It's quite a special method we got here.

Use the EntityQuery overload of EntityManager

Any time you use the EntityManager with EntityQuery instead of Entity or an array of it, you are doing it to everything in a whole chunk at the same time instead of to each entities one by one.

Remember that EntityQuery could return multiple chunks, if you have so many Entity that it spans several chunks on one EntityQuery. (I like to think that the type name is called "ChunkQuery".)

Here’s all of them you can use.

//Add just the type without data, all data starts at default for that struct.
//If tag component, no need to reshape the chunk's memory arragnement.
AddComponent(EntityQuery entityQuery, ComponentType componentType)
public void AddComponent<T>(EntityQuery entityQuery)

//Add with tailor-made data for each entities matched in the EQ.
//Requires extreme accuracy of your target entity for each data!
//Length mismatch will throw.
AddComponentData<T>(EntityQuery entityQuery, NativeArray<T> componentArray) where T : struct, IComponentData

//If removing tag component, no need to reshape the chunk's memory arrangement.
RemoveComponent(EntityQuery entityQuery, ComponentType componentType)
RemoveComponent<T>(EntityQuery entityQuery)

//You can add/remove multiple types at once too.
AddComponent(EntityQuery entityQuery, ComponentTypes types)
RemoveComponent(EntityQuery entityQuery, ComponentTypes types)

//It is natural that chunk component operation could be done to multiple chunks at the same time..
AddChunkComponentData<T>(EntityQuery entityQuery, T componentData) where T : unmanaged, IComponentData
RemoveChunkComponentData<T>(EntityQuery entityQuery)

//Swaps SCD value for the whole chunk without any data movement.
//Similar performance to tag component add-remove.
AddSharedComponentData<T>(EntityQuery entityQuery, T componentData) where T : struct, ISharedComponentData

//Very good one as it just throw the matched chunk away in a big unit.
DestroyEntity(EntityQuery entityQuery)

//There is NativeArray<EntityManager> but no EntityQuery overload for CreateEntity, for very logical reason.

VS NativeArray<Entity> overload? These are much better, since things like add and remove is really happening on the whole chunk and not a single entity is moving. The data shape inside each entity will need to be moved around however if the component you add/remove is not a tag (has size).

As for AddSharedComponentData why there is no NativeArray version? Because this kind of component actually stays on the chunk and not on any individual entity. Having an EntityQuery version is natural. If NativeArray version exists it is likely not any better than you loop the array and do it one by one.

AddChunkComponentData , RemoveChunkComponentData these are obviously a chunk thing and only logical to have only EntityQuery version and not NativeArray<Entity>.

Analysis

You can read the analysis for NativeArray version above, in the very end of each, they likely wanted to work in terms of an array of "chunk + start index + count". Now though, there is no preprocessing required to get that shape of workload, since EntityQuery already get you chunks! Start index is 0. Count is that whole chunk. Therefore, I think it is quite safe to assume EntityQuery is faster.

To illustrate the difference, both DestroyEntity of EntityQuery and NativeArray<Entity> ended up using the same DestroyBatch in EntityComponentStoreCreateDestroyEntities.cs. However that method wants 3 things : which Chunk, start index, and count. By EntityQuery, of course it is a chunk and start at 0 to the end of chunk. Easy! In the case of NativeArray<Entity>, you may have input some completely arbitrary entities, not in the same chunk, or maybe some of them in the same chunk? There is a code that try to figure out contiguous entities in the array so it could do chunk-start-length together. (Now you know how to at least use the NativeArray<Entity> version fast, if you must.) But the point is operating in chunks should be faster in most case, it is just too rough for some surgical operations.

Read more about why tag component has better performance on some of these operation here : https://gametorrahod.com/tag-component/.

Together with ISharedComponentData filter

It even works when the EntityQuery has been .SetFilter -ed too. So you can even do selective mass operation based on your SharedComponentData or your .SetFilterChanged criteria. Just don’t forget to .ResetFilter when you want to make the query go back to normal.

As an example, I have 10000 of things to show but only a subset (1000) of them is visible+processed at any given time (governed by Process tag component) and this subset move forward from 0,1000, 2000, 3000, ... until the end.

So for each 1000 entities I add Group ISharedComponentData with integer 1~10, when it is time to remove all Process of the previous 1000 entities and add to the next 1000 at once, I could achieve that with 1 RemoveComponent and 1 AddComponent with filtered EntityQuery (add filter, do remove, add new filter, do add, reset the filter if needed) instead of 2000 iterations.

Main article about SCD and its filter ability : https://gametorrahod.com/everything-about-isharedcomponentdata#filtering

EntityCommandBuffer with EntityQuery

You can plan a deferred multi-chunk operation. The EntityQuery required is a class, so it is only for out-of-job EntityCommandBuffer.

EntityManager with EntityQuery is already fast. But remember it creates a sync point right in the middle of frame that complete all jobs. If you don't need the operation's result right now then it is 100% better playback later in chunks. Nothing's better than a chunk based operation that also not disturbing running jobs. Sync point completes ALL jobs. (Not just jobs related to the operation that cause the sync, unfortunately.) Therefore if you decided to do so you are troubling all other systems too.

  • You don't need the result entirely in this frame, because it signifies the completed work of this system or clean up of this system. A typical scenario is a system that work on newcomer entities once, then tag it so the next round they are not worked on anymore. (Has ComponentType.Exclude of that tag.) You know this is the only system that care about this tag, therefore you should make your EntityCommandBuffer from BeginInitializationEntityCommandBufferSystem. This target is the best considering the sync point position. In other words, make it effective the next frame. It is also common for system that looks to "consume" some message entity and make them disappear by destroying them by throwing chunks away. You can queue destroy command to the begin init target.
  • If you kinda need it for various systems that you are not yet think carefully about it, think now if those systems could be classified into PresentationSystemGroup or not. If you could, then target your playback on EndSimulationEntityCommandBufferSystem. Keep in mind that if you do this, then no kind of jobs could survive the border from simulation to presentation update flow. The best position of sync point is always going to be the beginning of the next frame.
    (There was EndPresentationEntityCommandBufferSystem available at some point in time, but since that is the same in meaning as BeginInitializationEntityCommandBufferSystem Unity removed it.)
  • The last resort is to introduce your own EntityCommandBufferSystem, usually you do this when you want to paste [UpdateAfter(YourECBS)] on systems among simulation group because you can't afford to not have this data available for more systems to compute on (that they may then use EndSimulationEntityCommandBufferSystem this time), and finally produce something for folks at presentation group.

It is also usable in Entities.ForEach or Jobs.WithCode that ends with .Run() and with .WithStructuralChange.

Deferred query timing

But beware that deferred EntityQuery command will consider the query at playback time, not at command enqueue time. (The command really remember the query and not query result.) For example, you just used ForEach and you used WithStoreEntityQueryInField to take the query out for use with EntityCommandBuffer just below that target the begin init. You expect the things you just worked on in the ForEach above to receive the command. But that may not be the case if by the time you arrive at the next frame, the entities received more changes and no longer match the query!

Here's a "tag flashing" pattern I used. The idea is that usually when you tag as a message so later system do something once, the message receiver has to remove the tag to signify the message has been received. If the tag is designed to be consumed by multiple systems, it could be a hassle whose responsibility to remove it and reduce modularity. Regardless, EntityQuery overload is ideal for this kind of pattern since it is cheap to tag the chunk.

Several other tactics include the tagger need to clean it up the next frame so the message only works for one round. That also maybe a bit more hassle. Instead I could leave the removal work to ECB playback. Have the tagger enqueue the removal right at the line I tag and target the begin init to minimize job completion impact. Therefore the message last until the beginning of the next frame and "automatically disappear". It is now impossible to forget removing the tag and anyone later than this system in the same frame are ensured to receive the message once.

//Imagine a deserializer system that produce fresh entities.
//All entities just deserialized receive Deserialized tag.

//But there are more work
//to be done and could bloat this deserializer system. Instead,
//I have more systems that UpdateAfter this one to add in their own
//"post-deserialize" task. Therefore, I would like them to be able to
//just `RequireForUpdate` the `PostProcessNeeded` tag.

EntityManager.AddComponent<PostProcessNeeded>(deserializedQuery);
var ecb = ecbs.CreateCommandBuffer();
ecb.RemoveComponent<PostProcessNeeded>(deserializedQuery);

...

EntityManager.RemoveComponent<Deserialized>(deserializedQuery);

It sounds like a neat idea to "flash" the PostProcessNeeded tag component to everything I just worked on in deserializedQuery, except that I forgot that I remove the Deserialized that make the query work right after. At the beginning of the next frame, the tag wouldn't be cleaned up since query no longer match. The tag stays and systems continue working on PostProcessNeeded every frame.

Either you use non EntityQuery version so command enqueue remember each individual Entity instead. (They are now free to change archetype as they like, the command no longer cares. But it is no longer a batched command.) Or also use ECB for the RemoveComponent with the EntityQuery overload.

Where is the batched set?

So far everything deals with component type addition or removal. If it is an add, they all starts with default value. It's understandable though, since how could the API know which individual Entity wants which value of your new component? It's true that all Set variants of EntityManager do not have EntityQuery overload.

The closest is this AddComponentData(EntityQuery entityQuery, NativeArray componentArray) where T : struct, IComponentData overload which could have data (though it is not a set, it is an add.) But its implementation is just getting EntityArray of that query, then iterate setting each data of the same index. You think there must be a more "batch" way to do this.

These methods are available on EntityQuery :

public void CopyFromComponentDataArray<T>(NativeArray<T> componentDataArray)
public void CopyFromComponentDataArrayAsync<T>(NativeArray<T> componentDataArray, out JobHandle jobhandle)

Typical usage of EntityQuery.CopyFromComponentDataArray

This is a method on EntityQuery instead of EntityManager that could potentially do what's equivalent to batched set.

It seemed to be designed symmetrically with ToComponentDataArray so a NativeArray of equal length that was copied from could be applied back with modifications. The apply back is quite efficient because the API will schedule a parallel unsafe job for you. Each job copy a linear memory bluntly across boundary of each Entity. The only thing it can't cross is to an another chunk, which the other threads are probably working on in parallel.

But this apply back is also quite brittle. Remember that the ECS database may be segmented into multiple chunks, but NativeArray is linear. It is thanks to the fact that there is an entityOffset of each chunk coming into IJobChunk inside this CopyFromComponentDataArray method that could be gymnastic-ed to map perfectly with linearized NativeArray. So that's how it knows which value goes to which Entity without even having a dictionary-style NativeHashMap<Entity, T> in order to apply back. If you just made the array from ToComponentDataArray, then apply back before any chunk changed, you are automatically good to go.

Advanced usage of EntityQuery.CopyFromComponentDataArray

However you are now going to use CopyFromComponentDataArray without prior ToComponentDataArray. First of all you will have to maintain a NativeArray that has the same length as linearized entity of that EntityQuery.

Of course you could do ToComponentDataArray to kickstart this then proceed to add the same repeating value for a fast reset, or different values as you like. But its length will only be up to date only at this moment. For this solution to work, you must be very disciplined that no new entity of this archetype would be added. ( CopyFromComponentDataArray has a throw on length mismatch.) If done correctly, the NativeArray of component is your portal to batch set multiple components efficiently. You must know which element in the array goes to which Entity though to perform this gymnastic.

I think this is quite dangerous because even the length matches but the entity switch places by whatever phenomenon (like Entity.Version reuse), the copy back will pass but may not land on an Entity you are expecting.

Bonus : memory deserialization

There is still one more way even faster than anything else, and it is only possible because of the beauty of data-oriented design. Unity could serialize chunk memory as-is to a file. If we deserialize back this file and put this memory back, you instantly get everything back in a way that even EntityManager would be confused what just happened. It's almost like I could grab my brain out and put in an another person later and it instantly functions perfectly like me. This direct memory loading approach is also the basis of ECS Subscene system.

ECS serialization is currently not documented much maybe because it is still not stable, but you could access it from SerializeUtility. For now I don't have time to benchmark this vs. the previous best performer CopyFromComponentDataArray, but to have a memory to deserialize in the first place it means that the values must be known before runtime. It maybe less flexible.