Generating an id is a usual task when coding. Depending on the project, these ids can be internal to a component, shared inside a process, shared between several process, or even shared between several systems. A usual use-case in embedded development is generating ids to expose objects via IPC or RPC. This article details a solution to this task on linux, based on a distributed algorithm.
Let’s consider that you are developing a service that needs to expose some objects to clients, and methods can be called on these objects. You probably need to generate some ids to expose each object to the clients. Without thinking a lot about this simple task it will most likely ends up like this :
uint32_t generateId(void) { static uint32_t counter = 0; return counter++; }
This is simple and efficient but has several potential issues. The first one is that it leads to undefined behavior when counter has reached UINT32_MAX. In the - rare - situations where the author was aware about this, the last line will contain a comment (@todo : handle wrap around) to help debugging when the issue will occur. Some people may also consider that this cannot happen. This may be the case depending on the quantity of ids that will ever be generated, but you cannot say when the issue will occur.
The second potential issue is that if this generator is used in several processes, then there will be collisions. If the generated ids are used in different namespaces then is is not an issue. But if you want to put several of these ids in a common list, then it does not work.
Finally the generated ids are very easy to guess. This is usually not an issue if no security is involved, but it can promote poor design choices when handling these ids. With such ids you will surprisingly see after some time (months or years) that some developers saw that these ids are increasing numbers and use them in unexpected ways : Using them as C array indices or stealing ids that they should not use.
Ideally such a generator would generate non obvious ids, that can be shared on the system without collision.
Existing Solutions
The problem of generating unique ids is also usual on distributed systems and it has many existing solutions. There are 2 ways to implement an id generator:
- As a centralized service.
- As a decentralized service.
In a centralized service, one system is handling all the requests of all other systems. This make the generation easy to implement but causes scalability issues on big systems. In a decentralized service, there is no unique point where the ids are generated. Each system can generate unique ids independently from the over ones. Several public algorithms are available, among which simple flake and flake. They both are - more or less - derivatives of the snowflake generator that was developed by twitter.
These solutions and principles can be easily translated to a single system : replace the systems with processes and you end up with the same issue at the scale of a single system. There is however one important difference : All the aforementioned algorithms can lead to collisions. This was not an issue for their intended use-cases because collisions are detected when inserting the associated objects in a data base : The ids are the primary keys of the tables, so the database detects collisions during the insertion. When generating an IPC or RPC id to expose an object, there is no easy way to detect such a collision and fix it by re-generating a new id.
The centralized vs distributed design choice remains important but for another reason than scalability. Using a daemon to serve ids means than an IPC must be used, which adds complexity to the code. A distributed algorithm allows each client to directly use the generator function, embedded in a shared library.
A Solution : Sysflake
So let’s continue the analogy between distributed systems and our single system. The distributed algorithms for distributed systems mentioned before use 3 information to build an id:
- a timestamp : the time since a chosen origin.
- a machine id : the mac address of the network card.
- a random number.
We can use a very similar model, but with few adaptations. There is no need for a machine id since there is only one machine. However we have multiple processes and threads. So this information can be replaced with the process id and thread id. The fact is that on linux the processes and the threads share the same space of ids, so we can simply use the thread id (or the process id in a single threaded process). The random number is an issue because it can lead to collisions. The random number can be replaced with a counter like on the sample code on the start of the article (but taking care of overflows).
These information are concatenated to generate 64 bits ids in a distributed way (distributed in term of process). This generator is not just a counter, and ensures no collision. Here is the structure that I chose :
The timestamp is the uptime of the system, in milli-seconds. This allows to generate ids for (1<<37))/1000/60/60/24/365 = 4.3 years.
The thread id is encoded with 17 bits because this is the default value of /proc/sys/kernel/pid_max on a 64 bits linux kernel. on a 32 bits kernel this value is 2^15.
Finally there are 10 bits left to encode the ids as a 64 bits integer. They are used as a counter to ensure no collision happens when generating several ids in a thread within the same millisecond.
So this sysflake algorithm can generate 1024 unique ids per thread per milli-second for more than 4 years. If for some reason one thread tries to generate more than 1024 id within a milli-second (remember that it is intended for an embedded system), this can be detected (the returned flake is 0) and another id can be generated during the next milli-second.
The code
Sysflake is implemented in C and is available on github. The algorithm is distributed and the implementation uses thread local variables to avoid the need of any locking on multi-threaded processes.