And at that point, you are in for quite a bit of complexity.
are you going to persist the data? The easiest way to do it – just creating a file for every value in the database – is going to be problematic on most systems. So you need to put a lot of data in a small set of files. Which means that you have to decide how you are going to put the data together.
In general, there is the fixed size option in which you divide the file(s) into pages and work around that. The good thing about this is that it gives you the ability to reuse space in the file after deletes / updates. The bad thing is that it is quite complex. Alternatively, you can just write the data out as needed, but then you can’t really update written data and would need to run compactions.
And we haven’t talked about searching yet. Some DBs, like Bitcask / Munin, would actually store the keys in memory, and store the position on the disk for retrieving the value. But for the most part, both keys and values tend to be on disk in some form. In CouchDB, they are held inside an append only B+Tree. In LevelDB, they are held in Sorted String Tables. LMDB uses Copy-On-Write B+Tree. Esent uses a B+Tree with a Log file.
In each of those cases, the actual semantics for persistent data involve at least two concerns. You need to actually be able to search the data (that usually mean more than just O(1) access. You want to be able to go back and forth on the keys) and you need to be able to do a transactional save. This is so you can recover in case of a crash, most of the time.
But there is actually a lot more that goes into the selection of the proper persistence format. To start with, how you store the data on disk will have a big effect on your performance. If you store the data as a linked list, for example, you might as well kiss your performance goodbye. Beyond that, we also have issues with things like how is the format going to scale when we have concurrent readers. For example, if you have something that does a lot of seeks, and relies on the seeks always going forward to ensure performance, that is going to be hit hard the moment that you have concurrent readers doing concurrent reads on different parts of the system. You would be forcing the system to do random seeks.
There are other considerations, as well. For example, if you are using something like B+Tree, it is likely that you’ll be overwriting the same section on the disk multiple times. That is fine with HDD, but SSD would just give up and die on you at some point. And we haven’t started talking about secondary indexes yet…