A database “is a structured collection of data. Card indices, printed catalogues of archaeological artefacts and telephone directories are all examples of databases”.
We are going to examine databases that are prevalent in computer systems and go through the core differences between Local and Distributed Databases.
A local database is an isolated database that is stored on a single computer and does not share its data outside of itself or its direct access network.
Examples of a local database can be any of the following:
- Preference/Settings data local to an application or system permissions.
- A local registry used by the Operating System.
- An isolated database such as MySQL, MSSQL or MongoDB that is not mirrored on a network.
A distributed database is a networked database that is stored on numerous computers that are networked together to form a single database.
A distributed database is often used in one or more of the following scenarios:
- When high throughput to the database occurs.
- When using Slave Nodes to maintain incremental backups of un-corrupted data.
- When data mirroring is required.
- For data redundancy if a computer system crashes or becomes inaccessible.
- General fault tolerance.
If a database is distributed along with a locally installed application on a single computer in order to save Preferences/Settings and the like then it is usually ideal to store this data in either a Flat File or a SQLite local database. These database types don?t require connection strings or many of the more complex setups in order to operate and store smaller amounts of data.
There are three types of distinguishable databases.
- SQL – often referred to as Relational Databases.
Some examples are MySQL, MSSQL, PostgreSQL, SQLite.
The power of SQL databases comes in where Tables can be joined together to show unreplicated data. Data can be stored in multiple tables and joined together later on using primary and foreign keys and constraints.
Their strength can also be their weakness as performing Joins on large datasets can be very time consuming and slow down the entire database.
- No-SQL – often interpreted as Not Only SQL.
Some examples are MongoDB, SimpleDB, Cassandra.
The power of NoSQL is that there are no Joins like with Relational databases and this can have dramatic speed improvements. However, in order to overcome this shortfall, data is often replicated and it can prove difficult to keep numerous sources of the same data up to date.
- Flat File – Anything from a text file to a locally stored and re-interpreted Tuple or JSON structure.
This is very useful for applications that need to store and retrieve configuration details. It is not recommended to use this technique for anything with a high amount of reads/writes or where multiple clients are accessing the same data. This is because data can easily become corrupted and incorrect revisions of data can easily be stored to disk as flat files are locked on read/write operations.
For anything that needs to store, retrieve and search (index) large amounts of data. It is best to setup a dedicated database system and scale as required.
There are two main ways to scale a database system.
- Vertical Scaling (Often the most expensive way to scale)
This is usually the most often used method of database scaling and is done by adding more system resources to a computing machine of choice.
In order to Vertically Scale, one would need to make sure that the following criteria are met:
– RAM: Should meet the size of the database on disk or be larger than the total of all indexes.
-Disk-Space: Should be larger than the total database size and the total size of all indexes with additional room for expansion.
-Swap-Space: Should be double the size of the system RAM to allow for swapping to take place should the RAM become overloaded.
-Network: Upload and Download Throughput should be high enough to not be maxed out when clients are requesting data and responses are sent.
- Horizontal Scaling (Can be the cheapest way to scale)
This can be the most cost effective way to scale a database as expensive computers are not required. It is also the more fault tolerant way of scaling because there is no reliance on any single piece of hardware. If any single machine stops operating correctly then it is automatically removed from the network and the other machines take over from it.
In order to correctly set this up, one would need to make sure that;
– DNS and or Load Balancing is setup to distribute relevant load to multiple machines in a cluster.
– Machines are either setup as Master or Slave nodes in order to not all overwrite each other’s data.
In order to distribute load between multiple machines, traffic needs to be forked off in different directions. This is possible through either the DNS or Load Balancing.
DNS (Domain Name System) can handle this by means of techniques such as Round Robin-ing which randomly chooses a child node to divert traffic to.
A potentially better and more technical way of achieving this is by means of using a proprietary Load Balancing technology or by setting up a High Availability reverse proxy server to look for “healthy child nodes” and divert traffic to whichever machine has the most current resources available to handle the incoming request(s).
Let’s look at a few scenarios:
If all computers are localised in a single building then setting up a single master with a slave for replication and read prioritisation would be a simple setup.
If all computers are spread across the country or a collections of countries (or the world) then having a single master per major geographical region with multiple slaves in each of the regions for additional load would be advisable.
Database mirroring, which is sometimes referred to as database replication “is the frequent electronic copying data from a database in one computer or server to a database in another so that all users share the same level of information. The result is a distributed database in which users can access data relevant to their tasks without interfering with the work of others”.
Having database mirroring or replication setup is highly advantageous to any system or organisation that values its data and would suffer should the data become invalid or corrupt.