Maybe one of the most important, and the most misunderstood, parts of working with a computer is the backup, or the process of backing up data.
As most people know, a backup is usually a complete copy of the important data on a computer. Now, at a minimum, a backup must have a source and a destination. Even if this appears pretty logical, let’s deepen the analysis a bit.
The source of a backup must be all the data that one works with. If the data is lost, the backup should contain everything needed in order to continue one’s work. While this might seem straightforward, there are cases when the files needed are scattered apart on the computer’s disk(s).
Let’s assume a backup for an accounting program – one would simply select its folder (from the Program Files folder) and create the backup. What happens, usually, is that the respective program saves its working data in other folders, like Application Data, thus making the backup incomplete. This is just a quick scenario that explains the necessity of having the right source for the backup.
On the other hand, if the user tries to back up the above-mentioned program on a weekly schedule, the backup program should be able to allow the user to schedule these backups. Forcing the user to select the same data, over and over again, would be counter-productive. Even assuming that the backup source is easy to select, the user is prone to forget about doing the backup, and there are plenty of reasons for why that might happen – the user’s in a hurry, the backup doesn’t seem so important after working a few extra hours, the power went down, and so on. The solution is obvious: the backup program should allow its backups to be scheduled, thus making it easier and safer for the user.
Most backup programs offer the possibility of scheduling backups to run on a specific day of the week, or at specified intervals, reducing user interaction to a minimum, while ensuring that the backup is created.
All backup programs (that I know of) support the creation of successive backups, overwriting its previous versions. This type of backup goes by the name of replace backup (some call it a normal backup). This way, the backup destination always contains (only!) the most recent backup that was made and, in most cases, the best data for a restore operation. As the saying goes, “A backup is just as important as its age: the newer, the better” (however, this is not true in some cases). While this method of creating a backup is, undoubtedly, among the easiest to program, it has its disadvantages.
One of the main problems with this approach is the fact that each backup execution would require about the same time as its first execution. If the backup’s source contains a lot of data, this operation can consume a lot of time and resources – it can easily disturb the user’s work, if the backup runs during work hours.
What if the backup source contains thousands of files and the user created/modified only one hundred (or less) files between two backup executions?
The little graphic above (note that the data in the graphic does not reflect the performance of a specific backup program and is merely an example) shows that the time needed to create a full backup every time is about the same at every backup, because this approach does not take into account how many files are new or changed after the first backup. It just takes the whole source and copies it to the destination. Leaving out the time needed for a new backup to be completed, this type of backup also uses a lot more resources than would be necessary. It’s obvious that we need all the data backed up, but we want the process to be more efficient – to be faster, to better use the available resources and to consume less space for the backup.
A solution to this problem would be the following:
- create the first backup, as a full backup
- following the full backup – at all the other executions of the backup – check the source against the destination and copy only the new or modified files
The checking of the source against the destination can be made by using an index file, making the operation a lot faster than by verifying every byte of every file in the backup’s source and destination.
This way, only the first backup will take a long time and a lot of processing power. The next backups will only add information to the first backup, as necessary. This type of backup is usually called, somewhat inappropriately, an incremental backup.
Now we can rest assured, for we have a backup to use in case anything goes wrong – and the operation itself will not disturb anyone, especially if it’s scheduled to run at a convenient time.