sakana

very short memo

back up files

Here is another approach for creating back up files onto dropbox on regular basis by cron. A quite simple script though, here are some remark.

hostname

There are multiple approaches for obtaining hostname though, here I use socket.gethostname(), which seems to be a transliteration of gethostname() system call.

compression

Here I use bzip2 as compression algorithm, which seems to achieve more effective compression rate against files like plain text in comparison to other algorithm like gzip. But as a trade off to good compression rate, compression process seems to take longer time.

temporary file

I create backup file first under temporary directory and then move it under directory under dropbox in case that there is/are any change(s). As creating backup file directly under dropbox directory is quite slow due to traffic.

checksum

Finally I compare temporary backup file and existing file by value of MD5 checksum of each file. If they do not match, it indicates that some change(s) of file(s) may have arisen. If they are identical, then there was no change in files under target directory, so do nothing.

import datetime
import hashlib
import os
import os.path
import shutil
import socket
import tarfile

if __name__ == "__main__":

    # base directory, which is one level upper than target directory
    base      = "<BASE_DIRECTORY_OF_YOUR_CHOICE>"
    # backup target directory under base directory
    target    = "<BACK_UP_TARGET_DIRECTORY>"
    # directories to exclude from backup file
    exclude   = ["<DIRECTORY_TO_EXCLUDE>"]

    # backup file name / HOSTNAME_DAY.tar.bz2
    backup    = socket.gethostname() + "_" +\
                datetime.date.today().strftime("%A") +\
                ".tar.bz2"
    # temporary directory & temporary file
    temp_dir  = "/tmp"
    temp_file = os.path.join(temp_dir, backup)
    # destination directory
    dest_dir  = os.path.join(base, "Dropbox/<TARGET_DIRECTORY>")
    dest_file = os.path.join(dest_dir, backup)

    # let's start backup

    # first create a backup file under temporary file
    tar = tarfile.open(temp_file, "w:bz2")

    os.chdir(base)
    for root, dirs, files in os.walk(target):
        for name in files:
            for d in root.split('/'):  
                if d in exclude:
                    break
            else:
                tar.add(os.path.join(root, name))
    
    tar.close()

    # backup file creation has finished
    # now determine if file replacement is required or not
    if os.path.exists(dest_file):
        # dictionary to store md5 checksum of each file
        md5 = {}
        for f in temp_file, dest_file:
            with open(f) as file_to_check:
                data = file_to_check.read()
                md5[f] = hashlib.md5(data).hexdigest()
        # compare checksum value of each file
        # copy temporary file under destination directory
        if md5[temp_file] != md5[dest_file]:
            shutil.copy(temp_file, dest_dir)
    else:
        # somehow backup of same name does not exist, do copy
        shutil.copy(temp_file, dest_dir)

If you would like to take backup, say, every one hour, then add such a line as follows in cron table.

$ crontab -l
0 * * * * python <PATH_TO_SCRIPT>/backup.py