24 KiB
System Backup
Contents
- Overview
- Installation
- Setup
- Backup
- Cleanup
- Scripted Backups
- Scheduled Backups
- Manual Recovery
- Backend Portability
- References
Overview
There are a million and one ways to configure backups, everything is situationally dependent - for our purposes, we have these challenges and needs as a laptop user, cloud server administrator or just a FOSS advocate:
- Random uptime, laptops suspend/hibernate frequently
- Slow upload bandwidth on home DSL-style networks or spotty wifi
- Avoid vendor lock-in on both the client and server side
- Portability on the client and server side between distributions
- Reduce server administration - use a cloud files/storage provider
- Include standardized encryption algorithms in the backups
- Ability to tune backups granularly on a per-case basis
- Ability to verify backups and restore, even if it's a manual action (worst case)
Given these principles, the solution being implemented will use:
- fcron for running missed crons upon wakeup
- GnuPG for security/encryption
- duplicity as the main engine
- duply to make using duplicity easier
- boto (python-boto) for bucket storage:
- Google Cloud Storage
- Amazon S3 or Glacier
More specifically this article is using Arch Linux and Google Cloud Storage with Durable Reduced Availability Storage to reduce costs to a trivial (truly - < $5/mo, see the pricing) amount by excluding things like Downloads and Music. There are better ways to back up data like that, what you're really after are the things that matter (pictures, documents, configs, etc.) that can't be replaced.
All software is available in every distro (within reason), just replace the Arch pacman installs with yum/apt-get/emerge/etc. as appropriate. You may wish to consider using a custom PPA (ppa:duplicity-team/ppa) or similar source if your distribution's mainline version is too old.
Installation
These actions are typically performed as root or with sudo - adjust as needed.
fcron
Like anacron, fcron assumes the computer is not always running and, unlike anacron, it can schedule events at intervals shorter than a single day which is useful for systems which suspend/hibernate regularly. For an always-on system like a cloud server there's no real need to replace the standard cron package, however.
When replacing cronie with fcron be aware the spool directory is /var/spool/fcron and the fcrontab command is used instead of crontab to edit the user crontabs. These crontabs are stored in a binary format with the text version next to them as foo.orig in the spool directory. Any scripts which manually edit user crontabs may need to be adjusted due to this difference in behavior.
A quick scriptlet which will replace cronie and convert traditional user crontabs to fcron format:
systemctl stop cronie; systemctl disable cronie
pacman -Sy; pacman -S fcron
cd /var/spool/cron && (
for ctab in *; do
fcrontab ${ctab} -u ${ctab}
done
)
systemctl start fcron; systemctl enable fcron
duplicity / boto / gnupg / duply
Duplicity has minimal dependencies for a python application, no need to pip install a lot of extra modules. Many other backends are available for use however we are only focusing on GCS and/or S3 here. If a duply package is not available in your distribution, it's a single bash script - visit the website and grab a copy, place it somewhere handy (such as ~/bin/duply) as needed.
Duplicity 0.6.22 or newer is required for Google Cloud Storage
Arch
These are in the default repositories - most folks have gnupg already installed:
pacman -Sy; pacman -S duplicity python2-boto gnupg --noconfirm
Duply is currently in AUR - many folks use pacaur (which uses cower for the heavy lifting):
pacaur -S duply
CentOS
These are split between the main repositories and EPEL repo:
It's common the CentOS/EPEL repositories are behind in versions - I recommend downloading the SRPMs from EPEL and rebuild new RPMs using the latest versions of the duplicity, duply and python-boto packages.
yum -y install duplicity python-boto duply gnupg
gsutil
While not strictly necessary, having gsutil installed and ready makes working with unexpected issues easy; for example, while working your your Excludes you may try several times until you get it perfect -- gsutil makes it easy to delete en-mass or list bucket contents. Highly recommended - use a tools/ subdirectory for things like this:
[ ! -d ~/tools ] && mkdir ~/tools; cd ~/tools
wget https://storage.googleapis.com/pub/gsutil.tar.gz
tar -zxf gsutil.tar.gz; rm gsutil.tar.gz
# Arch default is python v3, gsutil needs python v2
sed -i.orig '1 s/python/python2/g' gsutil/gsutil
Setup
These actions are all performed as yourself, not root, for a typical backup of a home directory. This is where your usage pattern of a Linux system comes into play -- I personally have a ~/System/ directory where I copy any config change made outside my home directory (i.e. system level). Keep your home directory on a separate partition or Logical Volume and encrypt it with LUKS.
This makes your home directory the source of all evil and allows a copy of the home directory to a new machine, a fresh laptop install, or even moving from one distro to another very contained. It also makes backups trivial - backup your home directory, the rest is disposable and easier re-installed rather than restored.
This same method can apply to a cloud server -- implement a methodology so that all your routine database dumps, git trees and so forth are all parented under a single higher directory instead of scattered about the filesystem. When making edits to system level files such as /etc/httpd/conf/httpd.conf, /etc/php.ini, /etc/my.cnf, cron tasks and so on copy them (or just use symlinks) over to a collected location under one tree for easier backup and restore. This has a side effect of making your whole infrastructure portable to another server with little work.
Google Cloud Storage
This section is always subject to change as it depends on the Google web links in question - they frequently update and shift things around. So as an overview, our mission is to:
- Enable Google Cloud Storage for your Google account
- Set up a Project and attach billing (credit card) to it
- Enable Interoperable Access and generate Storage Access Keys
- Configure gsutil for random operations
- Set up a Bucket for each backup (i.e. laptop)
The first four are one-time only, the last one is repeated for each backupset you configure with duply/duplicity later. The setup generated by this section is then easily copied to another laptop for use on a second backup, etc.
Enable GCS
Log into your Google account, then visit https://cloud.google.com/products/cloud-storage/ and click the "Get Started" or "Go to my console" (or similar) link usually at the top and follow any instructions to get the basics set up. You may have to agree to Terms of Service and all that jazz - do the needful.
Create a Project
A Project is the higher level umbrella where you will be billed for usage; as of this writing the URL is at
https://console.developers.google.com/project and you click Create Project and choose a meta-level name like Backups (not a specific laptop name, e.g.). This will create a PROJECT ID displayed, jot that down in your notepad for use later.
On the left of Google's console, click Billing to connect the new Project to your credit card -- follow the instructions as appropriate.
Interoperable Access
This part can be the most confusing, as Google's webUI seems to be in flux a lot and the exact links change. As of this writing, the way to access the area:
- Click into your Project from the console
- On the left, click Storage then Cloud Storage
- Click Project Dashboard which opens a new tab
You're in a different UI at this point - on the left should be Google Cloud Storage with two sub-menus Storage Access and Interoperable Access. Enable Interoperable Access, then generate new Keys and jot down in the notepad both parts of the Key (one is secret). It's possible these two links may work for you:
Configure gsutil
Use the gsutil tool to generate a default configuration:
cd ~/tools/gsutil
./gsutil config -a
This creates the ~/.boto file which has a lot of comments. You need to insert the Project ID and Google keypair from the above steps. If you strip out the comments, here's the required portions - replace AAA, BBB and ZZZ with your data:
[Credentials]
gs_access_key_id = AAAAAAAAAAAAAAAAAAAA
gs_secret_access_key = BBBBBBBBBBBBBBBBBBBB
[Boto]
https_validate_certificates = True
[GSUtil]
content_language = en
default_api_version = 1
default_project_id = ZZZZZZZZZZZZ
[OAuth2]
Create a Bucket
A bucket is where the encrypted tarballs will actually be stored - so you want a Bucket for each backupset (system) you'll configure. A good choice might be the short name of the laptop for example, "mylaptop" will be used here as an example. We'll enable Durable Reduced Availability on the bucket to save money as well. The gsutil tool can perform many actions, just run ./gsutil --help and investigate.
cd ~/tools/gsutil
./gsutil mb -c DRA gs://mylaptop/
The bucket name is global to all users, you may get an error if the name chosen is already in use.
GnuPG
Generate a standard key specifically for use with your backups; because the password will be stored in plaintext in the duply config in your home directory, create a new key and use a unique password and not one of your existing keys. This keypair can then be copied to the other systems for backup encryption with a common key.
Creating a key
Create a standard GPG key:
GnuPG 2.0 and earlier
echo "pinentry-program /usr/bin/pinentry-curses" >> ~/.gnupg/gpg-agent.conf
GPG_AGENT_INFO=""; gpg --gen-key
GnuPG 2.1 and later
echo "allow-loopback-pinentry" >> ~/.gnupg/gpg-agent.conf
gpg --gen-key
Creating a key requires entropy to be generated by the system. If this is a virtual instance (i.e. Virtualbox guest) consider installing
rng-toolsand starting therngddaemon to provide the required entropy. Using thehavegeddaemon is an alternate option torngdas well.
Check the key is available:
$ gpg --list-keys QQQQQQQQ
pub 2048R/QQQQQQQQ 2014-07-18
uid [ultimate] duply <duply@localhost>
sub 2048R/RRRRRRRR 2014-07-18
$ gpg --list-secret-keys QQQQQQQQ
sec 2048R/QQQQQQQQ 2014-07-18
uid duply <duply@localhost>
ssb 2048R/RRRRRRRR 2014-07-18
Migrating a key
If you are already using a key on another system, it can be exported and imported so that all your backups upstream are encrypted with the same key. First, export the public and private keys on the source and copy them to the new system:
gpg --export -a QQQQQQQQ > duply_public.asc
gpg --export-secret-keys -a QQQQQQQQ > duply_secret.asc
scp duply*.asc user@remote:
On the new device, import the keys:
gpg --import duply_public.asc
gpg --import duply_secret.asc
Finally, edit the key and set trust to Ultimate:
gpg --edit-key QQQQQQQQ
Command> trust
[...]
5 = I trust ultimately
Your decision? 5
Import Preferences
During import, if your source GPG secret key is newer than the destination GPG you may get an error about incompatible preferences, and an offer to fix them - choose Yes:
$ gpg --import duply_secret.asc
gpg: key QQQQQQQQ: secret key imported
gpg: key QQQQQQQQ: "duply <duply@localhost>" not changed
gpg: WARNING: key QQQQQQQQ contains preferences for unavailable algorithms on these user IDs:
gpg: "duply <duply@localhost>": preference for cipher algorithm 1
gpg: it is strongly suggested that you update your preferences and
gpg: re-distribute this key to avoid potential algorithm mismatch problems
Set preference list to:
Cipher: AES256, AES192, AES, CAST5, 3DES
Digest: SHA256, SHA1, SHA384, SHA512, SHA224
Compression: ZLIB, BZIP2, ZIP, Uncompressed
Features: MDC, Keyserver no-modify
Really update the preferences? (y/N) y
For the curious, this usually means the newer version of GPG supports a newer cipher or digest that the older one needs to remove; for example a key on the newer GPG contains the IDEA cipher:
source GPG 2.0.26
gpg> showpref
[ultimate] (1). duply <duply@localhost>
Cipher: AES256, AES192, AES, CAST5, 3DES, IDEA
Digest: SHA256, SHA1, SHA384, SHA512, SHA224
Compression: ZLIB, BZIP2, ZIP, Uncompressed
Features: MDC, Keyserver no-modify
destination GPG 2.0.14
Command> showpref
[ultimate] (1). duply <duply@localhost>
Cipher: AES256, AES192, AES, CAST5, 3DES
Digest: SHA256, SHA1, SHA384, SHA512, SHA224
Compression: ZLIB, BZIP2, ZIP, Uncompressed
Features: MDC, Keyserver no-modify
Duply
Generate a default configuration for the backupset - we'll use the same name as the laptop and Bucket, "mylaptop":
duply mylaptop create
This creates two files that need edited:
~/.duply/mylaptop/conf
~/.duply/mylaptop/exclude
There are two other files that can be configured, pre and post that run commands before and after a duply backup. These are not created by default, however might come in handy if you need to mount/umount a filesystem, dump a database, etc. as part of the process. See the duply documentation for further info.
conf
Very similar to the gsutil setup, you'll need to configure the GCS data in this file for storing your backups, as well as all the other settings as to what should be backed up, retention periods and so forth. This part is situationally dependent -- I choose to manage my own Full backups manually since they require over 9 hours to upload and I need to disable Suspend on my laptop. Given that, my configuration looks like so (without all the comments):
The use of
GPG_OPTS='--pinentry-mode loopback'is required for GnuPG 2.1 and later, along with the above setting in~/gnupg/gpg-agent.confallow-loopback-pinentry. Failure to configure these will result in the passphrase not working from an unattended mode.
With duply 1.10 and above, do not set
TARGET_USERandTARGET_PASSin this config file - they are now configured elsewhere, see below.
GPG_KEY='XXXXXXXX'
GPG_PW='YYYYYYYYY'
GPG_OPTS='--pinentry-mode loopback'
TARGET='gs://mylaptop'
TARGET_USER='AAAAAAAAAAAAAAAAAAAA'
TARGET_PASS='BBBBBBBBBBBBBBBBBBBB'
SOURCE='/home/CCCCCC'
FILENAME='.duplicity-ignore'
DUPL_PARAMS="$DUPL_PARAMS --exclude-if-present '$FILENAME'"
MAX_AGE=2M
MAX_FULL_BACKUPS=2
...where you're obviously replacing AAA, BBB, CCC, XXX and YYY with your information as created above. This file should be mode 0600 so that only you can read it, as it contains both your GPG key password and GCS access keypair.
exclude
Configure the exclude file to ignore things you do not want in the backup - it uses globs (wildcards) to make things a little easier. As an average MATE desktop user with the typical applications, here is a basic exclude file that tends to work as a good starting point:
- /home/*/Downloads
- /home/*/Misc
- /home/*/Movies
- /home/*/Music
- /home/*/VirtualBox**
- /home/*/abs
- /home/*/builds
- /home/*/tools/android-sdk-linux
- /home/*/tools/jdk**
- /home/*/.ICEauthority
- /home/*/.Xauthority
- /home/*/.adobe
- /home/*/.android/cache
- /home/*/.cache
- /home/*/.cddb*
- /home/*/.config/**metadata*
- /home/*/.config/*/sessions
- /home/*/.config/*session*
- /home/*/.config/VirtualBox
- /home/*/.config/libreoffice
- /home/*/.config/pulse
- /home/*/.gstreamer*
- /home/*/.hplip
- /home/*/.icons
- /home/*/.java/deployment
- /home/*/.java/fonts
- /home/*/.local/share/gvfs-metadata
- /home/*/.local/share/icons
- /home/*/.macromedia
- /home/*/.mozilla/firefox/Crash**
- /home/*/.mozilla/firefox/*/storage
- /home/*/.purple/icons
- /home/*/.thumbnails
- /home/*/.thunderbird/**.msf
- /home/*/.thunderbird/Crash**
- /home/*/.xsession-errors*
Everyone will have a slight variation on this file, adjust as needed. It's a bit difficult to get globbing to work right with dot-files so I tend to just avoid that specific pattern usage.
Backup
This part is dead simple - just run duply with the name of the backupset; it will detect it's the first time run and trigger a full backup. Be sure to use screen, disable suspend/hibernate, etc. if your backup is going to take a really long time:
duply mylaptop backup
With duply 1.10 and above, you must first export the environment variables with your Google Cloud Storage API credentials. As such, skip to the "Scripted Backup" section below to see what's needed.
You may wish to increase verbosity and/or add --dry-run to the conf file the first time to ensure what you think it happening is actually happening. The default conf generated has the options and instructions present to set those up. For normal usage just use the default verbosity level.
From this point, every time you run duply backup foo it will detect the full backup and run Incrementals instead; how much and how long it takes depends on your changeset each time. You can force a full or incremental by using full or incr instead of backup as a command as well. The bkp action will skip the pre/post files execution.
Cleanup
Duply has several actions to help keep track of your backups - verify to show changed local files since the backup, status to list the upstream full and incremental statistics, and various forms of purge to flush older backups. Keep in mind that you cannot purge incrementals until a full backup supersedes them, so there's a bit of an art to knowing when you should generate a new full backup.
For a laptop it's probably sufficient to perform a full backup once a month (or less) and roll with the incrementals, unless you have massive amounts of change. For a scenario where you're backing up databases and such it might make sense to perform full backups weekly, since the size of your incrementals will grow rapidly and consume space. After a full backup, purge the incrementals older than that full backup to save space.
duply mylaptop status
duply mylaptop full
duply mylaptop purge
duply mylaptop purge --force
The purge actions will respect the settings in conf as outlined above. From time to time you may wish to use the cleanup action that will attempt to find orphaned backup bits and fetch to randomly test restoring files from the online backups to ensure everything is working as intended.
Scripted Backups
Create a small shell script that will be run via fcron to perform the backup, check status and email the results to yourself from the saved logfile. I use a very basic script with mailx (the standard commandline mail):
#!/bin/bash
MAILTO="me@mydomain.com"
LOGDIR="/home/CCCCCC/.logs"
TIMESTAMP=$(date +%Y-%m-%d_%H%M)
MAILSUB="mylaptop backup report: ${TIMESTAMP}"
LOGFILE="${LOGDIR}/duply_${TIMESTAMP}.log"
# Duply 1.10+ requires ENV vars
export GS_ACCESS_KEY_ID='<my API user, the old TARGET_USER in duply>'
export GS_SECRET_ACCESS_KEY='<my secret key, the old TARGET_PASS in duply>'
echo "" >> ${LOGFILE}
duply mylaptop backup 1>>${LOGFILE} 2>&1
duply mylaptop status 1>>${LOGFILE} 2>&1
echo "" >> ${LOGFILE}
cat ${LOGFILE} | mail -s "${MAILSUB}" ${MAILTO}
find "${LOGDIR}" -type f -mtime +30 -delete
exit 0
Scheduled Backups
Insert the above script into your fcrontab using the %daily keyword; fcron will run any missed jobs on the next hour after the system is online when 0 is specified in the minutes field and * in the hours:
%daily,mail(no) 0 * /home/CCCCCC/bin/mylaptopduply.sh
See the fcrontab(5) man page for more information.
Manual Recovery
The encrypted backup files are uploaded in chunks of 25M by default (configurable in the conf file of duply backupset); these files can be downloaded using a web browser from GCS, decrypted and untarred manually in the worst case scenario. The GPG key is still required so be sure that your ~/.gnupg keychain is backed up in some fashion not inside your duply/duplicity backups. Assuming the GPG key used to encrypt is still available:
- Go to the GCS console in a web browser
- Click the Project, Storage, Cloud Storage, Storage Browser
- Click into the Bucket, find a file you think might have what you need
- Download the file to your local system
Once downloaded, decrypt it (you will be prompted for the GPG secret key password) and untar:
gpg -d duplicity-inc.20140802T010001Z.to.20140802T170542Z.vol1.difftar.gpg > recover.tar
tar -xf recover.tar
This process is really only for recovery in a disaster; as the files upstream are chunked tarballs it's a rather random process to know which backup-file might have the specific file you need. There are manifests and signatures upstream as well, so downloading those first and perusing might help.
gpg -d duplicity-inc.20140802T010001Z.to.20140802T170542Z.manifest.gpg > recover.manifest
less recover.manifest
It would definitely be quicker to download all the manifests first, decrypt them then just grep the files to find the target; then you can download the backup-file in question. If your gsutil is working it can be used instead of a browser:
mkdir ~/recover; cd ~/recover
../tools/gsutil/gsutil cp gs://mylaptop/*.manifest.gpg .
for ii in *.gpg; do gpg -d "${ii}" > "${ii%%.gpg}"; done;
grep "some filename" *.manifest
It would be easier to set up another Linux instance and use duply/duplicity to recover the data properly (and quicker), but the option here is available to do it all by hand.
Backend Portability
An extension of the Manual Recovery could be to copy all GCS files down to a local path using gsutil (or web browser), copy/upload them to a filesystem or different provider then reconfigure your duply config to use the new backend without losing your existing backups. This might even be used just to download a copy of everything and save onto a USB drive that is kept in a fireproof safe.
Given the duplicity gpg-tarball storage design, your solution is upstream provider independent - the backup files can be ported from one backend to another with a bit of scripting and elbow grease. This could also be leveraged to keep a backup on different providers at the same time or use different backends for short vs. long term storage.