tecosystems

My Backup: Dropbox, JungleDisk, S3, and a Desperate Need for Deduping

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

When it comes to backing up my music, I have four problems. Apart from the backup, I mean.

  1. My music acquisition is done on a Linux laptop (both Amazon and the eMusic store provide Linux clients)
  2. My Linux laptop’s harddrive is only 128 GB, much of which is devoted to virtual images, ergo space for music is tight
  3. My primary music library is housed on a Mac (so as to be portable to my iPhone)
  4. My music library (~80 GB) is too large to be entirely synced with Dropbox (50 GB max)

In other words, I currently download to one machine – an Ubuntu equipped X301 – which has only enough space to contain a subset of the master library. As for the master library, it’s attached to a Mac Mini running OS X, which in turn needs to be regularly updated with the newly acquired tracks from the Ubuntu machine.

Got all that?

Besides migrating newly acquired content to the master library, one of the requirements of my backup process is to push the content to the cloud. As I noted in 2007, my music may just be my most valuable material possession, and it’s certainly the least replaceable, so having an offsite copy is critical to me. Just as it will be to a horde of consumers in the years ahead, but that’s a different matter entirely.

So here’s my solution, warts and all – and yes, I’ll get to those.

Tools

  • Paid Dropbox account: $99/year
  • Paid Amazon S3 account: $0.150/GB storage | $0.100/GB transfer
  • JungleDisk client: have a lifetime account, which appears to be no longer available
  • Dropbox client
  • Rsync

Process

  • [Ubuntu laptop] Download music from Amazon/eMusic/etc into ~/Music
  • [Ubuntu laptop] ~/Music is symlinked to my Dropbox directory (instructions here), meaning everything downloaded is pushed to the cloud and to my other Dropbox equipped machines, including the Mac Mini
  • [Mac Mini] Dropbox directory is synced to the master music library on an external drive using the following process. All credit for the process – scripts included – goes to to Chris Tirpak. Anything wrong, particularly with the rsync options, is purely my stupidity:
    1. Create a backup rsync script. Mine is as follows:

      #!/bin/bash
      #
      # backup the home directory to copper
      #
      SUBJECT="Daily Backup Log"
      EMAIL="[email protected]"
      BACKUPLOGFILE="/Users/sog/bin/backupMac.log"

      #remove the old log file in case it is still there
      rm $BACKUPLOGFILE

      echo "Begin backup at: " > $BACKUPLOGFILE
      date >> $BACKUPLOGFILE

      rsync -rltvz /Users/sog/Dropbox/Music/ --exclude "Amazon MP3" /Volumes/"NO NAME"/"MUSICDIR"/
      >> $BACKUPLOGFILE 2>&1

      rsync -rltvz /Users/sog/Dropbox/Music/"Amazon MP3"/ /Volumes/"NO NAME"/"MUSICDIR"/
      >> $BACKUPLOGFILE 2>&1

      echo "End backup at: " >> $BACKUPLOGFILE
      date >> $BACKUPLOGFILE

      # send the log in an email using /bin/mail
      /usr/bin/mail -s "$SUBJECT" "$EMAIL" < $BACKUPLOGFILE

      rm $BACKUPLOGFILE

    2. Copy this file to ~/bin
    3. Tell the script to run every night to sync the directories by inserting getting launchd to execute a plist entry. Here’s my plist script:
    4. Put the plist file in ~/Library/LaunchAgents
    5. In a terminal: launchctl unload net.ogrady.backupMacSilent.plist
    6. In a terminal: launchctl load net.ogrady.backupMacSilent.plist
    7. In a terminal: launchctl list | grep -i net.ogrady
  • With the master directory thus updated with any newly downloaded tracks, JungleDisk then reflects the master directory up to S3 for permanent backup nightly.

The good news about the above: it will dutifully run rsync nightly to grab the target Dropbox directories and copy them over to the master directory. The bad news? It creates duplicate files. Lots of duplicates. My master music directory – both from this process and from previous backup efforts – has a massive duplication problem, probably on the order of several thousand duplicate files.

Which brings me to the question: anyone got an outstanding de-duplication procedure that will let me preview the files to be removed? Because I need some serious help.

Otherwise, what would you improve, and where? What’s your backup routine look like?