[go: up one dir, main page]

Skip to content

Add rsync style --files-from=FILE to simplify explicit backup scope

I have:

Summary

It can be useful to have duplicity act on a specific set of files which constitute a subset of the backup source path, for example when duplicity is wrapped by another program to orchestrate/automate more complex jobs. Currently, this can be done by:

duplicity --include-filelist FILE --exclude /source/dir /source/dir url://backend

With the above, anything not explicitly mentioned in FILE will be cut out by the --exclude file selection option. It works, but there's two problems with this:

  1. it's somewhat less than intuitive, or at least far from obvious when first trying to achieve such
  2. it can be seriously non-performant when the included file set is large

Admittedly, backing up only a subset of the source is (probably) a corner use case and can already be achieved. Maybe it's not considered worth addressing this, but I would at least like to offer a solution... Would also be very happy to have be told I'm just doing it wrong and that there's a better way already supported that I just haven't spotted :)

Performance Issues

As noted here, if the list passed to --include-filelist is more than a few thousand lines long duplicity will spend more and more time crunching regex and less time doing backup I/O. In previous testing this became a seriously notable slowdown about about 10k files, and can in some cases can totally choke out if pushed much further than that.

The issue seems to be that for m files in the source directory and n file selection rules, duplicity is performing up to n*m operations and so the time taken can approach O(_n_2) if you're having a sufficiently bad day. In both shell globbing and regex modes, regular expression processing and compilation is involved, but this performance hit for long lists is almost certainly just the loop-in-a-loop nature of the job. Have since determined that the small gains in the linked post for literal matching are mostly down to not having to compile 20k regular expressions, and not because the matching is significantly faster.

To be fair the file selection system is not broken, it just isn't really designed for this use case. As admitted above, this is certainly a corner use case, but the below bug would suggest that I'm not the only twit on the block who tries to do things like this with duplicity. :D

https://bugs.launchpad.net/duplicity/+bug/1576389

I will also post separatley some testing data demonstrating how not abusing the file selection system for this purpose can dramatically improve matters.

Proposed Solution

Rsync has an option --files-from=FILE (see rsync man page) which is documented to define the input file list, rather than building it by walking the source path. A similar command line option for duplicity would be more intuitive and explicit, whilst also providing a cue for duplicity to use a more efficient algorithm.

If --files-from is not specified, duplicity would behave exactly as it does now, i.e. am not proposing a breaking change for the many people unaffected by this and uninterested in it.

This option would reduce the scope of the backup source to the explicitly named files, but otherwise duplicity would behave the same. File selection rules could still be specified, but would work on the subset of the source folder defined by --files-from only. The following behaviours described in the rsync man page would probably make sense to implement for duplicity as well:

  • each line in FILE to specify a path relative to the backup source
  • any leading slashes are removed/ignored (in line with the above) or an error emitted
  • missing folders to be created where implied, i.e. don't error if a file is listed but it's parent folder isn't
  • FILE may be - to permit piping a list in via stdin

I suspect the first two in the above list were about keeping the content of FILE potentially be mount point agnostic, but am thinking another advantage here is that if a user mixes up the files passed to --files-from and --include-filelist (and friends) then an error is quite likely to result under the above conditions. Specifically, --include-filelist would presumably reject relative paths as unable to match the backup source as a prefix. The inverse would work, but potentially produce an instant but empty backup (thus the above suggestion to error instead).

In processing --files-from, duplicity would not walk the filesystem tree. Instead, it would iterate the input list as part of the file selection system. To achieve the same, one would no longer need to specify many/any file selection rules and the loop-in-a-loop goes away.

For the sake of simplicity, I'd also suggest that each line in FILE is literal relative path and not to support shell globs or regex therein.

Edited by Jethro Donaldson