What is Syncthing doing when it says “Scanning”, and what’s the point of it?
To answer that we first need to talk about the index database.
The Index Database
Syncthing keeps a index database with information about each file, directory, and symlink it knows about. Each entry contains the name of the item, some metadata like size, timestamps, and permissions; internal information like a version vector and sequence number, and a list of the blocks making up the file. The index is keyed on folder ID, device ID, and file name. We can imagine it something like this:
The blocks are all of a specific size, historically always 128 KiB each. Nowadays we use larger blocks sometimes, and of course the last block is usually smaller unless the file size happens to be an even multiple of the block size. (Directories and symlinks don’t have any blocks; apart from that the handling is the same.)
Keeping this database up to date is called scanning.
The index entries are stored in the same format that is used to exchange index information between devices. It is described in our protocol buffer schema.
Scanning is a three step process.
- Walk the folder on disk, comparing each found item with the corresponding item in the index database. Queue any differing items for further inspection.
- Hash the files we queued in step one.
- Walk the index database for the folder, checking if each item in the database still exists on disk. Queue any missing items for deletion.
In steps one and two we find and hash new or updated files, in step three we find files that have been deleted. These steps can all take up a long time or be really quick, depending.
Step One - Walk the Folder
Walking the folder on disk and comparing to the database is quick if both the file metadata and the database are mostly cached in RAM. If not, and the folder is large, it can take a while and cause a lot of I/O. We can’t predict how long this step will take because we don’t know what the contents on disk are before we look - hence this step is shown simply as “Scanning” in the GUI, without any progress indication.
Step Two - Hash the Files
Once we’ve built a list of files to hash we know how much work is left to do in step two. The hashing process reads each changed file, computes the cryptographic hashes for each block, and periodically updates the index entries in the database. I say “periodically” because it’s done in batches instead of immediately for each file, for improved efficiency. The new index information is also sent to other devices when it is committed to the database. This has effects on rename detection.
During this step the GUI shows progress information - “Scanning (52%)” and similar. We also calculate the current hash rate and estimate how long the scan will take to complete.
Step Three - Scan for Deletes
Once the hashing is complete the third step kicks in to look for deleted files. This is yet another folder walk and the performance considerations are the same in step one. The GUI shows “Scanning (100%)” while this is ongoing, which might be less than totally intuitive. Usually, however, this step is quick enough for it not to matter.
The process above describes what Syncthing has done pretty much since its inception. Nowadays Syncthing supports listening for filesystem notifications, which gives us faster response to changes and less need to scan. Internally it works much the same as the three step process described above - it’s just that the process is limited to the files or the subtree that has changed. That is, instead of scanning the whole folder on a set schedule we scan individual files and directories when notified about them having changed.
Events are aggregated and processed in batches. Aggregation means that in some cases, instead of scanning many changed files individually we will do a full scan of their parent folder. Batching means that there is a certain delay, mostly to wait for further changes affecting the same file or files in the vicinity, before processing the whole set of changes in one go.
Now you know how scanning works. The next article in the series is be about how changes are synchronized between devices!