-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added: Basic File-level Deduplication in our FileStore #914
Conversation
This will lead to our files being fragmented between multiple archives when downloading mod updates, however, at the very least. We don't test any of this code right now, so I'll probably add a test or two.
This should make our tests a little bit less flaky, as we shouldn't have failures on reused data now. Does make tests also a bit slower though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed above:
- move
HashSet
inNxFileStore
, - make it lazy
- if already populated, update it when new files are backed up
I'm pretty sure NxFileStore already has a exists check: private unsafe bool TryGetLocation(Hash hash, out AbsolutePath archivePath, out FileEntry fileEntry)
{
var prefix = new Id64(EntityCategory.ArchivedFiles, (ulong)hash);
foreach (var entry in _store.GetByPrefix<ArchivedFiles>(prefix))
{
foreach (var location in _archiveLocations)
{
var path = location.Combine(entry.File);
if (!path.FileExists) continue;
archivePath = path;
fixed (byte* ptr = entry.FileEntryData.AsSpan())
{
var reader = new LittleEndianReader(ptr);
FileEntry tmpEntry = default;
tmpEntry.FromReaderV1(ref reader);
fileEntry = tmpEntry;
return true;
}
}
}
archivePath = default;
fileEntry = default;
return false;
} So I don't think we need to keep anything in memory. That's backed by Sqlite and has a O(n) search time based on the number of duplicates in the system, which after this patch will make it O(1). We can just call |
I'm half asleep but damn we really missed this. Didn't know there was a table for this. Well, that's kinda cool. |
Anyway, my doggo woke me up from sleep because she wanted outside >.< While I wait half asleep, the following comes to mind:
Also add that So I think a I think I'll try that as a temp workaround, and measure overhead (very roughly) with some quick logging. (I'll fill the data store with 100k files). If it's fast enough I'll just use that and switch over to a Nap time. |
To test or profile you can try installing SMIM from Skyrim: |
I don't like half assing a job, so I went the full way.
I did a very, very rough benchmark by modifying one of the tests. Edit: We now hash with cache too !!! tl;dr File Dedupe Overhead: ~150ms for 30000mods with 300,000 files. (Basically free) Namely, I wrote a loop which:
After around 45 minutes, I installed 30000 mods through this loop, with approx 300,000 files total. All in all, once you hit 30000 mods, the overhead is around This was done with the SQL statement !!! Archive Dedupe Overhead: Edit: No longer valid, we use download cache. This problem still exists, but has been 'delegated' to another subsystem (downloads). Approximate overheads:
Unlike File Dedupe which is basically free and in practice improves performance by compressing less on mod updates, archive dedupe unfortunately carries pure overhead as we don't really have much use for the archive hash outside of deduping future archives. I'm not sure if this is the best way to go about this, part of me feels like getting the user to not download the same archive in the first place may instead perhaps be a better approach. Or hashing a subset of the file (e.g. 4096 bytes), and using that as the hash for deduplication instead; it'd probably bring better average performance long run, but that's an issue for another issue. |
9cfd394
to
c0a5cf3
Compare
@Sewer56 Are the test failures due to the hashing bug? Should we wait for a fix on that to get merged first? |
No, they're completely unrelated to this PR. Maybe something on Main. |
Quick Summary:
FileOriginRegistry
will only backup new files that are not already backed up anywhere..nx
data written.Also added some basic tests around re-archiving our downloads.