Added: Basic File-level Deduplication in our FileStore #914

Sewer56 · 2024-02-07T14:49:59Z

Quick Summary:

Tests will now clear tables between runs, making the tests less flaky (no leftover data from previous runs).
- Note that this is slow. I didn't add a 'fast' routine for this because we will be switching underlying database code in Event Sourcing migration.
FileOriginRegistry will only backup new files that are not already backed up anywhere.
- e.g. Installing same mod twice results in no new .nx data written.
- e.g. Installing an update will only write the files that are new in that specific archive.
- e.g. If multiple mods have the same files, they will be deduplicated.

Also added some basic tests around re-archiving our downloads.

This will lead to our files being fragmented between multiple archives when downloading mod updates, however, at the very least. We don't test any of this code right now, so I'll probably add a test or two.

This should make our tests a little bit less flaky, as we shouldn't have failures on reused data now. Does make tests also a bit slower though.

src/NexusMods.DataModel/FileOriginRegistry.cs

Al12rs

As discussed above:

move HashSet in NxFileStore,
make it lazy
if already populated, update it when new files are backed up

halgari · 2024-02-08T02:06:54Z

@Sewer56 @Al12rs

I'm pretty sure NxFileStore already has a exists check:

    private unsafe bool TryGetLocation(Hash hash, out AbsolutePath archivePath, out FileEntry fileEntry)
    {
        var prefix = new Id64(EntityCategory.ArchivedFiles, (ulong)hash);
        foreach (var entry in _store.GetByPrefix<ArchivedFiles>(prefix))
        {
            foreach (var location in _archiveLocations)
            {
                var path = location.Combine(entry.File);
                if (!path.FileExists) continue;

                archivePath = path;

                fixed (byte* ptr = entry.FileEntryData.AsSpan())
                {
                    var reader = new LittleEndianReader(ptr);
                    FileEntry tmpEntry = default;

                    tmpEntry.FromReaderV1(ref reader);
                    fileEntry = tmpEntry;
                    return true;
                }
            }
        }

        archivePath = default;
        fileEntry = default;
        return false;
    }

So I don't think we need to keep anything in memory. That's backed by Sqlite and has a O(n) search time based on the number of duplicates in the system, which after this patch will make it O(1). We can just call TryGetLocation and discard the output.

Sewer56 · 2024-02-08T02:09:40Z

I'm half asleep but damn we really missed this. Didn't know there was a table for this. Well, that's kinda cool.

Sewer56 · 2024-02-08T03:15:44Z

Anyway, my doggo woke me up from sleep because she wanted outside >.<

While I wait half asleep, the following comes to mind:

We decided earlier when talking that we'll deduplicate on archive level too (avoiding extraction if possible.
Deduping on Archive level means we store Archive Hash as finalHash in DownloadAnalysis.
Which means we have to fetch DownloadAnalysis either way, to extract the ID field.

Also add that TryGetLocation could potentially bottleneck hard if used with those rare mods that have over 30000 files (exact impact remains to be seen though), due to sheer overhead of separately hitting the DB each time.

So I think a GetAll() method for a given entity is still probably desirable. GetByPrefix with an empty prefix maybe does the job, if the underlying DB is smart enough to eliminate it, but I have no way of knowing till I wake up and get to my pc again.

I think I'll try that as a temp workaround, and measure overhead (very roughly) with some quick logging. (I'll fill the data store with 100k files). If it's fast enough I'll just use that and switch over to a GetAll() method post DB migration. I don't really want to write 'obsolete on arrival' code.

Nap time.

Al12rs · 2024-02-08T08:30:30Z

To test or profile you can try installing SMIM from Skyrim:
https://www.nexusmods.com/skyrim/mods/8655?tab=files
Very popular and reasonably large without being absurd.

Sewer56 · 2024-02-08T14:27:17Z

I don't like half assing a job, so I went the full way.

Now dedupes on Archives AND Files.
I added a GetAll method to IDataStore myself.

I did a very, very rough benchmark by modifying one of the tests.

Edit: We now hash with cache too

!!! tl;dr File Dedupe Overhead: ~150ms for 30000mods with 300,000 files. (Basically free)

Namely, I wrote a loop which:

Generates a Mod with 7-14 random small files.
Packs it.
Installs it.

After around 45 minutes, I installed 30000 mods through this loop, with approx 300,000 files total.
This is pretty much worst case scenario for current implementation (it fares better with fewer mods with many files like SMIM).

All in all, once you hit 30000 mods, the overhead is around 150ms (slow by my standards but still 'acceptable' for 30000 mods with 300,000 files).

This was done with the SQL statement SELECT Id,Data FROM [{tableName}] which I added in GetAll.
Around 70% of CPU time was spent fetching items from the DataStore, around 29% was spent deserializing; remaining <0.4% was expanding HashSets to use for deduping.

!!! Archive Dedupe Overhead: sizeof(Archive) / DiskReadSpeed

Edit: No longer valid, we use download cache. This problem still exists, but has been 'delegated' to another subsystem (downloads).

Approximate overheads:

Storage Type	1GB	5GB	10GB	20GB
HDD	7.1s	35s	71.4s	142.9s
SATA SSD	2.5s	12.5s	25s	50s
NVMe	0.3s	1.92s	3.8s	7.6s

Unlike File Dedupe which is basically free and in practice improves performance by compressing less on mod updates, archive dedupe unfortunately carries pure overhead as we don't really have much use for the archive hash outside of deduping future archives.

I'm not sure if this is the best way to go about this, part of me feels like getting the user to not download the same archive in the first place may instead perhaps be a better approach. Or hashing a subset of the file (e.g. 4096 bytes), and using that as the hash for deduplication instead; it'd probably bring better average performance long run, but that's an issue for another issue.

src/NexusMods.DataModel/FileOriginRegistry.cs

tests/NexusMods.DataModel.Tests/FileOriginRegistryTests.cs

src/NexusMods.DataModel/FileOriginRegistry.cs

Al12rs · 2024-02-12T17:14:47Z

@Sewer56 Are the test failures due to the hashing bug? Should we wait for a fix on that to get merged first?

Sewer56 · 2024-02-12T17:16:21Z

@Sewer56 Are the test failures due to the hashing bug? Should we wait for a fix on that to get merged first?

No, they're completely unrelated to this PR. Maybe something on Main.

Sewer56 added 3 commits February 7, 2024 11:45

Changed: Deduplicate on File Hashes when Registering Downloads

1fa10e0

This will lead to our files being fragmented between multiple archives when downloading mod updates, however, at the very least. We don't test any of this code right now, so I'll probably add a test or two.

Added: FileOrigin Tests, Including Dedupe Tests

8b19e48

Fix: Clear DataStore between runs.

0a6fe5f

This should make our tests a little bit less flaky, as we shouldn't have failures on reused data now. Does make tests also a bit slower though.

Sewer56 requested a review from a team February 7, 2024 14:49

Sewer56 self-assigned this Feb 7, 2024

Sewer56 added Bug Something isn't working meta-performance Anywhere we might get an improvement in performance. labels Feb 7, 2024

Al12rs reviewed Feb 7, 2024

View reviewed changes

src/NexusMods.DataModel/FileOriginRegistry.cs Outdated Show resolved Hide resolved

Al12rs reviewed Feb 7, 2024

View reviewed changes

src/NexusMods.DataModel/FileOriginRegistry.cs Outdated Show resolved Hide resolved

Al12rs requested changes Feb 7, 2024

View reviewed changes

Added: Archive Deduplication and GetAll method for DataStore

a5ac3e8

Changed: Pull Cache from FileHashCache

c0a5cf3

Sewer56 force-pushed the filestore-dedupe branch from 9cfd394 to c0a5cf3 Compare February 8, 2024 15:01

Al12rs requested changes Feb 8, 2024

View reviewed changes

src/NexusMods.DataModel/FileOriginRegistry.cs Outdated Show resolved Hide resolved

Al12rs reviewed Feb 8, 2024

View reviewed changes

tests/NexusMods.DataModel.Tests/FileOriginRegistryTests.cs Show resolved Hide resolved

Deduplicate at App Level (Minor Slowdown, but handles Ingest etc.)

339a5c1

Al12rs reviewed Feb 9, 2024

View reviewed changes

src/NexusMods.DataModel/FileOriginRegistry.cs Outdated Show resolved Hide resolved

Removed: Out of date comments

c22f6c3

Al12rs approved these changes Feb 12, 2024

View reviewed changes

Al12rs closed this Feb 12, 2024

Al12rs reopened this Feb 12, 2024

Sewer56 merged commit 29e9690 into main Feb 13, 2024
5 of 7 checks passed

Al12rs deleted the filestore-dedupe branch February 14, 2024 09:43

Al12rs mentioned this pull request Feb 15, 2024

When adding files to FileStore, if files are already backed up, they get backed up again #870

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added: Basic File-level Deduplication in our FileStore #914

Added: Basic File-level Deduplication in our FileStore #914

Sewer56 commented Feb 7, 2024

Al12rs left a comment

halgari commented Feb 8, 2024

Sewer56 commented Feb 8, 2024 •

edited

Loading

Sewer56 commented Feb 8, 2024 •

edited

Loading

Al12rs commented Feb 8, 2024

Sewer56 commented Feb 8, 2024 •

edited

Loading

Al12rs commented Feb 12, 2024

Sewer56 commented Feb 12, 2024

Added: Basic File-level Deduplication in our FileStore #914

Added: Basic File-level Deduplication in our FileStore #914

Conversation

Sewer56 commented Feb 7, 2024

Al12rs left a comment

Choose a reason for hiding this comment

halgari commented Feb 8, 2024

Sewer56 commented Feb 8, 2024 • edited Loading

Sewer56 commented Feb 8, 2024 • edited Loading

Al12rs commented Feb 8, 2024

Sewer56 commented Feb 8, 2024 • edited Loading

Al12rs commented Feb 12, 2024

Sewer56 commented Feb 12, 2024

Sewer56 commented Feb 8, 2024 •

edited

Loading

Sewer56 commented Feb 8, 2024 •

edited

Loading

Sewer56 commented Feb 8, 2024 •

edited

Loading