Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added: Basic File-level Deduplication in our FileStore #914

Merged
merged 7 commits into from
Feb 13, 2024

Conversation

Sewer56
Copy link
Member

@Sewer56 Sewer56 commented Feb 7, 2024

Quick Summary:

  • Tests will now clear tables between runs, making the tests less flaky (no leftover data from previous runs).
    • Note that this is slow. I didn't add a 'fast' routine for this because we will be switching underlying database code in Event Sourcing migration.
  • FileOriginRegistry will only backup new files that are not already backed up anywhere.
    • e.g. Installing same mod twice results in no new .nx data written.
    • e.g. Installing an update will only write the files that are new in that specific archive.
    • e.g. If multiple mods have the same files, they will be deduplicated.

Also added some basic tests around re-archiving our downloads.

This will lead to our files being fragmented between multiple archives when downloading mod updates, however, at the very least.

We don't test any of this code right now, so I'll probably add a test or two.
This should make our tests a little bit less flaky, as we shouldn't have failures on reused data now.
Does make tests also a bit slower though.
@Sewer56 Sewer56 requested a review from a team February 7, 2024 14:49
@Sewer56 Sewer56 self-assigned this Feb 7, 2024
@Sewer56 Sewer56 added Bug Something isn't working meta-performance Anywhere we might get an improvement in performance. labels Feb 7, 2024
Copy link
Contributor

@Al12rs Al12rs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed above:

  • move HashSet in NxFileStore,
  • make it lazy
  • if already populated, update it when new files are backed up
@halgari
Copy link
Collaborator

halgari commented Feb 8, 2024

@Sewer56 @Al12rs

I'm pretty sure NxFileStore already has a exists check:

    private unsafe bool TryGetLocation(Hash hash, out AbsolutePath archivePath, out FileEntry fileEntry)
    {
        var prefix = new Id64(EntityCategory.ArchivedFiles, (ulong)hash);
        foreach (var entry in _store.GetByPrefix<ArchivedFiles>(prefix))
        {
            foreach (var location in _archiveLocations)
            {
                var path = location.Combine(entry.File);
                if (!path.FileExists) continue;

                archivePath = path;

                fixed (byte* ptr = entry.FileEntryData.AsSpan())
                {
                    var reader = new LittleEndianReader(ptr);
                    FileEntry tmpEntry = default;

                    tmpEntry.FromReaderV1(ref reader);
                    fileEntry = tmpEntry;
                    return true;
                }
            }
        }

        archivePath = default;
        fileEntry = default;
        return false;
    }

So I don't think we need to keep anything in memory. That's backed by Sqlite and has a O(n) search time based on the number of duplicates in the system, which after this patch will make it O(1). We can just call TryGetLocation and discard the output.

@Sewer56
Copy link
Member Author

Sewer56 commented Feb 8, 2024

I'm half asleep but damn we really missed this. Didn't know there was a table for this. Well, that's kinda cool.

@Sewer56
Copy link
Member Author

Sewer56 commented Feb 8, 2024

Anyway, my doggo woke me up from sleep because she wanted outside >.<

While I wait half asleep, the following comes to mind:

  • We decided earlier when talking that we'll deduplicate on archive level too (avoiding extraction if possible.
  • Deduping on Archive level means we store Archive Hash as finalHash in DownloadAnalysis.
  • Which means we have to fetch DownloadAnalysis either way, to extract the ID field.

Also add that TryGetLocation could potentially bottleneck hard if used with those rare mods that have over 30000 files (exact impact remains to be seen though), due to sheer overhead of separately hitting the DB each time.

So I think a GetAll() method for a given entity is still probably desirable. GetByPrefix with an empty prefix maybe does the job, if the underlying DB is smart enough to eliminate it, but I have no way of knowing till I wake up and get to my pc again.

I think I'll try that as a temp workaround, and measure overhead (very roughly) with some quick logging. (I'll fill the data store with 100k files). If it's fast enough I'll just use that and switch over to a GetAll() method post DB migration. I don't really want to write 'obsolete on arrival' code.

Nap time.

@Al12rs
Copy link
Contributor

Al12rs commented Feb 8, 2024

To test or profile you can try installing SMIM from Skyrim:
https://www.nexusmods.com/skyrim/mods/8655?tab=files
Very popular and reasonably large without being absurd.

@Sewer56
Copy link
Member Author

Sewer56 commented Feb 8, 2024

I don't like half assing a job, so I went the full way.

  • Now dedupes on Archives AND Files.
  • I added a GetAll method to IDataStore myself.

I did a very, very rough benchmark by modifying one of the tests.

Edit: We now hash with cache too


!!! tl;dr File Dedupe Overhead: ~150ms for 30000mods with 300,000 files. (Basically free)

Namely, I wrote a loop which:

  • Generates a Mod with 7-14 random small files.
  • Packs it.
  • Installs it.

After around 45 minutes, I installed 30000 mods through this loop, with approx 300,000 files total.
This is pretty much worst case scenario for current implementation (it fares better with fewer mods with many files like SMIM).

All in all, once you hit 30000 mods, the overhead is around 150ms (slow by my standards but still 'acceptable' for 30000 mods with 300,000 files).

This was done with the SQL statement SELECT Id,Data FROM [{tableName}] which I added in GetAll.
Around 70% of CPU time was spent fetching items from the DataStore, around 29% was spent deserializing; remaining <0.4% was expanding HashSets to use for deduping.


!!! Archive Dedupe Overhead: sizeof(Archive) / DiskReadSpeed

Edit: No longer valid, we use download cache. This problem still exists, but has been 'delegated' to another subsystem (downloads).

Approximate overheads:

Storage Type 1GB 5GB 10GB 20GB
HDD 7.1s 35s 71.4s 142.9s
SATA SSD 2.5s 12.5s 25s 50s
NVMe 0.3s 1.92s 3.8s 7.6s

Unlike File Dedupe which is basically free and in practice improves performance by compressing less on mod updates, archive dedupe unfortunately carries pure overhead as we don't really have much use for the archive hash outside of deduping future archives.

I'm not sure if this is the best way to go about this, part of me feels like getting the user to not download the same archive in the first place may instead perhaps be a better approach. Or hashing a subset of the file (e.g. 4096 bytes), and using that as the hash for deduplication instead; it'd probably bring better average performance long run, but that's an issue for another issue.

@Al12rs
Copy link
Contributor

Al12rs commented Feb 12, 2024

@Sewer56 Are the test failures due to the hashing bug? Should we wait for a fix on that to get merged first?

@Sewer56
Copy link
Member Author

Sewer56 commented Feb 12, 2024

@Sewer56 Are the test failures due to the hashing bug? Should we wait for a fix on that to get merged first?

No, they're completely unrelated to this PR. Maybe something on Main.

@Al12rs Al12rs closed this Feb 12, 2024
@Al12rs Al12rs reopened this Feb 12, 2024
@Sewer56 Sewer56 merged commit 29e9690 into main Feb 13, 2024
5 of 7 checks passed
@Al12rs Al12rs deleted the filestore-dedupe branch February 14, 2024 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working meta-performance Anywhere we might get an improvement in performance.
3 participants