Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry Regarding Use of Topics API Model for HTTP Archive #305

Open
nrllh opened this issue Apr 18, 2024 · 4 comments
Open

Inquiry Regarding Use of Topics API Model for HTTP Archive #305

nrllh opened this issue Apr 18, 2024 · 4 comments

Comments

@nrllh
Copy link

nrllh commented Apr 18, 2024

Hello,

I am Nurullah from HTTP Archive, and we are planning to use Topics API model to categorize webpages for the 2024 Web Almanac project.

Our goal is to utilize the Topics API model to determine the categories of the CrUX origins in HTTP Archive. We intend to classify the origins similar to the one discussed here. The results of this classification will be stored and made publicly available in BigQuery, primarily for use by the Web Almanac analysts.

Before proceeding, we want to ensure that this use case does not violate any terms of use or raise other concerns regarding the Topics API. Could you provide guidance or confirm whether there are any potential issues with utilizing the Topics API in this manner?

Appreciate your support on this matter.

@leeronisrael
Copy link
Contributor

Hi Nurullah - I'm looking into this and will get back to you soon.

@leeronisrael
Copy link
Contributor

As you know, the Topics API classification model is shipped alongside the Chrome browser, in order to facilitate the on-device generation of topics. All the code to use the model is within the Chromium source tree which is subject to the Chromium open source license. There is no technical barrier to any party utilizing the model purposes beyond Topics API. In production, Chrome uses an override list in order to improve performance - this list does not exist in the Chromium source tree.

@nrllh
Copy link
Author

nrllh commented May 16, 2024

Thank you, Leeron! That sounds cool. I will share our data with you as well once we are done.

@nrllh
Copy link
Author

nrllh commented Jul 9, 2024

Thank you once again, @leeronisrael. We processed all the URLs in our dataset and made it open-source. Check the documentation: https://har.fyi/reference/functions/get_host_categories/

There have been some discussions on the accuracy of the model, but I couldn't find any related stats. Do you have any statistics on this that you can provide?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants