It indicates an expandable section or menu, or sometimes previous / next navigation options. Homepage

Amazon's secret GitHub data grab

A graphic of Andy Jassy with the Amazon logo in the background.
Amazon CEO Andy Jassy. Chelsea Jia Feng/BI
  • Amazon is trying to get around data-collection limits on Microsoft's GitHub.
  • Amazon wants GitHub metadata to train in-house AI models.
  • The company told employees its approach had been approved by Amazon lawyers.
Advertisement

To create powerful AI models, you need mountains of good data. Amazon is going to great lengths to collect this type of valuable information.

The company recently told employees to sign up for Microsoft's GitHub software-development platform and share their accounts so Amazon can scrape data from GitHub more quickly, Business Insider has learned.

This is a key step in Amazon's efforts to train its upcoming in-house AI model.

In an internal memo shared with employees last month, Amazon's Artificial General Intelligence Group wrote that it needed "quantitative and qualitative metadata from GitHub" for AI training purposes.

Advertisement

But there's a problem. A single GitHub account can make only 5,000 data-collection requests an hour. There are more than 150 million public data repositories on GitHub, so these account limitations mean scraping all this information would take too long, according to the memo.

To get around this, the Amazon AGI team is asking employees to create new GitHub accounts and share them with the company. Then, Amazon can run all these accounts simultaneously, reducing the time to collect data to just a "few weeks," according to the memo.

"Fetching all of this with a single account would take many years," the memo explained. "In order to increase the rate at which we can collect the metadata, we ask team members to create GitHub account[s] and share the API keys."

Rohit Prasad.
Rohit Prasad, the head scientist and senior vice president of Amazon's artificial-general-intelligence group. NurPhoto

Amazon's leadership is openly soliciting employee help with this workaround.

Advertisement

Rohit Prasad, Amazon's head scientist and senior vice president of the AGI group, encouraged employees to share their GitHub accounts to help "collect more high-quality code data for training our foundation models," according to an internal email from late May, titled "Help with data."

Another email from an Amazon AGI director urged employees: "It only takes 5 minutes!"

The episode highlights the rabid thirst for data among tech companies developing their own AI models. These models need lots of high-quality information to become more intelligent and human-like. There's a finite supply of this information, which is leading to a "war for data" among tech companies.

In Amazon's case, the company needs more data to train a yet-to-be-released AI model, internally dubbed its "most ambitious" AI project. Launching a new, more powerful AI model is important for Amazon, as the company is trying to catch up to its rivals Microsoft, Google, and Meta in the generative-AI space.

Advertisement

Alleged license violations

While the GitHub workaround will most likely speed up Amazon's AI-training process, it could raise ethical concerns over accessing data without appropriate permissions.

Microsoft is likely to be unhappy when it discovers that its archrival, Amazon, is leaning hard on GitHub for AI training data.

Even Microsoft itself is facing a lawsuit for allegedly violating license agreements when it used GitHub data to train its Copilot AI service.

"Amazon supports the protection of rightsholders and content creators, as well as established legal frameworks that facilitate the development of innovative and beneficial services," Amazon said in a statement. "Our LLMs are trained on data from a variety of sources, including licensed and proprietary data, open-source datasets, and publicly available data where appropriate. While this is an evolving area, we adhere to industry best practices around data collection to train our models."

Advertisement

The company also explained that it had created systems to "properly credit open-source developers if generated code suggestions are similar to their projects."

Spokespeople for GitHub and Microsoft didn't respond to requests for comment.

'Showing our hand'

In the internal memo, Amazon wrote that the GitHub workaround was approved by both the company's legal and security teams.

By following the guidelines, it said, Amazon was making sure to follow GitHub's rate limits and avoid getting its accounts blocked.

Advertisement

The memo said that in terms of "showing our hand," Amazon's move "should not alarm anyone" because the company was working on multiple products at the same time.

For employees interested in helping, the memo said they should use an Amazon work email, not a personal account, to sign up for GitHub.

It also said Amazon employees should create a "classic personal token," not a "fine-grained personal token," when signing up. GitHub classic personal tokens give access to a broader set of code repositories, though they may be less secure, according to GitHub's website.

The Amazon instructions also said the expiration of these tokens should be set to one year, and no "scopes" should be selected to ensure the token has "read-only" access to public information.

Advertisement

Once they sign up, Amazon employees should copy-paste their personal access tokens in a shared company file, the memo added.

'Most expansive' models

For Amazon, more data is crucial for its new AI model. Last year, Amazon CEO Andy Jassy wrote in an internal email that Prasad would be leading the newly created AGI team, under the goal of building the "most expansive" large language models for the company. Prasad now reports directly to Jassy.

Amazon may be behind some of its AI competitors, who've been involved in a huge land grab to collect more training data for years.

OpenAI, for example, has been striking a series of licensing deals with a long list of companies, including Reddit, Shutterstock, and News Corp, to use their content for AI-model training. Tech companies, hungry for even more training data, are also granting themselves new permissions to use a lot more of consumers' information.

Advertisement

Amazon's AGI team, meanwhile, has already gone through a major restructuring. In November, it laid off some of the employees who were working on Alexa-related projects, as BI reported. Prasad also outlined the six new focus areas for the AGI group at the time, including foundational models and conversational assistant services, BI previously reported.

A tricky position?

Though Amazon's legal team has approved the GitHub data-scraping workaround, the move could put Amazon in a tricky position.

In 2022, the programmer Matthew Butterick and the law firm Joseph Saveri filed a class-action lawsuit against Microsoft, alleging open-source license violations. Microsoft trained its Copilot AI service on publicly available code from GitHub without complying with the "underlying open-source licenses and other legal requirements," Joseph Saveri's website says.

While open-source code on GitHub is generally free to use, it comes with certain obligations, such as preserving accurate attribution of the source code, Butterick wrote on the website about the lawsuit. For Copilot, it's nearly impossible to credit the original source since it's built on billions of lines of code from GitHub, while Microsoft gets to sell it without giving back anything to the open-source community, he wrote.

Advertisement

"Like Neo plugged into the Matrix, or a cow on a farm, Copi­lot wants to con­vert us into noth­ing more than pro­duc­ers of a resource to be extracted (Well, until we can be dis­posed of entirely)," Butterick wrote. "And for what? Even the cows get food & shel­ter out of the deal. Copi­lot con­tributes noth­ing to our indi­vid­ual projects. And noth­ing to open source broadly."

Do you work at Amazon? Got a tip?

Contact the reporter, Eugene Kim, via the encrypted-messaging apps Signal or Telegram (+1-650-942-3061) or email (ekim@businessinsider.com). Reach out using a nonwork device. Check out Business Insider's source guide for other tips on sharing information securely.

Amazon Amazon Web Services Artificial Intelligence
Advertisement
Two crossed lines that form an 'X'. It indicates a way to close an interaction, or dismiss a notification.

Jump to

  1. Main content
  2. Search
  3. Account