Trained on billions of lines of open source code
According to the official description, the GitHub Copilot tool is powered by OpenAI’s new AI system, OpenAI Codex, which is based on the GPT-3 natural language processing (NLP) model. independently generate various forms of text.
The company claims that Copilot is “much more capable than GPT-3 in terms of code generation.” Copilot leverages billions of lines of publicly available code uploaded to GitHub and other sites. Based on the vast code base, Copilot, when introduced into an integrated development environment (IDE), traverses the complete code and gives the programmer AI-based analysis suggestions that the programmer can accept or reject. GitHub does not guarantee that the generated code will work, as Copilot does not test the code.
After Copilot was released, some users who used the tool said, “Copilot guessed about a tenth of the code I wanted to write, and other times gave some pretty good or completely inappropriate suggestions. But when Copilot guessed correctly, it felt like it was reading my mind. Even though I was the only one coding, it was really like pair programming. I write better code, documentation, and tests. copilot has made me a better programmer.”
Nat Friedman, CEO of GitHub, also said that hundreds of developers at GitHub are currently using Copilot throughout the day while coding, and most are taking suggestions rather than turning off the feature.
In addition to anticipation, some developers have left comments expressing concern that the feature will put programmers out of work, and some are starting to think about how this will affect programming. For now, however, Copilot is still mainly positioned to provide code completion and suggestion functions similar to IntelliSense/IntelliCode.
Is Microsoft violating the open source license?
Although Copilot is liked by many developers, some of them have raised questions.
The first is the efficiency issue. thu2111″, who has previously tried two AI-driven autocompletion engines in Java and Kotlin, posted that the plugin was removed due to two issues: first, AI suggestions are usually not as useful as type-driven IDE autocompletion (using IntelliJ); secondly, AI plugins are very aggressive in pushing their suggestions list to the top, even if they are less helpful than the default.
The second most controversial issue is whether Microsoft is violating the open source license agreement. part of the reason Copilot code generation is better than GPT-3 is that it is trained on a large dataset containing publicly available source code, with terabytes of publicly available code and English language examples on GitHub alone.
GitHub Copilot is now available as a Visual Studio Code extension, and Copilot will be free for developers in beta, but Microsoft will charge for it when it goes live. Microsoft says that it currently only offers the service for code stored in public repositories. So, does Microsoft have the right to apply this open source code to its own commercial products?
According to GitHub staffer Albert Ziegler, for GitHub Copilot to remember a piece of code, it must look at that snippet often. Because each file is only shown to GitHub Copilot once, the snippet needs to exist in many different files in the public code.
Ziegler said that the 41 main snippets tested appeared in at least 10 different files, and 35 of them appeared more than 100 times. During testing, GitHub Copilot suggested starting an empty file after having accessed the GNU public license more than 700,000 times.
Copilot test data, source: Albert Ziegler
Since the code generated by Copilot is not a copy of the GPL code, the developer cannot identify which project the code came from.
One of the features of the GPL protocol is that if a GPL code base is referenced, the referenced part of the code must be open sourced. That is, if a company has a line of code that introduces a library function that is open sourced under the GPL, it must open source the entire code. To “hide” pieces of GPL code and then copy and paste them into a commercial project would be a violation of the agreement for many developers.
Game developer eevee said that copyright includes not only copy and paste, but also derivative works. Microsoft also acknowledged that GitHub Copilot was trained based on a large amount of GPL code, and that everything it knows is extracted from that code. “So it’s not clear to me why this is not a form of converting open source code into a commercial product.”
However, Thomas Dickerson, a PhD in computing at Brown University, questioned eevee’s point: Does this mean that anyone who has read even a single line of GPL code can no longer work on closed-source projects, because those are derivative works?
Zac Skalko said that Copilot must have asked the user’s permission, so Copilot is not the real “author”, the user is the real committer, and therefore it is exempt from liability.
The developer dragonwriter argues that Microsoft played a word game: Microsoft did not claim to use the “open source corpus” but rather “public code” because such use is “fair use” and not subject to copyright. use” and is not subject to copyright.
Previously, there has been controversy over whether the use of copyrighted works by AI for training is an infringement of copyright, and the industry has not yet reached a consensus.
In 2015, Xiaomi was publicly accused of violating the GPL v2 license by the smart device community XDA. Although Xiaomi’s MIUI is derived from Android and is licensed under the Apache 2.0 license, Android uses a GPL v2-compliant Linux kernel. According to the GPL v2 license, the modified source code must also be made public, and Xiaomi has modified the Linux kernel source code, but Xiaomi has not made the source code public. Although this did not lead to a lawsuit, it had a significant impact on the community and Xiaomi’s image.
According to Red Hat’s “State of Enterprise Open Source 2021 Report,” 90% of IT leaders use open source software in their organizations, and 79% say that the use of open source software in emerging technologies, such as edge computing, the Internet of Things, artificial intelligence and machine learning, will increase in the coming years.
As open source adoption increases, disputes between developers and users of open source projects continue.
In the past few years, some cloud vendors have used open source software in commercial products without giving anything back to the community to help sustain those projects, and many companies, including Redis Labs, MongoDB, Cockroach Labs and Confluent, have modified or changed their open source licenses to prevent the code from being used without compensation.
“They’re just trying to restrict users from using the software as a separate service. The purpose of these new licenses is to continue to leverage the popularity of the software and source code to gain customers and exclude SaaS services that are based on the same code.” said Justin Colannino, development policy and legal counsel at GitHub. To this day, the years-long friction between the open source camp and cloud computing platforms continues.
Any developer, startup, or individual developer needs enough incentive to do meaningful open source projects, or the open source ecosystem will be unsustainable.
This article is from WeChat public number: InfoQ (ID: infoqchina), author: Chu Xingjuan
Posted by:CoinYuppie，Reprinted with attribution to:https://coinyuppie.com/microsoft-programming-ai-just-launched-into-controversy/
Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.