“Reddit has long had a symbiotic relationship with the search engines of companies like Google and Microsoft. The search engines “crawl” Reddit’s web pages in order to index information and make it available for search results. That crawling, or “scraping,” isn’t always welcome by every site on the internet. But Reddit has benefited by appearing higher in search results. The dynamic is different with L.L.M.s — they gobble as much data as they can to create new A.I. systems like the chatbots. Reddit believes its data is particularly valuable because it is continuously updated. That newness and relevance, Mr. Huffman said, is what large language modeling algorithms need to produce the best results. “More than any other place on the internet, Reddit is a home for authentic conversation,” Mr. Huffman said. “There’s a lot of stuff on the site that you’d only ever say in therapy, or A.A., or never at all.””
“AI chatbots have exploded in popularity over the past four months, stunning the public with their awesome abilities, from writing sophisticated term papers to holding unnervingly lucid conversations. Chatbots cannot think like humans: They do not actually understand what they say. They can mimic human speech because the artificial intelligence that powers them has ingested a gargantuan amount of text, mostly scraped from the internet.
This text is the AI’s main source of information about the world as it is being built, and influences how it responds to users. If it aces the law school admissions test, for example, it’s probably because its training data included thousands of LSAT practice sites. Tech companies have grown secretive about what they feed the AI. So The Washington Post set out to analyze one of these data sets to fully reveal the types of proprietary, personal, and often offensive websites that go into an AI’s training data.”
“OpenAI, Google, and other companies building large-scale AI projects have traditionally paid nothing for much of their training data, scraping it from the web. But Stack Overflow, a popular internet forum for computer programming help, plans to begin charging large AI developers as soon as the middle of this year for access to the 50 million questions and answers on its service, CEO Prashanth Chandrasekar says.”
“Today, Stability AI released a new open-source language model, StableLM. The Alpha version of the model is available in 3 billion and 7 billion parameters, with 15 billion to 65 billion parameter models to follow. Developers can freely inspect, use, and adapt our StableLM base models for commercial or research purposes, subject to the terms of the CC BY-SA-4.0 license.”