Vana Plans to Allow Users to Rent Their Reddit Data to Train AI | TechCrunch

In the generative AI boom, data is the new oil. So why shouldn’t you be able to sell your own?

From large tech companies to startups, AI manufacturers are licensing e-books, images, videos, audio files, and more from data brokers to create more powerful (and legally defensible) AI-based products. Shutterstock has deals with Meta, Google, Amazon and Apple to provide millions of images for model training, while OpenAI has signed agreements with several news organizations to train its models on news archives.

In many cases, the individual creators and owners of this data never saw a cent of the money change hands. A startup called Vana wants to change that.

Anna Kazlauskas and Art Abal, who met in a course at the MIT Media Lab focused on developing technology for emerging markets, founded Vana in 2021. Before Vana, Kazlauskas studied computer science and economics at MIT and eventually left the company, to start a fintech automation startup, Iambiq, from Y Combinator. Abal, a corporate lawyer by training, was an associate at The Cadmus Group, a Boston-based consulting firm, before leading impact sourcing at data annotation firm Appen.

With Vana, Kazlauskas and Abal set out to build a platform that allows users to “bundle” their data – including chats, voice recordings and photos – into datasets that can then be used for generative AI model training. They also want to create more personalized experiences—for example, a daily motivational voicemail based on your health goals or an art-generating app that understands your style preferences—by refining public models based on this data.

“Vana’s infrastructure actually creates a user-owned data treasure trove,” Kazlauskas told TechCrunch. “It does this by allowing users to aggregate their personal data in a non-custodial manner… Vana enables users to own AI models and use their data in AI applications.”

Here’s how Vana introduces its platform and API to developers:

The Vana API connects a user’s personal data across platforms… to allow you to personalize your application. Your app gets instant access to a user’s personalized AI model or underlying data, simplifying onboarding and eliminating concerns about computational costs… We believe that users are moving their personal data from walled gardens like Instagram, Facebook and Google into You should be able to deliver an amazing personalized experience the first time a user interacts with your consumer AI application.

Creating an account with Vana is pretty easy. After you confirm your email, you can attach data to a digital avatar (e.g. selfies, a description of yourself, and voice recordings) and explore apps built on Vana’s platform and data sets. The app selection ranges from ChatGPT-style chatbots to interactive storybooks and a Hinge profile generator.

Photo credit: Old

Why, you may ask – in an age of increasing privacy awareness and ransomware attacks, would anyone ever provide their personal information to an anonymous startup, let alone a venture capital-backed company? (Vana has raised $20 million to date from Paradigm, Polychain Capital and other backers.) Can a for-profit company really be trusted not to misuse or mishandle monetizable data that comes into its hands?

Vana Reddit DAO

Photo credit: Old

In response to this question, Kazlauskas emphasized that the purpose of Vana is for users to “take back control of their data.” Vana users would have the option to self-host their data instead of storing it on Vana’s servers and control how their data is shared with apps and developers. She also argued that the company has no incentive to exploit users because Vana makes money by charging users a monthly subscription (starting at $3.99) and charging a “data transaction fee” from developers (e.g. for the Transfer of data sets for training AI models). the wealth of personal data they bring with them.

“We want to create models that are owned and managed by users and contribute all of their data,” Kazlauskas said, “and allow users to take their data and models with them into any application.”

Well, while Old doesn’t sell user data to companies for training generative AI models (or so it claims), but rather wants to allow users to do it themselves if they want – starting with their Reddit posts.

This month, Vana launched what it calls the Reddit Data DAO (Digital Autonomous Organization), a program that pools multiple users’ Reddit data (including their karma and post history) and allows them to collectively decide how to combine that data be used. After logging in with a Reddit account, submitting a request to Reddit for their data, and uploading that data to the DAO, users are given the right to collaborate with other members of the DAO on decisions such as licensing the combined data to generative AI Companies to coordinate for a common profit.

It’s a response of sorts to Reddit’s recent moves to commercialize data on its platform.

To date, Reddit has not blocked access to posts and communities for generative AI training purposes. But late last year, before the IPO, the company changed course. Since the policy change, Reddit has collected over $203 million in royalties from companies like Google.

“The comprehensive idea [with the DAO is] to free user data from the big platforms that want to hoard and monetize it,” Kazlauskas said. “This is a first and part of our commitment to helping people combine their data into user-owned datasets to train AI models.”

Unsurprisingly, Reddit – which does not work with Vana in any official capacity – is not happy about the DAO.

Reddit has banned Vana’s subreddit dedicated to discussing the DAO. And a Reddit spokesperson accused Vana of “exploiting” its data export system, which is designed to comply with data protection regulations such as the GDPR and the California Consumer Privacy Act.

“Our data agreements allow us to establish protections for such companies, even for public information,” the spokesperson told TechCrunch. “Reddit does not share non-public personal information with commercial companies, and when Reddit users request to have their data exported from us, they will receive non-public personal information back from us in accordance with applicable law.” Direct partnerships between Reddit and verified organizations with clear terms and responsibilities are important, and these partnerships and agreements prevent misuse and misuse of people’s data.”

But does Reddit have any real reason to worry?

Kazlauskas expects the DAO to grow to the point where it will impact how much Reddit can charge its customers for their data. That’s a long way off, assuming it ever happens; The DAO has just over 141,000 members, a tiny fraction of Reddit’s 73 million users. And some of these members could be bots or duplicate accounts.

Then there is the question of how to fairly distribute the payments the DAO may receive from data buyers.

Currently, the DAO awards “tokens” – cryptocurrencies – to users that match their Reddit karma. But karma may not be the best measure of high-quality contributions to the dataset — especially in smaller Reddit communities with fewer opportunities to earn it.

Kazlauskas puts forward the idea that members of the DAO could choose to share their cross-platform and demographic data, potentially making the DAO more valuable and incentivizing sign-ups. To do this, users would have to trust even more that Vana handles their sensitive data responsibly.

Personally, I don’t think Vanas DAO will reach critical mass. There are far too many obstacles standing in the way. However, I think it will not be the last fundamental attempt to gain control of the data that is increasingly being used to train generative AI models.

Startups like Spawning are working on ways to allow creators to set rules for how their data is used for training, while providers like Getty Images, Shutterstock and Adobe continue to experiment with compensation systems. But no one has cracked the code yet. Can it at all? Be cracked? Given the cutthroat nature of the generative AI industry, this is certainly a tall order. But maybe someone will find a way – or politics will force it.

Sharing Is Caring:

Leave a Comment