I scraped all of OpenAI's Community Forum

I scraped all of OpenAI's Community Forum(julep-ai.github.io)

310 points by alt-glitch 2 years ago | 59 comments

xfalcox 2 years ago |

That's super cool, thanks for sharing! I will share this as an easy to follow example of what we can with AI.

> Allowing a Q&A interface using these embeddings over the post contents could speed up research over the community posts (if you know the right questions to ask :P). Let's view some posts similar to this one complaining about function calling

That's indeed a great thing to surface, and that's exactly how the the OpenAI forum selects the "Related Topics" to show at the end of every topic. We use embeddings for this feature, and the entire thing is open-source: https://github.com/discourse/discourse-ai/blob/main/lib/embe...

We also embeddings for suggesting tags, categories, HyDE search and more. It's by far my favorite tech of this new AI/ML gen so far in terms of applicability.

> Using Twitter-roBERTa-base for sentiment analysis, we generated a post_sentiment label (negative, positive, neutral) and post_sentiment_score confidence score for each post.

We do the same, with even the same model, and conveniently show that information on the admin interface of the forum. Again all open source: https://github.com/discourse/discourse-ai/tree/main/lib/sent...

Disclaimer: I'm the tech lead on the AI parts of Discourse, the open source software that powers OpenAI's community forum.

wavyknife 2 years ago |

(disclaimer: I work for Discourse)

Discourse has an AI plugin that admins can run on their community to generate their own sentiment analysis (among other things), though it's not quite as thorough as this write up! https://meta.discourse.org/t/discourse-ai-plugin/259214

We're always interested to see how public data can be used like this. It's something that can be a lot more difficult on closed platforms.

Aachen 2 years ago | |

> helps you keep tabs on your community by analyzing posts and providing sentiment and emotional scores to give you an overall sense of your community for any period of time [...]

> Toxicity can scan both new posts and chat messages and classify them on a toxicity score across a variety of labels

Is that within the defined data processing purposes of all Discourse setups? Does the tool warn admins they might need to update their policies before being able to run this tool, perhaps needing to seek consent (depending on their jurisdiction and ethics)? It sounds somewhat objectionable, trying to guess my mental state from what I write without opt-in

Edit: and apparently it also tries to flag NSFW chat messages, does Discourse have PM chats where this would flag private messages for admins to read or is it only public chats that this bot runs on?

> tagging NSFW image content in posts and chat messages

eddd-ddde 2 years ago | | |

I don't think there's anything left for you to consent once you decide to post on a public forum. If I can read your post and guess your mental state so can any other bot.

wavyknife 2 years ago | | |

Discourse is not a centralized platform, so it's up to individual sites to ensure they're compliant with data and privacy regulations.

BadHumans 2 years ago | | |

More companies and communities than you think already do this without your knowledge let alone consent.

xfalcox 2 years ago | | |

> Is that within the defined data processing purposes of all Discourse setups?

It's an optional plugin that can be enabled / disabled by the site admin. Those modules are all disabled by default, and each need to be enabled by the site owner.

> Edit: and apparently it also tries to flag NSFW chat messages, does Discourse have PM chats where this would flag private messages for admins to read or is it only public chats that this bot runs on?

Discourse PMs can be read by admins, see the definition here: https://meta.discourse.org/t/guidance-and-best-practices-on-...

SunlitCat 2 years ago |

I didn't even knew they have community forums. Looking at the main homepage (openai.com), the only external links I can find are to chatgpt and their docs hosted on platform.openai.com. The other links lead to their socials, github and soundcloud (of all places).

Maybe I'm not looking thoroughly enough, so I may be wrong, tho!

hughesjj 2 years ago | |

I would also love to see these forums both to post and to lurk

djantje 2 years ago | | |

https://community.openai.com/ (when you are logged in on platform.openai.com, there is a link from the menu)

miduil 2 years ago |

That's an interesting write-up, I wonder how this would look for other big Discourse communities such as NixOS.

alt-glitch 2 years ago | |

This is definitely a workflow we can package into something open-source.

I wonder how the community moderators would like it.

dcreater 2 years ago | | |

I for one would love it!

klooney 2 years ago |

What's the "Day Knowledge Direction" cluster in the Atlas view?

alt-glitch 2 years ago | |

Neat find! That's actually a cluster of all the system messages notifying users about closing and re-opening of the thread. That's why they're so tightly clustered.

I believe the naming isn't perfect for this, but this was all automatic topic modelling!

Example: [1]: https://community.openai.com/t/read-this-before-posting-a-ne...

fzysingularity 2 years ago |

So epic, thank you for making this dataset available to everyone!

alright2565 2 years ago |

I saw this part:

> Every Discourse Discussion returns data in JSON if you append .json to the URL.

then this:

> Raw data was gathered into a single JSONL file by automating a browser using Playwright.

Kinda seems to me like having a whole browser instance for this isn't necessary? I would have been surprised if this .json pattern didn't continue for all pages, and it turns out that it does in fact also work for the topic list: https://community.openai.com/latest.json

The other place I've seen this sort of API pattern is reddit. For example, https://www.reddit.com/r/all.json or (randomly chosen) https://www.reddit.com/r/mildlyinfuriating/comments/1bqn3c0/...

velid0 2 years ago |

Now train a gpt based on the data :D

testfrequency 2 years ago | |

But make sure to call it ClosedData or something so we know it’s not open source

(sorry, I think openai and sam are gross)

davely 2 years ago | | |

Maybe I don’t understand this sentiment, but are people really that hung up on the name?

I see this sort of thing posted a lot (i.e., “it should be ClosedAI instead of OpenAI, lol”)

What if it just means “Open for Business” instead of “Open Access for All”? Or maybe they should just make it an acronym?

I’m sorry for the confusion on my part, but there’s just been a lot of words dedicated toward expressing frustration with the company because they chose to use “open” in their name.

Personally, I don’t find it frustrating that Apple doesn’t sell fruit and Intel doesn’t actually give intelligence data.

garyiskidding 2 years ago |

This is really amazing. Pretty insightful. Thank you.

xandrius 2 years ago |

Love it, just for the sole reason of turning something OpenAI made into a dataset for everyone else :D

codetrotter 2 years ago | |

I don’t think OpenAI are gonna lose any sleep over this.

Isn’t a “community forum” like this basically just: “we’re not gonna spend money on providing adequate customer support so instead here is a forum where y’all can talk amongst yourselves and we’ll give you some badges and imaginary points for doing the customer support yourselves”?

alt-glitch 2 years ago | | |

I believe a community forum is absolutely vital for an "ecosystem" company. There needs to be a town square where people can discuss ideas and share feedback about that particular ecosystem.

OpenAI has a pretty active forum with moderators replying and helping out all the time.

solardev 2 years ago | | |

They probably just sic a customer service GPT on it and use it to train the other ones...

dorkwood 2 years ago |

I did a bit of data scraping for fun in the past, but I was never quite sure of the legality of what I was doing. What if I was breaking some law in some jurisdiction of some country? Was someone going to track me down and punish me?

OpenAI has taught me that no one gives a shit. Scrape the entire internet if you want, and use the data for whatever you feel like.

alt-glitch 2 years ago | |

We were really heading someplace with The Semantic Web aka The Real Web 3.0 [1]

Alas we have to fight against the machines in order to properly read the internet thru machines.

I believe Discourse knowingly keeps its data easy to scrape though, so kudos to them!

[1]: https://en.wikipedia.org/wiki/Semantic_Web

bsuvc 2 years ago | |

> OpenAI has taught me that no one gives a shit. Scrape the entire internet if you want, and use the data for whatever you feel like.

Cloudflare gives a shit.

My household had to use our 5G internet for most things for a week or two until our IP reputation recovered.

stoorafa 2 years ago | | |

Yeah it’s probably worth renting a server if there’s any doubt about whether it’s wholly appropriate to do something

ifyoubuildit 2 years ago | |

Do you think it would be better if someone did track you down and punish you? Which world do you want to live in?

n0sleep 2 years ago | | |

I think large companies should be punished for stealing from people to make themselves richer.

EcommerceFlow 2 years ago | |

A precursor to this would have been that Linkedin lawsuit Microsoft lost, allowing that one company to scrape all of Linkedin (technically "public information").

htrp 2 years ago | | |

hiQ Labs v. LinkedIn

enonimal 2 years ago |

> Number of Posts with negative sentiment, grouped by Topic

> # 1 Result: Python Packaging

Checks out

throwaway98797 2 years ago |

did they have the right to use all thier data?