I scraped all of OpenAI's Community Forum(julep-ai.github.io) |
I scraped all of OpenAI's Community Forum(julep-ai.github.io) |
> Allowing a Q&A interface using these embeddings over the post contents could speed up research over the community posts (if you know the right questions to ask :P). Let's view some posts similar to this one complaining about function calling
That's indeed a great thing to surface, and that's exactly how the the OpenAI forum selects the "Related Topics" to show at the end of every topic. We use embeddings for this feature, and the entire thing is open-source: https://github.com/discourse/discourse-ai/blob/main/lib/embe...
We also embeddings for suggesting tags, categories, HyDE search and more. It's by far my favorite tech of this new AI/ML gen so far in terms of applicability.
> Using Twitter-roBERTa-base for sentiment analysis, we generated a post_sentiment label (negative, positive, neutral) and post_sentiment_score confidence score for each post.
We do the same, with even the same model, and conveniently show that information on the admin interface of the forum. Again all open source: https://github.com/discourse/discourse-ai/tree/main/lib/sent...
Disclaimer: I'm the tech lead on the AI parts of Discourse, the open source software that powers OpenAI's community forum.
Discourse has an AI plugin that admins can run on their community to generate their own sentiment analysis (among other things), though it's not quite as thorough as this write up! https://meta.discourse.org/t/discourse-ai-plugin/259214
We're always interested to see how public data can be used like this. It's something that can be a lot more difficult on closed platforms.
> Toxicity can scan both new posts and chat messages and classify them on a toxicity score across a variety of labels
Is that within the defined data processing purposes of all Discourse setups? Does the tool warn admins they might need to update their policies before being able to run this tool, perhaps needing to seek consent (depending on their jurisdiction and ethics)? It sounds somewhat objectionable, trying to guess my mental state from what I write without opt-in
Edit: and apparently it also tries to flag NSFW chat messages, does Discourse have PM chats where this would flag private messages for admins to read or is it only public chats that this bot runs on?
> tagging NSFW image content in posts and chat messages
It's an optional plugin that can be enabled / disabled by the site admin. Those modules are all disabled by default, and each need to be enabled by the site owner.
> Edit: and apparently it also tries to flag NSFW chat messages, does Discourse have PM chats where this would flag private messages for admins to read or is it only public chats that this bot runs on?
Discourse PMs can be read by admins, see the definition here: https://meta.discourse.org/t/guidance-and-best-practices-on-...
Maybe I'm not looking thoroughly enough, so I may be wrong, tho!
I wonder how the community moderators would like it.
I believe the naming isn't perfect for this, but this was all automatic topic modelling!
Example: [1]: https://community.openai.com/t/read-this-before-posting-a-ne...
> Every Discourse Discussion returns data in JSON if you append .json to the URL.
then this:
> Raw data was gathered into a single JSONL file by automating a browser using Playwright.
Kinda seems to me like having a whole browser instance for this isn't necessary? I would have been surprised if this .json pattern didn't continue for all pages, and it turns out that it does in fact also work for the topic list: https://community.openai.com/latest.json
The other place I've seen this sort of API pattern is reddit. For example, https://www.reddit.com/r/all.json or (randomly chosen) https://www.reddit.com/r/mildlyinfuriating/comments/1bqn3c0/...
(sorry, I think openai and sam are gross)
I see this sort of thing posted a lot (i.e., “it should be ClosedAI instead of OpenAI, lol”)
What if it just means “Open for Business” instead of “Open Access for All”? Or maybe they should just make it an acronym?
I’m sorry for the confusion on my part, but there’s just been a lot of words dedicated toward expressing frustration with the company because they chose to use “open” in their name.
Personally, I don’t find it frustrating that Apple doesn’t sell fruit and Intel doesn’t actually give intelligence data.
Isn’t a “community forum” like this basically just: “we’re not gonna spend money on providing adequate customer support so instead here is a forum where y’all can talk amongst yourselves and we’ll give you some badges and imaginary points for doing the customer support yourselves”?
OpenAI has a pretty active forum with moderators replying and helping out all the time.
OpenAI has taught me that no one gives a shit. Scrape the entire internet if you want, and use the data for whatever you feel like.
Alas we have to fight against the machines in order to properly read the internet thru machines.
I believe Discourse knowingly keeps its data easy to scrape though, so kudos to them!
Cloudflare gives a shit.
My household had to use our 5G internet for most things for a week or two until our IP reputation recovered.
> # 1 Result: Python Packaging
Checks out
/s
This shows laypeople piling into a hype thing and running immediately into the roadblock of programming.
Normal people don't want to like, put in effort to feel like they are a part of something.
They are used to "just" having to turn on Netflix to feel like they are a part of the biggest TV show, or "just" having to click a button to buy a Stanley Cup, or "just" having to click a button to buy Bitcoin. The API and performance issues, IMO, they're not noise, but they are meaningless. To me this also signals how badly Grok and Stability are doing it, they are doubling and tripling down on popular opinions that have a strong, objective meaninglessness to them (like how fast the tokens come out and how much porn you're allowed to make). Whereas the Grok people are looking at this analysis and feeling very validated right now.
I have no dog in this race, but I would hope that the OpenAI people do not waste any time on Python APIs for dumb people; instead, they should definitely improve their store and have a firmer opinion on how that would look. They almost certainly have a developing opinion on a programming paradigm for chatbots, but I feel like they are hamstrung by needed to quantize their models to meet demand, not decisions about the look and feel of Python APIs or the crappiness of the Python packaging ecosystem. Another POV is that the Apple development experience persists to be notoriously crappy, and yet they are the most valuable platform for most companies in the world right now; and also, JetBrains could not sustain an audience for the AppCode IDE, because everyone uses middlewares anyway; so I really don't think Python APIs matter as much as the community says they do. It's a Nice to Have, but it Does Not Matter.
this was more a slam on python packaging in general, than it is on the OpenAI implementation.
I wouldn't be surprised if many of the issues under this topic are more related to Python package version nightmares, than OpenAI's Python implementation itself.
Keen for your feedback, either here or email: alex@stainlessapi.com
I was pretty disappointed to see this, as I work on the Python package and was hoping for a good place to find feedback (apart from the github issues, which I monitor pretty closely).
I'm not a data scientist; maybe someone from the Julep team could comment on the labeling? Or how I could find some more specific themes of problems with the Python package? (Was it just that people who have a problem of some kind just happen to also use the Python library?)
Nomic Atlas automatically generates the labels here. There could be different variations of posts involving the Python Packages.
But I did some manual digging & here's what I found; Heading over to the map and filtering by posts around "Python Packages" leads to around 900 posts.
Sharing a few examples which do talk about people's posts related to the python package:
- https://community.openai.com/p/701058 - https://community.openai.com/p/652075 - https://community.openai.com/t/32442 - https://community.openai.com/p/143928
Note: My intuition is that most of the posts are very basic, probably user errors like "No API Key Found" etc.
If you write an article and post it on your blog, people can't just come along and take the text verbatim
If you license your blog as public domain, then someone takes the content and does something objectionable with it, you can (in many countries) still make use of moral rights if you'd wish to correct the situation
If I post something publicly on a forum, I'm well aware I may have agreed or consented (depending on the forum) to terms that allow this type of processing, but that is not the default. There exist restrictions, both legally and morally (some legal ones are even called moral rights and are inalienable). Hence my question how this plugin handles extending the allowed data processing to cover taking the content and making automated decisions and claims that may or may not be accurate. I would not be comfortable with that being an automated behind-the-scenes process flagging my posts as good or bad towards the moderators, since they likely won't care to read back hundreds of comments and see whether the computer did a good job
Gone are the days when you simply saw all the important links on the main page, it seems. :)
> Moderators can read PMs that have an active flag.
This system is now setting nsfw flags in an automated fashion, specifically seeking out content that the persons involved wouldn't want others to see. Clearly a forum is the wrong place for that content, but people don't always make good decisions (especially kids; I was a kid on forums too and would be very surprised if nothing ever transpired there). The receiving person can already flag anything they deem inappropriate. A system making automated decisions about messages that were intended to be private creates problems and it is not clear to me who this serves
customers
https://openai.com/blog/introducing-openai
« We’re hoping to grow OpenAI into such an institution. As a non-profit, our aim is to build value for everyone rather than shareholders. Researchers will be strongly encouraged to publish their work, whether as papers, blog posts, or code, and our patents (if any) will be shared with the world. We’ll freely collaborate with others across many institutions and expect to work with companies to research and deploy new technologies. »
They never give an explicit explanation for their name, but it's pretty obvious.
This doesn’t seem to be compatible with continuing loftily call themselves with the same name, as the initial nonprofit mission.
As in prepare for the end... THE END OF HIGH PRICES!
> to benefit humanity as a whole, unconstrained by a need to generate financial return
1. typesafety (for those using pyright/mypy) and autocomplete/intellisense
2. auto-retry (w/ backoff, intelligently so w/ rate limits) and error handling
3. auto-pagination (can save a lot of code if you make list calls)
4. SSE parsing for streaming
5. (coming soon) richer streaming & function-calling helpers (can save / clean up a lot of code)
Not all of these matter to everybody (e.g., I imagine you're not moved by such benefits as "dot notation over dictionary access", which some devs might really like).
I would argue that auto-retry would benefit a pretty large percentage of users, though, especially since the 429 handling can paper over a lot of rate limits to the point that you never actually "feel" them. And spurious/temporary network connections or 500s also ~disappear.
For some simple use-cases, none of these would really matter, and I agree with you - especially if it's not production code and you don't use a type-aware editor.
(my company provides first-party clients with a lot of polish; maybe we could help)