Show HN: Some blind hackers are bridging IRC to LMMs running locally

Show HN: Some blind hackers are bridging IRC to LMMs running locally(2mb.codes)

201 points by blindgeek 2 years ago | 40 comments

kgeist 2 years ago |

Interesting, I also manage an IRC bot with multimodal capability for months now. It's not a real LMM - rather, a combination of 3 models. It uses Llava for images and Whisper for audio. The pipeline is simple: if it finds a URL which looks like an image - it feeds it to Llava (same with audio). Llava's response is injected back to the main LLM (a round robin of Solar 10.7B and Llama 13B) to provide the response in the style of the bot's character (persona) and in the context of the conversation. I run it locally on my RTX 3060 using llama.cpp. Additionally, it's also able to search on Wikipedia, in the news (provided by Yahoo RSS) and can open HTML pages (if it sees a URL which is not an image or audio).

Llava is a surprisingly good model for its size. However, what I found is that it often hallucinates "2 people in the background" for many images.

I made the bot just to explore how far I can go with local off-the-shelf LLMs, I never thought it could be useful for blind people, interesting. A practical idea I had on my mind was to hook it to a webcam so that if something interesting happens in front of my house, I can be notified by the bot, for example. I guess it could also be useful for blind people if the camera is mounted on the body.

gs17 2 years ago | |

> Llava is a surprisingly good model for its size. However, what I found is that it often hallucinates "2 people in the background" for many images.

There's a creepypasta or an SCP entry in there. You do not recognize the people in the background.

dontupvoteme 2 years ago | | |

They're reverse vampires. Humans can only see their reflection.

jodrellblank 2 years ago | |

> "Llava is a surprisingly good model for its size. However, what I found is that it often hallucinates "2 people in the background" for many images."

Llamafile.exe[1] is Llava based, and I find it hallucinates handbags in images a lot. Asked to describe a random photo with "I cannot see. Please accurately and thoroughly describe this scene with 'just-the-facts' descriptions and without editorialising." it comes out with text that feels like an estate agent wrote it, often picking an imaginary handbag or two as a detail worth mentioning:

"The image depicts a street scene where three people are gathered around a white car parked near a building. One man is standing next to the vehicle, while another is holding a cell phone and talking to a woman in front of him. The third person is also nearby, participating in the conversation or observing the situation. The background showcases various elements such as a couple of handbags on the ground close to one of the individuals, as well as multiple chairs placed at different distances from each other. These objects further emphasize the social aspect of this outdoor gathering."

(Note there are "three people", made of a man, another man, a woman, and a third person). There were no handbags in that street scene or two tourist-looking-people with a phone next to a car with a driver in it. Or:

"The scene depicts a group of people on the back of a boat, with a beautiful young woman riding in front. Several individuals are holding umbrellas above their heads as they enjoy the outing. The boat is located in shallow water close to shore, near brick buildings, possibly a hotel. A few chairs can be seen onboard along with several handbags carried by the passengers. Additionally, a couple of bottles and an orange are present in the scene, suggesting refreshments during the boating trip."

No handbags, chairs, bottles or orange were visible on the pleasure-trip boat going past. Or:

"Various vehicles can be spotted nearby, including cars parked or driving along the road, and a truck located further back in the scene. A handbag is also visible, possibly belonging to one of the shoppers at the market."

One woman off to the side was carrying a handbag with the strap diagonally across her body and the bag on her front. Possibly it belonged to her ... or possibly she nicked it?

"One person appears to be holding a backpack while standing with the rest of the group. A handbag can also be seen resting near another individual among the group."

Nope.

"a large number of pedestrians are walking up and down between shops and stores, likely engaging in various activities or running errands. Some people have handbags, which can be seen as they walk along the sidewalk."

Nobody visibly had a handbag.

It seems odd that it picks out handbags as one of the few things worth describing, repeatedly. As if the training data contained lots of images tagged 'handbag' and that such concept has survived into the small model.

See also [2] article and top comment in the discussion about fake photos in 1917; running this query over and over on random pictures from my photo collection, I recognise the output style the template-feeling elements of it, much more now.

[1] https://github.com/Mozilla-Ocho/llamafile/

[2] https://news.ycombinator.com/item?id=19251755

codeofdusk 2 years ago |

I'm also totally blind and, somewhat relatedly, I've built Gptcmd, a small console app to ease GPT conversation and experimentation (see the readme for more on what it does, with inline demo). Version 2.0 will get GPT vision (image) support:

https://github.com/codeofdusk/gptcmd

simonw 2 years ago |

I had an interesting conversation the other day about how best to make ChatGPT style "streaming" interfaces accessible to screenreaders, where text updates as it streams in.

It's not easy! https://fedi.simonwillison.net/@simon/111836275974119220

codeofdusk 2 years ago | |

I'm totally blind and built Gptcmd, a small console app to make interacting with GPT, manipulating context/conversations, etc. easier. Since it's just a console app, when streaming is enabled, output is seemlessly reported.

https://github.com/codeofdusk/gptcmd

jpsouth 2 years ago |

Hey! I don’t understand too much about AI/ML/LLMs (and now LMMs!) so hoping someone could explain a little further for me?

What I gather is this is an IRC bot/plugin/add-on that will allow a user to prompt an ‘LMM’ which is essentially an LLM with multiple output capabilities (text, audio, images etc) which on the surface sounds awesome.

How does an LMM benefit blind users over an LLM with voice capability? Is the addition of image/video just for accessibility to none-blind people?

What’s the difference between this and integrating an LLM with voice/image/video capability?

Is there any reason that this has been made over other available uncensored/free/local LLMs (aside from this being an LMM)?

Thanks in advance.

th0ma5 2 years ago |

Since there's no way to truly objectively tell if LLM output is correct, this seems like it would have its limits, even if it seems subjectively good, but I have that problem with all of the LLM stuff I guess.

DustinBrett 2 years ago |

You could run an LLM in the browser with WebLLM and then connect to IRC via WebSockets using something like KiwiIRC. Fully client side AI on IRC.

BMSR 2 years ago |

Blind hackers really impress me. I also have an ai bot on irc but it uses openai. Which is fast, almost instant, but less impressive.

nathias 2 years ago |

I've been waiting 25 years for this

xpe 2 years ago |

If you didn't know... LMM = Large Multimodal Models

az09mugen 2 years ago | |

Thanks ! I thought it was a typo, your comment definitively deserves more upvotes.