Show HN: No Trash Search(notrashsearch.github.io) |
Show HN: No Trash Search(notrashsearch.github.io) |
'No Trash Search' is very focussed on STEM and not "for daily use". It's surprisingly good when you're looking for certain kinds of information. Under the hood it's little more than a programmable search engine [1] with a whitelist of ~120 sites.
So back to what web search was in the 1990s, roughly: an index from a curated selection of sites.
(BTW, any good search engines these days that aren't indirectly using Google or Bing ?)
Sure - the web is now a cesspool optimized for advertising and attention. The traditional search engines made a lot more sense at the dawn of the internet when it was more about discovery. Now, for the most part, it's closer to an information retrieval tool, where a finite list of established sites have the bulk of what one is looking for. It only makes sense to have a tool that lets one navigate the established, legit internet, and not have to deal with all the crap.
That doesn't mean there is no use case for google as it is, but some more focused competition is a no brainer.
The code for Gigablast is open-source, including the crawler.
I could be wrong but I do not think search.marginalia.eu nor wiby.me use Google or Bing.
The comment about "hundreds of millions" is interesting. Assume hypothetically a search engline claimed to be searching millions of sites for a given query but in truth it was actually only searching 120 sites that it had determined answered this query (i.e., was the most popular answer source) for the majority of users. How would a user verify the search engine's claim about searching millions of sites was true. What if the search engine only allowed the user to retrieve a maxmimum of about 230 results, not matter how many sites it claimed to search.
I searched for “3 hole punch review” [1] here, and the results have zero relevancy.
First one is a Chinese cell phone company, second a Wikipedia page for an episode of the office, third a thesaurus page with synonyms for ‘colorful’ and fourth a link to the Wikipedia page of Yellow Submarine.
I can’t even imagine how you get there from “3 hole punch review”
One way to improve is a "bring your own list" feature, and the ability to include vetted lists. Maybe some kind of web of trust - if your friends have whitelisted a site, it is whitelisted for you too. If you find a problem with that site, you let your friend know to remove it. If they don't respond you can remove that friend from your trusted persopn list (maybe they got hacked?). Then maybe you can 'follow' a few lists of famous trusted people (e.g. paulg etc.) to build up a bigger slice of the internet you can search.
A spammer will want to come in then and create something that white lists their spam sites, but they need to convince you to add their list! And when you see the spam you can just unfollow them. They can't succeed.
The GitHub repo was third and had to be scrolled to.
Seems pretty trash to me.
Also search.marginalia.nu puts a smile on my face almost every time I use it :-)
(I should try Neeva, I keep hearing good things about it.)
One suggestion that I have is to remove w3schools.com from the whitelist. MDN is a much better source for information about web development.
Ads are added automatically by Google. The whole thing is little more than a wrapper around the 'Programmable Search Element Control API' which is an HTML element you can just insert into any site, like an iframe. Unfortunately this is the only way to make Programmable Search available at scale as the API is restricted to either 10 sites or 10K queries / day, even when paid!
There is a paid version for the HTML plugin, but that would leak the API key and so it wouldn't work as a business.
There is an option to get a share of the revenue generated by a search engine. Maybe it's time for me to figure out how that works.
I was thinking of making a hosted, ad free, customizable version where people upload their own keys. Not sure if people would like that.
As a side-note, it's super easy to remove ads with 1 line of CSS, but I wasn't sure how Google would feel about that so it's not in the online version. TamperMonkey is an extension that allows people to insert their own CSS on different websites. Hmm.
You can view all offerings in the docs [0].
[0] https://developers.google.com/custom-search/docs/overview#su...
Right now, looking at your allow-list config, it feels a bit custom to you, but if I had an easy way to limit search to the sites I myself know and trust, I could see how that would be useful.
I know I could probably pick it out of my browser's history UI & poke it into Google's Programmable Search UI, but that seems like a hassle and a half.
With caching, I think you might be able to reduce the load.
Also, why is w3fools in the list? It's an awful site.
The poster did say it was mostly for STEM subjects though...
More importantly though, I think "Best smartphone 2021" is really a search that has been conditioned on the crap google gives back now. At best you might expect to find a "best smartphones" listicle or something.
This is just a whitelisted search, so in my 5 min playing with it, it looks like popular or consumer queries are more likely to just provide reddit or wikipedia links, while more technical searches land on SO or documentation sites.
I think with a little tuning, this approach is great. Given the modern internet and all the crap there is, a manual whitelist of sites that are actually legit is always going to be superior to an algorithmic approach.
The blob of the ads are still the top results. This is not the "no trash search" I'm looking for.
As explained in my other comment, this website is a wrapper around google programmable search. The actual searching happens on Google servers, and I can see why people have problems with that. The code you see on the website is the same as the repo, though. It's actually hosted by GitHub! You can verify this by opening the web inspector in any browser or looking at the `.github.io` portion of the URL.
You can learn more about Programmable Search here: https://developers.google.com/custom-search/docs/overview. NoTrashSearch uses the 'Programmable Search Element Control API', which is documented here: https://developers.google.com/custom-search/docs/element and can be used with very little code!
Stupid question though: where is the list of whitelisted sites? Is that something you set up separately with google? I scanned though the code and expected to find a list somewhere, but obviously you do it in a different way
Nice SEO campaign ;)
If only I could get NTS to whitelist my domain name (myfirstnamelastname dot com), the Big-G has hated it seemingly since even before I acquired it > ten years ago, even though it's ad-free and totally benign. Good thing I mostly just host go pkgs with it and use it for my email.
p.s. OP this is amazing! Would love an article explaining any backstory and details on how you made this (or setup / configured it).
There is a form [0] on the about page that allows people to suggest websites to add :)
> p.s. OP this is amazing! Would love an article explaining any backstory and details on how you made this (or setup / configured it).
Thanks! I think this is gonna be disappointing from an engineering perspective, and certainly not article worthy :) As further explained in my other comments, the website is basically a wrapper around google programmable search [1] where I whitelisted a set of sites I found useful personally, plus some suggestions from other users. It's really easy to set up.
As to why, I will quote some other comments of mine:
"I built this website a couple of months ago because I was annoyed by how hard it was to find useful things on Google."
"to find things more easily while programming or studying (I study biology, cs and ai; and philosophy in my free time, so expect the best results for queries related to those subjects). ... When I'm not doing those things, I just use Google or DDG because they have better results for day-to-day queries."
Let me know if you have other questions!
[0] https://docs.google.com/forms/d/e/1FAIpQLSdf8lAoShQz7Wjl9h60...
[1] https://developers.google.com/custom-search/docs/overview
The site uses a whitelist of URLs to (attempt to) keep results relevant to science and programming. In the context in which I'm using this search engine, I have no interest in (reviews on) 3 hole punches. (That's not to say I never do, but in that case I'd use Google, Reddit, etc.) The fact that results don't show up here means that they also won't show up when I'm not looking for them, which is 100% of the time when I'm using this search engine. That's a plus for me personally.
Best case would be to have relevant results in a single search engine, but that's not what I intended when building this site.
Honestly I just created this search engine for myself to find things more easily while programming or studying (I study biology, cs and ai; and philosophy in my free time, so expect results the best results for queries related to those subjects). I think those subjects also appeal to the HN audience, that's why I shared it here. When I'm not doing those things, I just use Google or DDG because they have better results for day-to-day queries.
That being said, I'm definitely interested in helping improve other people's search as well (reason I'm posting at all), so let me know if you have suggestions for sites to add!
* easy add to my filter list (like maybe a browser plugin so I can see that the current site isn't in my filter, but I can click a button and now it's in my filter & opposite for remove for when sites start to suck)
* stats on which sites I visit after searching
* aggregate bing+google filtered searches
* curated site lists for different topics, top 100, etc. Maybe like a temporary search using these sets such that I can try them without affecting my own filters. Maybe sharing lists w friends
* some sort of search anonymization/log deletion feature
* integration with browser search on desktop & mobile
* search flags like duckduckgo so I can easily switch filter sets by typing like /news or /nerdshit in the query
* integration with archive.ph & wayback machine
If you want to be cynical, just do Bing/DDG searches over Tor, and scrape that into the cache. This is $0/1000 searches, though it obviously violates some ToS somewhere. Unless they want to block Tor, you should be good.
I created a pastebin with all sites at the time of writing this comment if that's helpful: https://pastebin.com/qLC0wQ0t. If you're looking to create your own search engine, go here: https://programmablesearchengine.google.com/cse/all.
> I will look into fine tuning it for my needs. It is an interesting approach to a very annoying problem.
If a premium version were available with a customizable whitelist, would you pay for that? The API is around $5 / 1000 searches so it would cost about the same.
Search for things specifically on those pages, by very specific phrases and such.
Of course you have to find them yourself first for that verification.
I can say having set up some very teeny tiny websites here and there that the googlebot is hooked up to a lot of stuff. I'm not even sure how it found a couple of them as quickly as it did. Things like "if someone adds an RSS feed to Feed.ly" seem to do the trick. None of them were sites trying to "hide" or anything and I expected them to be found eventually, but they got found much faster than I expected. Or maybe they just scan new domain registrations, though it seemed to me it wasn't that that triggered it.
A search engine can tell users some large number of sites were searched at the time of the user's query and some large number of results exist, but what if it does not allow the user to actually view all the results.
To put it another way, the question is not what Google has discovered about the www,^1 but what Google is willing to let the user search and retrieve. If retrieving the 963rd result for a common string is not allowed, then it is impossible for the user to verify that the site containing that result was searched when the user submitted her query. Even if the search produced a 963rd result, what difference does it make if the user cannot retrieve it. What is the point of the search engine locating the 963rd result if it never has to show this result to the user querying a common string.
1. What Google has discovered about the www^2 and what Google users are able to discover about the www through Google may be two different things.^3 Google has its own interests to pursue in the name of online advertising and these may conflict with users' interests. "Censorship" is one concept that often draws negative connotations but there are many more subtle forms of filtering and manipulation that are possible here, including unintentional ones.
2. The most important focus would be what is "popular".
3. Some users might care less about what is "popular". Such users would, by and large, be less interesting to an advertising company. Individual interests might become subverted in favour of "popular" interests, to the extent they conflict. An advertising company (that runs a search engine) will favour the larger audience.
Far better than Bing or Google. It's not obvious why theirs is so terrible, unless that product is not a moneymaker for them, in which case it explains everything.
Big Russian or Chinese software is even more out of the question than the GAFAMs (if they're big, they definitely have authorities messing with the results).
Hmm, what about Baltic or Ukrainian or Israeli search engines ?
It completely boggles my mind that the useless GitHub and SO clones rank first page on Google. Do engineers at Google not use their own product?
If I am right they played stupid games and won stupid prizes. More specifically they have allowed rampant deletionism for years so while I am fairly certain the questions and answers originated on Stack Overflow it wouldn't surprise me if a good number of of those aren't visible on Stack Overflow anymore which would explain why they rank higher in Google.
Done right this would actually be a service.
Sadly some of them seems to mix together various questions and answers in the same page to generate text matches for unusual queries.
Do you have an example search leading to a GitHub clone?
A few weeks/months ago however, while I was trying to solve an issue whith a colleague who would search using french keywords, I noticed that some websites featured on the first page of the Google results were off.
In short, they were machine-translated versions of Stack Overflow threads. And they would appear in most of the searches using french keywords.
Those websites also appeared rarely in my searches while I was using English keywords, but most of the time I never bothered opening them. But now I notice them every time.
Some examples: When searching for "wget set http proxy" on Google, the fourth result leads me to qastack.fr, and the ninth to it-swarm-fr.com, both are websites featuring scrapped and machine-translated threads from Stack Overflow.
When searching deliberately in french for "Eclipse CDT stdout ne s'affiche pas" ("Eclipse CDT stdout not displayed [in console]"), the first result leads me to askcodez.com and the fourth one to qastack.fr (askodez is the same as the other two).
I have never stumbled upon Github clones, yet, however.
GitMemory is probably the most well-known example; it's just a thin layer over the GitHub API with a completely garbage UI, yet it often ranks higher than GitHub itself.