Shirt Without Stripes(github.com) |
Shirt Without Stripes(github.com) |
Search results could be better? Sure.
Can we find adversarial examples? Almost always.
This just suggests to me that real humans haven't issued that type of search query enough for the AI to know what to do with it. Which wouldn't be so big of a problem.
You have to know to search for "solid colored shirt", but when you can't think of this variation of search, or maybe there isn't one, exclusion is your only option, and it's broken.
On Amazon's side of things I would also include the obnoxious "Hey you just bought a pair of sneakers so now I will change all your recommendations to sneakers".
If it's meaningful for some reason, then it works:
https://www.google.com/search?q=woman+without+makeup&tbm=isc...
If it's an user error (like a dumb query) it fails and it shouldn't be a surprise:
But there probably aren’t many images labelled as stripeless.
I’m not sure why BERT doesn’t try shirt -stripes.
=> "shirt -stripes" works pretty well on google at least
We are in a funny place with UIs.
"shirt without sleeves"
That something that someone may actually search for. (At least the guys at my gym would!) And Amazon gets it mostly wrong.
Just my theory.
Nobody would describe a plain shirt as a shirt without stripes unless it’s within that context.
"Shirt without stripes" is harder to imagine it meaning "strips on the side. I'll add them myself".
Language is ever so complicated and contextual.
These days either my searches have become better or their interpretation/context has as I rarely have to use that anymore
I wouldn't say so. For the five images at the top of my search results, 3 of the 5 are striped, and 1 is plaid.
not that there aren't a couple of striped shirts in there, but nothing like otherwise. So I mean, there is an extremely apparent difference from me, even though there is not the perfect result you might wish for.
Might your search history be so that you're so contrarian that Google suggests contrarian results? :D
The result, of course, show shirt with some kind of stripe, albeit not prominent like the English one.
Google can do this now, for example in a prototype. The tough thing is to get it to consumer-grade quality without messing up other searches. The QA process is utterly brutal because one weird search can be a scandal.
Or does input need to have basic filters applied before handing to ML? "without X" or "no X" = "-X"? Can be foiled with "shirt without having stripes".
But make a script that scrapes the top X results for these sites. Get your own AI / humans to rate it.
Make it competitive for these large sites <==> give them an incentive.
The real question is “shirts without stripes” really a query people enter? Or representative of a real pattern in the data?
So it's not such a big deal that negation doesn't work.
Also, "shirts -stripes" does seem to work in both Amazon and Google. Or at least, I see no striped shirts.
> in particular, it shows clear insensitivity to the contextual impacts of negation.
As in, "X without Y" sounds like a common enough use case to have it's own little parser branch in places as big as Google or Amazon
So it's essentially the same input, and essentially the same expected output, but there must be quite a knot between understanding the word "without" and literally just using the - operator.
https://www.amazon.ca/s?k=shirt+without+stripes&ref=nb_sb_no...
shirt -stripes
> "Am I going crazy or is it the world around me!?"Fishbone - Drunk Skitzo https://youtu.be/SaPGH4Yd_zc?t=231
(Apologies for the snarky low-content flip reply.)
Citation needed.
As far as my personal observations go, Google is NOT optimized for long tail at all. It is always trying to return most popular results from cache of most popular results. Once the cache is exhausted, Google starts to return completely irrelevant trash (anything after first two pages of search is pure spam and meaningless keyword soup).
If you try to look up some obscure keyword and find nothing, try again after couple of months. There is a very high likehood, that you will see dozens of "new" results — most of them being from several years old pages. Perhaps, the actual long-tail searches still happen somewhere in background, but you are not going to see their output right away — instead you need to wait until they get committed to the nearby cache.
Another alarming change, that happened relatively recently (4-5 years ago), is tendency to increase number of results at expense of match precision. A long time ago Google actually returned exact results when you quoted search phrase. Then they started to ignore quotes. Then they started to ignore some of search terms, if doing so results in greater number of results. Finally, Google gained horrifying ability to ignore MOST of search terms. OP's example probably has the same cause — Google's NLP knows the meaning of word "without". But Alphabet Inc. can't afford to hose all those websites, that use AdWords to sell you STRIPED SHIRTS. This would mean a loss of money! THE LOSS OF MONEY!!!
And since most search applications are basically just finding you the results with the most keyword matches, with a little bit of extra magic thrown in, the above is basically what you see.
These queries are basically the equivalent of optical illusions in cognitive psychology when studying the visual system -- seeing how the systems break tells you a lot about how they work.
Have you ever seen shirt dresses in your dress shirt queries, or vice versa? The search application isn't caring enough about bigrams and compound words.
Have you ever seen bowls in your bowling queries or fish in your fishing queries? The search application is over-stemming.
Natural language search is a real pain on any general purpose search application, particularly ones that have to deal with titles. The obvious simple fix to this query is to rewrite [x without y] to [x -y], but then when someone goes to search for [a day without rain] or [a year without summer], you are going to totally break those queries.
To be honest I wish they actually were keyword searches and that the machine doesn't try to be smarter than you. Many times when I carefully specify which keywords must appear on the page, it'll ignore parts of the query or add unrelated synonyms. Usually one can work around it with operators but it's tedious and doesn't work reliably.
That is, learn to structure your query in a way that Google understands what you're trying to say. This used to be what yuo had to do, but now that Google tries to understand the intent of what you're trying to say and advertises as such, it's clearly a hack.
shirt -stripes
It would show pictures of shirts in pages that don't mention the word "stripes", whether the shirts have stripes on them or not.
In other words, it has little to do with what the article wants to show...
I've run into so many variations of this. You can search for something only to have the recommendation/related results embedded on whatever page to throw off your results one way or another.
I genuinely think that whatever standard HTML/XTHML is at ought to have, either as an attribute or a semantic tag, some kind of "related" or "recommended" ability to set that content apart. My cynical thought is that even if it were adopted, it would probably get abused in some fashion.
What about a search query for “Doctors without Borders“ or “Men Without Hats“?
Surely interpreting “without“ as the negative operator would ruin those searches.
The author's intent exceeds both the capabilities and intended use case of search engines.
The query "shirts without stripes" if interpreted by human would require any search system to not only analyse the keywords and tags (of the products/images), but also the content, which is an infeasible task given its dynamic nature.
So the author wants: select all shirts where content analysis of returned images yields no stripes.
This is a context-sensitive image/product search based on arbitrary, dynamically created criteria and shows that user isn't aware of what the search functionality does as opposed to exposing weak "AI". [edit]To clarify: you cannot add all possible keywords/criteria in advance[/edit]
Achilles: Who could deny it?
Tortoise: Good. Likewise, “Cast-iron sinks” is a valid utterance, isn’t it?
Achilles: Indubitably.
Tortoise: Then, putting them together, we get “Politicians lie in cast iron sinks”. Now that’s not the case, is it?
---- Douglas Hofstadter, Gödel, Escher, Bach: An Eternal Golden Braid. Basic Books, 1979
"Vaguely similar to a joke from _the movie_ Ninotchka that _the Slovenian philosopher_ Zizek often uses...."
Give people context. Don't assume people know what you know.
As an aside, I think that there are much more important factors to consider in regard to clarity and readability. I believe some of those are accurate articulation and logical tractability of ideas communicated.
The difference between coffee without cream and coffee without milk is the same whether Ninotchka and Zizek are the things described above, or if they are a city and a taxi driver.
If Google wants to group words by semantics, they should have a semantical grouping operator. For example "shirts (without stripes)". What if I am looking for a song text with these exact words in random positions?
If what author wants was implemented, it would make my experience with Google even worse, unless it could think for me also. But then why would it need me in the first place?
You’d be surprised how effective NLP is for use when identifying query intent, and pulling out modifiers that should apply as metadata filters.
Weighted keyword search works a lot, but it fails hard for many long tail queries (especially in e-commerce and other attribute heavy domains).
IMO there really isn’t a good excuse for these firms to fail at queries like this. The query itself isn’t particularly difficult when using a decent NLP stack and following well known practices.
So NLP is totally a thing you want to have in search. Arguably, its the whole point of search as it exists now.
"Shirts"
"Polka dot shirts"
"Floral shirts"
"Wikipedia list of clothing patterns"
"Houndstooth shirts"
If you go to Google's homepage and click the microphone at the end of the search input box you can search by speaking. All it does is convert to speech to text, but it implies you might be able to search in a more "natural language" way.
Google have a blog post from October last year with some more complex examples of where more sophisticated NLP helps https://www.blog.google/products/search/search-language-unde...
After that I facepalmed myself and turned it off.
Normal humans do this all the time, and if I can't do it speaking to it becomes incredibly frustrating to the point that I never want to do it again. I don't want to plan ahead what to say before I say it.
Granted, it's been a couple of years since I last tried so maybe they're better now.
Larry: "I saw a red Lamborghini in the parking lot!"
Most people will assume Lisa is driving a red Lamborghini and back from Vacation, meanwhile, all the bots are searching for Lamborghini vacations and trying to figure out what's going on in the conversation.
"shirts -stripes" results: https://www.amazon.com/s?k=shirts+-stripes&ref=nb_sb_noss_2
So basically the AI doesn't convert "without x" to "-x" even though the basic capability needed is there. This is why AI is a hard problem, especially when it meets the real world.
It's 2020 and we're still quibbling about the terminology used in SQL, what did we expect?
The state of the art in machine translation (from what I've read at least) is translating from language-A to a language-less "concept space" and then from there to language-B. Could that be done where the output language is something a search engine can use to find what you want correctly?
Given that pattern, I suspect we could see much better results in cases like this.
Today, entering in any tech-related query at all takes you to StackOverflow, end of story. Not only are SO answers quite often outdated (or even terrible advice in general), most of the time I'm not looking for a "here's how you do X", I'm looking for background information on a topic.
Most non-tech queries I put into google are even _more_ useless as the results tend to fall into these categories:
* Wikipedia (okay for _very_ general things, useless for domain-specific knowledge)
* SEO-enhanced blogspam, (a.k.a. "8 Weird Ways to Earn Millions Through Gaming The System!")
* Tweets on twitter (!)
The dev/tech industry desperately needs a search engine that somehow prioritizes _quality_ content, not one-off answers, blogspam, and tweets.At a point there was a TED talk explaining social networks were a solved problem now that facebook was dominant. Recycling was seen as solved problem until it wasn't etc.
I wonder how many actual "sovled problems" we have.
"birds without flight"
"cars without wheels"
"cats without tails"
"dogs without hair"
"intersections without lights"
"poems without rhyme"
"shirts without collars" (also "sleeves", "shoulders", "buttons", "logos", "pockets", and more)
I'd also say nobody uses the phrase "birds without flight", but instead "flightless birds".
I'd imagine people say "hairless dogs" a lot more often than "dogs without hair".
And the shirt examples show that there are many uses of the lead phrase "shirts without" that work just fine, but for some reason "stripes" really stands out far beyond the others. "Shirts without shoulders" is kind of a bizarre term to match, when almost all search results show "off the shoulder" or "cold shoulder" as the thing being matched.
What if you walked into a store and asked an associate for a shirt without stripes? What would you get?
Probably some further questions for clarification. What about checked shirts? Floral prints? Plaid? Do you want no pattern at all? T-shirt? Polo shirt? Dress shirt?
Granted, the AI results are particularly bad because they give you the one thing that you specifically didn't ask for, but that's also the only information you provided. Defining a query in terms of what you don't instead of what you do isn't going to go well.
What if you went to google and said "Show me all the webpages that aren't about elephants"? Sure, you'd get something, but would it be anything useful?
Google has gotten better, it's just HNer expectations that have changed as they expect more and more magic.
For example, the subtitle on the repo is "Stupid AI" when this query has never worked in these search engines, and it won't anytime soon.
You'd think the technical HN crowd would be more advanced than to make the same mistakes that (they complain that) stakeholders/users/gamers make when they mistakenly think everything is much easier than it actually is. Things aren't "stupid" just because they can't yet read your mind.
I would expect there to be an e-commerce site or blog post somewhere containing a page with the exact title "shirts without stripes" and I'd expect it to be the first match.
This thread is an excellent example. The author of the linked page didn't have the decency to actually make a substantive point, instead sharing three screenshots and posting the link here, chumming the HN waters with the kind of stuff that brings in the sharks from far and wide.
Bashing on big cos: Check
Vague pronouncements about AI: Check
Generic side-swipes about 'ad revenue': Check
This is why a coherent thesis is required to even initiate a proper discussion, because in the absence of that it invariably devolves to lowest-common-denominator shit-flinging.
Negations sidestep almost all of the algorithms that try to provide an improved result set, and fall through to pure text relevancy. So try searching on amazon for shirt, then search for: shirt -xkxkxkxk. Since xkxkxkxk doesn't match any documents, the negation should have no effect, but it does, the effect it has is to sidestep all the fancy relevancy work and hardcoded query rewrite rules, domcat rules, demand and sales/impression statistics etcetc, and give you basically awful search results. You don't even get shirts.
Anyone with a programming background knows there is an art to forming useful search queries--it is an acquired skill. I'd personally much rather the engine bring back predictable results given mundane rules and keywords than attempt to understand sentences using an opaque method of understanding.
That said, this seems like an obvious place for improvement where both groups can be made happy.
Given exact query to human, they create environment thus context themselves.
It may also depend on whom you are asking to. For example, myself, entering this site to find out news about software & tech. Also since 'Stripe' is a company name, I assumed link will get the list of shirt shops who do not accept Stripe as a payment method/provider. (Thus some kind of protest related thing)
I literally thought about that yesterday and did not see the page thinking "That's too much for tonight".
Now seeing topic is somewhat very different.
The former returns lots of mixed race couples, mostly not white couples. However the latter returns black couples.
What is going on here? Similar phenomenon perhaps?
What is the expected result, can we agree?
It is a shirt with anything except stripes.
And it's only that smart assistant that automates coping with the deficiencies of a one-size-fits-all central solution, finding me shirts with no stripes by using a rather dumb search engine. (Or "a pizza I would like", etc.)
https://medium.com/pinterest-engineering/pinsage-a-new-graph...
"Don't think of a cow !"
What did you just think of ? A cow, of cowrse.
If you want a shirt w/o stripes, just google "plain shirt" or "dress shirt -stripes.
There are several reasons for this, including the following:
1) Natural language understanding for search has gotten a lot better, but it is still not as robust as keyword matching. The upside of delighting some users with natural language understanding doesn't yet justify the downside of making the experience worse for everyone else.
2) Most users today don't use natural language search queries. That is surely a chicken-and-egg problem: perhaps users would love to use natural language search if it worked as well or better than keyword search. But that's where we are today. So, until there's a breakthrough, most search engine developers see more incremental gain from optimizing some form of keyword search than from trying to support natural language search.
3) Even if the search engine understands the search query perfectly, it still has to match that interpretation against the documentation representation. In general, it's a lot easier to understand a query like "shirt with stripes" than to reliably know which of the shirts in the catalog do or don't have stripes. No one has perfectly clean, complete, or consistent data. We need not just query understanding, but item understanding too.
4) Negation is especially hard. A search index tends to focus on including accurate content rather than exhaustive content. That makes it impossible to distinguish negation from not knowing. It's the classic problem of absence of evidence is not being evidence of absence. This is also a problem for keyword and boolean search -- negating a word generally won't negate synonyms or other variations of that word.
5) The people maintaining search indexes and searchers co-evolve to address -- or at least work around -- many of these issues. For example, most shoppers don't search for a "dress without sleeves"; they search for a "sleeveless dress". Everyone is motivated to drive towards a shared vocabulary, and that at least addresses the common cases.
None of this is to say that we shouldn't be striving to improve the way people and search engines communicate. But I'm not convinced that an example like this one sheds much light on the problem.
If you're curious to learn more about query understanding, I suggest you check out https://queryunderstanding.com/introduction-c98740502103
Though I somewhat agree, I don't think that an average user even knows about existence of search operators in the first place, let alone being aware of this specific one and when to use it.
Google is already by far the most widely used search engine, so they don't really need to innovate or improve the search product very much in order to attract and retain users. Presumably capturing more advertising spending from the companies paying for ads is a bigger priority.
Microsoft under Satya Nadella has been all about enterprise and cloud, and I doubt Bing is a strategic priority any more, so it's not surprising that they wouldn't put a lot of resources into making it better.
Amazon is a little surprising. You'd think they'd have a lot to gain from making it easier for people to find what they're looking for. But maybe less than perfect search results are deliberate? Maybe it's like how supermarkets put basic items in the back of the store and high-margin impulse buys in the front - so you have to walk past chocolates and chips if you want to buy a carton of milk.
If Amazon is deliberately nerfing search results then maybe Google would stand to benefit from having better shopping-related results - people would get frustrated trying to find a shirt without stripes on Amazon and just use Google instead, letting Google profit from advertising in the process. But maybe people selling shirts aren't willing to pay much for ads, so there isn't much money for Google to make by getting better at finding specific types of shirts.
I dunno if any of these conjectures are anywhere near accurate, but it's interesting to think about.
AND keyword LIKE '%searchStr%'
Lots of focus on a general purpose mono-model, but I think a collection of specialized subsystems is a better representation and would produce better results, faster.
But Siri is a general domain problem, which is really really hard. Siri set the expectation you can ask it anything, and it works terribly and for most questions just gives up and runs a web search.
If you are an e-commerce company though, that's a narrow enough domain, because you know that for most people, they're looking for products to buy or compare. It's not an unbounded Q&A service.
At some point in the future marketers will learn about AGI, and we'll have to make yet another term, maybe artificial general practical intelligence?
There is nothing on the word "intelligence" to imply it's not specialized.
I know what agi is; I just find the terms backward.
It's perfectly ok to prefer the qualifier to go the other way around, keep intelligence general and change the name of the specialized form. But that's just not how our language evolved.
You can't assume that customers would type one thing or another - you need to gather lots of query log data and see what you find. You'd be surprised how much variation there is, but once you do have this data you can then find patterns to cover lots of (but not all) cases.
Zooming out, the language field breaks into several subfields:
- A large group of Chomsky followers in academia are all about logical rules but very little in the way of algorithmic applicability, or even interest in such.
- A large and well-funded group of ML practitioners, with a lot of algorithmic applicability, but arguably very shallow model of the language fails in cases like attribution. Neural networks might yet show improvement, but apparently didn't in this case.
- A small and poorly funded group of "comp ling", attempting to create formalisms (e.g. HPSG) that are still machine-verifiable, and even generative. My girlfriends is doing PhD in this area, in particular dealing with modeling WH questions, so I get some glimpse into it; it's a pity the field is not seeing more interest (and funding).
https://www.google.com/search?q=%22Shirt+without+Stripes%22&
If you argue this is bad behavior: Maybe we need a web query which really only takes the query literally. Putting the query in quotes will not quite have this effect for Google. Maybe some other syntax?
Yes, the English grammatical rules make it unambiguous where it belongs. This is solvable.
Seems like a matter for logical inference. At which point it becomes fairly easy to find shirts made from material where that materials pattern is not stripes.
But yes, no AI I have seen works reliably on even basic queries like this.
Couldn't you just parse the sentence into a dependency tree and look at the relationships to figure that out? CoreNLP got both of your examples right (try it at http://nlp.stanford.edu:8080/corenlp/process, can't link the result directly).
To be useful, Google must solve natural language problems. You can't solve natural language problems by using formal language in sine bits of the problem, at least not until we have a full Chomsky-style understanding of the whole of human language.
Well, one could argue, that it belongs exactly where anyone entering the query put it. Before "stripes".
The problem is often, that search engines try to be too clever, while not offering any kind of switch "exactly those words in this order" and that is just a bad user interface.
If it just disregards the word without, well, that's pretty bad.
I will not be surprised millions of dollars are being lost because of this substandard query result per year.
“Shirt -stripes” is unambiguous to a system, yet the first result on Amazon(.ca) is a striped shirt, and the 3rd is sweatpants.
That's the sort of thing I'd expect Amazon to be doing?
I mean, context is key, right? You're on Amazon and your first search term is "shirts". Unless their is a band called "shirts without stripes", the user wants shirts. The rest of the query is probably some filter of that product. You know shirts sometimes have stripes. It's not a one-size-fits-all algorithm, but it's simple enough that the user should end up with the results they wanted.
> "no evidence of cancer" and "evidence of no cancer" are very different things.
Why is it not as simple "no belongs to the word it precedes" ? like unary operator, ! (not), in typical computer languages.
- no textbook evidence of cancer
Statements have structure, parsing them with simple rules like this is akin to parsing C++ with regular expressions.
You'd also have quite a bit of fun trying to parse the phrase "no means no" or other usages where "no" is being used as a noun... And for bonus points, folks talk to search engines in broken english all the time so "shirts no single striped" is a totally reasonable query to submit to a server and expect to be parse-able.
Does she want ice cream? Answer: No, she doesn't. I added a not, so she's reversing the answer as Japanese people do.
The number of times I've been dumbstruck by this is larger than I'd like to admit, and I'm a coder.
Q: "Do you mind if I sit here?"
A1: "Not at all!"
A2: "Sure!"
Both are valid answers and mean the same thing, the person asking is welcome to sit there. This has always amused me.
There have been some lengthy discussions on HN about vertical search and how Google doesn't always buy up a small company; they litigate.
I'd be curious to see how many sentences with attribution problems actually have other structural issues. If I want to write clearly and without ambiguity, I rewrite sentences that have these problems. Why wouldn't I do the same for search queries?
The bad results are because they're not positively indexing the absense of the feature by deeply analyzing the images or products beyond the descriptions. "Shirt with stripes" yields almost exclusively striped shirts. Exclude those results from all "shirts" and there are still a lot of striped shirts that the search algorithm doesn't know enough to exclude.
There is no ambiguity in "not stripes", you can't invert it and write it in the positive form of what you want; the neatest way to describe the category of what you want to browse is "things which are not stripey".
Particular personal bugbear is car websites where you can filter in "petrol engine" or "diesel engine", but there is no support for negative filtering, so you can't choose "not LPG". In so many search-and-filter options you can't exclude your dealbreakers, and it's much more likely that I have a single dealbreaker which rejects a choice overriding all other considerations, than that I have a single dealmaker which makes a choice overriding all else.
What do you call a skyscraper like that if you want to refer to it? They exist, but you can't find them using that search term on Google.
Ok, so imagine one online retailer follows your advice and expect the users to write clear and unambiguous queries, while another retailer puts extra effort into attribution.
Which one will make more money?
That’s the best I can do, sorry.
I think you'd struggle to find anywhere Google claims to "understand everything", making your assertion a strawman.
Literally in the article you're quoting from Google:
> But you’ll still stump Google from time to time. Even with BERT, we don’t always get it right. If you search for “what state is south of Nebraska,” BERT’s best guess is a community called “South Nebraska.” (If you've got a feeling it's not in Kansas, you're right.)
"So that’s a lot of technical details, but what does it all mean for you? Well, by applying BERT models to both ranking and featured snippets in Search, we’re able to do a much better job helping you find useful information. In fact, when it comes to ranking results, BERT will help Search better understand one in 10 searches in the U.S. in English, and we’ll bring this to more languages and locales over time.
"Particularly for longer, more conversational queries, or searches where prepositions like “for” and “to” matter a lot to the meaning, Search will be able to understand the context of the words in your query. You can search in a way that feels natural for you.
...
"No matter what you’re looking for, or what language you speak, we hope you’re able to let go of some of your keyword-ese and search in a way that feels natural for you. But you’ll still stump Google from time to time. Even with BERT, we don’t always get it right. If you search for “what state is south of Nebraska,” BERT’s best guess is a community called “South Nebraska.” (If you've got a feeling it's not in Kansas, you're right.)
"Language understanding remains an ongoing challenge, and it keeps us motivated to continue to improve Search. We’re always getting better and working to find the meaning in-- and most helpful information for-- every query you send our way."
> sometimes still don’t quite get it right,
> Even with BERT, we don’t always get it right.
And nothing in the blog is about image search.
This is a case where, while it makes sense to say the sentence, it's not a common use of language, and at the end of the day, the search engine will find what's written down, it's not a natural language processor yet (despite any marketing).
Shirt stores don't advertise "Shirts without stripes - 20% off", they describe them as "Solid shirts" or "Plain shirts". Men's fashion blogs talk about picking "solid shirts" or "plain shirts" for a particular look. If I walked into a clothing store and asked for "shirts without stripes", the sales person would most likely laugh and say "er, you mean you want plain shirts?".
Plain shirts/solid shorts are the most common way to refer to these, and people seem to be searching this way:
https://trends.google.com/trends/explore?date=all&q=solid%20...
Regarding moving towards natural language processing - the "without" part is not as important as knowing the context.
My kids will ask me to get from the bakery things like "the round bread with a hole and seeds", which I know means "sesame bagel", or "the sticky bread", which means "cinnamon twists" - which I understand because I know the context. Sometimes they say "I want the red thingy", and I need to ask a bunch of questions to eventually get at what they want (sometimes it's a red sweater, sometimes it's cranberry juice).
Unless Google starts asking questions back, I don't think there is any way it can give you what you want right away.
Searching "pants" only shows me "trousers", that's a big fail for Google IMO, I'm accessing google.co.uk.
The joke from Zizek: https://www.youtube.com/watch?v=wmJVsaxoQSw
To extend on that, you can think of the human brain as just another (powerful) statistical model.
- Shirt Without Stripes: shirts where the description contains both "without" and "stripes". Example: a shirt without collar, with stripes.
- "Shirt Without Stripes": a mess, with and without stripes, suggesting an unusual search query. In fact, the linked article site is the first result in web search.
- Stripeless shirt: sexy women in strapless shirts
- "stripeless shirt": pictures of Invader Zim...
- "stripeless" shirt: mostly shirts without stripes, but there are some shirts with stripes that are described as stripeless...
The last one may give us a hint at the problem. If you have to mention a shirt is without stipes, you are probably comparing is to a shirt with stripes. For example imagine a forum, some guy is posting a picture of a shirt with stripes, I can expect some people to ask questions like "do they sell this shirt without stripes"? Or maybe the seller himself may have a something like "shirt without stripes available here (link)" in the description. So the search engines tie "shirt without stripes" to pictures of shirts with stripes.
I remember an incident where searching for "jew" on Google led to antisemitic websites. The reason was simply that that exact word was rarely used in other contexts. Mainstream and Jewish source tend to use the words "jews" and "jewish" but not "jew". And because Google doesn't look at the dictionary meanings of words but rather what people use them for, you get issues like that.
I had a similar problem when I was trying to convince a friend that homeopathy was a complete and utter fraud with absolutely no basis in reality. She was convinced that the internet's overwhelming consensus was that homeopathy was valuable and regular doctors were control-freaks who make things up when they don't know the answers.
To prove her point, she did an internet search for allopathic medicine and showed me how the majority of the results were negative.
https://en.wikipedia.org/wiki/Allopathic_medicine
Just a humorous anecdote, not trying to start any conversations about the relative value of different medical paradigms.
Sometimes I wonder how much my brain has changed to use search engines / how much of it is dedicated to effective googling. Makes me feel like a cyborg.
"Humans usually don't intuitively understand the word 'no'. Please imagine a non-pink elephant."
Eg: If you're doing a sport where leaning forward is bad, avoid telling yourself 'dont lean forward' as your mind only hears 'lean forward', therefore reinforcing the thing you're trying to avoid. Alternatively, tell yourself 'lean back' or 'stay straight' or whatever you're focusing on for that maneuver or drill.
It would be great if I could add negative keywords to a website, or mark text as "don't index" or "index with a negative weight". But probably, people would game this in ways I can't imagine.
There is probably a clever ML solution for this, like having meaning-vectors for distinct ideas, and pushing pages that are close to one meaning away from the other meaning. Classification is easy if you have a keywords like "painting" and "catholic", but if it is "virgin" or "prayer" then it could be either meaning, so there is never a bullet-proof solution.
The theme of this talk was how they did a study that showed prepositions and articles do have meaning. A big deal was made out of the results.
I think things like this happens when people consider engineering approximations such as bag of words to be the truth over time.
There were gasps in the room and a kind of depressed acquiescence: geez, he might be right. And the pendulum indeed swung in that direction, hard, and the field has been overwhelmingly dominated by the statistical machine learning folks on the CS side of the field, while the linguists kind of quietly keep the flame alive in their corner.
But I thought then, and I still think now, that it really just was another swing of the pendulum (which has gone back and forth a few times since the birth of the field in the 1960s). Perhaps it's now time again for someone to ring up the linguists and let them apply their expertise again?
Likewise, Google says I should log into their website for personalized search results, but after years of always clicking on Python 3 results over Python 2.7 results, it never learned to show me the correct result.
Eventually I realized that personalized recommendations are more or less just a thin cover for collecting vast amounts of data with no benefit to the consumer. I believe we have the technology to do better, but we don't use it. In fact, we seem to be using it less and less.
As humans we know immediately that the search is for documents about shirts where stripes are not present. But the term 'without' doesn't make it through to the term compositor step which is feeding terms in a binary relationship. We might make such a relationship as
Q = "shirt" AND NOT "stripes"
You could onebox it (the Google term for a search short circuit path that recognizes the query pattern and some some specific action, for example calculations are a onebox) and then you get a box of shirts with no stripes and an bunch of query results with.
You can n-gram it, by ranking the without-stripes n-gram higher than the individual terms, but that doesn't help all that much because the English language documents don't call them "shirts without stripes", generally they are referred to as "plain shirts" or "solid shirts" (plain-shirt(s) and solid-shirt(s) respectively). But you might do okay punning without-stripes => plain or to solid.
From a query perspective you get better accuracy with the query "shirts -stripes". This algorithmic query uses unary minus to indicate a term that should not be on the document but it isn't very friendly to non-engineer searchers.
Finally you can build a punning database, which is often done with misspellings like "britney spears" (ok so I'm dating my tenure with that :-)) which takes construction terms like "without", "with", "except", "exactly" and creates an algorithmic query that is most like the original by simple substitution. This would map "<term> without <term>" => "<term> -<term>". The risk there is that "doctors without borders" might not return the organization on the first page (compare results from "doctors without borders" and "doctors -borders", ouch!)
When people get sucked into search it is this kind of problem that they spend a lot of time and debate on :-)
It's a completely artificial construct. Simply the fact that this hacker-news entry is the #1 search result shows that real human people do not perform this search in significant quantity. But we can quantify that with data to backup the assumption [1][2]. When people want to buy a shirt without stripes, they do not describe the shirt by what it doesn't have.
In fact, it's trivial to cherry pick a random selection of words that on the face of it sounds like something a human might search for, but it turns out never occurs in practice. Add to that the fact that the term is being searched without quotes [3], which results in the negation not actually being attached to anything.
Do you go to a store to buy it along with your Pants Without Suspenders, Socks Without Animal Print, and other items defined purely by what they don't have?
[1] https://trends.google.com/trends/explore?geo=US&q=%22white%2... [2] https://trends.google.com/trends/explore?geo=US&q=%22plain%2... [3] https://trends.google.com/trends/explore?geo=US&q=plain%20sh...
Likewise, here, I would search for solid-colored shirts.
And these services are limited to the content/terminology utilized by the cataloged sites/products.
If I am selling a "black shirt" or a "solid black shirt," it is not google's job to catalog it as a "shirt without stripes," unless I advertise it as a "black shirt without stripes."
I would use natural language to test a services' NLP ability.
There's too many products nowadays to be manually attributed (e.g. pattern=stripes), making it hard return good results even with entity resolution for queries. We train classifiers to categorize products, including what something is not, using their images and descriptions.
I'm guessing one of those reflections looks like a turtle? Or maybe a pattern on the floor, wall, or rug?
Although there are examples where I'm unsure if the AI is dumber than my 4yo or smarter than me. This is a result for "truck": https://i.imgur.com/JcgXZAG.jpg
Even (especially?) my 4yo knows those are Brio trains, not trucks. However, trains have components called trucks! https://en.wikipedia.org/wiki/Steam_locomotive_components I'm unsure whether or not any of the wheel assemblies on these toy trains are considered trucks, so either the AI is extremely smart or slightly dumber than a 4yo.
i recently looked for dough scrapers. i wanted to see what's selling best and what's most rated. they are everywhere. in dessert & decoration, in utensils, in bakeware, and many other categories. i mean i get it...
it's not just search that's hard. categorization is also an issue here.
Joking aside, it doesn't surprise me that this isn't being picked up — aren't most of these AI teams more R&D than actual public-facing? Maybe I'm just cynical though.
'shirt no stripes'
on Google returned this web page at top of the organic results.
So at some point, searching for a shirt online will involve this conversation. Even more confusing.
(Although I expect my filter bubble will play a part in that)
''' Shirt Without Stripes | Hacker Newsnews.ycombinator.com › item 42 mins ago - The point that the author is making, in a very understated way, is that all three companies have PR websites that breathlessly describe their ...
'''
Is it surprising that very few of the result surprises me?
The net result of that Google search, combined with the "Shirt Without Stripes" repo, leaves me even more unimpressed with the capabilities of our AI overlords.
I never figured out what kind of mistake could have led to that.
If I search for 'person' it's a mixed-race woman, then a white woman (Greta Thurnberg), then a white man.
Many interpreted this along tribal lines, but likely it is that there is constant tuning and lots of complex constraints.
[1] not to say that you implied the reason was racism, but often it is attributed to something along those lines
Something of a corollary to Brooksian egg-manning: with an infinite number of possible searches, you can find at least one whose results do not exactly match the current demographics of the state from which you place the search.
The google image search you did -- did not provide incorrect answers, unlike the OP's
There’s a nuanced argument that practitioners know how ML is so dependent on training data and accuracy tails off sharply, but that nuance tends to removed from anything selling to potential customers — which has not been a great way to keep them in my experience.
Edit: "stripes" not "stripped" ugh
Nobody has solved the common sense knowledge problem yet. A solution for that would qualify as Artificial General Intelligence and pass the Turing Test.
But search engines have come a long way. I even suspect that when search engines place too much logical - or embedding relevance to stop words such as "without", that, on average, the relevant metrics would go down. It is not completely ignored as "shirt with stripes" surfaces more striped shirts than "shirt without stripes". "shirt -stripes" does what you want it to do.
Searching for "white family USA" shows a lot of interracial families. Here "white" is likely not ignored as much, and thus it surfaces pages with images where that word is explicitly mentioned, which is likely happening when describing race.
You can use Google to find Tori Amos when searching for "redhead female singer sings about rape". Bing surfaces porn sites. DDG surfaces lists (top 100 female singers) type results. The Wikipedia page that Google surfaces does not even contain the word "redhead", yet it falls back to list style results when removing "redhead" from your query, suggesting "redhead" and "Tori Amos" are close in their semantic space. That's impressive progress over 10-20 years back.
[1] https://en.wikipedia.org/wiki/Commonsense_knowledge_(artific...
EDIT: scrap that, I didn't mean Alexa, which is doing AI obviously, but the search engine of Amazon's retail website.
Anyway, NLP is hard and everyone sucks at it. Think about it: just building something that could work with any <N1> <preposition> <N2> or any other way to express the same requests would mean understanding the relationships of every possible combinations of N1 and N2. It means building a generalized world model that is quite different from simply applying ML to a narrow use case. Cracking that would more or less mean solving general AI which probably won't happen soon.
You're right the NLP is hard, but not everyone sucks at it.
Additionally, "shirt without stripes" is not the same as "solid color shirt"; as an example, take a look at:
Whereas all these services seem to be processing the input in such a superficial way that they give the searcher results that aren't just inaccurate but are the opposite of what was asked for.
Lol what? These are words a toddler would understand.
If your "ML algorithm" doesn't understand straightforward language, how is it any better than a couple if-then statements?
Beyond that, I'm unsure how you think "<something> without <something>" is at all unusual or difficult to decipher.
How am I supposed to explicitly search for a shirt without stripes, then?
Google has not yet discovered how to automate "is this a quality link?" evaluation or not, since they can't tell the difference between "an amateur who's put in 20 years and just writes haphazardly" and "an SEO professional who uses Markov-generated text to juice links". They have started to select "human-curated" sources of knowledge to promote above search results, which has resulted in various instances of e.g. a political party's search results showing a parody image. They simply cannot evaluate trust without the data they initially harvested to make their billions, and without curation their algorithm will continue to fail.
Google has so much more data than just the keywords and searches people make, it seems like this should be a problem they could solve.
Through tracking cookies (e.g. Google Analytics) they should be able to follow a single user's session from start to finish, and they also should be able to 'rank' users in some vague way where they'd learn which users very rarely fall for ads or spend time on the sites that they know are BS. Those sites that are showing up on page 5 or 6 of the search results, but still get far more attention than others on the first few pages, could get ranked higher.
But I don't think many of Google's problems these days are technical in nature. They're caused by the MBAs now having more power at Google than the techies, and thus increasing revenue is more important than accuracy.
Google is a dumbass nowadays, and regularly ignores half your search terms to present you with absolutely irrelevant results, that have gotten lots of visits in the past.
Page 2: Page 2 of about 86 results (0.36 seconds)
It seems they're really just trimming the web.
Google's job is not to give you great search results, it's to keep you clicking on ads. Ideally it would be the ads on the search results page directly, but if that doesn't work then a blogspam website with Google ads is the next best thing.
If Google was a paid service this problem would be solved the next day. Oh, and Pinterest would completely disappear from Google too. :)
Today, it will silently guess at what I want, and rewrite the query. If they have indexed pages that contain the words I put in, but don't meet their freshness/recency/goodness criteria, they will return OTHER pages with content that contains vaguely related words. "Oh, he couldn't have meant that, it's from 6 months ago, and it's niche!"
They'll even show this off by bolding the words I didn't want to search for.
So, if I'm looking for something that isn't popular -- duckduckgo it is. It doesn't do this kind of rewriting, so my queries still work.
I still continue to use it though since as some here have already mentioned Google's results because worse a few years ago and DDG was lean and good enough to switch. I do hope they'd consider more such feedback.
Related aside: It frustrates me no end that spellcheck still doesn't appear to use any probablistic considerations, like Markov chains, to determine the intended word. And that when I click the next to last letter to make an adjustment it doesn't then change the suggestions to alternate endings, etc.. Perhaps newer devices than I have do this.
Doing just that for 10 years, beating hand-coded systems: https://www-nlp.stanford.edu/pubs/SocherLinNgManning_ICML201... [pdf]
> I would guess that most modern NNs from the NLP area (Transformer or LSTM) would be able to correctly differentiate the meaning.
Yes. See demos like: https://demo.allennlp.org/constituency-parsing/MTczNjYyNA== and https://demo.allennlp.org/dependency-parsing/MTczNjYyNg==
> I think there is no fancy NN (yet) behind Google search,
During the deep learning boom, Google made a huge push towards NN-based NLP. SEO's and their PR calls their efforts collectively RankBrain: https://en.wikipedia.org/wiki/RankBrain
I think we are on the cusp of combining symbolical/logical operations over the vectors produced by Neural Networks (or at least, major effort there). Could be by neatly tying up all these different NN-based NLP modules (parsing, semantic distance, knowledge bases, ...) with another set of decision layers stacked on top.
Chomsky: Statistical analysis of snowflakes falling outside the window may predict the next snowflake, but it will do very little for weather prediction, and nothing for climate analysis.
Norvig: Give us enough data and we will get close enough for all practical purposes.
It's easy to say, isn't it? Unfortunately, sticking the word "just" in there doesn't affect the difficulty. I do it all the time, too.
That said, "meaning" is not statistical.
Are you sure of that? After all, we don't have all the same interpretation for every word in the dictionary. Ask a hundred person in the street - well perhaps not these days - the meaning of the word "meaning", how many different explanations will you get? And will you be able to reduce them all to the same "fundamental" meaning?
I think you're overestimating Google's sophistication.
Knowing how the machine will interpret humans is just as important to finding your results.
I guess because they leave it up to the advertiser to determine the negative match words and that seems to always have priority.
https://help.twitter.com/en/using-twitter/advanced-twitter-m...
Everything is for the best, in this best of all possible search engines -- the Candide fallacy.
If Google isn't under survival pressure to get better (and they aren't) the incentives aren't aligned for them to improve or even to not get worse every year.
If Google is failing first gradually then suddenly it might not even be within the institutional power to notice how bad it's become before it's too late.
This assumes that AI wants truth. These three companies AI don’t necessarily want truth, they want revenue.
But let's also logically assume that most content on the web is telling the visitor what their page or product is, not what it is not.
I would expect a shirt to be advertised and described as what it is: “black shirt” not “shirt without stripes.”
So if a content creator does not include the exact term “without stripes” in the description of shirt, then you are relying on google to infer meaning on your behalf and the content creator.
Now, this is relatively inconsequential for a shirt and perhaps not well representative, as a fashion-related searches a different than many searches. If I search for “news without coronavirus,” should I expect only articles that do not refer to coronavirus? I wouldn’t.
If I was allergic to peanuts and I searched for “food without peanuts,” I would expect results from content creators and sellers of products who took care to include the term “without peanuts,” because they are advertising their product as safe for those with peanut allergies. I would not rely on google or amazon to make that determination for me.
Both for google and individual sites, there are better options to further narrow results. If you don’t want to narrowly define your result to a specific pattern or color, the first search more broadly and then used advanced settings or filters to omit terms and/or include others.
Unfortunately people don't search for "solid shirts". At best they search for "plain shirts", but there's a lot of taste to clothing that means people often do want a shirt without stripes, but are open to patterned/plain.
I think searching "shirts without stripes" is very legitimate in fashion.
I say this having built a clothing search function for the company I work for, and one that does not support this sort of query.
Most likely, common sense reasoning will be required to get full natural language processing, since human communication relies extremely often on such reasoning. But building a knowledge base of common sense facts will be one of the hardest challenges ever attempted in machine learning/artificial intelligence.
"Not at all" == "I do Not [object to you sitting here] at all"
A3: “Sure I do, last time you sat next to me you wouldn’t shut up.”
As with including the word "stripes" in a search where you want to omit results with stripes, including the word "mini" is only causing unnecessary confusion. The adapter that works for a Mac Mini will also work for a Macbook, as an example.
Not actually true. ML is one area of study within the field of AI. Thanks to marketing departments and slightly shoddy journalism these two things are now casually treated as equivalents, but they're really not: ML is still very much a subset of AI.
If you were unfamiliar with them and searched "widgets" to find out more and got widgets of a single colour and form, it would not be an unreasonable assumption that widgets are mostly (if not entirely) that shape and colour, especially if there was nothing to indicate that this was a subset of potential widgets.
It's not so much "demand for diversity" as it is "more accurate and correct representation".
Most of the very top results seem to be of trump and greta thunberg.
That might explain a lot but I don't think so.
Just look to how they are messing up simple searches because of basic lack of quality controls:
- Why doesn't doublequotes work anymore? Not because dark SEO vut because nobody cares.
- Same goes for the verbatim option.
- The last Android phone I liked was the Samsung SII, and last year I finally gave up and got the cheapest new iPhone I could get, an XR. My iPhone XR reliably does something my S3, S4, S7 Edge and at least one Samsung Note couldn't do: it just work as expected without unreasonable delays.
- Ads. They seem to be optimized to fleece advertisers for pay-per-views because a good number of the ads I've seen are ridiculous, especially given that I had reported those ads a number of times. I guess what certain customers that probably paid a lot for those impressions would say if they knew that I had specifically tried to opt out from those ads and weren't in the target group anyway.
Also, don't underestimate the adversaries. Ranking well on Google means earning a lot of money. So much so, that I'd argue the SEO-people are making significantly more money than Google loses by having spammy SERPs. They will happily throw money at the problem and work around the filters. I don't think you can really select for quality by statistical measures. Google tried and massively threw "trust" at traditional media companies and "brands". The SEO-people responded by simply paying the media companies to host their content, and now they rank top 3, pay less than they did by buying links previously, and never get penalties.
They already do this today for any venue where they can link “traffic volume” to “ranking increase without human review”.
Google's aim was to replace other sources of information with Google:
> People make decisions based on information they find on the Web. So companies that are in-between people and their information are in a very powerful position
Profit was on their minds from the very beginning:
> There are a lot of benefits for us, aside from potential financial success.
Revenue, however, was not urgent back then, to them or to their VCs:
> Right now, we’re thinking about generating some revenue. We have a number of ways to doing that. One thing is we can put up some advertising.
So over the past two decades, they executed a two-pronged approach: Become indispensable and Become profitable. But now they're trying to pivot from "at web search" to "at assisting human beings", and that's a much more difficult problem when their approach to "Become profitable" was to use algorithms rather than human beings.
Here's a useful litmus test for whether Google has succeeded at that pivot:
If you were in a foreign city and you suddenly wanted to propose marriage to your partner, would you trust Google Assistant to help you find a ring, make a dinner reservation, and ensure that the staff support the mood you want (Quiet or Loud, Private or Public)?
If so, then Google's pivot has been successful.
People still think we will have self driving cars "in two years" yet here we are talking about dumb shirts. AI winter is coming
But it isn't necessary to formalize any of it. At the current level of sophistication, our informal common ground of words like "understanding" suffice for a discussion. It's obvious Google Translate doesn't resemble human language processing.
I think the basic issue is that people just don't respect machines and want to minimize the amount of effort spend on communicating with them - I don't say "Alexa please bring up songs by Death Grips if you don't mind" I shout "Alexa! Play! Death Grips!" and then yell at it when it misunderstands.
A shirt lacking stripes would never be described or labeled as a "shirt without stripes."
In the absence of that actual description, you are asking google to assume what you mean.
I would just never expect that to work very well.
1) Gather all items labeled as "shirts" (among other labels) 2) Filter out any labels that includes "stripes"
A shirt doesn't have to be labeled "shirt without stripes" for this to work. A shirt labeled "shirt with stripes" or "striped shirt" would not match, and lots of other shirts (solid shirts, shirts with prints, whatever) would match just fine.
Relatedly, one time I picked up a prescription for a cat. The cat's name was listed as CatFirstName MyLastName. They had another (human) client with that same first and name. It turned out that on my previous visit they had "corrected" that client's record to indicate that he was a cat.
If vendors would use the term "shirt without stripes" than it would match great, but they call it "plain shirt".
Google advertises using BERT natural language models
https://blog.google/products/search/search-language-understa...
> ... but they call it "plain shirt".
Or polka dotted :)
https://www.google.com/search?q=windowless+skyscraper&tbm=is...
https://www.emporis.com/buildings/119453/seattle-tower-seatt...
Windowless is a superset of glassless.
"non-glass skyscraper":
1. No glass used in the exterior construction at all -> implying no windows
2. No glass used in the exterior construction at all -> implying the windows are made out of something other than glass
3. A skyscraper in which glass is not a prominent architectural feature, but the building does contain features like windows and doors that contain glass. (This comment)
That's the full glass buildings returned in your windowless query.
See how that works? That's not really what's going on. Sure, G. is incentivized to include pages quickly, but they are also incentivized to produce them accurately, and as the above poster indicates, this is quite a hard problem to solve generally.
A is also incentivized to sell items.
In many cases different algorithms will lead to quantifiably different results. The algorithm changes that work better for the measurement set will be kept and those changes which dont will be discarded. And both A and G do that within different constraints.
Pointing out the obvious: Google is an advertising company. If the cost of producing an accurate result outweighs the advertising income on a given term, there is no incentive for Google to produce better results.
Having a search engine that people go to whenever they want to search for things is incredibly valuable, because they will come to you when they want to buy things and you can sell ads. But unless you consistently give the best results for all queries, people will go whenever does. It's worth investing strongly in all queries, not just highly monetizable ones.
(Disclosure: I work for Google, speaking only for myself)
A sales gimmick furniture store would use in the past was to offer customers a free gallon on ice cream for visiting the store. The value was to the store offering the promotion, as shoppers would be drawn to the "free" gift, but on receiving the ice cream -- too much to eat directly. -- would then have to go home to put the dessert in the freezer. And have less time to comparison shop at competing merchant's stores. Given limited shopping time (usually a weekend activity), this is an effective resource exhaustion attack.
Similar tricks to tie up time, patience, or cognitive reserve are common in sales. For a dominant vendor, tweaking the hassle factor of a site so long as defection rates are low could well be a net positive, if it makes the likelihood of a visitor going to other sites lower.
Still I insist that business serving up more relevant search results for loosely phrased queries will make more money than the one relying on the user to formulate perfect queries.
That's my story and I'm sticking to it.
See Scott Adams, "Confusopoly" (2011): https://www.scottadamssays.com/2011/12/07/online-confusopoly...
I've touched on this: https://old.reddit.com/r/dredmorbius/comments/243in1/privacy...
The antipattern is sufficiently widely adopted that I've been. looking for possible dark-pattern justifications.
Nope. Cable television was introduced with the promise of no ads. That didn't last long.
Search engines are a relatively competitive market. A paid Google with no extra perks will not fly when the majority of people will just flee to Bing. For a paid Google to be successful it has to provide additional value such as filtering out ads, blogspam, Pinterest and other wastes of time.
Subscription based services also require you to be authenticated and that enables fine grained invasive tracking. Something traditional media couldn't do.
If delivery costs were a factor then I shouldn't be charged $15 for an ebook with near zero distribution costs when a paperback was $5 before ebooks came onto the scene and introduced a new incentive for price gouging.
For example:
"The city councilmen refused the demonstrators a permit because they advocated violence."
Which party is "they"? There is no lexical information that can possibly answer this question. It depends entirely on an actual understanding of what "city councilmen" and "demonstrators" (in the context of city councilmen and permits!) are, and which one would be more likely to be advocating violence (and in which case that would lead to a permit denial).
Background: Until recently I worked at a symbolic AI company who was tackling this problem. I myself didn't work on this problem directly, but I became 100% convinced that their approach, while a long shot, was the only conceivable way of solving it in a fully generalized way.
https://www.reddit.com/r/NoStupidQuestions/comments/64ae8h/i...
“Yes, I would like an unstriped dress shirt please”
“How about this striped shirt?”
“No thank you, I would like an unstriped dress shirt please”
“I have some lovely jogging pants”
“Ok, I need to be clear here, I would like a dress shirt that has no stripes”
“Can I interest you in a white undershirt? People who buy dress shirts usually buy undershirts”
....
T: I think pink would look good on you, and it's very fashionable right now.
You: Just bring me some yellow shirts to try.
T: Oh, I got these, and brought this pink one anyway; try it!
But, of course Google isn't making fashion suggestions. But then, ... the tailor might also be just trying to shift excess stock or be on a bonus for selling that particular high-cost shirt.
They can certainly also bring some stock to shift, or offer suggestions while I’m trying something on, but if they aren’t listening when I make a direct request or when I clearly say no, then they aren’t really there for me, their customer.
I’m an odd one that I already know specifically what I want to buy before I search for it, but I’m certainly not the only one (and I think everyone has done that at least once).
I mean if us humans have a difficulty parsing each other's statements, then why should machines do any better?
People want better results but don't want to be tracked, and those things are in opposition to each other.
But taking it as a given the Google's results are better, is that really because of lack of privacy, or just because of how Google has been pouring more money and talent into the problem longer than anyone else? Because I'm not convinced that personal data is particularly useful for generating search results. The example they always give is determining whether a search for "jaguar" means the cat or the car. But that always seemed silly to me, because most searches are going to give extra context to disambiguate ("jaguar habitat"), and even they don't, the user is smart enough to type "jaguar car" if they're not getting the right results. Further, Google doesn't actually know whether I'm more interested in cars or cats—it justs know that I'm a woman in college, so it guesses that I'm less interested in cars. Is that really so useful?
Does searching Google through Tor give noticeably worse results than searching google while logged in? I would be genuinely surprised if it did.
I mean, that's probably why they are equivalent for you. You've chosen privacy over better results (which is a totally legit choice to make!).
It's funny because it's frequently mentioned how Google's tracking is what enables it to give such personalized search results, but often I question how effective that really is.
For instance I question if Google has some profile on me and shows results they _think_ I will want to see (e.g. news related), and thus leave out other results. If it works that way then I'm frequently seeing the same websites in my results and effectively being siloed and shielded from other results that I may find interesting.
Their new strategy of adding snippets for everything has truly gone insane. I search a query for "covid us deaths" today and had to scroll about 3 viewport lengths down to even see the first result.
What happened to just a plain list of blue links?
From a marketing perspective, I feel like DDG needs to change it's name or use a shortened alias. "Google" is an incredible word as it's easy to spell, remember, and it's short. Interestingly they own "duck.com"...
Alternative hypothesis: people only have had Google as reference for years, which means that Google represents "reality" to them. Anything that looks even slightly different is therefore worse.
Still though: This is not evidence for Google's search quality. I, too, feel, like the results got worse over the last years.
Also: Afaik, DDG uses bing under the hood, not what I would call "search startup" in the sense of revolutionizing search quality.
In this very specific case I don't buy it. Sure, it probably applies for other queries, but if you approach a salesperson and ask him for "shirts without stripes" it's pretty clear what you want, and he wont bring you any piece with stripes on it.
The only difference is that those are all physical properties of a shirt while stripes is a type of pattern.
shirt without buttons - preety much fail.
shirt without red button - as already expected, shirts with red buttons
It might be hard to come up with examples on the spot, but in everyday life you will routinely come across things you need to refer to by negation which are relatively uncommon.
The most amazing property in my opinion is the fact that it trains itself. (Whereas neural networks are trained by external systems).
I suspect it's related to imprinting. To take the example of filial imprinting, the brain must have some hardcoded notion of what a parent looks like. Then this is used to to build a parent detector, and the hardcoded notion is thereafter discarded. Then the newly learnt parent detector is used in reinforcement learning (near parent = good). Keep in mind that this all happens just after birth or hatching, before the visual centres of the brain have had any chance to train.
Really cool stuff.
I'm not sure trying to confuse people about whether a shirt has stripes on it would make as much sense. The purchaser seems likely to give up on picking an ideal shirt and just go with the cheapest result.
Both though have the same essence: a manifestly confusing and annoying interface may be serving the merchant's interests.
See also Ling's Cars, possibly explaining awful Web design:
https://ello.co/dredmorbius/post/7tojtidef_l4r_sdbringw (HN discussion: https://news.ycombinator.com/item?id=16921212)
"Kind person" - pictures of men women, children, of all ages and colors.
"good person" - Mostly pictures of two hands holding. No clear bias towards women at all. If anything, more of the hands look "male".
"Bad person" - Nearly 100% cartoon characters
Absolutely ridiculous that you would take the time to write up such fake nonsense.
Following the stereotype content model theory I would likely get a pretty decent prediction of what kind of culture and group perspective produced the data. You could also rerun the experiment in different locations to see if it differ.
I did use images.google.se in order to tell google which country I wanted my bias from since that is the culture and demographics I am most familiar with. I also only looked at photos of a person and ignored emojis.
I have also seen here on HN links to websites that have captured screen shots of word association from google images and published them so you could click a word see the screen shot. They tend to follow the same line as above, but with some subtle differences, and I suspect that is the country culture being just a bit different to mine.
I just submitted all your searches to google.com from Australia, and the results were nothing like what you described; all the results were very diverse.
This is to be expected, as Google has been criticised for years for reinforcing stereotypes in image search results, and has gone to great effort to adjust the algorithms to reduce this effect.
"The AI works in mysterious ways. Trust it."It's rather surprising how often almost all complex systems theories be it AI, cosmology or economics have an aspects where even the theorists are resorting "to because it is".
Sometimes those statements are based on measured data but it's not always easy or possible to do so accurately for highly interconnected system or worse system where you have actors reacting to theoretical model in a way that changes how the system behaves.
I don't have proof, but I strongly believe that a search algorithm that returns what a customer is actually searching for will drive more sales. I suppose it's possible that with time, consistently bad results will beat a customer into submission and drive more sales of stuff the customer doesn't want. But I don't believe that's true, and this would only be the case if the customer accepts that the thing they want doesn't exist. If the customer is pretty sure that solid color shirts exist, they'll just shop elsewhere until they find it.
edit: fixed typo born -> porn
I'd expect to see some of the much touted "search intelligence", NLP, term inference, vector term analysis, and AI in action...
Of the first ten hits: Amazon: 7/10 aren't shirts (but all are stripeless) Google: All shirts, one is checked Bing: All shirts, all contain stripes in description.
Not exactly glowing for Amazon, clearly unsupported by Bing.
Presumably this would be after the algo devalued people who clicked on "Next Page" until they came to a page that had stripeless shirts on it, or who, after the search, only ever clicked on stripeless shirts. "Deeds not words," dontchaknow.
Which is not the case: searching for "plain shirts" do not give similar results than searching for "shirts without stripes".
When I’m looking up information, like an article, especially technical information, then yes I want the search engine to do as little interpreting as possible. That’s because any interpretation rules it uses won’t always be relevant so I’d rather have more control.
But for Amazon? I don’t see how someone can see the average user typing “shirts without stripes” and getting almost nothing but shirts with stripes, and going “yup, works as expected”.
How is interpreting "shirts without stripes" as "I wish to see shirts that don't have any stripes" guessing?
I would venture to say that most people (meaning your use case is in the minority) who type "shirts without stripes" want to see results showing shirts without stripes not results "containing words "shirt", "stripes" and "without".
I think what is happening here is that you know how search engines work and so you are conditioned to expect them to do what they're doing.
Because I was looking for pages about the band called "Shirts without Stripes". Because I wanted pages with shirts that have stripes but where the page featured the word "without", because their shirts are without something else. Because I want to see striped shirts from the company called "Without". I don't want the machine to guess what I mean. It can never know.
Have you tried viewing pages past the first page? Often times it's just filled with what looks like foreign hacker scam websites.
* tie without paisley * tie not paisley * non-paisley ties * ties that aren't paisley * ties other than paisley
You guessed it, in each case, at least half of the results are paisley ties. The only way to actually get what you want -- the set described by X, minus the set described by Y -- is to use the exclusion operator in the search, "ties -paisley".
This is great, and makes intuitive sense to somebody with multiple computer science degrees. But not only is it hard to explain to an outsider, it's actually quite hard to get them to think in a way that accommodates this capability, that is, in terms of set theory.
I have the same reservations about Google as anyone, but rewriting history is never the right move. Moving beyond text matching was what made search truly useful.
> Moving beyond text matching was
> what made search truly useful.
PageRank is what made Google more useful than its competition. They had it since the beginning, and I like it.What I don't like is to search for the band "Chrisma" and get results for "fruitcake sale!" because Google corrected my spelling to "Christmas", decided to look for related concepts, and then boost whichever result is the most mercantile.
Yes, they have. Read what I said again, I don't dispute this. What Google does now is an improvement over text matchers. I never claimed that what they do now is better than Google circa 2000 (though I don't care to register an opinion either way on that).
Whatever they've done since, their product remains better than text matchers. Mercantile search is better than terrible search.
I was about to say there are no such queries but then I remembered having to type a captcha for seemingly automated queries. The captcha page has no results on it obviously. This is because automated queries do not produce advertising revenue. You have to buy them.
I've typed an insane number of queries since the beginning. A decade ago I use to be able to find truly exotic articles, I could find every obscure blog posting on every blog with 3 readers and I was pretty sure google delivered all of it. The tiny communities that came with the supper niche topics rarely produced a link I didn't already find. If they did it was new and I didn't google for a while.
Today google feels like it is a pre-ordered list from which it removes the least matching articles. Only if the match is truly shit will it be moved slightly down the page. The most convincing in this is typing first name + last name queries in imagines and getting celeberties who only have the first or the last name.
People wont go, it has to get much worse before they do.
edit:
With humans an pets a good slap over the head or a firm NO! will usually do the trick.
There are very clearly many queries with no advertising revenue, because there are many queries that show no ads. Trying some searches off the top of my head that I expected wouldn't have ads, I don't get any ads on [cabbage], [who is the president], [3+5], or [why is the sky blue]. On the other hand, if I search for a highly commercial query like [mesothelioma] the first four results are ads.
> A decade ago I use to be able to find truly exotic articles, I could find every obscure blog posting on every blog with 3 readers
My model of what happened is that SEO got a lot better. When Google first came out it was amazing because Page Rank was able to identify implicit ranking information in pages. Once it's valuable to have lots of backlinks, though, this gets heavily gamed. Staying ahead of efforts to game the algorithm is really hard, and I think a lot of times people's experience of a better search engine comes from a time when SEO was much less sophisticated.
> The most convincing in this is typing first name + last name queries in imagines and getting celebrities who only have the first or the last name.
This hasn't been my experience, so I tried an image search for [tom cruise], curious if I would get other Toms. The first 45 responses were all of the celebrity, and image 46 was of Glen Powell in https://helenair.com/people/tom-cruise-helps-glen-powell-lea... which is a different kind of mistake. Do you remember what query you were seeing this on?
I believe what he means is that searching for first name + last name of someone who isn’t a celebrity gets you celebrities who match either the first name or last name.
Searching for Tim Neeson gets you a wall of photos of Liam Neeson: https://www.google.com/search?q=tim+neeson&tbm=isch
Searching for Tim Cruise blankets you with pictures of Tom Cruise, but it at least says “Showing results for tom cruise“ so you know it did an autocorrect. When I tried other first names + Cruise, the effect is less pronounced than with the Neeson example. Maybe it’s because cruise is a more common name as well as an English word.
Queries without ads do produce revenue. They are an essential part of the formula.
Think of people standing around in bars. We cant argue that just standing there doesn't produce revenue.
The flowers on the table in a restaurant produce revenue.
Free parking produces revenue.
If queries without adds didn't produce revenue they wouldn't exist. More often enough it doesn't even take an extra query, the adds will sit behind the links.
No, it would predict that a query that has no advertising income will poor results. You can determine on your own whether that is the case.
- If I entered "person" I'd see a mix of images substantially similar to what I saw using google.co.uk up to and including Terry Crews, which was frankly a little weird, and otherwise mostly white
- If I entered "人", which Google Translate reliably informs me is Japanese for "person", I'd see a few white faces, but a substantial majority of Japanese people
So it seems possible that Google's trying to be smart in showing me images that reflect the ethnic makeup I might expect based on my language and location. I mean, it's doing a pretty imperfect job of it (men are overrepresented, for one) but viewed charitably it's possible that's what's going on.
Is the case for woke outrage against Google Image Search overstated? Possibly; possibly not. After these experiments I honestly don't feel like I have enough data to come to a conclusion either way, although it does seem like they may at least be trying to do a half decent job.
The TL;DR of it is that google crawls the internet for photos, associates those photos with text content pulled from the caption or from the surrounding page, and gives them a popularity score based on the popularity of the page/image. There are some cleverer bits trying to label objects in the images, but it's primarily a reflection of how frequently that image is accessed and how well the text content on the page matches your query. There's some additional localization, anti-spam, and freshness rating that influences the results too.
The majority of pages with "人" and a photo on it that has a machine labeled person image would be a photo of a japanese/chinese person, and if you're being localized to japan with a vpn, that would be even more true.
Google doesn't "know" what you're trying to search. It's a giant pattern matching game that slices and dices and rearranges text to find the closest match.
"Person without stripes" shows several zebras, tigers, a horse painted like a zebra, and a bunch of people with stripes.
Interestingly, duckduckgo shows me, as second result, an albino tiger with, you guessed it, no stripes. The page title has "[...] with NO stripes [...]" in it, so I assume that helped the algo a bit.
EDIT: I also got the painted horse (it looks spray-painted, if you ask me) and I must admit it's quite funny to look at
Unless things have really changed, [doctor] will be mostly white men and [nurse] will be mostly white and Filipino women.
But don't blame the AI. The AI has no morality. It simply reflects and amplifies the morality of the data it was given.
And in this case the data is the entirety of human knowledge that Google knows about.
So really you can't blame anyone but society for having such deeply engrained biases.
The question to ask is does the programmer of the AI have a moral obligation to change the answer, and if so, guided by whose morality?
(That's how I would do it if I wanted more accurate rather than more general results.)
The next few images contained Donald Trump, Terry Crews, Bill Gates and a French politician named Pierre Person.
After that it was actually quite a varied mix of men/women and color/white people.
I am still not very impressed with Google's search engine in this aspect, but it is not biased in the way you suggest.
At least it is not biased that way for me. As far as I am aware, and I might be completely wrong here, Google, in part, bases its search results on your prior search history and other stored profile information. It is entirely possible that your search results say more about your online profile than about Google engine :)
Well, she was the 2019 Time Person of the Year.
Likewise, Trump was the 2016 choice, and Crews and Gates have been featured as part of a group Person of the Year (“The Silence Breakers” and “The Good Samaritans” respectively).
There's not much diversity, assuming Terry Crews is from USA, then all the first viewport full of images are Western people; except Ms Thunberg they're all from USA AFAICT [I'm in UK].
The first non-Western person would be a Polish dude called Andrzej Person (the second Person called Person in my list after a USA dancer/actress), then Xi Jinping a few lines down. The population in my UK city is such that about 5/30 of my kids primary and secondary school, respectively, classmates have recent Asian (Indian/Pakistani) heritage. So, relative to our population, there are more black people, far fewer Indian-subcontinent no obviously local people.
Interesting for me is there are no boys. I see girls, men and women of various ages but no boys. 7 viewports down there's an anonymous boy in an image for "national short person day". The only other boys in the top 10 pages are [sexual and violent] crime victims.
The adjectives with thumbnails across the top are interesting too - beautiful, fake, anime, attractive, kawaii are women; short, skinny, obese, big [a hugely obese person on a scooter], cute, business are men.
Any sort of image search is going to tend to be biased toward stock photos, because those images are well labeled, and often created to match things people search for.
Key point right there. Unless Google is deliberately injecting racial and/or gender bias into their code, which seems extremely far fetched (to put it kindly), the real fault lies with us humans and what we choose to publish on the web.
Nurses it's 34 women to 5 men. Proportions of skin tones are what I'd expect to see in a city in my country.
I would contend that society is biased. There is no evidence that says men are better doctors than women, and in fact what little this has been studied says that women make better doctors than men (and is reflected in the more recent med school graduation classes which are majority women).
So it's a question of what you are asking for when you search for [doctor]. Are you asking for a statistical sampling or are you asking for a set of exemplars?
> So statistically, it would be correct to return mostly male doctors in an image search.
And that's exactly it. The AI has no morality. It's doing exactly what it should, and is amplifying our existing biases.
You can blame statistics for that. Beyond that, you can blame genetics for slightly skewing the gender ratios of certain fields and human social behavior to amplify this gap to an extreme degree.
IMO, wrapping it in a concept like "morality" because the pictures have people in them just serves to excuse the problem and obscure its (otherwise obvious) solution.
But here, not that I think it will help: https://www.recompile.se/~belorn/happyvscriminal.png
First is happy person. Out of 20 we have 14 women, 4 guys, 2 children.
Second is criminal person. The contrast to the first image should be obvious enough that I don't need to type it.
If I type in "person" only I get the following persons in the first row in following order: Pierre Person (male) Greta Thunberg (female) Greta Thunberg (female) Unnamed man (male) Unnamed woman (female) Mark zuckerberg (male) Keanu Reeves (male) Greta Thunberg (female) Trump (male) Read Terry (male) Unnamed man (male) Greta Thunberg (female) Greta Thunberg (female) Unnamed woman (female) Unnamed woman (female)
Resulting in 8 pictures of females, 8 males, which I must say is very balanced (I don't care to take a screenshot, format and upload, so if you don't trust the result then don't).
Typing in doctor as someone suggested in a other thread I get in order (f=female, m=male): fffmffmmmmfmmfffmfmfmmmff
and Nurse: fffmffmfmmffmffmfffmffmffff
Interestingly the first 5 images have the same order of gender and are both primarily female, through doctor tend to equalize a bit more later while nurse tend to remain a bit more female dominated.
Your initial comment said "Happy person", women of color.
But your screenshot showed several white people, several men, and a diversity of ages. Yes, more women, which is probably reflective of the frequency of photos with that search term/description in stock photo libraries and articles/blog posts featuring them. No big deal.
You also said "Criminal person", Hispanic men
But the screenshot contains more photos of India's prime minister than it does of Hispanic men. In fact I can't see any obviously-Hispanic men, and the biggest category in that set seems to be white men (though some are ambiguous).
The doctor and nurse searches suggest Google is making some effort to de-bias the results against the stereotype.
To me the biggest takeaway is that image search results still aren't very good at all, for generic searches like this.
Indeed it's likely that they can't be, as it's so hard to discern the user's true intent (for something as broad as "happy person"), compared to something more specific like "roger federer" or "eiffel tower".
I'm not disputing that, and it certainly explains why it's "good enough" for somes search queries whilst being totally gimpy for others.
My understanding was that Google does prioritise what it's classified as local search results though, on the basis that they're likely to be more relevant.
You don't have to bother creating anything new unless you have something to sell and are willing to invest (big).
Facebook is actually a pretty pathetic implementation where we can still find content created by normal people. If people made traditional websites in stead of facebook groups and facebook pages NO ONE would be able to find it.
We've witnessed the great obliteration of what was once a nice place and now we have to hear google was not to blame?? The death by a thousand cuts is actually well documented.
We tell you what your site must look like or we'll gut it:
https://en.wikipedia.org/wiki/Google_Penguin Google Penguin is a codename[1] for a Google algorithm update that was first announced on April 24, 2012. The update was aimed at decreasing search engine rankings of websites that violate Google's Webmaster Guidelines[2]
There, this is what the entire internet must look like. We went from indexing to engineering here.
https://support.google.com/webmasters/answer/96569?hl=en
rel="ugc"
We recommend marking user-generated content (UGC) links, such as comments and forum posts, as ugc.If you want to recognize and reward trustworthy contributors, you might remove this attribute from links posted by members or users who have consistently made high-quality contributions over time. Read more about avoiding comment spam.
Before this those elaborately contributing got actual credit for it. Do you think you got a choice in it? Google clearly demands you ban credit for comments. OR ELSE!
rel="nofollow"
Use the nofollow value when other values don't apply, and you'd rather Google not associate your site with, or crawl the linked page from, your site. (For links within your own site, use robots.txt, as described below.)Woah association! How did we go from linking-to to association? It was important enough for readers but be careful to hide it from google. Such little unimportant websites simply shouldn't exist in our index. We command you to help keep our index clean of such filth!
Then the magical: We wont actually tell you what is wrong with your website! Ferengi Rule of Acquisition 39 "Don't tell customers more than they need to know." Get a budget and hire someone to do SEO. Deal with it, we don't care. No, you don't have any feedback.