MongoDB Releases Queryable Encryption Preview

MongoDB Releases Queryable Encryption Preview(mongodb.com)

120 points by andrewbarba 4 years ago | 67 comments

SkyPuncher 4 years ago |

This is a really neat technology, but I don't understand it's use case. I've worked in HealthTech and currently in the compliance space. I'm skeptical of Mongo's claims (and their familiarity with compliance laws). Kind of feels like a solution in search of a problem.

"In use" implies that you have a need to process that data. It doesn't matter if the end client is submitting queries in plain text (protected in transit) or this fancy encryption, the client (or server) still needs to be authorized to query that data. Translating from plain-text to encryption does not add additional protections from a compliance perspective.

snorkel 4 years ago | |

This seems more applicable for the SaaS hosting model where the database service is managed by a 3rd party. So the use case is "I trust your SaaS service is compliant with my legal obligations to protect my customer data, but it'd be easier for everyone involved if your database service also has no way of seeing sensitive data fields. That would make it easier for me to pass my compliance audits, otherwise I need to audit you." So the data is encrypted client-side before it's sent over to the database service, and the database service is not able to decrypt it, but can still can include the encrypted value in a query.

SkyPuncher 4 years ago | | |

The problem is that situation doesn't really exist.

At an organizational level, it's extremely hard to control what information get put into a SaaS. There are far too many ways in which data can be de-anonymized or inferred against (e.g. a field existing can have privacy implications).

It's far safer to use a SaaS provider that meets general control requirements than to try to shoe-horn encrypted data into them.

giaour 4 years ago | |

> It doesn't matter if the end client is submitting queries in plain text (protected in transit) or this fancy encryption

It's not just the query that is encrypted in this case, but the data being queried. From MongoDB's description, the server never receives or stores plaintext data, and the query results can only be decrypted by a client who has the same key that was used to encrypt the data in the first place. From a compliance perspective, that's amazing if it works. It means the server is never storing or processing anything but ciphertext.

gkop 4 years ago | | |

Yes, and in the context of Mongo-as-a-Service, it's amazing both to the client and also the service provider (less liability).

xhkkffbf 4 years ago | |

In one of the books about the general idea, _Translucent Databases_, the idea is to save the costs of securing the raw data. Someone might break into the database server (or listen on the wire) and find only encrypted values. This can make many different architectural use cases easier to deliver.

In the most extreme cases, the unencrypted values never leave the client. The database can concentrate on delivering storage and fast query answers without paying much attention to issues of security. Clients don't need to trust the database because they control the encryption.

redwood 4 years ago | |

I see this as more about fundamental trust.. confidentiality from the service providers, not compliance

015a 4 years ago | |

To be fair: the Compliance Regime didn't invent any of the technologies they create frameworks around, and if we followed every compliance framework's recommendations to a T, the systems produced would by-and-large be insecure. They paint with an extremely large brush; and its a toss-up whether the auditor has even been involved in anything technology-related beyond auditing for, well, decades. There's good ones and bad ones, but the integrity of many audit processes relies to a significant degree on the goodwill of the SMEs of the systems and processes being audited.

Just as a dumb example; an auditor says passwords need to be hashed with bcrypt. They find a code sample that says "store(bcrypt(password))". Awesome; complied to a T. But true security goes beyond that: are we using a library for bcrypt, or an internal implementation? Is the internal implementation well-implemented? Is the library free of CVEs (maybe they check that)? Did we trace that call to ensure the data generated is what is inserted to the db, or was it intercepted by some middleware? Did we name that function 'bcrypt' but its actually just MD5?

My point is really not to assert that auditing is pointless, but rather its fundamentally limited in what kind of attestations it can make.

One great example I can pull from a few recent audits I've been through: serverless tech like Fargate. This oftentimes blows auditors away (or, rather, it used to; nowadays they've seen it so often that they just know). It checks so many boxes. They'll present multi-page forms about data center colos and operating system security and operator SSH access and we'll say "We use Fargate". "Oh nice, ok we can check all of these and carve out with AWS's attestation for (ComplianceFrameworkX)". It saves hours, days, of time.

That's, I think, where homomorphic encryption can go. That isn't what this is, but it's a step toward that. It's not about meeting today's compliance frameworks; it's about evolving the framework. And, in the interim, as advanced R&D teams meet these auditors, they'll educate-up how, yeah, you've got a lot of questions here, but its not that we do or don't meet them: its that they're fundamentally the wrong questions to ask; but we understand the spirit, here's how we meet the spirit, and here's how we're actually better than if we had just checked Yes on all of them.

Third example: years ago, our team was the first time our auditor had ever seen LetsEncrypt and k8s certificate-manager (then it was called kube-lego). He wanted an attestation that TLS certificates were current and not near-expiration. We countered: they can't be near-expiration, because we have automated systems which renew them. He'd never seen anything like it; he was used to expensive certificates and operations runbooks for renewal; and we nerded out for ten minutes showing it all off. Instead of documenting a runbook for renewing certificates, he documented our runbook for maintaining this automated service and ensuring uptime. Win-win.

Its a slow process, and its made even slower because there are tons of people in the industry who treat the frameworks as gospel. But, ultimately; we control the technology, not them. We decide what is secure; they just attest to it and double-check.

dandraper 4 years ago |

This feature is a result of MongoDB's acquisition of Aroki. It looks like a good product but we actually beat them to it with https://cipherstash.com/activestash

CipherStash works with any Database and also supports Range queries and sorting/ordering. We do it in the application layer. Only supports Ruby so far but C#, Java, Python, Rust are in the works.

metadat 4 years ago | |

What about Go, or even Tcl, and Ocaml? Do you have pointers to docs that'd help OSS efforts in this department?

dandraper 4 years ago | | |

Not yet but that's a good suggestion! The core client code is Rust so additional languages are (mostly) just native bindings to Rust. We will be releasing the Rust SDK publicly soon and welcome contributions!

throwaway2016a 4 years ago |

Help me understand this...

It says it will support prefix search, substring search, and the like. Can anyone point me in the right direction on what the algorithm may be here? I don't get how you could do those things without making the encryption less secure and/or decrypting every record the fly.

Another interesting use case I found that isn't mentioned here is sort. I've had customers ask me to be able to sort the results by PII and we tell them... no, we can't do that because the field is encrypted.

blintz 4 years ago | |

These things are indeed possible while maintaining fully semantically secure encryption. Recent, mostly theoretical work shows that this is possible using fully homomorphic encryption. The basic idea is, the client can encrypt its query, the server can process the encrypted query and produce an encrypted result, and send this back to the client. It sounds impossible, but it isn’t! Very cool stuff. There are actually also some practical implementations that work… so it’s gradually exiting the “theoretical only” stage.

MongoDB is very short on details, and I suspect they do something worse than homomorphic encryption, that does indeed make some kind of compromise between privacy and convenience.

dweinus 4 years ago | | |

Yeah, they contrast their method with homomorphic encryption, which makes me share your suspicion

hapiri 4 years ago | |

It is less secure than your standard symmetric encryption. I guess they would use deterministic encryption in which 2 entries with same email address will have the same record string ( this leaks information to attacker ). Prefix search & sort can be achieved by using order preserving encryption. Not really sure about sub-string though.

throwaway2016a 4 years ago | | |

I've researched order preserving encryption before but the tradeoffs (mainly that the attacker can tell the order and use that to narrow the search space) always seemed like high risk.

jalcazar 4 years ago | |

Related video explaining encryption schemes to make encrypted data in a DB queryable:

CryptDB: Processing Queries on an Encrypted Database

https://youtu.be/xsaXMUelOEA?t=807

bawolff 4 years ago | | |

I was under the impression that cryptdb "encryption" was thoroughly broken. Am i mistaken?

E.g. googling i found http://cs.brown.edu/people/seny/pubs/edb.pdf

bincyber 4 years ago |

This is really neat. Recently I explored similar functionality for relational databases and only got as far as implementing column-level encryption [0] in this Go library [1], but without support for querying the encrypted data. HashiCorp Vault's transit secrets engine supports Convergent Encryption [2] which provides limited ability to query the encrypted data, but I haven't yet experimented with it. If anyone is doing something like this in production, would love to hear about your experience.

[0]: https://en.wikipedia.org/wiki/Column_Level_Encryption

[1]: https://github.com/bincyber/go-sqlcrypter

[2]: https://www.vaultproject.io/docs/secrets/transit#convergent-...

muchpir 4 years ago | |

The MuchPIR project (https://github.com/ReverseControl/MuchPIR) implements Information-Theoretic Private Information Retrieval (IT-PIR) in Postgresql; In addition to the demo there is a high performance version available for commercial use.

eknkc 4 years ago |

I didn't know this was a thing. The article mentions it can do equality, range, prefix, suffix and substring queries. Does this mean that the encryption scheme creates sortable 1:1 mapped results after encryption? Kind of like a shift cipher?

tyingq 4 years ago | |

They mention this:

"Queryable Encryption was designed by MongoDB’s Advanced Cryptography Research Group, headed by Seny Kamara and Tarik Moataz"

Some related papers with those two as authors:

https://eprint.iacr.org/2016/453.pdf

https://cs.brown.edu/people/seny/pubs/sgx.pdf

GTP 4 years ago |

The problem is: is also the full query encrypted or just some values that are considered sensitive? I remember a research form some years ago showing that if an attacker is still able to see the SQL code can recover the content of the database by looking at the queries, the responses and "putting the pieces together". Now, if the target was to get the exact values inside the database (think about employees wages) it still required to observe a very big number of queries, but if you were interested in getting a reasonable interval for each value then the number of queries needed become small enough to be doable in practice.

Unfortunately I don't seem too be able to find this again, but a quick search turned out two papers that say that just encrypting your db isn't enough: [0], [1]. In particualr [1] doesn't seem to go into the details of how you could recover the data, but mentions that many operations as performed by "normal" databases leak information if performed over encrypted data. Maybe someone that is more familiar with Queryable Encryption can comment on this?

[0] https://www.cs.cornell.edu/~shmat/shmat_hotos17.pdf [1] https://www.microsoft.com/en-us/research/wp-content/uploads/...

winrid 4 years ago |

Neat. Did they fix their blog's pagination yet? If you hit next enough times you may or may not be able to take down the site, don't ask me how I know.

(their pagination is implemented just by increasing the limit parameter).

api 4 years ago |

Is this actually possible? Couldn't you make many repeated queries and slowly decrypt the text by e.g. slowly narrowing the range?

robmccoll 4 years ago | |

This is possible. The goal is that the server knows as little as possible, while the client has full information. It's order revealing encryption. The server side knows the ordering of the values, but doesn't know any specific value. When queried, it is always getting prefixes (or exact matches) following the same encryption scheme, so it can compare those to the corpus and select results since the query parameters fall into the same ordering. The server doesn't have access to the keys needed to generate query parameters, so in theory it would be difficult for the server to perform narrowing queries on its own. Over time the server could gather statistical results that may reveal more about the data it's holding. Also, these schemes may need to produce the same cipher text for the same input, so frequency distributions can be used to reveal information.

Diggsey 4 years ago | |

Yeah the article is very thin on technical details. To make this work as they describe, it must not be possible for any client to "forge" queries, or else they could trivially decode the content by sending prefix queries of increasing length.

It's also difficult to see how this could work on the server side without exposing some information about the encrypted fields. For example, if all documents have a value that begins with "a", then there must exist a prefix query that matches all those documents. I would expect it to be possible to figure out whether such a query is possible or not, only given access to the encrypted data, but even if that's not possible, the simple fact that a prefix query was issued that matched all documents gives away that information.

robmccoll 4 years ago | | |

You could have a larger range than domain and throw in some noise. Exact match queries would need to become range queries that are de-noised at decryption.

SkyPuncher 4 years ago | |

Yes. This is the fundamental problem with this.

For something like, HIPAA, this ads very little value if fields are semi-known.

rafaelturk 4 years ago |

This looks really cool. Albeit feels that it is actually a feature implemented in the driver (client side) so my initial impression is that is not a meanignfull innovation on the server side. This can be implemented with any Database, even with current MongoDBs

gqewogpdqa 4 years ago | |

Nope it’s implemented on the server side. I think that they are going to talk more about it at a session and maybe even in a keynote

8jy89hui 4 years ago | |

> This can be implemented with any Database, even with current MongoDBs

Is it really all client side? How could they do things like substring matching without sending the entire index back and forth to the client? The graphic seems to show the query being executed solely on the server (although graphics often lie).

jayd16 4 years ago | | |

Perhaps encrypted trigrams (or some such thing) are sent during insert and search.

Then it's just a matter of counting matching trigrams/chunks. The server doesn't need to know how to read the trigrams.

rafaelturk 4 years ago | |

We use Mongoose, for sensitive data we have a wrapper around the .pre Save() method da encrypts it before sending data to the downstream db. Feels that MongoDB implemented that, in a more elegant structured code.

bawolff 4 years ago |

I call bullshit.

So let me get this right - its encrypted but you cansearch prefix and suffix?

So all the attacker has to do is do it one letter at a time, see if it starts with A, B, C, once they figure that out, go to the next letter and so on. (I presume that the DB is not supposed to be trusted since they make such a big fuss about only being decryptable on the client side)

Also there doesn't seem to be a whitepaper detailing algorithms or their threat model. Bitcoin scams try harder then this.

winrid 4 years ago | |

The use case you're outlining is someone already has access to the database. They can just do a find() in that case and get everything, no query required. You're basically describing an lz77 SSL hack that's like 20 years old, I'm pretty sure they would think of this.

The use case here is just "advanced encryption at rest". Encrypting at rest is one thing, but this means people are less likely to see PII by accident, for example.

bawolff 4 years ago | | |

That's not what their blog post says. To quote:

"Queryable Encryption implements a fast, searchable scheme that allows the server to process queries on fully encrypted data, without knowing anything about the data. The data and the query itself remain encrypted at all times on the server."

They are strongly implying that the someone with access to the database should not be able to decrypt the data. According to their blog post that seems to be the entire value proposition compared to what they describe as traditional encryption at rest.

mushi 4 years ago | |

It’s already been mentioned that “Queryable Encryption was designed by MongoDB’s Advanced Cryptography Research Group, headed by Seny Kamara and Tarik Moataz" - are you calling bullshit on their work? What are your qualifications?

bawolff 4 years ago | | |

So long as whatever system they designed has not been published and reviewed by independent experts, then yes. I don't have to be an expert in this space to recognize what the norms are for making new production ready cryptosystems are, and that this doesn't remotely meet them.

Designing secure cryptosystems is hard. Experts fail at it all the time. The lack of technical details is a major red flag.

Not to mention the distinct possibility that even if this group made a secure system, the mongodb marketing dept may very well be misrepresenting its security/limitations.

Redsquare 4 years ago |

If it is going to the likes of aws kms everytime it will blow budgets

claudiug 4 years ago |

can this be done in postgres via client or via server? I found it really nice

uberdru 4 years ago |

seriously did not think we would see homomorphic encryption productized for a few more years. pretty impressive!

8jy89hui 4 years ago | |

> Some of the existing tools, such as homomorphic encryption or secure enclaves have performance unsuited to scalable encrypted search, require proprietary hardware, or have uncertain security properties.

I don't think this is exactly homomorphic. I hope they put out a whitepaper so researchers can properly evaluate its security.

uberdru 4 years ago | | |

Nice catch, I was scanning for homomorphic encryption, but missed this. Have no idea how else they would implement this.

muchpir 4 years ago | |

Homomorphic Encryption is available at large scale today for limited use cases.

See the MuchPIR project (https://github.com/ReverseControl/MuchPIR) which implements Information-Theoretic Private Information Retrieval (IT-PIR) in Postgresql; In addition to the demo there is a high performance version available for commercial use.

dandraper 4 years ago | |

Its not Homomorphic but "structural encryption". Less useful than HE but faster.

snorkel 4 years ago | | |

Correct. It's not homomorphic encryption, but rather more like TDE (Transparent Data Encryption) except that MongoDB service isn't decrypting the data. This is essentially client-side encryption (at the driver) and without server-side decryption.

cvwright 4 years ago | | |

Faster has a usefulness all its own

samwillis 4 years ago | |

Homomorphic encryption allows you to modify the encrypted data without decrypting it or even knowing the the content. I don’t think this is homomorphic encryption.

If they are able to do this without decrypting the data then I think you could describe this as a somewhat week encryption that exposes some data attributes as queryable. You could not implement this with strong encryption without at least decrypting for indexing.