Transform Data by Example [video](microsoft.com) |
Transform Data by Example [video](microsoft.com) |
Something like this: Suppose we have a table of strings of digits, some including spaces, and we’d like to remove the spaces. From
123 456
234567
345 678
to 123456
234567
345678
Now, what happens if it encounters, say 4567890
Would the result be unchanged (as we would probably want), or would it “cheat” and remove the middle “7” character, giving “456890”?For your example, it could list:
“Remove interior spaces from each item”
or it could say: “Remove the middle character from any 7-character strings to make them 6 characters in length”
You would be able to do something with that.Learning algorithms that produce decision trees are usually used in this situation.
This is the problem of lacking explanatory mechanisms in ML.
Note that some techniques that are very out of vogue at the moment, such as Genetic Programming, are much better than neural nets in this regard.
123 456
234567
345 678
and the program replies with something like what you wrote: “Remove the middle character from any 7-character strings to make them 6 characters in length”, it would actually take a programmer’s mind to be able to envision why this might in some cases be wrong. Most people who are not programmers would, I think, see this as equivalent to “Remove interior spaces from each item”. I suspect that the skill required to choose an algorithm correctly is the exact same skill required to actually being a programmer.All this then buys you is that you don’t have to remember the function names.
[1] https://www.microsoft.com/en-us/research/publication/automat...
One way to eliminate this ambiguity is to also provide a natural language description of what you want, e.g. "remove the spaces".
In the natural language processing community, we call this semantic parsing.
But sometimes the semantic parser can misinterpret the language too and generate a program which still "cheats" in the same manner as you described. We call these "spurious programs".
Shameless plug-- my group has been working on how to deal with these spurious programs:
From Language to Programs: Bridging Reinforcement Learning and Maximum Marginal Likelihood https://arxiv.org/abs/1704.07926
Besides, it depends on the slope of "coding". If it gets really difficult really quick (exponentially say), this could just be forever stuck in the "low hanging fruit" stage.
This is where hinting is important. Metadata. That sequence if I know it's a phone number, or a sequence of increasing digits, depends a lot on metadata.
Given some reasonable sample size, i believe machine learning could provide hints as to some of the common types of formats. Semi automated data hinting or structuring?
There is a bidirectional connection between interpreting your data and how your data is structured
Is it possible to use your data column to statistically hint at metadata characteristics by some sort of clustering, then use that to automatically clean input data?
The beauty of this product is that its adoption strategy is baked into the product itself: I'd share this with all Excel user friends of mine because I want the algorithm to get smarter, and I might even learn a bit of C# myself so that I can contribute and scratch my own itch. This in turn makes the product better (because of the larger training data), lending itself to more word of mouth.
One concern I have is security: I'd love to hear from folks who built this/more familiar with this about how to ensure the security of suggested transformations.
Either way, this looks very useful. Having spent more than my fair share of time massaging data prior to import, this looks pretty great.
Then they complain to Microsoft, who helpfully suggests the product they should upgrade to. This has always been a strong spot of Microsoft's. "I see you've scaled beyond the capacity of [Product A]. Well, fortunately for you we have [Product B] which can handle it, with a nice import wizard to get you started painlessly." It typically goes Excel > Access > On-prem SQL Server > Azure.
This sounds very negative and I swear I don't mean it that way. It's a great sales tactic if you offer products at every level of scale.
For example, given the rule `f "abcde" 2 == "aabbccddee"`, it even figures out the role of the parameter `2`, so `f "zq" 3` gives `"zzzqqq"`.
https://support.office.com/en-us/article/Use-AutoFill-and-Fl...
[0]https://www.microsoft.com/en-us/research/blog/deep-learning-...
[1]HN Discussion: https://news.ycombinator.com/item?id=14168027
It's not production ready / launched yet, but it's getting there.
I'd be interested to finds (or really doesn't find) this useful :)
It can't do miracles, but this is time saving in many cases like when you want to concatenate values from different columns in a new format into a single column and so on.
Ok, just realized somehow the site has vanished. Not working archived version: http://web.archive.org/web/20161028231256/https://www.transf...
For example, "sort all of the folders, so that it Alan goes before Amy, etc". The rule ("sort") is pretty ambiguous, but one simple example in the context gives enough information to realise you probably mean alphabetically by first name.
Is there something like this example that could be combined with NLP to make things like these "intelligent assistants" we have now much more useful for data processing tasks?
It would be great to describe data manipulation to a machine the way that I would describe it to a colleague: give an overview of an algorithm, watch how they interpret it, and correct with a couple of examples in a feedback loop. Currently describing such things for a machine requires writing the algorithm manually in a programming language.
Being able to solve quickly the most common cases (which rely in such "common knowledge") would automate a lot of work that now requires writing a complex program in advance, and would allow the user to concentrate on the outliers that require more thought.
IIRC this is tested heavily in IQ tests.
What I mean is if every row had a date like "12 May 2002" and you wanted it turned into 2002.05.12 then it would be nice if it indicated when it added data. For example if one of the rows just read "15 May" then, since there is no year, it would not be completely absurd if it transformed into 2017.05.15 - or if all of the other data is 2002, then adding that. But I really think silently adding data that was not in the input is going too far. A transform shouldn't ever silently inject plausible data with no indication that this is interpolated. Bad things can result.
Otherwise great demo!
https://www.microsoft.com/en-us/research/publication/transfo...
Though it probably also uses more recent work from the same group:
"Zhongjun Jin, Michael R. Anderson, Michael J. Cafarella, H. V. Jagadish: Foofah: Transforming Data By Example. SIGMOD Conference 2017: 683-698"
This would be great for refactoring code.
The experimental Lapis editor[1] did exactly this, by the way.
- 123 456 = 123456 valid
- 1234567 = 123567 not valid (dropped 4)
Properties:
- output may not contain whitespace
- no number characters may be dropped
- characters may not be reordered
You could even say that it only needs to approximately reproduce the output with some tunable error threshold, which might give you leeway for finding more comprehensible and simpler trees.
Take image classification as an example. CNNs can do it by finding nonlinear patterns that exist. A decision tree would have a very tough time doing it because the pixels have a complicated relationship with each other that defines what the image is of.
I think for something like the data transformations we're talking about a Neural Network would be pretty over kill. It looks like this feature in excel works by comparing the data to pre-defined formats, which is probably done by searching all known formats in a somewhat intelligent (not ai, just intelligent) way so that it's fast. Then it can output that type of data in whatever form you want.
Your comment gave me an interesting idea though: What if we put neural networks inside of decision trees?
You haven't fixed anything here. You've just encoded your training data in a neural net and then presented the same problem to the decision tree learner. Unless you're planning to transform your training data somehow?
I'm imagining a hypothetical example where generalization is easier to achieve with a neural network than with a decision tree using standard training techniques. Then a tree trained on the network might generalize better than a tree trained straight on the original data, with the additional benefit of being less of a black box than the network.
Actually that's somewhat less true for big decision trees. But the general point is that you can train interpretable models to mimic the output of uninterpretable black boxes.
The biggest issue is that decision trees only work for data with fixed inputs and outputs. Recurrent NNs work on a time series and possibly even have attention mechanisms.
True
> tunable error threshold, which might give you leeway for finding more comprehensible and simpler trees
True
However, my guess is you'll wind up doing only one of (a) having more accurate tree model than training it directory (b) improve the understand ability of your model significantly.
Also seems like it’s another magnitude of complexity in the neural net to have it not only train and learn on your inputs, but also train and learn on its own training and learning.
Consistency is enforced by the dataset, and also by the model. Both outputs would read from the same hidden layer--the one that encodes the desired transformation.
The third neural net would do the checking, obviously.
And how would you evaluate whether the explanation was correct or not?
But more importantly, the decision tree will model the behavior of the NN, not necessarily the original data. Which is what you want, if your goal is to understand what function the NN has learned.
But that's the whole point of this method! To understand what errors the NN might be making. It's also quite possible the NN's errors aren't really errors, if there are mistakes or noise in the labels.
This technique has been called "dark knowledge" and is really interesting. See http://www.kdnuggets.com/2015/05/dark-knowledge-neural-netwo... They train much simpler models to get the same accuracy as much bigger models, just by copying the predictions of the bigger model on the same data. In fact you can get crazy results like this:
>When they omitted all examples of the digit 3 during the transfer training, the distilled net gets 98.6% of the test 3s correct even though 3 is a mythical digit it has never seen.