Classes vs. Data Structures(blog.cleancoder.com) |
Classes vs. Data Structures(blog.cleancoder.com) |
I followed the link, it's so much worse.
Worse? Yes.
How much worse? So much.I find it difficult to explain pointers to people because I don't understand what they are missing. I could use some help understanding their lack of understanding.
When they don't get "indirect reference to the address of a data structure or object in memory", I'm stuck on how to proceed. Pointing people to the very elegant treatment of this topic in the old K&R doesn't always do the trick. Somehow.
It's only patronizing to people who are motivated self-learners, such as you. To people who are unmotivated and/or struggling with the concepts (such as high school students), it works better than most other styles I've tried.
I've spent a lot of time tutoring high school and elementary school students. Many of them are so frustrated that they just want you to give them the answer. But then if you do they write it down and declare "problem solved." In the long run, students who continually weasel the answers out of their tutors end up far, far behind those who work with a Socratic teacher and are persuaded to try to think of the answers for themselves.
Unfortunately, the most common usage is at best ineffective, for as you say, they will dictate some logical jump I don't agree with, so when they disprove it in favor of their point I remain unconvinced as my own argument hasn't been addressed at all.
At worst, it's infuriating.
If a teacher wants to convey a specific opinion or viewpoint, they should just share it. But if a teacher is genuinely willing to listen and wants to get their students to think critically about a topic -- without a desire to navigate the students in a specific direction -- then the Socratic method is very useful. But the teacher has to be truly willing to listen to their students. And that is hard for some teachers.
I couldn't get much further past this:
What is an object?
An object is a set of functions that operate upon encapsulated data elements.
...what?It's like HN guidelines - "assume good intent".
You must admit, you're throwing stones from a glass house when you write like that :p
Plato merely makes this process explicit in his Socratic dialogues.
So, the existence of poorly thought out or agenda driven dialogues do not in themselves show the method is a bad one, just like bad musicians do not discredit music as a whole.
The Socratic approach can be great when everyone's on the same page, but if you are teaching people from the ground up then all you'll really do is come off as a jerk.
1) Author says object and data structure are different things basing on stated difference, but doesn't explain why this couldn't be a different point of view on the same thing.
2) When author says "adding a class is easy when you need a new entity, while adding a function is easy when you deal with data structures" he doesn't mention that only "packaging" of code - into switch cases or separate methods - is different. To add Triangle, one still has to add both fields which contain Triangle data and all the methods which work with that data - area, perimeter, center. Similarly, adding data structure "Triangle" also needs all those additions - and only them - but in a different form.
And many derivations from this perceived difference are, um, questionable.
Yet I like author's explanation of impedance mismatch. True, data storage has to satisfy many needs (roughly, we don't want to copy data - normalization), while logic may focus on specific goal at hand.
I could write an in-depth critique of the whole post, but it's 5:30 pm and I would rather focus on finishing my work so I can go home and be with my family.
So instead of that, I'll just give you a summary of what I think about the post:
1) The post essentially compares two different approaches: a specific variant of object-oriented programming and a specific variant of functional programming.
2) There are various points in the post where the author plays fast and loose with definitions of certain concepts, in order to make them fit his narrative.
3) The Socratic dialogue format does not help make the reasoning clearer, it just makes the readers uncomfortably aware that they're being railroaded into reaching certain conclusions.
4) The conclusions reached in the post are a mixed bag of: a) useful and interesting, b) correct but incomplete, and c) correct only under very specific circumstances, rendering them potentially dangerous.
Lesson is just a transfer of knowledge - which may turn out to be useless, but only after the fact.
Of course.
If reader knows better, why reader reads at all?
Because the reading, as you yourself said:
...may turn out to be useless, but only after the fact.
I don't find it wrong, just irrelevant.
Oh, and using "data structure" to mean a "struct" is annoying for us old farts that learned the standard meaning, it's causing people to get confused. Actually, since classes are a data structures (in the old sense), the very title sounds like nonsense.
Nothing too bad, every day people does much worse in the Internet, but add the μαιευτικη´ and it comes out as inappropiate.
As a form of writing, you're left to guess at what your pupil might be thinking. Apart from Plato, few people have ever managed to succeed at this.
In a way though, Plato tries to tell a story rather than blabber away facts. It allows for the introduction of stupid or "less-intelligent" questions rather than setting up obvious straw men.
.02
That's because to them, _everything_ is an "indirect reference". Every time they use a variable it feels like an indirect reference, so what the hell is so special about a pointer?
Took me an intro class in computer systems before I could really understand pointers, because nobody wanted to bite the bullet and teach me about the stack. Pointers don't make sense if you don't know that the stack and the heap exist.
Now, THAT's interesting...
The C memory model and pointers come easily to those that understand what a memory address is and how long one is on their favorite architecture.
In my experience, people without this knowledge might understand how to use pointers, but not really why they are. And the answer to the "why pointers" question is _not_ "Imagine you want a function that swaps two parameters" nor "because they you would have to copy structs all of the time"
Then, it is simply "0x100 is what's in a pointer".
I don't think that is complex (it shouldn't be). It is intuitive to most. Seeing a picture should solidify things. And finally, it is semi-accurate to how a pointer actually works. Getting into the mud of "A pointer is like a reference in a scholarly paper" or "A pointer is like a link on a webpage" simply confuses what a pointer actually is.
And, if that isn't enough, simply write a program using pointers and print out pointer values. It really is a crazy simple concept that doesn't need a whole lot of metaphors.
What's worse is when the authors of "what's a pointer" don't know what a pointer is themselves. I've seen that far too often.
The correct simplification to make is not explaining virtual memory or MMIO. Pretend we have nicely-acting linear address space. The computer is a mail-sorting octopus and when it runs out of hands it only has mailboxes to put things in. Still not a great analogy.
Everytime I start to hear "addresses are like real addresses, ways to find where something lives" I know it will be an explanation for people that want to understand them well enough to not use them.
It is also rough because some people try to explain pointers with Java. AP CompSci in american high schools is happy to try to explain pointers to Java users, it's not surprising people have a rough time!
So is this not how it's explained nowadays? No wonder there's people that doesn't get it. So many concepts, not only about programming, I have learnt in my life first with "this is what happens", then with the precise definition...
Consider that p=f and p=&f both can be used to assign the address of a function f to a function pointer p. To a beginner it could look like p takes either type, or that the & really serves no purpose.
Consider these four lines of code:
char *p1 = "hello";
char a1[] = "hello";
char *p2 = &"hello";
char a2[] = &"hello"; // only this one is junk
For the assignment to p1 you get an implied & effectively added into your source code. It may be convenient, but it leads beginners astray with muddled thinking about types.A little-known limitation of C is that you can't really pass arrays to functions. You get that free & again, meaning that you end up passing a pointer. We should need to do printf(&"hello world\n") but that would be inconvenient, so instead we have a language that appears to behave in inconsistent ways.
Array dimensionality plays a role here. A real X by Y array, with two dimensions X and Y, can syntactically be used in the same way as an array of pointers to arrays. After being passed to a function, the sizeof operator breaks for both (since now you have just a pointer) but the 2-dimensional nature is lost for only one of the styles.
Even the bracket notation is trouble. It is why cute stuff like 4["foo"] can be used to mean 0. That seems harmless enough, but the oddities extend well beyond syntax that nobody should be using. Other issues are pointer arithmetic being surprising and some people being surprised that pointers have types beyond simply "pointer".
I find that the best hope for a clear understanding is to constantly remind the student of the above, right from the start. This is slower going, but it prevents serious misunderstandings from persisting and building on each other.
There you have it. Pointers in two sentences.
The "indirect reference" yada yada is too technical, requiring your audience to already understand those terms. It also ends up leading to cyclical definitions if you try going too much into detail (because data structures can hold pointers).
Maybe they don't know what they are missing.
When I was learning C two decades ago, my first thought was that pointers were solving no real problems, and that I will be able to do without them. Then I had to implement the "swap" function.
Analogies seem like they help, but often lead to error when you reach a point where the analogy is stretched to far.
To see if you understand pointers, try exercise #3 here: https://www.joelonsoftware.com/2005/12/29/test-yourself/ . Just step through the code in your head, and then use your mouse to highlight the text in the box below it. Hopefully that helps you to get started in the right direction.
You may also pass around the map to your friend in effect showing them how to get to the place, without possibly even having seen the place yourself.
And if you are a bad person, you could share maps that lead to dead end dumps
Maybe it's because I started in assembly language, but to me, data exists at an address in memory. Or starting at an address in memory. That data might be on integer, a float, a character, the beginning of a string, the head of an object/data-struct, OR an address of data somewhere else (which may be any of the above again). Pointers, to me, are simply an address of an address of a data thingie.
Why would you want to use these? One reason is that pushing large data-objects back-and-forth across function calls is expensive (pointers are usually just 32-bit or 64-bit numbers). Another is that one can have a pool of resources (think polygons in a large 3d scene), that are marked "used" or "unused". This gets a large performance boost from not having to do the whole malloc/free thing on lots and lots of small objects. Another (this is related to the first), the function can receive a data-structure, and modify it (side effects!) without having to copy it and pass the modified copy back.
I'm intrigued by your mental model. Having not thought it through - it seems as if lat/long is appropriate, but maybe even the idea that it is a coffee shop that you will find there is undetermined. (That is: the fact that it is a coffee shop is one of the things that is interesting about it, like the menu and hours. Again, I need to think this through).
Thanks!
It takes very little time to allocate space on the stack and not too much time to copy data from one register to the stack or another register. Very often, if you don't use pointers, the compiler will take your data and throw it into a register and reference that register directly in the method you just called (free).
Now, if you are going to allocate on the heap anyways, then sure, it makes sense to start looking into things like object pools. But that wouldn't be a baseline I'd start with.
Objects with a few ints or longs in them can be passed around free or nearly free with modern compilers.
Here is an example of that concept in action
You really don't get much faster than that.
That’s not to say your model is bad, it’s the same one I have in my head. But it’s no more ‘real’ than coordinates or library cards or P.O. Box number.
I do think one area that causes issues with C is any sort of string manipulation requires a reasonable understanding of pointers, and most introductory programming classes tend to focus on things like string manipulation.
And just like real world street addresses, you can't make blind assumptions about pointers.
The value may make logical sense, or it may be completely meaningless and you require direction from someone to find it.
You can't just write down a random address and expect a building to exist at it.
You can hold onto the street address for a coffee shop, put it in a notebook, then one day go to the address to find that the coffee shop was torn down and the street address is invalid.
If so, then you can think of memory like as a giant array of bytes and a pointer as the index in that array.
Hurray, you now understand pointers :)
Don't muddy it up beyond that.
In some languages, an array is dynamic. You don't have to declare the size. You can just put stuff at any index you desire, and the array will grow as needed. For example, just set foo[1000000000] to 42. That might take a gigabyte or more of RAM, or just address space, or neither.
The indexes might not be required to be numbers. Some languages allow strings or even arbitrary objects as indexes. You could index by a JPEG or a device handle or a float.
The values might not be required to have a consistent type. Some languages would let you throw random different-sized things into an array.
I feel like the prevalence of linked lists in discussions wrt pointers is proof that nobody is understanding these analogies.
At least there should be some regular pushback from beginners who conceptualize the linear addy space and say, "Hey, what is the distance between this item and the one I just inserted into the list?" Or, "Does it take time to jump around among all these links?"
Edit: And, "How does the time moving among links compare to the time iterating over an array?"
Doesn't really matter for the general concept.
* The indexes might not be required to be numbers.
In most of those languages, that data type is called a map, dictionary, or even associative array. I've seen few languages actually use the phrase "associative array" primarily because it is confusing.
But, to be clear, I'm talking about the common "array" definition. That is, a value indexed by a number.
* The values might not be required to have a consistent type.
Yet, I was clear in saying "a byte array". A pointer points to the byte address and doesn't care about what type is ultimately represented. That is why void pointers are a thing. It is a language level action to reinterpret the pointer into a concrete object type.
But again, that confuses the issue of what pointers are with concerns like memory layout and object sizes. To be clear, you don't need to those concepts to understand what a pointer is.
* Some languages would let you throw random different-sized things into an array.
I specifically said "array of bytes". It does not matter what a language lets you do. A pointer is the head of a byte address and the interpretation of that byte and subsequent bytes are a language level concern. All unneeded information for a discussion on "what is a pointer".
It's like the execution model, where processors do a lot of out-of-order execution, but do a lot of work so they can present a model to the programmer as if everything happens in-order.
Part of the confusion in teaching pointers (I think) is that we act like they are a lower level abstraction than they are (because they used to be). On modern commodity hardware with mainstream languages pointers have very little to do with memory addresses. Accessing memory via them will not happen in constant time, they may or may not perform better when used, even if the language doesn’t copy them on use the OS or processor might etc.
I just find when I try to teach people the memory address model of pointers there are so many exceptions that the model isn’t helpful. Like the OP I struggle with an alternative.
(I find that the P.O. Box number thing to be a useful general abstraction for me, but it doesn't seem to convey the concept well to people who are confused. Hence the intent behind the intent of my question: "Does anyone know how to effectively teach pointers?")
Then you get to data structures, and you're trying to explain the fundamental difference between dense, buffer/array-like structures and sparse, graph/pointer-like structures, and there might be no concept of an underlying memory model where the difference is clear.
This has always seemed like a horribly ambiguous and inconsistent mental model to me, and there must surely have been quite a few bugs caused by programmers not clearly understanding the differences between value and reference semantics and exactly when each applies in their chosen language(s). And yet for reasons I can't understand, many of the popular languages today do seem to follow something close to this model.
My favourite book at college about (the real) data structures started building the array, even if it was already implemented in the language.
Of course you also have to mention that even assembly is an abstraction. Memory isn't really a big array, cpus have caches and all that jazz. This matters when writing performant algorithms.
edit that being said, after your comment I went out and looked up "What's a pointer" to see how it is taught. Pretty much every web page I hit does exactly what I described, so, awesome :).
https://www.youtube.com/watch?v=QM1iUe6IofM https://www.youtube.com/watch?v=yy8jQgmhbAU https://www.youtube.com/watch?v=rX0ItVEVjHc
In fact I find good data driven designs easier to maintain than good OOP ones..
I'd really like it if Uncle Bob eventually has his fill of Clojure and moves on to explore what Common Lisp built decades earlier, then blogs about that too.
I think one very clear example is hybrid sum types like Scala's case classes (on sealed trait/abstract class), Kotlin's sealed classes and probably Swift enums too. They can all be used as a pure sum-type, but they don't forego inheritance and polymorphism.
I think the more important distinction relies on how you use it, regardless of whether the language allows you to do more. Do you make the data layout a contract (in the case of sum types: commit to closed set of cases) and make changing the "schema" harder? Or do you make the provided set of functions a contract and make it harder to add new functionality that will be supported by all data types?
Julia, builds on this tradition but allows you to have your cake and eat it too. It has multimethods/generic functions and they are the only option—all user defined functions are multimethods. They also have excellent performance (they're used for everything, they have to).
Of course, there's no free lunch and you do give up traditional separate compilation, but the degree composability it gives to the ecosystem is hard to comprehend without experiencing it. Simple, reusable data types are shared across the ecosystem with anyone adding whatever (external) methods they want. Generic code that handles a literally exponential explosion of argument types "just work"—and the compiler generates fast code. All without doing anything special, since multiple dispatch is the default and only way functions work.
I stopped using Clojure and don't consider it for new projects because I think types are invaluable documentation now, and it pains me that clojure and it's community doesn't believe the same way (typed clojure[1] does exist but it's contentious).
I was going to say that Haskell addresses this by putting values in data types and behaviors in "type classes", but then I remembered that functions can be values... which is now making me think that the reality is probably more abstract or complex than the author here is letting on.
> Classes make functions visible while keeping data implied. Data structures make data visible while keeping functions implied. > Classes make it easy to add types but hard to add functions. Data structures make it easy to add functions but hard to add types.
is known as the Expression Problem https://en.wikipedia.org/wiki/Expression_problem.
The last point:
> Data Structures expose callers to recompilation and redeployment. Classes isolate callers from recompilation and redeployment.
is only somewhat true. I suspect it would be more accurate to say that it's a matter of indirection: static dispatch isolates callers from recompilation; static dispatch exposes callers to recompilation; calling a function pointer isolates callers from recompilation; calling a function directly exposes callers to recompilation. (All of this in statically typed languages.) Though this isn't my area of expertise. Perhaps someone else knows more? [Edit: sounds like these "expose" cases often don't cause recompilation either.]
... most projects (maybe all?) I've worked on in 20+ years deployed a full set of new bits upon release, rather than trying to differentiate at the level of which source code files were and were not touched.
So this strikes me as a carryover from many years ago when working on large C++ projects with slow compile times was even more painful than it is today.
Any counterpoints?
But it seems to be a rather weak point, yes.
This reminds me of Linus's quote:
"
I'd also like to point out that unlike every single horror I've ever witnessed when looking closer at SCM products, git actually has a simple design, with stable and reasonably well-documented data structures. In fact, I'm a huge proponent of designing your code around the data, rather than the other way around, and I think it's one of the reasons git has been fairly successful ().
() I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.
"
I'm one of those people who designs the data first and then the code. But when you design the code, you shouldn't be making it a one-to-one mapping with the data.
When you design your classes, they should be the best representation for the programmer to use and not necessarily just identical to storage format.
As well, the most convenient structure for the user of your classes is most likely not the best format for storage.
I don't understand: "but the existence of the data structure implies that some operations must exist."
Grounding it out to a specific data structure, the existence of `List` implies that e.g. `sort` exists?
That direction makes less sense than `sort` implies the existence of e.g. `List`(something to be sorted).
The existence of List implies that operations must exist to insert an element in a list, access to elements in a list, find out the size of the list, etc.
Edit: BTW 'implies' is the the magic word in the text. It's what creates all the appearance of meaning. Try to replace it what something else. Now I remember why I disliked Plato so much.
Not at all like the private members of an object, which I think was the analogy being made.
The question of where data is stored for an object is far less important than how to access data in a data structure. Like, the whole reason the data structure is designed the way it is, is to access it in particular ways.
> An Object is a set of functions that operate upon implied data elements
If this is replaced with:
> An Interface is a set of functions, often operate upon implied some implied data elements (but not necessary).
Everything in the article will probably be less confusing.
When I started using Rust I wasn’t used at all to seperate data and behaviour that strictly, but it makes sense. OOP paradigms were still hardwired in my head so the hardest part was actually wanting to do it in that decoupled way. Something about a car object that has wheel objects and a car.drive() function gives you a good feeling as a programmer, but sometimes it is more effective to stay with the data structure and describe the car as a struct of vectors which implement a Driveable trait..
Encapsulation means you never have the canonical data. The server is the system of record. You can get data back in response to queries, perhaps even a full data dump, but it's a snapshot. You can also send commands to mutate data on the server. Typical applications aren't allowed to do a complete replacement (restoring from backup).
On the other hand, data is better thought of as what's going over the network. Messages consist of data. Encapsulation is almost meaningless; if you want to keep something private from the receiver, don't include it in the message in the first place (or use encryption, maybe). Anyone reading the data has to be able to understand the format, or at least, ignore what they don't understand.
In the degenerate case where the caller and implementation live within the same process, the same types often get used both for message transfer (in function arguments and return values) and storage. There is widespread "cheating" for performance reasons, and it can get confusing. For a transient process, it might not make sense to think in these terms at all. (Traditional Smalltalk used persistent images and client-server style encapsulation makes somewhat more sense there.)
You can also "cheat" by using the same schema for data transfer and storage, or having a trivial mapping between them. This can introduce unnecessary coupling, but there are systems where it works. (Consider that you can make a full clone of a git repo and it doesn't encapsulate any data.)
I'm not sure about the switch approach described in the post:
function area(shape)
switch shape.type
case "square": return shape.side ** 2
case "circle": return 2 * PI * shape.radius
case "triangle": return ...
case "segment": return 0
case "polygon": return ...
...
case "oval": return ...
You can have a lot of cases, some of them requiring non trivial code...
Eventually you write a function for each case and it's more work than adding a method for each shape because you still need to write the switch...Classes seem work better than structures here.
But then you want to handle intersections
The switch approach doesn't seem realistic:
function intersection(shapeA, shapeB)
if(shapeA.type == "circle" AND shapeB.type == "circle")...
if(shapeA.type == "circle" AND shapeB.type == "square")...
if(shapeA.type == "square" AND shapeB.type == "circle")...
...//uh oh you have nShapes**2 cases to handle
But java classes or not better: where do you define Circle-Square intersection? In Circle? In Square?Even with multiple dispatch the solution is not ideal. You now have some things related to Circle (area, perimeter...) in the Circle.blub file, and intersection(Circle, Circle) wich only works with Circles is now in intersections.blub...
I don't see a good solution and sometimes I feel like the problem is more with our tools (code in text files) rather than programming paradigms
I don't know how we're quantifying so as to assess "most", but at least in some popular static languages there are circumstances (most notably inline functions) where callers are likely to be recompiled and circumstances (dynamic loading) where they clearly won't be.
> But if one changes data structure, then the callers has to recompiled.
Only if you're changing parts of the data structure that are visible to callers. For instance, if your API operates on opaque handles you can change the underlying data structure however you'd like without recompiling the caller.
Because they allow us to avoid discussing the complications that arise when entities have lifetimes, over which, at different stages, different operations are meaningful.
Data-structures are put most simply as ways of organizing data. The organization of the data implies a specific layout--a way of representing your data as a table of integers, ie. in RAM.
After reading the article, I don't see a meaningful distinction between objects and data structures. A data structure can be represented as an object, especially in OOP languages where it must be.
A general class doesn't necessarily lay out its data in any particular way--which allows abstraction over the data representation of the class.
However, some classes are made which are designed in a manner which guarantees a certain data layout. std::vector with its contiguous memory requirement comes to mind.
To add to this, the c++ conception of a "concept" or haskell's concept of a "typeclass", or even a generic class, is really what this article is talking about. Or even Java or Go's interfaces. There is absolutely no way to guarantee a specific data structure through an interface, typeclass, or concept, since they fundementally do not mention their data representation at all.
Perhaps I am nitpicking, or perhaps I am reading this wrong, but I would not design the square data structure to have a perimeter function. The square data structure should just expose the data that describes a square (length, width). Adding higher abstractions (perimeter, etc) on top of the data structure only serves to create the trumped up problem later described in the dialog. The perimeter method should be defined in the Square class, where perhaps a "StraightLinesOnlyPolygonMixin" could define the perimeter method.
In general, I cannot see why a Data Structure would define computational methods. You are tightly coupling logic to the underlying data source, which is wrong when that logic obviously could apply to any underlying data source (I don't care if my square is backed by RDB, S3, a hardcoded instance, etc) The perimeter method, and probably the Square class, should be the same.
And more pressingly, why would you ever pass a data structure to a function that had more than one algorithm to compute a result? The data structure (or perhaps some intermediary (an adapter?)) should own the algorithm within a function that computes only on that data structure. This ensures that all methods associated with your data structure are obviously and explicitly associated (in a single file, class, whatever). The alternative, as outlined in the dialog, is to spread a bunch of switches all around your code. Given these two possibilities, why would one choose to place switches in disparate places throughout your code?
I should have stuck with the same metaphor, but regardless of whether it is perimeter or area, the output is a function of the description.
Here it jumps to objects instead of database VIEWs, that can be tailored for each application. There's no need for complex object models when you embrace the db and you don't treat it as a dumb store.
Technically, ORMs are a set of waldos that you operate inside a glove box in order to manipulate the data in the DB without getting DB cooties on you.
A database table is both a data structure and a collection of functions for accessing/manipulating it (SQL). An ORM maps between the "Object Oriented" objects and the "Relational" objects that are tables.
Interesting the author thinks about the the api exposed by a table as the "data structure" itself rather than an object. Pragmatically, we tend to refer to the "objects" at the lower level of abstraction as data structures. Is [...] a function that defines an array, or the array itself?
Correct me if I'm wrong, but he seems to be saying that, to avoid breaking consumers of your library often by changing the implementation, hide the implementation details behind classes, right?
This seems to be a common reaction to someone who experiments with the "freedom of functional programming" where that freedom means operating on and returning raw data structures that OOP usually hides behind private variables.
That's still bad practice, even in code that heavily uses FP, and good code usually mixes FP and OOP properly, so that you're given functions when you're meant to have functions, and data when you're meant to have data. This is how I've been writing JavaScript for a few years now, and it's not how I've seen Java or Clojure usually written.
In Rust, we keep the same distinction by modeling data structures as structs and enums, and modeling the “object” side by traits (whether static- or dynamic-dispatch). Traits decouple a consumer from the particular data and emphasize a behavioral contract, allowing any data structure to implement the desired behavior.
Objects may possibly have broader uses that what is described here ("business data applications") but within the definition given, the description of Object and operations on object make a lot of sense. This post is worth re-reading a few times.
It's not particularly intuitive if you're not used to high level math or functional programming, but it really is a lot simpler (not that it doesn't have downsides/leaks in the abstraction, but that's another discussion).
In Smalltalk, you can have a TreeNode class and instantiate TreeNode instances. This is often the way that binary trees are implemented in Smalltalk. In that case, data structures exist as a design of interacting objects. In that case, you can molest them fairly directly using the object interface.
The same goes for Java and C#. Basically, you can design in such a way, that you can essentially have C or Fortran in any language, even a pure OO language like Smalltalk. The right way to do this in Smalltalk, is to use a Facade, which can be used to hide the guts of your data structure.
As a side note, I hope that the situation with pre- and postcondition checking improves (runtime verficiation like in Racket seems too heavyweight for me, dependent types alone are kind of clunky, type refinements are either bolted onto existing languages (LiquidHaskell, F7) and don't always fit, or exist in very obscure ones (F*)), and we can drop information hiding for the sake of safety alone. I don't like how today the "safe" option is to severly limit what you can do with a data type (want to safely index an array? just use an iterator! oh, you wanted to index two arrays at once? just zip them and hope the compiler makes the tuples go away! you wanted O(1) indexing? too bad, no can do).
I suspect it's because of the polymorphism. Why write a function that operates on some unknown a for which a Foo implementation exists when it's so easy to write a function that operates on any a for which a Foo implementation exists?
The only time I've seen existentials used is for safety - in particular ST-style monads where a fancy type is used to ensure that you can't "leak" state out of the monadic context.
In C#, you create a class when it's an object, and a struct when it's a data structure, or am I missing something?
ie:
// Data Structure
public struct Foo
{
public string Bar { get; set; }
}
// Object
public class Bar
{
public string Foo { get; set; }
}
You can check to see if something is a Data Structure like so: typeof(Foo).IsValueType (true)
typeof(Bar).IsValueType (false)Yes and no. Due to the technical limitation that (unlike in C++[0]) structs in C# cannot derive from other structs, DTOs in C# are often implemented (or rather: generated) as classes to allow derivation from a base class that has certain behavioral hooks (say, for misc. custom serializers).
[0] See https://www.fluentcpp.com/2017/06/13/the-real-difference-bet... (the real technical difference between struct and class in C++ is just that the default visibility modifier for a struct is public and the default visibility modifier for a class is private; both can use inheritance and have virtual methods)
That said, ultimately it doesn't really matter. If you're going to implement collision detection like this, then yes, you will have a combinatorial explosion. This is not a language issue. Switching from Java to some other language with a different form of dispatch will not save you from implementing a lot of algorithms when adding bezier curves into the mix.
The practical approach is reduce the problem to a common case, e.g. to turn the collision shapes into a set of triangles first, and then perform triangle-triangle collision detection.
Worse than faking it in (say) C++ using the visitor pattern?
oh come on. I can get a full build of qt5's main libraries (core, gui, widgets, network, xml, etc) in 15 minutes on my laptop. It took an hour a few years ago. Compilers are getting faster all the time.
And making use of binary libraries as well.
Eventually with C++20 modules the situation will improve, and maybe by 2025, those that still care might finally enjoy Delphi compile times with C++.
struct list {
float node;
struct list *next;
}
Above is the data structure; it implies operations.
Below is an interface (class, in the article); it implies data. #define LIST_H
float index(struct list *ls, int i);
int find(struct list *ls, float x);
void sort(struct list *ls);That said I don't think that any particular set of operations is implied. A data structure will be more suitable for some operations than others, but it won't really require anything in particular.
It ultimately boils down to the difference between static and dynamic polymorphism (Even more visible in Rust, where regular polymorphism works exactly like C++ templates, while trait objects pack a "v-table" together with a structure. In Haskell it's a little more blurry since "static" polymorphism is already implemented in a way that doesn't easily translate to templates, for example allowing polymorphic recursion).
You are touching on something called the "expression problem". You might be interested in reading some commentary about it. There is an inherent decision to be made any time you have many algorithms each operating on many data types. In most programming models, you have to choose between grouping your code based on your data types (but then any new algorithm needs to be implemented on each data type) or based on your algorithms (but then any new data type needs to be supported by a new case in each algorithm). Neither is the "right answer". You fundamentally have a two-dimensional system here, and you need to decide which axis you are going to prioritise in your design.
I think I have encountered this issue in the past but was turned of by the lack of formality in the discussion. I wish there was more academic, concrete discussion of these issues, because I feel that what I am doing now (informal discussion) likely has many holes.
If you bother to take a look to C APIs, you'll find the familiar "handle" parameter, equivalent to the "this" or "self" of OO languages. Setting aside all the philosophical mumbo jumbo, OO is syntactic sugar over those interfaces.
The text would have been a lot clearer if that would have been more explicit from the beginning!
Things like Java serialization and Python pickle attempt to do what you say and are considered failures (or at least security risks) because they allow a third party to act on object implementation internals.
Security aside, a denormalized representation of data could be different than the implementation specific representation that's encapsulated inside an object.
In practice, it is debatable how often that is helpful when you're implementing generic data structures. An alternative is to specify the representation explicitly and provide a set of functions designed to work with it, but also to allow direct access by other functions when that is useful.
You can still build a layer of more abstract interfaces on top and write more generic algorithms in terms of those interfaces rather than any specific concrete representation, as for example Haskell's typeclass system does.
I don't see that as an alternative. I consider this natural OOP. You don't lose anything by strictly enforcing the encapsulation because you can always explicitly provide low level methods into the data structures. Conversely, you lose all safety when you open up encapsulation. The method API of an object is the contract it provides. If you go around that contract it's much harder to make safe implementation changes.
Because of this, low level access should be opt in, in the way you describe.
Well, not really, in Java you had to opt in, and you could implement the readObject and writeObject methods to override the default behavior; similarly in Python, you can override __reduce__ and friends as needed.
The security problems are more these protocols are essentially executing arbitrary code when reading data. You'd think, "I can specify a root object and then it's going to specify what its attributes can be, and so forth."
But Python is a dynamically typed language so give a class Foo, its attributes can be anything. (Though now with annotations, it is possible to lock down precisely the types you allow; I wrote a module that does this.[1])
Java could have fixed this if they didn't go with type erasure. `List<Foo>` becomes an `List` at runtime, so there's no way to determine what it ought to contain, short of kludges[2] that other libraries use.
> Security aside, a denormalized representation of data could be different than the implementation specific representation that's encapsulated inside an object.
Yup, as soon as your process has to interact with other processes, even writing to a file and reading it later, it's possible to disagree on how data is represented or what it means. I'll plug another project of mine: I try to make the client smart[3] (enough) to allow versioning[4], as well as avoiding a class hierarchy.
[1]: https://pypi.org/project/json-syntax/
[2]: https://static.javadoc.io/com.google.code.gson/gson/2.8.5/co...
and this is because of that that OOP has a bad name.
It's ok to violate "great principles" when writing software. What I have never seen pan out is dogmatic application of principles.
That sounds appallingly inefficient for no design win. This is precisely the problem solved by Java's Iterator interface, and .Net's IEnumerable, isn't it? I can't imagine why we'd want to serialise and deserialise. Or am I misreading things?
But it's not the cleanest way by any measure. It has tighter internal coupling than any other easy to imagine way. Most of the times, people will use the list normal interface to take the data.
If performance is a requirement, most OOP people will create a database interface, that the linked list class can write without taking outside of its interface. Most FP people would do it the other way around, and create an interface to applying data to a function that the list can implement. Either way, it can tie with the dirty introspection in performance.
That doesn't seem like the cleanest to me. Instead, you'd use the interface to read/write to the list as you write/read it to storage. At no time do you want to reach into the LinkedList object and access it's internal mechanism.
But then you're not really gaining anything either, unless perhaps you have some mechanism to enforce that the low-level access should only be used in specific circumstances when it is deliberately intended. It's like writing classes that have a few data members, but then writing direct get and set accessors for each of them anyway. There's no more complicated invariant that you're enforcing at that point, and without any non-trivial invariants to be enforced, the whole argument for encapsulation and data hiding becomes moot.
Conversely, you lose all safety when you open up encapsulation. The method API of an object is the contract it provides.
Do you always need that safety, though? If your data is defined to be stored in a certain representation, and direct access to that underlying representation in that format is available if needed, isn't that representation now just another part of the contract? Again, you're balancing two competing priorities: is there some invariant to be enforced that is sufficiently complicated for data hiding to be a useful safeguard, and is it useful to interact efficiently with, or make safe assumptions about performance based on, the true data representation?
Which one is the more important consideration must surely depend on how complicated your representation and any related invariants are. Being able to pattern match against values of some relatively simple algebraic data type can be very useful, for example. On the other hand, it's not much fun to accidentally corrupt the super-efficient look-up structure at the heart of your whole system that now uses a complicated set of hash tables and the odd Bloom filter internally after someone spent three weeks optimising it for a 50% speed boost.
With collections, you can export contents through an Iterable, or to Array, or any number of other strategies, without coupling the consumer to your internal representation.
You can still provide standardised interfaces for things like iteration along with a data structure even if you choose to expose the specific representation, so I am not sure how strong an argument your second point makes. Depending on the situation, you may find your consumer is implicitly coupled to the true representation anyway, perhaps because it inadvertently relies on values being iterated in sorted order or insertion order or because it assumes certain performance characteristics even if these things are not strictly part of the documented interface. More than once in programming history, even standard libraries of popular programming languages have been updated to guarantee some behaviour that had been reliable in practice but was never actually part of the original specification.
The argument for abstraction at the procedure level seems obvious if you've ever used an interface or private members.
I think you have some picture in your head where everything has to be abstracted in a needlessly obtuse way to be an object. If you have some DTO, the fields are the contract so its fine if they're public. (In Java in particular its easier to refactor if you use the getter/setter pattern but that's not an OOP issue.)
I mean, I really don't understand what you're advocating here. Everything should be public?
You seem to be assuming a certain style of programming or type of programming language here, along the lines of mainstream OOP languages. There are other options, from low level languages like C to languages where algebraic data types and pattern matching are bread and butter, that do not attempt to conceal true representations of simple data structures in that way. The sky does not fall! :-)
Sometimes it is useful to be able to drop down to that level if the provided set of standard functions to work with that data structure does not meet your needs. You might want to support some new access pattern that is not offered in a convenient form by the tools available in the provided interface, for example.
Sometimes you might want to know, say, which order the axes go in for some multi-dimensional data structure, because writing your algorithms to match can have a profound effect on real world performance. And in this sort of situation, the advantages of being explicit about that can far outweigh the theoretical benefits of being able to replace the internal representation transparently, which is something you are extremely unlikely to be doing in an established system where this kind of issue arises anyway.
By using the standardised interfaces for everything you can change the implementation without changing how you work on it.
For example, my service that stores a collection in a database. I could write my service so that it takes in a Collection, rather than an ArrayList, because then anyone using it can pass in a Set (no duplicates), CopyOnWriteArrayList, TreeSet (ordered set).
By hiding internals and using the interface, I can pick the abstraction I need and give more freedom to those using my classes.
Another example. Once Project Valhalla and Value types come in, LinkedList might be changed to use value types for Nodes. Lets say I've tightly coupled my code to the implementation of LinkedList. This could potentially break my code.
Regarding your second example, if you are working with an explicit representation then you simply would not make a breaking change like that. Instead you would create a new data structure with the new representation, which other code can then choose to use instead if it wants to. Again, nothing about this prevents both versions from also providing equivalent functions to access them in the same way where that makes sense or writing other code in terms of those functions rather than tied directly to the specific implementation.