Never create Ruby strings longer than 23 characters

Never create Ruby strings longer than 23 characters(patshaughnessy.net)

54 points by ctaglia 12 years ago | 59 comments

nly 12 years ago |

This is known as the "small string optimisation" in C++, so you can see a similar implementation in Clangs libc++[1].

One interesting corollary is that moving short strings in an implementation that does this could actually be ever so slightly (negligibly) slower than moving long ones (since byte copies are slower than word copies). But generally, this is a free lunch optimisation and can save you hundreds of megs of memory when writing programs dealing with millions of short strings.

[1] http://llvm.org/svn/llvm-project/libcxx/trunk/include/string - search for "union"

Someone 12 years ago |

http://www.slideshare.net/nirusuma/what-lies-beneath-the-bea... (from march 2012) also discusses this.

Also (pedantic):

   #define RSTRING_EMBED_LEN_MAX ((int)((sizeof(VALUE)*3)/sizeof(char)-1))

sizeof(char) is always 1, so that division is superfluous.

BudVVeezer 12 years ago | |

sizeof(char) is implementation defined; see limits.h for more information on the probable size of char for your target. If the sizeof(char) is 1, the division will be optimized away, so there's no loss by keeping the code portable.

Sharlin 12 years ago | | |

No, the size of char (in bits) is implementation defined, but sizeof(char) is defined to be 1, no matter what its size in bits.

chollida1 12 years ago | | |

No this is incorrect, see:

http://stackoverflow.com/q/4562249/25981

sizeof(char) is always defined to be one. This can't be altered by a conforming compiler.

EpicEng 12 years ago | | |

Wrong. sizeof(char) is define to be one. The number of bits in a byte (char) is implementation defined (this is why CHAR_BIT exists). Not the same thing.

danielweber 12 years ago |

More like "ruby optimizes for short strings, and chose 23 at the cut-off point for Reasons."

yapcguy 12 years ago | |

Can't wait for someone to write a new faster better string class which handles strings of any length by internally chopping them into 23 character portions....

fat0wl 12 years ago | | |

lol yes....... how reasonable. ahah when i saw the title of the article all i could think was "click comments to tune in for the most amusing flamewar this week" but so many of these comments are like "so... its 23 characters.... why not?!"

comeon peopleeeee, a bit of an arbitrary internal standard, no?

i understand the point about "CONCLUSION: it doesn't matter for a few strings!" but.... comeonnnn it must matter on some level, otherwise why is Rails such a pain in the ass to optimize? these things must add up...

vidarh 12 years ago | | |

I hope that is meant as a joke.

pothibo 12 years ago | |

More to the point, ruby always uses string with more than 23 character. It's strings that are passed to the client and an HTML page is almost always bigger than 23 characters.

Xylakant 12 years ago | | |

Ruby is a general purpose scripting language that can be used for web development (rails, sinatra) but is often used for different purposes (puppet, chef, vagrant, shoes, ...).

And even if you'd assume web development as the only purpose, there's a lot of strings that are shorter than 23 characters: Header for request and responses, form fields passed by the client (usernames, passwords, ...), field names passed in hashes, table and column names, template and file names, URLs or even the occasional, totally rare string in a json structure. It's an optimization with major gain and little loss.

ben0x539 12 years ago |

There's some discussion at https://news.ycombinator.com/item?id=3425164 , including some interesting technical/benchmarky comments.

ra88it 12 years ago |

Title: "Never create Ruby strings longer than 23 characters"

Conclusion: "Don’t worry! I don’t think you should refactor all your code to be sure you have strings of length 23 or less."

spoiler 12 years ago |

This is MRI (C Ruby) behaviour and not Ruby - specific , though. However, this is still interesting information.

anon4 12 years ago |

Wouldn't it be better to use this declaration though:

    struct RString {

      struct RBasic basic;

      union {
        struct {
          long len;
          char *ptr;
          union {
            long capa;
            VALUE shared;
          } aux;
        } heap;
    
        char ary[];
      } as;
    };

    /* apologies if I messed up the syntax here */
    #define RSTRING_EMBED_LEN_MAX (sizeof(((RString*)(0))->as) - 1)

Then you can even use the padding the compiler added, if any, plus you can add more things to heap and the embed length will grow automatically.

markburns 12 years ago |

For anyone interested, he points to an older translation of the Ruby Hacking Guide, there is a pretty much complete translation at

http://ruby-hacking-guide.github.com

alecdbrooks 12 years ago | |

Thanks for the link! I'm not interested in Ruby per se, but it's fascinating nonetheless from the perspective of data structures and how they are implemented in C.

On a related note, I've found a much less comprehensive (but still useful) guide to Python internals: http://tech.blog.aknin.name/category/my-projects/pythons-inn....

gaius 12 years ago |

I suppose the thing to do is analyse your app for the average string length, and just recompile your Ruby with that. Would be even better of it was a command line parameter.

throwaway0094 12 years ago | |

This isn't quite right. Even if your average string length is 1k+, you shouldn't change the embedded string size to 1k+. I think these objects sit on the C stack internally, which doesn't handle large objects like this well.

Also, I would guess the performance gains (from skipping malloc) would wash out the longer your average string gets -- even if the huge stack use doesn't kill your performance for some other reason (blowing the d-cache?).

ben0x539 12 years ago | | |

I don't think these strings ever sit on the C stack, except maybe if some C code/extension is being really clever. The standard representation for variables is a tagged pointer as far as I know, so I would assume that is all that goes on the stack. This optimization probably just saves another level of indirection.

pedrocr 12 years ago |

Why does "str2 = str" actually allocate a new RString instead of just pointing both str and str2 to the same RString?

alecdbrooks 12 years ago | |

That's what it is doing. The additional RString structure associates the label "str2" with the characters (on the heap) allocated for the original string.

Ruby experts can correct me if I'm wrong, but when Ruby sees a name like "str2" it looks it up in a table, which points it to the RString structure. From there, it can follow the pointer to the actual array of characters, which in this case is only stored once.

pedrocr 12 years ago | | |

According to the article both str and str2 will point to the same char[] on the heap, but they are represented by two different RString objects. As you said when you want to access str and str2 you need to look them up in a table. So why not have both entries on the table point to the same RString, instead of pointing to two different RString's that point to the same char[]?

pothibo 12 years ago | |

I haven't checked the code so I may be wrong but it's possible it's for multi-threading reasons.

microtonal 12 years ago | | |

MRI has a global interpreter lock, so that does not make much sense.

In fact, the diagram is simply wrong. This was rectified by the author in an article two weeks later:

http://patshaughnessy.net/2012/1/18/seeing-double-how-ruby-s...

grosbisou 12 years ago |

Extremely interesting. But I cannot quite understand why RSTRING_EMBED_LEN_MAX is calculated that way.

VALUE seems to be unsigned int defined via "typedef uintptr_t VALUE;" and "typedef unsigned __int64 uintptr_t;"

But why is it calculated like that I don't get. Anyone can explain?

Sharlin 12 years ago | |

The small string buffer should be the same size as the "heap" struct so as not to waste memory -- remember, they shared the memory as they're members of a union. The heap struct contains three members which, taking into accoult alignment restrictions, usually add up to three times the machine word size (which is basically what sizeof(uintptr_t) is). The "-1" is because C strings are null-terminated, so the maximum length is one less than the size of the buffer.

What I don't know is why they don't simply use sizeof(heap) as the buffer size.

grosbisou 12 years ago | | |

Ah that was obvious. Thanks, very clear answer.

al2o3cr 12 years ago | |

It's using the storage in an RString struct that isn't otherwise occupied by the RBasic info:

https://github.com/ruby/ruby/blob/8f77cfb308061ff49de0a47e82...

Note the `as` union. The `heap` version has three VALUE-sized entries, so RSTRING_EMBED_LEN_MAX is calculated accordingly, with the -1 to account for the null terminator.

Dylan16807 12 years ago | |

Good question. In a really roundabout way it manages to be the same size as the alternative struct.

Edit: I missed that part of that was another union, removed what I said about it being off on 32 bit.

I still don't understand why they go so roundabout by dividing by one and casting to int...

Sharlin 12 years ago | | |

Actually in C and C++, longs are 32 bit on most 32-bit platforms. If you need a 64-bit integer type, you need either "long long" or some implementation-specific equivalent.

gesman 12 years ago |

I wonder why they didn't make cut-off optimization points at 33?

When programmers don't know in advance how long name/email/input/whatever field is going to be - they just use the magic "power of two" length :)

So 32 (or 33) in this case would be more reasonable.

gliese1337 12 years ago | |

Because of this line:

    #define RSTRING_EMBED_LEN_MAX ((int)((sizeof(VALUE)*3)/sizeof(char)-1))

23 wasn't chosen, it was calculated to be the size that would be required for a struct describing a heap string, and will actually be a different number for different architectures. Choosing to make it bigger would add unnecessary overhead to the RString struct.

njharman 12 years ago | |

> When programmers don't know

They did know. And there are many more cutoffs than powers of two, depending on storage backend.

badman_ting 12 years ago |

Reminds me of this Mr Show sketch :) https://www.youtube.com/watch?v=RkP_OGDCLY0

throwaway0094 12 years ago |

Is Ruby's internal encoding UTF-8, then?

sluukkonen 12 years ago | |

Each String in Ruby has their own encoding. But by default, it is UTF-8 these days.

jokoon 12 years ago |

"never use ruby" works well for me

ctrager 12 years ago | |

The designer of a string class in any language. C++ and Java - has to deal with the same issue - that heap allocations are slower than stack allocations. But to do a stack allocation means reserving some fixed length memory which is a waste if you have a lot of small strings. It's a tradeoff. The Ruby approach is reasonable. I think in Microsoft's C++ STL library, the limit is 16 rather than 32. Even with the low-level closer-to-the-metal power of C/C++, the string class designer still has to make a decision about the tradeoff.

jokoon 12 years ago | | |

Strings are overrated, they should never be used until you really need them.

drakaal 12 years ago |

Who needs more than 23?

drakaal 12 years ago | |

This comment is also 23

corresation 12 years ago |

This all sounds rather terrible for Ruby, doesn't it? It isn't so much that the short string is faster (though I'm left unclear whether it itself is on the stack/heap, though given the GC nature of Ruby and practical considerations of the language, it must be the heap), but rather that the cost of the short string is also added to the long string in the heap (assumed) allocation of the RString (which becomes larger and thus more difficult to malloc).

If this is intended to sit on the stack, which I find highly unlikely (especially given the timings that seem to be the delta between one malloc and two, and would be much more significant if it were a stack allocation versus a heap allocation. This is not comparable to small string optimizations for the stack in C++), maybe. But otherwise it seems like a poorly considered hack.

The string type could as easily have been dynamically allocated based upon the length of the string, where the ptr by default points inside that same allocated block. If the string is expanded it can then be realloced and the string alloced somewhere else. No waste, a single allocation, etc.