String tokenization in C

String tokenization in C(onebyezero.blogspot.com)

155 points by throwaway2419 7 years ago | 114 comments

kazinator 7 years ago |

The actions of strtok can easily be coded using strspn and strcspn.

https://groups.google.com/forum/message/raw?msg=comp.lang.c/... [2001]

https://groups.google.com/forum/message/raw?msg=comp.lang.c/... [2011 repost]

strspn(s, bag) calculates the length of the prefix of string s which consists only of the characters in string bag. strcspn(s, bag) calculates the length of the prefix of s consisting of characters not in bag.

The bag is like a one-character regex class; so that is to say strspn(s, "abcd") is like calculating the length of the token at the front of input s matching the regex [abcd]* , and in the case of strcspn, that becomes [^abcd]* .

saagarjha 7 years ago | |

And it’s nicer, since you can pass in a const char * and use it in concurrent code.

jstimpfle 7 years ago |

strtok is one of the silliest parts of the standard library. (And there are many bad ones). It's broken. It's not thread safe (yes there is strtok_r). It's needlessly hard to use. And it writes zeros to the input array. The latter means it's unfit for most use cases, including non-trivial tokenization where you want e.g. to split "a+1" into three tokens.

If you program in C please just write those four obvious lines yourself.

yason 7 years ago | |

If you program in C please just write those four obvious lines yourself.

Those are not necessarily obvious lines, there are several pitfalls to avoid, and for that reason strtok() is much longer than four lines. When it comes to the standard library functions strtok() has well-defined behaviour that is easy to reason with and near-magically approaches the string-splitting convenience close to scripting languages.

In contrast, an example of truly sickening part of stdlib is converting strings to number. The atoi()/atol() family doesn't check for errors at all so you want to use strtol(). But the way error checking works in strtol() is so complex that the man page has a specific example of how to do it correctly. All sane programmers quickly write a clean wrapper around strtol() to encode the complexity once. Now, strtok() is nothing like that.

In its simplicity, strtok() is quite versatile. A few strtok() calls can easily parse lines like:

    keyword=value1, value2, value3

that you might find in configuration files. And I mean truly in just a few lines which you might expect in Python but with C string handling? No.

jstimpfle 7 years ago | | |

Here is the musl implementation.

> https://github.com/esmil/musl/blob/master/src/string/strtok....

It's a bit longer than 4 lines because strtok does things you should not want. If you insist on parsing that configuration line with strtok, go ahead and write that brittle code. It breaks as soon as you want empty strings (try "keyword=value1, , value3" with strtok) or escape sequences or other transformations, or as soon as you want to do something as basic as parsing from a stream instead of a string that is completely in memory.

So to clarify, of course you are never done with parsing in 4 lines. But even if it wasn't as braindead to overwrite the input string, the functionality strtok provides would not be worth more than 4 lines.

nly 7 years ago | | |

Unless you're using a pre-specified configuration file format (e.g. TOML), then parsing configuration files requires a general parsing library. This is a non-trivial task requiring a real parser operating over a well-specified grammar. A tokenization pipeline just won't cut it.

I worked on a project a few years ago that read its custom-format config file in line by line, chopped everything off each line following the first '#' character (to support comments), and then trimmed the whitespace. This sounds like a reasonable and elegant approach until you consider that now none of your user controlled fields (via a GUI in our case) can contain the '#' character. This effected customers, but nobody ever fixed it.

With the tools and languages out there now, there's just no excuse for this crap.

Someone 7 years ago | | |

> A few strtok() calls can easily parse lines like:

     keyword=value1, value2, value3

The challenge with parsing isn’t parsing correct inputs; it’s generating useful error messages and recovering on incorrect inputs such as

     keyword=,,value1, value2, value3,,,,

or even

     =keyword=,,value1, value2, value3,,,,

strtok isn’t the best tool for doing that.

(Yes, those could be valid inputs, but if they are, chances are they should be parsed differently)

simias 7 years ago | |

I use strtok_r from time to time, it does the job if you have a mutable input. Of course having to write zeroes is a bit cumbersome but it's one of the drawbacks of C-style strings.

The plain truth is that string handling in C is a huge pain in the ass no matter how you look at it. Splitting, concatenating, regex-ing... All of that is a huge pain in C. If you need to write a high-performance parser then it might be worth it but if you're just parsing a fancy command line format and performances don't matter it's just incredibly frustrating and error prone.

Rust fares better here because its str type is not NUL-terminated but actually keeps track of the length separately which makes it significantly more flexible and versatile. Of course you could do that in C but you'll be incompatible with any code dealing with native C-strings.

And of course you make one mistake and you have a buffer overflow vulnerability...

So yeah, if you program in C please use strtok_r if applicable, otherwise considering offloading the parsing to an other part of your application written in a language better suited for that and hand over binary data to the C library. If everything else fails then consider handwriting your parser and may god be with you. Oh and if your grammar is complex enough to warrant it, there's always lex/yacc.

Matthias247 7 years ago | | |

> The plain truth is that string handling in C is a huge pain in the ass no matter how you look at it.

It is. And it’s not even only Cs fault. 80% of it is bad API Design. Strings could be accepted as a struct consisting of a pointer and a length, aka string_view. And there could be some manipulation functions around it. That would make those APIs a lot more flexible (one no longer needs to care whether things are null terminated and there would be less pointless copies).

For these reasons my estimate in the meantime is that the average C program which uses stdlib functions is less efficient than an implementation in another language, even though the authors would claim otherwise (its C, it must be fast).

WalterBright 7 years ago | | |

One of the big performance wins of the D programming language over C is that arrays are length terminated instead of 0 terminated, so you can "slice" strings to get substrings, rather than allocate/copy/zero (and then get the free in the right place!).

paavoova 7 years ago | | |

> but you'll be incompatible with any code dealing with native C-strings

Not entirely, see https://github.com/antirez/sds

Basically, you have a header storing length, etc, but still null terminate, so library functions like strlen are none the wiser.

nly 7 years ago | |

Much of libc is terrible from an API design perspective, even given the limitations of C as a language.

libc has somehow managed to hit the sweet spot and have APIs that are both inconvenient to use properly, and perform poorly.

agumonkey 7 years ago | | |

any attempt has been made to build a nicer foundational C/OS library since ?

nwmcsween 7 years ago | | |

libc aka ISO, POSIX, GNU wasn't designed it was more or less evolved from what was in use.

tails4e 7 years ago | |

strncat is another good example. It's a buffer overflow waiting to happen, and worse it's the 'safe' version of strcat. The trick is the parameter you pass is not the length of the destination buffer, but the remaining length of the destination buffer. Most people really want the behaviour of strlcat.

rurban 7 years ago | | |

Nope, the safe versions of strcat are strcat_s and strncat_s, with bounds check and ensured NULL termination. strlcat does no bounds check.

loeg 7 years ago | | |

glibc's continued resistance to adding strlcat and strlcpy is a travesty. :-(

jstimpfle 7 years ago | | |

I'd say just use memcpy.

hermitdev 7 years ago | |

There are valuable use cases where this matters. For example, parsing FIX messages in finance, this allows you to parse the tag/value pairs with no memory allocations, which matters in low latency HFT applications.

quotemstr 7 years ago | | |

If you want speed, just build a lexer with ragel [1]. It's hard to go faster than a DFA.

[1] http://www.colm.net/open-source/ragel/

Gibbon1 7 years ago | |

> And it writes zeros to the input array.

If you pass strtok_r a const string it can and will bus fault in some systems. This happens when it tries to write a /0 to the input string. Being an old crusty firmware guy I'm not sailing on good cargo cult ship HMS Immutability, but generating side effects in your input data stream is terrible.

There is no way to back up/undo when using strtok_r. When your parsing involves a decision tree that kinda sucks.

avar 7 years ago | |

> it writes zeros to the input array [which] means it's unfit for most use cases[...]

Other issues with strtok() aside, this seems like a silly reason to discount a standard library function. If you don't want your input munged you can strdup() it. It's rare to find a C program that's so specialized that the performance hit of a strdup() would be unacceptable in a case where strtok() could otherwise have been used.

jstimpfle 7 years ago | | |

It would be wasteful to duplicate the input string which includes so much garbage. I would rather just go through the input string and append token by token to the output string, terminated with a single NUL.

Hypergraphe 7 years ago | |

Agreed, I basically avoid using strtok because of that. Why would you write zeros in my input...

fredmorcos 7 years ago | |

strtok is there for things like and similar to /etc/hosts and /etc/fstab

ygra 7 years ago | | |

Or have those file formats been designed around having to parse them in a C program?

fanf2 7 years ago | |

Use strcspn() instead

jstimpfle 7 years ago | | |

    Token tok;
    start_token(&tok);
    for (;;) {
        int c = look_next_char();
        if (('A' <= c && c <= 'Z') ||
            ('a' <= c && c <= 'z')) {  /* or whatever test */
            consume_char();
            add_to_token(tok, c);
        } else {
            break;
        }
    }
    end_token(tok);

Done. There's no point in going through a weird API.

stochastic_monk 7 years ago |

I recommend ksplit/ksplit_core from Heng Li’s excellent klib kstring.{h,c}[0]. It modifies the string in-place, adding null terminators, and provides a list of offsets into the string. This gives you the flexibility of accessing tokens by index without paying costs of copying or memory allocation.

[0] https://github.com/attractivechaos/klib

lixtra 7 years ago |

I have an obsession with unsafe example code:

  strcpy(str,"abc,def,ghi");
  token = strtok(str,",");
  printf("%s \n",token);

Even if the author knows how many tokens are returned I would prefer a check for NULL here since a good fraction might not read further than this bad example.

enriquto 7 years ago | |

> I have an obsession with unsafe example code:

It is perfectly OK for example code to be unsafe. You do not wear a parachute when you learn to fly using a simulator. You realize that things will become more serious and complicated in the future, but you have to start with something simple and unsafe, no big deal. Otherwise you will never see the consequences of unsafe code in simple cases.

bqe 7 years ago | | |

I think you underestimate how many people blindly copy examples without understanding them. Safe example code results in more correct programs.

jfries 7 years ago |

Well, yes, using strtok works if the data happens to be structured in a certain simple way. Very often you want to do something more advanced though, and using regex for matching tokens is then necessary.

graycat 7 years ago |

A lot of experience shows that the string tokenization in Open Object Rexx is darned useful. E.g., for many years, IBM's internal computing was from about 3600 mainframe computers around the world running VM/CMS with a lot of service machines written in Rexx. Rexx is no toy but a powerful, polished, scripting language and really good at handling strings.

A little example of some Rexx code with some string parsing is in

https://news.ycombinator.com/item?id=18648999

pasokan 7 years ago |

It used to be that gcc will warn against strtok and recommend strsep instead. Do not know what the status is today

tinus_hn 7 years ago | |

Strtok is not thread safe and can’t be made thread safe without changing the API. You should not use it.

morbusfonticuli 7 years ago | | |

> Strtok is not thread safe and can’t be made thread safe without changing the API. You should not use it.

Well, there is already a thread-safe variant [0]: > The strtok() function uses a static buffer while parsing, so it's not thread safe. Use strtok_r() if this matters to you.

[0] https://linux.die.net/man/3/strtok_r

heinrichhartman 7 years ago | | |

Well, strtok could use thread local variables to store intermediate state, to make it threadsafe while maintaining the same API. Not saying this is a good idea, but technically it would work, no?

zakk 7 years ago | | |

That’s not a good reason not to use it.

A function can be not thread-safe and still safe to use in single-threaded programs.

The point is that strtok is not a good choice even for single-threaded code.

caf 7 years ago |

Note though that strsep() is not as portable, because it is an extension to standard C.

tptacek 7 years ago | |

It's a tiny function, written in ANSI C, so if you're really concerned about this, just include it with your program. It's an extension to the standard C library, not to C itself.

beefhash 7 years ago | | |

Except then you have the issue about compilers complaining about double-declarations of the function, meaning you'll either have a lot of warning spam on every #include or now hard require some kind of header defines for HAVE_STRSEP. Once you go that way, there's no going back and it's only gonna become more and more.

beefhash 7 years ago | |

In fact, it's not even in POSIX.

teddyh 7 years ago | | |

To quote the GNU C library manual: “This function was introduced in 4.3BSD and therefore is widely available.”¹

1. https://www.gnu.org/software/libc/manual/html_node/Finding-T...

satyenr 7 years ago |

> Next, strtok is not thread-safe. That's because it uses a static buffer internally. So, you should take care that only one thread in your program calls strtok at a time.

I wonder why strtok() does not use an output parameter similar to scanf() — and return the number of tokens. Something like:

  int strtok(char *str, char *delim, char **tokens);

Granted, it would involve dynamic memory allocation and the implementation that immediately comes to mind would be less efficient than the current implementation, but surely it’s worth eliminating the kind of bugs the current strtok() can introduce?

Does anyone here have the historical prospective?

megous 7 years ago |

Other approach from library calls and flex is re2c. It preprocesses the source code and inlines regular expression parsing where you needed. It's very powerful in combination with goto.

saagarjha 7 years ago |

  str = (char *) malloc(sizeof(char) * (strlen(TESTSTRING)+1));

  strcpy(str,TESTSTRING);

str = strdup(TESTSTRING)?

rurban 7 years ago |

AFAIK strtok has restrict on both args since C99. And the safe variants strtok_s and esp. wcstok_s are missing. Strings are unicode nowadays, not ASCII.

https://en.cppreference.com/w/c/string/byte/strtok

bsenftner 7 years ago |

...And then the application is required to implement variable length characters, a la Unicode, and you start your strings logic all over...

syrrim 7 years ago | |

As long as you're fine with ascii delimiters, strtok et al. work fine for utf-8 strings.

bsenftner 7 years ago | | |

Would you happen to be aware of good Unicode normalization function/lib in C/C++?

the_clarence 7 years ago |

Problem is that your token string is going to be quite large. Is there a built-in solution for when tokens are just single chars?

setquk 7 years ago |

I just use flex. You don’t have to ship flex as a dependency either.

alexandernst 7 years ago |

How about just using a properly suited language por string manipulation?

start = p; while (isspace(*p) && p < eof) // [ ]* ++p; if (p == eof) return EOF; if (is_ident_start(*p)) { // [a-z] ++p; while (is_ident(*p)) // [a-z0-9]* ++p; set_token(p, p - start); return IDENT; } else if (is_number(*p)) { // [0-9] ++p; while (is_number(*p)) // [0-9]* ++p; set_token(p, p - start); return NUMBER; } // etc.

%{ #include "y.tab.h" int num_lines = 1; int comment_mode=0; int stack =0; %} digit ([0-9]) integer ({digit}+) float_num ({digit}+\.{digit}+) %% {integer} { //deal with integer printf("#%d: NUM:",num_lines); ECHO;printf("\n"); yylval.Integer = atoi(yytext); return INT; } {float_num} {// deal with float printf("#%d: NUM:",num_lines);ECHO;printf("\n"); yylval.Float = atof(yytext); return FLOAT; } \n { ++num_lines; } . if(strcmp(yytext," "))ECHO; %% int yywrap() { return 1; }