String tokenization in C(onebyezero.blogspot.com) |
String tokenization in C(onebyezero.blogspot.com) |
https://groups.google.com/forum/message/raw?msg=comp.lang.c/... [2001]
https://groups.google.com/forum/message/raw?msg=comp.lang.c/... [2011 repost]
strspn(s, bag) calculates the length of the prefix of string s which consists only of the characters in string bag. strcspn(s, bag) calculates the length of the prefix of s consisting of characters not in bag.
The bag is like a one-character regex class; so that is to say strspn(s, "abcd") is like calculating the length of the token at the front of input s matching the regex [abcd]* , and in the case of strcspn, that becomes [^abcd]* .
If you program in C please just write those four obvious lines yourself.
Those are not necessarily obvious lines, there are several pitfalls to avoid, and for that reason strtok() is much longer than four lines. When it comes to the standard library functions strtok() has well-defined behaviour that is easy to reason with and near-magically approaches the string-splitting convenience close to scripting languages.
In contrast, an example of truly sickening part of stdlib is converting strings to number. The atoi()/atol() family doesn't check for errors at all so you want to use strtol(). But the way error checking works in strtol() is so complex that the man page has a specific example of how to do it correctly. All sane programmers quickly write a clean wrapper around strtol() to encode the complexity once. Now, strtok() is nothing like that.
In its simplicity, strtok() is quite versatile. A few strtok() calls can easily parse lines like:
keyword=value1, value2, value3
that you might find in configuration files. And I mean truly in just a few lines which you might expect in Python but with C string handling? No.> https://github.com/esmil/musl/blob/master/src/string/strtok....
It's a bit longer than 4 lines because strtok does things you should not want. If you insist on parsing that configuration line with strtok, go ahead and write that brittle code. It breaks as soon as you want empty strings (try "keyword=value1, , value3" with strtok) or escape sequences or other transformations, or as soon as you want to do something as basic as parsing from a stream instead of a string that is completely in memory.
So to clarify, of course you are never done with parsing in 4 lines. But even if it wasn't as braindead to overwrite the input string, the functionality strtok provides would not be worth more than 4 lines.
I worked on a project a few years ago that read its custom-format config file in line by line, chopped everything off each line following the first '#' character (to support comments), and then trimmed the whitespace. This sounds like a reasonable and elegant approach until you consider that now none of your user controlled fields (via a GUI in our case) can contain the '#' character. This effected customers, but nobody ever fixed it.
With the tools and languages out there now, there's just no excuse for this crap.
keyword=value1, value2, value3
The challenge with parsing isn’t parsing correct inputs; it’s generating useful error messages and recovering on incorrect inputs such as keyword=,,value1, value2, value3,,,,
or even =keyword=,,value1, value2, value3,,,,
strtok isn’t the best tool for doing that.(Yes, those could be valid inputs, but if they are, chances are they should be parsed differently)
The plain truth is that string handling in C is a huge pain in the ass no matter how you look at it. Splitting, concatenating, regex-ing... All of that is a huge pain in C. If you need to write a high-performance parser then it might be worth it but if you're just parsing a fancy command line format and performances don't matter it's just incredibly frustrating and error prone.
Rust fares better here because its str type is not NUL-terminated but actually keeps track of the length separately which makes it significantly more flexible and versatile. Of course you could do that in C but you'll be incompatible with any code dealing with native C-strings.
And of course you make one mistake and you have a buffer overflow vulnerability...
So yeah, if you program in C please use strtok_r if applicable, otherwise considering offloading the parsing to an other part of your application written in a language better suited for that and hand over binary data to the C library. If everything else fails then consider handwriting your parser and may god be with you. Oh and if your grammar is complex enough to warrant it, there's always lex/yacc.
It is. And it’s not even only Cs fault. 80% of it is bad API Design. Strings could be accepted as a struct consisting of a pointer and a length, aka string_view. And there could be some manipulation functions around it. That would make those APIs a lot more flexible (one no longer needs to care whether things are null terminated and there would be less pointless copies).
For these reasons my estimate in the meantime is that the average C program which uses stdlib functions is less efficient than an implementation in another language, even though the authors would claim otherwise (its C, it must be fast).
Not entirely, see https://github.com/antirez/sds
Basically, you have a header storing length, etc, but still null terminate, so library functions like strlen are none the wiser.
libc has somehow managed to hit the sweet spot and have APIs that are both inconvenient to use properly, and perform poorly.
If you pass strtok_r a const string it can and will bus fault in some systems. This happens when it tries to write a /0 to the input string. Being an old crusty firmware guy I'm not sailing on good cargo cult ship HMS Immutability, but generating side effects in your input data stream is terrible.
There is no way to back up/undo when using strtok_r. When your parsing involves a decision tree that kinda sucks.
Other issues with strtok() aside, this seems like a silly reason to discount a standard library function. If you don't want your input munged you can strdup() it. It's rare to find a C program that's so specialized that the performance hit of a strdup() would be unacceptable in a case where strtok() could otherwise have been used.
Token tok;
start_token(&tok);
for (;;) {
int c = look_next_char();
if (('A' <= c && c <= 'Z') ||
('a' <= c && c <= 'z')) { /* or whatever test */
consume_char();
add_to_token(tok, c);
} else {
break;
}
}
end_token(tok);
Done. There's no point in going through a weird API. strcpy(str,"abc,def,ghi");
token = strtok(str,",");
printf("%s \n",token);
Even if the author knows how many tokens are returned I would prefer a check for NULL here since a good fraction might not read further than this bad example.It is perfectly OK for example code to be unsafe. You do not wear a parachute when you learn to fly using a simulator. You realize that things will become more serious and complicated in the future, but you have to start with something simple and unsafe, no big deal. Otherwise you will never see the consequences of unsafe code in simple cases.
A little example of some Rexx code with some string parsing is in
Well, there is already a thread-safe variant [0]: > The strtok() function uses a static buffer while parsing, so it's not thread safe. Use strtok_r() if this matters to you.
A function can be not thread-safe and still safe to use in single-threaded programs.
The point is that strtok is not a good choice even for single-threaded code.
1. https://www.gnu.org/software/libc/manual/html_node/Finding-T...
I wonder why strtok() does not use an output parameter similar to scanf() — and return the number of tokens. Something like:
int strtok(char *str, char *delim, char **tokens);
Granted, it would involve dynamic memory allocation and the implementation that immediately comes to mind would be less efficient than the current implementation, but surely it’s worth eliminating the kind of bugs the current strtok() can introduce?Does anyone here have the historical prospective?
str = (char *) malloc(sizeof(char) * (strlen(TESTSTRING)+1));
strcpy(str,TESTSTRING);
str = strdup(TESTSTRING)?Regular expressions don't show up outside the spec, sure; but if you're writing the code (for implicit state machine), you need to know exactly where you are in the regular language that defines the tokens to write good code. Writing a regex matcher in code like this is like writing code in assembly - mentally, you're mapping to a different set of concepts all the time.
Speak for yourself. There’s a POSIX standard for regex that is more than 30 years old & a GNU implementation that comes with gcc. C++ has regex in the standard library.
Typical pattern:
start = p;
while (isspace(*p) && p < eof) // [ ]*
++p;
if (p == eof) return EOF;
if (is_ident_start(*p)) { // [a-z]
++p;
while (is_ident(*p)) // [a-z0-9]*
++p;
set_token(p, p - start);
return IDENT;
} else if (is_number(*p)) { // [0-9]
++p;
while (is_number(*p)) // [0-9]*
++p;
set_token(p, p - start);
return NUMBER;
} // etc.
Corresponds to: IDENT ::= [a-z][a-z0-9]* ;
NUMBER ::= [0-9][0-9]* ;
SPACE ::= [ ]* ;
TOKEN ::= SPACE (IDENT | NUMBER) ;
Inline those nonterminals, and guess what - regular expression!Your mileage may vary but it's also a common issue having to filter out empty items out of something like ",foo,bar ,,,,baz,,xyzzy," where you only really want those four words. You will especially encounter this in parsing user input which might have extra whitespace, or badly formed lists of items.
Use strtok() to split C strings into tokens delimited by the given set of delimiters. If you want to catch each comma but skip over any whitespace and split at the first '=' only, then use something else.
Let's say if I'm going to need a configuration file for my program I'd most certainly start with something I can parse with strtok(). I would really need very specific needs to warrant a more complex format that would require a more sophisticated parser in which case using one would be a no-brainer.
Even if this is true, the reasoning here is disturbingly short-sighted. Copying code that you do not understand is unacceptable behavior, and I'd say the sooner it blows up in your face, the better. The goal of code examples is to illustrate how things work in a simplified way, and code without error checks is often easier to understand at first. Imagine a hello world with all the possible error checks. That would be incomprehensible.
If you have needs that require a general parsing library then why are you criticizing strtok()? It doesn't parse XML either, not C source code, nor any unspecified configuration file formats.
You know what else isn't in POSIX? All the rest of your C code.
static char *p;
if (!s && !(s = p)) return NULL;
s += strspn(s, sep);
if (!*s) return p = 0;
p = s + strcspn(s, sep);
if (*p) *p++ = 0;
else p = 0;
return s;
Instead of carrying that code, or something similar, with my source code or my own utility library I'd much rather have the already debugged version from the standard library.Overwriting the input in C is more efficient than maintaining more internal state and returning a pointer and the length of each token which you would need to strncpy() to get the token into a C string. strtok() does not want to do the initial strdup() for you because only you will know whether your input can already be mutated or whether you need to use a copy.
As I pointed in the other reply, strtok() does not break on strings like "keyword=value1,, , value3" unless you skipped RTFM and expect it to do something completely different. And more often than not that's exactly what you want when parsing non-computer readable input which you can expect to take a specific form.
If you want to handle escape sequences, parse from a stream (without having the option to fgets() the next line into memory), or parse CSV tables without collapsin colunms then you will want to use something more specific to that. Luckily, strtok() was not advertised as a Swiss army knife so it's off the hook for specific parsing purposes like those.
while (i < len && !is_token_begin(buf[i]))
i++;
if (i == len)
error("End of input\n");
start_token(tok);
while (i < len && is_token_char(buf[i])) {
i++;
add_to_token(tok, buf[i]);
}
end_token(tok);
or something along those lines. Whatever you need. It's not rocket science. Putting highly fluctuating and project-specific code like this in a library would only have disadvantages. Not everything should be in a library. In fact, most things should not be.The results are often worse than if they would have used a higher level language right from the start.
Generally though, C is just a terrible language to do anything other than write extremely low level routines in. You should probably just never use strtok() and go straight to something like Ragel or re2c for building high performance tokenizers... then call those from a higher level language.
Being able to do that is exactly the point why it's so much simpler to avoid a silly API such as strcspn (or, god forbid, strtok).
> non-ASCII
yeah i know... Do you prefer strcspn("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ")? Do you think it's faster?
If you're pedantic, you could lex (0x41 <= c && c <= 0x5A). That way at least you consistently read ASCII, even on non-ASCII implementations. But I don't care and it's less readable.
> I suppose you consider isalpha() to have a weird API?
Yes. I do not even understand what it does.
>> isalpha() checks for an alphabetic character; in the standard "C" locale, it is equivalent to (isupper(c) || islower(c)). In some locales, there may be additional characters for which isalpha() is true-letters which are neither upper case nor lower case.
Well in any case I'm sure that's not what I wanted... By the way locale is super hard to use as well. Locale is a process global property. I'm not aware of any way to pass explicit locales to library functions.
'A' v.s. 0x41 makes no difference for portability. The thing that's unportable about that is that it assumes that the characters A..Z are continuous in your character encoding, which isn't portable C.
Although admittedly having to deal with EBCDIC these days is rare in anything except highly portable programs like C compilers or popular script interpreters.
This is why ctype.h functions exist. Just use them.
And you cannot really do transformation instead of only matching with RE. Those are needed already in simple cases like string and number literals. Now your code will look more like
%{
#include "y.tab.h"
int num_lines = 1;
int comment_mode=0;
int stack =0;
%}
digit ([0-9])
integer ({digit}+)
float_num ({digit}+\.{digit}+)
%%
{integer} { //deal with integer
printf("#%d: NUM:",num_lines); ECHO;printf("\n");
yylval.Integer = atoi(yytext);
return INT;
}
{float_num} {// deal with float
printf("#%d: NUM:",num_lines);ECHO;printf("\n");
yylval.Float = atof(yytext);
return FLOAT;
}
\n { ++num_lines; }
. if(strcmp(yytext," "))ECHO;
%%
int yywrap() {
return 1;
}
(copied from stackoverflow). And that's before preprocessing. Yeah. Thanks, but no thanks.Why isn't it a good choice exactly? Could you sum it up?
This is why it's crucial to get APIs right first time.
It's also a complete idiocy to possibly not copy everything. I've never wanted that. It's begging for bugs. If there's not enough room to copy all the source bytes, you either know that in advance or you want to crash hard.
Even memcpy itself is hardly necessary as a library function. It's a simple loop. But at least it's what you would write anyway, and in some cases compilers are able to optimize it (with questionable benefits).
Moreover, most of the time I’m either certain the output buffer is large enough (maybe because I just allocated it) or I really do want the output buffer to have a truncated string. Strncpy’s two dumb behaviors that I almost never want is to pad the destination buffer with zeros if the source string is smaller and to leave the destination I terminated if the source is longer.
If you're implying that we should then use a regex implementation instead: Coding up a lexer (for a mainstream programming language) using simple counter increments and such is not a lot of work. It has the advantage that it results in faster code (unless you're going for real heavy machinery) and that you can easily code up additional transformations. For example, how would you parse string literals (including escape sequences) with a regex?
You want to convert the string literal to an internal buffer, interpreting the escape sequences. In the same way, you want to parse integers. You cannot really do that with a regular expression. RE is for matching, not for transforming.
The one use-case I'm envisioning is quickly exposing POSIX conformant regexen to the command-line.
I’ll take you up on that bet. I see #include <regex.h> in every source repo of grep I can find right now.
http://git.savannah.gnu.org/cgit/grep.git/tree/src/search.h
https://opensource.apple.com/source/text_cmds/text_cmds-99/g...
https://android.googlesource.com/platform/system/core.git/+/...
https://github.com/c9/node-gnu-tools/blob/master/grep-src/sr...
You owe me a beer. :)
BTW, I do super agree with your comment to just not use strtok, and also the idea that most people are better off parsing text in perl or python...
Let's get that beer sometime, anyway.
I am finding it hard to believe that you think it's "just not in C's spirit to use canned libraries."
There are advantages and disadvantages of using libraries in any language. There are some aspects of C —e.g. the lack of garbage collection— that make it harder to reuse code compared to some languages. But it is definitely not against "C's spirit".
Wait, what? If C does not require A..Z to be contiguous, the distinction between 'A' and 0x41 is extremely significant to portable programs intending to parse ASCII when the native compiler encoding is whatever franken-coding doesn't have contiguous latin characters.
My response was to OPs moving the goal post to "portably parsing ASCII" in response to his suggested replacement for a C library function not being portable on non-ASCII systems, which make no sense.
Anyway I think most programming languages nowadays have their source encoding specified as UTF-8 or at least something ASCII-like, so ('A' <= c && c <= 'Z') is in fact what I would likely write, and using isalpha() would technically be a bug just as well.