Regular expression question

L7L7L7 · September 13, 2006, 5:55pm

In trying to parse a C source file I have the following section of
code:

…
…
case line
when /^./*.?*/.$/ # single line comment(s)
non_comments = line.split(//*.?*//).to_s
process_code(non_comments)
when /^./**?[^(*/)]$/ # multi-line start
comment = true
next
when /^[^(/*)]*/.$/ # multi-line end
comment = false
…
…

I am running into a problem with the multi-line comment sections.
While something like:

/*
A comment
*/

will work (i.e. gets properly parsed out)

/* A

comment */

OR

/* A *

comment */

will not.
My guess is that it is because of the [^(*/)] construct blocking the
leading or trailing ‘’ character. However, I thought that by placing
the */ within parenthesis I avoided the characters being evaluated
individually.
Is there a way to look for the pattern '/’ without having a single ‘*’
break the search?
As an alternative, I use this:

when /^./**?[^(*/)]*?$/
comment = true
next
when /^.?[^(/*)]*/.$/
comment = false

Which seems to solve the problem, but I can see where it is brittle

/* A * comment
for * instance */

Any suggestions?
Thanks in advance.

L7L7L7 · September 13, 2006, 6:16pm

L7 wrote:

In trying to parse a C source file I have the following section of
code:

…
…
case line
when /^./*.?*/.$/ # single line comment(s)
non_comments = line.split(//*.?*//).to_s
process_code(non_comments)
when /^./**?[^(*/)]$/ # multi-line start
comment = true
next
when /^[^(/*)]*/.$/ # multi-line end
comment = false
…
…

I am running into a problem with the multi-line comment sections.

My eyes glaze over with these kinds of expressions, but this might help:

Scroll down the section on “Comments”. They seem to have a simpler
solution, I think the trick is to be able to use . as matching newlines.

And you can turn on newline matching in Ruby by putting an “m” after the
expression:

/my_pattern_here/m

Hope this helps…?

Jeff

L7L7L7 · September 13, 2006, 6:39pm

On 9/13/06, L7 [email protected] wrote:

In trying to parse a C source file I have the following section of
code:

Remember that in C, nested comment-blocks are not permitted, for the
incredibly good reason that they are not recognizable by
regular-expressions ;-). Why don’t you take a pre-pass through your C
file and take out the comments yourself before you run your main
parse? A recursive-descent parser to do the job would probably take
almost no code at all in Ruby.

L7L7L7 · September 13, 2006, 6:35pm

Jeff C. wrote:

My eyes glaze over with these kinds of expressions, but this might help:

Various Regex Examples for Programmers - Source Code Syntax

Scroll down the section on “Comments”. They seem to have a simpler
solution, I think the trick is to be able to use . as matching newlines.

I dont think it applies to this directly. I didnt explicitly mention,
but the processing is happening on a line-by-line basis. In order to
remove all commenting in the above manner I would first have to read
the file as a string, strip, split on newline then parse code.

L7L7L7 · September 13, 2006, 6:55pm

Francis C. wrote:

On 9/13/06, L7 [email protected] wrote:

In trying to parse a C source file I have the following section of
code:

Remember that in C, nested comment-blocks are not permitted, for the
incredibly good reason that they are not recognizable by
regular-expressions ;-).

Agreed. However, something with ‘*’ characters in it is allowed (so
long as they are not preceeded or followed directly by ‘/’) and that is
where I would get clobbered.

Why don’t you take a pre-pass through your C
file and take out the comments yourself before you run your main

As I mentioned, that involved a bit of overhead. But with regard to the
project, I assume it is the ‘best fix’ to what I have.

L7L7L7 · September 13, 2006, 7:31pm

L7 wrote:

/ …

I dont think it applies to this directly. I didnt explicitly mention,
but the processing is happening on a line-by-line basis. In order to
remove all commenting in the above manner I would first have to read
the file as a string, strip, split on newline then parse code.

Yes, and that may be the best way to approach this problem. There are a
number of problems where reading the entire file and processing it as a
long string is the best (one is tempted to say the only) way to proceed.

If you don’t read the entire file, then you are obliged to carry more
state
information in your algorithm between lines. IMHO, it is much better to
eliminate multiline comments in one go than to construct and maintain
state
flags for this and any other contingencies that may carry over between
lines.

Obviously this poses practical problems for huge source files, but,
again
IMHO, huge source files should not exist anyway – they should be broken
up
into manageable chunks.

L7L7L7 · September 13, 2006, 10:50pm

On 9/13/06, Paul L. [email protected] wrote:

Yes, and that may be the best way to approach this problem. There are a

Obviously this poses practical problems for huge source files, but, again
IMHO, huge source files should not exist anyway – they should be broken
up
into manageable chunks.

–
Paul L.
http://www.arachnoid.com

I am intrigued, I believe that the regular expression to find all
comments
in C must be very complex and probably not the correct tool, look at
these
snipplets

// /*
if(strcmp(x,"/")
// "/
etc. etc.

BTW I cannot find a reason why the job cannot be done by a regular
expression but that does not mean it can
Robert

–
Deux choses sont infinies : l’univers et la bÃªtise humaine ; en ce qui
concerne l’univers, je n’en ai pas acquis la certitude absolue.

Albert Einstein

L7L7L7 · September 15, 2006, 4:14am

On Thu, 2006-09-14 at 05:49 +0900, Robert D. wrote:

On 9/13/06, Paul L. [email protected] wrote:
I am intrigued, I believe that the regular expression to find all comments
in C must be very complex and probably not the correct tool, look at these
snipplets

// /*
if(strcmp(x,"/")
// "/
etc. etc.

I’m not sure if it’s impossible to parse out C-style comments using a
regular expression, but the various JavaCC grammars I’ve seen all use
lexical states to do it instead. Another complication is trigraphs (*),
although I think those are unrecognized by default in most C
preprocessors.

Yours,

Tom

(*) Digraphs and trigraphs - Wikipedia

L7L7L7 · September 14, 2006, 1:38am

On Sep 13, 2006, at 10:55 AM, L7 wrote:

comment = true
next
when /^[^(/*)]*/.$/ # multi-line end
comment = false
…
…

Is there a way to look for the pattern ‘/’ without having a single
'’
break the search?

If I’m not mistaken, what you need is a negative lookahead

try /^./*([^/]|/(?!*))$/ for multi-line start

and /^([^*]|*(?!/))*/.$/ for multi-line end

the key difference (from the start pattern) is ([^/]|/(?!*))

this breaks down like so:

(
[^/] # anything but /
| # or
/(?!*) # a / not followed by an * (don’t eat the character after /,
just peek at it)
)

The pattern for multi-line end uses the same technique, but with the
characters reversed.

I’m sure this isn’t the be all and end all of C comment matching
regexs, but it handles all of the cases you described.

Rod

L7L7L7 · September 15, 2006, 4:22am

On 9/14/06, Tom C. [email protected] wrote:

I’m not sure if it’s impossible to parse out C-style comments using a
regular expression, but the various JavaCC grammars I’ve seen all use
lexical states to do it instead. Another complication is trigraphs (*),
although I think those are unrecognized by default in most C
preprocessors.

A C-style block comment can indeed be recognized by a regex. In fact,
that’s
how lexical states are generally invoked in scanners generated by tools
like
flex. However, a nested C-style block comment can not be detected by a
regex. A (theoretical) language supporting such a construct would be a
context-free language, not a regular language.

L7L7L7 · September 15, 2006, 4:33am

On 9/14/06, Tom C. [email protected] wrote:

I’m not sure if it’s impossible to parse out C-style comments using a
regular expression, but the various JavaCC grammars I’ve seen all use
lexical states to do it instead. Another complication is trigraphs (*),
although I think those are unrecognized by default in most C
preprocessors.

One more point. Someone upthread gave an example similar to this:

/* printf (“*/”); */

Considered strictly as a lexical construction, I think this is regular.
However, I have a funny feeling that this:

/* printf (“/…/”); */

is actually context-free. Does anyone know for sure?

L7L7L7 · September 15, 2006, 4:43am

On Fri, Sep 15, 2006 at 11:32:33AM +0900, Francis C. wrote:

/* printf ("*/"); */
Pretty sure this would end up being a syntax error
Considered strictly as a lexical construction, I think this is regular.
However, I have a funny feeling that this:

/* printf ("/…/"); */
This too.

gcc agrees with me at least:

% cat comments.c
#include <stdio.h>

int main(int argc, char *argv) {
/ printf("/"); /
/ printf("/…*/"); */
return 0;
}
% gcc -c comments.c
comments.c: In function ‘main’:
comments.c:4: error: missing terminating " character
comments.c:5: error: missing terminating " character

is actually context-free. Does anyone know for sure?
As for whether or not its context free, I don’t know, but I think you
overestimated how hard C tries. /* */ are not nestable for instance.

L7L7L7 · September 15, 2006, 4:51am

On 9/14/06, Logan C. [email protected] wrote:

return 0;

I know these are syntax errors in C. I was talking about a hypothetical
language (not C) that defined such constructs as legal. I’m still not
sure
that it’s impossible to use a regular language to generate this case:
/* "*/ */
I’m pretty convinced that the other case requires a context-free
language.

L7L7L7 · September 15, 2006, 3:30pm

“Francis C.” [email protected] writes:

One more point. Someone upthread gave an example similar to this:

/* printf (“*/”); */

Considered strictly as a lexical construction, I think this is regular.
However, I have a funny feeling that this:

/* printf (“/…/”); */

is actually context-free. Does anyone know for sure?

So you want to know if a grammar is regular or not? Sounds like you
need the Myhill-Nerode theorem
(Myhill–Nerode theorem - Wikipedia).

And according to that, a language that allows arbitrary nesting of
comment expressions like this is indeed not regular, and therefore not
parseable with regular expressions as traditionally defined in
computer science. To parse arbitrarily nested constructs you either
need something like perl’s evaluate-code-at-regexp-match-time feature
(which so far as I know exists in no other language), or an actual
grammar. (or anything else that can get as complicated
computationally as a pushdown automaton)

L7L7L7 · September 15, 2006, 2:46pm

On Fri, Sep 15, 2006 at 11:51:16AM +0900, Francis C. wrote:

I know these are syntax errors in C. I was talking about a hypothetical
language (not C) that defined such constructs as legal. I’m still not sure
that it’s impossible to use a regular language to generate this case:
/* "/ /
I’m pretty convinced that the other case requires a context-free language.
Well for empirical evidence one could look at ML. ( comments ( are *)
nestable *).