Ruby has two styles to write if
.
if foo then bar end
foo if bar
This reads natural to human, but not to machines. For example, can you tell if this code is valid or not?
p if 1 then 2 else 3 end
The answer is:
$ ruby -e 'p if 1 then 2 else 3 end'
-e:1: syntax error, unexpected `then', expecting end-of-input
Because the if
here is recognized as "modifier if", not "keyword if". So how does Ruby decides the type of if
?
parse.y
The answer should be in the parse.y, which defines Ruby's grammer.
In the parse.y, you see keyword_if and modifier_if. It means the type of if
is decided by the lexer, not the parser.
lex.c.blt
By grepping modifier_if
, you will find lex.c.blt has a table of keywords in the function rb_reserved_word
.
#line 31 "defs/keywords"
{gperf_offsetof(stringpool, 33), {keyword_if, modifier_if}, EXPR_VALUE},
parse.y
The lexer starts from yylex. It calls parser_yylex
, which handles the symbols like +
, -
, etc. If the character is not a symbol, parse_ident
is called.
parse_ident
checks if a keyword begins from the current position with rb_reserved_word
. The returned kw
is a member of the table we've seen in lex.c.blt
.
/* See if it is a reserved word. */
kw = rb_reserved_word(tok(p), toklen(p));
In the case of if
keyword, kw->id[0]
corresponds to keyword_if and kw->id[1]
corresponds to modifier_if.
Actually id
has two values to distinguish keywords and modifiers. According to lex.c.blt, Ruby has five modifiers.
- x if y
- x unless y
- x while y
- x until y
- x rescue y
When an if
is a modifier
This is the condition that distinguishes keyword_if and modifier_if. In short, an if
is a keyword if the lexer state is EXPR_BEG
; otherwise, it is a modifier.
if (IS_lex_state_for(state, (EXPR_BEG | EXPR_LABELED)))
return kw->id[0];
else {
if (kw->id[0] != kw->id[1])
SET_LEX_STATE(EXPR_BEG | EXPR_LABEL);
return kw->id[1];
}
The lexer state
Among the states of the lexer, EXPR_BEG, EXPR_END and EXPR_ARG are the most important. They decides operators like +
, -
is unary or binary. For example:
-
1 - 2
: This is binary minus because the state is EXPR_END after the1
. -
foo(-1)
: This is unary minus because the state is EXPR_BEG after the(
.
EXPR_ARG is a bit tricky; On this state, the meaning of -
changes by the space after it.
-
foo - 1
: binary minus -
foo -1
: unary minus
What is interesting is that this rule is not so difficult for humans. The former "looks like" binary and the latter "looks like" unary. So you will actually never be bothered by this, unless you are implementing the parser.
keyword if and modifier if
Now you can tell an if
is a keyword or modifier by checking the lexer state.
-
foo() if ...
: This is modifier_if because the state is EXPR_END after the)
. -
foo(if ...)
: This is keyword_if because the state is EXPR_BEG after the(
. -
foo if ...
: This is modifier_if because the state is EXPR_ARG after thebefore
if
.
Why this matters to me
I think most Rubyists does not care about corner cases like this; However I needed to figure out this because I'm making my original programming language Shiika which has Ruby-like syntax.
As you've seen, parsing Ruby-like syntax is not easy, especially parsing method calls without parentheses. I'm happy if this entry helps someone who want to make a Rubyish language.
Top comments (0)