Skip to content

Commit

Permalink
[lex] Better specify whitespace characters
Browse files Browse the repository at this point in the history
This commit defines a grammar term for _whitespace-character_ and
uses it consistently where the plain text term whitespace character
is used.  A whitespace character is defined as one of the five
characters that are mentioned in the text closest to provifing a
defifinition.  The unicode character name is (mostly) consistently
used to name these characters, and for consistency, similar changes
were made to name unicode characters rather than render specified
characters in code font throughout [lex].  The one exception is
backslash, which is retained as-is to avoid making more issues for
P2348.  Note that this commit is not a replacement for P2348,
merely a clearer statement of the existing specification without
any normative changes.
  • Loading branch information
AlisdairM committed Oct 28, 2024
1 parent 324f564 commit 912443c
Showing 1 changed file with 30 additions and 23 deletions.
53 changes: 30 additions & 23 deletions source/lex.tex
Original file line number Diff line number Diff line change
Expand Up @@ -110,9 +110,9 @@
\indextext{line splicing}%
If the first translation character is \unicode{feff}{byte order mark},
it is deleted.
Each sequence of a backslash character (\textbackslash)
Each sequence of a backslash character (\unicode{005c}{reverse solidus})
immediately followed by
zero or more whitespace characters other than new-line followed by
zero or more \grammarterm{whitespace-character}s other than new-line followed by
a new-line character is deleted, splicing
physical source lines to form \defnx{logical source lines}{source line!logical}. Only the last
backslash on any physical source line shall be eligible for being part
Expand All @@ -127,7 +127,7 @@
to the file.

\item The source file is decomposed into preprocessing
tokens\iref{lex.pptoken} and sequences of whitespace characters
tokens\iref{lex.pptoken} and sequences of \grammarterm{whitespace-character}s
(including comments). A source file shall not end in a partial
preprocessing token or in a partial comment.
\begin{footnote}
Expand All @@ -140,9 +140,9 @@
would arise from a source file ending with an unclosed \tcode{/*}
comment.
\end{footnote}
Each comment\iref{lex.comment} is replaced by one space character. New-line characters are
retained. Whether each nonempty sequence of whitespace characters other
than new-line is retained or replaced by one space character is
Each comment\iref{lex.comment} is replaced by one \unicode{0020}{space} character. New-line characters are
retained. Whether each nonempty sequence of \grammarterm{whitespace-character}s other
than new-line is retained or replaced by one \unicode{0020}{space} character is
unspecified.
As characters from the source file are consumed
to form the next preprocessing token
Expand Down Expand Up @@ -178,7 +178,7 @@
\item
Adjacent \grammarterm{string-literal} tokens are concatenated\iref{lex.string}.

\item Whitespace characters separating tokens are no longer
\item \grammarterm{whitespace-character}s separating tokens are no longer
significant. Each preprocessing token is converted into a
token\iref{lex.token}. The resulting tokens
constitute a \defn{translation unit} and
Expand Down Expand Up @@ -469,16 +469,25 @@

\rSec1[lex.comment]{Comments}

\pnum
\indextext{comment|(}%
\begin{bnf}
\nontermdef{whitespace-character}\br
\unicode{0009}{character tabulation}\br
\textnormal{new-line}\br
\unicode{000b}{line tabulation}\br
\unicode{000c}{form feed}\br
\unicode{0020}{space}\br
\end{bnf}

\pnum
\indextext{comment!\tcode{/*} \tcode{*/}}%
\indextext{comment!\tcode{//}}%
The characters \tcode{/*} start a comment, which terminates with the
characters \tcode{*/}. These comments do not nest.
\indextext{comment!\tcode{//}}%
The characters \tcode{//} start a comment, which terminates immediately before the
next new-line character. If there is a form-feed or a vertical-tab
character in such a comment, only whitespace characters shall appear
next new-line character. If there is a \unicode{000c}{form feed} or a \unicode{000b}{line tabulation}
character in such a comment, only \grammarterm{whitespace-character}s shall appear
between it and the new-line that terminates the comment; no diagnostic
is required.
\begin{note}
Expand All @@ -494,6 +503,7 @@

\indextext{token!preprocessing|(}%
\begin{bnf}

\nontermdef{preprocessing-token}\br
header-name\br
import-keyword\br
Expand All @@ -506,7 +516,7 @@
string-literal\br
user-defined-string-literal\br
preprocessing-op-or-punc\br
\textnormal{each non-whitespace character that cannot be one of the above}
\textnormal{each non-\grammarterm{whitespace-character} that cannot be one of the above}
\end{bnf}

\pnum
Expand All @@ -520,7 +530,7 @@
(\grammarterm{import-keyword}, \grammarterm{module-keyword}, and \grammarterm{export-keyword}),
identifiers, preprocessing numbers, character literals (including user-defined character
literals), string literals (including user-defined string literals), preprocessing
operators and punctuators, and single non-whitespace characters that do not lexically
operators and punctuators, and single non-\grammarterm{whitespace-character}s that do not lexically
match the other preprocessing token categories.
If a \unicode{0027}{apostrophe} or a \unicode{0022}{quotation mark} character
matches the last category, the program is ill-formed.
Expand All @@ -530,12 +540,9 @@
\indextext{whitespace}%
whitespace;
\indextext{comment}%
this consists of comments\iref{lex.comment}, or whitespace characters
(\unicode{0020}{space},
\unicode{0009}{character tabulation},
new-line,
\unicode{000b}{line tabulation}, and
\unicode{000c}{form feed}), or both.
this consists of comments\iref{lex.comment},
\grammarterm{whitespace-character}s, or
both.
As described in \ref{cpp}, in certain
circumstances during translation phase 4, whitespace (or the absence
thereof) serves as more than preprocessing token separation. Whitespace
Expand Down Expand Up @@ -673,13 +680,13 @@
external source file names as specified in~\ref{cpp.include}.

\pnum
The appearance of either of the characters \tcode{'} or \tcode{\textbackslash} or of
The appearance of either of the characters \unicode{0027}{apostrophe} or \unicode{005c}{reverse solidus} or of
either of the character sequences \tcode{/*} or \tcode{//} in a
\grammarterm{q-char-sequence} or an \grammarterm{h-char-sequence}
is conditionally-supported with \impldef{meaning of \tcode{'}, \tcode{\textbackslash},
\tcode{/*}, or \tcode{//} in a \grammarterm{q-char-sequence} or an
\grammarterm{h-char-sequence}} semantics, as is the appearance of the character
\tcode{"} in an \grammarterm{h-char-sequence}.
\unicode{0022}{quotation mark} in an \grammarterm{h-char-sequence}.
\begin{footnote}
Thus, a sequence of characters
that resembles an escape sequence can result in an error, be interpreted as the
Expand Down Expand Up @@ -826,7 +833,7 @@
\end{footnote}
operators, and other separators.
\indextext{whitespace}%
Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments
\grammarterm{whitespace-character}s and comments
(collectively, ``whitespace''), as described below, are ignored except
as they serve to separate tokens.
\begin{note}
Expand Down Expand Up @@ -1790,8 +1797,8 @@
\begin{bnf}
\nontermdef{d-char}\br
\textnormal{any member of the basic character set except:}\br
\bnfindent\textnormal{\unicode{0020}{space}, \unicode{0028}{left parenthesis}, \unicode{0029}{right parenthesis}, \unicode{005c}{reverse solidus},}\br
\bnfindent\textnormal{\unicode{0009}{character tabulation}, \unicode{000b}{line tabulation}, \unicode{000c}{form feed}, and new-line}
\bnfindent\textnormal{a \grammarterm{whitespace-character}, \unicode{0028}{left parenthesis}, \unicode{0029}{right parenthesis},}\br
\bnfindent\textnormal{and \unicode{005c}{reverse solidus}}
\end{bnf}

\pnum
Expand Down

0 comments on commit 912443c

Please sign in to comment.