diff options
Diffstat (limited to 'Doc/reference')
-rw-r--r-- | Doc/reference/compound_stmts.rst | 22 | ||||
-rw-r--r-- | Doc/reference/datamodel.rst | 18 | ||||
-rw-r--r-- | Doc/reference/expressions.rst | 18 | ||||
-rw-r--r-- | Doc/reference/grammar.rst | 18 | ||||
-rw-r--r-- | Doc/reference/introduction.rst | 160 | ||||
-rw-r--r-- | Doc/reference/lexical_analysis.rst | 413 |
6 files changed, 451 insertions, 198 deletions
diff --git a/Doc/reference/compound_stmts.rst b/Doc/reference/compound_stmts.rst index f36ed3e122f..e95fa3a6424 100644 --- a/Doc/reference/compound_stmts.rst +++ b/Doc/reference/compound_stmts.rst @@ -154,15 +154,15 @@ The :keyword:`for` statement is used to iterate over the elements of a sequence (such as a string, tuple or list) or other iterable object: .. productionlist:: python-grammar - for_stmt: "for" `target_list` "in" `starred_list` ":" `suite` + for_stmt: "for" `target_list` "in" `starred_expression_list` ":" `suite` : ["else" ":" `suite`] -The ``starred_list`` expression is evaluated once; it should yield an -:term:`iterable` object. An :term:`iterator` is created for that iterable. -The first item provided -by the iterator is then assigned to the target list using the standard -rules for assignments (see :ref:`assignment`), and the suite is executed. This -repeats for each item provided by the iterator. When the iterator is exhausted, +The :token:`~python-grammar:starred_expression_list` expression is evaluated +once; it should yield an :term:`iterable` object. An :term:`iterator` is +created for that iterable. The first item provided by the iterator is then +assigned to the target list using the standard rules for assignments +(see :ref:`assignment`), and the suite is executed. This repeats for each +item provided by the iterator. When the iterator is exhausted, the suite in the :keyword:`!else` clause, if present, is executed, and the loop terminates. @@ -1885,7 +1885,7 @@ expressions. The presence of annotations does not change the runtime semantics o the code, except if some mechanism is used that introspects and uses the annotations (such as :mod:`dataclasses` or :func:`functools.singledispatch`). -By default, annotations are lazily evaluated in a :ref:`annotation scope <annotation-scopes>`. +By default, annotations are lazily evaluated in an :ref:`annotation scope <annotation-scopes>`. This means that they are not evaluated when the code containing the annotation is evaluated. Instead, the interpreter saves information that can be used to evaluate the annotation later if requested. The :mod:`annotationlib` module provides tools for evaluating annotations. @@ -1898,6 +1898,12 @@ all annotations are instead stored as strings:: >>> f.__annotations__ {'param': 'annotation'} +This future statement will be deprecated and removed in a future version of Python, +but not before Python 3.13 reaches its end of life (see :pep:`749`). +When it is used, introspection tools like +:func:`annotationlib.get_annotations` and :func:`typing.get_type_hints` are +less likely to be able to resolve annotations at runtime. + .. rubric:: Footnotes diff --git a/Doc/reference/datamodel.rst b/Doc/reference/datamodel.rst index ea5b84b00c0..4a099e81dac 100644 --- a/Doc/reference/datamodel.rst +++ b/Doc/reference/datamodel.rst @@ -262,6 +262,8 @@ Booleans (:class:`bool`) a string, the strings ``"False"`` or ``"True"`` are returned, respectively. +.. _datamodel-float: + :class:`numbers.Real` (:class:`float`) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -1228,10 +1230,22 @@ Special attributes :attr:`__annotations__ attributes <object.__annotations__>`. For best practices on working with :attr:`~object.__annotations__`, - please see :mod:`annotationlib`. Where possible, use + please see :mod:`annotationlib`. Use :func:`annotationlib.get_annotations` instead of accessing this attribute directly. + .. warning:: + + Accessing the :attr:`!__annotations__` attribute directly + on a class object may return annotations for the wrong class, specifically + in certain cases where the class, its base class, or a metaclass + is defined under ``from __future__ import annotations``. + See :pep:`749 <749#pep749-metaclasses>` for details. + + This attribute does not exist on certain builtin classes. On + user-defined classes without ``__annotations__``, it is an + empty dictionary. + .. versionchanged:: 3.14 Annotations are now :ref:`lazily evaluated <lazy-evaluation>`. See :pep:`649`. @@ -3346,7 +3360,7 @@ left undefined. argument if the three-argument version of the built-in :func:`pow` function is to be supported. - .. versionchanged:: next + .. versionchanged:: 3.14 Three-argument :func:`pow` now try calling :meth:`~object.__rpow__` if necessary. Previously it was only called in two-argument :func:`!pow` and the binary diff --git a/Doc/reference/expressions.rst b/Doc/reference/expressions.rst index 8837344e5dd..24544a055c3 100644 --- a/Doc/reference/expressions.rst +++ b/Doc/reference/expressions.rst @@ -134,8 +134,7 @@ Literals Python supports string and bytes literals and various numeric literals: .. productionlist:: python-grammar - literal: `stringliteral` | `bytesliteral` - : | `integer` | `floatnumber` | `imagnumber` + literal: `stringliteral` | `bytesliteral` | `NUMBER` Evaluation of a literal yields an object of the given type (string, bytes, integer, floating-point number, complex number) with the given value. The value @@ -406,8 +405,9 @@ brackets or curly braces. Variables used in the generator expression are evaluated lazily when the :meth:`~generator.__next__` method is called for the generator object (in the same fashion as normal generators). However, the iterable expression in the -leftmost :keyword:`!for` clause is immediately evaluated, so that an error -produced by it will be emitted at the point where the generator expression +leftmost :keyword:`!for` clause is immediately evaluated, and the +:term:`iterator` is immediately created for that iterable, so that an error +produced while creating the iterator will be emitted at the point where the generator expression is defined, rather than at the point where the first value is retrieved. Subsequent :keyword:`!for` clauses and any filter condition in the leftmost :keyword:`!for` clause cannot be evaluated in the enclosing scope as they may @@ -625,8 +625,10 @@ is already executing raises a :exc:`ValueError` exception. .. method:: generator.close() - Raises a :exc:`GeneratorExit` at the point where the generator function was - paused. If the generator function catches the exception and returns a + Raises a :exc:`GeneratorExit` exception at the point where the generator + function was paused (equivalent to calling ``throw(GeneratorExit)``). + The exception is raised by the yield expression where the generator was paused. + If the generator function catches the exception and returns a value, this value is returned from :meth:`close`. If the generator function is already closed, or raises :exc:`GeneratorExit` (by not catching the exception), :meth:`close` returns :const:`None`. If the generator yields a @@ -1023,7 +1025,7 @@ series of :term:`arguments <argument>`: : ["," `keywords_arguments`] : | `starred_and_keywords` ["," `keywords_arguments`] : | `keywords_arguments` - positional_arguments: positional_item ("," positional_item)* + positional_arguments: `positional_item` ("," `positional_item`)* positional_item: `assignment_expression` | "*" `expression` starred_and_keywords: ("*" `expression` | `keyword_item`) : ("," "*" `expression` | "," `keyword_item`)* @@ -1928,7 +1930,7 @@ Expression lists single: , (comma); expression list .. productionlist:: python-grammar - starred_expression: ["*"] `or_expr` + starred_expression: "*" `or_expr` | `expression` flexible_expression: `assignment_expression` | `starred_expression` flexible_expression_list: `flexible_expression` ("," `flexible_expression`)* [","] starred_expression_list: `starred_expression` ("," `starred_expression`)* [","] diff --git a/Doc/reference/grammar.rst b/Doc/reference/grammar.rst index b9cca4444c9..55c148801d8 100644 --- a/Doc/reference/grammar.rst +++ b/Doc/reference/grammar.rst @@ -8,15 +8,15 @@ used to generate the CPython parser (see :source:`Grammar/python.gram`). The version here omits details related to code generation and error recovery. -The notation is a mixture of `EBNF -<https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_ -and `PEG <https://en.wikipedia.org/wiki/Parsing_expression_grammar>`_. -In particular, ``&`` followed by a symbol, token or parenthesized -group indicates a positive lookahead (i.e., is required to match but -not consumed), while ``!`` indicates a negative lookahead (i.e., is -required *not* to match). We use the ``|`` separator to mean PEG's -"ordered choice" (written as ``/`` in traditional PEG grammars). See -:pep:`617` for more details on the grammar's syntax. +The notation used here is the same as in the preceding docs, +and is described in the :ref:`notation <notation>` section, +except for a few extra complications: + +* ``&e``: a positive lookahead (that is, ``e`` is required to match but + not consumed) +* ``!e``: a negative lookahead (that is, ``e`` is required *not* to match) +* ``~`` ("cut"): commit to the current alternative and fail the rule + even if this fails to parse .. literalinclude:: ../../Grammar/python.gram :language: peg diff --git a/Doc/reference/introduction.rst b/Doc/reference/introduction.rst index b7b70e6be5a..444acac374a 100644 --- a/Doc/reference/introduction.rst +++ b/Doc/reference/introduction.rst @@ -90,44 +90,122 @@ Notation .. index:: BNF, grammar, syntax, notation -The descriptions of lexical analysis and syntax use a modified -`Backus–Naur form (BNF) <https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form>`_ grammar -notation. This uses the following style of definition: - -.. productionlist:: notation - name: `lc_letter` (`lc_letter` | "_")* - lc_letter: "a"..."z" - -The first line says that a ``name`` is an ``lc_letter`` followed by a sequence -of zero or more ``lc_letter``\ s and underscores. An ``lc_letter`` in turn is -any of the single characters ``'a'`` through ``'z'``. (This rule is actually -adhered to for the names defined in lexical and grammar rules in this document.) - -Each rule begins with a name (which is the name defined by the rule) and -``::=``. A vertical bar (``|``) is used to separate alternatives; it is the -least binding operator in this notation. A star (``*``) means zero or more -repetitions of the preceding item; likewise, a plus (``+``) means one or more -repetitions, and a phrase enclosed in square brackets (``[ ]``) means zero or -one occurrences (in other words, the enclosed phrase is optional). The ``*`` -and ``+`` operators bind as tightly as possible; parentheses are used for -grouping. Literal strings are enclosed in quotes. White space is only -meaningful to separate tokens. Rules are normally contained on a single line; -rules with many alternatives may be formatted alternatively with each line after -the first beginning with a vertical bar. - -.. index:: lexical definitions, ASCII - -In lexical definitions (as the example above), two more conventions are used: -Two literal characters separated by three dots mean a choice of any single -character in the given (inclusive) range of ASCII characters. A phrase between -angular brackets (``<...>``) gives an informal description of the symbol -defined; e.g., this could be used to describe the notion of 'control character' -if needed. - -Even though the notation used is almost the same, there is a big difference -between the meaning of lexical and syntactic definitions: a lexical definition -operates on the individual characters of the input source, while a syntax -definition operates on the stream of tokens generated by the lexical analysis. -All uses of BNF in the next chapter ("Lexical Analysis") are lexical -definitions; uses in subsequent chapters are syntactic definitions. - +The descriptions of lexical analysis and syntax use a grammar notation that +is a mixture of +`EBNF <https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_ +and `PEG <https://en.wikipedia.org/wiki/Parsing_expression_grammar>`_. +For example: + +.. grammar-snippet:: + :group: notation + + name: `letter` (`letter` | `digit` | "_")* + letter: "a"..."z" | "A"..."Z" + digit: "0"..."9" + +In this example, the first line says that a ``name`` is a ``letter`` followed +by a sequence of zero or more ``letter``\ s, ``digit``\ s, and underscores. +A ``letter`` in turn is any of the single characters ``'a'`` through +``'z'`` and ``A`` through ``Z``; a ``digit`` is a single character from ``0`` +to ``9``. + +Each rule begins with a name (which identifies the rule that's being defined) +followed by a colon, ``:``. +The definition to the right of the colon uses the following syntax elements: + +* ``name``: A name refers to another rule. + Where possible, it is a link to the rule's definition. + + * ``TOKEN``: An uppercase name refers to a :term:`token`. + For the purposes of grammar definitions, tokens are the same as rules. + +* ``"text"``, ``'text'``: Text in single or double quotes must match literally + (without the quotes). The type of quote is chosen according to the meaning + of ``text``: + + * ``'if'``: A name in single quotes denotes a :ref:`keyword <keywords>`. + * ``"case"``: A name in double quotes denotes a + :ref:`soft-keyword <soft-keywords>`. + * ``'@'``: A non-letter symbol in single quotes denotes an + :py:data:`~token.OP` token, that is, a :ref:`delimiter <delimiters>` or + :ref:`operator <operators>`. + +* ``e1 e2``: Items separated only by whitespace denote a sequence. + Here, ``e1`` must be followed by ``e2``. +* ``e1 | e2``: A vertical bar is used to separate alternatives. + It denotes PEG's "ordered choice": if ``e1`` matches, ``e2`` is + not considered. + In traditional PEG grammars, this is written as a slash, ``/``, rather than + a vertical bar. + See :pep:`617` for more background and details. +* ``e*``: A star means zero or more repetitions of the preceding item. +* ``e+``: Likewise, a plus means one or more repetitions. +* ``[e]``: A phrase enclosed in square brackets means zero or + one occurrences. In other words, the enclosed phrase is optional. +* ``e?``: A question mark has exactly the same meaning as square brackets: + the preceding item is optional. +* ``(e)``: Parentheses are used for grouping. +* ``"a"..."z"``: Two literal characters separated by three dots mean a choice + of any single character in the given (inclusive) range of ASCII characters. + This notation is only used in + :ref:`lexical definitions <notation-lexical-vs-syntactic>`. +* ``<...>``: A phrase between angular brackets gives an informal description + of the matched symbol (for example, ``<any ASCII character except "\">``), + or an abbreviation that is defined in nearby text (for example, ``<Lu>``). + This notation is only used in + :ref:`lexical definitions <notation-lexical-vs-syntactic>`. + +The unary operators (``*``, ``+``, ``?``) bind as tightly as possible; +the vertical bar (``|``) binds most loosely. + +White space is only meaningful to separate tokens. + +Rules are normally contained on a single line, but rules that are too long +may be wrapped: + +.. grammar-snippet:: + :group: notation + + literal: stringliteral | bytesliteral + | integer | floatnumber | imagnumber + +Alternatively, rules may be formatted with the first line ending at the colon, +and each alternative beginning with a vertical bar on a new line. +For example: + + +.. grammar-snippet:: + :group: notation-alt + + literal: + | stringliteral + | bytesliteral + | integer + | floatnumber + | imagnumber + +This does *not* mean that there is an empty first alternative. + +.. index:: lexical definitions + +.. _notation-lexical-vs-syntactic: + +Lexical and Syntactic definitions +--------------------------------- + +There is some difference between *lexical* and *syntactic* analysis: +the :term:`lexical analyzer` operates on the individual characters of the +input source, while the *parser* (syntactic analyzer) operates on the stream +of :term:`tokens <token>` generated by the lexical analysis. +However, in some cases the exact boundary between the two phases is a +CPython implementation detail. + +The practical difference between the two is that in *lexical* definitions, +all whitespace is significant. +The lexical analyzer :ref:`discards <whitespace>` all whitespace that is not +converted to tokens like :data:`token.INDENT` or :data:`~token.NEWLINE`. +*Syntactic* definitions then use these tokens, rather than source characters. + +This documentation uses the same BNF grammar for both styles of definitions. +All uses of BNF in the next chapter (:ref:`lexical`) are lexical definitions; +uses in subsequent chapters are syntactic definitions. diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index ff801a7d4fc..567c70111c2 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -35,11 +35,11 @@ Logical lines .. index:: logical line, physical line, line joining, NEWLINE token -The end of a logical line is represented by the token NEWLINE. Statements -cannot cross logical line boundaries except where NEWLINE is allowed by the -syntax (e.g., between statements in compound statements). A logical line is -constructed from one or more *physical lines* by following the explicit or -implicit *line joining* rules. +The end of a logical line is represented by the token :data:`~token.NEWLINE`. +Statements cannot cross logical line boundaries except where :data:`!NEWLINE` +is allowed by the syntax (e.g., between statements in compound statements). +A logical line is constructed from one or more *physical lines* by following +the explicit or implicit *line joining* rules. .. _physical-lines: @@ -99,7 +99,7 @@ which is recognized by Bram Moolenaar's VIM. If no encoding declaration is found, the default encoding is UTF-8. If the implicit or explicit encoding of a file is UTF-8, an initial UTF-8 byte-order -mark (b'\xef\xbb\xbf') is ignored rather than being a syntax error. +mark (``b'\xef\xbb\xbf'``) is ignored rather than being a syntax error. If an encoding is declared, the encoding name must be recognized by Python (see :ref:`standard-encodings`). The @@ -160,11 +160,12 @@ Blank lines .. index:: single: blank line A logical line that contains only spaces, tabs, formfeeds and possibly a -comment, is ignored (i.e., no NEWLINE token is generated). During interactive -input of statements, handling of a blank line may differ depending on the -implementation of the read-eval-print loop. In the standard interactive -interpreter, an entirely blank logical line (i.e. one containing not even -whitespace or a comment) terminates a multi-line statement. +comment, is ignored (i.e., no :data:`~token.NEWLINE` token is generated). +During interactive input of statements, handling of a blank line may differ +depending on the implementation of the read-eval-print loop. +In the standard interactive interpreter, an entirely blank logical line (that +is, one containing not even whitespace or a comment) terminates a multi-line +statement. .. _indentation: @@ -202,19 +203,20 @@ the space count to zero). .. index:: INDENT token, DEDENT token -The indentation levels of consecutive lines are used to generate INDENT and -DEDENT tokens, using a stack, as follows. +The indentation levels of consecutive lines are used to generate +:data:`~token.INDENT` and :data:`~token.DEDENT` tokens, using a stack, +as follows. Before the first line of the file is read, a single zero is pushed on the stack; this will never be popped off again. The numbers pushed on the stack will always be strictly increasing from bottom to top. At the beginning of each logical line, the line's indentation level is compared to the top of the stack. If it is equal, nothing happens. If it is larger, it is pushed on the stack, and -one INDENT token is generated. If it is smaller, it *must* be one of the +one :data:`!INDENT` token is generated. If it is smaller, it *must* be one of the numbers occurring on the stack; all numbers on the stack that are larger are -popped off, and for each number popped off a DEDENT token is generated. At the -end of the file, a DEDENT token is generated for each number remaining on the -stack that is larger than zero. +popped off, and for each number popped off a :data:`!DEDENT` token is generated. +At the end of the file, a :data:`!DEDENT` token is generated for each number +remaining on the stack that is larger than zero. Here is an example of a correctly (though confusingly) indented piece of Python code:: @@ -254,8 +256,18 @@ Whitespace between tokens Except at the beginning of a logical line or in string literals, the whitespace characters space, tab and formfeed can be used interchangeably to separate tokens. Whitespace is needed between two tokens only if their concatenation -could otherwise be interpreted as a different token (e.g., ab is one token, but -a b is two tokens). +could otherwise be interpreted as a different token. For example, ``ab`` is one +token, but ``a b`` is two tokens. However, ``+a`` and ``+ a`` both produce +two tokens, ``+`` and ``a``, as ``+a`` is not a valid token. + + +.. _endmarker-token: + +End marker +---------- + +At the end of non-interactive input, the lexical analyzer generates an +:data:`~token.ENDMARKER` token. .. _other-tokens: @@ -263,67 +275,94 @@ a b is two tokens). Other tokens ============ -Besides NEWLINE, INDENT and DEDENT, the following categories of tokens exist: -*identifiers*, *keywords*, *literals*, *operators*, and *delimiters*. Whitespace -characters (other than line terminators, discussed earlier) are not tokens, but -serve to delimit tokens. Where ambiguity exists, a token comprises the longest -possible string that forms a legal token, when read from left to right. +Besides :data:`~token.NEWLINE`, :data:`~token.INDENT` and :data:`~token.DEDENT`, +the following categories of tokens exist: +*identifiers* and *keywords* (:data:`~token.NAME`), *literals* (such as +:data:`~token.NUMBER` and :data:`~token.STRING`), and other symbols +(*operators* and *delimiters*, :data:`~token.OP`). +Whitespace characters (other than logical line terminators, discussed earlier) +are not tokens, but serve to delimit tokens. +Where ambiguity exists, a token comprises the longest possible string that +forms a legal token, when read from left to right. .. _identifiers: -Identifiers and keywords -======================== +Names (identifiers and keywords) +================================ .. index:: identifier, name -Identifiers (also referred to as *names*) are described by the following lexical -definitions. +:data:`~token.NAME` tokens represent *identifiers*, *keywords*, and +*soft keywords*. -The syntax of identifiers in Python is based on the Unicode standard annex -UAX-31, with elaboration and changes as defined below; see also :pep:`3131` for -further details. - -Within the ASCII range (U+0001..U+007F), the valid characters for identifiers -include the uppercase and lowercase letters ``A`` through -``Z``, the underscore ``_`` and, except for the first character, the digits +Within the ASCII range (U+0001..U+007F), the valid characters for names +include the uppercase and lowercase letters (``A-Z`` and ``a-z``), +the underscore ``_`` and, except for the first character, the digits ``0`` through ``9``. -Python 3.0 introduced additional characters from outside the ASCII range (see -:pep:`3131`). For these characters, the classification uses the version of the -Unicode Character Database as included in the :mod:`unicodedata` module. -Identifiers are unlimited in length. Case is significant. +Names must contain at least one character, but have no upper length limit. +Case is significant. -.. productionlist:: python-grammar - identifier: `xid_start` `xid_continue`* - id_start: <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property> - id_continue: <all characters in `id_start`, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property> - xid_start: <all characters in `id_start` whose NFKC normalization is in "id_start xid_continue*"> - xid_continue: <all characters in `id_continue` whose NFKC normalization is in "id_continue*"> - -The Unicode category codes mentioned above stand for: - -* *Lu* - uppercase letters -* *Ll* - lowercase letters -* *Lt* - titlecase letters -* *Lm* - modifier letters -* *Lo* - other letters -* *Nl* - letter numbers -* *Mn* - nonspacing marks -* *Mc* - spacing combining marks -* *Nd* - decimal numbers -* *Pc* - connector punctuations -* *Other_ID_Start* - explicit list of characters in `PropList.txt - <https://www.unicode.org/Public/16.0.0/ucd/PropList.txt>`_ to support backwards - compatibility -* *Other_ID_Continue* - likewise - -All identifiers are converted into the normal form NFKC while parsing; comparison -of identifiers is based on NFKC. - -A non-normative HTML file listing all valid identifier characters for Unicode -16.0.0 can be found at -https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt +Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can also use "letter-like" +and "number-like" characters from outside the ASCII range, as detailed below. + +All identifiers are converted into the `normalization form`_ NFKC while +parsing; comparison of identifiers is based on NFKC. + +Formally, the first character of a normalized identifier must belong to the +set ``id_start``, which is the union of: + +* Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``) +* Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``) +* Unicode category ``<Lt>`` - titlecase letters +* Unicode category ``<Lm>`` - modifier letters +* Unicode category ``<Lo>`` - other letters +* Unicode category ``<Nl>`` - letter numbers +* {``"_"``} - the underscore +* ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_ + to support backwards compatibility + +The remaining characters must belong to the set ``id_continue``, which is the +union of: + +* all characters in ``id_start`` +* Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``) +* Unicode category ``<Pc>`` - connector punctuations +* Unicode category ``<Mn>`` - nonspacing marks +* Unicode category ``<Mc>`` - spacing combining marks +* ``<Other_ID_Continue>`` - another explicit set of characters in + `PropList.txt`_ to support backwards compatibility + +Unicode categories use the version of the Unicode Character Database as +included in the :mod:`unicodedata` module. + +These sets are based on the Unicode standard annex `UAX-31`_. +See also :pep:`3131` for further details. + +Even more formally, names are described by the following lexical definitions: + +.. grammar-snippet:: + :group: python-grammar + + NAME: `xid_start` `xid_continue`* + id_start: <Lu> | <Ll> | <Lt> | <Lm> | <Lo> | <Nl> | "_" | <Other_ID_Start> + id_continue: `id_start` | <Nd> | <Pc> | <Mn> | <Mc> | <Other_ID_Continue> + xid_start: <all characters in `id_start` whose NFKC normalization is + in (`id_start` `xid_continue`*)"> + xid_continue: <all characters in `id_continue` whose NFKC normalization is + in (`id_continue`*)"> + identifier: <`NAME`, except keywords> + +A non-normative listing of all valid identifier characters as defined by +Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode +Character Database. + + +.. _UAX-31: https://www.unicode.org/reports/tr31/ +.. _PropList.txt: https://www.unicode.org/Public/16.0.0/ucd/PropList.txt +.. _DerivedCoreProperties.txt: https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt +.. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms .. _keywords: @@ -335,7 +374,7 @@ Keywords single: keyword single: reserved word -The following identifiers are used as reserved words, or *keywords* of the +The following names are used as reserved words, or *keywords* of the language, and cannot be used as ordinary identifiers. They must be spelled exactly as written here: @@ -359,18 +398,19 @@ Soft Keywords .. versionadded:: 3.10 -Some identifiers are only reserved under specific contexts. These are known as -*soft keywords*. The identifiers ``match``, ``case``, ``type`` and ``_`` can -syntactically act as keywords in certain contexts, +Some names are only reserved under specific contexts. These are known as +*soft keywords*: + +- ``match``, ``case``, and ``_``, when used in the :keyword:`match` statement. +- ``type``, when used in the :keyword:`type` statement. + +These syntactically act as keywords in their specific contexts, but this distinction is done at the parser level, not when tokenizing. As soft keywords, their use in the grammar is possible while still preserving compatibility with existing code that uses these names as identifier names. -``match``, ``case``, and ``_`` are used in the :keyword:`match` statement. -``type`` is used in the :keyword:`type` statement. - .. versionchanged:: 3.12 ``type`` is now a soft keyword. @@ -449,8 +489,9 @@ String literals are described by the following lexical definitions: .. productionlist:: python-grammar stringliteral: [`stringprefix`](`shortstring` | `longstring`) - stringprefix: "r" | "u" | "R" | "U" | "f" | "F" + stringprefix: "r" | "u" | "R" | "U" | "f" | "F" | "t" | "T" : | "fr" | "Fr" | "fR" | "FR" | "rf" | "rF" | "Rf" | "RF" + : | "tr" | "Tr" | "tR" | "TR" | "rt" | "rT" | "Rt" | "RT" shortstring: "'" `shortstringitem`* "'" | '"' `shortstringitem`* '"' longstring: "'''" `longstringitem`* "'''" | '"""' `longstringitem`* '"""' shortstringitem: `shortstringchar` | `stringescapeseq` @@ -881,11 +922,20 @@ Numeric literals floating-point literal, hexadecimal literal octal literal, binary literal, decimal literal, imaginary literal, complex literal -There are three types of numeric literals: integers, floating-point numbers, and -imaginary numbers. There are no complex literals (complex numbers can be formed -by adding a real number and an imaginary number). +:data:`~token.NUMBER` tokens represent numeric literals, of which there are +three types: integers, floating-point numbers, and imaginary numbers. + +.. grammar-snippet:: + :group: python-grammar + + NUMBER: `integer` | `floatnumber` | `imagnumber` -Note that numeric literals do not include a sign; a phrase like ``-1`` is +The numeric value of a numeric literal is the same as if it were passed as a +string to the :class:`int`, :class:`float` or :class:`complex` class +constructor, respectively. +Note that not all valid inputs for those constructors are also valid literals. + +Numeric literals do not include a sign; a phrase like ``-1`` is actually an expression composed of the unary operator '``-``' and the literal ``1``. @@ -899,38 +949,67 @@ actually an expression composed of the unary operator '``-``' and the literal .. _integers: Integer literals ----------------- +^^^^^^^^^^^^^^^^ -Integer literals are described by the following lexical definitions: +Integer literals denote whole numbers. For example:: -.. productionlist:: python-grammar - integer: `decinteger` | `bininteger` | `octinteger` | `hexinteger` - decinteger: `nonzerodigit` (["_"] `digit`)* | "0"+ (["_"] "0")* - bininteger: "0" ("b" | "B") (["_"] `bindigit`)+ - octinteger: "0" ("o" | "O") (["_"] `octdigit`)+ - hexinteger: "0" ("x" | "X") (["_"] `hexdigit`)+ - nonzerodigit: "1"..."9" - digit: "0"..."9" - bindigit: "0" | "1" - octdigit: "0"..."7" - hexdigit: `digit` | "a"..."f" | "A"..."F" + 7 + 3 + 2147483647 There is no limit for the length of integer literals apart from what can be -stored in available memory. +stored in available memory:: + + 7922816251426433759354395033679228162514264337593543950336 + +Underscores can be used to group digits for enhanced readability, +and are ignored for determining the numeric value of the literal. +For example, the following literals are equivalent:: + + 100_000_000_000 + 100000000000 + 1_00_00_00_00_000 -Underscores are ignored for determining the numeric value of the literal. They -can be used to group digits for enhanced readability. One underscore can occur -between digits, and after base specifiers like ``0x``. +Underscores can only occur between digits. +For example, ``_123``, ``321_``, and ``123__321`` are *not* valid literals. -Note that leading zeros in a non-zero decimal number are not allowed. This is -for disambiguation with C-style octal literals, which Python used before version -3.0. +Integers can be specified in binary (base 2), octal (base 8), or hexadecimal +(base 16) using the prefixes ``0b``, ``0o`` and ``0x``, respectively. +Hexadecimal digits 10 through 15 are represented by letters ``A``-``F``, +case-insensitive. For example:: -Some examples of integer literals:: + 0b100110111 + 0b_1110_0101 + 0o177 + 0o377 + 0xdeadbeef + 0xDead_Beef - 7 2147483647 0o177 0b100110111 - 3 79228162514264337593543950336 0o377 0xdeadbeef - 100_000_000_000 0b_1110_0101 +An underscore can follow the base specifier. +For example, ``0x_1f`` is a valid literal, but ``0_x1f`` and ``0x__1f`` are +not. + +Leading zeros in a non-zero decimal number are not allowed. +For example, ``0123`` is not a valid literal. +This is for disambiguation with C-style octal literals, which Python used +before version 3.0. + +Formally, integer literals are described by the following lexical definitions: + +.. grammar-snippet:: + :group: python-grammar + + integer: `decinteger` | `bininteger` | `octinteger` | `hexinteger` | `zerointeger` + decinteger: `nonzerodigit` (["_"] `digit`)* + bininteger: "0" ("b" | "B") (["_"] `bindigit`)+ + octinteger: "0" ("o" | "O") (["_"] `octdigit`)+ + hexinteger: "0" ("x" | "X") (["_"] `hexdigit`)+ + zerointeger: "0"+ (["_"] "0")* + nonzerodigit: "1"..."9" + digit: "0"..."9" + bindigit: "0" | "1" + octdigit: "0"..."7" + hexdigit: `digit` | "a"..."f" | "A"..."F" .. versionchanged:: 3.6 Underscores are now allowed for grouping purposes in literals. @@ -943,26 +1022,58 @@ Some examples of integer literals:: .. _floating: Floating-point literals ------------------------ +^^^^^^^^^^^^^^^^^^^^^^^ -Floating-point literals are described by the following lexical definitions: +Floating-point (float) literals, such as ``3.14`` or ``1.5``, denote +:ref:`approximations of real numbers <datamodel-float>`. -.. productionlist:: python-grammar - floatnumber: `pointfloat` | `exponentfloat` - pointfloat: [`digitpart`] `fraction` | `digitpart` "." - exponentfloat: (`digitpart` | `pointfloat`) `exponent` - digitpart: `digit` (["_"] `digit`)* - fraction: "." `digitpart` - exponent: ("e" | "E") ["+" | "-"] `digitpart` +They consist of *integer* and *fraction* parts, each composed of decimal digits. +The parts are separated by a decimal point, ``.``:: + + 2.71828 + 4.0 + +Unlike in integer literals, leading zeros are allowed in the numeric parts. +For example, ``077.010`` is legal, and denotes the same number as ``77.10``. + +As in integer literals, single underscores may occur between digits to help +readability:: + + 96_485.332_123 + 3.14_15_93 + +Either of these parts, but not both, can be empty. For example:: + + 10. # (equivalent to 10.0) + .001 # (equivalent to 0.001) + +Optionally, the integer and fraction may be followed by an *exponent*: +the letter ``e`` or ``E``, followed by an optional sign, ``+`` or ``-``, +and a number in the same format as the integer and fraction parts. +The ``e`` or ``E`` represents "times ten raised to the power of":: + + 1.0e3 # (represents 1.0×10³, or 1000.0) + 1.166e-5 # (represents 1.166×10⁻⁵, or 0.00001166) + 6.02214076e+23 # (represents 6.02214076×10²³, or 602214076000000000000000.) + +In floats with only integer and exponent parts, the decimal point may be +omitted:: -Note that the integer and exponent parts are always interpreted using radix 10. -For example, ``077e010`` is legal, and denotes the same number as ``77e10``. The -allowed range of floating-point literals is implementation-dependent. As in -integer literals, underscores are supported for digit grouping. + 1e3 # (equivalent to 1.e3 and 1.0e3) + 0e0 # (equivalent to 0.) -Some examples of floating-point literals:: +Formally, floating-point literals are described by the following +lexical definitions: - 3.14 10. .001 1e100 3.14e-10 0e0 3.14_15_93 +.. grammar-snippet:: + :group: python-grammar + + floatnumber: + | `digitpart` "." [`digitpart`] [`exponent`] + | "." `digitpart` [`exponent`] + | `digitpart` `exponent` + digitpart: `digit` (["_"] `digit`)* + exponent: ("e" | "E") ["+" | "-"] `digitpart` .. versionchanged:: 3.6 Underscores are now allowed for grouping purposes in literals. @@ -973,20 +1084,62 @@ Some examples of floating-point literals:: .. _imaginary: Imaginary literals ------------------- +^^^^^^^^^^^^^^^^^^ -Imaginary literals are described by the following lexical definitions: +Python has :ref:`complex number <typesnumeric>` objects, but no complex +literals. +Instead, *imaginary literals* denote complex numbers with a zero +real part. -.. productionlist:: python-grammar - imagnumber: (`floatnumber` | `digitpart`) ("j" | "J") +For example, in math, the complex number 3+4.2\ *i* is written +as the real number 3 added to the imaginary number 4.2\ *i*. +Python uses a similar syntax, except the imaginary unit is written as ``j`` +rather than *i*:: + + 3+4.2j + +This is an expression composed +of the :ref:`integer literal <integers>` ``3``, +the :ref:`operator <operators>` '``+``', +and the :ref:`imaginary literal <imaginary>` ``4.2j``. +Since these are three separate tokens, whitespace is allowed between them:: + + 3 + 4.2j + +No whitespace is allowed *within* each token. +In particular, the ``j`` suffix, may not be separated from the number +before it. -An imaginary literal yields a complex number with a real part of 0.0. Complex -numbers are represented as a pair of floating-point numbers and have the same -restrictions on their range. To create a complex number with a nonzero real -part, add a floating-point number to it, e.g., ``(3+4j)``. Some examples of -imaginary literals:: +The number before the ``j`` has the same syntax as a floating-point literal. +Thus, the following are valid imaginary literals:: - 3.14j 10.j 10j .001j 1e100j 3.14e-10j 3.14_15_93j + 4.2j + 3.14j + 10.j + .001j + 1e100j + 3.14e-10j + 3.14_15_93j + +Unlike in a floating-point literal the decimal point can be omitted if the +imaginary number only has an integer part. +The number is still evaluated as a floating-point number, not an integer:: + + 10j + 0j + 1000000000000000000000000j # equivalent to 1e+24j + +The ``j`` suffix is case-insensitive. +That means you can use ``J`` instead:: + + 3.14J # equivalent to 3.14j + +Formally, imaginary literals are described by the following lexical definition: + +.. grammar-snippet:: + :group: python-grammar + + imagnumber: (`floatnumber` | `digitpart`) ("j" | "J") .. _operators: |