aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/Doc/reference/lexical_analysis.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Doc/reference/lexical_analysis.rst')
-rw-r--r--Doc/reference/lexical_analysis.rst349
1 files changed, 243 insertions, 106 deletions
diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst
index 001e2547fe8..567c70111c2 100644
--- a/Doc/reference/lexical_analysis.rst
+++ b/Doc/reference/lexical_analysis.rst
@@ -288,58 +288,81 @@ forms a legal token, when read from left to right.
.. _identifiers:
-Identifiers and keywords
-========================
+Names (identifiers and keywords)
+================================
.. index:: identifier, name
-Identifiers (also referred to as *names*) are described by the following lexical
-definitions.
+:data:`~token.NAME` tokens represent *identifiers*, *keywords*, and
+*soft keywords*.
-The syntax of identifiers in Python is based on the Unicode standard annex
-UAX-31, with elaboration and changes as defined below; see also :pep:`3131` for
-further details.
-
-Within the ASCII range (U+0001..U+007F), the valid characters for identifiers
-include the uppercase and lowercase letters ``A`` through
-``Z``, the underscore ``_`` and, except for the first character, the digits
+Within the ASCII range (U+0001..U+007F), the valid characters for names
+include the uppercase and lowercase letters (``A-Z`` and ``a-z``),
+the underscore ``_`` and, except for the first character, the digits
``0`` through ``9``.
-Python 3.0 introduced additional characters from outside the ASCII range (see
-:pep:`3131`). For these characters, the classification uses the version of the
-Unicode Character Database as included in the :mod:`unicodedata` module.
-Identifiers are unlimited in length. Case is significant.
+Names must contain at least one character, but have no upper length limit.
+Case is significant.
-.. productionlist:: python-grammar
- identifier: `xid_start` `xid_continue`*
- id_start: <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property>
- id_continue: <all characters in `id_start`, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
- xid_start: <all characters in `id_start` whose NFKC normalization is in "id_start xid_continue*">
- xid_continue: <all characters in `id_continue` whose NFKC normalization is in "id_continue*">
-
-The Unicode category codes mentioned above stand for:
-
-* *Lu* - uppercase letters
-* *Ll* - lowercase letters
-* *Lt* - titlecase letters
-* *Lm* - modifier letters
-* *Lo* - other letters
-* *Nl* - letter numbers
-* *Mn* - nonspacing marks
-* *Mc* - spacing combining marks
-* *Nd* - decimal numbers
-* *Pc* - connector punctuations
-* *Other_ID_Start* - explicit list of characters in `PropList.txt
- <https://www.unicode.org/Public/16.0.0/ucd/PropList.txt>`_ to support backwards
- compatibility
-* *Other_ID_Continue* - likewise
-
-All identifiers are converted into the normal form NFKC while parsing; comparison
-of identifiers is based on NFKC.
-
-A non-normative HTML file listing all valid identifier characters for Unicode
-16.0.0 can be found at
-https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt
+Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can also use "letter-like"
+and "number-like" characters from outside the ASCII range, as detailed below.
+
+All identifiers are converted into the `normalization form`_ NFKC while
+parsing; comparison of identifiers is based on NFKC.
+
+Formally, the first character of a normalized identifier must belong to the
+set ``id_start``, which is the union of:
+
+* Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
+* Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
+* Unicode category ``<Lt>`` - titlecase letters
+* Unicode category ``<Lm>`` - modifier letters
+* Unicode category ``<Lo>`` - other letters
+* Unicode category ``<Nl>`` - letter numbers
+* {``"_"``} - the underscore
+* ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
+ to support backwards compatibility
+
+The remaining characters must belong to the set ``id_continue``, which is the
+union of:
+
+* all characters in ``id_start``
+* Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
+* Unicode category ``<Pc>`` - connector punctuations
+* Unicode category ``<Mn>`` - nonspacing marks
+* Unicode category ``<Mc>`` - spacing combining marks
+* ``<Other_ID_Continue>`` - another explicit set of characters in
+ `PropList.txt`_ to support backwards compatibility
+
+Unicode categories use the version of the Unicode Character Database as
+included in the :mod:`unicodedata` module.
+
+These sets are based on the Unicode standard annex `UAX-31`_.
+See also :pep:`3131` for further details.
+
+Even more formally, names are described by the following lexical definitions:
+
+.. grammar-snippet::
+ :group: python-grammar
+
+ NAME: `xid_start` `xid_continue`*
+ id_start: <Lu> | <Ll> | <Lt> | <Lm> | <Lo> | <Nl> | "_" | <Other_ID_Start>
+ id_continue: `id_start` | <Nd> | <Pc> | <Mn> | <Mc> | <Other_ID_Continue>
+ xid_start: <all characters in `id_start` whose NFKC normalization is
+ in (`id_start` `xid_continue`*)">
+ xid_continue: <all characters in `id_continue` whose NFKC normalization is
+ in (`id_continue`*)">
+ identifier: <`NAME`, except keywords>
+
+A non-normative listing of all valid identifier characters as defined by
+Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode
+Character Database.
+
+
+.. _UAX-31: https://www.unicode.org/reports/tr31/
+.. _PropList.txt: https://www.unicode.org/Public/16.0.0/ucd/PropList.txt
+.. _DerivedCoreProperties.txt: https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt
+.. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms
.. _keywords:
@@ -351,7 +374,7 @@ Keywords
single: keyword
single: reserved word
-The following identifiers are used as reserved words, or *keywords* of the
+The following names are used as reserved words, or *keywords* of the
language, and cannot be used as ordinary identifiers. They must be spelled
exactly as written here:
@@ -375,18 +398,19 @@ Soft Keywords
.. versionadded:: 3.10
-Some identifiers are only reserved under specific contexts. These are known as
-*soft keywords*. The identifiers ``match``, ``case``, ``type`` and ``_`` can
-syntactically act as keywords in certain contexts,
+Some names are only reserved under specific contexts. These are known as
+*soft keywords*:
+
+- ``match``, ``case``, and ``_``, when used in the :keyword:`match` statement.
+- ``type``, when used in the :keyword:`type` statement.
+
+These syntactically act as keywords in their specific contexts,
but this distinction is done at the parser level, not when tokenizing.
As soft keywords, their use in the grammar is possible while still
preserving compatibility with existing code that uses these names as
identifier names.
-``match``, ``case``, and ``_`` are used in the :keyword:`match` statement.
-``type`` is used in the :keyword:`type` statement.
-
.. versionchanged:: 3.12
``type`` is now a soft keyword.
@@ -465,8 +489,9 @@ String literals are described by the following lexical definitions:
.. productionlist:: python-grammar
stringliteral: [`stringprefix`](`shortstring` | `longstring`)
- stringprefix: "r" | "u" | "R" | "U" | "f" | "F"
+ stringprefix: "r" | "u" | "R" | "U" | "f" | "F" | "t" | "T"
: | "fr" | "Fr" | "fR" | "FR" | "rf" | "rF" | "Rf" | "RF"
+ : | "tr" | "Tr" | "tR" | "TR" | "rt" | "rT" | "Rt" | "RT"
shortstring: "'" `shortstringitem`* "'" | '"' `shortstringitem`* '"'
longstring: "'''" `longstringitem`* "'''" | '"""' `longstringitem`* '"""'
shortstringitem: `shortstringchar` | `stringescapeseq`
@@ -897,11 +922,20 @@ Numeric literals
floating-point literal, hexadecimal literal
octal literal, binary literal, decimal literal, imaginary literal, complex literal
-There are three types of numeric literals: integers, floating-point numbers, and
-imaginary numbers. There are no complex literals (complex numbers can be formed
-by adding a real number and an imaginary number).
+:data:`~token.NUMBER` tokens represent numeric literals, of which there are
+three types: integers, floating-point numbers, and imaginary numbers.
+
+.. grammar-snippet::
+ :group: python-grammar
-Note that numeric literals do not include a sign; a phrase like ``-1`` is
+ NUMBER: `integer` | `floatnumber` | `imagnumber`
+
+The numeric value of a numeric literal is the same as if it were passed as a
+string to the :class:`int`, :class:`float` or :class:`complex` class
+constructor, respectively.
+Note that not all valid inputs for those constructors are also valid literals.
+
+Numeric literals do not include a sign; a phrase like ``-1`` is
actually an expression composed of the unary operator '``-``' and the literal
``1``.
@@ -915,38 +949,67 @@ actually an expression composed of the unary operator '``-``' and the literal
.. _integers:
Integer literals
-----------------
+^^^^^^^^^^^^^^^^
-Integer literals are described by the following lexical definitions:
+Integer literals denote whole numbers. For example::
-.. productionlist:: python-grammar
- integer: `decinteger` | `bininteger` | `octinteger` | `hexinteger`
- decinteger: `nonzerodigit` (["_"] `digit`)* | "0"+ (["_"] "0")*
- bininteger: "0" ("b" | "B") (["_"] `bindigit`)+
- octinteger: "0" ("o" | "O") (["_"] `octdigit`)+
- hexinteger: "0" ("x" | "X") (["_"] `hexdigit`)+
- nonzerodigit: "1"..."9"
- digit: "0"..."9"
- bindigit: "0" | "1"
- octdigit: "0"..."7"
- hexdigit: `digit` | "a"..."f" | "A"..."F"
+ 7
+ 3
+ 2147483647
There is no limit for the length of integer literals apart from what can be
-stored in available memory.
+stored in available memory::
+
+ 7922816251426433759354395033679228162514264337593543950336
+
+Underscores can be used to group digits for enhanced readability,
+and are ignored for determining the numeric value of the literal.
+For example, the following literals are equivalent::
+
+ 100_000_000_000
+ 100000000000
+ 1_00_00_00_00_000
+
+Underscores can only occur between digits.
+For example, ``_123``, ``321_``, and ``123__321`` are *not* valid literals.
+
+Integers can be specified in binary (base 2), octal (base 8), or hexadecimal
+(base 16) using the prefixes ``0b``, ``0o`` and ``0x``, respectively.
+Hexadecimal digits 10 through 15 are represented by letters ``A``-``F``,
+case-insensitive. For example::
-Underscores are ignored for determining the numeric value of the literal. They
-can be used to group digits for enhanced readability. One underscore can occur
-between digits, and after base specifiers like ``0x``.
+ 0b100110111
+ 0b_1110_0101
+ 0o177
+ 0o377
+ 0xdeadbeef
+ 0xDead_Beef
-Note that leading zeros in a non-zero decimal number are not allowed. This is
-for disambiguation with C-style octal literals, which Python used before version
-3.0.
+An underscore can follow the base specifier.
+For example, ``0x_1f`` is a valid literal, but ``0_x1f`` and ``0x__1f`` are
+not.
-Some examples of integer literals::
+Leading zeros in a non-zero decimal number are not allowed.
+For example, ``0123`` is not a valid literal.
+This is for disambiguation with C-style octal literals, which Python used
+before version 3.0.
- 7 2147483647 0o177 0b100110111
- 3 79228162514264337593543950336 0o377 0xdeadbeef
- 100_000_000_000 0b_1110_0101
+Formally, integer literals are described by the following lexical definitions:
+
+.. grammar-snippet::
+ :group: python-grammar
+
+ integer: `decinteger` | `bininteger` | `octinteger` | `hexinteger` | `zerointeger`
+ decinteger: `nonzerodigit` (["_"] `digit`)*
+ bininteger: "0" ("b" | "B") (["_"] `bindigit`)+
+ octinteger: "0" ("o" | "O") (["_"] `octdigit`)+
+ hexinteger: "0" ("x" | "X") (["_"] `hexdigit`)+
+ zerointeger: "0"+ (["_"] "0")*
+ nonzerodigit: "1"..."9"
+ digit: "0"..."9"
+ bindigit: "0" | "1"
+ octdigit: "0"..."7"
+ hexdigit: `digit` | "a"..."f" | "A"..."F"
.. versionchanged:: 3.6
Underscores are now allowed for grouping purposes in literals.
@@ -959,26 +1022,58 @@ Some examples of integer literals::
.. _floating:
Floating-point literals
------------------------
+^^^^^^^^^^^^^^^^^^^^^^^
-Floating-point literals are described by the following lexical definitions:
+Floating-point (float) literals, such as ``3.14`` or ``1.5``, denote
+:ref:`approximations of real numbers <datamodel-float>`.
-.. productionlist:: python-grammar
- floatnumber: `pointfloat` | `exponentfloat`
- pointfloat: [`digitpart`] `fraction` | `digitpart` "."
- exponentfloat: (`digitpart` | `pointfloat`) `exponent`
- digitpart: `digit` (["_"] `digit`)*
- fraction: "." `digitpart`
- exponent: ("e" | "E") ["+" | "-"] `digitpart`
+They consist of *integer* and *fraction* parts, each composed of decimal digits.
+The parts are separated by a decimal point, ``.``::
+
+ 2.71828
+ 4.0
+
+Unlike in integer literals, leading zeros are allowed in the numeric parts.
+For example, ``077.010`` is legal, and denotes the same number as ``77.10``.
+
+As in integer literals, single underscores may occur between digits to help
+readability::
+
+ 96_485.332_123
+ 3.14_15_93
+
+Either of these parts, but not both, can be empty. For example::
+
+ 10. # (equivalent to 10.0)
+ .001 # (equivalent to 0.001)
+
+Optionally, the integer and fraction may be followed by an *exponent*:
+the letter ``e`` or ``E``, followed by an optional sign, ``+`` or ``-``,
+and a number in the same format as the integer and fraction parts.
+The ``e`` or ``E`` represents "times ten raised to the power of"::
+
+ 1.0e3 # (represents 1.0×10³, or 1000.0)
+ 1.166e-5 # (represents 1.166×10⁻⁵, or 0.00001166)
+ 6.02214076e+23 # (represents 6.02214076×10²³, or 602214076000000000000000.)
+
+In floats with only integer and exponent parts, the decimal point may be
+omitted::
+
+ 1e3 # (equivalent to 1.e3 and 1.0e3)
+ 0e0 # (equivalent to 0.)
-Note that the integer and exponent parts are always interpreted using radix 10.
-For example, ``077e010`` is legal, and denotes the same number as ``77e10``. The
-allowed range of floating-point literals is implementation-dependent. As in
-integer literals, underscores are supported for digit grouping.
+Formally, floating-point literals are described by the following
+lexical definitions:
-Some examples of floating-point literals::
+.. grammar-snippet::
+ :group: python-grammar
- 3.14 10. .001 1e100 3.14e-10 0e0 3.14_15_93
+ floatnumber:
+ | `digitpart` "." [`digitpart`] [`exponent`]
+ | "." `digitpart` [`exponent`]
+ | `digitpart` `exponent`
+ digitpart: `digit` (["_"] `digit`)*
+ exponent: ("e" | "E") ["+" | "-"] `digitpart`
.. versionchanged:: 3.6
Underscores are now allowed for grouping purposes in literals.
@@ -989,20 +1084,62 @@ Some examples of floating-point literals::
.. _imaginary:
Imaginary literals
-------------------
+^^^^^^^^^^^^^^^^^^
-Imaginary literals are described by the following lexical definitions:
+Python has :ref:`complex number <typesnumeric>` objects, but no complex
+literals.
+Instead, *imaginary literals* denote complex numbers with a zero
+real part.
-.. productionlist:: python-grammar
- imagnumber: (`floatnumber` | `digitpart`) ("j" | "J")
+For example, in math, the complex number 3+4.2\ *i* is written
+as the real number 3 added to the imaginary number 4.2\ *i*.
+Python uses a similar syntax, except the imaginary unit is written as ``j``
+rather than *i*::
+
+ 3+4.2j
+
+This is an expression composed
+of the :ref:`integer literal <integers>` ``3``,
+the :ref:`operator <operators>` '``+``',
+and the :ref:`imaginary literal <imaginary>` ``4.2j``.
+Since these are three separate tokens, whitespace is allowed between them::
+
+ 3 + 4.2j
+
+No whitespace is allowed *within* each token.
+In particular, the ``j`` suffix, may not be separated from the number
+before it.
+
+The number before the ``j`` has the same syntax as a floating-point literal.
+Thus, the following are valid imaginary literals::
-An imaginary literal yields a complex number with a real part of 0.0. Complex
-numbers are represented as a pair of floating-point numbers and have the same
-restrictions on their range. To create a complex number with a nonzero real
-part, add a floating-point number to it, e.g., ``(3+4j)``. Some examples of
-imaginary literals::
+ 4.2j
+ 3.14j
+ 10.j
+ .001j
+ 1e100j
+ 3.14e-10j
+ 3.14_15_93j
- 3.14j 10.j 10j .001j 1e100j 3.14e-10j 3.14_15_93j
+Unlike in a floating-point literal the decimal point can be omitted if the
+imaginary number only has an integer part.
+The number is still evaluated as a floating-point number, not an integer::
+
+ 10j
+ 0j
+ 1000000000000000000000000j # equivalent to 1e+24j
+
+The ``j`` suffix is case-insensitive.
+That means you can use ``J`` instead::
+
+ 3.14J # equivalent to 3.14j
+
+Formally, imaginary literals are described by the following lexical definition:
+
+.. grammar-snippet::
+ :group: python-grammar
+
+ imagnumber: (`floatnumber` | `digitpart`) ("j" | "J")
.. _operators: