ยปCore Development>Code coverage>Lib/pickletools.py

Python code coverage for Lib/pickletools.py

#countcontent
1n/a'''"Executable documentation" for the pickle module.
2n/a
3n/aExtensive comments about the pickle protocols and pickle-machine opcodes
4n/acan be found here. Some functions meant for external use:
5n/a
6n/agenops(pickle)
7n/a Generate all the opcodes in a pickle, as (opcode, arg, position) triples.
8n/a
9n/adis(pickle, out=None, memo=None, indentlevel=4)
10n/a Print a symbolic disassembly of a pickle.
11n/a'''
12n/a
13n/aimport codecs
14n/aimport io
15n/aimport pickle
16n/aimport re
17n/aimport sys
18n/a
19n/a__all__ = ['dis', 'genops', 'optimize']
20n/a
21n/abytes_types = pickle.bytes_types
22n/a
23n/a# Other ideas:
24n/a#
25n/a# - A pickle verifier: read a pickle and check it exhaustively for
26n/a# well-formedness. dis() does a lot of this already.
27n/a#
28n/a# - A protocol identifier: examine a pickle and return its protocol number
29n/a# (== the highest .proto attr value among all the opcodes in the pickle).
30n/a# dis() already prints this info at the end.
31n/a#
32n/a# - A pickle optimizer: for example, tuple-building code is sometimes more
33n/a# elaborate than necessary, catering for the possibility that the tuple
34n/a# is recursive. Or lots of times a PUT is generated that's never accessed
35n/a# by a later GET.
36n/a
37n/a
38n/a# "A pickle" is a program for a virtual pickle machine (PM, but more accurately
39n/a# called an unpickling machine). It's a sequence of opcodes, interpreted by the
40n/a# PM, building an arbitrarily complex Python object.
41n/a#
42n/a# For the most part, the PM is very simple: there are no looping, testing, or
43n/a# conditional instructions, no arithmetic and no function calls. Opcodes are
44n/a# executed once each, from first to last, until a STOP opcode is reached.
45n/a#
46n/a# The PM has two data areas, "the stack" and "the memo".
47n/a#
48n/a# Many opcodes push Python objects onto the stack; e.g., INT pushes a Python
49n/a# integer object on the stack, whose value is gotten from a decimal string
50n/a# literal immediately following the INT opcode in the pickle bytestream. Other
51n/a# opcodes take Python objects off the stack. The result of unpickling is
52n/a# whatever object is left on the stack when the final STOP opcode is executed.
53n/a#
54n/a# The memo is simply an array of objects, or it can be implemented as a dict
55n/a# mapping little integers to objects. The memo serves as the PM's "long term
56n/a# memory", and the little integers indexing the memo are akin to variable
57n/a# names. Some opcodes pop a stack object into the memo at a given index,
58n/a# and others push a memo object at a given index onto the stack again.
59n/a#
60n/a# At heart, that's all the PM has. Subtleties arise for these reasons:
61n/a#
62n/a# + Object identity. Objects can be arbitrarily complex, and subobjects
63n/a# may be shared (for example, the list [a, a] refers to the same object a
64n/a# twice). It can be vital that unpickling recreate an isomorphic object
65n/a# graph, faithfully reproducing sharing.
66n/a#
67n/a# + Recursive objects. For example, after "L = []; L.append(L)", L is a
68n/a# list, and L[0] is the same list. This is related to the object identity
69n/a# point, and some sequences of pickle opcodes are subtle in order to
70n/a# get the right result in all cases.
71n/a#
72n/a# + Things pickle doesn't know everything about. Examples of things pickle
73n/a# does know everything about are Python's builtin scalar and container
74n/a# types, like ints and tuples. They generally have opcodes dedicated to
75n/a# them. For things like module references and instances of user-defined
76n/a# classes, pickle's knowledge is limited. Historically, many enhancements
77n/a# have been made to the pickle protocol in order to do a better (faster,
78n/a# and/or more compact) job on those.
79n/a#
80n/a# + Backward compatibility and micro-optimization. As explained below,
81n/a# pickle opcodes never go away, not even when better ways to do a thing
82n/a# get invented. The repertoire of the PM just keeps growing over time.
83n/a# For example, protocol 0 had two opcodes for building Python integers (INT
84n/a# and LONG), protocol 1 added three more for more-efficient pickling of short
85n/a# integers, and protocol 2 added two more for more-efficient pickling of
86n/a# long integers (before protocol 2, the only ways to pickle a Python long
87n/a# took time quadratic in the number of digits, for both pickling and
88n/a# unpickling). "Opcode bloat" isn't so much a subtlety as a source of
89n/a# wearying complication.
90n/a#
91n/a#
92n/a# Pickle protocols:
93n/a#
94n/a# For compatibility, the meaning of a pickle opcode never changes. Instead new
95n/a# pickle opcodes get added, and each version's unpickler can handle all the
96n/a# pickle opcodes in all protocol versions to date. So old pickles continue to
97n/a# be readable forever. The pickler can generally be told to restrict itself to
98n/a# the subset of opcodes available under previous protocol versions too, so that
99n/a# users can create pickles under the current version readable by older
100n/a# versions. However, a pickle does not contain its version number embedded
101n/a# within it. If an older unpickler tries to read a pickle using a later
102n/a# protocol, the result is most likely an exception due to seeing an unknown (in
103n/a# the older unpickler) opcode.
104n/a#
105n/a# The original pickle used what's now called "protocol 0", and what was called
106n/a# "text mode" before Python 2.3. The entire pickle bytestream is made up of
107n/a# printable 7-bit ASCII characters, plus the newline character, in protocol 0.
108n/a# That's why it was called text mode. Protocol 0 is small and elegant, but
109n/a# sometimes painfully inefficient.
110n/a#
111n/a# The second major set of additions is now called "protocol 1", and was called
112n/a# "binary mode" before Python 2.3. This added many opcodes with arguments
113n/a# consisting of arbitrary bytes, including NUL bytes and unprintable "high bit"
114n/a# bytes. Binary mode pickles can be substantially smaller than equivalent
115n/a# text mode pickles, and sometimes faster too; e.g., BININT represents a 4-byte
116n/a# int as 4 bytes following the opcode, which is cheaper to unpickle than the
117n/a# (perhaps) 11-character decimal string attached to INT. Protocol 1 also added
118n/a# a number of opcodes that operate on many stack elements at once (like APPENDS
119n/a# and SETITEMS), and "shortcut" opcodes (like EMPTY_DICT and EMPTY_TUPLE).
120n/a#
121n/a# The third major set of additions came in Python 2.3, and is called "protocol
122n/a# 2". This added:
123n/a#
124n/a# - A better way to pickle instances of new-style classes (NEWOBJ).
125n/a#
126n/a# - A way for a pickle to identify its protocol (PROTO).
127n/a#
128n/a# - Time- and space- efficient pickling of long ints (LONG{1,4}).
129n/a#
130n/a# - Shortcuts for small tuples (TUPLE{1,2,3}}.
131n/a#
132n/a# - Dedicated opcodes for bools (NEWTRUE, NEWFALSE).
133n/a#
134n/a# - The "extension registry", a vector of popular objects that can be pushed
135n/a# efficiently by index (EXT{1,2,4}). This is akin to the memo and GET, but
136n/a# the registry contents are predefined (there's nothing akin to the memo's
137n/a# PUT).
138n/a#
139n/a# Another independent change with Python 2.3 is the abandonment of any
140n/a# pretense that it might be safe to load pickles received from untrusted
141n/a# parties -- no sufficient security analysis has been done to guarantee
142n/a# this and there isn't a use case that warrants the expense of such an
143n/a# analysis.
144n/a#
145n/a# To this end, all tests for __safe_for_unpickling__ or for
146n/a# copyreg.safe_constructors are removed from the unpickling code.
147n/a# References to these variables in the descriptions below are to be seen
148n/a# as describing unpickling in Python 2.2 and before.
149n/a
150n/a
151n/a# Meta-rule: Descriptions are stored in instances of descriptor objects,
152n/a# with plain constructors. No meta-language is defined from which
153n/a# descriptors could be constructed. If you want, e.g., XML, write a little
154n/a# program to generate XML from the objects.
155n/a
156n/a##############################################################################
157n/a# Some pickle opcodes have an argument, following the opcode in the
158n/a# bytestream. An argument is of a specific type, described by an instance
159n/a# of ArgumentDescriptor. These are not to be confused with arguments taken
160n/a# off the stack -- ArgumentDescriptor applies only to arguments embedded in
161n/a# the opcode stream, immediately following an opcode.
162n/a
163n/a# Represents the number of bytes consumed by an argument delimited by the
164n/a# next newline character.
165n/aUP_TO_NEWLINE = -1
166n/a
167n/a# Represents the number of bytes consumed by a two-argument opcode where
168n/a# the first argument gives the number of bytes in the second argument.
169n/aTAKEN_FROM_ARGUMENT1 = -2 # num bytes is 1-byte unsigned int
170n/aTAKEN_FROM_ARGUMENT4 = -3 # num bytes is 4-byte signed little-endian int
171n/aTAKEN_FROM_ARGUMENT4U = -4 # num bytes is 4-byte unsigned little-endian int
172n/aTAKEN_FROM_ARGUMENT8U = -5 # num bytes is 8-byte unsigned little-endian int
173n/a
174n/aclass ArgumentDescriptor(object):
175n/a __slots__ = (
176n/a # name of descriptor record, also a module global name; a string
177n/a 'name',
178n/a
179n/a # length of argument, in bytes; an int; UP_TO_NEWLINE and
180n/a # TAKEN_FROM_ARGUMENT{1,4,8} are negative values for variable-length
181n/a # cases
182n/a 'n',
183n/a
184n/a # a function taking a file-like object, reading this kind of argument
185n/a # from the object at the current position, advancing the current
186n/a # position by n bytes, and returning the value of the argument
187n/a 'reader',
188n/a
189n/a # human-readable docs for this arg descriptor; a string
190n/a 'doc',
191n/a )
192n/a
193n/a def __init__(self, name, n, reader, doc):
194n/a assert isinstance(name, str)
195n/a self.name = name
196n/a
197n/a assert isinstance(n, int) and (n >= 0 or
198n/a n in (UP_TO_NEWLINE,
199n/a TAKEN_FROM_ARGUMENT1,
200n/a TAKEN_FROM_ARGUMENT4,
201n/a TAKEN_FROM_ARGUMENT4U,
202n/a TAKEN_FROM_ARGUMENT8U))
203n/a self.n = n
204n/a
205n/a self.reader = reader
206n/a
207n/a assert isinstance(doc, str)
208n/a self.doc = doc
209n/a
210n/afrom struct import unpack as _unpack
211n/a
212n/adef read_uint1(f):
213n/a r"""
214n/a >>> import io
215n/a >>> read_uint1(io.BytesIO(b'\xff'))
216n/a 255
217n/a """
218n/a
219n/a data = f.read(1)
220n/a if data:
221n/a return data[0]
222n/a raise ValueError("not enough data in stream to read uint1")
223n/a
224n/auint1 = ArgumentDescriptor(
225n/a name='uint1',
226n/a n=1,
227n/a reader=read_uint1,
228n/a doc="One-byte unsigned integer.")
229n/a
230n/a
231n/adef read_uint2(f):
232n/a r"""
233n/a >>> import io
234n/a >>> read_uint2(io.BytesIO(b'\xff\x00'))
235n/a 255
236n/a >>> read_uint2(io.BytesIO(b'\xff\xff'))
237n/a 65535
238n/a """
239n/a
240n/a data = f.read(2)
241n/a if len(data) == 2:
242n/a return _unpack("<H", data)[0]
243n/a raise ValueError("not enough data in stream to read uint2")
244n/a
245n/auint2 = ArgumentDescriptor(
246n/a name='uint2',
247n/a n=2,
248n/a reader=read_uint2,
249n/a doc="Two-byte unsigned integer, little-endian.")
250n/a
251n/a
252n/adef read_int4(f):
253n/a r"""
254n/a >>> import io
255n/a >>> read_int4(io.BytesIO(b'\xff\x00\x00\x00'))
256n/a 255
257n/a >>> read_int4(io.BytesIO(b'\x00\x00\x00\x80')) == -(2**31)
258n/a True
259n/a """
260n/a
261n/a data = f.read(4)
262n/a if len(data) == 4:
263n/a return _unpack("<i", data)[0]
264n/a raise ValueError("not enough data in stream to read int4")
265n/a
266n/aint4 = ArgumentDescriptor(
267n/a name='int4',
268n/a n=4,
269n/a reader=read_int4,
270n/a doc="Four-byte signed integer, little-endian, 2's complement.")
271n/a
272n/a
273n/adef read_uint4(f):
274n/a r"""
275n/a >>> import io
276n/a >>> read_uint4(io.BytesIO(b'\xff\x00\x00\x00'))
277n/a 255
278n/a >>> read_uint4(io.BytesIO(b'\x00\x00\x00\x80')) == 2**31
279n/a True
280n/a """
281n/a
282n/a data = f.read(4)
283n/a if len(data) == 4:
284n/a return _unpack("<I", data)[0]
285n/a raise ValueError("not enough data in stream to read uint4")
286n/a
287n/auint4 = ArgumentDescriptor(
288n/a name='uint4',
289n/a n=4,
290n/a reader=read_uint4,
291n/a doc="Four-byte unsigned integer, little-endian.")
292n/a
293n/a
294n/adef read_uint8(f):
295n/a r"""
296n/a >>> import io
297n/a >>> read_uint8(io.BytesIO(b'\xff\x00\x00\x00\x00\x00\x00\x00'))
298n/a 255
299n/a >>> read_uint8(io.BytesIO(b'\xff' * 8)) == 2**64-1
300n/a True
301n/a """
302n/a
303n/a data = f.read(8)
304n/a if len(data) == 8:
305n/a return _unpack("<Q", data)[0]
306n/a raise ValueError("not enough data in stream to read uint8")
307n/a
308n/auint8 = ArgumentDescriptor(
309n/a name='uint8',
310n/a n=8,
311n/a reader=read_uint8,
312n/a doc="Eight-byte unsigned integer, little-endian.")
313n/a
314n/a
315n/adef read_stringnl(f, decode=True, stripquotes=True):
316n/a r"""
317n/a >>> import io
318n/a >>> read_stringnl(io.BytesIO(b"'abcd'\nefg\n"))
319n/a 'abcd'
320n/a
321n/a >>> read_stringnl(io.BytesIO(b"\n"))
322n/a Traceback (most recent call last):
323n/a ...
324n/a ValueError: no string quotes around b''
325n/a
326n/a >>> read_stringnl(io.BytesIO(b"\n"), stripquotes=False)
327n/a ''
328n/a
329n/a >>> read_stringnl(io.BytesIO(b"''\n"))
330n/a ''
331n/a
332n/a >>> read_stringnl(io.BytesIO(b'"abcd"'))
333n/a Traceback (most recent call last):
334n/a ...
335n/a ValueError: no newline found when trying to read stringnl
336n/a
337n/a Embedded escapes are undone in the result.
338n/a >>> read_stringnl(io.BytesIO(br"'a\n\\b\x00c\td'" + b"\n'e'"))
339n/a 'a\n\\b\x00c\td'
340n/a """
341n/a
342n/a data = f.readline()
343n/a if not data.endswith(b'\n'):
344n/a raise ValueError("no newline found when trying to read stringnl")
345n/a data = data[:-1] # lose the newline
346n/a
347n/a if stripquotes:
348n/a for q in (b'"', b"'"):
349n/a if data.startswith(q):
350n/a if not data.endswith(q):
351n/a raise ValueError("strinq quote %r not found at both "
352n/a "ends of %r" % (q, data))
353n/a data = data[1:-1]
354n/a break
355n/a else:
356n/a raise ValueError("no string quotes around %r" % data)
357n/a
358n/a if decode:
359n/a data = codecs.escape_decode(data)[0].decode("ascii")
360n/a return data
361n/a
362n/astringnl = ArgumentDescriptor(
363n/a name='stringnl',
364n/a n=UP_TO_NEWLINE,
365n/a reader=read_stringnl,
366n/a doc="""A newline-terminated string.
367n/a
368n/a This is a repr-style string, with embedded escapes, and
369n/a bracketing quotes.
370n/a """)
371n/a
372n/adef read_stringnl_noescape(f):
373n/a return read_stringnl(f, stripquotes=False)
374n/a
375n/astringnl_noescape = ArgumentDescriptor(
376n/a name='stringnl_noescape',
377n/a n=UP_TO_NEWLINE,
378n/a reader=read_stringnl_noescape,
379n/a doc="""A newline-terminated string.
380n/a
381n/a This is a str-style string, without embedded escapes,
382n/a or bracketing quotes. It should consist solely of
383n/a printable ASCII characters.
384n/a """)
385n/a
386n/adef read_stringnl_noescape_pair(f):
387n/a r"""
388n/a >>> import io
389n/a >>> read_stringnl_noescape_pair(io.BytesIO(b"Queue\nEmpty\njunk"))
390n/a 'Queue Empty'
391n/a """
392n/a
393n/a return "%s %s" % (read_stringnl_noescape(f), read_stringnl_noescape(f))
394n/a
395n/astringnl_noescape_pair = ArgumentDescriptor(
396n/a name='stringnl_noescape_pair',
397n/a n=UP_TO_NEWLINE,
398n/a reader=read_stringnl_noescape_pair,
399n/a doc="""A pair of newline-terminated strings.
400n/a
401n/a These are str-style strings, without embedded
402n/a escapes, or bracketing quotes. They should
403n/a consist solely of printable ASCII characters.
404n/a The pair is returned as a single string, with
405n/a a single blank separating the two strings.
406n/a """)
407n/a
408n/a
409n/adef read_string1(f):
410n/a r"""
411n/a >>> import io
412n/a >>> read_string1(io.BytesIO(b"\x00"))
413n/a ''
414n/a >>> read_string1(io.BytesIO(b"\x03abcdef"))
415n/a 'abc'
416n/a """
417n/a
418n/a n = read_uint1(f)
419n/a assert n >= 0
420n/a data = f.read(n)
421n/a if len(data) == n:
422n/a return data.decode("latin-1")
423n/a raise ValueError("expected %d bytes in a string1, but only %d remain" %
424n/a (n, len(data)))
425n/a
426n/astring1 = ArgumentDescriptor(
427n/a name="string1",
428n/a n=TAKEN_FROM_ARGUMENT1,
429n/a reader=read_string1,
430n/a doc="""A counted string.
431n/a
432n/a The first argument is a 1-byte unsigned int giving the number
433n/a of bytes in the string, and the second argument is that many
434n/a bytes.
435n/a """)
436n/a
437n/a
438n/adef read_string4(f):
439n/a r"""
440n/a >>> import io
441n/a >>> read_string4(io.BytesIO(b"\x00\x00\x00\x00abc"))
442n/a ''
443n/a >>> read_string4(io.BytesIO(b"\x03\x00\x00\x00abcdef"))
444n/a 'abc'
445n/a >>> read_string4(io.BytesIO(b"\x00\x00\x00\x03abcdef"))
446n/a Traceback (most recent call last):
447n/a ...
448n/a ValueError: expected 50331648 bytes in a string4, but only 6 remain
449n/a """
450n/a
451n/a n = read_int4(f)
452n/a if n < 0:
453n/a raise ValueError("string4 byte count < 0: %d" % n)
454n/a data = f.read(n)
455n/a if len(data) == n:
456n/a return data.decode("latin-1")
457n/a raise ValueError("expected %d bytes in a string4, but only %d remain" %
458n/a (n, len(data)))
459n/a
460n/astring4 = ArgumentDescriptor(
461n/a name="string4",
462n/a n=TAKEN_FROM_ARGUMENT4,
463n/a reader=read_string4,
464n/a doc="""A counted string.
465n/a
466n/a The first argument is a 4-byte little-endian signed int giving
467n/a the number of bytes in the string, and the second argument is
468n/a that many bytes.
469n/a """)
470n/a
471n/a
472n/adef read_bytes1(f):
473n/a r"""
474n/a >>> import io
475n/a >>> read_bytes1(io.BytesIO(b"\x00"))
476n/a b''
477n/a >>> read_bytes1(io.BytesIO(b"\x03abcdef"))
478n/a b'abc'
479n/a """
480n/a
481n/a n = read_uint1(f)
482n/a assert n >= 0
483n/a data = f.read(n)
484n/a if len(data) == n:
485n/a return data
486n/a raise ValueError("expected %d bytes in a bytes1, but only %d remain" %
487n/a (n, len(data)))
488n/a
489n/abytes1 = ArgumentDescriptor(
490n/a name="bytes1",
491n/a n=TAKEN_FROM_ARGUMENT1,
492n/a reader=read_bytes1,
493n/a doc="""A counted bytes string.
494n/a
495n/a The first argument is a 1-byte unsigned int giving the number
496n/a of bytes in the string, and the second argument is that many
497n/a bytes.
498n/a """)
499n/a
500n/a
501n/adef read_bytes1(f):
502n/a r"""
503n/a >>> import io
504n/a >>> read_bytes1(io.BytesIO(b"\x00"))
505n/a b''
506n/a >>> read_bytes1(io.BytesIO(b"\x03abcdef"))
507n/a b'abc'
508n/a """
509n/a
510n/a n = read_uint1(f)
511n/a assert n >= 0
512n/a data = f.read(n)
513n/a if len(data) == n:
514n/a return data
515n/a raise ValueError("expected %d bytes in a bytes1, but only %d remain" %
516n/a (n, len(data)))
517n/a
518n/abytes1 = ArgumentDescriptor(
519n/a name="bytes1",
520n/a n=TAKEN_FROM_ARGUMENT1,
521n/a reader=read_bytes1,
522n/a doc="""A counted bytes string.
523n/a
524n/a The first argument is a 1-byte unsigned int giving the number
525n/a of bytes, and the second argument is that many bytes.
526n/a """)
527n/a
528n/a
529n/adef read_bytes4(f):
530n/a r"""
531n/a >>> import io
532n/a >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x00abc"))
533n/a b''
534n/a >>> read_bytes4(io.BytesIO(b"\x03\x00\x00\x00abcdef"))
535n/a b'abc'
536n/a >>> read_bytes4(io.BytesIO(b"\x00\x00\x00\x03abcdef"))
537n/a Traceback (most recent call last):
538n/a ...
539n/a ValueError: expected 50331648 bytes in a bytes4, but only 6 remain
540n/a """
541n/a
542n/a n = read_uint4(f)
543n/a assert n >= 0
544n/a if n > sys.maxsize:
545n/a raise ValueError("bytes4 byte count > sys.maxsize: %d" % n)
546n/a data = f.read(n)
547n/a if len(data) == n:
548n/a return data
549n/a raise ValueError("expected %d bytes in a bytes4, but only %d remain" %
550n/a (n, len(data)))
551n/a
552n/abytes4 = ArgumentDescriptor(
553n/a name="bytes4",
554n/a n=TAKEN_FROM_ARGUMENT4U,
555n/a reader=read_bytes4,
556n/a doc="""A counted bytes string.
557n/a
558n/a The first argument is a 4-byte little-endian unsigned int giving
559n/a the number of bytes, and the second argument is that many bytes.
560n/a """)
561n/a
562n/a
563n/adef read_bytes8(f):
564n/a r"""
565n/a >>> import io, struct, sys
566n/a >>> read_bytes8(io.BytesIO(b"\x00\x00\x00\x00\x00\x00\x00\x00abc"))
567n/a b''
568n/a >>> read_bytes8(io.BytesIO(b"\x03\x00\x00\x00\x00\x00\x00\x00abcdef"))
569n/a b'abc'
570n/a >>> bigsize8 = struct.pack("<Q", sys.maxsize//3)
571n/a >>> read_bytes8(io.BytesIO(bigsize8 + b"abcdef")) #doctest: +ELLIPSIS
572n/a Traceback (most recent call last):
573n/a ...
574n/a ValueError: expected ... bytes in a bytes8, but only 6 remain
575n/a """
576n/a
577n/a n = read_uint8(f)
578n/a assert n >= 0
579n/a if n > sys.maxsize:
580n/a raise ValueError("bytes8 byte count > sys.maxsize: %d" % n)
581n/a data = f.read(n)
582n/a if len(data) == n:
583n/a return data
584n/a raise ValueError("expected %d bytes in a bytes8, but only %d remain" %
585n/a (n, len(data)))
586n/a
587n/abytes8 = ArgumentDescriptor(
588n/a name="bytes8",
589n/a n=TAKEN_FROM_ARGUMENT8U,
590n/a reader=read_bytes8,
591n/a doc="""A counted bytes string.
592n/a
593n/a The first argument is an 8-byte little-endian unsigned int giving
594n/a the number of bytes, and the second argument is that many bytes.
595n/a """)
596n/a
597n/adef read_unicodestringnl(f):
598n/a r"""
599n/a >>> import io
600n/a >>> read_unicodestringnl(io.BytesIO(b"abc\\uabcd\njunk")) == 'abc\uabcd'
601n/a True
602n/a """
603n/a
604n/a data = f.readline()
605n/a if not data.endswith(b'\n'):
606n/a raise ValueError("no newline found when trying to read "
607n/a "unicodestringnl")
608n/a data = data[:-1] # lose the newline
609n/a return str(data, 'raw-unicode-escape')
610n/a
611n/aunicodestringnl = ArgumentDescriptor(
612n/a name='unicodestringnl',
613n/a n=UP_TO_NEWLINE,
614n/a reader=read_unicodestringnl,
615n/a doc="""A newline-terminated Unicode string.
616n/a
617n/a This is raw-unicode-escape encoded, so consists of
618n/a printable ASCII characters, and may contain embedded
619n/a escape sequences.
620n/a """)
621n/a
622n/a
623n/adef read_unicodestring1(f):
624n/a r"""
625n/a >>> import io
626n/a >>> s = 'abcd\uabcd'
627n/a >>> enc = s.encode('utf-8')
628n/a >>> enc
629n/a b'abcd\xea\xaf\x8d'
630n/a >>> n = bytes([len(enc)]) # little-endian 1-byte length
631n/a >>> t = read_unicodestring1(io.BytesIO(n + enc + b'junk'))
632n/a >>> s == t
633n/a True
634n/a
635n/a >>> read_unicodestring1(io.BytesIO(n + enc[:-1]))
636n/a Traceback (most recent call last):
637n/a ...
638n/a ValueError: expected 7 bytes in a unicodestring1, but only 6 remain
639n/a """
640n/a
641n/a n = read_uint1(f)
642n/a assert n >= 0
643n/a data = f.read(n)
644n/a if len(data) == n:
645n/a return str(data, 'utf-8', 'surrogatepass')
646n/a raise ValueError("expected %d bytes in a unicodestring1, but only %d "
647n/a "remain" % (n, len(data)))
648n/a
649n/aunicodestring1 = ArgumentDescriptor(
650n/a name="unicodestring1",
651n/a n=TAKEN_FROM_ARGUMENT1,
652n/a reader=read_unicodestring1,
653n/a doc="""A counted Unicode string.
654n/a
655n/a The first argument is a 1-byte little-endian signed int
656n/a giving the number of bytes in the string, and the second
657n/a argument-- the UTF-8 encoding of the Unicode string --
658n/a contains that many bytes.
659n/a """)
660n/a
661n/a
662n/adef read_unicodestring4(f):
663n/a r"""
664n/a >>> import io
665n/a >>> s = 'abcd\uabcd'
666n/a >>> enc = s.encode('utf-8')
667n/a >>> enc
668n/a b'abcd\xea\xaf\x8d'
669n/a >>> n = bytes([len(enc), 0, 0, 0]) # little-endian 4-byte length
670n/a >>> t = read_unicodestring4(io.BytesIO(n + enc + b'junk'))
671n/a >>> s == t
672n/a True
673n/a
674n/a >>> read_unicodestring4(io.BytesIO(n + enc[:-1]))
675n/a Traceback (most recent call last):
676n/a ...
677n/a ValueError: expected 7 bytes in a unicodestring4, but only 6 remain
678n/a """
679n/a
680n/a n = read_uint4(f)
681n/a assert n >= 0
682n/a if n > sys.maxsize:
683n/a raise ValueError("unicodestring4 byte count > sys.maxsize: %d" % n)
684n/a data = f.read(n)
685n/a if len(data) == n:
686n/a return str(data, 'utf-8', 'surrogatepass')
687n/a raise ValueError("expected %d bytes in a unicodestring4, but only %d "
688n/a "remain" % (n, len(data)))
689n/a
690n/aunicodestring4 = ArgumentDescriptor(
691n/a name="unicodestring4",
692n/a n=TAKEN_FROM_ARGUMENT4U,
693n/a reader=read_unicodestring4,
694n/a doc="""A counted Unicode string.
695n/a
696n/a The first argument is a 4-byte little-endian signed int
697n/a giving the number of bytes in the string, and the second
698n/a argument-- the UTF-8 encoding of the Unicode string --
699n/a contains that many bytes.
700n/a """)
701n/a
702n/a
703n/adef read_unicodestring8(f):
704n/a r"""
705n/a >>> import io
706n/a >>> s = 'abcd\uabcd'
707n/a >>> enc = s.encode('utf-8')
708n/a >>> enc
709n/a b'abcd\xea\xaf\x8d'
710n/a >>> n = bytes([len(enc)]) + b'\0' * 7 # little-endian 8-byte length
711n/a >>> t = read_unicodestring8(io.BytesIO(n + enc + b'junk'))
712n/a >>> s == t
713n/a True
714n/a
715n/a >>> read_unicodestring8(io.BytesIO(n + enc[:-1]))
716n/a Traceback (most recent call last):
717n/a ...
718n/a ValueError: expected 7 bytes in a unicodestring8, but only 6 remain
719n/a """
720n/a
721n/a n = read_uint8(f)
722n/a assert n >= 0
723n/a if n > sys.maxsize:
724n/a raise ValueError("unicodestring8 byte count > sys.maxsize: %d" % n)
725n/a data = f.read(n)
726n/a if len(data) == n:
727n/a return str(data, 'utf-8', 'surrogatepass')
728n/a raise ValueError("expected %d bytes in a unicodestring8, but only %d "
729n/a "remain" % (n, len(data)))
730n/a
731n/aunicodestring8 = ArgumentDescriptor(
732n/a name="unicodestring8",
733n/a n=TAKEN_FROM_ARGUMENT8U,
734n/a reader=read_unicodestring8,
735n/a doc="""A counted Unicode string.
736n/a
737n/a The first argument is an 8-byte little-endian signed int
738n/a giving the number of bytes in the string, and the second
739n/a argument-- the UTF-8 encoding of the Unicode string --
740n/a contains that many bytes.
741n/a """)
742n/a
743n/a
744n/adef read_decimalnl_short(f):
745n/a r"""
746n/a >>> import io
747n/a >>> read_decimalnl_short(io.BytesIO(b"1234\n56"))
748n/a 1234
749n/a
750n/a >>> read_decimalnl_short(io.BytesIO(b"1234L\n56"))
751n/a Traceback (most recent call last):
752n/a ...
753n/a ValueError: invalid literal for int() with base 10: b'1234L'
754n/a """
755n/a
756n/a s = read_stringnl(f, decode=False, stripquotes=False)
757n/a
758n/a # There's a hack for True and False here.
759n/a if s == b"00":
760n/a return False
761n/a elif s == b"01":
762n/a return True
763n/a
764n/a return int(s)
765n/a
766n/adef read_decimalnl_long(f):
767n/a r"""
768n/a >>> import io
769n/a
770n/a >>> read_decimalnl_long(io.BytesIO(b"1234L\n56"))
771n/a 1234
772n/a
773n/a >>> read_decimalnl_long(io.BytesIO(b"123456789012345678901234L\n6"))
774n/a 123456789012345678901234
775n/a """
776n/a
777n/a s = read_stringnl(f, decode=False, stripquotes=False)
778n/a if s[-1:] == b'L':
779n/a s = s[:-1]
780n/a return int(s)
781n/a
782n/a
783n/adecimalnl_short = ArgumentDescriptor(
784n/a name='decimalnl_short',
785n/a n=UP_TO_NEWLINE,
786n/a reader=read_decimalnl_short,
787n/a doc="""A newline-terminated decimal integer literal.
788n/a
789n/a This never has a trailing 'L', and the integer fit
790n/a in a short Python int on the box where the pickle
791n/a was written -- but there's no guarantee it will fit
792n/a in a short Python int on the box where the pickle
793n/a is read.
794n/a """)
795n/a
796n/adecimalnl_long = ArgumentDescriptor(
797n/a name='decimalnl_long',
798n/a n=UP_TO_NEWLINE,
799n/a reader=read_decimalnl_long,
800n/a doc="""A newline-terminated decimal integer literal.
801n/a
802n/a This has a trailing 'L', and can represent integers
803n/a of any size.
804n/a """)
805n/a
806n/a
807n/adef read_floatnl(f):
808n/a r"""
809n/a >>> import io
810n/a >>> read_floatnl(io.BytesIO(b"-1.25\n6"))
811n/a -1.25
812n/a """
813n/a s = read_stringnl(f, decode=False, stripquotes=False)
814n/a return float(s)
815n/a
816n/afloatnl = ArgumentDescriptor(
817n/a name='floatnl',
818n/a n=UP_TO_NEWLINE,
819n/a reader=read_floatnl,
820n/a doc="""A newline-terminated decimal floating literal.
821n/a
822n/a In general this requires 17 significant digits for roundtrip
823n/a identity, and pickling then unpickling infinities, NaNs, and
824n/a minus zero doesn't work across boxes, or on some boxes even
825n/a on itself (e.g., Windows can't read the strings it produces
826n/a for infinities or NaNs).
827n/a """)
828n/a
829n/adef read_float8(f):
830n/a r"""
831n/a >>> import io, struct
832n/a >>> raw = struct.pack(">d", -1.25)
833n/a >>> raw
834n/a b'\xbf\xf4\x00\x00\x00\x00\x00\x00'
835n/a >>> read_float8(io.BytesIO(raw + b"\n"))
836n/a -1.25
837n/a """
838n/a
839n/a data = f.read(8)
840n/a if len(data) == 8:
841n/a return _unpack(">d", data)[0]
842n/a raise ValueError("not enough data in stream to read float8")
843n/a
844n/a
845n/afloat8 = ArgumentDescriptor(
846n/a name='float8',
847n/a n=8,
848n/a reader=read_float8,
849n/a doc="""An 8-byte binary representation of a float, big-endian.
850n/a
851n/a The format is unique to Python, and shared with the struct
852n/a module (format string '>d') "in theory" (the struct and pickle
853n/a implementations don't share the code -- they should). It's
854n/a strongly related to the IEEE-754 double format, and, in normal
855n/a cases, is in fact identical to the big-endian 754 double format.
856n/a On other boxes the dynamic range is limited to that of a 754
857n/a double, and "add a half and chop" rounding is used to reduce
858n/a the precision to 53 bits. However, even on a 754 box,
859n/a infinities, NaNs, and minus zero may not be handled correctly
860n/a (may not survive roundtrip pickling intact).
861n/a """)
862n/a
863n/a# Protocol 2 formats
864n/a
865n/afrom pickle import decode_long
866n/a
867n/adef read_long1(f):
868n/a r"""
869n/a >>> import io
870n/a >>> read_long1(io.BytesIO(b"\x00"))
871n/a 0
872n/a >>> read_long1(io.BytesIO(b"\x02\xff\x00"))
873n/a 255
874n/a >>> read_long1(io.BytesIO(b"\x02\xff\x7f"))
875n/a 32767
876n/a >>> read_long1(io.BytesIO(b"\x02\x00\xff"))
877n/a -256
878n/a >>> read_long1(io.BytesIO(b"\x02\x00\x80"))
879n/a -32768
880n/a """
881n/a
882n/a n = read_uint1(f)
883n/a data = f.read(n)
884n/a if len(data) != n:
885n/a raise ValueError("not enough data in stream to read long1")
886n/a return decode_long(data)
887n/a
888n/along1 = ArgumentDescriptor(
889n/a name="long1",
890n/a n=TAKEN_FROM_ARGUMENT1,
891n/a reader=read_long1,
892n/a doc="""A binary long, little-endian, using 1-byte size.
893n/a
894n/a This first reads one byte as an unsigned size, then reads that
895n/a many bytes and interprets them as a little-endian 2's-complement long.
896n/a If the size is 0, that's taken as a shortcut for the long 0L.
897n/a """)
898n/a
899n/adef read_long4(f):
900n/a r"""
901n/a >>> import io
902n/a >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x00"))
903n/a 255
904n/a >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\xff\x7f"))
905n/a 32767
906n/a >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\xff"))
907n/a -256
908n/a >>> read_long4(io.BytesIO(b"\x02\x00\x00\x00\x00\x80"))
909n/a -32768
910n/a >>> read_long1(io.BytesIO(b"\x00\x00\x00\x00"))
911n/a 0
912n/a """
913n/a
914n/a n = read_int4(f)
915n/a if n < 0:
916n/a raise ValueError("long4 byte count < 0: %d" % n)
917n/a data = f.read(n)
918n/a if len(data) != n:
919n/a raise ValueError("not enough data in stream to read long4")
920n/a return decode_long(data)
921n/a
922n/along4 = ArgumentDescriptor(
923n/a name="long4",
924n/a n=TAKEN_FROM_ARGUMENT4,
925n/a reader=read_long4,
926n/a doc="""A binary representation of a long, little-endian.
927n/a
928n/a This first reads four bytes as a signed size (but requires the
929n/a size to be >= 0), then reads that many bytes and interprets them
930n/a as a little-endian 2's-complement long. If the size is 0, that's taken
931n/a as a shortcut for the int 0, although LONG1 should really be used
932n/a then instead (and in any case where # of bytes < 256).
933n/a """)
934n/a
935n/a
936n/a##############################################################################
937n/a# Object descriptors. The stack used by the pickle machine holds objects,
938n/a# and in the stack_before and stack_after attributes of OpcodeInfo
939n/a# descriptors we need names to describe the various types of objects that can
940n/a# appear on the stack.
941n/a
942n/aclass StackObject(object):
943n/a __slots__ = (
944n/a # name of descriptor record, for info only
945n/a 'name',
946n/a
947n/a # type of object, or tuple of type objects (meaning the object can
948n/a # be of any type in the tuple)
949n/a 'obtype',
950n/a
951n/a # human-readable docs for this kind of stack object; a string
952n/a 'doc',
953n/a )
954n/a
955n/a def __init__(self, name, obtype, doc):
956n/a assert isinstance(name, str)
957n/a self.name = name
958n/a
959n/a assert isinstance(obtype, type) or isinstance(obtype, tuple)
960n/a if isinstance(obtype, tuple):
961n/a for contained in obtype:
962n/a assert isinstance(contained, type)
963n/a self.obtype = obtype
964n/a
965n/a assert isinstance(doc, str)
966n/a self.doc = doc
967n/a
968n/a def __repr__(self):
969n/a return self.name
970n/a
971n/a
972n/apyint = pylong = StackObject(
973n/a name='int',
974n/a obtype=int,
975n/a doc="A Python integer object.")
976n/a
977n/apyinteger_or_bool = StackObject(
978n/a name='int_or_bool',
979n/a obtype=(int, bool),
980n/a doc="A Python integer or boolean object.")
981n/a
982n/apybool = StackObject(
983n/a name='bool',
984n/a obtype=bool,
985n/a doc="A Python boolean object.")
986n/a
987n/apyfloat = StackObject(
988n/a name='float',
989n/a obtype=float,
990n/a doc="A Python float object.")
991n/a
992n/apybytes_or_str = pystring = StackObject(
993n/a name='bytes_or_str',
994n/a obtype=(bytes, str),
995n/a doc="A Python bytes or (Unicode) string object.")
996n/a
997n/apybytes = StackObject(
998n/a name='bytes',
999n/a obtype=bytes,
1000n/a doc="A Python bytes object.")
1001n/a
1002n/apyunicode = StackObject(
1003n/a name='str',
1004n/a obtype=str,
1005n/a doc="A Python (Unicode) string object.")
1006n/a
1007n/apynone = StackObject(
1008n/a name="None",
1009n/a obtype=type(None),
1010n/a doc="The Python None object.")
1011n/a
1012n/apytuple = StackObject(
1013n/a name="tuple",
1014n/a obtype=tuple,
1015n/a doc="A Python tuple object.")
1016n/a
1017n/apylist = StackObject(
1018n/a name="list",
1019n/a obtype=list,
1020n/a doc="A Python list object.")
1021n/a
1022n/apydict = StackObject(
1023n/a name="dict",
1024n/a obtype=dict,
1025n/a doc="A Python dict object.")
1026n/a
1027n/apyset = StackObject(
1028n/a name="set",
1029n/a obtype=set,
1030n/a doc="A Python set object.")
1031n/a
1032n/apyfrozenset = StackObject(
1033n/a name="frozenset",
1034n/a obtype=set,
1035n/a doc="A Python frozenset object.")
1036n/a
1037n/aanyobject = StackObject(
1038n/a name='any',
1039n/a obtype=object,
1040n/a doc="Any kind of object whatsoever.")
1041n/a
1042n/amarkobject = StackObject(
1043n/a name="mark",
1044n/a obtype=StackObject,
1045n/a doc="""'The mark' is a unique object.
1046n/a
1047n/aOpcodes that operate on a variable number of objects
1048n/agenerally don't embed the count of objects in the opcode,
1049n/aor pull it off the stack. Instead the MARK opcode is used
1050n/ato push a special marker object on the stack, and then
1051n/asome other opcodes grab all the objects from the top of
1052n/athe stack down to (but not including) the topmost marker
1053n/aobject.
1054n/a""")
1055n/a
1056n/astackslice = StackObject(
1057n/a name="stackslice",
1058n/a obtype=StackObject,
1059n/a doc="""An object representing a contiguous slice of the stack.
1060n/a
1061n/aThis is used in conjunction with markobject, to represent all
1062n/aof the stack following the topmost markobject. For example,
1063n/athe POP_MARK opcode changes the stack from
1064n/a
1065n/a [..., markobject, stackslice]
1066n/ato
1067n/a [...]
1068n/a
1069n/aNo matter how many object are on the stack after the topmost
1070n/amarkobject, POP_MARK gets rid of all of them (including the
1071n/atopmost markobject too).
1072n/a""")
1073n/a
1074n/a##############################################################################
1075n/a# Descriptors for pickle opcodes.
1076n/a
1077n/aclass OpcodeInfo(object):
1078n/a
1079n/a __slots__ = (
1080n/a # symbolic name of opcode; a string
1081n/a 'name',
1082n/a
1083n/a # the code used in a bytestream to represent the opcode; a
1084n/a # one-character string
1085n/a 'code',
1086n/a
1087n/a # If the opcode has an argument embedded in the byte string, an
1088n/a # instance of ArgumentDescriptor specifying its type. Note that
1089n/a # arg.reader(s) can be used to read and decode the argument from
1090n/a # the bytestream s, and arg.doc documents the format of the raw
1091n/a # argument bytes. If the opcode doesn't have an argument embedded
1092n/a # in the bytestream, arg should be None.
1093n/a 'arg',
1094n/a
1095n/a # what the stack looks like before this opcode runs; a list
1096n/a 'stack_before',
1097n/a
1098n/a # what the stack looks like after this opcode runs; a list
1099n/a 'stack_after',
1100n/a
1101n/a # the protocol number in which this opcode was introduced; an int
1102n/a 'proto',
1103n/a
1104n/a # human-readable docs for this opcode; a string
1105n/a 'doc',
1106n/a )
1107n/a
1108n/a def __init__(self, name, code, arg,
1109n/a stack_before, stack_after, proto, doc):
1110n/a assert isinstance(name, str)
1111n/a self.name = name
1112n/a
1113n/a assert isinstance(code, str)
1114n/a assert len(code) == 1
1115n/a self.code = code
1116n/a
1117n/a assert arg is None or isinstance(arg, ArgumentDescriptor)
1118n/a self.arg = arg
1119n/a
1120n/a assert isinstance(stack_before, list)
1121n/a for x in stack_before:
1122n/a assert isinstance(x, StackObject)
1123n/a self.stack_before = stack_before
1124n/a
1125n/a assert isinstance(stack_after, list)
1126n/a for x in stack_after:
1127n/a assert isinstance(x, StackObject)
1128n/a self.stack_after = stack_after
1129n/a
1130n/a assert isinstance(proto, int) and 0 <= proto <= pickle.HIGHEST_PROTOCOL
1131n/a self.proto = proto
1132n/a
1133n/a assert isinstance(doc, str)
1134n/a self.doc = doc
1135n/a
1136n/aI = OpcodeInfo
1137n/aopcodes = [
1138n/a
1139n/a # Ways to spell integers.
1140n/a
1141n/a I(name='INT',
1142n/a code='I',
1143n/a arg=decimalnl_short,
1144n/a stack_before=[],
1145n/a stack_after=[pyinteger_or_bool],
1146n/a proto=0,
1147n/a doc="""Push an integer or bool.
1148n/a
1149n/a The argument is a newline-terminated decimal literal string.
1150n/a
1151n/a The intent may have been that this always fit in a short Python int,
1152n/a but INT can be generated in pickles written on a 64-bit box that
1153n/a require a Python long on a 32-bit box. The difference between this
1154n/a and LONG then is that INT skips a trailing 'L', and produces a short
1155n/a int whenever possible.
1156n/a
1157n/a Another difference is due to that, when bool was introduced as a
1158n/a distinct type in 2.3, builtin names True and False were also added to
1159n/a 2.2.2, mapping to ints 1 and 0. For compatibility in both directions,
1160n/a True gets pickled as INT + "I01\\n", and False as INT + "I00\\n".
1161n/a Leading zeroes are never produced for a genuine integer. The 2.3
1162n/a (and later) unpicklers special-case these and return bool instead;
1163n/a earlier unpicklers ignore the leading "0" and return the int.
1164n/a """),
1165n/a
1166n/a I(name='BININT',
1167n/a code='J',
1168n/a arg=int4,
1169n/a stack_before=[],
1170n/a stack_after=[pyint],
1171n/a proto=1,
1172n/a doc="""Push a four-byte signed integer.
1173n/a
1174n/a This handles the full range of Python (short) integers on a 32-bit
1175n/a box, directly as binary bytes (1 for the opcode and 4 for the integer).
1176n/a If the integer is non-negative and fits in 1 or 2 bytes, pickling via
1177n/a BININT1 or BININT2 saves space.
1178n/a """),
1179n/a
1180n/a I(name='BININT1',
1181n/a code='K',
1182n/a arg=uint1,
1183n/a stack_before=[],
1184n/a stack_after=[pyint],
1185n/a proto=1,
1186n/a doc="""Push a one-byte unsigned integer.
1187n/a
1188n/a This is a space optimization for pickling very small non-negative ints,
1189n/a in range(256).
1190n/a """),
1191n/a
1192n/a I(name='BININT2',
1193n/a code='M',
1194n/a arg=uint2,
1195n/a stack_before=[],
1196n/a stack_after=[pyint],
1197n/a proto=1,
1198n/a doc="""Push a two-byte unsigned integer.
1199n/a
1200n/a This is a space optimization for pickling small positive ints, in
1201n/a range(256, 2**16). Integers in range(256) can also be pickled via
1202n/a BININT2, but BININT1 instead saves a byte.
1203n/a """),
1204n/a
1205n/a I(name='LONG',
1206n/a code='L',
1207n/a arg=decimalnl_long,
1208n/a stack_before=[],
1209n/a stack_after=[pyint],
1210n/a proto=0,
1211n/a doc="""Push a long integer.
1212n/a
1213n/a The same as INT, except that the literal ends with 'L', and always
1214n/a unpickles to a Python long. There doesn't seem a real purpose to the
1215n/a trailing 'L'.
1216n/a
1217n/a Note that LONG takes time quadratic in the number of digits when
1218n/a unpickling (this is simply due to the nature of decimal->binary
1219n/a conversion). Proto 2 added linear-time (in C; still quadratic-time
1220n/a in Python) LONG1 and LONG4 opcodes.
1221n/a """),
1222n/a
1223n/a I(name="LONG1",
1224n/a code='\x8a',
1225n/a arg=long1,
1226n/a stack_before=[],
1227n/a stack_after=[pyint],
1228n/a proto=2,
1229n/a doc="""Long integer using one-byte length.
1230n/a
1231n/a A more efficient encoding of a Python long; the long1 encoding
1232n/a says it all."""),
1233n/a
1234n/a I(name="LONG4",
1235n/a code='\x8b',
1236n/a arg=long4,
1237n/a stack_before=[],
1238n/a stack_after=[pyint],
1239n/a proto=2,
1240n/a doc="""Long integer using found-byte length.
1241n/a
1242n/a A more efficient encoding of a Python long; the long4 encoding
1243n/a says it all."""),
1244n/a
1245n/a # Ways to spell strings (8-bit, not Unicode).
1246n/a
1247n/a I(name='STRING',
1248n/a code='S',
1249n/a arg=stringnl,
1250n/a stack_before=[],
1251n/a stack_after=[pybytes_or_str],
1252n/a proto=0,
1253n/a doc="""Push a Python string object.
1254n/a
1255n/a The argument is a repr-style string, with bracketing quote characters,
1256n/a and perhaps embedded escapes. The argument extends until the next
1257n/a newline character. These are usually decoded into a str instance
1258n/a using the encoding given to the Unpickler constructor. or the default,
1259n/a 'ASCII'. If the encoding given was 'bytes' however, they will be
1260n/a decoded as bytes object instead.
1261n/a """),
1262n/a
1263n/a I(name='BINSTRING',
1264n/a code='T',
1265n/a arg=string4,
1266n/a stack_before=[],
1267n/a stack_after=[pybytes_or_str],
1268n/a proto=1,
1269n/a doc="""Push a Python string object.
1270n/a
1271n/a There are two arguments: the first is a 4-byte little-endian
1272n/a signed int giving the number of bytes in the string, and the
1273n/a second is that many bytes, which are taken literally as the string
1274n/a content. These are usually decoded into a str instance using the
1275n/a encoding given to the Unpickler constructor. or the default,
1276n/a 'ASCII'. If the encoding given was 'bytes' however, they will be
1277n/a decoded as bytes object instead.
1278n/a """),
1279n/a
1280n/a I(name='SHORT_BINSTRING',
1281n/a code='U',
1282n/a arg=string1,
1283n/a stack_before=[],
1284n/a stack_after=[pybytes_or_str],
1285n/a proto=1,
1286n/a doc="""Push a Python string object.
1287n/a
1288n/a There are two arguments: the first is a 1-byte unsigned int giving
1289n/a the number of bytes in the string, and the second is that many
1290n/a bytes, which are taken literally as the string content. These are
1291n/a usually decoded into a str instance using the encoding given to
1292n/a the Unpickler constructor. or the default, 'ASCII'. If the
1293n/a encoding given was 'bytes' however, they will be decoded as bytes
1294n/a object instead.
1295n/a """),
1296n/a
1297n/a # Bytes (protocol 3 only; older protocols don't support bytes at all)
1298n/a
1299n/a I(name='BINBYTES',
1300n/a code='B',
1301n/a arg=bytes4,
1302n/a stack_before=[],
1303n/a stack_after=[pybytes],
1304n/a proto=3,
1305n/a doc="""Push a Python bytes object.
1306n/a
1307n/a There are two arguments: the first is a 4-byte little-endian unsigned int
1308n/a giving the number of bytes, and the second is that many bytes, which are
1309n/a taken literally as the bytes content.
1310n/a """),
1311n/a
1312n/a I(name='SHORT_BINBYTES',
1313n/a code='C',
1314n/a arg=bytes1,
1315n/a stack_before=[],
1316n/a stack_after=[pybytes],
1317n/a proto=3,
1318n/a doc="""Push a Python bytes object.
1319n/a
1320n/a There are two arguments: the first is a 1-byte unsigned int giving
1321n/a the number of bytes, and the second is that many bytes, which are taken
1322n/a literally as the string content.
1323n/a """),
1324n/a
1325n/a I(name='BINBYTES8',
1326n/a code='\x8e',
1327n/a arg=bytes8,
1328n/a stack_before=[],
1329n/a stack_after=[pybytes],
1330n/a proto=4,
1331n/a doc="""Push a Python bytes object.
1332n/a
1333n/a There are two arguments: the first is an 8-byte unsigned int giving
1334n/a the number of bytes in the string, and the second is that many bytes,
1335n/a which are taken literally as the string content.
1336n/a """),
1337n/a
1338n/a # Ways to spell None.
1339n/a
1340n/a I(name='NONE',
1341n/a code='N',
1342n/a arg=None,
1343n/a stack_before=[],
1344n/a stack_after=[pynone],
1345n/a proto=0,
1346n/a doc="Push None on the stack."),
1347n/a
1348n/a # Ways to spell bools, starting with proto 2. See INT for how this was
1349n/a # done before proto 2.
1350n/a
1351n/a I(name='NEWTRUE',
1352n/a code='\x88',
1353n/a arg=None,
1354n/a stack_before=[],
1355n/a stack_after=[pybool],
1356n/a proto=2,
1357n/a doc="""True.
1358n/a
1359n/a Push True onto the stack."""),
1360n/a
1361n/a I(name='NEWFALSE',
1362n/a code='\x89',
1363n/a arg=None,
1364n/a stack_before=[],
1365n/a stack_after=[pybool],
1366n/a proto=2,
1367n/a doc="""True.
1368n/a
1369n/a Push False onto the stack."""),
1370n/a
1371n/a # Ways to spell Unicode strings.
1372n/a
1373n/a I(name='UNICODE',
1374n/a code='V',
1375n/a arg=unicodestringnl,
1376n/a stack_before=[],
1377n/a stack_after=[pyunicode],
1378n/a proto=0, # this may be pure-text, but it's a later addition
1379n/a doc="""Push a Python Unicode string object.
1380n/a
1381n/a The argument is a raw-unicode-escape encoding of a Unicode string,
1382n/a and so may contain embedded escape sequences. The argument extends
1383n/a until the next newline character.
1384n/a """),
1385n/a
1386n/a I(name='SHORT_BINUNICODE',
1387n/a code='\x8c',
1388n/a arg=unicodestring1,
1389n/a stack_before=[],
1390n/a stack_after=[pyunicode],
1391n/a proto=4,
1392n/a doc="""Push a Python Unicode string object.
1393n/a
1394n/a There are two arguments: the first is a 1-byte little-endian signed int
1395n/a giving the number of bytes in the string. The second is that many
1396n/a bytes, and is the UTF-8 encoding of the Unicode string.
1397n/a """),
1398n/a
1399n/a I(name='BINUNICODE',
1400n/a code='X',
1401n/a arg=unicodestring4,
1402n/a stack_before=[],
1403n/a stack_after=[pyunicode],
1404n/a proto=1,
1405n/a doc="""Push a Python Unicode string object.
1406n/a
1407n/a There are two arguments: the first is a 4-byte little-endian unsigned int
1408n/a giving the number of bytes in the string. The second is that many
1409n/a bytes, and is the UTF-8 encoding of the Unicode string.
1410n/a """),
1411n/a
1412n/a I(name='BINUNICODE8',
1413n/a code='\x8d',
1414n/a arg=unicodestring8,
1415n/a stack_before=[],
1416n/a stack_after=[pyunicode],
1417n/a proto=4,
1418n/a doc="""Push a Python Unicode string object.
1419n/a
1420n/a There are two arguments: the first is an 8-byte little-endian signed int
1421n/a giving the number of bytes in the string. The second is that many
1422n/a bytes, and is the UTF-8 encoding of the Unicode string.
1423n/a """),
1424n/a
1425n/a # Ways to spell floats.
1426n/a
1427n/a I(name='FLOAT',
1428n/a code='F',
1429n/a arg=floatnl,
1430n/a stack_before=[],
1431n/a stack_after=[pyfloat],
1432n/a proto=0,
1433n/a doc="""Newline-terminated decimal float literal.
1434n/a
1435n/a The argument is repr(a_float), and in general requires 17 significant
1436n/a digits for roundtrip conversion to be an identity (this is so for
1437n/a IEEE-754 double precision values, which is what Python float maps to
1438n/a on most boxes).
1439n/a
1440n/a In general, FLOAT cannot be used to transport infinities, NaNs, or
1441n/a minus zero across boxes (or even on a single box, if the platform C
1442n/a library can't read the strings it produces for such things -- Windows
1443n/a is like that), but may do less damage than BINFLOAT on boxes with
1444n/a greater precision or dynamic range than IEEE-754 double.
1445n/a """),
1446n/a
1447n/a I(name='BINFLOAT',
1448n/a code='G',
1449n/a arg=float8,
1450n/a stack_before=[],
1451n/a stack_after=[pyfloat],
1452n/a proto=1,
1453n/a doc="""Float stored in binary form, with 8 bytes of data.
1454n/a
1455n/a This generally requires less than half the space of FLOAT encoding.
1456n/a In general, BINFLOAT cannot be used to transport infinities, NaNs, or
1457n/a minus zero, raises an exception if the exponent exceeds the range of
1458n/a an IEEE-754 double, and retains no more than 53 bits of precision (if
1459n/a there are more than that, "add a half and chop" rounding is used to
1460n/a cut it back to 53 significant bits).
1461n/a """),
1462n/a
1463n/a # Ways to build lists.
1464n/a
1465n/a I(name='EMPTY_LIST',
1466n/a code=']',
1467n/a arg=None,
1468n/a stack_before=[],
1469n/a stack_after=[pylist],
1470n/a proto=1,
1471n/a doc="Push an empty list."),
1472n/a
1473n/a I(name='APPEND',
1474n/a code='a',
1475n/a arg=None,
1476n/a stack_before=[pylist, anyobject],
1477n/a stack_after=[pylist],
1478n/a proto=0,
1479n/a doc="""Append an object to a list.
1480n/a
1481n/a Stack before: ... pylist anyobject
1482n/a Stack after: ... pylist+[anyobject]
1483n/a
1484n/a although pylist is really extended in-place.
1485n/a """),
1486n/a
1487n/a I(name='APPENDS',
1488n/a code='e',
1489n/a arg=None,
1490n/a stack_before=[pylist, markobject, stackslice],
1491n/a stack_after=[pylist],
1492n/a proto=1,
1493n/a doc="""Extend a list by a slice of stack objects.
1494n/a
1495n/a Stack before: ... pylist markobject stackslice
1496n/a Stack after: ... pylist+stackslice
1497n/a
1498n/a although pylist is really extended in-place.
1499n/a """),
1500n/a
1501n/a I(name='LIST',
1502n/a code='l',
1503n/a arg=None,
1504n/a stack_before=[markobject, stackslice],
1505n/a stack_after=[pylist],
1506n/a proto=0,
1507n/a doc="""Build a list out of the topmost stack slice, after markobject.
1508n/a
1509n/a All the stack entries following the topmost markobject are placed into
1510n/a a single Python list, which single list object replaces all of the
1511n/a stack from the topmost markobject onward. For example,
1512n/a
1513n/a Stack before: ... markobject 1 2 3 'abc'
1514n/a Stack after: ... [1, 2, 3, 'abc']
1515n/a """),
1516n/a
1517n/a # Ways to build tuples.
1518n/a
1519n/a I(name='EMPTY_TUPLE',
1520n/a code=')',
1521n/a arg=None,
1522n/a stack_before=[],
1523n/a stack_after=[pytuple],
1524n/a proto=1,
1525n/a doc="Push an empty tuple."),
1526n/a
1527n/a I(name='TUPLE',
1528n/a code='t',
1529n/a arg=None,
1530n/a stack_before=[markobject, stackslice],
1531n/a stack_after=[pytuple],
1532n/a proto=0,
1533n/a doc="""Build a tuple out of the topmost stack slice, after markobject.
1534n/a
1535n/a All the stack entries following the topmost markobject are placed into
1536n/a a single Python tuple, which single tuple object replaces all of the
1537n/a stack from the topmost markobject onward. For example,
1538n/a
1539n/a Stack before: ... markobject 1 2 3 'abc'
1540n/a Stack after: ... (1, 2, 3, 'abc')
1541n/a """),
1542n/a
1543n/a I(name='TUPLE1',
1544n/a code='\x85',
1545n/a arg=None,
1546n/a stack_before=[anyobject],
1547n/a stack_after=[pytuple],
1548n/a proto=2,
1549n/a doc="""Build a one-tuple out of the topmost item on the stack.
1550n/a
1551n/a This code pops one value off the stack and pushes a tuple of
1552n/a length 1 whose one item is that value back onto it. In other
1553n/a words:
1554n/a
1555n/a stack[-1] = tuple(stack[-1:])
1556n/a """),
1557n/a
1558n/a I(name='TUPLE2',
1559n/a code='\x86',
1560n/a arg=None,
1561n/a stack_before=[anyobject, anyobject],
1562n/a stack_after=[pytuple],
1563n/a proto=2,
1564n/a doc="""Build a two-tuple out of the top two items on the stack.
1565n/a
1566n/a This code pops two values off the stack and pushes a tuple of
1567n/a length 2 whose items are those values back onto it. In other
1568n/a words:
1569n/a
1570n/a stack[-2:] = [tuple(stack[-2:])]
1571n/a """),
1572n/a
1573n/a I(name='TUPLE3',
1574n/a code='\x87',
1575n/a arg=None,
1576n/a stack_before=[anyobject, anyobject, anyobject],
1577n/a stack_after=[pytuple],
1578n/a proto=2,
1579n/a doc="""Build a three-tuple out of the top three items on the stack.
1580n/a
1581n/a This code pops three values off the stack and pushes a tuple of
1582n/a length 3 whose items are those values back onto it. In other
1583n/a words:
1584n/a
1585n/a stack[-3:] = [tuple(stack[-3:])]
1586n/a """),
1587n/a
1588n/a # Ways to build dicts.
1589n/a
1590n/a I(name='EMPTY_DICT',
1591n/a code='}',
1592n/a arg=None,
1593n/a stack_before=[],
1594n/a stack_after=[pydict],
1595n/a proto=1,
1596n/a doc="Push an empty dict."),
1597n/a
1598n/a I(name='DICT',
1599n/a code='d',
1600n/a arg=None,
1601n/a stack_before=[markobject, stackslice],
1602n/a stack_after=[pydict],
1603n/a proto=0,
1604n/a doc="""Build a dict out of the topmost stack slice, after markobject.
1605n/a
1606n/a All the stack entries following the topmost markobject are placed into
1607n/a a single Python dict, which single dict object replaces all of the
1608n/a stack from the topmost markobject onward. The stack slice alternates
1609n/a key, value, key, value, .... For example,
1610n/a
1611n/a Stack before: ... markobject 1 2 3 'abc'
1612n/a Stack after: ... {1: 2, 3: 'abc'}
1613n/a """),
1614n/a
1615n/a I(name='SETITEM',
1616n/a code='s',
1617n/a arg=None,
1618n/a stack_before=[pydict, anyobject, anyobject],
1619n/a stack_after=[pydict],
1620n/a proto=0,
1621n/a doc="""Add a key+value pair to an existing dict.
1622n/a
1623n/a Stack before: ... pydict key value
1624n/a Stack after: ... pydict
1625n/a
1626n/a where pydict has been modified via pydict[key] = value.
1627n/a """),
1628n/a
1629n/a I(name='SETITEMS',
1630n/a code='u',
1631n/a arg=None,
1632n/a stack_before=[pydict, markobject, stackslice],
1633n/a stack_after=[pydict],
1634n/a proto=1,
1635n/a doc="""Add an arbitrary number of key+value pairs to an existing dict.
1636n/a
1637n/a The slice of the stack following the topmost markobject is taken as
1638n/a an alternating sequence of keys and values, added to the dict
1639n/a immediately under the topmost markobject. Everything at and after the
1640n/a topmost markobject is popped, leaving the mutated dict at the top
1641n/a of the stack.
1642n/a
1643n/a Stack before: ... pydict markobject key_1 value_1 ... key_n value_n
1644n/a Stack after: ... pydict
1645n/a
1646n/a where pydict has been modified via pydict[key_i] = value_i for i in
1647n/a 1, 2, ..., n, and in that order.
1648n/a """),
1649n/a
1650n/a # Ways to build sets
1651n/a
1652n/a I(name='EMPTY_SET',
1653n/a code='\x8f',
1654n/a arg=None,
1655n/a stack_before=[],
1656n/a stack_after=[pyset],
1657n/a proto=4,
1658n/a doc="Push an empty set."),
1659n/a
1660n/a I(name='ADDITEMS',
1661n/a code='\x90',
1662n/a arg=None,
1663n/a stack_before=[pyset, markobject, stackslice],
1664n/a stack_after=[pyset],
1665n/a proto=4,
1666n/a doc="""Add an arbitrary number of items to an existing set.
1667n/a
1668n/a The slice of the stack following the topmost markobject is taken as
1669n/a a sequence of items, added to the set immediately under the topmost
1670n/a markobject. Everything at and after the topmost markobject is popped,
1671n/a leaving the mutated set at the top of the stack.
1672n/a
1673n/a Stack before: ... pyset markobject item_1 ... item_n
1674n/a Stack after: ... pyset
1675n/a
1676n/a where pyset has been modified via pyset.add(item_i) = item_i for i in
1677n/a 1, 2, ..., n, and in that order.
1678n/a """),
1679n/a
1680n/a # Way to build frozensets
1681n/a
1682n/a I(name='FROZENSET',
1683n/a code='\x91',
1684n/a arg=None,
1685n/a stack_before=[markobject, stackslice],
1686n/a stack_after=[pyfrozenset],
1687n/a proto=4,
1688n/a doc="""Build a frozenset out of the topmost slice, after markobject.
1689n/a
1690n/a All the stack entries following the topmost markobject are placed into
1691n/a a single Python frozenset, which single frozenset object replaces all
1692n/a of the stack from the topmost markobject onward. For example,
1693n/a
1694n/a Stack before: ... markobject 1 2 3
1695n/a Stack after: ... frozenset({1, 2, 3})
1696n/a """),
1697n/a
1698n/a # Stack manipulation.
1699n/a
1700n/a I(name='POP',
1701n/a code='0',
1702n/a arg=None,
1703n/a stack_before=[anyobject],
1704n/a stack_after=[],
1705n/a proto=0,
1706n/a doc="Discard the top stack item, shrinking the stack by one item."),
1707n/a
1708n/a I(name='DUP',
1709n/a code='2',
1710n/a arg=None,
1711n/a stack_before=[anyobject],
1712n/a stack_after=[anyobject, anyobject],
1713n/a proto=0,
1714n/a doc="Push the top stack item onto the stack again, duplicating it."),
1715n/a
1716n/a I(name='MARK',
1717n/a code='(',
1718n/a arg=None,
1719n/a stack_before=[],
1720n/a stack_after=[markobject],
1721n/a proto=0,
1722n/a doc="""Push markobject onto the stack.
1723n/a
1724n/a markobject is a unique object, used by other opcodes to identify a
1725n/a region of the stack containing a variable number of objects for them
1726n/a to work on. See markobject.doc for more detail.
1727n/a """),
1728n/a
1729n/a I(name='POP_MARK',
1730n/a code='1',
1731n/a arg=None,
1732n/a stack_before=[markobject, stackslice],
1733n/a stack_after=[],
1734n/a proto=1,
1735n/a doc="""Pop all the stack objects at and above the topmost markobject.
1736n/a
1737n/a When an opcode using a variable number of stack objects is done,
1738n/a POP_MARK is used to remove those objects, and to remove the markobject
1739n/a that delimited their starting position on the stack.
1740n/a """),
1741n/a
1742n/a # Memo manipulation. There are really only two operations (get and put),
1743n/a # each in all-text, "short binary", and "long binary" flavors.
1744n/a
1745n/a I(name='GET',
1746n/a code='g',
1747n/a arg=decimalnl_short,
1748n/a stack_before=[],
1749n/a stack_after=[anyobject],
1750n/a proto=0,
1751n/a doc="""Read an object from the memo and push it on the stack.
1752n/a
1753n/a The index of the memo object to push is given by the newline-terminated
1754n/a decimal string following. BINGET and LONG_BINGET are space-optimized
1755n/a versions.
1756n/a """),
1757n/a
1758n/a I(name='BINGET',
1759n/a code='h',
1760n/a arg=uint1,
1761n/a stack_before=[],
1762n/a stack_after=[anyobject],
1763n/a proto=1,
1764n/a doc="""Read an object from the memo and push it on the stack.
1765n/a
1766n/a The index of the memo object to push is given by the 1-byte unsigned
1767n/a integer following.
1768n/a """),
1769n/a
1770n/a I(name='LONG_BINGET',
1771n/a code='j',
1772n/a arg=uint4,
1773n/a stack_before=[],
1774n/a stack_after=[anyobject],
1775n/a proto=1,
1776n/a doc="""Read an object from the memo and push it on the stack.
1777n/a
1778n/a The index of the memo object to push is given by the 4-byte unsigned
1779n/a little-endian integer following.
1780n/a """),
1781n/a
1782n/a I(name='PUT',
1783n/a code='p',
1784n/a arg=decimalnl_short,
1785n/a stack_before=[],
1786n/a stack_after=[],
1787n/a proto=0,
1788n/a doc="""Store the stack top into the memo. The stack is not popped.
1789n/a
1790n/a The index of the memo location to write into is given by the newline-
1791n/a terminated decimal string following. BINPUT and LONG_BINPUT are
1792n/a space-optimized versions.
1793n/a """),
1794n/a
1795n/a I(name='BINPUT',
1796n/a code='q',
1797n/a arg=uint1,
1798n/a stack_before=[],
1799n/a stack_after=[],
1800n/a proto=1,
1801n/a doc="""Store the stack top into the memo. The stack is not popped.
1802n/a
1803n/a The index of the memo location to write into is given by the 1-byte
1804n/a unsigned integer following.
1805n/a """),
1806n/a
1807n/a I(name='LONG_BINPUT',
1808n/a code='r',
1809n/a arg=uint4,
1810n/a stack_before=[],
1811n/a stack_after=[],
1812n/a proto=1,
1813n/a doc="""Store the stack top into the memo. The stack is not popped.
1814n/a
1815n/a The index of the memo location to write into is given by the 4-byte
1816n/a unsigned little-endian integer following.
1817n/a """),
1818n/a
1819n/a I(name='MEMOIZE',
1820n/a code='\x94',
1821n/a arg=None,
1822n/a stack_before=[anyobject],
1823n/a stack_after=[anyobject],
1824n/a proto=4,
1825n/a doc="""Store the stack top into the memo. The stack is not popped.
1826n/a
1827n/a The index of the memo location to write is the number of
1828n/a elements currently present in the memo.
1829n/a """),
1830n/a
1831n/a # Access the extension registry (predefined objects). Akin to the GET
1832n/a # family.
1833n/a
1834n/a I(name='EXT1',
1835n/a code='\x82',
1836n/a arg=uint1,
1837n/a stack_before=[],
1838n/a stack_after=[anyobject],
1839n/a proto=2,
1840n/a doc="""Extension code.
1841n/a
1842n/a This code and the similar EXT2 and EXT4 allow using a registry
1843n/a of popular objects that are pickled by name, typically classes.
1844n/a It is envisioned that through a global negotiation and
1845n/a registration process, third parties can set up a mapping between
1846n/a ints and object names.
1847n/a
1848n/a In order to guarantee pickle interchangeability, the extension
1849n/a code registry ought to be global, although a range of codes may
1850n/a be reserved for private use.
1851n/a
1852n/a EXT1 has a 1-byte integer argument. This is used to index into the
1853n/a extension registry, and the object at that index is pushed on the stack.
1854n/a """),
1855n/a
1856n/a I(name='EXT2',
1857n/a code='\x83',
1858n/a arg=uint2,
1859n/a stack_before=[],
1860n/a stack_after=[anyobject],
1861n/a proto=2,
1862n/a doc="""Extension code.
1863n/a
1864n/a See EXT1. EXT2 has a two-byte integer argument.
1865n/a """),
1866n/a
1867n/a I(name='EXT4',
1868n/a code='\x84',
1869n/a arg=int4,
1870n/a stack_before=[],
1871n/a stack_after=[anyobject],
1872n/a proto=2,
1873n/a doc="""Extension code.
1874n/a
1875n/a See EXT1. EXT4 has a four-byte integer argument.
1876n/a """),
1877n/a
1878n/a # Push a class object, or module function, on the stack, via its module
1879n/a # and name.
1880n/a
1881n/a I(name='GLOBAL',
1882n/a code='c',
1883n/a arg=stringnl_noescape_pair,
1884n/a stack_before=[],
1885n/a stack_after=[anyobject],
1886n/a proto=0,
1887n/a doc="""Push a global object (module.attr) on the stack.
1888n/a
1889n/a Two newline-terminated strings follow the GLOBAL opcode. The first is
1890n/a taken as a module name, and the second as a class name. The class
1891n/a object module.class is pushed on the stack. More accurately, the
1892n/a object returned by self.find_class(module, class) is pushed on the
1893n/a stack, so unpickling subclasses can override this form of lookup.
1894n/a """),
1895n/a
1896n/a I(name='STACK_GLOBAL',
1897n/a code='\x93',
1898n/a arg=None,
1899n/a stack_before=[pyunicode, pyunicode],
1900n/a stack_after=[anyobject],
1901n/a proto=4,
1902n/a doc="""Push a global object (module.attr) on the stack.
1903n/a """),
1904n/a
1905n/a # Ways to build objects of classes pickle doesn't know about directly
1906n/a # (user-defined classes). I despair of documenting this accurately
1907n/a # and comprehensibly -- you really have to read the pickle code to
1908n/a # find all the special cases.
1909n/a
1910n/a I(name='REDUCE',
1911n/a code='R',
1912n/a arg=None,
1913n/a stack_before=[anyobject, anyobject],
1914n/a stack_after=[anyobject],
1915n/a proto=0,
1916n/a doc="""Push an object built from a callable and an argument tuple.
1917n/a
1918n/a The opcode is named to remind of the __reduce__() method.
1919n/a
1920n/a Stack before: ... callable pytuple
1921n/a Stack after: ... callable(*pytuple)
1922n/a
1923n/a The callable and the argument tuple are the first two items returned
1924n/a by a __reduce__ method. Applying the callable to the argtuple is
1925n/a supposed to reproduce the original object, or at least get it started.
1926n/a If the __reduce__ method returns a 3-tuple, the last component is an
1927n/a argument to be passed to the object's __setstate__, and then the REDUCE
1928n/a opcode is followed by code to create setstate's argument, and then a
1929n/a BUILD opcode to apply __setstate__ to that argument.
1930n/a
1931n/a If not isinstance(callable, type), REDUCE complains unless the
1932n/a callable has been registered with the copyreg module's
1933n/a safe_constructors dict, or the callable has a magic
1934n/a '__safe_for_unpickling__' attribute with a true value. I'm not sure
1935n/a why it does this, but I've sure seen this complaint often enough when
1936n/a I didn't want to <wink>.
1937n/a """),
1938n/a
1939n/a I(name='BUILD',
1940n/a code='b',
1941n/a arg=None,
1942n/a stack_before=[anyobject, anyobject],
1943n/a stack_after=[anyobject],
1944n/a proto=0,
1945n/a doc="""Finish building an object, via __setstate__ or dict update.
1946n/a
1947n/a Stack before: ... anyobject argument
1948n/a Stack after: ... anyobject
1949n/a
1950n/a where anyobject may have been mutated, as follows:
1951n/a
1952n/a If the object has a __setstate__ method,
1953n/a
1954n/a anyobject.__setstate__(argument)
1955n/a
1956n/a is called.
1957n/a
1958n/a Else the argument must be a dict, the object must have a __dict__, and
1959n/a the object is updated via
1960n/a
1961n/a anyobject.__dict__.update(argument)
1962n/a """),
1963n/a
1964n/a I(name='INST',
1965n/a code='i',
1966n/a arg=stringnl_noescape_pair,
1967n/a stack_before=[markobject, stackslice],
1968n/a stack_after=[anyobject],
1969n/a proto=0,
1970n/a doc="""Build a class instance.
1971n/a
1972n/a This is the protocol 0 version of protocol 1's OBJ opcode.
1973n/a INST is followed by two newline-terminated strings, giving a
1974n/a module and class name, just as for the GLOBAL opcode (and see
1975n/a GLOBAL for more details about that). self.find_class(module, name)
1976n/a is used to get a class object.
1977n/a
1978n/a In addition, all the objects on the stack following the topmost
1979n/a markobject are gathered into a tuple and popped (along with the
1980n/a topmost markobject), just as for the TUPLE opcode.
1981n/a
1982n/a Now it gets complicated. If all of these are true:
1983n/a
1984n/a + The argtuple is empty (markobject was at the top of the stack
1985n/a at the start).
1986n/a
1987n/a + The class object does not have a __getinitargs__ attribute.
1988n/a
1989n/a then we want to create an old-style class instance without invoking
1990n/a its __init__() method (pickle has waffled on this over the years; not
1991n/a calling __init__() is current wisdom). In this case, an instance of
1992n/a an old-style dummy class is created, and then we try to rebind its
1993n/a __class__ attribute to the desired class object. If this succeeds,
1994n/a the new instance object is pushed on the stack, and we're done.
1995n/a
1996n/a Else (the argtuple is not empty, it's not an old-style class object,
1997n/a or the class object does have a __getinitargs__ attribute), the code
1998n/a first insists that the class object have a __safe_for_unpickling__
1999n/a attribute. Unlike as for the __safe_for_unpickling__ check in REDUCE,
2000n/a it doesn't matter whether this attribute has a true or false value, it
2001n/a only matters whether it exists (XXX this is a bug). If
2002n/a __safe_for_unpickling__ doesn't exist, UnpicklingError is raised.
2003n/a
2004n/a Else (the class object does have a __safe_for_unpickling__ attr),
2005n/a the class object obtained from INST's arguments is applied to the
2006n/a argtuple obtained from the stack, and the resulting instance object
2007n/a is pushed on the stack.
2008n/a
2009n/a NOTE: checks for __safe_for_unpickling__ went away in Python 2.3.
2010n/a NOTE: the distinction between old-style and new-style classes does
2011n/a not make sense in Python 3.
2012n/a """),
2013n/a
2014n/a I(name='OBJ',
2015n/a code='o',
2016n/a arg=None,
2017n/a stack_before=[markobject, anyobject, stackslice],
2018n/a stack_after=[anyobject],
2019n/a proto=1,
2020n/a doc="""Build a class instance.
2021n/a
2022n/a This is the protocol 1 version of protocol 0's INST opcode, and is
2023n/a very much like it. The major difference is that the class object
2024n/a is taken off the stack, allowing it to be retrieved from the memo
2025n/a repeatedly if several instances of the same class are created. This
2026n/a can be much more efficient (in both time and space) than repeatedly
2027n/a embedding the module and class names in INST opcodes.
2028n/a
2029n/a Unlike INST, OBJ takes no arguments from the opcode stream. Instead
2030n/a the class object is taken off the stack, immediately above the
2031n/a topmost markobject:
2032n/a
2033n/a Stack before: ... markobject classobject stackslice
2034n/a Stack after: ... new_instance_object
2035n/a
2036n/a As for INST, the remainder of the stack above the markobject is
2037n/a gathered into an argument tuple, and then the logic seems identical,
2038n/a except that no __safe_for_unpickling__ check is done (XXX this is
2039n/a a bug). See INST for the gory details.
2040n/a
2041n/a NOTE: In Python 2.3, INST and OBJ are identical except for how they
2042n/a get the class object. That was always the intent; the implementations
2043n/a had diverged for accidental reasons.
2044n/a """),
2045n/a
2046n/a I(name='NEWOBJ',
2047n/a code='\x81',
2048n/a arg=None,
2049n/a stack_before=[anyobject, anyobject],
2050n/a stack_after=[anyobject],
2051n/a proto=2,
2052n/a doc="""Build an object instance.
2053n/a
2054n/a The stack before should be thought of as containing a class
2055n/a object followed by an argument tuple (the tuple being the stack
2056n/a top). Call these cls and args. They are popped off the stack,
2057n/a and the value returned by cls.__new__(cls, *args) is pushed back
2058n/a onto the stack.
2059n/a """),
2060n/a
2061n/a I(name='NEWOBJ_EX',
2062n/a code='\x92',
2063n/a arg=None,
2064n/a stack_before=[anyobject, anyobject, anyobject],
2065n/a stack_after=[anyobject],
2066n/a proto=4,
2067n/a doc="""Build an object instance.
2068n/a
2069n/a The stack before should be thought of as containing a class
2070n/a object followed by an argument tuple and by a keyword argument dict
2071n/a (the dict being the stack top). Call these cls and args. They are
2072n/a popped off the stack, and the value returned by
2073n/a cls.__new__(cls, *args, *kwargs) is pushed back onto the stack.
2074n/a """),
2075n/a
2076n/a # Machine control.
2077n/a
2078n/a I(name='PROTO',
2079n/a code='\x80',
2080n/a arg=uint1,
2081n/a stack_before=[],
2082n/a stack_after=[],
2083n/a proto=2,
2084n/a doc="""Protocol version indicator.
2085n/a
2086n/a For protocol 2 and above, a pickle must start with this opcode.
2087n/a The argument is the protocol version, an int in range(2, 256).
2088n/a """),
2089n/a
2090n/a I(name='STOP',
2091n/a code='.',
2092n/a arg=None,
2093n/a stack_before=[anyobject],
2094n/a stack_after=[],
2095n/a proto=0,
2096n/a doc="""Stop the unpickling machine.
2097n/a
2098n/a Every pickle ends with this opcode. The object at the top of the stack
2099n/a is popped, and that's the result of unpickling. The stack should be
2100n/a empty then.
2101n/a """),
2102n/a
2103n/a # Framing support.
2104n/a
2105n/a I(name='FRAME',
2106n/a code='\x95',
2107n/a arg=uint8,
2108n/a stack_before=[],
2109n/a stack_after=[],
2110n/a proto=4,
2111n/a doc="""Indicate the beginning of a new frame.
2112n/a
2113n/a The unpickler may use this opcode to safely prefetch data from its
2114n/a underlying stream.
2115n/a """),
2116n/a
2117n/a # Ways to deal with persistent IDs.
2118n/a
2119n/a I(name='PERSID',
2120n/a code='P',
2121n/a arg=stringnl_noescape,
2122n/a stack_before=[],
2123n/a stack_after=[anyobject],
2124n/a proto=0,
2125n/a doc="""Push an object identified by a persistent ID.
2126n/a
2127n/a The pickle module doesn't define what a persistent ID means. PERSID's
2128n/a argument is a newline-terminated str-style (no embedded escapes, no
2129n/a bracketing quote characters) string, which *is* "the persistent ID".
2130n/a The unpickler passes this string to self.persistent_load(). Whatever
2131n/a object that returns is pushed on the stack. There is no implementation
2132n/a of persistent_load() in Python's unpickler: it must be supplied by an
2133n/a unpickler subclass.
2134n/a """),
2135n/a
2136n/a I(name='BINPERSID',
2137n/a code='Q',
2138n/a arg=None,
2139n/a stack_before=[anyobject],
2140n/a stack_after=[anyobject],
2141n/a proto=1,
2142n/a doc="""Push an object identified by a persistent ID.
2143n/a
2144n/a Like PERSID, except the persistent ID is popped off the stack (instead
2145n/a of being a string embedded in the opcode bytestream). The persistent
2146n/a ID is passed to self.persistent_load(), and whatever object that
2147n/a returns is pushed on the stack. See PERSID for more detail.
2148n/a """),
2149n/a]
2150n/adel I
2151n/a
2152n/a# Verify uniqueness of .name and .code members.
2153n/aname2i = {}
2154n/acode2i = {}
2155n/a
2156n/afor i, d in enumerate(opcodes):
2157n/a if d.name in name2i:
2158n/a raise ValueError("repeated name %r at indices %d and %d" %
2159n/a (d.name, name2i[d.name], i))
2160n/a if d.code in code2i:
2161n/a raise ValueError("repeated code %r at indices %d and %d" %
2162n/a (d.code, code2i[d.code], i))
2163n/a
2164n/a name2i[d.name] = i
2165n/a code2i[d.code] = i
2166n/a
2167n/adel name2i, code2i, i, d
2168n/a
2169n/a##############################################################################
2170n/a# Build a code2op dict, mapping opcode characters to OpcodeInfo records.
2171n/a# Also ensure we've got the same stuff as pickle.py, although the
2172n/a# introspection here is dicey.
2173n/a
2174n/acode2op = {}
2175n/afor d in opcodes:
2176n/a code2op[d.code] = d
2177n/adel d
2178n/a
2179n/adef assure_pickle_consistency(verbose=False):
2180n/a
2181n/a copy = code2op.copy()
2182n/a for name in pickle.__all__:
2183n/a if not re.match("[A-Z][A-Z0-9_]+$", name):
2184n/a if verbose:
2185n/a print("skipping %r: it doesn't look like an opcode name" % name)
2186n/a continue
2187n/a picklecode = getattr(pickle, name)
2188n/a if not isinstance(picklecode, bytes) or len(picklecode) != 1:
2189n/a if verbose:
2190n/a print(("skipping %r: value %r doesn't look like a pickle "
2191n/a "code" % (name, picklecode)))
2192n/a continue
2193n/a picklecode = picklecode.decode("latin-1")
2194n/a if picklecode in copy:
2195n/a if verbose:
2196n/a print("checking name %r w/ code %r for consistency" % (
2197n/a name, picklecode))
2198n/a d = copy[picklecode]
2199n/a if d.name != name:
2200n/a raise ValueError("for pickle code %r, pickle.py uses name %r "
2201n/a "but we're using name %r" % (picklecode,
2202n/a name,
2203n/a d.name))
2204n/a # Forget this one. Any left over in copy at the end are a problem
2205n/a # of a different kind.
2206n/a del copy[picklecode]
2207n/a else:
2208n/a raise ValueError("pickle.py appears to have a pickle opcode with "
2209n/a "name %r and code %r, but we don't" %
2210n/a (name, picklecode))
2211n/a if copy:
2212n/a msg = ["we appear to have pickle opcodes that pickle.py doesn't have:"]
2213n/a for code, d in copy.items():
2214n/a msg.append(" name %r with code %r" % (d.name, code))
2215n/a raise ValueError("\n".join(msg))
2216n/a
2217n/aassure_pickle_consistency()
2218n/adel assure_pickle_consistency
2219n/a
2220n/a##############################################################################
2221n/a# A pickle opcode generator.
2222n/a
2223n/adef _genops(data, yield_end_pos=False):
2224n/a if isinstance(data, bytes_types):
2225n/a data = io.BytesIO(data)
2226n/a
2227n/a if hasattr(data, "tell"):
2228n/a getpos = data.tell
2229n/a else:
2230n/a getpos = lambda: None
2231n/a
2232n/a while True:
2233n/a pos = getpos()
2234n/a code = data.read(1)
2235n/a opcode = code2op.get(code.decode("latin-1"))
2236n/a if opcode is None:
2237n/a if code == b"":
2238n/a raise ValueError("pickle exhausted before seeing STOP")
2239n/a else:
2240n/a raise ValueError("at position %s, opcode %r unknown" % (
2241n/a "<unknown>" if pos is None else pos,
2242n/a code))
2243n/a if opcode.arg is None:
2244n/a arg = None
2245n/a else:
2246n/a arg = opcode.arg.reader(data)
2247n/a if yield_end_pos:
2248n/a yield opcode, arg, pos, getpos()
2249n/a else:
2250n/a yield opcode, arg, pos
2251n/a if code == b'.':
2252n/a assert opcode.name == 'STOP'
2253n/a break
2254n/a
2255n/adef genops(pickle):
2256n/a """Generate all the opcodes in a pickle.
2257n/a
2258n/a 'pickle' is a file-like object, or string, containing the pickle.
2259n/a
2260n/a Each opcode in the pickle is generated, from the current pickle position,
2261n/a stopping after a STOP opcode is delivered. A triple is generated for
2262n/a each opcode:
2263n/a
2264n/a opcode, arg, pos
2265n/a
2266n/a opcode is an OpcodeInfo record, describing the current opcode.
2267n/a
2268n/a If the opcode has an argument embedded in the pickle, arg is its decoded
2269n/a value, as a Python object. If the opcode doesn't have an argument, arg
2270n/a is None.
2271n/a
2272n/a If the pickle has a tell() method, pos was the value of pickle.tell()
2273n/a before reading the current opcode. If the pickle is a bytes object,
2274n/a it's wrapped in a BytesIO object, and the latter's tell() result is
2275n/a used. Else (the pickle doesn't have a tell(), and it's not obvious how
2276n/a to query its current position) pos is None.
2277n/a """
2278n/a return _genops(pickle)
2279n/a
2280n/a##############################################################################
2281n/a# A pickle optimizer.
2282n/a
2283n/adef optimize(p):
2284n/a 'Optimize a pickle string by removing unused PUT opcodes'
2285n/a put = 'PUT'
2286n/a get = 'GET'
2287n/a oldids = set() # set of all PUT ids
2288n/a newids = {} # set of ids used by a GET opcode
2289n/a opcodes = [] # (op, idx) or (pos, end_pos)
2290n/a proto = 0
2291n/a protoheader = b''
2292n/a for opcode, arg, pos, end_pos in _genops(p, yield_end_pos=True):
2293n/a if 'PUT' in opcode.name:
2294n/a oldids.add(arg)
2295n/a opcodes.append((put, arg))
2296n/a elif opcode.name == 'MEMOIZE':
2297n/a idx = len(oldids)
2298n/a oldids.add(idx)
2299n/a opcodes.append((put, idx))
2300n/a elif 'FRAME' in opcode.name:
2301n/a pass
2302n/a elif 'GET' in opcode.name:
2303n/a if opcode.proto > proto:
2304n/a proto = opcode.proto
2305n/a newids[arg] = None
2306n/a opcodes.append((get, arg))
2307n/a elif opcode.name == 'PROTO':
2308n/a if arg > proto:
2309n/a proto = arg
2310n/a if pos == 0:
2311n/a protoheader = p[pos: end_pos]
2312n/a else:
2313n/a opcodes.append((pos, end_pos))
2314n/a else:
2315n/a opcodes.append((pos, end_pos))
2316n/a del oldids
2317n/a
2318n/a # Copy the opcodes except for PUTS without a corresponding GET
2319n/a out = io.BytesIO()
2320n/a # Write the PROTO header before any framing
2321n/a out.write(protoheader)
2322n/a pickler = pickle._Pickler(out, proto)
2323n/a if proto >= 4:
2324n/a pickler.framer.start_framing()
2325n/a idx = 0
2326n/a for op, arg in opcodes:
2327n/a if op is put:
2328n/a if arg not in newids:
2329n/a continue
2330n/a data = pickler.put(idx)
2331n/a newids[arg] = idx
2332n/a idx += 1
2333n/a elif op is get:
2334n/a data = pickler.get(newids[arg])
2335n/a else:
2336n/a data = p[op:arg]
2337n/a pickler.framer.commit_frame()
2338n/a pickler.write(data)
2339n/a pickler.framer.end_framing()
2340n/a return out.getvalue()
2341n/a
2342n/a##############################################################################
2343n/a# A symbolic pickle disassembler.
2344n/a
2345n/adef dis(pickle, out=None, memo=None, indentlevel=4, annotate=0):
2346n/a """Produce a symbolic disassembly of a pickle.
2347n/a
2348n/a 'pickle' is a file-like object, or string, containing a (at least one)
2349n/a pickle. The pickle is disassembled from the current position, through
2350n/a the first STOP opcode encountered.
2351n/a
2352n/a Optional arg 'out' is a file-like object to which the disassembly is
2353n/a printed. It defaults to sys.stdout.
2354n/a
2355n/a Optional arg 'memo' is a Python dict, used as the pickle's memo. It
2356n/a may be mutated by dis(), if the pickle contains PUT or BINPUT opcodes.
2357n/a Passing the same memo object to another dis() call then allows disassembly
2358n/a to proceed across multiple pickles that were all created by the same
2359n/a pickler with the same memo. Ordinarily you don't need to worry about this.
2360n/a
2361n/a Optional arg 'indentlevel' is the number of blanks by which to indent
2362n/a a new MARK level. It defaults to 4.
2363n/a
2364n/a Optional arg 'annotate' if nonzero instructs dis() to add short
2365n/a description of the opcode on each line of disassembled output.
2366n/a The value given to 'annotate' must be an integer and is used as a
2367n/a hint for the column where annotation should start. The default
2368n/a value is 0, meaning no annotations.
2369n/a
2370n/a In addition to printing the disassembly, some sanity checks are made:
2371n/a
2372n/a + All embedded opcode arguments "make sense".
2373n/a
2374n/a + Explicit and implicit pop operations have enough items on the stack.
2375n/a
2376n/a + When an opcode implicitly refers to a markobject, a markobject is
2377n/a actually on the stack.
2378n/a
2379n/a + A memo entry isn't referenced before it's defined.
2380n/a
2381n/a + The markobject isn't stored in the memo.
2382n/a
2383n/a + A memo entry isn't redefined.
2384n/a """
2385n/a
2386n/a # Most of the hair here is for sanity checks, but most of it is needed
2387n/a # anyway to detect when a protocol 0 POP takes a MARK off the stack
2388n/a # (which in turn is needed to indent MARK blocks correctly).
2389n/a
2390n/a stack = [] # crude emulation of unpickler stack
2391n/a if memo is None:
2392n/a memo = {} # crude emulation of unpickler memo
2393n/a maxproto = -1 # max protocol number seen
2394n/a markstack = [] # bytecode positions of MARK opcodes
2395n/a indentchunk = ' ' * indentlevel
2396n/a errormsg = None
2397n/a annocol = annotate # column hint for annotations
2398n/a for opcode, arg, pos in genops(pickle):
2399n/a if pos is not None:
2400n/a print("%5d:" % pos, end=' ', file=out)
2401n/a
2402n/a line = "%-4s %s%s" % (repr(opcode.code)[1:-1],
2403n/a indentchunk * len(markstack),
2404n/a opcode.name)
2405n/a
2406n/a maxproto = max(maxproto, opcode.proto)
2407n/a before = opcode.stack_before # don't mutate
2408n/a after = opcode.stack_after # don't mutate
2409n/a numtopop = len(before)
2410n/a
2411n/a # See whether a MARK should be popped.
2412n/a markmsg = None
2413n/a if markobject in before or (opcode.name == "POP" and
2414n/a stack and
2415n/a stack[-1] is markobject):
2416n/a assert markobject not in after
2417n/a if __debug__:
2418n/a if markobject in before:
2419n/a assert before[-1] is stackslice
2420n/a if markstack:
2421n/a markpos = markstack.pop()
2422n/a if markpos is None:
2423n/a markmsg = "(MARK at unknown opcode offset)"
2424n/a else:
2425n/a markmsg = "(MARK at %d)" % markpos
2426n/a # Pop everything at and after the topmost markobject.
2427n/a while stack[-1] is not markobject:
2428n/a stack.pop()
2429n/a stack.pop()
2430n/a # Stop later code from popping too much.
2431n/a try:
2432n/a numtopop = before.index(markobject)
2433n/a except ValueError:
2434n/a assert opcode.name == "POP"
2435n/a numtopop = 0
2436n/a else:
2437n/a errormsg = markmsg = "no MARK exists on stack"
2438n/a
2439n/a # Check for correct memo usage.
2440n/a if opcode.name in ("PUT", "BINPUT", "LONG_BINPUT", "MEMOIZE"):
2441n/a if opcode.name == "MEMOIZE":
2442n/a memo_idx = len(memo)
2443n/a markmsg = "(as %d)" % memo_idx
2444n/a else:
2445n/a assert arg is not None
2446n/a memo_idx = arg
2447n/a if memo_idx in memo:
2448n/a errormsg = "memo key %r already defined" % arg
2449n/a elif not stack:
2450n/a errormsg = "stack is empty -- can't store into memo"
2451n/a elif stack[-1] is markobject:
2452n/a errormsg = "can't store markobject in the memo"
2453n/a else:
2454n/a memo[memo_idx] = stack[-1]
2455n/a elif opcode.name in ("GET", "BINGET", "LONG_BINGET"):
2456n/a if arg in memo:
2457n/a assert len(after) == 1
2458n/a after = [memo[arg]] # for better stack emulation
2459n/a else:
2460n/a errormsg = "memo key %r has never been stored into" % arg
2461n/a
2462n/a if arg is not None or markmsg:
2463n/a # make a mild effort to align arguments
2464n/a line += ' ' * (10 - len(opcode.name))
2465n/a if arg is not None:
2466n/a line += ' ' + repr(arg)
2467n/a if markmsg:
2468n/a line += ' ' + markmsg
2469n/a if annotate:
2470n/a line += ' ' * (annocol - len(line))
2471n/a # make a mild effort to align annotations
2472n/a annocol = len(line)
2473n/a if annocol > 50:
2474n/a annocol = annotate
2475n/a line += ' ' + opcode.doc.split('\n', 1)[0]
2476n/a print(line, file=out)
2477n/a
2478n/a if errormsg:
2479n/a # Note that we delayed complaining until the offending opcode
2480n/a # was printed.
2481n/a raise ValueError(errormsg)
2482n/a
2483n/a # Emulate the stack effects.
2484n/a if len(stack) < numtopop:
2485n/a raise ValueError("tries to pop %d items from stack with "
2486n/a "only %d items" % (numtopop, len(stack)))
2487n/a if numtopop:
2488n/a del stack[-numtopop:]
2489n/a if markobject in after:
2490n/a assert markobject not in before
2491n/a markstack.append(pos)
2492n/a
2493n/a stack.extend(after)
2494n/a
2495n/a print("highest protocol among opcodes =", maxproto, file=out)
2496n/a if stack:
2497n/a raise ValueError("stack not empty after STOP: %r" % stack)
2498n/a
2499n/a# For use in the doctest, simply as an example of a class to pickle.
2500n/aclass _Example:
2501n/a def __init__(self, value):
2502n/a self.value = value
2503n/a
2504n/a_dis_test = r"""
2505n/a>>> import pickle
2506n/a>>> x = [1, 2, (3, 4), {b'abc': "def"}]
2507n/a>>> pkl0 = pickle.dumps(x, 0)
2508n/a>>> dis(pkl0)
2509n/a 0: ( MARK
2510n/a 1: l LIST (MARK at 0)
2511n/a 2: p PUT 0
2512n/a 5: L LONG 1
2513n/a 9: a APPEND
2514n/a 10: L LONG 2
2515n/a 14: a APPEND
2516n/a 15: ( MARK
2517n/a 16: L LONG 3
2518n/a 20: L LONG 4
2519n/a 24: t TUPLE (MARK at 15)
2520n/a 25: p PUT 1
2521n/a 28: a APPEND
2522n/a 29: ( MARK
2523n/a 30: d DICT (MARK at 29)
2524n/a 31: p PUT 2
2525n/a 34: c GLOBAL '_codecs encode'
2526n/a 50: p PUT 3
2527n/a 53: ( MARK
2528n/a 54: V UNICODE 'abc'
2529n/a 59: p PUT 4
2530n/a 62: V UNICODE 'latin1'
2531n/a 70: p PUT 5
2532n/a 73: t TUPLE (MARK at 53)
2533n/a 74: p PUT 6
2534n/a 77: R REDUCE
2535n/a 78: p PUT 7
2536n/a 81: V UNICODE 'def'
2537n/a 86: p PUT 8
2538n/a 89: s SETITEM
2539n/a 90: a APPEND
2540n/a 91: . STOP
2541n/ahighest protocol among opcodes = 0
2542n/a
2543n/aTry again with a "binary" pickle.
2544n/a
2545n/a>>> pkl1 = pickle.dumps(x, 1)
2546n/a>>> dis(pkl1)
2547n/a 0: ] EMPTY_LIST
2548n/a 1: q BINPUT 0
2549n/a 3: ( MARK
2550n/a 4: K BININT1 1
2551n/a 6: K BININT1 2
2552n/a 8: ( MARK
2553n/a 9: K BININT1 3
2554n/a 11: K BININT1 4
2555n/a 13: t TUPLE (MARK at 8)
2556n/a 14: q BINPUT 1
2557n/a 16: } EMPTY_DICT
2558n/a 17: q BINPUT 2
2559n/a 19: c GLOBAL '_codecs encode'
2560n/a 35: q BINPUT 3
2561n/a 37: ( MARK
2562n/a 38: X BINUNICODE 'abc'
2563n/a 46: q BINPUT 4
2564n/a 48: X BINUNICODE 'latin1'
2565n/a 59: q BINPUT 5
2566n/a 61: t TUPLE (MARK at 37)
2567n/a 62: q BINPUT 6
2568n/a 64: R REDUCE
2569n/a 65: q BINPUT 7
2570n/a 67: X BINUNICODE 'def'
2571n/a 75: q BINPUT 8
2572n/a 77: s SETITEM
2573n/a 78: e APPENDS (MARK at 3)
2574n/a 79: . STOP
2575n/ahighest protocol among opcodes = 1
2576n/a
2577n/aExercise the INST/OBJ/BUILD family.
2578n/a
2579n/a>>> import pickletools
2580n/a>>> dis(pickle.dumps(pickletools.dis, 0))
2581n/a 0: c GLOBAL 'pickletools dis'
2582n/a 17: p PUT 0
2583n/a 20: . STOP
2584n/ahighest protocol among opcodes = 0
2585n/a
2586n/a>>> from pickletools import _Example
2587n/a>>> x = [_Example(42)] * 2
2588n/a>>> dis(pickle.dumps(x, 0))
2589n/a 0: ( MARK
2590n/a 1: l LIST (MARK at 0)
2591n/a 2: p PUT 0
2592n/a 5: c GLOBAL 'copy_reg _reconstructor'
2593n/a 30: p PUT 1
2594n/a 33: ( MARK
2595n/a 34: c GLOBAL 'pickletools _Example'
2596n/a 56: p PUT 2
2597n/a 59: c GLOBAL '__builtin__ object'
2598n/a 79: p PUT 3
2599n/a 82: N NONE
2600n/a 83: t TUPLE (MARK at 33)
2601n/a 84: p PUT 4
2602n/a 87: R REDUCE
2603n/a 88: p PUT 5
2604n/a 91: ( MARK
2605n/a 92: d DICT (MARK at 91)
2606n/a 93: p PUT 6
2607n/a 96: V UNICODE 'value'
2608n/a 103: p PUT 7
2609n/a 106: L LONG 42
2610n/a 111: s SETITEM
2611n/a 112: b BUILD
2612n/a 113: a APPEND
2613n/a 114: g GET 5
2614n/a 117: a APPEND
2615n/a 118: . STOP
2616n/ahighest protocol among opcodes = 0
2617n/a
2618n/a>>> dis(pickle.dumps(x, 1))
2619n/a 0: ] EMPTY_LIST
2620n/a 1: q BINPUT 0
2621n/a 3: ( MARK
2622n/a 4: c GLOBAL 'copy_reg _reconstructor'
2623n/a 29: q BINPUT 1
2624n/a 31: ( MARK
2625n/a 32: c GLOBAL 'pickletools _Example'
2626n/a 54: q BINPUT 2
2627n/a 56: c GLOBAL '__builtin__ object'
2628n/a 76: q BINPUT 3
2629n/a 78: N NONE
2630n/a 79: t TUPLE (MARK at 31)
2631n/a 80: q BINPUT 4
2632n/a 82: R REDUCE
2633n/a 83: q BINPUT 5
2634n/a 85: } EMPTY_DICT
2635n/a 86: q BINPUT 6
2636n/a 88: X BINUNICODE 'value'
2637n/a 98: q BINPUT 7
2638n/a 100: K BININT1 42
2639n/a 102: s SETITEM
2640n/a 103: b BUILD
2641n/a 104: h BINGET 5
2642n/a 106: e APPENDS (MARK at 3)
2643n/a 107: . STOP
2644n/ahighest protocol among opcodes = 1
2645n/a
2646n/aTry "the canonical" recursive-object test.
2647n/a
2648n/a>>> L = []
2649n/a>>> T = L,
2650n/a>>> L.append(T)
2651n/a>>> L[0] is T
2652n/aTrue
2653n/a>>> T[0] is L
2654n/aTrue
2655n/a>>> L[0][0] is L
2656n/aTrue
2657n/a>>> T[0][0] is T
2658n/aTrue
2659n/a>>> dis(pickle.dumps(L, 0))
2660n/a 0: ( MARK
2661n/a 1: l LIST (MARK at 0)
2662n/a 2: p PUT 0
2663n/a 5: ( MARK
2664n/a 6: g GET 0
2665n/a 9: t TUPLE (MARK at 5)
2666n/a 10: p PUT 1
2667n/a 13: a APPEND
2668n/a 14: . STOP
2669n/ahighest protocol among opcodes = 0
2670n/a
2671n/a>>> dis(pickle.dumps(L, 1))
2672n/a 0: ] EMPTY_LIST
2673n/a 1: q BINPUT 0
2674n/a 3: ( MARK
2675n/a 4: h BINGET 0
2676n/a 6: t TUPLE (MARK at 3)
2677n/a 7: q BINPUT 1
2678n/a 9: a APPEND
2679n/a 10: . STOP
2680n/ahighest protocol among opcodes = 1
2681n/a
2682n/aNote that, in the protocol 0 pickle of the recursive tuple, the disassembler
2683n/ahas to emulate the stack in order to realize that the POP opcode at 16 gets
2684n/arid of the MARK at 0.
2685n/a
2686n/a>>> dis(pickle.dumps(T, 0))
2687n/a 0: ( MARK
2688n/a 1: ( MARK
2689n/a 2: l LIST (MARK at 1)
2690n/a 3: p PUT 0
2691n/a 6: ( MARK
2692n/a 7: g GET 0
2693n/a 10: t TUPLE (MARK at 6)
2694n/a 11: p PUT 1
2695n/a 14: a APPEND
2696n/a 15: 0 POP
2697n/a 16: 0 POP (MARK at 0)
2698n/a 17: g GET 1
2699n/a 20: . STOP
2700n/ahighest protocol among opcodes = 0
2701n/a
2702n/a>>> dis(pickle.dumps(T, 1))
2703n/a 0: ( MARK
2704n/a 1: ] EMPTY_LIST
2705n/a 2: q BINPUT 0
2706n/a 4: ( MARK
2707n/a 5: h BINGET 0
2708n/a 7: t TUPLE (MARK at 4)
2709n/a 8: q BINPUT 1
2710n/a 10: a APPEND
2711n/a 11: 1 POP_MARK (MARK at 0)
2712n/a 12: h BINGET 1
2713n/a 14: . STOP
2714n/ahighest protocol among opcodes = 1
2715n/a
2716n/aTry protocol 2.
2717n/a
2718n/a>>> dis(pickle.dumps(L, 2))
2719n/a 0: \x80 PROTO 2
2720n/a 2: ] EMPTY_LIST
2721n/a 3: q BINPUT 0
2722n/a 5: h BINGET 0
2723n/a 7: \x85 TUPLE1
2724n/a 8: q BINPUT 1
2725n/a 10: a APPEND
2726n/a 11: . STOP
2727n/ahighest protocol among opcodes = 2
2728n/a
2729n/a>>> dis(pickle.dumps(T, 2))
2730n/a 0: \x80 PROTO 2
2731n/a 2: ] EMPTY_LIST
2732n/a 3: q BINPUT 0
2733n/a 5: h BINGET 0
2734n/a 7: \x85 TUPLE1
2735n/a 8: q BINPUT 1
2736n/a 10: a APPEND
2737n/a 11: 0 POP
2738n/a 12: h BINGET 1
2739n/a 14: . STOP
2740n/ahighest protocol among opcodes = 2
2741n/a
2742n/aTry protocol 3 with annotations:
2743n/a
2744n/a>>> dis(pickle.dumps(T, 3), annotate=1)
2745n/a 0: \x80 PROTO 3 Protocol version indicator.
2746n/a 2: ] EMPTY_LIST Push an empty list.
2747n/a 3: q BINPUT 0 Store the stack top into the memo. The stack is not popped.
2748n/a 5: h BINGET 0 Read an object from the memo and push it on the stack.
2749n/a 7: \x85 TUPLE1 Build a one-tuple out of the topmost item on the stack.
2750n/a 8: q BINPUT 1 Store the stack top into the memo. The stack is not popped.
2751n/a 10: a APPEND Append an object to a list.
2752n/a 11: 0 POP Discard the top stack item, shrinking the stack by one item.
2753n/a 12: h BINGET 1 Read an object from the memo and push it on the stack.
2754n/a 14: . STOP Stop the unpickling machine.
2755n/ahighest protocol among opcodes = 2
2756n/a
2757n/a"""
2758n/a
2759n/a_memo_test = r"""
2760n/a>>> import pickle
2761n/a>>> import io
2762n/a>>> f = io.BytesIO()
2763n/a>>> p = pickle.Pickler(f, 2)
2764n/a>>> x = [1, 2, 3]
2765n/a>>> p.dump(x)
2766n/a>>> p.dump(x)
2767n/a>>> f.seek(0)
2768n/a0
2769n/a>>> memo = {}
2770n/a>>> dis(f, memo=memo)
2771n/a 0: \x80 PROTO 2
2772n/a 2: ] EMPTY_LIST
2773n/a 3: q BINPUT 0
2774n/a 5: ( MARK
2775n/a 6: K BININT1 1
2776n/a 8: K BININT1 2
2777n/a 10: K BININT1 3
2778n/a 12: e APPENDS (MARK at 5)
2779n/a 13: . STOP
2780n/ahighest protocol among opcodes = 2
2781n/a>>> dis(f, memo=memo)
2782n/a 14: \x80 PROTO 2
2783n/a 16: h BINGET 0
2784n/a 18: . STOP
2785n/ahighest protocol among opcodes = 2
2786n/a"""
2787n/a
2788n/a__test__ = {'disassembler_test': _dis_test,
2789n/a 'disassembler_memo_test': _memo_test,
2790n/a }
2791n/a
2792n/adef _test():
2793n/a import doctest
2794n/a return doctest.testmod()
2795n/a
2796n/aif __name__ == "__main__":
2797n/a import argparse
2798n/a parser = argparse.ArgumentParser(
2799n/a description='disassemble one or more pickle files')
2800n/a parser.add_argument(
2801n/a 'pickle_file', type=argparse.FileType('br'),
2802n/a nargs='*', help='the pickle file')
2803n/a parser.add_argument(
2804n/a '-o', '--output', default=sys.stdout, type=argparse.FileType('w'),
2805n/a help='the file where the output should be written')
2806n/a parser.add_argument(
2807n/a '-m', '--memo', action='store_true',
2808n/a help='preserve memo between disassemblies')
2809n/a parser.add_argument(
2810n/a '-l', '--indentlevel', default=4, type=int,
2811n/a help='the number of blanks by which to indent a new MARK level')
2812n/a parser.add_argument(
2813n/a '-a', '--annotate', action='store_true',
2814n/a help='annotate each line with a short opcode description')
2815n/a parser.add_argument(
2816n/a '-p', '--preamble', default="==> {name} <==",
2817n/a help='if more than one pickle file is specified, print this before'
2818n/a ' each disassembly')
2819n/a parser.add_argument(
2820n/a '-t', '--test', action='store_true',
2821n/a help='run self-test suite')
2822n/a parser.add_argument(
2823n/a '-v', action='store_true',
2824n/a help='run verbosely; only affects self-test run')
2825n/a args = parser.parse_args()
2826n/a if args.test:
2827n/a _test()
2828n/a else:
2829n/a annotate = 30 if args.annotate else 0
2830n/a if not args.pickle_file:
2831n/a parser.print_help()
2832n/a elif len(args.pickle_file) == 1:
2833n/a dis(args.pickle_file[0], args.output, None,
2834n/a args.indentlevel, annotate)
2835n/a else:
2836n/a memo = {} if args.memo else None
2837n/a for f in args.pickle_file:
2838n/a preamble = args.preamble.format(name=f.name)
2839n/a args.output.write(preamble + '\n')
2840n/a dis(f, args.output, memo, args.indentlevel, annotate)