.. Copyright (C) 2001-2023 NLTK Project .. For license information, see LICENSE.TXT ========================================= Loading Resources From the Data Package ========================================= >>> import nltk.data Overview ~~~~~~~~ The `nltk.data` module contains functions that can be used to load NLTK resource files, such as corpora, grammars, and saved processing objects. Loading Data Files ~~~~~~~~~~~~~~~~~~ Resources are loaded using the function `nltk.data.load()`, which takes as its first argument a URL specifying what file should be loaded. The ``nltk:`` protocol loads files from the NLTK data distribution: >>> tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle') >>> tokenizer.tokenize('Hello. This is a test. It works!') ['Hello.', 'This is a test.', 'It works!'] It is important to note that there should be no space following the colon (':') in the URL; 'nltk: tokenizers/punkt/english.pickle' will not work! The ``nltk:`` protocol is used by default if no protocol is specified: >>> nltk.data.load('tokenizers/punkt/english.pickle') But it is also possible to load resources from ``http:``, ``ftp:``, and ``file:`` URLs: >>> # Load a grammar from the NLTK webpage. >>> cfg = nltk.data.load('https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg') >>> print(cfg) # doctest: +ELLIPSIS Grammar with 14 productions (start state = S) S -> NP VP PP -> P NP ... P -> 'on' P -> 'in' >>> # Load a grammar using an absolute path. >>> url = 'file:%s' % nltk.data.find('grammars/sample_grammars/toy.cfg') >>> url.replace('\\', '/') 'file:...toy.cfg' >>> print(nltk.data.load(url)) Grammar with 14 productions (start state = S) S -> NP VP PP -> P NP ... P -> 'on' P -> 'in' The second argument to the `nltk.data.load()` function specifies the file format, which determines how the file's contents are processed before they are returned by ``load()``. The formats that are currently supported by the data module are described by the dictionary `nltk.data.FORMATS`: >>> for format, descr in sorted(nltk.data.FORMATS.items()): ... print('{0:<7} {1:}'.format(format, descr)) cfg A context free grammar. fcfg A feature CFG. fol A list of first order logic expressions, parsed with nltk.sem.logic.Expression.fromstring. json A serialized python object, stored using the json module. logic A list of first order logic expressions, parsed with nltk.sem.logic.LogicParser. Requires an additional logic_parser parameter pcfg A probabilistic CFG. pickle A serialized python object, stored using the pickle module. raw The raw (byte string) contents of a file. text The raw (unicode string) contents of a file. val A semantic valuation, parsed by nltk.sem.Valuation.fromstring. yaml A serialized python object, stored using the yaml module. `nltk.data.load()` will raise a ValueError if a bad format name is specified: >>> nltk.data.load('grammars/sample_grammars/toy.cfg', 'bar') Traceback (most recent call last): . . . ValueError: Unknown format type! By default, the ``"auto"`` format is used, which chooses a format based on the filename's extension. The mapping from file extensions to format names is specified by `nltk.data.AUTO_FORMATS`: >>> for ext, format in sorted(nltk.data.AUTO_FORMATS.items()): ... print('.%-7s -> %s' % (ext, format)) .cfg -> cfg .fcfg -> fcfg .fol -> fol .json -> json .logic -> logic .pcfg -> pcfg .pickle -> pickle .text -> text .txt -> text .val -> val .yaml -> yaml If `nltk.data.load()` is unable to determine the format based on the filename's extension, it will raise a ValueError: >>> nltk.data.load('foo.bar') Traceback (most recent call last): . . . ValueError: Could not determine format for foo.bar based on its file extension; use the "format" argument to specify the format explicitly. Note that by explicitly specifying the ``format`` argument, you can override the load method's default processing behavior. For example, to get the raw contents of any file, simply use ``format="raw"``: >>> s = nltk.data.load('grammars/sample_grammars/toy.cfg', 'text') >>> print(s) S -> NP VP PP -> P NP NP -> Det N | NP PP VP -> V NP | VP PP ... Making Local Copies ~~~~~~~~~~~~~~~~~~~ .. This will not be visible in the html output: create a tempdir to play in. >>> import tempfile, os >>> tempdir = tempfile.mkdtemp() >>> old_dir = os.path.abspath('.') >>> os.chdir(tempdir) The function `nltk.data.retrieve()` copies a given resource to a local file. This can be useful, for example, if you want to edit one of the sample grammars. >>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg') Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy.cfg' >>> # Simulate editing the grammar. >>> with open('toy.cfg') as inp: ... s = inp.read().replace('NP', 'DP') >>> with open('toy.cfg', 'w') as out: ... _bytes_written = out.write(s) >>> # Load the edited grammar, & display it. >>> cfg = nltk.data.load('file:///' + os.path.abspath('toy.cfg')) >>> print(cfg) Grammar with 14 productions (start state = S) S -> DP VP PP -> P DP ... P -> 'on' P -> 'in' The second argument to `nltk.data.retrieve()` specifies the filename for the new copy of the file. By default, the source file's filename is used. >>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg', 'mytoy.cfg') Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'mytoy.cfg' >>> os.path.isfile('./mytoy.cfg') True >>> nltk.data.retrieve('grammars/sample_grammars/np.fcfg') Retrieving 'nltk:grammars/sample_grammars/np.fcfg', saving to 'np.fcfg' >>> os.path.isfile('./np.fcfg') True If a file with the specified (or default) filename already exists in the current directory, then `nltk.data.retrieve()` will raise a ValueError exception. It will *not* overwrite the file: >>> os.path.isfile('./toy.cfg') True >>> nltk.data.retrieve('grammars/sample_grammars/toy.cfg') Traceback (most recent call last): . . . ValueError: File '...toy.cfg' already exists! .. This will not be visible in the html output: clean up the tempdir. >>> os.chdir(old_dir) >>> for f in os.listdir(tempdir): ... os.remove(os.path.join(tempdir, f)) >>> os.rmdir(tempdir) Finding Files in the NLTK Data Package ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The `nltk.data.find()` function searches the NLTK data package for a given file, and returns a pointer to that file. This pointer can either be a `FileSystemPathPointer` (whose `path` attribute gives the absolute path of the file); or a `ZipFilePathPointer`, specifying a zipfile and the name of an entry within that zipfile. Both pointer types define the `open()` method, which can be used to read the string contents of the file. >>> path = nltk.data.find('corpora/abc/rural.txt') >>> str(path) '...rural.txt' >>> print(path.open().read(60).decode()) PM denies knowledge of AWB kickbacks The Prime Minister has Alternatively, the `nltk.data.load()` function can be used with the keyword argument ``format="raw"``: >>> s = nltk.data.load('corpora/abc/rural.txt', format='raw')[:60] >>> print(s.decode()) PM denies knowledge of AWB kickbacks The Prime Minister has Alternatively, you can use the keyword argument ``format="text"``: >>> s = nltk.data.load('corpora/abc/rural.txt', format='text')[:60] >>> print(s) PM denies knowledge of AWB kickbacks The Prime Minister has Resource Caching ~~~~~~~~~~~~~~~~ NLTK uses a weakref dictionary to maintain a cache of resources that have been loaded. If you load a resource that is already stored in the cache, then the cached copy will be returned. This behavior can be seen by the trace output generated when verbose=True: >>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg', verbose=True) <> >>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg', verbose=True) <> If you wish to load a resource from its source, bypassing the cache, use the ``cache=False`` argument to `nltk.data.load()`. This can be useful, for example, if the resource is loaded from a local file, and you are actively editing that file: >>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg',cache=False,verbose=True) <> The cache *no longer* uses weak references. A resource will not be automatically expunged from the cache when no more objects are using it. In the following example, when we clear the variable ``feat0``, the reference count for the feature grammar object drops to zero. However, the object remains cached: >>> del feat0 >>> feat0 = nltk.data.load('grammars/book_grammars/feat0.fcfg', ... verbose=True) <> You can clear the entire contents of the cache, using `nltk.data.clear_cache()`: >>> nltk.data.clear_cache() Retrieving other Data Sources ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> formulas = nltk.data.load('grammars/book_grammars/background.fol') >>> for f in formulas: print(str(f)) all x.(boxerdog(x) -> dog(x)) all x.(boxer(x) -> person(x)) all x.-(dog(x) & person(x)) all x.(married(x) <-> exists y.marry(x,y)) all x.(bark(x) -> dog(x)) all x y.(marry(x,y) -> (person(x) & person(y))) -(Vincent = Mia) -(Vincent = Fido) -(Mia = Fido) Regression Tests ~~~~~~~~~~~~~~~~ Create a temp dir for tests that write files: >>> import tempfile, os >>> tempdir = tempfile.mkdtemp() >>> old_dir = os.path.abspath('.') >>> os.chdir(tempdir) The `retrieve()` function accepts all url types: >>> urls = ['https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg', ... 'file:%s' % nltk.data.find('grammars/sample_grammars/toy.cfg'), ... 'nltk:grammars/sample_grammars/toy.cfg', ... 'grammars/sample_grammars/toy.cfg'] >>> for i, url in enumerate(urls): ... nltk.data.retrieve(url, 'toy-%d.cfg' % i) Retrieving 'https://raw.githubusercontent.com/nltk/nltk/develop/nltk/test/toy.cfg', saving to 'toy-0.cfg' Retrieving 'file:...toy.cfg', saving to 'toy-1.cfg' Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy-2.cfg' Retrieving 'nltk:grammars/sample_grammars/toy.cfg', saving to 'toy-3.cfg' Clean up the temp dir: >>> os.chdir(old_dir) >>> for f in os.listdir(tempdir): ... os.remove(os.path.join(tempdir, f)) >>> os.rmdir(tempdir) Lazy Loader ----------- A lazy loader is a wrapper object that defers loading a resource until it is accessed or used in any way. This is mainly intended for internal use by NLTK's corpus readers. >>> # Create a lazy loader for toy.cfg. >>> ll = nltk.data.LazyLoader('grammars/sample_grammars/toy.cfg') >>> # Show that it's not loaded yet: >>> object.__repr__(ll) '' >>> # printing it is enough to cause it to be loaded: >>> print(ll) >>> # Show that it's now been loaded: >>> object.__repr__(ll) '' >>> # Test that accessing an attribute also loads it: >>> ll = nltk.data.LazyLoader('grammars/sample_grammars/toy.cfg') >>> ll.start() S >>> object.__repr__(ll) '' Buffered Gzip Reading and Writing --------------------------------- Write performance to gzip-compressed is extremely poor when the files become large. File creation can become a bottleneck in those cases. Read performance from large gzipped pickle files was improved in data.py by buffering the reads. A similar fix can be applied to writes by buffering the writes to a StringIO object first. This is mainly intended for internal use. The test simply tests that reading and writing work as intended and does not test how much improvement buffering provides. >>> from io import StringIO >>> test = nltk.data.BufferedGzipFile('testbuf.gz', 'wb', size=2**10) >>> ans = [] >>> for i in range(10000): ... ans.append(str(i).encode('ascii')) ... test.write(str(i).encode('ascii')) >>> test.close() >>> test = nltk.data.BufferedGzipFile('testbuf.gz', 'rb') >>> test.read() == b''.join(ans) True >>> test.close() >>> import os >>> os.unlink('testbuf.gz') JSON Encoding and Decoding -------------------------- JSON serialization is used instead of pickle for some classes. >>> from nltk import jsontags >>> from nltk.jsontags import JSONTaggedEncoder, JSONTaggedDecoder, register_tag >>> @jsontags.register_tag ... class JSONSerializable: ... json_tag = 'JSONSerializable' ... ... def __init__(self, n): ... self.n = n ... ... def encode_json_obj(self): ... return self.n ... ... @classmethod ... def decode_json_obj(cls, obj): ... n = obj ... return cls(n) ... >>> JSONTaggedEncoder().encode(JSONSerializable(1)) '{"!JSONSerializable": 1}' >>> JSONTaggedDecoder().decode('{"!JSONSerializable": 1}').n 1