Metadata-Version: 2.1 Name: gruut Version: 2.2.3 Summary: A tokenizer, text cleaner, and phonemizer for many human languages. Home-page: https://github.com/rhasspy/gruut Author: Michael Hansen Author-email: mike@rhasspy.org Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3.9 Classifier: License :: OSI Approved :: MIT License Requires-Python: >=3.6 Description-Content-Type: text/markdown License-File: LICENSE Requires-Dist: Babel <3.0.0,>=2.8.0 Requires-Dist: dateparser ~=1.1.0 Requires-Dist: gruut-ipa <1.0,>=0.12.0 Requires-Dist: gruut-lang-en ~=2.0.0 Requires-Dist: jsonlines ~=1.2.0 Requires-Dist: networkx <3.0.0,>=2.5.0 Requires-Dist: num2words <1.0.0,>=0.5.10 Requires-Dist: numpy <2.0.0,>=1.19.0 Requires-Dist: python-crfsuite ~=0.9.7 Requires-Dist: dataclasses ; python_version<"3.7" Requires-Dist: types-dataclasses ; python_version<"3.7" Requires-Dist: importlib-resources ; python_version<"3.9" Provides-Extra: align Requires-Dist: aeneas ~=1.7.3.0 ; extra == 'align' Requires-Dist: pydub ~=0.24.1 ; extra == 'align' Provides-Extra: all Requires-Dist: hazm ~=0.7.0 ; extra == 'all' Requires-Dist: conllu >=4.4 ; extra == 'all' Requires-Dist: rapidfuzz >=1.4.1 ; extra == 'all' Requires-Dist: aeneas ~=1.7.3.0 ; extra == 'all' Requires-Dist: pydub ~=0.24.1 ; extra == 'all' Requires-Dist: mishkal ~=0.4.0 ; extra == 'all' Requires-Dist: codernitydb3 ~=0.6.0 ; extra == 'all' Requires-Dist: phonetisaurus ~=0.3.0 ; extra == 'all' Requires-Dist: gruut-lang-ar ~=2.0.0 ; extra == 'all' Requires-Dist: gruut-lang-cs ~=2.0.0 ; extra == 'all' Requires-Dist: gruut-lang-de ~=2.0.0 ; extra == 'all' Requires-Dist: gruut-lang-es ~=2.0.0 ; extra == 'all' Requires-Dist: gruut-lang-fa ~=2.0.0 ; extra == 'all' Requires-Dist: gruut-lang-fr ~=2.0.0 ; extra == 'all' Requires-Dist: gruut-lang-it ~=2.0.0 ; extra == 'all' Requires-Dist: gruut-lang-lb ~=2.0.0 ; extra == 'all' Requires-Dist: gruut-lang-nl ~=2.0.0 ; extra == 'all' Requires-Dist: gruut-lang-pt ~=2.0.0 ; extra == 'all' Requires-Dist: gruut-lang-ru ~=2.0.0 ; extra == 'all' Requires-Dist: gruut-lang-sv ~=2.0.0 ; extra == 'all' Requires-Dist: gruut-lang-sw ~=2.0.0 ; extra == 'all' Provides-Extra: ar Requires-Dist: mishkal ~=0.4.0 ; extra == 'ar' Requires-Dist: codernitydb3 ~=0.6.0 ; extra == 'ar' Requires-Dist: gruut-lang-ar ~=2.0.0 ; extra == 'ar' Provides-Extra: cs Requires-Dist: gruut-lang-cs ~=2.0.0 ; extra == 'cs' Provides-Extra: de Requires-Dist: gruut-lang-de ~=2.0.0 ; extra == 'de' Provides-Extra: es Requires-Dist: gruut-lang-es ~=2.0.0 ; extra == 'es' Provides-Extra: fa Requires-Dist: hazm ~=0.7.0 ; extra == 'fa' Requires-Dist: gruut-lang-fa ~=2.0.0 ; extra == 'fa' Provides-Extra: fr Requires-Dist: gruut-lang-fr ~=2.0.0 ; extra == 'fr' Provides-Extra: g2p Requires-Dist: phonetisaurus ~=0.3.0 ; extra == 'g2p' Provides-Extra: it Requires-Dist: gruut-lang-it ~=2.0.0 ; extra == 'it' Provides-Extra: lb Requires-Dist: gruut-lang-lb ~=2.0.0 ; extra == 'lb' Provides-Extra: nl Requires-Dist: gruut-lang-nl ~=2.0.0 ; extra == 'nl' Provides-Extra: pt Requires-Dist: gruut-lang-pt ~=2.0.0 ; extra == 'pt' Provides-Extra: ru Requires-Dist: gruut-lang-ru ~=2.0.0 ; extra == 'ru' Provides-Extra: sv Requires-Dist: gruut-lang-sv ~=2.0.0 ; extra == 'sv' Provides-Extra: sw Requires-Dist: gruut-lang-sw ~=2.0.0 ; extra == 'sw' Provides-Extra: train Requires-Dist: conllu >=4.4 ; extra == 'train' Requires-Dist: rapidfuzz >=1.4.1 ; extra == 'train' # Gruut A tokenizer, text cleaner, and [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) phonemizer for several human languages that supports [SSML](#ssml). ```python from gruut import sentences text = 'He wound it around the wound, saying "I read it was $10 to read."' for sent in sentences(text, lang="en-us"): for word in sent: if word.phonemes: print(word.text, *word.phonemes) ``` which outputs: ``` He h ˈi wound w ˈaʊ n d it ˈɪ t around ɚ ˈaʊ n d the ð ə wound w ˈu n d , | saying s ˈeɪ ɪ ŋ I ˈaɪ read ɹ ˈɛ d it ˈɪ t was w ə z ten t ˈɛ n dollars d ˈɑ l ɚ z to t ə read ɹ ˈi d . ‖ ``` Note that "wound" and "read" have different pronunciations when used in different (grammatical) contexts. A [subset of SSML](#ssml) is also supported: ```python from gruut import sentences ssml_text = """ Today at 4pm, 2/1/2000. Un mese fà, 2/1/2000. """ for sent in sentences(ssml_text, ssml=True): for word in sent: if word.phonemes: print(sent.idx, word.lang, word.text, *word.phonemes) ``` with the output: ``` 0 en-US Today t ə d ˈeɪ 0 en-US at ˈæ t 0 en-US four f ˈɔ ɹ 0 en-US P p ˈi 0 en-US M ˈɛ m 0 en-US , | 0 en-US February f ˈɛ b j u ˌɛ ɹ i 0 en-US first f ˈɚ s t 0 en-US , | 0 en-US two t ˈu 0 en-US thousand θ ˈaʊ z ə n d 0 en-US . ‖ 1 it Un u n 1 it mese ˈm e s e 1 it fà f a 1 it , | 1 it due d j u 1 it gennaio d͡ʒ e n n ˈa j o 1 it duemila d u e ˈm i l a 1 it . ‖ ``` See [the documentation](https://rhasspy.github.io/gruut/) for more details. ## Installation ```sh pip install gruut ``` Languages besides English can be added during installation. For example, with French and Italian support: ```sh pip install -f 'https://synesthesiam.github.io/prebuilt-apps/' gruut[fr,it] ``` The extra pip repo is needed for an updated [num2words fork](https://github.com/rhasspy/num2words) that includes support for more languages. You may also [manually download language files](https://github.com/rhasspy/gruut/releases/latest) and use put them in `$XDG_CONFIG_HOME/gruut/` (`$HOME/.config/gruut` by default). gruut will look for language files in the directory `$XDG_CONFIG_HOME/gruut//` if the corresponding Python package is not installed. Note that `` here is the **full** language name, e.g. `de-de` instead of just `de`. ## Supported Languages gruut currently supports: * Arabic (`ar`) * Czech (`cs` or `cs-cz`) * German (`de` or `de-de`) * English (`en` or `en-us`) * Spanish (`es` or `es-es`) * Farsi/Persian (`fa`) * French (`fr` or `fr-fr`) * Italian (`it` or `it-it`) * Luxembourgish (`lb`) * Dutch (`nl`) * Russian (`ru` or `ru-ru`) * Swedish (`sv` or `sv-se`) * Swahili (`sw`) The goal is to support all of [voice2json's languages](https://github.com/synesthesiam/voice2json-profiles#supported-languages) ## Dependencies * Python 3.7 or higher * Linux * Tested on Debian Bullseye * [num2words fork](https://github.com/rhasspy/num2words) and [Babel](https://pypi.org/project/Babel/) * Currency/number handling * num2words fork includes additional language support (Arabic, Farsi, Swedish, Swahili) * gruut-ipa * [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) pronunciation manipulation * [pycrfsuite](https://github.com/scrapinghub/python-crfsuite) * Part of speech tagging and grapheme to phoneme models * [pydateparser](https://github.com/GLibAi/pydateparser) * Date parsing for multiple languages ## Numbers, Dates, and More `gruut` can automatically verbalize numbers, dates, and other expressions. This is done in a locale-aware manner for both parsing and verbalization, so "1/1/2020" may be interpreted as "M/D/Y" or "D/M/Y" depending on the word or sentence's language (e.g., ``). The following types of expressions can be automatically expanded into words by `gruut`: * Numbers - "123" to "one hundred and twenty three" (disable with `verbalize_numbers=False` or `--no-numbers`) * Relies on `Babel` for parsing and `num2words` for verbalization * Dates - "1/1/2020" to "January first, twenty twenty" (disable with `verbalize_dates=False` or `--no-dates`) * Relies on `pydateparser` for parsing and both `Babel` and `num2words` for verbalization * Currency - "$10" to "ten dollars" (disable with `verbalize_currency=False` or `--no-currency`) * Relies on `Babel` for parsing and both `Babel` and `num2words` for verbalization * Times - "12:01am" to "twelve oh one A M" (disable with `verbalize_times=False` or `--no-times`) * English only * Relies on `num2words` for verbalization ## Command-Line Usage The `gruut` module can be executed with `python3 -m gruut --language ` or with the `gruut` command (from `setup.py`). The `gruut` command is line-oriented, consuming text and producing [JSONL](https://jsonlines.org/). You will probably want to install [jq](https://stedolan.github.io/jq/) to manipulate the [JSONL](https://jsonlines.org/) output from `gruut`. ### Plain Text Takes raw text and outputs [JSONL](https://jsonlines.org/) with cleaned words/tokens. ```sh echo 'This, right here, is some "RAW" text!' \ | gruut --language en-us \ | jq --raw-output '.words[].text' This , right here , is some " RAW " text ! ``` More information is available in the full JSON output: ```sh gruut --language en-us 'More text.' | jq . ``` Output: ```json { "idx": 0, "text": "More text.", "text_with_ws": "More text.", "text_spoken": "More text", "par_idx": 0, "lang": "en-us", "voice": "", "words": [ { "idx": 0, "text": "More", "text_with_ws": "More ", "leading_ws": "", "training_ws": " ", "sent_idx": 0, "par_idx": 0, "lang": "en-us", "voice": "", "pos": "JJR", "phonemes": [ "m", "ˈɔ", "ɹ" ], "is_major_break": false, "is_minor_break": false, "is_punctuation": false, "is_break": false, "is_spoken": true, "pause_before_ms": 0, "pause_after_ms": 0 }, { "idx": 1, "text": "text", "text_with_ws": "text", "leading_ws": "", "training_ws": "", "sent_idx": 0, "par_idx": 0, "lang": "en-us", "voice": "", "pos": "NN", "phonemes": [ "t", "ˈɛ", "k", "s", "t" ], "is_major_break": false, "is_minor_break": false, "is_punctuation": false, "is_break": false, "is_spoken": true, "pause_before_ms": 0, "pause_after_ms": 0 }, { "idx": 2, "text": ".", "text_with_ws": ".", "leading_ws": "", "training_ws": "", "sent_idx": 0, "par_idx": 0, "lang": "en-us", "voice": "", "pos": null, "phonemes": [ "‖" ], "is_major_break": true, "is_minor_break": false, "is_punctuation": false, "is_break": true, "is_spoken": false, "pause_before_ms": 0, "pause_after_ms": 0 } ], "pause_before_ms": 0, "pause_after_ms": 0 } ``` For the whole input line and each word, the `text` property contains the processed input text with normalized whitespace while `text_with_ws` retains the original whitespace. The `text_spoken` property only contains words that are spoken, so punctuation and breaks are excluded. Within each word, there is: * `idx` - zero-based index of the word in the sentence * `sent_idx` - zero-based index of the sentence in the input text * `pos` - part of speech tag (if available) * `phonemes` - list of [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) phonemes for the word (if available) * `is_minor_break` - `true` if "word" separates phrases (comma, semicolon, etc.) * `is_major_break` - `true` if "word" separates sentences (period, question mark, etc.) * `is_break` - `true` if "word" is a major or minor break * `is_punctuation` - `true` if "word" is a surrounding punctuation mark (quote, bracket, etc.) * `is_spoken` - `true` if not a break or punctuation See `python3 -m gruut --help` for more options. ### SSML A subset of [SSML](https://www.w3.org/TR/speech-synthesis11/) is supported: * `` - wrap around SSML text * `lang` - set language for document * `

` - paragraph * `lang` - set language for paragraph * `` - sentence (disables automatic sentence breaking) * `lang` - set language for sentence * `` / `` - word (disables automatic tokenization) * `lang` - set language for word * `role` - set word role (see [word roles](#word-roles)) * `` - set language inner text * `` - set voice of inner text * `` - force interpretation of inner text * `interpret-as` one of "spell-out", "date", "number", "time", or "currency" * `format` - way to format text depending on `interpret-as` * number - one of "cardinal", "ordinal", "digits", "year" * date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year) * `` - Pause for given amount of time * time - seconds ("123s") or milliseconds ("123ms") * `` - User-defined mark (`marks_before` and `marks_after` attributes of words/sentences) * name - name of mark * `` - substitute `alias` for inner text * `` - supply phonemes for inner text * `ph` - phonemes for each word of inner text, separated by whitespace * `` - inline or external pronunciation lexicon * `id` - unique id of lexicon (used in ``) * `uri` - if empty or missing, lexicon is inline * One or more `` child elements with: * Optional `role="..."` ([word roles][#word-roles] separated by whitespace) * `WORD` - word text * `P H O N E M E S` - word pronunciation (phonemes separated by whitespace) * `` - use pronunciation lexicon for child elements * `ref` - id from a `` #### Word Roles During phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag as `gruut:`. For initialisms and `spell-out`, the role `gruut:letter` is used to indicate that e.g., "a" should be spoken as `/eɪ/` instead of `/ə/`. For `en-us`, the following additional roles are available from the part-of-speech tagger: * `gruut:CD` - number * `gruut:DT` - determiner * `gruut:IN` - preposition or subordinating conjunction * `gruut:JJ` - adjective * `gruut:NN` - noun * `gruut:PRP` - personal pronoun * `gruut:RB` - adverb * `gruut:VB` - verb * `gruut:VB` - verb (past tense) #### Inline Lexicons Inline [pronunciation lexicons](https://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/) are supported via the `` and `` tags. gruut diverges slightly from the [SSML standard](https://www.w3.org/TR/speech-synthesis11/) here by allowing lexicons to be defined within the SSML document itself (`url` is blank or missing). Additionally, the `id` attribute of the `` element can be left off to indicate a "default" inline lexicon that does not require a corresponding `` tag. For example, the following document will yield three different pronunciations for the word "tomato": ``` xml tomato t ə m ˈɑ t oʊ tomato t ə m ˈi t oʊ tomato tomato tomato ``` The first "tomato" will be looked up in the U.S. English lexicon (`/t ə m ˈeɪ t oʊ/`). Within the `` tag's scope, the second and third "tomato" words will be looked up in the inline lexicon. The third "tomato" word has a [role](#word-roles) attached (selecting a made up pronunciation in this case). Even further from the SSML standard, gruut allows you to leave off the `` id entirely. With no `id`, a `` tag is no longer needed, allowing you to override the pronunciation of any word in the document: ``` xml tomato t ə m ˈɑ t oʊ tomato ``` This will yield a pronunciation of `/t ə m ˈɑ t oʊ/` for all instances of "tomato" in the document (unless they have a ``). ## Intended Audience gruut is useful for transforming raw text into phonetic pronunciations, similar to [phonemizer](https://github.com/bootphon/phonemizer). Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from a [carefully chosen inventory](https://en.wikipedia.org/wiki/Template:Language_phonologies). For each supported language, gruut includes a: * A word pronunciation lexicon built from open source data * See [pron_dict](https://github.com/Kyubyong/pron_dictionaries) * A pre-trained grapheme-to-phoneme model for guessing word pronunciations Some languages also include: * A pre-trained part of speech tagger built from open source data: * See [universal dependencies](https://universaldependencies.org/)