ai-content-maker/.venv/Lib/site-packages/gruut-2.2.3.dist-info/METADATA

528 lines
18 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Metadata-Version: 2.1
Name: gruut
Version: 2.2.3
Summary: A tokenizer, text cleaner, and phonemizer for many human languages.
Home-page: https://github.com/rhasspy/gruut
Author: Michael Hansen
Author-email: mike@rhasspy.org
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: Babel <3.0.0,>=2.8.0
Requires-Dist: dateparser ~=1.1.0
Requires-Dist: gruut-ipa <1.0,>=0.12.0
Requires-Dist: gruut-lang-en ~=2.0.0
Requires-Dist: jsonlines ~=1.2.0
Requires-Dist: networkx <3.0.0,>=2.5.0
Requires-Dist: num2words <1.0.0,>=0.5.10
Requires-Dist: numpy <2.0.0,>=1.19.0
Requires-Dist: python-crfsuite ~=0.9.7
Requires-Dist: dataclasses ; python_version<"3.7"
Requires-Dist: types-dataclasses ; python_version<"3.7"
Requires-Dist: importlib-resources ; python_version<"3.9"
Provides-Extra: align
Requires-Dist: aeneas ~=1.7.3.0 ; extra == 'align'
Requires-Dist: pydub ~=0.24.1 ; extra == 'align'
Provides-Extra: all
Requires-Dist: hazm ~=0.7.0 ; extra == 'all'
Requires-Dist: conllu >=4.4 ; extra == 'all'
Requires-Dist: rapidfuzz >=1.4.1 ; extra == 'all'
Requires-Dist: aeneas ~=1.7.3.0 ; extra == 'all'
Requires-Dist: pydub ~=0.24.1 ; extra == 'all'
Requires-Dist: mishkal ~=0.4.0 ; extra == 'all'
Requires-Dist: codernitydb3 ~=0.6.0 ; extra == 'all'
Requires-Dist: phonetisaurus ~=0.3.0 ; extra == 'all'
Requires-Dist: gruut-lang-ar ~=2.0.0 ; extra == 'all'
Requires-Dist: gruut-lang-cs ~=2.0.0 ; extra == 'all'
Requires-Dist: gruut-lang-de ~=2.0.0 ; extra == 'all'
Requires-Dist: gruut-lang-es ~=2.0.0 ; extra == 'all'
Requires-Dist: gruut-lang-fa ~=2.0.0 ; extra == 'all'
Requires-Dist: gruut-lang-fr ~=2.0.0 ; extra == 'all'
Requires-Dist: gruut-lang-it ~=2.0.0 ; extra == 'all'
Requires-Dist: gruut-lang-lb ~=2.0.0 ; extra == 'all'
Requires-Dist: gruut-lang-nl ~=2.0.0 ; extra == 'all'
Requires-Dist: gruut-lang-pt ~=2.0.0 ; extra == 'all'
Requires-Dist: gruut-lang-ru ~=2.0.0 ; extra == 'all'
Requires-Dist: gruut-lang-sv ~=2.0.0 ; extra == 'all'
Requires-Dist: gruut-lang-sw ~=2.0.0 ; extra == 'all'
Provides-Extra: ar
Requires-Dist: mishkal ~=0.4.0 ; extra == 'ar'
Requires-Dist: codernitydb3 ~=0.6.0 ; extra == 'ar'
Requires-Dist: gruut-lang-ar ~=2.0.0 ; extra == 'ar'
Provides-Extra: cs
Requires-Dist: gruut-lang-cs ~=2.0.0 ; extra == 'cs'
Provides-Extra: de
Requires-Dist: gruut-lang-de ~=2.0.0 ; extra == 'de'
Provides-Extra: es
Requires-Dist: gruut-lang-es ~=2.0.0 ; extra == 'es'
Provides-Extra: fa
Requires-Dist: hazm ~=0.7.0 ; extra == 'fa'
Requires-Dist: gruut-lang-fa ~=2.0.0 ; extra == 'fa'
Provides-Extra: fr
Requires-Dist: gruut-lang-fr ~=2.0.0 ; extra == 'fr'
Provides-Extra: g2p
Requires-Dist: phonetisaurus ~=0.3.0 ; extra == 'g2p'
Provides-Extra: it
Requires-Dist: gruut-lang-it ~=2.0.0 ; extra == 'it'
Provides-Extra: lb
Requires-Dist: gruut-lang-lb ~=2.0.0 ; extra == 'lb'
Provides-Extra: nl
Requires-Dist: gruut-lang-nl ~=2.0.0 ; extra == 'nl'
Provides-Extra: pt
Requires-Dist: gruut-lang-pt ~=2.0.0 ; extra == 'pt'
Provides-Extra: ru
Requires-Dist: gruut-lang-ru ~=2.0.0 ; extra == 'ru'
Provides-Extra: sv
Requires-Dist: gruut-lang-sv ~=2.0.0 ; extra == 'sv'
Provides-Extra: sw
Requires-Dist: gruut-lang-sw ~=2.0.0 ; extra == 'sw'
Provides-Extra: train
Requires-Dist: conllu >=4.4 ; extra == 'train'
Requires-Dist: rapidfuzz >=1.4.1 ; extra == 'train'
# Gruut
A tokenizer, text cleaner, and [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) phonemizer for several human languages that supports [SSML](#ssml).
```python
from gruut import sentences
text = 'He wound it around the wound, saying "I read it was $10 to read."'
for sent in sentences(text, lang="en-us"):
for word in sent:
if word.phonemes:
print(word.text, *word.phonemes)
```
which outputs:
```
He h ˈi
wound w ˈaʊ n d
it ˈɪ t
around ɚ ˈaʊ n d
the ð ə
wound w ˈu n d
, |
saying s ˈeɪ ɪ ŋ
I ˈaɪ
read ɹ ˈɛ d
it ˈɪ t
was w ə z
ten t ˈɛ n
dollars d ˈɑ l ɚ z
to t ə
read ɹ ˈi d
. ‖
```
Note that "wound" and "read" have different pronunciations when used in different (grammatical) contexts.
A [subset of SSML](#ssml) is also supported:
```python
from gruut import sentences
ssml_text = """<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<s>Today at 4pm, 2/1/2000.</s>
<s xml:lang="it">Un mese fà, 2/1/2000.</s>
</speak>"""
for sent in sentences(ssml_text, ssml=True):
for word in sent:
if word.phonemes:
print(sent.idx, word.lang, word.text, *word.phonemes)
```
with the output:
```
0 en-US Today t ə d ˈeɪ
0 en-US at ˈæ t
0 en-US four f ˈɔ ɹ
0 en-US P p ˈi
0 en-US M ˈɛ m
0 en-US , |
0 en-US February f ˈɛ b j u ˌɛ ɹ i
0 en-US first f ˈɚ s t
0 en-US , |
0 en-US two t ˈu
0 en-US thousand θ ˈaʊ z ə n d
0 en-US . ‖
1 it Un u n
1 it mese ˈm e s e
1 it fà f a
1 it , |
1 it due d j u
1 it gennaio d͡ʒ e n n ˈa j o
1 it duemila d u e ˈm i l a
1 it . ‖
```
See [the documentation](https://rhasspy.github.io/gruut/) for more details.
## Installation
```sh
pip install gruut
```
Languages besides English can be added during installation. For example, with French and Italian support:
```sh
pip install -f 'https://synesthesiam.github.io/prebuilt-apps/' gruut[fr,it]
```
The extra pip repo is needed for an updated [num2words fork](https://github.com/rhasspy/num2words) that includes support for more languages.
You may also [manually download language files](https://github.com/rhasspy/gruut/releases/latest) and use put them in `$XDG_CONFIG_HOME/gruut/` (`$HOME/.config/gruut` by default).
gruut will look for language files in the directory `$XDG_CONFIG_HOME/gruut/<lang>/` if the corresponding Python package is not installed. Note that `<lang>` here is the **full** language name, e.g. `de-de` instead of just `de`.
## Supported Languages
gruut currently supports:
* Arabic (`ar`)
* Czech (`cs` or `cs-cz`)
* German (`de` or `de-de`)
* English (`en` or `en-us`)
* Spanish (`es` or `es-es`)
* Farsi/Persian (`fa`)
* French (`fr` or `fr-fr`)
* Italian (`it` or `it-it`)
* Luxembourgish (`lb`)
* Dutch (`nl`)
* Russian (`ru` or `ru-ru`)
* Swedish (`sv` or `sv-se`)
* Swahili (`sw`)
The goal is to support all of [voice2json's languages](https://github.com/synesthesiam/voice2json-profiles#supported-languages)
## Dependencies
* Python 3.7 or higher
* Linux
* Tested on Debian Bullseye
* [num2words fork](https://github.com/rhasspy/num2words) and [Babel](https://pypi.org/project/Babel/)
* Currency/number handling
* num2words fork includes additional language support (Arabic, Farsi, Swedish, Swahili)
* gruut-ipa
* [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) pronunciation manipulation
* [pycrfsuite](https://github.com/scrapinghub/python-crfsuite)
* Part of speech tagging and grapheme to phoneme models
* [pydateparser](https://github.com/GLibAi/pydateparser)
* Date parsing for multiple languages
## Numbers, Dates, and More
`gruut` can automatically verbalize numbers, dates, and other expressions. This is done in a locale-aware manner for both parsing and verbalization, so "1/1/2020" may be interpreted as "M/D/Y" or "D/M/Y" depending on the word or sentence's language (e.g., `<s lang="...">`).
The following types of expressions can be automatically expanded into words by `gruut`:
* Numbers - "123" to "one hundred and twenty three" (disable with `verbalize_numbers=False` or `--no-numbers`)
* Relies on `Babel` for parsing and `num2words` for verbalization
* Dates - "1/1/2020" to "January first, twenty twenty" (disable with `verbalize_dates=False` or `--no-dates`)
* Relies on `pydateparser` for parsing and both `Babel` and `num2words` for verbalization
* Currency - "$10" to "ten dollars" (disable with `verbalize_currency=False` or `--no-currency`)
* Relies on `Babel` for parsing and both `Babel` and `num2words` for verbalization
* Times - "12:01am" to "twelve oh one A M" (disable with `verbalize_times=False` or `--no-times`)
* English only
* Relies on `num2words` for verbalization
## Command-Line Usage
The `gruut` module can be executed with `python3 -m gruut --language <LANGUAGE> <TEXT>` or with the `gruut` command (from `setup.py`).
The `gruut` command is line-oriented, consuming text and producing [JSONL](https://jsonlines.org/).
You will probably want to install [jq](https://stedolan.github.io/jq/) to manipulate the [JSONL](https://jsonlines.org/) output from `gruut`.
### Plain Text
Takes raw text and outputs [JSONL](https://jsonlines.org/) with cleaned words/tokens.
```sh
echo 'This, right here, is some "RAW" text!' \
| gruut --language en-us \
| jq --raw-output '.words[].text'
This
,
right
here
,
is
some
"
RAW
"
text
!
```
More information is available in the full JSON output:
```sh
gruut --language en-us 'More text.' | jq .
```
Output:
```json
{
"idx": 0,
"text": "More text.",
"text_with_ws": "More text.",
"text_spoken": "More text",
"par_idx": 0,
"lang": "en-us",
"voice": "",
"words": [
{
"idx": 0,
"text": "More",
"text_with_ws": "More ",
"leading_ws": "",
"training_ws": " ",
"sent_idx": 0,
"par_idx": 0,
"lang": "en-us",
"voice": "",
"pos": "JJR",
"phonemes": [
"m",
"ˈɔ",
"ɹ"
],
"is_major_break": false,
"is_minor_break": false,
"is_punctuation": false,
"is_break": false,
"is_spoken": true,
"pause_before_ms": 0,
"pause_after_ms": 0
},
{
"idx": 1,
"text": "text",
"text_with_ws": "text",
"leading_ws": "",
"training_ws": "",
"sent_idx": 0,
"par_idx": 0,
"lang": "en-us",
"voice": "",
"pos": "NN",
"phonemes": [
"t",
"ˈɛ",
"k",
"s",
"t"
],
"is_major_break": false,
"is_minor_break": false,
"is_punctuation": false,
"is_break": false,
"is_spoken": true,
"pause_before_ms": 0,
"pause_after_ms": 0
},
{
"idx": 2,
"text": ".",
"text_with_ws": ".",
"leading_ws": "",
"training_ws": "",
"sent_idx": 0,
"par_idx": 0,
"lang": "en-us",
"voice": "",
"pos": null,
"phonemes": [
"‖"
],
"is_major_break": true,
"is_minor_break": false,
"is_punctuation": false,
"is_break": true,
"is_spoken": false,
"pause_before_ms": 0,
"pause_after_ms": 0
}
],
"pause_before_ms": 0,
"pause_after_ms": 0
}
```
For the whole input line and each word, the `text` property contains the processed input text with normalized whitespace while `text_with_ws` retains the original whitespace. The `text_spoken` property only contains words that are spoken, so punctuation and breaks are excluded.
Within each word, there is:
* `idx` - zero-based index of the word in the sentence
* `sent_idx` - zero-based index of the sentence in the input text
* `pos` - part of speech tag (if available)
* `phonemes` - list of [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) phonemes for the word (if available)
* `is_minor_break` - `true` if "word" separates phrases (comma, semicolon, etc.)
* `is_major_break` - `true` if "word" separates sentences (period, question mark, etc.)
* `is_break` - `true` if "word" is a major or minor break
* `is_punctuation` - `true` if "word" is a surrounding punctuation mark (quote, bracket, etc.)
* `is_spoken` - `true` if not a break or punctuation
See `python3 -m gruut <LANGUAGE> --help` for more options.
### SSML
A subset of [SSML](https://www.w3.org/TR/speech-synthesis11/) is supported:
* `<speak>` - wrap around SSML text
* `lang` - set language for document
* `<p>` - paragraph
* `lang` - set language for paragraph
* `<s>` - sentence (disables automatic sentence breaking)
* `lang` - set language for sentence
* `<w>` / `<token>` - word (disables automatic tokenization)
* `lang` - set language for word
* `role` - set word role (see [word roles](#word-roles))
* `<lang lang="...">` - set language inner text
* `<voice name="...">` - set voice of inner text
* `<say-as interpret-as="">` - force interpretation of inner text
* `interpret-as` one of "spell-out", "date", "number", "time", or "currency"
* `format` - way to format text depending on `interpret-as`
* number - one of "cardinal", "ordinal", "digits", "year"
* date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)
* `<break time="">` - Pause for given amount of time
* time - seconds ("123s") or milliseconds ("123ms")
* `<mark name="">` - User-defined mark (`marks_before` and `marks_after` attributes of words/sentences)
* name - name of mark
* `<sub alias="">` - substitute `alias` for inner text
* `<phoneme ph="...">` - supply phonemes for inner text
* `ph` - phonemes for each word of inner text, separated by whitespace
* `<lexicon id="...">` - inline or external pronunciation lexicon
* `id` - unique id of lexicon (used in `<lookup ref="...">`)
* `uri` - if empty or missing, lexicon is inline
* One or more `<lexeme>` child elements with:
* Optional `role="..."` ([word roles][#word-roles] separated by whitespace)
* `<grapheme>WORD</grapheme>` - word text
* `<phoneme>P H O N E M E S</phoneme>` - word pronunciation (phonemes separated by whitespace)
* `<lookup ref="...">` - use pronunciation lexicon for child elements
* `ref` - id from a `<lexicon id="...">`
#### Word Roles
During phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag as `gruut:<TAG>`. For initialisms and `spell-out`, the role `gruut:letter` is used to indicate that e.g., "a" should be spoken as `/eɪ/` instead of `/ə/`.
For `en-us`, the following additional roles are available from the part-of-speech tagger:
* `gruut:CD` - number
* `gruut:DT` - determiner
* `gruut:IN` - preposition or subordinating conjunction
* `gruut:JJ` - adjective
* `gruut:NN` - noun
* `gruut:PRP` - personal pronoun
* `gruut:RB` - adverb
* `gruut:VB` - verb
* `gruut:VB` - verb (past tense)
#### Inline Lexicons
Inline [pronunciation lexicons](https://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/) are supported via the `<lexicon>` and `<lookup>` tags. gruut diverges slightly from the [SSML standard](https://www.w3.org/TR/speech-synthesis11/) here by allowing lexicons to be defined within the SSML document itself (`url` is blank or missing). Additionally, the `id` attribute of the `<lexicon>` element can be left off to indicate a "default" inline lexicon that does not require a corresponding `<lookup>` tag.
For example, the following document will yield three different pronunciations for the word "tomato":
``` xml
<?xml version="1.0"?>
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<lexicon xml:id="test" alphabet="ipa">
<lexeme>
<grapheme>
tomato
</grapheme>
<phoneme>
<!-- Individual phonemes are separated by whitespace -->
t ə m ˈɑ t oʊ
</phoneme>
</lexeme>
<lexeme>
<grapheme role="fake-role">
tomato
</grapheme>
<phoneme>
<!-- Made up pronunciation for fake word role -->
t ə m ˈi t oʊ
</phoneme>
</lexeme>
</lexicon>
<w>tomato</w>
<lookup ref="test">
<w>tomato</w>
<w role="fake-role">tomato</w>
</lookup>
</speak>
```
The first "tomato" will be looked up in the U.S. English lexicon (`/t ə m ˈeɪ t oʊ/`). Within the `<lookup>` tag's scope, the second and third "tomato" words will be looked up in the inline lexicon. The third "tomato" word has a [role](#word-roles) attached (selecting a made up pronunciation in this case).
Even further from the SSML standard, gruut allows you to leave off the `<lexicon>` id entirely. With no `id`, a `<lookup>` tag is no longer needed, allowing you to override the pronunciation of any word in the document:
``` xml
<?xml version="1.0"?>
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<!-- No id means change all words without a lookup -->
<lexicon>
<lexeme>
<grapheme>
tomato
</grapheme>
<phoneme>
t ə m ˈɑ t oʊ
</phoneme>
</lexeme>
</lexicon>
<w>tomato</w>
</speak>
```
This will yield a pronunciation of `/t ə m ˈɑ t oʊ/` for all instances of "tomato" in the document (unless they have a `<lookup>`).
## Intended Audience
gruut is useful for transforming raw text into phonetic pronunciations, similar to [phonemizer](https://github.com/bootphon/phonemizer). Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from a [carefully chosen inventory](https://en.wikipedia.org/wiki/Template:Language_phonologies).
For each supported language, gruut includes a:
* A word pronunciation lexicon built from open source data
* See [pron_dict](https://github.com/Kyubyong/pron_dictionaries)
* A pre-trained grapheme-to-phoneme model for guessing word pronunciations
Some languages also include:
* A pre-trained part of speech tagger built from open source data:
* See [universal dependencies](https://universaldependencies.org/)