Metadata-Version: 2.1
Name: bnunicodenormalizer
Version: 0.1.6
Summary: Bangla Unicode Normalization Toolkit
Home-page: https://github.com/mnansary/bnUnicodeNormalizer
Author: Bengali.AI
Author-email: research.bengaliai@gmail.com
License: MIT
Keywords: bangla,unicode,text normalization,indic
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Education
Classifier: Operating System :: OS Independent
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Description-Content-Type: text/markdown
License-File: LICENSE

# bnUnicodeNormalizer
Bangla Unicode Normalization for word normalization
# install
```python
pip install bnunicodenormalizer
```
# useage
**initialization and cleaning**
```python
# import
from bnunicodenormalizer import Normalizer 
from pprint import pprint
# initialize
bnorm=Normalizer()
# normalize
word = 'াটোবাকো'
result=bnorm(word)
print(f"Non-norm:{word}; Norm:{result['normalized']}")
print("--------------------------------------------------")
pprint(result)
```
> output 

```
Non-norm:াটোবাকো; Norm:টোবাকো
--------------------------------------------------
{'given': 'াটোবাকো',
 'normalized': 'টোবাকো',
 'ops': [{'after': 'টোবাকো',
          'before': 'াটোবাকো',
          'operation': 'InvalidUnicode'}]}
```

**call to the normalizer returns a dictionary in the following format**

* ```given``` = provided text
* ```normalized``` = normalized text (gives None if during the operation length of the text becomes 0)
* ```ops``` = list of operations (dictionary) that were executed in given text to create normalized text
*  each dictionary in ops has:
    * ```operation```: the name of the operation / problem in given text
    * ```before``` : what the text looked like before the specific operation
    * ```after```  : what the text looks like after the specific operation  

**allow to use english text**

```python
# initialize without english (default)
norm=Normalizer()
print("without english:",norm("ASD123")["normalized"])
# --> returns None
norm=Normalizer(allow_english=True)
print("with english:",norm("ASD123")["normalized"])

```
> output

```
without english: None
with english: ASD123
```

 
# Initialization: Bangla Normalizer

```python
'''
    initialize a normalizer
            args:
                allow_english                   :   allow english letters numbers and punctuations [default:False]
                keep_legacy_symbols             :   legacy symbols will be considered as valid unicodes[default:False]
                                                    '৺':Isshar 
                                                    '৻':Ganda
                                                    'ঀ':Anji (not '৭')
                                                    'ঌ':li
                                                    'ৡ':dirgho li
                                                    'ঽ':Avagraha
                                                    'ৠ':Vocalic Rr (not 'ঋ')
                                                    '৲':rupi
                                                    '৴':currency numerator 1
                                                    '৵':currency numerator 2
                                                    '৶':currency numerator 3
                                                    '৷':currency numerator 4
                                                    '৸':currency numerator one less than the denominator
                                                    '৹':Currency Denominator Sixteen
                legacy_maps                     :   a dictionay for changing legacy symbols into a more used  unicode 
                                                    a default legacy map is included in the language class as well,
                                                    legacy_maps={'ঀ':'৭',
                                                                'ঌ':'৯',
                                                                'ৡ':'৯',
                                                                '৵':'৯',
                                                                '৻':'ৎ',
                                                                'ৠ':'ঋ',
                                                                'ঽ':'ই'}
                                            
                                                    pass-   
                                                    * legacy_maps=None; for keeping the legacy symbols as they are
                                                    * legacy_maps="default"; for using the default legacy map
                                                    * legacy_maps=custom dictionary(type-dict) ; which will map your desired legacy symbol to any of symbol you want
                                                        * the keys in the custiom dicts must belong to any of the legacy symbols
                                                        * the values in the custiom dicts must belong to either vowels,consonants,numbers or diacritics  
                                                        vowels         =   ['অ', 'আ', 'ই', 'ঈ', 'উ', 'ঊ', 'ঋ', 'এ', 'ঐ', 'ও', 'ঔ']
                                                        consonants     =   ['ক', 'খ', 'গ', 'ঘ', 'ঙ', 'চ', 'ছ','জ', 'ঝ', 'ঞ', 
                                                                            'ট', 'ঠ', 'ড', 'ঢ', 'ণ', 'ত', 'থ', 'দ', 'ধ', 'ন', 
                                                                            'প', 'ফ', 'ব', 'ভ', 'ম', 'য', 'র', 'ল', 'শ', 'ষ', 
                                                                            'স', 'হ','ড়', 'ঢ়', 'য়','ৎ']    
                                                        numbers        =    ['০', '১', '২', '৩', '৪', '৫', '৬', '৭', '৮', '৯']
                                                        vowel_diacritics       =   ['া', 'ি', 'ী', 'ু', 'ূ', 'ৃ', 'ে', 'ৈ', 'ো', 'ৌ']
                                                        consonant_diacritics   =   ['ঁ', 'ং', 'ঃ']
    
                                                        > for example you may want to map 'ঽ':Avagraha as 'হ' based on visual similiarity 
                                                            (default:'ই')

                ** legacy contions: keep_legacy_symbols and legacy_maps operates as follows 
                    case-1) keep_legacy_symbols=True and legacy_maps=None
                        : all legacy symbols will be considered valid unicodes. None of them will be changed
                    case-2) keep_legacy_symbols=True and legacy_maps=valid dictionary example:{'ঀ':'ক'}
                        : all legacy symbols will be considered valid unicodes. Only 'ঀ' will be changed to 'ক' , others will be untouched
                    case-3) keep_legacy_symbols=False and legacy_maps=None
                        : all legacy symbols will be removed
                    case-4) keep_legacy_symbols=False and legacy_maps=valid dictionary example:{'ঽ':'ই','ৠ':'ঋ'}
                        : 'ঽ' will be changed to 'ই' and 'ৠ' will be changed to 'ঋ'. All other legacy symbols will be removed
'''

```

```python
my_legacy_maps={'ঌ':'ই',
                'ৡ':'ই',
                '৵':'ই',
                'ৠ':'ই',
                'ঽ':'ই'}
text="৺,৻,ঀ,ঌ,ৡ,ঽ,ৠ,৲,৴,৵,৶,৷,৸,৹"
# case 1
norm=Normalizer(keep_legacy_symbols=True,legacy_maps=None)
print("case-1 normalized text:  ",norm(text)["normalized"])
# case 2
norm=Normalizer(keep_legacy_symbols=True,legacy_maps=my_legacy_maps)
print("case-2 normalized text:  ",norm(text)["normalized"])
# case 2-defalut
norm=Normalizer(keep_legacy_symbols=True)
print("case-2 default normalized text:  ",norm(text)["normalized"])

# case 3
norm=Normalizer(keep_legacy_symbols=False,legacy_maps=None)
print("case-3 normalized text:  ",norm(text)["normalized"])
# case 4
norm=Normalizer(keep_legacy_symbols=False,legacy_maps=my_legacy_maps)
print("case-4 normalized text:  ",norm(text)["normalized"])
# case 4-defalut
norm=Normalizer(keep_legacy_symbols=False)
print("case-4 default normalized text:  ",norm(text)["normalized"])
```

> output

```
case-1 normalized text:   ৺,৻,ঀ,ঌ,ৡ,ঽ,ৠ,৲,৴,৵,৶,৷,৸,৹
case-2 normalized text:   ৺,৻,ঀ,ই,ই,ই,ই,৲,৴,ই,৶,৷,৸,৹
case-2 default normalized text:   ৺,৻,ঀ,ঌ,ৡ,ঽ,ৠ,৲,৴,৵,৶,৷,৸,৹
case-3 normalized text:   ,,,,,,,,,,,,,
case-4 normalized text:   ,,,ই,ই,ই,ই,,,ই,,,,
case-4 default normalized text:   ,,,,,,,,,,,,, 
```

# Operations
* base operations available for all indic languages:

```python
self.word_level_ops={"LegacySymbols"   :self.mapLegacySymbols,
                    "BrokenDiacritics" :self.fixBrokenDiacritics}

self.decomp_level_ops={"BrokenNukta"             :self.fixBrokenNukta,
                    "InvalidUnicode"             :self.cleanInvalidUnicodes,
                    "InvalidConnector"           :self.cleanInvalidConnector,
                    "FixDiacritics"              :self.cleanDiacritics,
                    "VowelDiacriticAfterVowel"   :self.cleanVowelDiacriticComingAfterVowel}
```
* extensions for bangla

```python
self.decomp_level_ops["ToAndHosontoNormalize"]             =       self.normalizeToandHosonto

# invalid folas 
self.decomp_level_ops["NormalizeConjunctsDiacritics"]      =       self.cleanInvalidConjunctDiacritics

# complex root cleanup 
self.decomp_level_ops["ComplexRootNormalization"]          =       self.convertComplexRoots

```

# Normalization Problem Examples
**In all examples (a) is the non-normalized form and (b) is the normalized form**

* Broken diacritics:
``` 
# Example-1: 
(a)'আরো'==(b)'আরো' ->  False 
    (a) breaks as:['আ', 'র', 'ে', 'া']
    (b) breaks as:['আ', 'র', 'ো']
# Example-2:
(a)পৌঁছে==(b)পৌঁছে ->  False
    (a) breaks as:['প', 'ে', 'ৗ', 'ঁ', 'ছ', 'ে']
    (b) breaks as:['প', 'ৌ', 'ঁ', 'ছ', 'ে']
# Example-3:
(a)সংস্কৄতি==(b)সংস্কৃতি ->  False
    (a) breaks as:['স', 'ং', 'স', '্', 'ক', 'ৄ', 'ত', 'ি']
    (b) breaks as:['স', 'ং', 'স', '্', 'ক', 'ৃ', 'ত', 'ি']
```
* Nukta Normalization:

```        
Example-1:
(a)কেন্দ্রীয়==(b)কেন্দ্রীয় ->  False
    (a) breaks as:['ক', 'ে', 'ন', '্', 'দ', '্', 'র', 'ী', 'য', '়']
    (b) breaks as:['ক', 'ে', 'ন', '্', 'দ', '্', 'র', 'ী', 'য়']
Example-2:
(a)রযে়ছে==(b)রয়েছে ->  False
    (a) breaks as:['র', 'য', 'ে', '়', 'ছ', 'ে']
    (b) breaks as:['র', 'য়', 'ে', 'ছ', 'ে']
Example-3: 
(a)জ়ন্য==(b)জন্য ->  False
    (a) breaks as:['জ', '়', 'ন', '্', 'য']
    (b) breaks as:['জ', 'ন', '্', 'য']
``` 
* Invalid hosonto
```
# Example-1:
(a)দুই্টি==(b)দুইটি-->False
    (a) breaks as ['দ', 'ু', 'ই', '্', 'ট', 'ি']
    (b) breaks as ['দ', 'ু', 'ই', 'ট', 'ি']
# Example-2:
(a)এ্তে==(b)এতে-->False
    (a) breaks as ['এ', '্', 'ত', 'ে']
    (b) breaks as ['এ', 'ত', 'ে']
# Example-3:
(a)নেট্ওয়ার্ক==(b)নেটওয়ার্ক-->False
    (a) breaks as ['ন', 'ে', 'ট', '্', 'ও', 'য়', 'া', 'র', '্', 'ক']
    (b) breaks as ['ন', 'ে', 'ট', 'ও', 'য়', 'া', 'র', '্', 'ক']
# Example-4:
(a)এস্আই==(b)এসআই-->False
    (a) breaks as ['এ', 'স', '্', 'আ', 'ই']
    (b) breaks as ['এ', 'স', 'আ', 'ই']
# Example-5: 
(a)'চু্ক্তি'==(b)'চুক্তি' ->  False 
    (a) breaks as:['চ', 'ু', '্', 'ক', '্', 'ত', 'ি']
    (b) breaks as:['চ', 'ু','ক', '্', 'ত', 'ি']
# Example-6:
(a)'যু্ক্ত'==(b)'যুক্ত' ->   False
    (a) breaks as:['য', 'ু', '্', 'ক', '্', 'ত']
    (b) breaks as:['য', 'ু', 'ক', '্', 'ত']
# Example-7:
(a)'কিছু্ই'==(b)'কিছুই' ->   False
    (a) breaks as:['ক', 'ি', 'ছ', 'ু', '্', 'ই']
    (b) breaks as:['ক', 'ি', 'ছ', 'ু','ই']
```

* To+hosonto: 

``` 
# Example-1:
(a)বুত্পত্তি==(b)বুৎপত্তি-->False
    (a) breaks as ['ব', 'ু', 'ত', '্', 'প', 'ত', '্', 'ত', 'ি']
    (b) breaks as ['ব', 'ু', 'ৎ', 'প', 'ত', '্', 'ত', 'ি']
# Example-2:
(a)উত্স==(b)উৎস-->False
    (a) breaks as ['উ', 'ত', '্', 'স']
    (b) breaks as ['উ', 'ৎ', 'স']
```

* Unwanted doubles(consecutive doubles):

```
# Example-1: 
(a)'যুুদ্ধ'==(b)'যুদ্ধ' ->  False 
    (a) breaks as:['য', 'ু', 'ু', 'দ', '্', 'ধ']
    (b) breaks as:['য', 'ু', 'দ', '্', 'ধ']
# Example-2:
(a)'দুুই'==(b)'দুই' ->   False
    (a) breaks as:['দ', 'ু', 'ু', 'ই']
    (b) breaks as:['দ', 'ু', 'ই']
# Example-3:
(a)'প্রকৃৃতির'==(b)'প্রকৃতির' ->   False
    (a) breaks as:['প', '্', 'র', 'ক', 'ৃ', 'ৃ', 'ত', 'ি', 'র']
    (b) breaks as:['প', '্', 'র', 'ক', 'ৃ', 'ত', 'ি', 'র']
# Example-4:
(a)আমাকোা==(b)'আমাকো'->   False
    (a) breaks as:['আ', 'ম', 'া', 'ক', 'ে', 'া', 'া']
    (b) breaks as:['আ', 'ম', 'া', 'ক', 'ো']
```

* Vowwels and modifier followed by vowel diacritics:

```
# Example-1:
(a)উুলু==(b)উলু-->False
    (a) breaks as ['উ', 'ু', 'ল', 'ু']
    (b) breaks as ['উ', 'ল', 'ু']
# Example-2:
(a)আর্কিওোলজি==(b)আর্কিওলজি-->False
    (a) breaks as ['আ', 'র', '্', 'ক', 'ি', 'ও', 'ো', 'ল', 'জ', 'ি']
    (b) breaks as ['আ', 'র', '্', 'ক', 'ি', 'ও', 'ল', 'জ', 'ি']
# Example-3:
(a)একএে==(b)একত্রে-->False
    (a) breaks as ['এ', 'ক', 'এ', 'ে']
    (b) breaks as ['এ', 'ক', 'ত', '্', 'র', 'ে']
```  

* Repeated folas:

```
# Example-1:
(a)গ্র্রামকে==(b)গ্রামকে-->False
    (a) breaks as ['গ', '্', 'র', '্', 'র', 'া', 'ম', 'ক', 'ে']
    (b) breaks as ['গ', '্', 'র', 'া', 'ম', 'ক', 'ে']
```

## IMPORTANT NOTE
**The normalization is purely based on how bangla text is used in ```Bangladesh```(bn:bd). It does not necesserily cover every variation of textual content available at other regions**

# unit testing
* clone the repository
* change working directory to ```tests```
* run: ```python3 -m unittest test_normalizer.py```

# Issue Reporting
* for reporting an issue please provide the specific information
    *  invalid text
    *  expected valid text
    *  why is the output expected 
    *  clone the repository
    *  add a test case in **tests/test_normalizer.py** after **line no:91**

    ```python
        # Dummy Non-Bangla,Numbers and Space cases/ Invalid start end cases
        # english
        self.assertEqual(norm('ASD1234')["normalized"],None)
        self.assertEqual(ennorm('ASD1234')["normalized"],'ASD1234')
        # random
        self.assertEqual(norm('িত')["normalized"],'ত')
        self.assertEqual(norm('সং্যুক্তি')["normalized"],"সংযুক্তি")
        # Ending
        self.assertEqual(norm("অজানা্")["normalized"],"অজানা")

        #--------------------------------------------- insert your assertions here----------------------------------------
        '''
            ###  case: give a comment about your case
            ## (a) invalid text==(b) valid text <---- an example of your case
            self.assertEqual(norm(invalid text)["normalized"],expected output)
                        or
            self.assertEqual(ennorm(invalid text)["normalized"],expected output) <----- for including english text
            
        '''
        # your case goes here-
            
    ```
    * perform the unit testing
    * make sure the unit test fails under true conditions    

# Indic Base Normalizer
* to use indic language normalizer for 'devanagari', 'gujarati', 'odiya', 'tamil', 'panjabi', 'malayalam','sylhetinagri'

```python
from bnunicodenormalizer import IndicNormalizer
norm=IndicNormalizer('devanagari')
```
* initialization

```python
'''
    initialize a normalizer
    args:
        language                        :   language identifier from 'devanagari', 'gujarati', 'odiya', 'tamil', 'panjabi', 'malayalam','sylhetinagri'
        allow_english                   :   allow english letters numbers and punctuations [default:False]
                
'''        
        
```


# ABOUT US
* Authors: [Bengali.AI](https://bengali.ai/) in association with OCR Team , [APSIS Solutions Limited](https://apsissolutions.com/) 
* **Cite Bengali.AI multipurpose grapheme dataset paper**
```bibtext
@inproceedings{alam2021large,
  title={A large multi-target dataset of common bengali handwritten graphemes},
  author={Alam, Samiul and Reasat, Tahsin and Sushmit, Asif Shahriyar and Siddique, Sadi Mohammad and Rahman, Fuad and Hasan, Mahady and Humayun, Ahmed Imtiaz},
  booktitle={International Conference on Document Analysis and Recognition},
  pages={383--398},
  year={2021},
  organization={Springer}
}
```

Change Log
===========

0.0.5 (9/03/2022)
-------------------
- added details for execution map
- checkop typo correction

0.0.6 (9/03/2022)
-------------------
- broken diacritics op addition

0.0.7 (11/03/2022)
-------------------
- assemese replacement
- word op and unicode op mapping
- modifier list modification
- doc string for call and initialization
- verbosity removal
- typo correction for operation
- unit test updates
- 'এ' replacement correction
- NonGylphUnicodes
- Legacy symbols option
- legacy mapper added 
- added bn:bd declaration

0.0.8 (14/03/2022)
-------------------
- MultipleConsonantDiacritics handling change
- to+hosonto correction
- invalid hosonto correction 

0.0.9 (15/04/2022)
-------------------
- base normalizer
- language class
- bangla extension
- complex root normalization 

0.0.10 (15/04/2022)
-------------------
- added conjucts
- exception for english words

0.0.11 (15/04/2022)
-------------------
- fixed no space char issue for bangla

0.0.12 (26/04/2022)
-------------------
- fixed consonants orders 

0.0.13 (26/04/2022)
-------------------
- fixed non char followed by diacritics 

0.0.14 (01/05/2022)
-------------------
- word based normalization
- encoding fix

0.0.15 (02/05/2022)
-------------------
- import correction

0.0.16 (02/05/2022)
-------------------
- local variable issue

0.0.17 (17/05/2022)
-------------------
- nukta mod break

0.0.18 (08/06/2022)
-------------------
- no space chars fix


0.0.19 (15/06/2022)
-------------------
- no space chars further fix
- base_bangla_compose to avoid false op flags
- added foreign conjuncts


0.0.20 (01/08/2022)
-------------------
- এ্যা replacement correction

0.0.21 (01/08/2022)
-------------------
- "য","ব" + hosonto combination correction
- added 'ব্ল্য' in conjuncts

0.0.22 (22/08/2022)
-------------------
- \u200d combination limiting

0.0.23 (23/08/2022)
-------------------
- \u200d condition change

0.0.24 (26/08/2022)
-------------------
- \u200d error handling

0.0.25 (10/09/22)
-------------------
- removed unnecessary operations: fixRefOrder,fixOrdersForCC
- added conjuncts: 'র্ন্ত','ঠ্য','ভ্ল'

0.1.0 (20/10/22)
-------------------
- added indic parser
- fixed language class

0.1.1 (21/10/22)
-------------------
- added nukta and diacritic maps for indics 
- cleaned conjucts for now 
- fixed issues with no-space and connector

0.1.2 (10/12/22)
-------------------
- allow halant ending for indic language except bangla

0.1.3 (10/12/22)
-------------------
- broken char break cases for halant 

0.1.4 (01/01/23)
-------------------
- added sylhetinagri 

0.1.5 (01/01/23)
-------------------
- cleaned panjabi double quotes in diac map 

0.1.6 (15/04/23)
-------------------
- added bangla punctuations