Metadata-Version: 2.1 Name: bnunicodenormalizer Version: 0.1.6 Summary: Bangla Unicode Normalization Toolkit Home-page: https://github.com/mnansary/bnUnicodeNormalizer Author: Bengali.AI Author-email: research.bengaliai@gmail.com License: MIT Keywords: bangla,unicode,text normalization,indic Classifier: Development Status :: 3 - Alpha Classifier: Intended Audience :: Education Classifier: Operating System :: OS Independent Classifier: License :: OSI Approved :: MIT License Classifier: Programming Language :: Python :: 3 Description-Content-Type: text/markdown License-File: LICENSE # bnUnicodeNormalizer Bangla Unicode Normalization for word normalization # install ```python pip install bnunicodenormalizer ``` # useage **initialization and cleaning** ```python # import from bnunicodenormalizer import Normalizer from pprint import pprint # initialize bnorm=Normalizer() # normalize word = 'াটোবাকো' result=bnorm(word) print(f"Non-norm:{word}; Norm:{result['normalized']}") print("--------------------------------------------------") pprint(result) ``` > output ``` Non-norm:াটোবাকো; Norm:টোবাকো -------------------------------------------------- {'given': 'াটোবাকো', 'normalized': 'টোবাকো', 'ops': [{'after': 'টোবাকো', 'before': 'াটোবাকো', 'operation': 'InvalidUnicode'}]} ``` **call to the normalizer returns a dictionary in the following format** * ```given``` = provided text * ```normalized``` = normalized text (gives None if during the operation length of the text becomes 0) * ```ops``` = list of operations (dictionary) that were executed in given text to create normalized text * each dictionary in ops has: * ```operation```: the name of the operation / problem in given text * ```before``` : what the text looked like before the specific operation * ```after``` : what the text looks like after the specific operation **allow to use english text** ```python # initialize without english (default) norm=Normalizer() print("without english:",norm("ASD123")["normalized"]) # --> returns None norm=Normalizer(allow_english=True) print("with english:",norm("ASD123")["normalized"]) ``` > output ``` without english: None with english: ASD123 ``` # Initialization: Bangla Normalizer ```python ''' initialize a normalizer args: allow_english : allow english letters numbers and punctuations [default:False] keep_legacy_symbols : legacy symbols will be considered as valid unicodes[default:False] '৺':Isshar '৻':Ganda 'ঀ':Anji (not '৭') 'ঌ':li 'ৡ':dirgho li 'ঽ':Avagraha 'ৠ':Vocalic Rr (not 'ঋ') '৲':rupi '৴':currency numerator 1 '৵':currency numerator 2 '৶':currency numerator 3 '৷':currency numerator 4 '৸':currency numerator one less than the denominator '৹':Currency Denominator Sixteen legacy_maps : a dictionay for changing legacy symbols into a more used unicode a default legacy map is included in the language class as well, legacy_maps={'ঀ':'৭', 'ঌ':'৯', 'ৡ':'৯', '৵':'৯', '৻':'ৎ', 'ৠ':'ঋ', 'ঽ':'ই'} pass- * legacy_maps=None; for keeping the legacy symbols as they are * legacy_maps="default"; for using the default legacy map * legacy_maps=custom dictionary(type-dict) ; which will map your desired legacy symbol to any of symbol you want * the keys in the custiom dicts must belong to any of the legacy symbols * the values in the custiom dicts must belong to either vowels,consonants,numbers or diacritics vowels = ['অ', 'আ', 'ই', 'ঈ', 'উ', 'ঊ', 'ঋ', 'এ', 'ঐ', 'ও', 'ঔ'] consonants = ['ক', 'খ', 'গ', 'ঘ', 'ঙ', 'চ', 'ছ','জ', 'ঝ', 'ঞ', 'ট', 'ঠ', 'ড', 'ঢ', 'ণ', 'ত', 'থ', 'দ', 'ধ', 'ন', 'প', 'ফ', 'ব', 'ভ', 'ম', 'য', 'র', 'ল', 'শ', 'ষ', 'স', 'হ','ড়', 'ঢ়', 'য়','ৎ'] numbers = ['০', '১', '২', '৩', '৪', '৫', '৬', '৭', '৮', '৯'] vowel_diacritics = ['া', 'ি', 'ী', 'ু', 'ূ', 'ৃ', 'ে', 'ৈ', 'ো', 'ৌ'] consonant_diacritics = ['ঁ', 'ং', 'ঃ'] > for example you may want to map 'ঽ':Avagraha as 'হ' based on visual similiarity (default:'ই') ** legacy contions: keep_legacy_symbols and legacy_maps operates as follows case-1) keep_legacy_symbols=True and legacy_maps=None : all legacy symbols will be considered valid unicodes. None of them will be changed case-2) keep_legacy_symbols=True and legacy_maps=valid dictionary example:{'ঀ':'ক'} : all legacy symbols will be considered valid unicodes. Only 'ঀ' will be changed to 'ক' , others will be untouched case-3) keep_legacy_symbols=False and legacy_maps=None : all legacy symbols will be removed case-4) keep_legacy_symbols=False and legacy_maps=valid dictionary example:{'ঽ':'ই','ৠ':'ঋ'} : 'ঽ' will be changed to 'ই' and 'ৠ' will be changed to 'ঋ'. All other legacy symbols will be removed ''' ``` ```python my_legacy_maps={'ঌ':'ই', 'ৡ':'ই', '৵':'ই', 'ৠ':'ই', 'ঽ':'ই'} text="৺,৻,ঀ,ঌ,ৡ,ঽ,ৠ,৲,৴,৵,৶,৷,৸,৹" # case 1 norm=Normalizer(keep_legacy_symbols=True,legacy_maps=None) print("case-1 normalized text: ",norm(text)["normalized"]) # case 2 norm=Normalizer(keep_legacy_symbols=True,legacy_maps=my_legacy_maps) print("case-2 normalized text: ",norm(text)["normalized"]) # case 2-defalut norm=Normalizer(keep_legacy_symbols=True) print("case-2 default normalized text: ",norm(text)["normalized"]) # case 3 norm=Normalizer(keep_legacy_symbols=False,legacy_maps=None) print("case-3 normalized text: ",norm(text)["normalized"]) # case 4 norm=Normalizer(keep_legacy_symbols=False,legacy_maps=my_legacy_maps) print("case-4 normalized text: ",norm(text)["normalized"]) # case 4-defalut norm=Normalizer(keep_legacy_symbols=False) print("case-4 default normalized text: ",norm(text)["normalized"]) ``` > output ``` case-1 normalized text: ৺,৻,ঀ,ঌ,ৡ,ঽ,ৠ,৲,৴,৵,৶,৷,৸,৹ case-2 normalized text: ৺,৻,ঀ,ই,ই,ই,ই,৲,৴,ই,৶,৷,৸,৹ case-2 default normalized text: ৺,৻,ঀ,ঌ,ৡ,ঽ,ৠ,৲,৴,৵,৶,৷,৸,৹ case-3 normalized text: ,,,,,,,,,,,,, case-4 normalized text: ,,,ই,ই,ই,ই,,,ই,,,, case-4 default normalized text: ,,,,,,,,,,,,, ``` # Operations * base operations available for all indic languages: ```python self.word_level_ops={"LegacySymbols" :self.mapLegacySymbols, "BrokenDiacritics" :self.fixBrokenDiacritics} self.decomp_level_ops={"BrokenNukta" :self.fixBrokenNukta, "InvalidUnicode" :self.cleanInvalidUnicodes, "InvalidConnector" :self.cleanInvalidConnector, "FixDiacritics" :self.cleanDiacritics, "VowelDiacriticAfterVowel" :self.cleanVowelDiacriticComingAfterVowel} ``` * extensions for bangla ```python self.decomp_level_ops["ToAndHosontoNormalize"] = self.normalizeToandHosonto # invalid folas self.decomp_level_ops["NormalizeConjunctsDiacritics"] = self.cleanInvalidConjunctDiacritics # complex root cleanup self.decomp_level_ops["ComplexRootNormalization"] = self.convertComplexRoots ``` # Normalization Problem Examples **In all examples (a) is the non-normalized form and (b) is the normalized form** * Broken diacritics: ``` # Example-1: (a)'আরো'==(b)'আরো' -> False (a) breaks as:['আ', 'র', 'ে', 'া'] (b) breaks as:['আ', 'র', 'ো'] # Example-2: (a)পৌঁছে==(b)পৌঁছে -> False (a) breaks as:['প', 'ে', 'ৗ', 'ঁ', 'ছ', 'ে'] (b) breaks as:['প', 'ৌ', 'ঁ', 'ছ', 'ে'] # Example-3: (a)সংস্কৄতি==(b)সংস্কৃতি -> False (a) breaks as:['স', 'ং', 'স', '্', 'ক', 'ৄ', 'ত', 'ি'] (b) breaks as:['স', 'ং', 'স', '্', 'ক', 'ৃ', 'ত', 'ি'] ``` * Nukta Normalization: ``` Example-1: (a)কেন্দ্রীয়==(b)কেন্দ্রীয় -> False (a) breaks as:['ক', 'ে', 'ন', '্', 'দ', '্', 'র', 'ী', 'য', '়'] (b) breaks as:['ক', 'ে', 'ন', '্', 'দ', '্', 'র', 'ী', 'য়'] Example-2: (a)রযে়ছে==(b)রয়েছে -> False (a) breaks as:['র', 'য', 'ে', '়', 'ছ', 'ে'] (b) breaks as:['র', 'য়', 'ে', 'ছ', 'ে'] Example-3: (a)জ়ন্য==(b)জন্য -> False (a) breaks as:['জ', '়', 'ন', '্', 'য'] (b) breaks as:['জ', 'ন', '্', 'য'] ``` * Invalid hosonto ``` # Example-1: (a)দুই্টি==(b)দুইটি-->False (a) breaks as ['দ', 'ু', 'ই', '্', 'ট', 'ি'] (b) breaks as ['দ', 'ু', 'ই', 'ট', 'ি'] # Example-2: (a)এ্তে==(b)এতে-->False (a) breaks as ['এ', '্', 'ত', 'ে'] (b) breaks as ['এ', 'ত', 'ে'] # Example-3: (a)নেট্ওয়ার্ক==(b)নেটওয়ার্ক-->False (a) breaks as ['ন', 'ে', 'ট', '্', 'ও', 'য়', 'া', 'র', '্', 'ক'] (b) breaks as ['ন', 'ে', 'ট', 'ও', 'য়', 'া', 'র', '্', 'ক'] # Example-4: (a)এস্আই==(b)এসআই-->False (a) breaks as ['এ', 'স', '্', 'আ', 'ই'] (b) breaks as ['এ', 'স', 'আ', 'ই'] # Example-5: (a)'চু্ক্তি'==(b)'চুক্তি' -> False (a) breaks as:['চ', 'ু', '্', 'ক', '্', 'ত', 'ি'] (b) breaks as:['চ', 'ু','ক', '্', 'ত', 'ি'] # Example-6: (a)'যু্ক্ত'==(b)'যুক্ত' -> False (a) breaks as:['য', 'ু', '্', 'ক', '্', 'ত'] (b) breaks as:['য', 'ু', 'ক', '্', 'ত'] # Example-7: (a)'কিছু্ই'==(b)'কিছুই' -> False (a) breaks as:['ক', 'ি', 'ছ', 'ু', '্', 'ই'] (b) breaks as:['ক', 'ি', 'ছ', 'ু','ই'] ``` * To+hosonto: ``` # Example-1: (a)বুত্পত্তি==(b)বুৎপত্তি-->False (a) breaks as ['ব', 'ু', 'ত', '্', 'প', 'ত', '্', 'ত', 'ি'] (b) breaks as ['ব', 'ু', 'ৎ', 'প', 'ত', '্', 'ত', 'ি'] # Example-2: (a)উত্স==(b)উৎস-->False (a) breaks as ['উ', 'ত', '্', 'স'] (b) breaks as ['উ', 'ৎ', 'স'] ``` * Unwanted doubles(consecutive doubles): ``` # Example-1: (a)'যুুদ্ধ'==(b)'যুদ্ধ' -> False (a) breaks as:['য', 'ু', 'ু', 'দ', '্', 'ধ'] (b) breaks as:['য', 'ু', 'দ', '্', 'ধ'] # Example-2: (a)'দুুই'==(b)'দুই' -> False (a) breaks as:['দ', 'ু', 'ু', 'ই'] (b) breaks as:['দ', 'ু', 'ই'] # Example-3: (a)'প্রকৃৃতির'==(b)'প্রকৃতির' -> False (a) breaks as:['প', '্', 'র', 'ক', 'ৃ', 'ৃ', 'ত', 'ি', 'র'] (b) breaks as:['প', '্', 'র', 'ক', 'ৃ', 'ত', 'ি', 'র'] # Example-4: (a)আমাকোা==(b)'আমাকো'-> False (a) breaks as:['আ', 'ম', 'া', 'ক', 'ে', 'া', 'া'] (b) breaks as:['আ', 'ম', 'া', 'ক', 'ো'] ``` * Vowwels and modifier followed by vowel diacritics: ``` # Example-1: (a)উুলু==(b)উলু-->False (a) breaks as ['উ', 'ু', 'ল', 'ু'] (b) breaks as ['উ', 'ল', 'ু'] # Example-2: (a)আর্কিওোলজি==(b)আর্কিওলজি-->False (a) breaks as ['আ', 'র', '্', 'ক', 'ি', 'ও', 'ো', 'ল', 'জ', 'ি'] (b) breaks as ['আ', 'র', '্', 'ক', 'ি', 'ও', 'ল', 'জ', 'ি'] # Example-3: (a)একএে==(b)একত্রে-->False (a) breaks as ['এ', 'ক', 'এ', 'ে'] (b) breaks as ['এ', 'ক', 'ত', '্', 'র', 'ে'] ``` * Repeated folas: ``` # Example-1: (a)গ্র্রামকে==(b)গ্রামকে-->False (a) breaks as ['গ', '্', 'র', '্', 'র', 'া', 'ম', 'ক', 'ে'] (b) breaks as ['গ', '্', 'র', 'া', 'ম', 'ক', 'ে'] ``` ## IMPORTANT NOTE **The normalization is purely based on how bangla text is used in ```Bangladesh```(bn:bd). It does not necesserily cover every variation of textual content available at other regions** # unit testing * clone the repository * change working directory to ```tests``` * run: ```python3 -m unittest test_normalizer.py``` # Issue Reporting * for reporting an issue please provide the specific information * invalid text * expected valid text * why is the output expected * clone the repository * add a test case in **tests/test_normalizer.py** after **line no:91** ```python # Dummy Non-Bangla,Numbers and Space cases/ Invalid start end cases # english self.assertEqual(norm('ASD1234')["normalized"],None) self.assertEqual(ennorm('ASD1234')["normalized"],'ASD1234') # random self.assertEqual(norm('িত')["normalized"],'ত') self.assertEqual(norm('সং্যুক্তি')["normalized"],"সংযুক্তি") # Ending self.assertEqual(norm("অজানা্")["normalized"],"অজানা") #--------------------------------------------- insert your assertions here---------------------------------------- ''' ### case: give a comment about your case ## (a) invalid text==(b) valid text <---- an example of your case self.assertEqual(norm(invalid text)["normalized"],expected output) or self.assertEqual(ennorm(invalid text)["normalized"],expected output) <----- for including english text ''' # your case goes here- ``` * perform the unit testing * make sure the unit test fails under true conditions # Indic Base Normalizer * to use indic language normalizer for 'devanagari', 'gujarati', 'odiya', 'tamil', 'panjabi', 'malayalam','sylhetinagri' ```python from bnunicodenormalizer import IndicNormalizer norm=IndicNormalizer('devanagari') ``` * initialization ```python ''' initialize a normalizer args: language : language identifier from 'devanagari', 'gujarati', 'odiya', 'tamil', 'panjabi', 'malayalam','sylhetinagri' allow_english : allow english letters numbers and punctuations [default:False] ''' ``` # ABOUT US * Authors: [Bengali.AI](https://bengali.ai/) in association with OCR Team , [APSIS Solutions Limited](https://apsissolutions.com/) * **Cite Bengali.AI multipurpose grapheme dataset paper** ```bibtext @inproceedings{alam2021large, title={A large multi-target dataset of common bengali handwritten graphemes}, author={Alam, Samiul and Reasat, Tahsin and Sushmit, Asif Shahriyar and Siddique, Sadi Mohammad and Rahman, Fuad and Hasan, Mahady and Humayun, Ahmed Imtiaz}, booktitle={International Conference on Document Analysis and Recognition}, pages={383--398}, year={2021}, organization={Springer} } ``` Change Log =========== 0.0.5 (9/03/2022) ------------------- - added details for execution map - checkop typo correction 0.0.6 (9/03/2022) ------------------- - broken diacritics op addition 0.0.7 (11/03/2022) ------------------- - assemese replacement - word op and unicode op mapping - modifier list modification - doc string for call and initialization - verbosity removal - typo correction for operation - unit test updates - 'এ' replacement correction - NonGylphUnicodes - Legacy symbols option - legacy mapper added - added bn:bd declaration 0.0.8 (14/03/2022) ------------------- - MultipleConsonantDiacritics handling change - to+hosonto correction - invalid hosonto correction 0.0.9 (15/04/2022) ------------------- - base normalizer - language class - bangla extension - complex root normalization 0.0.10 (15/04/2022) ------------------- - added conjucts - exception for english words 0.0.11 (15/04/2022) ------------------- - fixed no space char issue for bangla 0.0.12 (26/04/2022) ------------------- - fixed consonants orders 0.0.13 (26/04/2022) ------------------- - fixed non char followed by diacritics 0.0.14 (01/05/2022) ------------------- - word based normalization - encoding fix 0.0.15 (02/05/2022) ------------------- - import correction 0.0.16 (02/05/2022) ------------------- - local variable issue 0.0.17 (17/05/2022) ------------------- - nukta mod break 0.0.18 (08/06/2022) ------------------- - no space chars fix 0.0.19 (15/06/2022) ------------------- - no space chars further fix - base_bangla_compose to avoid false op flags - added foreign conjuncts 0.0.20 (01/08/2022) ------------------- - এ্যা replacement correction 0.0.21 (01/08/2022) ------------------- - "য","ব" + hosonto combination correction - added 'ব্ল্য' in conjuncts 0.0.22 (22/08/2022) ------------------- - \u200d combination limiting 0.0.23 (23/08/2022) ------------------- - \u200d condition change 0.0.24 (26/08/2022) ------------------- - \u200d error handling 0.0.25 (10/09/22) ------------------- - removed unnecessary operations: fixRefOrder,fixOrdersForCC - added conjuncts: 'র্ন্ত','ঠ্য','ভ্ল' 0.1.0 (20/10/22) ------------------- - added indic parser - fixed language class 0.1.1 (21/10/22) ------------------- - added nukta and diacritic maps for indics - cleaned conjucts for now - fixed issues with no-space and connector 0.1.2 (10/12/22) ------------------- - allow halant ending for indic language except bangla 0.1.3 (10/12/22) ------------------- - broken char break cases for halant 0.1.4 (01/01/23) ------------------- - added sylhetinagri 0.1.5 (01/01/23) ------------------- - cleaned panjabi double quotes in diac map 0.1.6 (15/04/23) ------------------- - added bangla punctuations