o žªj5,ã@sÔUdZddlZddlmZmZmZdZdZdZdZ dZ d Zd ZdeZ eed<d ededefdd„Zd ededBfdd„Zd ededBfdd„Zd ededBfdd„Zdedefdd„Zddededefdd„ZdS) a«Stage 1a+: UTF-16/UTF-32 detection for data without BOM. This stage runs after BOM detection but before binary detection. UTF-16 and UTF-32 encoded text contains characteristic null-byte patterns that would otherwise cause binary detection to reject the data. Note: ``from __future__ import annotations`` is intentionally omitted because this module is compiled with mypyc, which does not support PEP 563 string annotations. éN)ÚASCII_TEXT_BYTESÚDETERMINISTIC_CONFIDENCEÚDetectionResultiéé g¸…ëQ¸ž?çà?gffffffæ?g333333Ã?óÚ_NULL_SEPARATOR_ALLOWEDÚdataÚ null_fracÚreturncCs|tkrdS| dt¡S)u‹Return True if the data looks like ASCII with null byte separators. :param data: The raw byte sample to examine. :param null_frac: The positional null fraction for this UTF-16 candidate (i.e. fraction of null bytes in even positions for BE, or odd positions for LE) â€” not the total null fraction across all bytes. Checks two conditions: 1. The positional null fraction is below ``_NULL_SEPARATOR_MAX_FRACTION`` 2. Every non-null byte is printable ASCII or common whitespace When both conditions are met, the nulls are likely field separators (e.g. ``find -print0``), not UTF-16 encoding artifacts. FN)Ú_NULL_SEPARATOR_MAX_FRACTIONÚ translater )r r©rú^/var/www/html/fyndo/pharma/fyndo/venv/lib/python3.10/site-packages/chardet/pipeline/utf1632.pyÚ_is_null_separator_pattern6srcCs8|dt…}t|ƒtkrdSt|ƒ}|dur|St|ƒS)aDetect UTF-32 or UTF-16 encoding from null-byte patterns. UTF-32 is checked before UTF-16 since UTF-32 patterns are more specific. :param data: The raw byte data to examine. :returns: A :class:`DetectionResult` if a strong pattern is found, or ``None``. N)Ú_SAMPLE_SIZEÚlenÚ_MIN_BYTES_UTF16Ú_check_utf32Ú_check_utf16)r ÚsampleÚresultrrrÚdetect_utf1632_patternsJsrcs`tˆƒtˆƒd}|tkrdSˆd|…‰|d}t‡fdd„tdtˆƒdƒDƒƒ}t‡fdd„tdtˆƒdƒDƒƒ}||krc||dkrczˆ d¡}t|ƒrXtdtdd WSWn tybYnwt‡fd d„tdtˆƒdƒDƒƒ}t‡fdd„td tˆƒdƒDƒƒ}||kr®||dkr®zˆ d¡}t|ƒr¡tdtdd WSWdStyYdSwdS)a’Check for UTF-32 encoding based on 4-byte unit structure. For valid Unicode (U+0000 to U+10FFFF = 0x0010FFFF): - UTF-32-BE: the first byte of each 4-byte unit is always 0x00 - UTF-32-LE: the last byte of each 4-byte unit is always 0x00 For BMP characters (U+0000 to U+FFFF), additionally: - UTF-32-BE: the second byte is also 0x00 - UTF-32-LE: the third byte is also 0x00 éNc3ó |]}ˆ|dkrdVqdS©réNr©Ú.0Úi©r rrÚ tó€z_check_utf32..rc3s$|] }ˆ|ddkrdVqdS)rrNrrr!rrr"vó€"rz utf-32-be©ÚencodingÚ confidenceÚlanguagec3rrrrr!rrr"…r#éc3rrrrr!rrr"‡r#éz utf-32-le) rÚ_MIN_BYTES_UTF32ÚsumÚrangeÚdecodeÚ_looks_like_textrrÚUnicodeDecodeError)r Útrimmed_lenÚ num_unitsÚ be_first_nullÚbe_second_nullÚtextÚle_last_nullÚ le_third_nullrr!rr`sL"" ýÿÿ"" ýÿ ýýrc s®ttˆƒtƒ}||d8}|tkrdS|d}t‡fdd„td|dƒDƒƒ}t‡fdd„td|dƒDƒƒ}||}||}g}|tkrStˆd|…|ƒsS| d|f¡|tkrgtˆd|…|ƒsg| d |f¡|skdSt|ƒdkrš|dd}zˆd|… |¡} t | ƒrt|tdd WSWdSt y™YdSwd} d}|D]%\}}zˆd|… |¡} Wn t y¸Yq wt| ƒ} | |krÅ| }|} q | durÕ|tkrÕt| tdd SdS)aýCheck for UTF-16 via null-byte patterns in alternating positions. UTF-16 encodes each BMP character as two bytes. For characters whose code-point high byte is 0x00 (Latin, digits, basic punctuation, many control structures), one of the two bytes in each unit will be a null. Even for non-Latin scripts (Arabic, CJK, Cyrillic, etc.) a significant fraction of code units still contain at least one null byte. Non-UTF-16 single-byte encodings never contain null bytes, so even a small null-byte fraction in alternating positions is a strong signal. When both endiannesses show null-byte patterns (e.g., Latin text where every other byte is null), we disambiguate by decoding both ways and comparing text-quality scores. r*Nc3rrrrr!rrr"°r#z_check_utf16..rc3rrrrr!rrr"²r#rz utf-16-lez utf-16-ber%çð¿)Úminrrrr,r-Ú_UTF16_MIN_NULL_FRACTIONrÚappendr.r/rrr0Ú _text_qualityÚ_MIN_TEXT_QUALITY)r Ú sample_lenr2Ú be_null_countÚ le_null_countÚbe_fracÚle_fracÚ candidatesr&r5Ú best_encodingÚbest_qualityÚ_Úqualityrr!rr˜sp ÿ ÿýÿþþÿ€ýrr5cCs6|sdS|dd…}tdd„|Dƒƒ}|t|ƒtkS)z9Quick check: is decoded text mostly printable characters.FNéôcss$|] }| ¡s|dvrdVqdS)ú rN)Úisprintable)rÚcrrrr"ñr$z#_looks_like_text..)r,rÚ_MIN_PRINTABLE_FRACTION)r5rÚ printablerrrr/ìs r/rHÚlimitcCs |d|…}t|ƒ}|dkrdSd}d}d}d}d}|D]>} t | ¡} | ddkr8|d7}t| ƒdkr7|d7}q| ddkrC|d7}q| dksK| d vrP|d7}q| dd krZ|d7}q||dkrcdS||dkrkdS||}|||d 7}|dkrƒ|dkrƒ|d7}|S)uØScore how much *text* looks like real human-readable content. Returns a score in the range [-1.0, ~1.6). Higher values indicate more natural text. The practical maximum is 1.5 for all-ASCII-letter input (1.6 approaches as sample size grows with all ASCII letters plus whitespace). A score of -1.0 means the content is almost certainly not valid text (too many control characters or combining marks). Scoring factors: * Base score: ratio of Unicode letters (category ``L*``) to sample length. * ASCII bonus: additional 0.5x weight for ASCII letters. This is the primary signal for disambiguating endianness â€” correct decoding of Latin-heavy text produces ASCII letters, wrong decoding produces CJK. * Space bonus: +0.1 when the sample contains at least one whitespace character and is longer than 20 characters. * Rejection: returns -1.0 if >10% control characters or >20% combining marks (category ``M*``). Nrr8ÚLré€ÚMÚZsrIÚCgš™™™™™¹?gš™™™™™É?ré)rÚunicodedataÚcategoryÚord)r5rNrÚnÚlettersÚmarksÚspacesÚcontrolsÚ ascii_lettersrKÚcatÚscorerrrr<õs@ € €r<)rH)Ú__doc__rUÚchardet.pipelinerrrrr+rr:r=rLr r ÚbytesÚ__annotations__ÚfloatÚboolrrrrÚstrr/Úintr<rrrrÚs"8T