This paper addresses the challenges of managing and processing unstructured or semi-structured text, particularly in the context of increasing data volumes that traditional linguistic databases and algorithms struggle to handle in real-time scenarios. While humans can easily navigate linguistic complexities, computational systems face significant difficulties due to algorithmic limitations and the shortcomings of Large Language Models (LLMs). These challenges often result in issues such as a lack of standardized formats, malformed expressions, semantic and lexical ambiguities, hallucinations, and failures to produce outputs aligned with the intricate meaning layers present in human language.
As for the automatic analysis of linguistic data, is well known that Natural Language Processing (NLP) uses two different approaches, coming from diverse cultural and experiential backgrounds. The first approach is based on probabilistic computational statistics (PCS), which underpins most Machine Learning (ML), LLMs, and Artificial Intelligence (AI) techniques. The second approach is based, for each specific language, on the formalization of morpho-syntactic features and constraints used by humans in ordinary communication activities. At first glance, the second approach appears more effective in addressing linguistic phenomena such as polysemy and the formation of meaningful distributional sequences or, more precisely, acceptable and grammatical morpho-syntactic contexts.
In this paper, we initiate a scientific discussion on the differences between these two approaches, aiming to shed light on their respective advantages and limitations.