1. Introduction
Search engines detect the user intent or search goal from web queries to retrieve accurate results. The user intent is defined as “the expression of an affective, cognitive, or situational goal” during users’ interactions with web systems, i.e., search engines or dialogue systems [
1]. Intent detection or recognition entails classifying diversely-expressed queries into predefined intent categories, but intent detection from search queries is a critical and challenging problem due to multiple complexities. Firstly, search queries are short and lack sufficient context. In addition, queries are loosely structured; thus, compounding semantic ambiguity [
2,
3]. Queries in morphologically rich languages further complicate this context due to their complex word structure and recurrent inflections and derivations.
Urdu is a major language spoken by more than 160 million people across the world. According to a study, about 0.04% of the global web content is in the Urdu language [
4]. It is written in Arabic script, possesses a free word order and has a rich inflectional morphology. As words have multiple surface forms in Urdu, the same query can be formulated in a number of different queries. To capture these variations maximally, large intent-annotated query corpora are required for intent detection.
The annotated datasets are used to extract the features for machine learning or deep learning based on intent detection models [
5,
6]. However, to deal with the diversified web queries without significant feature engineering, deep neural models are also being used. Capsule neural networks (CapsNet) [
7], introduced in image recognition tasks, are new types of neural networks that group neurons together into vectors that encode the specific parameters of entities and use dynamic-routing between layers to pass on the parameters that are important onto the next layer. Due to the improved performance of capsule neural networks in capturing the spatial relationships between features, CapsNet-based models are being extensively researched in user intent detection [
8,
9,
10].
In this article, we tackle Urdu intent detection challenge by developing a state of the art model and designing and developing the first intent-annotated Urdu web queries dataset. The Urdu web queries dataset consists of 11,751 search queries issued by 165 local users, covering 11 AOL query classification domains [
1], by utilizing the search query logs of Humkinar [
4], an Urdu search engine. The dataset is annotated using Broder’s taxonomy for search intents [
5]. In our prior study, two Urdu intent-annotated datasets were developed by translating English queries into Urdu [
11] while preserving the original intent annotation. The translated queries were limited in capturing native search characteristics, e.g., the query structure, search domain and vocabulary, thus, this dataset was developed to adequately capture this variability. The availability of this dataset will also be instrumental in benchmarking subsequent efforts in Urdu language understanding.
Our proposed intent detection model is based on a capsule neural network architecture, having an initial layer of LSTM cells instead of a convolution layer as proposed in the original capsule network model [
7]. The proposed model achieves a significantly higher intent detection performance compared to the machine learning- and deep learning-based baselines. Through empirical evaluations, we further investigate the effect of query corpus-trained and pre-trained input word representations from a natural language (NL) corpus on the performance of the proposed intent detection model. The results indicate that using pre-trained input vectors limits the performance of the model due to differences in the word collocation context in the query and the natural language corpora. The key contributions of this article are as follows.
The design and development of the first intent-annotated Urdu web queries dataset (UWQ-22), covering 11 AOL query classification domains and annotated with Broder’s intent taxonomy (presented in
Section 3 and
Section 4).
The development of a customized neural network-based model for intent detection, namely, U-IntentCapsNet, utilizing LSTM cells and an iterative routing mechanism between capsules to effectively discriminate diversely-expressed search intents (presented in
Section 5).
A rigorous performance evaluation of the proposed model depicting state of the art results for intent detection outperforming several strong baselines and alternate classification techniques (presented in
Section 7).
The rest of the paper is organized as follows:
Section 2 presents an extensive review of intent taxonomies, intent detection datasets, and models.
Section 3 and
Section 4 describe the design, development and annotation of the Urdu web queries dataset.
Section 5 provides a detailed description of the proposed CapsNet-based model for intent detection.
Section 6 highlights the experimental setup.
Section 7 presents the results and discussion regarding the performance evaluation of the intent detection model, and finally,
Section 8 concludes the paper.
2. Related Works
User intent detection from search queries requires an intent-annotated dataset for system learning. The most widely-used intent taxonomy for search queries was developed by Broder [
5]. In this taxonomy, a single level structure of three intent classes, namely, the informational, navigational and transactional, was proposed. Jansen et al. [
12] extended Broder’s taxonomy by defining secondary and tertiary level intent classes for each of the three top level intents. Rose and Levinson [
13] redefined Broder’s taxonomy by introducing sub-levels and replacing the “transactional intent” with a “resource seeking intent.” In this restructuring, at level 2, five sub-classes for the informational and four for the resource intent were formulated. Baez-Yates et al. [
14] proposed a different taxonomy from the earlier research, and classified queries as informational, not informational and ambiguous. The intent taxonomies were used to annotate datasets extracted from publicly-released query logs, e.g., the TREC Web Corpus and WT10g collection (
http://ir.dcs.gla.ac.uk/test_collections/, accessed on 15 June 2022), AltaVista logs [
13], DogPile [
12], Lycos [
15], MSN Search Query log (
http://www.sobigdata.eu/content/query-log-msn-rfp-2006, accessed on 15 June 2022), Yahoo (
https://webscope.sandbox.yahoo.com/, accessed on 15 June 2022) and AOL Search query logs [
16]. In addition, search query logs from Russian (
http://switchdetect.yandex.ru/en/datasets, accessed on 15 June 2022), Chinese (
http://www.sogou.com/labs/, accessed on 15 June 2022), Chilean [
17] and Vietnamese [
18] search engines were also used.
Web query datasets in local languages have also been developed through translation. For example, Schuster et al. [
19] reported a translated query dataset for Spanish and Thai. In MultiATIS++ [
20], an English ATIS dataset [
21] was translated in eight languages, extending an earlier research by Uday et al. [
22]. PhoATIS, a Vietnamese query dataset, was developed by translating an English ATIS dataset in Vietnamese [
18]. Additionally, query datasets for the Estonian, Latvian, Lithuanian, and Russian languages were developed [
23] using the Tilde machine-translation system by translating publicly-released datasets [
24,
25]. Moreover, two Urdu queries datasets were developed by translating English ATIS and AOL queries datasets in Urdu [
11]. Although translated query datasets have been used for intent detection research, they are, however, limited in capturing the local syntactic structure, vocabulary and search characteristics. In this study, a native web queries dataset was designed and developed to capture this variability for user intent detection.
Traditionally, user intent detection models have exploited classifiers such as support vector machines (SVM), naïve bayes and logistic regression with discriminative features modelled from corpora, query logs or pre-trained language models [
6]. Subsequent approaches have extensively developed Convolutional Neural Network (CNN)-based [
26,
27] and long short term memory (LSTM)-based [
28,
29] architectures for classifying intents from diversely-framed user queries, mostly in English. Recently, capsule neural network-based approaches have been explored for detecting the intent from user queries in virtual assistants’ and dialogue systems’ scenarios. In [
9,
30], a CAPSULE-NLU model is proposed for joint intent detection and slot filling. The architecture comprises of three capsules: WordCaps, SlotCaps and IntentCaps. Input word representations are learnt from the training dataset using a BLSTM in the WordCaps and then used to predict slots in the Slot caps. The output of the SlotCaps predict the intent of the utterance by using dynamic routing by an agreement algorithm between each capsule pair. The model achieves a 0.950 and 0.973 intent accuracy on the ATIS and SNIPS datasets, respectively. In [
10], an INTENTCAPSNET model is proposed for intent detection, using two capsules: SemanticCaps and DetectionCaps. In the SemanticCaps, input word representations are trained from scratch using a BLSTM and fed into a multi-head attention layer to extract the semantic features. The DetectionCaps use these features by dynamically combining to form higher level representations through unsupervised routing by an agreement algorithm. The model achieves a 0.9621 and 0.9088 accuracy, respectively, for SNIPS and CVA, a Chinese voice assistant dataset. In [
8], a BERT-Cap-based model is proposed for intent detection from user queries in English and Chinese. This model leverages pre-trained BERT [
31] to optimize the sentence representation and pass it as input to low-level capsules. Through dynamic routing, the low-level capsules capture the rich features of the sentence and forward them to the high-level semantic capsules for intent classification, and it achieved a 0.967 accuracy on one of the Chinese datasets.
Models for joint intent detection and slot filling have been largely used for English and other languages where datasets with semantic slots and intent labels are available. Other languages rarely have datasets with slot annotations, and generally, only intent category labels are available [
8]. Withstanding these limitations, in the proposed architecture, a two-tiered capsule network-based model was designed having WordCaps and IntentCaps layers with an intermediate dynamic routing mechanism. The IntentCaps in the proposed architecture use the output vector of the WordCaps directly, similar to the DetectionCaps in [
10], but differ from the IntentCaps in [
9] that use the output from SlotCaps, and it applies a max-margin loss for intent classification.
The BERT-Caps model reported in [
8] utilized the language model pre-trained on a large NL corpus, as the initial parameters in the sentence encoders, whereas the models in [
9,
10] used an input representation trained from the queries dataset. Studies reveal that there are stark differences in the syntactic properties of an NL and query corpus, specifically, in the word co-occurrence structure [
2]. Thus, word representations extracted from the two corpora may manifest varying patterns owing to this differentiation. In the proposed model, we investigated the impact of using sentence encoding representations from pre-trained models trained from an NL corpus [
32] and a contextual representation learnt from the queries dataset, for intent detection in web queries.
3. Urdu Web Queries Dataset (UWQ-22)
The Urdu web queries dataset is the first dataset comprising of native web queries extracted from a localized platform. This section discusses in detail the dataset development and the intent annotation process.
The Urdu web queries dataset has been extracted from the search records of Humkinar, an Urdu search engine [
4], covering user queries from 1st December 2020 till 31st January 2021. Each search record tuple consisted of the following five fields: (i) a user identification number, (ii) user query, (iii) clicked URL, (iv) total time (in seconds) spent by the user on the clicked URL and (v) server time (in hours, minutes and seconds and the date). The user identification number, user query and clicked URL have been retained in this dataset. The total number of search records extracted from the search engine was 13,785. Search records with missing entries were removed and the final dataset comprised of 11,751 search records from 165 users.
Table 1 describes the prominent statistics of the Urdu web queries dataset. The total queries in the dataset comprised of 42,214 terms, of which 38,789 were unique. The mean length of the web queries is 3.78 terms. The queries included in the dataset cover 11 domains [
1].
Table 2 presents the domain-wise distribution of the queries with the respective number of terms.
Table 2 highlights a balanced coverage of queries across all domains. The highest number of queries in the dataset were from entertainment, i.e., 1253 (11%), while queries related to the geography domain were the lowest, i.e., 964 queries (8%). A similar analysis over the AOL query domains was reported in [
33] regarding Turkish and English datasets, where the highest number of queries, 19.9% and 12.6% in the respective datasets also fall in the entertainment category.
As presented in
Table 1, the mean length of the queries was 3.78 terms.
Figure 1 describes the frequency distribution plot of the web queries with respect to query length.
Figure 1 shows that the length of the web queries in the dataset ranged between 1 and 39 terms. Of those queries, 82.39% had less than or equal to five terms. Additionally, after a query length = 6, the frequency of queries started declining, reducing to a very low frequency after a query length = 10.
Table 3 describes the mean query length with respect to query domains. It can be observed that the highest mean query length across all domains was for travel domain queries, i.e., 4.1 terms, while the lowest mean query length was of books and entertainment queries, i.e., 3.4 terms. It is interesting to note that the shopping domain had the longest query in the dataset (max. length 38 terms) while all domains include 1 word queries in the dataset.
4. Dataset Annotation
Following Broder’s taxonomy of web queries, the Urdu web queries dataset was annotated with three intents: Informational, Navigational and Transactional.
Figure 2 describes the block diagram for the intent annotation process of this dataset. The first step after the extraction and finalization of the dataset was for pre-processing, in order to ensure the data consistency for annotation. In parallel, detailed rules for annotating the queries with the required intents were developed and finalized. As a next step, the dataset was manually annotated in two passes. In the Annotation Pass I, a sample dataset was extracted and annotated by two linguists by following the annotation rules. The inter-annotator accuracy was measured until it converged to a 0.6 Cohen’s kappa coefficient [
34] after resolving disagreements. The remaining dataset was annotated in Pass II. These steps are further elaborated in the following section.
4.1. Data Pre-Processing
The preliminary step of the data pre-processing for intent annotation was tokenization. Urdu text is written in perso-arabic script that includes two types of Unicode characters; joiners and non-joiners. If the last character of the Urdu word is a joiner, a space is necessary; however, if the word ends with a non-joiner character, a space character may not be inserted to mark the word boundary or form the required word shape [
35]. Thus, traditional tokenization techniques, such as separating words on whitespaces, may lead to two possible types of word tokenization errors in Urdu: a space insertion and a space deletion. A space insertion between morphemes in the context of affixation, compounding, reduplication, foreign words and abbreviations is erroneous. In the Urdu web queries dataset, a Zero Width Non-Joiner (ZWNJ) character was used to resolve the space insertion characters manually. Similarly, white space was added between words where the ending character of the word was a non-joiner, and therefore, space characters were omitted.
After the tokenization, similar queries were removed systematically in two iterations. Firstly, the queries were trimmed for trailing spaces at the beginning and end and duplicate queries were removed. The resulting dataset was normalized through diacritics removal, systematic whitespace normalization, the removal of punctuations or accents and the removal of English alphabets. After normalization, duplicate queries and records resulting in null queries were removed again. The resultant dataset had 8519 queries.
4.2. Annotation Rules
The Urdu web queries dataset was annotated with three intents: navigational (NAV), transactional (TRAN) and informational (INFO). The definitions of these intent classes, as specified in [
5,
12], are given in the following section. Additionally, the salient characteristics of each intent class were also specified, with examples from the Urdu web queries dataset, which have been used as rules to annotate the queries according to the respective intent class.
4.2.1. Navigational Intent
Navigational intent implies a web search focused on finding a particular website, mostly based on prior knowledge. Users issuing queries with a navigational intent mostly click a particular website. User queries depicting a navigational intent have the following characteristics:
Queries containing domain suffixes, e.g., زمین ۔ کوم (Zameen.com), and دراز ڈاٹ پی کے (Daraz.pk);
Queries with the names of online platforms, e.g., زوم (Zoom), ویکیپیڈیا (Wikipedia), فیسبوک (Facebook) and ایمیزون (Amazon);
Queries with an organizational or brand name, e.g., ڈذنی لینڈ (Disneyland), ریڈیو پاکستان (Radio Pakistan), and سماء نیوز (SAMAA News).
4.2.2. Transactional Intent
Transactional intent aims at locating a specific website to obtain something by executing a web service. These action-oriented queries are also termed as resource-seeking queries as they aim to download, book or view a resource. These queries have the following characteristics:
Queries containing multimedia and text-based file formats or extensions, e.g., سورہ رحمن mp3 (Quranic Verse Surah Rahman mp3), and ہیری پوٹر pdf (Harry Potter pdf.);
Queries containing terms related to entertainment videos or audios, books and course-works, e.g., ارتغل ترقی ڈرامہ (Ertugul Turkish drama), ہیری پوٹر (Harry Potter), CSS جغرافیائی سلیبس (Geography Syllabus CSS), and سورہ رحمن (Quranic Verse Surah Rahman);
Queries containing terms related to technology (i.e., software, applications, anti-virus, and drivers) downloads e.g., 11 ونڈو (Windows 11), and آن لائن ہاکی (online hockey);
Queries with e-commerce and booking related terms, e.g., لاہور سے استنبول کی پروازیں (flight from Lahore to Istanbul), مالدیپ ہوٹل ریٹ (hotel rates Maldivies), تاج محل کا ٹکٹ (Taj Mahal ticket), and سوزوکی مہران برائے فروخت (Suzuki Mehran sale and purchase);
Queries with terms related to the weather, maps, and calculators, e.g., اج کا لاہور کا موسم (today’s Lahore weather), اسام آباد سے مری کا فاصلہ (distance between Islamabad and Murree), and پیکجز مال کا راستہ (route to Packages Mall).
4.2.3. Informational Intent
Informational intent implies reviewing the web content available on the Internet to gain awareness or knowledge. This includes reviewing information from a single webpage or reviewing multiple websites to collect the requisite information. Informational intent queries have the following distinguishable characteristics.
Queries having interrogative phrases, e.g., نینو ٹیکنالوجی کیا ہے؟ (what is nanotechnology?), کشمیر پریمئر لیگ کون جیتا (who won the Kashmir Premier League?), and کس ملک کی آبادی سب سے کم ہے (which country has the least population?);
Queries containing names of celebrities or famous personalities, e.g., کپل شرما (Kapil Sharma), فیض احمد فیض (Faiz Ahmad Faiz), and غالب (Ghalib);
Queries with natural language terms, e.g., موبائل لاک کھولنے کا طریقہ (procedure of opening a mobile (phone) lock), کرونا ویکسین کی ایجاد (invention of the Corona Vaccine), کرکٹ ورلڈ کپ (the Cricket World Cup), and کرپٹو کرنسی (crypto currency).
Although queries consisting of personality names could be both information and navigational, for consistency, queries having only personality names were annotated as informational. Short phrases with natural language terms that neither presented a navigational intent, nor a transactional intent were classified as informational.
4.3. Manual Annotation and Consistency Evaluation
The dataset was manually annotated by the linguists in two passes. At the end of each annotation pass, inter-annotator agreement was calculated to evaluate the annotation consistency. Inter-annotator agreement threshold was set at ≥0.6 (e.g., a coefficient 0 or less = poor, 0.01–0.2 = slight, 0.21–0.4 = fair, 0.41–0.60 = moderate, 0.61–0.80 = substantial and 0.81–1.0 = perfect agreement), to finalize the annotation.
In the first pass, a sample of queries was extracted to test the annotation rules and to bring both annotators to the same level of labelling agreement. Due to the large diversity in the dataset, random sampling could have resulted in a lack of topical coverage in the drawn sample. To address this, coverage clustering techniques were useful to collate the textually similar queries. Therefore,
K-means clustering with clusters = 80 (evaluated using the elbow method) was used [
36] and a random sample of 20 queries from each cluster, resulting in 1600 queries, was extracted for the Annotation Pass I. The two linguists annotated the queries and an inter-annotator agreement using Cohen’s kappa coefficient [
34] was calculated. The agreement value was 0.7213, indicating substantial agreement between the annotators.
The disagreements were analyzed and categorized into two groups: informational-transactional (INFO-TRAN) and informational-navigational (INFO-NAV). In the INFO-TRAN level disagreement, the queries were inherently ambiguous, e.g., ٹرین میں نماز (prayers in train) is subject to interpretation of the annotator. It could be informational in the context of finding facts or procedures related to the query or it could be transactional in the context of finding or downloading a resource related to the query, e.g., a book or video. In the INFO-NAV level disagreements, e.g., اولمپک سپورٹس, (Olympics sports), and ریل میڈرڈ (Rail Madrid), labeling the queries as either informational or navigational were both valid, unless more contextual information was available for disambiguation. This analysis signifies the complex structure of queries and the requisite annotation challenges. The remaining corpus was annotated in the Annotation Pass II by dividing the data in four batches.
4.4. Intent Annotated Dataset Statistics
This section describes the detailed statistics of the intent annotated Urdu web queries dataset with respect to frequency, query length and domain coverage. The frequency distribution of the three intents in the Urdu web queries dataset is presented in
Table 4.
Table 4 highlights the percentage coverage of each intent type in the Urdu web queries dataset. The informational queries had the highest frequency in the dataset, forming 76.25% of the total queries. Transactional queries consist of 13.82% of the dataset while navigational queries form 9.92%, respectively. These findings are in conformity with the log analyses of English queries reported in [
5,
12,
13], where the informational queries were also in the majority followed by the transactional and navigational queries.
Table 5 further presents the query length for each intent type as well as the maximum and minimum query lengths for each intent.
The query length was calculated as the total number of terms per query. Queries with an informational intent were the longest, having the highest mean query length, i.e., 4.0, followed by the transactional and navigational queries with 3.7 and 3.2 mean query lengths, respectively. As per
Table 5, the maximum length of a query with an informational intent was twice that of a navigational intent. The maximum query length of queries with a transactional intent were significantly smaller than the informational intent queries. The percentage frequency of query intents with respect to the query domains is presented in
Table 6.
As the number of queries in each domain were different, for a comparative analysis across the domains, the percentage frequency of the queries per intent is shown in
Table 6. From the table it can be observed that the informational queries were balanced across all domains. Technology-related informational queries were the most frequent (12%), while the travel-related queries were the least frequent (7%). In the transactional queries, the highest number of queries were related to fact–info (20%), followed by geography (19%). Transactional queries related to health and travel were the least frequent, i.e., 1%. The majority of the navigational queries were from the entertainment domain (36%) followed by books (28%), while navigational queries belonging to business and geography were the least frequent (1%). A similar analysis has been reported in [
1] for English web queries with informational, transactional and navigational intents. In this study, the predominant domain in the informational queries is health (89.6%), the transactional queries is adult (62.3%) and navigational queries is business (51.9%). In comparison, the Urdu queries showed a different distribution of query intents with respect to the query domains. This might be due to the limitation of the localized web content in Urdu for the respective domains.
7. Results and Discussion
To demonstrate the effectiveness of the proposed model, the intent detection results for the Urdu web queries dataset were presented and compared with the baselines in terms of the accuracy and F1 scores. Finally, the ablation results were reported highlighting the contribution of the various components in the proposed model.
The proposed model was trained using the annotated Urdu web queries dataset by utilizing the train and evaluation sets given in
Table 7.
Figure 4 describes the training and validation accuracies and loss during the training of the proposed U-IntentCapsNet model. The optimal number of epochs, 40, can be seen from the training accuracy and loss curves, as at epoch 40 the training set converged to the highest value and the loss was at the minimum.
Table 8 shows the label-wise intent detection results with the proposed U-IntentCapsNet model. It is evident that the INFO intent had the highest F1 score, 0.9439, while the NAV and TRAN intents had comparatively lower and similar F1 scores, 0.7553 and 0.7555, respectively.
In
Table 9 the confusion matrix for intent detection with the proposed U-IntentCapsNet model is presented. It is evident that the majority of the confusions were between the TRAN-INFO and NAV-INFO classes. One obvious reason is the lexical similarity between the TRAN-INFO and NAV-INFO queries. For example, “Geo News” (the name of a local TV channel) was NAV, and “Kashaf Drama” (the name of a drama) was TRAN while the “News drama industry” was INFO. The majority of the INFO queries contained terms occurring individually in queries with a TRAN or NAV intent.
7.1. Comparison with Baselines of Alternate Classification Techniques
In the next section, the baseline models with alternate text classification techniques are compared with the proposed U-IntentCapsNet in terms of the accuracy and F1 scores with respect to the intent detection.
It is clearly evident from the results presented in
Table 10 that the proposed model demonstrated a significant improvement from the classifiers using TFIDF-based as well as neural models for text classification. Compared with the best baseline, i.e., the Baseline V, in which an n-gram convolution layer was used with max pooling, the proposed model had a 0.144 improvement in the F1 score. In comparison with the Baseline VII, which only leveraged from the BLSTM without capsule networks, the proposed U-IntentCapsNet model showed an improvement of almost 5% in accuracy. This highlights the effectiveness of using a capsule network-based architecture for intent detection. The proposed model performed significantly well against the feature engineered, machine learning (ML)-based, classification techniques. When compared with the best performing ML-based baselines, the Baselines I/II, having TFIDF as features and SVM/NB as the classifier, the proposed model had a 7.6% improvement in accuracy. It is evident from
Table 10 that the proposed model demonstrated novelty and effectiveness in modelling text for intent detection and that it improved upon the baselines by a significant margin.
7.2. Comparison with the Baselines Using Pre-Trained Embeddings
To analyze the impact of using pre-trained word vectors, experiments using multiple word vector models as the input embedding layers for the proposed U-IntentCapsNet model were performed. The Baselines IX-XI were trained using the context free Word2Vec embeddings. These baselines used 100, 200 and 300 dimensional W2V embeddings. An experiments was also conducted using a pre-trained mBert as the input word vector representation for the proposed U-IntentCapsNet model. The results of these models in terms of their accuracy, precision recall and F1 scores are given in
Table 11.
The results presented in
Table 11 show that the proposed model performed the best when the model utilized the vector encodings of tokens and learned the embeddings without pre-trained word representations. When analyzing the experimental results utilizing the W2V embeddings, it is worth noting that the intent detection with the W2V-100 gave a 0.8136 F1 score. A significant improvement in the results was observed in the W2V-200 (Baseline X) where the F1 increased from 0.8136 to 0.8535. By further using the W2V-300 (Baseline XI), the results were improved to a 0.8583 F1 score. Using pre-trained mBert embeddings in Baseline XII, the over-all accuracy attained by the model was 83.94% and the F1 score was 0.8391. These results show that the W2V-based baselines performed better than the contextualized embeddings.
Figure 5 presents the learning curve of the W2V Baselines IX–XI and the proposed model in terms of the accuracy over the epochs.
It is evident from
Figure 5 that the proposed U-IntentCapsNet model attained an improved accuracy of 91.12%, surpassing the W2V-based baselines by a significant margin in fewer epochs than the baselines that might have taken more epochs to converge to their local maxima. In the context of skewed datasets such as the Urdu web queries dataset used in these experiments, where the INFO class is in the majority, the F1 scores of all the intent classes need to be analyzed to measure the quality of the trained model.
Table 11 highlights the F1 score of the INFO, TRAN and NAV classes in the dataset for the W2V-based baselines and the proposed model.
It can be further elaborated from
Table 12 that the proposed model had the highest F1 score for all the three intents. The F1 score for the INFO intent was the highest, i.e., 0.9439, among the other baselines. Baseline IX performed poorly in predicting TRAN queries with lowest F1 score of 0.2947 among all the baselines. Baseline XI underperformed the most in predicting NAV queries with an F1 score of 0.0985 among all the other baselines and the proposed model. The prediction of the proposed model for the TRAN and NAV were 0.7553 and 0.7555, respectively, which was quite satisfactory, highlighting the fact that the model was able to learn a fine-grained difference among the queries.
We delved deeper into this investigation and analyzed the vector representations generated by the W2V model to understand the low performance of the model with pre-trained embedding. Five frequently-searched, two-term queries from the queries dataset given in
Table 13 were analyzed. These queries included examples of five different types of query compositions: (i) abbreviations, e.g., “T,V”, (ii) one-word query (split into two terms), e.g., “I, phone”; (iii) compound words, e.g., “smart, phone”; (iv) terms searched together, e.g., “Tom, Jerry” and (v) newer terms, e.g., “Corona, Vaccine”. In order to visualize the distance between these frequently searched together terms, we plotted the W2V-based feature vectors for the queries in a two-dimensional plane as shown in
Figure 6. The dimensionality reduction was performed using t-distributed stochastic neighbor embedding (t-sne) for visualization. The cosine similarity of word vectors for the sample queries are given in
Table 13.
The similarity measures for the feature vectors of the query terms in the sample queries given in
Table 13 show that vectors pre-trained on published or web-based natural language corpora warrant coverage for very popular terms such as Tom, or Jerry; however, newer terms, e.g., corona-vaccine or smart-phone are under-represented. Additionally, it was observed that queries have a large number of transliterated terms which may not have adequate coverage in the natural language corpora. It could be observed that all the queries shown in
Table 13 had transliterated terms, e.g., Tom, Jerry, Corona, Vaccine, Smart, and Phone are all English words added into the Urdu language. The spellings for transliterated terms are non-standardized, and this adds another dimension of complexity in the coverage of the same word in the corpus used for pre-trained models. This observation was validated through a simple search in the web queries dataset, and the different variations of the words Corona and Iphone that were found are presented in
Table 14.
The multiple spellings shown for Iphone illustrate the inconsistent use of space between the two terms I and phone. It can be further noted that multiple typing formats that are used have resulted in multiple Unicode character combinations to type “I” in Urdu, i.e., “ا+ی”, “آ+ء+ی”, آ+ئی, and “ا+ئ”. These variations are prominent reasons that can affect the performance of the proposed model when using pre-trained embeddings.
7.3. Ablation Results
To study the contribution of the different modules of the U-IntentCapsNet, the ablation test results are presented in
Table 15. The “U-IntentCapsNet
w/
o BLSTM” used the LSTM with only a forward pass; the “U-IntentCapsNet
w/
o Regularizer” did not include the drop out = 0.3 used in the model to avoid over-fitting.
From the results of these experiments presented in
Table 15, it is clear that every module of the proposed U-Intent CapsNet played a detrimental part in improving the overall performance of the model. In the proposed model, the BLSTM significantly contributed to boosting the performance by 3.35%, and introducing regularization to avoid over-fitting also had a comparable contribution by increasing the accuracy by 1%.
As shown in
Table 16, the proposed U-IntentCapsNet architecture was designed by varying the dimensionality of the input vector
and training models with
= 200 and
100 dimensions. The results showed a 4% and 5% decrease in the accuracy by using 200 or 100 input vector dimensions. Further experiments were conducted by varying the routing iterations, e.g., iter. to two, three and five, in the proposed model.
Iterative routing computes and updates the coupling co-efficient
that determines the contribution the
t-th word
has with the
k-th intent,
. In order to determine the best number of iterations for the coupling coefficients, experiments were conducted with a different number of routing iterations of two, three, and five. The accuracy and F1 scores of the model with the varying number of iterations are given in
Table 17, and the influence of the outing on the proposed model is visually presented in
Figure 7. It is evident that the proposed model with three iterations converged faster and gave the best results.
7.4. Error Analysis
Errors in intent detection have been thoroughly analyzed to understand the model’s performance and categorize the areas in which the model did not perform well. The salient observations are given below:
Skewed dataset: As per the general searching trends, web queries datasets consist of more informational queries than navigational and transactional queries [
12]. This characteristic was also present in the Urdu web queries dataset in which 76% of the queries were informational. A maximum inter-class confusion was observed between the NAV-INFO and TRAN-INFO classes; however, the proposed model performed better in discriminating these classes as compared to the baselines.
Named entities: NAV class queries that did not have “www” or domain identifier, e.g., .com, or .pk, tended to be misclassified as INFO. A deeper analysis showed that most of those queries had brand names or other named entities that had a very low occurrence in the dataset, potentially causing the mis-classification.
Queries with more than one valid intent: Due to the nature of the data, it is possible that the user intent could have belonged to two intent classes. For this reason, the TRAN suffered the most as transactional queries, being more descriptive, adopted the jargon and characteristics of the INFO queries. For example, “ڈورےمون” (Doremon), and “بانگدرہ پر تبصرہ” (Analysis on Bang-e-Dara, pointing to a book download), was predicted as INFO although it was annotated as TRAN, when both labels can also be true. Similar confusions could be seen in other NAV queries as they were largely misclassified as INFO.