Template-Type: ReDIF-Article 1.0
Author-Name:Muhammad  Hashir,Zulqarnain  Channa,Shamshad  Lakho,Atta  Muhammad  Panhyar,Manzoor Hussain,Muhammad Ibrahim Channa
Author-Email:hashirlakho46@gmail.com,zulqarnainchana@gmail.com,shamshad.lakho@quest.edu.pk,attapanhyar@quest.edu.pk,manzoorhussain575@gmail.com,muhammadibrahimchanna112@gmail.com
Author-Workplace-Name:Department of Information Technology,Quaid-e-Awam University of Engineering, Science and Technology,Nawabshah, Pakistan.Department  of  Computer  Science,Quaid-e-Awam  University  of  Engineering,  Science  and Technology,Nawabshah, Pakistan.Department of Artificial Intelligence Quaid-e-Awam University of Engineering, Science and Technology,Nawabshah, Pakistan
Title:Sindhi Keyword Extraction from Online Articles for SEO Experts Using Web Scraping and MultiBERT Model
Abstract:The unavailability of computational tools, poor optimization for low-resource languages, and the peculiarities of the Sindhi (سنڌي) script present serious difficulties in keyword extraction for search engine optimization (SEO). All these restrictions make it difficult to index the content and make the Sindhi web pages visible in the result pages of search engines. To mitigate these issues, this paper will offer a deep learning-based solution to Sindhi keyword extraction based on a multilingual BERT (MultiBERT) model combined with Named Entity Recognition (NER). Over 6,300 Sindhi news articles were gathered through web scraping of the Daily Kawish. The mined data, including URLs, categories, and textual content, was organized in a CSV format and later subjected to normalization processes to accommodate linguistic differences in Sindhi text. A multilingual BERT-based NER model was further refined to identify keywords on the processed data. The experimental findings indicate that the model proposed has an accuracy of 92.5%, precision of 91.8%, recall of 89.6%, and F1-score of 90.7%. The proposed model outperformed baseline methods by up to 17% in F1-score, demonstrating its effectiveness for low-resource language processing, which is over and above the experimental results of the conventional methods of keyword extraction, including TF-IDF, TextRank, and RAKE. The extracted keywords were then analyzed using visualization in order to comprehend their distribution and relevance. The framework suggested offers a working model through which Sindhi keyword extraction can be improved and provides practical implications for SEO professionals in order to enhance content visibility with low-resource languages. It is also a contribution to the development of natural language processing (NLP) for regional languages and a framework for future studies in the field of Sindhi text analytics.
Keywords:Sindhi Language, Keyword Extraction, Deep Learning, Natural Language Processing, Multilingual BERT, NER, Web Scraping, Text Normalization, Search Engine Optimization
Journal:International Journal of Innovations in Science and Technology
Pages:336-350
Volume:7
Issue:10
Year:2025
Month:December
File-URL:https://journal.50sea.com/index.php/IJIST/article/view/1766/2655
File-Format: Application/pdf
File-URL:https://journal.50sea.com/index.php/IJIST/article/view/1766
File-Format: text/html
Handle: RePEc:abq:IJIST1:v:7:y:2025:i:10:p:336-350