Aggregate Functions in Categorical Data Skyline Search (CDSS) for Multi-keyword Document Search

Mardiah Mardiah(1*), annisa annisa(2), Shelvie Nidya Neyman(3),

(1) Departemen Ilmu Komputer IPB
(2) IPB University
(3) IPB University
(*) Corresponding Author
DOI: https://doi.org/10.23917/khif.v9i1.18127

Abstract

Literature review is the first step in starting research for a deep understanding of the research interest. However, finding literature relevant to research interests is difficult and takes time. Skyline query is a method that can be used for filtering. An object p is said to dominate object q if p equals q on all of its attributes, and p is at least better than q on one attribute. Categorical Data Skyline Search (CDSS) is an algorithm that can filter skyline objects in categorical data types such as documents. CDSS uses Extended Distance Wu and Palmer (DEWP) to calculate the distance between the user query and document keywords. The document keywords and user queries are represented as nodes in the ACM CCS ontology, and documents are assumed to be represented by a single keyword. This study aims to use the CDSS algorithm to search for skyline documents represented by more than one keyword by adding an aggregate function (average, minimum, maximum) to the CDSS algorithm, especially in calculating DEWP. This study used the thesis documents from the IPB University computer science department. Document keywords will be extracted using the Term Frequency-Inverse Term Frequency (TF-IDF) method. The collected keywords will be mapped in a mixed ontology tree that refers to the Association of Computing Machinery Computing Classification System 2012 (ACM CCS 2012) and Computer Science Ontology (CSO) as ontology standards in computer science. The skyline query algorithm for determining skyline documents is Block Nested Loop (BNL). The evaluation method uses the skyline ratio of each aggregate function in the CDSS. Based on the ratio value, CDSS using the maximum DEWP has the most relevant skyline results compared to the average DEWP and minimum DEWP.

Keywords

categorical data skyline search, aggregate function, ontology, skyline query, term frequency inverse term frequency

Full Text:

Accepted PDF

References

B. B. L. Penning de Vries, M. van Smeden, F. R. Rosendaal, and R. H. H. Groenwold, “Title, abstract, and keyword searching resulted in poor recovery of articles in systematic reviews of epidemiologic practice,” J. Clin. Epidemiol., vol. 121, pp. 55–61, 2020, doi: 10.1016/j.jclinepi.2020.01.009.

J. Brocke, A. Simons, K. Riemer, B. Niehaves, R. Plattfaut, and A. Cleven, “Standing on the shoulders of giants: challenges and recommendations of literature search in information systems research,” Commun. Assoc. Inf. Syst., vol. 37, no. 9, pp. 205–224, 2015, doi: 10.17705/1cais.03709.

S. Börzsönyi, D. Kossmann, and K. Stocker, “The skyline operator,” in Proceedings - International Conference on Data Engineering, 2001, pp. 421–430.

W. Zhang, A. Li, M. A. Cheema, Y. Zhang, and L. Chang, “Probabilistic n-of-N skyline computation over uncertain data streams,” World Wide Web, vol. 18, no. 5, pp. 1331–1350, 2015, doi: 10.1007/s11280-014-0292-2.

N. Zhang, C. Li, N. Hassan, S. Rajasekaran, and G. Das, “On skyline groups,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 4, pp. 942–956, 2014, doi: 10.1109/TKDE.2013.119.

H. Jaudoin, P. Nerzic, O. Pivert, and D. Rocacher, “On making skyline queries resistant to outliers,” Stud. Comput. Intell., vol. 665, pp. 19–38, 2017, doi: 10.1007/978-3-319-45763-5_2.

W. Lee, J. J. Song, and C. K. S. Leung, “Categorical data skyline using classification tree,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2011, vol. 6612 LNCS, pp. 181–187, doi: 10.1007/978-3-642-20291-9_19.

Z. Wu and M. Palmer, “Verbs semantics and lexical selection,” in Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, 1994, pp. 133–138, doi: 10.3115/981732.981751.

T. Mabotuwana, M. C. Lee, and E. V. Cohen-Solal, “An ontology-based similarity measure for biomedical data - Application to radiology reports,” J. Biomed. Inform., vol. 46, no. 5, pp. 857–868, 2013, doi: 10.1016/j.jbi.2013.06.013.

S. B. Zhang and Q. R. Tang, “Protein-protein interaction inference based on semantic similarity of gene ontology terms,” J. Theor. Biol., vol. 401, pp. 30–37, 2016, doi: 10.1016/j.jtbi.2016.04.020.

A. Salatino, T. Thanapalasingam, A. Mannocci, F. Osborne, and E. Motta, “The computer science ontology: a large-scale taxonomy of research areas,” in International Semantic Web Conference, 2018, vol. 11137 LNCS, pp. 187–205, doi: 10.1007/978-3-030-00668-6_12.

F. Z. Tala, “A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia,” 2003.

D. Jurafsky and J. Martin, “N-Gram Language Models N-Gram Language Models,” in Speech and Language Processing, 2020.

N. Firoozeh, A. Nazarenko, F. Alizon, and B. Daille, “Keyword extraction: Issues and methods,” Nat. Lang. Eng., vol. 26, no. 3, pp. 259–291, 2020, doi: 10.1017/S1351324919000457.

G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Inf. Process. Manag., vol. 24, no. 5, pp. 513–523, 1988.

C. D. Manning, P. Raghavan, and H. Schütze, “Scoring, term weighting and the vector space model,” in Introduction to information retrieval, Cambridge University Press, 2009, pp. 120–126.

H. Bast, B. Buchhold, and E. Haussmann, “Semantic search on text and knowledge bases,” Found. Trends Inf. Retr., vol. 10, no. 2–3, pp. 119–271, 2016, doi: 10.1561/1500000032.

Annisa, A. Zaman, and Y. Morimoto, “Area skyline query for selecting good locations in a map,” J. Inf. Process., vol. 24, no. 6, pp. 946–955, 2016, doi: 10.2197/ipsjjip.24.946.

W. Sun et al., “Verifiable privacy-preserving multi-keyword text search in the cloud supporting similarity-based ranking,” IEEE Trans. Parallel Distrib. Syst., vol. 25, no. 11, pp. 3025–3035, 2014, doi: 10.1109/TPDS.2013.282.

C. Kalyvas and M. Maragoudakis, “A skyline-based decision boundary estimation method for binominal classification in big data,” Computation, vol. 8, no. 3, pp. 1–22, 2020, doi: 10.1109/SEEDA-CECNSM49515.2020.9221822.

Article Metrics

Abstract view(s): 253 time(s)
Accepted PDF: 196 time(s)

Refbacks

  • There are currently no refbacks.