Full Text

Turn on search term navigation

1. Introduction

As an application of an open source operating system in the field of smart phones, the Android system faces risks and serious challenges related to malicious applications. In addition, with the continuous growth in the number of applications in the Android market, the system has become an easy target for malware authors to attack. Recent statistics [1] reveal that the total number of new Android malware samples till March 2020 amounted to 482,579 per month. Additionally, over 4.18 million malicious applications were found on the Android platform in 2019, with approximately 11,500 new Android malware instances emerging on a daily basis. Some conventional commercial malware detection systems primarily utilize string signatures for retrieving established known malware databases for fast and lightweight detection. Some other malware detection systems use static and dynamic analysis for malware variants detection. Nevertheless, given the prodigious quantity of novel malware emerging daily, these methodologies frequently fall short in achieving comprehensive coverage. Consequently, in addressing the escalating number of novel malware, the detection capabilities of these traditional and commercial malware detection systems has proven inadequate.

In recent years, many strategies have been devised by researchers to mitigate the impact of malicious software. These methodologies leverage an integration of static and dynamic analysis techniques, combined with machine learning algorithms, to effectively discern malicious software on the Android platform. This identification process relies heavily on features like opcodes. Given that Dalvik opcodes [2] serve as a crucial static feature that can meticulously depict the behavioral patterns of Android programs, a multitude of research endeavors have been dedicated to the extraction of Dalvik opcodes as distinctive features, coupled with the employment of a variety of machine learning strategies for the purpose of detection. Among them, most studies focus on the semantic information within opcode sequences. Some studies select opcode N-grams as semantic features (generally, the N value is less than 3 as an excessively high feature dimension can significantly impact performance). The formula for calculating the number of feature dimensions is $o p_{n u m}^{N}$ , where $o p_{n u m}$ represents the number of opcode types, and then traditional machine learning algorithms or statistical models are employed as classifiers to distinguish between malicious and legitimate programs.

Although the opcode-based method has demonstrated its practicality in many applications, it has encountered bottlenecks in the pursuit of higher accuracy. When trying to integrate more semantic features, it has not only failed to bring significant improvements in accuracy, but also increased the time cost of processing. In addition, the excessively long opcode sequence leads to difficulties in understanding the underlying semantics, which increases the complexity of analysis and processing. In order to break through this bottleneck and deeply understand the operational characteristics of malware, it is particularly important to explore new feature extraction methods, which are expected to bring a new perspective and understanding to malware analysis.

Given the rich information contained in the dependencies among opcodes and their neglect in research, we aim to delve into these dependencies and construct opcode graphs accordingly to analyze their topological characteristics. Currently, there are research methods based on control flow graphs or API call graphs and searches for resemblances among the subgraphs of unlabeled instances and their labeled counterparts. Regrettably, these methods are prone to evasion detection through changes in call references due to the coarse granularity of call block acquisition. In other words, when obfuscating the call references of a malware, it might evade the detection. Additionally, there are file-to-file graph-based methods, through the establishment of interactions between files and hosts, which construct graphs on a file-to-file graph to identify the malware which has similar network behaviors. However, these approaches encounter difficulties in managing novel malware variants owing to the challenges when the scale of the file-to-file graph is insufficient.

We opted to construct an opcode graph, as opposed to depending on API or system calls, to more precisely illustrate the interdependencies among opcodes. We employ transition probabilities, as opposed to control or data flow, to handle the vital statistical information required to differentiate between benign and malicious software. Notably, we extract frequent subgraphs from them and calculate their topological features to describe the characteristics of malicious software, as shown in Figure 1. The topological features mean the dependencies and their information entropy. The topological features revealed by these frequent subgraphs provide us with an understanding of the characteristics of malicious software. The topological features will be embedded into a neural network by graph convolution. By extracting the topological features of the subgraphs, we can effectively detect Android malicious software. Topological features contain rich topological information, thereby improving the accuracy of detection. In addition, subgraphs also demonstrate clusters of operation chains, making it easier for humans to understand and analyze the practical significance of these features.

The detection of Android malware variants through the utilization of subgraphs presents a number of challenges. Given the extensive amount of subgraphs present in opcode graphs, the processes of subgraph extraction, model creation for topology feature computation, and the utilization of these features for malware detection are paramount. In this study, we initiate by constructing an opcode graph, where the vertices represent opcodes, and the directed edges depict dependencies between sequential opcodes. The edge property values represent the co-occurrence probabilities of two consecutive opcodes. Subsequently, we extract frequent subgraphs of a uniform size. Following this, we introduce an innovative graph convolutional neural network to compute distinct topology features for each subgraph. These features are then utilized to train a classification model aimed at detecting malware.

The main contributions of this paper are summarized as follows:

Upon conducting a thorough analysis, it has been discovered that the dependencies among opcodes encompass significant information, exhibiting a distinctive trait of malware. To further elaborate, by constructing a graph of opcodes based on these dependencies, numerous frequent subgraphs and topological characteristics of these subgraphs were extracted. These extracted features serve as valuable representatives for the identification and detection of malware variants.
A novel approach, utilizing graph convolutional neural networks, has been put forward by us for the purpose of detecting Android malware in a highly effective and efficient manner. This neural network architecture is capable of extracting and embedding frequent subgraphs, subsequently obtaining topological features of these subgraphs by computing the dot product between their adjacent matrices and several randomly initialized convolutional cores.
A prototype has been successfully implemented and has undergone rigorous evaluation using Drebin [3] and the MobileSandbox project [4] datasets. The evaluation outcomes demonstrate that our method exhibits remarkable stability in attaining a high degree of accuracy, nearing 95%, while maintaining minimal detection time. Specifically, the average detection time per executable remains below 0.1 s.

The paper is organized as follows: Section 2 describes related works. Section 3 presents our frequent subgraph convolutional neural networks-based Android malware detection technique. Section 4 presents evaluations of our approach. Section 5 and Section 6 present the discussion and conclusions.

2. Related Works

Recent research has demonstrated a strong preference for utilizing machine learning techniques, including deep learning, alongside graph-based methodologies for the detection of malware and its evolving variants. Liu et al. [5] proposed a novel lightweight deep learning framework for Android malware detection based on attention temporal networks. The framework introduced multi-head attention mechanisms and reinforcement learning to direct the model’s focus toward behavioral clues in malware sequences. The average accuracy exceeded that of traditional machine learning methods. Muzaffar et al. [6] developed a tool named DroidDissector, which specializes in Android malware detection. The tool combines static and dynamic feature extraction techniques, significantly improving the detection capabilities for mobile platform malware. Research indicates that the tool outperforms existing solutions in various real-world scenarios. Araujo et al. [7] proposed a malware detection method based on a CNN. This method converts malware binary files into 2D images and uses the CNN for classification, effectively identifying previously unseen malware variants. Experimental results showed that the CNN model has strong expressive power when handling high-dimensional features, improving detection accuracy. Seneviratne et al. [8] proposed a self-supervised learning-based method for unsupervised malware detection. This approach utilizes self-supervised learning techniques to automatically extract potential malicious behavior features from unlabeled data, and combines traditional machine learning models for classification. The study showed that malware detection systems using self-supervised learning can still perform strongly even when labeled data are scarce and adapt to different types of malware and attack patterns. He et al. [9] presented a transfer learning-based approach that leverages knowledge from known malware categories to help detection systems identify new malware types. This research showed that transfer learning can effectively overcome issues such as malware sample imbalance and label scarcity, improving the adaptability of malware detection systems. Paola et al. [10] proposed a big data-based malware detection framework that combines distributed computing and machine learning models to accelerate the malware detection process through parallel processing. The method uses distributed data storage and multi-node computing to reduce the time required for model training and prediction, and is particularly suitable for large-scale malware datasets. Research shows that in big data environments, this method can significantly improve the accuracy and efficiency of malware detection.

Antonio et al. [11,12] proposed a supervised learning technique that demonstrated promising results in Android malware detection. They also created a comprehensive labeled dataset, comprising over 18,000 samples classified into five distinct categories. They provide valuable perspectives on machine learning-based attack detection and Android vulnerabilities. Han et al. [13] proposed a clustering-based anomaly detection method for network traffic. The study applied the K-means clustering algorithm to analyze network traffic data, extracting features of normal and abnormal traffic and comparing them with historical traffic to monitor malware spread in real-time. This method operates efficiently in large-scale network environments with low computational costs and high detection efficiency. Chen et al. [14] proposed a malware detection method based on attack graphs. By modeling nodes, attack behaviors, and propagation paths in the network, the researchers built a dynamic malware propagation model that can effectively predict malware spread paths and monitor devices potentially affected in real-time, providing decision support for malware defense. Daniel et al. [15] proposed a hybrid analysis method based on deep learning, which combines static code analysis and dynamic behavior analysis. The method first extracts key features of malicious code through static analysis and then uses dynamic behavior analysis to further confirm malicious activities. By incorporating Deep Neural Networks (DNNs), the system can classify malware with high accuracy in a short time. This method demonstrated high precision on multiple datasets, especially in handling highly mutated malware. Singh et al. [16] proposed a multimodal learning-based malware detection framework. The framework integrates static file features and dynamic behavior features through multimodal neural networks, enabling the establishment of connections between static and dynamic information of malware. The experiment showed that combining multimodal information significantly improved detection accuracy, especially when dealing with complex or novel malware. Dai et al. [17] proposed a malware detection framework based on ensemble learning and big data analytics. The framework uses multiple machine learning models (such as Support Vector Machine, Random Forest, and Convolutional Neural Networks) to process big data from multiple sources in parallel, effectively identifying various types of malware. The study showed that combining ensemble learning with big data processing not only improved detection rates but also significantly reduced false positives. Yerima et al. [18] proposed a multi-level decision fusion framework that combines static analysis, dynamic behavior analysis, and machine learning models. The framework improves malware detection accuracy through multi-level decision fusion strategies by weighting and integrating the outputs of different feature extractors and classifiers. The experiments showed that this method is more advantageous than single-method approaches when handling complex malware attacks.

Despite significant progress in malware detection technologies, several challenges remain. First, the issue of data imbalance continues to be severe. The scarcity of malware samples leads to model training being biased towards normal software, affecting the detection of low-frequency or novel malware. Although methods such as data augmentation and Generative Adversarial Networks have improved this situation, limitations still exist. Currently, detection methods perform poorly when facing unknown malware. Existing models usually rely on training with known samples, limiting their ability to adapt to zero-day attacks and novel malware. Although adaptive and reinforcement learning methods have been proposed, their actual effectiveness remains limited. In summary, despite the progress made in malware detection technologies in terms of accuracy and efficiency, there are still many challenges in data processing, real-time performance, adaptability, accuracy, and robustness. These issues need to be effectively addressed in future research.

3. Methods

3.1. Overview of Our Approach

In the forthcoming section, we advocate for the utilization of a frequent subgraph convolutional neural network to tackle the challenge of detecting Android malware variants. The operational schema of our method encompasses four stages: the process of unpacking and disassembly, the assembly of the opcode graph, the extraction of frequent subgraphs, and feature training and classification rooted in graph convolutional neural networks.

Step 1: Unpacking and disassembly process. The procedure of unpacking and disassembly is designed to deconstruct and unpack binary files (.apk files) to unveil their opcode sequences. Given the fact that certain binaries are encapsulated utilizing specific packing tools, this complicates the disassembly process. To circumvent this, we examine the binaries and initially unpack them using several unpacking tools, as noted in reference materials such as [19], among other resources. However, if a binary has not been previously packed, the unpacking stage becomes redundant. While decompiling an APK file, we only consider Java-based applications. The unpacked .apk files are then decompressed to procure their .dex files. Through the application of the Dedexer tool [20], these .dex files are disassembled to generate .ddx files, which aids the extraction of opcode sequences. Methods involving multiple disassembly tools have been avoided to prevent any potential skewing of the results. Post the systematic scanning of the .ddx files, an opcode profile is established for each binary. This profile is essentially a compilation of opcodes. To illustrate, an opcode profile could potentially contain a list like NEW-INSTANCE, SGET-OBJECT, INVOKE-DIRECT, INVOKE-VIRTUAL, INVOKE-VIRTUAL, INVOKE-VIRTUAL, MOVE-RESULT, ....

Step 2: Opcode graph construction. Initially, our endeavor starts with securing opcode sequences by disassembling .dex files, which are originally decompressed from .apk files, the Android executables. Each opcode sequence is then transformed into a directed graph, with each file having its distinct graph structure. In the context of this structure, each vertex signifies an operation, while each edge denotes the relationship existing between a pair of sequential operations. The metric assigned as the property value for each edge illustrates the co-occurrence likelihood of two companion operations.

Step 3: Frequent subgraph extraction. Subsequently, the amalgamation of all the opcode graphs ensues, resulting in an aggregated opcode graph. Within this composite structure, the vertices delineate the operations, while the edges map out the relationships. The property values provide a representation of the average co-occurrence probabilities. Progressing this, we derive numerous frequent subgraphs from the comprehensive opcode graph. These frequent subgraphs are essentially substructures wherein each vertex is either connected to, or connected by, at least one other vertex within the subgraph, providing that the information entropy of the connection surpasses a specified threshold.

Step 4: Graph convolutional neural networks training. Post the extraction of the subgraphs, the adjacency matrices corresponding to these subgraphs are introduced into a graph convolutional neural network. Here, the neural network performs computations to establish the dot product of the filter entries and the adjacency matrix, thereby generating a specific subgraph topology feature in relation to that convolutional core. Additionally, pooling, a type of down-sampling procedure, is implemented to decrease the feature dimensions. Lastly, a softmax classifier is utilized to categorize the subgraph features.

By implementing the preceding four stages, we possess the ability to identify variants of Android malware in an efficient and effective manner.

3.2. Dalvik Opcode Graph Construction

The Dalvik opcode graph can be delineated as a directed graph which comprises vertices embodying operations and directed edges signifying the connections between these operations. Allowing $G = (V, E, P)$ to stand in for the opcode graph, V represents the vertex set, E constitutes the set of directed edges, and P is a compilation of edge property values. Each individual vertex, noted as v, symbolizes a specific opcode, for instance, ADD-DOUBLE, INVOKE-DIRECT-EMPTY, MONITOR-ENTER, and so forth. Every directed edge epitomizes the dependencies amongst two consecutive opcodes. To illustrate, the edge referred to as $e_{i, j} = < v_{i}, v_{j} >$ depicts a sequence of operations transitioning from $v_{i}$ to $v_{j}$ . Furthermore, each property value, denoted as $p_{i, j}$ , embodies the co-occurrence likelihood of the corresponding edge $e_{i, j}$ .

Consider $o s_{i} = {o p_{j}, o p_{k}, \dots, o p_{t}, o p_{m}}$ as a specific opcode sequence of an application, where its length is signified by m. Envision $O P_{t o t a l} = {o p_{1}, o p_{2}, \dots, o p_{n}}$ as the opcode set, wherein n indicates the number of opcode types. Also, conceive $O P_{i} = {o p_{j}, o p_{k}, o p_{t}, o p_{m}, \dots}$ as a subset of $O P_{t o t a l}$ , incorporating opcodes that also reside in $o s_{i}$ . Let $G_{i} = (V_{i}, E_{i}, P_{i})$ be defined as the program’s opcode graph. $G_{i} = (V_{i}, E_{i}, P_{i})$ is autonomously constructed through interlinking the opcodes found in the sequence $o s_{i}$ in line with the sequential correlations between two opcodes, wherein $V_{i} = O P_{i}$ and $E_{i} = {e_{j, k} = < o p_{j}, o p_{k} >, \dots, e_{t, m} = < o p_{t}, o p_{m} >}$ .

For instance, we have $o s_{i}$ = NEW-INSTANCE, SGET-OBJECT, INVOKE-DIRECT, INVOKE-VIRTUAL, INVOKE-VIRTUAL, INVOKE-VIRTUAL, MOVE-RESULT, ..., which is a sequence of opcodes for a program. From $o s_{i}$ , we derive $O P_{i}$ = NEW-INSTANCE, SGET-OBJECT, INVOKE-DIRECT, INVOKE-VIRTUAL, MOVE-RESULT, .... In the program’s $G_{i} = (V_{i}, E_{i})$ , we have $V_{i} = O P_{i}$ and the $E_{i}$ = <NEW-INSTANCE, SGET-OBJECT>, <SGET-OBJECT, INVOKE-DIRECT>, <INVOKE-DIRECT, INVOKE-VIRTUAL>, <INVOKE-VIRTUAL, INVOKE-VIRTUAL>, <INVOKE-VIRTUAL, MOVE-RESULT>, ....

In every executable, we construct an opcode graph that reflects the relationships between every pair of successive opcodes in the opcode sequences. Each interrelationship between two adjacent opcodes signifies the operational dependency. Given the extensive length of the opcode sequences, the resultant opcode graph is highly interconnected, signifying a vast quantity of dependency relations.

While we note that neither the operations nor the relations in malware are exclusive—in other words, the operations and dependencies in malware might also be found in benign software—the capacities and interactions within the graph alone do not sufficiently categorize malware. However, the co-occurrence probabilities of two consecutive operations provide a distinct point of difference between malware and benign files. Therefore, in this study, we employ these co-occurrence probabilities as attribute values of the relationships between two successive operations.

Consider val( $e_{i, j}$ ) as the property value $p_{i, j}$ of $e_{i, j}$ , signifying the co-occurrence likelihood of the connections $e_{i, j}$ from $v_{i}$ to $v_{j}$ as per Equation (1). Here, $N_{e_{i, j}}$ symbolizes the frequency of $e_{i, j}$ , obtained by reckoning the count of $e_{i, j}$ within the program.

(1) $v a l (e_{i, j}) = \frac{N_{e_{i, j}}}{\sum_{k = 1, t = 1}^{n, n} N_{e_{k, t}}}$

3.3. Frequent Subgraph Extraction

Upon constructing the opcode graph for each executable, we begin to amalgamate the opcode graphs of every executable included in both the malware and benign training datasets, resulting in two average opcode graphs. In these graphs, the nodes symbolize the operations, the edges depict the relations or connections, and the property values signify the average co-occurrence probabilities in accordance with Equations (2) and (3). Here, $v a l_{m a l w a r e} (e_{i, j})$ is the mean property value of $e_{i, j}$ in malware, while $v a l_{b e n i g n} (e_{i, j})$ is the mean characteristic value of $e_{i, j}$ in benign software. Meanwhile, $G_{k}$ denotes the opcode graph of the $k^{t h}$ executable, and $v a l_{G_{k}} (e_{i, j})$ represents the property value of the edge $e_{i, j}$ in $G_{k}$ . Subsequently, we extract numerous frequent subgraphs from the average opcode graph. A frequent subgraph is one where each vertex $v_{i}$ either connects to at least one vertex $v_{j}$ or is connected by at least one vertex $v_{j}$ within the subgraph. The information entropy $I E (v a l (e_{i, j}))$ of $v a l (e_{i, j})$ is greater than a specified threshold (in this case, we set the threshold at 2), as per Equation (Equation (4)). Lastly, we assemble these frequent subgraphs and compile them into a list. This list of frequent subgraphs will then be utilized to extract the frequent subgraph for each executable.

(2) $v a l_{m a l w a r e} (e_{i, j}) = \frac{\sum_{G_{k} \in m a l w a r e} v a l_{G_{k}} (e_{i, j})}{n_{G_{k}}}$

(3) $v a l_{b e n i g n} (e_{i, j}) = \frac{\sum_{G_{k} \in b e n i g n} v a l_{G_{k}} (e_{i, j})}{n_{G_{k}}}$

(4) $I E (v a l (e_{i, j})) = l n (\frac{v a l_{m a l w a r e} (e_{i, j})}{v a l_{b e n i g n} (e_{i, j})})$

Although there exist some edge cases, such as dense graphs where most nodes are connected, we use the information entropy $I E (v a l (e_{i, j})) > t h r e s h o l d$ to choose the connections which are significant for malware and drop the others.

For the extraction of frequent subgraphs from the average opcode graph, where the property values determine the degree of support for each edge, upon extraction, we scale down the redundant subgraphs. These subgraphs have a consistent size of K, resulting in uniform adjacency matrices. Every subgraph can be represented by an adjacency matrix A of size $K \times K$ , whereby $A_{i, j} = v a l (e_{i, j})$ if an edge exists from vertex $v_{i}$ to vertex $v_{j}$ , otherwise $A_{i, j} = 0$ . The adjacency matrix A is then sent to our graph convolutional neural networks for computation of a topology feature.

3.4. Graph Convolutional Neural Networks

We propose a graph convolutional neural net for embedding the opcode subgraph and for training the features of the subgraph topology. The structure of our neural networks, depicted in Figure 1, encompasses a subgraph layer, convolutional layer, pooling layer, fully connected layer, softmax layer, and an output layer. The subgraph layer is tasked with receiving the adjacency matrices of subgraphs. The convolutional layer calculates the dot product between the adjacency matrices and several convolutional kernels initialized randomly, thus obtaining the topology features of the subgraphs. The pooling layer leverages mean pooling to decrement the number of feature dimensions. Through the fully connected and softmax layers, the neural networks produce a vector comprising two values representing the probabilities of malware and benign software, respectively.

Training process. During the forward pass, we initially extract frequent subgraphs for each executable based on the aforementioned list of frequent subgraphs and feed the adjacency matrices of the opcode subgraphs into our neural networks. Subsequently, every convolutional kernel convolves across the matrix’s width and height, computing a subgraph topology feature specific to that convolutional kernel. These features are forwarded to the ensuing pooling layer to diminish the quantity of feature dimensions. The pooling layer is fully connected to the fully connected layer. The pooling operation acts as a down-sampling technique, and in this context, we utilize mean-pooling. Following this, the fully connected layer, along with the softmax classifier, are used to train the features derived from malicious and legitimate executables. Ultimately, the softmax classifier emits a vector of two confidence values, indicative of the probabilities of malware or benign intrusion, respectively.

During the backpropagation phase, the neural networks leverage the gradient descent technique [21] to retro-propagate the variance from the output layer down to the input layer, updating the weight matrices across adjacent layers in the process. The objective function is outlined in Equation (5), introducing X as the inputs, W as the weights within the neural network, $H (W \cdot X)$ as the intermediate outcome of a layer during the forward pass, and Y as the label value (1/0). The weights are updated as prescribed by Equations (6) and (7), introducing $α$ as the learning rate and $(Y - H (W \cdot X))$ as the variance. The softmax function is defined following Equation (8), leading us to integrate $H (W \cdot X)) = S o f t m a x (W \cdot X)$ within Equation (6), consequently updating the weights spanning from the softmax layer to the output layer as dictated by Equation (9).

(5) $m i n E = {(Y - H (W \cdot X))}^{2}$

(6) $▵ W = α \cdot \frac{\partial E}{\partial W} \cdot (Y - H (W \cdot X)))$

(7) $W = W + ▵ W$

(8) $S o f t m a x (W \cdot X) = \frac{e x p (W \cdot X)}{\sum e x p (W \cdot X)}$

(9) $▵ W = α \cdot S o f t m a x (W \cdot X) \cdot X \cdot (Y - H (W \cdot X))$

Over multiple iterations, the variance within the neural networks is brought to convergence. These neural networks establish a model that is subsequently leveraged for the detection of Android malware variants during the detection stage.

4. Results

In this section, we present a series of experiments designed to attest to the effectiveness of the methodology we have proposed. We initially elaborate on the specific setup of the experiments, including the deployed dataset and verification procedure. Subsequently, we conduct a comprehensive performance evaluation of our method and, by contrasting it with current state-of-the-art methods, we provide ample evidence of its superior efficacy.

4.1. Setup, Dataset, and Validation

Our experimental setting is established on a single computer system. It features an Intel i7-13700kf @ 3.40 GHz processor (Intel Corporation, Santa Clara, CA, USA), equipped with a RAM of 16.0 GB, and operates on the Linux Ubuntu 20.04 system (Canonical Ltd., London, UK). The development of our approach uses the Java programming language because it is suitable for deployment on the Android platform with efficient performance.

In analyzing the performance of our approach towards detecting Android malware variants, we examine two distinct datasets: one composed of Android malware and the other of Android benign data. To approximate a realistic environment, we sourced the Android benign data from Google Play [22]. The instances of Android malware were aggregated from Drebin [3] and the MobileSandbox project [4]. The final dataset that we assembled comprises 5550 benign Android apps and 5560 instances of Android malware, as illustrated in Table 1. A total of 2000 benign Android apps and 2000 Android malware apps are used for training; the other instances are used for testing. The potential skew in training datasets may refer to issues where the distribution or representation of data within the dataset is not balanced or fair, for example, class imbalance, sample selection bias, etc. Addressing these issues requires employing techniques like data augmentation, stratified sampling, and adjusting for biases throughout the development lifecycle of the model. In the case of opcode-based methodologies, we disassemble Android instances using Dedexer [20], enabling us to procure their opcode sequences.

To assess the efficiency of our methodology, we employ a k-fold cross-validation technique in our experiments, with the resultant values being averaged for our findings. To avert potential data skew, both the preliminarily labeled malicious and benign datasets utilized for the training process are of an equivalent size. For the purposes of training and detection, we arbitrarily select 2000 samples each from our malware and benign datasets, previously labeled, with the remaining samples in our datasets serving as undefined targets.

4.2. Data Pre-Processing

We initiate our approach by writing a script, which enables automatic decompression of .apk files to .dex files, for each benign sample and malware. This is followed by the disassembly of the .dex file to the .ddx file, facilitated via the previously mentioned tool, Dedexer [20]. Subsequently, we extract the opcode sequence from the .ddx file, producing sequences similar to “SGET-OBJECT, INVOKE-VIRTUAL, ...”. This opcode sequence is later utilized to construct an opcode graph based on the co-occurrence probabilities of sequential opcodes. Thanks to the standardized format of the .ddx files, extracting opcode sequences and building the respective Dalvik opcode graph becomes straightforward. It is worth mentioning that this graph construction methodology applies to all .ddx files.

4.3. Hyper-Parameter Settings

The hyper-parameters can exert a significant influence on the performance outcome. We make note of these hyper-parameter settings in this subsection, appropriate to our methodology. During the initialization process, we establish the hyper-parameters, which encompass the size of the subgraph adjacent matrix, the size of the subgraph, the count of the convolutional layers and cores, the dimensions of each convolutional core, the count of the pooling layers, and the dimensions for each pooling core, as represented in Table 2. Our guiding principle in setting these hyper-parameters is to maintain a balance between accuracy and training speed.

4.4. Frequent Subgraph Analysis

The objective of this subsection is to conduct a qualitative examination of malware features, setting the groundwork for future in-depth studies. We scrutinize hundreds of frequently occurring subgraphs. This in-depth review provides us with several initial insights. Firstly, the subgraphs exhibit a symmetrical characteristic, which implies that the property values of $e_{i, j}$ and $e_{j, i}$ are likely to be similar. This property can be utilized for reducing the feature dimensions. Secondly, the aggregation characteristic is also evident in the subgraphs, suggesting that when a vertex is present in a subgraph, it is likely to be found in numerous other subgraphs as well. This feature could be useful for automatically identifying operation chains that serve as malware evidence. Lastly, we notice that the subgraphs tend to be complete, indicating a high level of interconnectivity among the vertices within the subgraphs.This characteristic is illustrated in Figure 2.

4.5. Performance Comparison of Several Opcode-Based Approaches

The benchmarks we selected to evaluate performance comparison comprise the classification accuracy, detection precision, detection recall, detection false positive rate, F1-score, detection time consumption, training time consumption, and disassembly time consumption. The classification accuracy is quantified based on Equation (9). TPR represents the true positive rate (equivalent to malware detection recall) as per Equation (10), in which TP denotes the count of accurately classified malware cases and FN refers to the count of malware instances erroneously classified as benign binaries. TNR signifies the true negative rate, as outlined in Equation (11), wherein FP represents benign instances inaccurately classified as malware binaries and TN equates to the count of correctly classified benign binaries. FPR stands for false positive rate and Precision denotes the precision of malware detection, as indicated in Equations (12) and (13), respectively.

(10) $a c c u r a c y = \frac{T P R + T N R}{2}$

(11) $T P R (R e c a l l) = \frac{T P}{T P + F N}$

(12) $T N R = \frac{T N}{F P + T N}$

(13) $F P R = \frac{F P}{F P + T N}$

(14) $P r e c i s i o n = \frac{T P}{T P + F P}$

As illustrated in Figure 3 and Table 3, findings from our experimental study reveal the following: (1) Through the implementation of the proposed frequent subgraph convolutional neural networks, our methodology conspicuously enhances performance, particularly in the areas of classification accuracy, detection precision, recall, 1-FPR, and F1-score, alongside detection speed. This is revealed when juxtaposed with numerous leading-edge Android malware detection strategies. (2) The results show the precision is 93.8%, which means the false positives occurrence in the proposed method relate to whether they are associated with benign applications, such as wifi applications. If the scale and categories of the training dataset increase, the false positives may be reduced. (3) The sacrifice made by our approach lies in the cost of training time, as the proposed frequent subgraph convolutional neural networks require a substantial amount of time to extract numerous subgraphs and train their features through numerous iterations. (4) Compared with ngram-based methods, the proposed methods can achieve faster training and detection speed due to the fewer features. Compared with the control flow graph-based method, the proposed method uses frequent subgraphs to reduce the complexity of graph construction and improve the accuracy of malware variants.

4.6. Stability Analysis of Our Approach

Given the swift escalation in the quantity of Internet-facing malware variants, the volume of training samples typically falls short compared to the detection set. When the detection batch contains a substantial number of binaries and the training set is comparatively smaller (resulting in a smaller ratio of $\frac{t r a i n i n g}{t r a i n i n g + d e t e c t i o n}$ ), it imposes a severe constraint on the detection precision. Consequently, we proceeded to gauge the steadiness of our approach by subjecting different volumes of training sets to testing. The sizes of our training sets stood at 500, 1000, and 2000, while the respective sizes of the testing sets were 5060, 4560, and 3560. This resulted in the ratio $\frac{t r a i n i n g}{t r a i n i n g + d e t e c t i o n}$ settling at 0.09, 0.18, and 0.36.

Figure 4 elucidates the accuracy, precision, recall, 1-FPR, and the F1-score of our approach under different training volumes. The outcomes indicate the stability of our method, regardless of variations in the ratio $\frac{t r a i n i n g}{t r a i n i n g + d e t e c t i o n}$ . The effectiveness of our approach is further underscored when the ratio of $\frac{t r a i n i n g}{t r a i n i n g + d e t e c t i o n}$ falls below 0.1. Additionally, Figure 5 sheds light on the time required for training our approach, corresponding to different training volumes. A nearly linear pattern is observed in the escalating cost of training time as the training volume increases.

5. Discussion

Despite the proficiency of our approach in the effective and efficient detection of Android malware variants, there remain certain limitations to its functionality.

Malware detection methodologies grounded in static analysis often hinge on disassembly tools and other contrarian techniques. Should a malicious sample undergo encryption or be compacted by packers, the disassembly tools hit a roadblock. This conundrum curtails the effectiveness of static analysis-based malware detection procedures under certain circumstances. Nevertheless, unpacking techniques such as [19] can undo the impact of packers in the majority of instances by restoring the original software sources. Given that packed softwares must unwind their internal original codes prior to their execution, it leaves a window open for the application of unpacking techniques. For innovative packing methodologies, unpacking becomes a dynamic process, making effective static analysis against emerging malware a multifaceted approach. Furthermore, runtime process image snapshots can help skirt the common issues associated with packing and can be applied to categorically determine if these processes bear a link to a known malware family [23]. On another note, methods rooted in dynamic analysis may capture software behaviors in real-time, but they, too, come with their share of constraints. By concealing their malevolent behaviors until certain conditions are met, such as non-virtual machine, burst point, and more, malicious samples can elude detection. The search for an immaculate technique capable of surmounting all obstacles remains elusive.

Obfuscation is a consequence of a semantics-maintaining transformation and obfuscated programmes. Obfuscated programmes, developed from the same source, bear a resemblance to each other. The preserved semantics and the cited similarities pave the way for our technique to perceive the degree of obfuscated malware variants. A range of obfuscation methods have been uncovered, which include identifier renaming, junk code infusion, and control flow-based obfuscation, to name a few. Identifier renaming obfuscation reassigns variable names, which could impede manual analysis, but is unlikely to influence the distributions of directives, operation codes, or other binary representations. Junk code infusion might alter the distributions but is easily discerned and de-noised. Control flow-based obfuscation modifies the control flow graph, rendering the control flow-based detection inaccurate. It incorporates some disruptive instructions but retains the original ones, thereby preserving a semblance of similarity. A significant portion of obfuscation techniques, such as renaming, are simple obfuscations and do not bring about substantial changes to the similarities. Additionally, some instances of obfuscated malware are considered malware variants. In essence, our approach can counteract certain inaccuracies brought about by obfuscation. The accuracy of detecting obfuscated malware variants is contingent upon the degree of obfuscation involved. Addressing the challenge of handling obfuscated malware requires a mixture approach that integrates static and dynamic analysis techniques, alongside machine learning methodologies. Effective strategies include using static analysis tools to identify typical obfuscation patterns, leveraging dynamic analysis environments to observe the true behaviors of malware at runtime, and applying reverse engineering to unpack complex obfuscation layers.

Malware authors may adopt adversarial attacks and obfuscation techniques to evade machine learning-based detection. The adversarial authors train a generative network to minimize the malicious probabilities assigned by this substitute, lowering the detection rate close to zero and complicating defenses based on retraining. It reveals the limitations of current machine learning-based malware detectors. We suggest mitigation strategies involving the integration of multi-feature and domain expert knowledge into the learning process.

6. Conclusions

This manuscript introduces an innovative frequent subgraph convolutional neural network method for Android malware detection, which accurately recognizes Android malware variants among benign binaries. The suggested approach is not solely applicable to Android malware detection, but also holds potential for malware detection across various platforms, including Windows, Linux, and more.

In terms of prospective endeavors, we plan on infusing our approach with an array of graph topology attributes to enhance accuracy further. Additionally, we anticipate extending the implementation of our proposed neural networks to other sectors, encompassing social network inference, knowledge graphs, network traffic, and more.

Author Contributions

Conceptualization, J.Z.; Methodology, S.S. and Y.Z.; Software, X.H. and Y.Z.; Validation, X.H. and Y.Z.; Formal analysis, S.S.; Investigation, X.H.; Writing—original draft, S.S.; Writing—review & editing, Y.Z. and J.Z.; Supervision, J.Z. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. The architecture of our graph convolutional neural networks.

Figure 2. Examples of several frequent subgraphs.

Figure 3. The accuracy of our method for Android malware detection.

Figure 4. The detection accuracy of our approach in different training volume.

Figure 5. The training time cost of our approach in different training volume.

Table 1

The Android malware dataset.

Malware Family	Number
FakeInstaller	925
DroidKungFu	667
Plankton	625
Opfake	613
GingerMaster	339
BaseBridge	330
Iconosys	152
Others	1909
Total	5560

Table 2

The hyper-parameter settings of our frequent subgraph convolutional neural networks.

Item	Value
Number of convolutional layers	1
Number of convolutional cores	5
Size of each convolutional core	5 × 5
Number of pooling layers	1
Number of pooling cores	5
Size of each pooling core	2 × 2

Table 3

Time cost comparison with several state-of-art approaches for Android malware detection.

Method	Detection	Training	Disassembly
Method	Time Cost	Time Cost	Time Cost
Our approach (GCN)	0.045 s	57.68 h	24.98 s

References

1. Manzil, H.H.R.; Naik, S.M. Detection approaches for android malware: Taxonomy and review analysis. Expert Syst. Appl.; 2024; 238, 122255. [DOI: https://dx.doi.org/10.1016/j.eswa.2023.122255]

2. Dalvik Opcodes. 2018; Available online: http://pallergabor.uw.hu/androidblog/dalvik_opcodes.html (accessed on 6 March 2025).

3. Arp, D.; Spreitzenbarth, M.; Hubner, M.; Gascon, H.; Rieck, K. Drebin: Efficient and Explainable Detection of Android Malware in Your Pocket; Technical Report Georg-August Institute of Computer Science: Göttingen, Germany, 2013.

4. Michael, S.; Florian, E.; Thomas, S.; Felix, C.F.; Hoffmann, J. Mobile-Sandbox: Looking deeper into android applications. Proceedings of the 28th International ACM Symposium on Applied Computing (SAC); Coimbra, Portugal, 18–22 March 2013.

5. Liu, H.; Gong, L.M.X. LTAChecker: Lightweight Android Malware Detection Based on Dalvik Opcode Sequences Using Attention Temporal Networks. IEEE Internet Things J.; 2024; 11, pp. 25371-25381. [DOI: https://dx.doi.org/10.1109/JIOT.2024.3394555]

6. Muzaffar, A.; Ragab Hassen, H.; Zantout, H.; Lones, M.A. DroidDissector: A Static and Dynamic Analysis Tool for Android Malware Detection. International Conference on Applied CyberSecurity; Springer Nature: Cham, Switzerland, 2023.

7. Vasan, D.; Alazab, M.; Wassan, S.; Naeem, H.; Safaei, B.; Zheng, Q. IMCFN: Image-Based Malware Classification Using Fine-Tuned Convolutional Neural Network Architecture. Comput. Netw.; 2020; 171, 107138. [DOI: https://dx.doi.org/10.1016/j.comnet.2020.107138]

8. Seneviratne, S.; Shariffdeen, R.; Rasnayaka, S.; Kasthuriarachchi, N. Self-supervised vision transformers for malware detection. IEEE Access; 2022; 10, pp. 103121-103135. [DOI: https://dx.doi.org/10.1109/ACCESS.2022.3206445]

9. He, Z.; Homayoun, H.; Sayadi, H. Guarding Against the Unknown: Deep Transfer Learning for Hardware Image-Based Malware Detection. J. Hardw. Syst. Secur.; 2024; 8, pp. 61-78. [DOI: https://dx.doi.org/10.1007/s41635-024-00146-6]

10. De Paola, A.; Gaglio, S.; Re, G.L.; Morana, M. A hybrid system for malware detection on big data. Proceedings of the IEEE INFOCOM 2018-IEEE Conference on Computer Communications Workshops; Honolulu, HI, USA, 15–19 April 2018.

11. Muoz, A. Cracking the Core: Hardware Vulnerabilities in Android Devices Unveiled. Electronics; 2024; 13, 4269. [DOI: https://dx.doi.org/10.3390/electronics13214269]

12. Gómez, A.; Muñoz, A. Deep Learning-Based Attack Detection and Classification in Android Devices. Electronics; 2023; 12, 3253. [DOI: https://dx.doi.org/10.3390/electronics12153253]

13. Han, X.; Liu, S.; Liu, J.; Jiang, B.; Lu, Z.; Liu, B. ECNet: Robust Malicious Network Traffic Detection with Multi-View Feature and Confidence Mechanism. IEEE Trans. Inf. Forensics Secur.; 2024; 19, pp. 6871-6885. [DOI: https://dx.doi.org/10.1109/TIFS.2024.3426304]

14. Chen, J.; Sun, S.; Xia, C.; Shi, D.; Chen, G. Modeling and analyzing malware propagation over wireless networks based on hypergraphs. IEEE Trans. Netw. Sci. Eng.; 2023; 10, pp. 3767-3778. [DOI: https://dx.doi.org/10.1109/TNSE.2023.3273184]

15. Gibert, D.; Mateu, C.; Planes, J. HYDRA: A multimodal deep learning framewo HYDRA: A multimodal deep learning framework for malware classification. Comput. Secur.; 2020; 95, 101873. [DOI: https://dx.doi.org/10.1016/j.cose.2020.101873]

16. Singh, N.; Tripathy, S. MDLDroid: Multimodal Deep Learning Based Android Malware Detection. International Conference on Information Systems Security; Springer Nature: Cham, Switzerland, 2023.

17. Dai, Y.; Li, H.; Qian, Y.; Yang, R.; Zheng, M. SMASH: A malware detection method based on multi-feature ensemble learning. IEEE Access; 2019; 7, pp. 112588-112597. [DOI: https://dx.doi.org/10.1109/ACCESS.2019.2934012]

18. Yerima, S.Y.; Sezer, S. Droidfusion: A novel multilevel classifier fusion approach for android malware detection. IEEE Trans. Cybern.; 2018; 49, pp. 453-466. [DOI: https://dx.doi.org/10.1109/TCYB.2017.2777960] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29993965]

19. Xue, L.; Luo, X.; Yu, L.; Wang, S.; Wu, D. Adaptive Unpacking of Android Apps. Proceedings of the IEEE/ACM 39th International Conference on Software Engineering (ICSE); Buenos Aires, Argentina, 20–28 May 2017.

20. Dedexer. 2024; Available online: http://dedexer.sourceforge.net/ (accessed on 6 March 2025).

21. Gradient Descent. 2018; Available online: https://en.wikipedia.org/wiki/Gradient_descent (accessed on 6 March 2025).

22. Google Play. 2024; Available online: http://vxheaven.org/vl.php (accessed on 6 March 2025).

23. Cesare, S.; Xiang, Y.; Zhou, W. Control Flow-Based Malware Variant Detection. IEEE Trans. Dependable Secur. Comput. (TDSC); 2014; 11, pp. 230-817.

Word count: 6643

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

As Android holds a commanding position in the smartphone operating system market, the proliferation of malicious applications on this platform has also escalated rapidly. This surge in diverse malware variants has compelled researchers to explore innovative techniques leveraging machine learning. Given the significance of static analysis in network security, and the proven effectiveness of Dalvik opcode as a precise representation of malware, many studies have adopted the use of Dalvik opcode in conjunction with machine learning algorithms to detect Android malware. Currently, a considerable number of opcode-based approaches are being developed to extract semantic information from opcode sequences. Nonetheless, these approaches encounter considerable challenges in terms of achieving precision. Despite the integration of additional semantic features, they do not succeed in enhancing precision and often result in longer computation times. Furthermore, the extensive length of opcode sequences poses a significant obstacle in the analysis of their underlying semantics. When confronted with these challenges, delving into alternative characteristics could hold the potential to overcome the prevailing predicament, thereby enhancing our comprehension of malwares’ operational mechanisms. Considering the rich informational content embedded within opcode dependencies, despite the scarcity of research in this domain, we intend to prioritize our focus on these dependencies. By constructing opcode graphs, we aim to gain deeper insights into the topological properties of these dependencies, thereby facilitating a more comprehensive analysis. This paper presents an innovative Android malware detection method. The core process of this method includes building a Dalvik opcode graph, extracting frequent subgraphs, and embedding subgraphs using graph convolutional neural networks to extract topological features and train classification models. This model aims to accurately distinguish between malicious Android applications and legitimate applications. Based on the above method, we have successfully developed a lightweight prototype for Android malware variant detection. Through theoretical analysis and practical experimental verification, the prototype demonstrates excellent effectiveness, efficiency, and stability. Specifically, its detection accuracy is nearly 95%, and the time cost for a single detection does not exceed 0.1 s.

Details

Title

An Android Malware Detection Method Using Frequent Graph Convolutional Neural Networks

Author

Zhao, Yulong¹; Sun, Shi¹; Huang, Xiaofeng²; Zhang, Jixin¹

¹ School of Computer Science, Hubei University of Technology, Wuhan 430068, China; [email protected] (Y.Z.); [email protected] (S.S.)
² Artificial Intelligence in Education, Central China Normal University, Wuhan 430079, China

First page

1151

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20799292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/electronics14061151

ProQuest document ID

3181459023