Methodology

Our proposed method differs from the above classfication and clustering methods, as we integrate classification and clustering into a holistic model. The benign URLs are for the convenience of people's memory, while malicious URLs do not want to attract people's attention. Malicious URLs are often filled with a lot of junk characters and change encoding methods, use IP addresses instead of domain names, and randomly generate domain names. From this point of view, lexical mining of malicious URLs is a viable way. So our work clusters and identifies malicious URLs based on their lexical features. FCM algorithm is used to effectively cluster similar URLs into a group while identifying malicious URLs within the clustered URLs, attaining a better clustering performance.

Fig1: The overview of URL clustering and detection method

Our goal is to cluster similar samples together while keeping benign and mali- cious samples separated. Figure 1 presents an overview of our method, including URL extraction, feature representation as well as clustering and detection.

URL Extraction

We design a traffic collection platform to collect network data generated by Android apps during network interaction. Then, we extract URL samples from the network traffic. A large number of network traffic data generated by both benign and malicious apps is collected. This module consists of two components: app execution and network traffic collection.

Feature Representation

This section is divided into three partitions: feature representation, feature selection, and vectorization. For more details you can read the paper.

Clustering and Detection Model

The original FCM algorithm comprises of two parts. The First part is a three- layer feedforward neural network and the second part is a K-Means clustering algorithm. The K-Means algorithm needs to set the K value in advance. However, the K selection is a difficult but critical issue. Thus, we propose to enhance the FCM by adding the Canopy algorithm for data coarse clustering. The modified FCM uses Particle Swarm Optimization (PSO) algorithm to adjust the parameters of the neural network in accordance with the clustering accuracy of K-Means.

Evaluation

Fig2: Detection rate comparison with novel malware in the wild using our method and other anti-virus scanners

Our new malware dataset consists of 305 malicious apps that are confirmed by VirusTotal reports. The 305 malware are filtered by 59 anti-virus scanners in VirusTotal; however, each scanner in VirusTotal can only detect part of these malware samples. We select nine popular anti-virus scanners which are AegisLab, Avira, Sophos, McAfee, F-Secure, BitDefender, Tencent, Kasper- sky and Baidu respectively. The detection results of scanners are derived from the VirusTotal service, which vary considerably. The best anti-virus scanner is AegisLab which can detect 189 out of 305 malware and the detection rate is 61.9%, whereas the Baidu scanner only discovers 17 malware in the wild app set whose detection rate is only 5.6%. Figure 11 shows the detailed statistics. In contrast, our detection model can identify 188 out of 305 apps and the detection rate is 61.6% that is on par with the best performing scanner, and outperforms eight other anti-virus scanners. Note that detecting novel malicious apps is a notoriously difficult task, and all existing methods are not able to achieve high detection rate due to the malware's high adaptability. Thus, the comparison result validates the capability of our model in scanning wild apps.