- 无标题文档
查看论文信息

题名:

 基于智能优化的KNN算法改进研究    

姓名:

 吴敬学    

论文语种:

 chi    

学科代码:

 125500    

门类名称:

 图书情报    

一级学科名称:

 图书情报    

专业名称:

 图书情报    

培养层次:

 硕士    

作者国别:

 中国    

学位授予单位:

 华南师范大学    

院系:

 005经济与管理学院    

第一导师姓名:

 奉国和    

第一导师单位:

 华南师范大学经济与管理学院    

论文提交日期:

 2013-06-05    

论文答辩日期:

 2013-05-31    

学位授予日期:

 2013-07-01    

外文题名:

 Research on Improvement of KNN Algorithm Based on Intelligent Optimization    

关键词:

 文本分类 ; k近邻 ; 粒子群优化 ; 免疫克隆选择    

外文关键词:

 text categorization,k nearest neighbors(KNN),particle swarm optimization,immune clonal selection,journals manuscripts categorization    

论文摘要:
在“信息爆炸”时代,信息的有效组织、过滤与检索显得尤为重要。文本分类结合信息处理技术、机器学习和统计学习理论,在文本识别、搜索引擎、信息过滤、电子政务等方面有着广泛的应用,已经成为信息处理现代化的关键技术之一。k近邻(k nearest neighbors, KNN)算法作为文本分类的一种重要方法,曾被认为是向量空间模型(vector space model, VSM)下最好的算法之一。然而KNN算法属于懒惰学习法,又需要逐个比对寻找近邻,计算开销大;另外,易受样本分布及特征空间影响。本文首先针对传统KNN算法寻找近邻耗时过多的不足,提出利用粒子群优化算法搜索近邻。粒子初始化为k个训练样本的编号,对于每一代:个体最优pbest定义为当前粒子与上一代pbest搜索到的最近邻集合,群体最优gbest定义为当前pbest搜索到的最近邻集合。并在迭代过程中建立距离表,通过距离表的更新与查询避免重复计算。在开放数据集上的对比实验表明,改进算法在分类精度不减甚至略微提高基础上,耗时缩短80%以上。接下来基于免疫克隆选择算法提出一种提高分类精度的KNN改进算法ICS-KNN,重新定义抗体变异算子并引入抗体淘汰算子,在增加种群多样性同时避免抗体群出现退化,通过筛选最适合KNN分类器的特征以提高决策准确率。通过开放数据集的对比实验证明改进算法分类精度的提升。最后将前文提出的两种算法相结合用于期刊论文自动归栏的实证研究,选择三种期刊常设主题栏目上的论文,以标题、摘要和关键词构成的文本作为样本,按年度建立训练集和测试集进行分类,通过与传统KNN算法的对比实验验证改进算法更适合期刊论文自动归栏的应用。
外文摘要:
In this “information explosion” age, effective organization, filtering and retrieval of information becomes particularly necessary. Text categorization combines the technology of information processing, machine learning and statistical learning theory, has been widely used to text recognition, search engines, information filtering e-government and so on, and has become one of the key technologies in the modernization of information processing. K nearest neighbors(KNN) algorithm is an important method of text categorization, which has been proved to be one of the best algorithms that use vector space model(SVM). However, KNN algorithm belongs to the lazy learning method, also needs to compare with the train samples one by one when looking for nearest neighbors, these lead to huge computational overhead. In addition, its performance to a great extent depends on sample distribution and the feature space. Firstly, aiming at the shortcoming of the traditional KNN algorithm looking for neighbors takes too much, put forward using particle swarm optimization to search neighbors. Particles are initialized as the serial number of k training samples, in each iteration: pbest is defined as the nearest neighbors set of current particle and its previous pbest, gbest is defined as the nearest neighbors set of every current pbest. A distance index is used to avoid double counting. Contrast experiments proved its improvement.Then, an improved more accurate KNN algorithm based on immune clonal selection algorithm which named ICS-KNN is put forward. In the ICS-KNN, the antibodies mutation operator is redefined and the antibodies elimination operator is defined, these can both increase population diversity and avoid the degradation. Simulation experiment proves the performance boost of the improved algorithms. Lastly, the improved algorithm is used to the empirical study of journals manuscript categorization. Papers on particular columns of three selected journals are used to constitute the experiment samples with their titles, abstracts and keywords. The training set and testing set are organized by time series. The comparative experiment with traditional KNN algorithm proves the performance boost of the improved algorithm in the application of journals manuscripts categorization. In addition, consistency volatility of different columns and different journals is analyzed according to the experiment data of the improved algorithm.
论文总页数:

 54    

参考文献总数:

 91    

资源类型:

 学位论文    

开放日期:

 2015-06-04    

无标题文档

   建议浏览器: 谷歌 火狐 360请用极速模式,双核浏览器请用极速模式