北京邮电大学学报(社科版) ›› 2019, Vol. 21 ›› Issue (5): 85-92.doi: 10.19722/j.cnki.1008-7729.2019.0064

• 经济与管理 • 上一篇    下一篇

基于Nutch的多源社交媒体情报采集系统

  

  1. 武汉理工大学 经济学院,湖北 武汉430070
  • 出版日期:2019-10-31
  • 基金资助:
    教育部人文社会科学研究规划基金项目(17YJA870006);湖北省自然科学基金项目(2018CFB564)

Nutch-based Multi-source Social Media Intelligence Collection System

  1. School of Economics, Wuhan University of Technology, Wuhan 430070, China
  • Online:2019-10-31

摘要: 以新闻、论坛、贴吧、微博等互联网社交媒体平台为研究对象,在对各平台进行领域建模、情报采集流程设计以及采集内容解析的基础上,设计了基于网络抓取开源工具Nutch的通用采集系统。根据各平台特点,分别将分类排名、分块解析、模拟登录方法应用于新闻、论坛贴吧、微博的采集工作中,以提高系统的通用性和性价比,实现对多源社交媒体情报的高效采集。

关键词:  , Nutch, 社交媒体情报, 多源情报采集, 内容解析, 模拟登录

Abstract: Taking Internet social media platforms such as news websites, BBS, post bars, microblogs, etc. as the research objects, and based on the domain modeling, intelligence collection process design and content analysis of each platform, an intelligence collection system suitable for the whole social media platform is designed based on open source web crawler Nutch. According to the characteristics of each platform, methods of classification ranking, block analysis and simulated login are applied to the collection of news, BBS, post bars, and microblogs, which improves versatility and cost performance of the system, and achieves efficient collection of multi-source social media intelligence.

Key words: Nutch, social media intelligence, multi-source intelligence collection, content analysis, simulated login

中图分类号: