jsoup을 활용하여 자바 웹크롤링 만들기
Jsoup을 lib를 사용하면 쉽게 HTML파싱이 가능하다.
다양한 방법의 셀렉터를 지원하므로 규칙만 잘 찾는다면 필요한 사이트의 웹크롤러 만들기는 아~주 쉽다.
1. jsoup MAVEN 추가
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup --> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.10.3</version> </dependency>
2. 코딩
내 블로그의 포스트 타이틀, 링크, 이미지를 가져오는 크롤러를 만들어보았다.
메인페이지에 노출되는 포스트만 가져오니까, 최근글 가져오기 개념이 될 것 같다.
Jsoup 활용은 확실한 공식문서를 참고합시다.
public class JavaWebCrawler { public static String getCurrentData(){ SimpleDateFormat sdf = new SimpleDateFormat("yyyy.MM.dd HH:mm:ss"); return sdf.format(new Date()); } public static void goCrawling() throws IOException { System.out.println("Start Date : " + getCurrentData()); Document doc2 = Jsoup.connect("http://archive.htrucci.com").get(); //System.out.println(doc2.data()); //System.out.println(doc2.body()); System.out.println("블로그 제목 :"+doc2.body().getElementsByClass("navbar-brand text-logo").text()); Elements posts = doc2.body().getElementsByClass("entry-title"); //System.out.println(doc2.body().getElementsByClass("entry-title")); for(int i=0; i<posts.size(); i++){ System.out.println("Post Title : "+posts.eq(i).text()); System.out.println("Link : "+posts.eq(i).select("a").attr("href")); System.out.println("Post Image : "+doc2.body().getElementsByClass("post-thumbnail post-thumbnail-small").eq(i).select("img").attr("src")); System.out.println(); } System.out.println("End Date : " + getCurrentData()); // HttpPost http = new HttpPost("http://www.htrucci.com"); // // HttpClient httpClient = HttpClientBuilder.create().build(); // // HttpResponse response = httpClient.execute(http); // // HttpEntity entity = response.getEntity(); // // ContentType contentType = ContentType.getOrDefault(entity); // Charset charset = contentType.getCharset(); // if(charset == null){ // charset = Charset.defaultCharset(); // } // BufferedReader br = new BufferedReader(new InputStreamReader(entity.getContent(), charset)); // // StringBuffer sb = new StringBuffer(); // // String line = ""; // while((line=br.readLine()) != null){ // sb.append(line+"\n"); // } // System.out.println(sb.toString()); // // Document doc = Jsoup.parse(sb.toString()); }
3. 결과
블로그 제목 :황교빈의 아카이브 Post Title : SpringBoot + Redis + Synology NAS Link : http://archive.htrucci.com/archives/344/springboot-redis-synology-nas/ Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-2.jpg Post Title : 이클립스 플러그인 moreclipboard Link : http://archive.htrucci.com/archives/73/%ec%9d%b4%ed%81%b4%eb%a6%bd%ec%8a%a4-%ed%94%8c%eb%9f%ac%ea%b7%b8%ec%9d%b8-moreclipboard/ Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-4.jpg Post Title : 맥북 원격 재부팅하는 방법 Link : http://archive.htrucci.com/archives/336/%eb%a7%a5%eb%b6%81-%ec%9b%90%ea%b2%a9-%ec%9e%ac%eb%b6%80%ed%8c%85%ed%95%98%eb%8a%94-%eb%b0%a9%eb%b2%95/ Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-6.jpg Post Title : 앱스토어, 플레이스토어, 원스토어 앱이관시 주요 참고사항 Link : http://archive.htrucci.com/archives/325/%ec%95%b1%ec%8a%a4%ed%86%a0%ec%96%b4-%ed%94%8c%eb%a0%88%ec%9d%b4%ec%8a%a4%ed%86%a0%ec%96%b4-%ec%9b%90%ec%8a%a4%ed%86%a0%ec%96%b4-%ec%95%b1%ec%9d%b4%ea%b4%80%ec%8b%9c-%ec%a3%bc%ec%9a%94-%ec%b0%b8-2/ Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-10.jpg Post Title : [Thymeleaf] 요소유형 “”은 짝이 맞는 종료 태그로 종료되어야 합니다. Link : http://archive.htrucci.com/archives/311/thymeleaf-%ec%9a%94%ec%86%8c%ec%9c%a0%ed%98%95-%ec%9d%80-%ec%a7%9d%ec%9d%b4-%eb%a7%9e%eb%8a%94-%ec%a2%85%eb%a3%8c-%ed%83%9c%ea%b7%b8%eb%a1%9c-%ec%a2%85%eb%a3%8c%eb%90%98%ec%96%b4%ec%95%bc/ Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-9.jpg Post Title : 부트스트랩 템플릿 사이트 Link : http://archive.htrucci.com/archives/306/%ec%bb%a4%eb%ae%a4%eb%8b%88%ed%8b%b0-%ec%a0%9c%ec%9e%91/ Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-3.jpg Post Title : 로봇 입력방지 reCAPTCHA 적용하기 Link : http://archive.htrucci.com/archives/303/%eb%a1%9c%eb%b4%87-%ec%9e%85%eb%a0%a5%eb%b0%a9%ec%a7%80-recaptcha-%ec%a0%81%ec%9a%a9%ed%95%98%ea%b8%b0/ Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-7.jpg Post Title : webjar를 이용하여 client-side 라이브러리를 jar로 관리하자 Link : http://archive.htrucci.com/archives/299/webjars-org/ Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-5.jpg Post Title : Swift문법 – 함수만들기 Link : http://archive.htrucci.com/archives/292/swift%eb%ac%b8%eb%b2%95-%ed%95%a8%ec%88%98%eb%a7%8c%eb%93%a4%ea%b8%b0/ Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-6.jpg Post Title : NSBundle.mainBundle() 이 없다고 나온다 Link : http://archive.htrucci.com/archives/289/nsbundle-mainbundle-%ec%9d%b4-%ec%97%86%eb%8b%a4%ea%b3%a0-%eb%82%98%ec%98%a8%eb%8b%a4/ Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-8.jpg End Date : 2017.07.05 14:06:32
최근 댓글