jsoup을 활용하여 자바 웹크롤링 만들기

Jsoup을 lib를 사용하면 쉽게 HTML파싱이 가능하다.

다양한 방법의 셀렉터를 지원하므로 규칙만 잘 찾는다면 필요한 사이트의 웹크롤러 만들기는 아~주 쉽다.

1. jsoup MAVEN 추가

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.10.3</version>
</dependency>

 

 

 2. 코딩

내 블로그의 포스트 타이틀, 링크, 이미지를 가져오는 크롤러를 만들어보았다.

메인페이지에 노출되는 포스트만 가져오니까, 최근글 가져오기 개념이 될 것 같다.

Jsoup 활용은 확실한 공식문서를 참고합시다.

https://jsoup.org/apidocs/

public class JavaWebCrawler {
    public static String getCurrentData(){
        SimpleDateFormat sdf = new SimpleDateFormat("yyyy.MM.dd HH:mm:ss");
        return sdf.format(new Date());
    }
    public static void goCrawling() throws IOException {
        System.out.println("Start Date : " + getCurrentData());

        Document doc2 = Jsoup.connect("http://archive.htrucci.com").get();
        //System.out.println(doc2.data());
        //System.out.println(doc2.body());
        System.out.println("블로그 제목 :"+doc2.body().getElementsByClass("navbar-brand text-logo").text());
        Elements posts = doc2.body().getElementsByClass("entry-title");
        //System.out.println(doc2.body().getElementsByClass("entry-title"));
        for(int i=0; i<posts.size(); i++){
            System.out.println("Post Title : "+posts.eq(i).text());
            System.out.println("Link : "+posts.eq(i).select("a").attr("href"));
            System.out.println("Post Image : "+doc2.body().getElementsByClass("post-thumbnail post-thumbnail-small").eq(i).select("img").attr("src"));
            System.out.println();

        }
        System.out.println("End Date : " + getCurrentData());


//        HttpPost http = new HttpPost("http://www.htrucci.com");
//
//        HttpClient httpClient = HttpClientBuilder.create().build();
//
//        HttpResponse response = httpClient.execute(http);
//
//        HttpEntity entity = response.getEntity();
//
//        ContentType contentType = ContentType.getOrDefault(entity);
//        Charset charset = contentType.getCharset();
//        if(charset == null){
//            charset = Charset.defaultCharset();
//        }
//        BufferedReader br = new BufferedReader(new InputStreamReader(entity.getContent(), charset));
//
//        StringBuffer sb = new StringBuffer();
//
//        String line = "";
//        while((line=br.readLine()) != null){
//            sb.append(line+"\n");
//        }
//        System.out.println(sb.toString());
//
//        Document doc = Jsoup.parse(sb.toString());

    }

3. 결과

블로그 제목 :황교빈의 아카이브
Post Title : SpringBoot + Redis + Synology NAS
Link : http://archive.htrucci.com/archives/344/springboot-redis-synology-nas/
Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-2.jpg

Post Title : 이클립스 플러그인 moreclipboard
Link : http://archive.htrucci.com/archives/73/%ec%9d%b4%ed%81%b4%eb%a6%bd%ec%8a%a4-%ed%94%8c%eb%9f%ac%ea%b7%b8%ec%9d%b8-moreclipboard/
Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-4.jpg

Post Title : 맥북 원격 재부팅하는 방법
Link : http://archive.htrucci.com/archives/336/%eb%a7%a5%eb%b6%81-%ec%9b%90%ea%b2%a9-%ec%9e%ac%eb%b6%80%ed%8c%85%ed%95%98%eb%8a%94-%eb%b0%a9%eb%b2%95/
Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-6.jpg

Post Title : 앱스토어, 플레이스토어, 원스토어 앱이관시 주요 참고사항
Link : http://archive.htrucci.com/archives/325/%ec%95%b1%ec%8a%a4%ed%86%a0%ec%96%b4-%ed%94%8c%eb%a0%88%ec%9d%b4%ec%8a%a4%ed%86%a0%ec%96%b4-%ec%9b%90%ec%8a%a4%ed%86%a0%ec%96%b4-%ec%95%b1%ec%9d%b4%ea%b4%80%ec%8b%9c-%ec%a3%bc%ec%9a%94-%ec%b0%b8-2/
Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-10.jpg

Post Title : [Thymeleaf] 요소유형 “”은 짝이 맞는 종료 태그로 종료되어야 합니다.
Link : http://archive.htrucci.com/archives/311/thymeleaf-%ec%9a%94%ec%86%8c%ec%9c%a0%ed%98%95-%ec%9d%80-%ec%a7%9d%ec%9d%b4-%eb%a7%9e%eb%8a%94-%ec%a2%85%eb%a3%8c-%ed%83%9c%ea%b7%b8%eb%a1%9c-%ec%a2%85%eb%a3%8c%eb%90%98%ec%96%b4%ec%95%bc/
Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-9.jpg

Post Title : 부트스트랩 템플릿 사이트
Link : http://archive.htrucci.com/archives/306/%ec%bb%a4%eb%ae%a4%eb%8b%88%ed%8b%b0-%ec%a0%9c%ec%9e%91/
Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-3.jpg

Post Title : 로봇 입력방지 reCAPTCHA 적용하기
Link : http://archive.htrucci.com/archives/303/%eb%a1%9c%eb%b4%87-%ec%9e%85%eb%a0%a5%eb%b0%a9%ec%a7%80-recaptcha-%ec%a0%81%ec%9a%a9%ed%95%98%ea%b8%b0/
Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-7.jpg

Post Title : webjar를 이용하여 client-side 라이브러리를 jar로 관리하자
Link : http://archive.htrucci.com/archives/299/webjars-org/
Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-5.jpg

Post Title : Swift문법 – 함수만들기
Link : http://archive.htrucci.com/archives/292/swift%eb%ac%b8%eb%b2%95-%ed%95%a8%ec%88%98%eb%a7%8c%eb%93%a4%ea%b8%b0/
Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-6.jpg

Post Title : NSBundle.mainBundle() 이 없다고 나온다
Link : http://archive.htrucci.com/archives/289/nsbundle-mainbundle-%ec%9d%b4-%ec%97%86%eb%8b%a4%ea%b3%a0-%eb%82%98%ec%98%a8%eb%8b%a4/
Post Image : http://archive.htrucci.com/wp-content/themes/vega/sample/images/featured-image-8.jpg

End Date : 2017.07.05 14:06:32

 

 

You may also like...

답글 남기기

이메일은 공개되지 않습니다. 필수 입력창은 * 로 표시되어 있습니다.