Wednesday, December 9, 2015

Mini Search Engine - Just the basics, using Neo4j, Crawler4j, Graphstream and Encog



http://www.briandupreez.net/2013/07/mini-search-engine-just-basics-using.html
Then using Crawler4j created a graph of all the URLs starting with my blog, their relationships to other URLs and all the words and indexes of the words that those URLs contain.

    public boolean shouldVisit(final WebURL url) {
        final String href = url.getURL().toLowerCase();
        return !NodeConstants.FILTERS.matcher(href).matches();
    }

    /**
     * This function is called when a page is fetched and ready
     * to be processed by your program.
     */
    @Override
    public void visit(final Page page) {

        final String url = page.getWebURL().getURL();
        System.out.println("URL: " + url);

        final Index<Node> nodeIndex = graphDb.index().forNodes(NodeConstants.PAGE_INDEX);

        if (page.getParseData() instanceof HtmlParseData) {
            HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
            String text = htmlParseData.getText();
            //String html = htmlParseData.getHtml();
            List<WebURL> links = htmlParseData.getOutgoingUrls();
            Transaction tx = graphDb.beginTx();
            try {

                final Node pageNode = graphDb.createNode();
                pageNode.setProperty(NodeConstants.URL, url);
                nodeIndex.add(pageNode, NodeConstants.URL, url);

                //get all the words
                final List<String> words = cleanAndSplitString(text);
                int index = 0;
                for (final String word : words) {
                    final Node wordNode = graphDb.createNode();
                    wordNode.setProperty(NodeConstants.WORD, word);
                    wordNode.setProperty(NodeConstants.INDEX, index++);
                    final Relationship relationship = pageNode.createRelationshipTo(wordNode, RelationshipTypes.CONTAINS);
                    relationship.setProperty(NodeConstants.SOURCE, url);
                }

                for (final WebURL webURL : links) {
                    System.out.println("Linking to " + webURL);
                    final Node linkNode = graphDb.createNode();
                    linkNode.setProperty(NodeConstants.URL, webURL.getURL());
                    final Relationship relationship = pageNode.createRelationshipTo(linkNode, RelationshipTypes.LINK_TO);
                    relationship.setProperty(NodeConstants.SOURCE, url);
                    relationship.setProperty(NodeConstants.DESTINATION, webURL.getURL());
                }

                tx.success();
            } finally {
                tx.finish();
            }

        }
    }


    private static List<String> cleanAndSplitString(final String input) {
        if (input != null) {
            final String[] dic = input.toLowerCase().replaceAll("\\p{Punct}", "").replaceAll("\\p{Digit}", "").split("\\s+");
            return Arrays.asList(dic);
        }
        return new ArrayList<>();
    }

After the data was collected, I could query it and perform the functions of a search engine.

  final ExecutorService executorService = Executors.newFixedThreadPool(4);
  final String[] searchTerms = {"java", "spring"};

  List<Callable<TaskResponse>> tasks = new ArrayList<>();
  tasks.add(new WordFrequencyTask(searchTerms));
  tasks.add(new DocumentLocationTask(searchTerms));
  tasks.add(new PageRankTask(searchTerms));
  tasks.add(new NeuralNetworkTask(searchTerms));

  final List<Future<TaskResponse>> results = executorService.invokeAll(tasks);

   



Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts