Finally I have released my PhantomSQL project.
PhantomSQL is a domain specific language designed for mining content from static and dynamic sources, It closely resembles SQL with features borrowed from other popular dynamic languages.
It can be run as a interpreter or ‘server’ mode, it comes with type 4 JDBC driver for ease of integration with java applications.
Sample Queries
Here are just few examples taken from the project site to display some of the syntax of the language.
Hello World
Following example illustrates how to query google.com for some blog search results.
SELECT FIRST css("#ires a") AS title FROM https://www.google.com/SEARCH USING GET WITH {'q': "josh bloch", 'tbm':"blg"} print @title +" : "+ @title['href'] |
Crawling Flicker.com
Flicker Integration
This query does nothing more than query flicker.com for ‘nabilishes’ and then crawl the site using GET while the ‘css(“a.Next”)’ condition matches, at the end it prints how many results have been found.
@RESULT = SELECT css(".pc_img") FROM http://www.flickr.com/SEARCH USING GET WITH {'q': "nabilishes", 'm' : "text"} crawl(css("a.Next")) print @RESULT.length() |
Extracting images from Flicker.com search results
Here we build on previous example by adding the ‘WHILE-SELECT’ construct and actually saving the image with the ‘save’ function.
while SELECT css(".pc_img", "src", TRUE) AS img FROM http://www.flickr.com/SEARCH USING GET WITH {'q': "nabilishes", 'm' : "text"} crawl(css("a.Next")) BEGIN save (@img) END |
Retrieving Binary Content
Following two examples are equivalent.
Following example illustrates how to retrieve binary content from a site and save it on the filesystem when running via interpreter.
SELECT FIRST css("#main_image", "src", TRUE) AS item FROM https://1saleaday.com save(@item) |
Following example illustrates how to retrieve binary content from a site with JDBC Driver and dump the file to file system.
public void retieveFile() { try { try { Class.forName("com.gbltech.phantomsql.driver.PhantomDriver"); } catch (Exception e) { e.printStackTrace(); } Connection conn = DriverManager.getConnection("jdbc:phantomsql://localhost?characterEncoding=utf8"); Statement statement = conn.createStatement(); ResultSet resultSet = statement .executeQuery("select first css(\"#main_image\", \"src\", true) as item from https://1saleaday.com"); if (resultSet.next()) { OutputStream out = null; Blob blob = resultSet.getBlob(1); try { out = new FileOutputStream(new File("./test.jpg")); InputStream is = blob.getBinaryStream(); byte[] buff = new byte[1024]; int read = 0; while ((read = is.read(buff, 0, buff.length)) != -1) { out.write(buff, 0, read); } out.flush(); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } finally { if (out != null) { try { out.close(); } catch (IOException e) { e.printStackTrace(); } } } } } catch (SQLException e) { e.printStackTrace(); } } |
This is just a taste of what the PhantomSQL can do, there is much more so go check it out.
I am looking for feedback, criticism and bug reports to let me make the project better, so if you have something drop me a line.