Download API to check it

PDF, 2.1 mb

Fetching API

Variables

api.web.service.addr = the IP or HTTP address (with port) of our API web service
your.web.service.addr = the IP or HTTP address (with port or default 80) to send our callbacks to

Request for fetching

User programs can send us a request for fetching web pages by means of HTTP POST to api.web.service.addr/parse (e.g. ) containing a JSON such as the following:

{
  “ApiKey”: ”YourApiKeyHere”,
  “CallbackUrl”: ”TheUrlAtWhichYouReceiveCallbacks”,
  “UseBrowser”: boolean_value,
  “Pages”: [
    { “Src”: “TheUrlYouRequestHere”, “FileName”: “TheFileNameToUseForThisPage”, “UserAgent”: “UserAgentStringToPassToTheWebsite”, “Cookies”: “CookieStringToPassToTheWebsite”, “Locale”: “LocaleStringToPassToTheWebsite” },
    { “Src”:'https://www.bing.com/search?q=blowzier+Ted+conversationize+LDMTS+Biblicism+Vereshchagin+choux',  “FileName”: “0.html”, “UserAgent”: “Chrome”, “Locale”: “en-GB”, “Cookies”: “tool=curl; fun=yes;” }
  ]
}

    
Description of request:

Src – full google URL in format (max 4096 chars);

UserAgent – user agent; “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36 OPR/68.0.3618.173”

FileName – the name of file in the resulting archive.

Cookies - optional field for cookies string

Locale – optional filed (for example en-GB), which will be set in header Accept-Language. So if you set locale = en-GB, our system will send header Accept-Language: en-GB

UseBrowser – This must be set to true for some websites like LinkedIn, Google Trends, etc. But normally it should be set to false. Our system will then use a browser with JavaScript rendering rather than just HTML page source fetching. Please, note that usage of a browser involves extra costs compared to plain HTML fetching.

 Note: Please send correct Usergents and Cookies, Google Captcha very   sensitive for these parameters 

Complete example:

<?php 

// Log string
$log = "User: ".$_SERVER['REMOTE_ADDR'].' - '.date("F j, Y, g:i a").PHP_EOL;


# These are the settings of our services
$api_addr = 'http://apiserp.com:88';
$api_parse = $api_addr . '/parse';
$api_download = $api_addr . '/download';


# These are the settings of your services
# TODO: put your actual settings into these variables

$my_api_key = 'Put your API key here';
$my_host = 'apiserp.com'; # Put your actual domain or IP address here
$my_port = 80; # Put your actual port here
$my_callback_path = '/demo-php/index-anon.php'; # Put your actual callback path here
$my_callback_addr = 'http://' . $my_host . ':' . $my_port;
$my_callback_url = $my_callback_addr . $my_callback_path; # Check that this URL is what you actually serve
$my_blocks_dir = 'Blocks';


# If launch param exist in url address and it's equal 1 then send request to get token

if(isset($_GET['launch']) && $_GET['launch'] == 1){
    # Compose a request
    $pages = [];

    $page0 = ['Src' => 'https://www.google.com/search?q=game+recommendation',
             'FileName' => 'game_recommendation.html',
             'UserAgent' => 'Mozilla',
             'Cookies' => 'test0=a; test1=b;',
             'Locale' => 'en-GB'];

    array_push($pages, $page0);

    $page1 = ['Src'=> 'https://www.google.com/search?q=game+suggester',
             'FileName'=>  'game_suggester.html',
             'UserAgent'=> 'Chrome',
             'Cookies'=> 'testA=0; testB=1',
             'Locale'=> 'fr-CH'];

    array_push($pages, $page1);

    $obj_request = ['ApiKey' => $my_api_key, 'CallbackUrl' => $my_callback_url, 'Pages' => $pages];
    
    $ch = curl_init( $api_parse );
    # Setup request to send json via POST.
    $postdata = json_encode($obj_request);
    curl_setopt( $ch, CURLOPT_POSTFIELDS, $postdata );
    curl_setopt( $ch, CURLOPT_HTTPHEADER, array('Content-Type:application/json'));
    # Return response instead of printing.
    curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
    # Send request.
    $result = curl_exec($ch);
    $httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

    curl_close($ch);

    # Check http code
    if($httpcode != 200){
        throw new Exception('Unexpected status code received from the parser: ' . $httpcode);
    }

    $obj_response = json_decode($result, true);

    if(!isset($obj_response['BlockToken'])){
        throw new Exception('Didn\'t receive a block token from the parser, but got: ' . $result);
    }

    $g_block_token = $obj_response['BlockToken'];

    // Create tokens file for store tokens if not exist
    if (!file_exists('tokens.txt')) {
        touch('tokens.txt');
    }

    // Get file with a(append) permission
    $file = fopen('tokens.txt', 'a');

    // Add token in new line
    fwrite($file, $g_block_token.PHP_EOL);

    // close file
    fclose($file);

    print('Got block token: ' . $g_block_token . PHP_EOL);
    $log .= 'Got block token: ' . $g_block_token . PHP_EOL;
} else {
    // callback answer

    // Get token from post
    $block_token = $_POST['BlockToken'];
    $checked = false;
    $new_lines = '';

    // Check tokens
    $fp = fopen("tokens.txt", "r+");
    while ($line = stream_get_line($fp, 1024 * 1024, PHP_EOL)) {
        if($block_token == $line) {
            $checked = true;
        } else {
            $new_lines .= $line.PHP_EOL;
       
        }
    }
    fclose($fp);

    // If token is valid then download blocks
    if($checked){
        // Get blocks
        $download_url = $api_download . '?BlockToken=' . $block_token;

        $ch = curl_init($download_url);
        curl_setopt($ch, CURLOPT_HEADER, 1);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
        $raw_file_data = curl_exec($ch);
        $httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

        curl_close($ch);

        # Check http code
        if($httpcode != 200){
            throw new Exception('Download status Code is ' . $httpcode);
        }
        

        $log .= 'Will download to: ' . getcwd() . PHP_EOL;
        if (!file_exists('./' . $my_blocks_dir)) {
            mkdir('./' . $my_blocks_dir, 0777, true);
        }

        // Set file name
        $file_name = $my_blocks_dir . '/' . $block_token . '.zip';
        // Download file
        file_put_contents($file_name, $raw_file_data);

        $log .= 'Downloaded block: ' . $file_name . PHP_EOL;

        
        // Now we update tokens in tokens file
        // Get file with a(append) permission
        $file = fopen('tokens.txt', 'w');

        // Add token in new line
        fwrite($file, $new_lines);

        // close file
        fclose($file);

        print('OK');

        // Add OK to log too
        $log .= 'OK' . PHP_EOL;
    }
    
    else {
        print('Don\'t know token: ' . $block_token . PHP_EOL);
        $log .= 'Don\'t know token: ' . $block_token . PHP_EOL;
    }
    
}

$log .= "-------------------------".PHP_EOL;
//Save string to log, use FILE_APPEND to append.
file_put_contents('./log_'.date("j.n.Y").'.log', $log, FILE_APPEND);
    
import os
from http.server import HTTPServer, BaseHTTPRequestHandler
from threading import Thread
import json
import urllib.parse
from queue import SimpleQueue
from cgi import parse_header, parse_multipart
from io import BytesIO
# To get this import, launch command line: pip3 install requests

import requests
# These are the settings of our services
api_addr = 'http://apiserp.com:88'
api_parse = f'{api_addr}/parse'
api_download = f'{api_addr}/download'


# These are the settings of your services
# TODO: put your actual settings into these variables
my_api_key = 'Put your API key here'
my_host = 'Put your domain or IP address here'
my_port = 8003 # Put your actual port here
my_callback_path = '/do-callback/' # Put your actual callback path here
my_callback_addr = f'http://{my_host}:{my_port}'
my_callback_url = f'{my_callback_addr}{my_callback_path}' # Check that this URL is what you actually serve 
my_blocks_dir = 'Blocks'


# The normal variables of the program follow
ready_blocks = SimpleQueue()


class CallbackRequestHandler(BaseHTTPRequestHandler):
    def parse_POST(self):
        ctype, pdict = parse_header(self.headers['content-type'])
        if ctype == 'multipart/form-data':
            postvars = parse_multipart(self.rfile, pdict)
        elif ctype == 'application/x-www-form-urlencoded':
            length = int (self.headers['content-length'])
            postvars = urllib.parse.parse_qs
                bytes.decode(self.rfile.read(length)),
                keep_blank_values=True)
        else:
            postvars = {}
        return postvars

    def write_response(self, message: str):
        self.send_response(200)
        self.send_header('Content-type', 'text/plain')
        self.end_headers()
        response = BytesIO()
        response.write(message.encode('ASCII')
        self.wfile.write(response.getvalue())

    def do_POST(self):
        postvars = self.parse_POST()
        bt_list = postvars.get('BlockToken')
        if bt_list is None:
            print('Got a request without a block token in it.')
            self.write_response('ERROR: no block token')
            return
        if len(bt_list) == 0:
            print('Got an empty block token.')
            self.write_response('ERROR: empty block token')
            return
        block_token = bt_list[0]
        self.write_response('OK')
        print('Delivered block: ', block_token)
        ready_blocks.put(block_token)


def serve_async(httpd: HTTPServer):
    httpd.serve_forever()


g_httpd = HTTPServer(('0.0.0.0', my_port),CallbackRequestHandler)
thr_httpd = Thread(target=serve_async, args=(g_httpd,))
thr_httpd.start())

# Compose a request
pages = []

page0 = {'Src': 'https://www.google.com/search?q=game+recommendation',
         'FileName': 'game_recommendation.html',
         'UserAgent': 'Mozilla',
         'Cookies': 'test0=a; test1=b;',
         'Locale': 'en-GB'}
pages.append(page0)

page1 = {'Src': 'https://www.google.com/search?q=game+suggester',
         'FileName': 'game_suggester.html',
         'UserAgent': 'Chrome',
         'Cookies': 'testA=0; testB=1',
         'Locale': 'fr-CH'}
pages.append(page1)

obj_request = {'ApiKey': my_api_key, 'CallbackUrl': my_callback_url, 'Pages': pages}

parser_response = requests.post(api_parse, json=obj_request)
if parser_response.status_code != 200:
    raise Exception(f'Unexpected status code received from the parser: {parser_response.status_code}')
obj_response = json.loads(parser_response.content)
g_block_token = obj_response.get('BlockToken')
if g_block_token is None:
    raise Exception(f'Didn\'t receive a block token from the parser, but got: {parser_response.content}')
print(f'Got block token: {g_block_token}')

while True:
    bt_ready = ready_blocks,get()
    if bt_ready == g_block_token:
        break
    print('Unexpected block token:', bt_ready)

download_url = f'{api_download}?BlockToken={urllib.parse.quote(g_block_token)}'
download_resp = requests.get(download_url)
if download_resp.status_code != 200:
    raise Exception('Download status Code is ' + str(download_resp.status_code))
os.makedirs(my_blocks_dir, exist_ok=True)
file_name = f'{my_blocks_dir}/{g_block_token}.zip'
with open(file_name, 'wb') as out_file:
    out_file.write(download_resp.content)
print('Downloaded block: ', file_name)
g_httpd.shutdown()
    

Above, you can see a request for two pages, where in the first page we put a description and in the second page we put an example. You must not send us the description page: it is only there for your convenience. We recommend sending 500-1000 different pages per block to optimize your user experience.

Typically, the latency for a block is up to 10 minutes, depending on our success of querying the websites. So if you need more than 500-1000 pages fetched, you can send us several blocks one after another, each 500-1000 pages in size, because the throughput of our system is millions of pages per hour. We just depend on the other systems due to latency.

Our API web service responds with JSON with the following:
{ “BlockToken”: “128934ygieh89y13” } or { “Error : “ErrorDetails” }
depending on whether your request is well-formed, contains a valid API key, your page balance allows sending such a request, whether our system was able to handle the request at this time or not, etc. In the latter case, you can retry your request in ~30 seconds or use our alternative IP/HTTP address in case our main IP/HTTP address is undergoing maintenance.



Callbacks

For callbacks, you should expose a web service your.web.service.addr/your/path e.g. http://apiserp.com:8000/fetcher/callback that is able to handle a POST request without any CSRF token, with the following parameters in the POST body (please, note that if your script uses low-level HTTP, you should keep in mind that these parameters are URL-encoded):
BlockToken - the block token you received when sending us this fetching request

In the case that your script has handled the callback successfully, you must respond to us in plain text “OK” word (without quotes). Otherwise, we will try to deliver the callback to you two more times with a delay of 60 seconds.



Download of Blocks

After you have received a callback notifying that your block of pages is ready, you can download the block as a .zip archive using a GET request to: api.web.service.addr/download?BlockToken=”YourBlockTokenWithoutQuotes” E.g. http://apiserp.com:88/download?BlockToken=128934ygieh89y13

Note:
- Zip file will be held for 3 hours. After it is automatically deleted.