API Documentation

bonfire.cli

bonfire.cli.build_universe(universe, build_mappings=True)

Expand the universe from the seed user list in the universe file. Should only be called once every 15 minutes.

This command also functions as an update of a universe.

Seed is currently limited to the first 14 users in the file. API threshold limit is 15 calls per 15 minutes, and we need an extra call for the initial request for seed user IDs. If we want a seed > 14, we will need to be sure not to exceed 15 hits/15 min.

bonfire.cli.build_universe_mappings(universe, rebuild=False)

Create and map the universe.

bonfire.cli.cache_queries(universe, top_links=False, tweet=False)
bonfire.cli.cleanup_universe(universe, days=30)
bonfire.cli.collect_universe_tweets(universe)

Connects to the streaming API and enqueues tweets from universe users. Limited to the top 5000 users by API limitation.

bonfire.cli.command(name=None, cls=None, **attrs)
bonfire.cli.config_file_path()
bonfire.cli.delete_content_by_url(universe, url)

Delete the content specified by url.

bonfire.cli.delete_tweets_by_url(universe, url)

Delete tweets specified by url.

bonfire.cli.edit_file(filename)
bonfire.cli.ensure_config()
bonfire.cli.get_latest_raw_tweet(universe)
bonfire.cli.get_latest_tweet(universe)
bonfire.cli.get_universes()
bonfire.cli.logging_config()
bonfire.cli.process_universe_rawtweets(universe, build_mappings=True)

Take all unprocessed tweets in given universe, extract and process their contents. When there are no tweets, sleep until it sees a new one.

bonfire.cli.yes_no(s)

bonfire.config

bonfire.config.config_file_path()
bonfire.config.configuration()
bonfire.config.get(section, option, default=None)
bonfire.config.get_elasticsearch_hosts(universe)
bonfire.config.get_twitter_keys(universe)
bonfire.config.get_universe_seed(universe)
bonfire.config.get_universes()
bonfire.config.logging_config()

bonfire.content

class bonfire.content.BaseFetcher
get_authors()

A shim to help support using Newspaper’s canonical_link.

get_canonical_url()
Main function for determining the canonical url. Check as follows:
  • opengraph url (og:url)
  • twitter url (twitter:url)
  • newspaper’s guess (usually from meta tags)

If none of these work or newspaper guesses a short-url domain, it gives up and requests the url to get its final redirect.

get_description()

Retrieve description from opengraph, twitter, or meta tags.

get_facebook_image()

Retrieve opengraph image for the resource.

get_image()
Retrieve a favorite image for the resource, checking in this order:
  • opengraph
  • twitter
  • newspaper’s top image.
get_provider()

Returns a prettified domain for the resource, stripping www.

get_published()

Retrieve a published date. This almost never gets anything.

get_tags()
get_title()

Retrieve title from opengraph, twitter, or meta tags.

get_top_image()

Get the top article image.

get_twitter_creator()

Retrieve twitter username of the creator.

get_twitter_image()

Retrieve twitter image for the resource from twitter:image.src or twitter.image.

get_twitter_player()

Retrieve default player for twitter cards.

class bonfire.content.DefaultFetcher(url, html=None)

Class to fetch article from a URL.

get_authors()

A shim to help support using Newspaper’s canonical_link.

get_canonical_url()
Main function for determining the canonical url. Check as follows:
  • opengraph url (og:url)
  • twitter url (twitter:url)
  • newspaper’s guess (usually from meta tags)

If none of these work or newspaper guesses a short-url domain, it gives up and requests the url to get its final redirect.

get_description()

Retrieve description from opengraph, twitter, or meta tags.

get_facebook_image()

Retrieve opengraph image for the resource.

get_favicon()

Retrieve favicon url from article tags or from http://g.etfv.co

get_image()
Retrieve a favorite image for the resource, checking in this order:
  • opengraph
  • twitter
  • newspaper’s top image.
get_image_dimensions(img_url)
get_metadata()
get_provider()

Returns a prettified domain for the resource, stripping www.

get_published()

Retrieve a published date. This almost never gets anything.

get_tags()
get_text()

Get the article text.

get_title()
get_top_image()
get_twitter_creator()

Retrieve twitter username of the creator.

get_twitter_image()

Retrieve twitter image for the resource from twitter:image.src or twitter.image.

get_twitter_player()

Retrieve default player for twitter cards.

class bonfire.content.NewspaperFetcher(url, html=None)

Smartly fetches metadata from a newspaper article, and cleans the results.

get_authors()

Retrieve an author or authors. This works very sporadically.

get_canonical_url()
Main function for determining the canonical url. Check as follows:
  • opengraph url (og:url)
  • twitter url (twitter:url)
  • newspaper’s guess (usually from meta tags)

If none of these work or newspaper guesses a short-url domain, it gives up and requests the url to get its final redirect.

get_description()
get_facebook_image()

Retrieve opengraph image for the resource.

get_favicon()

Retrieve favicon url from article tags or from http://g.etfv.co

get_image()
Retrieve a favorite image for the resource, checking in this order:
  • opengraph
  • twitter
  • newspaper’s top image.
get_image_dimensions(img_url)
get_metadata()
get_provider()

Returns a prettified domain for the resource, stripping www.

get_published()

Retrieve a published date. This almost never gets anything.

get_tags()

Retrive a comma-separated list of all keywords, categories,and tags, flattened:

  • opengraph tags + sections
  • keywords + meta_keywords
  • tags
get_text()

Get the article text.

get_title()

Retrieve title from opengraph, twitter, or meta tags.

get_top_image()
get_twitter_creator()

Retrieve twitter username of the creator.

get_twitter_image()

Retrieve twitter image for the resource from twitter:image.src or twitter.image.

get_twitter_player()

Retrieve default player for twitter cards.

bonfire.content.extract(url, html=None)

Extract metadata from a URL, and return a dict result.

Uses newspaper https://github.com/codelucas/newspaper/, but overrides some defaults in favor of opengraph and twitter elements.

Parameters:html – if provided, skip downloading and go straight to parsing html.

bonfire.dates

bonfire.dates.apply_offset(start_date, offset)

Apply an offset in minutes to a given date.

bonfire.dates.dateify_string(datestr, format='%a %b %d %H:%M:%S +0000 %Y')

Convert a date string to a python datetime (UTC).

bonfire.dates.epoch_to_datetime(epoch)

Converts unix timestamp to python datetime (UTC).

bonfire.dates.get_query_dates(start, end, hours=None, stringify=True)

Gets the start and end dates for a query. :arg start: datetime to start with.

If not present, defaults to since argument.
Parameters:
  • end – datetime to end with. If not present, defaults to now.
  • hours – number of hours since end to start with, if no start is specified.
  • stringify – return as formatted strings.
bonfire.dates.get_since_now(start_time, time_type=None, stringify=True)

Gets the amount of time that has expired since now. Smartly chooses between hours, days, minutes, and seconds. Accepts datetime, unix epoch, or Twitter-formatted datestring.

Parameters:
  • time_type – force return of a certain time measurement (e.g. “120 seconds ago” instead of “2 minutes ago”)
  • stringify – turn result into a string instead of tuple, pluralized if need be
bonfire.dates.now(stringify=False)

Return now. :arg stringify: return now as a formatted string.

bonfire.dates.stringify_date(dt)

Convert a datetime to an elasticsearch-formatted datestring (UTC).

bonfire.dates.stringify_since_now(amt, time_type)

Takes an amount and a type. Returns a pluralized string.

bonfire.db

bonfire.db.add_to_results_cache(universe, hours, results)

Cache a set of results under certain number of hours.

Index a new top link to the given universe.

bonfire.db.build_universe_mappings(universe, rebuild=False)

Create and map the universe.

bonfire.db.cleanup(universe, days=30)

Delete everything in the universe that is more than days old. Does not apply to top content.

bonfire.db.delete_content_by_url(universe, url)

Delete the content specified by url.

bonfire.db.delete_tweets_by_url(universe, url)

Delete tweets specified by url.

bonfire.db.delete_user(universe, user_id)

Delete a user from the universe index by their id.

bonfire.db.enqueue_tweet(universe, tweet)

Save a tweet to the universe index as an unprocessed tweet document.

bonfire.db.es(universe)

Return new-style Elasticsearch client connection for the universe

bonfire.db.get_all_docs(universe, index, doc_type, body={}, size=None, field='_id')

Helper function to return all values in a certain field. Defaults to retrieving all ids from a given index and doc type.

Parameters:
  • universe – current universe.
  • index – current index.
  • doc_type – the type of doc to return all values for.
  • body – add custom body, or leave blank to retrieve everything.
  • size – limit by size, or leave as None to retrieve all.
  • field – retrieve all of a specific field. Defaults to id.
bonfire.db.get_cached_url(universe, url)

Get a resolved URL from the index. Returns None if URL doesn’t exist.

bonfire.db.get_items(universe, quantity=20, hours=24, start=None, end=None, time_decay=True)

The default function: gets the most popular links shared from a given universe and time frame.

Parameters:
  • quantity – number of links to return
  • hours – hours since end to search through.
  • start – start datetime in UTC. Defaults to hours.
  • end – end datetime in UTC. Defaults to now.
  • time_decay – whether or not to decay the score based on the time of its first tweet.
bonfire.db.get_latest_raw_tweet(universe)
bonfire.db.get_latest_tweet(universe)

Get the most recently added top links in the given universe.

bonfire.db.get_score_stats(universe, hours=4)

Get extended stats on the scores returned from the results cache. :arg hours: type of query to search for.

Search for any links in the current set that are a high enough score to get into top links. Return one (and only one) if so.

bonfire.db.get_top_providers(universe, size=2000)

Get a list of all providers (i.e. domains) in order of popularity. Possible future use for autocomplete, to search across publications.

bonfire.db.get_universe_tweets(universe, query=None, quantity=20, hours=24, start=None, end=None)

Get tweets in a given universe.

Parameters:
  • query – accepts None, string, or dict. if None, matches all if string, searches across the tweets’ text for the given string if dict, accepts any elasticsearch match query http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
  • start – accepts int or datetime (timezone-unaware, UTC) if int, starts at that many number of hours before now
  • end – accepts datetime (timezone-unaware, UTC), defaults to now.
  • size – number of tweets to return
bonfire.db.get_user_ids(universe, size=None)

Get top users for the universe by weight. :arg size: number of users to get. Defaults to all users.

bonfire.db.get_user_weights(universe, user_ids)

Takes a list of user ids and returns a dict with their weighted influence.

bonfire.db.logger()
bonfire.db.next_unprocessed_tweet(universe, not_ids=None)

Get the next unprocessed tweet and delete it from the index.

bonfire.db.save_content(universe, content)

Save the content of a URL to the index.

bonfire.db.save_tweet(universe, tweet)

Save a tweet to the universe index, fully processed.

bonfire.db.save_user(universe, user)

Check if a user exists in the database. If not, create it. If so, update it.

Scores a given link returned from elasticsearch.

Parameters:
  • link – full elasticsearch result for the link
  • user_weights – a dict with key,value pairs key is the user’s id, value is the user’s weighted twitter influence
  • time_decay – whether or not to decay the link’s score based on time
  • hours – used for determining the decay factor if decay is enabled
bonfire.db.search_content(universe, query, size=100)

Search fulltext of all content across universes for a given string, or a custom match query.

Parameters:
bonfire.db.search_items(universe, term, quantity=100)

Search the text of both tweets and content for a given term and universe, and return some items matching one or the other.

Parameters:
  • term – search term to use for querying both tweets and content
  • quantity – number of items to return
bonfire.db.set_cached_url(universe, url, resolved_url)

Index a URL and its resolution in Elasticsearch

bonfire.elastic

An attempt to make the Elasticsearch client a bit more usable. Currently implements search, get, and mget. get_source maps to get because there really should not be a need for both if the API is done correctly. For the time being, other client methods will behave just as they do for the client provided by Elasticsearch.

The implemented methods return ESDocument or ESCollection objects. ESDocument is dot-addressable and dict-like. It’s keys are your document keys. Meta-data is attached to this document with underscored properties: _index, _type, _id, and when applicable: _score, _version, _found. I realize that _ has the general connotation of “privacy” in Python, but this seems to me to be the safest way to have special keys and keep the API fairly usable.

ESCollection is an iterable of those documents and has the special properties: took, timed_out, total_shards, successful_shards, failed_shards. No need for [‘hits’][‘hits’] deferencing – just iterate the collection. Same for [‘docs’] on an mget operation.

class bonfire.elastic.ESAggregation
clear() → None. Remove all items from D.
copy() → a shallow copy of D
static fromkeys(S[, v]) → New dict with keys from S and values equal to v.

v defaults to None.

get(k[, d]) → D[k] if k in D, else d. d defaults to None.
has_key(k) → True if D has a key k, else False
items() → list of D's (key, value) pairs, as 2-tuples
iteritems() → an iterator over the (key, value) items of D
iterkeys() → an iterator over the keys of D
itervalues() → an iterator over the values of D
keys() → list of D's keys
pop(k[, d]) → v, remove specified key and return the corresponding value.

If key is not found, d is returned if given, otherwise KeyError is raised

popitem() → (k, v), remove and return some (key, value) pair as a

2-tuple; but raise KeyError if D is empty.

setdefault(k[, d]) → D.get(k,d), also set D[k]=d if k not in D
update([E, ]**F) → None. Update D from dict/iterable E and F.

If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() → list of D's values
viewitems() → a set-like object providing a view on D's items
viewkeys() → a set-like object providing a view on D's keys
viewvalues() → an object providing a view on D's values
class bonfire.elastic.ESClient(hosts=None, transport_class=<class 'elasticsearch.transport.Transport'>, **kwargs)
abort_benchmark(*args, **kwargs)

Aborts a running benchmark. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-benchmark.html

Parameters:name – A benchmark name
benchmark(*args, **kwargs)

The benchmark API provides a standard mechanism for submitting queries and measuring their performance relative to one another. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-benchmark.html

Parameters:
  • index – A comma-separated list of index names; use _all or empty string to perform the operation on all indices
  • doc_type – The name of the document type
  • body – The search definition using the Query DSL
  • verbose – Specify whether to return verbose statistics about each iteration (default: false)
bulk(*args, **kwargs)

Perform many index/delete operations in a single API call. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html

See the bulk() helper function for a more friendly API.

Parameters:
  • body – The operation definition and data (action-data pairs), as either a newline separated string, or a sequence of dicts to serialize (one per row).
  • index – Default index for items which don’t provide one
  • doc_type – Default document type for items which don’t provide one
  • consistency – Explicit write consistency setting for the operation
  • refresh – Refresh the index after performing the operation
  • routing – Specific routing value
  • replication – Explicitly set the replication type (default: sync)
  • timeout – Explicit operation timeout
clear_scroll(*args, **kwargs)

Clear the scroll request created by specifying the scroll parameter to search. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html

Parameters:
  • scroll_id – The scroll ID or a list of scroll IDs
  • body – A comma-separated list of scroll IDs to clear if none was specified via the scroll_id parameter
count(*args, **kwargs)

Execute a query and get the number of matches for that query. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-count.html

Parameters:
  • index – A comma-separated list of indices to restrict the results
  • doc_type – A comma-separated list of types to restrict the results
  • body – A query to restrict the results (optional)
  • allow_no_indices – Whether to ignore if a wildcard indices expression resolves into no concrete indices. (This includes _all string or when no indices have been specified)
  • expand_wildcards – Whether to expand wildcard expression to concrete indices that are open, closed or both., default ‘open’
  • ignore_unavailable – Whether specified concrete indices should be ignored when unavailable (missing or closed)
  • min_score – Include only documents with a specific _score value in the result
  • preference – Specify the node or shard the operation should be performed on (default: random)
  • q – Query in the Lucene query string syntax
  • routing – Specific routing value
  • source – The URL-encoded query definition (instead of using the request body)
count_percolate(*args, **kwargs)

The percolator allows to register queries against an index, and then send percolate requests which include a doc, and getting back the queries that match on that doc out of the set of registered queries. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html

Parameters:
  • index – The index of the document being count percolated.
  • doc_type – The type of the document being count percolated.
  • id – Substitute the document in the request body with a document that is known by the specified id. On top of the id, the index and type parameter will be used to retrieve the document from within the cluster.
  • body – The count percolator request definition using the percolate DSL
  • allow_no_indices – Whether to ignore if a wildcard indices expression resolves into no concrete indices. (This includes _all string or when no indices have been specified)
  • expand_wildcards – Whether to expand wildcard expression to concrete indices that are open, closed or both., default ‘open’
  • ignore_unavailable – Whether specified concrete indices should be ignored when unavailable (missing or closed)
  • percolate_index – The index to count percolate the document into. Defaults to index.
  • percolate_type – The type to count percolate document into. Defaults to type.
  • preference – Specify the node or shard the operation should be performed on (default: random)
  • routing – A comma-separated list of specific routing values
  • version – Explicit version number for concurrency control
  • version_type – Specific version type
create(*args, **kwargs)

Adds a typed JSON document in a specific index, making it searchable. Behind the scenes this method calls index(..., op_type=’create’) http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html

Parameters:
  • index – The name of the index
  • doc_type – The type of the document
  • id – Document ID
  • body – The document
  • consistency – Explicit write consistency setting for the operation
  • id – Specific document ID (when the POST method is used)
  • parent – ID of the parent document
  • percolate – Percolator queries to execute while indexing the document
  • refresh – Refresh the index after performing the operation
  • replication – Specific replication type (default: sync)
  • routing – Specific routing value
  • timeout – Explicit operation timeout
  • timestamp – Explicit timestamp for the document
  • ttl – Expiration time for the document
  • version – Explicit version number for concurrency control
  • version_type – Specific version type
delete(*args, **kwargs)

Delete a typed JSON document from a specific index based on its id. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete.html

Parameters:
  • index – The name of the index
  • doc_type – The type of the document
  • id – The document ID
  • consistency – Specific write consistency setting for the operation
  • parent – ID of parent document
  • refresh – Refresh the index after performing the operation
  • replication – Specific replication type (default: sync)
  • routing – Specific routing value
  • timeout – Explicit operation timeout
  • version – Explicit version number for concurrency control
  • version_type – Specific version type
delete_by_query(*args, **kwargs)

Delete documents from one or more indices and one or more types based on a query. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html

Parameters:
  • index – A comma-separated list of indices to restrict the operation; use _all to perform the operation on all indices
  • doc_type – A comma-separated list of types to restrict the operation
  • body – A query to restrict the operation specified with the Query DSL
  • allow_no_indices – Whether to ignore if a wildcard indices expression resolves into no concrete indices. (This includes _all string or when no indices have been specified)
  • analyzer – The analyzer to use for the query string
  • consistency – Specific write consistency setting for the operation
  • default_operator – The default operator for query string query (AND or OR), default u’OR’
  • df – The field to use as default where no field prefix is given in the query string
  • expand_wildcards – Whether to expand wildcard expression to concrete indices that are open, closed or both., default u’open’
  • ignore_unavailable – Whether specified concrete indices should be ignored when unavailable (missing or closed)
  • q – Query in the Lucene query string syntax
  • replication – Specific replication type, default u’sync’
  • routing – Specific routing value
  • source – The URL-encoded query definition (instead of using the request body)
  • timeout – Explicit operation timeout
delete_script(*args, **kwargs)

Remove a stored script from elasticsearch. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html

Parameters:
  • lang – Script language
  • id – Script ID
delete_template(*args, **kwargs)

Delete a search template. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-template.html

Parameters:id – Template ID
exists(*args, **kwargs)

Returns a boolean indicating whether or not given document exists in Elasticsearch. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-get.html

Parameters:
  • index – The name of the index
  • id – The document ID
  • doc_type – The type of the document (uses _all by default to fetch the first document matching the ID across all types)
  • parent – The ID of the parent document
  • preference – Specify the node or shard the operation should be performed on (default: random)
  • realtime – Specify whether to perform the operation in realtime or search mode
  • refresh – Refresh the shard containing the document before performing the operation
  • routing – Specific routing value
explain(*args, **kwargs)

The explain api computes a score explanation for a query and a specific document. This can give useful feedback whether a document matches or didn’t match a specific query. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-explain.html

Parameters:
  • index – The name of the index
  • doc_type – The type of the document
  • id – The document ID
  • body – The query definition using the Query DSL
  • _source – True or false to return the _source field or not, or a list of fields to return
  • _source_exclude – A list of fields to exclude from the returned _source field
  • _source_include – A list of fields to extract and return from the _source field
  • analyze_wildcard – Specify whether wildcards and prefix queries in the query string query should be analyzed (default: false)
  • analyzer – The analyzer for the query string query
  • default_operator – The default operator for query string query (AND or OR), (default: OR)
  • df – The default field for query string query (default: _all)
  • fields – A comma-separated list of fields to return in the response
  • lenient – Specify whether format-based query failures (such as providing text to a numeric field) should be ignored
  • lowercase_expanded_terms – Specify whether query terms should be lowercased
  • parent – The ID of the parent document
  • preference – Specify the node or shard the operation should be performed on (default: random)
  • q – Query in the Lucene query string syntax
  • routing – Specific routing value
  • source – The URL-encoded query definition (instead of using the request body)
get(*args, **kwargs)
get_script(*args, **kwargs)

Retrieve a script from the API. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html

Parameters:
  • lang – Script language
  • id – Script ID
get_source(*args, **kwargs)
get_template(*args, **kwargs)

Retrieve a search template. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-template.html

Parameters:
  • id – Template ID
  • body – The document
index(*args, **kwargs)

Adds or updates a typed JSON document in a specific index, making it searchable. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html

Parameters:
  • index – The name of the index
  • doc_type – The type of the document
  • body – The document
  • id – Document ID
  • consistency – Explicit write consistency setting for the operation
  • op_type – Explicit operation type (default: index)
  • parent – ID of the parent document
  • refresh – Refresh the index after performing the operation
  • replication – Specific replication type (default: sync)
  • routing – Specific routing value
  • timeout – Explicit operation timeout
  • timestamp – Explicit timestamp for the document
  • ttl – Expiration time for the document
  • version – Explicit version number for concurrency control
  • version_type – Specific version type
info(*args, **kwargs)

Get the basic info from the current cluster.

list_benchmarks(*args, **kwargs)

View the progress of long-running benchmarks. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-benchmark.html

Parameters:
  • index – A comma-separated list of index names; use _all or empty string to perform the operation on all indices
  • doc_type – The name of the document type
mget(*args, **kwargs)
mlt(*args, **kwargs)

Get documents that are “like” a specified document. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-more-like-this.html

Parameters:
  • index – The name of the index
  • doc_type – The type of the document (use _all to fetch the first document matching the ID across all types)
  • id – The document ID
  • body – A specific search request definition
  • boost_terms – The boost factor
  • include – Whether to include the queried document from the response
  • max_doc_freq – The word occurrence frequency as count: words with higher occurrence in the corpus will be ignored
  • max_query_terms – The maximum query terms to be included in the generated query
  • max_word_length – The minimum length of the word: longer words will be ignored
  • min_doc_freq – The word occurrence frequency as count: words with lower occurrence in the corpus will be ignored
  • min_term_freq – The term frequency as percent: terms with lower occurence in the source document will be ignored
  • min_word_length – The minimum length of the word: shorter words will be ignored
  • mlt_fields – Specific fields to perform the query against
  • percent_terms_to_match – How many terms have to match in order to consider the document a match (default: 0.3)
  • routing – Specific routing value
  • search_from – The offset from which to return results
  • search_indices – A comma-separated list of indices to perform the query against (default: the index containing the document)
  • search_query_hint – The search query hint
  • search_scroll – A scroll search request definition
  • search_size – The number of documents to return (default: 10)
  • search_source – A specific search request definition (instead of using the request body)
  • search_type – Specific search type (eg. dfs_then_fetch, count, etc)
  • search_types – A comma-separated list of types to perform the query against (default: the same type as the document)
  • stop_words – A list of stop words to be ignored
mpercolate(*args, **kwargs)

The percolator allows to register queries against an index, and then send percolate requests which include a doc, and getting back the queries that match on that doc out of the set of registered queries. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html

Parameters:
  • index – The index of the document being count percolated to use as default
  • doc_type – The type of the document being percolated to use as default.
  • body – The percolate request definitions (header & body pair), separated by newlines
  • allow_no_indices – Whether to ignore if a wildcard indices expression resolves into no concrete indices. (This includes _all string or when no indices have been specified)
  • expand_wildcards – Whether to expand wildcard expression to concrete indices that are open, closed or both., default ‘open’
  • ignore_unavailable – Whether specified concrete indices should be ignored when unavailable (missing or closed)
msearch(*args, **kwargs)

Execute several search requests within the same API. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-multi-search.html

Parameters:
  • body – The request definitions (metadata-search request definition pairs), as either a newline separated string, or a sequence of dicts to serialize (one per row).
  • index – A comma-separated list of index names to use as default
  • doc_type – A comma-separated list of document types to use as default
  • search_type – Search operation type
mtermvectors(*args, **kwargs)

Multi termvectors API allows to get multiple termvectors based on an index, type and id. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/docs-multi-termvectors.html

Parameters:
  • index – The index in which the document resides.
  • doc_type – The type of the document.
  • body – Define ids, parameters or a list of parameters per document here. You must at least provide a list of document ids. See documentation.
  • field_statistics – Specifies if document count, sum of document frequencies and sum of total term frequencies should be returned. Applies to all returned documents unless otherwise specified in body “params” or “docs”., default True
  • fields – A comma-separated list of fields to return. Applies to all returned documents unless otherwise specified in body “params” or “docs”.
  • ids – A comma-separated list of documents ids. You must define ids as parameter or set “ids” or “docs” in the request body
  • offsets – Specifies if term offsets should be returned. Applies to all returned documents unless otherwise specified in body “params” or “docs”., default True
  • parent – Parent id of documents. Applies to all returned documents unless otherwise specified in body “params” or “docs”.
  • payloads – Specifies if term payloads should be returned. Applies to all returned documents unless otherwise specified in body “params” or “docs”., default True
  • positions – Specifies if term positions should be returned. Applies to all returned documents unless otherwise specified in body “params” or “docs”., default True
  • preference – Specify the node or shard the operation should be performed on (default: random) .Applies to all returned documents unless otherwise specified in body “params” or “docs”.
  • routing – Specific routing value. Applies to all returned documents unless otherwise specified in body “params” or “docs”.
  • term_statistics – Specifies if total term frequency and document frequency should be returned. Applies to all returned documents unless otherwise specified in body “params” or “docs”., default False
percolate(*args, **kwargs)

The percolator allows to register queries against an index, and then send percolate requests which include a doc, and getting back the queries that match on that doc out of the set of registered queries. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html

Parameters:
  • index – The index of the document being percolated.
  • doc_type – The type of the document being percolated.
  • id – Substitute the document in the request body with a document that is known by the specified id. On top of the id, the index and type parameter will be used to retrieve the document from within the cluster.
  • body – The percolator request definition using the percolate DSL
  • allow_no_indices – Whether to ignore if a wildcard indices expression resolves into no concrete indices. (This includes _all string or when no indices have been specified)
  • expand_wildcards – Whether to expand wildcard expression to concrete indices that are open, closed or both., default ‘open’
  • ignore_unavailable – Whether specified concrete indices should be ignored when unavailable (missing or closed)
  • percolate_format – Return an array of matching query IDs instead of objects
  • percolate_index – The index to percolate the document into. Defaults to index.
  • percolate_type – The type to percolate document into. Defaults to type.
  • preference – Specify the node or shard the operation should be performed on (default: random)
  • routing – A comma-separated list of specific routing values
  • version – Explicit version number for concurrency control
  • version_type – Specific version type
ping(*args, **kwargs)

Returns True if the cluster is up, False otherwise.

put_script(*args, **kwargs)

Create a script in given language with specified ID. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html

Parameters:
  • lang – Script language
  • id – Script ID
  • body – The document
put_template(*args, **kwargs)

Create a search template. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-template.html

Parameters:
  • id – Template ID
  • body – The document
scroll(*args, **kwargs)

Scroll a search request created by specifying the scroll parameter. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html

Parameters:
  • scroll_id – The scroll ID
  • scroll – Specify how long a consistent view of the index should be maintained for scrolled search
search(*args, **kwargs)
search_shards(*args, **kwargs)

The search shards api returns the indices and shards that a search request would be executed against. This can give useful feedback for working out issues or planning optimizations with routing and shard preferences. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-shards.html

Parameters:
  • index – The name of the index
  • doc_type – The type of the document
  • allow_no_indices – Whether to ignore if a wildcard indices expression resolves into no concrete indices. (This includes _all string or when no indices have been specified)
  • expand_wildcards – Whether to expand wildcard expression to concrete indices that are open, closed or both. (default: ‘“open”’)
  • ignore_unavailable – Whether specified concrete indices should be ignored when unavailable (missing or closed)
  • local – Return local information, do not retrieve the state from master node (default: false)
  • preference – Specify the node or shard the operation should be performed on (default: random)
  • routing – Specific routing value
search_template(*args, **kwargs)

A query that accepts a query template and a map of key/value pairs to fill in template parameters. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/query-dsl-template-query.html

Parameters:
  • index – A comma-separated list of index names to search; use _all or empty string to perform the operation on all indices
  • doc_type – A comma-separated list of document types to search; leave empty to perform the operation on all types
  • body – The search definition template and its params
  • allow_no_indices – Whether to ignore if a wildcard indices expression resolves into no concrete indices. (This includes _all string or when no indices have been specified)
  • expand_wildcards – Whether to expand wildcard expression to concrete indices that are open, closed or both., default ‘open’
  • ignore_unavailable – Whether specified concrete indices should be ignored when unavailable (missing or closed)
  • preference – Specify the node or shard the operation should be performed on (default: random)
  • routing – A comma-separated list of specific routing values
  • scroll – Specify how long a consistent view of the index should be maintained for scrolled search
  • search_type – Search operation type
suggest(*args, **kwargs)

The suggest feature suggests similar looking terms based on a provided text by using a suggester. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-search.html

Parameters:
  • index – A comma-separated list of index names to restrict the operation; use _all or empty string to perform the operation on all indices
  • body – The request definition
  • allow_no_indices – Whether to ignore if a wildcard indices expression resolves into no concrete indices. (This includes _all string or when no indices have been specified)
  • expand_wildcards – Whether to expand wildcard expression to concrete indices that are open, closed or both., default ‘open’
  • ignore_unavailable – Whether specified concrete indices should be ignored when unavailable (missing or closed)
  • preference – Specify the node or shard the operation should be performed on (default: random)
  • routing – Specific routing value
  • source – The URL-encoded request definition (instead of using request body)
termvector(*args, **kwargs)

Added in 1. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-termvectors.html

Parameters:
  • index – The index in which the document resides.
  • doc_type – The type of the document.
  • id – The id of the document.
  • body – Define parameters. See documentation.
  • field_statistics – Specifies if document count, sum of document frequencies and sum of total term frequencies should be returned., default True
  • fields – A comma-separated list of fields to return.
  • offsets – Specifies if term offsets should be returned., default True
  • parent – Parent id of documents.
  • payloads – Specifies if term payloads should be returned., default True
  • positions – Specifies if term positions should be returned., default True
  • preference – Specify the node or shard the operation should be performed on (default: random).
  • routing – Specific routing value.
  • term_statistics – Specifies if total term frequency and document frequency should be returned., default False
update(*args, **kwargs)

Update a document based on a script or partial data provided. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-update.html

Parameters:
  • index – The name of the index
  • doc_type – The type of the document
  • id – Document ID
  • body – The request definition using either script or partial doc
  • consistency – Explicit write consistency setting for the operation
  • fields – A comma-separated list of fields to return in the response
  • lang – The script language (default: mvel)
  • parent – ID of the parent document
  • refresh – Refresh the index after performing the operation
  • replication – Specific replication type (default: sync)
  • retry_on_conflict – Specify how many times should the operation be retried when a conflict occurs (default: 0)
  • routing – Specific routing value
  • script – The URL-encoded script definition (instead of using request body)
  • timeout – Explicit operation timeout
  • timestamp – Explicit timestamp for the document
  • ttl – Expiration time for the document
  • version – Explicit version number for concurrency control
  • version_type – Explicit version number for concurrency control
class bonfire.elastic.ESCollection(resultset)
next()
class bonfire.elastic.ESDocument(doc)
clear() → None. Remove all items from D.
copy() → a shallow copy of D
static fromkeys(S[, v]) → New dict with keys from S and values equal to v.

v defaults to None.

get(k[, d]) → D[k] if k in D, else d. d defaults to None.
has_key(k) → True if D has a key k, else False
items() → list of D's (key, value) pairs, as 2-tuples
iteritems() → an iterator over the (key, value) items of D
iterkeys() → an iterator over the keys of D
itervalues() → an iterator over the values of D
keys() → list of D's keys
pop(k[, d]) → v, remove specified key and return the corresponding value.

If key is not found, d is returned if given, otherwise KeyError is raised

popitem() → (k, v), remove and return some (key, value) pair as a

2-tuple; but raise KeyError if D is empty.

setdefault(k[, d]) → D.get(k,d), also set D[k]=d if k not in D
update([E, ]**F) → None. Update D from dict/iterable E and F.

If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() → list of D's values
viewitems() → a set-like object providing a view on D's items
viewkeys() → a set-like object providing a view on D's keys
viewvalues() → an object providing a view on D's values

bonfire.extract

class bonfire.extract.ArticleExtractor(url=None, html=None)
article_node
author
doc
fetch(url)
get_article_text()
get_top_image()

Return the first of the article images.

html
images
metadata
title
url
exception bonfire.extract.InstantiationError
args
message
bonfire.extract.as_date(node)
bonfire.extract.clean_attribution(string)
bonfire.extract.clean_whitespace(string)
bonfire.extract.content_nodes(elem, node_types=None)
bonfire.extract.find_pubdate(node)
bonfire.extract.is_attribution(string)
bonfire.extract.word_count(text)

bonfire.mappings

bonfire.process

bonfire.process.create_session()

Create a requests session optimized for many connections.

bonfire.process.logger()
bonfire.process.process_rawtweet(universe, raw_tweet, session=None)

Take a raw tweet from the queue, extract and save metadata from its content, then save as a processed tweet.

bonfire.process.process_universe_rawtweets(universe, build_mappings=True)

Take all unprocessed tweets in given universe, extract and process their contents. When there are no tweets, sleep until it sees a new one.

bonfire.twitter

bonfire.twitter.client(universe)

Return a Twitter client for the given universe.

bonfire.twitter.collect_universe_tweets(universe)

Connects to the streaming API and enqueues tweets from universe users. Limited to the top 5000 users by API limitation.

bonfire.twitter.get_friends(universe, user_id)

Get Twitter IDs for friends of the given user_id.

bonfire.twitter.logger()
bonfire.twitter.lookup_users(universe, usernames)

Lookup Twitter users by screen name. Limited to first 100 user names by API limitation.

bonfire.twitter.stream_client(universe)

Return a Twitter streaming client for the given universe.

bonfire.universe

bonfire.universe.build_universe(universe, build_mappings=True)

Expand the universe from the seed user list in the universe file. Should only be called once every 15 minutes.

This command also functions as an update of a universe.

Seed is currently limited to the first 14 users in the file. API threshold limit is 15 calls per 15 minutes, and we need an extra call for the initial request for seed user IDs. If we want a seed > 14, we will need to be sure not to exceed 15 hits/15 min.

bonfire.universe.cache_queries(universe, top_links=False, tweet=False)
bonfire.universe.cleanup_universe(universe, days=30)