API Documentation¶

bonfire.cli¶

bonfire.cli.build_universe(universe, build_mappings=True)¶

Expand the universe from the seed user list in the universe file. Should only be called once every 15 minutes.

This command also functions as an update of a universe.

Seed is currently limited to the first 14 users in the file. API threshold limit is 15 calls per 15 minutes, and we need an extra call for the initial request for seed user IDs. If we want a seed > 14, we will need to be sure not to exceed 15 hits/15 min.

bonfire.cli.build_universe_mappings(universe, rebuild=False)¶: Create and map the universe.

bonfire.cli.cache_queries(universe, top_links=False, tweet=False)¶

bonfire.cli.cleanup_universe(universe, days=30)¶

bonfire.cli.collect_universe_tweets(universe)¶: Connects to the streaming API and enqueues tweets from universe users. Limited to the top 5000 users by API limitation.

bonfire.cli.command(name=None, cls=None, **attrs)¶

bonfire.cli.config_file_path()¶

bonfire.cli.delete_content_by_url(universe, url)¶: Delete the content specified by url.

bonfire.cli.delete_tweets_by_url(universe, url)¶: Delete tweets specified by url.

bonfire.cli.edit_file(filename)¶

bonfire.cli.ensure_config()¶

bonfire.cli.get_latest_raw_tweet(universe)¶

bonfire.cli.get_latest_tweet(universe)¶

bonfire.cli.get_universes()¶

bonfire.cli.logging_config()¶

bonfire.cli.process_universe_rawtweets(universe, build_mappings=True)¶: Take all unprocessed tweets in given universe, extract and process their contents. When there are no tweets, sleep until it sees a new one.

bonfire.cli.yes_no(s)¶

bonfire.config¶

bonfire.config.config_file_path()¶

bonfire.config.configuration()¶

bonfire.config.get(section, option, default=None)¶

bonfire.config.get_elasticsearch_hosts(universe)¶

bonfire.config.get_twitter_keys(universe)¶

bonfire.config.get_universe_seed(universe)¶

bonfire.config.get_universes()¶

bonfire.config.logging_config()¶

bonfire.content¶

class bonfire.content.BaseFetcher¶

get_authors()¶

get_canonical_link()¶: A shim to help support using Newspaper’s canonical_link.

get_canonical_url()¶

Main function for determining the canonical url. Check as follows:

opengraph url (og:url)
twitter url (twitter:url)
newspaper’s guess (usually from meta tags)

If none of these work or newspaper guesses a short-url domain, it gives up and requests the url to get its final redirect.

get_description()¶: Retrieve description from opengraph, twitter, or meta tags.

get_facebook_image()¶: Retrieve opengraph image for the resource.

get_image()¶

Retrieve a favorite image for the resource, checking in this order:

opengraph
twitter
newspaper’s top image.

get_provider()¶: Returns a prettified domain for the resource, stripping www.

get_published()¶: Retrieve a published date. This almost never gets anything.

get_tags()¶

get_title()¶: Retrieve title from opengraph, twitter, or meta tags.

get_top_image()¶: Get the top article image.

get_twitter_creator()¶: Retrieve twitter username of the creator.

get_twitter_image()¶: Retrieve twitter image for the resource from twitter:image.src or twitter.image.

get_twitter_player()¶: Retrieve default player for twitter cards.

class bonfire.content.DefaultFetcher(url, html=None)¶

Class to fetch article from a URL.

get_authors()¶

get_canonical_link()¶: A shim to help support using Newspaper’s canonical_link.

get_canonical_url()¶

Main function for determining the canonical url. Check as follows:

opengraph url (og:url)
twitter url (twitter:url)
newspaper’s guess (usually from meta tags)

If none of these work or newspaper guesses a short-url domain, it gives up and requests the url to get its final redirect.

get_description()¶: Retrieve description from opengraph, twitter, or meta tags.

get_facebook_image()¶: Retrieve opengraph image for the resource.

get_favicon()¶: Retrieve favicon url from article tags or from http://g.etfv.co

get_image()¶

Retrieve a favorite image for the resource, checking in this order:

opengraph
twitter
newspaper’s top image.

get_image_dimensions(img_url)¶

get_metadata()¶

get_provider()¶: Returns a prettified domain for the resource, stripping www.

get_published()¶: Retrieve a published date. This almost never gets anything.

get_tags()¶

get_text()¶: Get the article text.

get_title()¶

get_top_image()¶

get_twitter_creator()¶: Retrieve twitter username of the creator.

get_twitter_image()¶: Retrieve twitter image for the resource from twitter:image.src or twitter.image.

get_twitter_player()¶: Retrieve default player for twitter cards.

class bonfire.content.NewspaperFetcher(url, html=None)¶

Smartly fetches metadata from a newspaper article, and cleans the results.

get_authors()¶: Retrieve an author or authors. This works very sporadically.

get_canonical_link()¶

get_canonical_url()¶

Main function for determining the canonical url. Check as follows:

opengraph url (og:url)
twitter url (twitter:url)
newspaper’s guess (usually from meta tags)

If none of these work or newspaper guesses a short-url domain, it gives up and requests the url to get its final redirect.

get_description()¶

get_facebook_image()¶: Retrieve opengraph image for the resource.

get_favicon()¶: Retrieve favicon url from article tags or from http://g.etfv.co

get_image()¶

Retrieve a favorite image for the resource, checking in this order:

opengraph
twitter
newspaper’s top image.

get_image_dimensions(img_url)¶

get_metadata()¶

get_provider()¶: Returns a prettified domain for the resource, stripping www.

get_published()¶: Retrieve a published date. This almost never gets anything.

get_tags()¶

Retrive a comma-separated list of all keywords, categories,and tags, flattened:

opengraph tags + sections

keywords + meta_keywords

tags

get_text()¶: Get the article text.

get_title()¶: Retrieve title from opengraph, twitter, or meta tags.

get_top_image()¶

get_twitter_creator()¶: Retrieve twitter username of the creator.

get_twitter_image()¶: Retrieve twitter image for the resource from twitter:image.src or twitter.image.

get_twitter_player()¶: Retrieve default player for twitter cards.

bonfire.content.extract(url, html=None)¶

Extract metadata from a URL, and return a dict result.

Uses newspaper https://github.com/codelucas/newspaper/, but overrides some defaults in favor of opengraph and twitter elements.

Parameters:	html – if provided, skip downloading and go straight to parsing html.

bonfire.dates¶

bonfire.dates.apply_offset(start_date, offset)¶: Apply an offset in minutes to a given date.

bonfire.dates.dateify_string(datestr, format='%a %b %d %H:%M:%S +0000 %Y')¶: Convert a date string to a python datetime (UTC).

bonfire.dates.epoch_to_datetime(epoch)¶: Converts unix timestamp to python datetime (UTC).

bonfire.dates.get_query_dates(start, end, hours=None, stringify=True)¶

Gets the start and end dates for a query. :arg start: datetime to start with.

If not present, defaults to since argument.

Parameters:	end – datetime to end with. If not present, defaults to now. hours – number of hours since end to start with, if no start is specified. stringify – return as formatted strings.

bonfire.dates.get_since_now(start_time, time_type=None, stringify=True)¶

Gets the amount of time that has expired since now. Smartly chooses between hours, days, minutes, and seconds. Accepts datetime, unix epoch, or Twitter-formatted datestring.

Parameters:	time_type – force return of a certain time measurement (e.g. “120 seconds ago” instead of “2 minutes ago”) stringify – turn result into a string instead of tuple, pluralized if need be

bonfire.dates.now(stringify=False)¶: Return now. :arg stringify: return now as a formatted string.

bonfire.dates.stringify_date(dt)¶: Convert a datetime to an elasticsearch-formatted datestring (UTC).

bonfire.dates.stringify_since_now(amt, time_type)¶: Takes an amount and a type. Returns a pluralized string.

bonfire.db¶

bonfire.db.add_to_results_cache(universe, hours, results)¶: Cache a set of results under certain number of hours.

bonfire.db.add_to_top_links(universe, link)¶: Index a new top link to the given universe.

bonfire.db.build_universe_mappings(universe, rebuild=False)¶: Create and map the universe.

bonfire.db.cleanup(universe, days=30)¶: Delete everything in the universe that is more than days old. Does not apply to top content.

bonfire.db.delete_content_by_url(universe, url)¶: Delete the content specified by url.

bonfire.db.delete_tweets_by_url(universe, url)¶: Delete tweets specified by url.

bonfire.db.delete_user(universe, user_id)¶: Delete a user from the universe index by their id.

bonfire.db.enqueue_tweet(universe, tweet)¶: Save a tweet to the universe index as an unprocessed tweet document.

bonfire.db.es(universe)¶: Return new-style Elasticsearch client connection for the universe

bonfire.db.get_all_docs(universe, index, doc_type, body={}, size=None, field='_id')¶

Helper function to return all values in a certain field. Defaults to retrieving all ids from a given index and doc type.

Parameters:	universe – current universe. index – current index. doc_type – the type of doc to return all values for. body – add custom body, or leave blank to retrieve everything. size – limit by size, or leave as None to retrieve all. field – retrieve all of a specific field. Defaults to id.

bonfire.db.get_cached_url(universe, url)¶: Get a resolved URL from the index. Returns None if URL doesn’t exist.

bonfire.db.get_items(universe, quantity=20, hours=24, start=None, end=None, time_decay=True)¶

The default function: gets the most popular links shared from a given universe and time frame.

Parameters:	quantity – number of links to return hours – hours since end to search through. start – start datetime in UTC. Defaults to hours. end – end datetime in UTC. Defaults to now. time_decay – whether or not to decay the score based on the time of its first tweet.

bonfire.db.get_latest_raw_tweet(universe)¶

bonfire.db.get_latest_tweet(universe)¶

bonfire.db.get_recent_top_links(universe, quantity=20)¶: Get the most recently added top links in the given universe.

bonfire.db.get_score_stats(universe, hours=4)¶: Get extended stats on the scores returned from the results cache. :arg hours: type of query to search for.

bonfire.db.get_top_link(universe, hours=4, quantity=5)¶: Search for any links in the current set that are a high enough score to get into top links. Return one (and only one) if so.

bonfire.db.get_top_providers(universe, size=2000)¶: Get a list of all providers (i.e. domains) in order of popularity. Possible future use for autocomplete, to search across publications.

bonfire.db.get_universe_tweets(universe, query=None, quantity=20, hours=24, start=None, end=None)¶

Get tweets in a given universe.

Parameters:

query – accepts None, string, or dict. if None, matches all if string, searches across the tweets’ text for the given string if dict, accepts any elasticsearch match query http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
start – accepts int or datetime (timezone-unaware, UTC) if int, starts at that many number of hours before now
end – accepts datetime (timezone-unaware, UTC), defaults to now.
size – number of tweets to return

bonfire.db.get_user_ids(universe, size=None)¶: Get top users for the universe by weight. :arg size: number of users to get. Defaults to all users.

bonfire.db.get_user_weights(universe, user_ids)¶: Takes a list of user ids and returns a dict with their weighted influence.

bonfire.db.logger()¶

bonfire.db.next_unprocessed_tweet(universe, not_ids=None)¶: Get the next unprocessed tweet and delete it from the index.

bonfire.db.save_content(universe, content)¶: Save the content of a URL to the index.

bonfire.db.save_tweet(universe, tweet)¶: Save a tweet to the universe index, fully processed.

bonfire.db.save_user(universe, user)¶: Check if a user exists in the database. If not, create it. If so, update it.

bonfire.db.score_link(link, user_weights, time_decay=True, hours=24)¶

Scores a given link returned from elasticsearch.

Parameters:	link – full elasticsearch result for the link user_weights – a dict with key,value pairs key is the user’s id, value is the user’s weighted twitter influence time_decay – whether or not to decay the link’s score based on time hours – used for determining the decay factor if decay is enabled

bonfire.db.search_content(universe, query, size=100)¶

Search fulltext of all content across universes for a given string, or a custom match query.

Parameters:	query – accepts a string or dict if string, searches fulltext of all content if dict, accepts any elasticsearch match query http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-match-query.html size – number of links to return

bonfire.db.search_items(universe, term, quantity=100)¶

Search the text of both tweets and content for a given term and universe, and return some items matching one or the other.

Parameters:	term – search term to use for querying both tweets and content quantity – number of items to return

bonfire.db.set_cached_url(universe, url, resolved_url)¶: Index a URL and its resolution in Elasticsearch

bonfire.elastic¶

An attempt to make the Elasticsearch client a bit more usable. Currently implements search, get, and mget. get_source maps to get because there really should not be a need for both if the API is done correctly. For the time being, other client methods will behave just as they do for the client provided by Elasticsearch.

The implemented methods return ESDocument or ESCollection objects. ESDocument is dot-addressable and dict-like. It’s keys are your document keys. Meta-data is attached to this document with underscored properties: _index, _type, _id, and when applicable: _score, _version, _found. I realize that _ has the general connotation of “privacy” in Python, but this seems to me to be the safest way to have special keys and keep the API fairly usable.

ESCollection is an iterable of those documents and has the special properties: took, timed_out, total_shards, successful_shards, failed_shards. No need for [‘hits’][‘hits’] deferencing – just iterate the collection. Same for [‘docs’] on an mget operation.

class bonfire.elastic.ESAggregation¶

clear() → None. Remove all items from D.¶

copy() → a shallow copy of D¶

static fromkeys(S[, v]) → New dict with keys from S and values equal to v.¶: v defaults to None.

get(k[, d]) → D[k] if k in D, else d. d defaults to None.¶

has_key(k) → True if D has a key k, else False¶

items() → list of D's (key, value) pairs, as 2-tuples¶

iteritems() → an iterator over the (key, value) items of D¶

iterkeys() → an iterator over the keys of D¶

itervalues() → an iterator over the values of D¶

keys() → list of D's keys¶

pop(k[, d]) → v, remove specified key and return the corresponding value.¶: If key is not found, d is returned if given, otherwise KeyError is raised

popitem() → (k, v), remove and return some (key, value) pair as a¶: 2-tuple; but raise KeyError if D is empty.

setdefault(k[, d]) → D.get(k,d), also set D[k]=d if k not in D¶

update([E, ]**F) → None. Update D from dict/iterable E and F.¶: If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() → list of D's values¶

viewitems() → a set-like object providing a view on D's items¶

viewkeys() → a set-like object providing a view on D's keys¶

viewvalues() → an object providing a view on D's values¶

class bonfire.elastic.ESClient(hosts=None, transport_class=<class 'elasticsearch.transport.Transport'>, **kwargs)¶

abort_benchmark(*args, **kwargs)¶

Aborts a running benchmark. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-benchmark.html

Parameters:	name – A benchmark name

benchmark(*args, **kwargs)¶

The benchmark API provides a standard mechanism for submitting queries and measuring their performance relative to one another. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-benchmark.html

Parameters:	index – A comma-separated list of index names; use _all or empty string to perform the operation on all indices doc_type – The name of the document type body – The search definition using the Query DSL verbose – Specify whether to return verbose statistics about each iteration (default: false)

bulk(*args, **kwargs)¶

Perform many index/delete operations in a single API call. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html

See the bulk() helper function for a more friendly API.

Parameters:

body – The operation definition and data (action-data pairs), as either a newline separated string, or a sequence of dicts to serialize (one per row).
index – Default index for items which don’t provide one
doc_type – Default document type for items which don’t provide one
consistency – Explicit write consistency setting for the operation
refresh – Refresh the index after performing the operation
routing – Specific routing value
replication – Explicitly set the replication type (default: sync)
timeout – Explicit operation timeout

clear_scroll(*args, **kwargs)¶

Clear the scroll request created by specifying the scroll parameter to search. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html

Parameters:	scroll_id – The scroll ID or a list of scroll IDs body – A comma-separated list of scroll IDs to clear if none was specified via the scroll_id parameter

count(*args, **kwargs)¶

Execute a query and get the number of matches for that query. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-count.html

Parameters:

index – A comma-separated list of indices to restrict the results
doc_type – A comma-separated list of types to restrict the results
body – A query to restrict the results (optional)
allow_no_indices – Whether to ignore if a wildcard indices expression resolves into no concrete indices. (This includes _all string or when no indices have been specified)
expand_wildcards – Whether to expand wildcard expression to concrete indices that are open, closed or both., default ‘open’
ignore_unavailable – Whether specified concrete indices should be ignored when unavailable (missing or closed)
min_score – Include only documents with a specific _score value in the result
preference – Specify the node or shard the operation should be performed on (default: random)
q – Query in the Lucene query string syntax
routing – Specific routing value
source – The URL-encoded query definition (instead of using the request body)

count_percolate(*args, **kwargs)¶

The percolator allows to register queries against an index, and then send percolate requests which include a doc, and getting back the queries that match on that doc out of the set of registered queries. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html

Parameters:

index – The index of the document being count percolated.
doc_type – The type of the document being count percolated.
id – Substitute the document in the request body with a document that is known by the specified id. On top of the id, the index and type parameter will be used to retrieve the document from within the cluster.
body – The count percolator request definition using the percolate DSL
allow_no_indices – Whether to ignore if a wildcard indices expression resolves into no concrete indices. (This includes _all string or when no indices have been specified)
expand_wildcards – Whether to expand wildcard expression to concrete indices that are open, closed or both., default ‘open’
ignore_unavailable – Whether specified concrete indices should be ignored when unavailable (missing or closed)
percolate_index – The index to count percolate the document into. Defaults to index.
percolate_type – The type to count percolate document into. Defaults to type.
preference – Specify the node or shard the operation should be performed on (default: random)
routing – A comma-separated list of specific routing values
version – Explicit version number for concurrency control
version_type – Specific version type

create(*args, **kwargs)¶

Adds a typed JSON document in a specific index, making it searchable. Behind the scenes this method calls index(..., op_type=’create’) http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html

Parameters:

index – The name of the index
doc_type – The type of the document
id – Document ID
body – The document
consistency – Explicit write consistency setting for the operation
id – Specific document ID (when the POST method is used)
parent – ID of the parent document
percolate – Percolator queries to execute while indexing the document
refresh – Refresh the index after performing the operation
replication – Specific replication type (default: sync)
routing – Specific routing value
timeout – Explicit operation timeout
timestamp – Explicit timestamp for the document
ttl – Expiration time for the document
version – Explicit version number for concurrency control
version_type – Specific version type

delete(*args, **kwargs)¶

Delete a typed JSON document from a specific index based on its id. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete.html

Parameters:

index – The name of the index
doc_type – The type of the document
id – The document ID
consistency – Specific write consistency setting for the operation
parent – ID of parent document
refresh – Refresh the index after performing the operation
replication – Specific replication type (default: sync)
routing – Specific routing value
timeout – Explicit operation timeout
version – Explicit version number for concurrency control
version_type – Specific version type

delete_by_query(*args, **kwargs)¶

Delete documents from one or more indices and one or more types based on a query. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html

Parameters:

index – A comma-separated list of indices to restrict the operation; use _all to perform the operation on all indices
doc_type – A comma-separated list of types to restrict the operation
body – A query to restrict the operation specified with the Query DSL
allow_no_indices – Whether to ignore if a wildcard indices expression resolves into no concrete indices. (This includes _all string or when no indices have been specified)
analyzer – The analyzer to use for the query string
consistency – Specific write consistency setting for the operation
default_operator – The default operator for query string query (AND or OR), default u’OR’
df – The field to use as default where no field prefix is given in the query string
expand_wildcards – Whether to expand wildcard expression to concrete indices that are open, closed or both., default u’open’
ignore_unavailable – Whether specified concrete indices should be ignored when unavailable (missing or closed)
q – Query in the Lucene query string syntax
replication – Specific replication type, default u’sync’
routing – Specific routing value
source – The URL-encoded query definition (instead of using the request body)
timeout – Explicit operation timeout

delete_script(*args, **kwargs)¶

Remove a stored script from elasticsearch. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html

Parameters:	lang – Script language id – Script ID

delete_template(*args, **kwargs)¶

Delete a search template. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-template.html

Parameters:	id – Template ID

exists(*args, **kwargs)¶

Returns a boolean indicating whether or not given document exists in Elasticsearch. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-get.html

Parameters:

index – The name of the index
id – The document ID
doc_type – The type of the document (uses _all by default to fetch the first document matching the ID across all types)
parent – The ID of the parent document
preference – Specify the node or shard the operation should be performed on (default: random)
realtime – Specify whether to perform the operation in realtime or search mode
refresh – Refresh the shard containing the document before performing the operation
routing – Specific routing value

explain(*args, **kwargs)¶

The explain api computes a score explanation for a query and a specific document. This can give useful feedback whether a document matches or didn’t match a specific query. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-explain.html

Parameters:

index – The name of the index
doc_type – The type of the document
id – The document ID
body – The query definition using the Query DSL
_source – True or false to return the _source field or not, or a list of fields to return
_source_exclude – A list of fields to exclude from the returned _source field
_source_include – A list of fields to extract and return from the _source field
analyze_wildcard – Specify whether wildcards and prefix queries in the query string query should be analyzed (default: false)
analyzer – The analyzer for the query string query
default_operator – The default operator for query string query (AND or OR), (default: OR)
df – The default field for query string query (default: _all)
fields – A comma-separated list of fields to return in the response
lenient – Specify whether format-based query failures (such as providing text to a numeric field) should be ignored
lowercase_expanded_terms – Specify whether query terms should be lowercased
parent – The ID of the parent document
preference – Specify the node or shard the operation should be performed on (default: random)
q – Query in the Lucene query string syntax
routing – Specific routing value
source – The URL-encoded query definition (instead of using the request body)

get(*args, **kwargs)¶

get_script(*args, **kwargs)¶

Retrieve a script from the API. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html

Parameters:	lang – Script language id – Script ID

get_source(*args, **kwargs)¶

get_template(*args, **kwargs)¶

Retrieve a search template. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-template.html

Parameters:	id – Template ID body – The document

index(*args, **kwargs)¶

Adds or updates a typed JSON document in a specific index, making it searchable. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html

Parameters:

index – The name of the index
doc_type – The type of the document
body – The document
id – Document ID
consistency – Explicit write consistency setting for the operation
op_type – Explicit operation type (default: index)
parent – ID of the parent document
refresh – Refresh the index after performing the operation
replication – Specific replication type (default: sync)
routing – Specific routing value
timeout – Explicit operation timeout
timestamp – Explicit timestamp for the document
ttl – Expiration time for the document
version – Explicit version number for concurrency control
version_type – Specific version type

info(*args, **kwargs)¶: Get the basic info from the current cluster.

list_benchmarks(*args, **kwargs)¶

View the progress of long-running benchmarks. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-benchmark.html

Parameters:	index – A comma-separated list of index names; use _all or empty string to perform the operation on all indices doc_type – The name of the document type

mget(*args, **kwargs)¶

mlt(*args, **kwargs)¶

Get documents that are “like” a specified document. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-more-like-this.html

Parameters:

index – The name of the index
doc_type – The type of the document (use _all to fetch the first document matching the ID across all types)
id – The document ID
body – A specific search request definition
boost_terms – The boost factor
include – Whether to include the queried document from the response
max_doc_freq – The word occurrence frequency as count: words with higher occurrence in the corpus will be ignored
max_query_terms – The maximum query terms to be included in the generated query
max_word_length – The minimum length of the word: longer words will be ignored
min_doc_freq – The word occurrence frequency as count: words with lower occurrence in the corpus will be ignored
min_term_freq – The term frequency as percent: terms with lower occurence in the source document will be ignored
min_word_length – The minimum length of the word: shorter words will be ignored
mlt_fields – Specific fields to perform the query against
percent_terms_to_match – How many terms have to match in order to consider the document a match (default: 0.3)
routing – Specific routing value
search_from – The offset from which to return results
search_indices – A comma-separated list of indices to perform the query against (default: the index containing the document)
search_query_hint – The search query hint
search_scroll – A scroll search request definition
search_size – The number of documents to return (default: 10)
search_source – A specific search request definition (instead of using the request body)
search_type – Specific search type (eg. dfs_then_fetch, count, etc)
search_types – A comma-separated list of types to perform the query against (default: the same type as the document)
stop_words – A list of stop words to be ignored

mpercolate(*args, **kwargs)¶

The percolator allows to register queries against an index, and then send percolate requests which include a doc, and getting back the queries that match on that doc out of the set of registered queries. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html

Parameters:

index – The index of the document being count percolated to use as default
doc_type – The type of the document being percolated to use as default.
body – The percolate request definitions (header & body pair), separated by newlines
allow_no_indices – Whether to ignore if a wildcard indices expression resolves into no concrete indices. (This includes _all string or when no indices have been specified)
expand_wildcards – Whether to expand wildcard expression to concrete indices that are open, closed or both., default ‘open’
ignore_unavailable – Whether specified concrete indices should be ignored when unavailable (missing or closed)

msearch(*args, **kwargs)¶

Execute several search requests within the same API. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-multi-search.html

Parameters:	body – The request definitions (metadata-search request definition pairs), as either a newline separated string, or a sequence of dicts to serialize (one per row). index – A comma-separated list of index names to use as default doc_type – A comma-separated list of document types to use as default search_type – Search operation type

mtermvectors(*args, **kwargs)¶

Multi termvectors API allows to get multiple termvectors based on an index, type and id. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/docs-multi-termvectors.html

Parameters:

index – The index in which the document resides.
doc_type – The type of the document.
body – Define ids, parameters or a list of parameters per document here. You must at least provide a list of document ids. See documentation.
field_statistics – Specifies if document count, sum of document frequencies and sum of total term frequencies should be returned. Applies to all returned documents unless otherwise specified in body “params” or “docs”., default True
fields – A comma-separated list of fields to return. Applies to all returned documents unless otherwise specified in body “params” or “docs”.
ids – A comma-separated list of documents ids. You must define ids as parameter or set “ids” or “docs” in the request body
offsets – Specifies if term offsets should be returned. Applies to all returned documents unless otherwise specified in body “params” or “docs”., default True
parent – Parent id of documents. Applies to all returned documents unless otherwise specified in body “params” or “docs”.
payloads – Specifies if term payloads should be returned. Applies to all returned documents unless otherwise specified in body “params” or “docs”., default True
positions – Specifies if term positions should be returned. Applies to all returned documents unless otherwise specified in body “params” or “docs”., default True
preference – Specify the node or shard the operation should be performed on (default: random) .Applies to all returned documents unless otherwise specified in body “params” or “docs”.
routing – Specific routing value. Applies to all returned documents unless otherwise specified in body “params” or “docs”.
term_statistics – Specifies if total term frequency and document frequency should be returned. Applies to all returned documents unless otherwise specified in body “params” or “docs”., default False

percolate(*args, **kwargs)¶

The percolator allows to register queries against an index, and then send percolate requests which include a doc, and getting back the queries that match on that doc out of the set of registered queries. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html

Parameters:

index – The index of the document being percolated.
doc_type – The type of the document being percolated.
id – Substitute the document in the request body with a document that is known by the specified id. On top of the id, the index and type parameter will be used to retrieve the document from within the cluster.
body – The percolator request definition using the percolate DSL
allow_no_indices – Whether to ignore if a wildcard indices expression resolves into no concrete indices. (This includes _all string or when no indices have been specified)
expand_wildcards – Whether to expand wildcard expression to concrete indices that are open, closed or both., default ‘open’
ignore_unavailable – Whether specified concrete indices should be ignored when unavailable (missing or closed)
percolate_format – Return an array of matching query IDs instead of objects
percolate_index – The index to percolate the document into. Defaults to index.
percolate_type – The type to percolate document into. Defaults to type.
preference – Specify the node or shard the operation should be performed on (default: random)
routing – A comma-separated list of specific routing values
version – Explicit version number for concurrency control
version_type – Specific version type

ping(*args, **kwargs)¶: Returns True if the cluster is up, False otherwise.

put_script(*args, **kwargs)¶

Create a script in given language with specified ID. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html

Parameters:	lang – Script language id – Script ID body – The document

put_template(*args, **kwargs)¶

Create a search template. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-template.html

Parameters:	id – Template ID body – The document

scroll(*args, **kwargs)¶

Scroll a search request created by specifying the scroll parameter. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html

Parameters:	scroll_id – The scroll ID scroll – Specify how long a consistent view of the index should be maintained for scrolled search

search(*args, **kwargs)¶

search_shards(*args, **kwargs)¶

The search shards api returns the indices and shards that a search request would be executed against. This can give useful feedback for working out issues or planning optimizations with routing and shard preferences. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-shards.html

Parameters:

index – The name of the index
doc_type – The type of the document
allow_no_indices – Whether to ignore if a wildcard indices expression resolves into no concrete indices. (This includes _all string or when no indices have been specified)
expand_wildcards – Whether to expand wildcard expression to concrete indices that are open, closed or both. (default: ‘“open”’)
ignore_unavailable – Whether specified concrete indices should be ignored when unavailable (missing or closed)
local – Return local information, do not retrieve the state from master node (default: false)
preference – Specify the node or shard the operation should be performed on (default: random)
routing – Specific routing value

search_template(*args, **kwargs)¶

A query that accepts a query template and a map of key/value pairs to fill in template parameters. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/query-dsl-template-query.html

Parameters:

index – A comma-separated list of index names to search; use _all or empty string to perform the operation on all indices
doc_type – A comma-separated list of document types to search; leave empty to perform the operation on all types
body – The search definition template and its params
allow_no_indices – Whether to ignore if a wildcard indices expression resolves into no concrete indices. (This includes _all string or when no indices have been specified)
expand_wildcards – Whether to expand wildcard expression to concrete indices that are open, closed or both., default ‘open’
ignore_unavailable – Whether specified concrete indices should be ignored when unavailable (missing or closed)
preference – Specify the node or shard the operation should be performed on (default: random)
routing – A comma-separated list of specific routing values
scroll – Specify how long a consistent view of the index should be maintained for scrolled search
search_type – Search operation type

suggest(*args, **kwargs)¶

The suggest feature suggests similar looking terms based on a provided text by using a suggester. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-search.html

Parameters:

index – A comma-separated list of index names to restrict the operation; use _all or empty string to perform the operation on all indices
body – The request definition
allow_no_indices – Whether to ignore if a wildcard indices expression resolves into no concrete indices. (This includes _all string or when no indices have been specified)
expand_wildcards – Whether to expand wildcard expression to concrete indices that are open, closed or both., default ‘open’
ignore_unavailable – Whether specified concrete indices should be ignored when unavailable (missing or closed)
preference – Specify the node or shard the operation should be performed on (default: random)
routing – Specific routing value
source – The URL-encoded request definition (instead of using request body)

termvector(*args, **kwargs)¶

Added in 1. http://www.elasticsearch.org/guide/en/elasticsearch/reference/master/search-termvectors.html

Parameters:

index – The index in which the document resides.
doc_type – The type of the document.
id – The id of the document.
body – Define parameters. See documentation.
field_statistics – Specifies if document count, sum of document frequencies and sum of total term frequencies should be returned., default True
fields – A comma-separated list of fields to return.
offsets – Specifies if term offsets should be returned., default True
parent – Parent id of documents.
payloads – Specifies if term payloads should be returned., default True
positions – Specifies if term positions should be returned., default True
preference – Specify the node or shard the operation should be performed on (default: random).
routing – Specific routing value.
term_statistics – Specifies if total term frequency and document frequency should be returned., default False

update(*args, **kwargs)¶

Update a document based on a script or partial data provided. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-update.html

Parameters:

index – The name of the index
doc_type – The type of the document
id – Document ID
body – The request definition using either script or partial doc
consistency – Explicit write consistency setting for the operation
fields – A comma-separated list of fields to return in the response
lang – The script language (default: mvel)
parent – ID of the parent document
refresh – Refresh the index after performing the operation
replication – Specific replication type (default: sync)
retry_on_conflict – Specify how many times should the operation be retried when a conflict occurs (default: 0)
routing – Specific routing value
script – The URL-encoded script definition (instead of using request body)
timeout – Explicit operation timeout
timestamp – Explicit timestamp for the document
ttl – Expiration time for the document
version – Explicit version number for concurrency control
version_type – Explicit version number for concurrency control

class bonfire.elastic.ESCollection(resultset)¶

next()¶

class bonfire.elastic.ESDocument(doc)¶

clear() → None. Remove all items from D.¶

copy() → a shallow copy of D¶

static fromkeys(S[, v]) → New dict with keys from S and values equal to v.¶: v defaults to None.

get(k[, d]) → D[k] if k in D, else d. d defaults to None.¶

has_key(k) → True if D has a key k, else False¶

items() → list of D's (key, value) pairs, as 2-tuples¶

iteritems() → an iterator over the (key, value) items of D¶

iterkeys() → an iterator over the keys of D¶

itervalues() → an iterator over the values of D¶

keys() → list of D's keys¶

pop(k[, d]) → v, remove specified key and return the corresponding value.¶: If key is not found, d is returned if given, otherwise KeyError is raised

popitem() → (k, v), remove and return some (key, value) pair as a¶: 2-tuple; but raise KeyError if D is empty.

setdefault(k[, d]) → D.get(k,d), also set D[k]=d if k not in D¶

update([E, ]**F) → None. Update D from dict/iterable E and F.¶: If E present and has a .keys() method, does: for k in E: D[k] = E[k] If E present and lacks .keys() method, does: for (k, v) in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() → list of D's values¶

viewitems() → a set-like object providing a view on D's items¶

viewkeys() → a set-like object providing a view on D's keys¶

viewvalues() → an object providing a view on D's values¶

bonfire.extract¶

class bonfire.extract.ArticleExtractor(url=None, html=None)¶

article_node¶

author¶

doc¶

fetch(url)¶

get_article_text()¶

get_top_image()¶: Return the first of the article images.

html¶

images¶

metadata¶

title¶

url¶

exception bonfire.extract.InstantiationError¶

args¶

message¶

bonfire.extract.as_date(node)¶

bonfire.extract.clean_attribution(string)¶

bonfire.extract.clean_whitespace(string)¶

bonfire.extract.content_nodes(elem, node_types=None)¶

bonfire.extract.find_pubdate(node)¶

bonfire.extract.is_attribution(string)¶

bonfire.extract.link_density(obj)¶

bonfire.extract.word_count(text)¶

bonfire.mappings¶

bonfire.process¶

bonfire.process.create_session()¶: Create a requests session optimized for many connections.

bonfire.process.logger()¶

bonfire.process.process_rawtweet(universe, raw_tweet, session=None)¶: Take a raw tweet from the queue, extract and save metadata from its content, then save as a processed tweet.

bonfire.process.process_universe_rawtweets(universe, build_mappings=True)¶: Take all unprocessed tweets in given universe, extract and process their contents. When there are no tweets, sleep until it sees a new one.

bonfire.twitter¶

bonfire.twitter.client(universe)¶: Return a Twitter client for the given universe.

bonfire.twitter.collect_universe_tweets(universe)¶: Connects to the streaming API and enqueues tweets from universe users. Limited to the top 5000 users by API limitation.

bonfire.twitter.get_friends(universe, user_id)¶: Get Twitter IDs for friends of the given user_id.

bonfire.twitter.logger()¶

bonfire.twitter.lookup_users(universe, usernames)¶: Lookup Twitter users by screen name. Limited to first 100 user names by API limitation.

bonfire.twitter.stream_client(universe)¶: Return a Twitter streaming client for the given universe.

bonfire.twitter.tweet_link(universe, link)¶

bonfire.universe¶

bonfire.universe.build_universe(universe, build_mappings=True)¶

Expand the universe from the seed user list in the universe file. Should only be called once every 15 minutes.

This command also functions as an update of a universe.

Seed is currently limited to the first 14 users in the file. API threshold limit is 15 calls per 15 minutes, and we need an extra call for the initial request for seed user IDs. If we want a seed > 14, we will need to be sure not to exceed 15 hits/15 min.

bonfire.universe.cache_queries(universe, top_links=False, tweet=False)¶

bonfire.universe.cleanup_universe(universe, days=30)¶

bonfire.universe.update_top_links(universe, tweet=False)¶

API Documentation¶

bonfire.cli¶

bonfire.config¶

bonfire.content¶

bonfire.dates¶

bonfire.db¶

bonfire.elastic¶

bonfire.extract¶

bonfire.mappings¶

bonfire.process¶

bonfire.twitter¶

bonfire.universe¶

Table Of Contents

Previous topic

This Page

Navigation

API Documentation¶

bonfire.cli¶

bonfire.config¶

bonfire.content¶

bonfire.dates¶

bonfire.db¶

bonfire.elastic¶

bonfire.extract¶

bonfire.mappings¶

bonfire.process¶

bonfire.twitter¶

bonfire.universe¶

Table Of Contents

Previous topic

This Page

Quick search

Navigation